De-identifying OMOP Databases

A survey of existing tools and approaches for de-identifying OMOP data.

Why De-identification Matters

De-identification is the process of removing or transforming data elements that could identify individuals. In healthcare, this typically means addressing the 18 identifiers specified under HIPAA’s Safe Harbor provision, or demonstrating through Expert Determination that re-identification risk is “very small.”

The practical payoff: properly de-identified data is no longer considered PHI under HIPAA. That means:

No IRB approval required for secondary research use
Simplified data sharing agreements
Faster time-to-research

However, achieving “properly de-identified” status—especially with third-party certification—is not trivial.

Why De-identification is Hard

The Privacy-Utility Trade-off

De-identification isn’t free. Every step taken to protect privacy reduces the analytical value of the data.

A 2024 study from Seoul demonstrated this tension clearly. The researchers tested 19 different de-identification configurations using ARX on clinical data and found:

“All 19 de-identification scenarios significantly decreased re-identification risk. Nevertheless, the de-identification processes resulted in record suppression and complete masking of variables used as predictors, thereby compromising dataset utility.”

— Im et al., BMC Medical Informatics and Decision Making, 2024

This creates a fundamental tension:

Too little de-identification → High utility, high re-identification risk, regulatory problems
Too much de-identification → Low risk, but data becomes useless for research

Finding the sweet spot requires understanding both your privacy requirements and your research needs.

OMOP-Specific Challenges

The OMOP Common Data Model inherits all the de-identification challenges of longitudinal clinical data, but its standardized structure makes these challenges explicit:

Temporal data creates fingerprints: OMOP captures the entire patient journey—conditions, procedures, drug exposures—all linked by a persistent Person ID and precise dates. Even without names, this longitudinal pattern can be uniquely identifying.
Granularity can be identifying: Rare conditions, unusual lab values, or specific treatment combinations can act as identifiers themselves.
Clinical notes are a minefield: The NOTE table contains free-text narratives where PHI hides in unpredictable ways—embedded names, locations, family members, referring physicians.
No community standard: While OHDSI provides guidance on CDM privacy, there’s no universally adopted de-identification standard for OMOP ETL pipelines.

Understanding the OMOP Data Landscape

The OMOP CDM (v5.4) consists of approximately 39 tables. From a de-identification perspective, the challenge isn’t just “which tables”—it’s “which columns require which approach.”

Based on Posada et al.’s framework from Stanford (2025 OHDSI Symposium), OMOP columns fall into three categories:

1. Structured Columns with Predictable Format

These are columns where the data type and format are well-defined: dates, numeric IDs, coded values. Examples: birth_datetime, person_id, visit_start_date, condition_concept_id.

Approaches:

Date shifting: Consistent per-patient offset (SANT method)
Hashing/encryption: Deterministic for referential integrity
Format Preserving Encryption (FPE): When format must be maintained
Generalization: Age → age range, ZIP → 3-digit

Tools like ARX and sdcMicro handle these systematically.

2. The `_source_value` Problem

Fields ending in _source_value preserve raw data from the source system. This is where things get complicated—these columns are in structured tables, but their contents vary wildly depending on the source institution:

Sometimes they contain clean codes: "M", "F", "ICD10:J45.909"
Sometimes they contain free-text with embedded PHI: "Dr. Smith's patient from St. Mary's clinic"

Approaches (from Posada et al.):

Remove everything unmapped to concept_ids (simplest, but loses provenance)
Curated allow-lists: Human review of unique unmapped values
NLP for values that look like free-text

The Jeon et al. paper recommends masking most _source_value fields (e.g., 12345*****), but this assumes the values are structured. If your ETL populates these with free-text, you need NLP.

3. Free-Text Columns (Require NLP)

These columns explicitly contain narrative text:

Table	Column	Content
NOTE	`note_text`	Full clinical narratives
DRUG_EXPOSURE	`sig`	Prescription instructions
OBSERVATION	`value_as_string`	Free-text observation values

These require specialized NLP-based de-identification—pattern matching, named entity recognition, or hybrid approaches.

The Tools

For Structured Data: ARX

ARX is an open-source tool for anonymizing tabular data. It appears to be the most widely used framework for healthcare data anonymization in the academic literature.

Key features:

Implements k-anonymity, l-diversity, t-closeness, and differential privacy
GUI and Java API available
Handles large datasets
GitHub: https://github.com/arx-deidentifier/arx

How it works (simplified):

Load your tabular data
Classify columns as identifiers (remove), quasi-identifiers (transform), or sensitive (protect)
Define generalization hierarchies (e.g., exact age → 5-year bands → 10-year bands)
Select privacy model and parameters (e.g., k=5 for k-anonymity)
ARX finds the optimal transformation that satisfies privacy requirements while minimizing information loss

Limitations:

Requires expert configuration—choosing appropriate quasi-identifiers and hierarchies requires domain knowledge
No standard OMOP-specific configuration exists (yet)
Only handles structured data—not designed for free-text

For Clinical Notes: The NLP Options

This is where things get more complex. Clinical notes are messy, unstructured, and full of embedded PHI.

Tool	Certification	Performance	Notes
Philter-UCSF	Third-party HIPAA certified	99.46% recall	Only certified solution I’ve found
Stanford TiDE	Classified “High Risk”	Scalable (100M notes in ~7hrs)	Powers Stanford’s STARR-OMOP-deid

Philter-UCSF

Philter appears to be unique in having achieved third-party HIPAA certification.

Why it stands out:

Third-party certification: Audited by ArcherHall against 70+ million known PIIs
Highest published recall: 99.46% on UCSF corpus, 99.92% on i2b2 benchmark
Production scale: Over 130 million certified de-identified notes at UCSF
Multi-institutional adoption: UC Irvine, UC Davis, UCLA
Open source: BSD-2-Clause license

Technical approach: Philter uses a hybrid method combining regular expressions, part-of-speech tagging, and named entity recognition. Importantly, it uses a whitelist approach—rather than trying to identify all possible PHI (an unbounded problem), Philter identifies what’s definitely not PHI and flags everything else.

Key papers:

Norgeot et al., npj Digital Medicine 2020 — Original Philter paper
Radhakrishnan et al., JAMIA Open 2023 — Philter V1.0 certification

Stanford TiDE

TiDE is another production system worth noting, with an important caveat from Stanford’s own documentation:

“Note that our method makes re-identification harder. It does not remove the possibility of leaked PHI. The STARR-OMOP-deid dataset is deemed as High Risk dataset due to the presence of occasional leaked PHI.”

Strengths:

Highly scalable: processes 100 million notes in ~7 hours
Uses “Hiding in Plain Sight” (HIPS) surrogate replacement
Open source
Integrated with Stanford’s OMOP implementation

Proposed Approach: ARX + Philter

Based on this research, a practical pipeline would look something like:

OMOP Source Database
       ↓
┌──────────────────────────────────────┐
│  Structured Tables                   │
│  (PERSON, VISIT, CONDITION, etc.)    │
│            ↓                         │
│         ARX Framework                │
│  - k-anonymity for quasi-identifiers │
│  - Date shifting (SANT method)       │
│  - Masking of _source_value fields   │
└──────────────────────────────────────┘
       ↓
┌──────────────────────────────────────┐
│  Unstructured Tables                 │
│  (NOTE, NOTE_NLP)                    │
│            ↓                         │
│         Philter                      │
│  - NER for PHI detection             │
│  - Pattern matching for dates, IDs   │
│  - Whitelist-based filtering         │
└──────────────────────────────────────┘
       ↓
De-identified OMOP Database

Why this combination:

ARX: Most widely cited tool for structured health data anonymization
Philter: Only tool with third-party HIPAA certification for clinical text
Both open source: No licensing costs, full transparency
Both production-proven: UCSF runs Philter at scale; ARX has been validated in multiple OMOP studies

Alternative: TiDE is another option—it powers Stanford’s STARR-OMOP-deid.

What I Still Don’t Know

UCSF’s structured data approach: The Philter papers focus on clinical notes, but I haven’t found documentation of exactly what UCSF uses for structured table de-identification.
Ready-to-use ARX configs: The Jeon et al. paper provides field-level recommendations, but I haven’t found a ready-to-use ARX configuration file for OMOP tables.
Integration complexity: What does it actually take to integrate ARX and Philter into an ETL pipeline?
Performance at scale: How do these tools perform on databases with tens of millions of patients?

References

Jeon S, et al. Proposal and Assessment of a De-Identification Strategy to Enhance Anonymity of the OMOP-CDM. J Med Internet Res. 2020. https://www.jmir.org/2020/11/e19597/
Posada JD, Flowers N, Desai P. Considerations for De-identification of the OMOP Common Data Model. 2025 OHDSI Symposium. https://www.ohdsi.org/2025showcase-133/
Norgeot B, et al. Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes. npj Digital Medicine. 2020. https://www.nature.com/articles/s41746-020-0258-y
Muenzen K, et al. A certified de-identification system for all clinical text documents. JAMIA Open. 2023. https://academic.oup.com/jamiaopen/article/6/3/ooad045/7219298
Hripcsak G, et al. Preserving temporal relations in clinical data while maintaining privacy. JAMIA. 2016. https://pmc.ncbi.nlm.nih.gov/articles/PMC5070517/
Im E, et al. Exploring the tradeoff between data privacy and utility. BMC Med Inform Decis Mak. 2024. https://link.springer.com/article/10.1186/s12911-024-02545-9
OHDSI. Preserving Privacy in an OMOP CDM Implementation. https://ohdsi.github.io/CommonDataModel/cdmPrivacy.html
HHS. Guidance Regarding Methods for De-identification of PHI. https://www.hhs.gov/hipaa/for-professionals/special-topics/de-identification/index.html
Stanford STARR. TiDE Clinical Text Safe Harbor. https://starr.stanford.edu/methods/tide-clinical-text-safe-harbor

These are working notes—they’ll evolve as I learn more and (eventually) start building.

Why De-identification Matters#

Why De-identification is Hard#

The Privacy-Utility Trade-off#

OMOP-Specific Challenges#

Understanding the OMOP Data Landscape#

1. Structured Columns with Predictable Format#

2. The _source_value Problem#

3. Free-Text Columns (Require NLP)#

The Tools#

For Structured Data: ARX#

For Clinical Notes: The NLP Options#

Philter-UCSF#

Stanford TiDE#

Proposed Approach: ARX + Philter#

What I Still Don’t Know#

References#