Before building anything, I wanted to understand what’s already out there.
I’m planning to build an open-source de-identification pipeline for OMOP data. But before writing any code, I needed to survey the landscape: what tools exist, what’s been proven in production, and what the research says actually works.
These are my notes.
Why De-identification Matters
De-identification is the process of removing or transforming data elements that could identify individuals. In healthcare, this typically means addressing the 18 identifiers specified under HIPAA’s Safe Harbor provision, or demonstrating through Expert Determination that re-identification risk is “very small.”
The practical payoff: properly de-identified data is no longer considered PHI under HIPAA. That means:
- No IRB approval required for secondary research use
- Simplified data sharing agreements
- Faster time-to-research
However, achieving “properly de-identified” status—especially with third-party certification—is not trivial.
Why De-identification is Hard
The Privacy-Utility Trade-off
De-identification isn’t free. Every step taken to protect privacy reduces the analytical value of the data.
A 2024 study from Seoul demonstrated this tension clearly. The researchers tested 19 different de-identification configurations using ARX on clinical data and found:
“All 19 de-identification scenarios significantly decreased re-identification risk. Nevertheless, the de-identification processes resulted in record suppression and complete masking of variables used as predictors, thereby compromising dataset utility.”
— Im et al., BMC Medical Informatics and Decision Making, 2024
This creates a fundamental tension:
- Too little de-identification → High utility, high re-identification risk, regulatory problems
- Too much de-identification → Low risk, but data becomes useless for research
Finding the sweet spot requires understanding both your privacy requirements and your research needs.
OMOP-Specific Challenges
The OMOP Common Data Model inherits all the de-identification challenges of longitudinal clinical data, but its standardized structure makes these challenges explicit:
- Temporal data creates fingerprints: OMOP captures the entire patient journey—conditions, procedures, drug exposures—all linked by a persistent Person ID and precise dates. Even without names, this longitudinal pattern can be uniquely identifying.
- Granularity can be identifying: Rare conditions, unusual lab values, or specific treatment combinations can act as identifiers themselves.
- Clinical notes are a minefield: The NOTE table contains free-text narratives where PHI hides in unpredictable ways—embedded names, locations, family members, referring physicians.
- No community standard: While OHDSI provides guidance on CDM privacy, there’s no universally adopted de-identification standard for OMOP ETL pipelines.
Understanding the OMOP Data Landscape
The OMOP CDM (v5.4) consists of approximately 39 tables. From a de-identification perspective, the challenge isn’t just “which tables”—it’s “which columns require which approach.”
Based on Posada et al.’s framework from Stanford (2025 OHDSI Symposium), OMOP columns fall into three categories:
1. Structured Columns with Predictable Format
These are columns where the data type and format are well-defined: dates, numeric IDs, coded values.
Examples: birth_datetime, person_id, visit_start_date, condition_concept_id.
Approaches:
- Date shifting: Consistent per-patient offset (SANT method)
- Hashing/encryption: Deterministic for referential integrity
- Format Preserving Encryption (FPE): When format must be maintained
- Generalization: Age → age range, ZIP → 3-digit
Tools like ARX and sdcMicro handle these systematically.
2. The _source_value Problem
Fields ending in _source_value preserve raw data from the source system. This is where things get complicated—these
columns are in structured tables, but their contents vary wildly depending on the source institution:
- Sometimes they contain clean codes:
"M","F","ICD10:J45.909" - Sometimes they contain free-text with embedded PHI:
"Dr. Smith's patient from St. Mary's clinic"
Approaches (from Posada et al.):
- Remove everything unmapped to concept_ids (simplest, but loses provenance)
- Curated allow-lists: Human review of unique unmapped values
- NLP for values that look like free-text
The Jeon et al. paper recommends masking most _source_value
fields (e.g., 12345*****), but this assumes the values are structured. If your ETL populates these with free-text, you need NLP.
3. Free-Text Columns (Require NLP)
These columns explicitly contain narrative text:
| Table | Column | Content |
|---|---|---|
| NOTE | note_text |
Full clinical narratives |
| DRUG_EXPOSURE | sig |
Prescription instructions |
| OBSERVATION | value_as_string |
Free-text observation values |
These require specialized NLP-based de-identification—pattern matching, named entity recognition, or hybrid approaches.
The Tools
For Structured Data: ARX
ARX is an open-source tool for anonymizing tabular data. It appears to be the most widely used framework for healthcare data anonymization in the academic literature.
Key features:
- Implements k-anonymity, l-diversity, t-closeness, and differential privacy
- GUI and Java API available
- Handles large datasets
- GitHub: https://github.com/arx-deidentifier/arx
How it works (simplified):
- Load your tabular data
- Classify columns as identifiers (remove), quasi-identifiers (transform), or sensitive (protect)
- Define generalization hierarchies (e.g., exact age → 5-year bands → 10-year bands)
- Select privacy model and parameters (e.g., k=5 for k-anonymity)
- ARX finds the optimal transformation that satisfies privacy requirements while minimizing information loss
Limitations:
- Requires expert configuration—choosing appropriate quasi-identifiers and hierarchies requires domain knowledge
- No standard OMOP-specific configuration exists (yet)
- Only handles structured data—not designed for free-text
For Clinical Notes: The NLP Options
This is where things get more complex. Clinical notes are messy, unstructured, and full of embedded PHI.
| Tool | Certification | Performance | Notes |
|---|---|---|---|
| Philter-UCSF | Third-party HIPAA certified | 99.46% recall | Only certified solution I’ve found |
| Stanford TiDE | Classified “High Risk” | Scalable (100M notes in ~7hrs) | Powers Stanford’s STARR-OMOP-deid |
Philter-UCSF
Philter appears to be unique in having achieved third-party HIPAA certification.
Why it stands out:
- Third-party certification: Audited by ArcherHall against 70+ million known PIIs
- Highest published recall: 99.46% on UCSF corpus, 99.92% on i2b2 benchmark
- Production scale: Over 130 million certified de-identified notes at UCSF
- Multi-institutional adoption: UC Irvine, UC Davis, UCLA
- Open source: BSD-2-Clause license
Technical approach: Philter uses a hybrid method combining regular expressions, part-of-speech tagging, and named entity recognition. Importantly, it uses a whitelist approach—rather than trying to identify all possible PHI (an unbounded problem), Philter identifies what’s definitely not PHI and flags everything else.
Key papers:
- Norgeot et al., npj Digital Medicine 2020 — Original Philter paper
- Muenzen et al., JAMIA Open 2023 — Philter V1.0 certification
Stanford TiDE
TiDE is another production system worth noting, with an important caveat from Stanford’s own documentation:
“Note that our method makes re-identification harder. It does not remove the possibility of leaked PHI. The STARR-OMOP-deid dataset is deemed as High Risk dataset due to the presence of occasional leaked PHI.”
Strengths:
- Highly scalable: processes 100 million notes in ~7 hours
- Uses “Hiding in Plain Sight” (HIPS) surrogate replacement
- Open source
- Integrated with Stanford’s OMOP implementation
Proposed Approach: ARX + Philter
Based on this research, a practical pipeline would look something like:
OMOP Source Database
↓
┌──────────────────────────────────────┐
│ Structured Tables │
│ (PERSON, VISIT, CONDITION, etc.) │
│ ↓ │
│ ARX Framework │
│ - k-anonymity for quasi-identifiers │
│ - Date shifting (SANT method) │
│ - Masking of _source_value fields │
└──────────────────────────────────────┘
↓
┌──────────────────────────────────────┐
│ Unstructured Tables │
│ (NOTE, NOTE_NLP) │
│ ↓ │
│ Philter │
│ - NER for PHI detection │
│ - Pattern matching for dates, IDs │
│ - Whitelist-based filtering │
└──────────────────────────────────────┘
↓
De-identified OMOP Database
Why this combination:
- ARX: Most widely cited tool for structured health data anonymization
- Philter: Only tool with third-party HIPAA certification for clinical text
- Both open source: No licensing costs, full transparency
- Both production-proven: UCSF runs Philter at scale; ARX has been validated in multiple OMOP studies
Alternative: TiDE is another option—it powers Stanford’s STARR-OMOP-deid.
What I Still Don’t Know
- UCSF’s structured data approach: The Philter papers focus on clinical notes, but I haven’t found documentation of exactly what UCSF uses for structured table de-identification.
- Ready-to-use ARX configs: The Jeon et al. paper provides field-level recommendations, but I haven’t found a ready-to-use ARX configuration file for OMOP tables.
- Integration complexity: What does it actually take to integrate ARX and Philter into an ETL pipeline?
- Performance at scale: How do these tools perform on databases with tens of millions of patients?
References
- Jeon S, et al. Proposal and Assessment of a De-Identification Strategy to Enhance Anonymity of the OMOP-CDM. J Med Internet Res. 2020. https://www.jmir.org/2020/11/e19597/
- Posada JD, Flowers N, Desai P. Considerations for De-identification of the OMOP Common Data Model. 2025 OHDSI Symposium. https://www.ohdsi.org/2025showcase-133/
- Norgeot B, et al. Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes. npj Digital Medicine. 2020. https://www.nature.com/articles/s41746-020-0258-y
- Muenzen K, et al. A certified de-identification system for all clinical text documents. JAMIA Open. 2023. https://academic.oup.com/jamiaopen/article/6/3/ooad045/7219298
- Hripcsak G, et al. Preserving temporal relations in clinical data while maintaining privacy. JAMIA. 2016. https://pmc.ncbi.nlm.nih.gov/articles/PMC5070517/
- Im E, et al. Exploring the tradeoff between data privacy and utility. BMC Med Inform Decis Mak. 2024. https://link.springer.com/article/10.1186/s12911-024-02545-9
- OHDSI. Preserving Privacy in an OMOP CDM Implementation. https://ohdsi.github.io/CommonDataModel/cdmPrivacy.html
- HHS. Guidance Regarding Methods for De-identification of PHI. https://www.hhs.gov/hipaa/for-professionals/special-topics/de-identification/index.html
- Stanford STARR. TiDE Clinical Text Safe Harbor. https://starr.stanford.edu/methods/tide-clinical-text-safe-harbor
These are working notes—they’ll evolve as I learn more and (eventually) start building.