Reproducing CEHR-XGPT: A Beginner's Journey into EHR Foundation Models
Introduction In my previous post, I set up a local OHDSI development environment with synthetic data from Synthea. As I continued learning about the OMOP Common Data Model, I became interested in a specific question: How can I generate realistic synthetic patient data from an OMOP instance? While searching for approaches, I found CEHR-XGPT (pronounced “seer-ex-gpt”), a foundation model for electronic health records developed by Chao Pang and colleagues at Columbia University. I think it’s a fantastic piece of work—the idea of using time tokens to preserve temporal structure is elegant, and the fact that a single model can handle feature extraction, prediction, and synthetic generation is impressive. ...