When I started working with OHDSI WebAPI to build a research data platform, one of the first things that required a genuine shift in thinking was the relationship between cohort definitions, cohort generations, and cohort characterizations.
OHDSI separates these into distinct concepts with independent lifecycles — a design that reflects the realities of large-scale observational research: multi-source execution, longitudinal re-runs, and full reproducibility requirements. Once you internalize why the separation exists, the WebAPI endpoint structure starts to feel very intentional.
This post is the walkthrough I put together while building that understanding.
The Three Core Concepts
1. Cohort Definition — the “what”
A cohort definition is the logic — the inclusion and exclusion criteria that describe which patients belong to a cohort. Think of it as a saved query template. It answers:
“What criteria must a patient meet to be included?”
For example: “All patients with a first diagnosis of Chronic Kidney Disease, with at least 365 days of prior observation.” The definition itself contains no patient data — it’s purely declarative.
A useful mental model: a cohort definition is analogous to a SQL view definition. It describes what to select, but doesn’t execute against any data until you explicitly run it against a source.
Key characteristics:
- Source-agnostic — can be executed against any CDM
- Authored once, potentially executed many times
- Stored in the
cohort_definitiontable in WebAPI’s own database
Relevant endpoints:
GET /WebAPI/cohortdefinition # list all definitions
GET /WebAPI/cohortdefinition/{id} # get a definition with its full expression
GET /WebAPI/cohortdefinition/{id}/info # get generation status across all sources
GET /WebAPI/cohortdefinition/{id}/generate/{sourceKey} # trigger a generation run
2. Cohort Generation — the “when” and “how many”
A cohort generation is the execution of a cohort definition against a specific CDM data source. It answers:
“When this definition was run against source X, how many patients qualified?”
Each generation represents an execution of the definition against a specific CDM source. A single definition can have many generations — one per source, or multiple re-runs on the same source as the underlying data changes over time. The result you get back is a status, timing information, and a person count.
Key characteristics:
- Has a
status:RUNNING,COMPLETE, orFAILED - Records
startTime,endTime, andpersonCount
Relevant endpoint:
GET /WebAPI/cohortdefinition/{id}/info
This returns an array — one entry per source where the cohort has been generated:
[
{
"id": {
"cohortDefinitionId": 8,
"sourceId": 1
},
"status": "COMPLETE",
"personCount": 7664,
"startTime": 1770144930800,
"endTime": 1770144985000,
"isValid": true
}
]
3. Cohort Characterization — the “analyze what”
A cohort characterization is an analysis configuration that defines what you want to know about a cohort. It answers:
“Given a generated cohort, what clinical features should we analyze?”
A characterization bundles together:
- One or more cohort definitions to analyze
- A set of feature analyses to run (demographics, conditions, drugs, procedures, etc.)
Like the cohort definition, a characterization is declarative — it contains no patient data and must be executed against a CDM source to produce results. That execution has its own lifecycle, completely independent of cohort generations.
Key characteristics:
- References one or more cohort definitions
- References feature analyses (DemographicsGender, ConditionOccurrenceLongTerm, etc.)
- Has its own generation lifecycle — separate from cohort definition generations
Relevant endpoints:
GET /WebAPI/cohort-characterization # list all characterizations
GET /WebAPI/cohort-characterization/{id}/design # full design: cohorts + analyses
GET /WebAPI/cohort-characterization/{id}/export # design with full cohort expressions
GET /WebAPI/cohort-characterization/{id}/generation # list generation runs
POST /WebAPI/cohort-characterization/generation/{id}/result # fetch the actual results
How They Relate
Cohort Definition ←── "What patients qualify?"
│
│ executed against a CDM source
▼
Cohort Generation ←── "These N patients qualified on this date"
│
│ referenced by
▼
Cohort Characterization ←── "What do we want to know about those patients?"
│
│ executed against a CDM source
▼
Characterization Generation ←── "Here are the analysis results"
│
│ retrieved via
▼
POST /cohort-characterization/generation/{id}/result
Walking Through the API
Here’s a concrete walkthrough for fetching characterization results for “Chronic Kidney
Disease” (characterization ID 12) on SYNPUF5PCT.
Step 1 — Find the characterization
GET /WebAPI/cohort-characterization?size=10000
Paginated list of all characterizations. Find ID 12.
Step 2 — Get the design
GET /WebAPI/cohort-characterization/12/design
Returns the cohort definitions and feature analyses referenced by this characterization. You’ll need both to construct the result request in step 4.
Step 3 — Find a completed generation for your source
GET /WebAPI/cohort-characterization/12/generation
Filter for status == "COMPLETED" and your target sourceKey. Extract the generationId.
Step 4 — Fetch the results
POST /WebAPI/cohort-characterization/generation/{generationId}/result
{
"cohortIds": [8],
"analysisIds": [70, 71, 72, 74, 67, 76, 90],
"domainIds": ["DEMOGRAPHICS", "CONDITION", "DRUG"],
"showEmptyResults": false
}
Why This Design Makes Sense
The separation of these three concepts serves real research workflow requirements:
Reusability. A cohort definition can be referenced by many characterizations. Define “Type 2 Diabetes” once and use it across a comorbidity study, a drug utilization analysis, and a treatment pathway analysis — without duplicating the definition logic.
Source independence. Both definitions and characterizations are source-agnostic. The same characterization can be executed against synthetic data during development and real-world data in production, producing comparable results without any changes to the analysis configuration.
Reproducibility. Separating the definition from the execution creates a verifiable audit trail. You can always answer: “What exact criteria were in effect when this cohort was generated on this date?” — critical for regulatory submissions and scientific reproducibility.
Incremental re-execution. When source data is refreshed, you re-execute the characterization without touching the definition. This supports longitudinal studies where you need to track how a cohort’s characteristics evolve as new data is ingested.
Closing Thoughts
If I had to distill it: a definition is the recipe, a generation is the cooked meal, and a characterization is a nutritional analysis of that meal. They’re related but serve distinct purposes, and the separation is what gives OHDSI the flexibility to support multi-source, longitudinal, reproducible research.
The OHDSI community has been a great resource throughout this work. If you’re getting started, the OHDSI forums are the best place to ask questions.