Making Sense of Cohort Definitions, Characterizations, and Generations in OHDSI WebAPI

When I started working with OHDSI WebAPI to build a research data platform, one of the first things that required a genuine shift in thinking was the relationship between cohort definitions, cohort generations, and cohort characterizations.

OHDSI separates these into distinct concepts with independent lifecycles — a design that reflects the realities of large-scale observational research: multi-source execution, longitudinal re-runs, and full reproducibility requirements. Once you internalize why the separation exists, the WebAPI endpoint structure starts to feel very intentional.

This post is the walkthrough I put together while building that understanding.

The Three Core Concepts

1. Cohort Definition — the “what”

A cohort definition is the logic — the inclusion and exclusion criteria that describe which patients belong to a cohort. Think of it as a saved query template. It answers:

“What criteria must a patient meet to be included?”

For example: “All patients with a first diagnosis of Chronic Kidney Disease, with at least 365 days of prior observation.” The definition itself contains no patient data — it’s purely declarative.

A useful mental model: a cohort definition is analogous to a SQL view definition. It describes what to select, but doesn’t execute against any data until you explicitly run it against a source.

Key characteristics:

Source-agnostic — can be executed against any CDM
Authored once, potentially executed many times
Stored in the cohort_definition table in WebAPI’s own database

Relevant endpoints:

GET  /WebAPI/cohortdefinition                              # list all definitions
GET  /WebAPI/cohortdefinition/{id}                         # get a definition with its full expression
GET  /WebAPI/cohortdefinition/{id}/info                    # get generation status across all sources
GET  /WebAPI/cohortdefinition/{id}/generate/{sourceKey}    # trigger a generation run

2. Cohort Generation — the “when” and “how many”

A cohort generation is the execution of a cohort definition against a specific CDM data source. It answers:

“When this definition was run against source X, how many patients qualified?”

Each generation represents an execution of the definition against a specific CDM source. A single definition can have many generations — one per source, or multiple re-runs on the same source as the underlying data changes over time. The result you get back is a status, timing information, and a person count.

Key characteristics:

Has a status: RUNNING, COMPLETE, or FAILED
Records startTime, endTime, and personCount

Relevant endpoint:

GET /WebAPI/cohortdefinition/{id}/info

This returns an array — one entry per source where the cohort has been generated:

[
  {
    "id": {
      "cohortDefinitionId": 8,
      "sourceId": 1
    },
    "status": "COMPLETE",
    "personCount": 7664,
    "startTime": 1770144930800,
    "endTime": 1770144985000,
    "isValid": true
  }
]

3. Cohort Characterization — the “analyze what”

A cohort characterization is an analysis configuration that defines what you want to know about a cohort. It answers:

“Given a generated cohort, what clinical features should we analyze?”

A characterization bundles together:

One or more cohort definitions to analyze
A set of feature analyses to run (demographics, conditions, drugs, procedures, etc.)

Like the cohort definition, a characterization is declarative — it contains no patient data and must be executed against a CDM source to produce results. That execution has its own lifecycle, completely independent of cohort generations.

Key characteristics:

References one or more cohort definitions
References feature analyses (DemographicsGender, ConditionOccurrenceLongTerm, etc.)
Has its own generation lifecycle — separate from cohort definition generations

Relevant endpoints:

GET  /WebAPI/cohort-characterization                           # list all characterizations
GET  /WebAPI/cohort-characterization/{id}/design               # full design: cohorts + analyses
GET  /WebAPI/cohort-characterization/{id}/export               # design with full cohort expressions
GET  /WebAPI/cohort-characterization/{id}/generation           # list generation runs
POST /WebAPI/cohort-characterization/generation/{id}/result    # fetch the actual results

How They Relate

Cohort Definition        ←── "What patients qualify?"
        │
        │ executed against a CDM source
        ▼
Cohort Generation        ←── "These N patients qualified on this date"
        │
        │ referenced by
        ▼
Cohort Characterization  ←── "What do we want to know about those patients?"
        │
        │ executed against a CDM source
        ▼
Characterization Generation  ←── "Here are the analysis results"
        │
        │ retrieved via
        ▼
POST /cohort-characterization/generation/{id}/result

Walking Through the API

Here’s a concrete walkthrough for fetching characterization results for “Chronic Kidney Disease” (characterization ID 12) on SYNPUF5PCT.

Step 1 — Find the characterization

GET /WebAPI/cohort-characterization?size=10000

Paginated list of all characterizations. Find ID 12.

Step 2 — Get the design

GET /WebAPI/cohort-characterization/12/design

Returns the cohort definitions and feature analyses referenced by this characterization. You’ll need both to construct the result request in step 4.

Step 3 — Find a completed generation for your source

GET /WebAPI/cohort-characterization/12/generation

Filter for status == "COMPLETED" and your target sourceKey. Extract the generationId.

Step 4 — Fetch the results

POST /WebAPI/cohort-characterization/generation/{generationId}/result

{
  "cohortIds": [8],
  "analysisIds": [70, 71, 72, 74, 67, 76, 90],
  "domainIds": ["DEMOGRAPHICS", "CONDITION", "DRUG"],
  "showEmptyResults": false
}

Why This Design Makes Sense

The separation of these three concepts serves real research workflow requirements:

Reusability. A cohort definition can be referenced by many characterizations. Define “Type 2 Diabetes” once and use it across a comorbidity study, a drug utilization analysis, and a treatment pathway analysis — without duplicating the definition logic.

Source independence. Both definitions and characterizations are source-agnostic. The same characterization can be executed against synthetic data during development and real-world data in production, producing comparable results without any changes to the analysis configuration.

Reproducibility. Separating the definition from the execution creates a verifiable audit trail. You can always answer: “What exact criteria were in effect when this cohort was generated on this date?” — critical for regulatory submissions and scientific reproducibility.

Incremental re-execution. When source data is refreshed, you re-execute the characterization without touching the definition. This supports longitudinal studies where you need to track how a cohort’s characteristics evolve as new data is ingested.

Closing Thoughts

If I had to distill it: a definition is the recipe, a generation is the cooked meal, and a characterization is a nutritional analysis of that meal. They’re related but serve distinct purposes, and the separation is what gives OHDSI the flexibility to support multi-source, longitudinal, reproducible research.

The OHDSI community has been a great resource throughout this work. If you’re getting started, the OHDSI forums are the best place to ask questions.

The Three Core Concepts#

1. Cohort Definition — the “what”#

2. Cohort Generation — the “when” and “how many”#

3. Cohort Characterization — the “analyze what”#

How They Relate#

Walking Through the API#

Why This Design Makes Sense#

Closing Thoughts#

The Three Core Concepts

1. Cohort Definition — the “what”

2. Cohort Generation — the “when” and “how many”

3. Cohort Characterization — the “analyze what”

How They Relate

Walking Through the API

Why This Design Makes Sense

Closing Thoughts