A quick hands-on exercise to understand ARX’s features through its GUI, using the bundled adult dataset (US Census data).
Setup
Download the ARX repository to access the example data:
git clone https://github.com/arx-deidentifier/arx.git
The example dataset is at arx/data/adult.csv — US Census data from the UCI ML Repository.
Creating a Project and Importing Data
In ARX GUI:
- File → New Project
- File → Import data → CSV file
- Select
adult.csv

Leave “Perform data cleansing” checked. It handles common data quality issues: trimming whitespace, normalizing empty strings, removing duplicates, and fixing inconsistent formatting.
Setting Attribute Types
Each column needs an attribute type that tells ARX how to handle it:
| Type | Purpose | Color |
|---|---|---|
| Identifying | Direct identifiers (name, SSN) — will be removed | Red |
| Quasi-identifying | Can be combined to re-identify (age, zip, gender) | Yellow |
| Sensitive | What you’re protecting but want to analyze (diagnosis, salary) | Purple |
| Insensitive | Safe to keep as-is | Green |
For the adult dataset, I set:
- Quasi-identifying: age, sex, race, marital-status, education, native-country, workclass, occupation
- Sensitive: salary-class
Click a column header, then change the “Type” dropdown in the right panel.

Loading Hierarchies
Hierarchies tell ARX how to generalize values. For each quasi-identifier, load the corresponding hierarchy file from arx/data/:
- Click the column header (e.g., “age”)
- File → Import hierarchy
- Select the hierarchy file (e.g.,
adult_hierarchy_age.csv)
The age hierarchy shows generalization levels:
Level-0: age 39 (original — most useful, least private)
Level-1: 35-39 (5-year band)
Level-2: 30-39 (10-year band)
Level-3: 20-39 (20-year band)
Level-4: * (suppressed — least useful, most private)
How Levels Work

ARX doesn’t generalize record-by-record. It applies one transformation level per attribute to the entire dataset, choosing the level that:
- Satisfies your privacy requirement (e.g., k=5)
- Minimizes information loss
When you run anonymization with k-anonymity (k=5):
- ARX tries Level-0 (exact ages) — Can it make everyone indistinguishable from 4 others? Probably not.
- ARX tries Level-1 (5-year bands) — Better, but maybe still not enough.
- ARX tries Level-2 (10-year bands) — Maybe now k=5 is achievable.
- ARX picks the minimum generalization that satisfies the privacy model.

The output might be: “age at Level-2, sex at Level-1, race at Level-1” — meaning all ages become 10-year bands, while sex and race use less generalization.
Repeat hierarchy loading for all quasi-identifiers: sex, race, marital-status, education, native-country, workclass, occupation.
Understanding Sensitive Attributes
A sensitive attribute is the private information you want to protect from being linked to individuals, but still want to analyze (e.g., salary, diagnosis, disease).
ARX does not transform sensitive values. Instead, it generalizes the quasi-identifiers enough so that attackers cannot link a specific person to their sensitive value.
Example: In a salary study, an attacker who knows “39-year-old white male from the US” shouldn’t be able to find that exact person’s salary in the dataset. ARX achieves this by generalizing age/sex/race until multiple people share the same combination.
How ARX Protects Sensitive Attributes
- With k-anonymity: At least k people share the same quasi-identifier combination (so the attacker can’t pinpoint you)
- With ℓ-diversity: Within each group, there are at least ℓ different sensitive values (so even if they find your group, they can’t be certain of your salary)
Goal: “Analyze salary patterns across demographics, without anyone being able to figure out a specific person’s salary.”
Adding Privacy Models
k-Anonymity
In the “Privacy models” panel (bottom-right):
- Click the “+” button
- Select k-Anonymity
- Set k = 5 (double-click the knob to edit)
- Click OK

The most commonly recommended k-value for healthcare data in the USA is k=5. While k=3 is sometimes acceptable, k=5 is the standard best practice to ensure at least five similar observations exist.
ℓ-Diversity
If you have a sensitive attribute, you also need ℓ-diversity:
- Click “+” again
- Select ℓ-Diversity
- Set ℓ = 2 (since salary-class only has 2 values: “>50K” and “<=50K”)
- Variant: Distinct-l-diversity (simplest option)
- Click OK
Variant options:
| Variant | Description |
|---|---|
| Distinct-l-diversity | Simply counts distinct values. At least ℓ different values must exist. |
| Entropy-l-diversity | Uses entropy measure. Values must be distributed more evenly. |
| Recursive-(c,l)-diversity | Stricter. Ensures the most common value doesn’t dominate. |
Running Anonymization
Edit → Anonymize

The anonymization options dialog appears:

Search Strategy:
| Option | Description |
|---|---|
| Optimal | Exhaustive search — guarantees best solution, slow for many quasi-identifiers |
| Best-effort, binary | Faster heuristic using binary search |
| Best-effort, top down | Starts from maximum generalization, works down |
| Best-effort, bottom up | Starts from minimum generalization, works up |
| Best-effort, genetic | Uses genetic algorithm — good for high-dimensional data |
Rule of thumb: Use “Optimal” if you have ≤15 quasi-identifiers.
Transformation Model:
| Option | Description |
|---|---|
| Global transformation | Same generalization level applied to ALL records (standard) |
| Local transformation | Different generalization for different records — better utility, more complex |
For learning, keep the defaults (Optimal + Global) and click OK.
Exploring Results
After anonymization, the top-right shows something like:
Transformations: 6480 | Selected: [0, 4, 0, 2, 3, 2, 2, 1] | Applied: [0, 4, 0, 2, 3, 2, 2, 1]

This means ARX explored 6,480 possible transformations and selected the optimal one. The numbers are generalization levels for each quasi-identifier.
Solution Space Lattice
Click the “Explore results” tab to see the lattice visualization:

- Each node is a transformation (combination of generalization levels)
- Green nodes = satisfy privacy (k=5, ℓ=2)
- Blue border = the selected optimal transformation
- Lines connect transformations that differ by one level
ARX found many valid solutions but selected the one with the best utility score — preserving the most information while satisfying privacy requirements.
Analyzing Utility
Click “Analyze utility” to see what was lost.

In my run, age became completely suppressed (*) in the output. To satisfy k=5 and ℓ=2, ARX had to generalize aggressively because the dataset has too many unique combinations.
The Distribution tab shows this clearly: the original age histogram (17-90 range) became a single bar at 100% — all records now have the same age value: “*”.

This is the utility cost of achieving strong privacy with this dataset.
Analyzing Risk
Click “Analyze risk” to see re-identification metrics.

| Metric | Before | After |
|---|---|---|
| Average prosecutor risk | 60% | 0.09% |
| Highest prosecutor risk | 100% | 6.25% |
| Sample uniques | 46.5% | 0% |
| Population uniques | 2.97% | 0% |
Attacker models:
| Model | Who they are |
|---|---|
| Prosecutor | Knows a specific person is in the dataset, tries to find them |
| Journalist | Tries to find any person they can re-identify |
| Marketer | Tries to re-identify people for commercial purposes |
Bottom Line
- Before: 46.5% of records were unique → easy to re-identify
- After: 0% unique, max risk 6.25% → strong privacy protection
The cost: most data became “*” (suppressed).
The Privacy-Utility Trade-off
This tutorial demonstrated the fundamental tension in de-identification:
More privacy = less utility
Options to improve utility:
- Relax privacy: Lower k to 3, or remove ℓ-diversity
- Fewer quasi-identifiers: Remove some columns from quasi-identifying
- Allow record suppression: Let ARX suppress some records entirely instead of generalizing everything
Takeaways
Three things became clear from this exercise:
- ARX configuration is dataset-specific and use-case-specific. Quasi-identifier selection, hierarchy design, and privacy parameters all depend on your data distribution and research goals.
- Expert configuration is required. Someone with privacy expertise needs to analyze the dataset, identify risks, design appropriate hierarchies, and iterate until the privacy-utility balance is acceptable.