A quick hands-on exercise to understand ARX’s features through its GUI, using the bundled adult dataset (US Census data).

Setup

Download the ARX repository to access the example data:

git clone https://github.com/arx-deidentifier/arx.git

The example dataset is at arx/data/adult.csv — US Census data from the UCI ML Repository.

Creating a Project and Importing Data

In ARX GUI:

  1. File → New Project
  2. File → Import data → CSV file
  3. Select adult.csv

Import Data Import Data

Leave “Perform data cleansing” checked. It handles common data quality issues: trimming whitespace, normalizing empty strings, removing duplicates, and fixing inconsistent formatting.

Setting Attribute Types

Each column needs an attribute type that tells ARX how to handle it:

Type Purpose Color
Identifying Direct identifiers (name, SSN) — will be removed Red
Quasi-identifying Can be combined to re-identify (age, zip, gender) Yellow
Sensitive What you’re protecting but want to analyze (diagnosis, salary) Purple
Insensitive Safe to keep as-is Green

For the adult dataset, I set:

  • Quasi-identifying: age, sex, race, marital-status, education, native-country, workclass, occupation
  • Sensitive: salary-class

Click a column header, then change the “Type” dropdown in the right panel.

Loading Hierarchies

Hierarchies tell ARX how to generalize values. For each quasi-identifier, load the corresponding hierarchy file from arx/data/:

  1. Click the column header (e.g., “age”)
  2. File → Import hierarchy
  3. Select the hierarchy file (e.g., adult_hierarchy_age.csv)

The age hierarchy shows generalization levels:

Level-0: age 39 (original — most useful, least private)
Level-1: 35-39 (5-year band)
Level-2: 30-39 (10-year band)
Level-3: 20-39 (20-year band)
Level-4: * (suppressed — least useful, most private)

How Levels Work

ARX doesn’t generalize record-by-record. It applies one transformation level per attribute to the entire dataset, choosing the level that:

  • Satisfies your privacy requirement (e.g., k=5)
  • Minimizes information loss

When you run anonymization with k-anonymity (k=5):

  1. ARX tries Level-0 (exact ages) — Can it make everyone indistinguishable from 4 others? Probably not.
  2. ARX tries Level-1 (5-year bands) — Better, but maybe still not enough.
  3. ARX tries Level-2 (10-year bands) — Maybe now k=5 is achievable.
  4. ARX picks the minimum generalization that satisfies the privacy model.

The output might be: “age at Level-2, sex at Level-1, race at Level-1” — meaning all ages become 10-year bands, while sex and race use less generalization.

Repeat hierarchy loading for all quasi-identifiers: sex, race, marital-status, education, native-country, workclass, occupation.

Understanding Sensitive Attributes

A sensitive attribute is the private information you want to protect from being linked to individuals, but still want to analyze (e.g., salary, diagnosis, disease).

ARX does not transform sensitive values. Instead, it generalizes the quasi-identifiers enough so that attackers cannot link a specific person to their sensitive value.

Example: In a salary study, an attacker who knows “39-year-old white male from the US” shouldn’t be able to find that exact person’s salary in the dataset. ARX achieves this by generalizing age/sex/race until multiple people share the same combination.

How ARX Protects Sensitive Attributes

  • With k-anonymity: At least k people share the same quasi-identifier combination (so the attacker can’t pinpoint you)
  • With ℓ-diversity: Within each group, there are at least ℓ different sensitive values (so even if they find your group, they can’t be certain of your salary)

Goal: “Analyze salary patterns across demographics, without anyone being able to figure out a specific person’s salary.”

Adding Privacy Models

k-Anonymity

In the “Privacy models” panel (bottom-right):

  1. Click the “+” button
  2. Select k-Anonymity
  3. Set k = 5 (double-click the knob to edit)
  4. Click OK

The most commonly recommended k-value for healthcare data in the USA is k=5. While k=3 is sometimes acceptable, k=5 is the standard best practice to ensure at least five similar observations exist.

ℓ-Diversity

If you have a sensitive attribute, you also need ℓ-diversity:

  1. Click “+” again
  2. Select ℓ-Diversity
  3. Set ℓ = 2 (since salary-class only has 2 values: “>50K” and “<=50K”)
  4. Variant: Distinct-l-diversity (simplest option)
  5. Click OK

Variant options:

Variant Description
Distinct-l-diversity Simply counts distinct values. At least ℓ different values must exist.
Entropy-l-diversity Uses entropy measure. Values must be distributed more evenly.
Recursive-(c,l)-diversity Stricter. Ensures the most common value doesn’t dominate.

Running Anonymization

Edit → Anonymize

The anonymization options dialog appears:

Search Strategy:

Option Description
Optimal Exhaustive search — guarantees best solution, slow for many quasi-identifiers
Best-effort, binary Faster heuristic using binary search
Best-effort, top down Starts from maximum generalization, works down
Best-effort, bottom up Starts from minimum generalization, works up
Best-effort, genetic Uses genetic algorithm — good for high-dimensional data

Rule of thumb: Use “Optimal” if you have ≤15 quasi-identifiers.

Transformation Model:

Option Description
Global transformation Same generalization level applied to ALL records (standard)
Local transformation Different generalization for different records — better utility, more complex

For learning, keep the defaults (Optimal + Global) and click OK.

Exploring Results

After anonymization, the top-right shows something like:

Transformations: 6480 | Selected: [0, 4, 0, 2, 3, 2, 2, 1] | Applied: [0, 4, 0, 2, 3, 2, 2, 1]

This means ARX explored 6,480 possible transformations and selected the optimal one. The numbers are generalization levels for each quasi-identifier.

Solution Space Lattice

Click the “Explore results” tab to see the lattice visualization:

  • Each node is a transformation (combination of generalization levels)
  • Green nodes = satisfy privacy (k=5, ℓ=2)
  • Blue border = the selected optimal transformation
  • Lines connect transformations that differ by one level

ARX found many valid solutions but selected the one with the best utility score — preserving the most information while satisfying privacy requirements.

Analyzing Utility

Click “Analyze utility” to see what was lost.

In my run, age became completely suppressed (*) in the output. To satisfy k=5 and ℓ=2, ARX had to generalize aggressively because the dataset has too many unique combinations.

The Distribution tab shows this clearly: the original age histogram (17-90 range) became a single bar at 100% — all records now have the same age value: “*”.

This is the utility cost of achieving strong privacy with this dataset.

Analyzing Risk

Click “Analyze risk” to see re-identification metrics.

Metric Before After
Average prosecutor risk 60% 0.09%
Highest prosecutor risk 100% 6.25%
Sample uniques 46.5% 0%
Population uniques 2.97% 0%

Attacker models:

Model Who they are
Prosecutor Knows a specific person is in the dataset, tries to find them
Journalist Tries to find any person they can re-identify
Marketer Tries to re-identify people for commercial purposes

Bottom Line

  • Before: 46.5% of records were unique → easy to re-identify
  • After: 0% unique, max risk 6.25% → strong privacy protection

The cost: most data became “*” (suppressed).

The Privacy-Utility Trade-off

This tutorial demonstrated the fundamental tension in de-identification:

More privacy = less utility

Options to improve utility:

  1. Relax privacy: Lower k to 3, or remove ℓ-diversity
  2. Fewer quasi-identifiers: Remove some columns from quasi-identifying
  3. Allow record suppression: Let ARX suppress some records entirely instead of generalizing everything

Takeaways

Three things became clear from this exercise:

  1. ARX configuration is dataset-specific and use-case-specific. Quasi-identifier selection, hierarchy design, and privacy parameters all depend on your data distribution and research goals.
  2. Expert configuration is required. Someone with privacy expertise needs to analyze the dataset, identify risks, design appropriate hierarchies, and iterate until the privacy-utility balance is acceptable.