Learning ARX: A Hands-On GUI Tutorial

A quick hands-on exercise to understand ARX’s features through its GUI, using the bundled adult dataset (US Census data).

Setup

Download the ARX repository to access the example data:

git clone https://github.com/arx-deidentifier/arx.git

The example dataset is at arx/data/adult.csv — US Census data from the UCI ML Repository.

Creating a Project and Importing Data

In ARX GUI:

File → New Project
File → Import data → CSV file
Select adult.csv

Import Data

Leave “Perform data cleansing” checked. It handles common data quality issues: trimming whitespace, normalizing empty strings, removing duplicates, and fixing inconsistent formatting.

Setting Attribute Types

Each column needs an attribute type that tells ARX how to handle it:

Type	Purpose	Color
Identifying	Direct identifiers (name, SSN) — will be removed	Red
Quasi-identifying	Can be combined to re-identify (age, zip, gender)	Yellow
Sensitive	What you’re protecting but want to analyze (diagnosis, salary)	Purple
Insensitive	Safe to keep as-is	Green

For the adult dataset, I set:

Quasi-identifying: age, sex, race, marital-status, education, native-country, workclass, occupation
Sensitive: salary-class

Click a column header, then change the “Type” dropdown in the right panel.

Loading Hierarchies

Hierarchies tell ARX how to generalize values. For each quasi-identifier, load the corresponding hierarchy file from arx/data/:

Click the column header (e.g., “age”)
File → Import hierarchy
Select the hierarchy file (e.g., adult_hierarchy_age.csv)

The age hierarchy shows generalization levels:

Level-0: age 39 (original — most useful, least private)
Level-1: 35-39 (5-year band)
Level-2: 30-39 (10-year band)
Level-3: 20-39 (20-year band)
Level-4: * (suppressed — least useful, most private)

How Levels Work

ARX doesn’t generalize record-by-record. It applies one transformation level per attribute to the entire dataset, choosing the level that:

Satisfies your privacy requirement (e.g., k=5)
Minimizes information loss

When you run anonymization with k-anonymity (k=5):

ARX tries Level-0 (exact ages) — Can it make everyone indistinguishable from 4 others? Probably not.
ARX tries Level-1 (5-year bands) — Better, but maybe still not enough.
ARX tries Level-2 (10-year bands) — Maybe now k=5 is achievable.
ARX picks the minimum generalization that satisfies the privacy model.

The output might be: “age at Level-2, sex at Level-1, race at Level-1” — meaning all ages become 10-year bands, while sex and race use less generalization.

Repeat hierarchy loading for all quasi-identifiers: sex, race, marital-status, education, native-country, workclass, occupation.

Understanding Sensitive Attributes

A sensitive attribute is the private information you want to protect from being linked to individuals, but still want to analyze (e.g., salary, diagnosis, disease).

ARX does not transform sensitive values. Instead, it generalizes the quasi-identifiers enough so that attackers cannot link a specific person to their sensitive value.

Example: In a salary study, an attacker who knows “39-year-old white male from the US” shouldn’t be able to find that exact person’s salary in the dataset. ARX achieves this by generalizing age/sex/race until multiple people share the same combination.

How ARX Protects Sensitive Attributes

With k-anonymity: At least k people share the same quasi-identifier combination (so the attacker can’t pinpoint you)
With ℓ-diversity: Within each group, there are at least ℓ different sensitive values (so even if they find your group, they can’t be certain of your salary)

Goal: “Analyze salary patterns across demographics, without anyone being able to figure out a specific person’s salary.”

Adding Privacy Models

k-Anonymity

In the “Privacy models” panel (bottom-right):

Click the “+” button
Select k-Anonymity
Set k = 5 (double-click the knob to edit)
Click OK

The most commonly recommended k-value for healthcare data in the USA is k=5. While k=3 is sometimes acceptable, k=5 is the standard best practice to ensure at least five similar observations exist.

ℓ-Diversity

If you have a sensitive attribute, you also need ℓ-diversity:

Click “+” again
Select ℓ-Diversity
Set ℓ = 2 (since salary-class only has 2 values: “>50K” and “<=50K”)
Variant: Distinct-l-diversity (simplest option)
Click OK

Variant options:

Variant	Description
Distinct-l-diversity	Simply counts distinct values. At least ℓ different values must exist.
Entropy-l-diversity	Uses entropy measure. Values must be distributed more evenly.
Recursive-(c,l)-diversity	Stricter. Ensures the most common value doesn’t dominate.

Running Anonymization

Edit → Anonymize

The anonymization options dialog appears:

Search Strategy:

Option	Description
Optimal	Exhaustive search — guarantees best solution, slow for many quasi-identifiers
Best-effort, binary	Faster heuristic using binary search
Best-effort, top down	Starts from maximum generalization, works down
Best-effort, bottom up	Starts from minimum generalization, works up
Best-effort, genetic	Uses genetic algorithm — good for high-dimensional data

Rule of thumb: Use “Optimal” if you have ≤15 quasi-identifiers.

Transformation Model:

Option	Description
Global transformation	Same generalization level applied to ALL records (standard)
Local transformation	Different generalization for different records — better utility, more complex

For learning, keep the defaults (Optimal + Global) and click OK.

Exploring Results

After anonymization, the top-right shows something like:

Transformations: 6480 | Selected: [0, 4, 0, 2, 3, 2, 2, 1] | Applied: [0, 4, 0, 2, 3, 2, 2, 1]

This means ARX explored 6,480 possible transformations and selected the optimal one. The numbers are generalization levels for each quasi-identifier.

Solution Space Lattice

Click the “Explore results” tab to see the lattice visualization:

Each node is a transformation (combination of generalization levels)
Green nodes = satisfy privacy (k=5, ℓ=2)
Blue border = the selected optimal transformation
Lines connect transformations that differ by one level

ARX found many valid solutions but selected the one with the best utility score — preserving the most information while satisfying privacy requirements.

Analyzing Utility

Click “Analyze utility” to see what was lost.

In my run, age became completely suppressed (*) in the output. To satisfy k=5 and ℓ=2, ARX had to generalize aggressively because the dataset has too many unique combinations.

The Distribution tab shows this clearly: the original age histogram (17-90 range) became a single bar at 100% — all records now have the same age value: “*”.

This is the utility cost of achieving strong privacy with this dataset.

Analyzing Risk

Click “Analyze risk” to see re-identification metrics.

Metric	Before	After
Average prosecutor risk	60%	0.09%
Highest prosecutor risk	100%	6.25%
Sample uniques	46.5%	0%
Population uniques	2.97%	0%

Attacker models:

Model	Who they are
Prosecutor	Knows a specific person is in the dataset, tries to find them
Journalist	Tries to find any person they can re-identify
Marketer	Tries to re-identify people for commercial purposes

Bottom Line

Before: 46.5% of records were unique → easy to re-identify
After: 0% unique, max risk 6.25% → strong privacy protection

The cost: most data became “*” (suppressed).

The Privacy-Utility Trade-off

This tutorial demonstrated the fundamental tension in de-identification:

More privacy = less utility

Options to improve utility:

Relax privacy: Lower k to 3, or remove ℓ-diversity
Fewer quasi-identifiers: Remove some columns from quasi-identifying
Allow record suppression: Let ARX suppress some records entirely instead of generalizing everything

Takeaways

Three things became clear from this exercise:

ARX configuration is dataset-specific and use-case-specific. Quasi-identifier selection, hierarchy design, and privacy parameters all depend on your data distribution and research goals.
Expert configuration is required. Someone with privacy expertise needs to analyze the dataset, identify risks, design appropriate hierarchies, and iterate until the privacy-utility balance is acceptable.

Setup#

Creating a Project and Importing Data#

Setting Attribute Types#

Loading Hierarchies#

How Levels Work#

Understanding Sensitive Attributes#

How ARX Protects Sensitive Attributes#

Adding Privacy Models#

k-Anonymity#

ℓ-Diversity#

Running Anonymization#

Exploring Results#

Solution Space Lattice#

Analyzing Utility#

Analyzing Risk#

Bottom Line#

The Privacy-Utility Trade-off#

Takeaways#