Data anonymization is a popular topic today for both enterprise and open public data uses. Less commonly used techniques are data shuffling and swapping, techniques which work well when retention of data distribution is important. For example, retaining the age distributions of employees.
Shuffling data reorders one or more columns so that the statistical distribution remains the same but the shuffled values no longer can be used to re-identify entities in rows. Since shuffling doesn’t remove or alter uniquely identifying individual values, it may need to be combined with other techniques to properly anonymize a data set.
Random Shuffle
Basic data shuffling will randomly permute a list of elements. Consider as an example, an open public data set for the City of Atlanta employee salary data in 2015. This data is in the public record, and contains the employees' full names, ages and salaries.
We'll start de-identification by first replacing each employee’s name with a sequential ID.
Original public data with name replaced:
Employee | Age | Annual Salary |
0001 | 38 | 46,000 |
0002 | 52 | 26,700 |
0003 | 44 | 46,575 |
0004 | 42 | 42,867 |
0005 | 32 | 28,035 |
0006 | 44 | 67,800 |
0007 | 33 | 46,378 |
0008 | 28 | 39,328 |
0009 | 58 | 125,000 |
0010 | 45 | 60,466 |
To further de-identify the data set, while preserving data utility, we might shuffle the age column. Quasi-identifiers are attributes which do not uniquely identify an individual, but are sufficiently correlated that they can sometimes or when combined with other data can be used to re-identify someone.
Shuffling the age column removes the correlation between age and salary. The statistical distributions of age and salary are unaffected with this approach.
With age shuffled:
Employee | Age | Annual Salary |
0001 | 44 | 46,000 |
0002 | 28 | 26,700 |
0003 | 44 | 46,575 |
0004 | 38 | 42,867 |
0005 | 52 | 28,035 |
0006 | 45 | 67,800 |
0007 | 42 | 46,378 |
0008 | 32 | 39,328 |
0009 | 58 | 125,000 |
0010 | 33 | 60,466 |
Shuffling Data
All permutations are equally likely when shuffling, including the original list. In this instance, run using python's shuffle, employee 0003's age after shuffling retained the same value.
Data shuffling has been endorsed by the EU Data Protection Board in 2014 which wrote:
Many ETL tools provide shuffling as a de-identification technique, including Talend, Informatica, Ab Initio and even Oracle.
History of the Fisher-Yates Shuffle
The first shuffle algorithm was described by Ronald Fisher and Frank Yates in their 1938 book Statistical tables for biological, agricultural and medical research. A software version of the algorithm, optimized by Richard Durstenfeld to run in linear O(n) time, was popularized in Donald Knuth’s The Art of Computer Programming Volume II.
Below is a python example of this algorithm:
import random
list = [38, 52, 44, 42, 32, 44, 33, 28, 58, 45]
print ("The original list is : " + str(list))
# Fisher–Yates Shuffle
for i in range(len(list)-1, 0, -1):
# Pick a random index from 0 to i
j = random.randint(0, i + 1)
# Swap arr[i] with the element at random index
list[i], list[j] = list[j], list[i]
print ("The shuffled list is : " + str(list))
Python and Java, however, provide built-in shuffle methods. Below is python code to shuffle the employee ages in the example above:
list = [38, 52, 44, 42, 32, 44, 33, 28, 58, 45]
random.shuffle(list)
[44, 28, 44, 38, 52, 45, 42, 32, 58, 33]
Well Known Uses
The U.S. Census Bureau began using a variant of data swapping for the 1990 decennial census. The method was tested with extensive simulations, and the results were considered to be a success and essentially the same methodology was used for actual data releases. In the U.S. Census Bureau’s version, records were swapped between census blocks for individuals or households that have been matched on a predetermined set of k variables. A similar approach was used for the U.S. 2000 census. The Office for National Statistics in the UK applied data swapping as part of its disclosure control procedures for the U.K. 2001 Census.
In 2002, researchers at the U.S. National Institute of Statistical Science (NISS) developed WebSwap, a public web-based tool to perform data swapping in databases of categorical variables.
Other Shuffle and Swapping Techniques
Group shuffle is used when two or more rows need to be shuffled together. Some data sets may have columns with highly correlated data, so shuffling a single column in isolation would diminish the analytical value of the data.
Consider our earlier employee example with the addition of a years-of-service column. Since a persons age is related to their potential employment longevity, it would make sense to shuffle these together.
Direct swapping is a non-random approach which handpicks records by finding pairs to swap. For a record set with age and salary, you can swap the salary of individuals of the same age. For example the following rows :
Dept. Id | Age | Annual Salary |
A03031 | 44 | 95,000 |
A68002 | 44 | 60,000 |
swapping salary where age = 44:
Dept. Id | Age | Annual Salary |
A03031 | 44 | 60,000 |
A68002 | 44 | 95,000 |
Direct swapping works with categorical data (e.g., eye color green, blue, brown or black) or a discrete number of values, in our example age represents a range 0..110.
Rank swap is similar, swapping pairs which are not exact matches, but close in value. It works well on continuous variables. For example blood pressure within 5 hg could be chosen as pairs to swap:
original:
Systolic Blood Pressure (Hg) | Height | Weight |
121 | 5' 8" | 155 |
122 | 5' 11" | 180 |
swapped:
Systolic Blood Pressure (Hg) | Height | Weight |
121 | 5' 11" | 180 |
122 | 5' 8" | 155 |
In practice, both direct and rank swapping typically swap a subset of the records, not the entire set.
Summary
Shuffle can be applied as one of several valuable techniques for data de-identification. As the EU Data Protection Board recommends, it may not be appropriate for stand-alone use, and always requires careful analysis with a given data set that obvious identifiers and quasi-identifiers are protected. Consider applying a k-anonymization test on the output data.
No comments:
Post a Comment