Click here to Skip to main content
15,884,628 members
Articles / Artificial Intelligence / Data Science

anonympy - Data Anonymization with Python

Rate me:
Please Sign up or sign in to vote.
5.00/5 (2 votes)
9 Feb 2022Public Domain3 min read 15.8K   8   1
An overview of newly written package anonympy and a walk-through some of its methods and functionality
Data anonymization plays a huge role in contemporary data-driven society and most of the time data is sensitive. We will use `anonympy` package for solving this issue. Each method depends on what kind of data (data type) we are trying to anonymize. Data anonymization has lots of pitfalls, therefore there is no single step approach to achieve it. Every method should be used carefully and before applying anything, we should thoroughly understand the data and keep our end goal in mind.

Introduction

Our world is bombarded with digital data. 2.5 quintillion bytes is the number for amount of data produced every day. And most of the time, data is personal and sensitive, something that the person whom it relates to, wouldn't want to disclose it. Some examples of personal and sensitive data are names, identification card numbers, ethnicity, etc. However, data also contains valuable business insights. So, how do we balance privacy and the need to gather and share valuable information? That's where data anonymization comes in.

Background

With the rising need of data anonymization and extensibility of Python's packages, I thought it would be nice to create a library which can provide numerous data anonymization techniques and be easy to use. Please meet, my very first package - anonympy, created with the hope to contribute to open-source community and help other users to deal sensitive data. As for now, the package provides functions to anonymize tabular (pd.DataFrame) and image data.

Using the Code

As a usage example, let's anonymize the following dataset - sample.csv.
Let's start by installing the package. It can be achieved in two steps:

Python
pip install anonympy
pip install cape-privacy==0.3.0 --no-deps

Next, load our sample dataset which we will try to anonymize:

Python
import pandas as pd

url = r'https://raw.githubusercontent.com/ArtLabss/open-data-anonimizer/
      0287f675a535101f145cb975baf361a96ff71ed3/examples/files/new.csv'
df = pd.read_csv(url, parse_dates=['birthdate'])
df.head()

Image 1

By looking at columns, we can see that all are personal and sensitive. Therefore, we will have to apply relevant techniques to each and every column. We need to initializing our dfAnonymizer object.

Python
from anonympy.pandas import dfAnonymizer 

anonym = dfAnonymizer(df)

It’s important to know of what data type is a column before applying any functions. Let’s check the data types and see what methods are available to us.

Python
# check dtypes 
print(anonym.numeric_columns) 
print(anonym.categorical_columns) 
print(anonym.datetime_columns) 

... ['salary', 'age']
... ['first_name', 'address', 'city', 'phone', 'email', 'web']
... ['birthdate']

# available methods for each data type
from anonympy.pandas.utils import available_methods

print(available_methods())

... `numeric`:        
  * Perturbation - "numeric_noise"         
  * Binning - "numeric_binning"         
  * PCA Masking - "numeric_masking"        
  * Rounding - "numeric_rounding" 
`categorical`:         
  * Synthetic Data - "categorical_fake"         
  * Synthetic Data Auto - "categorical_fake_auto"         
  * Resampling from same Distribution - "categorical_resampling"         
  * Tokenazation - "categorical_tokenization"         
  * Email Masking - "categorical_email_masking" 
`datetime`:         
  * Synthetic Date - "datetime_fake"         
  * Perturbation - "datetime_noise" 
`general`:         
  * Drop Column - "column_suppression" 

In our dataset, we have 6 categorical columns, 2 numerical and 1 of datetime type. Also, from the list that available_methods returned, we can find functions for each data type.

Let’s add some random noise to age column, round the values in salary column and partially mask email column.

Python
anonym.numeric_noise('age')   
anonym.numeric_rounding('salary')  
anonym.categorical_email_masking('email') 

# or with a single line 
# anonym.anonymize({'age':'numeric_noise',                      
                    'salary':'numeric_rounding',                      
                    'email':'categorical_email_masking'})

To see the changes call to_df(), or for short summary, call info() method.

Python
anonym.info()

Image 2

Now we would like to substitute names in first_name column with fake ones. For that, we first have to check if Faker has a corresponding method for that.

Python
from anonympy.pandas.utils import fake_methods  

print(fake_methods('f')) # agrs: None / 'all' / any letter  

... factories, file_extension, file_name, file_path, firefox, first_name, 
first_name_female, first_name_male, first_name_nonbinary, fixed_width, 
format, free_email, free_email_domain, future_date, future_datetime 

Good, Faker has a method called first_name, let’s permutate the column.

Python
anonym.categorical_fake('first_name') 

# passing a dictionary is also valid -> {column_name: method_name} 
# anonym.categorical_fake({'first_name': 'first_name_female'}

Checking fake_methods for other column names it turns out, Faker also has methods for address and city. The web column can be substituted with url method and phone with phone_number.

Python
anonym.categorical_fake_auto() # this will change `address` and `city` 
                               # because column names correspond to method names 
anonym.categorical_fake({'web': 'url', 'phone': 'phone_number'}) # here we need to specify, 
                               # because column names differs from method name 

Last column left to anonymize is birthdate. Since we have age column which contains the same information, we could drop this column using column_supression method. However, for the sake of clarity, let’s add some noise to it.

Python
anonym.datetime_noise('birthdate')

That’s it. Let’s now compare our datasets before and after anonymization.

Before:

Click to enlarge

After:

Click to enlarge

And now, your dataset is safe for public release.

Points of Interest

Data privacy and protection is an important part data handling and should be paid proper attention to. Everyone wants his personal and sensitive data to be protected and secure. Therefore, in this article, I showed you how to use anonympy for simple anonymization and pseudoanonymization with python. This library should not be used as a magic wand that will do everything, you still have to thoroughly understand your data and the techniques that are being applied and always keep in mind your end goal.
Here is the GitHub repository for the package - anonympy.

Good Luck with anonymizing your data!

History

  • 9th February, 2022: Initial version

License

This article, along with any associated source code and files, is licensed under A Public Domain dedication


Written By
Student ArtLabs
Russian Federation Russian Federation
Hey there! My name is Shakhansho, but I prefer to call myself Shukur. I am 21 years old and a Junior student at University of Central Asia (UCA), majoring in computer science. Due to my experiences of working in consumer services, I have developed strong communication and interpersonal skills. In addition, this summer I was lucky to start my intership at ArtLabs
So now apart from my studies I am also working as a part-time Machine Learning Developer. Although my hard skills are not as strong as my soft skills, I compensate it with my hard work, dedication, and passion for the field of computer science.
... and excessive Googling
This is a Organisation (No members)


Comments and Discussions

 
QuestionMaintaining referential integrity Pin
Member 1604026529-Jun-23 4:54
Member 1604026529-Jun-23 4:54 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.