Generate custom datasets using Python Faker

former_member559925 · ‎05-26-2021

As Solution Advisors, we often need to create custom datasets to support customer opportunities.

We can create more engaging customer experiences if we had more realistic datasets that more closely resembled their own data.

Ideally, we would be able to create a dataset of any size easily and able to specify constraints on the data, such as matching data formats the customer may use or specifying the statistical distribution of the random data. Also, it would be nice to generate realistic looking PII data in case you needed to demonstrate data masking.

We can easily create such datasets in Python, and this blog will serve as a guide on how to use the Faker, numpy, and pandas libaries in Python to generate any datasets you need.

Once we create the datasets, we have a lot of flexibility with how we use them. For this demo, we'll upload the newly created datasets to SAP HANA Cloud as tables.

Faker is a Python fake data generator

Faker is a Python library that generates fake data for you. It is useful to create realistic looking datasets and can generate all types of data. We'll explore those most relevant for customer demos but the documentation details all the "providers" of fake data available in the library.

We will also use the Python numpy library since it will allow to create numeric fields (e.g. sales) based on a distribution or randomly select from a list.

To begin, let's make sure we have the necessary libraries installed. In addition to Faker and numpy, we'll also need the handy pandas library. The hana_ml library will be used to upload the dataset we create to SAP HANA Cloud.

!pip install numpy

!pip install faker

!pip install pandas

!pip install hana_ml

Next, let's instantiate the Faker library. For this demo, we'll create an instance of Faker called fake and use that instance to generate all our fake data.

import pandas as pd

from faker import Faker 

import numpy as np



fake = Faker()

Once we have our instance, we can use that instance to call any number of fake data "providers" Faker includes. For example, we can easily generate 5 fake first names:

# First name

for _ in range(5):

    print(fake.first_name())

Faker will generate random data every time it is called

There are providers for different types of data we can generate on a fake "customer" by calling the appropriate Faker provider. For example:

# There are specific versions of these generators



# It can generate names

print('Male first names: ' + fake.first_name_male())

print('Female first names: ' + fake.first_name_female())

print('Last names: ' + fake.last_name())

print('Full names: ' + fake.name())



# Generate prefixes and suffixes (there are also gender specific versions e.g. prefix_female())

print('Prefix: ' + fake.prefix())

print('Suffix: ' + fake.suffix())



# Generate emails

print('Company emails: ' + fake.ascii_company_email())

print('Safe emails: ' + fake.ascii_safe_email())

print('Free emails: ' + fake.ascii_free_email())

print('ASCII Emails: ' + fake.ascii_email())

print('Emails: ' + fake.email())

Faker can easily generate realistic looking PII. For more options, https://faker.readthedocs.io/en/master/providers.html

If you prefer to create a company focused dataset, you can do that too.

# Company names

print('Company name: ' + fake.company())

print('Company suffix: ' + fake.company_suffix())



# Generate Address components

print('Street address: ' + fake.street_address())

print('Bldg #: ' + fake.building_number())

print('City: ' + fake.city())

print('Country: ' + fake.country())

print('Postcode: ' + fake.postcode())



# Or generate full addresses

print('Full address: ' + fake.address())



# Even generate motto, etc.

print('Catch phrase: ' + fake.catch_phrase())

print('Motto: ' + fake.bs())

Generate columns that match specific formats

If you needed to create fake data that needed a specific format, such as a product code or iPhone model, you can do that too:

# Use bothify to generate random numbers(#) or letters(?). Can limit the letters used with letters= 

print(fake.bothify('PROD-??-##', letters='ABCDE'))

print(fake.bothify('iPhone-#'))

Generating categorical columns based on probabilities/weights

You can even specify the percent the random value is likely to be "True" with boolean columns.

# Create fake True/False values

# Random True/False

print(fake.boolean())



# Specify % True

print(fake.boolean(chance_of_getting_true=25))

For categorical columns, you can specify a list of values to randomly choose from. Optionally, you can also specify the weights to give to each value if you don't want each element in the list to have an equal chance of being selected.

import numpy as np



industry = ['Automotive','Health Care','Manufacturing','High Tech','Retail']

# Specify probabilities of each category (must sum to 1.0)

weights = [0.6, 0.2, 0.1, 0.07, 0.03]



# p= specifies the probabilities of each category. Must sum to 1.0

print(np.random.choice(industry, p=weights))



# Generating choice without weights (equal probability on all elements)

print(np.random.choice(industry))

Generate numeric columns centered around a distribution

For columns that represent information such as sales, you can create numeric columns by specifying the mean and standard deviation. Alternatively, you can also generate random integers by specifying the maximum value.

# 1st argument is mean of distribution, 2nd is standard deviation

print(np.random.normal(1000, 100))

# Rounded result

print(round(np.random.normal(1000, 100)))



# Generate random integer between 0 and 4

print(np.random.randint(5))

Generate Dates between a range

Dates or datetimes can also be created multiple ways. You can specify and date within this century, decade, year, or month or between a date range.

print(fake.date_this_century().strftime('%m-%d-%Y'))

print(fake.date_this_decade().strftime('%m-%d-%Y'))

print(fake.date_this_year().strftime('%m-%d-%Y'))

print(fake.date_this_month().strftime('%m-%d-%Y'))

print(fake.time())

import pandas as pd



# Start and end dates to generate data

my_start = pd.to_datetime('01-01-2021')

my_end = pd.to_datetime('12-31-2021')



print(f'Random date between {my_start} & {my_end}')

fake.date_between_dates(my_start, my_end).strftime('%m-%d-%Y')

Or even parts of dates or specifying dates relative to today:

print(fake.year())

print(fake.month())

print(fake.day_of_month())

print(fake.day_of_week())

print(fake.month_name())

print(fake.past_date('-1y'))

print(fake.future_date('+1d'))

Let's Put It All Together!

Now that we have some familiarity with how to use Faker and numpy to generate different types of columns, let's put everything together to create a full dataset we can use.

Note: To create categorical columns based on a choice, you can also use faker's random_element method as an option to numpy's random.choice when you don't need to specify weights. Your choice! In the example below, I used both for industry and industry2.

We'll create a function that creates a row of fake data, then call it 5 times and save as a Pandas dataFrame (df).

from faker import Faker     

import numpy as np

import pandas as pd



industry = ['Automotive','Health Care','Manufacturing','High Tech','Retail']





fake = Faker() 

def create_data(x): 

  

    # dictionary 

    b_user ={} 

    for i in range(0, x): 

        b_user[i] = {} 

        b_user[i]['name'] = fake.name()

        b_user[i]['job'] = fake.job()

        b_user[i]['birthdate'] = fake.date_of_birth(minimum_age=18,maximum_age=65)

        b_user[i]['email'] = fake.company_email()

        b_user[i]['company'] = fake.company()

        b_user[i]['industry'] = fake.random_element(industry)

        b_user[i]['city'] = fake.city()

        b_user[i]['state'] = fake.state()

        b_user[i]['zipcode'] = fake.postcode()

        b_user[i]['netNew'] = fake.boolean(chance_of_getting_true=65)

        b_user[i]['sales_rounded'] = round(np.random.normal(1000,200))

        b_user[i]['sales_decimal'] = np.random.normal(1000,200)

        b_user[i]['priority'] = fake.random_digit()

        b_user[i]['industry2'] = np.random.choice(industry)        

        

    return b_user

    

df = pd.DataFrame(create_data(5)).transpose()

df.head(5)

Multi-Country Support! Localize to other Countries/Languages

Saving the best for last, I think the coolest thing about the Faker library is the ability to generate fake datasets in any localization. Although Faker set to use US English by default, we can easily set the localization when we initialize the library. We can even specify multiple localization if we wanted to generate datasets that are truly multi-lingual.

For example, we can generate first names in a number of locales by specifying the language code

fake = Faker('en-US')

print(fake.name())



fake = Faker('ja-JP')

print(fake.name())



fake = Faker('ru_RU')

print(fake.name())



fake = Faker('it_IT')

print(fake.name())



fake = Faker('de_DE')

print(fake.name())



fake = Faker('pt_BR')

print(fake.name())

Generate multi-lingual datasets easily

We can generate datasets in any language by specifying the language codes when instantiating Faker.

# Instantiate Faker with multiple locales

fake = Faker(['en_US','de_DE','pt_BR','ja_JP','zh-CN'])

This modified code will create 1000 profiles across English, German, Portuguese, Japanese, and Chinese.

from faker import Faker     

import numpy as np

import pandas as pd



industry = ['Automotive','Health Care','Manufacturing','High Tech','Retail']



# Instantiate Faker with multiple locales

fake = Faker(['en_US','de_DE','pt_BR','ja_JP','zh-CN']) 

def create_data(x): 

  

    # dictionary 

    b_user ={} 

    for i in range(0, x): 

        b_user[i] = {} 

        b_user[i]['name'] = fake.name()

        b_user[i]['job'] = fake.job()

        b_user[i]['birthdate'] = fake.date_of_birth(minimum_age=18,maximum_age=65)

        b_user[i]['email'] = fake.company_email()

        b_user[i]['company'] = fake.company()

        b_user[i]['industry'] = fake.random_element(industry)

        b_user[i]['city'] = fake.city()

        b_user[i]['state'] = fake.state()

        b_user[i]['zipcode'] = fake.postcode()

        b_user[i]['netNew'] = fake.boolean(chance_of_getting_true=65)

        b_user[i]['sales_rounded'] = round(np.random.normal(1000,200))

        b_user[i]['sales_decimal'] = np.random.normal(1000,200)

        b_user[i]['priority'] = fake.random_digit()

        b_user[i]['industry2'] = np.random.choice(industry)        

        

    return b_user

    

df = pd.DataFrame(create_data(1000)).transpose()

df.head(10)

The resulting dataset will generate data across the locales listed:

Upload our Dataset to SAP HANA Cloud

Since we're already in Python, let's leverage the hana_ml library to connect to SAP HANA Cloud and upload our newly created dataset as a table.

# Create connection to HANA Cloud

import hana_ml.dataframe as dataframe



# Instantiate connection object

conn = dataframe.ConnectionContext(address = '<Your HANA tenant info.hanacloud.ondemand.com>',

                                   port = 443, 

                                   user = '<USERNAME', 

                                   password = "<PASSWORD>", 

                                   encrypt = 'true',

                                   sslValidateCertificate = 'false')



# Display HANA version to test connection

print('HANA version: ' + conn.hana_version())



# Print APL version to confirm PAL/APL are enabled

import hana_ml.algorithms.apl.apl_base as apl_base

v = apl_base.get_apl_version(conn)

v.head(5)

 # Upload Pandas dataframe to HANA Cloud

dataframe.create_dataframe_from_pandas(connection_context = conn, 

                                       pandas_df = df, 

                                       table_name = 'FAKER',

                                       force = True)

The resulting table, including multi-language records are now in SAP HANA Cloud:

I hope this blog was helpful in learning more how Python and it's powerful libraries can allow you to create flexible datasets to support your customer engagements easily. Thanks for your interest in this topic! Please let me know if you have any comments/questions in the Q&A section below. Thanks!