Introduction to Synthetic Data Generation with iData

Introduction to Synthetic Data Generation

Prerequisites

We start this topic with a Project created, a Data Source and Data Entities have been defined.
This will require the AcmeData source to be defined.

Video


Introduction

In this short article we will be:

  1. Creating a new Data Generation processing stage
  1. Populating fields with:
  1. Random words
  1. Random names
  1. Random dates
  1. Random email addresses,
  1. Values from a pre-defined list
  1. All combinations of data

Steps

In previous articles we have covered topics such as Project Creation, Data SourcesData Entities, Data stages and writing transforms. We recommend that you view these as we will be using the functionality demonstrated within them in this article.


Today we are going to be looking at the Synthetic Data Generation.

Logging In

From the welcome screen, press the Login button taking us to the Login screen.

We enter our iData credentials and press Login.

From the project screen select our AcmeData project.


Creating a New Stage

The first step is to create a new stage.


Click on Data Processing Stages and then New Stage.


Select a Data Generation Stage

Then press ‘Setup new processing stage’.


Give the stage a name.

Select All Data Entities from the left.

And press save

We can now see our Data Generation stage in the list of stages

System Generated Default Data

Let’s take a look at what the default generation scripts create by running the stage.

Click on the Actions button next to the Data Generation stage, and select Run.


Then press run again and wait until the View Reports button is enabled.

Now select View Reports and click on the Reports tab, then click on Company, and click on the Details tab.


As we can see, the column values are not particularly exciting, but we are providing a default value for every single column. We can update the transforms to make these fields more realistic.

Click on the stage name in the Project ribbon, click on Transforms and select the Company transform.

We now have a new browser tab containing our default transform rules for the Company entity.


We can update this to pick some defaults for Department, and we can tag things email, phone number and Company name and address and iData will generate more realistic values.

  1.     stream.Make($CompanyID).NextIntAsc(1, 1);
  2.     
  3.     stream.Make($AddressID).RndIntRange(1,100);
  4.     stream.Make($Company).TagAsCompanyName();
  5.     stream.Make($CompanyID).RndIntRange(1,100);
  6.     stream.Make($Departments).RndText('Sales','Accounts','Support','Production');
  7.     stream.Make($Description).RndWords(50);
  8.     stream.Make($Email).TagAsEmail();
  9.     stream.Make($JoinedDate).RndDateRange('1/1/2010','31/12/2020').ConvertToText("d","en-UK");
  10.     stream.Make($MainPhone).TagAsPhoneNumber();
  11.     stream.MakeRndAddressCols();
Like this:

Press Save


We can perform similar changes to the other entities.

Switching to the main iData browser tab and select the Client transform.

We can update the contents to make similar changes:

  1.     stream.Make($ClientID).NextIntAsc(1, 1);
  2.     
  3.     stream.Make($AddressID).RndIntRange(1,100);
  4.     stream.Make($Age).RndIntRangeNormalized(18,100,45,10);
  5.     stream.Make($ClientID).RndIntRange(1,100);
  6.     stream.Make($Comments).RndWords(50);
  7.     stream.Make($CompanyID).RndIntRange(1,100);
  8.     stream.Make($Email).TagAsEmail();
  9.     stream.Make($Firstname).TagAsPersonFirstName();
  10.     stream.Make($JoinedDate).RndDateRange('1/1/2010','31/12/2020').ConvertToText("d","en-UK");
  11.     stream.Make($PAN).RndPatterns("AAAAAAAAA");
  12.     stream.Make($Phone).TagAsPhoneNumber();
  13.     stream.Make($Surname).TagAsPersonSurname();
  14.     stream.MakeRndPersonCols();
Like this:


We press Save to commit the changes.


Now we can do that same to the Addresses table.

Switching to the main iData browser tab and select the Addresses transform.

We update the transformation rules similarly:

  1.     stream.Make($AddressID).NextIntAsc(1, 1);
  2.     
  3.     stream.Make($AddressLine).TagAsAddressStreet();
  4.     stream.Make($City).TagAsAddressCity();
  5.     stream.Make($Country).TagAsAddressCountry();
  6.     stream.Make($Postcode).TagAsAddressPostCode();
  7.     stream.Make($Region).TagAsAddressRegion();
  8.     stream.MakeRndAddressCols();
  9.     stream.Make($Type).RndText('Home','Billing');

Like this:



Press Save, and close.

Wcan rerun the stage to see what the results look like.

Click on the stage name in the Projects Ribbon, click Run Stage, press run again and wait until the View Reports button is enabled.

Now select View Reports and click on the Reports tab, then click on Company, and press the Details tab.

When we look at the generated data now, it looks a lot more interesting. We have realistic looking company names and email addresses.

Clicking on the Data Generation Report on the Project Ribbon, and clicking on Client, and then the Details tab.

We see realistic client data with generated email addresses linked to the generated names.


Click on the Data Generation Report in the Project ribbon, click on dbo_Addresses


Similarly for the addresses data.


Generating Data Spread of Certain Values 

IData can create a data spread of different values in record sets. Let’s use this orders table as an example. 

Clicking on the Editor browser tab for dbo_Orders. We’ll make a few small changes to the original script, and the interesting area is this section at the bottom after we define the Test Data Set ‘Shirt’

  1.     stream.Make($ClientID).NextIntAsc(1, 1);
  2.     stream.Make($OrderID).NextIntAsc(1, 1);
  3.     
  4.     stream.Make($Email).TagAsEmail();
  5.     stream.Make($OrderDate).RndDateRange('1/1/2014','31/12/2021').ConvertToText("d","en-UK");
  6.     stream.MakeRndPersonCols();

  7.     stream.TestDataSet("Shirt");
  8.     stream.Make($Colour).ForEachText('Red','Green','Blue');
  9.     stream.Make($Product).SetText('Shirt');
  10.     stream.Make($Size).ForEachText('Small','Medium','Large');
  11.     
  12.     stream.EndTestDataSet();
Like this:


What we are asking for is the output to use each of the combinations of colour, and all of the combinations of size. And this can be used to ensure that we have full coverage of a particular set of values for testing purposes in our test environment. 

We are using a data set definition here, so it is important to comment out the ‘Default’ definition in line 23.

Also note that the OrderID, ClientID, Email and OrderDate are outside of the test data set definition, as these will apply to any test data sets we create subsequently in the transformation.

We can run this to see the results of the Orders we have defined.

Press Save, then select the Main iData browser tab. Click on the Stage name in the Project Ribbon, Press Run Stage, Run, and wait until the View Reports is enabled. Select the Report tab, and select Orders.

We can see that we have a combination of shirt sizes and colours.

By using an additional Test Data Set definition in our Transformation, we can add another produce to the orders.

Click on the Editor browser tab for the dbo_Orders.js transformation.

Adding a section for the Trousers product line.

  1.     stream.TestDataSet("Trousers");
  2.     stream.Make($Colour).ForEachText('Black','Grey','Navy');
  3.     stream.Make($Product).SetText('Trousers');
  4.     stream.Make($Size).ForEachText('Short','Long');
  5.     stream.EndTestDataSet();
Like this:

Press Save

Returning to the Main browser tab we can rerun the stage by select the Main iData browser tab. Click on the Stage name in the Project Ribbon, Press Run Stage, Run, and wat until the View Reports is enabled. Select the Report tab, and select dbo_Orders.

We can see in the report that we have two products, and that they have not mixed their colours and the orders only have colours and sizes that make sense for that particular product.

Defaults Settings

iData by default will create 50 rows of data, which of course is configurable in the main entity job script. We can perform many other generation options such as limit the output of data to only the test combinations listed in the transforms, lookup values to populate from and existing table.

For our dbo_Orders example, if we navigate to Stages / Scripts / Jobs and select dbo_Orders.js we can see the default limit on line 13.
 

These facilities are documented in the user manual.


    • Related Articles

    • Introduction to Data Sources and Data Entities

      Prerequisites User access to iData Single, empty project Access to the local server data ‘localhost\SQLEXPRESS’ database ‘AcmeData’ Video Introduction In this short article will be: Using an existing project Adding a data source for a SQL Server ...
    • Introduction To Profiling

      Prerequisites We start this topic with a Project created, a Data Source and Data Entities have been defined. In this Topic we have AcmeData database defined as the Data Source and all tables imported as Data Entities.  Video Introduction In this ...
    • Introduction to Similar Records

      Prerequisites We start this topic with a Project created, a Data Source and Data Entities have been defined. We will be using the AcmeData project we created in previous articles. Video Introduction In this short article we will be: Creating a ...
    • Introduction to Data Validation

      Prerequisites We start this topic with a Project created, a Data Source and Data Entities have been defined. We have created and run a Profile against all entities in the AcmeData database. Video Introduction In this short article will be: Creating a ...
    • Introduction to Obfuscation

      Prerequisites We start this topic with a Project created, a Data Source and Data Entities have been defined. Video Introduction In this short article we will be: Creating a new Validate and Clean stage Editing transformation rules to: Replace data ...