Cleansing and Transforming data with iData 2

Introduction to Cleansing Data From Profile Information

Prerequisites

We start this topic with a Project created, a Data Source and Data Entities have been defined.

We have created and run a Profile against all entities in the AcmeData database.

We have created a Profile stage and run this for the AcmeData Entities.


Video


Introduction

In this short article we will be:

Looking at transforming data using information provided by the profiling stage.

Steps

In previous videos we have covered creating a project, and connecting a data source, defining data entitiescreating and executed a profiling stage, and we have performed some data validation.

In this article we are going to be looking at cleansing data using basic replacement on conditions, and then we shall update recordusing rules, based on the Profiling stage output.

In today’s article we are going to create a new stage, update the transforms within this stage. If you have not yet viewed the previous topics, we recommend that you do so before proceeding.


Logging In

From the welcome screen, click on Login,

We enter our iData credentials and press Login.


From the Projects page, select the project.


Select the Data Stages for the project.

Creating a New Stage

We’ll create a New Stage, select the Validation and Clean stage, and Click on Create Stage.

Give the stage a name, and selecting Auto Map. We can now save the stage.

Selecting our new stage, we then want to open the transforms.


Changing the Data Conditionally

We shall start off by making a simple modification to the Clients transform age column.

Click on the Client transform to edit it.

Let’s assume that we want to rectify underage client records by setting the age to 18, if it is less than 18. Obviously, we are unlikely to want to do this with production data, but it will demonstrate the principle of how simple it is to create a rule and repair some data.

Take our simple age rule on line 24, uncomment it and modify it to change the data. All we need to do is add Make and setInt, to a value if its less than 18 like this.

  1. stream.If($Age).LessThan(18).Make().SetInt(18);



Save this and select the main iData browser tab.


Run the Stage

To run the stage, 

Navigate to the stage, click on Run Stage and Run.

Once again select the View Report when it is enabled, and select the Client report on the Reports tab when it has completed.

The report summary shows us that we have transformed a number of Ages

Selecting the Details tab we can see the changed records and the changes made.

This also provides traceability of what the old and new values were.

Adding Annotation

We can also add annotations to track any changes to the data. 

Selecting the Client transformation browser tab

If we add a line before the Age transformation rule on line 24. We can a TrackRowUpdate statement and mark them as an age change, any following rules that makes changes to the data will mark those records with this annotation.

  1. stream.TrackRowUpdates(Age Change);
 

Like this:


  1.     let roads = lookups.UniqueSuffixes_AddressLine_Lookup();
  2.     stream.Make($AddressLine).ReplaceLookup(roads, 'Suffix', 'NewSuffix');

Press Save, then we can run the stage again.

Select the main iData browser tab. Navigate to the stage in the project breadcrumb ribbon, and click on Run Stage and Run.

Once again select the View Report when it is enabled, and select the Client report on the Reports tab when it has completed. Click on the Details tab to view the results.

If we scroll over to the right, and hover our mouse over a record, we can see the comment has been added to the rows changed.


Changing Data Using Lookups

We can also use information directly from the profile to help us clean the data. If we go into the profiler and find our addresses. 

Navigating back to Stages on the Projects Ribbon, Selecting our Profile Stage, Selecting Stage Reports, and Clicking on the latest Clean report.

If we open the Addresses report, and navigate to the AddressLine columns UniqueSuffixes row.

Clicking on the link to view the Addressline Suffixes:


We have these entries with the unique suffixes on the end of the road. We can export this to our clean stage and use it in our rules.

To do this we click Export, 

Select a stage and chose our Clean stage. Then we press Export to Stage.

We should then see the message ‘The lookup has been successfully exported.’

This will have added this table to the Lookups within our Clean stage.

Changing Lookup Table Values

To view this, navigate back to our Stages from the Project Ribbon, select our Clean stage. Now we select Scripts, in here we have a Lookups folder. 

We open this and we can see the table has been imported as a CSV.

Click on the UniqueSuffixes_AddressLine_Lookup link to edit it.

This will open up our editor in a new tab, as before.

This lookup table has the existing suffix, followed by a comma, the expected replacement value, followed by a comma, and finally the count from the profile, which we can ignore.

Not that there is an Asterix (*) to denote a wildcard, which we will need to retain for the suffixes.

We shall update these to standardise our street, road and avenue suffixes like this:

  1. Suffix,NewSuffix,Count
  2. * Rd.,* Road,196
  3. * St.,* Street,192
  4. * Road,* Road,130
  5. * Ave,* Avenue,123
  6. * Avenue,* Avenue,113
  7. * Av.,* Avenue,111
  8. * Street,* Street,96
  9. * rd,* Road,35
  10. * st,* Street,31
So it looks like this:

We can now press save and close.

To apply this replacement transform, we need to open the Addresses Transform.

Applying a Lookup Transform

Selecting the iData browser tab, Clicking on Clean stage from the Project Ribbon, selecting Transforms, and clicking on addresses

We can now edit our Addresses Transforms.

To apply the lookup for AddressLine inset a line creating a logical link to the new lookup before line 24, and update line 25 like this:

Press Save.

Now we can rerun our stage and look at the results.

Select the main iData browser tab. Navigate to the stage in the project breadcrumb ribbon and click on Run Stage and Run.

Once again select the View Report when it is enabled and select the Addresses report on the Reports tab when it has completed. Click on the Details tab to view the results.

We can now see the roads where the abbreviated suffixes have been replaced with our defined full values for them.

    • Related Articles

    • Introduction To Profiling

      Prerequisites We start this topic with a Project created, a Data Source and Data Entities have been defined. In this Topic we have AcmeData database defined as the Data Source and all tables imported as Data Entities.  Video Introduction In this ...
    • Introduction to Synthetic Data Generation

      Prerequisites We start this topic with a Project created, a Data Source and Data Entities have been defined. This will require the AcmeData source to be defined. Video Introduction In this short article we will be: Creating a new Data Generation ...
    • Introduction to Data Validation

      Prerequisites We start this topic with a Project created, a Data Source and Data Entities have been defined. We have created and run a Profile against all entities in the AcmeData database. Video Introduction In this short article will be: Creating a ...
    • Introduction to Data Sources and Data Entities

      Prerequisites User access to iData Single, empty project Access to the local server data ‘localhost\SQLEXPRESS’ database ‘AcmeData’ Video Introduction In this short article will be: Using an existing project Adding a data source for a SQL Server ...
    • Introduction to Similar Records

      Prerequisites We start this topic with a Project created, a Data Source and Data Entities have been defined. We will be using the AcmeData project we created in previous articles. Video Introduction In this short article we will be: Creating a ...