Introduction to data profiling with idata2

Introduction To Profiling

Prerequisites

We start this topic with a Project created, a Data Source and Data Entities have been defined. In this Topic we have AcmeData database defined as the Data Source and all tables imported as Data Entities. 

Video



Introduction

In this short article we will be:

  1. Defining a Data Processing Stage
  1. Creating a Profile Stage to profile our data
  1. Run the Profile stage and view the resulting reports

Steps


Logging In

This is the welcome screen.

To log in, we press the Login button taking us to the Login screen.

We enter our iData credentials and press Login.


From the Projects page, select the project.


Creating a Stage

A Data Processing stage defines the activity type we wish to perform in iData as a set of work.

Once we are in a project we can navigate to Stages in a number of ways.

Select Data Processing Stages from the left-hand project ribbon:

Or click on Data Processing Stages tile for the Project


If we had just created the Data Entities, then a Helper prompt would also be available to click.


Clicking New Stage.


As we have no stages defined, we have a single option. Press ‘New Stage’.

We now see five options to select the type of data processing we wish to operate in this stage. 

As we wish to profile our data, click on Profiling.

Whenever we create a new stage, the next pop up that appears outlines the stage type we have selected.

Click on Setup processing stage.

We can now define the stage by entering a name in the Name field.


We could filter the tables by typing in the filter field, dragging and dropping, or we can Select All, which we’ll do now.

and press ‘Save’.

We now see our new stage in the list.


Running a Stage

To run the stage we can press the action dropdown button next to the stage:

Then chose Run.


The next box shows us the stage run log. To start the Stage run press the run button.

This starts the job initiator. Once the initiator has completed, the View Report button is enabled.

Pressing this when the job worker is still processing the entities will display the job progress, if the job has completed, we would see the output reports tab.

we can see here the status shows that the dbo_client entity is still processing:

We can see the progress of the job here, with the status of each entity defined in the stage shown on the right hand side.

Once the entities have been profiled, we can click on the Reports tab.


We now see the individual entities and status of the profile run.


We can select the link to one, by clicking on the entity name on the left.

What we are doing is looking at all of the data entities, looking at the patterns and values in the tables in those data entities, we can see a report on the values that we have got in there.

We can see what our data looks like for Clients by clicking on the dbo_Client.

The column names are across the top, and the categories of information that those columns fall into down the left.

We can see for instance that the surname has some values that are blank or whitespace.

Scrolling down to the IsNull Profile Type, our firstname column we have some entries that are NULL, and some email addresses do not have a period. We can see there are some things here we might have some concerns about if we found them in our data.

As well as looking at the content of a particular table we can look at values across multiple tables. If I select all tables but pick a particular column, and filter on that.

By clicking on the Filter Tab

Clicking on View all tables, unticking View all Columns, selecting ‘Email’ from the Columns, and pressing the Filter button we can see that there is an email column in several different tables in our database.


We can see how it is represented in different ways in each of those tables. We can look at things like the number of values, minimum and maximum and see whether there are any significant or important differences between this column across the tables.


Filtering Results

We can also use the profiler to filter the results.

Navigating back to the Filter Tab, ticking ‘View All Columns’, unticking ‘View all profiles’ and selecting the ‘BlankOrWhiteSpace’ and ‘IsNull’ from the Profilers, then selection ‘Greater Than’ from filter results, and entering Zero in the Filter value field.

We can click on the Filter button to show us the fields where we have blank or just spaces in them.

We can see those columns we saw earlier, surname having been flagged as having some whitespace entries and firstname having some values that are null.

Scrolling further to the right we have also discovered that our department name has a lot of blank entries in it.



Pattern Checking

We can also use the profiler to look at the patterns of values that occur in our address table.

Navigating back using the Filter tab, untick ‘View all tables’, select the Addresses entity from Tables, tick ‘View all Profiles’ and click ‘Clear’ on Results Filter.

Press the Filter button to see the results.


Examining our Postcode in the Addresses table, we can see unique patterns that are occurring in the field. 

Clicking in ‘UniquePatterns’ link under the Postcode column.


These are UK postcodes, and so the patterns at the top are all values we would expect, but the values at the bottom are not, so we can see we have some problems with our postcode data.


Looking at our AddressLine, we can look at values such as unique word triples in our addresses. 

If we click on the Triple link

Here we can see that the address field has been used to put a note in the file to say that the address is not known, and perhaps this is something we should follow up and see why that is the case and why it is not in another field.


Selecting the unique suffixes in the Addressline column, we can see the different endings there are in the addresses. 


We can see that ‘road’, ‘avenue’ and ‘street’ are spelled in an abbreviation as well as fully spelled out. It’s always a good idea to try and standardise your data as much as possible but this will also help in identifying duplicate records when they have been made as standard as possible.

We will cover cleaning these entries up in another topic.



    • Related Articles

    • Introduction to Similar Records

      Prerequisites We start this topic with a Project created, a Data Source and Data Entities have been defined. We will be using the AcmeData project we created in previous articles. Video Introduction In this short article we will be: Creating a ...
    • Introduction to Data Validation

      Prerequisites We start this topic with a Project created, a Data Source and Data Entities have been defined. We have created and run a Profile against all entities in the AcmeData database. Video Introduction In this short article will be: Creating a ...
    • Introduction to Cleansing Data From Profile Information

      Prerequisites We start this topic with a Project created, a Data Source and Data Entities have been defined. We have created and run a Profile against all entities in the AcmeData database. We have created a Profile stage and run this for the ...
    • Introduction and Projects

      Prerequisites We start this topic with a Clean iData instance for our user. No projects are available.  Video Introduction In this short article we will be: Introducing the initial screens of the iData product and creating a new iData project. Steps ...
    • Introduction to Obfuscation

      Prerequisites We start this topic with a Project created, a Data Source and Data Entities have been defined. Video Introduction In this short article we will be: Creating a new Validate and Clean stage Editing transformation rules to: Replace data ...