Using idata2 to identify duplicate records

Introduction to Similar Records


Prerequisites

We start this topic with a Project created, a Data Source and Data Entities have been defined.

We will be using the AcmeData project we created in previous articles.


Video


Introduction

In this short article we will be:

    1. Creating a Similar Records stage
    2. Configuring our columns to assess similarity
    3. Executing the Stage
    4. Examining the stage results

Steps

In previous topics we have covered include Creating Projects, Data Sources, Profiling, Validation, and Transformation. We recommend that you take time to look at these topics.

Today we are going to be looking at the Similar Records.


Logging In

From the welcome screen, press the Login button taking us to the Login screen.

We enter our iData credentials and press Login.


From the Projects page, we select our project.



Creating a New Stage

To analyse similar records, we need to create a new stage. So, we click on the Data Processing Stages icon, 

and select ‘New Stage’.

The stage type we want is ‘Similar Records’

Then we press Setup Processing Stage


Selecting the Entities

In this exercise we will look at our Client table only.

Type a Name in Stage Name, and select Client from ‘Available data entities’ and drag it to ‘Entities to check for similar records’

When we are looking for similar records, we need to tell iData which columns are relevant to the matching process. To do this we click on ‘Select Columns’

.

We are going to select Age, email, firstname, account number seen as PAN here, phone and surname. We can Clear the selection by pressing ‘Remove All’

We drag these into the left hand pane.

We will also tell iData to group the records by Phone to tag the field as the initial sort order.

Press Save, then Save the Stage.

We can see the new stage in our projects stage list:

Running the Stage

We can run this stage from here by pressing Actions next to the Similar Records stage we have just created and pressing Run.

Press Run once more, and when the work initiator has enabled the View Report button, select it.

Select Client from the Reports tab.

The summary tab shows us we have identified some potential similar records

Selecting the Detail tab, we can view what these records are.


We can see on the firstname column that we have different spellings of the same name that are very similar, the entries in blue are an exact match of the primary record we discovered, where those in orange have a difference. We have scored the columns and identified that they are a very close match. And we can see the overall score for the first record in this column here.

Each of the groups of records are also numbered, so we can identify them.

This process can be fine-tuned by changing the weighting of the columns, so that a match with an email address is significant, but a match on surname is less important. You can also change the type of matching in between fields.

We will cover this in other support materials and in the help documentation.

 

    • Related Articles

    • Introduction To Profiling

      Prerequisites We start this topic with a Project created, a Data Source and Data Entities have been defined. In this Topic we have AcmeData database defined as the Data Source and all tables imported as Data Entities.  Video Introduction In this ...
    • Introduction to Data Validation

      Prerequisites We start this topic with a Project created, a Data Source and Data Entities have been defined. We have created and run a Profile against all entities in the AcmeData database. Video Introduction In this short article will be: Creating a ...
    • Introduction to Synthetic Data Generation

      Prerequisites We start this topic with a Project created, a Data Source and Data Entities have been defined. This will require the AcmeData source to be defined. Video Introduction In this short article we will be: Creating a new Data Generation ...
    • Introduction to Comparison and Assurance

      Prerequisites We start this topic with a Project created Within the project we have created AdventureWorks_Old as a Data Source and Data Entities have been defined. Video Introduction In this short article we will be: Adding a New Datasource and ...
    • Introduction to Cleansing Data From Profile Information

      Prerequisites We start this topic with a Project created, a Data Source and Data Entities have been defined. We have created and run a Profile against all entities in the AcmeData database. We have created a Profile stage and run this for the ...