We start this topic with a Project created, a Data Source and Data Entities have been defined.
We will be using the AcmeData project we created in previous articles.
In this short article we will be:
In previous topics we have covered include Creating Projects, Data Sources, Profiling, Validation, and Transformation. We recommend that you take time to look at these topics.
Today we are going to be looking at the Similar Records.
From the welcome screen, press the Login button taking us to the Login screen.
We enter our iData credentials and press Login.
From the Projects page, we select our project.
and select ‘New Stage’.
The stage type we want is ‘Similar Records’
Then we press Setup Processing Stage
Type a Name in Stage Name, and select Client from ‘Available data entities’ and drag it to ‘Entities to check for similar records’
When we are looking for similar records, we need to tell iData which columns are relevant to the matching process. To do this we click on ‘Select Columns’
.
We are going to select Age, email, firstname, account number seen as PAN here, phone and surname. We can Clear the selection by pressing ‘Remove All’
We drag these into the left hand pane.
We will also tell iData to group the records by Phone to tag the field as the initial sort order.
Press Save, then Save the Stage.
We can see the new stage in our projects stage list:
Press Run once more, and when the work initiator has enabled the View Report button, select it.
Select Client from the Reports tab.
The summary tab shows us we have identified some potential similar records
Selecting the Detail tab, we can view what these records are.
We can see on the firstname column that we have different spellings of the same name that are very similar, the entries in blue are an exact match of the primary record we discovered, where those in orange have a difference. We have scored the columns and identified that they are a very close match. And we can see the overall score for the first record in this column here.
Each of the groups of records are also numbered, so we can identify them.
This process can be fine-tuned by changing the weighting of the columns, so that a match with an email address is significant, but a match on surname is less important. You can also change the type of matching in between fields.
We will cover this in other support materials and in the help documentation.