Problem
Data migration projects are often protracted and complex affairs. Comprehensive quality assurance activities are an essential part of such projects, in order to prove that the output is acceptable to all stake holders, and is thus fit for its intended purpose.
However, it is a difficult task to undertake QA activities on a large and complex dataset, especially if relying largely on manual testing.
Solution
iData provides extensive features which allow thorough assurance to be undertaken on the output of a data migration. This article describes how the product can be used to provide such assurance.
Profiling
A key feature of iData is the ability to generate a profile of a dataset, which provides a snapshot summary of the content, expressed as a series of metrics. Amongst these metrics are:
- Record counts
- Column definitions
- Count of the frequency of NULL values in each column
- Breakdown of each column by letter case - i.e. whether the content is all upper case, all lower case or mixed case
- Minimum and maximum values in each column
- Minimum and maximum string lengths in each column
- Breakdown of unique values / patterns / words in each column
This is just a fraction of the information returned - dozens of metric are derived from each column.
A dataset can be profiled as many times as desired, and all historic profiling reports are preserved. This allows multiple reports to be viewed side-by-side so that differences can be identified. Some of these may be expected differences, whilst others might indicate faults in the migration process. As always, the challenge is to make the distinction between the two!
Once a cut of data is available for testing, a profile of it can be generated and compared to a profile of the original data. In this way, fundamental problems can be quickly identified (e.g. incorrect record counts, different minimum/maximum values in a column) and reported via the proper channels.
Where multiple data sets are being merged together, profiling can be used in a similar fashion, albeit multiple reports would need to be compared in order to identify anomalies between the original data sets and the single merged set being assured.
iData's profiling reports can also provide assurance regarding any cleansing activities that might have been applied by the migration process. If a data issue apparent in the profile of the source data should be resolved during the migration, then the issue should not be present when the migrated data is profiled.
Comparing Data Sets
iData can make a detailed comparison between a pair of data sets, after which a detailed report on its findings is returned. For providing quality assurance on a data migration, the original and migrated data would be compared in this fashion.
This activity encompasses 100% of the content of both sets, so that all the following issues will be identified:
- Original records which cannot be found in the output of the migration
- Migrated records which cannot be found in the original data
- Records which appear in both but which contain differences in at least one column
Mappings can be defined so that iData knows how each source column should be migrated, which means it can flag potential anomalies when the output does not match what is expected.
Migration projects involving data cleansing can also be assured using iData's reports, as they should result in differences between the source and migrated data. The reports can be inspected to ensure that the items in question have changed, and been changed correctly.
Every single data item is included in the comparison here, so if just one column in one record is incorrectly migrated, it will be reported by iData.
As with profiling, the comparison process can be run whenever required, and the generated report from each run is preserved for future reference. This allows the latest report to be gauged against historic reports, so that it can be determined whether the latest data cut is better or worse than its predecessors.
Example Project - Migration and Merge
Consider the common scenario of an organisation being acquired by a larger parent with a different back-office system. The master data from the acquired organisation needs both migrating to a new format, and then merging with the existing records relating to the parent, the end result being a single common set of master data which covers both concerns.
The following diagram gives a high level view of the project, together with the three main phases where iData can be utilised:
The assurance activities using iData for the three phases are summarised below:
Phase 1
Profiling reports for both the acquired master data and the intermediate data set can be generated and compared to detect any issues.
A comparison report between the acquired and intermediate data can also be generated, which will flag all the changes made by the cleansing and transformation processes. This can be inspected to detect any incorrect or unnecessary changes.
Phase 2
A comparison report between the intermediate data and the merged data can be generated, to confirm that all records which exist in both are identical.
N.B. Since the records relating to the parent company will only exist in the merged data, iData will flag all of them as missing from the intermediate data. However, such instances can be filtered out of the final report, in order to avoid them causing confusion.
Phase 3
A comparison report between the parent company data and the merged data sets can be generated, to confirm that all records which exist in both are identical.
As with phase 2, iData will detect many records which only exist in the merged data, albeit this time, these will be the subset belonging to the acquired company. Again, these instances can be filtered out of the report for the sake of clarity.
If deemed necessary, additional assurance could be provided by inspecting the profiling reports for the intermediate, parent company and merged data sets, and reconciling them to identify any anomalies.
Conclusion
iData's features allow comprehensive assurance of complex and large-scale data migrations to be performed quickly, covering 100% of both the source and migrated data. Such activities give confidence that the final output is accurate.