De-duping using GT Datamaker™

Create progressively de-duped data to test integrated systems

De-duplicating and matching customer records can be difficult for users to test. Although most users will be using standard algorithms to match duplicates, these need to be tested fully as part of an integrated system. It was with this in mind that we added the de-duping component into Datamaker™, offering multiple types of progressive de-duping methods built right into the tool. The customer will be equipped with a unique solution with the ability to create their own variations based on the data they require.

Datamaker’s sophisticated de-duping and matching module has the ability to generate high-quality ‘incorrect’ data containing duplicates for systems testing. The most innovative and unique selling feature of the solution is that it will allow the user to create progressively de-duped data by choosing from a list of functions. This list of functions allows the user to create the variations starting with ‘very similar’ and ending with ‘very different’. The use of escalating de-duping methods allows the user to build data sets carefully - testing as they go.

Grid-Tools built the de-duping solution using standard algorithms like the Jaro-Winkler and the Levenshtein distance, both advanced methods for measuring the amount of difference and similarity between two sequences. The data sets are built such that the distance between the data grows progressively further and the names become progressively different.

The component also includes standard Soundex functionality, as seen in MySQL and Oracle databases. The Soundex will break-down and encode names into 4 digit numbers, allowing the user to easily identify similar names for matching purposes.