Duplicate-Check – Documentation

Application

On this page you can find the user manual for the Duplicate-Check.

Duplicate-Check
Duplicate-Check

Settings

The Duplicate-Check has no default setting and can be configured individually for each check. Select all the rules for cleansing and duplicate checking that you want to perform in batch processing.

Purge

All cleanups are performed one after the other and are subsequently exported in new fields. If cleanups are performed, the duplicate check is performed afterwards.

Duplicates

The duplicate check checks all transferred data records one after the other and forms groups. Using these groups, all duplicate data records can be viewed and processed subsequently. Which rules were executed during the check of a duplicate is also output.

Batch processing

Select a file or drag and drop it to the “Import file” field and check it by pressing the “Check button”. This will start the verification of the entire file.

The process of checking takes some time depending on the size of the file.

The name of the duplicate export file is automatically assigned and generated in the identical format as the format passed. For example, if a file with the name “dubletten.json” is passed, the name of the log file will be “dubletten-log.json”.

Pre-validation

Pre-checking of a duplicate import file can help you to check the format and the number of records found without performing any verification. The file will be analyzed and the result will be shown to you with a message.

Cleanup Rules

The duplicate cleanup rules listed here are part of the Duplicate-Check. We will add supplementary cleanup options for the software over time.

Duplicate Check
Duplicate Check
cleanupdescription
c001Remove multiple spaces
c002Remove trailing spaces
c003Remove non printable characters
c004Remove German letters (umlauts, ß)
c005Normalize quotes, special chars (Result: ‘, “, -)
c006Remove duplicate letters, e.g. tt -> t

If you see a discrepancy with our status listed here compared to the duplicate check, please let us know. Do you have other cleanups that we don’t yet provide? Let us know, we will be very happy to expand them.

Duplicate Rules

The duplicate cleanup rules listed here are part of the Duplicate-Check. We will add supplementary cleanup options for the software over time.

Duplicate Check
Duplicate Check
RulesDescription
d100Match data 100%
d101Match on lower case
d102Ignore country
d103Ignore first name/last name
d104Ignore department
d105Ignore country/first name/last name/department
d106Ignore house number
d107Ignore postcode
d108Ignore city
d109Ignore first name
d110Ignore last name
d111Ignore street

If you see a discrepancy with our status listed here compared to the Duplicate-Check, please let us know. Do you have other cleanups that we don’t yet provide? Let us know, we will be very happy to expand them.

Configuration

On the tab card you have the possibility to define some settings for the Duplicate-Check. These will be taken into account during processing. These settings are described in detail below.

Duplicate Check - Settings Processing
Duplicate Check – Settings Processing

Separator

The separator is only relevant for the import of CSV files for the Duplicate-Check (regardless of whether batch processing or background processing). The data is split into individual values based on this separator and prepared for checking. Please consider when setting the separator to an occurrence in your master data. A comma or a semicolon can possibly occur in a company name and lead to an error in the processing. We have defined the pipe character “|” as default. For JSON and XLSX this separator is not necessary.

Character encoding of the output

With this setting you have the possibility to control the code page of the CSV output of the Duplicate-Check. UTF-8 is specified here by default. If the CSV file is further processed with Microsoft Excel, it is recommended to use Win1252 (this corresponds to the ANSI encoding).

If the data records in the output file of the Duplicate-Check are not displayed correctly in your text editor or in Microsoft Excel, for example with umlauts, please also set this parameter to a different one than the one currently set for you. This solves display problems in most cases.

Command Line Interface (cli)

You can also run the duplicate check without the graphical interface. For running the client tool in a command line, please specify all necessary parameters.

Parameters

Run ew_service_duplicate --help and you will get the overview of all Duplicate-Check parameters that you can pass to the cli.

Usage of: ew_service_duplicate.exe [options]

Main options:
      --lang=ARG          Language (de,en). Overwrites settings.
  -c, --cleaner=ARG       List of cleaning rules (default: all), comma-separated.
  -d, --duplicates=ARG    List of duplicate checks (default: all), comma-separated
      --inputfile=ARG     Filename to import (csv, json, xlsx)
      --outputfile=ARG    Filename to export the results (csv, json, xlsx)
      --split             Split export into different files.
      --testmail          Send a testmail.
      --validatefile=ARG  Check file, if structure is readable.
Information:
  -h, --help     Show help and exit.
  -v, --version  Version Return the version information.

Cleaning Rules:
  c001 -  Remove multiple spaces
  c002 -  Remove trailing spaces
  c003 -  Remove non printable characters
  c004 -  Remove German letters (umlauts, ß)
  c005 -  Normalize quotes, special chars (Result: ', ", -)

Duplicate Checks:
  d100 -  Entries matching with 100%
  d101 -  Entries matching ignoring case
  d102 -  Entries matching ignoring country
  d103 -  Entries matching ignoring firstname/lastname
  d104 -  Entries matching ignoring department
  d105 -  Entries matching ignoring country/firstname/lastname/department
  d106 -  Entries matching ignoring number
  d107 -  Entries matching ignoring postcode
  d108 -  Entries matching ignoring town
  d109 -  Entries matching ignoring firstname
  d110 -  Entries matching ignoring lastname
  d111 -  Entries matching ignoring street

-h --help

View all the necessary parameters that the cli supports.

-v --version

Outputs the current installed version of the Duplicate-Check.

--lang

This parameter allows you to specify or override the language of the Duplicate-Check.

-c --cleaner

If you specify this parameter without any other cleanup rules, all of them will be executed one after the other. If you want to perform only certain cleanups, specify them, e.g. --cleaner c001,c003.

If this parameter is omitted, no cleanups will be performed.

-d --duplicates

If you specify this parameter without any other duplicate rules, all of them will be performed one after the other. If you want to check only certain duplicates, specify them, e.g. --duplicates d100,d105.

If this parameter is omitted, no cleanups will be performed.

-i --inputfile

Use this parameter to specify the file with data to be imported.

-o --outputfile

The export file is specified with this parameter. It is important that this is not identical to the import file.

--split

Splits the results into separate files containing unique or duplicate entries

--testmail

From the Duplicate-Check, test sending a test e-mail. After processing a file, an email can be sent to you at the end.

--validatefile

Check the duplicate cleanup import file for formal correctness beforehand.

Outputs of the command line interface (cli)

The cli keeps issuing messages on the command line during runtime so that you can keep track of how far the checks have progressed.

CSV Interface

We recommend using the XLSX or JSON import interfaces.

By using a simple CSV file, the Duplicate-Check software provides you with a way to check your entire data set.

We take care to maintain compatibility when extending the CSV import interface of the Duplicate-Check software. This means that you can always use the latest version without generating additional effort when integrating it into your ERP system.

To ensure that the individual duplicate data records belong to your master data, you have the option of specifying up to two unique keys in the import file.

The default separator of the individual elements for the Duplicate-Check is the ‘|’ character (pipe). This can be changed via the settings. Bold field names are mandatory fields (the separator can be changed via settings).

Please note that all fields must be specified in the Duplicate-Check import file, even if you do not use Key_1 and Key_2.

Structure – CSV Import File

fieldformatexample
key1String
key2String 
firstnameString 
lastnameString
name1String 
name2String 
name3String 
name4String 
streetString 
numberString 
postcodeString
townString 
departmentString 
countryString 

Example in the form of a CSV file:

key1;key2;firstname;lastname;name1;name2;name3;name4;street;number;postcode;town;department;country;
val_key1;val_key2;val_firstname;val_lastname;val_name1;val_name2;val_name3;val_name4;val_street;val_number;val_postcode;val_town;val_department;val_country;

… (more duplicate checks)

Note

Please pay attention to the correct number of columns (14 columns) when creating the CSV import file. This note is important for possible errors during import when using CSV. However, you can also use the XLSX or JSON import format to eliminate this source of error.

Structure – CSV Export File

The CSV export file of the Duplicate-Check contains the transferred values as well as cleaned values and values marked as duplicates.

fieldformatexample
internalidString 
key1String 
key2String 
firstnameString 
lastnameString 
name1String 
name2String 
name3String 
name4String 
streetString 
numberString 
postcodeString 
townString 
departmentString 
countryString 
// cleaned data  
cleaned firstnameString 
cleaned lastnameString 
cleaned name1String 
cleaned name2String 
cleaned name3String 
cleaned name4String 
cleaned streetString 
cleaned numberString 
cleaned postcodeString 
cleaned townString 
cleaned departmentString 
cleaned countryString 
// applied cleaners  
applied cleanersString 
// applied duplicates  
duplicate idsString 
address groupString 

The output of the export file in CSV format of the duplicate check always includes an additional column containing the headings. Please take this into account for any automatic re-import of the check results.

JSON Interface

With the import interface for JSON files, the Duplicate-Check offers you a way to check your entire data set from your master data.

We make sure that compatibility is always maintained when extending the JSON duplicate interface. This means that you can always use the latest version without generating additional effort when integrating it into your ERP system.

In order to be able to uniquely assign a JSON data record from your ERP system in the duplicate check, you have the option of specifying up to two unique keys in the import file. These will be returned in the export file and can be used for re-import into your ERP system. However, you can also leave these two fields (key1 and key2) blank. They are not necessary for processing.

Please note that all bold fields must be specified in the import file.

Structure – JSON Import File

fieldformatexample
key1String 
key2String 
firstnameString 
lastnameString 
name1String 
name2String 
name3String 
name4String 
streetString 
numberString 
postcodeString 
townString 
departmentString 
countryString 

Example in the form of a JSON file:

[
    {
        "key1":"val_key1",
        "key2":"val_key2",
        "firstname":"val_firstname",
        "lastname":"val_lastname",
        "name1":"val_name1",
        "name2":"val_name2",
        "name3":"val_name3",
        "name4":"val_name4",
        "street":"val_street",
        "number":"val_number",
        "postcode":"val_postcode",
        "town":"val_town",
        "department":"val_department",
        "country":"val_country"
    },
    {...}
]

Structure – JSON Export File

The Duplicate-Check JSON export file contains the previously imported data, also in the same data format unless otherwise specified. Please note that the JSON interface outputs all available fields and thus represents the most complete format.

fieldformatexample
internalidString 
key1String 
key2String 
firstnameString 
lastnameString 
name1String 
name2String 
name3String 
name4String 
streetString 
numberString 
postcodeString 
townString 
departmentString 
countryString 
// cleaned data  
cleaned firstnameString 
cleaned lastnameString 
cleaned name1String 
cleaned name2String 
cleaned name3String 
cleaned name4String 
cleaned streetString 
cleaned numberString 
cleaned postcodeString 
cleaned townString 
cleaned departmentString 
cleaned countryString 
// applied cleaners  
applied cleanersString 
// applied duplicates  
duplicate idsString 
address groupString 

In contrast to XLSX or CSV, we have agreed on the pure English notation of the Duplicate-Check keys within JSON. This allows us to avoid runtime errors in the event of an incorrect conversion from the outset.

Please note that the fields in the Duplicate-Check export file are not always output in the same order as specified in the table above.

XLSX Interface

With the import interface for Microsoft Excel (XLSX) file, the software Duplicate-Check offers you a possibility to check your entire data set from your master data.

We take care to maintain compatibility (also with Microsoft Excel) when extending the XLSX duplicate interface. This means that you can always use the latest version without generating additional effort when integrating it into your ERP system.

In order to be able to uniquely assign an XLSX data record from your ERP system in the Duplicate-Check, you have the option of specifying up to two unique keys in the import file. These will be returned in the export file and can be used for re-import into your ERP system. However, you can also leave these two fields (Key_1 and Key_2) blank.

The designations of the column headers of the Duplicate-Check are searched for within the XLSX file during import and assigned during import. Please enter only one name at a time, e.g. “key1” and not “key1,key_1”.

The upper and lower case of the column headers is not relevant for the import.

Structure – XLSX Import File

fieldformatexample
key1String 
key2String 
firstnameString 
lastnameString 
name1String 
name2String 
name3String 
name4String 
streetString 
numberString 
postcodeString 
townString 
departmentString 
countryString 

Structure – XLSX Export File

The XLSX export file of the Duplicate-Check contains the returned values of the individual checks, also in the same data format unless otherwise specified.

fieldformatexample
internalidString 
key1String 
key2String 
firstnameString 
lastnameString 
name1String 
name2String 
name3String 
name4String 
streetString 
numberString 
postcodeString 
townString 
departmentString 
countryString 
// cleaned data  
cleaned firstnameString 
cleaned lastnameString 
cleaned name1String 
cleaned name2String 
cleaned name3String 
cleaned name4String 
cleaned streetString 
cleaned numberString 
cleaned postcodeString 
cleaned townString 
cleaned departmentString 
cleaned countryString 
// applied cleaners  
applied cleanersString 
// applied duplicates  
duplicate idsString 
address groupString 

XLSX versions

We support all XLSX versions up to and including the latest version of Office 365 in the Duplicate-Check software. Please understand that we will no longer support older versions (XLS).

Our test scope for the Duplicate-Check software includes very many different variants of how a Microsoft Excel document can look like. Nevertheless, we could not test all functions and possibilities. If you encounter a problem when importing XLSX files, please contact us and send us one or two test data sets so that we can help you quickly.