Data cleaning

Aim

To ensure a clean data file, from which errors have been removed.

 

Requirements

  • Data of all variables in the files is cleaned, not just those used in statistical analyses.

 

Documentation

  • Note all modifications in the logbook or syntax file.

 

Responsibilities

Executing researcher:
  1. To clean the data in this right order by monitoring:
  • The presence of duplicates;
  • The presence of ghost patients;
  • Compulsory completion of a variable;
  • Out-of-range values;
  • Logical inconsistencies between variables, for instance, “pregnant men”;
  1. To trace errors back to their source;
  2. To include possible improvements in a copy of the raw SPSS files and to store under a new name;
  3. To note modifications in the logbook/syntax.
Project leaders: To ensure the executing researcher cleans the data, by naming this topic during a regular meeting.
Research assistant: N.a.

 

How To

The aim of the data cleaning process is to obtain a data file which is as clean possible, i.e. that as many errors as possible have been removed. Data cleaning involves monitoring the following (in this order):
  1. Presence of duplicates in the file (the same respondent occurring more than once);
  2. Presence of ghost patients (non-existent respondent numbers occurring in the file);
  3. Compulsory completion of a variable (it is, for instance, essential that the respondent number is always filled in);
  4. Out-of-range values (impossible variable values, for instance, a height of 3 metres);
  5. Logical inconsistencies between variables, for instance, “pregnant men”;
  6. If applicable, longitudinal data cleaning will need to be subsequently carried out.
Work in the right order: Firstly, deal with the “out-of-range” issues, and only then carry out inconsistency assessments, as the risk of finding inconsistencies is smaller when the “out-of-range” improvements have been made.

 

Data improvement
When tracing errors, go back to the source. This may, for instance, be the relevant registration form, patient record or questionnaire in order to assess where the problem lies (data entry error, interpretation error, wrong entry by respondent or issues which cannot be resolved any further). In case of an interview (both open as well as closed), it is possible to return to the tape recording or report of the contact form associated with the interview, or the report (form) created by the interviewer or respondent.
Improvements need to subsequently be included in the database that was used for the data collection or when this is impossible in a copy of the raw SPSS files at a variable/form level, and stored under a new name. The raw SPSS files refer to files where the data entry checking has already taken place (see Data Entry Accuracy), but where no variables or questionnaires have been added together to form a single file (also refer to the schematic overview of the various stages of the files in the data processing phase); no new variables have been created in this file either as of yet.
In case of an incorrect response by the respondent, the associated variable should be coded as “user missing”. It is of the utmost importance to clean every variable in a file, and not just those variables to be used in the statistical analysis. Once the improvements have been made, the files should be stored under a new name. These are referred to as cleaned SPSS system files.It is important that the modifications carried out during the data cleaning process are documented in a logbook.
Notice that for research under the strict conditions of GCP there are much more regulations for the process of data validation and data cleaning. For example, to make use of a GCP compliant database like Castor or Open Clinica and to make use of discrepancy management and queries.

 

Appendices/references/links

 

Audit questions

  1. Have the data be cleaned?
  2. Have all the variables in the raw SPSS files been cleaned?
    1. If so, how (documented?)?
    2. If not, why not?
  3. Have any uncovered errors been amended in the original database that was uses for data collection or in the raw SPSS files at the level of the measuring point or registration form/questionnaire level?
  4. Has the new cleaned file been stored under a new name?
  5. How has the data cleaning been documented?

 

V3.0: 1 December 2016: Text updated
V2.0: 28 May 2015: Revision format
V1.2: 1 Jan 2010: English translation
V1.1: 29 Nov 2006: Small textual amendments.