Data Cleaning Guideline in PDF

Aim

Ensuring a clean data file from which as many errors as possible have been removed prior to starting the statistical analysis.

Description

The data cleaning process is aimed at obtaining a data file which is as clean possible. Data cleaning involves monitoring the following (in this order):

  1. Presence of duplicates in the file (the same respondent occurring more than once).
  2. Presence of ghost patients (non-existent respondent numbers occurring in the file).
  3. Compulsory completion of a variable (it is, for instance, essential that the respondent number is always filled in).
  4. Out-of-range values (impossible variable values, for instance, a height of 3 metres).
  5. Logical inconsistencies between variables, for instance, “pregnant men”.
  6. If applicable, longitudinal data cleaning will need to be subsequently carried out.

Work in the right order: Firstly, deal with the “out-of-range” issues, and only then carry out inconsistency assessments, as the risk of finding inconsistencies is smaller when the “out-of-range" improvements have been made.

Data improvement
When tracing errors, go back to the source. This may, for instance, be the relevant questionnaire in order to assess where the problem lies (data entry error, interpretation error, wrong entry by respondent or issues which cannot be resolved any further). In the event of an interview (both open as well as closed) it is possible to return to the tape recording or report of the contact form associated with the interview, or the report (form) created by the interviewer or respondent.

Improvements need to subsequently be included in a copy of the raw SPSS files at a variable/form level. These should then be stored under a new name. The raw SPSS files refer to files where the data entry checking has already taken place (see Data Entry Accuracy), but where no variables or questionnaires have been added together to form a single file (also refer to the schematic overview of the various stages of the files in the data processing phase); no new variables have been created in this file either as of yet.

In the event of an incorrect response by the respondent, the associated variable should be coded as “user missing”. It is of the utmost importance to clean every variable in a file, and not just those variables to be used in the statistical analysis. Once the improvements have been made, the files should be stored under a new name. These are referred to as cleaned SPSS system files.
It is important that the modifications carried out during the data cleaning process are documented in a logbook.

Introductory Meeting Data Management part 2.

Introductory Course SPSS for the Post-Initial Master’s programme in Epidemiology

Information and guides for the data processing phase can be found on the Data and System Management’s intranet pages.

V1.2: 1 Jan 2010: English translation,
V1.1: 29 Nov 2006: Small textual amendments.

    • Have the data be cleaned?
    • Have all the variables in the raw SPSS files been cleaned? If so, how (documented?)? If not, why not?
    • Have any uncovered errors been amended in the raw SPSS files at questionnaire level?
    • Has the new cleaned file been stored under a new name?
    • How has the data cleaning been documented?