Data TransformationGuideline in PDF

Aim

Creating (composite/derived) analysis variables from measured variables correctly.

Description

The phase following data cleaning is aimed at transforming measured variables into variables that can be used for analysis (analysis variables). Often certain types of processing are required for this, such as recoding of an income variable into a new variable of income class, or calculating a new Body Mass Index variable from the height and weight variables. This is referred to as data transformation. As the name suggests, this is often generated through the data transformation commands in SPSS, such as recode, compute, etc. It is important that this type of transformation is carried out properly and that the transformations are documented. If the transformations are carried out in SPSS via the syntax window, then it is sufficient to save the syntax file as a logbook. If the transformations in SPSS are carried out via the menu, the documentation may consist of SPSS log file annotated in MS Word.
It is recommended that a data transformation schedule is created prior to any modifications, with columns for the measured variable(s), process(es), the variable name and a description of the resulting variable (see details). Appropriate variable labels, value labels and any potential missing value definitions need to be assigned to the new variables. Standard practice is to calculate the frequency distribution of the new variables in order to monitor odd values and outliers. It is important when recoding a variable that the original variable is left intact. Therefore always make sure to create a new variable.
Once the improvements have been made, the files should be stored under a new name. This is also referred to as the working data: Cleaned files, including the derived variables.

 


Example of a data transformation schedule

Measured variable(s)

Transformed variable

Transformation

Monthinc (Monthly income)

Income class (INCCLASS)

0-1000 Euro= low
1000-2000 Euro = medium
> 2000 Euro = high

Weight, height

Body Mass Index (BMI)

Weight/(Height * Height)

Fatigue, Polyu, Drink, Thirst

Number of diabetes symptoms (DMSYMP)

Count how frequently a patient scores 1 on the variables Fatigue, Polyu, Drink and Thirst

Dob, StuDate

Age at time of study (AGE)

Absolute value of (StuDate – Dob)/365.25

Sbldsys1 + Sbldsys2

Average systolic blood pressure (SBLPAV)

(Sbldsys1 + Sbldsys2)/2

Data transformation: Creating analysis variables from measured variables via specific processes (recoding, calculations).
This does not involve for instance logarithmic transformations of variables if variables are perhaps not normally distributed.

V1.1: 1 Jan 2010: English translation.
V1.0: 31 Mar 2004: Data reduction has been changed to data transformation.

  • Has the data transformation been carried out correctly?
  • How has the data transformation been documented?
  • Has care been taken to ensure that the process is reversible, that is, can the file be restored to the status prior to reduction, if required?