Initital data analysis

Aim

To get a first impression of the data:
  • To get an impression of the distribution properties of the continuous variables and the numbers in the subgroups;
  • To explore the validity and reliability of the measurement instruments.

 

Requirements

  • To investigate continues outcome variables for normal distribution;
  • To check for percentages of missing values and outliers;
  • To calculate Cronbach’s alpha for sub scales.

 

Documentation

  • Syntax which has been used to perform the initial data analysis;
  • Percentages of missing values and outliers;
  • Cronbach’s alpha of new sub scales.

 

Responsibilities

Executing researcher: To perform the initial data analysis, including:
  • Normal distribution of continue variables;
  • Percentages of missing values and outliers;
  • Cronbach’s alpha for new sub scales.
Project leaders: To advice the executing researcher to perform initial data analysis.
Research assistant: N.a.

 

How To

First impression
It is advisable to always investigate the distribution of all the variables that you are going to use. Frequencies are examined for all categorical variables (e.g. marital status, education). Descriptive statistics (percentage missing values, average, “trimmed” average, standard deviation, median, other possible percentiles, minimum, maximum) are calculated for continuous variables (e.g. body weight, blood pressure). It is advisable to create figures, e.g. boxplots or histograms in order to review the distribution.

 

Outliers and odd combinations
So-called outliers may occur in continuous variables. These are values that, theoretically, are not “out of range”, but are extremely unlikely given the observed distribution. Generate a boxplot to check for outliers.
Cross-tabulations can be generated for categorical variables (e.g. gender x ADL limitations) in order to assess whether odd combinations are present. Scatterplots can be created for continuous variables to reveal any unlikely combinations (simply reviewing correlations is not sufficient). For instance: A weight of 120 kg and height of 1.50 metres will be an outlier in most populations. When it has been decided that a certain value or combination of values are outliers and the true value cannot be recovered from the raw data, then these need to be recoded as “missing”.

 

Missing values
Also carefully review missing values when evaluating the distributions. Often specific codes (e.g. -1 or 9) are used for missing values, although this is not really necessary and is often not very convenient when the data are analysed. Note whether these codes have been defined as missing values. If there are missing values, consider whether these need to be imputed (filled in). Because there are a number of methods for this, it is advised to consult a statistician.

 

Normal distribution of outcome variables
Check for normal distribution of outcome variables. Graphs can be used for this, such as histograms or Q-Q plots. If it is apparent that the variable is not normally distributed, then a transformation could be considered (for instance a logarithm transformation) to see whether this improves matters.

 

Distribution of categories
Categories can be combined if the numbers in one or more categories is/are too small. The need for this is not always evident from an ordinary frequency distribution. However, it can be apparent from a cross-tabulation. For instance in a study where there is stratification by gender and education, the cross-tabulation of education by gender shows that for men the lowest category “not completed primary education” rarely occurs, whereas for women the highest category “completed university education” rarely occurs. The lowest and next lowest categories can then be added together, as well as the highest and second highest.

 

Evaluating the randomisation procedure
In order to evaluate whether the randomisation has been “successful”, the distribution of all the relevant (prognostic) variables needs to be reviewed separately for each treatment arm. Descriptive statistics (percentages, averages, median, standard deviation, range) can be used for this. Differences between groups can be tested (e.g. chi-square or t-test), although it needs to remembered that due to the randomisation procedure any differences found are, by definition, due to chance. So, if differences are found there is no need to change it, but it is important to remember it when evaluating and interpretating results.

 

 

Audit questions
  1. Has the distribution of all variables been reviewed?
  2. Were there variables with a high percentage of missing values?
    1. If so, how were these dealt with?
  3. Have outliers been explored?
    1. If so, how?
  4. Where relevant: How were (large) deviations from normality solved?
  5. Has been assessed whether the items belonging to a scale actually fit to the scale?

 

V3.0: 23 Jan 2017: Minor revision
V3.0: 13 Oct 2016: Minor revision
V2.0: 12 May 2015: Revision format
V1.1: 1 Jan 2010: English translation
V1.0: 23 Apr 2007