Data Analysis in GeneralGuideline in PDF

Aim

Outline of quality aspects of data analysis (principal analyses).

Description

The variety of methods used in data analysis for medical/epidemiological research is enormous. This note provides an overview of the classes of frequently used methods.  Here and there it discusses factors that may have an influence on the quality of interpretation, and therefore the conclusions as well. It is self-evident that no attempt has been made to provide an exhaustive list: The field is simply much too large for this.
We discuss the following topics briefly in the details section:
General modelling
Regression analysis
(Co)-variance analysis
Multilevel analysis
Methods for longitudinal data
Factor analysis
Methods for analysing structural models
Special methods: Exact tests, non-parametric tests, bootstrapping

General modelling Not very much has been written about the general principles of statistical modelling, although there is literature on modelling within specific academic areas. A general book about statistical modelling is Dixon’s [1]; Edwards discusses the advantages and disadvantages of iterative (stepwise) methods in detail [2] (compare this with the article by Adèr, Kuik, Hoeksma and Mellenbergh [3], which is available here as a handout).
There are a number of issues that frequently occur in modelling:

Reliability of the models determined
Models may be specific to the data provided; this means that they may not be found in follow-up studies. A remedy for this is cross-validation. This method involves randomly splitting the sample in two halves. One half of the sample is used to develop the model, and the other to verify the models . In general, this will require a great number of observations.

Stepwise analysis
In this method the models are built in a stepwise manner. At each step a term is removed or added to the model. Although there are a number of arguments against this procedure (see Edwards [2]), it is still used frequently. In general it is recommended that only forward stepwise methods are used, preferably using a variation in which the user can confirm or prevent the removal or addition of a term suggested by the programme.
Methods may also be used, which, instead of stepwise procedures, run through and detail the competing models [2, 4]. The results of this type of analysis therefore consist of more than one model.

Misspecification
If there are terms missing from a statistical model (for example  confounders), or if the specified model does not represent certain essential aspects (for instance, the use of linear regression analysis whilst the data has a hierarchical structure, meaning multilevel analysis would have been more appropriate), this may influence the results dramatically.

Missing observations
Many multivariate methods (such as multiple regression analysis) are sensitive to missing values, as they apply standard “listwise deletion”: If one observation is missing from a respondent, then the respondent is not used in estimating the model parameters. Possible remedies: (i) Apply multiple imputation [5], even if this is often impractical in practice; (ii) Imputation using the EM algorithm or Imputation using regression analysis: This can be carried out using the Missing Value Analysis (MVA) programme within SPSS; (iii) Enter a “safe” value for the missing values; this is a value that is not expected to disrupt the estimates for the coefficients (mean imputation, “last observation carried forward” and similar methods). Although this appears to be simple, it is not always the right option; (iv) Multilevel analysis can be used for models with repeated measures where values are missing at certain time points.

Violations of model assumptions
Both linear regression analyses, as well as analysis of variance require that residuals are normally distributed. It is therefore good practice to calculate “diagnostics” for both analyses: Probability plots and other diagnostic plots [6]. However , both methods are relatively robust against violations of the assumptions. The main purpose of analysis of the diagnostics is therefore to get an impression of the reliability of the results.

Regression analysis
A number of methods fall into this category, each with specific properties and assumptions:
(Multiple) linear regression analysis;
Logistic regression analysis;
Poisson regression;
Cox regression analysis (survival analysis).
Some comments:
A multilevel variant also exists for all of these methods, which can be applied when the data have a hierarchical structure. It should be pointed out that there are specific assumptions that need to be met for the Cox regression analysis, and that GEE is a preferred method for logistical multilevel analysis.
The diagnostics used for these methods differ greatly. The diagnostics for linear regression have already been described above. There are also diagnostics for logistic regression analysis: see Hosmer and Lemmeshow’s book [7]. Diagnostic assessment is less common for the other two methods.
A special type of logistic regression analysis is produced by calculating ROC curves: The result is a table and plot of sensitivity against (1 - specificity) at different thresholds for the predictor.
In Cox regression analysis the time dependency of the covariate can be taken into consideration. It is standard practice to assume that covariates are constant over time.

(Co)-variance analysis
Often a one-way ANOVA is used when the averages of more than two groups need to be compared (for two groups this equates to a t-test). If, in addition to this, a number of co-variates – both categorical  as well as continuous - need to be included in the model, then an analysis of (co-)variance needs to be carried out.
The advantage of covariance analysis over regression analysis is that all covariates can be specified in a single model: All interactions are included automatically in the model. A disadvantage is that the variance analysis imposes strict requirements on the (continuous) covariates (the regression coefficients need to be equal in all subgroups), which are not always met. Analysis of variance is useful in the exploratory phase in order to get an impression of the influential covariates/confounders.

Multilevel analysis
Multilevel analysis is used if the data are nested. For instance, patient data collected from various GP practices, where some practices are group practices in which each doctor has his/her own patients.
The methodology is complicated: It is advisable to take a course on the topic before undertaking the analysis and asking for advice prior to the analysis phase.

Methods for longitudinal data
Multilevel analysis can also be used if the lowest level contains observations over time. The GEE [8] programme can be used in this situation. The GGE estimating procedures are particularly reliable when the dependent variable is dichotomous (the equivalent of logistic regression analysis).
In a methodological sense, use of GEE is recommended when comparisons between groups need to be made and the researcher is not interested in the variability between individual patients.  

Factor analysis
Factor analysis is often used in validating questionnaires (see also guideline 1.1B-08 Questionnaires, selecting, translating and validating ), particularly when there is an assumption that the questionnaire contains more than one dimension.
A distinction can be made between exploratory and confirmatory factor analysis. Principal Components Analysis (PCA) is often used in the former (as well as Common Factor Analysis); the latter often makes use of software to derive structural models (see below).
The use of factor analysis is anything other than trivial: There are various pitfalls to avoid. The same advice applies here as for multilevel analysis: Take a course and ask advice prior to the analysis phase.

Methods for analysing structural models
Two programmes are often used for this: EQS and Lisrel. The standard reference text for SEM (Structural Equation Modelling) is Bollen [9].
Lisrel is obtainable through EMGO.

Special methods: Exact tests
In many instances a choice (in SPSS) can be made between asymptotic and exact tests, for instance in calculating chi-square tests on a cross tabulation. A specific statistical package has been developed for this purpose (Statexact), which can also be used to calculate exact odds ratios. This programme is available from: Joop Kuik, Department of Clinical Epidemiology and Biostatistics.

Special methods: Non-parametric methods, bootstrapping
Methods such as regression analysis and variance analysis impose relatively strict requirements on the data that they are applied to. It is important that the distribution of the data are unimodal and, more generally, that the data are normally distributed. Various options are available if these requirements cannot be met.

Non-parametric methods
SPSS has a (large) range of non-parametric tests available, for instance: Mann-Whitney U, Kruskal-Wallis, Wilcoxon and Friedman’s test. These tests do not specify all of the requirements regarding the data distribution and in most cases use the rank order of the dependent variable.

Bootstrapping
This is a so-called resampling method, which allows the distribution requirements for parametric tests to be by-passed. Bootstrapping is frequently used in cost-effectiveness analyses these days. The standard reference text is Efron and Tibshirani [10].

[1]  Dobson AJ. Introduction to Statistical Modelling. London New York: Chapman and Hall, 1983.
[2]  Edwards D. Introduction to Graphical Modelling. New York Berlin Heidelberg Bacelona Hon Kong London Milan Paris Singapore Tokyo: Springer, 2nd edn., 2000. ISBN 0-387-95054-0.
[3]  Adèr HJ, Kuik DJ, Hoeksma JB, Mellenbergh GJ. Methodological aspects of statistical modelling: Some new perspectives. In: Stasinopoulos M, Touloumi G, eds., Statistical Modelling in Society. Proceedings of the 17th International Workshop on Statistical Modelling. Chania, Crete, Greece, July 8-12, 2002, Athens, 2002. National & Kapodistrian University of Athens and University of North London, 2002; 59-68.
[4]  Burnham KP, Anderson DR. Model selection and multimodel inference. A Practical Information-Theoretic Approach. ??????, 2nd edn., 2002.
[5]  Little RJA, Rubin DB. Statistical Analysis with Missing Data. New York: Wiley, 1987.
[6]  Judd. Statistical methods in the social sciences. To be sorted out.
[7]  Hosmer DW, Lemeshow S. Applied Logistic Regression. New York Chichester Brisbane Toronto Singapore: John Wiley & Sons, 1989.
[8]  Liang K, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika 1986;73:13-22.
[9]  Bollen KA. Structural Equations with Latent Variables. New York: John Wiley and Sons, 1989.
[10]  Efron B, Tibshirani RJ. An Introduction to the Bootstrap. New York London: Chapman & Hall, 1993.

Modelling

Finding a statistical model that works well with the data.

Cross-validation

Method where the sample is split in two. One half is used to develop the models, the other to test the models developed.

Stepwise modelling

Modelling method involving stepwise procedures: A term is removed or added to the model at each step. A distinction is made between: forward stepwise, backward stepwise and stepwise

Imputing

Method of filling in missing values in a dataset.

Multilevel analysis

Type of regression analysis where a distinction can be made between more than one level: For instance, data collected from patients within a general practice whereby both the patients’ as well as general practitioners’ data play a role.

GEE

Generalized Estimating Equations: Specific type of multilevel analysis

Logistic regression analysis

Type of regression analysis where the dependent variable is dichotomous.

Dichotomy

A variable that can only assume one of two values.

Cox regression

Type of survival analysis. The dependent variable reflects length of survival.

Normality

Property of a variable’s distribution: The underlying distribution is normal or Gaussian.

Resampling method

Method where samples are repeatedly taken from the available data, either to by-pass the distribution requirements of a test (for instance for bootstrapping), or to increase the precision of an estimate.

V1.1:  1 Jan 2010: English translation.