Handling missing data

Aim

To give researchers a structured guideline for handling missing data

 

Requirements

Clear documentation of the decisions that were made regarding handling missing data.  If data were imputed, imputation methods are clearly documented.

 

Documentation

  • Research protocol: Describe how missing data will be prevented during data collection;
  • Data analysis plan: Describe how missing data analyses will be performed and how missing data will be handled (per questionnaire/instrument);
  • Syntax: Documentation of the missing data analyses that were performed, and of the eventual imputation method. Refer to data files with and without imputed data;
  • Logbook: decisions and arguments for the way missing data will be handled.

 

Responsibilities

Executing researcher:
  • Make sure you understand the pitfalls of ignoring missing data;
  • Make sure you discuss the way you handle missing data with your supervisor and if needed with experts;
  • Document the points mentioned under 3 (Documentation).
Project leaders:
  • Advise the executing researcher to consult sources and experts before missing data were imputed or ignored;
  • Inspect the research protocol how missing data will be prevented during data collection;
  • Inspect the analysis plan on the steps that will be taken to analyze the missing data;
  • Ensure that the executing researcher properly documents the points that are mentioned under 3 (Documentation).
Research assistant:
  • Ask the executing researcher, when there are no guidelines, how to handle missing data in questionnaires/on instruments.

 

How To

1. Introduction
Missing data are a common problem in all kinds of research. The way you deal with it depends on how much data is missing, the kind of missing data (single items, a full questionnaire, a measurement wave), and why it is missing, i.e. the reasons that the data are missing. Handling missing data is an important step in several phases of your study.

 

2. Why do you need to do something with missing data
The default option in SPSS is that cases with missing values are not included in the analyses. Deleting cases or persons results in a smaller sample size and larger standard errors. As a result the power to find a significant result decreases and the chance that you correctly accept the alternative hypothesis of an effect (compared to the null hypothesis of no effect) is smaller. Secondly, you introduce bias in effect estimates, like mean differences (from t-tests) or regression coefficients (from regression analyses). When the group of non-responders is large, and you delete them, your sample characteristics are different from your original sample and from the population you study. There could be a difference in characteristics between responders and non-responders. Therefore you need to inspect the missing data, before doing further analyses. Thus, always check the missing data in your data set before starting your analyses, and do never simply delete persons in your dataset with missing values (default option in SPSS).

 

3. What to do with missing data in different phases of your study
Data preparation:
If you work with questionnaires, make sure that all questions are clear and applicable to your respondents. If necessary, use the ‘not applicable’ answer option. In SPSS, missing values can be coded by the user (user system missings) or automatically, by SPSS itself (system missing value). It is not necessary to code your missing values by using numbers as 999 or -9999. You can also leave the cells open (empty) because in both ways, the missing values are deleted from the analyses. To decrease the chance of missing data, use digital applications to collect your data, such as Web based questionnaires where you can set the option that answering the question is required. You can also use these applications for sending reminders and tracking the respondents’ progress. If you work with physical or physiological data, the most frequent cause of missing data is a technical problem with the instruments. Testing the instruments in a pilot study will partly prevent you for these problems.

 

Data collection:
Closely monitor the completeness of the data when you receive or obtain the data. When you detect missing data during data collection, try to complete your data. Look back in the raw data (questionnaires), or ask your respondents to fill out the missing items. Describe in your logbook why data are missing. This helps you to decide whether data are missing at random or not.

 

Data processing:
Investigate the number of missing data you have (see 4.4) and estimate the need for imputation and think about the most adequate imputation method (see 4.5 and further).

 

Data analyses:
If you have missing values in your data set when starting your analyses, remember that case wise and list wise deletion (default in SPSS regression and ANOVAs) may hamper the reliability and accuracy of your results (see 4.2).

 

4. How much data is missing?
SPSS can help you to identify the amount of missing data. When you are interested in the percentage of missing values for each variable separately (e.g. item on a questionnaire) use the Frequency option in SPSS:
  1. Select Analyze à Descriptive Statistics à Frequencies;
  2. Move all variables into the “Variable(s)” window;
  3. Click OK.
The “Statistics” box tells you the number of missing values for each variable. However, be aware that this only gives you information about the percentage of missing values for each variable separately. It is more important to study the full percentage of missing data, especially when you use more variables in your analysis.
When you are interested in the full percentage of missing data use the following option:
  1. Select Analyze → Multiple Imputation → Analyze patterns;
  2. Move all variables into the “Variable(s)” window;
  3. Click OK.
The output tells you the percentage of variables with missing data, the percentage of cases with missing data, and the number of missing values. This final pie chart tells you the full percentage of missing data. Note the 5% borderline. Also patterns of missing data are presented.
Tip: use the Help button, and click “show me” for more information about the options and output in SPSS.
When you want to find out more about the patterns of missing data and the relation between missing data between variables, use the following option:
  1. Analyze → Missing Value Analysis;
  2. Move all variables of interest into the Quantitative or Categorical Variable(s) window;
  3. Use the ‘patterns’ button to get information about the relation between missing data on more variables;
  4. A tutorial of the Missing Value Analysis procedures in SPSS can be found via the Help button. A user’s guide can be downloaded freely on the internet.

 

5. What kind of data is missing?
Next step is to identify the kind of data that is missing. You can find out this information from the steps described in 4.4.
  1. A single item, or several items of a questionnaire is missing;
  2. A full questionnaire or a single variable (such as blood pressure);
  3. A measurement wave (in longitudinal / randomized studies).

 

6. What type of missings do you have?
Missing values are either random or non-random. Random missing values may occur because the subject accidentally did not answer some questions. For example, the subject may be tired and/or not paying attention, and misses the question. Random missing values may also result from data entry mistakes. Non-random missing values may occur because subjects purposefully do not answer some questions. Subjects may be reluctant to answer some questions because of social desirability concerns about the content of the question, such as questions about sensitive topics like income, past crimes, sexual history, prejudice or bias toward certain groups, and etc.
Think about your dataset. Is there an option that the missing values are non-random?
Rubin [1] developed in 1976 a typology for missing data (1,2).
Type of missings
Description
MCAR: Missing Completely At Random:
The data are MCAR when the probability that a value for a certain variable is missing is unrelated to the value of other observed variables, or unrelated to the variable with missing values itself. An example is when respondents accidentally skip questions.  In other words, the observed values in your dataset are just a random sample from your dataset, when it would have been complete.
MAR: Missing at Random (most of the time)
The data are MAR when the probability that a value for a certain variable is missing is related to observed values on other variables. In other words, the probability of missing data can be explained by other variables. An example is when older respondents have more missing values than younger respondents. However, within the group of older and younger respondents, the data are still MCAR and age explains the missingness. Another example is when respondents with low scores on the first wave are not invited for a second wave.
MNAR: Missing Not At Random:
The data are MNAR when the probability that a value for a certain variable is missing is related to the scores on that variable itself. An example is that respondents with low income intentionally skip their low income scores because that violates their privacy. In that case, the probability that an observation is missing depends on information that is not observed, like the value of the income score, because only low values are missing. MNAR is a serious problem, which cannot be solved with a technique as multiple imputation.

 

7. How do you know what kind of missings you have?
There are three kinds of methods.
  1. First you can inspect the data by yourself. Are the missings equally distributed in the data? Are low and / or high scores missing? If the missings are not equally spread this might be an indication that the data are MNAR. With this method you a-priori must know what the distribution of the variable normally is, i.e. is it normal or skewed? You need this information before you can judge which part of the data suffers from missing values. This method only applies if your dataset is large.
  2. Second, SPSS can test whether the respondents with missing data differ from the respondents without missing data on important variables (Analyze -> Missing Value Analysis -> select important variables -> descriptives -> t-test formed by indicator). Significant? Indication for MAR. Be aware that if your sample size is large (>500) this t-test might be significant if the data truly are not MAR. So, just looking at the means and their difference might be good enough. In case this mean difference is very small, this might be an indication of MCAR.
  3. In SPSS via (Analyze -> Missing Value Analysis, EM button), it is also possible to do a test for MCAR data. This is called Little´s test. A tutorial of the Missing Value Analysis procedures in SPSS can be found via the Help button.
It is important to note that you’re not able to test whether your missing data is MAR or MNAR. The above mentioned procedures (1 and 2) will only give you an indication for MCAR data or MAR/MNAR data. Pay attention to the possibility of MNAR, because all analyses have serious problems when your missing data is MNAR.

 

8. How to handle missing data?
Missing data is random:
For MCAR and MAR, many missing data methods have been developed in the last two decades (3). Although MCAR seems to be the least problematic mechanism, deleting cases can still reduce the power of finding an effect. It is argued that the MAR mechanism is most frequently seen in practice. An argument for this is that in most research multifactorial or multivariable problems are studied, so when data on variables are missing it is mostly related to other variables in the dataset.

 

Missing data is not random:
For MNAR, imputation is not sufficient, because the missing data are totally different from the available data, i.e. your complete data has become a selective group of persons. If you think your data is MNAR it might be wise to contact a statistician from EMGO+ who is willing to help you.
For MCAR and MAR, there are roughly two kinds of techniques for imputation:
  1. Single imputation is possible in SPSS and is an easy way to handle missings when just a few cases are missing (less than 5%) and you think your missing values are MCAR or MAR. However, after single imputation the cases are more similar which may result in an underestimation of the standard errors, i.e. smaller confidence intervals. This increases the chance of a type 1 error (the null hypothesis of no effect is rejected, while there is truly no effect). Therefore, this method is less adequate when you have >5% missing data. This is also the case when item scores are missing in questionnaires (4).
  2. Multiple imputation is more complex, but also implemented in SPSS 17.0 and later versions. Multiple imputation takes into account the uncertainty of missing values (present in all values of variables) and is therefore more preferred than single imputation. When the amount of missing data is high (exceeds 5% in several variables and different persons), multiple imputation is more adequate. Multiple Imputation works for total scores in questionnaires as well as for item scores in questionnaires (4,5,6).

 

Imputation techniques
Single imputation
Single imputation techniques are based on the idea that in a random sample every person can be replaced by a new person, given that this new person is randomly chosen from the same source population as the original person. In that case you can use the observed available data of the other persons to make an estimation of the distribution of the test result in the source population. It is called single imputation, because each missing is imputed once.
There are many methods for single imputation, such as replacement by the mean, by values from the regression, or expected maximization and stochastic regression imputation. Expected maximization forms a missing data correlation matrix by assuming the shape of a distribution for the missing data and imputes missing values on the likelihood under that distribution. Single imputation is possible in SPSS (analyze – missing value analyses – button EM for Expected Maximization). Further, to use single stochastic regression imputation, you can perform Multiple Imputation once, i.e. generate one imputed dataset. Single stochastic regression imputation may be an improvement over single regression imputation because imputation uncertainty is accounted for by adding noise (error) to the imputed values. However, it is still not the best solution and will underestimate the standard errors, like all single imputation procedures. These are therefore not recommended to use.

 

Multiple imputation (MI)
The difference with single imputation is that in MI the value is imputed for several times (5). There are more imputed datasets created. The different imputations are then based on random draws of different estimations of the underlying distribution in the source population.. There is more uncertainty created in the dataset. Therefore the standard error increases and becomes a better estimation of the correct standard error. The amount of imputations is dependent on the amount of missing data (7). For example, when 10% of the cases is missing in a multivariable model, 10 imputed datasets have to be generated, when 15% is missing, 15 imputed datasets, etc. A drawback of this method it that several imputed datasets are created and that the statistical analysis has to be repeated in each dataset. Finally, results have to be pooled in a summary measure. Most statistical packages can do this automatically. Multiple imputation is possible in recent versions (version 17) of SPSS (analyze –> multiple imputation –> impute missing data values). For more information see references.

 

Sensitivity analysis
After imputation, sensitivity analysis is needed to determine how your substantive results depend on how you handled the missing data.
Follow these steps:
  1. Do a complete case analysis (default option in SPSS; cases with missings are not included);
  2. Do a missing data analysis after you imputed the results;
  3. Compare substantive conclusions, decide how to report.

 

Imputation of descriptive statistics
A different approach may be used for descriptive studies. If you want to show the (observed) study data (means and standard deviations), for example to compare them with other countries/settings, without directly linking them to a conclusion, imputation is not immediately needed. However, to use statistics (t-tests, regressions, etc.) complete data analysis would certainly be needed. Do not forget the reviewer, who may sometimes have problems with using imputed and non-imputed data in one paper. Be clear about imputation and point out why you choose to present imputed/non-imputed data. Also, take your missing data evaluation and solution seriously. The missing data and imputation analysis can take as long as the normal data analysis and for complex imputation models even longer.

 

9. When is imputation of missing data not necessary?
When missing data is MCAR or MAR, and you use Maximum Likelihood estimation techniques in analyses such as Structural Equation Modelling (SEM) or Linear Mixed Models (LMM), imputation of missing data is not necessary in the case of outcome missing data in Longitudinal and Multilevel study designs (8,9,10). These techniques use the available data, and ignore the missing values and still give correct results. In such situations you do not have to use an extra imputation technique to handle your missing values. This is different for missing data in covariates in longitudinal and Multilevel study designs. In these situations Multiple Imputation is indicated, however, more complex imputation models have to be used (11). This is also the case for missing data that are MNAR. Specific models are available like selection or pattern-mixture models.

 

10. Summary
  • Make every effort to avoid missing data, or failing that, to understand how much and why data is missing.
  • Understand missing data mechanisms (MCAR, MAR, MNAR) and their implications.
  • Avoid default methods (listwise deletion, pairwise deletion).
  • Avoid default fixups (mean imputation, etc.) where possible.
  • Use multiple imputation to take proper account of missings.
  • Do a sensitivity analysis

 

Appendices/references/links

  1. Rubin DB. Inference and Missing Data. Biometrika 1976; 63(3):581-592.
  2. Donders AR, van der Heijden GJ, Stijnen T, Moons KG. Review: a gentle introduction to imputation of missing values. J Clin Epidemiol 2006;59(10):1087-91.
  3. Schafer, J.L. & Graham, J.W. Missing data: Our view of the state of the art. Psychological Methods 2002;7:147-177.
  4. Eekhout, I., de Vet, H.C.W., Twisk, J.W.R., Brand, J.P.L., de Boer, M.R., & Heymans, M.W. (2013). Missing data in a multi-item instrument were best handled by multiple imputation at the item score level. Journal of Clinical Epidemiology, 67(3), 335-342.
  5. van Buuren, S. (2012). Flexible Imputation of Missing Data (Chapman & Hall/CRC Interdisciplinary Statistics). Chapman and Hall/CRC.
  6. Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ 2009;338:2393.
  7. White I, Royston P, Wood A. Multiple imputation using chained equations: issues and guidance for practice. Stat Med 2011; 30: 377–399.
  8. Eekhout, I., Enders, C.K., Twisk, J.W.R., De Boer, M.R., de Vet, H.C.W., Heymans, M. W. (2015). Analyzing Incomplete Item Scores in Longitudinal Data by Including Item Score Information as Auxiliary Variables. Structural Equation Modeling: A Multidisciplinary Journal, 00, 1-15.
  9. Eekhout, I., Enders, C.K., Twisk, J.W.R., De Boer, M.R., de Vet, H.C.W., Heymans, M. W. (2015). Longitudinal data analysis with auxiliary item information to handle missing questionnaire data. Journal of Clinical Epidemiology, 68(6):637-645.
  10. Twisk J, de Boer M, de Vente W, Heymans M.Multiple imputation of missing values was not necessary before performing a longitudinal mixed-model analysis. J Clin Epidemiol. 2013 Sep;66(9):1022-8
  11. Enders CK, Mistler SA, Keller BT. Multilevel multiple imputation: A review and evaluation of joint modeling and chained equations imputation.Psychol Methods. 2016 Jun;21(2):222-40.

 

APH expert on Missing Data:

Martijn Heymans:  mw.heymans@vumc.nl

 

Course:
At the website of Epidm you can find information about a course about Missing data, see:
https://www.epidm.nl/en/courses/missing-data-consequences-and-solutions/

 

Audit questions

  1. Has a correct solution been chosen for the analyses with a high percentage of missing data?

 

V2.0: 12 May 2015: Revision format
V1.1: 5 July 2012: Addition to section: When is imputation of missing data not necessary?
V1.0: 1 Dec 2011