- Audit Questions
To give researchers a structured guideline for handling missing data
Missing data is a common problem in all kinds of research. The way you deal with it depends on how much data is missing, the kind of missing data (single items, a full questionnaire, a measurement wave), and why it is missing, i.e. the reasons that the data are missing. Handling missing data is an important step in several phases of your study.
4.2 Why do you need to do something with missing data?
The default option in SPSS is that cases with missing values are not included in the analyses. Deleting cases or persons results in a smaller sample size and larger standard errors. As a result the power to find a significant result decreases and the chance that you correctly accept the alternative hypothesis of an effect (compared to the null hypothesis of no effect) is smaller. Secondly, you introduce bias in effect estimates, like mean differences (from t-tests) or regression coefficients (from regression analyses). When the group of non-responders is large, and you delete them, your sample characteristics are different from your original sample and from the population you study. There could be a difference in characteristics between responders and non-responders. Therefore you need to inspect the missing data, before doing further analyses. Thus, always check the missing data in your data set before starting your analyses, and do never simply delete persons in your dataset with missing values (default option in SPSS).
4.3 What to do with missing data in different phases of your study
If you work with questionnaires, make sure that all questions are clear and applicable to your respondents. If necessary, use the ‘not applicable’ answer option. To decrease the chance of missing data, use digital applications to collect your data, such as Web based questionnaires where you can set the option that answering the question is required. You can also use these applications for sending reminders and tracking the respondents’ progress. If you work with physical or physiological data, the most frequent cause of missing data is a technical problem with the instruments. Testing the instruments in a pilot study will partly prevent you for these problems.
Closely monitor the completeness of the data when you receive or obtain the data. When you detect missing data during data collection, try to complete your data. Look back in the raw data (questionnaires), or ask your respondents to fill out the missing items. Describe in your logbook why data are missing. This helps you to decide whether data are missing at random or not.
Investigate the number of missing data you have (see 4.4) and estimate the need for imputation and think about the most adequate imputation method (see 4.5 and further).
If you have missing values in your data set when starting your analyses, remember that case wise and list wise deletion (default in SPSS regression and ANOVAs) may hamper the reliability of your results (see 4.2).
4.4 How much data is missing?
SPSS can help you to identify the amount of missing data. When you are interested in the percentage of missing values for each variable separately (e.g. item on a questionnaire) use the Frequency option in SPSS:
1. Select Analyze à Descriptive Statistics à Frequencies
2. Move all variables into the “Variable(s)” window.
3. Click OK. The “Statistics” box tells you the number of missing values for each variable.
However, be aware that this only gives you information about the percentage of missing values for each variable separately. It is more important to study the full percentage of missing data, especially when you use more variables in your analysis.
When you are interested in the full percentage of missing data use the following option:
- Select Analyze à Multiple Imputation à Analyze patterns
- Move all variables into the “Variable(s)” window.
- Click OK. The output tells you the percentage of variables with missing data, the percentage of cases with missing data, and the number of missing values. This final pie chart tells you the full percentage of missing data. Note the 5% borderline. Also patterns of missing data are presented.
- Tip: use the Help button, and click “show me” for more information about the options and output in SPSS.
When you want to find out more about the patterns of missing data and the relation between missing data between variables, use the following option:
- Analyze à Missing Value Analysis,
- Move all variables of interest into the Quantitative or Categorical Variable(s) window.
- Use the ‘patterns’ button to get information about the relation between missing data on more variables
- A tutorial of the Missing Value Analysis (SPSS 16 and further) procedures in SPSS can be found via the Help button. A user’s guide can be downloaded freely on the internet.
4.5 What kind of data is missing?
Next step is to identify the kind of data that is missing. You can find out this information from the steps described in 4.4.
- A single item, or several items of a questionnaire is missing.
- A full questionnaire or a single variable (such as blood pressure)
- A measurement wave (in longitudinal / randomized studies)
The way you deal with missing data depends on the type of missing data
4.6 What type of missings do you have?
Missing values are either random or non-random. Random missing values may occur because the subject accidentally did not answer some questions. For example, the subject may be tired and/or not paying attention, and misses the question. Random missing values may also result from data entry mistakes. Non-random missing values may occur because subjects purposefully do not answer some questions. For example, the question may be confusing, so respondents do not answer the question. Also, the question may not provide appropriate answer choices, such as “no opinion” or “not applicable”, so the subject chooses not to answer the question. Also, subjects may be reluctant to answer some questions because of social desirability concerns about the content of the question, such as questions about sensitive topics like income, past crimes, sexual history, prejudice or bias toward certain groups, and etc.
Think about your dataset. Is there an option that the missing values are non-random?
Rubin developed in 1976 a typology for missing data.
Type of missings
MCAR: Missing Completely At Random:
The data are MCAR when the probability that a value for a certain variable is missing is unrelated to the value of other observed variables, or unrelated to the variable with missing values itself. An example is when respondents accidentally skip questions. In other words, the observed values in your dataset is just a random sample from your dataset, when it would have been complete.
MAR: Missing at Random (most of the time)
The data are MAR when the probability that a value for a certain variable is missing is related to observed values on other variables. An example is when older respondents have more missing values than younger respondents. However, within the group of older and younger respondents, the data are still MCAR. Another example is when respondents with low scores on the first wave are not invited for a second wave.
MNAR: Missing Not At Random:
The data are MNAR when the probability that a value for a certain variable is missing is related to the scores on that variable itself. An example is that respondents with low income intentionally skip their low income scores because that violates their privacy. In that case, the probability that an observation is missing depends on information that is not observed, like the value of the income score, because only low values are missing. MNAR is a serious problem, which can not be solved with a technique as multiple imputation.
How do you know what kind of missings you have?
There are three kinds of methods.
- First you can inspect the data by yourself. Are the missings equally distributed in the data. Are low and / or high scores missing? If the missings are not equally spread this might be an indication that the data are MNAR. With this method you a-priori must now what the distribution of the variable normally is, i.e. is it normal or skewed? You need this information before you can judge which part of the data suffers from missing values. This method only applies if your dataset is large.
- Second, SPSS can test whether the respondents with missing data differ from the respondents without missing data on important variables (Analyze à Missing Value Analysisà select important variables àdescriptivesàt-test formed by indicator. Significant? Indication for MAR. Be aware that if your sample size is large (>500) this t-test might be significant if the data truly are not MAR. So, just looking at the means and their difference might be good enough. In case this mean difference is very small, this might be an indication of MCAR.
- In SPSS via (Analyze à Missing Value Analysis, EM button), it is also possible to do a test for MCAR data. This is called Little´s test. A tutorial of the Missing Value Analysis (SPSS 16 and further) procedures in SPSS can be found via the Help button.
It is important to note that you’re not able to test whether your missing data is MAR or MNAR. The above mentioned procedures (1 and 2) will only give you an indication. Pay attention to the possibility of MNAR, because all analyses have serious problems when your missing data is MNAR.
4.7 How to handle missing data?
Missing data is random:
For MCAR and MAR, many missing data methods have been developed in the last two decades (Schafer & Graham, 2002). Although MCAR seems to be the least problematic mechanism, deleting cases can still reduce the power of finding an effect. It is argued that the MAR mechanism is most frequently seen in practice. An argument for this is that in most research multifactorial or multivariable problems are studied, so when data on variables are missing it is mostly related to other variables in the dataset.
Missing data is not random:
For MNAR, imputation is not sufficient, because the missing data are totally different from the available data, i.e. your complete data has become a selective group of persons. If you think your data is MNAR it might be wise to contact a statistician from EMGO+ who is willing to help you.
For MCAR and MAR, there are roughly two kinds of techniques for imputation. Single and Multiple Imputation.
Single imputation is possible in SPSS and is an easy way to handle missings when just a few cases are missing (less than 5%) and you think your missing values are MCAR or MAR. However, after single imputation the cases are more similar which may result in an underestimation of the standard errors, i.e. smaller confidence intervals. This increases the chance of a type 1 error (the null hypothesis of no effect is rejected, while there is truly no effect). Therefore, this method is less adequate when you have >5% missing data.
Multiple imputation is more complex, but also implemented in SPSS 17.0 and later versions. Multiple imputation takes into account the uncertainty of missing values (present in all values of variables) and is therefore more preferred than single imputation. When your missingness is high (exceeds 5% in several variables and different persons) multiple imputation is more adequate.
Single imputation techniques are based on the idea that in a random sample every person can be replaced by a new person, given that this new person is randomly chosen from the same source population as the original person. In that case you can use the observed available data of the other persons to make an estimation of the distribution of the test result in the source population. It is called single imputation, because each missing is imputed once.
There are many methods for single imputation, such as replacement by the mean, regression, and expected maximization. Expected maximization is preferred, because in the other methods the variance and standard error are reduced and the chance for Type II errors increases. Expected maximization forms a missing data correlation matrix by assuming the shape of a distribution for the missing data and imputes missing values on the likelihood under that distribution. Single imputation is possible in SPSS (analyze – missing value analyses – button EM for Expected Maximization). Contact a statistician from EMGO+ who is willing to help you with this procedure.
For the imputation of a missing score on a single item in a questionnaire (see 4.5) , SPSS syntaxes can be found at:
tw.zip: Software for two-way imputation in SPSS. (Van Ginkel & Van der Ark, 2003a), or
rf.zip: Software for response function imputation in SPSS (Van Ginkel & Van der Ark, 2003b).
Multiple imputation (MI)
The difference with single imputation is that in MI the value is imputed for several times. There are more imputed datasets created. The different imputations are then based on random draws of different estimations of the underlying distribution in the source population. In this way, the imputed data comes from different distributions and therefore are less look alike. There is more uncertainty created in the dataset. Therefore the standard error increases. The amount of imputations is dependent on the amount of missing data, but mostly 5 to 10 imputations are enough. A drawback of this method it that several imputed datasets are created and that the statistical analysis has to be repeated in each dataset. Finally, results have to be pooled in a summary measure. Most statistical packages can do this automatically. Multiple imputation is possible in recent versions (vs 17) of SPSS (analyze – multiple imputation – impute missing data values). For more information see references. Contact a statistician from EMGO+ who is willing to help you with this procedure.
After imputation, sensitivity analysis is needed to determine how your substantive results depend on how you handled the missing data.
Follow these steps:
- Do a complete case analysis (default option in SPSS; cases with missings are not included)
- Do a missing data analysis after you imputed the results
- Compare substantive conclusions, decide how to report.
When is imputation of missing data not necessary?
1) When your missing data is MCAR or MAR, and you use Maximum Likelihood estimation techniques in analyses such as Structural Equation Modelling (SEM) or Linear Mixed Models (LMM), imputation of missing data is not necessary. These techniques use the available data, and ignore the missing values and still give correct results. In such situations you do not have to use an extra imputation technique to handle your missing values. Missing data that are MNAR is still a problem for these methods.
2. A different approach may be used for descriptive studies. If you want to show the (observed) study data (means and standard deviations), for example to compare them with other countries/settings, without directly linking them to a conclusion, imputation is not immediately needed. However, the evaluative statistics (t-tests, regressions, etc.) would certainly need complete case analysis. So, if you use statistical tests to compare the descriptive, imputation is needed (of course depending on the amount and type of missing data). In this final case, you link your descriptive to a conclusion and want a corrected p-value / 95% CI, and therefore you need to use the data with imputed values. Do not forget the reviewer, who may sometimes have problems with using imputed and non-imputed data in one paper. Be clear about imputation and point out why you choose to present imputed/non-imputed data.
- Make every effort to avoid missing data, or failing that, to understand how much and why data is missing.
- Understand missing data mechanisms (MCAR, MAR, MNAR) and their implications
- Avoid default methods (listwise deletion, pairwise deletion)
- Avoid default fixups (mean imputation, etc.) where possible
- Use multiple imputation to take proper account of missings
- Do a sensitivity analysis
Multiple Imputation Methods, Niels Smits (technical literature).
http://www.ssc.upenn.edu/~allison/MultInt99.pdf (especially for Multiple Imputation)
Ask EMGO+ statisticians for help via: http://www.emgo.nl/kc/preparation/research%20design/3%20Advice%20and%20support.html
EMGO+ experts on Missing Data
Recommended (non-technical) literature.
- Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ 2009;338:b2393. doi: 10.1136/bmj.b2393.
- Allison, P.D. (2001). Missing Data (Sage University Papers Series on Quantitative Applications in the Social Sciences, series no. 07-136). Thousand Oaks: Sage.
- Schafer, J.L. & Graham, J.W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7, 147-177.
- Donders AR, van der Heijden GJ, Stijnen T, Moons KG. Review: a gentle introduction to imputation of missing values. J Clin Epidemiol. 2006; 59(10):1087-91. Review.
- http://www.stat.psu.edu/~jls/mifaq.html (Multiple Imputation FAQ page met uitleg)
- Van Ginkel, J. R., & Van der Ark, L. A. (2003a). SPSS syntax for two-way imputation of missing test data [computer software and manual]. Retrieved from http://www.tilburguniversity.edu/nl/over-tilburg-university/schools/socialsciences/organisatie/departementen/mto/onderzoek/software/
V 1.0: 1 dec 2011: First version
V1.1: 5-jul-2012 : addition to section When is imputation of missing data not necessary?