Quality Handbook
To maintain and, where possible, improve the quality of data files
Errors and inconsistencies
Once the data have been entered (see 1.2-02), checked (see 1.3-03), cleaned (see 1.3-04), and transformed (1.3-05), they are, in principle, ready for analysis. During the analysis phase there is a risk that there may still be inconsistencies present in the data. This arises due to the fact that the data are now being used in a more focused manner, meaning that the monitoring can be much more focused than in the data cleaning phase. There has to be a system whereby subsequent corrections can be added. This should preferably be done by a single individual, and the file should be renamed to distinguish it from the original file.
Longitudinal files
There are more possibilities of finding inconsistencies in longitudinal research. As new data are regularly becoming available for individuals in the sample, more opportunities for monitoring continue to arise in order to discover errors and inconsistencies. For instance, respondents may reveal during one interview to have been widowed or divorced, and during the next interview reveal that they have never been married. The researcher will therefore need to obtain other information in order to ascertain which answer lies closest to the truth. For instance, in the example above it could be ascertained whether the respondent is old enough to have already been married.
In longitudinal data it is good practice to maintain a “cross-sectional" file, where the corrections based on the longitudinal information have not been made, and to also create a file where the longitudinal data have been cleaned.
Keeping it up to date
Official classifications are often used for coding diseases, hospital admissions and use of medicines, for instance, the International Classification of Diseases (ICD), the Diagnostic and Statistical Manual (DSM) for psychiatric disorders, and the Anatomical Therapeutical Chemical classification (ATC) for medicines. These classifications are not always fixed and may change in response to new insights. For instance, in the 1990s the ICD-9 was revised to create the ICD-10 and the DSM-III-R was revised to the DSM-IV; the ATC changes virtually every year. In order to maintain comparability between research data covering different years, either the coding for the old data has to be adapted to the new classifications, or an algorithm has to be created for the data files in order for the coding systems to link up to each other. It is important here that there is clear documentation about which files and variables are covered by which classifications.
Accessibility
(See Codebook, 1.1C-03, as well)
In complex studies where data are derived from different sources or – as is the case for longitudinal research – are available at different observation time points, there will be multiple data files. In addition, in longitudinal research the number of data files will increase in the course of the study. In terms of accessibility it is important that the naming of files and variables is logical. The naming of files should be such that the origin of the files is easily recognisable. The same also applies to the naming of the variables. It is particularly important that the variables with the same content in each file have a slightly different name in order to facilitate the recognition of the origins of the variables. For instance, the variable “marital status” could have been obtained from the general practitioner or the respondent. Therefore, in the general practitioner file (Huisarts in Dutch) the variable name could be prefixed with an “H” (i.e. HMARST), and in the respondents’ file the prefix could be an “R” (i.e. RMARST). The same variable name should not be used in different files.
Furthermore, it is important that the files can be linked in a unique way, preferably by using one key variable with a unique value for each sample unit. The key variable should not have any inherent significance. In the majority of the cases the most convenient variable to use for this is the "respondent number".
V1.1: 1 Jan 2010: English translation
V1.0: 24 Apr 2004.