Codebook/Data dictionary

Aim

To clarify what constitutes a good codebook (description of the data as stored in a computer program or (SPSS) dataset).

 

Requirements

Constitute a codebook
  • To specify the database and the data entry screens;
  • For data entry;
  • For data validation plan;
  • For statistical analysis;
  • When archiving files once the study has been completed to enable follow-up studies.

 

Documentation

A digital codebook separately for each questionnaire or registration form/CRF.

 

Responsibilities

Executing researcher: To describe a good codebook separately for each questionnaire.
Project leaders: To check with the executing researcher whether the codebook is up-to-date.
Research assistant: N.a.

 

How To

For any data entry method it is necessary for the researcher to create a codebook (also called data dictionary) in advance. This also applies if the data entry and/or creation of the database or questionnaire scanning have been outsourced or an online questionnaire will be used.
A codebook is required:
  • For the specification of the database and the data entry screens;
  • For data entry;
  • As part of your data validation plan;
  • For statistical analysis;
  • As data documentation when archiving data files once the study has been completed to, for instance, enable follow-up studies or sending the data to a repository for long time storing and sharing the data.
A good codebook consists of columns containing: the item or the description of the item, item number, allocated variable name , variable type, normally (numeric(N), alphanumeric/string (A) or a date) and number of characters, potential values the variable can take and coding for the values, including coding for (correct/incorrect) missing values. It is also possible to derive a codebook retrospectively in SPSS from an SPSS file by selecting the > File option in the main menu, and then selecting > Display Data File Information. When creating a codebook, a transcription takes places from the questionnaire or registration form to the variables. The questions are coded and this includes the allocation of values for missing observations (missing values). The aim is to create a separate codebook for each registration form/ questionnaire. A database and data entry screen can be created in BLAISE,  NetQuestionnaires (online questionnaire), SPSS, OpenClinica/Castor using the codebook.
Notice that the details of a codebook partly depend on the used database program.
The codebook allows the individuals entering the data to understand how encoded items should be entered. Furthermore, a codebook is a necessity for statistical analysis using, for instance, SPSS. For archiving purposes you have sometimes to make a copy of your codebook that you used for your database program to make it more generic for SPSS which (portable) files are also a good standard for archiving purposes.

 

Tips for naming variables
Try to provide meaningful names for the variables that say something about the nature of the item. Avoid cryptic descriptions. One suggestion is to always use the first letter(s) of the relevant form for the start of the variable name. Use capitals for the variable names for clarity. Use specific symbols with care. Avoid using spaces and hyphens (-) in variable names . Let a variable name does not start with a special character like %, &, $ etc. The majority of statistical packages like SPSS are only able to handle variable names that are no longer than 64 characters. When converting a longer character name, there may be a risk that the variable name is no longer unique in SPSS.

 

Examples
VAR. NAME
QUESTION ON FORM
CINCLUDE
Formulate a ‘card index search’, in which a specific inclusion criterion is completed
IORIGIN
Formulation ‘inclusion consultation’, in which ethnic origin is completed
FREGSYM
Formulate ‘follow-up consultation’, in which occurrence of regular symptoms is documented
T0SOCIAL
Patient questionnaire baseline time point, in which effect of illness on social life is explored
If the same questions occur at different time points, this needs to be indicated clearly in the variable names by including the relevant time point in the name. For instance: T0PAIN, T1PAIN, T2PAIN, (the participant is asked about pain at 3 time points). This does not, of course, apply to variables with unique values allowing files to be linked together, such as patient ID numbers, etc. These variables should always have the same name for each time point and form, for instance, PATNO.

 

Questions allowing more than one answer
Multiple choice questions (like choose all your sports activities) allowing multiple answers need to be split up in the codebook into as many variables as there are response categories. The values for these variables may be either 1 or 0, if selected or not selected, respectively. Missing values are not possible for this type of question, as it is impossible to distinguish between “response categories missed” and “response category not endorsed”. If the item contains a conditional question, that is, where the response is dependent on the answer to a previous question, then the “not applicable value” is also potentially possible.

 

Two types of missing values
A missing value resulting from an erroneously missing detail (omission or refusal to answer the question, etc) is usually coded with highest possible out-of-range value. For instance, 9 or 999. For a missing date it is best to use a date from the distant past as a value for the incorrect missing value (usually the date 14-12-1985 is used for this – some input program does this automatically). If the missing value results from a correct missing detail (i.e. the question is not applicable), then the variable is left blank. In SPSS this will be automatically coded as system missing (a dot). These values are usually filled in automatically in the data entry program where the not applicable question has been omitted. Particularly in older files a value other than <empty> may be entered for a correct missing value, for instance, 8 or 88, etc. As alphanumeric variables are not handled by statistical program, no distinction is made between correct or erroneous missing alphanumeric values. In both instances the cell is simply left empty.

 

Example codebook

 

Appendices/references/links

 

Audit questions

  1. Is there a codebook?
    1. If not, why not?
  2. If yes, would another researcher not associated with the project be able to understand it and is the codebook up-to-date?

 

V3.0: 1 December 2016: Text updated
V2.0: 12 May 2015: Revision format
V1.2: 1 Jan 2010: English translation
V1.1: 19 October 2006: Updated and minor textual amendments