Quality Handbook
To clarify what constitutes a good codebook.
The codebook contains the names and descriptions of the variables, the range of correct values and incorrect values and coding for correct and incorrect missing values.
A codebook is required:
A good codebook consists of columns containing the item or the description of the item along with the item number, the allocated variable name (consisting of a maximum of 8 characters), the variable type, normally (numeric(N), alphanumeric/string (A) or a date) and the number of characters, the potential values the variable can take and coding for the values, including coding for (correct/incorrect) missing values. See details. A codebook can also be potentially created by typing variable names, codes, etc. into an empty list of questions. It is also possible to derive a codebook retrospectively in SPSS from an SPSS file by selecting the Utilities option in the main menu, and then selecting File Info.
For any data entry method it is necessary for the researcher to create a codebook in advance. This also applies if the data entry and/or creation of the entry screen or questionnaire scanning have been outsourced. When creating a codebook, a transcription takes places from the questionnaire to the variables. The questions are coded and this includes the allocation of values for missing observations (missing values). The aim is to create a separate codebook for each questionnaire. An entry screen can be created in BLAISE, MS-Access, Netquestionaires / Elisten or Teleform using the codebook. The codebook allows the individuals entering the data to understand how encoded items should be entered. Furthermore, a codebook is a necessity for statistical analysis using, for instance, SPSS. Variables are referred to by a variable name of a maximum 8 characters in virtually all statistical programmes. These variable names often obscure the actual meaning of the variable. Refer to the details for advice on coding variables.
Naming variables
Try to provide meaningful names for the variables that say something about the nature of the item. Avoid cryptic descriptions. One suggestion is to always use the first letter(s) of the relevant form for the start of the variable name. Use capitals for the variable names for clarity. Use specific symbols with care. For instance, variable names cannot start with a number in SPSS. Make sure that a variable name is never longer than 8 characters, even though variable names may be longer than 8 characters within the associated database programme. Avoid using spaces in variable names. The majority of statistical packages are only able to handle variable names that are no longer than 8 characters. When converting a 10-character name from Access, for instance, there may be a risk that the variable name is no longer unique in SPSS.
For instance:
Access variable name |
Resulting SPSS variable name |
PYNSCORE10 |
PYNSCOR |
PYNSCORE20 |
PYNSCORE |
Example
VAR. NAME |
QUESTION ON FORM |
CINCLUDE |
Formulate a “card index search”, in which a specific inclusion criterion is completed |
IORIGIN |
Formulate “inclusion consultation”, in which ethnic origin is completed |
FREGSYM |
Formulate “follow-up consultation”, in which occurrence of regular symptoms is documented |
P1SOCIAL |
Patient questionnaire time point 1, in which effect of illness on social life is explored |
If the same questions occur at different time points, this needs to be indicated clearly in the variable names by including the relevant time point in the name. For instance: P1PAIN, P2PAIN, P3PAIN, (the participant is asked about pain at 3 time points). This does not, of course, apply to variables with unique values allowing files to be linked together, such as patient numbers, etc. These variables should always have the same name for each time point and form, for instance, PATNO.
The find and replace function in Word can be used to reduce the amount of time needed to create codebooks for different time points.
Questions allowing more than one answer
Multiple choice questions allowing multiple answers need to be split up in the codebook into as many variables as there are response categories. The values for these variables may be either 1 or 0, if selected or not selected, respectively. Missing values are not possible for this type of question, as it is impossible to distinguish between “response categories missed” and “response category not endorsed". If the item contains a conditional question, that is, where the response is dependent on the answer to a previous question, then the "not applicable value" is also potentially possible.
Two types of missing values
A missing value resulting from an erroneously missing detail (omission or refusal to answer the question, etc) is usually coded with highest possible out-of-range value. For instance, for a multiple choice question with response categories 1 to 5, this would be value 9.
For a date variable it is best to use a date from the distant past as a value for the incorrect missing value (usually the date 14-12-1985 is used for this – the Blaise input programme does this automatically).
If the missing value results from a correct missing detail (i.e. the question is not applicable), then the variable is left blank. In SPSS this will be automatically coded as system missing (a dot). These values are usually filled in automatically in the data entry programme where the not applicable question has been omitted. Particularly in older files a value other than <empty> may be entered for a correct missing value, for instance, 8 or 88, etc.
As alphanumeric variables are not handled by statistical programmes, no distinction is made between correct or erroneous missing alphanumeric values. In both instances the cell is simply left empty.
Audit questions: Is there a codebook? If not, why not?
If yes, would another researcher not associated with the project be able to understand it and is the codebook up-to-date?
Example of questionnaire:

Transformation to a codebook:
Question |
Variable |
Type |
Value |
Code |
Compulsory |
1. Patient number |
PATNO |
N 3 |
001-998 |
001-998 |
Yes & unique |
2. GP code |
GPCODE |
N 1 |
1-9 |
1-9 |
yes |
3. Sec |
SEX |
N 1 |
Male |
1 |
yes |
4. Date of birth
|
DOB |
datum |
01/01/1940- |
01/01/1940- |
no |
5. Number of pregnancies |
PREGNANT |
N 1 |
0-8 |
0-8 |
no |
6. Illness in family
|
FAMILL |
N 1 |
yes |
1 |
yes |
7. Which family illnesses |
|
|
|
|
|
- Diabetes |
DIABETES |
N 1 |
selected |
1 |
no |
- Respiratory |
RESP |
N 1 |
selected |
1 |
no |
- Cardiovascular |
CARDIOV |
N 1 |
selected |
1 |
nee |
- Other illnesses |
OTHILL |
N 1 |
selected |
1 |
no |
7a. Which other illnesses |
WHICHILL |
A 30 |
missing |
leeg |
no |
8. Weight in Kg |
WEIGHT |
N 3 |
45-140 |
45-140 |
yes |
9. Height in metres |
HEIGHT |
N 1.2 |
1,20-2,25 missing |
1,20-2,25 9,99 |
yes |
10. Last job |
JOB |
A 50 |
|
|
no |
An example can be found in the quality assurance handbook with this guideline.
V1.2: 1 Jan 2010: English translation.
V1.1: 19 October 2006: Updated and minor textual amendments.
Audit questions: Is there a codebook? If not, why not?
If yes, would another researcher not associated with the project be able to understand it and is the codebook up-to-date?