CodebookGuideline in PDF

Aim

To clarify what constitutes a good codebook.

Description

The codebook contains the names and descriptions of the variables, the range of correct values and incorrect values and coding for correct and incorrect missing values.
A codebook is required:

  • To create a data entry system
  • For data entry
  • For statistical analysis
  • When archiving data files once the study has been completed to, for instance, enable follow-up studies.

A good codebook consists of columns containing the item or the description of the item along with the item number, the allocated variable name (consisting of a maximum of 8 characters), the variable type, normally (numeric(N), alphanumeric/string (A) or a date) and the number of characters, the potential values the variable can take and coding for the values, including coding for (correct/incorrect) missing values. See details. A codebook can also be potentially created by typing variable names, codes, etc. into an empty list of questions. It is also possible to derive a codebook retrospectively in SPSS from an SPSS file by selecting the Utilities option in the main menu, and then selecting File Info.

For any data entry method it is necessary for the researcher to create a codebook in advance. This also applies if the data entry and/or creation of the entry screen or questionnaire scanning have been outsourced. When creating a codebook, a transcription takes places from the questionnaire to the variables. The questions are coded and this includes the allocation of values for missing observations (missing values). The aim is to create a separate codebook for each questionnaire. An entry screen can be created in BLAISE, MS-Access, Netquestionaires / Elisten or Teleform using the codebook. The codebook allows the individuals entering the data to understand how encoded items should be entered. Furthermore, a codebook is a necessity for statistical analysis using, for instance, SPSS. Variables are referred to by a variable name of a maximum 8 characters in virtually all statistical programmes. These variable names often obscure the actual meaning of the variable. Refer to the details for advice on coding variables.

Naming variables
Try to provide meaningful names for the variables that say something about the nature of the item. Avoid cryptic descriptions. One suggestion is to always use the first letter(s) of the relevant form for the start of the variable name. Use capitals for the variable names for clarity. Use specific symbols with care. For instance, variable names cannot start with a number in SPSS. Make sure that a variable name is never longer than 8 characters, even though variable names may be longer than 8 characters within the associated database programme. Avoid using spaces in variable names. The majority of statistical packages are only able to handle variable names that are no longer than 8 characters. When converting a 10-character name from Access, for instance, there may be a risk that the variable name is no longer unique in SPSS.

For instance:

Access variable name

Resulting SPSS variable name

PYNSCORE10

PYNSCOR

PYNSCORE20

PYNSCORE

Example


VAR. NAME

QUESTION ON FORM

CINCLUDE

Formulate a “card index search”, in which a specific inclusion criterion is completed

IORIGIN

Formulate “inclusion consultation”, in which ethnic origin is completed

FREGSYM

Formulate “follow-up consultation”, in which occurrence of regular symptoms is documented

P1SOCIAL

Patient questionnaire time point 1, in which effect of illness on social life is explored

 

If the same questions occur at different time points, this needs to be indicated clearly in the variable names by including the relevant time point in the name. For instance: P1PAIN, P2PAIN, P3PAIN, (the participant is asked about pain at 3 time points). This does not, of course, apply to variables with unique values allowing files to be linked together, such as patient numbers, etc. These variables should always have the same name for each time point and form, for instance, PATNO.
The find and replace function in Word can be used to reduce the amount of time needed to create codebooks for different time points.

Questions allowing more than one answer
Multiple choice questions allowing multiple answers need to be split up in the codebook into as many variables as there are response categories. The values for these variables may be either 1 or 0, if selected or not selected, respectively. Missing values are not possible for this type of question, as it is impossible to distinguish between “response categories missed” and “response category not endorsed". If the item contains a conditional question, that is, where the response is dependent on the answer to a previous question, then the "not applicable value" is also potentially possible.

Two types of missing values
A missing value resulting from an erroneously missing detail (omission or refusal to answer the question, etc) is usually coded with highest possible out-of-range value. For instance, for a multiple choice question with response categories 1 to 5, this would be value 9.

For a date variable it is best to use a date from the distant past as a value for the incorrect missing value (usually the date 14-12-1985 is used for this – the Blaise input programme does this automatically).
If the missing value results from a correct missing detail (i.e. the question is not applicable), then the variable is left blank. In SPSS this will be automatically coded as system missing (a dot). These values are usually filled in automatically in the data entry programme where the not applicable question has been omitted. Particularly in older files a value other than <empty> may be entered for a correct missing value, for instance, 8 or 88, etc.

As alphanumeric variables are not handled by statistical programmes, no distinction is made between correct or erroneous missing alphanumeric values. In both instances the cell is simply left empty.

 

Audit questions: Is there a codebook? If not, why not?
If yes, would another researcher not associated with the project be able to understand it and is the codebook up-to-date?


Example of questionnaire:

Transformation to a codebook:

 

Question

 

Varia­ble
name

 

Ty­pe

 

Value

 

Code­

Compulsory

 

1. Patient number

 

PATNO

 

N  3

 

001-998

 

001-998

 

Yes & unique

 

2. GP code

 

GPCODE

 

N  1

 

1-9
missing

 

1-9
99

 

yes

 

3. Sec

 

SEX

 

N  1

Male
Female
Missing

 

1
2
9

yes

 

4. Date of birth

 

 

DOB

 

datum

 

01/01/1940-
01/01/1980
missing

 

01/01/1940-
01/01/1980
14/12/1985

 

no

 

5. Number of pregnancies

 

PREGNANT

 

N  1

 

0-8
missing
na (sex=1)

 

0-8
9
empty

 

no

 

6. Illness in family

 

 

FAMILL

 

N 1

 

yes
no
missing

 

1
2
9

 

yes

 

7. Which family illnesses

 

 

 

 

 

 

- Diabetes

 

DIABETES

 

N  1

 

selected
not selected

 

1
0

 

no

 

- Respiratory

 

RESP

 

N  1

 

selected
not selected

 

1
0

 

no

 

- Cardiovascular

 

CARDIOV

 

N  1

 

selected
not selected

 

1
0

 

nee

 

- Other illnesses

 

OTHILL

 

N 1

 

selected
not selected

1
0

 

no

 

7a. Which other illnesses

 

WHICHILL

 

A 30

missing
na (othill=0)

leeg
leeg

 

no

 

8. Weight in Kg

 

WEIGHT

 

N 3

 

45-140
missing

 

45-140
999

 

yes

 

9. Height in metres

 

HEIGHT

 

N 1.2

 

1,20-2,25

missing

 

1,20-2,25

9,99

 

yes

 

10. Last job

 

JOB

 

A 50

 

 

 

no


An example can be found in the quality assurance handbook with this guideline.


The Introductory Meeting Data Management part1 , which is compulsory for doctoral students, discusses codebooks in more detail.
Codebook: Description of the data as stored on the computer.

V1.2: 1 Jan 2010: English translation.
V1.1: 19 October 2006: Updated and minor textual amendments.

Audit questions: Is there a codebook? If not, why not?
If yes, would another researcher not associated with the project be able to understand it and is the codebook up-to-date?