Prognostic modelsGuideline in PDF

Aim

To describe how a prognostic model can be developed and tested as thoroughly as possible.

Description

This guideline describes the methods and techniques that are used to develop and validate prognostic models. The aim of a prognostic model is to estimate the probability of a particular outcome based on as few variables as possible. This may involve prognostic (risk or outcome) prediction (predicting the course of a disease), as well as aetiological models (predicting who will get the disease on the basis of risk factors) or diagnostic models (predicting the presence of the disease). The various steps to develop a prognostic model are provided in summary, from the selection of predictors to the testing of the external validity. For a few steps there is the option between a fundamental, yet simple approach, or the use of more complex techniques. These options are summarised briefly in this guideline.

Contents of this guideline
Introduction
Preparation

  • Choice of predictors
  • Defining the outcome measure
  • Choice of model
  • Sample size and number of predictors
  • Linearity
  • Correlation between predictors
  • Handling missing values

Developing a prognostic model

  • Preselecting predictors and building the model
    • Univariate and stepwise regression analysis
    • Least absolute shrinkage and selection operator (Lasso)

How the prognostic model works
Creating a prediction rule
Validity

  • Internal validity
  • External validity

A. Introduction
The aim of a prognostic model is to estimate (predict) the probability of a particular outcome as optimally as possible, and not just to explore the causality of the association between a specific factor and the outcome (explanatory). The way in which a prognostic model is developed differs therefore from the method for building an explanatory model. For an explanatory (causal) model there is normally a single central determinant and correction for confounding; when building a prognostic model the focus is on the search for a combination of factors which are as strongly as possible related to the outcome.

Prognostic models are often developed for the clinical practice, where the risk of disease development or disease outcome (e.g. recovery from a specific disease) can be calculated for individuals by combining information across patients. The model can then be presented in the form of a clinical prediction rule (1). It is often preferable that the variables in the model are easily determined in practice in order to ensure that a prognostic model is applicable in (clinical) practice.

B. Preparation
Choice of predictors
Prognostic models can be developed using a broad variety of biological, psychological and social predictors. The correct predictors need to be carefully selected. It is advisable to include all predictors which have been shown to be strongly associated with the outcome in previous research, or those which can be expected to show an association on the basis of conceptual or theoretical models. A proper systematic literature review and expert advice is important in this step. When the practical applicability of the prognostic model is important, it is preferable for predictors to be determined quickly and simply (e.g. no complex or invasive tests and no extensive questionnaires).

Defining the outcome measure
The outcome is central to the prognostic model and needs to be carefully selected. Think carefully about the nature of the outcome (which concept), the method for determining the outcome (which measurement instrument, by whom) and the length of follow-up (which measurement time points). The outcome of a prognostic model is often dichotomous (e.g. ill or not ill), but it may also be a continuous outcome (for instance, the severity of functional limitations), or the time until a certain event occurs (“time to event”, for instance, the time until work is resumed or time until death). When defining a dichotomous outcome, occasionally a cut-off point is chosen on a continuous scale. Bare in mind that this leads to a loss of information and therefore only in the case of strong arguments this should be considered. If dichotomized, a cut-off needs to be carefully selected, preferably based on substantive arguments and the use of a conceptual or theoretical model. For instance, at what point do we define whether or not there is a case of depression?

Choice of model
The choice of the statistical model to be used in creating the prognostic model is dependent on the definition of the outcome measure. A logistic regression model should be chosen for a dichotomous outcome. A Cox regression model can be used for a “time to event” model and a linear regression model for a continuous outcome measure. There are various other options, but these will not be discussed in this guideline.

Sample size and number of predictors
The precision of the estimates in the prognostic model is highly dependent on the size of the study population. There are different ways of generating power calculations for determining the minimal sample size of the study population. This, in particular, will determine the number of variables that can be included in the regression model. A rule of thumb is that for a continuous outcome measure (linear regression) you will need at least 10 – 15 participants per variable in the model. For a dichotomous outcome (logistic regression) at least 10 – 15 “events" or “non-events”, depending on which has the lowest number of participants, need to be considered per variable (2). Events and non-events refer to whether or not the outcome occurs, for instance, disease/no disease. The logistic regression rule also applies to Cox regression models. When dealing with external validation of a prognostic model the validation cohort also needs to have a sufficient number of participants (validation cohort refers to a cohort used to externally test the model). The 10 - 15 participants rule also applies here.

Linearity
The regression models discussed in this guideline presupposes a linear relationship between the predictor and outcome. However, more often than not this relationship is non-linear rather than linear. An example of this, for instance, is the relationship between alcohol consumption and the risk of developing a cardiac infarction. This relationship is U-shaped. One therefore needs to consider investigating for all potential predictors (with the exception of nominal or dichotomous variables – nominal variables should always be included as dummy variables) whether the relationship with the outcome measure is indeed linear. However, a balance must be sought between a data driven search for sample idiosyncratic non-linearity and specifics applying to the population. Most important is that not the exact form of the relationship is important but the increase in predictive performance. There are various options for investigation of non-linearity including spline functions. More information about the various methods for investigating linearity will be available in the Epidm course “Prediction modelling” that will start in 2012.

Spline functions
Spline functions can be used to further explore the linear/non-linear relationship between a predictor and the outcome (spline functions are mathematical functions that are used to carefully analyse the relationship between a predictor and the outcome measure, if this is non-linear). These spline functions do not assume a linear relationship between a predictor and the outcome measure, if this is not present, but follow the pattern of the data in more detail. If there is a non-linear relationship between the predictor and the outcome, then this can be included as a function in the regression model. The advantage of this is that this does not reduce the power of the regression model too greatly in comparison with categorising the variables and including these as dummy variables, which often happens in a non-linear relationship. Contact Martijn W Heymans for more information about spline functions and how to apply them.

Correlation between predictors
A significant correlation between variables will affect the selection of both predictors. It is therefore sensible to generate a correlation table, including all potential predictors. When variables are strongly correlated (e.g. >0.70), it is sensible to choose which variables you are going to use in building the model, or if you intend to add variables together into a single variable. For instance, you could choose the variable most strongly associated with the outcome measure, or the measure that is easiest to measure. N.B.: There is no problem with variables strongly correlating with each other, i.e. the correlation between the dependent and the independent variable, in a single model. Problems arise when “forward” and “backward” selection takes place in combination with strongly correlated (independent) variables.

Handling missing values
There will be dropouts and missing values in virtually every cohort study . Dropouts are participants not (or no longer) taking part in follow-up assessments and whose outcome measures are missing. The number and reasons of dropouts need to be described. If possible the personal characteristics of the dropouts should also be described and compared with those participants who did take part in the follow-up assessments, in order to investigate whether a selective dropout took place. In addition to dropouts there are often also (incidental) missing values, where results of one or more predictors are missing for a section of the participants.

There are various strategies for dealing with missing values. One of these is to only use data from participants with a complete dataset (“complete case analysis”). In the most ideal case, where missing values are completely at random, coefficients are estimated less precisely. In less ideal cases, i.e. missing at random or missing not at random, this method will have a negative effect on the composition of the model and the regression coefficient estimates. This method is therefore strongly discouraged.

It is possible to impute missing values in a dataset. There are various methods available for this, including imputing an average value or imputing a value estimated from regression methods. However, use of these techniques is strongly discouraged. Multiple imputation is considered to be one of the best methods. It is common practice for an expert or a statistician to be consulted for applying these techniques (Martijn W Heymans can be consulted for this). Make sure that the number of dropouts and missing values are always described in your study. For detailed information on techniques to evaluate and handle missing data we would like to refer to the missing data guideline in the quality handbook.


Dropouts will not arise during the research in patient-controlled studies. However, there may of course be missing values. The same solutions as described above apply.

Developing the model

Preselecting predictors and building the model
Once a set of predictors has been selected, the next step is to create the prognostic model. It is important in this process to distinguish between relevant and less relevant predictors, meaning that the final model can be developed with as few predictors as possible, but would still lead to reliable predictions. The following techniques can be used for developing a prognostic model.

1. Univariate and Stepwise regression analysis

Selecting variables
Firstly, the relationship between each individual predictor is investigated with the outcome measure in a model that only includes the predictor and outcome measure (univariate). The relationship between the predictor and outcome are evaluated against a specific p-value: 0.20 is often used for this, or lower. If the predictor has a lower p-value, then this can be considered as relevant and included in the next step. The importance of each predictor to the prognostic model can be explored in this way. Should too many variables be retained in this pre-selection phase, than you can be stricter in the level of selection, i.e. choose a lower p-value, i.e. p < 0.1 or p < 0.05. An important note to consider is that the pre-selection of predictors based on univariate statistical significance is arbitrary. It is a better choice to make use of previous research and expert opinion for the first selection of predictors without thrusting too much on statistical pre-selection alone.

You may also choose to work with groups of variables. For instance, you could firstly generate the model on the basis of all easily obtainable variables (e.g. details from the case history). The most important predictors can then be selected from this group of variables (see building the model). You can then add the next group of variables (e.g. details from the physical examination). Select the most important predictors from this group of variables, plus from the variables that have been retained from the previous group, etc.

Building the model
The options for this are to use a forward or backward selection method, or a combination of the two (stepwise regression). Forward and backward selection methods can be used in order to select the predictors for the model step-by-step. In the forward selection method you add variables to the model, whereas in a backward selection method you remove variables from the model. The backward selection method is preferred, as it leads to fewer errors in the estimates for the predictors or in selecting the most relevant predictors. For these reasons this method is discussed in more detail here.
N.B.: Selecting predictors by using forward or backward selection techniques will always generate more problems than selecting variables on the basis of previous research (prospective or systematic literature reviews) or by consulting clinical experts for choosing important variables on basis of a Delphi procedure. It is therefore advisable to use forward and backward selection techniques as little as possible.

In backward selection all the selected variables are firstly entered at the same time into a model. Subsequently the variables with the highest p-values are manually removed (i.e. those variables contributing the least) on the basis of the Wald test (which allows you to calculate the significance level of a predictor). Then the model is re-run. This step is repeated until there are no variables left with a p-value smaller than 0.10 or 0.20. A p-value of 0.10 or 0.20 is commonly used in prognostic models, as variables that are less strongly associated with the outcome may still make a relevant contribution to the prediction.

Sometimes it may be informative to, following this procedure, add specific variables that did not end up in the final model (but perhaps were expected to fit in the model), to assess whether they make a significant contribution to the final model. This process is occasionally successful. It may also be interesting to interchange variables on the basis of the correlation between variables (e.g. variables that are easier to measure), to assess whether this generates an equivalent, but more easily applicable model.

2. Least absolute shrinkage and selection operator (Lasso)

The Lasso is an advanced technique for the selection of variables. The Lasso is able to shrink regression coefficients to zero. This is the same as not selecting variables in a multivariable analysis. The Lasso method combines this shrinkage with variable selection and so does not need a separate shrinkage step (for more on shrinkage see paragraph G below). Furthermore, with the Lasso the number of potential prognostic variables to select can be much larger than with “normal” backward selection. To learn more about this technique and how to apply it contact Martijn W Heymans. The method is promising but has not been applied much in epidemiological studies yet.


E. The performance the prognostic model
Once you have developed a prognostic model, it is also important to investigate how well the model works, that is to say, how well does the model predict outcomes? The section below describes which techniques, depending on the choice of model, can be used to test how well your prognostic model works (1):

Linear regression
The percentage variance explained (R2): This indicates the percentage of the total variance of the outcome measure explained by the predictors in the prognostic model.

Logistic and Cox regression
Calibration: Calibration can be used to assess how well the observed probability of the outcome agrees with the probability predicted by the model. This can also be presented graphically in a calibration plot. In a calibration plot groups of predicted probabilities of the outcome are plotted against groups of observed probabilities (groups of 10 are often used). Subsequently you can assess the extent to which these groups lie along the perfect calibration line, which forms a 45 degree angle with the horizontal axis. The Hosmer-Lemeshow test can also be used to investigate how well the predicted probabilities agree with the observed probabilities. This test should not be statistically significant (null hypothesis: There is no difference between predicted and observed values).

Discrimination: This indicates how well the model discriminates between people with and without the outcome. If there are few predictors in the model, then a lot of people will fall into the same group of predicted probabilities and the model will not be able to discriminate very well between groups. If there are numerous predictors in the model, then few people will fall into the same group and the model will have a better discriminatory power. An ROC curve can be generated for the predicted probabilities to determine the level of discrimination. The Area Under the Curve (AUC) of the ROC curve is a measure of discriminatory power for the model, that is, how well the model is able to discriminate between people with and without the outcome based on the predicted probabilities (3). An AUC of 0.5 indicates that the model is not discriminating very well (no different to tossing a coin); an AUC of 1.0 indicates perfect discrimination.

Reclassification tables: This is a novel method to evaluate the performance of a prediction model and can be seen as a refinement of discrimination obtained by the ROC curve (4). This method is especially useful to detect an improvement in discrimination when a new variable is added to an   existing prediction model. It makes use of the reassignment of subject with and without the outcome in their corresponding risk categories. When a new variable is added to the model and prediction is improved, subjects with the outcome are reassigned to a higher risk category. This means improved reclassification. When subjects with the outcome are reassigned to lower risk categories reclassification is worsened. For subjects without the outcome it works in the opposite direction. The Net Reclassification Improvement (NRI) and Integrated Discrimination Improvement (IDI) can be used to test of significance of reclassification and create confidence intervals.

F. Creating a prediction rule
For logistic and Cox regression models the regression coefficients can be used to calculate the outcome (predicted probabilities), based on individual patient characteristics (values of the determinants). The regression coefficients can be transformed into risk scores in order to facilitate use of the prediction rule in practice. A frequently used method for this is to divide the regression coefficients by the lowest value or to multiply the coefficients by a constant, for instance 10. A score card containing these scores can then be generated to allow the probability of an outcome to be easily calculated for a given individual. This is easy to use in practice. Refer to the article by Kuijpers et al. for an example. 2006 (5). Another example is to create a mathematical algorithm and install this on a website.

G. Validity
This is perhaps the most important part of developing a prediction rule. Prediction models commonly perform better in datasets used to develop the model than in new datasets (subjects). This means that the model’s regression coefficients and performance measures are too optimistic and that these have to be adapted to new situations (1, 6). A way to adapt prediction models is to shrink (i.e. make smaller) the regression coefficients before the model will be applied in new subjects. Internal and external validation are used to estimate the amount of optimism. In other words, validating the model explores how well predictions generated by the prognostic model agree with predictions for future patients or comparable patients not part of the study population. Determining validity of a prediction rule can be achieved in a number of ways, which are discussed briefly below. A nice reference for a more comprehensive overview is Vergouwe et al. (7).

A distinction is made between internal and external validity when validating a prediction rule.

Internal validity
For internal validity the model is developed and validated using exactly the same dataset of patients. Techniques that can be used to determine internal validity include: Data-splitting (where the dataset is split in two at random), cross-validation (where the dataset is split into more than two datasets at random) and bootstrapping (a type of simulation technique). The last method is recommended, as this makes efficient use of all the data.

External validity
For external validity a model is developed in a cohort of patients and the validity is determined using another cohort of comparable patients.

The previously described measures, such as variance explained (R2), calibration and discrimination, are used to determine validity.

If you would like more information about developing and/or validating prediction rules, then please contact Martijn W Heymans. There will also start a new Epidm course on “Prediction modelling” in 2012.

* Dropouts will not arise during the research in patient-controlled studies. However, there may of course by missing values. The same solutions for this apply as described above.

    1. Harrell F. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. Springer, 2001. (a new update will be available in june/july 2011).
    2. Peduzzi P, Concato J, Feinstein AR, Holford TR. Importance of events per independent variable in proportional hazards regression analysis. II. Accuracy and precision of regression estimates. J Clin Epidemiol 1995; 48(12):1503-10.
    3. Harrell F, Lee K, Mark D. Multivariate prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med 1996;15:361-87.
    4. Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21(1):128-38.
    5. Kuijpers T, van der Windt DA, Boeke AJ, Twisk JW, Vergouwe Y, Bouter LM, van der Heijden GJ. Clinical prediction rules for the prognosis of shoulder pain in general practice. Pain 2006;120(3):276-85.
    6. Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. New York: Springer Science+Business Media, 2009.
    7. Vergouwe Y, Steyerberg EW, Eijkemans MJ, Habbema JD. Validity of prognostic models: when is a model clinically useful? Semin Urol Oncol 2002;20:96-107.


Prognostic model: a multivariable model consisting of a combination of factors as strongly associated with the outcome as possible.

V1.0: 1 Jan 2010: English translation.
V1.1: 1 Mar 2011: Several textual changes and additions, Replacement of bootstrapping by the Lasso technique, addition of reclassification tables, more emphasis on validation the model. Update references.

1.         Was the selection of the predictors based on a literature search and advice from experts?
2.         Has the outcome measure been clearly defined?
3.         Have dropouts and missing values been described and have the potential consequences of these been discussed in the research report (is dealt with missing values in a sensible way, i.e. multiple imputation)?
4.         Is the sample size of the study population sufficient?
5.         Has linearity been assessed for all potential predictors?
6.         Has a correlation table been created of all potential predictors?
7.         Has a (manual) backward selection been used for building the model?
8.         Has the model quality been assessed? If possible, have calibration and discrimination been assessed?
9.         Was the prediction model validated?