The window with settings for Cox regression
is accessed via the menu Advanced statistics
→Survival analysis
→Cox PH regression
Cox regression, also known as the Cox proportional hazard model (Cox D.R. (1972)1)), is the most popular regressive method for survival analysis. It allows the study of the impact of many independent variables (, , , ) on survival rates. The approach is, in a way, non-parametric, and thus encumbered with few assumptions, which is why it is so popular. The nature or shape of the hazard function does not have to be known and the only condition is the assumption which also pertains to most parametric survival models, i.e. hazard proportionality.
The function on which Cox proportional hazard model is based describes the resulting hazard and is the product of two values only one of which depends on time ():
where:
- explanatory variables independent of time,
- parameters.
Dummy variables and interactions in the model
A discussion of the coding of dummy variables and interactions is presented in chapter Preparation of the variables for the analysis in multidimensional models.
Correction for ties in Cox regression is based on Breslow's method2)
The model can be transformed into a the linear form:
In such a case, the solution of the equation is the vector of the estimates of parameters called regression coefficients:
The coefficients are estimated by the so-called partial maximum likelihood estimation. The method is called „partial” as the search for the maximum of the likelihood function (the program makes use of the Newton-Raphson iterative algorithm) only takes place for complete data; censored data are taken into account in the algorithm but not directly.
There is a certain error of estimation for each coefficient. The magnitude of that error is estimated from the following formula:
where:
is the main diagonal of the covariance matrix.
Note
When building a model it ought to be remembered that the number of observations should be ten times greater than or equal to the ratio of the estimated model parameters () and the smaller one of the proportions of the censored or complete sizes (), i.e. () Peduzzi P., et al(1995)3).
Note
When building the model you need remember that the independent variables should not be multicollinear. In a case of multicollinearity estimation can be uncertain and the obtained error values very high.
Note
The criterion of convergence of the function of the Newton-Raphson iterative algorithm can be controlled with the help of two parameters: the limit of convergence iteration (it gives the maximum number of iterations in which the algorithm should reach convergence) and the convergence criterion (it gives the value below which the received improvement of estimation shall be considered to be insignificant and the algorithm will stop).
EXAMPLE cont. (remissionLeukemia.pqs file)
An individual hazard ratio (HR) is now calculated for each independent variable : It expresses the change of the risk of a failure event when the independent variable grows by 1 unit. The result is adjusted to the remaining independent variables in the model – it is assumed that they remain stable while the studied independent variable grows by 1 unit.
The value is interpreted as follows: \item means the stimulating influence of the studied independent variable on the occurrence of the failure event, i.e. it gives information about how much greater the risk of the occurrence of the failure event is when the independent variable grows by 1 unit. \item means the destimulating influence of the studied independent variable on the occurrence of the failure event, i.e. it gives information about how much lower the risk is of the occurrence of the failure event when the independent variable grows by 1 unit. \item means that the studied independent variable has no influence on the occurrence of the failure event (1). Note
If the analysis is made for a model other than linear or if interaction is taken into account, then, just as in the logistic regression model we can calculate the appropriate on the basis of the general formula which is a combination of independent variables.
EXAMPLE cont. (remissionLeukemia.pqs file)
On the basis of the coefficient and its error of estimation we can infer if the independent variable for which the coefficient was estimated has a significant effect on the dependent variable. For that purpose we use Wald test.
Hypotheses:
or, equivalently:
The Wald test statistics is calculated according to the formula:
The statistic asymptotically (for large sizes) has the Chi-square distribution with degree of freedom.
The p-value, designated on the basis of the test statistic, is compared with the significance level :
A good model should fulfill two basic conditions: it should fit well and be possibly simple. The quality of Cox proportional hazard model can be evaluated with a few general measures based on: - the maximum value of likelihood function of a full model (with all variables),
- the maximum value of the likelihood function of a model which only contains one free word,
- the observed number of failure events.
, , and is a kind of a compromise between the good fit and complexity. The second element of the sum in formulas for information criteria (the so-called penalty function) measures the simplicity of the model. That depends on the number of parameters () in the model and the number of complete observations (). In both cases the element grows with the increase of the number of parameters and the growth is the faster the smaller the number of observations.
The information criterion, however, is not an absolute measure, i.e. if all the compared models do not describe reality well, there is no use looking for a warning in the information criterion.
It is an asymptomatic criterion, appropriate for large sample sizes.
Because the correction of the Akaike information criterion concerns the sample size (the number of failure events) it is the recommended measure (also for smaller sizes).
Just like the corrected Akaike criterion it takes into account the sample size (the number of failure events), Volinsky and Raftery (2000)4).
The value of that coefficient falls within the range of , where values close to 1 mean excellent goodness of fit of the model, – a complete lack of fit. Coefficient is calculated according to the formula:
As coefficient does not assume value 1 and is sensitive to the amount of variables in the model, its corrected value is calculated:
The basic tool for the evaluation of the significance of all variables in the model is the Likelihood Ratio test. The test verifies the hypothesis:
The test statistic has the form presented below:
The statistic asymptotically (for large sizes) has the Chi-square distribution with degrees of freedom.
The p-value, designated on the basis of the test statistic, is compared with the significance level :
Hypotheses:
The test statistic has the form:
where:
- field error.
The statistic has asymptotically (for large numbers) a normal distribution.
The p-value, designated on the basis of the test statistic, is compared with the significance level :
In addition, a proposed cut-off point value for the combination of independent variables and model parameters is given for the ROC curve.
EXAMPLE cont. (remissionLeukemia.pqs file)
The analysis of the of the model residuals allows the verification of its assumptions. The main goal of the analysis in Cox regression is the localization of outliers and the study of hazard proportionality. Typically, in regression models residuals are calculated as the differences of the observed and predicted values of the dependent variable. However, in the case of censored values such a method of determining the residuals is not appropriate. In the program we can analyze residuals described as: Martingale, deviance, and Schoenfeld. The residuals can be drawn with respect to time or independent variables.
Hazard proportionality assumption
A number of graphical methods for evaluating the goodness of fit of the proportional hazard model have been created (Lee and Wang 2003\cite{lee_wang}). The most widely used are the methods based on the model residuals. As in the case of other graphical methods of evaluating hazard proportionality this one is a subjective method. For the assumption of proportional hazard to be fulfilled, the residuals should not form any pattern with respect to time but should be randomly distributed around value 0.
If the assumption of hazard proportionality is not fulfilled for any of the variables in Cox model, one possible solution is to make Cox's analyses separately for each level of that variable.
EXAMPLE cont. (remissionLeukemia.pqs file)