Spis treści

Non-parametric tests

The Kendall's concordance coefficient and a test to examine its significance

The Kendall's $\widetilde{W}$ coefficient of concordance is described in the works of Kendall, Babington-Smith (1939)1) and Wallis (1939)2). It is used when the result comes from different sources (from different raters) and concerns a few ($k\geq2$) objects. However, the assessment concordance is necessary. Is often used in measuring the interrater reliability strength – the degree of (raters) assessment concordance.

The Kendall's coefficient of concordance is calculated on an ordinal scale or a interval scale. Its value is calculated according to the following formula:

\begin{displaymath}
\widetilde{W}=\frac{12U-3n^2k(k+1)^2}{n^2k(k^2-1)-nC},
\end{displaymath}

where:

$n$ – number of different assessments sets (the number of raters),

$k$ – number of ranked objects,

$\displaystyle U=\sum_{j=1}^k\left(\sum_{i=1}^nR_{ij}\right)^2$,

$R_{ij}$ – ranks ascribed to the following objects $(j=1,2,...k)$, independently for each rater $(i=1,2,...n)$,

$\displaystyle C=\sum(t^3-t)$ – a correction for ties,

$t$ – number of cases incorporated into tie.

The coefficient's formula includes $C$ – the correction for ties. This correction is used, when ties occur (if there are no ties, the correction is not calculated, because of $C=0$).

Note

$W$ – the Kendall's coefficient in a population;

$\widetilde{W}$ – the Kendall's coefficient in a sample.

The value of $W\in<0; 1>$ and it should be interpreted in the following way:

  • $\widetilde{W}\approx1$ means a strong concordance in raters assessments;
  • $\widetilde{W}\approx0$ means a lack of concordance in raters assessments.

The Kendall's W coefficient of concordance vs. the Spearman coefficient:

  • When the values of the Spearman $r_s$ correlation coefficient (for all possible pairs) are calculated, the average coefficient – marked by $\bar{r}_s$ is a linear function of $\widetilde{W}$ coefficient:

\begin{displaymath}
\bar{r}_s=\frac{n\widetilde{W}-1}{n-1}
\end{displaymath}

The Kendall's W coefficient of concordance vs. the Friedman ANOVA:

  • The Kendall's $\widetilde{W}$ coefficient of concordance and the Friedman ANOVA are based on the same mathematical model. As a result, the value of the chi-square test statistic for the Kendall's coefficient of concordance and the value of the chi-square test statistic for the Friedman ANOVA are the same.

The chi-square test of significance for the Kendall's coefficient of concordance

Basic assumptions:

Hypotheses:

\begin{array}{cl}
\mathcal{H}_0: &  W=0\\
\mathcal{H}_1: &  W\neq0
\end{array}

The test statistic is defined by: \begin{displaymath}
\chi^2=n(k-1)\widetilde{W}
\end{displaymath} This statistic asymptotically (for large sample sizes) has the Chi-square distribution with the degrees of freedom calculated according to the following formula: $df=k-1$.

The p-value, designated on the basis of the test statistic, is compared with the significance level $\alpha$:

\begin{array}{ccl}
$ if $ p \le \alpha & \Longrightarrow & $ reject $ \mathcal{H}_0 $ and accept $ 	\mathcal{H}_1, \\
$ if $ p > \alpha & \Longrightarrow & $ there is no reason to reject $ \mathcal{H}_0. \\
\end{array}

The settings window with the test of the Kendall's W significance can be opened in Statistics menu →NonParametric testsKendall's W or in ''Wizard''.

EXAMPLE (judges.pqs file)

In the 6.0 system, dancing pairs grades are assessed by 9 judges. The judges point for example an artistic expression. They asses dancing pairs without comparing each of them and without placing them in the particular „podium place” (they create a ranking). Let's check if the judges assessments are concordant.

\begin{tabular}{|c|c|c|c|c|c|c|}
\hline
Judges&Couple A&Couple B&Couple C&Couple D&Couple E&Couple F\\\hline
S1&3&6&2&5&4&1\\
S2&4&6&1&5&3&2\\
S3&4&6&2&5&3&1\\
S4&2&6&3&5&4&1\\
S5&2&6&1&5&4&3\\
S6&3&5&1&6&4&2\\
S7&5&4&1&6&3&2\\
S8&3&6&2&5&4&1\\
S9&2&6&3&5&4&1\\\hline
\end{tabular}

Hypotheses:

$\begin{array}{cl}
\mathcal{H}_0: & $a lack of concordance between 9 judges assessments,$\\
& $in the population represented by the sample, $\\
\mathcal{H}_1: & $the 9 judges assessments in the population represented$\\
& $by the sample are concordant.$
\end{array}$

Comparing the p <0.0001 with the significance level $\alpha=0.05$, we have stated that the judges assessments are statistically concordant. The concordance strength is high: $\widetilde{W} = 0.83$, similarly the average Spearman's rank-order correlation coefficient: $\bar{r}_s = 0.81$. This result can be presented in the graph, where the X-axis represents the successive judges. Then the more intersection of the lines we can see (the lines should be parallel to the X axis, if the concordance is perfect), the less there is the concordance of rateres evaluations.

2022/02/09 12:56

The Cohen's Kappa coefficient and the test examining its significance

The Cohen's Kappa coefficient (Cohen J. (1960)3)) defines the agreement level of two-times measurements of the same variable in different conditions. Measurement of the same variable can be performed by 2 different observers (reproducibility) or by a one observer twice (recurrence). The $\hat \kappa$ coefficient is calculated for categorial dependent variables and its value is included in a range from -1 to 1. A 1 value means a full agreement, 0 value means agreement on the same level which would occur for data spread in a contingency table randomly. The level between 0 and -1 is practically not used. The negative $\hat \kappa$ value means an agreement on the level which is lower than agreement which occurred for the randomly spread data in a contingency table. The $\hat \kappa$ coefficient can be calculated on the basis of raw data or a $c\times c$ contingency table.

Unweighted Kappa (i.e., Cohen's Kappa) or weighted Kappa can be determined as needed. The assigned weights ($w_{ij}$) refer to individual cells of the contingency table, on the diagonal they are 1 and off the diagonal they belong to the range $<0; 1)$.

Unweighted Kappa

It is calculated for data, the categories of which cannot be ordered, e.g. data comes from patients, who are divided according to the type of disease which was diagnosed, and these diseases cannot be ordered, e.g. pneumonia $(1)$, bronchitis $(2)$ and other $(3)$. In such a situation, one can check the concordance of the diagnoses given by the two doctors by using unweighted Kappa, or Cohen's Kappa. Discordance of pairs ${(1), (3)}$ and ${(1), (2)}$ will be treated equivalently, so the weights off the diagonal of the weight matrix will be zeroed.

Weighted Kappa

In situations where data categories can be sorted, e.g., data comes from patients who are divided by the lesion grade into: no lesion $(1)$, benign lesion $(2)$, suspected cancer $(3)$, cancer $(4)$, one can build the concordance of the ratings given by the two radiologists taking into account the possibility of sorting. The ratings of ${(1), (4)}$ than ${(1), (2)}$ may then be considered as more discordant pairs of ratings. For this to be the case, so that the order of the categories affects the compatibility score, the weighted Kappa should be determined.

The assigned weights can be in linear or quadratic form.

  • Linear weights (Cicchetti, 19714)) – calculated according to the formula:

\begin{displaymath}
w_{ij}=1-\frac{|i-j|}{c-1}.
\end{displaymath}

The greater the distance from the diagonal of the matrix the smaller the weight, with the weights decreasing proportionally. Example weights for matrices of size 5×5 are shown in the table:

\begin{tabular}{|c|c|c|c|c|}
\hline1&0.75&0.5&0.25&0\\\hline
0.75&1&0.75&0.5&0.25\\\hline
0.5&0.75&1&0.75&0.5\\\hline
0.25&0.5&0.75&1&0.75\\\hline
0&0.25&0.5&0.75&1\\\hline
\end{tabular}

  • Square weights (Cohen, 19685)) – calculated according to the formula:

\begin{displaymath}
w_{ij}=1-\frac{(i-j)^2}{(c-1)^2}.
\end{displaymath}

The greater the distance from the diagonal of the matrix, the smaller the weight, with weights decreasing more slowly at closer distances from the diagonal and more rapidly at farther distances. Example weights for matrices of size 5×5 are shown in the table:

\begin{tabular}{|c|c|c|c|c|}
\hline1&0.9375&0.75&0.4375&0\\\hline
0.9375&1&0.9375&0.75&0.4375\\\hline
0.75&0.9375&1&0.9375&0.75\\\hline
0.4375&0.75&0.9375&1&0.9375\\\hline
0&0.4375&0.75&0.9375&1\\\hline
\end{tabular}

Quadratic scales are of greater interest because of the practical interpretation of the Kappa coefficient, which in this case is the same as the intraclass correlation coefficient 6). To determine the Kappa coefficient compliance, the data are presented in the form of a table of observed counts $O_{ij}$, and this table is transformed into a probability contingency table $p_{ij}=O_{ij}/n$.

The Kappa coefficient ($\hat \kappa$) is expressed by the formula: \begin{displaymath}
\hat \kappa=\frac{P_o-P_e}{1-P_e},
\end{displaymath}

where:

$P_o=\sum_{i=1}^c\sum_{j=1}^c w_{ij}p_{ij}$,

$P_e=\sum_{i=1}^c\sum_{j=1}^c w_{ij}p_{i.}p_{.i}$,

$p_{i.}$, $p_{.i}$ - total sums of columns and rows of the probability contingency table.

Note

$\hat \kappa$ denotes the concordance coefficient in the sample, while $\kappa$ in the population.

The standard error for Kappa is expressed by the formula:

\begin{displaymath}
SE_{\hat \kappa}=\frac{1}{(1-P_e)\sqrt{n}}\sqrt{\sum_{i=1}^{c}\sum_{j=1}^{c}p_{i.}p_{.j}[w_{ij}-(\overline{w}_{i.}+(\overline{w}_{.j})]^2-P_e^2}
\end{displaymath}

where:

$\overline{w}_{i.}=\sum_{j=1}^{c}p_{.j}w_{ij}$,

$\overline{w}_{.j}=\sum_{i=1}^{c}p_{i.}w_{ij}$.

The Z test of significance for the Cohen's Kappa ($\hat \kappa$) (Fleiss,20037)) is used to verify the hypothesis informing us about the agreement of the results of two-times measurements $X^{(1)}$ and $X^{(2)}$ features $X$ and it is based on the $\hat \kappa$ coefficient calculated for the sample.

Basic assumptions:

Hypotheses:

\begin{array}{cl}
\mathcal{H}_0: & \kappa= 0, \\
\mathcal{H}_1: & \kappa \ne 0.
\end{array}

The test statistic is defined by:

\begin{displaymath}
Z=\frac{\hat \kappa}{SE_{\kappa_{distr}}},
\end{displaymath}

Where:

$\displaystyle{SE_{\kappa_{distr}}=\frac{1}{(1-P_e)\sqrt{n}}\sqrt{\sum_{i=1}^c\sum_{j=1}^c p_{ij}[w_{ij}-(\overline{w}_{i.}+\overline{w}_{.j})(1-\hat \kappa)]^2-[\hat \kappa-P_e(1-\hat \kappa)]^2}}$.

The $Z$ statistic asymptotically (for a large sample size) has the normal distribution.

The p-value, designated on the basis of the test statistic, is compared with the significance level $\alpha$:

\begin{array}{ccl}
$ if $ p \le \alpha & \Longrightarrow & $ reject $ \mathcal{H}_0 $ and accept $ 	\mathcal{H}_1, \\
$ if $ p > \alpha & \Longrightarrow & $ there is no reason to reject $ \mathcal{H}_0. \\
\end{array}

The settings window with the test of Cohen's Kappa significance can be opened in Statistics menu → NonParametric testsKappa-Cohen or in ''Wizard''.

EXAMPLE (diagnosis.pqs file)

You want to analyse the compatibility of a diagnosis made by 2 doctors. To do this, you need to draw 110 patients (children) from a population. The doctors treat patients in a neighbouring doctors' offices. Each patient is examined first by the doctor A and then by the doctor B. Both diagnoses, made by the doctors, are shown in the table below.

Hypotheses:

\begin{array}{cl}
\mathcal{H}_0: & \kappa= 0, \\
\mathcal{H}_1: & \kappa \ne 0.
\end{array}

We could analyse the agreement of the diagnoses using just the percentage of the compatible values. In this example, the compatible diagnoses were made for 73 patients (31+39+3=73) which is 66.36% of the analysed group. The kappa coefficient introduces the correction of a chance agreement (it takes into account the agreement occurring by chance).

The agreement with a chance adjustment $\hat \kappa=44.58%$ is smaller than the one which is not adjusted for the chances of an agreement.

The p<0.0001. Such result proves an agreement between these 2 doctors' opinions, on the significance level $\alpha=0.05$,.

EXAMPLE (radiology.pqs file)

Radiological imaging assessed liver damage in the following categories: no changes (1), mild changes (2), suspicion of cancer $(3)$, cancer $(4)$. The evaluation was done by two independent radiologists based on a group of 70 patients. We want to check the concordance of the diagnosis.

Hypotheses:

\begin{array}{cl}
\mathcal{H}_0: & \kappa= 0, \\
\mathcal{H}_1: & \kappa \ne 0.
\end{array}

Because the diagnosis is issued on an ordinal scale, an appropriate measure of concordance would be the weighted Kappa coefficient.

Because the data are mainly concentrated on the main diagonal of the matrix and in close proximity to it, the coefficient weighted by the linear weights is lower ($\hat \kappa= 0.39$) than the coefficient determined for the quadratic weights ($\hat \kappa= 0.42$). In both situations, this is a statistically significant result (at the $\alpha=0.05$ significance level), p<0.0001.

If there was a large disagreement in the ratings concerning the two extreme cases and the pair: (no change and cancer) located in the upper right corner of the table occurred far more often, e.g., 15 times, then such a large disagreement would be more apparent when using quadratic weights (the Kappa coefficient would drop dramatically) than when using linear weights.

2022/02/09 12:56

The Kappa Fleiss coefficient and a test to examine its significance

This coefficient determines the concordance of measurements conducted by a few judges (Fleiss, 19718)) and is an extension of Cohen's Kappa coefficient, which allows testing the concordance of only two judges. With that said, it should be noted that each of $n$ randomly selected objects can be judged by a different random set of $k$ judges. The analysis is based on data transformed into a table with $n$ rows and $c$ columns, where $c$ is the number of possible categories to which the judges assign the test objects. Thus, each row in the table gives $x_{ij}$, which is the number of judges making the judgments specified in that column.

The Kappa coefficient ($\hat \kappa$) is then expressed by the formula:

\begin{displaymath}
\hat \kappa=\frac{P_o-P_e}{1-P_e},
\end{displaymath}

where:

$P_o=\frac{1}{kn(k-1)}\sum_{i=1}^n\sum_{j=1}^c x_{ij}-kn$,

$P_e=\sum_{i=1}^c q_j^2$,

$q_j=\frac{1}{km}\sum_{i=1}^n x_{ij}$.

A value of $\hat \kappa=1$ indicates full agreement among judges, while $\hat \kappa = 0$ indicates the concordance that would arise if the judges' opinions were given at random. Negative values of Kappa, on the other hand, indicate concordance less than that at random.

For a coefficient of $\hat \kappa$ the standard error $SE$ can be determined, which allows statistical significance to be tested and asymptotic confidence intervals to be determined.

Z test for significance of Fleiss' Kappa coefficient ($\hat \kappa$) (Fleiss, 20039)) is used to test the hypothesis that the ratings of several judges are consistent and is based on the coefficient $\hat \kappa$ calculated for the sample.

Basic assumptions:

  • measurement on a nominal scale – possible category ordering is not taken into account.

Hypotheses:

\begin{array}{cl}
\mathcal{H}_0: & \kappa= 0, \\
\mathcal{H}_1: & \kappa \ne 0.
\end{array}

The test statistic has the form:

\begin{displaymath}
Z=\frac{\hat \kappa}{SE},
\end{displaymath}

The $Z$ statistic asymptotically (for large sample sizes) has the normal distribution.

The p-value, designated on the basis of the test statistic, is compared with the significance level $\alpha$:

\begin{array}{ccl}
$ if $ p \le \alpha & \Longrightarrow & $ reject $ \mathcal{H}_0 $ and accept $ 	\mathcal{H}_1, \\
$ if $ p > \alpha & \Longrightarrow & $ there is no reason to reject $ \mathcal{H}_0. \\
\end{array}

Note

The determination of Fleiss's Kappa coefficient is conceptually similar to the Mantel-Haenszel method. The determined Kappa is a general measure that summarizes the concordance of all judge ratings and can be determined as the Kappa formed from individual layers, which are specific judge ratings (Fleiss, 200310)). Therefore, as a summary of each layer, the judges' concordance (Kappa coefficient) can be determined summarizing each possible rating separately.

The settings window with the test of the Fleiss's Kappa significance can be opened in Statistics menu →NonParametric testsFleiss Kappa.

EXAMPLE (temperament.pqs file)

20 volunteers take part in a game to determine their personality type. Each volunteer has a rating given by 7 different observers (usually people from their close circle or family). Each observer has been introduced to the basic traits describing temperament in each personality type: choleric, phlegmatic, melancholic, sanguine. We examine observers' concordance in assigning personality types. An excerpt of the data is shown in the table below.}

Hypotheses:

\begin{array}{cl}
\mathcal{H}_0: & \kappa= 0, \\
\mathcal{H}_1: & \kappa \ne 0.
\end{array}

We observe an unimpressive Kappa coefficient = 0.24, but statistically significant (p<0.0001), indicating non-random agreement between judges' ratings. The significant concordance applies to each grade, as evidenced by the concordance summary report for each stratum (for each grade) and the graph showing the individual Kappa coefficients and Kappa summarizing the total.

It may be interesting to note that the highest concordance is for the evaluation of phlegmatics (Kappa=0.48).

With a small number of people observed, it is also useful to make a graph showing how observers rated each person.

In this case, only person no 14 received an unambiguous personality type rating – sanguine. Person no. 13 and 16 were assessed as phlegmatic by 6 observers (out of 7 possible). In the case of the remaining persons, there was slightly less agreement in the ratings. The most difficult to define personality type seems to be characteristic of the last person, who received the most diverse set of ratings.

2022/02/09 12:56
1)
Kendall M.G., Babington-Smith B. (1939), The problem of m rankings. Annals of Mathematical Statistics, 10, 275-287
2)
Wallis W.A. (1939), The correlation ratio for ranked data. Journal of the American Statistical Association, 34,533-538
3)
Cohen J. (1960), A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 10,3746
4)
Cicchetti D. and Allison T. (1971), A new procedure for assessing reliability of scoring eeg sleep recordings. American Journal EEG Technology, 11, 101-109
5)
Cohen J. (1968), Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213-220
6)
Fleiss J.L., Cohen J. (1973), The equivalence of weighted kappa and the intraclass correlation coeffcient as measure of reliability. Educational and Psychological Measurement, 33, 613-619
7) , 9) , 10)
Fleiss J.L., Levin B., Paik M.C. (2003), Statistical methods for rates and proportions. 3rd ed. (New York: John Wiley) 598-626
8)
Fleiss J.L. (1971), Measuring nominal scale agreement among many raters. Psychological Bulletin, 76 (5): 378–382