The Kappa Fleiss coefficient and a test to examine its significance

This coefficient determines the concordance of measurements conducted by a few judges (Fleiss, 19711)) and is an extension of Cohen's Kappa coefficient, which allows testing the concordance of only two judges. With that said, it should be noted that each of $n$ randomly selected objects can be judged by a different random set of $k$ judges. The analysis is based on data transformed into a table with $n$ rows and $c$ columns, where $c$ is the number of possible categories to which the judges assign the test objects. Thus, each row in the table gives $x_{ij}$, which is the number of judges making the judgments specified in that column.

The Kappa coefficient ($\hat \kappa$) is then expressed by the formula:

\begin{displaymath}
\hat \kappa=\frac{P_o-P_e}{1-P_e},
\end{displaymath}

where:

$P_o=\frac{1}{kn(k-1)}\sum_{i=1}^n\sum_{j=1}^c x_{ij}-kn$,

$P_e=\sum_{i=1}^c q_j^2$,

$q_j=\frac{1}{km}\sum_{i=1}^n x_{ij}$.

A value of $\hat \kappa=1$ indicates full agreement among judges, while $\hat \kappa = 0$ indicates the concordance that would arise if the judges' opinions were given at random. Negative values of Kappa, on the other hand, indicate concordance less than that at random.

For a coefficient of $\hat \kappa$ the standard error $SE$ can be determined, which allows statistical significance to be tested and asymptotic confidence intervals to be determined.

Z test for significance of Fleiss' Kappa coefficient ($\hat \kappa$) (Fleiss, 20032)) is used to test the hypothesis that the ratings of several judges are consistent and is based on the coefficient $\hat \kappa$ calculated for the sample.

Basic assumptions:

  • measurement on a nominal scale – possible category ordering is not taken into account.

Hypotheses:

\begin{array}{cl}
\mathcal{H}_0: & \kappa= 0, \\
\mathcal{H}_1: & \kappa \ne 0.
\end{array}

The test statistic has the form:

\begin{displaymath}
Z=\frac{\hat \kappa}{SE},
\end{displaymath}

The $Z$ statistic asymptotically (for large sample sizes) has the normal distribution.

The p-value, designated on the basis of the test statistic, is compared with the significance level $\alpha$:

\begin{array}{ccl}
$ if $ p \le \alpha & \Longrightarrow & $ reject $ \mathcal{H}_0 $ and accept $ 	\mathcal{H}_1, \\
$ if $ p > \alpha & \Longrightarrow & $ there is no reason to reject $ \mathcal{H}_0. \\
\end{array}

Note

The determination of Fleiss's Kappa coefficient is conceptually similar to the Mantel-Haenszel method. The determined Kappa is a general measure that summarizes the concordance of all judge ratings and can be determined as the Kappa formed from individual layers, which are specific judge ratings (Fleiss, 20033)). Therefore, as a summary of each layer, the judges' concordance (Kappa coefficient) can be determined summarizing each possible rating separately.

The settings window with the test of the Fleiss's Kappa significance can be opened in Statistics menu →NonParametric testsFleiss Kappa.

EXAMPLE (temperament.pqs file)

20 volunteers take part in a game to determine their personality type. Each volunteer has a rating given by 7 different observers (usually people from their close circle or family). Each observer has been introduced to the basic traits describing temperament in each personality type: choleric, phlegmatic, melancholic, sanguine. We examine observers' concordance in assigning personality types. An excerpt of the data is shown in the table below.}

Hypotheses:

\begin{array}{cl}
\mathcal{H}_0: & \kappa= 0, \\
\mathcal{H}_1: & \kappa \ne 0.
\end{array}

We observe an unimpressive Kappa coefficient = 0.24, but statistically significant (p<0.0001), indicating non-random agreement between judges' ratings. The significant concordance applies to each grade, as evidenced by the concordance summary report for each stratum (for each grade) and the graph showing the individual Kappa coefficients and Kappa summarizing the total.

It may be interesting to note that the highest concordance is for the evaluation of phlegmatics (Kappa=0.48).

With a small number of people observed, it is also useful to make a graph showing how observers rated each person.

In this case, only person no 14 received an unambiguous personality type rating – sanguine. Person no. 13 and 16 were assessed as phlegmatic by 6 observers (out of 7 possible). In the case of the remaining persons, there was slightly less agreement in the ratings. The most difficult to define personality type seems to be characteristic of the last person, who received the most diverse set of ratings.

1)
Fleiss J.L. (1971), Measuring nominal scale agreement among many raters. Psychological Bulletin, 76 (5): 378–382
2) , 3)
Fleiss J.L., Levin B., Paik M.C. (2003), Statistical methods for rates and proportions. 3rd ed. (New York: John Wiley) 598-626