Some Coding Notes (extended) from Responses to Sexism journal
Procedural Memo: 10/22/97 Unitizing Reliability: Figures and Formula Holsti, O.R. (1969). Content analysis for the social sciences and humanities. Reading, MA: Addison-Wesley. SUMMARY OF SOURCES
HOLSTI (pp. 138-141) provides a couple of pretty basic formulas for determining IC reliability; One follows the formula:
C.R. = 2M/N1 + N2, where “M is the number of coding decisions on which the two judgees are in agreement, and N1 and N2 refer to the number of coding decisions made by judges 1 and 2, respectively” (p. 140).
“This formula has been criticized, however, because it does not take into account the extent of inter-coder agreement which may result from chance (Bennett, Alpert, and Goldstein, 1954). By chance alone, agreement should increase as the number of categories decreases” (p. 140).
Scott’s pi corrects for the number of categories in the category set, but also for the probable frequency with which each is used (reword this before put in press; several phrases are word for word):
pi = % observed agreement - % expected agreement / 1 - % expected agreement
% agreement is found by “finding the proportion of items falling into each category of a category set, and summing the square of those proportions” (p. 140). Holsti gives an example, but the example only seems to reflect the categories of one of the two coders in his comparison—or it might reflect both, but in this example, both have used each of 4 categories the same number of times; I don’t think this would always happen in real life. Holsti (1969) gives a third method, but it does not seem to apply to the case at hand.
While there are several other methods cited (from 1940s and 1950s), Holsti (1969) sees Scott’s pi as a good conservative estimate.
However, note that the problem with the first formula (henceforth called Holsti’s
formula) is that it capitalizes on chance *when there are a small number of categories*.
Guetzkow, H. (1950). Unitizing and categorizing problems in coding qualitative data. Journal of Clinical Psychology, 6, 47-58.
CODER A CODER B Totals 289 312
This figure was determined by adding the total number of units each saw across 5 questions for 20 surveys.
U = (O1 - O2) / (O1 + O2)
where O1 is the number of units Observer 1 sees in a text, and O2 is the number of observations Observer 2 sees in a text.
U = (289-312) / (289 + 312) = -23 / 601 = .038
Since U is actually a measure of disagreement (Folger, Hewes, & Poole) we could say that there is an agreement of .962 agreement.
One of the problems with this figure is that it can obscure many differences. For
example, in our data set there were several occasions when Coder A saw one more unit than Coder B did, and others where Coder B saw one more than Coder A. Depending on how the units of a “text” are calculated, these differences can cancel each other out, giving an inflated reliability.
Folger et al. state the problem clearly:
“Although Guetzkow’s index is certainly useful, it falls short of being ideal. To be ideal, an index of unitizing reliability should estimate the degree of agreement between two or more coders in identifying specific segments of text. That is, an ideal index should quanitify the unit-by-unit agreement between two or more coders. Neither U or his more sophisticaed index based on U does this. Guetzkow’s indices only show the
degree to which two coders identify the same number of units in a text of fixed length, not whether those units were in fact the same units.” (p. 120).
At the same time, Folger et al. suggest the index may be appropriate in certain situations. They suggest (but do not demonstrate) a way of looking at agreement in each objective segment (in our case, the amount of agreement for each question?) and then calculating across segments. They refer the reader to: Hewes et al. (1980)
Newtson & Engquist (1976) Newtson et al. (1977) and Ebbeson & Allen (1979) for examples.
I will later check on these cites. One of them is:
Hewes, D.E., Planalp, S. K., & Streibel, M. (1980) Analyzing social interaction: Some excruciating models and exhilarating results. Communication Yearbook 4, 123-144.
Folger et al. ask:
“Is it always necessary to go to so much work to provide evidence of unitizing
reliability? Probably not in all cases. If one is using an exhaustive coding system, i.e., a coding system in which each and every act is coded, and Guetzkow’s U is quite low, perhaps .10 or below, it may prove unnecessary to perform a unit-by-unit [segment by segment?] analysis. Similarly, if the actual unit is relatively objective and easily coded, Guetzkow’s indices may suffice. On the other hand, if the units are subjective, the coding scheme is not exhaustive or the data arre to be used for sequential analysis (lagged-sequential analyisi, Markov process, et.), unit-by-unit analysis is essential. In any even some measure of unitizing relaibility should be reported in any quantative study of social interact.” (p. 121).
Procedural Memo, September 29, 1997: Categorizing Reliability The following is a step-by-step summary for calculating Cohen’s Kappa. This
co-efficient is considered by many to be superior to Krippendorf’s Alpha, Scott’s Pi, and Holsti’s formula (no Greek letter given) in that it provides a more conservative estimate. The others, Cohen argues, capitalize on chance.
The step-by-step is based on
Cohen’s own explication (J. Cohen, 1960, A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46
, and a Web-site for a Geology course at the University of Texas
(http://www.utexas.edu/ftp/pub/grg/gcraft/notes/manerror/html/kappa.html).
I will base the suggested steps on the Web-site, since it is much easier to follow. For these figures, if one uses Cohen’s Kappa, the reliability is very close. I am not a
mathematician to know if the simpler version produces the same effects as the original.
1. Create a matrix comparing the classification of items by the two coders. This coding scheme assumes that coders have agreed upon a coding scheme prior to the testing of reliability. [This would be done through either imposing a scheme on the data or through inductively deriving a set of categories from the data itself]. I have given coders names to keep from confusing letters or numbers later. Ex: Coder Roberto Maria’s
Coder Maria A B C D E Total
A. Abusive language 2 2 0 0 0 4
B. Sexual innuendo 3 2 0 0 0 5
C. Inappropriate gaze 0 0 2 1 0 3
D. Unwanted romantic attention 0 0 2 2 2 6
E. Sexual touch 0 0 0 0 2 2
Roberto’s Total 5 4 4 3 4 20
Observed agreement (total along diagonal) 10
What I will provide here that site doesn’t is a brief explanation of the chart. a. In this case, each coder was given the same 20 units, with five categories. b. The column at the far right represents how many items Maria placed in each category, with the numbers in the bottom row indicating how much Roberto placed in each.
c. The numbers within the chart represent how the other coders coded the same items. For example, both coders coded a total of 9 items in the first two categories. Of Roberto’s 5 “Abusive language” items—Maria sees only 2 of them as abusive language and 3 of them as sexual innuendo. This example shows a strength of such an analysis. If we took merely number of units, one would say that both coders have the same number of units for both Category A and Category B. But at closer look, one realizes that they are not the same units. For another example, we can look at the category of “unwanted sexual
attention.” Maria places 6 items in this category, with fewer items in the gaze (3) and touch (2) categories. Roberto makes more differentiation, seeing 4 as gaze, 4 as touch, and 3 as inappropriate attention. If this were real data, we could begin to see from the chart where some of the categories might need clearer definition for the coders.
The diagonal represents the identical agreement on each category. Thus, as the coders approach total agreement, more and more of the frequencies would occur in the identical. As is, only 10 of the 20 items are coded identically (counting the numbers along the diagonal).
d.
It would be tempting at this point to jump from an agreement based on total numbers (where Maria and Roberto would have 100% agreement in first two
categories) to an agreement based on number of units categorized identically versus the total units in the category (where they would have 2 / 5 for each, or 40%). However, this does not allow for chance introduced by the number of categories or the frequency these categories are chosen as opposed to others. Thus, more sophisticated reasoning is needed.
2.
3.
Calculate “q”—the number of cases expected in the diagonal cells by chance alone. To do this, multiple Maria’s total for a category by Roberto’s total for the same category. Divide by the total number of units (20). q = SUM n (row) * n (col) / N A = 5 x 4 / 20 = 1.00 B = 4 x 5 / 20 = 1.00 C = 4 x 3 / 20 = .6 D = 3 x 6 / 20 = .9 E = 4 x 2 / 20 = .4 q = 1.25+.8+.6+.9+.4 = 3.9 Calculate Kappa: K = (d-q) / (N-q),
where d = the observed agreement (here, 10 cases) N = the total number of units sorted (here, 20) q = the agreement expected by chance (see above)
10 20 3.9
Kappa = (10 - 3.9) / (20 - 3.9) = 6.1 / 16.1 = .3789
So, while by numbers alone it looks like we have 50% reliability (10 / 20), this is only the agreement. The reliability, based on Cohen’s Kappa, would be only .38. The closer it is to 1.0, the higher the reliability between coders.
FOR THE STATISTICALLY HEARTY ONLY….
Now, I will run the same matrix through Cohen’s original formula, which uses
percentages throughout, rather than raw numbers. I will start with the same matrix, removing the labels, and will only report the steps.
Coder Maria A B C
Total
2 .10 (.05) 4 .20 3 .15 5 .25 0 .00 3 .15
0 .00
2 .10 (.03)
1 .05
0 .00
2 .10 (.05)
0 .00
0 .00
0 .00
2 .10
0 .00
0 .00
0 .00
Coder Roberto A
B
C
D
E