When statistically analyzing listening test data based on a small number of trials or test subjects (such as in the case of the Koya study), using the conventional significance level (α = 0.05), there are two categories of error that must be considered: Type 1 - concluding that inaudible differences are audible; and Type 2 - concluding that audible differences are inaudible. In studies such as this one, the α = 0.05 significance level usually produces type 2 error > type 1 error. Equalizing both errors usually requires reduction of type 2 error, since it is desirable to keep both errors as small as possible.
There are three ways of reducing type 2 error in a listening test:
1. Increase N. This is the preferred (but not always available) method to decrease type 2 error.
2. Increase p. Careful test design can increase this
3. Increase type 1 error. A method of last resort. May be necessary, to avoid type 2 errors that have p values which are just slightly above chance.
Incorporating equal probabilities of both types of errors for a particular p value of interest r, requires a metric by which the degree of equalization between the two error types can be assessed. In this case, the fairness
coefficient, FCp, provides for a useful figure of merit, where:
Here, an FCp, = 1 represents an ideal, perfectly fair study.
From: Leventhal, L.: "Type 1 and Type 2 Errors in Statistical Analysis of Listening Tests," Journal of the Audio Engineering Society, Vol. 34, pp. 437-453 (1986 Jun.) we have the following table:
<pre>N r Type 1 Error (
a) actual value Type 2 Error (
B)
p = 0.6 p = 0.7 p = 0.75 p = 0.8
15 14 0.0005 0.9948 0.9647 0.9198 0.8329
13 0.0037 0.9729 0.8732 0.7639 0.6020
12 0.0176 0.9095 0.7031 0.5387 0.3518
11 0.0592 0.7827 0.4845 0.3135 0.1642
10 0.1509 0.5968 0.2784 0.1484 0.0611
9 0.3036 0.3902 0.1311 0.0566 0.0181
8 0.5000 0.2131 0.0500 0.0173 0.0042
7 0.6964 0.0950 0.0152 0.0042 0.0008
6 0.8491 0.0338 0.0037 0.0008 0.0001</pre>
Selecting r = 9 for p = 0.6 and using the above formula for FCp:
Keeping mind that an FCp value of 1 is the ideal, 0.7781 indicates a high degree of fairness.
As to why the researchers chose p = 0.6 value, we find in the text:
"Although a p of 0.6 may seem as a low criterion, it was chosen so that the subtle effects of the audibility of phase distortion were uncovered in the analysis. Therefore, for this study, anything above 9 correct responses (r) out of 15 will be considered statistically significant for p = 0.6".
References
Koya, Daisuke: "Aural Phase Distortion Detection", Masters dissertation, Master of Science in Music Engineering Technology, University of Miami, Coral Gables, Fla., May 2000.
Leventhal, L.: "Type 1 and Type 2 Errors in Statistical Analysis of Listening Tests," Journal of the Audio Engineering Society, Vol. 34, pp. 437-453, June 1986.
- Posted for Mark Sanfilipo (inserted chart and formulas)