I said I would post this when I get a chance, so here we go:
On page 452 of "Loudspeakers and Rooms for Sound Reproduction: A Scientific Review" published by Floyd Toole in 2006, we find a great review of the early research into measurements of speakers in-room and how this was correlated with perceived sound quality. Remember that the premise of my paper was not to be damning of measurements (which I stated) but to note that in-room measurements are a poor indicator of sound quality.
First Toole notes a BBC finding that "we measure differences that we seem not to hear" which is an important statement around the relationship between in-room measurements and sound quality. As noted previously, in-room measurements contain a lot of information that our ears can filter, but the measurements cannot without losing too much resolution and becoming hopelessly uninterpretable (To remove all room reflections including very early reflections would require an exceedingly small window causing the measurement frequency resolution to only be accurate at exceptionally high frequencies).
He goes on to say:
"These observations imply that some of the problem lies
in our interpretations of measurements made in small
rooms. The horrendously irregular steady-state
“room curves” that we see simply do not correspond to what we
hear. Did our problems begin when we started to make
measurements? Are we incapable of hearing these things?
Or is it that we hear them, but they simply become part of
the acoustical context within which other acoustical events
occur, and we have some ability to separate the two? The
answer turns out to be some of each."
This is basically stating what I said in the article, that the measurements contain a great deal if information that do not seem to correlate with what we perceive as good sound. They often look horrendous but don't sound it. This is totally unrelated to the notion that a flat anechoic response is the biggest predictor of sound quality. Flat anechoic responses do not necessarily translate to a flat in-room response, though certainly, a flat anechoic response does guarantee a better likelihood of a good in-room response (it will always be less flat and smooth than the anechoic, reflections do that).
But I also made a point that prior research had suggested that the in-room response, absent knowledge of the anechoic response, is not a good predictor of sound quality. That while an obviously poor in-room response is likely a guarantee that the anechoic response is poor, the oppossite is not true. A flat in-room response is not a guarantee of good sound or a flat anechoic response (since we can artificially create that flat in-room response, and further, that a speaker can measure flat anechoically at the listening axis but not the first reflection axis).
In defense for the need of good psychoacoustic research, Toole notes:
"The performance of a loudspeaker is much more
complex than anything revealed in an on-axis anechoic
measurement. The perceptual processes of two ears and a
brain are vastly more complex than anything revealed in a
room curve or a reverberation time".
This in many ways was the point of my article. Not that measurements can't tell us a lot about good sound, but that measurements in the absence of this science is not valuable, and that this science has been done, to a point at least. This science showed that in-room measurements are a poor indicator of sound quality, but specific anechoic ones tell us at least most of what we perceive as good sound. I still find, from time to time, speakers that suggest that this science may be incomplete, and when you scour the many forum and blog posts by folks like Toole and Olive, you find often where they indicate that perhaps there is more too it. For example, could speaker directivity actually play a big role in sound quality? Most of the work published by Toole and Olive suggested that the ideal response was flat on the listening axis but had a downward tilt to the early reflection axis, and further, a steep downward tilt to the power response. That this correlated highest with good sound. Here's the thing, it is likely that Toole and Olive never measured a speaker that met the smoothness/flatness priority across all angles, but actually had very wide dispersion (a speaker meeting the above criteria must be somewhat directional). What if a speaker's forward radiation is smooth, flat, and uniform out to an unusually wide angle? In such a scenario, the in-room response would be flat, not downwardly tilted (Because the combined total energy wouldn't be stronger at LF's, it would remain flat). Exceedingly few speakers meet this criteria, I know of only two, and I am confident that Harman has never tested either in their labs.
Ok back to Toole's notes: On page 467, figure 17, he shows that the in-room response is best predicted by some combination of the early reflected sound and power response. The in-room measurement looks most like the early reflected sound, though the HF's clearly roll-off more like the power response. This is likely due to the fact that most small rooms actually have quite a bit of HF absorption, including the air itself. This is why I noted above that the in-room response of a speaker whose listening axis and early reflected sound response are flat would thus have a flat in-room response.
On page 472, Toole notes that above the transition frequency (in what is known as the Stochastic zone) the sound of the speaker and room are intertwined and inseparable. Equalization of this based on in-room measurements would risk applying the wrong correction to the problem. Namely, we can't know if the problem is speaker setup, room acoustics, or speaker problem, and the fix to any of these three is different. Auto-EQ can't tell either, so there lies a risk in using a room EQ system that is automated and unable to know. The best fix for people wanting to use proper forms of Room EQ are to only use speakers which meet the qualities of a flat and smooth anechoic response on the listening axis and an early reflection and power response which remain flat, but having lower output in the midrange and treble. In this scenario, Room EQ should not have any issues, but then again, it also may not provide any benefit.
I want to highlight a specific claim he makes, however, one that I've been echoing here and that I think is quite important to this discussion:
"Comprehensive high-resolution anechoic frequency-
response data on loudspeakers contain sufficient infor-
mation to permit remarkably good predictions of sub-
jective preference ratings based on listening in a normal
room. Single measures, such as the on-axis frequency
response, sound-power response or steady-state in-room
curves are less reliable."
He states that, based on the research he has conducted in the past, and the research of others, they have found that the reliance on in-room steady state measurements (that is the frequency response measurements we typically use) is an unreliable predictor of good sound.
Moving onto why Toole states this (beyond his own work) is older work which Toole cites in a prior paper. "Loudspeaker Measurements and their Relatonship to Listener Preferences: Part 1" 1986 AES, lays out a history of how we got to where we are today. It notes that early research into which meausrements correlate best with perceptions of sound quality suggested that the power response may be a good indicator. However, that some researchers found that it was possible to be fooled, in that "a non-smooth power response measured in the reverberation room indicates a similarly irregular frequency response as measured in the anechoic chamber. While it cannot be stated that a speaker system that shows a smooth response in the reverberation room will necessarily sound good or have a smooth pressure response, the reverse is true" (Brociner and Von Recklinghausen) found on page 229.
That is, a speaker that measures poorly in a normal small room will likely measure poorly in an anechoic chamber. A speaker that measures well in a normal small room will not necessarily sound good or measure well in an anechoic chamber. So we can't use in-room measurements, in isolation, as any indicator of good sound.
See
https://pdfs.semanticscholar.org/7b0e/3101e1788608d75d024ac926d25a077b85bc.pdf
and
http://www.mariobon.com/Articoli_storici_AES/Toole/AES_1986_Toole_01.pdf