"If it sounds good and measures bad, then you're measuring the wrong thing."
I used to believe that, and at one time it made sense to believe it - but technology has finally caught up. So, I no longer believe it's true. There is a measurement for everything now.
"Not everything that counts can be counted, and not everything that can be counted counts."
I think this one is still true.
There is still much to be learned in measurement. The main issue is that we can "count" a lot that may or may not "count" and we don't really know yet. For example, there is a lot still to be learned about dispersion. We also don't really know exactly which aspects of measured performance is the most important. A lot of this stuff was studied using "end run" research and as a result we know how it worked at a fairly gross level. We don't, for example, know exactly how much variation at what frequencies are transparent/perceptible. There is a big difference between the JND of a pure tone and of music played back through a speaker with these errors. When this was studied, it was done using fairly crude methods. At the end of the day, research is expensive and time consuming. You can only look at so much, so you pick what seems like the most important given your time and budget.
I think the main reason people think a bad measuring speaker sounds good is that they either have hearing problems or they simply don't know better. I doubt anyone, under blinded conditions, would prefer the sound of a really flawed/colored speaker. On the other hand, I do think a speaker could be flawed and have those flaws not matter. When Toole and Olive did a lot of this work, it took on a kind of smoother is better approach to the Spin data. But there has to be a limit where it doesn't matter anymore. We have measured a lot of speakers that had a good response for the most part, some minor issues in our opinion, and we thought it sounded really good. We didn't hear evidence of those flaws. We would question how audible they are. And we are pretty sure they aren't something that has been extensively studied. here an example would be the Polks I recently reviewed where I found that the tweeter beamed excessively. It was still smooth where it counted. Is it really possible to hear a speaker that is -30dB by 30 degrees at 15khz? I don't think so, or at least, I don't think its all that audible or matters all that much.
What about some disturbances in the integration due to directivity mismatch. We see that a lot. If it gets everything else right, including a very smooth listening window, does it matter? The research doesn't really say, they didn't test those kinds of issues (What was tested was far more extreme/flawed than what we typically see today). We've discussed this and question, again, if there is some limit where the amount of error is minor enough to not matter.