We are really wandering off of the original thread topic now. But hey, it's interesting so far.
Double-blind testing speakers is laborious, but is possible, and done routinely. At NRC and at Harman, this is done by placing speakers on large turntables, behind an acoustically transparent curtain. The different speakers are rotated into place into the same position real-time via these devices. These studies are of course, level matched. These studies I reference, are peer reviewed by the world's leading acoustic/loudspeaker design experts. The JAES is the standard publication for these works. Like a medical journal, but for audio engineering.
The loudspeaker studies use a very large number of subjects and speakers, in order to minimize isolated error(s). The largest of such radiation/response tests conducted at the NRC, by Ian Paisely of Mirage a number of years ago, used almost a thousand test subjects, and many different speaker systems, in order to systematically deduct what was favored vs. what was not, under blinded conditions.
Some studies now use bin-aural recordings made in real room acoustics, so that tests can be conducted over headphones via software.
Personally, I do not [yet] have a turntable system. When I compare modification(s) or new ideas, I use a control and a variable. Identical except for the desired variable(s). Using a calibrated measurement microphone, I record one unit in a small anechoic chamber(or in a real room, if the test is supposed to take room effects into account), and then I make a recording of the 2nd unit. Or, I may use the same unit, before and after modification. I then level match and sync the recordings in software, then load the recordings into ABX software in order to judge for differences under blinded conditions. I use an extremely linear headphone to compare the recordings.
As for interaction with components, what interaction? A good power amplifier with low source impedance and with adequate output devices to supply sufficient current capacity under the particular loads, makes this a non-issue. A good amplifier has no audible differences with differing loads(within reason). Only poor quality, or flawed designs exhibit such behaviour(s). Of course, I refer to the perceptual research once again, in reference to audible measured artifacts in relation to amplifiers. No one has yet shown credible data that suggests otherwise.
-Chris