In theory it depends not on who survives a duel but:
- the AVR, prepro, preamp, amp in use.
- the crossover design of the speakers.
- the quality of the speakers overall.
- your discerning capability, not just your hearing but how good are you in discerning what should be audible.
- how prone you are to Placebo effect, we all are, but some more prone than others.
A list like this makes doing listening test seem so complex, that it might discourage people from trying. I'm not sure if that was your intent, but that was my reaction from reading it.
The purpose of scientific method is to eliminate the variables, one-by-one, to narrow down the possible conclusions. Taking the first 3 items on your list, only one will be varied while all the rest will be constant throughout the test. For example (I'll keep this simple to talk about by assuming speakers are 2-way with passive analog crossovers), if you are testing bi-amping, you would keep everything constant except one amp would drive all drivers in a speaker, or two amps would drive each driver separately. Nothing else changes. I thought it might be
too obvious to point this out, but maybe not.
The last two items on your list, an individual's
discerning capability, and one's
proneness to the placebo effect are critically important. I'm glad you pointed them out, because if they are ignored, no listening test can have useful conclusions. They should be addressed by control experiments built into the listening tests.
How prone a listener is to the placebo effect can easily be determined by exposing listeners to tests where nothing is different, an A vs. A test (also known as a negative control). Not everyone reports hearing no difference – it can, and should be, measured. Floyd Toole & Sean Olive made an important contribution to the science of listening tests by showing conclusively that the results of this kind of A vs. A test were different depending on whether tests were done sighted or blind. (Obviously, all these tests must be done under blinded conditions.)
The other item on your list, a listener's discerning capability, is probably the most difficult to address. This would be a test of just what kind of subtle sound quality differences listeners can actually hear. It can be considered a positive control. As suitable positive controls, think of a short stretch (~1 minute long or less) of cleanly recorded music passage. Make additional digital copies and add various amounts of pink noise to the recording – so you have a series of short passages of the same music with 0%, 5%, 10%, etc., added noise. Test each listener to see what level of added noise is easily heard as different from no added noise. This would be like an internal calibration curve.
If, for example, a group of listeners could reliably hear a difference between 0% and 10% noise, it also tells you something useful about whether or not they can hear differences between single-amping or bi-amping:
- If they can, the differences in sound quality between single-amping and bi-amping can be considered at least equivalent to the difference between 0% and 10% added noise.
- If they can't, you can conclude that under conditions where listeners could reliably hear difference between 0% and 10% added noise, the couldn't hear differences between single-amping and b-amping.
That would be worth knowing.