The lack of statistical significance is usually caused by far too few participants in the listening test. For example, if two items are compared in a blind A/B listening test, and the goal of the test is to determine if people can or cannot hear a difference, it takes many people before a clear, statistically significant answer emerges. At least 100! With numbers in that range, a simple yes/no answer could be valid. A much larger trial, with 300-1000 people, would be required if you want to estimate what percent of listeners can or cannot hear differences. Yes, estimating human perception is absolutely sloppy, and large numbers of participants are required to deal with that.
Such a test would have to also include built-in controls that estimate how many people falsely identify identical items as sounding different (an A/A comparison, a measure of false positives), and another to estimate how many people fail to identify two items that are well known to sound different (a measure of false negatives). I've never seen a listening test use both of those controls.