To understand why it's so hard for systems to image in a way that "matches" a live performance, you have to consider the entire chain.
Let's simplify it for just an example and consider a live concert where a recording is made. You sit in a seat in the audience and listen to the concert. You hare hearing sound from individual instruments arriving at your hears from a particular location on stage. What your ears get is the direct sound from the instrument, and a lot of reflected sound from all surfaces in the venue. Your ears localize the instrument, yet hear the room in a three dimensional acoustic space.
Now consider the recording chain. There are mics, which don't have the directional characteristics of a pair of ears on a head on a neck on shoulders, etc. Most mics are actually mono pickups with either a cardiod or omnidirectional pattern. No one mic can sample the 3d soundfield, and even the few that claim to cannot do it in a way that matches the hears of every possible listener. There will be many mics, some spotted on instruments, some as spaced pairs, some as closely spaced X/Y pairs, and perhaps any of the combinations of pairs of mics on trees. If it's not an orchestra you are listening to, it's likely the recording will use fewer stereo pairs, and more "spot" (mono) mics. But none of the mics used will be at your head in your seat, they will probably be in locations you could never be.
So far, the electrical signals the mics produce bears no resemblance to the acoustic signals arriving at your head in the audience because the mics aren't were you are, and are of totally different design than your ears. Then, these signals get mixed to 2 or more channels, and we'll ignore the fact that the monitor system used in the mix won't match your system for now.
Now, you play that recording on your system in your room. What you hear is sound coming from your speakers, which aren't on a stage in a large 3D acoustic space, they're in your room. Their location can't possibly match the location of the original instruments in even the most basic azimuth and elevation. You're now hearing just two (or more) discrete sources. Again, ignore that their method of sound generation isn't the same as the original, and crash forward. Your ears hear the sound from the speakers, but also all the reflections coming off all surfaces in your room. None of those reflections existed in the original space. You also have the "mask" of the speaker's response, which didn't exist in the original. If you could remove all reflections from your room and just listen to two perfect speakers, you will have some darn nice imaging, but it will be wrong. Why? Both your ears are both hearing both speakers, which means the best you can do for a phantom image is to get your head exactly on the center line between them, but even then a phantom center signal won't be "real" because your ears are still hearing both speakers, effectively, a several microsecond delayed reflection. So the most tangible sound image will be when a sound comes from only one speaker (doesn't happen much these days). All other sound positions will be very different in size and location from the original soundfield.
Without going any farther you can see that it's pointless to ever think we can replicate the original. And that's not what anybody's trying to do anyway. What recording engineers and producers are trying for is an acceptable and pleasing rendition of the original performance. Sort of like an impressionist painting doesn't try for accuracy or reality, but is still pleasing.
Better imaging in a room can be had by eliminating as many early reflections as possible by either eliminating the reflective surface with treatment, or directing sound from the speaker away from it. But, even at its best, you can't ever replicate the original space, especially with just two channels. Front imaging with a 5.1 system is different. The phantom center is no longer an issue, almost anywhere in the room. But you still are dealing with an impression of the original. With more channels come more variables, and more chances for the recording engineer to create something more dimensional. But even 5.1 is limited in its ability to localize, basically in a distorted circle that connects the speakers. Height and Width channels help a lot in creating a 3D space, but so far we have no accessible content recorded that way, and must therefore depend on a spacial synthesis system to create artificial height and width.
There's no blaming the problem on the mics only, or the mix only, or the speakers or room only, or, especially, the recording medium which of all other links in the chain has the least impact on the input signal. But to maximize imaging, you have to eliminate as many extra influences that weren't present in the original as possible, and be certain that all front speakers match in character. You won't do that with a different, special "center" speaker. And the recording should have been produced specifically for your speaker layout. No fair re-mixing stereo to 5.1 if you want "reality", though some stereo > 5.1 processes can produce a very pleasing impression. Speakers that image best tend to have smaller or more unified sources, at the expense of sounding "large". They also tend to have more controlled coverage pattens (dispersion, or directivity) that avoid directing sound to possibly reflective surfaces.
There are some speakers that are designed to sound big and spacious, but break all the rules. These would be the bipolar or omnidirectional types, or types that deliberately point sound at reflective surfaces in the room. People confuse large, spacious sound with "imaging". It's very different, and though they can also be quite pleasing, there won't be much in the way of a tangible image. Again, not really "wrong", or "right" if it's pleasing and you like it.