One problem is novelty effects. One major Achilles heel of an A/B test is the short time frame over which they tend to be run, and this causes a couple of problems. First of all, there might be longer-term effects to the change, and you're not going to measure those, but also, there is a certain effect to just something being different on the website.
For instance, maybe your customers are used to seeing the orange buttons on the website all the time, and if a blue button comes up and it catches their attention just because it's different. However, as new customers come in who have never seen your website before, they don't notice that as being different, and over time even your old customers get used to the new blue button. It could very well be that if you were to make this same test a year later, there would be no difference. Or maybe they'd be the other way around.
I could very easily see a situation where you test orange button versus blue button, and in the first two weeks the blue button wins. People buy more because they are more attracted to it, because it's different. But a year goes by, I could probably run another web lab that puts that blue button against an orange button and the orange button would win, again, simply because the orange button is different, and it's new and catches people's attention just for that reason alone.
For that reason, if you do have a change that is somewhat controversial, it's a good idea to rerun that experiment later on and see if you can actually replicate its results. That's really the only way I know of to account for novelty effects; actually measure it again when it's no longer novel, when it's no longer just a change that might capture people's attention simply because it's different.
And this, I really can't understate the importance of understanding this. This can really skew a lot of results, it biases you to attributing positive changes to things that don't really deserve it. Being different in and of itself is not a virtue; at least not in this context.