A lot of issues with this article. I’ll try to point out a few of the major ones and ask the author a few questions along the way.
“Only then we can “safely” disregard the inferior scenario or variant — e.g. ad — and thus maximize positive results.” — how is safely defined in each situation? Why is p=0.0658 unsafe? Later on you experiment with different p-thresholds, but never justify any of them.
“This can be expensive since we might be paying to deliver an ad impression and/or because of the opportunity cost induced by not maximizing our return due to a “bad” choice of scenario(s).” — other costs are involved as well in a typical scenario where an A/B test is used. The ad testing scenario is perhaps one of the few where outside costs can be negligible (but that is not necessarily the case!).
“So, until we reach the desired confidence in our decision to move entirely to the designated superior scenario, we are purely paying to explore;” — not true, users experiencing the variant are in the exploitation phase for good or bad (what if it is actually worse?).
“Remember we want to more gradually and earlier move towards the scenarios we believe might turn out to be superior and thus reduce exploration costs and increase exploitation returns — a classic tradeoff.” — do we? Compatibility with actual decision-making in the business world where external considerations guide how long we’re willing to stay in the exploration phase. Again, ad testing is a scenario where such considerations might be relaxed, but even there we are often forced to choose one option or the other if the test informs external decisions, or we need to launch with one ad or the other on a given date (major product launch, etc.).
“We are going to test two different methods (the chi-squared split test — “Split” hereafter — and the Thompson beta bandit — “Bandit”)” — the frequentist test you use has little practical use in modern A/B testing where data can and should be evaluated multiple times across an experiment (external validity considerations taken into account to determine a proper period between data assessments). If comparing a bandit, which uses information as it accrues, you should compare it to a frequentist method which also allows for sequential monitoring of data (the class of sequential designs). Search for “AGILE statistical method” for an example in online A/B testing in particular.
The Basic Simulation spreadsheet uses two-tailed test instead of one-tailed. This is almost never justified in a real A/B test, including in ad testing. The calculated p-values are without the appropriate adjustment for multiple analyses (repeated significance test adjustment), so the error rate is not controlled (peeking). This is an invalid frequentist test as it doesn’t provide the nominal error guarantees (seems you’re going for 0.05 here).
P-value changes section — essentially shows that running a bandit is equivalent to running a poorly designed A/B test with a tolerance for uncertainty higher than 0.05. I think with a p-value of 0.5 it will about equal the bandit in most scenarios (switch to 0.33 in this scenario and it will outperform the bandit :-))). Why use a bandit then, and not a split test with super high uncertainty? Why make a remark on the false positives and then back away from it, not saying anything about how a bandit controls false positives? I don’t see an examination of the no difference scenario? How often would you be misled by the bandit if there is no real difference?
The first graph under Uncertainty has different Y-axis starting point on the left compared to the right part. It may lead some to misjudge the graphs slightly.
More Options — has any correction been applied to the chi-square calculation, e.g. Dunnett’s? I’m pretty unsure it wasn’t, making it, again, an invalid frequentist test.
Finally, and most crucially, the real-world example at the end seems to show the split test dominating the bandit over the medium and long term? Even if it is executed badly as presented in the article (or because of it, given the particular scenario?). How do you then make the conclusion you do? If a conclusion is needed in the short-term, then a split test will simply be used with a higher acceptable uncertainty and that’s it. It’s not like the bandit will miraculously provide lower uncertainty with the same data? A shorter period in which to decide and act is no impediment to using a frequentist test, it is what they were designed for which is why frequentist test offer finite-sample guarantees unlike Bayesian approach which mostly work asymptotically.
“An obvious conclusion and start for further research/testing could therefore be to combine both methods in a hybrid model, using the Bandit until the Split confidence is large enough” — are you describing Adaptive Sequential Designs here? There is significant literature on the topic, including several papers where it is demonstrated that an ASD performs at most as well as an equivalent fixed-allocation SD, but most often performs slightly worse than it.