Research Paper · Priya Venkatesh · Marcus Feldstein · 2026-01-20
Wheel Bias Detection: Statistical Power Requirements and Type I Error Rates
The detection of meaningful wheel bias requires a carefully designed statistical test with adequate power, a pre-registered hypothesis, and appropriate correction for multiple comparisons. We derive theoretical power requirements for detecting biases of different magnitudes across European and American wheel configurations, and we simulate Type I error rates under a range of testing approaches. Our central finding is that claims of detected wheel bias based on samples of fewer than 10,000 spins are statistically unreliable at conventional significance levels, even when the apparent pocket frequency deviation is visually striking. We propose a minimum sample size standard of 15,000 spins per wheel for any publication claiming to document meaningful bias.
The statistical detection of wheel bias is a problem with a clear mathematical structure but a long history of procedural failures. Claims of biased wheels have appeared regularly in the roulette literature since the nineteenth century, and the majority share a common methodological weakness: insufficient sample size relative to the magnitude of bias being claimed.
We begin with a theoretical analysis. Consider a European wheel with 37 pockets. The null hypothesis is that each pocket occurs with probability 1/37. An alternative hypothesis posits that a specific pocket occurs with probability 1/37 + δ for some positive δ. The question is: for a given δ, how many spins are required to detect this bias with 80% power at a significance level of 0.05, after Bonferroni correction for 37 simultaneous tests?
Our derivation shows that for a bias of δ = 0.01 (one additional percentage point of landing probability), approximately 42,000 spins are required to achieve 80% power. For a bias of δ = 0.02, the requirement drops to approximately 11,000 spins. For a bias of δ = 0.05 — which would represent an enormous mechanical defect — approximately 2,000 spins suffice. The practical implication is that only the most extreme biases are detectable with the sample sizes available to most observers.
We then examined Type I error rates — the probability of incorrectly concluding that bias exists when the wheel is fair — under three testing approaches commonly encountered in the literature: uncorrected chi-square tests, visual inspection of frequency histograms, and informal comparison of the most frequent pocket to the theoretical expectation. All three approaches produce substantially elevated Type I error rates relative to their nominal levels when applied to small samples. Visual inspection is the worst offender, with a Type I rate exceeding 60% for samples of 500 spins.
To validate our theoretical results, we ran one million simulated fair wheels, each generating 5,000 spins, and applied all three testing approaches. The uncorrected chi-square test declared 'bias' in 8.3% of simulations — meaningfully above the nominal 5% due to the multiple comparison problem. The visual inspection method declared 'bias' in 58.7% of simulations. These results illustrate concretely why the literature on wheel bias requires stricter methodological standards.
Our proposed minimum of 15,000 spins per wheel is conservative by design. At this sample size, the uncorrected chi-square test achieves its nominal Type I error rate, and biases of δ ≥ 0.015 can be detected with 80% power. We acknowledge that collecting 15,000 spins on a single wheel is demanding for individual researchers; we therefore encourage collaborative data collection among community members and the sharing of raw data for secondary analysis.