Accurate Brute Force Binomial Proportion Confidence Interval Calculator

For calculating, for example, confidence intervals for user study task success rates. This method assumes a uniform prior, that is, 0% to 100% are all equally likely given no more information, and I think this method makes the least amount of extra assumptions compared to other methods (see below).


Why and how?

Say you have a weighted coin that lands heads with some probability p. You flip it 10 times and it lands heads 2 times. Naively, p=0.2. But what's the 95% confidence interval for p?

If you look up how to calculate a confidence interval for success/failure trials, you'll be confronted with a dizzying assortment of fancy math with unclear tradeoffs. The simple normal approximation is clearly wrong because it can yield numbers not between 0 and 1. But which of the many alternatives is best? What if you have a small sample size? What if p is near 0% or close to 100%?

I'm not smart enough to know, but I don't think it's necessary. You can quickly brute-force a confidence interval to three or more decimal places.

This calculator works by assuming a uniform prior, that before any coin flips each p between 0 and 1 is equally likely to be the true p, and then it tests a bunch of different p's—that's the brute force part—to see which p's are most likely to produce the observed results.

More specifically, using the example of 2 heads out of 10 flips:

  1. The calculator creates 101 different coins. The first flips heads with probability p=0.00, the second with p=0.01, the third with p=0.02, and so on up to p=1.00.
  2. For each of the 101 coins it computes the probability that the coin would produce 2 heads if flipped 10 times.
  3. It adds up these probabilities across all the coins to get a total probability mass. (The sum is done in a numerically stable way to handle the vast differences between probabilities.)
  4. It sorts the coins from least to most probable, then discards the least probable coins that together account for 5% of the probability mass.
  5. The remaining coins are the 95% most probable coins. The p's of the lowest and highest coin are the 95% confidence interval boundaries.

Except, the calculator creates a lot more than 101 coins. It creates up to 1,000,001 coins. This works up to about 1000 trials, after which the inaccuracies of floating point math overwhelm the computation; the calculator will not report a result when it detects that floating point round-off changes the result (don't worry!).

Compared to traditional methods, this way:

Wolfram Alpha will let you try the traditional calculation methods if you want to compare.