Uncertainty in MLB Predictions: Confidence Bands and Variance

A prediction model says Team A has a 58 percent chance of winning tonight's game. What does that actually mean? If the model is well-calibrated, it means that across all the games where the model assigns a 58 percent probability, the favored team wins approximately 58 percent of the time. But the single number, 58 percent, conceals an enormous amount of information about how confident the model is in that estimate, how wide the plausible range of outcomes stretches, and whether the prediction is driven by strong signal or by thin, uncertain inputs.

Expressing uncertainty is not a weakness of a prediction system. It is a feature. A model that says "55 percent, plus or minus 3 percent" is telling you something fundamentally different from one that says "55 percent, plus or minus 12 percent." Both arrive at the same point estimate, but the second is far less sure of itself, and that difference in confidence changes how the prediction should be interpreted and used.

The Problem with Point Estimates

Most prediction outputs are communicated as single numbers: a win probability, an expected run total, a projected ERA. These point estimates are useful as summaries, but they are lossy compressions of the underlying prediction distribution. When a model predicts an expected game total of 8.5 runs, the actual distribution of possible outcomes might range from 2 to 16 runs. Two games with the same expected total of 8.5 can have very different distributions: one might be tightly clustered around 8 to 9 runs (two mediocre offenses against two mediocre pitchers), while another might be spread widely from 3 to 14 (a volatile bullpen game with extreme platoon splits and wind blowing out).

Point estimates strip away the shape of the distribution. They tell you where the center is but not how wide the spread is, whether the distribution is symmetric or skewed, or whether there are fat tails representing low-probability but high-impact scenarios. For downstream applications that depend on understanding the full range of outcomes, point estimates are insufficient.

Aleatoric vs. Epistemic Uncertainty

Not all uncertainty is the same. Prediction science distinguishes between two fundamentally different sources of uncertainty, and understanding the distinction is critical for building honest prediction systems.

Aleatoric Uncertainty

Aleatoric uncertainty is the inherent randomness in the system. Baseball is a stochastic sport. A batter with a .300 batting average fails 70 percent of the time. A pitcher with a 3.00 ERA still gives up runs. Even if you had perfect information about every variable, the outcome of a single plate appearance, inning, or game would still be uncertain because the physical processes involved (bat-ball contact angle, seam orientation, tiny variations in pitch release) contain irreducible randomness.

Aleatoric uncertainty cannot be reduced by gathering more data or building better models. It is a fundamental property of the game. A perfectly calibrated model that assigns a 60 percent win probability to Team A is acknowledging that Team B wins 40 percent of the time due to the inherent randomness of competitive baseball, not due to the model's ignorance.

Epistemic Uncertainty

Epistemic uncertainty arises from imperfect knowledge. The model does not have complete information about every relevant variable. It does not know the exact state of a pitcher's fatigue. It does not know whether a batter tweaked his hamstring in batting practice. It does not know the precise wind conditions at every point in the stadium. It is working with estimates, approximations, and statistical inferences derived from finite data.

Unlike aleatoric uncertainty, epistemic uncertainty can be reduced. More data, better sensors, more sophisticated modeling, and higher-quality information all narrow the epistemic component. The goal of model improvement is to convert epistemic uncertainty into knowledge, but the aleatoric floor remains. In baseball, that floor is high. Even the most informed prediction system in the world cannot reliably predict single-game outcomes with precision much beyond 60 to 65 percent for the strongest favorites.

Monte Carlo Simulations as Uncertainty Engines

Monte Carlo simulation models produce uncertainty quantification as a natural byproduct of their methodology. By simulating a game thousands of times with stochastic variation in each plate appearance, the simulation generates not a single outcome but a full distribution of outcomes. Team A wins in 5,800 of 10,000 simulations. The combined run total exceeds 9 in 4,200 simulations. Team A wins by 3 or more runs in 1,900 simulations.

This distribution contains far more information than the point estimates it can be summarized into. The width of the win-probability distribution reveals how sensitive the prediction is to random variation. The shape of the run-total distribution reveals whether high-scoring or low-scoring outcomes are more likely. The tail behavior reveals the probability of extreme outcomes: blowouts, shutouts, extra-inning marathons.

The simulation's variance is itself a predictive variable. Games with wider outcome distributions are inherently harder to predict than games with narrow distributions. A model that can distinguish between high-variance and low-variance games is providing useful meta-information about its own reliability.

Confidence Intervals for Win Probability

When a model reports that Team A has a 55 percent win probability, the honest version of that statement is something like "our estimate of Team A's win probability is 55 percent, with a 90 percent confidence interval of 47 to 63 percent." The confidence interval reflects the epistemic uncertainty in the estimate: if we reran the model with slightly different input assumptions, slightly different training data, or slightly different parameter estimates, the win probability would move within that range.

Narrow confidence intervals indicate that the model is working with strong, consistent signals. The pitching matchup clearly favors one side. The home/road split is large. The bullpen quality differential is unambiguous. Wide confidence intervals indicate that the model's inputs are conflicting or thin. The starting pitcher has only 30 innings of data. The team's recent performance contradicts its season-long metrics. The bullpen status is uncertain due to recent overuse.

Communicating these intervals is valuable because it allows downstream consumers to weight predictions appropriately. A 55 percent estimate with tight confidence bounds is a very different signal than a 55 percent estimate with wide bounds, even though the point estimate is identical.

Run Distribution Variance

Two games can have the same expected run total but very different variance in their run distributions. Consider two games, both with an expected total of 8.0 runs. In Game A, two solid mid-rotation starters face each other in a neutral park with mild weather. The run distribution is approximately normal, clustered around 7 to 9 runs, with modest tails. In Game B, an elite ace faces a struggling lineup, but the bullpen behind him is depleted from a 14-inning game yesterday. The run distribution is bimodal: there is a significant probability of a low-scoring gem (2 to 4 runs) if the ace dominates, and a significant probability of a high-scoring game (11 to 15 runs) if the bullpen enters early and collapses.

These two games have the same expected total but fundamentally different shapes. The variance of the run distribution, not just its mean, contains predictive information. High-variance games are inherently less predictable on a single-game basis. They are also the games where the model's confidence should be stated most cautiously.

Information Quality and Prediction Width

The confidence of a prediction should be a function of the quality and quantity of information available. Early-season predictions, when starters have thrown only 20 to 30 innings and team-level metrics are based on 15 to 20 games, carry substantially more uncertainty than late-season predictions based on 150 games and 180 innings of starter data. A responsible model acknowledges this by producing wider confidence bands in April than in September.

Several factors affect information quality and, consequently, prediction width. Pitcher sample size is the most significant: a starter with 500 innings of major-league Statcast data provides a much tighter signal than a rookie making his fifth career start. Lineup stability matters: a team that has run the same batting order for 30 games provides cleaner offensive projections than one making three lineup changes due to injuries. Environmental conditions also affect prediction width, because the uncertainty around weather forecasts (wind speed, precipitation probability) propagates into the game-level prediction.

Models that maintain a constant confidence level regardless of information quality are implicitly overclaiming precision when data is thin and underclaiming it when data is rich. Adaptive confidence intervals that expand and contract based on input quality produce better-calibrated uncertainty estimates.

Calibration Analysis

The most important test of a model's uncertainty quantification is calibration: do the stated confidence levels actually correspond to observed outcomes? If the model says "Team A wins 60 percent of the time," does Team A actually win 60 percent of those games? If the model says its 90 percent confidence interval for the run total is 5 to 12, does the actual run total fall within that range in 90 percent of games?

Calibration analysis is the diagnostic that separates honest uncertainty quantification from theatrical confidence intervals. A model can be overconfident (its 70 percent predictions win only 62 percent of the time, meaning it assigns too much certainty to its estimates) or underconfident (its 55 percent predictions win 60 percent of the time, meaning it is leaving information on the table by being too cautious).

Perfect calibration is the ideal but is rarely achieved in practice. Most models show slight overconfidence at the extremes (games rated at 65 or higher percent tend to win slightly less than advertised) and reasonable calibration near the center (games rated at 50 to 55 percent). Monitoring calibration over time and across different game contexts (home/road, divisional/interleague, day/night) reveals whether the model's uncertainty estimates are reliable or need recalibration.

The Base Rate of Upsets

Baseball has the highest upset rate of any major professional sport. In a typical MLB season, the home team wins approximately 53 to 54 percent of games. The best team in baseball finishes with a .600 to .620 winning percentage, meaning they lose 62 to 65 games. The worst team still wins 50 to 60 games. No team is ever a reliable single-game lock.

This base rate of unpredictability is the aleatoric floor that all prediction models must respect. A model that frequently assigns win probabilities above 70 percent in regular-season MLB games is almost certainly overconfident, because the structural randomness of the sport, the nine-inning sample of plate appearances, the high failure rate of hitting, the variance in pitcher performance, prevents even the most lopsided matchups from reaching that level of certainty with any regularity.

Understanding this base rate is essential context for interpreting any MLB prediction. When a model says 58 percent, it is not expressing lukewarm confidence. It is expressing one of the stronger signals the sport allows. The fact that 58 percent "feels" close to a coin flip is a reflection of baseball's inherent uncertainty, not a failure of the model.

Why High Uncertainty Is Not Bad Prediction

There is a common misconception that wide confidence intervals or near-50-50 predictions indicate a poor model. The opposite can be true. A model that assigns 50-50 to a game where the information genuinely does not favor either side is being honest. A model that forces a 58 percent prediction on that same game to appear more decisive is being dishonest and, over time, will show poor calibration.

High uncertainty, properly communicated, is itself valuable information. It tells the consumer that this is a game where the margin is thin, the inputs are conflicting, and confident claims of superiority are not supported by the data. Forcing false precision onto inherently uncertain situations is one of the most common failures in prediction systems, and it erodes trust far more than honestly stating "we don't have a strong read on this one."

Uncertainty in Downstream Analysis

The way a model quantifies and communicates uncertainty has direct consequences for how its outputs are used in model-versus-market comparisons. A point estimate of 55 percent compared to a market-implied probability of 52 percent looks like a three-point edge. But if the model's 90 percent confidence interval spans 47 to 63 percent, the apparent edge is well within the range of the model's own uncertainty. The comparison becomes meaningful only when the model's confidence interval does not overlap with the market price.

This framework, evaluating edges relative to the model's own uncertainty bands rather than against its point estimate, produces more conservative but more reliable signals. It acknowledges that a three-point edge in a high-confidence game is worth more than a five-point edge in a low-confidence game, even though the raw numbers suggest the opposite.

The models that perform best over long horizons are not the ones that generate the most aggressive predictions. They are the ones that correctly distinguish between games where they have a reliable signal and games where they do not. Uncertainty quantification is the mechanism that makes this distinction possible.

Prediction Models Series

Back to Prediction Models Hub