MLB Game Simulation Models: Monte Carlo Methods in Baseball
A baseball prediction model can estimate that a team has a 58% chance of winning. But where does that number come from? The most rigorous approach is Monte Carlo simulation: modeling the game from first pitch to final out, thousands of times, and counting how often each team wins. The result is not a single estimate but a full distribution of possible outcomes, capturing the inherent randomness of baseball with statistical precision.
Monte Carlo methods have been used in physics, finance, and engineering for decades. Their application to baseball leverages the sport's uniquely sequential, discrete structure. Each plate appearance has a finite set of possible outcomes. Each outcome changes the game state in a predictable way. By chaining these transitions together across nine or more innings, a simulation engine can generate realistic game trajectories that reflect the true variance in baseball outcomes.
What Monte Carlo Simulation Means in Baseball
The core idea is straightforward. Instead of computing a closed-form probability (which is mathematically intractable for a system as complex as a baseball game), you simulate the game many times with random sampling and observe the distribution of results. If you simulate 10,000 games between Team A and Team B, and Team A wins 5,800 of them, you estimate Team A's win probability at 58%.
The power of this approach lies in its generality. The simulation does not assume a particular scoring distribution. It does not assume that runs arrive according to a Poisson process or any other simplified model. It builds the scoring distribution organically from the sequence of individual events that constitute the game. If the interaction between two specific lineups and pitching staffs produces a bimodal scoring distribution (one cluster of low-scoring games and another cluster of blowouts, with relatively few games in between), the simulation will capture that structure. A parametric model using a single distribution shape might miss it entirely.
The Simulation Loop
A single simulation of a baseball game proceeds as follows. The simulation begins at the top of the first inning with nobody on and nobody out. The away team's leadoff hitter faces the home team's starting pitcher. The model draws a plate appearance outcome from the probability distribution for that specific batter-pitcher matchup. The outcome updates the base-out state: perhaps the batter singled and is now on first with nobody out.
The next batter in the lineup order then faces the same pitcher, and another outcome is drawn. This continues until three outs are recorded, at which point the half-inning ends and any runs scored are tallied. The home team then bats in the bottom of the first, following the same process against the away team's starting pitcher. This alternation continues through nine innings, with the home team's bottom of the ninth omitted if they are already ahead.
Inning-Level Mechanics
Within each half-inning, the simulation tracks the base-out state, which determines the run expectancy context for each plate appearance. When a batter singles with a runner on second, the simulation must resolve whether the runner scores, advances to third, or is thrown out at the plate. These base-running outcomes are themselves probabilistic, depending on the runner's speed, the outfielder's arm, and the base-out state.
The simulation also tracks the pitch count of the starting pitcher and the associated fatigue effects. As a starter's pitch count climbs, his outcome distribution shifts: walk rates increase, strikeout rates decrease, and hard-contact rates rise. The model uses historical pitch count performance curves to adjust the starter's effectiveness as the game progresses.
Bullpen Transitions
One of the most consequential decisions in a baseball game is when to remove the starting pitcher and which reliever to bring in. The simulation models this transition using a combination of pitch count thresholds, performance triggers (consecutive baserunners, for example), and inning-based norms. When the starter exits, the simulation selects from the available bullpen arms based on the game state and the manager's likely usage pattern.
Reliever selection is not random. High-leverage relievers (closers, setup men) are deployed in close games during the late innings. Middle relievers handle lower-leverage situations. Long relievers enter when the starter exits early. The simulation encodes these usage patterns as probabilistic rules, then draws the reliever's performance from his individual outcome distribution, adjusted for platoon matchup and recent workload.
Win Probability Distributions
After simulating a game thousands of times, the model produces a distribution of outcomes, not a single number. The win probability is simply the fraction of simulations won by each team. But the distribution contains far more information than that headline number.
Consider two games, both projected at 55-45 in favor of Team A. In Game 1, the distribution is tight: most simulated outcomes cluster near a 4-3 or 5-4 final score. In Game 2, the distribution is wide: some simulations end 1-0, others end 12-8. The win probability is the same, but the run distributions are radically different. This distinction matters for understanding confidence in the projection and for modeling total run outcomes.
What Simulation Outputs Capture
| Output | What It Tells You | Derivation |
|---|---|---|
| Win Probability | Likelihood of each team winning | Fraction of simulations won |
| Run Distribution (Home) | Probability of scoring 0, 1, 2, ... N runs | Histogram of home runs scored across sims |
| Run Distribution (Away) | Same for away team | Histogram of away runs scored across sims |
| Total Runs Distribution | Probability of combined total | Sum of both teams' runs per simulation |
| Run Line Coverage | How often each team covers a spread | Margin of victory distribution |
| Innings Distribution | How deep starters go | Pitch count and removal triggers |
Run Distribution Curves
One of the most valuable outputs of a simulation model is the run distribution for each team. Unlike a simple "Team A is expected to score 4.3 runs," the full distribution shows the probability of each possible run total. This allows for much richer analysis.
Baseball run distributions are not normally distributed. They are right-skewed: there is a hard floor at zero (a team cannot score fewer than zero runs) but no hard ceiling. A team expected to score 4 runs might score zero in 8% of simulations, 1 run in 12%, 2 runs in 15%, 3 runs in 17%, 4 runs in 16%, 5 runs in 12%, and then a long tail stretching out to 10 or more runs in rare cases. The shape of this distribution depends on the lineup's variance characteristics. A lineup with many high-power, high-strikeout hitters will produce a wider distribution (more shutouts and more blowouts) than a contact-oriented lineup, even if both have the same expected total.
The run distribution is also asymmetric between teams in the same game. The home team's run distribution is shaped differently because they bat last and may not complete the bottom of the ninth if ahead. In extra innings, both distributions extend into additional frames with modified rules (runner on second to start each extra inning in current MLB rules), which the simulation must model explicitly.
Starter Game Length Prediction
How long the starting pitcher lasts determines when the bullpen takes over, which in turn determines the composition and quality of the pitching the opposing lineup faces in the later innings. Predicting starter game length is therefore a critical input to the simulation.
Models predict game length using the starter's historical pitch efficiency (pitches per batter faced), his typical workload capacity (pitch count at which his performance degrades), and the opposing lineup's propensity to drive up pitch counts (foul ball rate, walk rate, pitches per plate appearance). A starter who averages 3.8 pitches per batter against a patient lineup will reach the 90-pitch threshold in the fifth or sixth inning. The same pitcher against an aggressive lineup that averages 3.3 pitches per PA might pitch into the seventh.
The simulation uses a probabilistic model for starter removal rather than a fixed pitch count threshold. The probability of removal increases continuously as the pitch count rises, with additional triggers for poor performance (consecutive walks, multiple hard-hit balls in an inning). This approach captures the realistic variance in starter game lengths, where the same pitcher in the same matchup might throw 95 pitches one day and 75 the next, depending on how the game unfolds.
Calibration: Matching Simulated Outcomes to Observed Base Rates
A simulation model is only as good as its calibration. If the model says Team A has a 60% win probability, then across all games where the model assigns a 60% probability, the favored team should win approximately 60% of the time. If Team A actually wins 55% of such games, the model is overconfident. If they win 65%, the model is underconfident.
Calibration requires checking the model against historical outcomes across multiple dimensions:
- Win probability calibration: Do 60% favorites win 60% of the time?
- Run total calibration: Does the distribution of simulated run totals match observed scoring patterns?
- Margin of victory calibration: Are simulated margins realistic (not too clustered or too spread)?
- Base rate calibration: Do individual component rates (strikeout frequency, walk frequency, home run frequency) match observed levels?
Miscalibration typically stems from one of two sources. First, the component inputs may be biased. If the plate appearance model systematically overestimates home run rates, the simulation will produce inflated run totals and skewed win probabilities. Second, the simulation may fail to capture important correlations, such as the tendency for scoring to cluster in certain innings or the interaction between pitcher fatigue and bullpen quality. Regular backtesting against historical data identifies and corrects these issues.
For each probability bucket P (e.g., 55-60%):
Observed Win Rate should approximate P
Deviation = |Observed - Expected| / sqrt(P*(1-P)/N)
Large deviations indicate miscalibration. The model should be re-examined when observed rates consistently diverge from expected rates by more than two standard deviations.
How Many Simulations Are Enough?
Monte Carlo estimates converge toward the true probability as the number of simulations increases. The practical question is: how many simulations do you need before the estimate is stable enough to be useful?
The standard error of a Monte Carlo win probability estimate is approximately sqrt(p*(1-p)/N), where p is the true probability and N is the number of simulations. For a 55% win probability with 10,000 simulations, the standard error is about 0.5 percentage points. That means the estimate will fluctuate by roughly plus or minus one percentage point from run to run, which is precise enough for most purposes.
For run distributions and more granular outputs, more simulations are needed. Estimating the probability of a specific score (say, exactly 3-2 in favor of the home team) requires more samples because you are estimating a smaller probability. With 10,000 simulations, a 4% probability has a standard error of about 0.2 percentage points. With 50,000 simulations, it drops to about 0.09 points.
Convergence by Simulation Count
| Simulations | Win Prob Std Error | Run Total Std Error | Typical Use Case |
|---|---|---|---|
| 1,000 | ~1.6% | ~0.07 runs | Quick screening, rough estimates |
| 5,000 | ~0.7% | ~0.03 runs | Daily model runs |
| 10,000 | ~0.5% | ~0.02 runs | Standard production runs |
| 50,000 | ~0.2% | ~0.01 runs | High-precision analysis, research |
| 100,000 | ~0.15% | ~0.007 runs | Publication-quality estimates |
Standard errors assume a win probability near 50%. Extreme probabilities (e.g., 80-20) converge faster.
Most production models run 10,000 to 20,000 simulations per game. This provides sufficient precision for win probabilities (within half a percentage point) while remaining computationally feasible for a full daily slate of 15 games. Research applications or model validation exercises might run 100,000 or more to minimize sampling noise when comparing model variants or testing calibration.
The Relationship Between Components
A game simulation model is a system of interconnected components, not a monolithic algorithm. The plate appearance outcome model provides the atomic-level probabilities for each batter-pitcher confrontation. The run expectancy framework translates those outcomes into inning-level scoring through base-out state transitions. The bullpen usage model determines which pitchers appear in which game states. The base-running model resolves runner advancement on contact. Each component feeds into the simulation loop, and errors in any one component propagate through the entire system.
This modular structure is both a strength and a vulnerability. It is a strength because each component can be tested, validated, and improved independently. If the plate appearance model is miscalibrating home run rates, that can be fixed without touching the bullpen model. It is a vulnerability because interactions between components can produce emergent errors that are difficult to diagnose. A small bias in strikeout rates might interact with the bullpen model to produce a larger bias in late-inning scoring, which in turn distorts the win probability distribution. Systematic backtesting across all components and their interactions is essential.
The simulation approach, despite its computational cost, remains the gold standard for baseball prediction because it respects the game's sequential, state-dependent structure. Baseball is not a continuous flow of play with normally distributed outcomes. It is a series of discrete, high-variance events, chained together by precise rules of state transition, played within an asymmetric framework where the home team bats last. Monte Carlo simulation is the natural mathematical language for this kind of system, and done well, it produces the most accurate and informative forecasts available.