Market vs. Model: Where MLB Predictions Diverge
Every prediction model produces a probability. Every market produces a price. When these two numbers agree, there is nothing interesting to say. When they disagree, something worth understanding is happening. Either the model is capturing signal the market is ignoring, or the market is incorporating information the model cannot access, or some combination of both. The study of where and why these divergences occur is one of the most instructive exercises in applied prediction science, because it reveals the structural strengths and blindnesses of both approaches simultaneously.
The relationship between model outputs and market-implied probabilities is not adversarial. It is complementary. Markets aggregate diverse information from thousands of participants, including insiders, statistical modelers, casual observers, and institutional money. Models apply systematic, quantitative frameworks to measurable variables. Each approach has domains where it excels and domains where it is weak. Understanding these domains is essential for anyone building or evaluating a prediction system.
Converting Market Prices to Implied Probabilities
Before comparing model output to market expectations, the market price must be translated into a probability. This requires removing the overround, also called vigorish or vig, which is the margin built into prices that ensures the market maker retains an edge regardless of the outcome. In a two-way market with prices of -130 and +110, the raw implied probabilities are 56.5% and 47.6%, which sum to 104.1%. The excess 4.1 percentage points is the overround.
Several methods exist for removing the overround, and the choice of method matters for the comparison. The simplest approach is multiplicative normalization: divide each raw implied probability by the total to force the probabilities to sum to 100%. A more sophisticated method is the power method, which assumes the overround is distributed proportionally to the underlying probability, applying a larger adjustment to the favorite than the underdog. The Shin method models the overround as partly driven by insider information, assigning more of the vig to the side with greater information asymmetry.
The differences between these methods are small for individual games, typically within 0.5 to 1.0 percentage points, but they accumulate across a season of analysis. A model that uses multiplicative normalization will systematically interpret market-implied probabilities slightly differently than one using the Shin method, which can shift the apparent direction and magnitude of model-market divergence. Consistency in methodology is more important than the specific method chosen, because the comparative analysis requires a stable baseline.
Systematic Divergence Patterns
When a model and a market disagree on a game's probability, the divergence may be idiosyncratic, driven by specific circumstances of that particular game, or it may be systematic, reflecting a structural difference in how the model and market process certain types of information. Systematic divergences are far more informative because they reveal repeatable patterns rather than one-off noise.
Several categories of systematic divergence appear consistently in the baseball domain.
Bullpen-Driven Divergence
Models that incorporate detailed bullpen fatigue and availability tracking often diverge from market prices in situations involving depleted bullpens. A team whose top three relievers are all unavailable due to consecutive-day usage might look fine in the market, which tends to price teams based on season-level bullpen quality. A model that tracks daily availability and projects which specific relievers will pitch will assign a lower win probability, reflecting the degraded bullpen configuration. This divergence is most pronounced during stretches without off-days, when fatigue accumulates and the gap between full-strength and depleted bullpen quality widens.
Weather-Driven Divergence
Markets adjust for weather in broad strokes, primarily through the total (over/under) market. Extreme wind or temperature shifts totals noticeably. But markets tend to adjust less on the moneyline side, where weather effects are asymmetric depending on team construction. A model with detailed environmental features might recognize that a particular wind pattern at a particular park disproportionately benefits one team's fly-ball-heavy lineup while neutralizing the opponent's power game. This type of team-specific weather interaction is difficult for markets to price precisely because it requires matching park geometry, wind vectors, and lineup construction simultaneously.
Roster-Change-Driven Divergence
When a team makes a mid-season roster move, a call-up from the minor leagues, a trade acquisition, or a key injury, there is a period during which the model and market may disagree about the impact. Models update systematically based on projection systems that assign the new player a specific value. Markets may overreact to the narrative significance of the move (a big-name trade acquisition) or underreact to its statistical impact (a lesser-known prospect who projects to be a significant upgrade over the player he replaces). The divergence during roster-transition periods is measurable and often persists for several days until the market price fully incorporates the player's actual contribution.
Travel-Driven Divergence
Cross-country travel, particularly westbound teams playing day games after arriving late the previous night, creates fatigue effects that appear in performance data but may not be fully priced by markets. A model that tracks travel distance, time zone changes, and game start times relative to the team's body clock can identify games where the traveling team faces a physiological disadvantage that extends beyond what the market reflects. The effect is subtle, perhaps 1 to 2 percentage points of win probability, but it is systematic and directional, which makes it relevant over a full season of games.
What Markets Know That Models Miss
It would be a mistake to assume that model-market divergence always favors the model. Markets incorporate categories of information that statistical models typically cannot access, and in these domains, the market is usually more accurate.
Injury information is the clearest example. A player may be listed as "available" on the injury report but be playing through a nagging issue that limits his effectiveness. The market, which aggregates information from beat reporters, clubhouse insiders, and observational accounts, can price in this diminished capacity before a model that relies on publicly available performance data registers the decline. Similarly, a player who is about to be scratched from the lineup due to a pregame injury is often reflected in market price movements before the official lineup announcement, because information leaks to connected market participants.
Clubhouse dynamics, managerial decision-making tendencies, and situational motivation are all information categories that markets can, at least in principle, price but models cannot easily quantify. A team that has just fired its manager might see a short-term motivational boost or a period of confusion. A team that has clinched a playoff spot might rest regulars in late-season games. These factors are legible to human observers and can move market prices, but they are difficult to encode as systematic features in a quantitative model.
What Models Know That Markets Underweight
The symmetric case is equally important. Models have structural advantages in processing certain types of information that markets tend to underweight, either because the information is difficult for humans to aggregate mentally or because it requires computational intensity to track properly.
Fatigue accumulation is a primary example. The cumulative effect of a heavy workload on a starting pitcher, measured through pitch counts, innings logged, and days of rest over rolling windows, is a continuous variable that models can track precisely. Markets tend to respond to discrete events (a pitcher is "on short rest" or "has a high pitch count from last start") but may not fully price the interaction between multiple fatigue indicators across several weeks. A pitcher who has thrown 200 innings by early September with three starts on short rest in the last month may not trigger any individual alarm for market participants but represents a measurable decline in expected performance that a model can quantify.
Pitch quality decay is another area where models often hold an advantage. Statcast data reveals that a pitcher's average fastball velocity, spin rate, and extension can decline gradually across a season in ways that precede a decline in ERA or other traditional performance metrics. A model that ingests pitch-level quality data can identify a pitcher whose stuff is degrading before that degradation manifests in his results, creating a window where the model projects a lower performance level than the market price implies.
Market Efficiency in Baseball
The efficient market hypothesis, adapted from finance, asks whether market prices fully reflect all available information. In baseball, the evidence suggests that markets are quite efficient but not perfectly so. The closing price, which incorporates the full day's information flow including lineup announcements and late injury news, is generally well-calibrated. Games priced as 60% favorites win approximately 60% of the time. Games priced as 70% favorites win approximately 70% of the time. The calibration curve is surprisingly linear across the full range of implied probabilities.
However, "well-calibrated in aggregate" does not mean "perfectly efficient in every instance." Systematic biases, though small, have been documented. Markets have historically slightly overvalued teams with strong recent performance (recency bias), slightly undervalued the impact of bullpen fatigue in specific schedule spots, and shown small inefficiencies around roster transition periods. None of these biases are large enough to overcome the overround on their own, but they represent real information that a model can capture and that the market processes imperfectly.
The degree of efficiency also varies by market type. The moneyline market, which attracts the most volume and sharpest participants, is the most efficient. The run-line (spread) market is slightly less efficient, and the totals market shows the most exploitable patterns, partly because total outcomes are influenced by weather, bullpen, and game-state variables that are harder for markets to price than the binary win/loss outcome.
Calibration Analysis: Are Models Honest About Their Accuracy?
A critical evaluation of any prediction model is calibration: when the model says a team has a 60% chance of winning, does that team actually win 60% of the time? Calibration is distinct from discrimination (the model's ability to separate winners from losers) and from raw accuracy (the percentage of correctly predicted outcomes). A model can be well-calibrated but have poor discrimination if it assigns every game a probability close to 50%. A model can have high discrimination but poor calibration if it consistently overestimates or underestimates probabilities.
Evaluating calibration requires a substantial sample size. With individual game probabilities ranging from roughly 35% to 70% in most MLB games, detecting a 2 to 3 percentage point calibration error requires hundreds of games at each probability level. A single season of 2,430 regular-season games provides enough data for broad calibration assessment but not enough for fine-grained analysis at specific probability ranges.
Models built on explicit uncertainty frameworks tend to be better calibrated than models that produce point estimates and then convert to probabilities ad hoc. When a model's probability output is derived from a distribution over possible outcomes, the width of that distribution naturally moderates extreme predictions, pulling them toward 50% in proportion to the model's uncertainty. This self-regulating property makes the model's claimed probabilities more likely to match observed frequencies, which is the definition of good calibration.
Expected Value vs. Directional Accuracy
A subtle but important distinction in model evaluation is the difference between being right on expected value and being right on direction. A model that assigns a 52% probability to one side is claiming a very slight edge. If that team wins, the model was "directionally correct," but the implied edge was so thin that the outcome was essentially a coin flip. If the model assigned 52% and the market implied 50%, the model was claiming a 2 percentage point edge in expected value. Being "right" on expected value means that, across many such games, the model's 52% assessments correspond to win rates that are closer to 52% than to 50%.
Directional accuracy (picking the winning side more than 50% of the time) is a necessary but insufficient condition for model quality. In baseball, where the best teams win only about 60% of their games and the worst teams win about 40%, the range of outcomes is compressed relative to sports like basketball or football. A model that picks every game correctly at 55% directional accuracy is performing well, but the evaluation requires thousands of games to distinguish this from chance variation around 50%.
Expected value accuracy, by contrast, is measurable through calibration analysis and does not require the model to "pick sides." It asks whether the model's probability assignments are reliable estimates of true outcome frequencies. This framing is more useful for prediction science because it evaluates the model's information content rather than its performance in a simplified binary classification task.
What Divergence Reveals About Model Architecture
The patterns of divergence between a model and the market are diagnostic of the model's architecture. A model that systematically diverges from the market on games involving bullpen-heavy projected innings likely has better (or worse) bullpen modeling than the market consensus. A model that diverges primarily on games at Coors Field may be over- or under-adjusting for altitude effects. A model that diverges on day games following night games may be incorporating a travel-fatigue feature that the market does not fully price.
This diagnostic use of divergence analysis is arguably more valuable than the divergence signals themselves. By examining where the model disagrees with the market and then tracking whether those disagreements resolve in the model's favor, modelers can identify which components of their system are adding value and which are introducing noise. Components that produce systematic divergence in the correct direction are genuine information advantages. Components that produce divergence but no predictive improvement are adding complexity without benefit and should be simplified or removed.
The ideal prediction system is one that agrees with the market on everything the market prices correctly and diverges only where it has a genuine informational or analytical edge. Achieving this requires continuous calibration against market prices, honest evaluation of which divergences are profitable and which are noise, and the discipline to trust the market on questions where the model has no structural advantage.