GM's OFFICE: Re-validating our Matchup Scores in this pitching-friendly environment

One of our most-trafficked (and most commented-upon) in-season tools here at is our Starting Pitcher Matchups tool. Each day, you can find our starting pitcher ratings under Tools/Today's SP on the top menu. The day's ratings also serve as the jumping-off point for written analysis in our Daily Matchups columns (which run seven days a week, and is a free read today), and on our Daily Dashboard page (which also features team lineups as they are released each day).

That's a lot of space devoted to one set of metrics, which inevitably leads you, our astute readers, to periodically ask the key question: "Hey, does this Matchup Score really work?"

Truth be told, whether or not you're asking that question, we're keeping an eye on it over the long term. But it's actually been quite a while since we've shown our work there, so this week seemed like a good time to do some public back-testing of this prominent tool in our analytic arsenal.

Tools. Analysis. Commentary. And most of all—a tradition of winning. Access it all with a subscription to 

Where we started

For a number of years, we based our single-game SP analysis on a simple metric that Ron Shandler and I concocted a long time ago: we leaned on our Pure Quality Starts metric as the basis, and simply compared a pitcher's own average PQS score to the average PQS score "allowed" by the opposing offense. We had some basic splits in there (home/road, RH/LH pitcher), but that "my PQS score vs. what today's opponent allows" comparison was the basis of the whole thing. It was fairly simple, but met the need for a number of years.

In 2017, though, our crack researcher Arik Florimonte decided to tackle this problem, and came up with a much better solution. We fully implemented that new formula across the site for Opening Day 2018. I did some validation of those scores a couple of times in 2018, first a preliminary report and then a more comprehensive back-test of a half-season's worth of data.

Where we are now

In the 2019-20 offseason, Arik did a study on home vs. road performance of starting pitchers, and out of that work made some changes to the Matchup Scores formulas to better account for home vs. road performance. Funny thing about that research: we had the resulting formula revisions in place for Opening Day 2020, but after the pandemic delayed Opening Day, we flat-out forgot to publicly announce those changes. And given how screwy the 2020 season was, we decided not to use that data for back-testing those changes.

Not only did we revise the formula, but the MLB environment since 2018 has been, shall we say, dynamic. Changes to the baseball, changes to starting pitcher usage patterns, to strikeout and velocity rates, reduced SP workloads and the rise of openers... there are a lot of variables in play.

All of which is to say, we are long overdue to revisit how the scoring system is performing.


So, that's what I did this week: pulled all of the Matchup Scores, as they were published the day of each game, for all April/May 2021 games. I cross-referenced that data with the PQS logs, so I had the actual results from each start. In total, the data set is a little more than 1,500 pitcher-starts, where the pitcher scored before the game matched the pitcher who actually made the start. Fortunately for me, I still had the spreadsheet I used for that mid-season 2018 analysis, so I was able to easily produce side-by-side comparisons of how the test went in 2018, vs. today.


Several people had observed that there seem to be more strong start ratings this year than there have been in the past. Without taking this deep dive, that seemed entirely plausible to me, just given the relative dominance of pitchers in MLB circa 2021. But that was the first thing I wanted to check. Sure enough:

Matchup score range    2021 Apr-May    2018 Apr-June
===================    ============    =============
Rating 2.0+                 8%              5%
Rating 1.5 to 1.99          7%              5%
Rating 1.0 to 1.49         12%             10%
Rating 0.5 to 0.99         17%             15%
Rating 0.0 to 0.49         19%             19%
Rating -0.50 to -0.01      16%             21%
Rating -1 to -0.51         12%             16%
Rating -1.5 to -1.01        6%              7%
Rating -2.0 to -1.51        2%              2%
Rating -2.0-                1%              1%

That's still more or less a bell-curve distribution, but there is definitely an upward shift in the 2021 data. Rolling that into the wider buckets we use to sort starts gives a clearer perspective:

Matchup score range            2021 Apr-May    2018 Apr-June
===================            ============    =============
Strong starts (0.5+)                44%             35% 
Judgment calls (-0.5 to 0.5)        35%             40%
Weak starts (-0.5-)                 21%             26%

That's nearly a 25% gain in Strong Start ratings, shifted pretty equally from both the Judgment Call and Weak Start tiers. As stated above, this made intuitive sense to me given the unbalanced state of the pitcher/batter dynamic right now, but it's still fairly striking to see it spelled out this way.

Before we get into what to do with that information, let's check in on how the scores are doing at predicting outcomes.

Comparing Matchup Scores to Matchup Outcomes

As I did in those 2018 exercises, I compared the pre-start Matchup Score to the post-start PQS scores and DraftKings SP point totals.


  Ranges count AvgPQS AvgDK
  Rating 2.0+ 121 3.31 24.60
  Rating 1.5 to 1.99 103 2.97 20.98
  Rating 1.0 to 1.49 193 2.61 16.65
  Rating 0.50 to 0.99 264 2.31 14.82
  Rating 0.0 to 0.49 292 2.37 14.33
  Rating -0.50 to -0.01 247 2.01 12.69
  Rating -1 to -0.51 189 2.12 10.97
  Rating -1.5 to -1.01 93 1.84 9.79
  Rating -2.0 to -1.51 34 1.38 7.48
  Rating -2.0- 12 1.75 9.79
  Total 1548    

At this point, we have to acknowledge that we're testing multiple variables here. The chart above is a very strong result, even a bit better than it was in 2018 (more on that in a minute), but we can't say for sure whether the gains are a result of the 2020 formula changes, or just a case where the system is working a little better in pitcher-friendly 2021 than it did in somewhat less-pitcher-friendly 2018.

But overall, the news is really good.

Making this actionable

Let's step back and see how these numbers compare to the 2018 study, in search of some better decision-making criteria:

  Ranges AvgPQS 18deltaPQS AvgDK 18deltaDK
  Rating 2.0+ 3.31 -0.03 24.60 +1.32
  Rating 1.5 to 1.99 2.97 -0.06 20.98 -0.22
  Rating 1.0 to 1.49 2.61 0.00 16.65 -1.25
  Rating 0.50 to 0.99 2.31 -0.15 14.82 -1.08
  Rating 0.0 to 0.49 2.37 -0.18 14.33 +0.55
  Rating -0.50 to -0.01 2.01 -0.15 12.69 -0.88
  Rating -1 to -0.51 2.12 +0.01 10.97 -0.19
  Rating -1.5 to -1.01 1.84 +0.10 9.79 +0.16
  Rating -2.0 to -1.51 1.38 -0.58 7.48 -4.20
  Rating -2.0- 1.75 +0.06 9.79 +1.82

Capturing the 2018-to-2021 movement in this chart starts to yield some actionable takeaways:

1. With more strong starts available, it makes sense to be pickier about chasing them. Specifically, starts rated 1.0 and higher are doing a (somewhat) better job in delivering good outcomes than the 0.50-0.99 tier. That was always the case, but those truly elite starts are steady compared to the 2018 study, where the lower-positive ratings are showing more erosion compared to the earlier study.

That lower tier has always been a softer endorsement from us, as we characterize it every day in the Daily Matchups column:

(top tier starts rated >1.0; starts in 0.5-0.9 range also favorable)

Since we're flush in 1.0+ ratings in 2021, it makes sense to use the that line as more of a hard cutoff for what we consider a "Strong Start", especially in your shallower league formats.

2. As we get into the real depth of the season and your individual team standings and circumstances evolve, the middle tier "Judgment Calls" section remains just that: a place for you to apply your individual judgment. That might relate to your individual team needs and context, or the additional factors beyond the raw Matchup Score that we try to illuminate with the written commentary of the Daily Matchups column. Within the range of -0.5 to +0.99, the aggregate differences in outcomes are noticeable but certainly not massive. So, if you're debating lineup decisions between a couple of guys in that range, it makes sense to take a deeper dive into recent trends for the pitcher and their opponent rather than just blindly trusting the difference between, say, a +0.3 rating and a -0.2.

3. A lot of the improvement in this chart, compared to 2018, comes down at the bottom end. The lower tale of the distribution curve wasn't as clean as we would like in the old version, but it's much more normal here. We can say this more strongly than ever: if you're defying the scores and deploying a starter who has a -1.0 or worse Matchup Score, you had better have a very good reason for defying that rating.

If you need a topical example of the perils of defying that score: Carlos Martínez rated a -1.0 for his epic 0.2 IP, 10 ER disaster at LA the other night. That line obviously isn't an every-night occurrence, but that makes a good litmus test for starting someone with such a rating: am I ok risking the Carlos Martínez outcome here?


Next steps

We'll stop here for today, but there is a lot more we can do in this area:

• I'd like to extend this re-validation into the component ratings, especially the Wins rating, as that can have a lot of utility in the second half of the season for players who are chasing that category.

• And there's another question that has come up repeatedly, which is whether we can separately validate how "lesser" pitchers perform when they pick up an unexpected strong start rating, as compared to the more regular residents of the top-tier ratings. That should be do-able, will just take a little more Excel magic than I had in me this week. But it's definitely a topic worth exploring.

If there are other types of validations you'd like to see, leave them in the comments and I'll see what I can do for next time...


Click here to subscribe

  For more information about the terms used in this article, see our Glossary Primer.