I am getting really tired of having this argument about advanced statistics, and I find that having it point-by-point on Twitter or G-chat or wherever ends up leading down all kinds of tributaries to nowhere and people start adopting this posture that is somehow both defensive and pompous and I’ve just had it.
So I’m going to just write it all out. That way it will be here on the Internet and I can just send out this link whenever I get involved in one of these things.
Like 12 years ago Brad Pitt and fat Jonah Hill figured out that baseball scouts were evaluating players based on all kinds of ridiculous myths like how hot their girlfriends were. It should have been obvious this was nonsense, but because baseball is more stuck in the past than a Mississippi diner, this stuff was allowed to pass as real scouting for like 3,000 years. Fortunately, fat Jonah Hill was so smart he realized the most important thing was to get on base and it didn’t really matter how you got there. He then expanded this kind of thinking to every aspect of baseball, and destroyed a million stupid myths. Brad Pitt applied all of this to the Oakland A’s, pissed off everybody in baseball, and won 500 games in a row with a bad team. This changed baseball forever.
No rational person who understands the subject matter would argue this is not the correct way to analyze baseball. Baseball was not originally designed to be a statistician’s game, of course, but it is the ultimate statistician’s game. Almost everything that happens in a baseball game can be isolated and, therefore, converted into a probability. If you don’t understand this, you can’t speak intelligently about the game, and this has become obvious to most baseball fans.
As a youngster, I even applied an extremely rudimentary level of this kind of analysis to my own baseball performance. My mom would keep track of the data, and then we would do the best we could on the back of an envelope to look past things like ERA and batting average to see a more precise truth. If you are in the business of trying to win baseball games or are trying to have a baseball career, you are a fool to not look at the sport this way.
That all said, I am not in the baseball business, I don’t care who wins baseball games and I don’t personally find these kinds of conversations to be interesting. They make the sport less fun to me, because they make the games themselves feel empty and somehow even more pedantic than they already are. But many other people love all this and that’s fine. They are talking about a sport they like in a smart way.
We good? Everybody OK with this? All right.
What happened next was that people started applying this kind of advanced statistical analysis to other sports. Much of this was logical to the point of being obvious. An example of this is “effective field goal percentage” (eFG%), which makes the simple observation that if you’re shooting 3-pointers you don’t need to make as many as you do if you’re shooting 2-pointers. This doesn’t need to be explained to most people, but eFG% is nonetheless a nice neat little way of expressing that idea with a specific number, if that what makes you feel good. There are lots of other examples of things like this. We have statistics that measure not just how many rebounds a player gets per game, but what percentage of the available rebounds he gets, which eliminates numerous variables (pace of play, shooting percentages, etc.) to give us a more precise measure of how well a certain player performs a task. A lot of this stuff, to me, seems to be an exercise in quantifying the obvious — for example, there’s one in which somebody will watch a game with a 4-point differential with 12 minutes left and calculate the trailing team has 48 percent chance of winning or something — but whatever.
Great. Love it. Go forth and prosper.
We run into problem here, though, because unlike in baseball, the individual performances of basketball players can’t be isolated from each other. If you are (1) not an idiot and (2) somebody who has watched both sports, the reason for this doesn’t need to be explained to you, so I’m not going to bother. The point is, the basketball metrics aren’t as precise as the baseball metrics and never will be because of the nature of the sports.
But they’re still pretty good. They’re pretty good at telling us what happened. It’s up to us to figure out the Why but at least we are pretty close to knowing the What.
So this seems like it’s working out, and now we start trying to do this kind of thing with teams. Except we aren’t just tabulating what happened in their games, we’re trying to compare them to other teams. When they all play each other, as they do in the NBA, this seems to work out OK. Every NBA team plays every other NBA team multiple times per season. And there aren’t very many NBA teams, but there are an absolute buttload of NBA games. Further, the level of talent in the NBA has very little variance from team to team, in part because there are only 450 playing jobs available and partly because like most professional sports leagues the NBA is structured to create as much parity as possible. The worst teams get the best draft picks, there is a salary cap, there is free agency, and so on.
Now, if you’ve ever attempted to lay wood flooring (which, I’m sure, is totally all of you), you’ll be able naturally conceptualize what I’m about to describe. But even if you haven’t attempted to lay wood flooring, you should be able to get this. Ready? OK: You lay the first plank, and it’s straight on the line. Then you lay the next one, and butt it right against the other and it looks pretty much perfect, but it’s off just a tiny bit, so little you can’t even see it. And then you lay the next one, and it’s a little bit off too, and this goes on and on until you get to the other side of the room and — oh no — you’re off by six inches and you stand up and the lines in your floor are fanning out and you’ve got this weird triangular shape that can’t possibly be filled with boards and you realize you were off just a little in the beginning but it compounded and even though you hammered those boards in one-by-one just like the book said, you aren’t even close.
Well, we’ve started laying flooring with college basketball, and we’ve got big problems, starting with the two factors I mentioned about the NBA. Because there are more than 5,000 scholarship Division I basketball players out there, the talent disparity is enormous. And because there are 347 Division I teams out there, the schedule doesn’t even come close to pitting every team against every other team.
The metric that is easiest to understand and therefore cited the most is the RPI, which attempts to measure the strength of a team’s schedule and its performance against that schedule. This would be a wonderful thing to know, because it would basically end every argument about which college basketball teams are deserving of which seeds.
Unfortunately, it’s Utopia.
The basis of the RPI, Strength of Schedule (SOS), is a simple formula — two thirds of it is the winning percentage of the teams you played, and the other third is the winning percentage of the teams those teams played. (I’m sure some calculate it with fifths or fourths or whatever but it’s still arbitrary and that’s not the point anyway). That’s pretty logical. I mean, you can see how the first board could get off by half a centimeter here, because a third of the formula is based on the transitive property, but at least we’re trying, right?
Now, remember how this is all an attempt to shatter myths? Well, in this case the big myth is the Top 25 poll, in which coaches or media just sort of observe basketball and rank the teams based on their impressions of them. It’s easy to see how this could be problematic. There are all kinds of potential biases at play, but maybe the biggest one is actually built into the system: Everything begins with a preseason poll. Thus, the polls are making a baseline assumption at the beginning of the season that we know EVERYTHING about all the teams and only adjust as we are proven wrong. Team X is the best until proven otherwise. This, of course, is absurd.
But in order to bust this myth, someone created a system (the SOS) which begins with an assumption we know NOTHING about any of these teams until the games are played. All 347 Division I basketball teams are assumed to be equal when the season begins, and only as they play each other does this begin to be sorted out. This is equally absurd. We know a whole heck of a lot about college basketball teams before the season begins. We know who the coaches are and how they have performed in the past. We have a pretty good idea who the best players are and who has them. We know who has the best home-court advantages. We know how teams performed against each other the previous season. We know how the NCAA Tournament from the previous 2,000 years has gone. We can accurately identify styles of play for a gigantic number of teams. The average college basketball fan has in his brain a huge amount of information about the sport — some quantifiable, some not — before a single game is played.
Of course, some of what we think we know is really just some kind of bias, but here’s the point: It is just as illogical to assume we know nothing about college basketball as it is to assume we know everything about college basketball.
This problem is compounded by there being an enormous number of teams and a small number of games. It’s not just that everybody doesn’t play everybody, it’s that you have to play Six Degrees of Robert Morris in order to compare Duke’s performance to New Mexico’s.
Old Dominion beating Santa Clara is the mythical butterfly that flaps its wings in China and causes an hurricane or gets Gonzaga a No. 1 seed. We are asking numbers to do things numbers aren’t capable of doing. The data are spread too thin. We’re trying to paint an entire landscape with two drops of lacquer.
And whenever I try to express this, somebody comes along and treats me like I just said the earth was flat, like I’m dragging brontosaurus bones into my cave and ranting about Aaron Craft’s grit. I’m not the idiot, here. I’m the one applying critical thought to the matter. Citing KenPom doesn’t mean you’re smart; it means he’s smart and you know how to read. Congratulations. I don’t want to hear somebody say, “It’s just math,” like I’m trying to argue 3 x 3 = 40. Multiplication only helps you if you’ve counted correctly in the first place.
I want to discuss this stuff just as intelligently as the next guy. I’m not some crusty baseball scout trying to do things the way they’ve always been done because I know change will make me obsolete. I’ve got nothing at stake, I just want to be right.
For some reason, everybody thinks you have to pick a side. You are either a Stats Guy or a Traditional Guy, and the Stats Guys get super defensive. They sound like Kip Dynamite when Napoleon doubts the effectiveness of his time machine. And the Traditional Guy gets defensive because he feels like people are telling him what he just saw didn’t really happen. And they both just start defending their sides to their own detriment, and it ends with both of them sounding like they don’t know how to think for themselves.
I think the stats are good. They’re useful tools and I think we should put them up to the X-Ray light and put our subjective observations up to the X-Ray light and if they don’t show the same thing, we should be open-minded enough to try to figure out which is wrong without just assuming anything with a decimal point in it is objective reality.
I just want to be able to point out a gap in the floor when I see one.