## Tuesday, 31 May 2016

### Probability, Bayesian Boundaries and Player Recruitment

I find probability, randomness and statistics to be perhaps one of the most inherently ordered and satisfying areas of mathematics; and despite being a topic that a lot of people have a vague notion or understanding of, I’ve also discovered that people’s understanding of it is rife with crucial misconceptions.

Let’s illustrate one of the most common misconceptions through an example:

Suppose 10 people are going to be asked to insert their hand into a bag and take out a coloured ball without looking. Inside the bag are 3 blue balls and 7 red balls for a total of 10 balls. Once each person has chosen his ball, the colour he selected is logged, the ball is placed back inside the bag and the next person chooses.

Now, after the 10 people have chosen their ball and their colours registered (you were not present while this was done and therefore don’t know how many times each colour came out), the gamemaster comes to you and asks: “Do you want to bet that blue balls were chosen 3 times and red balls 7 times? If this was the case, I’ll give you 100 dollars but if not you give me 100 dollars.”

Do you take the bet?

A lot of people would take the bet, meaning a lot of people would lose 100 dollars. The probability of you winning that bet is merely around 26%.

One of the most important theorems in probability, the Law of Large Numbers, states that if you continue to ask more and more people to select balls, when the number of people goes towards infinity then the proportion of “blue balls” will converge to 30%, but it says nothing about finite draws.

Most people wrongly interpret the probability associated to the example above as: “out of 10 cases, 3 will be blue and 7 will be red”. They fail to grasp that the way in which probability balances itself out is only in the long run.

Do you think Casino owners look through the CCTV biting their nails and cursing each time the house loses a hand? Probably not. They understand the Law of Large Numbers and that in the long run, the probability layout of their games means they earn money no matter what happens in a hand or even a whole night.

One of the areas of football that most excites and interests me is player recruitment and talent identification. Why is understanding probability important for this? Every big club in the world wants to quickly identify the superstars of the future at a young age and bring them into their academy; and to do so they must scan millions of young players across the whole world and have a system in place that correctly identifies future talents. However, it doesn’t always work out for all those recruited players does it? Many simply play their way through the academy system without their career mounting up to anything of note. How should football clubs approach their selection to maximise the number of future stars they recruit in lieu of the unremarkable?

Suppose then that every player comes in one of two categories: destined to be a future star and destined for an unremarkable career. The ultimate truth will only be known after his career unfolds and he retires; but the key conceptual element is to view him as having been in his respective category all along. This is obviously overly simplified, and in general you want a model involving more than these two categories, but for the sake of simplicity in the point I want to get across, let’s consider this simplified case.

The difference between which of the categories a player is in can sometimes be as subjective as “divorced parents”, but this doesn’t make it any less appropriate to be studied by mathematics. In fact, many soft variables such as these are used in Banks’ models for predicting whether a person is in the “will pay loan” or “won’t pay loan” category.

However, sometimes these subtle differences aren’t in the available information of each player. Perhaps the scout trying to decide who to recruit doesn’t know whose parents are divorced and in the eyes of the information that is available to him (in-game stats of the players perhaps), two players from the two different categories can look exactly the same.

The typical example in mathematics textbooks is classifying a new fish whose species you don’t know to be either salmon or trout, based solely on some variables such as its length and width.

The scenario can look like this:

Blue points represent previously recorded instances of trout for these two variables, and red points represents previous instances of salmon. When a new fish comes along, you can’t be 100% sure of which category it belongs to, but you can make an educated prediction by assigning the category that seems more likely from your previous observations. In the example, the black line divides the space into two areas: one where you predict “salmon”, and another where you predict “trout”.

Since for the available information the two categories “overlap”, you must accept a degree of error in your prediction, unless you include more variables that can better distinguish the categories such as length of fins or divorced parents.

This “inevitable error” is called the Bayesian error, and the line dividing your classification into categories the Bayesian decision boundary. Making decisions using the Bayesian boundary is the best decision making model for these type of problems, and the Bayes error is the lowest level of error any model can aspire to.

Bayesian decision boundaries can be used to make informed decisions on player recruitment. If we involve as much information as we can collect on players such as in-game stats and even softer variables about his personal life, we can use the info on previous “stars” or “unremarkables” to produce a Bayes decision model and predict new players into one of these two categories. This sort of approach has the advantage that it can process millions of players quickly, economically and efficiently; without the need of a huge scouting network or machinery. Most clubs would rebuff the idea of liaising their scouting process 100% to these methodologies, but at least it can be used first as a filter to narrow down millions of players into a selected handful that the human component of the scouting process can analyse.

The problem these approaches face is that when things are new to town they have a point to prove, the opposite of “innocent until proven otherwise”. Disregarding the Bayes error associated with scouting, these methods are expected to be magic crystal balls and are only given a couple of chances before their fate is sealed by these outcomes. It’s as if a Casino Group come to you proposing you invest in a new Casino and you say to them: "OK, get me one of your croupiers and let´s play a hand of blackjack, I want to see if it´s profitable”. If the house loses the hand, do you decide against the investment?

On the other hand, I read on the news the other day that Arsenal signed the Leicester scout that recruited Jamie Vardy, as if this single outcome is a sure sign of sustainable success.

I can’t assure you that you won’t miss out on a great player by using this approach, or that every player you sign will be fantastic; but the Law of Large Numbers ensures that these approaches will represent sustainable success in the long-run.

The true problem is convincing such a short-term industry to embrace it.