Pages

Tuesday 31 May 2016

Probability, Bayesian Boundaries and Player Recruitment

I find probability, randomness and statistics to be perhaps one of the most inherently ordered and satisfying areas of mathematics; and despite being a topic that a lot of people have a vague notion or understanding of, I’ve also discovered that people’s understanding of it is rife with crucial misconceptions.

Let’s illustrate one of the most common misconceptions through an example:

Suppose 10 people are going to be asked to insert their hand into a bag and take out a coloured ball without looking. Inside the bag are 3 blue balls and 7 red balls for a total of 10 balls. Once each person has chosen his ball, the colour he selected is logged, the ball is placed back inside the bag and the next person chooses.

Now, after the 10 people have chosen their ball and their colours registered (you were not present while this was done and therefore don’t know how many times each colour came out), the gamemaster comes to you and asks: “Do you want to bet that blue balls were chosen 3 times and red balls 7 times? If this was the case, I’ll give you 100 dollars but if not you give me 100 dollars.”

Do you take the bet?

A lot of people would take the bet, meaning a lot of people would lose 100 dollars. The probability of you winning that bet is merely around 26%.

One of the most important theorems in probability, the Law of Large Numbers, states that if you continue to ask more and more people to select balls, when the number of people goes towards infinity then the proportion of “blue balls” will converge to 30%, but it says nothing about finite draws.

Most people wrongly interpret the probability associated to the example above as: “out of 10 cases, 3 will be blue and 7 will be red”. They fail to grasp that the way in which probability balances itself out is only in the long run.

Do you think Casino owners look through the CCTV biting their nails and cursing each time the house loses a hand? Probably not. They understand the Law of Large Numbers and that in the long run, the probability layout of their games means they earn money no matter what happens in a hand or even a whole night.


One of the areas of football that most excites and interests me is player recruitment and talent identification. Why is understanding probability important for this? Every big club in the world wants to quickly identify the superstars of the future at a young age and bring them into their academy; and to do so they must scan millions of young players across the whole world and have a system in place that correctly identifies future talents. However, it doesn’t always work out for all those recruited players does it? Many simply play their way through the academy system without their career mounting up to anything of note. How should football clubs approach their selection to maximise the number of future stars they recruit in lieu of the unremarkable?

Suppose then that every player comes in one of two categories: destined to be a future star and destined for an unremarkable career. The ultimate truth will only be known after his career unfolds and he retires; but the key conceptual element is to view him as having been in his respective category all along. This is obviously overly simplified, and in general you want a model involving more than these two categories, but for the sake of simplicity in the point I want to get across, let’s consider this simplified case.

The difference between which of the categories a player is in can sometimes be as subjective as “divorced parents”, but this doesn’t make it any less appropriate to be studied by mathematics. In fact, many soft variables such as these are used in Banks’ models for predicting whether a person is in the “will pay loan” or “won’t pay loan” category.

However, sometimes these subtle differences aren’t in the available information of each player. Perhaps the scout trying to decide who to recruit doesn’t know whose parents are divorced and in the eyes of the information that is available to him (in-game stats of the players perhaps), two players from the two different categories can look exactly the same.

The typical example in mathematics textbooks is classifying a new fish whose species you don’t know to be either salmon or trout, based solely on some variables such as its length and width.

The scenario can look like this:


Blue points represent previously recorded instances of trout for these two variables, and red points represents previous instances of salmon. When a new fish comes along, you can’t be 100% sure of which category it belongs to, but you can make an educated prediction by assigning the category that seems more likely from your previous observations. In the example, the black line divides the space into two areas: one where you predict “salmon”, and another where you predict “trout”.

Since for the available information the two categories “overlap”, you must accept a degree of error in your prediction, unless you include more variables that can better distinguish the categories such as length of fins or divorced parents.

This “inevitable error” is called the Bayesian error, and the line dividing your classification into categories the Bayesian decision boundary. Making decisions using the Bayesian boundary is the best decision making model for these type of problems, and the Bayes error is the lowest level of error any model can aspire to.

Bayesian decision boundaries can be used to make informed decisions on player recruitment. If we involve as much information as we can collect on players such as in-game stats and even softer variables about his personal life, we can use the info on previous “stars” or “unremarkables” to produce a Bayes decision model and predict new players into one of these two categories. This sort of approach has the advantage that it can process millions of players quickly, economically and efficiently; without the need of a huge scouting network or machinery. Most clubs would rebuff the idea of liaising their scouting process 100% to these methodologies, but at least it can be used first as a filter to narrow down millions of players into a selected handful that the human component of the scouting process can analyse.

The problem these approaches face is that when things are new to town they have a point to prove, the opposite of “innocent until proven otherwise”. Disregarding the Bayes error associated with scouting, these methods are expected to be magic crystal balls and are only given a couple of chances before their fate is sealed by these outcomes. It’s as if a Casino Group come to you proposing you invest in a new Casino and you say to them: "OK, get me one of your croupiers and let´s play a hand of blackjack, I want to see if it´s profitable”. If the house loses the hand, do you decide against the investment?

On the other hand, I read on the news the other day that Arsenal signed the Leicester scout that recruited Jamie Vardy, as if this single outcome is a sure sign of sustainable success.

I can’t assure you that you won’t miss out on a great player by using this approach, or that every player you sign will be fantastic; but the Law of Large Numbers ensures that these approaches will represent sustainable success in the long-run.


The true problem is convincing such a short-term industry to embrace it.

Wednesday 25 May 2016

Data Mining: Funneling Football Information

Data Mining is the process of discovering and extracting concrete knowledge from large data sets in a way that is understandable, compact and applicable to real life problems. Fayyad, Piatetsky-Shapiro and Smyth (1996) succinctly refer to it as “pattern and knowledge discovery in databases”. A powerful image I like to use is thinking of Data Mining as a funnel of information, processing large amounts of data and information that persons cannot normally observe and understand in their entirety; and producing a compact and summarised version that extracts the key knowledge to be learned in an understandable way for the human user, who can then claim to have informed his decisions from as much information as was available.

You may not know it yet, but currently Data Mining methods are changing almost every industry in the world. Science most notably of course, but other very “real-life” day to day industries such as medicine, marketing and even street-light coordination to optimise traffic flow. Banks use Data Mining to decide whether to give someone a loan, and Facebook’s facial recognition software that predicts which friend you want to tag in a picture has Data Mining at its core. Even Post Offices use Data Mining methods to codify hand-written addresses to remove the necessity of a human reading them. The 4th season of Netflix’s House of Cards seems to imply that Data Mining applied to politics can win elections.

Football is one such industry that is beginning to feel its way around the new paradigms. There is a massive collection of data going on at the moment, with companies such as OPTA or Prozone recording millions of events, in-game statistics and other information on the game and the surrounding industry; much more data than a single person can look at and decipher the information contained. Who can come to the football’s aid?

The involvement of mathematical methods in the game has become a misleading mediatised debate, not helped by the “Hollywood-isation” of Moneyball that oversimplifies the role of Data Mining as a magic crystal ball to discover talent. Sceptics seem to think that involving math into the game will damage it, replacing the subtleties and aesthetic flowing nature of football with something radically different and rather cold, based on incomprehensibly looking at thousands of numbers on spreadsheets or “number crunching” on calculators.

Let me assure you now: as a mathematician, I haven’t spent a single minute of my life looking into Excel spreadsheets of thousands of numbers or “crunching” numbers into a calculator (I don’t even own one). There’s no knowledge to be learned by doing these things, my brain is incapable of interpreting it that simply; not without the help of much more elegant methods that know how to extract the information from these large databases and present them to me in compact and useful ways that I can actually understand.

In truth, mathematics are enhancing the game, not replacing anything. On the contrary; the wisdom and intuitive expertise of experienced football men at for example recognizing technically gifted young players or designing successful game tactics can be studied and codified into these techniques so that we can build upon their knowledge. In fact, this is what most Data Mining techniques rely on: using previous successes, knowledge and expectations to design their own criteria for what to expect from their own performance, which is what methods such as supervised machine learning are ultimately all about. However, just as valuable expertise gained organically can be integrated into these techniques to make them richer, it must also be said that human intuition is inevitably flawed and prone to mistakes. As a trivialised example, the judgement of a football scout can be thrown off by a player’s good looks, or he can be subconsciously biased towards liking left-footed strikers more, potentially causing him to produce inaccurate valuations. We all know this to be true of ourselves and of our judgements, and anyone who claims to not be victim of tangible biases while making decisions on a daily basis is not being honest with himself. Data mining on the other hand has no pre-conceived ideas, no prejudices or biases. I obviously cannot yet claim to understand the inner workings of football clubs nearly enough to pass judgement on their performance, but a simple review of a handful of studies by respected academics in the football industry (Anderson and Sally, 2013; Kuper and Szymanski, 2009) seems to reveal that there are still plenty of inefficiencies to be addressed. I believe that methods that can identify and address these inefficiencies should be greeted with enthusiasm by those who love the game, not mistrust.

Data Mining is not necessarily a noisy new neighbour in town disrupting everyone’s way of life. It’s not a revolutionary new way of doing things by pushing tradition off a cliff, but rather it is the gradual and natural continuation of humanity’s essential practice of interpreting information and creating knowledge to inform decision making, simply adapted into the 21st century where globalization and technological development exponentially increase the scale of available information. Our understanding of topics as subjective as human behaviour for example have been greatly enriched by this trend, with behavioural economics now dominating decisions in government policy, marketing, investment banking, etc. For football men there are now so many more sources of information than they have ever had before to inform their decisions and shape their actions. Arguably, so much information is available that it is beyond a single person to store, codify and aggregate it in order for tangible recommendations to be drawn out; and this is why mathematics must be called into action (to funnel the information). Opting against using mathematics to tap into the whole potential of available knowledge in the huge databases makes no sense, and those who make this decision will inevitably fall behind by missing out on the competitive advantage of knowing more than your opponents.

Many sceptics throw around phrases such as “numbers can’t tell you everything” and jump at the opportunity to signal out instances where statistics-led approaches have failed such as Damien Comolli’s transfer dealings while Director of Football at Liverpool. To them I would like to point out this: even Data Mining methods have a human component; there is a person behind designing and implementing the methods used, and subsequently on the receiving end of the methods to interpret and make use of the results. Applied poorly some techniques can reveal nothing worthwhile; but applied with creativity and skill these methods can produce some truly revolutionising discoveries. Ultimately, football clubs must ensure they hire good mathematicians to tap into these benefits.

I do not like football any less by observing it under the lens of mathematics; on the contrary, each day I feel like I gain a richer understanding of it that motivates and captivates me even more. I hope that this blog will do the same for you.

REFERENCES

1.      Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P. (1996). The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11), pp.27-34.

2.      Anderson, C. and Sally, D., (2013). The numbers game: why everything you know about football is wrong. Penguin UK.


3.      Kuper, S. and Szymanski, S. (2009). Soccernomics. New York: Nation Books.

Tuesday 17 May 2016

Exploring High-Dimensional Data Sets in Football: Topological Data Analysis

Data mining has revolutionised the world of business. The increasing availability of data combined with the globalisation of markets means that the discovery of the smallest competitive advantage can produce millionaire returns. The business of football could not be left behind.

On the playing field, footballers perform thousands of actions per game. Companies specialized in providing football statistics such as OPTA or Prozone employ an army of ‘data inputters’ who for meagre compensation jump at the opportunity to make a living by watching video after video of football matches and recording as many events as possible. For each player in each match they log the amount of times he passed the ball (with each leg of course), the amount of times he touched the ball, how many times he took a throw in or how many times he controlled on his left thigh. Literally thousands of events are logged.

These companies then sell massive statistical packages to the clubs, who while recognising the need to try and obtain competitive advantage from this new wave of information, are still mostly in the dark on how precisely to do this.


It’s simple to envisage extracting information from a single statistic. A typical process a football fan performs is looking at the leading goalscorers chart and assume that by seeing which strikers have scored the most goals something can be said regarding their quality.

What happens when we have two statistics? Again, it doesn’t seem too difficult. Consider the table where the 25 players with the highest ‘goals+assists’ value for the 2015-16 Premier League season are displayed:

Note: As a caveat, the postponed Manchester United vs. Bournemouth game hasn’t been played at the time of writing, so any goals scored in that match aren’t reflected in the table. Should Manchester United beat Bournemouth by the 19 goal margin needed to classify for the Champions League next season, surely at least one of their players will make the top 25. Interestingly, only 5 other teams aren’t represented in this table: West Brom, Crystal Palace, Bournemouth, Norwich and Aston Villa; averaging amongst them a 16.8th league finish (West Brom finished highest of these at 14) to Manchester United’s 5th or 6th.







Once again we can easily imagine extracting information from this ‘two dimensional’ data. We can differentiate three types of players which contribute to this total goals value: Players who score a lot but don’t assist much like Aguero and Kane, players who assist a lot but don’t score much like Ozil, or players that contribute in both items like the surprising Mahrez and Vardy of Leicester City. It’s a simple conclusion, almost trivial. The representation of data in two dimensions leaves information there for the taking, extremely available. The analytical process of interpreting this type of information and extracting concrete meaning is extremely natural for us.

What happens then when we have over 200 statistics? OPTA in association with Manchester City made a database for the 2011-12 Premier League season available to the general public, in which they collect over 200 in-game statistics for each player in each match of that season. Can we extract information or meaning in the same way from this type of data?

Topological Data Analysis (TDA) is a mathematic tool whose broad objective is precisely that: extracting qualitative information and meaning from high-dimensional data. It has been previously used for example to successfully analyse genetic data from cancer patients and discover patterns amongst groups of survivors. The key concept is perceiving the data as being located in a 200 dimensional space. This may seem overwhelming, but it’s actually quite natural. In the ‘goals+assists’ example, each player could be seen as a point in the plane, with two coordinates. If we had added a third statistic, like ‘passes’ for instance, each player could then be seen as a point in three dimensional space with its three coordinates (goals, assists and passes). This conceptualisation can be extrapolated to spaces with 200 coordinates, and there is an area of pure mathematics dedicated to describe geometric notions such as shape, closeness, regions, etc. in these spaces.

By applying TDA to a data set, the result is a two-dimensional representation as a graph of the originally high-dimensional data, where each node represents a grouping of the data points and vertices are determined in such a way that the distances and shape of the data in its original high-dimensional space is conserved under certain parameters in its new visual representation. By looking at the result we can deduce certain information regarding the original layout of the high-dimensional data. The mathematical detail of the method is actually quite technical, but the important property to keep in mind is that the representation in the form of a graph carries through some basic geometrical property of the data set.


The following images show the application of TDA to the database made public by OPTA:


Each node represents a group of players, whose colour is determined by their playing position as the legend indicates. We can observe at first sight how the “geometry” of the more than 200 coordinates recognises different playing positions and styles of play between the players. It can differentiate clearly between defenders, midfielders and strikers; and can even differentiate between more specific subcategories as the ones we have circled. The same graph can be “coloured” according to different parameters:


For this image, each node has the colour corresponding to the final position in the league table for which the players which compose it play. Once again, we can appreciate that this information is available in the geometry of the coordinates, because nodes of different colours appear in different ‘regions’ of the graph instead of being mixed up. This means that there is something in the combination of all the considered statistics of players from “top” teams that set them apart from players from “lesser” teams. Assuming that in a league as sophisticated as the Premier League, differentiating between the quality of the team is synonymous with differentiating between the quality of the players, we can began to anticipate the potential that discovering this type of information has when applied to young players from unknown leagues with view of better informing a club’s recruitment strategy.

This last point is important: this methodology allows us to detect which qualities are supported by the information contained in the “geometric layout” of the more than 200 statistics we are considering simultaneously. The next example can illustrate this point further:


This graph is the result of applying the technique to a slightly different dataset, where each node is composed by an entire team’s performance for a match, rather than looking at each player individually. We can appreciate that of the two colourings, the one on the left (final position in the league table of that team) appears to be much less ‘mixed up’ when compared to the one on the right (the specific result of that match). While the performances from “top” teams occurs in certain ‘areas’ or ‘zones’ of the 200-dimension space, the same cannot be stated for performances that ended in victory for example, which occur mixed up with performances leading to draws or defeats. This is interpreted as follows: while the information about a team’s quality (seen as its final position in the league table) is ‘codified’ inside the combination of the considered statistics and can be found in its geometric layout (which data points are close to which others), the same cannot be said of a match’s specific result. The considered statistics cannot accurately predict what the final result of a match will be; this can be due to the chance of a penalty or hitting the woodwork. What it can ‘predict’ is the final league position of a team; this underlying information of “performance quality” is mostly available when analysing in-game statistics.

Other types of relevant conclusions can be drawn out from applying this methodology; for example the fact that for the 2011-12 season Chelsea and Fulham (6th and 9th) employed a similar style of play.

This type of information proves to be very important for football because it establishes the existence of middle ground for the massive collection of data to meet with ‘footballistically’ valuable applications. We have established that there are techniques that can take advantage of large data sets which can seem overwhelming and useless for traditional analytical methods, and the information extracted can be used to revolutionise exciting areas of football such as player recruitment and talent identification.