Data mining has revolutionised the world of business. The increasing availability of data combined with the globalisation of markets means that the discovery of the smallest competitive advantage can produce millionaire returns. The business of football could not be left behind.
On the playing field, footballers perform thousands of actions per game. Companies specialized in providing football statistics such as OPTA or Prozone employ an army of ‘data inputters’ who for meagre compensation jump at the opportunity to make a living by watching video after video of football matches and recording as many events as possible. For each player in each match they log the amount of times he passed the ball (with each leg of course), the amount of times he touched the ball, how many times he took a throw in or how many times he controlled on his left thigh. Literally thousands of events are logged.
These companies then sell massive statistical packages to the clubs, who while recognising the need to try and obtain competitive advantage from this new wave of information, are still mostly in the dark on how precisely to do this.
It’s simple to envisage extracting information from a single statistic. A typical process a football fan performs is looking at the leading goalscorers chart and assume that by seeing which strikers have scored the most goals something can be said regarding their quality.
What happens when we have two statistics? Again, it doesn’t seem too difficult. Consider the table where the 25 players with the highest ‘goals+assists’ value for the 2015-16 Premier League season are displayed:
Once again we can easily imagine extracting information from this ‘two dimensional’ data. We can differentiate three types of players which contribute to this total goals value: Players who score a lot but don’t assist much like Aguero and Kane, players who assist a lot but don’t score much like Ozil, or players that contribute in both items like the surprising Mahrez and Vardy of Leicester City. It’s a simple conclusion, almost trivial. The representation of data in two dimensions leaves information there for the taking, extremely available. The analytical process of interpreting this type of information and extracting concrete meaning is extremely natural for us.
What happens then when we have over 200 statistics? OPTA in association with Manchester City made a database for the 2011-12 Premier League season available to the general public, in which they collect over 200 in-game statistics for each player in each match of that season. Can we extract information or meaning in the same way from this type of data?
Topological Data Analysis (TDA) is a mathematic tool whose broad objective is precisely that: extracting qualitative information and meaning from high-dimensional data. It has been previously used for example to successfully analyse genetic data from cancer patients and discover patterns amongst groups of survivors. The key concept is perceiving the data as being located in a 200 dimensional space. This may seem overwhelming, but it’s actually quite natural. In the ‘goals+assists’ example, each player could be seen as a point in the plane, with two coordinates. If we had added a third statistic, like ‘passes’ for instance, each player could then be seen as a point in three dimensional space with its three coordinates (goals, assists and passes). This conceptualisation can be extrapolated to spaces with 200 coordinates, and there is an area of pure mathematics dedicated to describe geometric notions such as shape, closeness, regions, etc. in these spaces.
By applying TDA to a data set, the result is a two-dimensional representation as a graph of the originally high-dimensional data, where each node represents a grouping of the data points and vertices are determined in such a way that the distances and shape of the data in its original high-dimensional space is conserved under certain parameters in its new visual representation. By looking at the result we can deduce certain information regarding the original layout of the high-dimensional data. The mathematical detail of the method is actually quite technical, but the important property to keep in mind is that the representation in the form of a graph carries through some basic geometrical property of the data set.
The following images show the application of TDA to the database made public by OPTA:
Each node represents a group of players, whose colour is determined by their playing position as the legend indicates. We can observe at first sight how the “geometry” of the more than 200 coordinates recognises different playing positions and styles of play between the players. It can differentiate clearly between defenders, midfielders and strikers; and can even differentiate between more specific subcategories as the ones we have circled. The same graph can be “coloured” according to different parameters:
For this image, each node has the colour corresponding to the final position in the league table for which the players which compose it play. Once again, we can appreciate that this information is available in the geometry of the coordinates, because nodes of different colours appear in different ‘regions’ of the graph instead of being mixed up. This means that there is something in the combination of all the considered statistics of players from “top” teams that set them apart from players from “lesser” teams. Assuming that in a league as sophisticated as the Premier League, differentiating between the quality of the team is synonymous with differentiating between the quality of the players, we can began to anticipate the potential that discovering this type of information has when applied to young players from unknown leagues with view of better informing a club’s recruitment strategy.
This last point is important: this methodology allows us to detect which qualities are supported by the information contained in the “geometric layout” of the more than 200 statistics we are considering simultaneously. The next example can illustrate this point further:
This graph is the result of applying the technique to a slightly different dataset, where each node is composed by an entire team’s performance for a match, rather than looking at each player individually. We can appreciate that of the two colourings, the one on the left (final position in the league table of that team) appears to be much less ‘mixed up’ when compared to the one on the right (the specific result of that match). While the performances from “top” teams occurs in certain ‘areas’ or ‘zones’ of the 200-dimension space, the same cannot be stated for performances that ended in victory for example, which occur mixed up with performances leading to draws or defeats. This is interpreted as follows: while the information about a team’s quality (seen as its final position in the league table) is ‘codified’ inside the combination of the considered statistics and can be found in its geometric layout (which data points are close to which others), the same cannot be said of a match’s specific result. The considered statistics cannot accurately predict what the final result of a match will be; this can be due to the chance of a penalty or hitting the woodwork. What it can ‘predict’ is the final league position of a team; this underlying information of “performance quality” is mostly available when analysing in-game statistics.
Other types of relevant conclusions can be drawn out from applying this methodology; for example the fact that for the 2011-12 season Chelsea and Fulham (6th and 9th) employed a similar style of play.