Data mining has
revolutionised the world of business. The increasing availability of data
combined with the globalisation of markets means that the discovery of the
smallest competitive advantage can produce millionaire returns. The business of
football could not be left behind.
On the playing
field, footballers perform thousands of actions per game. Companies specialized
in providing football statistics such as OPTA or Prozone employ an army of ‘data inputters’ who for meagre
compensation jump at the opportunity to make a living by watching video after
video of football matches and recording as many events as possible. For each
player in each match they log the amount of times he passed the ball (with each
leg of course), the amount of times he touched the ball, how many times he took
a throw in or how many times he controlled on his left thigh. Literally
thousands of events are logged.
These companies
then sell massive statistical packages to the clubs, who while recognising the
need to try and obtain competitive advantage from this new wave of information,
are still mostly in the dark on how precisely to do this.
It’s simple to
envisage extracting information from a single statistic. A typical process a
football fan performs is looking at the leading goalscorers chart and assume
that by seeing which strikers have scored the most goals something can be said
regarding their quality.
What happens
when we have two statistics? Again, it doesn’t seem too difficult. Consider the
table where the 25 players with the highest ‘goals+assists’ value for the 2015-16 Premier League season are
displayed:
Once again we
can easily imagine extracting information from this ‘two dimensional’ data. We
can differentiate three types of players which contribute to this total goals value: Players who score a
lot but don’t assist much like Aguero and Kane, players who assist a lot but
don’t score much like Ozil, or players that contribute in both items like the
surprising Mahrez and Vardy of Leicester City. It’s a simple conclusion, almost
trivial. The representation of data in two dimensions leaves information there
for the taking, extremely available. The analytical process of interpreting
this type of information and extracting concrete meaning is extremely natural
for us.
What happens
then when we have over 200 statistics? OPTA in association with Manchester City
made a database for the 2011-12 Premier League season available to the general
public, in which they collect over 200 in-game statistics for each player in
each match of that season. Can we extract information or meaning in the same
way from this type of data?
Topological Data Analysis
(TDA) is a mathematic tool whose broad objective is precisely that: extracting
qualitative information and meaning from high-dimensional data. It has been
previously used for example to successfully analyse genetic data from cancer
patients and discover patterns amongst groups of survivors. The key concept is
perceiving the data as being located in a 200 dimensional space. This may seem
overwhelming, but it’s actually quite natural. In the ‘goals+assists’ example, each player could be seen as a point in the
plane, with two coordinates. If we had added a third statistic, like ‘passes’ for instance, each player could
then be seen as a point in three dimensional space with its three coordinates
(goals, assists and passes). This conceptualisation can be extrapolated to
spaces with 200 coordinates, and there is an area of pure mathematics dedicated
to describe geometric notions such as shape, closeness, regions, etc. in these
spaces.
By applying TDA
to a data set, the result is a two-dimensional representation as a graph of the
originally high-dimensional data, where each node represents a grouping of the
data points and vertices are determined in such a way that the distances and
shape of the data in its original high-dimensional space is conserved under
certain parameters in its new visual representation. By looking at the result we
can deduce certain information regarding the original layout of the
high-dimensional data. The mathematical detail of the method is actually quite
technical, but the important property to keep in mind is that the
representation in the form of a graph carries through some basic geometrical
property of the data set.
The following
images show the application of TDA to the database made public by OPTA:
Each
node represents a group of players, whose colour is determined by their playing
position as the legend indicates. We can observe at first sight how the
“geometry” of the more than 200 coordinates recognises different playing
positions and styles of play between the players. It can differentiate clearly
between defenders, midfielders and strikers; and can even differentiate between
more specific subcategories as the ones we have circled. The same graph can be
“coloured” according to different parameters:
For this image,
each node has the colour corresponding to the final position in the league
table for which the players which compose it play. Once again, we can
appreciate that this information is available in the geometry of the
coordinates, because nodes of different colours appear in different ‘regions’
of the graph instead of being mixed up. This means that there is something in
the combination of all the considered statistics of players from “top” teams
that set them apart from players from “lesser” teams. Assuming that in a league
as sophisticated as the Premier League, differentiating between the quality of
the team is synonymous with differentiating between the quality of the players,
we can began to anticipate the potential that discovering this type of
information has when applied to young players from unknown leagues with view of
better informing a club’s recruitment strategy.
This last point is important: this methodology
allows us to detect which qualities are supported by the information contained
in the “geometric layout” of the more than 200 statistics we are considering
simultaneously. The next example can illustrate this point further:
This graph is
the result of applying the technique to a slightly different dataset, where
each node is composed by an entire team’s performance for a match, rather than
looking at each player individually. We can appreciate that of the two
colourings, the one on the left (final position in the league table of that
team) appears to be much less ‘mixed up’ when compared to the one on the right
(the specific result of that match). While the performances from “top” teams
occurs in certain ‘areas’ or ‘zones’ of the 200-dimension space, the same
cannot be stated for performances that ended in victory for example, which
occur mixed up with performances leading to draws or defeats. This is
interpreted as follows: while the information about a team’s quality (seen as
its final position in the league table) is
‘codified’ inside the combination of the considered statistics and can be found
in its geometric layout (which data points are close to which others), the same
cannot be said of a match’s specific result. The considered statistics cannot
accurately predict what the final result of a match will be; this can be due to
the chance of a penalty or hitting the woodwork. What it can ‘predict’ is the
final league position of a team; this underlying information of “performance
quality” is mostly available
when analysing in-game statistics.
Other types of
relevant conclusions can be drawn out from applying this methodology; for
example the fact that for the 2011-12 season Chelsea and Fulham (6th
and 9th) employed a similar style of play.
No comments:
Post a Comment