The Open Football: 2016

Wednesday 30 November 2016

Player Passing Motif Style Application: Most Distinctive Players and Best Recruitment Opportunities

I was recently invited to send some of my work on passing motifs to be used for a Fink Tank column in The Times, but of course a dendrogram such as the one I linked in my previous entry wouldn’t cut it in printed media. Therefore, I thought the best thing to do was set out to answer some concrete applied questions the methodology might answer, which would be easy to display but interesting nevertheless. The content of this entry was the result:

I started by thinking about “distinctive” players; players that couldn’t be replaced easily. Remember that there is solid evidence that the methodology outlined in the previous entry picks up some underlying information on player passing style, and two players are considered similar if their vectors are “close” to each other. With this in mind, I computed the average distance of the 10 closest players for each player. The players for whom this number was highest were considered the most distinctive. The following table shows the top 5 along with their closest neighbours (this whole entry is based on data from the 2015-16 Premier League, without goalkeepers):

First of all, I strongly think that Ulloa being there is a bit of an oddity: most of his appearances were late substitutions when Leicester were holding onto a 1-0 lead and therefore his stats are representative of this unique predicament.

Ozil is the most distinctive player in the league. Looking through his closest players Nacho Monreal and Bacary Sagna can raise a few eyebrows but all in all even they are considerably far away from his style and therefore don’t reveal much about him. It’s a bit like the US mainland and Australia being amongst the closest countries to Hawaii; so treat that with due suspicion. The rest of the players seem to make good footballing sense. I’ll leave it to the readers to read through the results and make their own judgements.

Another interesting question which I thought of was this: which players represent the best recruitment opportunities in the sense that they have similar styles to players who play for much better ranked teams. Something like Sunderland players having similar styles to players from Arsenal, Manchester City or Tottenham. There are several ways to answer this question. Let’s start by the simplest: for each player I computed the average final league position of the 10 closest players, and subtracted that number from his own team’s final position. The players for whom this number was highest can be considered to represent the “best” opportunities. The following table shows the top 10:

I didn’t watch Aston Villa much last season and have no opinion on Ashley Westwood to be honest. It’s good to see Nathan Redmond and Idrissa Gueye on there though, considering they have since moved to Everton and Southampton proving they were in fact capable of playing for better teams. The problem with this methodology though is that it assumes that the difference in player’s quality is lineal with league position. That is to say, the difference in quality between a player from the 16^th team and one from the 20^th team is assumed to be the same as the difference between a player from the champions and a player from the 5^thteam, when in truth there is no solid basis for this assumption.

One way to deal with this is to apply an increasing concave function to league position so that the same differences in position lower down the table are weighted less than for higher placed teams. I tried out a few functions like log, square root, cubic root, etc., and the results vary marginally but the same core of names seems to pop out for most of them. As an example, the table below shows the results for this methodology applying the fourth root to league position:

Some more satisfying names show up on there now (Ashley Williams who moved to Everton and Jason Puncheon who is pretty good), but it’s still unclear whether this methodology is properly representing the differences in quality required to play for different teams. Perhaps a better way to look at it is by points obtained rather than league position. Is the ratio of points a good representation of the difference in quality? The idea would be something like if a team obtains 90 points then its players must be 3 times as good as those of a team which obtains 30 points. The following table shows the top ten players such that the ratio between their teams’ points and the average points of the teams of their 10 closest players is greatest:

NOTE: I had to exclude Aston Villa players here because they obtained such a small amount of points that right away the method assumed it was twice as hard to play for Norwich than for Villa (Aston Villa made 17 points and Norwich 34) and obviously all the Villa players dominated the rankings.

Nathan Redmond is on the list again which is good to see, as well as Moussa Sissoko who moved to Tottenham this season. M’Vila, Cisse and Watmore are other players on the list which I rate highly, but again, lets allow the reader to make his own judgements.

To wrap this theme up, the final way of answering this question I used combines some of the best elements of the previous two ideas: for each player we compute the difference between the average squared points of the 10 closest players and his own team’s final points squared. The square is taken to compensate for the fact that the quality required to go from a 60-points team to an 80-points team is higher than the quality required to go from a 30-points team to a 50-points team. The table below shows the results:

There are some good names on there. Redmond definitely seems to be a good catch by Southampton. Now, remember that this methodology isn’t meant to be a magic crystal ball. Some people who know I do this type of work constantly ask me: “So, who’s going to win the league? Who’s the next Messi?”. They fail to understand the subtlety of what there is to actually gain from data mining. Ashley Westwood might be great, but then again he might not. Nevertheless, some players which traditional scouting methods seem to like from last season which were recruited by larger teams are also liked by this method.

It’s pretty remarkable that this methodology seems to be so rich when it is ignoring a lot of relevant information such as shots, goals, tackles, etc. It only sees what is visible in the passing network, which seems to be enough to make some decisions that very informed professional recruitment makes like picking up Gueye, Redmond, Sissoko, Ashley Williams, etc. That doesn't mean that it has all the answers. For example, it might like a centre back who is good at playing the ball out from the back but it has no way of knowing if he is also defensively sound. If complemented with more sources of information (such as direct traditional scouting), however, this type of work can be very useful for clubs.

Finally and on a bit of a sourer note, I thought it might also be interesting to look at some of the “worst” opportunity players; that is to say players who are similar to players from much worse teams than their own. I had to exclude Leicester players because their players completely dominated the top ten in most lists I drew up; even Kante, Mahrez and Vardy. I’m not sure what to make of this, because even though they were the champions, their players aren’t particularly similar to the players from other top teams. Just so it doesn’t seem like I’m slipping something past you, here are the 10 closest players to Kante, Mahrez and Vardy:

Now, excluding players from Leicester, here are the “worst” opportunities using the “squared points difference” metric outlined above:

Make of it what you will, but keep in mind that everything has a context and even if I claim this method sees a lot, I also recognise it doesn't know everything. If you see a name in there you don't like, keep calm, take a deep breath, look through the closest players and have a think about what might be going on.

Monday 31 October 2016

Passing Motifs at a Player Level: Player Passing Style

This is a pretty exciting entry, so bear with me if it gets a bit long, I think its worth it…

Ever since the first entry on Passing Motifs I mentioned the potential of extrapolating the methodology to study passing styles at a player level. That first entry mentioned the idea set forth by Javier Lopez and Raul Sanchez to answer the question “Who can replace Xavi?”. Nevertheless, that particular example always left me wanting for more because the outcome was noticeably skewed towards players from Barcelona and a few other teams like Arsenal and even Swansea surprisingly. It made me think that the methodology was ignoring individual player traits and rather picking up stats that are reflective of the team the player plays for, not of the player himself.

I’ve been thinking ever since what the best way to extract player passing style from passing motifs is. Here are some of the ideas I’ve had:

One first objective is to neutralise the effect of the team passing style on a player. If a team proportionately uses ABAB a lot, then inevitably so will the players. Therefore, if you put Fernandinho in Barcelona, his motif frequencies will start to resemble those of the whole team without it having been something inherent to him all along. The idea I had was to view how a player’s relative motif frequencies diverged from his team’s frequencies in each match. That is to say, if in a match Arsenal performed 40% of its frequencies as an ABAC and 43% of the motifs Coquelin was involved in were ABAC, then Coquelin had a +3% for that motif for that match. Averaging for the whole season, Coquelin could be seen as 5-dimensional vector where each entry corresponds to his average divergence for each of the 5 motifs. When the performance of this vectorisation is measured through the methodology outlined in my previous entry using data from the 2014-15 and 2015-16 seasons of the Premier League (only players who had at least 18 appearances to avoid outliers), this was the result:

The fairly negative z-scores reveal that this methodology has an agreeable stability for those two seasons and is therefore picking up on some underlying quality of the players playing style.

Just as we did for team motifs, instead of considering the raw values of motifs a player performed, we consider each performance in a match by a player as a 5-dimensional vector in which each entry is the percentage of the player’s total motifs that that motif corresponds to. So we can represent a match played by Romelu Lukaku as 5% ABAB, 13% ABAC, 25% ABCA, etc. Averaging over a whole season, each player is represented by a 5-dimensional vector.

Once again, we’re reasonably happy that this vectorisation is picking up on stable player qualities.

Another way of seeing that data which I felt might be useful is seeing each player’s match as the proportion of each motif his team performed that he participated in. That is to say, if Southampton completed 50 instances of ABAB in a match, and Jordy Clasie participated in 25 of those, he would have a 50% score for ABAB in that match. If in that same match Southampton completed 80 instances of ABAC and Clasie participated in 20, he would have a 25% score for that motif. Applying this logic to the 5 different motifs and averaging over the whole season, each player is once again represented by a 5-dimensional vector. This is how well it performs:

Out of the three 5-dimensional vectorisations we have shown so far, this is by some margin the one which performs the best. Both its z-scores are considerably lower than the other two, meaning its capturing pretty stable information for each player.

In the first entry regarding passing motifs we mentioned how the motifs could be vectorised in a 15-dimensional vector for players. To refresh your memory, for an ABAC sequence a player could participate as the A player, the B player or the C player. It’s straightforward to count that looking at all 5 motifs there are 15 “participation” possibilities for each player. If we count how many times each player was each letter in each of the 5 motifs, we are left with a 15-dimensional vector representing each player. This is basically the methodology used in the “Who can Replace Xavi?” article.

Comparing things in different dimensions is rather difficult and not too standardised in mathematics but I would dare say that it performs worse than previous 5-dimensional vectorisation, especially considering Z-Score 1 which is the most important indicator.

Finally, we can take this 15-dimensional idea and slightly alter it to not count the total of each pseudo-motif but rather what their relative frequencies are, so once again do something like if Dimitri Payet performed the B in an ABAC 15 times out of 100 total motifs he participated in, that pseudo-motif has a score of 15%. Once again, each player is represented by a 15-dimensional vector:

Immediately we appreciate that this is the best performing of all the vectorisations we have seen.

Now, the first thing we must say is that all the 5 different ways of obtaining player vectors shown here show evidence of uncovering some stable and underlying qualities of players’ passing style. We have used the indicators to compare them and discuss which might be better, but there is no way of determining whether some information which one of them is picking up on is missed by another.

Here’s the advantage: there is no downside to combining them all. If we simply glue together all these representations to make one long 45-dimensional (5+5+5+15+15) vector representation for players, then all the qualities on which each methodology picked up are at a scale represented. If two players were similar across all representations, they will be similar in the long one as well; if two players were similar across some of the representations but not others, then they will be mildly similar depending on how dissimilar they were in the others; etc.

Here is the performance of this long 45-dimensional vectorisation:

The results are very satisfying and it proves to be a robust vectorisation for player playing style, more than 1 standard deviation below the mean distance between all players and more than 4 standard deviations below the Gaussian distances, even in this very high dimensional space.

This vectorisation will surely provide me with a lot of material to explore for a good while, its even a little frustrating not finding an easy visual way in which to convey it to the readers. Lets settle for now on a hierarchical clustering dendrogram as a visualisation tool.

Below is a link for the pdf for the hierarchical clustering dendrogram applied to the data set for the 2015-16 season of the Premier League (only players who played in over 18 matches). Since there are 279 players, the tree labels are really tiny so the image couldn't be uploaded onto the blog directly, but on the pdf you can use your explorer's zoom to explore the results.

https://drive.google.com/file/d/0Bzvjb5fnv1HtZjFtRDJjUVBua0E/view?usp=sharing

If you'd rather not, here is a selection of the methodology's results:

Mesut Ozil has one of the most distinctive passing styles in the league. Cesc Fabregas is the player closest to him and together they form a subgroup with Juan Mata, Ross Barkley, Yaya Toure and Aaron Ramsey.
Alexis Sanchez is in a league of his own but the players with the most similar passing style are Payet, Moussa Sissoko, Jesus Navas, Sterling and Martial.
Troy Deeney is in the esteemed company of Aguero, De Bruyne, Oscar and Sigurdsson.
David Silva, Willian, Eden Hazard and Christian Eriksen are all pretty similar.
Nemanja Matic, Eric Dier and Gareth Barry have a similar passing style.
M’Vila, Lanzini, Capoue, Puncheon, Ander Herrera and Drinkwater are all similar, pretty good and perhaps underrated.
Walcott, Ihenacho, Scott Sinclair, Jefferson Montero, Wilfired Zaha, Bakary Sako, Albrighton, Bolasie and Michail Antonio form a subgroup of similar wingers.
Giroud is more similar to some rather underwhelming strikers such as Gomis, Cameron Jerome and Pappiss Cisse rather than to world class strikers. The same can be said of Harry Kane being similar to Aroune Kone, Son and Marc Pugh. Maybe the methodology is not as convincing for strikers?
Shane Long and Odion Ighalo are good alternatives to Jamie Vardy.
Diego Costa and Lukaku are similar to Rooney.
Victor Moses, Aaron Lennon and Jordon Ibe are similar.
Mahrez is similar to Sessegnon, Nathan Redmond and Jesse Lingard. Did Southampton know this?
Matt Ritchie (ex-Bournemouth now at Newcastle) is in a group with Lallana, Alli, Pedro and Lamela. An opportunity for the taking?
Angel Rangel has (and has always had) unusual stats for a full-back.
The methodology recognises who the goalkeepers are and set them apart without this information being explicitly available in the datasets. The same applies for many other players from similar positions which are grouped together like the CBs and full-backs.

This is a poor man’s substitute to actually exploring the dendrogram yourselves. Not to mention that a clustering dendrogram is not even the most faithful representation of the information being collected by this vectorisation, but I’m more than happy with the results and feel there is some real promise to the methodology. If I can come up with some better visualisations for the results I’ll post those later on.

Please have a look through the results from the dendrogram and comment on whether you feel we’re getting close to convincingly capturing player passing style through passing motifs.

Distinguishing Quality from Random Noise: How do we know we’re getting valuable information?

One of the main challenges of football analytics is ensuring that their manipulation of the available data is in fact uncovering underlying “qualities” of teams and players, instead of just randomly picking up statistical noise or irrelevant facts. I could certainly use the available data and assign a number to each player by summing up the number of blocked shots plus the square root of the number of headed shots inside the area divided by the goal difference his team obtained with him on the pitch multiplied by his number of interceptions. Can I use this number in any way to advise a club on whether they should buy him? Probably not. How can I know what is valuable?

Recall from the previous entries on team passing motifs that a main reason why I stated that the methodology was picking up on a stable quality of passing style was the fact that it was stable for consecutive seasons. If the methodology was just randomly assigning motif distributions, then surely there would be no consistency between different seasons.

The implication then is this: if a certain vectorisation of the data is in certain sense “stable” across seasons, then this vectorisation is representative of an underlying quality of the data observations. Metrics intended to measure qualities which one would expect to be stable over seasons such as “playing style” or “potential” should be able to be validated in this way.

The question then is how would the details of this validation go. In this entry, I’ll go through a “validating methodology” that I’ve been working on lately:

Take a vector representing a team or a player for a given season (something like the 5-dimensional vector representing a team in the passing motifs methodology). If my reasoning above is correct, if the vector contains valuable information regarding that player/team, then if I consider the equivalent vector for the season directly before, they should in theory be in some sense “close” to each other. The “closeness” of two vectors is of course a relative concept, so this should be measured in relation to the average distance between any pair of vectors.

As an example: If Juan Mata’s vector for 2014-15 is at a distance of 2.3 from his vector for 2015-16, and on average the distance between any two player vectors (not necessarily from the same player) in this context is 9.5, then we can say with reasonable certainty that Juan Mata’s vectors are “close” to each other.

The method which I wrote out takes as parameters the two vectorisation matrices for two consecutive seasons, normalises them, considers only players who have played at least 18 matches in each season; and prints out the following:

Here’s what we want to look at on this table: First of all, the lower the mean distance between the two vectors of each player, the better our methodology is according to the reasoning above. However, a “low” number is a very relative concept, so we need a reference with which to measure how low this number actually is. This methodology provides two such references:

The first and most important one is the mean distance between all vectors, not just between the two corresponding to each player. This gives us an idea of how far any two vectors in this context are and if the “closeness” of vectors of the same player is significant. Z-Score 1 is the mean distance between all vectors minus the mean distance between the vectors from the same player divided by the standard deviation of the distance between all players. The lower this number (accepting negative values of course), the better.
The second reference provided is the mean distance between Gaussian vectors in the dimension of the problem. Z-Score 2 is the mean distance between the simulated Gaussian vectors minus the mean distance between the vectors from the same player divided by the standard deviation of the distance between the Gaussian vectors. I feel this is also an important frame of reference because it gives a measure of just how “normalised” the scaled problem is. It also provides important “dimensional” context because for example if our vectorisation is in 15 dimensions as opposed to 5 dimensions, then the raw distances will increase but this does not necessarily mean that the higher dimensional vectorisation is less valuable, simply that the numbers we deal with in higher dimensions are naturally larger and we need to know this to know how “low” our mean distance number really is. Hence the importance of the Z-Scores.

This entry was a bit technical and perhaps less interesting for the average football fan, but I thought it was important to explain it because it’s what I’ve been using to understand how to best translate the passing motifs problem to a player context. I’m looking to follow up this entry with an applied example comparing different vectorisations of passing motifs at a player level very soon (2-3 days hopefully if I can find the time), so stay tuned!

Monday 10 October 2016

New Season, New Ideas

I spent the majority of the summer off in Colombia and then Croatia and took a bit of a break from football and math. But now I’m back in London and settling into my routine, and even though I didn’t spend much time in front of the computer over the summer and have no finished results to show you yet, even on holiday I can’t stop my mind from drifting off towards football and new ideas that could be applied. I’m going to use this entry to tell you about the plans I made to explore some of these ideas during this new “Analytics Season” leading up to the 2017 OptaPro Forum in February where I’m hoping to get the chance to present them.

Up until this point, I’ve spent most of the entries speaking about team passing sequences and the results of their quantification through network motifs. This is a very interesting topic, and I think there is still more to come from this. These are some areas where I still hope to do some more work:

The vectorisation through motif frequencies can be refined with some more information. For example, I’ve been thinking that different instances of ABAC represent very different kinds of combination play. An ABAC passing sequence can be composed of a short one-two between players A and B, and in the third and final pass A plays a long ball to player C. Alternatively, it can simply be composed of 3 short passes. The distance of the third pass should be the main factor differentiating different types of ABAC, because in the vast majority of cases the ‘ABA’ part will be made up of short passes (If Coutinho gives it to Lallana and Lallana gives it back to Coutinho, we don’t expect either of those passes to have been long otherwise Coutinho would have to run a large distance after his first pass). When the weights of the first principal component of the lengths of each pass in all instances of ABAC are looked at, effectively almost 97% of the variance is on the length of the third pass. It remains to be seen whether this ‘refinement’ can be used to further discern and distinguish team playing style.

NOTE: Think of Principal Component Analysis as a method assigning coefficients to what features contain the most variance in a set of data. If we have the height and weight of a population of hippos and a population of zebras, the height of the whole set is roughly the same but the weight differs a lot, and Principal Components Analysis tells us precisely this: the weight is where the variance is.

Even though we’ve spoken about “playing style” convincingly through this methodology, we still haven’t related this vectorisation to “success on the field” yet, which is actually what ultimately matters. It would be interesting for example to use Topological Data Analysis (you can read about it in my previous entry) to map out the motif vectorisation and discover where success in the league is being accumulated. We can also fit a probability distribution that gives the probability of a certain motif structure leading to a top 4 finish for instance. In this sense, we could potentially advise clubs on what they need to change in their passing play to increase their probability of finishing in the top 4, or of not being relegated, or of winning the league, etc.
As I said before and showed with the Xavi example, this ‘passing motifs’ idea can be simply extrapolated to a player level by vectorising each player’s frequency of participation in the different motifs. Once players are represented as vectors in this high-dimensional space, we can apply a whole arsenal of methodologies to answer questions such as which player is better suited to replace an outgoing player, which players have a similar style, how individual players affect the passing motif structures of their teams. It remains to be seen whether this approach will quantify a meaningful underlying quality in players as we have shown it does for teams, but certainly, more information is preferable to less information.
These topics (team style, player style and recruitment) can be combined; so for example we can advise clubs on how the recruitment of a certain player will affect their passing motif structure, and whether this change will improve their probability of a top 4 finish.

These are ambitious plans, but I think that they are important ones because I must admit that (fair) criticism that could be thrown my way is that so far the results are very interesting theoretically but difficult to translate into actual practical and applied contexts within the industry. This is fair, but it doesn’t take away in the least bit from the value of the results obtained. The passing motifs methodology has a lot going for it. It proved to be consistent across different seasons which is strong evidence that it identifies some underlying inherent property which we called “passing style”, instead of just randomly picking up statistical noise. It was also used to identify a passing style unique to Leicester City which was present even before their title winning season, something that no one could have predicted or expected. As I said, it has a lot going for it.

The key counter-argument for this criticism is this (opinion): there doesn’t have to be an obvious, direct and immediate practical application for theoretical work to be valuable for a field. I strongly suggest those with a true interest in the topic of Football Analytics read this fascinating entry from Statsbomb author Marek Kwiatkowski. Here’s an excerpt if you can’t be bothered to read the whole thing:

“(I) believe that we have now reached the point where all obvious work has been done, and to progress we must take a step back and reassess the field as a whole. I think about football analytics as a bona fide scientific discipline: quantitative study of a particular class of complex systems. Put like this it is not fundamentally different from other sciences like biology or physics or linguistics. It is just much less mature. And in my view we have now reached a point where the entire discipline is held back by a key aspect of this immaturity: the lack of theoretical developments. Established scientific disciplines rely on abstract concepts to organise their discoveries and provide a language in which conjectures can be stated, arguments conducted and findings related to each other. We lack this kind of language for football analytics. We are doing biology without evolution; physics without calculus; linguistics without grammar. As a result, instead of building a coherent and ever-expanding body of knowledge, we collect isolated factoids.”

When looking for conceptual theoretical developments, passing network motifs fit the bill of a consistent and robust concept with a clear underlying motivation (representation of “passing style”). Practical applications will inevitably follow from this maturation of the discipline, and I have already outlined above some much more practical approaches which can be looked into.

Finally, and before this entry gets any longer, this approach has given me valuable insight into a type of conceptual processing that can be done to raw football data in order to obtain a meaningful representation. Football events during a match are very dynamic, complex and interdependent, but they codify all the necessary information to determine results, quality, potential, etc. The network motifs approach suggests taking the constituent blocks of the passing events graph and applying an equivalence relationship on the identity of the nodes in order to study their nature (this simply means that instead of focusing on the specific players performing the passes, for instance we consider any occurrence of a one-two as belonging to the same “class” of pass motif regardless of who the specific players were). It has made me think: why not attempt this with other types of events? Consider for example having a directed graph representing a team’s performance, but instead of the nodes representing players and the vertices passes, each node can be seen to represent an “area” of the pitch and a vertex is simply the act of going from one area to another through a pass, dribble, etc. A sequence in this network is simply a movement between different areas. The ‘equivalence relationship’ on the identity of the nodes which I think would be useful for this approach would work something like this: a play starting in one area, then moving to an area three spaces to the right through a pass, and then forward 2 spaces through a dribble would be classified exactly the same regardless of the players involved and if it happened starting inside our own penalty box or from the halfway line.

Vectorising team or player performance through the frequency of the motifs in this context could lead to a very robust quantification of playing style, performance metric, probability of success… who knows?! I don’t yet, but let’s hope I can find out from here to February.

Thursday 21 July 2016

Quantifying Passing Subsequences: The Mysterious Case of Leicester City Part 2

In the previous entry we left off having seen the result of a clustering dendrogram for the 5-dimensional representation of teams corresponding to the ratio at which they use the 5 passing sequences using data from the 2015-2016 Premier League season.

It came as a surprise that Leicester was signalled out by the method as the team with the most distinctive passing style in the league. But then again, Leicester were eventually crowned champions, so surely something qualitative is there to be found. The problem is untangling the true causality relation of what is being discovered. Saying “Leicester were champions because they earned the highest number of points” is a bit moronic. Something like “Leicester were champions because they scored the third highest amount of goals and had the second least amount of goals conceded” or “Leicester were champions because they were able to name an unchanged starting XI the most times in the season” can provide a bit more insight, but ultimately, when using data from the same season it can be difficult to decipher the true causality of discoveries; i.e., did X happen because Leicester had the potential to be champions, or did Leicester have the potential to be champions because of X.

The essential question then that the whole football world wants answered is if Leicester championship run could have been predicted BEFORE the campaign kicked off. Surely the sports trading community would be interested.

To investigate deeper, we went back to the data from the 2014-2015 when Leicester were almost certainly doomed to relegation but miraculously went on an incredible winning run in the season’s final stretch that saw them go from being 7 points from safety in April to end comfortably 6 points above the relegation zone. No team in the history of the Premier League had ever remained in the first division having fewer than 20 points by the 29^th Fixture (Leicester had 19).

Could anyone have predicted Leicester exploits back then? Should we have known?

This was the resulting dendrogram for the data from the 2014-2015 season using the same methodology from the previous entry:

There are several important things to say regarding the results. First of all, forgetting about Leicester for a minute, it’s very satisfying to see that many of the same pairings from the 2015-2016 season are maintained. Arsenal-Manchester City, Tottenham-Chelsea and Crystal Palace-Sunderland are all examples of pairings that arise in both cases. There are other general trends that are respected like Liverpool, Southampton and Swansea being similar, just as Leicester, Arsenal and Manchester City forming the leftmost group in both cases with either Watford or Aston Villa. This is important because the probability of this happening (similar groups for both seasons) if the method was randomly pairing teams would be extremely low. This means that the method is identifying something (which I will call passing style) which is consistent in teams across a pair of consecutive seasons. This ‘satisfying consistency’ can also be seen in data for the 2013-14 and 2012-13 season for which I also replicated the method.

Let’s return now to Leicester. Just as in the case of the 2015-16 season, Leicester is the team that joins a subgroup highest up the clustering tree, meaning its passing style has the weakest bond to any other group of teams, i.e. it is the most distinctive. There is a very important caveat so we don’t get carried away: “being distinctive” is in no way equivalent to “being successful”. In fact, the second most “distinctive” team is Burnley who were relegated at the end of 2014-15. Both Leicester and Burnley have a relatively low total amounts of motifs completed, but this doesn’t explain their distinctiveness necessarily since both QPR and Crystal Palace completed fewer motifs than them and have relatively “strong” bonds with other teams. Also, a truly fascinating characteristic of Leicester’s results for both seasons is that in both of them, Leicester’s passing style forms a subgroup with Arsenal and Manchester City’s, arguably the “passing powerhouses” of the Premier League.

To answer the question posed before regarding whether we should have known about Leicester, I would cautiously say “No”. No, I very much doubt any concrete methodology would have pointed to Leicester as the eventual winner. However, keeping in mind that “being distinctive” is not synonymous to “being successful” (poor relegated Burnley), the truth is that with this data before the start of the 2015-16 season I could have said to pundits: ‘Hey, keep an eye on Leicester, there’s something interesting going on there (they are distinctive and are close to Arsenal and Manchester City)’. Moreover, I would also predict that if Leicester keep their players over the summer, this “style” which has led them to be distinctive in both the 2014-15 and 2015-16 seasons will still be there and could once again lead them to success. I wouldn’t go as far as saying they’ll win it again, but I think they’ll be in the contest. Then again, I could be completely wrong and Leicester’s fortunes can fall off a cliff in the upcoming season; but I know better than to think that means that everything I’ve said here is wrong. I hope the readers do as well (If Leicester end up doing well again, I would also be cautious about the omnipotence of my methods; statistics and probability are all about being better informed about the chaotic randomness of the world, not about fortune telling…).

I’ll keep on trying to see what else this methodology has to give. I suspect some sort of “tree/dendrogram” method could be used to quantify how much success (higher finishes in the league table) is being accumulated in what areas of the tree and what a team’s position in the dendrogram says about its final league position. Also, as I mentioned a couple of entries back when I first spoke about this methodology, the really interesting bit could be extrapolating the method to discover how well prospective recruitments will fit within a team’s passing style. I also hope to have a go at this. Finally, some additional variables could also be integrated into the methodology to further distinguish passing sequences. For example, a completely vertical instance of ABCD is very different from a sequence of ABCD composed of horizontal square passes. Integrating this is also something I’m working on.

Keep an eye on the blog to see how it all unfolds.

Friday 8 July 2016

Quantifying Passing Subsequences: The Mysterious Case of Leicester City

This entry follows up with the previous entry's idea for quantifying teams' and players' passing styles through 3-passes long motifs (if you haven't read it I recommend you do so before reading this one).

Now, I decided to attempt to replicate the results shown previously from the Spanish La Liga using last season's Premier League data. In my application, I quantify the raw passing data by counting the amount of times each motif occurs for each team in each match. The table below shows the total amount of times each team performed each of the five motifs throughout the whole season.

As we can appreciate, Arsenal and Manchester City are either 1st or 2nd for every motif category. However, since teams like Arsenal and Manchester City complete the highest amounts of passes in a season, it is to be expected that they also complete the most motifs for each category.

A different way of looking at this data then is to analyse the relative frequencies of the motifs as a percentage of the total number of motifs completed by a team during match. That is to say, regardless of how many motifs were completed in total by a team, we want to look at which percentage of them were ABAB, which percentage of them were ABAC, etc.

The following boxplots show the distribution of the relative frequency of each motif during each match for each team of the 2015-16 Premier League season:

Now, isn't that interesting?! Leicester emerge as a team with a distinctive playing style now. If you return for a moment to the previous entry you can see that both Barcelona and Leicester “win” in the ABAB and ABAC categories and noticeably “lose” in the ABCD category (this similarity isn’t there in the other two categories). I'm obviously not claiming that Leicester and Barcelona have a similar style, I'm sure I would lose all credibility with football fans and might as well just close the blog. The main difference is that Barcelona don't only win the relative frequency battle, but also the overall total usually completing many more passes than their opposition. Leicester have a much more modest return in overall motifs completed (i.e. many fewer passes completed). In fact they complete the second lowest amount of motifs overall (second only to WBA and only marginally below Sunderland), but for the amount of motifs they do complete, there seems to be something there in the sense that they tend to proportionally perform a distinctive choice of passing sequences/motifs.

In fact, forget about the whole Barcelona thing for a moment. Even without having ever seen those results for La Liga, the methodology is pointing towards Leicester as the team with the unique style in the Premier League. The following figure shows the Clustering Dendrogram for the data viewing each team as a vector in R^5 where each feature is the mean percentage each motif constitutes of the total for each match of the season:

NOTE: It’s not easy to explain briefly what a clustering dendrogram is or means so please refer to any of the good sources on mathematics widely available (Wikipedia is pretty good), but basically it represents how teams are sequentially grouped according to their similarity (distance in R^5). For example, we can see that Leicester, Watford, Arsenal and Manchester City are a “group” but within that group there are two subgroups consisting of Leicester on its own and the other three, and similarly Arsenal is more tightly grouped with Manchester City than with Watford. The higher in the tree a grouping is made, the “weaker” its bond is. With that in mind, Leicester is the team with the “weakest bond” to any other subgroup of teams.

Honestly, I don't know any more than you what this means (yet); but it's very interesting that something pointing towards Leicester came up when I wasn't even looking for it. This was a simple probing methodology which pinpointed Leicester on its own, without me asking: “Is Leicester distinctive?” or “What sets Leicester apart?”.

It would be important to validate whether there is any sort of linearity between the frequency of each motif and the total amount of motifs completed; if that was the case Leicester’s low amount of motifs could explain why it is being set apart, but I don’t think this would tell the whole story as then WBA and Sunderland would also be signaled out in a way.

In the coming weeks I’ll try to get to the bottom of this…

Thursday 9 June 2016

Quantifying Passing Subsequences: A Way to Identify Teams and Players’ Playing Style?

‘Passes’ dominate the bulk of recordable events of a football match, usually numbering in the hundreds as compared to ‘goals’ or ‘shots’ for example which rarely ever surpass 7 and 30 respectively. Us football fans have become accustomed to seeing references to so-called “passing statistics” such as Xabi Alonso completing a record number of passes with Bayern Munich or Sergio Busquets having 99% passing accuracy in a match. When Arsenal recently signed Granit Xhaka, the media’s coverage was dominated with figures of how he was amongst the top 5 of the Bundesliga’s “Completed Passes” tables. That’s all very well; but with the richness of information available, are we truly limited to such (no offense) obvious conclusions and interpretations?

The recorded information on passes is now greatly detailed, with not only the “passer” and “recipient” being recorded, but also the time at which it occurred, the starting and finish coordinates, the type of pass (long, short, aerial, through ball, etc.), etc. Surely there is more information to be uncovered.

Gyarmati, Kwak and Rodriguez (2014) have a creative take on the problem. They organise the pass data by what they call 3 passes-long motifs, defined as distinct sequences of 3 passes between players regardless of their identity. In total, there are 5 motifs:

ABCD
ABCB
ABCA
ABAC
ABAB

This may seem a bit confusing at the moment but if you bear with my explanation for a moment, it’s actually pretty simple: In an England match, a sequence of 3 passes of the type ABCB for example would be Kane passing to Barkley (AàB), then Barkley passing to Wilshere (BàC) and then Wilshere passing back to Barkley (CàB). If in that same match there is an instance of Vardy passing to Rooney (AàB), Rooney passing to Walker (BàC) and then Walker passing back to Rooney (CàB); this would once again be another instance of ABCB in this game even though it wasn’t the same players involved, since in their methodology Gyarmati, Kwak and Rodriguez don’t distinguish between the identity of the players, just the different motifs possible. In their investigation, a sequence of six passes for example would tally four different motifs, as the figure illustrates:

Take a moment to make sure you understand exactly how the information is being processed. In the end, the authors are left with a count tally for each match counting how many times each of the 5 motifs occurred.

The authors’ reasoning is that by understanding the motifs’ distribution for different teams, inherent information about a team’s playing style will become apparent. It seems like a reasonable intuition, if we consider for example that ABCD is a direct build-up passing sequence involving 4 different players, while ABAB most likely reveals a patient build up where 2 players give the ball back and forth in the style we usually attribute to Barcelona or Bayern Munich.

They then applied their ideas to passing data from the 2012-13 Spanish La Liga season. The following figures (taken directly from their paper) show the z-score for each motif for each team in that season:

Indeed, Barcelona seem to make extensive use of “patient” sequences like ABAC, ABCB and ABAB when compared to other teams; and significantly less use of the “direct” motif ABCD.

Another interesting way to use this data is through cluster analysis or clustering. This is a technique designed to discover natural groupings inside data. As a toy example, if you feed the following data points to a clustering algorithm, it will tell you: “There are 4 different groups, and these are the points in each of the groups.”

NOTE: Remember that when things are in 2 dimensions we can visualise it and naturally observe this kind of results, but the true value of these methods is when analysing data in higher dimensions than 3.

Viewing each team as a vector in 5 dimensions, where each entry corresponds to the z-score of one of the five motifs, the authors performed cluster analysis and their result yielded 4 natural groups of teams:

NOTE: The final league standings are in parenthesis for context

Lopez and Sanchez (2015) build on the previous passing motifs approach in their article “Who can replace Xavi?”, and turn the attention to players specifically by looking at which roles each player is fulfilling within the team’s passing sequences. There are now 15 possible roles for a player:

He can either be the “A” in a ABAB sequence (XaviàIniestaàXaviàIniesta), or he can be the “B” (IniestaàXaviàIniestaàXavi). Similarly, he can be the “A” in a ABAC sequence (XaviàBusquetsàXaviàMessi), or he can be the “B” in an ABAC sequence (BusquetsàXaviàBusquetsàMessi), etc. In total there are 15 roles each player can be for all the motifs we already discussed.

This interpretation allows each player to be viewed as a vector in a 15 dimensional space; and the geometry (distances between players) of this space allows for plenty of questions to be answered. A cluster analysis can again be performed, and in this way you can see which players can fulfil similar roles within a team’s passing combinations. Think of the applications this has for recruitment!

In answering their question regarding who can replace Xavi within Barcelona’s passing combinations, Lopez and Sanchez draw out a list of the 20 players geometrically closest to Xavi’s 15-dimensional passing motif feature vector (using data from the previous 3 and 5 seasons of the La Liga and Premier League respectively):

Image taken directly from Lopez and Sanchez (2015)

I really enjoyed the two articles I presented here, and not because I believe their methodology to be the “be all and end all” of investigating playing style or player recruitment, even though they do provide some valuable insight. The true reason I really like them is because they provide the perfect example of how applying math to football problems is not about the brute computing force of computers and algorithms as some might think; but rather it requires skill, creativity and even good old fashioned “football” sense to understand how to quantify and aggregate raw data into useful and manageable ways, and then apply methodologies whose outcomes can be tangibly interpreted in the football context.

I think there is more to come from these recent approaches. Stay tuned.

REFERENCES:

Peña, J.L. and Navarro, R.S., 2015. Who can replace Xavi? A passing motif analysis of football players. arXiv preprint arXiv:1506.07768.
Gyarmati, L., Kwak, H. and Rodriguez, P., 2014. Searching for a unique style in soccer. arXiv preprint arXiv:1409.0308.

Pages