I spent the majority of the summer off in Colombia and then Croatia and took a bit of a break from football and math. But now I’m back in London and settling into my routine, and even though I didn’t spend much time in front of the computer over the summer and have no finished results to show you yet, even on holiday I can’t stop my mind from drifting off towards football and new ideas that could be applied. I’m going to use this entry to tell you about the plans I made to explore some of these ideas during this new “Analytics Season” leading up to the 2017 OptaPro Forum in February where I’m hoping to get the chance to present them.
Up until this point, I’ve spent most of the entries speaking about team passing sequences and the results of their quantification through network motifs. This is a very interesting topic, and I think there is still more to come from this. These are some areas where I still hope to do some more work:
- The vectorisation through motif frequencies can be refined with some more information. For example, I’ve been thinking that different instances of ABAC represent very different kinds of combination play. An ABAC passing sequence can be composed of a short one-two between players A and B, and in the third and final pass A plays a long ball to player C. Alternatively, it can simply be composed of 3 short passes. The distance of the third pass should be the main factor differentiating different types of ABAC, because in the vast majority of cases the ‘ABA’ part will be made up of short passes (If Coutinho gives it to Lallana and Lallana gives it back to Coutinho, we don’t expect either of those passes to have been long otherwise Coutinho would have to run a large distance after his first pass). When the weights of the first principal component of the lengths of each pass in all instances of ABAC are looked at, effectively almost 97% of the variance is on the length of the third pass. It remains to be seen whether this ‘refinement’ can be used to further discern and distinguish team playing style.
NOTE: Think of Principal Component Analysis as a method assigning coefficients to what features contain the most variance in a set of data. If we have the height and weight of a population of hippos and a population of zebras, the height of the whole set is roughly the same but the weight differs a lot, and Principal Components Analysis tells us precisely this: the weight is where the variance is.
- Even though we’ve spoken about “playing style” convincingly through this methodology, we still haven’t related this vectorisation to “success on the field” yet, which is actually what ultimately matters. It would be interesting for example to use Topological Data Analysis (you can read about it in my previous entry) to map out the motif vectorisation and discover where success in the league is being accumulated. We can also fit a probability distribution that gives the probability of a certain motif structure leading to a top 4 finish for instance. In this sense, we could potentially advise clubs on what they need to change in their passing play to increase their probability of finishing in the top 4, or of not being relegated, or of winning the league, etc.
- As I said before and showed with the Xavi example, this ‘passing motifs’ idea can be simply extrapolated to a player level by vectorising each player’s frequency of participation in the different motifs. Once players are represented as vectors in this high-dimensional space, we can apply a whole arsenal of methodologies to answer questions such as which player is better suited to replace an outgoing player, which players have a similar style, how individual players affect the passing motif structures of their teams. It remains to be seen whether this approach will quantify a meaningful underlying quality in players as we have shown it does for teams, but certainly, more information is preferable to less information.
- These topics (team style, player style and recruitment) can be combined; so for example we can advise clubs on how the recruitment of a certain player will affect their passing motif structure, and whether this change will improve their probability of a top 4 finish.
These are ambitious plans, but I think that they are important ones because I must admit that (fair) criticism that could be thrown my way is that so far the results are very interesting theoretically but difficult to translate into actual practical and applied contexts within the industry. This is fair, but it doesn’t take away in the least bit from the value of the results obtained. The passing motifs methodology has a lot going for it. It proved to be consistent across different seasons which is strong evidence that it identifies some underlying inherent property which we called “passing style”, instead of just randomly picking up statistical noise. It was also used to identify a passing style unique to Leicester City which was present even before their title winning season, something that no one could have predicted or expected. As I said, it has a lot going for it.
The key counter-argument for this criticism is this (opinion): there doesn’t have to be an obvious, direct and immediate practical application for theoretical work to be valuable for a field. I strongly suggest those with a true interest in the topic of Football Analytics read this fascinating entry from Statsbomb author Marek Kwiatkowski. Here’s an excerpt if you can’t be bothered to read the whole thing:
“(I) believe that we have now reached the point where all obvious work has been done, and to progress we must take a step back and reassess the field as a whole. I think about football analytics as a bona fide scientific discipline: quantitative study of a particular class of complex systems. Put like this it is not fundamentally different from other sciences like biology or physics or linguistics. It is just much less mature. And in my view we have now reached a point where the entire discipline is held back by a key aspect of this immaturity: the lack of theoretical developments. Established scientific disciplines rely on abstract concepts to organise their discoveries and provide a language in which conjectures can be stated, arguments conducted and findings related to each other. We lack this kind of language for football analytics. We are doing biology without evolution; physics without calculus; linguistics without grammar. As a result, instead of building a coherent and ever-expanding body of knowledge, we collect isolated factoids.”
When looking for conceptual theoretical developments, passing network motifs fit the bill of a consistent and robust concept with a clear underlying motivation (representation of “passing style”). Practical applications will inevitably follow from this maturation of the discipline, and I have already outlined above some much more practical approaches which can be looked into.
Finally, and before this entry gets any longer, this approach has given me valuable insight into a type of conceptual processing that can be done to raw football data in order to obtain a meaningful representation. Football events during a match are very dynamic, complex and interdependent, but they codify all the necessary information to determine results, quality, potential, etc. The network motifs approach suggests taking the constituent blocks of the passing events graph and applying an equivalence relationship on the identity of the nodes in order to study their nature (this simply means that instead of focusing on the specific players performing the passes, for instance we consider any occurrence of a one-two as belonging to the same “class” of pass motif regardless of who the specific players were). It has made me think: why not attempt this with other types of events? Consider for example having a directed graph representing a team’s performance, but instead of the nodes representing players and the vertices passes, each node can be seen to represent an “area” of the pitch and a vertex is simply the act of going from one area to another through a pass, dribble, etc. A sequence in this network is simply a movement between different areas. The ‘equivalence relationship’ on the identity of the nodes which I think would be useful for this approach would work something like this: a play starting in one area, then moving to an area three spaces to the right through a pass, and then forward 2 spaces through a dribble would be classified exactly the same regardless of the players involved and if it happened starting inside our own penalty box or from the halfway line.
Vectorising team or player performance through the frequency of the motifs in this context could lead to a very robust quantification of playing style, performance metric, probability of success… who knows?! I don’t yet, but let’s hope I can find out from here to February.