The Open Football: 2017

Sunday, 14 May 2017

Paper: 'Flow Network Motifs Applied to Football Passing Data'

I wrote this paper to be presented at MathSport 2017 Conference in Padua University in June. It's a bit heavy on the mathematics in chapter 2 but should be fairly readable from there on. Here's the abstract so you can know if you're interested in reading it before opening the whole document:

"Network Motifs are important local properties of networks, and have lately drawn increasing attention as promising concepts to unearth structural design characteristics of complex networks. In this document, we push the boundaries of the existing body of literature which has used this theory to study soccer passing networks by attempting to uncover unique team passing network structure, and make a rigorous attempt to formalise a theoretical framework in which to carry out and evaluate these analyses. We contribute to the existing body of knowledge by proposing a framework based on repeatability in which to establish the ideal length of ﬂow motifs with which to study soccer passing networks, and also by considering spatial classiﬁcations of ﬂow motifs to achieve greater precision in our claim to discover unique team passing network style."

Anyways, here's the link to the pdf on Google Drive:

https://drive.google.com/open?id=0Bzvjb5fnv1HtVFNNWXpQb2dLNTQ

Thursday, 13 April 2017

What's the ideal length of passing motifs?

In my last entry I pledged to answer this question using the 'repeatability' methodology I presented there. This will be a quick entry to confirm that luckily, we've been right all along and 3 is the ideal length to consider for passing motifs.

The number of passes considered in a passing motifs analysis is a clear instance of the trade-off between detail and comparable structure that we discussed in the previous entry. The Figure below shows the number of motif types that occurred in the 2015-16 season of the Premier League depending on the number 'k' of passes we choose as 'length' from 3 to 7:

When we choose to consider 3 passes, there are 5 motif types which I hope all my readers know by heart by now: ABAB, ABAC, ABCA, ABCB and ABCD. If we choose to go up to 4 and consider one extra pass, there are 15 different types: ABABA, ABABC, ABACA, ABACB, ABACD, ABCAB, etc.

For 5-passes long motifs there are 52 types (all of which occurred at some point in the 2015-16 season), and for 6-passes long motifs there are 203 (of which only 187 types occurred in the 2015-16 data). There were 759 different types of 7-pass motifs in the data. We can appreciate how the number of motif types grows quickly with the number of passes we are considering, which precisely lends itself to losing structure in the noisy haze of excessive detail.

The Figure below shows the number of motifs for the 2015-16 season for each number of passes 'k' considered:

There were 138,432 3-pass long sequences compared to 45,820 7-pass long sequences. While there is a considerable amount less, the data set is still of a decent size to believe that we can extract meaningful conclusions.

Finally, the figure below shows the repeatability percentage as per the methodology of the previous entry for each choice of length from 3 to 7 (as before, we consider relative frequencies of the different categories rather than raw amounts):

3-pass long motifs have a repeatability of about 82.7%; while 6 and 7-pass long motifs have 57.8% and 52.3% respectively. Considering that as per our methodology random methods that carry no structure would have 50% repeatability (equivalent to randomly assigning style), these figures mean that by then we've lost almost all structure.

It's an interesting conclusion that sequences of 3 passes are the ideal number to consider which carry unique team structure better than longer sequences. Considering that passes constitute the grand majority of events on a football pitch, it's a far-reaching conclusion. It provides insight into breaking down the sequentiality of football matches into representative constituent blocks: looking at blocks of about the size corresponding to 3 passes should be best practice.

Wednesday, 5 April 2017

Passing Network Autographs and Overshooting Style

At the end of the last entry I touched on the trade-off between comparable structure and ‘granularity’ or ‘level of detail’ of football data. Imagine this: you have a player who has the ability to pick a certain type of between-lines pass that greatly increases your team’s chance of scoring in that play. With passing data, we could try and identify how this pass is represented in the data and then use the data to identify other players in the world that are also good at making this kind of pass. To do this, we will need to break the data down in a detailed way, and differentiate this type of pass from other vertical passes that perhaps aren’t as effective. We might want to consider where in the pitch this type of pass comes from or where it finishes, what action it is preceded by, where the defenders are, what happens after the pass is made, etc.; all of this in the hope that we can clearly identify the type of pass we’re looking for and tell it apart from other passes. However, what happens when we go too far and use too much detail? It is unlikely that each time our player performs this type of pass he does it in exactly the same way, in the same coordinates of the pitch and in the same conditions. If we start using too much detail, we might actually start classifying instances of this type of pass as different when in reality they correspond to the same sought-after ‘type’. Once this happens we are no longer capable of identifying the “type of pass” we were initially looking for, and now have hundreds of different passes that at this level of detail are all different from each other. We can no longer identify players who can play this pass because this “type” was obliterated in the detail.

I actually began thinking about this issue when reading Dustin Ward’s piece on clustering different types of passes. He decides to take 100 clusters or types of passes and see how often each team or player completes each of those types of passes. This is a good example of the trade-off mentioned above. 100 seems like a good number, it certainly reveals more info about a team than if we considered just 1 (which would basically be like looking at overall Pass Success percentages) or 2 types/clusters of passes.

Choosing 100 is also better than choosing 100,000. If we chose 100,000, then each player or team would perform maximum 1 instance in a season of highly detailed, highly differentiated types of passes. We wouldn’t be able to use this information to compare teams or players in any way. But is choosing 100 better than choosing 120? Or than choosing 80? How do we know when this trade-off is striking the right balance?

The key is having something against which to measure ‘balance’, something we want to optimise. In this entry I’ll show you an example of how this something could be ‘repeatability’:

For a while I’ve been wanting to push the Passing Motifs methodology a bit further and include some spatial information about the passes to see what else it can tell us about teams’ passing networks. Below is an example of two very different instances of ABAC.

The question I wanted to answer was this: will we gain any additional valuable information about teams by differentiating different ‘types’ of motifs according to their angles, distances and coordinates on the pitch? Crucially, I also wanted to know where the right balance would be when doing this differentiation in light of our structure-detail trade-off.

There are two ways of looking at spatial variables associated to motifs that I felt could be revealing:

x-y Coordinates of Passes: In Opta’s data files, each pass has a ‘Start x-y’ Coordinate and an ‘End x-y’ Coordinate, meaning each pass has four variables in terms of coordinates. A 3-pass long motif would therefore have a set of 12 variables representing where its passes began and ended.

Angles and Lengths: Another way of looking at it is by the ‘angles’ and ‘lengths’ of passes in a motif. The figure below illustrates how these are found.

With this idea we would have six variables associated with a motif: the angle of each of the 3 passes of the motif and the length of each of the 3 passes as well.

NOTE: The thing I like about this ‘angles+lengths’ idea is that it doesn’t “care” where in the pitch a motif happened, only its geometric structure. I like this because if it has ‘structure’ or ‘insight’ into teams’ styles it will not be as heavily determined by whether the team dominates the opposition or not: if we only look at pitch coordinates of motifs then top teams like Chelsea or Manchester City will perform all of their motifs high up the pitch. Therefore, the method would be biased towards saying they perform the same ‘types’ of motifs, namely “high up the pitch” motifs. I’m not saying that this isn’t meaningful, but it is information we all know by simply looking at the league table and knowing these teams play deeper into their opposition’s half. However, if we discover structure that is independent of the league table from the geometric shape of motifs, it makes it interesting in the sense that perhaps it wasn’t correlated with "obvious" aspects.

Whichever of the two ideas we go for, there is going to be a set of variables associated to each motif, which we can then use k-means clustering to classify into a certain number of different types of motifs. Our intuition from the trade-off tells us that there is an intermediate number of categories that has the best representation of “style”. The problem is that to use a k-means clustering algorithm, we need to manually tell the algorithm how many different categories we want before knowing this optimum number.

Consider this: for each choice of number of categories, once we have determined the number of categories and classified the different motifs into the category they correspond to, we can use the best practice we know from the original passing motif methodology and look at what percentage of each motif category (in the ABAC-sense) corresponds to which ‘type’ (in a either a x-y coordinate or angles+length sense). So as an example, if we had chosen to have 3 different types of motifs, then for each team we would have this set of numbers: what percentage of the teams ABAB motifs are type 1, what percentage of the ABABs are type 2, what percentage of the ABABs are type 3, what percentage of the teams ABAC motifs are type 1, what percentage of the ABACs are type 2, etc. What we’ll have is a vector representing each team.

Now suppose we randomly divide each team’s motifs into two different sets, so now we have Arsenal’s A motifs and Arsenal’s B motifs as if we were artificially considering each as the motifs of different teams. If choosing this number of categories reveals teams’ structure or style, then the style attributed to Arsenal’s A vector should be very similar to the style attributed to Arsenal’s B vector. The more underlying structure we’re capturing, the more this effect should be obvious. If on the other hand we’ve gone too far and now the extreme detail is overshooting the underlying structure we want to discover, then Arsenal’s A vector will not necessarily be similar to Arsenal’s B vector because the extreme detail is damaging the comparability of styles. This is what I mean by “repeatability”.

The following graph reveals how “repeatable” each choice of number of categories is for both the ‘x-y coordinates’ idea and the ‘angles+lengths’ idea:

The methodology is as follows: for each number from 2 to 50, we create that number of motif categories using k-means clustering and assign each motif to a category. We then divide randomly each team’s motifs into two different sets to have a vector A and vector B for each team. Then we check how “repeatable” the methodology by checking on average how close teams’ A vector is to their B vector in comparison to the rest of vectors representing other teams; and this process (since it involves both the randomness of a k-means algorithm and the division of a teams motifs into two sets) is repeated a hundred times for each number. The graph shows as a percentage the average ‘relative closeness’ for each of the hundred trials as follows: I took each teams’ two vectors and determined on a scale of 1 to 39 how close a team’s A vector was to his B vector. Since there are 20 teams and I divided each one into two vectors, there will be a total of 40 vectors representing 'styles'. Considering as a focal point a team’s A vector, its B vector could either be the closest of the other 39 vectors (1), the second closest (2), all the way up to the farthest away (39). I did this for every team and averaged these numbers, to finally compute the percentage that the outcome was of 39 (this was done using passing data for the 2015-16 Premier League season).

Right off the bat we find evidence of the balance we’ve been speaking of. When we start increasing the number of categories we start obtaining more repeatability, meaning we can more closely recognise two vectors as being the A and B vectors of the same team because they are similar (i.e. close) to each other. I like to interpret this as uncovering more underlying information that uniquely identifies a team’s passing network style: no matter how we randomly divide a team’s motifs into two sets, we roughly still know which sets correspond to the same team because we know the “style”. We then reach an optimum number of categories for which this repeatability is optimised: for the ‘x-y coordinates’ idea it’s at 9 and for the ‘angles+lengths’ idea it’s at 13. After this, the repeatability starts to decrease meaning that a team’s A and B vectors start to not be as similar to each other because they’re made up of highly detailed motifs that are overshooting the underlying “style” of what it actually is that a team inherently does with its passing networks.

We have answered our initial question: The original passing motif methodology (found in this entry) in which we took the 5 different motifs and compared teams according to how much they relatively used each motif had about 83.3% repeatability as per our methodology. By breaking motifs down into an optimum number of categories for the ‘x-y coordinates’ and ‘angles+lengths’ ideas (9 and 13 respectively), we were able to increase our repeatability to 94.3% and 84.4% respectively (evidently the 'x,y Coordinates' has better repeatability than 'Angles+Lengths', but as we said before any structure from a purely geometrical classification is interesting).

Below is a set of boxplots illustrating what the 9 categories represent in terms of the different 'x-y' coordinates:

As an illustration, if you look at categories 4 and 8, they both begin a bit past the halfway line really close to the left touchline, but while in Category 4 motifs the three-pass sequence ends a bit further up but still on the left hand touchline, the Category 8 motifs made their way across the pitch to finish closer to the right hand touchline.

The 94.3% repeatability of the 'x-y coordinates 9 category' vectorisation is incredibly high. In fact, if we remove Sunderland and West Bromwich which for some reason only have 80.2% and 83.5% repeatability respectively, the other teams have an average repeatability of 95.7%!

These results mean that we’ve managed to pin down an underlying structure in teams’ passing networks that allow us to identify unique team styles (lets call it "Passing Network Autographs") with a high degree of confidence. We’re at the point where if we’re given a set of motifs we could have a robust educated guess at which team they correspond to and most likely be right (except perhaps if they belong to West Brom or Sunderland for some reason). As an example, below is a comparison of Arsenal's autograph versus Bournemouth's (the team whose 'autograph' most differs from Arsenal's):

Perhaps some readers might be unimpressed with this rather theoretical and un-applied result, but although I admit that in its raw form this seems a bit unmanageable, I would advise them to keep an open mind and think of the potential. For example, having such a reliable ‘passing network autograph’ for teams, we can look through players from outside the Premier League and find those whose current passing network best fits within a team’s autograph. We could also use our measure of team style to try and predict which styles are more effective against each other, or which defenses are the best at interrupting the attacking flow through a team’s passing network. These possibilities probably sound more appealing to most readers, but in order to do them in a meaningful way they must be underpinned by theoretical confidence that we are indeed identifying team styles. I will try and follow up this theoretical entry with a more applied one exploring some of these possibilities later this month.

I want to finish this entry off by highlighting the important potential of generalisation these ideas have. I feel they’ve helped me establish best practice when it comes to breaking passing motifs into different categories according to their spatial properties (and by best practice I mean knowing how many categories I should break it up into); but the method can also be used to determine best practice in other ideas currently being explored by football analysts. For example, during my Opta Forum Presentation’s Q&A, Marek Kwiatkowski asked whether the passing motif methodology could be generalised to motifs of more than 3 passes. The answer is that it can, but we run the risk of going too far and start overshooting the structure that the methodology helped us identify as team and player passing style: for 3-pass long motifs we had 5 motif types, while only going up to 5 or 6-pass long motifs we’re already at 52 and 203 types respectively with wacky things like ABCADBA. The ideas presented here can help us answer the question whether it’s worth looking at longer motifs (another entry soon perhaps?). It can also help Dustin Ward to establish exactly how many types of passes he should consider. In general, it helps us to establish standardised best practice that the whole of football analytics will benefit from and that its currently distinctly lacking. Echoing Marek’s piece on the state of analytics: “Established scientific disciplines rely on abstract concepts to organise their discoveries and provide a language in which conjectures can be stated, arguments conducted and findings related to each other. We lack this kind of language for football analytics”. We need common-ground theory in which our public work can be related and compared, and it’s worth truly understood. The lack of it is holding back all of us who have an active interest in the field really taking off. I hope this approach to improve our understanding of our ideas and take steps towards enhancing them and establishing best practice can inspire other public (and even private) analysts to attempt similar things in their work and establish bridges through which we can compare and complement our work. Valuable applications will inevitably flow from robust, interconnected theory.

MATHEMATICAL FOOT-NOTE: Comparing the distance between A and B vectors as their position from 1 to 39 on the closest-farthest away scale may seem a bit unorthodox and one might consider simply using the z-score of the distance between teams’ A and B vectors in the context of all the distances between all 40 vectors. However, the reason I don’t do this is that for each different choice of the number of categories, the dimension in which these vectors are is different, and on a personal mathematical note I have deep mistrust in comparing distances between things that are in different dimensions.

P.S.: I want to give a brief mention to BenTorvaney who gave me a small but meaningful contribution which I feel greatly enhanced the results of this entry.

Tuesday, 14 March 2017

VIDEO: OptaPro Forum 2017 Talk on Passing Motifs

A talk on some of my passing motifs work was selected for the OptaPro Forum 2017 which took place in Central London on February 8^th. You can see the full video (except the Q&A) below:

Most of the content I'd already written entries about: For the general overview on the passing motifs methodology for teams read this entry. For my applied results on teams from the Premier League (and a take on what the hell happened with Leicester) read this and this entry. Below are the images for the hierarchical clustering dendrograms and the principal component analysis graphs of this 5-dimensional representations of Premier League teams for the 2014-15 and 2015-16 seasons.

Moving on to a player level, you can read up on the general methodology in this entry, and then on the specific scoring system that I presented at the forum to create those lists of players in this entry. Below are some of the lists I had on display which are respectively: Premier League 15-16 using 'Key Passes' to award points, Bundesliga 15-16 using 'Key Passes' to award points and finally Premier League using 'Expected Assists (xA)' to award points. For that last one I used Opta's xA numbers which give account for the probability of a pass turning into a shot with a certain xG value.

Finally, I also did a bit on using Topological Data Analysis (TDA) to explore the results for players which I hadn't done before; although to read up on the general methodology of TDA you can read this entry (wow how things have changed since that entry! I of course now know that Opta doesn't really log 'controls with left thigh'. Don't be fooled by how assured I wrote about the analytics industry back then, I honestly didn't know half of what I know now about that world just 10 months later... hopefully my future self in another 10 months will also look back with pity at my current self's ignorance).

Below is the image from the forum:

Finally, I want to use this (non) write-up on the presentation as a platform to discuss some more general reflections about analytics. 'Operationalising' is a hideous sounding word which was horribly difficult to repeatedly say in front of 300 people; but it actually is very important. There is so much complexity in raw football data that those of us who do analytics really need to broaden our scope when thinking how we will represent this raw information in numbers, vectors or variables that will help us uncover the rich underlying information that is there for the taking. The 'passing motif' operationalisation of raw passing network data is 'neat'; it seems harmless when you first see it and I wouldn't blame you for doubting that those 5/45 numbers attributed to teams/players will actually say much about them, but evidently they do. I think that what it's got going for it is that it helps to account for the sequentiality of the raw events, something which most of the work I encounter out there fails to do. As I said in the presentation, we're a bit too focused on events when its actually the sequences of these events that actually matter.

There is a classic problem though (akin to the overfitting problem of modelling) when trying to account for larger and larger sequences: If we become too granular and for example don't do this methodology for 3-pass long motifs but rather for 10-pass long motifs; then the occurrences will become so specific that we actually lose out on comparable capacity in the structure of our information. Alexis would have such a specific distribution of highly differentiated sequences that he would have no neighbours to reward him for their key passes! We need to strike a balance between sequentiality and lets call it "non-granularity" (this was actually one of the questions at the forum: can/should the methodology be generalised for more than 3 passes?).

Finding the correct concepts that strike this balance is the challenge of analytics. Passing motifs are "neat"; but even I recognise that they are nowhere near the ambitions of what I would hope to achieve in analytics. Exciting years to come!

Monday, 16 January 2017

Player Vectorised Representations: What player lists can we draw up with confidence?

I love drawing up lists and rankings of players (who doesn’t?) and giving myself a big “confirmation bias” pat on the back when I see players on the list which I like while casually either ignoring as a mistake of the method or updating my bias for the players on the list which I don’t particularly rate highly. However, the very exercise of drawing up lists and rankings can be misleading for the probabilistically-illiterate because it seems to imply set-in-stone certainty about who the best player is, who the second best is, etc.; and this rigid numbering masks the underlying concepts of probability. And yet, drawing up player lists is key for the recruitment workflows of clubs, be it in drafts or transfer windows, or even just to set up a schedule for their scouts. You definitely don’t have to see the rankings as set in stone, but I can imagine clubs would definitely want to have things like 15-men shortlists with 2 or 3 ‘favourite choices’. In this entry I’m going to show you a couple of lists I drew up and how we can go about our list-making with confidence with vectorised representations of players.

I drew up lists for this entry using the player passing motif ideas from previous entries. The passing motif methodology produces a vectorised representation of players, which basically means that each player is represented by a vector of numbers. In the passing motif methodology I’ve used so far, the vector representing each player has 45 entries or numbers. The key conceptual bit is that when you have this type of vectorised representation, you can imagine each player as being in a “space” of some sort. To imagine it, suppose that instead of 45 you simply had 3 numbers representing each player, something like age, height and weight. If this was the case, you could imagine each player as being represented by a dot in a 3-dimensional space much like your living room. Some players would be closer to others, some would be farther away. Perhaps all the senior, tall and heavy centre-backs are located around your TV, while the shorter and lighter second strikers are hovering around your dining room table. This is just how I conceptualise the result of the 45-dimensional passing motif methodology. It makes it more abstract to picture, but just as in the 3-dimensional case, there are distances, certain dots closer to each other or concentrated around certain areas, etc.

The list I drew up basically took all the players who had at least 18 appearances in last year’s Premier League, and gave them “points” according to how many key passes they made AND how many key passes the players around them made. The closer to a player you are, the more “points” his amount of key passes awards you; the farther away the less. I tried this out in a few ways but that’s the basic idea. The idea is that if you happened to make few key passes in the season but all the players whose motif vector is close to yours made a lot, you should still have a high score. If the information contained in the motif vectorisation is at all useful to recognise players with creative potential, then the best scored players should in a way be the best creative passers in a more profound way than simply looking at the Key Passes Table. The question is precisely, how do we know the vectorisation’s layout of players has anything to do with their “key passing ability” (i.e., players with high ability cluster around certain areas of this “space” and are in general closer to each other)? Let’s look at the list before we begin to answer this question so everybody gets a bit excited before it dawns on them that I’m actually rambling on about some technical stuff.

Remark: Notice how this list isn’t strictly correlated with key passes. Drinkwater is better ranked than Eriksen even though the latter had many more key passes. This means that if the list is sound (big if), its picking up on information that wasn’t immediately and explicitly available in the key passes tables.

My confirmation bias seems to like that list quite a lot, there are a lot of good names up there. Most readers probably follow the Premier League closely and know that those are all good attacking creative players, arguably the best in the league. Now imagine that instead of the Premier League, we drew up an equivalent list using data from leagues where we didn’t already know the players, and had confidence that just as in the case of the Premier League, we were definitely getting out a list of most of the best players. Should be useful huh?

There are also some notable absentees. Coutinho comes to mind as a player which is widely agreed to be amongst the best in the league who isn’t on the list. Why should we trust a list that claims to rank the top 15 creative players in the league but leaves out Coutinho?

As I said before, I think of the vectorised representation as encoding the information regarding players’ key passing ability if players who tend to have a higher number of key passes are more or less clustered together as opposed to randomly located mixed with all the other players. If this is a general trend, then we know that there is a relationship between a player’s key passing ability and his location in the 45-dimensional space we are imagining. Even if a player happened to not have many key passes in a season (this can happen just as strikers have goal droughts or perhaps because a player’s teammates don’t make good runs), we should still pick up on this “ability”.

What we would need then to justify our faith in the list is some sort of indicator which specified just how “clustered together” players with higher number of key passes are. There are many ways to approach this problem in mathematics. For those readers who have mathematical backgrounds we could try to fit a model and asses the goodness of fit, or apply some sort of multi-dimensional Kolmogorov-Smirnov technique comparing the actual distribution of vectors and key passes with one where the key passes where distributed randomly. However, all these tests are a bit technical and hard to apply in high dimensions, and all in all we really want an indicator more than a model of “Expected Key Passes”. Here’s a simpler validating technique:

For each player, take his K (in mi list K=10) closest neighbours and compute the standard deviation of their key passes. Once we’ve done this for every player, we can compute the mean of the standard deviation of key passes in each of these K-player “neighbourhoods” (let’s call it the ‘mean of neighbourhoods’ variation number=MNV’). If in each neighbourhood the players have a relatively similar number of key passes, then the MNV should be comparatively low. The important question is: what do we compare it to in order to know if its low or not?

I feel that there are two important numbers to compare this number to. The first would be simulating many (many) scenarios where the key passes are randomly permutated amongst the players and comparing the real MNV number to the average of these simulated cases. The second number would be the minimum MNV of any of the simulated scenarios. If the MNV of our actual vectorised representation is “low” in comparison to these simulated scenarios, then we know that the players’ layout in this imaginary 45-dimensional space clusters the key passers of the ball closer together than random distributions; which in turn would mean that the logic applied to obtain the list has a robust underlying reasoning because a player’s location in the 45-dimensional space should have something to do with his “key passing ability” (I fear I may have lost half the readers by this point…).

Here are some results:

Of the 100,000 simulations, the lowest MNV was 14.62 while the actual MNV is 11.86. This means that if we randomly assigned the players to a position in the 45-dimensional space 100,000 times, none of those simulations has the key passers clustered together better than our actual passing motif representation. This is quite promising, but even then, I suspected that maybe this is because the method clearly recognises the difference between defensive players and attacking players and attacking players are much more likely to get more key passes; so I repeated the validation using only attacking midfielders, wingers and strikers:

The results are less overwhelmingly positive, but even when just looking at attacking players, the actual layout surpasses any random distributions of the players after 100,000 simulations. To appreciate the value of this method and what information this is actually giving us, let’s compare with an equivalent list drawn up using ‘goals’ to award points rather than ‘key passes’ (using only attacking players again for the same reason as before).

The MNV numbers are naturally smaller because players score much less goals per season than key passes, so the overall scale of the problem is smaller. We can see that even though the real MNV is smaller than the average of the simulations, its actually relatively large when compared to the minimum MNV obtained through random simulations (notice how important it is to have a frame of reference to know when the number is small and when it is large in each specific context). This means that the position of players and goals in the 45-dimensional space can be clustered together through random simulations considerably better than using the passing motif vectorised representation. As opposed to the ‘key passes’ case, this vectorisation doesn’t encode much information pertaining to “goalscoring capability”. This actually makes sense though since the passing motif methodology is designed using only information from the passing network which doesn’t necessarily contain information regarding finishing. Therefore, the list made using ‘goals’ is much less reliable.

Coming back to Coutinho’s absence from the original list, it’s important to understand that I’m not claiming the list as a know-it-all oracle for creative talent and that this talent can be rigidly ordered. What this entry tried to show is that there is solid evidence that a player’s position in the 45-dimensional space determined by the passing motif methodology encodes a good amount of information which determines how many key passes he ought to have given a sort of “passing ability”. That doesn’t mean it encodes all the information. Perhaps the vectorised representation is missing out on what it is that makes Coutinho great. Nevertheless, once we’ve accepted and understood that the list will offer us, I doubt any club could claim that a list like this from different leagues from around the world is of no use to their organisation just because they might miss out on the Serbian League’s Coutinho (sadly, such is the ‘glass half empty’ prejudice that analytics face).

Finally, this way of looking at the problem of rating players opens the door to a host of possibilities. When I was doing my bachelor in pure mathematics I was actually more interested in differential geometry and topology courses than statistics courses, which is why I tend to think of data observations as vectors in high-dimensional spaces and think that their positions in those spaces encodes valuable information. This entry began by taking a vectorised representation (passing motifs vectors) and established that if we look at the number of key passes each player made, the players’ vectors’ position in this space seemed to encode this info. On the other hand, it didn’t seem to encode the information pertaining to goalscoring. That isn’t to say it might not encode information regarding other metrics. Expected Assists maybe? It also doesn’t mean that other vector representations don’t encode some of this information better than my own passing motifs representation. It’s a bit of a 3 step thing really: 1. Find a vector representation, 2. Check what sort of information it seems to encode well (especially information that isn’t explicitly available elsewhere, and 3. Find a way to give players a rating using this fact.

I hope this way of thinking encourages other analysts out there to try their hand at this sort of work!

Pages