## Monday 31 October 2016

### Distinguishing Quality from Random Noise: How do we know we’re getting valuable information?

One of the main challenges of football analytics is ensuring that their manipulation of the available data is in fact uncovering underlying “qualities” of teams and players, instead of just randomly picking up statistical noise or irrelevant facts. I could certainly use the available data and assign a number to each player by summing up the number of blocked shots plus the square root of the number of headed shots inside the area divided by the goal difference his team obtained with him on the pitch multiplied by his number of interceptions. Can I use this number in any way to advise a club on whether they should buy him? Probably not. How can I know what is valuable?

Recall from the previous entries on team passing motifs that a main reason why I stated that the methodology was picking up on a stable quality of passing style was the fact that it was stable for consecutive seasons. If the methodology was just randomly assigning motif distributions, then surely there would be no consistency between different seasons.

The implication then is this: if a certain vectorisation of the data is in certain sense “stable” across seasons, then this vectorisation is representative of an underlying quality of the data observations. Metrics intended to measure qualities which one would expect to be stable over seasons such as “playing style” or “potential” should be able to be validated in this way.

The question then is how would the details of this validation go. In this entry, I’ll go through a “validating methodology” that I’ve been working on lately:

Take a vector representing a team or a player for a given season (something like the 5-dimensional vector representing a team in the passing motifs methodology). If my reasoning above is correct, if the vector contains valuable information regarding that player/team, then if I consider the equivalent vector for the season directly before, they should in theory be in some sense “close” to each other. The “closeness” of two vectors is of course a relative concept, so this should be measured in relation to the average distance between any pair of vectors.

As an example: If Juan Mata’s vector for 2014-15 is at a distance of 2.3 from his vector for 2015-16, and on average the distance between any two player vectors (not necessarily from the same player) in this context is 9.5, then we can say with reasonable certainty that Juan Mata’s vectors are “close” to each other.

The method which I wrote out takes as parameters the two vectorisation matrices for two consecutive seasons, normalises them, considers only players who have played at least 18 matches in each season; and prints out the following:

Here’s what we want to look at on this table: First of all, the lower the mean distance between the two vectors of each player, the better our methodology is according to the reasoning above. However, a “low” number is a very relative concept, so we need a reference with which to measure how low this number actually is. This methodology provides two such references:

• The first and most important one is the mean distance between all vectors, not just between the two corresponding to each player. This gives us an idea of how far any two vectors in this context are and if the “closeness” of vectors of the same player is significant. Z-Score 1 is the mean distance between all vectors minus the mean distance between the vectors from the same player divided by the standard deviation of the distance between all players. The lower this number (accepting negative values of course), the better.
• The second reference provided is the mean distance between Gaussian vectors in the dimension of the problem. Z-Score 2 is the mean distance between the simulated Gaussian vectors minus the mean distance between the vectors from the same player divided by the standard deviation of the distance between the Gaussian vectors. I feel this is also an important frame of reference because it gives a measure of just how “normalised” the scaled problem is. It also provides important “dimensional” context because for example if our vectorisation is in 15 dimensions as opposed to 5 dimensions, then the raw distances will increase but this does not necessarily mean that the higher dimensional vectorisation is less valuable, simply that the numbers we deal with in higher dimensions are naturally larger and we need to know this to know how “low” our mean distance number really is. Hence the importance of the Z-Scores.

This entry was a bit technical and perhaps less interesting for the average football fan, but I thought it was important to explain it because it’s what I’ve been using to understand how to best translate the passing motifs problem to a player context. I’m looking to follow up this entry with an applied example comparing different vectorisations of passing motifs at a player level very soon (2-3 days hopefully if I can find the time), so stay tuned!