Wednesday, 25 May 2016

Data Mining: Funneling Football Information

Data Mining is the process of discovering and extracting concrete knowledge from large data sets in a way that is understandable, compact and applicable to real life problems. Fayyad, Piatetsky-Shapiro and Smyth (1996) succinctly refer to it as “pattern and knowledge discovery in databases”. A powerful image I like to use is thinking of Data Mining as a funnel of information, processing large amounts of data and information that persons cannot normally observe and understand in their entirety; and producing a compact and summarised version that extracts the key knowledge to be learned in an understandable way for the human user, who can then claim to have informed his decisions from as much information as was available.

You may not know it yet, but currently Data Mining methods are changing almost every industry in the world. Science most notably of course, but other very “real-life” day to day industries such as medicine, marketing and even street-light coordination to optimise traffic flow. Banks use Data Mining to decide whether to give someone a loan, and Facebook’s facial recognition software that predicts which friend you want to tag in a picture has Data Mining at its core. Even Post Offices use Data Mining methods to codify hand-written addresses to remove the necessity of a human reading them. The 4th season of Netflix’s House of Cards seems to imply that Data Mining applied to politics can win elections.

Football is one such industry that is beginning to feel its way around the new paradigms. There is a massive collection of data going on at the moment, with companies such as OPTA or Prozone recording millions of events, in-game statistics and other information on the game and the surrounding industry; much more data than a single person can look at and decipher the information contained. Who can come to the football’s aid?

The involvement of mathematical methods in the game has become a misleading mediatised debate, not helped by the “Hollywood-isation” of Moneyball that oversimplifies the role of Data Mining as a magic crystal ball to discover talent. Sceptics seem to think that involving math into the game will damage it, replacing the subtleties and aesthetic flowing nature of football with something radically different and rather cold, based on incomprehensibly looking at thousands of numbers on spreadsheets or “number crunching” on calculators.

Let me assure you now: as a mathematician, I haven’t spent a single minute of my life looking into Excel spreadsheets of thousands of numbers or “crunching” numbers into a calculator (I don’t even own one). There’s no knowledge to be learned by doing these things, my brain is incapable of interpreting it that simply; not without the help of much more elegant methods that know how to extract the information from these large databases and present them to me in compact and useful ways that I can actually understand.

In truth, mathematics are enhancing the game, not replacing anything. On the contrary; the wisdom and intuitive expertise of experienced football men at for example recognizing technically gifted young players or designing successful game tactics can be studied and codified into these techniques so that we can build upon their knowledge. In fact, this is what most Data Mining techniques rely on: using previous successes, knowledge and expectations to design their own criteria for what to expect from their own performance, which is what methods such as supervised machine learning are ultimately all about. However, just as valuable expertise gained organically can be integrated into these techniques to make them richer, it must also be said that human intuition is inevitably flawed and prone to mistakes. As a trivialised example, the judgement of a football scout can be thrown off by a player’s good looks, or he can be subconsciously biased towards liking left-footed strikers more, potentially causing him to produce inaccurate valuations. We all know this to be true of ourselves and of our judgements, and anyone who claims to not be victim of tangible biases while making decisions on a daily basis is not being honest with himself. Data mining on the other hand has no pre-conceived ideas, no prejudices or biases. I obviously cannot yet claim to understand the inner workings of football clubs nearly enough to pass judgement on their performance, but a simple review of a handful of studies by respected academics in the football industry (Anderson and Sally, 2013; Kuper and Szymanski, 2009) seems to reveal that there are still plenty of inefficiencies to be addressed. I believe that methods that can identify and address these inefficiencies should be greeted with enthusiasm by those who love the game, not mistrust.

Data Mining is not necessarily a noisy new neighbour in town disrupting everyone’s way of life. It’s not a revolutionary new way of doing things by pushing tradition off a cliff, but rather it is the gradual and natural continuation of humanity’s essential practice of interpreting information and creating knowledge to inform decision making, simply adapted into the 21st century where globalization and technological development exponentially increase the scale of available information. Our understanding of topics as subjective as human behaviour for example have been greatly enriched by this trend, with behavioural economics now dominating decisions in government policy, marketing, investment banking, etc. For football men there are now so many more sources of information than they have ever had before to inform their decisions and shape their actions. Arguably, so much information is available that it is beyond a single person to store, codify and aggregate it in order for tangible recommendations to be drawn out; and this is why mathematics must be called into action (to funnel the information). Opting against using mathematics to tap into the whole potential of available knowledge in the huge databases makes no sense, and those who make this decision will inevitably fall behind by missing out on the competitive advantage of knowing more than your opponents.

Many sceptics throw around phrases such as “numbers can’t tell you everything” and jump at the opportunity to signal out instances where statistics-led approaches have failed such as Damien Comolli’s transfer dealings while Director of Football at Liverpool. To them I would like to point out this: even Data Mining methods have a human component; there is a person behind designing and implementing the methods used, and subsequently on the receiving end of the methods to interpret and make use of the results. Applied poorly some techniques can reveal nothing worthwhile; but applied with creativity and skill these methods can produce some truly revolutionising discoveries. Ultimately, football clubs must ensure they hire good mathematicians to tap into these benefits.

I do not like football any less by observing it under the lens of mathematics; on the contrary, each day I feel like I gain a richer understanding of it that motivates and captivates me even more. I hope that this blog will do the same for you.


1.      Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P. (1996). The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11), pp.27-34.

2.      Anderson, C. and Sally, D., (2013). The numbers game: why everything you know about football is wrong. Penguin UK.

3.      Kuper, S. and Szymanski, S. (2009). Soccernomics. New York: Nation Books.

No comments:

Post a Comment