

Ever see all these "npxG/90 baselines" and wonder what on Earth is going on

In this very late 2nd installment of my #FPL spready series I explain how I get my baselines





As well as being behind on threads, this was my motivation for investigating weighting:
https://twitter.com/FPLRoosta/status/1344726987655569410?s=19
Skip to:









Suppose Salah's is 0.7npxG/90. This means in an average PL fixture{3} I expect his npxG to be 0.7*M/90 if he plays for M minutes.
Per90 is the most accessible/popular, but in my spready I use exact mins{4} (includes injury time) & npxG/min



Often it won't matter, but if a player is usually subbed on or usually subbed off & then their situation changes to play more/less mins your data will be out

https://twitter.com/theFPLkiwi/status/1288400761760686081?s=19
The benefit is ~small but so is the extra effort



Baselines are estimates using past data for a player. I pull GW data from @FFScout then put npxG in line with @fbref{5}. Some other data sources

https://twitter.com/uncertainty_pod/status/1309517194733248517?s=19
(give @uncertainty_pod a listen, especially if you want to do your own spready)



In "data" is npxG & mins for every player in every PL game since the start of 17/18 (36021 npxG data points).
In "fix_list" is team DEF strength (npxGA/game v an avg opponent) for both teams in every PL game

They are current beliefs of past strength


I'll cover these another time, they are similar to e.g. @rogue_wee ( https://twitter.com/rogue_wee/status/1346867567114260481?s=19) except those are past beliefs of past strength, so what he thought at the time from data before that game. Please don't spam him with questions as I believe he's taking a step back 
.


Let's use Salah as an example
.
For each of the 131 LIV PL games since joining{7} I have:
d
Fixture difficulty (opponent DEF strength * home/away multiplier)
m
Exact mins
x
@StatsBomb npxG via fbref{6}.
So I consider performance to be a function of x/(dm)
.

For each of the 131 LIV PL games since joining{7} I have:
d

m

x

So I consider performance to be a function of x/(dm)


Let the i'th game's data be d_i, m_i, x_i for i=1,...,131.
Salah's overall performance (npxG/min vs an avg team) is:
Σx_i / Σd_i*m_i {8}.
For a baseline we value recent data more highly, e.g. 20/21 > 17/18, so I apply a weighting w_i to each game


This gives:
Σx_i*w_i / Σd_i*m_i*w_i
(w_i>w_j for all i>j)
Many when starting (inc. me) apply the same weight to all games from the same part of a season, e.g. pre-covid w=0.2, post-covid w=0.5, 20/21 w=1.
Now I make w an exponential function of time
! But why 
?
Σx_i*w_i / Σd_i*m_i*w_i
(w_i>w_j for all i>j)
Many when starting (inc. me) apply the same weight to all games from the same part of a season, e.g. pre-covid w=0.2, post-covid w=0.5, 20/21 w=1.
Now I make w an exponential function of time



It means the relative weight of 2 games depends only on the time between them.
E.g. The boxing day games 2017-19:
w_57/w_20 = w_95/w_57 = c (a constant).
Let t_i be time in years{9} since game i. The final formula is
:
Σx_i*c^-t_i / Σd_i*m_i*c^-t_i
(I use c=2)
.

w_57/w_20 = w_95/w_57 = c (a constant).
Let t_i be time in years{9} since game i. The final formula is


(I use c=2)


The goal now is to find the best c for predicting future npxG

I'll be using fbref npxG as it's the most predictive{10} using @fbref scraped data{11} from @FF_Trout who has been a great sounding board for this work along with @fplreview




I tried using all the data but the noise resulted in nonsense (-ve weighting) so...

I took the PL data (every match from 17/18 to now) & removed players with <1500 mins (weighting irrelevant) or <0.1npxG/game (wouldn't pick in FPL)


This leaves 11,717 rows of data, 7,236 have a player with >=1500 mins to form a baseline from.
I used my own fixture ratings (DEF strength * home advantage) as I don't know of any public DEF ones, but @FiveThirtyEight do have overall team ratings {12}.
Using the formula

,
I used my own fixture ratings (DEF strength * home advantage) as I don't know of any public DEF ones, but @FiveThirtyEight do have overall team ratings {12}.
Using the formula



for each performance I calculated a prior baseline & multiplied by mins&fixture to give an npxG prediction
.
I did this for many values of c, and for each one calculated the rmse{13} of the 7,236 predictions.
I start with a wide range of c values and hone in on the best
.

I did this for many values of c, and for each one calculated the rmse{13} of the 7,236 predictions.
I start with a wide range of c values and hone in on the best


I also considered variations on weighting 
:
Extra fake days in between seasons due to big changes such as transfers
Do this to the lockdown peripd instead of Aug 2020

Include both lockdown & Aug
Subtract days instead (as footballers aren't playing/training).








For this method and data, the result is clear - the lowest rmse is achieved with a decay rate of:

This represents a rate of decay of 2.1 per year, or halving the weight of a game every ~340 days.
You can see this result in the graphs below




The variations on weight did not make much difference, and all performed worse than the vanilla model
.
While this matches my intuition, I'm surprised how closely
! I would invite anyone with a different intuition to repeat this as a check against any possible bias 
.

While this matches my intuition, I'm surprised how closely




On average, it is more predictive to use long term player data rather than the last few games or only including this season.
However, @analytic_fpl makes a point here{14} that FPL is a game of identifying outliers - so this result is to be used carefully



- the same position
- in the same formation
- with the same teammates
- for the same coach
- where they have "something to play for"
In general checking all these is difficult & noisy



In my model I will be setting c = 2.1 from now on. I hope to run a similar experiment later on team strengths, which will then allow me to check again at the player-level for the other top 5 leagues. I can then check assists & other points-scoring actions




{1} Not notes about feet.
{2} Pens done separately.
{3} Average over all fixtures including v LIV despite this being impossible for Salah.
{4} May be elsewhere - I use #FFScout membership https://www.fantasyfootballscout.co.uk/2020/04/20/new-per-90-stat-available-in-ffscout-members-area/
{5} https://fbref.com/en/comps/9/stats/Premier-League-Stats
{6} I get more precise
values than the 1d.p. for each game on fbref by taking snapshots of the 2d.p. per90 player numbers
.
{7} Includes games Salah didn't play.
{8} "Σ" means sum. This method values 3 goals vMCI & 0vWBA the same as 3vWBA & 0vMCI (as both score the same pts). Don't know whether this

{7} Includes games Salah didn't play.
{8} "Σ" means sum. This method values 3 goals vMCI & 0vWBA the same as 3vWBA & 0vMCI (as both score the same pts). Don't know whether this
is more predictive, so could be investigated another time 
.
{9} Due to leap years I use days in my spready, but won't overcomplicate.
{10} Many examples e.g. @thesignigame: https://twitter.com/thesignigame/status/1341050217467142152?s=19
{11} https://twitter.com/FF_Trout/status/1347668718856368130?s=19
{12} https://github.com/fivethirtyeight/data/tree/master/soccer-spi
{13} Rmse = root mean


{9} Due to leap years I use days in my spready, but won't overcomplicate.
{10} Many examples e.g. @thesignigame: https://twitter.com/thesignigame/status/1341050217467142152?s=19
{11} https://twitter.com/FF_Trout/status/1347668718856368130?s=19
{12} https://github.com/fivethirtyeight/data/tree/master/soccer-spi
{13} Rmse = root mean
squared error - a standard tool to evaluate predictions. Error = prediction - value.
{14} https://twitter.com/analytic_fpl/status/1344655689944281096?s=19
Thanks to:
@FF_Trout
@fplreview
@analytic_fpl
@FiveThirtyEight
@thesignigame
@fbref
@FFScout
@StatsBomb
@wee_rogue
@uncertainty_pod
@FPLRoosta
Kiwi out
.
{14} https://twitter.com/analytic_fpl/status/1344655689944281096?s=19

@FF_Trout
@fplreview
@analytic_fpl
@FiveThirtyEight
@thesignigame
@fbref
@FFScout
@StatsBomb
@wee_rogue
@uncertainty_pod
@FPLRoosta
Kiwi out

