This is a Twitter series on #FoundationsOfML. Today, I want to talk about another fundamental question:
What makes a metric useful for Machine Learning?
Let's take a look at some common evaluation metrics and their most important caveats...

Let's take a look at some common evaluation metrics and their most important caveats...


Remember our purpose is to find some optimal program P for solving a task T, by maximizing a performance metric M using some experience E. https://twitter.com/AlejandroPiad/status/1348840452670291969?s=20
We've already discussed different modeling paradigms and different types of experiences.
But arguably, the most difficult design decision in any ML process is which evaluation metric(s) to use.

There are many reasons why choosing the right metric is crucial.
If you cannot measure progress, you cannot objectively decide between different strategies.
This is true when solving any problem, but in ML the consequences are even bigger:

This is true when solving any problem, but in ML the consequences are even bigger:


Remember that "solving" a problem in ML is actually about *searching*, between different models, the one that maximizes a given metric.

Let's talk about some common metrics, focusing on classification for simplicity (for now):


It's probably the most commonly used metric in the most common type of ML problem.



This is not often the case. It can be far worse to tell a sick person to go home than to tell a healthy person to take the treatment (depending on the treatment, of course).

You can get >99% accuracy if you just tell everyone you find on the street that they don't have COVID.
The problem with Accuracy is that it smooths away different types of errors under the same number.
If you care about one specific class more than the rest, measure
Precision and
Recall instead, as they tell you more about the nature of the mistakes you're making.



In general, there are two types of errors we can make when deciding the category C of an element:
We can say one element belongs to C when it doesn't (type I).
We can fail to say an element belongs to C when it does (type II).
How about we measure both separately:


How about we measure both separately:



By looking at these metrics separately, you can better identify what kind of error you're making.
If you still want a kind of average that weights both, you can use the F-Measure, which allows you to prioritize precision vs recall to any desired degree.

Precision and Recall are also very intuitive to interpret, but they still don't tell us the whole story.
When we have more than two categories, we can fail at any one of them by confusing it with any other. Here, again, precision and recall are too general.


It looks something like this.

Accuracy, precision, and recall are easy to compute from the confusion matrix (I'll leave you that as an exercise



The story we've seen here is common all over Machine Learning.
We can have simple, high-level, interpretable metrics, that hideaway the nuance.
Or we can have low-level metrics that tell a bigger picture, but require more effort to interpret.


There is a lot more to tell about metrics and evaluation in general, and we've just focused on a very small part of the problem.
Some of the issues that need to be kept in mind:
Some of the issues that need to be kept in mind:





Almost every alignment problem in AI can be traced back to a poorly defined metric. For example, maximizing engagement is arguably a large part of the reason why social media is as broken as it is.

