I was an eng leader on Facebook’s NewsFeed and my team was responsible for the feed ranking platform.
Every few days an engineer would get paged that a metric e.g., “likes” or “comments” is down.
It usually translated to a Machine Learning model performance issue. /thread
Every few days an engineer would get paged that a metric e.g., “likes” or “comments” is down.
It usually translated to a Machine Learning model performance issue. /thread
2/ The typical workflow to diagnose the alert by the engineer was to first check our internal monitoring system Unidash to see if the alert was indeed true and then dive into Scuba to diagnose it further.
3/ Scuba is a real-time analytics system that would store all the prediction logs and makes them available for slicing and dicing. It only supported filter and group by queries and was very fast.
https://research.fb.com/wp-content/uploads/2016/11/scuba-diving-into-data-at-facebook.pdf
https://research.fb.com/wp-content/uploads/2016/11/scuba-diving-into-data-at-facebook.pdf
4/ Engineers would load up the Scuba dashboard for a given time window and start slicing data on a variety of attributes.
For example - Are likes down for all types of news feed stories? Are they down only within a particular country or region?
For example - Are likes down for all types of news feed stories? Are they down only within a particular country or region?
5/ If I were on-call and I got an alert that likes dropped by a stat-sig amount, the first thing I would do is to go into Scuba.
I will zoom into likes last day and compare it with last week, and add filters like country, etc to find out which slice has the biggest deviation.
I will zoom into likes last day and compare it with last week, and add filters like country, etc to find out which slice has the biggest deviation.
6/ Most Machine Learning model performance issues occurred due to data pipeline issues.
For example, a developer introduced a bug in logging and that is sending bad feature data to the model or a piece of the data pipeline is broken because of a system error.
For example, a developer introduced a bug in logging and that is sending bad feature data to the model or a piece of the data pipeline is broken because of a system error.
7/ Another set of issues were due to ML models that were not updated for a while and user behavior has changed.
This usually resulted in the On-call opening a ticket for the model owner to retrain the model.
This usually resulted in the On-call opening a ticket for the model owner to retrain the model.
8/ Facebook had continuous retraining of some models and these models had challenges with reproducibility as they would get updated every few hours.
9/ Another big use-case of Scuba was to do challenger champion testing.
Engineers would run lots of A/B tests and use Scuba metrics to figure out which model is performing the best before they bring that model dashboard for a launch review.
Engineers would run lots of A/B tests and use Scuba metrics to figure out which model is performing the best before they bring that model dashboard for a launch review.
10/ Finally all of this was enabled by a fantastic set of explainability tools that help us debug models both during experimentation and production timeframe.
Some of these tools were integrated into internal and external versions of the Facebook app. https://about.fb.com/news/2019/03/why-am-i-seeing-this/
Some of these tools were integrated into internal and external versions of the Facebook app. https://about.fb.com/news/2019/03/why-am-i-seeing-this/
11/ Monitoring, analysis, and explainability of Models are a must-have for teams that want to operationalize ML at scale and in a trustworthy manner.
This is why we created @fiddlerlabs. #MLOps #Monitoring #ExplainableAI
This is why we created @fiddlerlabs. #MLOps #Monitoring #ExplainableAI