NannyML – Shaping the Future of Model Supervision

NannyML – Shaping the Future of Model Supervision

Machine learning models can excel at extracting insights from data, leading to better recommendations, better churn prediction, and better business decisions. But how do you know your model performs well post-deployment when dealing with new real-world data? 

As models get deployed everywhere, data scientists – and more so decision-makers – need to know when they can trust a model prediction and when it has gone off track. And that’s when they need a “nanny” to take care of their models! NannyML detects silent model failures that happen post-deployment. 

Founded by Hakim ElakhrassWojtek Kuberski, and Wiljan CoolsNannyML raised a 1M Euro pre-seed round in October 2020, led by Lunar Ventures and Volta Ventures, with the participation of prominent angels, including Stijn Christiaens (Co-Founder of Collibra), Jonathan Cornelissen (Co-founder of Datacamp), Lieven Danneels (CEO of Televic), and others – to provide supervision for machine learning models. 

Learn more about the future of model supervision from our interview with the CEO Hakim Elakhrass:

Why did you start NannyML?

My co-founders and I initially met at a hackathon in 2015 – we simply stayed friends, not knowing that we would found a startup together later on. Then in 2018, my co-founder Wojtek and I started a machine learning consultancy, primarily helping with deploying machine learning models in the real world. 

Our consultancy proliferated – with projects across various industries – and Wiljan joined us in the summer of 2019. We focused on models in production, where “production” could mean many different things: from a command-line tool to delivering a desktop running the machine learning model to the client. 

Most of the value of machine learning comes from when it’s used. But that’s also when most of the effort is spent on maintaining models in production. And how do you know you can trust a model? We couldn’t find any satisfactory solution for this problem – and decided to give it a try ourselves and build a product more scalable than just consulting.

How does it work?

Model performance changes for two reasons: Either due to a bug (coding issue or data quality problem) or a change in the underlying system that generates that data. NannyML focuses on the second case: When the distribution that generates the data changes – and here the questions are: How does the data change? And is it a material change?

For example, if a model detects customer churn, you want to know when customer behavior changes and prevent your churn from going through the roof. But customer behavior is volatile and extremely noisy, so we knew from the beginning that just looking at data variability wouldn’t cut it. 

Also, you can only benchmark a model against historical data. This means that to calculate the actual model performance, you would need to know the truth about what has happened: Did the customer actually churn? We need to estimate model performance before historical data is available.

We had to research how to estimate model performance: A data science problem whose solution requires hard research. And we found a way to estimate model performance directly and robustly and figured that a material change is a change in the input data that correlates to a change in the predicted model performance. 

As of now, we focus on tabular data. But in the future, we may target NLP as well: The stability of machine learning models goes from very stable e.g. for computer vision (a dog will always look like a dog), to very unstable e.g. for tabular data (capturing noisy consumer behavior). NLP may be somewhere in between, where the process that generates language may be volatile. 

How did you evaluate your startup idea?

From our experience in consultancy, we know that machine learning is going to be omnipresent and that model monitoring will play an essential role. And monitoring is also just a small part of it – it’s really about post-deployment data science.

We didn’t rigorously evaluate the market size; we had the gut feeling it would be large enough. However, we got some estimates from how much companies spent in other complex industries on maintenance, e.g. the maintenance of an airplane, and then took a geometric average across different industries.

Our vision is to go from machine learning to automated systems in general and link their performance to business impact. We believe all enterprises will operate in the future like a quant hedge fund: Evaluating tons of data to derive business decisions. 

Companies make decisions based on complex models, but most don’t have a risk department. With NannyML, we provide them with the ability to do what financial institutions used to do and give superpower to decision-makers. 

Who should contact you?

We have just launched on ProductHunt, and we’re still looking for people to join our beta program for our open-source library that estimates real-world model performance. 

Please check out our GitHub – especially if you’re dealing with binary or multi-class classification problems. We’re looking for design partners and teams interested in model performance and would like to collaborate. 

We’re happy to talk to investors and potential users of NannyML at any time. Feel free also to join our Slack community

Further Reading

The Curse of Delayed Performance – Great post by NannyML with lots of further resources.

How to Detect Silent Model Failures? — Wojtek Kuberski, NannyML’s CTO, technical talk at PyData Global

Predict your model’s performance (without waiting for the control group) – An article about NannyML’s algorithm by Samuele Mazzanti on Towards Data Science. 

Intro to Post-deployment model performance – Article by Gema Parreno Piqueras on Medium, published in MLearning.Ai. 

Monitoring AI in Production: Introduction to NannyML – Medium post by Adnan Karol.

Open-Source Spotlight – Wiljan Cools, NannyML’s co-founder, speaking at Data Talk Clubs about NannyML.

AI monitoring company NannyML raises € 1M – Press release on NannyML’s seed round by Start it @KBC – Belgium’s largest accelerator program.

Detecting and Correcting for Label Shift with Black Box Predictors – An arXiv paper that mimics real-world scenarios where you do not have access to labels immediately after making the prediction.

Estimating the performance of an ML model in the absence of ground truth  – Article by Eryk Lewinson on Towards Data Science. 

Estimating model performance without ground truth – Article by Michał Oleszak on Medium.