ML Musings

Share this post
The work to nurture a great ML product
buildingml.substack.com

The work to nurture a great ML product

What’s really needed to build an ML product?

Catherine Breslin
Feb 16
Share this post
The work to nurture a great ML product
buildingml.substack.com
Photo by Paula Brustur on Unsplash

Problem identification

For building any product, whether it includes ML or not, the first step is to identify the problem you’re trying to solve. ML is a great tool for solving some problems, but there are many where it’s best to start simpler.

In this post, let’s consider working for a company building a hypothetical product for automatically transcribing university lectures. We’re going to build an automatic speech recognition (ASR) system which is tuned to work well for lectures — this is something that definitely needs machine learning at its core. The product team have decided to start small and focus initially on just Physics lectures as a proof of concept.

I’ve built many ASR systems, and the story below is loosely based on real situations and challenges that came up.

Data acquisition

Once it’s decided that machine learning is the way to go, we need to find the right data to use for training a model. In the speech recognition field there are some publicly available datasets, but many of them have non-commercial licenses. Also, they’re things like audiobook narration and telephone conversations between friends, which are not at all matched to our Physics lectures. The vocabulary is different, the microphones are mismatched, and the style of speech is quite unlike a lecture. We need to commission or buy more suitable audio data.

After some back-and-forth internally to align on what’s needed, we have managed to get agreement on what kind of data we’d like and how much of it we’ll collect. We track down a corpus that we’re able to license, and convince our boss that it’s worth paying for. It needs transcribing though, so we spend some more time figuring out the right transcription guidelines and waiting for the transcription work to be completed.

Data cleaning

The data we’ve acquired is mostly good, but a quick review shows that there are a few mistakes in the transcriptions. We notice that some of the transcription guidelines were ambiguous and need reworking. Also, there are some systematic mistakes in the transcriptions that need fixing. The transcribers weren’t Physics experts and so made a few mistakes with technical terms that we can write a script to fix in bulk. That script is only short and so sits somewhere alongside the data on s3 so we don’t forget about it.

Domain expertise

There are some requirements about the system that we’ll need to know from domain experts. For example, how close to real-time does our automated transcription need to be? Should the output have any punctuation or formatting applied, to make it easier to read? These requirements affect the design of the system.

We plan to go ahead with a system that doesn’t work in real-time, but processes the audio after the lecture has finished, and applies some minimal formatting to make the automated transcript more readable.

Model building & evaluation

Now our data is ready, we can get into the real machine learning and build the model! There are plenty of open source toolkits available that we’ll use, to save us writing too much of our own code. The data is split into test/train/dev splits, we build the model, evaluate the word error rate (WER), and do some hyperparameter tuning to find the best model candidate.

But, the training is taking too long and we’re wasting time waiting for it to complete. We decide to parallelise the model training across multiple GPUs, which means updating the model training code to allow for parallel training, and spinning up a cluster of GPUs to use.

We get the model training code to speed up, but now the GPU machine crashed! It needs restarting, but unfortunately it’s in the basement of the US office and it’s the middle of the night there. We spend the next few days twiddling our thumbs and writing documentation until the GPU machines are operational again.

A new version of Pytorch is released. We want to upgrade to take advantage of a new feature, but a few code changes are needed and the right tests aren’t in place to be sure we aren’t breaking anything. After taking some time to improve the test suite, we’re confident in the upgrade and can go ahead.

Finally, our model is ready to deploy, though it took us a bit longer than anticipated. To wrap up the work, we write a report characterising the performance across different dimensions and demographics, and give the sales team some material for setting expectations with customers.

Run-time infrastructure

With our model ready to use, we need to build out a runtime engine and integrate it with the rest of the product. We can make use of a lot of the open source code that we used in training, but there are a few issues to address first. We need to add in some specific error handling so that it fails gracefully for our customers rather than hang indefinitely when there’s a problem. And there’s one part of the open source engine which doesn’t quite run fast enough for our use case, so we’ll replace that part.

After these updates, we’re ready to launch our product, and run on real lectures from our customers.

Product Launch

The product launch is a success! Customers are excited, and our product allows Physics students to access their lectures more effectively.

The company decides to add in Chemistry and Biology lectures to the product. Unfortunately, however, initial tests show that the speech recognition performance on these is much worse than on Physics lectures. It’s back to the data collection & curation stages with these new subjects, to build and deploy an improved ASR model.

New Features

Now one customer alerts us to some expletives in the system output. They’re unhappy with this, even though the lecturer really did swear! But some of our other customers are unhappy with the idea of censorship and would prefer not to go down that route. After some internal debate where and how to apply such a filtering, we decide to allow customers an option to turn on a filter. It took some time to convince everyone in the company that this was a better solution than simply censoring the data that the model was built on.

Customers like using our product, but it looks like there’s actually a bigger market for a live transcription system than for an offline one. We have to tune the speech recognition models again to make them work in real-time, and update the runtime engine to support a live transcription mode.

Bugs and Regressions

Things are still going well with customers, and we update our models again to support university lectures in all subjects. But, after doing this, the performance on Physics and Chemistry lectures gets substantially worse. It was working last week, and so we have to figure out what’s happened. After a deep dive, we find errors in the text processing part of the system. One was where someone inadvertently wrote a rule to correct all occurrences of ‘gluon’ to ‘glue’, because it was a typo in one of the History lectures. During this deep dive we learn that no-one can track down the original data correction script on s3.

To make sure this particular regression doesn’t happen again in future deployments, we add a test case. The performance on Physics lectures still isn’t as good as in the past though, and we have many discussions about whether or not to create subject-specific ASR models. While it would make performance better, the maintenance would be too much, and so a single ASR model is chosen as the way forward.

After all of these updates, the model training code is getting unwieldy and requires significant engineering effort to use & maintain. There are a handful of manual steps in the training process too. Despite a deployment checklist on the wiki, some mistakes are being made. Just last week the wrong model was deployed and needed a roll-back. We choose to invest time in refactoring and simplifying parts of the model training code, introducing more automation, to make it easier and less error-prone to work with.

Now we’re hearing reports that performance is suddenly bad for one of our key customers. We need to investigate. After some dead ends in the investigation, listening to the audio reveals that one of their new lecturers had a bad microphone setup. The audio recording quality was well below what we can reliably handle. We feed this back to the customer with some suggestions for how to improve the audio quality which they take on board.

Scaling up

Things are finally calming down. Customers are delighted, bugs are under control and there’s a good model release cadence. Until late one afternoon you get a call from the product manager — “We’re going international, starting small of course, so how quickly can we get German, Spanish and French up and running…?”

You get the idea! Building a machine learning product is far more than building a machine learning model. And while this story is based on experiences building automatic speech recognition products, there are similar challenges in other domains and with other machine learning models.

Companies need to combine expertise from ML Scientists, Software Engineers, linguists, domain specialists, product managers and more, to successfully overcome the challenges inherent to building ML products.

Share this post
The work to nurture a great ML product
buildingml.substack.com
Comments

Create your profile

0 subscriptions will be displayed on your profile (edit)

Skip for now

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.

TopNew

No posts

Ready for more?

© 2022 Catherine Breslin
Privacy ∙ Terms ∙ Collection notice
Publish on Substack Get the app
Substack is the home for great writing