What we learned building voice-to-voice models: platforms, tools, technologies

Building applications that require training and tuning large language models can be overwhelming to those not familiar with machine learning tooling. There is a growing list of open-source models that one can choose, as well as a large number of platforms and tools. This post is an example of how one might navigate this landscape. It describes all the tools, models and platforms that we tried over a period of several months for a specific project. We have strong opinions on them, stay tuned!

In early 2024 we had a client with an ambitious goal: they wanted a native voice-to-voice model. The premise is that when humans converse naturally, we don’t transcribe what they hear in their heads, generate a written response and then speak it. We can speak without even knowing how to read or write. It makes sense that a model could do this, and hopefully with better efficiency. Nothing like that was publicly available at the time (this was before ChatGPT’s advanced voice mode), so we started by reading some papers and doing research.

First try: reproducing the SpeechGPT paper

Our first attempt was to implement a paper entitled SpeechGPT. The idea behind SpeechGPT was to convert speech into discrete units using a model called mHubert, then train a Llama model to understand those units as new tokens, and finally to generate sequences of those tokens that would be converted to speech again using a vocoder. These tokens did not correspond to anything in the space of text. We won’t dive into the different training phases, but you can read the paper if you’d like to know more.

We first attempted to train on a single Nvidia A100 GPU with 80 gigabytes of memory. The first runs were on spot instances that we could get from vast.ai, because they were cheap. We quickly ran into a snag: these machines would come and go. We would turn off a machine and never be able to turn it on again. It was not worth the overhead, so we decided to use a more reliable service even if it cost a little more.

We had some Azure credits so we built a Docker container with all the tools needed to fine-tune a model, and launched the training by hand. Azure was definitely better, in that we could provision the machines we wanted when we wanted them. We realized that we could parallelize the training runs across several GPUs, and we ended up training with eight of them via torch.run. This involved some trial and error to find the best training hyperparameters (e.g. batch size), and we quickly came up with a process to measure the outcome of training runs.

While we were at it we tried Azure AI Studio, and we quickly found out that we were not the target audience for it. It had a steep learning curve, and the abstraction level seemed too high. Later our client provided us access to GPU machines on Google Cloud Platform. From the usability perspective, we felt it was a step down from Azure. We started to experience inconveniences similar to those of Vast. For example, we had a server in the Asia South East region that was working great. We turned it off one night, and we could never get an available server again in the same region. This meant that we needed to access the disk attached to the server, and move the data across the globe to another region so we could continue training.

Reducing training loss

Once we managed to reproduce the SpeechGPT paper, we discovered that the performance of the model was not good enough. It often did not understand clear voice samples that were trivial for speech recognizers such as Whisper. So tested several ideas to see how much we could improve it. We will not go into deep detail about everything we tried (that is for another post) but we did manage to reduce the model loss somewhat (e.g. using sin/cos positional encoding in the embeddings as in the chart below)

At this point, the client asked if we could try to replicate the same results but with a non-Llama model, so we started experimenting with Mistral. It was quite difficult to port everything, mainly because the SpeechGPT scripts were heavily tied to Llama, and our hyperparameters were as well. We managed to train a Mistral, but the training times worsened.

At this stage, we thought it would be a good idea to see if the model could answer yes/no questions. We started experimenting with a Google dataset called BoolQ. What we did was create TTS (text-to-speech) and generate audio files with BoolQ questions. We could also dive deeply into this, but it exceeds the scope of this post.

We also thought that, instead of extending Llama’s vocabulary to include Hubert tokens, we could reuse Llama tokens that were rarely used. We started experimenting with that approach, but we didn’t finish it.

At that point, we were asked to evaluate Gazelle, an alternative voice-to-voice model published by Chris Hua around that time.

Word Error Rates

Before a voice-to-voice model can be used in production, it is important to know if it is capable of understanding speech correctly and transcribing it. For this, we measured the word error rates when given speech with known transcriptions. It was clear that our model was not on par with the state of the art (OpenAI’s Whisper, for example), and still not close to Gazelle 0.1 and 0.2.

Switching to MosaicLLM, Neptune, Weights and Biases

By now, we had lost count of how many training runs we had launched. Up to this point, our process had entailed manually running training scripts from a Linux terminal, and processing the output to create charts. This was too ad-hoc and error prone. We started automating part of this process. For training, we switched to Databricks’ MosaicML. We also connected our training runs to neptune.ai, an experiment tracker that allowed us to compare runs (even in real time) without having to examine the logs by hand. We tried Weights and Biases as well but the functionality we needed was well covered by Neptune. As a result, we settled on MosaicML + neptune.ai for the remainder of the training cycle. This allowed us to accelerate our iteration speed. We did not exploit the full potential of what these tools can offer, and it is clear that a small organization with limited time is better off not using GPU instances directly. This is better left to larger companies that need to focus on lowering hardware costs as opposed to developer effort.

Conclusions

When experimenting at the cutting edge of AI, it is fundamental to iterate very fast. This means one must be extremely comfortable with the tools and process. If we had to do this again, we would probably start with MosaicML because it did save us a fair amount of manual work. If we had to use one of the main cloud providers, we would pick Azure and launch our own Docker containers with our training environment that we know and love. We would stay away from Google Cloud Platform if possible. There is a place for vast.ai: if you want to do a quick proof of concept for cheap, you can always opportunistically grab a cheap GPU for a throwaway experiment,

Do you have a project that involves training or fine-tuning open-source models? Contact us!

What we learned building voice-to-voice models: platforms, tools, technologies

First try: reproducing the SpeechGPT paper

Reducing training loss

Word Error Rates

Switching to MosaicLLM, Neptune, Weights and Biases

Conclusions

Comments

Leave a Reply Cancel reply

More posts

How ChatGPT and Claude Convince You They’re Almost Human

Why you want to use gen AI for your business

What we learned building voice-to-voice models: platforms, tools, technologies