Author: Diego Basch

  • Why you want to use gen AI for your business

    When I tell people what I do, I often get the question: I know I should be using gen AI, but besides [ChatGPT and its ilk]  I’m not sure what I should be doing. When I start exploring their needs, I realize that most people have a distorted idea of what generative AI is. In this post I will explain it as simply as I can. Maybe this will give you ideas, or perhaps you will realize that this technology does not apply to your business (at least for now). Let’s go.

    The thing that prompted the explosion of AI is the concept of the Transformer. The idea is very simple, pay attention now. As you ingest text, you do not pay attention to one word at a time. You are aware that I asked you to pay attention as you read this, for example. Your mind remembers that I am talking about the Transformer, and each word has a relation to every other word in the context. For example, you know that the explosion I mentioned earlier is not literal, you did not imagine a scene from a Michael Bay movie. Same as when I said the word Transformer, even though Transformers is indeed a Michael Bay movie. Context is vital.

    The name of the Transformer should be a hint for what it does. The original motivation was to improve machine translation. If you want to translate a concept from, say, Spanish to English, you do not translate each word independently. Instead you look at a whole phrase Spanish, and come up with another in English that conveys the same meaning. It turns out that humans express meaning online in countless ways besides pure text (images are the obvious example). You don’t even need to switch languages. You can take an idea an expand on it, summarize it, make it more or less formal. And the way this happens is what puts the “generative” in gen AI.

    Can AI be leftist? [of course it can, it can be anything you want]

    Suppose I start a phrase like with “to whom it may” and ask you to guess what comes next. You would bet the farm on “concern.” That one was easy. If I said “your sister called and” then you have more options. “Said” is a good candidate, “petrichor” is not. But if said “your sister called and said that it’s about to rain. It made me think of the comforting scent of” then petrichor is much more likely. This is what all the chatbots do when you talk to them. They are asked to continue a passage that starts with a prompt: “you are a helpful bot, and here is something the person said.” Then what you said follows, and then something like “This is what you respond to the user.” You could play a social game using this mechanism. To make it fun, you could make a player lose if they mention certain words or topics. Try asking ChatGPT about any topics that OpenAI likes to avoid, and you will experience all the mechanisms that they had to build in order to restrict the model. But this is not inherent to LLMs. You could build a system to talk about the topics you want and avoid others, and they do not need to be the same ones Claude or ChatGPT deal with.

    By now I imagine you’re thinking “that’s nice, but what can I do with gen AI? How does it help my business?” This is easy for us, because every day we run into situations in which we think “this interaction would be so much better if only…” Some examples.

    Onboarding Buddy

    It is your first day at Widgets, Inc. Perhaps you were given some quick orientation, and now you have to start doing useful work. You need to know about some customer requirements. Where are they? You ask a coworker, they tell you to search Notion. It wasn’t there, but you eventually find it somehow. The next time you ask another coworker. After a while you have an idea of where everything is. But what if you could have an omniscient buddy with infinite patience and time for you, so you wouldn’t have to disturb your coworkers? Now we can finally have a useful intranet. A chatbot augmented with retrieval (what is known as RAG, or retrieval-augmented generation) can do this. We have built instances of this, for example with LangChain and custom tools to search gmail, slack channels, Notion, etc. Given an information request, the chatbot decides which resources to explore, and then inserts the relevant facts into the context for response generation. One of our cofounders (Diego Basch) has decades of experience in information retrieval, having sold his SaaS search company to LinkedIn. We can help you assess how to best take advantage of your proprietary information to make it helpful to your employees while at the same time residing in a secure server (for example, using an open-source model like LLaMa 3 with full control of the information flow).

    Automated Lead Verifier

    A car dealer has ten thousand leads. Each one is the phone number of a person who expressed interest in buying a car. Calling each person individually is expensive. Instead, they can use a voice-powered agent to validate these numbers. The agent calls each person and says “hi, I’m calling from [dealership]. I wanted to check that you are still interested in buying a car, is this correct?” The person doesn’t need to respond yes or no, they can ask follow-up questions such as “wait, what dealership?” or “who is this?” and the agent will be able to engage in an unstructured conversation. If the person confirms interest, it thanks them for their time and tells them that a person will follow up. And of course it can leave voice mail. This can whittle down the leads to a manageable number for the fraction of the cost of a call center.

    Fashion Assistant

    Imagine you have an event coming up—a wedding, a business meeting, or a casual hangout. You want to look appropriate and stylish, but you’re not quite sure what to wear. You could spend hours scrolling through inspiration boards or online shops, but what if you had a personal fashion advisor who instantly understood your preferences, the event type, and even what’s trending in your social circle?

    With generative AI, this is possible. A fashion assistant powered by AI can suggest outfit ideas based on the occasion, current fashion trends, your style history, and even the weather forecast. It can combine these factors with items you already own or suggest pieces from online stores that match your wardrobe, making it easy to find the right look with minimal effort. Imagine this assistant as your style-conscious friend who’s always on call, helping you to dress confidently for any occasion.

    Your use case here. Contact us!

  • What we learned building voice-to-voice models: platforms, tools, technologies

    Building applications that require training and tuning large language models can be overwhelming to those not familiar with machine learning tooling. There is a growing list of open-source models that one can choose, as well as a large number of platforms and tools. This post is an example of how one might navigate this landscape. It describes all the tools, models and platforms that we tried over a period of several months for a specific project. We have strong opinions on them, stay tuned!

    In early 2024 we had a client with an ambitious goal: they wanted a native voice-to-voice model. The premise is that when humans converse naturally, we don’t transcribe what they hear in their heads, generate a written response and then speak it. We can speak without even knowing how to read or write. It makes sense that a model could do this, and hopefully with better efficiency. Nothing like that was publicly available at the time (this was before ChatGPT’s advanced voice mode), so we started by reading some papers and doing research.

    First try: reproducing the SpeechGPT paper

    Our first attempt was to implement a paper entitled SpeechGPT. The idea behind SpeechGPT was to convert speech into discrete units using a model called mHubert, then train a Llama model to understand those units as new tokens, and finally to generate sequences of those tokens that would be converted to speech again using a vocoder. These tokens did not correspond to anything in the space of text. We won’t dive into the different training phases, but you can read the paper if you’d like to know more.

    We first attempted to train on a single Nvidia A100 GPU with 80 gigabytes of memory. The first runs were on spot instances that we could get from vast.ai, because they were cheap. We quickly ran into a snag: these machines would come and go. We would turn off a machine and never be able to turn it on again. It was not worth the overhead, so we decided to use a more reliable service even if it cost a little more.

    We had some Azure credits so we built a Docker container with all the tools needed to fine-tune a model, and launched the training by hand. Azure was definitely better, in that we could provision the machines we wanted when we wanted them. We realized that we could parallelize the training runs across several GPUs, and we ended up training with eight of them via torch.run. This involved some trial and error to find the best training hyperparameters (e.g. batch size), and we quickly came up with a process to measure the outcome of training runs.

    While we were at it we tried Azure AI Studio, and we quickly found out that we were not the target audience for it. It had a steep learning curve, and the abstraction level seemed too high. Later our client provided us access to GPU machines on Google Cloud Platform. From the usability perspective, we felt it was a step down from Azure. We started to experience inconveniences similar to those of Vast. For example, we had a server in the Asia South East region that was working great. We turned it off one night, and we could never get an available server again in the same region. This meant that we needed to access the disk attached to the server, and move the data across the globe to another region so we could continue training.

    Reducing training loss

    Once we managed to reproduce the SpeechGPT paper, we discovered that the performance of the model was not good enough. It often did not understand clear voice samples that were trivial for speech recognizers such as Whisper. So tested several ideas to see how much we could improve it. We will not go into deep detail about everything we tried (that is for another post) but we did manage to reduce the model loss somewhat (e.g. using sin/cos positional encoding in the embeddings as in the chart below)

    At this point, the client asked if we could try to replicate the same results but with a non-Llama model, so we started experimenting with Mistral. It was quite difficult to port everything, mainly because the SpeechGPT scripts were heavily tied to Llama, and our hyperparameters were as well. We managed to train a Mistral, but the training times worsened.

    At this stage, we thought it would be a good idea to see if the model could answer yes/no questions. We started experimenting with a Google dataset called BoolQ. What we did was create TTS (text-to-speech) and generate audio files with BoolQ questions. We could also dive deeply into this, but it exceeds the scope of this post.

    We also thought that, instead of extending Llama’s vocabulary to include Hubert tokens, we could reuse Llama tokens that were rarely used. We started experimenting with that approach, but we didn’t finish it.

    At that point, we were asked to evaluate Gazelle, an alternative voice-to-voice model published by Chris Hua around that time.

    Word Error Rates

    Before a voice-to-voice model can be used in production, it is important to know if it is capable of understanding speech correctly and transcribing it. For this, we measured the word error rates when given speech with known transcriptions. It was clear that our model was not on par with the state of the art (OpenAI’s Whisper, for example), and still not close to Gazelle 0.1 and 0.2.

    Switching to MosaicLLM, Neptune, Weights and Biases

    By now, we had lost count of how many training runs we had launched. Up to this point, our process had entailed manually running training scripts from a Linux terminal, and processing the output to create charts. This was too ad-hoc and error prone. We started automating part of this process. For training, we switched to Databricks’ MosaicML. We also connected our training runs to neptune.ai, an experiment tracker that allowed us to compare runs (even in real time) without having to examine the logs by hand. We tried Weights and Biases as well but the functionality we needed was well covered by Neptune. As a result, we settled on MosaicML + neptune.ai for the remainder of the training cycle. This allowed us to accelerate our iteration speed. We did not exploit the full potential of what these tools can offer, and it is clear that a small organization with limited time is better off not using GPU instances directly. This is better left to larger companies that need to focus on lowering hardware costs as opposed to developer effort.

    Conclusions

    When experimenting at the cutting edge of AI, it is fundamental to iterate very fast. This means one must be extremely comfortable with the tools and process. If we had to do this again, we would probably start with MosaicML because it did save us a fair amount of manual work. If we had to use one of the main cloud providers, we would pick Azure and launch our own Docker containers with our training environment that we know and love. We would stay away from Google Cloud Platform if possible. There is a place for vast.ai: if you want to do a quick proof of concept for cheap, you can always opportunistically grab a cheap GPU for a throwaway experiment,

    Do you have a project that involves training or fine-tuning open-source models? Contact us!