Blog

  • How ChatGPT and Claude Convince You They’re Almost Human

    If you have seen the movie Memento, you may remember a scene in which the protagonist is running in the street, and has an internal monologue. It goes like this: “so… what am I doing? Oh, I’m chasing this guy. No, he’s chasing me.” Neither us nor he have any context of what is happening besides what we see in that instant. This is very similar to what happens to an LLM when it generates an inference.

    Is ChatGPT an LLM? No, it’s much more

    An LLM is a machine that can continue a sequence of words. By now most people have not interacted with an LLM directly, but with an agent such as ChatGPT or Claude. However, you can go to a playground and make an LLM continue any text, like this (generated by Llama 3 8B, in case you’re curious):

    Prompt: Am I chasing this guy? No,
    Completion: 100 percent not. I’m just doing what I love to do and that’s play baseball.” The only thing keeping this from being a no-doubt pick for the Hall of Fame is the fact that he didn’t play in the World Series.

    In this case, all the context the LLM has is that there was a phrase that ends in a comma, and it must pick up from there and generate more words that make sense. But this is not what Claude and ChatGPT are. They are applications that:

    • know the history of their interactions with you
    • may know your current time and location
    • have the capability of searching the internet and custom resources to enrich the context

    Have you ever heard of RAG? It’s probably not what you think

    RAG stands for Retrieval-Augmented Generation. It’s a fancy way of saying that a system that uses an LLM needs to have a way to insert new information into the LLM context. This is because the LLM is frozen in time. You can download an LLM made several months ago, and it will know have any recent information in its training data. Imagine you wake up in a room and it’s the year 2032. Someone asks you: what do you think of the president of the New Soviet Union? Obviously you have no idea if the New Soviet Union even exists, let alone who governs it. But there is a sign in front of you that says: “if someone asks for current events, enter a query here.” So you type the question, you learn what you need and you answer confidently. Your interrogator has no idea that you just looked this up.

    In early 2023, for some reason people started using Vector Databases as the main way to retrieve information for this purpose. The idea is very simple: You can think of a piece of information as if it were located somewhere in the space of meaning. For example, the concept of a limousine is relatively close to the concept of a taxi, and not so close to a pitchfork. So if you take, say, all your emails and place them in their proper regions of semantic space, you might expect a question whose answer is contained in one message would also be close to it in that space. For example, “can you find the invite to the party” might find emails that mention saving the date, invitations, birthdays, gifts. Even if the word “party” or “invite” are not present.

    Relevance and similarity are different things

    Of course it could be that the most similar email to that question is from 2009. That doesn’t make it relevant. A less similar email from three weeks ago is much more relevant. Relevance is determined by a large number of signals that include recency but also location, keywords, the context of the query, etc. This is why the way we currently do RAG involves giving the LLM access to a bunch of bespoke tools, and a long prompt explaining it how to use them. For example, if you ask ChatGPT about what to wear tonight, it might look up the weather for your location.

    In this case, ChatGPT already knows where I am as I write this. I would guess this is part of a tool that contains user context. Try asking it what it knows about you. It also has access to web search. If it didn’t, it might create plausible search results that might not exist.

    But wait, how does the system know which tools to use?

    This is indeed an extremely difficult problem. If you ever used LangChain, the basic idea is that the agent is performing a loop in which it evaluates the currrent state of the conversation and decides what to do. It might have a prompt along these lines:

    You are an AI-powered assistant that excels at retrieving, analyzing, and generating information. You have access to the following tools:
    
    Python: For running calculations, data analysis, and generating visualizations.
    Web Search: For finding up-to-date information from the internet.
    Document Reader: For extracting and summarizing content from uploaded documents.
    Chat History Analysis: For understanding user preferences and tailoring responses based on past interactions.
    Memory Management: For storing, retrieving, and modifying user-specific context to maintain continuity across conversations.
    ...
    Use these tools effectively to assist users with their requests. Always choose the most efficient tool for a given task, combining multiple tools if necessary. Provide clear explanations of your actions when you employ tools.

    When we build an agent like this, we hope that the LLM predicts that a subsequent response should contain the results of a call to one of these tools. But of course, the LLM has very limited reasoning capabilities and it’s somewhat unpredictable. A slight variation might make it choose one tool over another. It might tell you “sorry, I cannot assist you with this because I don’t know what to do with this file” or it might trigger a call to the document tool. In the first case, you might say “I have seen your prompt, and I know you have a Document tool.” The agent would probably responde “oh, pardon my mistake. Of course you are correct” and proceed to analyze the document.

    Does building an agent need to be this complicated?

    For the time being, I’m afraid so. The state of the art for building intelligent agents relies heavily on heuristics, clever workarounds, and tools to sidestep the inherent limitations of LLMs. It seems that the reasoning capabilities have reached a plateau. More data does not make an LLM capable of better reasoning. Rather, it makes it more likely to be correct when asked about obscure facts. This is why effective agents depend not only on the power of LLMs but also on thoughtful prompt engineering, and the careful integration of clever tools to fill in the gaps.

    Nobody knows if another breakthrough is near. For now, crafting an effective agent is both engineering and art. We are having a blast doing it, and if you’re thinking about creating a custom agent for whatever purpose we’d love to chat.

    Now, where was I.

  • Why you want to use gen AI for your business

    When I tell people what I do, I often get the question: I know I should be using gen AI, but besides [ChatGPT and its ilk]  I’m not sure what I should be doing. When I start exploring their needs, I realize that most people have a distorted idea of what generative AI is. In this post I will explain it as simply as I can. Maybe this will give you ideas, or perhaps you will realize that this technology does not apply to your business (at least for now). Let’s go.

    The thing that prompted the explosion of AI is the concept of the Transformer. The idea is very simple, pay attention now. As you ingest text, you do not pay attention to one word at a time. You are aware that I asked you to pay attention as you read this, for example. Your mind remembers that I am talking about the Transformer, and each word has a relation to every other word in the context. For example, you know that the explosion I mentioned earlier is not literal, you did not imagine a scene from a Michael Bay movie. Same as when I said the word Transformer, even though Transformers is indeed a Michael Bay movie. Context is vital.

    The name of the Transformer should be a hint for what it does. The original motivation was to improve machine translation. If you want to translate a concept from, say, Spanish to English, you do not translate each word independently. Instead you look at a whole phrase Spanish, and come up with another in English that conveys the same meaning. It turns out that humans express meaning online in countless ways besides pure text (images are the obvious example). You don’t even need to switch languages. You can take an idea an expand on it, summarize it, make it more or less formal. And the way this happens is what puts the “generative” in gen AI.

    Can AI be leftist? [of course it can, it can be anything you want]

    Suppose I start a phrase like with “to whom it may” and ask you to guess what comes next. You would bet the farm on “concern.” That one was easy. If I said “your sister called and” then you have more options. “Said” is a good candidate, “petrichor” is not. But if said “your sister called and said that it’s about to rain. It made me think of the comforting scent of” then petrichor is much more likely. This is what all the chatbots do when you talk to them. They are asked to continue a passage that starts with a prompt: “you are a helpful bot, and here is something the person said.” Then what you said follows, and then something like “This is what you respond to the user.” You could play a social game using this mechanism. To make it fun, you could make a player lose if they mention certain words or topics. Try asking ChatGPT about any topics that OpenAI likes to avoid, and you will experience all the mechanisms that they had to build in order to restrict the model. But this is not inherent to LLMs. You could build a system to talk about the topics you want and avoid others, and they do not need to be the same ones Claude or ChatGPT deal with.

    By now I imagine you’re thinking “that’s nice, but what can I do with gen AI? How does it help my business?” This is easy for us, because every day we run into situations in which we think “this interaction would be so much better if only…” Some examples.

    Onboarding Buddy

    It is your first day at Widgets, Inc. Perhaps you were given some quick orientation, and now you have to start doing useful work. You need to know about some customer requirements. Where are they? You ask a coworker, they tell you to search Notion. It wasn’t there, but you eventually find it somehow. The next time you ask another coworker. After a while you have an idea of where everything is. But what if you could have an omniscient buddy with infinite patience and time for you, so you wouldn’t have to disturb your coworkers? Now we can finally have a useful intranet. A chatbot augmented with retrieval (what is known as RAG, or retrieval-augmented generation) can do this. We have built instances of this, for example with LangChain and custom tools to search gmail, slack channels, Notion, etc. Given an information request, the chatbot decides which resources to explore, and then inserts the relevant facts into the context for response generation. One of our cofounders (Diego Basch) has decades of experience in information retrieval, having sold his SaaS search company to LinkedIn. We can help you assess how to best take advantage of your proprietary information to make it helpful to your employees while at the same time residing in a secure server (for example, using an open-source model like LLaMa 3 with full control of the information flow).

    Automated Lead Verifier

    A car dealer has ten thousand leads. Each one is the phone number of a person who expressed interest in buying a car. Calling each person individually is expensive. Instead, they can use a voice-powered agent to validate these numbers. The agent calls each person and says “hi, I’m calling from [dealership]. I wanted to check that you are still interested in buying a car, is this correct?” The person doesn’t need to respond yes or no, they can ask follow-up questions such as “wait, what dealership?” or “who is this?” and the agent will be able to engage in an unstructured conversation. If the person confirms interest, it thanks them for their time and tells them that a person will follow up. And of course it can leave voice mail. This can whittle down the leads to a manageable number for the fraction of the cost of a call center.

    Fashion Assistant

    Imagine you have an event coming up—a wedding, a business meeting, or a casual hangout. You want to look appropriate and stylish, but you’re not quite sure what to wear. You could spend hours scrolling through inspiration boards or online shops, but what if you had a personal fashion advisor who instantly understood your preferences, the event type, and even what’s trending in your social circle?

    With generative AI, this is possible. A fashion assistant powered by AI can suggest outfit ideas based on the occasion, current fashion trends, your style history, and even the weather forecast. It can combine these factors with items you already own or suggest pieces from online stores that match your wardrobe, making it easy to find the right look with minimal effort. Imagine this assistant as your style-conscious friend who’s always on call, helping you to dress confidently for any occasion.

    Your use case here. Contact us!

  • What we learned building voice-to-voice models: platforms, tools, technologies

    Building applications that require training and tuning large language models can be overwhelming to those not familiar with machine learning tooling. There is a growing list of open-source models that one can choose, as well as a large number of platforms and tools. This post is an example of how one might navigate this landscape. It describes all the tools, models and platforms that we tried over a period of several months for a specific project. We have strong opinions on them, stay tuned!

    In early 2024 we had a client with an ambitious goal: they wanted a native voice-to-voice model. The premise is that when humans converse naturally, we don’t transcribe what they hear in their heads, generate a written response and then speak it. We can speak without even knowing how to read or write. It makes sense that a model could do this, and hopefully with better efficiency. Nothing like that was publicly available at the time (this was before ChatGPT’s advanced voice mode), so we started by reading some papers and doing research.

    First try: reproducing the SpeechGPT paper

    Our first attempt was to implement a paper entitled SpeechGPT. The idea behind SpeechGPT was to convert speech into discrete units using a model called mHubert, then train a Llama model to understand those units as new tokens, and finally to generate sequences of those tokens that would be converted to speech again using a vocoder. These tokens did not correspond to anything in the space of text. We won’t dive into the different training phases, but you can read the paper if you’d like to know more.

    We first attempted to train on a single Nvidia A100 GPU with 80 gigabytes of memory. The first runs were on spot instances that we could get from vast.ai, because they were cheap. We quickly ran into a snag: these machines would come and go. We would turn off a machine and never be able to turn it on again. It was not worth the overhead, so we decided to use a more reliable service even if it cost a little more.

    We had some Azure credits so we built a Docker container with all the tools needed to fine-tune a model, and launched the training by hand. Azure was definitely better, in that we could provision the machines we wanted when we wanted them. We realized that we could parallelize the training runs across several GPUs, and we ended up training with eight of them via torch.run. This involved some trial and error to find the best training hyperparameters (e.g. batch size), and we quickly came up with a process to measure the outcome of training runs.

    While we were at it we tried Azure AI Studio, and we quickly found out that we were not the target audience for it. It had a steep learning curve, and the abstraction level seemed too high. Later our client provided us access to GPU machines on Google Cloud Platform. From the usability perspective, we felt it was a step down from Azure. We started to experience inconveniences similar to those of Vast. For example, we had a server in the Asia South East region that was working great. We turned it off one night, and we could never get an available server again in the same region. This meant that we needed to access the disk attached to the server, and move the data across the globe to another region so we could continue training.

    Reducing training loss

    Once we managed to reproduce the SpeechGPT paper, we discovered that the performance of the model was not good enough. It often did not understand clear voice samples that were trivial for speech recognizers such as Whisper. So tested several ideas to see how much we could improve it. We will not go into deep detail about everything we tried (that is for another post) but we did manage to reduce the model loss somewhat (e.g. using sin/cos positional encoding in the embeddings as in the chart below)

    At this point, the client asked if we could try to replicate the same results but with a non-Llama model, so we started experimenting with Mistral. It was quite difficult to port everything, mainly because the SpeechGPT scripts were heavily tied to Llama, and our hyperparameters were as well. We managed to train a Mistral, but the training times worsened.

    At this stage, we thought it would be a good idea to see if the model could answer yes/no questions. We started experimenting with a Google dataset called BoolQ. What we did was create TTS (text-to-speech) and generate audio files with BoolQ questions. We could also dive deeply into this, but it exceeds the scope of this post.

    We also thought that, instead of extending Llama’s vocabulary to include Hubert tokens, we could reuse Llama tokens that were rarely used. We started experimenting with that approach, but we didn’t finish it.

    At that point, we were asked to evaluate Gazelle, an alternative voice-to-voice model published by Chris Hua around that time.

    Word Error Rates

    Before a voice-to-voice model can be used in production, it is important to know if it is capable of understanding speech correctly and transcribing it. For this, we measured the word error rates when given speech with known transcriptions. It was clear that our model was not on par with the state of the art (OpenAI’s Whisper, for example), and still not close to Gazelle 0.1 and 0.2.

    Switching to MosaicLLM, Neptune, Weights and Biases

    By now, we had lost count of how many training runs we had launched. Up to this point, our process had entailed manually running training scripts from a Linux terminal, and processing the output to create charts. This was too ad-hoc and error prone. We started automating part of this process. For training, we switched to Databricks’ MosaicML. We also connected our training runs to neptune.ai, an experiment tracker that allowed us to compare runs (even in real time) without having to examine the logs by hand. We tried Weights and Biases as well but the functionality we needed was well covered by Neptune. As a result, we settled on MosaicML + neptune.ai for the remainder of the training cycle. This allowed us to accelerate our iteration speed. We did not exploit the full potential of what these tools can offer, and it is clear that a small organization with limited time is better off not using GPU instances directly. This is better left to larger companies that need to focus on lowering hardware costs as opposed to developer effort.

    Conclusions

    When experimenting at the cutting edge of AI, it is fundamental to iterate very fast. This means one must be extremely comfortable with the tools and process. If we had to do this again, we would probably start with MosaicML because it did save us a fair amount of manual work. If we had to use one of the main cloud providers, we would pick Azure and launch our own Docker containers with our training environment that we know and love. We would stay away from Google Cloud Platform if possible. There is a place for vast.ai: if you want to do a quick proof of concept for cheap, you can always opportunistically grab a cheap GPU for a throwaway experiment,

    Do you have a project that involves training or fine-tuning open-source models? Contact us!