If you have seen the movie Memento, you may remember a scene in which the protagonist is running in the street, and has an internal monologue. It goes like this: “so… what am I doing? Oh, I’m chasing this guy. No, he’s chasing me.” Neither us nor he have any context of what is happening besides what we see in that instant. This is very similar to what happens to an LLM when it generates an inference.
Is ChatGPT an LLM? No, it’s much more
An LLM is a machine that can continue a sequence of words. By now most people have not interacted with an LLM directly, but with an agent such as ChatGPT or Claude. However, you can go to a playground and make an LLM continue any text, like this (generated by Llama 3 8B, in case you’re curious):
Prompt: Am I chasing this guy? No,
Completion: 100 percent not. I’m just doing what I love to do and that’s play baseball.” The only thing keeping this from being a no-doubt pick for the Hall of Fame is the fact that he didn’t play in the World Series.
In this case, all the context the LLM has is that there was a phrase that ends in a comma, and it must pick up from there and generate more words that make sense. But this is not what Claude and ChatGPT are. They are applications that:
- know the history of their interactions with you
- may know your current time and location
- have the capability of searching the internet and custom resources to enrich the context
Have you ever heard of RAG? It’s probably not what you think
RAG stands for Retrieval-Augmented Generation. It’s a fancy way of saying that a system that uses an LLM needs to have a way to insert new information into the LLM context. This is because the LLM is frozen in time. You can download an LLM made several months ago, and it will know have any recent information in its training data. Imagine you wake up in a room and it’s the year 2032. Someone asks you: what do you think of the president of the New Soviet Union? Obviously you have no idea if the New Soviet Union even exists, let alone who governs it. But there is a sign in front of you that says: “if someone asks for current events, enter a query here.” So you type the question, you learn what you need and you answer confidently. Your interrogator has no idea that you just looked this up.
In early 2023, for some reason people started using Vector Databases as the main way to retrieve information for this purpose. The idea is very simple: You can think of a piece of information as if it were located somewhere in the space of meaning. For example, the concept of a limousine is relatively close to the concept of a taxi, and not so close to a pitchfork. So if you take, say, all your emails and place them in their proper regions of semantic space, you might expect a question whose answer is contained in one message would also be close to it in that space. For example, “can you find the invite to the party” might find emails that mention saving the date, invitations, birthdays, gifts. Even if the word “party” or “invite” are not present.
Relevance and similarity are different things
Of course it could be that the most similar email to that question is from 2009. That doesn’t make it relevant. A less similar email from three weeks ago is much more relevant. Relevance is determined by a large number of signals that include recency but also location, keywords, the context of the query, etc. This is why the way we currently do RAG involves giving the LLM access to a bunch of bespoke tools, and a long prompt explaining it how to use them. For example, if you ask ChatGPT about what to wear tonight, it might look up the weather for your location.

In this case, ChatGPT already knows where I am as I write this. I would guess this is part of a tool that contains user context. Try asking it what it knows about you. It also has access to web search. If it didn’t, it might create plausible search results that might not exist.
But wait, how does the system know which tools to use?
This is indeed an extremely difficult problem. If you ever used LangChain, the basic idea is that the agent is performing a loop in which it evaluates the currrent state of the conversation and decides what to do. It might have a prompt along these lines:
You are an AI-powered assistant that excels at retrieving, analyzing, and generating information. You have access to the following tools:
Python: For running calculations, data analysis, and generating visualizations.
Web Search: For finding up-to-date information from the internet.
Document Reader: For extracting and summarizing content from uploaded documents.
Chat History Analysis: For understanding user preferences and tailoring responses based on past interactions.
Memory Management: For storing, retrieving, and modifying user-specific context to maintain continuity across conversations.
...
Use these tools effectively to assist users with their requests. Always choose the most efficient tool for a given task, combining multiple tools if necessary. Provide clear explanations of your actions when you employ tools.
When we build an agent like this, we hope that the LLM predicts that a subsequent response should contain the results of a call to one of these tools. But of course, the LLM has very limited reasoning capabilities and it’s somewhat unpredictable. A slight variation might make it choose one tool over another. It might tell you “sorry, I cannot assist you with this because I don’t know what to do with this file” or it might trigger a call to the document tool. In the first case, you might say “I have seen your prompt, and I know you have a Document tool.” The agent would probably responde “oh, pardon my mistake. Of course you are correct” and proceed to analyze the document.
Does building an agent need to be this complicated?
For the time being, I’m afraid so. The state of the art for building intelligent agents relies heavily on heuristics, clever workarounds, and tools to sidestep the inherent limitations of LLMs. It seems that the reasoning capabilities have reached a plateau. More data does not make an LLM capable of better reasoning. Rather, it makes it more likely to be correct when asked about obscure facts. This is why effective agents depend not only on the power of LLMs but also on thoughtful prompt engineering, and the careful integration of clever tools to fill in the gaps.
Nobody knows if another breakthrough is near. For now, crafting an effective agent is both engineering and art. We are having a blast doing it, and if you’re thinking about creating a custom agent for whatever purpose we’d love to chat.
Now, where was I.