The Rise of LLMs and AI Hype Cycles
Eric Laufer, Data Scientist Tech Lead
June 29, 2023
It is undeniable that we are in yet another AI hype cycle. Hype cycles are not bad; underlying them are scientific and engineering breakthroughs, and they generate investments, opportunities and, hopefully, better products for the end user. Despite the curve of the same name, each hype cycle leaves us with higher and higher expectations compared to past cycles. The current enthusiasm is driven mainly by two classes of models: the large transformer language models and image generation models. The first group includes GPT and its declinations, LLaMA, PaLM, Claude, and so on, and the second group is characterized by Stable Diffusion and proprietary text-to-image models by Midjourney, OpenAI and others. Adding to the excitement is the rapid rate at which new models get open-sourced and their inference optimized, models whose performance match their proprietary counterparts from six to twelve months prior. Peritus is more involved with the first group of models, which will be our main reference to the viewpoints discussed here.
To be sure, all of these are incredible engineering breakthroughs, but it's sometimes challenging to distinguish unbridled optimism from reasonable business ventures, and maintain a realistic view of technological progress in this current wave of AI enthusiasm. Every AI hype cycle, the same question arises: where and how can we use this awesome technology to build new products and improve, or even replace, existing ones? It is sometimes said that too many degrees of freedom limit creativity, and while the promise of LLMs currently seems boundless, we need to take a good look at popular use-cases and technical limitations, lest we drown in that sea of possibilities.
Product and technology
A friend in the AI world once told me "it's way easier to build ML tooling than actual good products that use ML, that's why everyone defaults to becoming an ML tooling service". One symptom that the market does not yet quite know what to make of the current technology is the relative amount of new platform and tooling companies versus the amount of actual new products or features based on this new technology. Platforms and tools are still essential for companies to integrate new technology; after a sufficient amount of people know that gold can be found, others start selling shovels. This tooling expansion echoes to the deep learning frameworks that were developed in the last decade or two: many people knew deep learning was going to be the bleeding edge of AI development, so we witnessed the creation of PyTorch, Tensorflow, Caffe, CNTK, Theano, Chainer, among others. Then libraries were built on top of some of these, as well as satellite tooling for managing training and experimentation.
However, this alone tells us little about the types of products and features we were going to see today, without the help of hindsight. Some ML tasks, when solved, may provide a product in itself because it solves such a specific or widespread use-case, a good example of this being audio transcription. Other ML tasks may only serve for internal processes or replace programming logic that would be otherwise impossible to express without machine learning, such as image classification. In terms of product strategy, it's useful to understand the positioning that products have with regards to new AI technology. We can see three broad categories:
- First are "single-model" products, which provide access to specific, high-quality models that solve important or complicated tasks. We use quotations here because they may not be a single model in the strictest sense, but there is usually a single API to interact with it. For example GPT4 is rumored to be several models under the hood, but for the end user it makes no difference in terms of interaction.
- Second are tools and platforms which enable the development and management of models themselves. This includes some tools mentioned in our previous blog post, and related technologies like vector databases, which we have discussed in another blog post and will comment further on, then model training and monitoring toolkits, etc.
- Finally are integrated products, who use either of the latter two to enable various features within a larger application context. This can potentially include any existing product of sufficient complexity, but as an example you can imagine the plethora of auto-suggestions and better help functionality that will progressively appear in many of the applications we already use.
Although many internal business meetings were probably premised on the "How can we use ChatGPT?" question, there are no easy conclusions. Showing users ChatGPT outputs (or any other LLM output) is like handing the steering wheel to a robot and requires some amount of context injected to be relevant within an application. In addition, ChatGPT is such a strong standalone product that its user-facing presence in any application is almost unmistakeable today, and using it directly is just as easy as a google search. Furthermore, the general disadvantages of LLMs, which we will discuss later, may prevent the conclusion of these business meetings from being reached yet. Blindly plugging ChatGPT into any product, as many have now concluded, is not a sound strategy.
Patterns of usage
What happened next is we saw many patterns of usage emerge for LLMs. Given that their sole functionality is to generate text, we can distinguish between generating text that a human will read, and generating text that a machine will use. This latter pattern, generating text that humans will not read, includes but is not limited to: tool usage, zeroshot classifications of different kinds and any other standard NLP task really. As the variety, availability and quality of models evolve, this type of usage may increase in the future, because using GPT on a large number of documents can be prohibitively costly and fragile, since any change in prompt or in input documents may require updating all existing documents. Using GPT to perform a complex classification task on data, and indexing on the newly created field is not always feasible in practice.
As for generating text that a human will read directly, many use-cases center around a few techniques and patterns. Beyond the initial context-less chatbot that LLMs offer, they can be enhanced with context, help edit and create text content, or summarize existing text. Most notable is the retrieval-augmented generation (RAG) paradigm, whose definition changed slightly from its original description in 2020, to the following general recipe:
- fetch documents using any search or information retrieval algorithm,
- fill an engineered prompt with the top results
- and ask for a single, synthesized answer.
This approach works fairly well to add a layer of polish to any search or question answering feature, and was quickly implemented by many companies big and small, like Bing, you and phind.
From a product development point of view, here's why RAG became so popular. The initial GPT hype convinced people that everything warrants a natural language interface or feedback. People then realize that GPT's knowledge is limited (or sometimes hallucinated), that context is important, and prompt engineering cannot solve everything. So the solution is to retrieve as much relevant data as possible and include it into the prompt. Additionally, vector search became strangely linked with RAG, for no other apparent reason than OpenAI offering the text-embedding-ada-002 model, or the existence of other embedding-centric transformers like sentence-BERT or INSTRUCTOR.
With vector search comes the need for vector databases, and so we saw options and products multiply with the hope of catching the market's newly created need for approximate nearest neighbor search (to be fair, vector search is also used in image, video and audio retrieval). Vector search works quite well with the models mentioned here, but does not necessarily replace standard information retrieval techniques in all contexts. Nor does it replace techniques for structured question answering. Nor does it make the quality of customized information retrieval less important.
As everyone plays around with RAG, search and recommendation quality become even more important, for summarizing irrelevant results does not make them relevant. Vector search also encourages locking into a single embedding model because on one hand using it implies you have a decent volume of data, and on the other switching from one embedding model to another can be costly, whether it be a custom model or an embedding-as-a-service offering.
Notwithstanding standalone single-model products like GitHub Copilot and ChatGPT, we are left with a handful of practical usage patterns for LLMs, which revolve around contextual text editing and summarization in the form of RAG, whose performance still depends on search quality.
Elephants in the room
Even though everyone is now talking about LLMs, the truth is that OpenAI is still leading the hype. As is fashionable with important tech entrepreneurs, Sam Altman went on a world tour to speak about the dangers of the technology they created and how "this is only the beginning". Google and Facebook initially claimed their reputation would be too much at stake with such a technology, that no one really had any moat and that the technology wasn't even that impressive. That did not prevent them or other companies and collaborative projects from frantically building and releasing LLMs of their own. Leaderboards recently sprang up to keep track of which models are the best, with the GPT family still taking up the top slots. For companies to use LLMs, it still requires the careful consideration of whether to go with OpenAI for an immediate edge, or invest in new models and infrastructure to future-proof themselves and hedge against potential monopolies.
Despite their initial dazzle, LLMs remain a somewhat cumbersome technology for a few reasons. Many new features and products being developed are centered around LLMs, but that does not prevent them from introducing new problems like hallucinations, or being difficult to test, and are still costly to use at a large scale and relatively slow. Methods like distillation and lower floating-point accuracy will prove even more important in the future to enable widespread usage. Even promising improvements and techniques for improving LLM outputs work against easy adoption. For instance many tricks like reflexion and chain-of-thought prompting simply add to the number of calls and number of tokens sent, resulting in greater costs and greater latency. LLMs are a bit like propulsion engines: you can probably build something impressive with them but you still need a reasonable plan and expertise to accomplish it. To continue the analogy, everyone is still figuring out those nails that warrant a propulsion engine-powered hammer.
The push for widespread integration of these new models creates a tension against practical considerations and usefulness depending on the situation. Combine dead internet conspiracy theories, proliferation of bots and fake accounts, 'fake news,' and ever more eloquent spam emails, and we are faced with a complex landscape of information credibility. Add AI fear mongering and decreasing attention spans to the mix, and it's not obvious that everyone will readily accept being fed AI-generated text in any context, or welcome natural language interfaces for any task. If anything we may begin to ignore anything written in ChatGPT's general neutral and arid tone, unless we specifically queried it directly.
Parting thoughts
We are still far from seeing the full impact of LLMs despite their almost vertical trajectory in popularity and buzz. Though some of Sam Altman's claims may be a bit sensationalist and marketing driven, his claims that the future holds even more promise remain true as the market adjusts to this new technology. What will drive innovation may not necessarily be the current or next generation of GPT models, but smaller, more specialized, and open-sourced models which the collective AI community can tweak and tailor for their specific use cases. Altman even says people are overly focused on the number of parameters, and recent research suggests that higher data quality can produce LLMs that are both useful and much smaller. As LLMs become more and more commoditized their use will become more transparent, and we may not even notice their presence as much. Sensible use of machine learning often involves picking the simplest model for more scalable and simpler tasks; or to avoid using a bazooka to kill a fly as they say.