Building LLM tooling
Zsolt Tanko, ML Engineer
June 15, 2023
Since our last blog post about GPT-4, the LLM dust has begun to settle and some broad features of the landscape we’re now inhabiting have become visible. Many existing organizations working with text data have moved towards deploying LLM technology, mostly in the form of OpenAI’s GPT family of models and related text embeddings. More still, an ecosystem of startups has emerged that serve novel products not possible pre-LLMs: chatbots and personal assistants leveraging semantic search. There are very, very many services offering embedding backed natural language interfaces to search podcasts, for example.
These latter contexts, where the capacities of the new technology are most brightly on display, get much of the spotlight, and the hype at the level of public sentiment has its reflection on the ground, too, where data scientists and engineers are building with LLMs. Today’s blog post gives a short overview of how we’ve been building with LLMs here at Peritus, in particular for our Security Copilot offering, with some gentle commentary on how tooling has been evolving and where it might go in the near future.
LLM tooling
Developers have, on balance, been allocating their time in alignment with the trends we’ve highlighted above. New libraries offering infrastructure for prompting LLMs like langchain and Microsoft’s guidance have seen rapid development and adoption. The AutoGPT project — an LLM powered general purpose agent able to retrieve data from the internet, write code, and execute commands — has gone from curious experiment to serious technological effort. So many vector database services and libraries have appeared to support semantic search that our experiences with them need to be sequestered to a separate, upcoming blog post.
All that is gold does not glitter, though, and much of the value to be captured in deploying LLM technology lives in discovering value-add for existing data-centric businesses, where chatbots and agents can only be a relatively small aspect of the value prop and, therefore, of the engineering efforts that transmute data into business-relevant outcomes.
Here at Peritus, a busy year has been spent discovering what LLMs have to offer and making it reality. Our earlier blog post focused on the discovery bit, today’s about what making it real has looked like. For us, adopting existing open-source LLM infrastructure has often felt clumsy, opaque, and forceful, all gentle pointers that doing so might be wielding a hammer and searching for a nail.
The dividends of building infrastructure
This sense was slow to arrive but obvious in retrospect: the tools most aggressively developed by the early adopters of LLMs in the open source community are aimed at the most novel applications. Those tools may well be fit to purpose there, but we shouldn’t expect that to be true in other settings. Some of our most significant use-cases for prompting GPT, for example, are focused on extracting structured data from GPT’s raw text response. The best approaches we’ve found are the result of careful and detailed prompt engineering combined with custom response processing, both the result of much empirical work with exactly the data and problems we have at hand. In other words, our tooling is custom and, so far, our sense is that it needs to be.
Another central challenge in incorporating prompts into our services has been API response times: currently the API for GPT-4 averages about 20 seconds to respond to a single prompt. In settings where data processing is served in real time, that means incorporating OpenAI’s most powerful model as-is is a no-go. Here too we’ve rolled out solutions specific to our use-cases: developing prompts designed to yield the relevant output early so that GPT responses can be streamed and terminated early, and carefully balancing which OpenAI model we use in a given context. Too powerful and the latency is unworkable; not powerful enough and the response is unusable.
In each of these cases value add was realized through experimentation, via trial and error and building careful intuition for what works in our contexts. Pre-made solutions, when taken at face value, are able to solve some of these problems, but our experience has been that tinkering, trying, exploring, hacking, building really matters. Engineering matters.
What’s next
Dust somewhat settled, we feel we have the lay of the land. And, more importantly, as the tooling around LLM technology inevitably matures, we feel well positioned to leverage new developments, having spent our time accruing experience in the LLM trenches. Poetically, we were recently granted access to Anthropic’s Claude model, which promises significantly better latencies, isn’t far off from GPT-4 in capacity, and offers a massive context window for prompting. That might solve some problems for us, but our capacity to deploy an entirely new LLM backend efficiently and effectively relies on what we’ve learned. And indeed, new models and functionality will continue to appear that deus ex machina some technical problems away, add new capacities, and have new idiosyncrasies. OpenAI’s recently announced function calling feature being a perfect example: new affordance, new quirks. Incorporating these developments will always demand fresh tooling and we’re confident that familiarity and experience developing that tooling is a resource that is permanently valuable.
Next week, we’ll continue the theme of LLM tooling with a more technical post outlining our experiences with vector database technologies.