As large-scale language model (LLM) technology becomes increasingly mature, various industries have accelerated the pace of LLM application implementation. In order to improve the practical application effect of LLM, the industry has made many efforts. Recently, the LinkedIn team shared their valuable experience in building generative AI products. LinkedIn says building products based on generative AI hasn't been smooth sailing, and they've encountered difficulties in a number of areas. The following is the original text of the LinkedIn blog. Over the past six months, our team at LinkedIn has been hard at work developing a new AI experience that attempts to reimagine how our members apply for jobs and browse professional content. The explosive growth of generative AI makes us stop and think about what is now possible that was impossible a year ago. We tried a lot of ideas without success and ultimately found that the product needed key things like: faster access to information, such as getting key points from posts or staying up to date on company updates. Connect the dots of information, such as assessing your suitability for a position. Get advice, such as improving your profile or preparing for an interview. ...
Example: How a newly developed system worksWe use a real-life scenario to show how a newly developed system works. Imagine you are scrolling through your LinkedIn feed and stumble upon an interesting post about accessibility in design. In addition to this article, you will also find some introductory questions to delve deeper into the topic and you are curious, such as "What are some examples of accessibility driving business value in technology companies?"
Systems Background operations:
-
Choose the right agent: The system takes your problem and decides which AI agent is best suited to handle it. In this case, it identifies your interest in accessibility within tech companies and routes your query to an AI agent specialized in performing general knowledge searches.
-
Gather information: AI agents call a combination of internal APIs and Bing, searching for specific examples and case studies that highlight how accessibility in design contributes to business value in technology.
-
Formulate a reply: With the necessary information, the agent can now compose a reply. It filters and synthesizes data into coherent, information-rich answers, giving you clear examples of how accessibility initiatives deliver business value to technology companies. To make the experience more interactive, internal API calls are made to use attachments such as links to articles or profiles of people mentioned in posts.
Interactivity:
You may ask "How can I turn my career to this field", then the system will repeat the above process, but now transfer you to career and job (career and job) AI agent. With just a few clicks, you can delve deeper into any topic, gain actionable insights, or find your next job opportunity.
Technical basis:
Most of the new features are made possible with the help of LLM technology.
Overall design:
The system pipeline follows Retrieval Augmented Generation (RAG), which is a common design pattern for generative artificial intelligence systems. Surprisingly, building the pipeline was less of a headache than we expected. In just a few days, we had the basic framework up and running:
-
Routing: Decide if a query is in range and which AI agent to forward it to.
-
Retrieval: Recall-oriented step, the AI agent decides which services to call and how to call them (such as LinkedIn people search, Bing API, etc.).
-
Generate: A precision-oriented step that sifts through the retrieved noisy data, filters it and generates the final response. 查 Figure 1: Simplify PIPELINE of user query. KSA stands for "Knowledge Sharing Agent" and is one of dozens of agents that can handle user queries.
Key designs include:
Fixed three-step pipeline;
Small model for routing/retrieval, for larger models generated;
Embedding-based retrieval (EBR), powered by an in-memory database, injecting response examples directly into Prompt;
Specific evaluation pipeline at each step, especially for routing/retrieval.
Development Speed
We decided to split the development tasks into developing independent agents by different people: common sense, job evaluation, job points, etc.
By parallelizing development tasks, we increase development speed, but this comes at the cost of "fragmentation". Maintaining a unified user experience becomes challenging when subsequent interactions are with assistants that are managed through different models, prompts, or tools.
To solve this problem, we adopted a simple organizational structure:
A small "horizontal" engineering pod that handles common components and focuses on the overall experience, which includes:
Services to host the product
Evaluation/testing tools
Global prompt templates used by all verticals (e.g. global identity of the agent, conversation history, jailbreak defense, etc.)
Shared UX components for iOS/Android/Web clients
Server-driven UI framework for publishing new UI changes without changing or releasing client code.
Key designs include:
Divide and conquer, but limit the number of agents;
Centralized evaluation pipeline with multiple rounds of dialogue;
Shared prompt templates (e.g. “identity” definition), UX templates, tools and instrumentation
Evaluation
Facts Assessing the quality of responses proved more difficult than expected. These challenges can be broadly divided into three areas: developing guidelines, extending annotations, and automated evaluation.
Developing a guideline is the first obstacle. Take job evaluations, for example: there's not much use in clicking "Evaluate my suitability for this job" and getting "You're a great fit." We want responses to be both authentic and empathetic. Some users may be considering a career change into a field for which they are not currently well suited and need help understanding the gaps and next steps. Ensuring that these details are consistent is critical to the annotator.
Extending comments is the second step. We need consistent and diverse annotators. Our in-house team of linguists built tools and processes to evaluate up to 500 daily conversations and capture relevant metrics: overall quality score, hallucination rate, AI violations, coherence, style, and more.
Automatic evaluation work is still in progress. Without automated evaluation, engineers could only visually inspect the results and test on a limited set of examples, with a delay of more than 1 day before understanding the metrics. We are building model-based evaluators to evaluate the above metrics and are working to achieve some success in hallucination detection, and the end-to-end automated evaluation pipeline will enable faster iterations.
Call internal API
LinkedIn has a wealth of unique data about people, companies, skills, courses, and more that is critical to building products that provide differentiated value. - However, the LLM has not been trained on this information and therefore cannot use it to reason and generate responses.
- The standard pattern for solving this problem is to set up a Retrieval Augmented Generation (RAG) pipeline through which an internal API is called and its response is injected into subsequent LLM prompts to provide additional context to support the response.
- A lot of this data is exposed internally via RPC APIs in various microservices.
- We solve this problem by wrapping "skills" around these APIs. Each skill has the following components:
-
A human-friendly description of what the API does and when to use it - Configuration for calling the RPC API (endpoints, input modes, output modes, etc.)
- LLM-friendly input and output modes
- Primitive type (String/Boolean/Number) values
- Input and output schema description for JSON schema
- Business logic for mapping between LLM friendly schema and real RPC schema
-
We write prompts that ask the LLM to decide what skills to use to solve a specific job (skill selection via planning), and then output parameters to invoke the skills (function calls). Since the parameters of the call must match the input pattern, we ask the LLM to output them in a structured way. Most LLMs are trained on YAML and JSON for structured output. We chose YAML because it is less verbose and therefore consumes fewer tokens than JSON.
One of the challenges we encountered is that while about 90% of the time the LLM response contains correctly formatted parameters, about 10% of the time the LLM gets it wrong and often outputs invalidly formatted data, or worse is not even valid YAML.
These errors are trivial to humans, but can cause the code that parses them to crash. 10% is a high enough number that we can't simply ignore it, so we set out to address it.
The standard way to resolve this issue is to detect it and then re-prompt LLM asking it to correct the error and provide some additional guidance. While this approach works, it adds considerable latency and consumes valuable GPU capacity due to additional LLM calls. To circumvent these limitations, we ended up writing an in-house defensive YAML parser.
Through analysis of various payloads, we identified common mistakes made by LLM and wrote code to appropriately detect and patch these errors before parsing. We've also modified the hints to inject hints for some of these common errors to improve patching accuracy. We were ultimately able to reduce the incidence of these errors to about 0.01%.
We are currently building a unified skills registry for dynamic discovery and invocation of APIs/agents packaged as LLM-friendly skills in our generative AI products.
Capacity and Latency
Capacity and latency are always the primary considerations. Here are some considerations:
- Quality and Latency: Technologies such as Chain of Thoughts (CoT) are very effective for improving quality and reducing illusions, but they need to never be token, thus increasing latency.
- Throughput and Latency: When running large generative models, it is common to see TimeToFirstToken (TTFT) and TimeBetweenTokens (TBT) increase as utilization increases.
- Cost: GPU clusters are not easily available and costly. We even had to set a schedule for testing the product in the beginning because too many tokens would be consumed.
- End-to-end streaming: A complete answer can take several minutes to complete, so we stream all requests to reduce perceived latency. What's more, we actually do streaming end-to-end in the pipeline. For example, the LLM response that determines which APIs to call is parsed incrementally, and once the parameters are ready, the API call is triggered without waiting for the full LLM response. The final comprehensive response is also transmitted all the way to the client using real-time messaging infrastructure and incrementally processed based on "responsible AI" and more.
- Asynchronous non-blocking pipeline: Because LLM calls can take a long time to process, we optimized service throughput by building a fully asynchronous non-blocking pipeline that does not waste resources by blocking I/O threads. Interested readers can read the original text of the blog to learn more about the research content. Original link: https://www.linkedin.com/blog/engineering/generative-ai/musings-on-building-a-generative-ai-product
The above is the detailed content of The error rate dropped from 10% to 0.01%, and LinkedIn fully shared its experience in implementing LLM applications. For more information, please follow other related articles on the PHP Chinese website!