🐦 Twitter Post Details

Viewing enriched Twitter post

@cwolferesearch

One of the best ways to reduce hallucinations with LLMs is by retrieving useful, factual information and injecting it into the LLM’s prompt as added context. Although this might sound complicated, it’s actually quite easy to implement with standard vector search functionality… Why do we need this? All LLMs have a fixed context length. So, the amount of information we can include in a prompt is limited by nature! As such, we need to be selective about the context that we provide to our model. If we want to provide useful context that can reduce hallucinations and improve the model’s output, one of the best approaches is to retrieve relevant information from an external (vector) database. Retrieval framework for LLMs. Assuming that we have a lot of relevant textual data that can be used by an LLM, we can’t just inject all of this data into the model’s prompt every time that we perform inference. Rather, we need to do the following: 1. Break our data into textual chunks 2. Vectorize each of the textual chunks 3. Store these vectors (with their data) in a vector database 4. Find relevant data at inference time using vector search 5. Add relevant data to our prompt to provide more context to the LLM We will use the same embedding model to vectorize these chunks and to generate query vectors that can be used for search. Storing data in a vector db. The first step in the above framework is to chunk our data. Typically, we will use chunks of ~200 tokens. However, the optimal chunk size is a hyperparameter that can change depending on the application. Then, we use an embedding model to vectorize each of these chunks, and we can store them, along with their text data, in a vector database (e.g., Redis, Weaviate, Pinecone, Qdrant, etc.). Retrieving relevant context. When we want to retrieve relevant textual data from our vector db, we should just i) create a query embedding based on our prompt (possibly including the chat history) and ii) run a vector search for relevant documents. This way, we can use semantic search to identify portions of data that are relevant to include as context within the LLM’s prompt. Creating the query embedding. There are a ton of different ways we can create query embedding for searching our vector db. The simplest approach would be to truncate our chat history or prompt and pass this directly into the embedding model. But, if this is too long, we could ask the LLM to summarize our chat history or prompt before embedding it, or even to convert the chat history or prompt into a list of search keywords. Picking an embedding model. To make sure this works well, we need a good embedding model that captures the semantic similarities between our queries and textual chunks. There are a variety of good embedding models publicly available via SentenceTransformers and HuggingFace. To find one that works for you, I’d recommend taking a look at the Massive Text Embedding Benchmark (hosted on HuggingFace). The result. We can use the approach described above to power retrieval-augmented generation (RAG), which is one of the best ways to reduce hallucinations and improve the output of LLMs. Given that this approach can be implemented without significant effort via tools like Pinecone / Weaviate and HuggingFace / SentenceTransformers, it is undoubtedly one of the most useful practical tools for building with LLMs.

View on Twitter

🔧 Raw API Response

{
  "user": {
    "created_at": "2021-08-11T22:32:35.000Z",
    "default_profile_image": false,
    "description": "Director of AI @RebuyEngine • PhD @optimalab1 • I make AI understandable",
    "fast_followers_count": 0,
    "favourites_count": 4508,
    "followers_count": 12600,
    "friends_count": 443,
    "has_custom_timelines": true,
    "is_translator": false,
    "listed_count": 419,
    "location": "",
    "media_count": 564,
    "name": "Cameron R. Wolfe, Ph.D.",
    "normal_followers_count": 12600,
    "possibly_sensitive": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/1425585940542763010/1670938083",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1425586035455582208/ikonUjoO_normal.jpg",
    "screen_name": "cwolferesearch",
    "statuses_count": 2136,
    "translator_type": "none",
    "url": "https://t.co/j75fAdLpp8",
    "verified": false,
    "withheld_in_countries": [],
    "id_str": "1425585940542763010"
  },
  "id": "1695160469588177354",
  "conversation_id": "1695160469588177354",
  "full_text": "One of the best ways to reduce hallucinations with LLMs is by retrieving useful, factual information and injecting it into the LLM’s prompt as added context. Although this might sound complicated, it’s actually quite easy to implement with standard vector search functionality…\n\nWhy do we need this? All LLMs have a fixed context length. So, the amount of information we can include in a prompt is limited by nature! As such, we need to be selective about the context that we provide to our model. If we want to provide useful context that can reduce hallucinations and improve the model’s output, one of the best approaches is to retrieve relevant information from an external (vector) database.\n\nRetrieval framework for LLMs. Assuming that we have a lot of relevant textual data that can be used by an LLM, we can’t just inject all of this data into the model’s prompt every time that we perform inference. Rather, we need to do the following:\n\n1. Break our data into textual chunks\n2. Vectorize each of the textual chunks\n3. Store these vectors (with their data) in a vector database\n4. Find relevant data at inference time using vector search\n5. Add relevant data to our prompt to provide more context to the LLM\n\nWe will use the same embedding model to vectorize these chunks and to generate query vectors that can be used for search.\n\nStoring data in a vector db. The first step in the above framework is to chunk our data. Typically, we will use chunks of ~200 tokens. However, the optimal chunk size is a hyperparameter that can change depending on the application. Then, we use an embedding model to vectorize each of these chunks, and we can store them, along with their text data, in a vector database (e.g., Redis, Weaviate, Pinecone, Qdrant, etc.).\n\nRetrieving relevant context. When we want to retrieve relevant textual data from our vector db, we should just i) create a query embedding based on our prompt (possibly including the chat history) and ii) run a vector search for relevant documents. This way, we can use semantic search to identify portions of data that are relevant to include as context within the LLM’s prompt.\n\nCreating the query embedding. There are a ton of different ways we can create query embedding for searching our vector db. The simplest approach would be to truncate our chat history or prompt and pass this directly into the embedding model. But, if this is too long, we could ask the LLM to summarize our chat history or prompt before embedding it, or even to convert the chat history or prompt into a list of search keywords.\n\nPicking an embedding model. To make sure this works well, we need a good embedding model that captures the semantic similarities between our queries and textual chunks. There are a variety of good embedding models publicly available via SentenceTransformers and HuggingFace. To find one that works for you, I’d recommend taking a look at the Massive Text Embedding Benchmark (hosted on HuggingFace).\n\nThe result. We can use the approach described above to power retrieval-augmented generation (RAG), which is one of the best ways to reduce hallucinations and improve the output of LLMs. Given that this approach can be implemented without significant effort via tools like Pinecone / Weaviate and HuggingFace / SentenceTransformers, it is undoubtedly one of the most useful practical tools for building with LLMs.",
  "reply_count": 34,
  "retweet_count": 183,
  "favorite_count": 1297,
  "hashtags": [],
  "symbols": [],
  "user_mentions": [],
  "urls": [],
  "media": [
    {
      "media_url": "https://pbs.twimg.com/media/F4ZrZBGWEAA7nQv.jpg",
      "type": "photo"
    }
  ],
  "url": "https://twitter.com/cwolferesearch/status/1695160469588177354",
  "created_at": "2023-08-25T19:45:25.000Z",
  "#sort_index": "1695160469588177354",
  "view_count": 324574,
  "quote_count": 14,
  "is_quote_tweet": false,
  "is_retweet": false,
  "is_pinned": false,
  "is_truncated": true,
  "startUrl": "https://twitter.com/cwolferesearch/status/1695160469588177354"
}