Building and Running a RAG model completely Offline using your Gaming GPU

In this guide we will learn to configure a RAG model which reads information from documents and answers our queries. The RAG model will use langchain, FAISS, LocalAI and Streamlit.


In this blog, we will create our own local RAG system which can answer questions using knowledge from a document. The RAG model will use langchain, FAISS, LocalAI and streamlit. We will use python and langchain framework for writing the code.

FAISS is a smart tool that helps to find similar things quickly in big piles of data. It’s like a super fast search engine for finding things that are alike.

FAISS can turn documents into special numbers called vectors with the help of an embedding model. An embedding model is a type of machine learning model that converts input data, such as words, sentences, or documents, into dense numerical vectors. These vectors are like codes that capture what a document is about.

When you want to find similar information to one you have, FAISS looks at the vector code of your search query and finds others with similar vector codes in the document.

We will use FAISS along with “all-MiniLM-l6-v2” as our embedding model to vectorize our document.

LocalAI is like a free and open-source version of OpenAI. It works just like OpenAI’s system but runs on your own computer instead of on the internet. We can specify a model using yml file and run that model locally for inferencing.

We will use LocalAI and run “stablelm-2-zephyr-1_6b” quantized model with cuda enabled.
This is the specification of the device that i will be running it on:

Processor: AMD Ryzen 5 3600 6-Core Processor (12 CPUs), ~3.6GHz
RAM: 16384MB
GPU: NVIDIA GeForce GTX 1650 SUPER
GPU Dedicated Memory: 3943 MB

Streamlit is a nifty tool that makes building and sharing data apps a breeze. With Streamlit, you can create interactive web applications for data science and machine learning projects using simple Python scripts.

Before starting on the code, lets visualize how our sytem will basically look like:

We will do the following step by step:

1. Use localAI for hosting out LLM locally with GPU support

2. Use FAISS and an embedding model to vectorize our document and store it

3. Create a RAG chain which takes query from user, uses FAISS similarity_search to search for relevant results and sends it to LLM to finalize the answer and return the result to user

4. Use streamlit to create a webapp to take user query and show response in browser.

We are going to use docker-compose to simplify deployment. Lets start by:

  1. Create a folder called models.
  2. cd models
  3. Create stablelm-1.6b.yaml and insert the following code. This defines the specification of our model which localAI will use to download and load the model:
name: stablelm-1.6
context_size: 2048
f16: true
gpu_layers: 90
mmap: true
trimsuffix: 
- "\n"
parameters:
  model: huggingface://brittlewis12/stablelm-2-zephyr-1_6b-GGUF/stablelm-2-zephyr-1_6b.Q8_0.gguf
  temperature: 0.2
  top_k: 40
  top_p: 0.95
  seed: -1
  
mirostat: 2
mirostat_eta: 1.0
mirostat_tau: 1.0
template:
  chat: &template |-
    Instruct: {{.Input}}
    Output:
  completion: *template

usage: |
      To use this model, interact with the API (in another terminal) with curl for instance:
      curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
          "model": "stablelm-1.6",
          "messages": [{"role": "user", "content": "How are you doing?", "temperature": 0.1}]
      }'

Now we need to create a docker-compose.yml file. Make sure you have NVIDIA GPU with cuda support before running this docker-compose. Also, dont forget to install cuda drivers for your GPU.

Go to your root folder and create docker-compose.yml and insert the following code:

version: '3.6'

services:
  api:
    image: localai/localai:v2.13.0-cublas-cuda12-core
    ports:
      - 8080:8080
    environment:
      - MODELS_PATH=/models
      - DEBUG=true
    volumes:
      - ./models:/models:cached
    command:
    - stablelm-1.6
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Run:

docker-compose up --build

You should get something like this:

If you dont have a gpu, you can choose other docker image suitable for you from here: https://localai.io/docs/getting-started/run-other-models/

Example: localai/localai:v2.14.0-ffmpeg-core. You can use as image in docker-compose and remove the deploy section.

Now that our localAI is up and running we can setup our remaining application.

Create a file called “document_to_embedding.py”

Insert the following code. Check the comments for explaination:

#import Essential dependencies

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings


if __name__=="__main__":
        
        # Path to store faiss vector database
        DB_FAISS_PATH = 'vectorstore/db_faiss'
        
        # loads the pdf 
        loader=PyPDFLoader("./documents/Milan_Mahat.pdf")
        docs=loader.load()
        
        #The text_splitter configured in this way will divide text into manageable chunks of 1000 characters each, with a 200-character overlap between adjacent chunks. This can be useful for processing large texts efficiently, especially in scenarios where the entire text cannot be processed at once due to memory constraints or computational limitations.
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
        splits = text_splitter.split_documents(docs)
        
        # Define the path to the pre-trained model you want to use
        modelPath = "sentence-transformers/all-MiniLM-l6-v2"
        # Create a dictionary with model configuration options, specifying to use the CPU for computations
        model_kwargs = {'device':'cuda'}
        # Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to False
        encode_kwargs = {'normalize_embeddings': False}

        # Initialize an instance of HuggingFaceEmbeddings with the specified parameters
        embeddings = HuggingFaceEmbeddings(
            model_name=modelPath,     # Provide the pre-trained model's path
            model_kwargs=model_kwargs, # Pass the model configuration options
            encode_kwargs=encode_kwargs # Pass the encoding options
        )
        try:
            #Vector code of document is created and stored
            vectorstore = FAISS.from_documents(docs, embeddings)
            vectorstore.save_local(DB_FAISS_PATH)
            print("Faiss index created ")
            
        except Exception as e:
            print("Fiass store failed \n",e)

Note: I am using “cuda” for model_kwargs, use “cpu” if you dont have cuda support.

In this case, im using my resume as the source of information. Before running it make sure you have installed necessary dependencies and your python version is at least 3.9.7
Check this link for requirements.txt:
https://github.com/LordMilan/DocumentGPT/blob/main/requirements.txt

After installing dependencies, run the script:

python .\document_to_embedding.py

Now create another file called “rag_with_streamlit.py”

Lets start with function to load the embedding that we created above.

#import Essential dependencies
import os
import streamlit as sl
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough


#function to load the vectordatabase
def load_knowledgeBase():
        
        # Define the path to the pre-trained model you want to use
        modelPath = "sentence-transformers/all-MiniLM-l6-v2"
        # Create a dictionary with model configuration options, specifying to use the CPU for computations
        model_kwargs = {'device':'cuda'}
        # Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to False
        encode_kwargs = {'normalize_embeddings': False}

        # Initialize an instance of HuggingFaceEmbeddings with the specified parameters
        embeddings = HuggingFaceEmbeddings(
            model_name=modelPath,     # Provide the pre-trained model's path
            model_kwargs=model_kwargs, # Pass the model configuration options
            encode_kwargs=encode_kwargs # Pass the encoding options
        )
        DB_FAISS_PATH = 'vectorstore/db_faiss'
        db = FAISS.load_local(DB_FAISS_PATH, embeddings, allow_dangerous_deserialization=True)
        return db

Now lets create out function to load out LLM thats hosted by localAI:

#function to load the OPENAI LLM
def load_llm():
        os.environ['OPENAI_API_BASE']  = "http://localhost:8080"
        llm = ChatOpenAI(model="stablelm-1.6", api_key="xxxx" )
        return llm

Here, we are setting ‘OPENAI_API_BASE’ value to our localAI API Endpoint i.e. localhost:8080.

We are using ChatOpenAI to connect with LocalAI, we can use random API Key in this case. LocalAI is designed to replicate real OpenAI API.

We will use ChatPromptTemplate to create our prompt for LLM:

#creating prompt template using langchain
def load_prompt():
        prompt = """ You need to answer the question using information from the context. 
        Context and question of the user is given below: 
        context: {context}
        question: {question}
         """
        prompt = ChatPromptTemplate.from_template(prompt)
        return prompt

We will use this function below to append all the search results of out FAAIS similarity_search:

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

This is out main code which consists of RAG chain and FAISS similarity_seach:

if __name__=='__main__':
    knowledgeBase=load_knowledgeBase()
    llm=load_llm()

    prompt=load_prompt()   
    print("Faiss index loaded ")

    sl.header("Welcome to DocumentGPT 📄.")
    sl.write(" You can ask me your questions related to document")


    query=sl.text_input('What do you wanna know?')

    if(query):
        similar_documents=knowledgeBase.similarity_search(query)
        
        modelPath = "sentence-transformers/all-MiniLM-l6-v2"
        # Create a dictionary with model configuration options, specifying to use the CPU for computations
        model_kwargs = {'device':'cuda'}
        # Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to False
        encode_kwargs = {'normalize_embeddings': False}

        # Initialize an instance of HuggingFaceEmbeddings with the specified parameters
        embeddings = HuggingFaceEmbeddings(
            model_name=modelPath,     # Provide the pre-trained model's path
            model_kwargs=model_kwargs, # Pass the model configuration options
            encode_kwargs=encode_kwargs # Pass the encoding options
        )

        similar_embeddings=FAISS.from_documents(similar_documents, embeddings)
        print("Similar_embeddings is loaded")

        #creating the chain for integrating llm,prompt,stroutputparser
        retriever = similar_embeddings.as_retriever()
        print("Retriever has run Successfully")
        # print(retriever | format_docs)
        print("rag_chain is running")
        rag_chain = (
                {"context": retriever | format_docs, "question": RunnablePassthrough()}
                | prompt
                | llm
                | StrOutputParser()
            )

        response=rag_chain.invoke(query)

        sl.write(response)

In the above code, streamlit is used to take user’s query which is used by FAISS for similarity search.

FAISS returns lots of similar results and it is appended together by the function format_docs, which is then passed to prompt along with the query. Then, the prompt is passed to LLM which does our magic and return our finalized answer. This is how a rag chain is created here.

Finally the answer is shown in strealmit webapp.

Lets run the script and see it in action:

streamlit run .\rag_with_streamlit.py

User Interface:

It can take a while to load at first, depending on your device specification.

This model does a pretty good job considering it is a quantized version and it is only 1.7GB in size.

Thats it! If you did everything right then you have successfully set up a RAG system which answers queries using information from your document and it runs completely offline after you have downloaded the necessary dependencies!

Github: https://github.com/LordMilan/DocumentGPT