templates: add RAG template for Intel Xeon Scalable Processors (langc…

…hain-ai#18424) **Description:** This template utilizes Chroma and TGI (Text Generation Inference) to execute RAG on the Intel Xeon Scalable Processors. It serves as a demonstration for users, illustrating the deployment of the RAG service on the Intel Xeon Scalable Processors and showcasing the resulting performance enhancements. **Issue:** None **Dependencies:** The template contains the poetry project requirements to run this template. CPU TGI batching is WIP. **Twitter handle:** None --------- Signed-off-by: lvliang-intel <liang1.lv@intel.com> Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com>
Tokkiu · Mar 29, 2024 · 0175906 · 0175906
1 parent d4673a3
commit 0175906
Show file tree

Hide file tree

Showing 9 changed files with 6,027 additions and 0 deletions.
diff --git a/templates/intel-rag-xeon/README.md b/templates/intel-rag-xeon/README.md
@@ -0,0 +1,97 @@
+# RAG example on Intel Xeon
+This template performs RAG using Chroma and Text Generation Inference on Intel® Xeon® Scalable Processors.
+Intel® Xeon® Scalable processors feature built-in accelerators for more performance-per-core and unmatched AI performance, with advanced security technologies for the most in-demand workload requirements—all while offering the greatest cloud choice and application portability, please check [Intel® Xeon® Scalable Processors](https://www.intel.com/content/www/us/en/products/details/processors/xeon/scalable.html).
+
+## Environment Setup
+To use [🤗 text-generation-inference](https://github.com/huggingface/text-generation-inference) on Intel® Xeon® Scalable Processors, please follow these steps:
+
+
+### Launch a local server instance on Intel Xeon Server:
+```bash
+model=Intel/neural-chat-7b-v3-3
+volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
+
+docker run --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model
+```
+
+For gated models such as `LLAMA-2`, you will have to pass -e HUGGING_FACE_HUB_TOKEN=\<token\> to the docker run command above with a valid Hugging Face Hub read token.
+
+Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token ans export `HUGGINGFACEHUB_API_TOKEN` environment with the token.
+
+```bash
+export HUGGINGFACEHUB_API_TOKEN=<token> 
+```
+
+Send a request to check if the endpoint is wokring:
+
+```bash
+curl localhost:8080/generate -X POST -d '{"inputs":"Which NFL team won the Super Bowl in the 2010 season?","parameters":{"max_new_tokens":128, "do_sample": true}}'   -H 'Content-Type: application/json'
+```
+
+More details please refer to [text-generation-inference](https://github.com/huggingface/text-generation-inference).
+
+
+## Populating with data
+
+If you want to populate the DB with some example data, you can run the below commands:
+```shell
+poetry install
+poetry run python ingest.py
+```
+
+The script process and stores sections from Edgar 10k filings data for Nike `nke-10k-2023.pdf` into a Chroma database.
+
+## Usage
+
+To use this package, you should first have the LangChain CLI installed:
+
+```shell
+pip install -U langchain-cli
+```
+
+To create a new LangChain project and install this as the only package, you can do:
+
+```shell
+langchain app new my-app --package intel-rag-xeon
+```
+
+If you want to add this to an existing project, you can just run:
+
+```shell
+langchain app add intel-rag-xeon
+```
+
+And add the following code to your `server.py` file:
+```python
+from intel_rag_xeon import chain as xeon_rag_chain
+
+add_routes(app, xeon_rag_chain, path="/intel-rag-xeon")
+```
+
+(Optional) Let's now configure LangSmith. LangSmith will help us trace, monitor and debug LangChain applications. LangSmith is currently in private beta, you can sign up [here](https://smith.langchain.com/). If you don't have access, you can skip this section
+
+```shell
+export LANGCHAIN_TRACING_V2=true
+export LANGCHAIN_API_KEY=<your-api-key>
+export LANGCHAIN_PROJECT=<your-project>  # if not specified, defaults to "default"
+```
+
+If you are inside this directory, then you can spin up a LangServe instance directly by:
+
+```shell
+langchain serve
+```
+
+This will start the FastAPI app with a server is running locally at 
+[http://localhost:8000](http://localhost:8000)
+
+We can see all templates at [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
+We can access the playground at [http://127.0.0.1:8000/intel-rag-xeon/playground](http://127.0.0.1:8000/intel-rag-xeon/playground)
+
+We can access the template from code with:
+
+```python
+from langserve.client import RemoteRunnable
+
+runnable = RemoteRunnable("http://localhost:8000/intel-rag-xeon")
+```
diff --git a/templates/intel-rag-xeon/data/nke-10k-2023.pdf b/templates/intel-rag-xeon/data/nke-10k-2023.pdf
diff --git a/templates/intel-rag-xeon/ingest.py b/templates/intel-rag-xeon/ingest.py
@@ -0,0 +1,49 @@
+import os
+
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+from langchain_community.document_loaders import UnstructuredFileLoader
+from langchain_community.embeddings import HuggingFaceEmbeddings
+from langchain_community.vectorstores import Chroma
+from langchain_core.documents import Document
+
+
+def ingest_documents():
+    """
+    Ingest PDF to Redis from the data/ directory that
+    contains Edgar 10k filings data for Nike.
+    """
+    # Load list of pdfs
+    data_path = "data/"
+    doc = [os.path.join(data_path, file) for file in os.listdir(data_path)][0]
+
+    print("Parsing 10k filing doc for NIKE", doc)
+
+    text_splitter = RecursiveCharacterTextSplitter(
+        chunk_size=1500, chunk_overlap=100, add_start_index=True
+    )
+    loader = UnstructuredFileLoader(doc, mode="single", strategy="fast")
+    chunks = loader.load_and_split(text_splitter)
+
+    print("Done preprocessing. Created", len(chunks), "chunks of the original pdf")
+
+    # Create vectorstore
+    embedder = HuggingFaceEmbeddings(
+        model_name="sentence-transformers/all-MiniLM-L6-v2"
+    )
+
+    documents = []
+    for chunk in chunks:
+        doc = Document(page_content=chunk.page_content, metadata=chunk.metadata)
+        documents.append(doc)
+
+    # Add to vectorDB
+    _ = Chroma.from_documents(
+        documents=documents,
+        collection_name="xeon-rag",
+        embedding=embedder,
+        persist_directory="/tmp/xeon_rag_db",
+    )
+
+
+if __name__ == "__main__":
+    ingest_documents()
diff --git a/templates/intel-rag-xeon/intel_rag_xeon.ipynb b/templates/intel-rag-xeon/intel_rag_xeon.ipynb
@@ -0,0 +1,62 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "681a5d1e",
+   "metadata": {},
+   "source": [
+    "## Connect to RAG App\n",
+    "\n",
+    "Assuming you are already running this server:\n",
+    "```bash\n",
+    "langserve start\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d774be2a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langserve.client import RemoteRunnable\n",
+    "\n",
+    "gaudi_rag = RemoteRunnable(\"http://localhost:8000/intel-rag-xeon\")\n",
+    "\n",
+    "print(gaudi_rag.invoke(\"What was Nike's revenue in 2023?\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "07ae0005",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(gaudi_rag.invoke(\"How many employees work at Nike?\"))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/templates/intel-rag-xeon/intel_rag_xeon/__init__.py b/templates/intel-rag-xeon/intel_rag_xeon/__init__.py
@@ -0,0 +1,3 @@
+from intel_rag_xeon.chain import chain
+
+__all__ = ["chain"]
diff --git a/templates/intel-rag-xeon/intel_rag_xeon/chain.py b/templates/intel-rag-xeon/intel_rag_xeon/chain.py
@@ -0,0 +1,72 @@
+from langchain.callbacks import streaming_stdout
+from langchain_community.embeddings import HuggingFaceEmbeddings
+from langchain_community.llms import HuggingFaceEndpoint
+from langchain_community.vectorstores import Chroma
+from langchain_core.output_parsers import StrOutputParser
+from langchain_core.prompts import ChatPromptTemplate
+from langchain_core.pydantic_v1 import BaseModel
+from langchain_core.runnables import RunnableParallel, RunnablePassthrough
+from langchain_core.vectorstores import VectorStoreRetriever
+
+
+# Make this look better in the docs.
+class Question(BaseModel):
+    __root__: str
+
+
+# Init Embeddings
+embedder = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
+
+knowledge_base = Chroma(
+    persist_directory="/tmp/xeon_rag_db",
+    embedding_function=embedder,
+    collection_name="xeon-rag",
+)
+query = "What was Nike's revenue in 2023?"
+docs = knowledge_base.similarity_search(query)
+print(docs[0].page_content)
+retriever = VectorStoreRetriever(
+    vectorstore=knowledge_base, search_type="mmr", search_kwargs={"k": 1, "fetch_k": 5}
+)
+
+# Define our prompt
+template = """
+Use the following pieces of context from retrieved
+dataset to answer the question. Do not make up an answer if there is no
+context provided to help answer it.
+
+Context:
+---------
+{context}
+
+---------
+Question: {question}
+---------
+
+Answer:
+"""
+
+
+prompt = ChatPromptTemplate.from_template(template)
+
+
+ENDPOINT_URL = "http://localhost:8080"
+callbacks = [streaming_stdout.StreamingStdOutCallbackHandler()]
+model = HuggingFaceEndpoint(
+    endpoint_url=ENDPOINT_URL,
+    max_new_tokens=512,
+    top_k=10,
+    top_p=0.95,
+    typical_p=0.95,
+    temperature=0.01,
+    repetition_penalty=1.03,
+    streaming=True,
+)
+
+# RAG Chain
+chain = (
+    RunnableParallel({"context": retriever, "question": RunnablePassthrough()})
+    | prompt
+    | model
+    | StrOutputParser()
+).with_types(input_type=Question)
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		from intel_rag_xeon.chain import chain

		__all__ = ["chain"]