Skip to content

Commit

Permalink
templates: add RAG template for Intel Xeon Scalable Processors (langc…
Browse files Browse the repository at this point in the history
…hain-ai#18424)

**Description:**
This template utilizes Chroma and TGI (Text Generation Inference) to
execute RAG on the Intel Xeon Scalable Processors. It serves as a
demonstration for users, illustrating the deployment of the RAG service
on the Intel Xeon Scalable Processors and showcasing the resulting
performance enhancements.

**Issue:**
None

**Dependencies:**
The template contains the poetry project requirements to run this
template.
CPU TGI batching is WIP.

**Twitter handle:**
None

---------

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
  • Loading branch information
3 people authored Mar 29, 2024
1 parent d4673a3 commit 0175906
Show file tree
Hide file tree
Showing 9 changed files with 6,027 additions and 0 deletions.
97 changes: 97 additions & 0 deletions templates/intel-rag-xeon/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# RAG example on Intel Xeon
This template performs RAG using Chroma and Text Generation Inference on Intel® Xeon® Scalable Processors.
Intel® Xeon® Scalable processors feature built-in accelerators for more performance-per-core and unmatched AI performance, with advanced security technologies for the most in-demand workload requirements—all while offering the greatest cloud choice and application portability, please check [Intel® Xeon® Scalable Processors](https://www.intel.com/content/www/us/en/products/details/processors/xeon/scalable.html).

## Environment Setup
To use [🤗 text-generation-inference](https://github.com/huggingface/text-generation-inference) on Intel® Xeon® Scalable Processors, please follow these steps:


### Launch a local server instance on Intel Xeon Server:
```bash
model=Intel/neural-chat-7b-v3-3
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model
```

For gated models such as `LLAMA-2`, you will have to pass -e HUGGING_FACE_HUB_TOKEN=\<token\> to the docker run command above with a valid Hugging Face Hub read token.

Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token ans export `HUGGINGFACEHUB_API_TOKEN` environment with the token.

```bash
export HUGGINGFACEHUB_API_TOKEN=<token>
```

Send a request to check if the endpoint is wokring:

```bash
curl localhost:8080/generate -X POST -d '{"inputs":"Which NFL team won the Super Bowl in the 2010 season?","parameters":{"max_new_tokens":128, "do_sample": true}}' -H 'Content-Type: application/json'
```

More details please refer to [text-generation-inference](https://github.com/huggingface/text-generation-inference).


## Populating with data

If you want to populate the DB with some example data, you can run the below commands:
```shell
poetry install
poetry run python ingest.py
```

The script process and stores sections from Edgar 10k filings data for Nike `nke-10k-2023.pdf` into a Chroma database.

## Usage

To use this package, you should first have the LangChain CLI installed:

```shell
pip install -U langchain-cli
```

To create a new LangChain project and install this as the only package, you can do:

```shell
langchain app new my-app --package intel-rag-xeon
```

If you want to add this to an existing project, you can just run:

```shell
langchain app add intel-rag-xeon
```

And add the following code to your `server.py` file:
```python
from intel_rag_xeon import chain as xeon_rag_chain

add_routes(app, xeon_rag_chain, path="/intel-rag-xeon")
```

(Optional) Let's now configure LangSmith. LangSmith will help us trace, monitor and debug LangChain applications. LangSmith is currently in private beta, you can sign up [here](https://smith.langchain.com/). If you don't have access, you can skip this section

```shell
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=<your-api-key>
export LANGCHAIN_PROJECT=<your-project> # if not specified, defaults to "default"
```

If you are inside this directory, then you can spin up a LangServe instance directly by:

```shell
langchain serve
```

This will start the FastAPI app with a server is running locally at
[http://localhost:8000](http://localhost:8000)

We can see all templates at [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
We can access the playground at [http://127.0.0.1:8000/intel-rag-xeon/playground](http://127.0.0.1:8000/intel-rag-xeon/playground)

We can access the template from code with:

```python
from langserve.client import RemoteRunnable

runnable = RemoteRunnable("http://localhost:8000/intel-rag-xeon")
```
Binary file added templates/intel-rag-xeon/data/nke-10k-2023.pdf
Binary file not shown.
49 changes: 49 additions & 0 deletions templates/intel-rag-xeon/ingest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
import os

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import UnstructuredFileLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document


def ingest_documents():
"""
Ingest PDF to Redis from the data/ directory that
contains Edgar 10k filings data for Nike.
"""
# Load list of pdfs
data_path = "data/"
doc = [os.path.join(data_path, file) for file in os.listdir(data_path)][0]

print("Parsing 10k filing doc for NIKE", doc)

text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1500, chunk_overlap=100, add_start_index=True
)
loader = UnstructuredFileLoader(doc, mode="single", strategy="fast")
chunks = loader.load_and_split(text_splitter)

print("Done preprocessing. Created", len(chunks), "chunks of the original pdf")

# Create vectorstore
embedder = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)

documents = []
for chunk in chunks:
doc = Document(page_content=chunk.page_content, metadata=chunk.metadata)
documents.append(doc)

# Add to vectorDB
_ = Chroma.from_documents(
documents=documents,
collection_name="xeon-rag",
embedding=embedder,
persist_directory="/tmp/xeon_rag_db",
)


if __name__ == "__main__":
ingest_documents()
62 changes: 62 additions & 0 deletions templates/intel-rag-xeon/intel_rag_xeon.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "681a5d1e",
"metadata": {},
"source": [
"## Connect to RAG App\n",
"\n",
"Assuming you are already running this server:\n",
"```bash\n",
"langserve start\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d774be2a",
"metadata": {},
"outputs": [],
"source": [
"from langserve.client import RemoteRunnable\n",
"\n",
"gaudi_rag = RemoteRunnable(\"http://localhost:8000/intel-rag-xeon\")\n",
"\n",
"print(gaudi_rag.invoke(\"What was Nike's revenue in 2023?\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "07ae0005",
"metadata": {},
"outputs": [],
"source": [
"print(gaudi_rag.invoke(\"How many employees work at Nike?\"))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
3 changes: 3 additions & 0 deletions templates/intel-rag-xeon/intel_rag_xeon/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from intel_rag_xeon.chain import chain

__all__ = ["chain"]
72 changes: 72 additions & 0 deletions templates/intel-rag-xeon/intel_rag_xeon/chain.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
from langchain.callbacks import streaming_stdout
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.llms import HuggingFaceEndpoint
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_core.vectorstores import VectorStoreRetriever


# Make this look better in the docs.
class Question(BaseModel):
__root__: str


# Init Embeddings
embedder = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

knowledge_base = Chroma(
persist_directory="/tmp/xeon_rag_db",
embedding_function=embedder,
collection_name="xeon-rag",
)
query = "What was Nike's revenue in 2023?"
docs = knowledge_base.similarity_search(query)
print(docs[0].page_content)
retriever = VectorStoreRetriever(
vectorstore=knowledge_base, search_type="mmr", search_kwargs={"k": 1, "fetch_k": 5}
)

# Define our prompt
template = """
Use the following pieces of context from retrieved
dataset to answer the question. Do not make up an answer if there is no
context provided to help answer it.
Context:
---------
{context}
---------
Question: {question}
---------
Answer:
"""


prompt = ChatPromptTemplate.from_template(template)


ENDPOINT_URL = "http://localhost:8080"
callbacks = [streaming_stdout.StreamingStdOutCallbackHandler()]
model = HuggingFaceEndpoint(
endpoint_url=ENDPOINT_URL,
max_new_tokens=512,
top_k=10,
top_p=0.95,
typical_p=0.95,
temperature=0.01,
repetition_penalty=1.03,
streaming=True,
)

# RAG Chain
chain = (
RunnableParallel({"context": retriever, "question": RunnablePassthrough()})
| prompt
| model
| StrOutputParser()
).with_types(input_type=Question)
Loading

0 comments on commit 0175906

Please sign in to comment.