forked from langchain-ai/langchain
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add new types of document transformers (langchain-ai#7379)
- Description: Add two new document transformers that translates documents into different languages and converts documents into q&a format to improve vector search results. Uses OpenAI function calling via the [doctran](https://github.com/psychic-api/doctran/tree/main) library. - Issue: N/A - Dependencies: `doctran = "^0.0.5"` - Tag maintainer: @rlancemartin @eyurtsev @hwchase17 - Twitter handle: @psychicapi or @jfan001 Notes - Adheres to the `DocumentTransformer` abstraction set by @dev2049 in langchain-ai#3182 - refactored `EmbeddingsRedundantFilter` to put it in a file under a new `document_transformers` module - Added basic docs for `DocumentInterrogator`, `DocumentTransformer` as well as the existing `EmbeddingsRedundantFilter` --------- Co-authored-by: Lance Martin <lance@langchain.dev> Co-authored-by: Bagatur <baskaryan@gmail.com>
- Loading branch information
1 parent
f11d845
commit 8effd90
Showing
17 changed files
with
985 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1 change: 1 addition & 0 deletions
1
...skeleton/docs/modules/data_connection/document_transformers/text_splitters/_category_.yml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,2 @@ | ||
label: 'Text splitters' | ||
position: 0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1 change: 1 addition & 0 deletions
1
docs/extras/modules/data_connection/document_transformers/integrations/_category_.yml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
label: 'Integrations' |
269 changes: 269 additions & 0 deletions
269
...dules/data_connection/document_transformers/integrations/doctran_extract_properties.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,269 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Doctran Extract Properties\n", | ||
"\n", | ||
"We can extract useful features of documents using the [Doctran](https://github.com/psychic-api/doctran) library, which uses OpenAI's function calling feature to extract specific metadata.\n", | ||
"\n", | ||
"Extracting metadata from documents is helpful for a variety of tasks, including:\n", | ||
"* Classification: classifying documents into different categories\n", | ||
"* Data mining: Extract structured data that can be used for data analysis\n", | ||
"* Style transfer: Change the way text is written to more closely match expected user input, improving vector search results" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"! pip install doctran" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": { | ||
"scrolled": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import json\n", | ||
"from langchain.schema import Document\n", | ||
"from langchain.document_transformers import DoctranPropertyExtractor" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"True" | ||
] | ||
}, | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"from dotenv import load_dotenv\n", | ||
"\n", | ||
"load_dotenv()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Input\n", | ||
"This is the document we'll extract properties from." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"[Generated with ChatGPT]\n", | ||
"\n", | ||
"Confidential Document - For Internal Use Only\n", | ||
"\n", | ||
"Date: July 1, 2023\n", | ||
"\n", | ||
"Subject: Updates and Discussions on Various Topics\n", | ||
"\n", | ||
"Dear Team,\n", | ||
"\n", | ||
"I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.\n", | ||
"\n", | ||
"Security and Privacy Measures\n", | ||
"As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe@example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security@example.com.\n", | ||
"\n", | ||
"HR Updates and Employee Benefits\n", | ||
"Recently, we welcomed several new team members who have made significant contributions to their respective departments. I would like to recognize Jane Smith (SSN: 049-45-5928) for her outstanding performance in customer service. Jane has consistently received positive feedback from our clients. Furthermore, please remember that the open enrollment period for our employee benefits program is fast approaching. Should you have any questions or require assistance, please contact our HR representative, Michael Johnson (phone: 418-492-3850, email: michael.johnson@example.com).\n", | ||
"\n", | ||
"Marketing Initiatives and Campaigns\n", | ||
"Our marketing team has been actively working on developing new strategies to increase brand awareness and drive customer engagement. We would like to thank Sarah Thompson (phone: 415-555-1234) for her exceptional efforts in managing our social media platforms. Sarah has successfully increased our follower base by 20% in the past month alone. Moreover, please mark your calendars for the upcoming product launch event on July 15th. We encourage all team members to attend and support this exciting milestone for our company.\n", | ||
"\n", | ||
"Research and Development Projects\n", | ||
"In our pursuit of innovation, our research and development department has been working tirelessly on various projects. I would like to acknowledge the exceptional work of David Rodriguez (email: david.rodriguez@example.com) in his role as project lead. David's contributions to the development of our cutting-edge technology have been instrumental. Furthermore, we would like to remind everyone to share their ideas and suggestions for potential new projects during our monthly R&D brainstorming session, scheduled for July 10th.\n", | ||
"\n", | ||
"Please treat the information in this document with utmost confidentiality and ensure that it is not shared with unauthorized individuals. If you have any questions or concerns regarding the topics discussed, please do not hesitate to reach out to me directly.\n", | ||
"\n", | ||
"Thank you for your attention, and let's continue to work together to achieve our goals.\n", | ||
"\n", | ||
"Best regards,\n", | ||
"\n", | ||
"Jason Fan\n", | ||
"Cofounder & CEO\n", | ||
"Psychic\n", | ||
"jason@psychic.dev\n", | ||
"\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"sample_text = \"\"\"[Generated with ChatGPT]\n", | ||
"\n", | ||
"Confidential Document - For Internal Use Only\n", | ||
"\n", | ||
"Date: July 1, 2023\n", | ||
"\n", | ||
"Subject: Updates and Discussions on Various Topics\n", | ||
"\n", | ||
"Dear Team,\n", | ||
"\n", | ||
"I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.\n", | ||
"\n", | ||
"Security and Privacy Measures\n", | ||
"As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe@example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security@example.com.\n", | ||
"\n", | ||
"HR Updates and Employee Benefits\n", | ||
"Recently, we welcomed several new team members who have made significant contributions to their respective departments. I would like to recognize Jane Smith (SSN: 049-45-5928) for her outstanding performance in customer service. Jane has consistently received positive feedback from our clients. Furthermore, please remember that the open enrollment period for our employee benefits program is fast approaching. Should you have any questions or require assistance, please contact our HR representative, Michael Johnson (phone: 418-492-3850, email: michael.johnson@example.com).\n", | ||
"\n", | ||
"Marketing Initiatives and Campaigns\n", | ||
"Our marketing team has been actively working on developing new strategies to increase brand awareness and drive customer engagement. We would like to thank Sarah Thompson (phone: 415-555-1234) for her exceptional efforts in managing our social media platforms. Sarah has successfully increased our follower base by 20% in the past month alone. Moreover, please mark your calendars for the upcoming product launch event on July 15th. We encourage all team members to attend and support this exciting milestone for our company.\n", | ||
"\n", | ||
"Research and Development Projects\n", | ||
"In our pursuit of innovation, our research and development department has been working tirelessly on various projects. I would like to acknowledge the exceptional work of David Rodriguez (email: david.rodriguez@example.com) in his role as project lead. David's contributions to the development of our cutting-edge technology have been instrumental. Furthermore, we would like to remind everyone to share their ideas and suggestions for potential new projects during our monthly R&D brainstorming session, scheduled for July 10th.\n", | ||
"\n", | ||
"Please treat the information in this document with utmost confidentiality and ensure that it is not shared with unauthorized individuals. If you have any questions or concerns regarding the topics discussed, please do not hesitate to reach out to me directly.\n", | ||
"\n", | ||
"Thank you for your attention, and let's continue to work together to achieve our goals.\n", | ||
"\n", | ||
"Best regards,\n", | ||
"\n", | ||
"Jason Fan\n", | ||
"Cofounder & CEO\n", | ||
"Psychic\n", | ||
"jason@psychic.dev\n", | ||
"\"\"\"\n", | ||
"print(sample_text)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"documents = [Document(page_content=sample_text)]\n", | ||
"properties = [\n", | ||
" {\n", | ||
" \"name\": \"category\",\n", | ||
" \"description\": \"What type of email this is.\",\n", | ||
" \"type\": \"string\",\n", | ||
" \"enum\": [\"update\", \"action_item\", \"customer_feedback\", \"announcement\", \"other\"],\n", | ||
" \"required\": True,\n", | ||
" },\n", | ||
" {\n", | ||
" \"name\": \"mentions\",\n", | ||
" \"description\": \"A list of all people mentioned in this email.\",\n", | ||
" \"type\": \"array\",\n", | ||
" \"items\": {\n", | ||
" \"name\": \"full_name\",\n", | ||
" \"description\": \"The full name of the person mentioned.\",\n", | ||
" \"type\": \"string\",\n", | ||
" },\n", | ||
" \"required\": True,\n", | ||
" },\n", | ||
" {\n", | ||
" \"name\": \"eli5\",\n", | ||
" \"description\": \"Explain this email to me like I'm 5 years old.\",\n", | ||
" \"type\": \"string\",\n", | ||
" \"required\": True,\n", | ||
" },\n", | ||
"]\n", | ||
"property_extractor = DoctranPropertyExtractor(properties=properties)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Output\n", | ||
"After extracting properties from a document, the result will be returned as a new document with properties provided in the metadata" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"extracted_document = await property_extractor.atransform_documents(\n", | ||
" documents, properties=properties\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"{\n", | ||
" \"extracted_properties\": {\n", | ||
" \"category\": \"update\",\n", | ||
" \"mentions\": [\n", | ||
" \"John Doe\",\n", | ||
" \"Jane Smith\",\n", | ||
" \"Michael Johnson\",\n", | ||
" \"Sarah Thompson\",\n", | ||
" \"David Rodriguez\",\n", | ||
" \"Jason Fan\"\n", | ||
" ],\n", | ||
" \"eli5\": \"This is an email from the CEO, Jason Fan, giving updates about different areas in the company. He talks about new security measures and praises John Doe for his work. He also mentions new hires and praises Jane Smith for her work in customer service. The CEO reminds everyone about the upcoming benefits enrollment and says to contact Michael Johnson with any questions. He talks about the marketing team's work and praises Sarah Thompson for increasing their social media followers. There's also a product launch event on July 15th. Lastly, he talks about the research and development projects and praises David Rodriguez for his work. There's a brainstorming session on July 10th.\"\n", | ||
" }\n", | ||
"}\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"print(json.dumps(extracted_document[0].metadata, indent=2))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.3" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.