Add new types of document transformers #7379

jasonwcfan · 2023-07-07T22:00:54Z

Description: Add two new document transformers that translates documents into different languages and converts documents into q&a format to improve vector search results. Uses OpenAI function calling via the doctran library.
Issue: N/A
Dependencies: doctran = "^0.0.5"
Tag maintainer: @rlancemartin @eyurtsev @hwchase17
Twitter handle: @psychicapi or @jfan001

Notes

Adheres to the DocumentTransformer abstraction set by @dev2049 in Add document transformer abstraction #3182
refactored EmbeddingsRedundantFilter to put it in a file under a new document_transformers module
Added basic docs for DocumentInterrogator, DocumentTransformer as well as the existing EmbeddingsRedundantFilter

vercel · 2023-07-07T22:00:58Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jul 13, 2023 3:51am

rlancemartin

Great! We should add a notebook e.g., doctran.ipynb that shows usage similar to the one below:

https://github.com/hwchase17/langchain/blob/master/docs/extras/modules/data_connection/document_loaders/integrations/source_code.ipynb

Cristobal got very good engagement on his related thread, and there's a lot of interest in this theme:

https://twitter.com/cristobal_dev/status/1675745314592915456?s=20

rlancemartin · 2023-07-08T00:29:08Z

langchain/document_transformers/interrogate.py

+from doctran import Doctran
+
+
+class DocumentInterrogator(BaseDocumentTransformer, BaseModel):


We might consider making this a BaseBlobParser similar to what we do here.

And then it is called easily w/o any boilerplate loader code:

loader = GenericLoader.from_filesystem( "./example_data/pds", glob="*", suffixes=[".pdf"], parser=DocumentTranslator() ) docs = loader.load()

Sounds good I'll take a look at adding that. These transformers depend on an OpenAI call so need to be run async - would that cause any issues when used as a parser?

No, we have a similar parser that uses OpenAI Whisper.

See example here:
https://github.com/hwchase17/langchain/blob/master/langchain/document_loaders/parsers/audio.py

jasonwcfan · 2023-07-08T01:38:41Z

Added the notebook examples - let me know if it's good to merge @rlancemartin and I can add the blob stuff in a followup PR, or if you want me to include it into this one!

rlancemartin · 2023-07-08T19:38:57Z

langchain/document_transformers/interrogate.py

+from doctran import Doctran
+
+
+class DocumentInterrogator(BaseDocumentTransformer, BaseModel):


Maybe we call this QA_generator as a more descriptive name?

rlancemartin · 2023-07-08T19:45:46Z

langchain/document_transformers/interrogate.py

+from doctran import Doctran
+
+
+class DocumentInterrogator(BaseDocumentTransformer, BaseModel):


It should be minimal changes to make this a parser.

1/ Inherit from BaseBlobParser
2/ Just re-name atransform_documents to lazy_parse (as shown here) to accept blobs
3/ should then work out-of-the-box w/ the blob loaders

The UX in the notebook is then simple:

loader = GenericLoader.from_filesystem( "./example_data/pdf", glob="*", suffixes=[".pdf"], parser= QA_generator() ) docs = loader.load()

rlancemartin · 2023-07-08T19:46:53Z

langchain/document_transformers/interrogate.py

+        self, documents: Sequence[Document], **kwargs: Any
+    ) -> Sequence[Document]:
+        """Asynchronously convert documents into Q&A format with doctran and OpenAI functions."""
+        openai_api_key = get_from_dict_or_env(kwargs, "openai_api_key", "OPENAI_API_KEY")


Can make api_key optional, as done here.

rlancemartin · 2023-07-08T19:47:40Z

Added the notebook examples - let me know if it's good to merge @rlancemartin and I can add the blob stuff in a followup PR, or if you want me to include it into this one!

Thanks!

Added the notebook examples - let me know if it's good to merge @rlancemartin and I can add the blob stuff in a followup PR, or if you want me to include it into this one!

Great! Just added comments.

IMO, easier to make it a parser. Minimal code changes. I can help if needed.

Will simplify UX and be consistent w/ other recent transformers.

rlancemartin · 2023-07-08T19:49:34Z

Also, functionality here is:
1/ QA
2/ Translate

Looks like Doctran can do more in terms of metadata extraction.

Plans to add that in a follow-up? We can discuss, too.

QA generation is cool, but impact is somewhat mild b/c we already have QAGeneration chain.

Translation is neat.

Adding context to chunks (in metadata) is the high impact addition that would be great to get in.

jasonwcfan · 2023-07-09T03:38:39Z

@rlancemartin ack, will add this in as a parser tomorrow.

Re: naming I avoided including QA because there's already a lot of modules in LangChain that have some variation of "QA" in its name so I thought it might just increase confusion - but happy to call it whatever you think makes sense!

Re: additional functionality yeah I can add those too in a follow up, we can discuss on monday to see which ones are most useful and the best way to include them. Would also be curious to understand the vision is for the document transformer module, I assumed it would be for these kinds of transformations

jasonwcfan · 2023-07-09T19:01:45Z

Added translate and QA as parsers, as well as metadata extraction since you mentioned that was valuable. Will clean up the old files and update the notebooks early this week.

I ran into a couple of issues though:

Passing arguments to parsers
The BaseBlobParser class doesn't support additional arguments when passed into loaders. It seems that other parsers don't need to accept additional arguments, but it's necessary for translation (need to specify the language) and metadata extraction (need to pass in a JSON schema to describe the metadata to extract).

I can extend BaseBlobParser and BaseLoader to accept **kwargs if we want to keep these modules as parsers, but that might make the scope of these changes further reaching than anticipated.

class DoctranTranslateParser(BaseBlobParser):
    """Translates text documents into other languages using doctran."""

    def lazy_parse(self, blob: Blob, language: str, **kwargs) -> Iterator[Document]:
        """Lazily parse the blob."""
        openai_api_key = get_from_dict_or_env(kwargs, "openai_api_key", "OPENAI_API_KEY")
        doctran = Doctran(openai_api_key=openai_api_key)
        doctran_doc = doctran.parse(content=blob.as_string()).translate(language=language).execute()
        yield [Document(page_content=doctran_doc.properties_as_yaml() if doctran_doc.extracted_properties else doctran_doc.transformed_content)]

Splitting
When splitting documents, the metadata for the parent document is copied to all child documents. This means any extracted properties (including QA pairs) will be copied over to all chunks, rather than mapped to only the relevant chunks that contain those properties. Therefore vector retrieval using these metadata properties won't be able to retrieve the correct documents if the document gets split.

I can:

Keep the functionality as is, but that would make metadata extraction useless if the user intends to split the document after parsing.
Keep the metadata extraction/QA generation step as a document transformation so it can happen after splitting
Add functionality to handle document splitting while mapping properties to the correct chunks. Not sure if there's a great way to do this since I'm not super familiar with the LangChain codebase yet.

Let me know what you think! Can also discuss more on our call but it's a lot to cover in 30 min haha.

rlancemartin · 2023-07-09T22:25:41Z

Added translate and QA as parsers, as well as metadata extraction since you mentioned that was valuable. Will clean up the old files and update the notebooks early this week.

I ran into a couple of issues though:

Passing arguments to parsers

The BaseBlobParser class doesn't support additional arguments when passed into loaders. It seems that other parsers don't need to accept additional arguments, but it's necessary for translation (need to specify the language) and metadata extraction (need to pass in a JSON schema to describe the metadata to extract).

I can extend BaseBlobParser and BaseLoader to accept **kwargs if we want to keep these modules as parsers, but that might make the scope of these changes further reaching than anticipated.
class DoctranTranslateParser(BaseBlobParser):

    """Translates text documents into other languages using doctran."""



    def lazy_parse(self, blob: Blob, language: str, **kwargs) -> Iterator[Document]:

        """Lazily parse the blob."""

        openai_api_key = get_from_dict_or_env(kwargs, "openai_api_key", "OPENAI_API_KEY")

        doctran = Doctran(openai_api_key=openai_api_key)

        doctran_doc = doctran.parse(content=blob.as_string()).translate(language=language).execute()

        yield [Document(page_content=doctran_doc.properties_as_yaml() if doctran_doc.extracted_properties else doctran_doc.transformed_content)]
Splitting

When splitting documents, the metadata for the parent document is copied to all child documents. This means any extracted properties (including QA pairs) will be copied over to all chunks, rather than mapped to only the relevant chunks that contain those properties. Therefore vector retrieval using these metadata properties won't be able to retrieve the correct documents if the document gets split.

I can:

Keep the functionality as is, but that would make metadata extraction useless if the user intends to split the document after parsing.

Keep the metadata extraction/QA generation step as a document transformation so it can happen after splitting

Add functionality to handle document splitting while mapping properties to the correct chunks. Not sure if there's a great way to do this since I'm not super familiar with the LangChain codebase yet.

Let me know what you think! Can also discuss more on our call but it's a lot to cover in 30 min haha.

Thanks! Hmm, you should be able to pass args to the parser. I'll have a close look later today! See here for example Grobid parser accepting args and applying metadata to chunks.

'''
loader = GenericLoader.from_filesystem(
"../Papers/",
glob="*",
suffixes=[".pdf"],
parser= GrobidParser(segment_sentences=False)
)
docs = loader.load()
'''

jasonwcfan · 2023-07-09T22:42:13Z

re: arguments, ah yeah nevermind - didn't notice that the loader accepts an instance of a parser rather than the class, so I can just include set them in the constructor 👍

rlancemartin · 2023-07-09T22:55:51Z

re: arguments, ah yeah nevermind - didn't notice that the loader accepts an instance of a parser rather than the class, so I can just include set them in the constructor 👍

Nice! Ya. And I think chunk-wise metadata should work (let's have a quick skim at Grobid as an example). I'll look when back on my computer later.

rlancemartin · 2023-07-10T01:55:57Z

pyproject.toml

@@ -116,6 +116,7 @@ streamlit = {version = "^1.18.0", optional = true, python = ">=3.8.1,<3.9.7 || >
 psychicapi = {version = "^0.8.0", optional = true}
 cassio = {version = "^0.0.7", optional = true}
 rdflib = {version = "^6.3.2", optional = true}
+doctran = {version = "^0.0.5", optional = true}


Just include this in the notebook rather than adding to pyproject.toml

! pip install doctran

How does it get loaded in the parser modules if it's not included as a dependency?

rlancemartin · 2023-07-10T01:57:02Z

docs/extras/modules/data_connection/document_transformers/interrogate_document.ipynb

+   "outputs": [],
+   "source": [
+    "from langchain.schema import Document\n",
+    "from langchain.document_transformers import DocumentInterrogator"


Running notebook (Python 3.9.16), I see import error, which appears to be an error in Py < 3.10:

│ 6 from langchain.utils import get_from_dict_or_env │ │ 7 from langchain.schema import BaseDocumentTransformer, Document │ │ ❱ 8 from doctran import Doctran │ │ 9 │ │ 10 │ │ 11 class DocumentInterrogator(BaseDocumentTransformer, BaseModel): │ │ │ │ /Users/rlm/anaconda3/envs/lcn2/lib/python3.9/site-packages/doctran/__init__.py:1 in <module> │ │ │ │ ❱ 1 from .doctran import Doctran, Document, DoctranConfig, ContentType, ExtractProperty, Rec │ │ 2 │ │ │ │ /Users/rlm/anaconda3/envs/lcn2/lib/python3.9/site-packages/doctran/doctran.py:14 in <module> │ │ │ │ 11 from typing import List, Optional, Dict, Any, Literal │ │ 12 from pydantic import BaseModel │ │ 13 │ │ ❱ 14 class ExtractProperty(BaseModel): │ │ 15 │ name: str │ │ 16 │ description: str │ │ 17 │ type: Literal["string", "number", "boolean", "array", "object"] │ │ │ │ /Users/rlm/anaconda3/envs/lcn2/lib/python3.9/site-packages/doctran/doctran.py:18 in │ │ ExtractProperty │ │ │ │ 15 │ name: str │ │ 16 │ description: str │ │ 17 │ type: Literal["string", "number", "boolean", "array", "object"] │ │ ❱ 18 │ items: Optional[List | Dict[str, Any]] │ │ 19 │ enum: Optional[List[str]] │ │ 20 │ required: bool = True │ │ 21

baskaryan · 2023-07-12T21:22:58Z

langchain/document_transformers/__init__.py

+    "DoctranPropertyExtractor",
+    "EmbeddingsClusteringFilter",
+    "EmbeddingsRedundantFilter",
+    "_DocumentWithState",


no need to export private stuff

ya, legacy from the initial commit. pulling them out now.

I did so because previously I was running into failures because there are tests that import these internal methods and classes, e.g test_embeddings_filter.py

But not sure if there's another way to run the tests that makes these methods available. Should I go ahead and remove them?

ya, may need them after all:

tests/unit_tests/test_document_transformers.py:2: error: Module "langchain.document_transformers" has no attribute "_filter_similar_embeddings" [attr-defined] langchain/retrievers/document_compressors/embeddings_filter.py:8: error: Module "langchain.document_transformers" has no attribute "_get_embeddings_from_stateful_docs" [attr-defined] langchain/retrievers/document_compressors/embeddings_filter.py:8: error: Module "langchain.document_transformers" has no attribute "get_stateful_documents" [attr-defined]

we should just update those important statements to import from the specific files instead of from document_transformers init

ya ill clean this all up.

baskaryan · 2023-07-12T21:25:10Z

langchain/document_transformers/text_extract.py

+    ) -> Sequence[Document]:
+        """Extracts properties from text documents using doctran."""
+
+        properties = kwargs.get("properties", None)


should this and openai_api_key just be passed in during init?

lint error from adding new explicit param from:

@abstractmethod async def atransform_documents( self, documents: Sequence[Document], **kwargs: Any ) -> Sequence[Document]:

as in to the class __init__ method, and accessed as instance attributes in the function

ya that's reasonable; none of these classes were using init methods.

@jasonwcfan will take care of this.

baskaryan · 2023-07-12T21:28:02Z

langchain/document_transformers/text_translate.py

+        ]
+        for i, doc in enumerate(doctran_docs):
+            doctran_docs[i] = await doc.translate(language=language).execute()
+        return [Document(page_content=doc.transformed_content) for doc in doctran_docs]


is dropping original metadata intentional?

Nope, thanks for catching that. Will add the fix

@jasonwcfan had this initially; I'm testing add-back now.

rlancemartin · 2023-07-12T23:46:54Z

This is good to go in once @jasonwcfan adds init methods and does a final sweep of Notebooks w/ init.

@baskaryan minor point that Vercel deploy test is failing (non blocking).

baskaryan · 2023-07-13T02:40:18Z

can fix vercel and lint issues and land from here @jasonwcfan!

rlancemartin · 2023-07-13T03:07:40Z

Thanks @baskaryan! Great. Should we wait for the Vercel fix?

baskaryan · 2023-07-13T03:19:49Z

docs/extras/use_cases/question_answering/chat_qa_documents_to_convert.ipynb

@rlancemartin can this be deleted?

oh, ya! weird, not sure how that got added here!

jasonwcfan added 3 commits July 7, 2023 14:10

add interrogator and translator document transformers

c9941df

add as python dependency

32b24d9

fix doc formatting

31974ab

dosubot bot added 🤖:docs 🤖:enhancement 🤖:improvement labels Jul 7, 2023

vercel bot temporarily deployed to Preview July 7, 2023 22:07 Inactive

rlancemartin self-assigned this Jul 7, 2023

rlancemartin reviewed Jul 7, 2023

View reviewed changes

rlancemartin reviewed Jul 8, 2023

View reviewed changes

add notebook examples

6a41698

vercel bot temporarily deployed to Preview July 8, 2023 00:47 Inactive

rlancemartin reviewed Jul 8, 2023

View reviewed changes

add extract, qa, translate parsers

06b5794

vercel bot temporarily deployed to Preview July 9, 2023 18:56 Inactive

make openai_api_key optional parameter in the parser constructor

19f5b43

vercel bot temporarily deployed to Preview July 9, 2023 19:14 Inactive

move params from lazy_parse to constructor

75bf2f5

vercel bot temporarily deployed to Preview July 9, 2023 22:53 Inactive

rlancemartin reviewed Jul 10, 2023

View reviewed changes

rlancemartin mentioned this pull request Jul 10, 2023

Error w/ Py version < 3.10 finic-ai/doctran#4

Closed

vercel bot temporarily deployed to Preview July 12, 2023 21:07 Inactive

baskaryan reviewed Jul 12, 2023

View reviewed changes

rlancemartin added 2 commits July 12, 2023 15:01

Address comments

30c4431

Fix tests

7f9a619

vercel bot temporarily deployed to Preview July 12, 2023 22:17 Inactive

Upddate interface

ca27dce

baskaryan mentioned this pull request Jul 12, 2023

Adds OpenAI functions powered document metadata tagger #7521

Merged

fmt

fc2c6db

vercel bot temporarily deployed to Preview July 12, 2023 23:41 Inactive

move params to init

56ec0c0

vercel bot temporarily deployed to Preview July 13, 2023 00:47 Inactive

baskaryan added 2 commits July 12, 2023 22:41

Merge branch 'master' into jasonwcfan/master

d6eb4f7

fix

a2c2214

vercel bot temporarily deployed to Preview July 13, 2023 02:49 Inactive

cr

c38aff7

vercel bot temporarily deployed to Preview July 13, 2023 03:18 Inactive

baskaryan reviewed Jul 13, 2023

View reviewed changes

undo

03e7aab

vercel bot temporarily deployed to Preview July 13, 2023 03:28 Inactive

cr

fc99565

vercel bot temporarily deployed to Preview July 13, 2023 03:36 Inactive

final

91c2a59

vercel bot temporarily deployed to Preview July 13, 2023 03:51 Inactive

baskaryan merged commit 8effd90 into langchain-ai:master Jul 13, 2023

		from doctran import Doctran


		class DocumentInterrogator(BaseDocumentTransformer, BaseModel):

Add new types of document transformers #7379

Add new types of document transformers #7379

Conversation

jasonwcfan commented Jul 7, 2023

vercel bot commented Jul 7, 2023 • edited Loading

rlancemartin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlancemartin Jul 8, 2023 • edited Loading

Choose a reason for hiding this comment

jasonwcfan commented Jul 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlancemartin commented Jul 8, 2023

rlancemartin commented Jul 8, 2023

jasonwcfan commented Jul 9, 2023

jasonwcfan commented Jul 9, 2023 • edited Loading

rlancemartin commented Jul 9, 2023 • edited Loading

jasonwcfan commented Jul 9, 2023

rlancemartin commented Jul 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlancemartin Jul 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlancemartin Jul 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlancemartin Jul 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlancemartin commented Jul 12, 2023 • edited Loading

baskaryan commented Jul 13, 2023

rlancemartin commented Jul 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vercel bot commented Jul 7, 2023 •

edited

Loading

rlancemartin Jul 8, 2023 •

edited

Loading

jasonwcfan commented Jul 9, 2023 •

edited

Loading

rlancemartin commented Jul 9, 2023 •

edited

Loading

rlancemartin Jul 10, 2023 •

edited

Loading

rlancemartin Jul 12, 2023 •

edited

Loading

rlancemartin Jul 12, 2023 •

edited

Loading

rlancemartin commented Jul 12, 2023 •

edited

Loading