Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new types of document transformers #7379

Merged
merged 43 commits into from
Jul 13, 2023
Merged

Conversation

jasonwcfan
Copy link
Contributor

  • Description: Add two new document transformers that translates documents into different languages and converts documents into q&a format to improve vector search results. Uses OpenAI function calling via the doctran library.
  • Issue: N/A
  • Dependencies: doctran = "^0.0.5"
  • Tag maintainer: @rlancemartin @eyurtsev @hwchase17
  • Twitter handle: @psychicapi or @jfan001

Notes

  • Adheres to the DocumentTransformer abstraction set by @dev2049 in Add document transformer abstraction #3182
  • refactored EmbeddingsRedundantFilter to put it in a file under a new document_transformers module
  • Added basic docs for DocumentInterrogator, DocumentTransformer as well as the existing EmbeddingsRedundantFilter

@vercel
Copy link

vercel bot commented Jul 7, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 13, 2023 3:51am

@dosubot dosubot bot added 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features 🤖:improvement Medium size change to existing code to handle new use-cases labels Jul 7, 2023
@vercel vercel bot temporarily deployed to Preview July 7, 2023 22:07 Inactive
@rlancemartin rlancemartin self-assigned this Jul 7, 2023
Copy link
Collaborator

@rlancemartin rlancemartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! We should add a notebook e.g., doctran.ipynb that shows usage similar to the one below:

https://github.com/hwchase17/langchain/blob/master/docs/extras/modules/data_connection/document_loaders/integrations/source_code.ipynb

Cristobal got very good engagement on his related thread, and there's a lot of interest in this theme:

https://twitter.com/cristobal_dev/status/1675745314592915456?s=20

from doctran import Doctran


class DocumentInterrogator(BaseDocumentTransformer, BaseModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might consider making this a BaseBlobParser similar to what we do here.

And then it is called easily w/o any boilerplate loader code:

loader = GenericLoader.from_filesystem(
    "./example_data/pds",
    glob="*",
    suffixes=[".pdf"],
    parser=DocumentTranslator()
)
docs = loader.load()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good I'll take a look at adding that. These transformers depend on an OpenAI call so need to be run async - would that cause any issues when used as a parser?

Copy link
Collaborator

@rlancemartin rlancemartin Jul 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we have a similar parser that uses OpenAI Whisper.

See example here:
https://github.com/hwchase17/langchain/blob/master/langchain/document_loaders/parsers/audio.py

@vercel vercel bot temporarily deployed to Preview July 8, 2023 00:47 Inactive
@jasonwcfan
Copy link
Contributor Author

Added the notebook examples - let me know if it's good to merge @rlancemartin and I can add the blob stuff in a followup PR, or if you want me to include it into this one!

from doctran import Doctran


class DocumentInterrogator(BaseDocumentTransformer, BaseModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we call this QA_generator as a more descriptive name?

from doctran import Doctran


class DocumentInterrogator(BaseDocumentTransformer, BaseModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be minimal changes to make this a parser.

1/ Inherit from BaseBlobParser
2/ Just re-name atransform_documents to lazy_parse (as shown here) to accept blobs
3/ should then work out-of-the-box w/ the blob loaders

The UX in the notebook is then simple:

loader = GenericLoader.from_filesystem(
    "./example_data/pdf",
    glob="*",
    suffixes=[".pdf"],
    parser= QA_generator()
)
docs = loader.load()

self, documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]:
"""Asynchronously convert documents into Q&A format with doctran and OpenAI functions."""
openai_api_key = get_from_dict_or_env(kwargs, "openai_api_key", "OPENAI_API_KEY")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can make api_key optional, as done here.

@rlancemartin
Copy link
Collaborator

Added the notebook examples - let me know if it's good to merge @rlancemartin and I can add the blob stuff in a followup PR, or if you want me to include it into this one!

Thanks!

Added the notebook examples - let me know if it's good to merge @rlancemartin and I can add the blob stuff in a followup PR, or if you want me to include it into this one!

Great! Just added comments.

IMO, easier to make it a parser. Minimal code changes. I can help if needed.

Will simplify UX and be consistent w/ other recent transformers.

@rlancemartin
Copy link
Collaborator

Also, functionality here is:
1/ QA
2/ Translate

Looks like Doctran can do more in terms of metadata extraction.

Plans to add that in a follow-up? We can discuss, too.

QA generation is cool, but impact is somewhat mild b/c we already have QAGeneration chain.

Translation is neat.

Adding context to chunks (in metadata) is the high impact addition that would be great to get in.

@jasonwcfan
Copy link
Contributor Author

@rlancemartin ack, will add this in as a parser tomorrow.

Re: naming I avoided including QA because there's already a lot of modules in LangChain that have some variation of "QA" in its name so I thought it might just increase confusion - but happy to call it whatever you think makes sense!

Re: additional functionality yeah I can add those too in a follow up, we can discuss on monday to see which ones are most useful and the best way to include them. Would also be curious to understand the vision is for the document transformer module, I assumed it would be for these kinds of transformations

@vercel vercel bot temporarily deployed to Preview July 9, 2023 18:56 Inactive
@jasonwcfan
Copy link
Contributor Author

jasonwcfan commented Jul 9, 2023

Added translate and QA as parsers, as well as metadata extraction since you mentioned that was valuable. Will clean up the old files and update the notebooks early this week.

I ran into a couple of issues though:

Passing arguments to parsers
The BaseBlobParser class doesn't support additional arguments when passed into loaders. It seems that other parsers don't need to accept additional arguments, but it's necessary for translation (need to specify the language) and metadata extraction (need to pass in a JSON schema to describe the metadata to extract).

I can extend BaseBlobParser and BaseLoader to accept **kwargs if we want to keep these modules as parsers, but that might make the scope of these changes further reaching than anticipated.

class DoctranTranslateParser(BaseBlobParser):
    """Translates text documents into other languages using doctran."""

    def lazy_parse(self, blob: Blob, language: str, **kwargs) -> Iterator[Document]:
        """Lazily parse the blob."""
        openai_api_key = get_from_dict_or_env(kwargs, "openai_api_key", "OPENAI_API_KEY")
        doctran = Doctran(openai_api_key=openai_api_key)
        doctran_doc = doctran.parse(content=blob.as_string()).translate(language=language).execute()
        yield [Document(page_content=doctran_doc.properties_as_yaml() if doctran_doc.extracted_properties else doctran_doc.transformed_content)]

Splitting
When splitting documents, the metadata for the parent document is copied to all child documents. This means any extracted properties (including QA pairs) will be copied over to all chunks, rather than mapped to only the relevant chunks that contain those properties. Therefore vector retrieval using these metadata properties won't be able to retrieve the correct documents if the document gets split.

I can:

  1. Keep the functionality as is, but that would make metadata extraction useless if the user intends to split the document after parsing.
  2. Keep the metadata extraction/QA generation step as a document transformation so it can happen after splitting
  3. Add functionality to handle document splitting while mapping properties to the correct chunks. Not sure if there's a great way to do this since I'm not super familiar with the LangChain codebase yet.

Let me know what you think! Can also discuss more on our call but it's a lot to cover in 30 min haha.

@vercel vercel bot temporarily deployed to Preview July 9, 2023 19:14 Inactive
@rlancemartin
Copy link
Collaborator

rlancemartin commented Jul 9, 2023

Added translate and QA as parsers, as well as metadata extraction since you mentioned that was valuable. Will clean up the old files and update the notebooks early this week.

I ran into a couple of issues though:

Passing arguments to parsers

The BaseBlobParser class doesn't support additional arguments when passed into loaders. It seems that other parsers don't need to accept additional arguments, but it's necessary for translation (need to specify the language) and metadata extraction (need to pass in a JSON schema to describe the metadata to extract).

I can extend BaseBlobParser and BaseLoader to accept **kwargs if we want to keep these modules as parsers, but that might make the scope of these changes further reaching than anticipated.

class DoctranTranslateParser(BaseBlobParser):

    """Translates text documents into other languages using doctran."""



    def lazy_parse(self, blob: Blob, language: str, **kwargs) -> Iterator[Document]:

        """Lazily parse the blob."""

        openai_api_key = get_from_dict_or_env(kwargs, "openai_api_key", "OPENAI_API_KEY")

        doctran = Doctran(openai_api_key=openai_api_key)

        doctran_doc = doctran.parse(content=blob.as_string()).translate(language=language).execute()

        yield [Document(page_content=doctran_doc.properties_as_yaml() if doctran_doc.extracted_properties else doctran_doc.transformed_content)]

Splitting

When splitting documents, the metadata for the parent document is copied to all child documents. This means any extracted properties (including QA pairs) will be copied over to all chunks, rather than mapped to only the relevant chunks that contain those properties. Therefore vector retrieval using these metadata properties won't be able to retrieve the correct documents if the document gets split.

I can:

  1. Keep the functionality as is, but that would make metadata extraction useless if the user intends to split the document after parsing.

  2. Keep the metadata extraction/QA generation step as a document transformation so it can happen after splitting

  3. Add functionality to handle document splitting while mapping properties to the correct chunks. Not sure if there's a great way to do this since I'm not super familiar with the LangChain codebase yet.

Let me know what you think! Can also discuss more on our call but it's a lot to cover in 30 min haha.

Thanks! Hmm, you should be able to pass args to the parser. I'll have a close look later today! See here for example Grobid parser accepting args and applying metadata to chunks.

'''
loader = GenericLoader.from_filesystem(
"../Papers/",
glob="*",
suffixes=[".pdf"],
parser= GrobidParser(segment_sentences=False)
)
docs = loader.load()
'''

@jasonwcfan
Copy link
Contributor Author

re: arguments, ah yeah nevermind - didn't notice that the loader accepts an instance of a parser rather than the class, so I can just include set them in the constructor 👍

@vercel vercel bot temporarily deployed to Preview July 9, 2023 22:53 Inactive
@rlancemartin
Copy link
Collaborator

re: arguments, ah yeah nevermind - didn't notice that the loader accepts an instance of a parser rather than the class, so I can just include set them in the constructor 👍

Nice! Ya. And I think chunk-wise metadata should work (let's have a quick skim at Grobid as an example). I'll look when back on my computer later.

pyproject.toml Outdated
@@ -116,6 +116,7 @@ streamlit = {version = "^1.18.0", optional = true, python = ">=3.8.1,<3.9.7 || >
psychicapi = {version = "^0.8.0", optional = true}
cassio = {version = "^0.0.7", optional = true}
rdflib = {version = "^6.3.2", optional = true}
doctran = {version = "^0.0.5", optional = true}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just include this in the notebook rather than adding to pyproject.toml

! pip install doctran

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it get loaded in the parser modules if it's not included as a dependency?

"outputs": [],
"source": [
"from langchain.schema import Document\n",
"from langchain.document_transformers import DocumentInterrogator"
Copy link
Collaborator

@rlancemartin rlancemartin Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running notebook (Python 3.9.16), I see import error, which appears to be an error in Py < 3.10:

│    6 from langchain.utils import get_from_dict_or_env                                            │
│    7 from langchain.schema import BaseDocumentTransformer, Document                              │
│ ❱  8 from doctran import Doctran                                                                 │
│    9                                                                                             │
│   10                                                                                             │
│   11 class DocumentInterrogator(BaseDocumentTransformer, BaseModel):                             │
│                                                                                                  │
│ /Users/rlm/anaconda3/envs/lcn2/lib/python3.9/site-packages/doctran/__init__.py:1 in <module>     │
│                                                                                                  │
│ ❱ 1 from .doctran import Doctran, Document, DoctranConfig, ContentType, ExtractProperty, Rec     │
│   2                                                                                              │
│                                                                                                  │
│ /Users/rlm/anaconda3/envs/lcn2/lib/python3.9/site-packages/doctran/doctran.py:14 in <module>     │
│                                                                                                  │
│    11 from typing import List, Optional, Dict, Any, Literal                                      │
│    12 from pydantic import BaseModel                                                             │
│    13                                                                                            │
│ ❱  14 class ExtractProperty(BaseModel):                                                          │
│    15 │   name: str                                                                              │
│    16 │   description: str                                                                       │
│    17 │   type: Literal["string", "number", "boolean", "array", "object"]                        │
│                                                                                                  │
│ /Users/rlm/anaconda3/envs/lcn2/lib/python3.9/site-packages/doctran/doctran.py:18 in              │
│ ExtractProperty                                                                                  │
│                                                                                                  │
│    15 │   name: str                                                                              │
│    16 │   description: str                                                                       │
│    17 │   type: Literal["string", "number", "boolean", "array", "object"]                        │
│ ❱  18 │   items: Optional[List | Dict[str, Any]]                                                 │
│    19 │   enum: Optional[List[str]]                                                              │
│    20 │   required: bool = True                                                                  │
│    21                            

@vercel vercel bot temporarily deployed to Preview July 12, 2023 21:07 Inactive
"DoctranPropertyExtractor",
"EmbeddingsClusteringFilter",
"EmbeddingsRedundantFilter",
"_DocumentWithState",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to export private stuff

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, legacy from the initial commit. pulling them out now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did so because previously I was running into failures because there are tests that import these internal methods and classes, e.g test_embeddings_filter.py

But not sure if there's another way to run the tests that makes these methods available. Should I go ahead and remove them?

Copy link
Collaborator

@rlancemartin rlancemartin Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, may need them after all:

tests/unit_tests/test_document_transformers.py:2: error: Module "langchain.document_transformers" has no attribute "_filter_similar_embeddings"  [attr-defined]
langchain/retrievers/document_compressors/embeddings_filter.py:8: error: Module "langchain.document_transformers" has no attribute "_get_embeddings_from_stateful_docs"  [attr-defined]
langchain/retrievers/document_compressors/embeddings_filter.py:8: error: Module "langchain.document_transformers" has no attribute "get_stateful_documents"  [attr-defined]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should just update those important statements to import from the specific files instead of from document_transformers init

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya ill clean this all up.

) -> Sequence[Document]:
"""Extracts properties from text documents using doctran."""

properties = kwargs.get("properties", None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this and openai_api_key just be passed in during init?

Copy link
Collaborator

@rlancemartin rlancemartin Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lint error from adding new explicit param from:

    @abstractmethod
    async def atransform_documents(
        self, documents: Sequence[Document], **kwargs: Any
    ) -> Sequence[Document]:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as in to the class __init__ method, and accessed as instance attributes in the function

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya that's reasonable; none of these classes were using init methods.

@jasonwcfan will take care of this.

]
for i, doc in enumerate(doctran_docs):
doctran_docs[i] = await doc.translate(language=language).execute()
return [Document(page_content=doc.transformed_content) for doc in doctran_docs]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is dropping original metadata intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, thanks for catching that. Will add the fix

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasonwcfan had this initially; I'm testing add-back now.

@vercel vercel bot temporarily deployed to Preview July 12, 2023 23:41 Inactive
@rlancemartin
Copy link
Collaborator

rlancemartin commented Jul 12, 2023

This is good to go in once @jasonwcfan adds init methods and does a final sweep of Notebooks w/ init.

@baskaryan minor point that Vercel deploy test is failing (non blocking).

@vercel vercel bot temporarily deployed to Preview July 13, 2023 00:47 Inactive
@baskaryan
Copy link
Collaborator

can fix vercel and lint issues and land from here @jasonwcfan!

@vercel vercel bot temporarily deployed to Preview July 13, 2023 02:49 Inactive
@rlancemartin
Copy link
Collaborator

Thanks @baskaryan! Great. Should we wait for the Vercel fix?

@vercel vercel bot temporarily deployed to Preview July 13, 2023 03:18 Inactive
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rlancemartin can this be deleted?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, ya! weird, not sure how that got added here!

@vercel vercel bot temporarily deployed to Preview July 13, 2023 03:28 Inactive
@vercel vercel bot temporarily deployed to Preview July 13, 2023 03:36 Inactive
@vercel vercel bot temporarily deployed to Preview July 13, 2023 03:51 Inactive
@baskaryan baskaryan merged commit 8effd90 into langchain-ai:master Jul 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features 🤖:improvement Medium size change to existing code to handle new use-cases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants