-
Notifications
You must be signed in to change notification settings - Fork 16.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new types of document transformers #7379
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! We should add a notebook
e.g., doctran.ipynb
that shows usage similar to the one below:
Cristobal got very good engagement on his related thread, and there's a lot of interest in this theme:
https://twitter.com/cristobal_dev/status/1675745314592915456?s=20
from doctran import Doctran | ||
|
||
|
||
class DocumentInterrogator(BaseDocumentTransformer, BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might consider making this a BaseBlobParser
similar to what we do here.
And then it is called easily w/o any boilerplate loader code:
loader = GenericLoader.from_filesystem(
"./example_data/pds",
glob="*",
suffixes=[".pdf"],
parser=DocumentTranslator()
)
docs = loader.load()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good I'll take a look at adding that. These transformers depend on an OpenAI call so need to be run async - would that cause any issues when used as a parser?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we have a similar parser that uses OpenAI Whisper.
See example here:
https://github.com/hwchase17/langchain/blob/master/langchain/document_loaders/parsers/audio.py
Added the notebook examples - let me know if it's good to merge @rlancemartin and I can add the blob stuff in a followup PR, or if you want me to include it into this one! |
from doctran import Doctran | ||
|
||
|
||
class DocumentInterrogator(BaseDocumentTransformer, BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we call this QA_generator
as a more descriptive name?
from doctran import Doctran | ||
|
||
|
||
class DocumentInterrogator(BaseDocumentTransformer, BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be minimal changes to make this a parser
.
1/ Inherit from BaseBlobParser
2/ Just re-name atransform_documents
to lazy_parse
(as shown here) to accept blobs
3/ should then work out-of-the-box w/ the blob loaders
The UX in the notebook is then simple:
loader = GenericLoader.from_filesystem(
"./example_data/pdf",
glob="*",
suffixes=[".pdf"],
parser= QA_generator()
)
docs = loader.load()
self, documents: Sequence[Document], **kwargs: Any | ||
) -> Sequence[Document]: | ||
"""Asynchronously convert documents into Q&A format with doctran and OpenAI functions.""" | ||
openai_api_key = get_from_dict_or_env(kwargs, "openai_api_key", "OPENAI_API_KEY") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can make api_key optional, as done here.
Thanks!
Great! Just added comments. IMO, easier to make it a parser. Minimal code changes. I can help if needed. Will simplify UX and be consistent w/ other recent transformers. |
Also, functionality here is: Looks like Doctran can do more in terms of metadata extraction. Plans to add that in a follow-up? We can discuss, too. QA generation is cool, but impact is somewhat mild b/c we already have QAGeneration chain. Translation is neat. Adding context to chunks (in metadata) is the high impact addition that would be great to get in. |
@rlancemartin ack, will add this in as a parser tomorrow. Re: naming I avoided including Re: additional functionality yeah I can add those too in a follow up, we can discuss on monday to see which ones are most useful and the best way to include them. Would also be curious to understand the vision is for the document transformer module, I assumed it would be for these kinds of transformations |
Added translate and QA as parsers, as well as metadata extraction since you mentioned that was valuable. Will clean up the old files and update the notebooks early this week. I ran into a couple of issues though: Passing arguments to parsers I can extend class DoctranTranslateParser(BaseBlobParser):
"""Translates text documents into other languages using doctran."""
def lazy_parse(self, blob: Blob, language: str, **kwargs) -> Iterator[Document]:
"""Lazily parse the blob."""
openai_api_key = get_from_dict_or_env(kwargs, "openai_api_key", "OPENAI_API_KEY")
doctran = Doctran(openai_api_key=openai_api_key)
doctran_doc = doctran.parse(content=blob.as_string()).translate(language=language).execute()
yield [Document(page_content=doctran_doc.properties_as_yaml() if doctran_doc.extracted_properties else doctran_doc.transformed_content)] Splitting I can:
Let me know what you think! Can also discuss more on our call but it's a lot to cover in 30 min haha. |
Thanks! Hmm, you should be able to pass args to the parser. I'll have a close look later today! See here for example Grobid parser accepting args and applying metadata to chunks. ''' |
re: arguments, ah yeah nevermind - didn't notice that the loader accepts an instance of a parser rather than the class, so I can just include set them in the constructor 👍 |
Nice! Ya. And I think chunk-wise metadata should work (let's have a quick skim at Grobid as an example). I'll look when back on my computer later. |
pyproject.toml
Outdated
@@ -116,6 +116,7 @@ streamlit = {version = "^1.18.0", optional = true, python = ">=3.8.1,<3.9.7 || > | |||
psychicapi = {version = "^0.8.0", optional = true} | |||
cassio = {version = "^0.0.7", optional = true} | |||
rdflib = {version = "^6.3.2", optional = true} | |||
doctran = {version = "^0.0.5", optional = true} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just include this in the notebook rather than adding to pyproject.toml
! pip install doctran
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it get loaded in the parser modules if it's not included as a dependency?
"outputs": [], | ||
"source": [ | ||
"from langchain.schema import Document\n", | ||
"from langchain.document_transformers import DocumentInterrogator" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running notebook (Python 3.9.16
), I see import error, which appears to be an error in Py < 3.10
:
│ 6 from langchain.utils import get_from_dict_or_env │
│ 7 from langchain.schema import BaseDocumentTransformer, Document │
│ ❱ 8 from doctran import Doctran │
│ 9 │
│ 10 │
│ 11 class DocumentInterrogator(BaseDocumentTransformer, BaseModel): │
│ │
│ /Users/rlm/anaconda3/envs/lcn2/lib/python3.9/site-packages/doctran/__init__.py:1 in <module> │
│ │
│ ❱ 1 from .doctran import Doctran, Document, DoctranConfig, ContentType, ExtractProperty, Rec │
│ 2 │
│ │
│ /Users/rlm/anaconda3/envs/lcn2/lib/python3.9/site-packages/doctran/doctran.py:14 in <module> │
│ │
│ 11 from typing import List, Optional, Dict, Any, Literal │
│ 12 from pydantic import BaseModel │
│ 13 │
│ ❱ 14 class ExtractProperty(BaseModel): │
│ 15 │ name: str │
│ 16 │ description: str │
│ 17 │ type: Literal["string", "number", "boolean", "array", "object"] │
│ │
│ /Users/rlm/anaconda3/envs/lcn2/lib/python3.9/site-packages/doctran/doctran.py:18 in │
│ ExtractProperty │
│ │
│ 15 │ name: str │
│ 16 │ description: str │
│ 17 │ type: Literal["string", "number", "boolean", "array", "object"] │
│ ❱ 18 │ items: Optional[List | Dict[str, Any]] │
│ 19 │ enum: Optional[List[str]] │
│ 20 │ required: bool = True │
│ 21
"DoctranPropertyExtractor", | ||
"EmbeddingsClusteringFilter", | ||
"EmbeddingsRedundantFilter", | ||
"_DocumentWithState", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to export private stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, legacy from the initial commit. pulling them out now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did so because previously I was running into failures because there are tests that import these internal methods and classes, e.g test_embeddings_filter.py
But not sure if there's another way to run the tests that makes these methods available. Should I go ahead and remove them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, may need them after all:
tests/unit_tests/test_document_transformers.py:2: error: Module "langchain.document_transformers" has no attribute "_filter_similar_embeddings" [attr-defined]
langchain/retrievers/document_compressors/embeddings_filter.py:8: error: Module "langchain.document_transformers" has no attribute "_get_embeddings_from_stateful_docs" [attr-defined]
langchain/retrievers/document_compressors/embeddings_filter.py:8: error: Module "langchain.document_transformers" has no attribute "get_stateful_documents" [attr-defined]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should just update those important statements to import from the specific files instead of from document_transformers init
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya ill clean this all up.
) -> Sequence[Document]: | ||
"""Extracts properties from text documents using doctran.""" | ||
|
||
properties = kwargs.get("properties", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this and openai_api_key just be passed in during init?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lint error from adding new explicit param from:
@abstractmethod
async def atransform_documents(
self, documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as in to the class __init__
method, and accessed as instance attributes in the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya that's reasonable; none of these classes were using init methods.
@jasonwcfan will take care of this.
] | ||
for i, doc in enumerate(doctran_docs): | ||
doctran_docs[i] = await doc.translate(language=language).execute() | ||
return [Document(page_content=doc.transformed_content) for doc in doctran_docs] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is dropping original metadata intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, thanks for catching that. Will add the fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jasonwcfan had this initially; I'm testing add-back now.
This is good to go in once @jasonwcfan adds init methods and does a final sweep of Notebooks w/ init. @baskaryan minor point that Vercel deploy test is failing (non blocking). |
can fix vercel and lint issues and land from here @jasonwcfan! |
Thanks @baskaryan! Great. Should we wait for the Vercel fix? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rlancemartin can this be deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, ya! weird, not sure how that got added here!
doctran = "^0.0.5"
Notes
DocumentTransformer
abstraction set by @dev2049 in Add document transformer abstraction #3182EmbeddingsRedundantFilter
to put it in a file under a newdocument_transformers
moduleDocumentInterrogator
,DocumentTransformer
as well as the existingEmbeddingsRedundantFilter