Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds OpenAI functions powered document metadata tagger #7521

Conversation

jacoblee93
Copy link
Contributor

Adds a new document transformer that automatically extracts metadata for a document based on an input schema. I also moved document_transformers.py to document_transformers/__init__.py to group it with this new transformer - it didn't seem to cause issues in the notebook, but let me know if I've done something wrong there.

Also had a linter issue I couldn't figure out:

MacBook-Pro:langchain jacoblee$ make lint
poetry run mypy .
docs/dist/conf.py: error: Duplicate module named "conf" (also at "./docs/api_reference/conf.py")
docs/dist/conf.py: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#mapping-file-paths-to-modules for more info
docs/dist/conf.py: note: Common resolutions include: a) using `--exclude` to avoid checking one of them, b) adding `__init__.py` somewhere, c) using `--explicit-package-bases` or adjusting MYPYPATH
Found 1 error in 1 file (errors prevented further checking)
make: *** [lint] Error 2

@rlancemartin @baskaryan

@vercel
Copy link

vercel bot commented Jul 11, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Jul 13, 2023 4:50am

@dosubot dosubot bot added 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features labels Jul 11, 2023
@CodiumAI-Agent
Copy link

PR Analysis

  • 🎯 Main theme: Adding a new document transformer that automatically extracts metadata for a document based on an input schema
  • 🔍 Description and title: Yes
  • 📌 Type of PR: Enhancement
  • 🧪 Relevant tests added: No
  • Minimal and focused: Yes, the PR is focused on adding a new feature of document metadata tagging and does not include unrelated changes.
  • 🔒 Security concerns: No, the PR does not introduce possible security concerns or issues. The changes are related to data processing and do not involve any security-sensitive operations.

PR Feedback

  • 💡 General PR suggestions: The PR is well-structured and the code changes are clear. However, it lacks tests for the new functionality. It's important to add tests to ensure the new feature works as expected and to prevent regressions in the future. Additionally, the linter issue mentioned in the PR description should be resolved.

  • 🤖 Code suggestions:

    • relevant file: langchain/document_transformers/init.py
      suggestion content: Consider adding type hints to the function _filter_similar_embeddings and _filter_cluster_embeddings for better code readability and maintainability. [important]

    • relevant file: langchain/document_transformers/openai_functions.py
      suggestion content: In the MetadataTagger class, the atransform_documents method raises a NotImplementedError. If this method is not intended to be used, consider removing it or adding a docstring to explain why it's not implemented. [medium]

    • relevant file: langchain/document_transformers/openai_functions.py
      suggestion content: In the create_metadata_tagger function, consider adding a docstring to explain the purpose of the function, its arguments, and its return value. This will improve code readability and maintainability. [medium]

    • relevant file: langchain/chains/openai_functions/tagging.py
      suggestion content: In the create_tagging_chain and create_tagging_chain_pydantic functions, consider adding a docstring to explain the purpose of the functions, their arguments, and their return values. This will improve code readability and maintainability. [medium]

How to use

Tag me in a comment '@CodiumAI-Agent' to ask for a new review after you update the PR.
You can also tag me and ask any question, for example '@CodiumAI-Agent is the PR ready for merge?'

@@ -2,9 +2,9 @@
import os
import time
import uuid
import numpy as np
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't change these manually, just ran make format.

Copy link
Collaborator

@baskaryan baskaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just one nit

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably move the stuff here to their own files. think #7379 already doing that so will wait till it lands to merge

@rlancemartin
Copy link
Collaborator

Nice. Yes. This is good. Consolidating document_transformers in a new dir will make additions easier going forward.

Like @baskaryan said, #7379 is going in soon.

@baskaryan baskaryan added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Jul 13, 2023
@baskaryan baskaryan merged commit cdb93ab into langchain-ai:master Jul 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features lgtm PR looks good. Use to confirm that a PR is ready for merging.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants