Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: astra db chunks deletion based on metadata field #5537

Merged

Conversation

smatiolids
Copy link
Contributor

Purpose
This PR addresses the need to reload specific documents without affecting others. To achieve this, a new option, "deletion_field", has been introduced.

Functionality

When "deletion_field" is set (e.g., "file_path"), the system will delete all documents in the target collection where metadata["file_path"] matches the corresponding value in the incoming documents.
This ensures that chunks from the specific file are removed before reloading it, preventing duplicates or conflicts.

… document management

- Introduced a new 'deletion_field' input to specify a metadata field for deleting documents before loading new data.
- Enhanced the _add_documents_to_vector_store method to handle document deletion based on the specified field, improving data management capabilities.
@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. enhancement New feature or request labels Jan 3, 2025
Copy link

codspeed-hq bot commented Jan 3, 2025

CodSpeed Performance Report

Merging #5537 will degrade performances by 25.38%

Comparing smatiolids:feat/astra_deletion_based_on_metadata (8a422f0) with main (c5528c6)

Summary

❌ 2 regressions
✅ 13 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main smatiolids:feat/astra_deletion_based_on_metadata Change
test_successful_run_with_input_type_any 258.8 ms 344.3 ms -24.84%
test_successful_run_with_output_type_debug 249.6 ms 334.4 ms -25.38%

smatiolids and others added 2 commits January 3, 2025 18:07
…ove readability.

- Optimized the deletion logic by using a set comprehension to eliminate duplicates when gathering delete values from documents.
@smatiolids smatiolids changed the title Feat/astra deletion based on metadata feat/astra deletion based on metadata Jan 3, 2025
@smatiolids smatiolids changed the title feat/astra deletion based on metadata feat: astra db deletion chunks based on metadata field Jan 3, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Jan 3, 2025
@smatiolids smatiolids changed the title feat: astra db deletion chunks based on metadata field feat: astra db chunks deletion based on metadata field Jan 3, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Jan 3, 2025
@ogabrielluiz ogabrielluiz requested a review from erichare January 6, 2025 12:05
@@ -607,6 +616,18 @@ def _add_documents_to_vector_store(self, vector_store) -> None:
msg = "Vector Store Inputs must be Data objects."
raise TypeError(msg)

if documents and self.deletion_field:
self.log(f"Deleting documents where {self.deletion_field}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we remove this log line?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msmygit I can see the argument to delete it for sure - i'd say we can keep it for now... I don't think it exposes any unnecessary information, and might help debug some issues.

Co-authored-by: Madhavan <msmygit@users.noreply.github.com>
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Jan 6, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Jan 6, 2025
Copy link
Collaborator

@erichare erichare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @msmygit 's suggestions, just left the log message in for now. I commited the change to the error log message suggested. Everything else looks great - i think this is a good addition, though it will have to eventually be nicely documented to be used to its full potential. But i'm going to approve! Thanks!

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jan 6, 2025
@erichare
Copy link
Collaborator

erichare commented Jan 6, 2025

I should add - the team is actively investigating a better ingestion experience for situations like this - but i think this is a solid solution for the current time.

@github-actions github-actions bot removed the enhancement New feature or request label Jan 8, 2025
@github-actions github-actions bot added the enhancement New feature or request label Jan 8, 2025
@erichare erichare added this pull request to the merge queue Jan 8, 2025
Merged via the queue into langflow-ai:main with commit 3df8130 Jan 8, 2025
35 of 36 checks passed
ogabrielluiz pushed a commit that referenced this pull request Jan 8, 2025
* feat: Add deletion_field parameter to AstraDBVectorStoreComponent for document management

- Introduced a new 'deletion_field' input to specify a metadata field for deleting documents before loading new data.
- Enhanced the _add_documents_to_vector_store method to handle document deletion based on the specified field, improving data management capabilities.

* Merging with main

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes

* - Enhanced the info string for the 'deletion_field' parameter to improve readability.
- Optimized the deletion logic by using a set comprehension to eliminate duplicates when gathering delete values from documents.

* [autofix.ci] apply automated fixes

* Update src/backend/base/langflow/components/vectorstores/astradb.py

Co-authored-by: Madhavan <msmygit@users.noreply.github.com>

* [autofix.ci] apply automated fixes

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Hare <ericrhare@gmail.com>
Co-authored-by: Madhavan <msmygit@users.noreply.github.com>
ogabrielluiz pushed a commit to raphaelchristi/langflow that referenced this pull request Jan 10, 2025
)

* feat: Add deletion_field parameter to AstraDBVectorStoreComponent for document management

- Introduced a new 'deletion_field' input to specify a metadata field for deleting documents before loading new data.
- Enhanced the _add_documents_to_vector_store method to handle document deletion based on the specified field, improving data management capabilities.

* Merging with main

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes

* - Enhanced the info string for the 'deletion_field' parameter to improve readability.
- Optimized the deletion logic by using a set comprehension to eliminate duplicates when gathering delete values from documents.

* [autofix.ci] apply automated fixes

* Update src/backend/base/langflow/components/vectorstores/astradb.py

Co-authored-by: Madhavan <msmygit@users.noreply.github.com>

* [autofix.ci] apply automated fixes

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Hare <ericrhare@gmail.com>
Co-authored-by: Madhavan <msmygit@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request lgtm This PR has been approved by a maintainer size:S This PR changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants