Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add concurrent insertion of vector rows in the Cassandra Vector Store #6772

Conversation

hemidactylus
Copy link
Contributor

Using the newest cassIO, the Cassandra Vector Store can now perform concurrent DB insertion in its add_texts method, hence improving its performance.

Linter unrelated errors

Running the linter, the following (likely unrelated) was observed:

langchain/vectorstores/mongodb_atlas.py:137: error: Argument 1 to "insert_many" of "Collection" has incompatible type "List[Dict[str, Sequence[object]]]"; expected "Iterable[Union[MongoDBDocumentType, RawBSONDocument]]"  [arg-type]
langchain/vectorstores/mongodb_atlas.py:186: error: Argument 1 to "aggregate" of "Collection" has incompatible type "List[object]"; expected "Sequence[Mapping[str, Any]]"  [arg-type]
langchain/vectorstores/qdrant.py: error: Name "grpc" is not defined  [name-defined]

Happy to be pointed at a fix, if the above is somehow caused by the code in this PR!

@vercel
Copy link

vercel bot commented Jun 26, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Jun 30, 2023 9:46pm

@hemidactylus
Copy link
Contributor Author

This last commit adapts the Cassandra Vector Store to the newest cassIO library, which introduces a few improvements (including interface changes). Also the delete method is implemented.
Moreover, an important non-deterministic source of trouble is identified and healed in the associated integration test (related to the MMR search test). There, test coverage is also extended to the delete methods.

Sorry about the poetry.lock conflict :(

@jbellis
Copy link
Contributor

jbellis commented Jun 29, 2023

@rlancemartin could you take a look at this Cassandra followup?

@rlancemartin rlancemartin self-assigned this Jun 29, 2023
@rlancemartin
Copy link
Collaborator

Thanks! Same as 6671, can you please rebase on master, resolve merge conflict in pyproject.toml and then re-generate poetry.lock with poetry lock --no-update? I can also take this (I'm already doing it for the other PR).

One other note: are there any conflicts between changes here and 6771 that we can resolve early (e.g., maybe they could be tested together by stacking the PRs).

@jbellis jbellis force-pushed the SL-cassandra-vectorstore-2 branch 2 times, most recently from f98a913 to 007cd32 Compare June 29, 2023 21:04
@jbellis
Copy link
Contributor

jbellis commented Jun 29, 2023

Rebased. But when trying to run tests I get

$ poetry run pytest tests/integration_tests/vectorstores/test_cassandra.py 
Warning: Found deprecated key 'default' or 'secondary' in pyproject.toml configuration for source azure-sdk-dev. Please provide the key 'priority' instead. Accepted values are: 'default', 'primary', 'secondary', 'supplemental', 'explicit'.
Warning: Found deprecated priority 'secondary' for source 'azure-sdk-dev' in pyproject.toml. Consider changing the priority to one of the non-deprecated values: 'default', 'primary', 'supplemental', 'explicit'.
Warning: In a future version of Poetry, PyPI will be disabled automatically if at least one custom source is configured with another priority than 'explicit'. In order to avoid a breaking change and make your pyproject.toml forward compatible, add PyPI explicitly via 'poetry source add pypi'. By the way, this has the advantage that you can set the priority of PyPI as with any other source.
ImportError while loading conftest '/home/jonathan/Projects/langchain/tests/integration_tests/vectorstores/conftest.py'.
tests/integration_tests/vectorstores/conftest.py:7: in <module>
    from langchain.document_loaders import TextLoader
langchain/__init__.py:6: in <module>
    from langchain.agents import MRKLChain, ReActChain, SelfAskWithSearchChain
langchain/agents/__init__.py:2: in <module>
    from langchain.agents.agent import (
langchain/agents/agent.py:16: in <module>
    from langchain.agents.tools import InvalidTool
langchain/agents/tools.py:4: in <module>
    from langchain.callbacks.manager import (
langchain/callbacks/__init__.py:11: in <module>
    from langchain.callbacks.manager import (
langchain/callbacks/manager.py:35: in <module>
    from langchain.callbacks.tracers.langchain import LangChainTracer
langchain/callbacks/tracers/__init__.py:3: in <module>
    from langchain.callbacks.tracers.langchain import LangChainTracer
langchain/callbacks/tracers/langchain.py:11: in <module>
    from langchainplus_sdk import LangChainPlusClient
E   ModuleNotFoundError: No module named 'langchainplus_sdk'

Am I doing it wrong?

@jbellis
Copy link
Contributor

jbellis commented Jun 29, 2023

Test problem was due to me not re-running poetry install

@rlancemartin
Copy link
Collaborator

Test problem was due to me not re-running poetry install

Glad you figured it out! I just kicked off tests.

@rlancemartin
Copy link
Collaborator

@jbellis please run make format to resolve Lint error.

@hemidactylus
Copy link
Contributor Author

@rlancemartin Hello, the linter/code style should be happy now :)

@rlancemartin rlancemartin mentioned this pull request Jun 30, 2023
@rlancemartin rlancemartin force-pushed the SL-cassandra-vectorstore-2 branch from cd0429c to 035772c Compare June 30, 2023 21:34
@hemidactylus hemidactylus force-pushed the SL-cassandra-vectorstore-2 branch 2 times, most recently from 035772c to a5b206c Compare June 30, 2023 21:36
@rlancemartin
Copy link
Collaborator

The repeated conflict with poetry.lock is odd. My local master was up-to-date after 6728 went it, but I still saw this merge conflict reported in the PR after re-generating poetry.lock. It looks like you might be creating a new branch? Tag me on that and we should be able to get this in shortly.

@hemidactylus
Copy link
Contributor Author

@rlancemartin Sorry, I might have inadvertently done something untoward.
I wanted to sync my fork to upstream (including this branch), saw conflicts in the github UI and went with a force reset of upstream. That was a few minutes ago.
I think I zapped away the very content of this PR (looking at the code).

@vercel vercel bot temporarily deployed to Production June 30, 2023 21:45 Inactive
@rlancemartin
Copy link
Collaborator

No problem. You should have the branch locally, and can force push your local branch to recover all.

@hemidactylus
Copy link
Contributor Author

Thank you! I actually did a git reset --hard on my pc but @jbellis will do the forcepush in a few minutes. Sorry about this and thank you for helping!

@jbellis
Copy link
Contributor

jbellis commented Jun 30, 2023

I pushed it back to where I left it yesterday after the reformat. Doesn't look like that is showing up yet on this PR page but I confirmed that the branch at https://github.com/hemidactylus/langchain/tree/SL-cassandra-vectorstore-2 is where it should be.

@hemidactylus
Copy link
Contributor Author

@rlancemartin perhaps the PR should be re-opened somehow? I see "no new commits" so I cannot reopen with this comment (yet the branch is restored).

@rlancemartin
Copy link
Collaborator

rlancemartin commented Jun 30, 2023

@rlancemartin perhaps the PR should be re-opened somehow? I see "no new commits" so I cannot reopen with this comment (yet the branch is restored).

Try git push hemidactylus SL-cassandra-vectorstore-2

I get permission denied when I try to push from my local branch now (this worked previously) -

git push -f hemidactylus SL-cassandra-vectorstore-2
Enumerating objects: 415, done.
Counting objects: 100% (354/354), done.
Delta compression using up to 8 threads
Compressing objects: 100% (190/190), done.
Writing objects: 100% (263/263), 170.37 KiB | 14.20 MiB/s, done.
Total 263 (delta 172), reused 130 (delta 66), pack-reused 0
remote: Resolving deltas: 100% (172/172), completed with 67 local objects.
To github.com:hemidactylus/langchain.git
 ! [remote rejected]   SL-cassandra-vectorstore-2 -> SL-cassandra-vectorstore-2 (permission denied)
error: failed to push some refs to 'github.com:hemidactylus/langchain.git'

Can you re-open the PR?

If not, you should be able to simply open a new PR. If you do that, you probably want to change the branch name.

@hemidactylus
Copy link
Contributor Author

Thank you, I went with the "new PR" option (#7017). Thank you again!

rlancemartin added a commit that referenced this pull request Jul 1, 2023
…ndra Vector Store (#7017)

Retrying with the same improvements as in #6772, this time trying not to
mess up with branches.

@rlancemartin doing a fresh new PR from a branch with a new name. This
should do. Thank you for your help!

---------

Co-authored-by: Jonathan Ellis <jbellis@datastax.com>
Co-authored-by: rlm <pexpresss31@gmail.com>
vowelparrot pushed a commit that referenced this pull request Jul 4, 2023
…ndra Vector Store (#7017)

Retrying with the same improvements as in #6772, this time trying not to
mess up with branches.

@rlancemartin doing a fresh new PR from a branch with a new name. This
should do. Thank you for your help!

---------

Co-authored-by: Jonathan Ellis <jbellis@datastax.com>
Co-authored-by: rlm <pexpresss31@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants