-
Notifications
You must be signed in to change notification settings - Fork 890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add new agents docs embbeddings functionality #305
Conversation
1524faa
to
d3d60c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested.
- Performed an incremental review on d3d60c8
- Looked at
292
lines of code in4
files - Took 1 minute and 9 seconds to review
More info
- Skipped
1
files when reviewing. - Skipped posting
0
additional comments because they didn't meet confidence threshold of50%
.
Workflow ID: wflow_VNhbdixwnNT2lJOA
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment (or add @ellipsis-agent to your repo for autocomplete). We'll respond in a few minutes. Learn more here. You can customize Ellipsis with review rules, user-specific overrides, quiet
mode, and more. See docs.
d3d60c8
to
4f35b96
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested.
- Performed an incremental review on 4f35b96
- Looked at
292
lines of code in4
files - Took 1 minute and 26 seconds to review
More info
- Skipped
1
files when reviewing. - Skipped posting
0
additional comments because they didn't meet confidence threshold of50%
.
Workflow ID: wflow_42mqCJQjm4hHzDEw
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment (or add @ellipsis-agent to your repo for autocomplete). We'll respond in a few minutes. Learn more here. You can customize Ellipsis with review rules, user-specific overrides, quiet
mode, and more. See docs.
agents-api/migrations/migrate_1714566760_change_embeddings_dimensions.py
Outdated
Show resolved
Hide resolved
agents-api/migrations/migrate_1714566760_change_embeddings_dimensions.py
Show resolved
Hide resolved
4f35b96
to
a7e7e1d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested.
- Performed an incremental review on a7e7e1d
- Looked at
343
lines of code in5
files - Took 1 minute and 39 seconds to review
More info
- Skipped
1
files when reviewing. - Skipped posting
2
additional comments because they didn't meet confidence threshold of50%
.
1. /agents-api/agents_api/exceptions.py:17
:
- Assessed confidence :
100%
- Grade:
0%
- Comment:
Consider enhancing the error message ofUnknownTokenizerError
to include more context, such as the model name or other relevant details, to aid in debugging.
super().__init__("unknown tokenizer for the given model")
- Reasoning:
TheUnknownTokenizerError
exception is raised without any specific details about the context or the tokenizer that caused the error. Providing more context in the error message can help in debugging and understanding the issue more clearly.
2. /agents-api/migrations/migrate_1714566760_change_embeddings_dimensions.py:9
:
- Assessed confidence :
66%
- Grade:
0%
- Comment:
The use of?
in the SQL query is unclear. If it's intended for parameter substitution, ensure it's supported by the SQL engine or provide further clarification in the code to avoid confusion.
-- Ensure parameter substitution is supported or clarify the use of `?` in this context
- Reasoning:
The migration script uses a placeholder?
in the SQL query which might be intended for parameter substitution, but it's not clear if this is correctly implemented or if it's a typo. This could potentially lead to SQL errors or unexpected behavior during the migration process.
Workflow ID: wflow_GATvwEah8F0vHMsU
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with review rules, user-specific overrides, quiet
mode, and more. See docs.
a7e7e1d
to
4613316
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested.
- Performed an incremental review on 4613316
- Looked at
343
lines of code in5
files - Took 1 minute and 47 seconds to review
More info
- Skipped
1
files when reviewing. - Skipped posting
1
additional comments because they didn't meet confidence threshold of50%
.
1. /agents-api/agents_api/exceptions.py:1
:
- Assessed confidence :
100%
- Grade:
0%
- Comment:
The custom exceptions should inherit fromException
instead ofBaseException
to avoid confusion and potential issues in exception handling.
class BaseException(Exception):
- Reasoning:
TheBaseException
class in Python is a built-in exception base class, and it's generally not recommended to use it as a base for custom exceptions because it can lead to confusion with the built-inException
class. Instead, custom exceptions should typically inherit fromException
directly to avoid potential issues and confusion in exception handling.
Workflow ID: wflow_1YPZlYF58cXpTvRu
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with review rules, user-specific overrides, quiet
mode, and more. See docs.
4613316
to
4969436
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me!
- Performed an incremental review on 4969436
- Looked at
343
lines of code in5
files - Took 1 minute and 9 seconds to review
More info
- Skipped
1
files when reviewing. - Skipped posting
1
additional comments because they didn't meet confidence threshold of50%
.
1. agents-api/agents_api/embed_models_registry.py:20
:
- Assessed confidence :
66%
- Grade:
0%
- Comment:
Thenormalize_l2
function does not handle the case where all vectors in a 2D array are zero vectors. This might lead to incorrect normalization results. Consider adjusting the normalization logic to handle this edge case appropriately. - Reasoning:
Thenormalize_l2
function inembed_models_registry.py
is designed to normalize vectors to unit length. However, the function does not handle the case where the input is a list of lists (2D array) and all vectors are zero vectors. In such a case, thenp.where
condition will return the original zero vectors without normalization, which might not be the intended behavior since a zero vector cannot be normalized to a unit vector. This could potentially lead to incorrect embedding normalization in scenarios where all embeddings are zero vectors.
Workflow ID: wflow_sXhDeznbQvk8TGss
You can customize Ellipsis with review rules, user-specific overrides, quiet
mode, and more. See docs.
4969436
to
2923232
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested.
- Performed an incremental review on 2923232
- Looked at
349
lines of code in5
files - Took 1 minute and 4 seconds to review
More info
- Skipped
1
files when reviewing. - Skipped posting
0
additional comments because they didn't meet confidence threshold of50%
.
Workflow ID: wflow_caKLxhg9TFLSu1LS
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with review rules, user-specific overrides, quiet
mode, and more. See docs.
2923232
to
e212bbd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested.
- Performed an incremental review on e212bbd
- Looked at
349
lines of code in5
files - Took 2 minutes and 2 seconds to review
More info
- Skipped
1
files when reviewing. - Skipped posting
1
additional comments because they didn't meet confidence threshold of50%
.
1. /agents-api/agents_api/embed_models_registry.py:93
:
- Assessed confidence :
100%
- Grade:
0%
- Comment:
The use oflen(e)
in thenormalize
method may raise an error ife
is annp.NDArray
with undefined shape or is zero-dimensional. Consider checking the shape ofe
before applying length operations. - Reasoning:
Thenormalize
method in theEmbeddingModel
class inembed_models_registry.py
assumes that the length of each embedding vector (len(e)
) is always available. However, ifembeddings
containsnp.NDArray
objects, usinglen(e)
directly could raise an error becauselen
is not generally applicable to arrays without a defined shape or if the array is zero-dimensional.
Workflow ID: wflow_escqlc60cDCiQESa
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with review rules, user-specific overrides, quiet
mode, and more. See docs.
5a9d799
to
70af41a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me!
- Performed an incremental review on 5a9d799
- Looked at
98
lines of code in1
files - Took 1 minute and 23 seconds to review
More info
- Skipped
0
files when reviewing. - Skipped posting
1
additional comments because they didn't meet confidence threshold of50%
.
1. agents-api/agents_api/embed_models_registry.py:103
:
- Assessed confidence :
100%
- Grade:
0%
- Comment:
The PR description states that theEmbeddingModel
class is introduced in this PR, but the class already exists in the codebase. Please update the PR description to reflect the actual changes made. - Reasoning:
The PR description mentions that theEmbeddingModel
class is introduced inembed_models_registry.py
, but the diff shows that the class already exists and only modifications are made. This discrepancy might be a misunderstanding or an error in the PR description.
Workflow ID: wflow_EgKAozIEhu1YWyhp
You can customize Ellipsis with review rules, user-specific overrides, quiet
mode, and more. See docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me!
- Performed an incremental review on 70af41a
- Looked at
98
lines of code in1
files - Took 1 minute and 22 seconds to review
More info
- Skipped
0
files when reviewing. - Skipped posting
1
additional comments because they didn't meet confidence threshold of50%
.
1. agents-api/agents_api/embed_models_registry.py:144
:
- Assessed confidence :
100%
- Grade:
0%
- Comment:
The same issue as in the previous comments applies here. TheTokenizer
class does not support thefrom_pretrained
method. Please correct this to usePreTrainedTokenizer
from thetransformers
library.
tokenizer=PreTrainedTokenizer.from_pretrained("BAAI/llm-embedder"),
- Reasoning:
The same issue with the incorrect use ofTokenizer.from_pretrained
method from thetokenizers
library is repeated for other models as well. This needs to be corrected to ensure the code does not raise an AttributeError.
Workflow ID: wflow_I1WFZmCOYHwweMre
You can customize Ellipsis with review rules, user-specific overrides, quiet
mode, and more. See docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested.
- Performed an incremental review on 75dac25
- Looked at
131
lines of code in1
files - Took 1 minute and 10 seconds to review
More info
- Skipped
0
files when reviewing. - Skipped posting
0
additional comments because they didn't meet confidence threshold of50%
.
Workflow ID: wflow_IBgdYwc43VIipcIq
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with review rules, user-specific overrides, quiet
mode, and more. See docs.
|
||
def up(client): | ||
for q in queries_to_run: | ||
import pdb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of import pdb
and pdb.set_trace()
in a production migration script is inappropriate as it can halt the migration process requiring manual intervention. Consider removing or replacing it with proper error logging and handling.
Signed-off-by: Diwank Singh Tomer <diwank.singh@gmail.com>
75dac25
to
68ce7f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me!
- Performed an incremental review on 68ce7f5
- Looked at
114
lines of code in1
files - Took 59 seconds to review
More info
- Skipped
0
files when reviewing. - Skipped posting
1
additional comments because they didn't meet confidence threshold of50%
.
1. agents-api/migrations/migrate_1714566760_change_embeddings_dimensions.py:133
:
- Assessed confidence :
0%
- Comment:
The order of operations inqueries_to_run
is correct, ensuring indices are dropped before changing dimensions and recreated afterwards. This avoids potential conflicts or errors during the schema update. - Reasoning:
The migration script seems to be dropping the existing indices before changing the dimensions of the embeddings, which is a good practice to avoid conflicts or errors during the schema update. However, the order of operations in thequeries_to_run
list is crucial. The script first drops the indices, then changes the dimensions, and finally recreates the indices with the new dimensions. This is the correct sequence to ensure that the indices are compatible with the new schema.
Workflow ID: wflow_KOJlfTYQRpL74SB5
You can customize Ellipsis with review rules, user-specific overrides, quiet
mode, and more. See docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested.
- Performed an incremental review on d762fa1
- Looked at
63
lines of code in3
files - Took 1 minute and 43 seconds to review
More info
- Skipped
0
files when reviewing. - Skipped posting
2
additional comments because they didn't meet confidence threshold of50%
.
1. agents-api/agents_api/clients/embed.py:8
:
- Assessed confidence :
100%
- Grade:
0%
- Comment:
Consider providing a specific default model name or handling the default case within the function to avoid potential runtime errors related toembedding_model_id
.
embed_model_name: str = "default_model_name",
- Reasoning:
The default value forembed_model_name
in theembed
function is set toembedding_model_id
, which is imported from the environment. This might lead to confusion or errors ifembedding_model_id
is not defined in the environment or if it's not intended to be used as a default model ID. It would be clearer to either provide a specific default model name or handle the default case within the function to avoid potential runtime errors.
2. agents-api/agents_api/embed_models_registry.py:81
:
- Assessed confidence :
100%
- Grade:
0%
- Comment:
The parameter name should bemodel_id
to match the expected parameter in theembed
function.
embeddings = await embed(input, model_id=self.embedding_model_name)
- Reasoning:
Theembed
function inembed_models_registry.py
is called with an incorrect parameter nameembed_model_name
. The correct parameter name should bemodel_id
as per the definition inagents_api/clients/embed.py
. This mismatch will cause a runtime error because theembed
function expects a parameter namedmodel_id
notembed_model_name
.
Workflow ID: wflow_oabailAEbPqlbrez
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with review rules, user-specific overrides, quiet
mode, and more. See docs.
@@ -302,7 +305,13 @@ async def list_agents( | |||
@router.post("/agents/{agent_id}/docs", tags=["agents"]) | |||
async def create_docs(agent_id: UUID4, request: CreateDoc) -> ResourceCreatedResponse: | |||
doc_id = uuid4() | |||
content = [request.content] if isinstance(request.content, str) else request.content | |||
content = [ | |||
(c.model_dump() if isinstance(c, ContentItem) else c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure that ContentItem
has a model_dump()
method to avoid potential AttributeError. If it does not exist, consider implementing it or handling this scenario appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested.
- Performed an incremental review on 67476aa
- Looked at
26
lines of code in1
files - Took 44 seconds to review
More info
- Skipped
0
files when reviewing. - Skipped posting
0
additional comments because they didn't meet confidence threshold of50%
.
Workflow ID: wflow_nwe7eZ6pIDuVlQpt
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with review rules, user-specific overrides, quiet
mode, and more. See docs.
@@ -238,7 +239,12 @@ async def list_users( | |||
@router.post("/users/{user_id}/docs", tags=["users"]) | |||
async def create_docs(user_id: UUID4, request: CreateDoc) -> ResourceCreatedResponse: | |||
doc_id = uuid4() | |||
content = [request.content] if isinstance(request.content, str) else request.content | |||
content = [ | |||
(c.model_dump() if isinstance(c, ContentItem) else c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of model_dump()
on instances of ContentItem
is incorrect as this method does not exist in the ContentItem
class. This will cause an AttributeError at runtime. Please revise the handling of ContentItem
objects to correctly extract the necessary data.
(c.model_dump() if isinstance(c, ContentItem) else c) | |
(c.content if isinstance(c, ContentItem) else c) |
67476aa
to
d2b8c32
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested.
- Performed an incremental review on d2b8c32
- Looked at
187
lines of code in5
files - Took 53 seconds to review
More info
- Skipped
0
files when reviewing. - Skipped posting
0
additional comments because they didn't meet confidence threshold of50%
.
Workflow ID: wflow_ChcyCDiQiJx6hUjQ
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with review rules, user-specific overrides, quiet
mode, and more. See docs.
agents-api/docker-compose.yml
Outdated
@@ -73,6 +73,24 @@ services: | |||
count: all | |||
capabilities: [gpu] | |||
|
|||
docs-text-embeddings-inference: | |||
container_name: text-embeddings-inference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The container name text-embeddings-inference
is used for both the text-embeddings-inference
and docs-text-embeddings-inference
services. This could lead to conflicts. Consider using a unique container name for each service.
container_name: text-embeddings-inference | |
container_name: docs-text-embeddings-inference |
d2b8c32
to
e4704f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested.
- Performed an incremental review on e4704f9
- Looked at
187
lines of code in5
files - Took 54 seconds to review
More info
- Skipped
0
files when reviewing. - Skipped posting
0
additional comments because they didn't meet confidence threshold of50%
.
Workflow ID: wflow_WFmywHjoc28UjQGi
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with review rules, user-specific overrides, quiet
mode, and more. See docs.
@@ -5,6 +5,8 @@ | |||
async def embed( | |||
inputs: list[str], | |||
join_inputs=False, | |||
embedding_service_url: str = embedding_service_url, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default parameter embedding_service_url
shadows the imported embedding_service_url
from env.py
. Consider renaming the parameter to avoid confusion and potential bugs.
embedding_service_url: str = embedding_service_url, | |
embedding_service_url: str = None, |
@@ -5,6 +5,8 @@ | |||
async def embed( | |||
inputs: list[str], | |||
join_inputs=False, | |||
embedding_service_url: str = embedding_service_url, | |||
embedding_model_name: str = embedding_model_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default parameter embedding_model_name
shadows the imported embedding_model_id
from env.py
. Consider renaming the parameter to avoid confusion and potential bugs.
embedding_model_name: str = embedding_model_id, | |
embedding_model_name: str = None, |
e4704f9
to
f6443b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested.
- Performed an incremental review on f6443b1
- Looked at
229
lines of code in6
files - Took 1 minute and 1 seconds to review
More info
- Skipped
0
files when reviewing. - Skipped posting
1
additional comments because they didn't meet confidence threshold of50%
.
1. /agents-api/agents_api/embed_models_registry.py:84
:
- Assessed confidence :
100%
- Grade:
0%
- Comment:
Ensure that theembed
function inclients/embed.py
is correctly handling theembedding_service_url
parameter to avoid unexpected behavior due to the shadowing issue in its default parameters. - Reasoning:
TheEmbeddingModel
class inembed_models_registry.py
has a methodembed
that uses theembed
function fromclients/embed.py
. Theembed
function is called withembedding_service_url
andembedding_model_name
parameters. However, theembed
function inclients/embed.py
has a bug where the default parameterembedding_service_url
shadows the importedembedding_service_url
fromenv.py
. This could lead to unexpected behavior if theembed
function is called without explicitly passing theembedding_service_url
parameter.
Workflow ID: wflow_hIChwDncEKaw4iGO
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with review rules, user-specific overrides, quiet
mode, and more. See docs.
@@ -5,6 +5,8 @@ | |||
async def embed( | |||
inputs: list[str], | |||
join_inputs=False, | |||
embedding_service_url: str = embedding_service_url, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default parameter embedding_service_url
shadows the imported embedding_service_url
from env.py
. Consider renaming the parameter to avoid confusion and potential bugs.
embedding_service_url: str = embedding_service_url, | |
embedding_service_url: str = None, |
f6443b1
to
327d0c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested.
- Performed an incremental review on 327d0c6
- Looked at
250
lines of code in6
files - Took 47 seconds to review
More info
- Skipped
0
files when reviewing. - Skipped posting
0
additional comments because they didn't meet confidence threshold of50%
.
Workflow ID: wflow_tuPmQhPAhWFcYTk3
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with review rules, user-specific overrides, quiet
mode, and more. See docs.
@@ -5,6 +5,8 @@ | |||
async def embed( | |||
inputs: list[str], | |||
join_inputs=False, | |||
embedding_service_url: str = embedding_service_url, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default parameter embedding_service_url
shadows the import from env.py
. Consider renaming the parameter to avoid confusion and potential bugs.
embedding_service_url: str = embedding_service_url, | |
embedding_service_url: str = default_embedding_service_url, |
Summary:
This PR enhances document embedding functionality, updates the database schema, and includes new dependencies for improved data handling and machine learning.
Key points:
EmbeddingModel
class inembed_models_registry.py
routers.py
to useEmbeddingModel
for embedding document snippetsmigrate_1714566760_change_embeddings_dimensions.py
to update embedding dimensionsnumpy
andtransformers
dependencies inpyproject.toml
for enhanced data handling and machine learning capabilitiesEmbeddingModel
class inembed_models_registry.py
to support dynamic model selectionembed
function inclients/embed.py
to accept a model name parameterGenerated with ❤️ by ellipsis.dev