Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add new agents docs embbeddings functionality #305

Merged
merged 7 commits into from
May 4, 2024

Conversation

whiterabbit1983
Copy link
Contributor

@whiterabbit1983 whiterabbit1983 commented May 1, 2024

🚀 This description was created by Ellipsis for commit 327d0c6

Summary:

This PR enhances document embedding functionality, updates the database schema, and includes new dependencies for improved data handling and machine learning.

Key points:

  • Enhances document embedding functionality using various models
  • Updates database schema for new embedding dimensions
  • Includes necessary dependencies, focusing on agents API
  • Introduces EmbeddingModel class in embed_models_registry.py
  • Updates routers.py to use EmbeddingModel for embedding document snippets
  • Adds migration script migrate_1714566760_change_embeddings_dimensions.py to update embedding dimensions
  • Modifies embedding dimensions in the database
  • Updates and adds indices for embeddings in the database
  • Adds numpy and transformers dependencies in pyproject.toml for enhanced data handling and machine learning capabilities
  • Modifies EmbeddingModel class in embed_models_registry.py to support dynamic model selection
  • Updates embed function in clients/embed.py to accept a model name parameter

Generated with ❤️ by ellipsis.dev

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problems found on commit 1524faa.


You can customize Ellipsis with review rules, user-specific overrides, quiet mode, and more. See docs.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested.

  • Performed an incremental review on d3d60c8
  • Looked at 292 lines of code in 4 files
  • Took 1 minute and 9 seconds to review
More info
  • Skipped 1 files when reviewing.
  • Skipped posting 0 additional comments because they didn't meet confidence threshold of 50%.

Workflow ID: wflow_VNhbdixwnNT2lJOA


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment (or add @ellipsis-agent to your repo for autocomplete). We'll respond in a few minutes. Learn more here. You can customize Ellipsis with review rules, user-specific overrides, quiet mode, and more. See docs.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested.

  • Performed an incremental review on 4f35b96
  • Looked at 292 lines of code in 4 files
  • Took 1 minute and 26 seconds to review
More info
  • Skipped 1 files when reviewing.
  • Skipped posting 0 additional comments because they didn't meet confidence threshold of 50%.

Workflow ID: wflow_42mqCJQjm4hHzDEw


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment (or add @ellipsis-agent to your repo for autocomplete). We'll respond in a few minutes. Learn more here. You can customize Ellipsis with review rules, user-specific overrides, quiet mode, and more. See docs.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested.

  • Performed an incremental review on a7e7e1d
  • Looked at 343 lines of code in 5 files
  • Took 1 minute and 39 seconds to review
More info
  • Skipped 1 files when reviewing.
  • Skipped posting 2 additional comments because they didn't meet confidence threshold of 50%.
1. /agents-api/agents_api/exceptions.py:17:
  • Assessed confidence : 100%
  • Grade: 0%
  • Comment:
    Consider enhancing the error message of UnknownTokenizerError to include more context, such as the model name or other relevant details, to aid in debugging.
        super().__init__("unknown tokenizer for the given model")
  • Reasoning:
    The UnknownTokenizerError exception is raised without any specific details about the context or the tokenizer that caused the error. Providing more context in the error message can help in debugging and understanding the issue more clearly.
2. /agents-api/migrations/migrate_1714566760_change_embeddings_dimensions.py:9:
  • Assessed confidence : 66%
  • Grade: 0%
  • Comment:
    The use of ? in the SQL query is unclear. If it's intended for parameter substitution, ensure it's supported by the SQL engine or provide further clarification in the code to avoid confusion.
    -- Ensure parameter substitution is supported or clarify the use of `?` in this context
  • Reasoning:
    The migration script uses a placeholder ? in the SQL query which might be intended for parameter substitution, but it's not clear if this is correctly implemented or if it's a typo. This could potentially lead to SQL errors or unexpected behavior during the migration process.

Workflow ID: wflow_GATvwEah8F0vHMsU


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with review rules, user-specific overrides, quiet mode, and more. See docs.

agents-api/agents_api/exceptions.py Outdated Show resolved Hide resolved
agents-api/agents_api/exceptions.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested.

  • Performed an incremental review on 4613316
  • Looked at 343 lines of code in 5 files
  • Took 1 minute and 47 seconds to review
More info
  • Skipped 1 files when reviewing.
  • Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.
1. /agents-api/agents_api/exceptions.py:1:
  • Assessed confidence : 100%
  • Grade: 0%
  • Comment:
    The custom exceptions should inherit from Exception instead of BaseException to avoid confusion and potential issues in exception handling.
class BaseException(Exception):
  • Reasoning:
    The BaseException class in Python is a built-in exception base class, and it's generally not recommended to use it as a base for custom exceptions because it can lead to confusion with the built-in Exception class. Instead, custom exceptions should typically inherit from Exception directly to avoid potential issues and confusion in exception handling.

Workflow ID: wflow_1YPZlYF58cXpTvRu


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with review rules, user-specific overrides, quiet mode, and more. See docs.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me!

  • Performed an incremental review on 4969436
  • Looked at 343 lines of code in 5 files
  • Took 1 minute and 9 seconds to review
More info
  • Skipped 1 files when reviewing.
  • Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.
1. agents-api/agents_api/embed_models_registry.py:20:
  • Assessed confidence : 66%
  • Grade: 0%
  • Comment:
    The normalize_l2 function does not handle the case where all vectors in a 2D array are zero vectors. This might lead to incorrect normalization results. Consider adjusting the normalization logic to handle this edge case appropriately.
  • Reasoning:
    The normalize_l2 function in embed_models_registry.py is designed to normalize vectors to unit length. However, the function does not handle the case where the input is a list of lists (2D array) and all vectors are zero vectors. In such a case, the np.where condition will return the original zero vectors without normalization, which might not be the intended behavior since a zero vector cannot be normalized to a unit vector. This could potentially lead to incorrect embedding normalization in scenarios where all embeddings are zero vectors.

Workflow ID: wflow_sXhDeznbQvk8TGss


You can customize Ellipsis with review rules, user-specific overrides, quiet mode, and more. See docs.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested.

  • Performed an incremental review on 2923232
  • Looked at 349 lines of code in 5 files
  • Took 1 minute and 4 seconds to review
More info
  • Skipped 1 files when reviewing.
  • Skipped posting 0 additional comments because they didn't meet confidence threshold of 50%.

Workflow ID: wflow_caKLxhg9TFLSu1LS


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with review rules, user-specific overrides, quiet mode, and more. See docs.

agents-api/agents_api/exceptions.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested.

  • Performed an incremental review on e212bbd
  • Looked at 349 lines of code in 5 files
  • Took 2 minutes and 2 seconds to review
More info
  • Skipped 1 files when reviewing.
  • Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.
1. /agents-api/agents_api/embed_models_registry.py:93:
  • Assessed confidence : 100%
  • Grade: 0%
  • Comment:
    The use of len(e) in the normalize method may raise an error if e is an np.NDArray with undefined shape or is zero-dimensional. Consider checking the shape of e before applying length operations.
  • Reasoning:
    The normalize method in the EmbeddingModel class in embed_models_registry.py assumes that the length of each embedding vector (len(e)) is always available. However, if embeddings contains np.NDArray objects, using len(e) directly could raise an error because len is not generally applicable to arrays without a defined shape or if the array is zero-dimensional.

Workflow ID: wflow_escqlc60cDCiQESa


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with review rules, user-specific overrides, quiet mode, and more. See docs.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me!

  • Performed an incremental review on 5a9d799
  • Looked at 98 lines of code in 1 files
  • Took 1 minute and 23 seconds to review
More info
  • Skipped 0 files when reviewing.
  • Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.
1. agents-api/agents_api/embed_models_registry.py:103:
  • Assessed confidence : 100%
  • Grade: 0%
  • Comment:
    The PR description states that the EmbeddingModel class is introduced in this PR, but the class already exists in the codebase. Please update the PR description to reflect the actual changes made.
  • Reasoning:
    The PR description mentions that the EmbeddingModel class is introduced in embed_models_registry.py, but the diff shows that the class already exists and only modifications are made. This discrepancy might be a misunderstanding or an error in the PR description.

Workflow ID: wflow_EgKAozIEhu1YWyhp


You can customize Ellipsis with review rules, user-specific overrides, quiet mode, and more. See docs.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me!

  • Performed an incremental review on 70af41a
  • Looked at 98 lines of code in 1 files
  • Took 1 minute and 22 seconds to review
More info
  • Skipped 0 files when reviewing.
  • Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.
1. agents-api/agents_api/embed_models_registry.py:144:
  • Assessed confidence : 100%
  • Grade: 0%
  • Comment:
    The same issue as in the previous comments applies here. The Tokenizer class does not support the from_pretrained method. Please correct this to use PreTrainedTokenizer from the transformers library.
tokenizer=PreTrainedTokenizer.from_pretrained("BAAI/llm-embedder"),
  • Reasoning:
    The same issue with the incorrect use of Tokenizer.from_pretrained method from the tokenizers library is repeated for other models as well. This needs to be corrected to ensure the code does not raise an AttributeError.

Workflow ID: wflow_I1WFZmCOYHwweMre


You can customize Ellipsis with review rules, user-specific overrides, quiet mode, and more. See docs.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested.

  • Performed an incremental review on 75dac25
  • Looked at 131 lines of code in 1 files
  • Took 1 minute and 10 seconds to review
More info
  • Skipped 0 files when reviewing.
  • Skipped posting 0 additional comments because they didn't meet confidence threshold of 50%.

Workflow ID: wflow_IBgdYwc43VIipcIq


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with review rules, user-specific overrides, quiet mode, and more. See docs.


def up(client):
for q in queries_to_run:
import pdb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of import pdb and pdb.set_trace() in a production migration script is inappropriate as it can halt the migration process requiring manual intervention. Consider removing or replacing it with proper error logging and handling.

Signed-off-by: Diwank Singh Tomer <diwank.singh@gmail.com>
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me!

  • Performed an incremental review on 68ce7f5
  • Looked at 114 lines of code in 1 files
  • Took 59 seconds to review
More info
  • Skipped 0 files when reviewing.
  • Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.
1. agents-api/migrations/migrate_1714566760_change_embeddings_dimensions.py:133:
  • Assessed confidence : 0%
  • Comment:
    The order of operations in queries_to_run is correct, ensuring indices are dropped before changing dimensions and recreated afterwards. This avoids potential conflicts or errors during the schema update.
  • Reasoning:
    The migration script seems to be dropping the existing indices before changing the dimensions of the embeddings, which is a good practice to avoid conflicts or errors during the schema update. However, the order of operations in the queries_to_run list is crucial. The script first drops the indices, then changes the dimensions, and finally recreates the indices with the new dimensions. This is the correct sequence to ensure that the indices are compatible with the new schema.

Workflow ID: wflow_KOJlfTYQRpL74SB5


You can customize Ellipsis with review rules, user-specific overrides, quiet mode, and more. See docs.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested.

  • Performed an incremental review on d762fa1
  • Looked at 63 lines of code in 3 files
  • Took 1 minute and 43 seconds to review
More info
  • Skipped 0 files when reviewing.
  • Skipped posting 2 additional comments because they didn't meet confidence threshold of 50%.
1. agents-api/agents_api/clients/embed.py:8:
  • Assessed confidence : 100%
  • Grade: 0%
  • Comment:
    Consider providing a specific default model name or handling the default case within the function to avoid potential runtime errors related to embedding_model_id.
embed_model_name: str = "default_model_name",
  • Reasoning:
    The default value for embed_model_name in the embed function is set to embedding_model_id, which is imported from the environment. This might lead to confusion or errors if embedding_model_id is not defined in the environment or if it's not intended to be used as a default model ID. It would be clearer to either provide a specific default model name or handle the default case within the function to avoid potential runtime errors.
2. agents-api/agents_api/embed_models_registry.py:81:
  • Assessed confidence : 100%
  • Grade: 0%
  • Comment:
    The parameter name should be model_id to match the expected parameter in the embed function.
embeddings = await embed(input, model_id=self.embedding_model_name)
  • Reasoning:
    The embed function in embed_models_registry.py is called with an incorrect parameter name embed_model_name. The correct parameter name should be model_id as per the definition in agents_api/clients/embed.py. This mismatch will cause a runtime error because the embed function expects a parameter named model_id not embed_model_name.

Workflow ID: wflow_oabailAEbPqlbrez


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with review rules, user-specific overrides, quiet mode, and more. See docs.

@@ -302,7 +305,13 @@ async def list_agents(
@router.post("/agents/{agent_id}/docs", tags=["agents"])
async def create_docs(agent_id: UUID4, request: CreateDoc) -> ResourceCreatedResponse:
doc_id = uuid4()
content = [request.content] if isinstance(request.content, str) else request.content
content = [
(c.model_dump() if isinstance(c, ContentItem) else c)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure that ContentItem has a model_dump() method to avoid potential AttributeError. If it does not exist, consider implementing it or handling this scenario appropriately.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested.

  • Performed an incremental review on 67476aa
  • Looked at 26 lines of code in 1 files
  • Took 44 seconds to review
More info
  • Skipped 0 files when reviewing.
  • Skipped posting 0 additional comments because they didn't meet confidence threshold of 50%.

Workflow ID: wflow_nwe7eZ6pIDuVlQpt


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with review rules, user-specific overrides, quiet mode, and more. See docs.

@@ -238,7 +239,12 @@ async def list_users(
@router.post("/users/{user_id}/docs", tags=["users"])
async def create_docs(user_id: UUID4, request: CreateDoc) -> ResourceCreatedResponse:
doc_id = uuid4()
content = [request.content] if isinstance(request.content, str) else request.content
content = [
(c.model_dump() if isinstance(c, ContentItem) else c)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of model_dump() on instances of ContentItem is incorrect as this method does not exist in the ContentItem class. This will cause an AttributeError at runtime. Please revise the handling of ContentItem objects to correctly extract the necessary data.

Suggested change
(c.model_dump() if isinstance(c, ContentItem) else c)
(c.content if isinstance(c, ContentItem) else c)

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested.

  • Performed an incremental review on d2b8c32
  • Looked at 187 lines of code in 5 files
  • Took 53 seconds to review
More info
  • Skipped 0 files when reviewing.
  • Skipped posting 0 additional comments because they didn't meet confidence threshold of 50%.

Workflow ID: wflow_ChcyCDiQiJx6hUjQ


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with review rules, user-specific overrides, quiet mode, and more. See docs.

@@ -73,6 +73,24 @@ services:
count: all
capabilities: [gpu]

docs-text-embeddings-inference:
container_name: text-embeddings-inference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container name text-embeddings-inference is used for both the text-embeddings-inference and docs-text-embeddings-inference services. This could lead to conflicts. Consider using a unique container name for each service.

Suggested change
container_name: text-embeddings-inference
container_name: docs-text-embeddings-inference

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested.

  • Performed an incremental review on e4704f9
  • Looked at 187 lines of code in 5 files
  • Took 54 seconds to review
More info
  • Skipped 0 files when reviewing.
  • Skipped posting 0 additional comments because they didn't meet confidence threshold of 50%.

Workflow ID: wflow_WFmywHjoc28UjQGi


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with review rules, user-specific overrides, quiet mode, and more. See docs.

@@ -5,6 +5,8 @@
async def embed(
inputs: list[str],
join_inputs=False,
embedding_service_url: str = embedding_service_url,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default parameter embedding_service_url shadows the imported embedding_service_url from env.py. Consider renaming the parameter to avoid confusion and potential bugs.

Suggested change
embedding_service_url: str = embedding_service_url,
embedding_service_url: str = None,

@@ -5,6 +5,8 @@
async def embed(
inputs: list[str],
join_inputs=False,
embedding_service_url: str = embedding_service_url,
embedding_model_name: str = embedding_model_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default parameter embedding_model_name shadows the imported embedding_model_id from env.py. Consider renaming the parameter to avoid confusion and potential bugs.

Suggested change
embedding_model_name: str = embedding_model_id,
embedding_model_name: str = None,

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested.

  • Performed an incremental review on f6443b1
  • Looked at 229 lines of code in 6 files
  • Took 1 minute and 1 seconds to review
More info
  • Skipped 0 files when reviewing.
  • Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.
1. /agents-api/agents_api/embed_models_registry.py:84:
  • Assessed confidence : 100%
  • Grade: 0%
  • Comment:
    Ensure that the embed function in clients/embed.py is correctly handling the embedding_service_url parameter to avoid unexpected behavior due to the shadowing issue in its default parameters.
  • Reasoning:
    The EmbeddingModel class in embed_models_registry.py has a method embed that uses the embed function from clients/embed.py. The embed function is called with embedding_service_url and embedding_model_name parameters. However, the embed function in clients/embed.py has a bug where the default parameter embedding_service_url shadows the imported embedding_service_url from env.py. This could lead to unexpected behavior if the embed function is called without explicitly passing the embedding_service_url parameter.

Workflow ID: wflow_hIChwDncEKaw4iGO


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with review rules, user-specific overrides, quiet mode, and more. See docs.

@@ -5,6 +5,8 @@
async def embed(
inputs: list[str],
join_inputs=False,
embedding_service_url: str = embedding_service_url,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default parameter embedding_service_url shadows the imported embedding_service_url from env.py. Consider renaming the parameter to avoid confusion and potential bugs.

Suggested change
embedding_service_url: str = embedding_service_url,
embedding_service_url: str = None,

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested.

  • Performed an incremental review on 327d0c6
  • Looked at 250 lines of code in 6 files
  • Took 47 seconds to review
More info
  • Skipped 0 files when reviewing.
  • Skipped posting 0 additional comments because they didn't meet confidence threshold of 50%.

Workflow ID: wflow_tuPmQhPAhWFcYTk3


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with review rules, user-specific overrides, quiet mode, and more. See docs.

@@ -5,6 +5,8 @@
async def embed(
inputs: list[str],
join_inputs=False,
embedding_service_url: str = embedding_service_url,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default parameter embedding_service_url shadows the import from env.py. Consider renaming the parameter to avoid confusion and potential bugs.

Suggested change
embedding_service_url: str = embedding_service_url,
embedding_service_url: str = default_embedding_service_url,

@whiterabbit1983 whiterabbit1983 merged commit 968c446 into dev May 4, 2024
6 checks passed
@whiterabbit1983 whiterabbit1983 deleted the f/docs-ingestion branch May 4, 2024 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants