Expand the management function of the dataset - Narrowing down the search scope of the dataset based on paths and more metadata #10170

glacierck · 2024-11-01T10:18:23Z

Self Checks

I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
Please do not modify this template :) and fill in all the required fields.

1. Is this request related to a challenge you're experiencing? Tell me about your story.

We are developing a dynamic knowledge base system that implements a document question-answering function based on the RAG (Retrieve-and-Generate) mechanism. To enhance retrieval efficiency and accuracy, we need to be able to specify the search scope for vector retrieval according to the path input by the user during runtime, rather than blindly searching within the entire dataset.
The scenario I evaluated has 100000 documents, and vector recall is only a mechanism for recalling top-n, which leads to performance issues and unreliable accuracy in conducting full retrieval at such a scale

Specific Requirements:

Vector Retrieval within a Specified Path Range:

Users should be able to input a specific file path, such as '/dataset-1/dir-1/a.docx'.
The system needs to be able to perform vector retrieval within the specified range (i.e., the file or the directory and its subdirectories) based on this path.

Multi-level Precise Q&A Search:

Q&A within the Dataset: Retrieval is conducted across the entire dataset, but users can restrict the search scope through paths to improve efficiency.
Q&A within a Directory: Retrieval is conducted within a specific directory (and its subdirectories) specified by the user for more refined searching.
Q&A within a Document: Retrieval is conducted within a single document specified by the user for the most precise searching.

Desired Outcomes:

By implementing the retrieval function within a specified path range, users can more flexibly control the scope and precision of their searches, thereby obtaining more efficient search results in different scenarios.
This function should significantly enhance the Q&A performance and user experience of the dynamic knowledge base system.

2. Additional context or comments

#5928 : his issue provides a solution, but I think cross dataset retrieval may be unfriendly to the management of dataset lists. The same effect can be achieved by implementing directory management dimensions within the dataset.

3. Can you help us with this feature?

I am interested in contributing to this feature.

crazywoola · 2024-11-01T11:20:35Z

You can discuss with @Yawen-1010 She is the PM of the RAG.

glacierck · 2024-11-04T08:39:37Z

@Yawen-1010 Is there a development plan or design blueprint for the dataset? I would like to know about the functionality of dataset management and fragment retrieval. Our team is considering fully embracing DIY as an AI agent. Currently, the dataset related functions are in a primitive state. We hope to join you and promote this branch.

glacierck · 2024-11-04T08:44:07Z

@Yawen-1010 Anyway, in the workflow, the retrieval of datasets only accepts the input parameter 'query', which is too weak.

ZYW-Mia · 2024-11-07T05:48:22Z

Hi, @glacierck .
Thank you for your request, which is clear in goal, specific in scenario, and detailed in description. We have received many similar requests, and we are currently designing the metadata function of the knowledge base to solve this problem. We can discuss and communicate on this issue.

dosubot · 2024-12-08T16:05:19Z

Hi, @glacierck. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.

Issue Summary:

Enhancement request for a dynamic knowledge base system to improve search scope narrowing using file paths and metadata.
@crazywoola suggested discussing with @Yawen-1010, the PM of the RAG.
You inquired about development plans or design blueprints for dataset management and retrieval.
@ZYW-Mia acknowledged the request and mentioned ongoing design of a metadata function to address similar requests.

Next Steps:

Please let us know if this issue is still relevant to the latest version of the Dify repository by commenting here.
If there is no further activity, this issue will be automatically closed in 15 days.

Thank you for your understanding and contribution!

dosubot bot added the 💪 enhancement New feature or request label Nov 1, 2024

glacierck changed the title ~~Expand the management function of the dataset~~ Expand the management function of the dataset - Narrowing down the search scope based on the path Nov 1, 2024

glacierck changed the title ~~Expand the management function of the dataset - Narrowing down the search scope based on the path~~ Expand the management function of the dataset - Narrowing down the search scope of the dataset based on paths and more metadata Nov 1, 2024

crazywoola assigned Yawen-1010 Nov 1, 2024

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Dec 8, 2024

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 23, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand the management function of the dataset - Narrowing down the search scope of the dataset based on paths and more metadata #10170

Expand the management function of the dataset - Narrowing down the search scope of the dataset based on paths and more metadata #10170

glacierck commented Nov 1, 2024 •

edited

Loading

crazywoola commented Nov 1, 2024

glacierck commented Nov 4, 2024

glacierck commented Nov 4, 2024

ZYW-Mia commented Nov 7, 2024

dosubot bot commented Dec 8, 2024

Expand the management function of the dataset - Narrowing down the search scope of the dataset based on paths and more metadata #10170

Expand the management function of the dataset - Narrowing down the search scope of the dataset based on paths and more metadata #10170

Comments

glacierck commented Nov 1, 2024 • edited Loading

Self Checks

1. Is this request related to a challenge you're experiencing? Tell me about your story.

Specific Requirements:

Vector Retrieval within a Specified Path Range:

Multi-level Precise Q&A Search:

Desired Outcomes:

2. Additional context or comments

3. Can you help us with this feature?

crazywoola commented Nov 1, 2024

glacierck commented Nov 4, 2024

glacierck commented Nov 4, 2024

ZYW-Mia commented Nov 7, 2024

dosubot bot commented Dec 8, 2024

glacierck commented Nov 1, 2024 •

edited

Loading