Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markdown ingest is weird: text before first header is missing #11828

Open
5 tasks done
cpwan opened this issue Dec 19, 2024 · 1 comment
Open
5 tasks done

Markdown ingest is weird: text before first header is missing #11828

cpwan opened this issue Dec 19, 2024 · 1 comment
Labels
🐞 bug Something isn't working

Comments

@cpwan
Copy link

cpwan commented Dec 19, 2024

Self Checks

  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

0.14.1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Ingest the following markdown into "Knowledge"

Any text above the first # is gone!

# apple 

## banana

## Cherry

### dify

i am a bad boy

===

lol???

The rendered example looks like this
image

✔️ Expected Behavior

The Any text above the first # is gone! text should be ingested.

❌ Actual Behavior

It is gone. As a consequence, there is no chunk for any text ahead of the first #.

@cpwan cpwan changed the title Markdown ingest is weird Markdown ingest is weird: text before first header is missing Dec 19, 2024
@dosubot dosubot bot added the 🐞 bug Something isn't working label Dec 19, 2024
Copy link

dosubot bot commented Dec 19, 2024

The issue you're experiencing with markdown ingestion in Dify, where text before the first markdown header (#) is not being ingested, might be related to how the markdown processing is implemented. Unfortunately, I couldn't find specific details in the codebase about how text before the first header is handled during ingestion.

However, the MarkdownExtractor class is responsible for loading markdown files and converting them into Document objects, which involves parsing the markdown into tuples of headers and their corresponding text content [1]. This might be where the issue arises, as the text before the first header might not be included in these tuples.

To address this, you might need to look into the MarkdownExtractor class and see if there's a way to modify it to include text before the first header. Alternatively, you could consider adding a header at the very beginning of your markdown content as a temporary workaround.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant