Markdown ingest is weird: text before first header is missing #11828

cpwan · 2024-12-19T03:48:27Z

Self Checks

This is only for bug report, if you would like to ask a question, please head to Discussions.
I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
Please do not modify this template :) and fill in all the required fields.

Dify version

0.14.1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Ingest the following markdown into "Knowledge"

Any text above the first # is gone!

# apple 

## banana

## Cherry

### dify

i am a bad boy

===

lol???

The rendered example looks like this

✔️ Expected Behavior

The Any text above the first # is gone! text should be ingested.

❌ Actual Behavior

It is gone. As a consequence, there is no chunk for any text ahead of the first #.

The text was updated successfully, but these errors were encountered:

dosubot · 2024-12-19T03:51:34Z

The issue you're experiencing with markdown ingestion in Dify, where text before the first markdown header (#) is not being ingested, might be related to how the markdown processing is implemented. Unfortunately, I couldn't find specific details in the codebase about how text before the first header is handled during ingestion.

However, the MarkdownExtractor class is responsible for loading markdown files and converting them into Document objects, which involves parsing the markdown into tuples of headers and their corresponding text content ^[1]. This might be where the issue arises, as the text before the first header might not be included in these tuples.

To address this, you might need to look into the MarkdownExtractor class and see if there's a way to modify it to include text before the first header. Alternatively, you could consider adding a header at the very beginning of your markdown content as a temporary workaround.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

cpwan changed the title ~~Markdown ingest is weird~~ Markdown ingest is weird: text before first header is missing Dec 19, 2024

dosubot bot added the 🐞 bug Something isn't working label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Markdown ingest is weird: text before first header is missing #11828

Markdown ingest is weird: text before first header is missing #11828

cpwan commented Dec 19, 2024

dosubot bot commented Dec 19, 2024

Markdown ingest is weird: text before first header is missing #11828

Markdown ingest is weird: text before first header is missing #11828

Comments

cpwan commented Dec 19, 2024

Self Checks

Dify version

Cloud or Self Hosted

Steps to reproduce

✔️ Expected Behavior

❌ Actual Behavior

dosubot bot commented Dec 19, 2024