[Discussion]: Most articles only have one section #310
Description
Problem statement
Most articles seem to have only one section. The tutorial explains the Article class body as a list of sections, where each section has a header and a list of paragraphs.
However, I could not easily find an example with more than one section, so I ran this code:
from fundus import Crawler, PublisherCollection
from textwrap import TextWrapper
crawler = Crawler(PublisherCollection.us)
for article in crawler.crawl():
print(".")
if len(article.body.sections) > 2:
print(article)
and it seems only a small minority of articles has more than one section.
Based on this, I wonder if the complexity:
- body is a list of sections
- a section has an optional header
- a section is a list of paragraphs
is necessary?
For instance, an article like this one is only one section.
Solution
Instead of: "Body
is TextSequence
of ArticleSection
, each with a headline TextSequence
and a paragraphs TextSequence
(which again is a list of ArticleSection
)"
A potentially simpler solution could be a flattened structure in which "Body
is list of Paragraph
". A Paragraph
would only hold a string and has a boolean flag is_headline.
Additional Context
No response