Skip to content

[Discussion]: Most articles only have one section  #310

Closed
@alanakbik

Description

Problem statement

Most articles seem to have only one section. The tutorial explains the Article class body as a list of sections, where each section has a header and a list of paragraphs.

However, I could not easily find an example with more than one section, so I ran this code:

from fundus import Crawler, PublisherCollection
from textwrap import TextWrapper

crawler = Crawler(PublisherCollection.us)

for article in crawler.crawl():
    print(".")
    if len(article.body.sections) > 2:
        print(article)

and it seems only a small minority of articles has more than one section.

Based on this, I wonder if the complexity:

  • body is a list of sections
  • a section has an optional header
  • a section is a list of paragraphs
    is necessary?

For instance, an article like this one is only one section.

Solution

Instead of: "Body is TextSequence of ArticleSection, each with a headline TextSequence and a paragraphs TextSequence (which again is a list of ArticleSection)"

A potentially simpler solution could be a flattened structure in which "Body is list of Paragraph". A Paragraph would only hold a string and has a boolean flag is_headline.

Additional Context

No response

Metadata

Assignees

No one assigned

    Labels

    featureHave an idea on how to improve the code base? Come forward and let us know.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions