Skip to content
This repository has been archived by the owner on Jan 2, 2025. It is now read-only.

feat: index and search documentation #978

Merged
merged 42 commits into from
Nov 8, 2023
Merged

Conversation

oppiliappan
Copy link
Collaborator

@oppiliappan oppiliappan commented Sep 21, 2023

effective changes:

  • introduces a multithreaded webcrawler specialized to scrape documentation
  • adds the following backend endpoints:
    • GET /docs/sync url==<string>: index a new doc source
    • GET|DELETE /docs/:id: crud ops on indexed doc sources
    • GET /docs/:id/resync: update an existing doc source
    • GET /docs/:id/fetch relative_url==<string>: fetch one webpage from a doc source, the page is broken down into headings & sections
    • GET /docs/:id/search: semantic search over a doc source, the granularity here is induvidual sections of each page

implementation details:

  • article scraper is a port of a python library: newspaper3k
  • websites are scraped and broken down into sections based on html header tags; using tree-sitter
  • a new qdrant collection is initialized on startup: web
  • document sources are tracked on sqlite, but text content of each page is stored solely in qdrant

@oppiliappan oppiliappan marked this pull request as draft September 21, 2023 13:43
@oppiliappan oppiliappan force-pushed the nerdypepper/doc-scraper branch 2 times, most recently from 3c66f23 to b7f6646 Compare September 22, 2023 11:24
@oppiliappan oppiliappan force-pushed the nerdypepper/doc-scraper branch 2 times, most recently from 99e1c2a to 96b0f18 Compare October 10, 2023 13:07
@ggordonhall ggordonhall added the feature A new feature label Oct 20, 2023
@oppiliappan oppiliappan force-pushed the nerdypepper/doc-scraper branch 4 times, most recently from 4a06a80 to 60ea06f Compare November 2, 2023 17:37
Copy link

gitpod-io bot commented Nov 3, 2023

@oppiliappan oppiliappan force-pushed the nerdypepper/doc-scraper branch 3 times, most recently from f8ff0a7 to d210715 Compare November 6, 2023 11:45
@oppiliappan oppiliappan marked this pull request as ready for review November 6, 2023 18:56
server/bleep/src/indexes/doc.rs Outdated Show resolved Hide resolved
server/bleep/src/indexes/doc.rs Outdated Show resolved Hide resolved
server/bleep/src/indexes/doc.rs Show resolved Hide resolved
server/bleep/src/indexes/doc.rs Outdated Show resolved Hide resolved
server/bleep/src/indexes/doc.rs Outdated Show resolved Hide resolved
server/bleep/src/indexes/doc.rs Outdated Show resolved Hide resolved
server/bleep/src/indexes/doc.rs Show resolved Hide resolved
server/bleep/src/collector/group.rs Show resolved Hide resolved
server/bleep/src/collector/group.rs Show resolved Hide resolved
server/bleep/src/collector/group.rs Outdated Show resolved Hide resolved
server/bleep/src/collector/group.rs Outdated Show resolved Hide resolved
server/bleep/src/collector/group.rs Outdated Show resolved Hide resolved
server/bleep/src/scraper/article.rs Outdated Show resolved Hide resolved
server/bleep/src/scraper/article.rs Show resolved Hide resolved
server/bleep/src/indexes/doc.rs Outdated Show resolved Hide resolved
server/bleep/src/indexes/doc.rs Show resolved Hide resolved
server/bleep/src/indexes/doc.rs Outdated Show resolved Hide resolved
server/bleep/src/indexes/doc.rs Outdated Show resolved Hide resolved
server/bleep/src/indexes/doc.rs Outdated Show resolved Hide resolved
server/bleep/src/indexes/doc.rs Outdated Show resolved Hide resolved
server/bleep/src/indexes/doc.rs Outdated Show resolved Hide resolved
server/bleep/src/indexes/doc.rs Outdated Show resolved Hide resolved
server/bleep/src/indexes/doc.rs Outdated Show resolved Hide resolved
server/bleep/src/indexes/doc.rs Outdated Show resolved Hide resolved
server/bleep/src/indexes/doc.rs Outdated Show resolved Hide resolved
server/bleep/src/scraper.rs Outdated Show resolved Hide resolved
server/bleep/src/webserver.rs Show resolved Hide resolved
effective changes:
------------------

- introduces a multithreaded webcrawler specialized to scrape
  documentation
- adds the following backend endpoints:
  * `GET /docs/sync url==<string>`: index a new doc source
  * `GET|DELETE /docs/:id`: crud ops on indexed doc sources
  * `GET /docs/:id/resync`: update an existing doc source
  * `GET /docs/:id/fetch relative_url==<string>`: fetch one webpage from
    a doc source, the page is broken down into headings & sections
  * `GET /docs/:id/search`: semantic search over a doc source, the
    granularity here is induvidual sections of each page

implementation details:
-----------------------

- article scraper is a port of a python library: newspaper3k
- websites are scraped and broken down into sections based on html
  header tags; using tree-sitter
- a new qdrant collection is initialized on startup: `web`
- document sources are tracked on sqlite, but text content of each page
  is stored solely in qdrant
studio crud ops
---------------

- introduces a new `doc_context` to studio snapshots and studios
- `doc_context` may be populated in a similar fashion to `context`,
  through the `patch` method, sample request:

    http PATCH :7878/api/studio/1
    {
        "doc_context": [
            {
                "doc_id": 3,
                "doc_source": "https://docs.rs/qdrant-client/latest/qdrant_client/",
                "relative_url": "qdrant/struct.PayloadIncludeSelector.html",
                "ranges": [
                    "eaefb40a-13a3-4c2e-a0b3-4ffa3670bfa4",
                    "9b0f63bb-9ecd-46ee-8315-23065df418ce"
                ],
                "hidden": false
            }
        ]
    }
- the uuids in the `ranges` field correspond to the sections in a
  webpage
- to display a webpage with its active an inactive sections, use the
  `fetch` endpoint, which lists every section of the page in-order.
  among these, the active sections are those which are present in the
  studio context

token counting
--------------

- token counts for docs are calculated separately and added under the
  `doc_context` field
- token counts include the headers for each section as well
- also add support for metadata scraping, yet to be integrated into the
  doc index however
also updates db to incorporate metadata changeset
- return stream errors as EventSource messages
- return end of stream only /after/ all tantivy commits have completed
- include link dedup check in `verify` endpoint
- track absolute url in the index instead of constructing on the fly
- attach newlines if not available in code fence blocks
- and several others, idk look at the diff
- different url schemes used in linking pages
- improve article main node detector
- also add Sse keep-alive to all streaming endpoints
- commit just once at the end of the index step
@oppiliappan oppiliappan force-pushed the nerdypepper/doc-scraper branch from 21202b3 to 7fe7b3b Compare November 7, 2023 12:41
@oppiliappan oppiliappan force-pushed the nerdypepper/doc-scraper branch from a31440e to 45eb337 Compare November 7, 2023 12:59
* doc indexing ui

* add doc sections panel

* styling for doc panel

* two-step popup

* use sse for syncing docs

* add filtering to doc providers on search

* search sections

* fix sections selection

* show a list of docs for provider

* add docs to studio context

* add tokens for docs

* ui fixes

* get token count for doc

* some design updates

* rework arrow key navigation in doc context modal

* update designs for doc panel

* add title and icon to docs in code studio

* minor fixes

* rework arrow navigation for repo context modal

* small fix for long doc urls

* show error if doc page in context is unavailable

* set max width for the info message and add three dots to very long urls to make it less jittery

* allow multiselect for context files

* add breadcrumbs for doc sections on search, fix scrolling issues in doc modal

* rework section search UI

* minor UI fixes

* change doc remove hotkey

* make sure buttons are not selectable

* disable html in markdown for parsed docs

* reset indexing state on SSE error

* add tooltip to breadcrumbs, fix clicking on section breadcrumbs, open link in doc panel

* manually handle links in docs panel

* use absolute url

* remove console log
server/bleep/src/webserver/studio.rs Outdated Show resolved Hide resolved
server/bleep/src/webserver/studio.rs Outdated Show resolved Hide resolved
server/bleep/src/webserver/studio.rs Outdated Show resolved Hide resolved
anastasiya1155 and others added 5 commits November 7, 2023 10:42
- make doc_context non-null
- use sql transaction to prevent stream errors from landing us in an
  incongruent state
- do not check host str when making relative urls
- no more warnings
@oppiliappan oppiliappan merged commit 1ba867b into main Nov 8, 2023
8 checks passed
@oppiliappan oppiliappan deleted the nerdypepper/doc-scraper branch November 8, 2023 09:49
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants