feat: index and search documentation #978

oppiliappan · 2023-09-21T13:43:44Z

effective changes:

introduces a multithreaded webcrawler specialized to scrape documentation
adds the following backend endpoints:
- GET /docs/sync url==<string>: index a new doc source
- GET|DELETE /docs/:id: crud ops on indexed doc sources
- GET /docs/:id/resync: update an existing doc source
- GET /docs/:id/fetch relative_url==<string>: fetch one webpage from a doc source, the page is broken down into headings & sections
- GET /docs/:id/search: semantic search over a doc source, the granularity here is induvidual sections of each page

implementation details:

article scraper is a port of a python library: newspaper3k
websites are scraped and broken down into sections based on html header tags; using tree-sitter
a new qdrant collection is initialized on startup: web
document sources are tracked on sqlite, but text content of each page is stored solely in qdrant

gitpod-io · 2023-11-03T11:56:34Z

server/bleep/src/indexes/doc.rs

server/bleep/src/collector/group.rs

server/bleep/src/scraper/article.rs

server/bleep/src/indexes/doc.rs

server/bleep/src/scraper.rs

server/bleep/src/webserver.rs

effective changes: ------------------ - introduces a multithreaded webcrawler specialized to scrape documentation - adds the following backend endpoints: * `GET /docs/sync url==<string>`: index a new doc source * `GET|DELETE /docs/:id`: crud ops on indexed doc sources * `GET /docs/:id/resync`: update an existing doc source * `GET /docs/:id/fetch relative_url==<string>`: fetch one webpage from a doc source, the page is broken down into headings & sections * `GET /docs/:id/search`: semantic search over a doc source, the granularity here is induvidual sections of each page implementation details: ----------------------- - article scraper is a port of a python library: newspaper3k - websites are scraped and broken down into sections based on html header tags; using tree-sitter - a new qdrant collection is initialized on startup: `web` - document sources are tracked on sqlite, but text content of each page is stored solely in qdrant

studio crud ops --------------- - introduces a new `doc_context` to studio snapshots and studios - `doc_context` may be populated in a similar fashion to `context`, through the `patch` method, sample request: http PATCH :7878/api/studio/1 { "doc_context": [ { "doc_id": 3, "doc_source": "https://docs.rs/qdrant-client/latest/qdrant_client/", "relative_url": "qdrant/struct.PayloadIncludeSelector.html", "ranges": [ "eaefb40a-13a3-4c2e-a0b3-4ffa3670bfa4", "9b0f63bb-9ecd-46ee-8315-23065df418ce" ], "hidden": false } ] } - the uuids in the `ranges` field correspond to the sections in a webpage - to display a webpage with its active an inactive sections, use the `fetch` endpoint, which lists every section of the page in-order. among these, the active sections are those which are present in the studio context token counting -------------- - token counts for docs are calculated separately and added under the `doc_context` field - token counts include the headers for each section as well

- also add support for metadata scraping, yet to be integrated into the doc index however

also updates db to incorporate metadata changeset

- return stream errors as EventSource messages - return end of stream only /after/ all tantivy commits have completed - include link dedup check in `verify` endpoint - track absolute url in the index instead of constructing on the fly - attach newlines if not available in code fence blocks - and several others, idk look at the diff

- different url schemes used in linking pages - improve article main node detector

- also add Sse keep-alive to all streaming endpoints - commit just once at the end of the index step

* doc indexing ui * add doc sections panel * styling for doc panel * two-step popup * use sse for syncing docs * add filtering to doc providers on search * search sections * fix sections selection * show a list of docs for provider * add docs to studio context * add tokens for docs * ui fixes * get token count for doc * some design updates * rework arrow key navigation in doc context modal * update designs for doc panel * add title and icon to docs in code studio * minor fixes * rework arrow navigation for repo context modal * small fix for long doc urls * show error if doc page in context is unavailable * set max width for the info message and add three dots to very long urls to make it less jittery * allow multiselect for context files * add breadcrumbs for doc sections on search, fix scrolling issues in doc modal * rework section search UI * minor UI fixes * change doc remove hotkey * make sure buttons are not selectable * disable html in markdown for parsed docs * reset indexing state on SSE error * add tooltip to breadcrumbs, fix clicking on section breadcrumbs, open link in doc panel * manually handle links in docs panel * use absolute url * remove console log

server/bleep/migrations/20230919100529_code_studio_docs.sql

server/bleep/src/webserver/studio.rs

server/bleep/migrations/20230919100529_code_studio_docs.sql

- make doc_context non-null - use sql transaction to prevent stream errors from landing us in an incongruent state - do not check host str when making relative urls - no more warnings

oppiliappan marked this pull request as draft September 21, 2023 13:43

oppiliappan force-pushed the nerdypepper/doc-scraper branch 2 times, most recently from 3c66f23 to b7f6646 Compare September 22, 2023 11:24

oppiliappan force-pushed the nerdypepper/doc-scraper branch 2 times, most recently from 99e1c2a to 96b0f18 Compare October 10, 2023 13:07

ggordonhall added the feature A new feature label Oct 20, 2023

oppiliappan force-pushed the nerdypepper/doc-scraper branch 4 times, most recently from 4a06a80 to 60ea06f Compare November 2, 2023 17:37

oppiliappan force-pushed the nerdypepper/doc-scraper branch 3 times, most recently from f8ff0a7 to d210715 Compare November 6, 2023 11:45

oppiliappan marked this pull request as ready for review November 6, 2023 18:56

oppiliappan requested review from calyptobai, ggordonhall and rsdy November 6, 2023 18:56

ggordonhall reviewed Nov 6, 2023

View reviewed changes

rsdy reviewed Nov 7, 2023

View reviewed changes

ggordonhall reviewed Nov 7, 2023

View reviewed changes

oppiliappan added 9 commits November 7, 2023 12:37

rework error handling, add verify endpoint

96e3424

- also add support for metadata scraping, yet to be integrated into the doc index however

introduce two-step search

63bae3d

also updates db to incorporate metadata changeset

interleave scraping and embedding

6e143c7

add sse to sync and resync endpoints

a2be4b1

continue crawling even if article cannot be parsed

a1a1845

introduce list_with_id to list pages in a provider

e73df9a

return placeholder for page title

2df7b0e

oppiliappan added 15 commits November 7, 2023 12:37

rip out qdrant

62bdbad

assortment of fixes

8603b52

remove chunking logic

bdf5ec7

rework log levels to work well with sentry

4ea529e

lower log level to trace

ac26035

clippy

81d01f5

fix bugs arising from indexing momentjs

e46bb4e

- different url schemes used in linking pages - improve article main node detector

add analytics events to /sync

d097eae

- also add Sse keep-alive to all streaming endpoints - commit just once at the end of the index step

fix studios

036b9a1

clippy

7ec7f9d

bug bug bug

a3c9ab4

more url-ness

65a2cfc

add absolute_url to doc-context-file

7005995

address review comments

7fe7b3b

oppiliappan force-pushed the nerdypepper/doc-scraper branch from 21202b3 to 7fe7b3b Compare November 7, 2023 12:41

oppiliappan requested review from ggordonhall and rsdy November 7, 2023 12:41

ggordonhall approved these changes Nov 7, 2023

View reviewed changes

undo accidental doc test

45eb337

oppiliappan force-pushed the nerdypepper/doc-scraper branch from a31440e to 45eb337 Compare November 7, 2023 12:59

calyptobai reviewed Nov 7, 2023

View reviewed changes

anastasiya1155 and others added 5 commits November 7, 2023 10:42

fix selecting section after search in doc modal

e79c09c

address review comments

749b1db

- make doc_context non-null - use sql transaction to prevent stream errors from landing us in an incongruent state - do not check host str when making relative urls - no more warnings

attempt to fix duplication

ffd5db7

fix issue with python tutorial page

f80d820

address clippy

727e0b0

oppiliappan merged commit 1ba867b into main Nov 8, 2023
8 checks passed

oppiliappan deleted the nerdypepper/doc-scraper branch November 8, 2023 09:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: index and search documentation #978

feat: index and search documentation #978

oppiliappan commented Sep 21, 2023 •

edited by gitpod-io bot

Loading

gitpod-io bot commented Nov 3, 2023

feat: index and search documentation #978

feat: index and search documentation #978

Conversation

oppiliappan commented Sep 21, 2023 • edited by gitpod-io bot Loading

effective changes:

implementation details:

gitpod-io bot commented Nov 3, 2023

oppiliappan commented Sep 21, 2023 •

edited by gitpod-io bot

Loading