This repository has been archived by the owner on Jan 2, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 572
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
oppiliappan
force-pushed
the
nerdypepper/doc-scraper
branch
2 times, most recently
from
September 22, 2023 11:24
3c66f23
to
b7f6646
Compare
oppiliappan
force-pushed
the
nerdypepper/doc-scraper
branch
2 times, most recently
from
October 10, 2023 13:07
99e1c2a
to
96b0f18
Compare
oppiliappan
force-pushed
the
nerdypepper/doc-scraper
branch
4 times, most recently
from
November 2, 2023 17:37
4a06a80
to
60ea06f
Compare
oppiliappan
force-pushed
the
nerdypepper/doc-scraper
branch
3 times, most recently
from
November 6, 2023 11:45
f8ff0a7
to
d210715
Compare
ggordonhall
reviewed
Nov 6, 2023
rsdy
reviewed
Nov 7, 2023
ggordonhall
reviewed
Nov 7, 2023
effective changes: ------------------ - introduces a multithreaded webcrawler specialized to scrape documentation - adds the following backend endpoints: * `GET /docs/sync url==<string>`: index a new doc source * `GET|DELETE /docs/:id`: crud ops on indexed doc sources * `GET /docs/:id/resync`: update an existing doc source * `GET /docs/:id/fetch relative_url==<string>`: fetch one webpage from a doc source, the page is broken down into headings & sections * `GET /docs/:id/search`: semantic search over a doc source, the granularity here is induvidual sections of each page implementation details: ----------------------- - article scraper is a port of a python library: newspaper3k - websites are scraped and broken down into sections based on html header tags; using tree-sitter - a new qdrant collection is initialized on startup: `web` - document sources are tracked on sqlite, but text content of each page is stored solely in qdrant
studio crud ops --------------- - introduces a new `doc_context` to studio snapshots and studios - `doc_context` may be populated in a similar fashion to `context`, through the `patch` method, sample request: http PATCH :7878/api/studio/1 { "doc_context": [ { "doc_id": 3, "doc_source": "https://docs.rs/qdrant-client/latest/qdrant_client/", "relative_url": "qdrant/struct.PayloadIncludeSelector.html", "ranges": [ "eaefb40a-13a3-4c2e-a0b3-4ffa3670bfa4", "9b0f63bb-9ecd-46ee-8315-23065df418ce" ], "hidden": false } ] } - the uuids in the `ranges` field correspond to the sections in a webpage - to display a webpage with its active an inactive sections, use the `fetch` endpoint, which lists every section of the page in-order. among these, the active sections are those which are present in the studio context token counting -------------- - token counts for docs are calculated separately and added under the `doc_context` field - token counts include the headers for each section as well
- also add support for metadata scraping, yet to be integrated into the doc index however
also updates db to incorporate metadata changeset
- return stream errors as EventSource messages - return end of stream only /after/ all tantivy commits have completed - include link dedup check in `verify` endpoint - track absolute url in the index instead of constructing on the fly - attach newlines if not available in code fence blocks - and several others, idk look at the diff
- different url schemes used in linking pages - improve article main node detector
- also add Sse keep-alive to all streaming endpoints - commit just once at the end of the index step
oppiliappan
force-pushed
the
nerdypepper/doc-scraper
branch
from
November 7, 2023 12:41
21202b3
to
7fe7b3b
Compare
ggordonhall
approved these changes
Nov 7, 2023
oppiliappan
force-pushed
the
nerdypepper/doc-scraper
branch
from
November 7, 2023 12:59
a31440e
to
45eb337
Compare
* doc indexing ui * add doc sections panel * styling for doc panel * two-step popup * use sse for syncing docs * add filtering to doc providers on search * search sections * fix sections selection * show a list of docs for provider * add docs to studio context * add tokens for docs * ui fixes * get token count for doc * some design updates * rework arrow key navigation in doc context modal * update designs for doc panel * add title and icon to docs in code studio * minor fixes * rework arrow navigation for repo context modal * small fix for long doc urls * show error if doc page in context is unavailable * set max width for the info message and add three dots to very long urls to make it less jittery * allow multiselect for context files * add breadcrumbs for doc sections on search, fix scrolling issues in doc modal * rework section search UI * minor UI fixes * change doc remove hotkey * make sure buttons are not selectable * disable html in markdown for parsed docs * reset indexing state on SSE error * add tooltip to breadcrumbs, fix clicking on section breadcrumbs, open link in doc panel * manually handle links in docs panel * use absolute url * remove console log
calyptobai
reviewed
Nov 7, 2023
- make doc_context non-null - use sql transaction to prevent stream errors from landing us in an incongruent state - do not check host str when making relative urls - no more warnings
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
effective changes:
GET /docs/sync url==<string>
: index a new doc sourceGET|DELETE /docs/:id
: crud ops on indexed doc sourcesGET /docs/:id/resync
: update an existing doc sourceGET /docs/:id/fetch relative_url==<string>
: fetch one webpage from a doc source, the page is broken down into headings & sectionsGET /docs/:id/search
: semantic search over a doc source, the granularity here is induvidual sections of each pageimplementation details:
web