Skip to content
This repository has been archived by the owner on Jan 2, 2025. It is now read-only.

feat: index and search documentation #978

Merged
merged 42 commits into from
Nov 8, 2023
Merged
Changes from 1 commit
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
bfa1eb1
feat: index and search documentation
oppiliappan Sep 20, 2023
115842a
integrate docs into studios
oppiliappan Sep 22, 2023
96e3424
rework error handling, add verify endpoint
oppiliappan Sep 25, 2023
63bae3d
introduce two-step search
oppiliappan Sep 26, 2023
6e143c7
interleave scraping and embedding
oppiliappan Sep 29, 2023
a2be4b1
add sse to sync and resync endpoints
oppiliappan Oct 3, 2023
a1a1845
continue crawling even if article cannot be parsed
oppiliappan Oct 4, 2023
e73df9a
introduce `list_with_id` to list pages in a provider
oppiliappan Oct 5, 2023
2df7b0e
return placeholder for page title
oppiliappan Oct 5, 2023
aff57ff
restore search endpoint
oppiliappan Oct 5, 2023
7b83c04
add metadata to all scraped docs
oppiliappan Oct 5, 2023
acf5b95
fix total token counts
oppiliappan Oct 5, 2023
e30e9a3
bug fixes & endpoint improvements
oppiliappan Oct 6, 2023
6738a8a
include index status in response, minor fixes
oppiliappan Oct 9, 2023
b3a6fe5
assortment of fixes
oppiliappan Oct 10, 2023
15c4bd7
extract language name class hierarchy for code
oppiliappan Oct 11, 2023
e8e4586
implement content-addressed scheme in qdrant; bug fixes
oppiliappan Oct 11, 2023
a791073
handle redirects gracefully
oppiliappan Oct 12, 2023
5e55654
introduce tantivy index
oppiliappan Oct 26, 2023
f1a982c
clean up search, fix redir bug
oppiliappan Oct 30, 2023
62bdbad
rip out qdrant
oppiliappan Nov 2, 2023
8603b52
assortment of fixes
oppiliappan Nov 3, 2023
1db820d
assortment of bug fixes, unabridged edition
oppiliappan Nov 6, 2023
bdf5ec7
remove chunking logic
oppiliappan Nov 6, 2023
4ea529e
rework log levels to work well with sentry
oppiliappan Nov 6, 2023
ac26035
lower log level to trace
oppiliappan Nov 6, 2023
81d01f5
clippy
oppiliappan Nov 6, 2023
e46bb4e
fix bugs arising from indexing momentjs
oppiliappan Nov 6, 2023
d097eae
add analytics events to /sync
oppiliappan Nov 6, 2023
036b9a1
fix studios
oppiliappan Nov 6, 2023
7ec7f9d
clippy
oppiliappan Nov 6, 2023
a3c9ab4
bug bug bug
oppiliappan Nov 6, 2023
65a2cfc
more url-ness
oppiliappan Nov 6, 2023
7005995
add absolute_url to doc-context-file
oppiliappan Nov 6, 2023
7fe7b3b
address review comments
oppiliappan Nov 7, 2023
45eb337
undo accidental doc test
oppiliappan Nov 7, 2023
f671dbe
Doc indexing FE (#1118)
anastasiya1155 Nov 7, 2023
e79c09c
fix selecting section after search in doc modal
anastasiya1155 Nov 7, 2023
749b1db
address review comments
oppiliappan Nov 7, 2023
ffd5db7
attempt to fix duplication
oppiliappan Nov 7, 2023
f80d820
fix issue with python tutorial page
oppiliappan Nov 7, 2023
727e0b0
address clippy
oppiliappan Nov 8, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fix issue with python tutorial page
  • Loading branch information
oppiliappan committed Nov 7, 2023
commit f80d820c75a5386f32cb43fa50280b758e98d8b1
2 changes: 1 addition & 1 deletion server/bleep/src/scraper/article.rs
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ use std::{
};

static RE_BAD_NODES_ATTR: Lazy<Regex> = Lazy::new(|| {
Regex::new(r###"(?mi)^side$|combx|retweet|mediaarticlerelated|menucontainer|navbar|storytopbar-bucket|utility-bar|inline-share-tools|comment|PopularQuestions|contact|foot(er|note)?|cnn_strycaptiontxt|cnn_html_slideshow|cnn_strylftcntnt|links|meta$|shoutbox|sponsor|tags|socialnetworking|socialNetworking|cnnStryHghLght|cnn_stryspcvbx|^inset$|pagetools|post-attributes|welcome_form|contentTools2|the_answers|communitypromo|runaroundLeft|subscribe|vcard|articleheadings|date|^print$|popup|author-dropdown|tools|socialtools|byline|konafilter|breadcrumbs|^fn$|wp-caption-text|legende|ajoutVideo|timestamp|js_replies|[^-]facebook(-broadcasting)?|google|[^-]twitter|styln-briefing-block|read-more-link|js-body-read-more"###).unwrap()
Regex::new(r###"(?mi)^side$|combx|retweet|mediaarticlerelated|menucontainer|navbar|storytopbar-bucket|utility-bar|inline-share-tools|comment|PopularQuestions|contact|foot(er|note)?|cnn_strycaptiontxt|cnn_html_slideshow|cnn_strylftcntnt|links|meta$|shoutbox|sponsor|tags|socialnetworking|socialNetworking|cnnStryHghLght|cnn_stryspcvbx|^inset$|pagetools|post-attributes|welcome_form|contentTools2|the_answers|communitypromo|runaroundLeft|subscribe|vcard|articleheadings|date|^print$|popup|author-dropdown|socialtools|byline|konafilter|breadcrumbs|^fn$|wp-caption-text|legende|ajoutVideo|timestamp|js_replies|[^-]facebook(-broadcasting)?|google|[^-]twitter|styln-briefing-block|read-more-link|js-body-read-more"###).unwrap()
});
const PUNCTUATION: &str = r#",."'!?&-/:;()#$%*+<=>@[\]^_`{|}~"#;
const ARTICLE_BODY_ATTR: &[(&str, &str); 3] = &[
Expand Down
Loading