We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Just writing down the current workflow – it could surely be improved.
Use the warcbase workflow to extract URLs, akin to:
import org.warcbase.spark.matchbox._ import org.warcbase.spark.rdd.RecordRDD._ val r = RecordLoader.loadArchives("/directory/to/arc/file.arc.gz", sc) .keepValidPages() .map(r => r.getUrl) .saveAsTextFile("/path/to/export/directory/")
Right now, we've got a layout incompatibility. This (lazy person's) code works:
ed -i -- 's/((//g' * sed -i -- 's/,/ /g' * sed -i -- 's/)//g' *
Run this script in the directory with the fixed URLs, changing paths as necessary.
#!/bin/bash for filename in *.txt; do filenameshort=${filename/-all.txt/} echo $filenameshort cp $filename /Users/ianmilligan1/dropbox/git/WALK-CrawlVis/crawl-sites/raw.txt python2 /Users/ianmilligan1/dropbox/git/WALK-CrawlVis/crawl-sites/process.py > /Users/ianmilligan1/dropbox/git/WALK-CrawlVis/crawl-sites/$filenameshort.csv sed "s/data.csv/$filenameshort.csv/g" /Users/ianmilligan1/dropbox/git/WALK-CrawlVis/crawl-sites/index.html > /Users/ianmilligan1/dropbox/git/WALK-CrawlVis/crawl-sites/$filenameshort.html done
What does it do? It goes into each file, processes them, creates a datafile and an index file pointing at said datafile.
Not pretty, but works right now.
https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/alberta_education_curriculum.html https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/alberta_floods_2013.html https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/alberta_oil_sands.html https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/canadian_business_grey_literature.html https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/elxn42.html https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/energy_environment.html https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/hcf_alberta_online_encyclopedia.html https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/health_sciences_grey_literature.html https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/heritage_community_foundation.html https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/humanities_computing.html https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/idle_no_more.html https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/lfrancophonie_de_louest_canadien.html https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/ottawa_shooting_october_2014.html https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/prarie_provinces.html https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/web_archive_general.html
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Just writing down the current workflow – it could surely be improved.
1: Assemble the URL Lists
Use the warcbase workflow to extract URLs, akin to:
2: Put into format for Crawl Visualization
Right now, we've got a layout incompatibility. This (lazy person's) code works:
3: Process
Run this script in the directory with the fixed URLs, changing paths as necessary.
What does it do? It goes into each file, processes them, creates a datafile and an index file pointing at said datafile.
Not pretty, but works right now.
4: Results
https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/alberta_education_curriculum.html
https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/alberta_floods_2013.html
https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/alberta_oil_sands.html
https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/canadian_business_grey_literature.html
https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/elxn42.html
https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/energy_environment.html
https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/hcf_alberta_online_encyclopedia.html
https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/health_sciences_grey_literature.html
https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/heritage_community_foundation.html
https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/humanities_computing.html
https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/idle_no_more.html
https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/lfrancophonie_de_louest_canadien.html
https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/ottawa_shooting_october_2014.html
https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/prarie_provinces.html
https://ianmilligan1.github.io/WALK-CrawlVis/crawl-sites/web_archive_general.html
The text was updated successfully, but these errors were encountered: