Workflow #3

ianmilligan1 · 2016-08-05T18:36:12Z

Just writing down the current workflow – it could surely be improved.

1: Assemble the URL Lists

Use the warcbase workflow to extract URLs, akin to:

import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._

val r = RecordLoader.loadArchives("/directory/to/arc/file.arc.gz", sc) 
.keepValidPages()
.map(r => r.getUrl)
.saveAsTextFile("/path/to/export/directory/")

2: Put into format for Crawl Visualization

Right now, we've got a layout incompatibility. This (lazy person's) code works:

ed -i -- 's/((//g' *
sed -i -- 's/,/ /g' *
sed -i -- 's/)//g' *

3: Process

Run this script in the directory with the fixed URLs, changing paths as necessary.

#!/bin/bash
for filename in *.txt; do
    filenameshort=${filename/-all.txt/}
    echo $filenameshort
    cp $filename /Users/ianmilligan1/dropbox/git/WALK-CrawlVis/crawl-sites/raw.txt
    python2 /Users/ianmilligan1/dropbox/git/WALK-CrawlVis/crawl-sites/process.py > /Users/ianmilligan1/dropbox/git/WALK-CrawlVis/crawl-sites/$filenameshort.csv
    sed "s/data.csv/$filenameshort.csv/g" /Users/ianmilligan1/dropbox/git/WALK-CrawlVis/crawl-sites/index.html > /Users/ianmilligan1/dropbox/git/WALK-CrawlVis/crawl-sites/$filenameshort.html
done

What does it do? It goes into each file, processes them, creates a datafile and an index file pointing at said datafile.

Not pretty, but works right now.

4: Results

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow #3

Workflow #3

ianmilligan1 commented Aug 5, 2016 •

edited

Loading

Workflow #3

Workflow #3

Comments

ianmilligan1 commented Aug 5, 2016 • edited Loading

1: Assemble the URL Lists

2: Put into format for Crawl Visualization

3: Process

4: Results

ianmilligan1 commented Aug 5, 2016 •

edited

Loading