Compare

Comparing Web Archival Collections with Count

Code primarily developed by Ryan Deschamps, a research assistant with the Web Archives for Historial Research Group, in collaboration with Ian Milligan.

This is a rough draft of some documentation (we're using this for scholarly research right now, so it'll change as we actually do test cases and beyond). Between this and the well-commented Jupyter notebook file, you should be good. If not, please feel free to open an issue.

What this allows you to do is take collections and see the overlap between them: what domains do they share, what ones do not. If you use more than three collections, rather than the Venn diagram it generates a correspondence analysis. Our goal is that this might help you in collections development, or just plain ol' figuring out what you have in a collection.

Installation

Clone this repo, or simply download the Jupyter notebook found here.

This was tested using anaconda. Base packages are installed, but you will need to install mca and matplotlib_venn. On OS X, we used pip:

pip install mca
pip install matplotlib_venn

Generating Input Data

This script currently compares collections by taking their domain coverage and finding overlaps and differences. You will need data.

By default, the script takes output from warcbase. Documentation can be found here. We generated domains using a script like so:

import org.warcbase.spark.matchbox._ 
import org.warcbase.spark.rdd.RecordRDD._ 

val elxn42 = 
  RecordLoader.loadArchives("/mnt/vol1/data_sets/cpp/cpp_warcs_accession_02/*201511*.gz", sc) 
  .keepValidPages() 
  .map(r => (r.getCrawlMonth, ExtractDomain(r.getUrl))) 
  .countItems() 
  .saveAsTextFile("/mnt/vol1/derivative_data/cpp-counted-domains-month-201511")

Results will come in a series of part files. Join them together:

cat part* > domains-counted-201511.txt

Generating Concordances

Open the Jupyter notebook and run the notebook. You should just need to change the following line to point to the directory that has the txt files you generated in the above step.

#establish the data folder
path = "/Users/ianmilligan1/dropbox/git/WALK/Data/Domains/"

For NER overlap, there is another line you will need to change:

#establish the data folder
loc_path = "../../NER/"

Change this to the directory with your NER output.

Everything else should be described in comments.

Acknowlegements

Support for this project comes from an Ontario Ministry of Research and Innovation Early Researcher Award and a Compute Canada Research Platforms and Portals award.

Name		Name	Last commit message	Last commit date
parent directory ..
.ipynb_checkpoints		.ipynb_checkpoints
ALL		ALL
D3		D3
SFU		SFU
UFT		UFT
UVIC		UVIC
WINNIPEG		WINNIPEG
__pycache__		__pycache__
adjustText		adjustText
assembled		assembled
frequencies		frequencies
3DCompare.ipynb		3DCompare.ipynb
Compare.py		Compare.py
DUMMY_ALL.csv		DUMMY_ALL.csv
Dummy Variables.ipynb		Dummy Variables.ipynb
README.md		README.md
UA_Compare.ipynb		UA_Compare.ipynb
UA_CompareSFU.ipynb		UA_CompareSFU.ipynb
UA_CompareUFT.ipynb		UA_CompareUFT.ipynb
UA_CompareUVIC.ipynb		UA_CompareUVIC.ipynb
UA_CompareWINNIPEG.ipynb		UA_CompareWINNIPEG.ipynb
UnderstandingMCA.ipynb		UnderstandingMCA.ipynb
Using_Compare_to_Evaluate_Web_Archive_Collections.ipynb		Using_Compare_to_Evaluate_Web_Archive_Collections.ipynb
collection-domains-fixed.tar		collection-domains-fixed.tar
compare_collections-Copy1.ipynb		compare_collections-Copy1.ipynb
compare_collections-UFT.ipynb		compare_collections-UFT.ipynb
compare_collections.ipynb		compare_collections.ipynb
d3.min.js		d3.min.js
output.csv		output.csv
pseudo_code.txt		pseudo_code.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compare

Compare

README.md

Comparing Web Archival Collections with Count

Installation

Generating Input Data

Generating Concordances

Acknowlegements

Files

Compare

Directory actions

More options

Directory actions

More options

Latest commit

History

Compare

Folders and files

parent directory

README.md

Comparing Web Archival Collections with Count

Installation

Generating Input Data

Generating Concordances

Acknowlegements