Skip to content

ropensci/internetarchive

Repository files navigation

Build Status

internetarchive: An R client to the Internet Archive API

This API client for the Internet Archive is intended primarily for searching for items, retrieving metadata for items, and downloading the files associated with items. The functions can be used with the pipe operator (%>%) from magrittr and the data manipulation verbs in dplyr to create pipelines from searching to downloading. For the full details of what is possible with the Internet Archive API, see their advanced search help.

Installation

Install the development version from GitHub.

# install.packages("devtools")
devtools::install_github("ropensci/internetarchive", build_vignettes = TRUE)

Then load the package. We will also use dplyr for manipulating the retrieved data.

library("internetarchive")
library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> 
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> 
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

Basic search and browse

The simplest way to search the Internet Archive is to use a keyword search. The following function searches for these keywords in the most important metadata fields, and returns a list of item identifiers.

ia_keyword_search("isaac hecker")
#> 19 total items found. This query requested 5 results.
#> [1] "TheLifeOfFatherHecker"  "fatherhecker01sedg"    
#> [3] "fatherhecker00sedggoog" "lifeoffatherheck01elli"
#> [5] "lifeoffatherheck00elli"

You can pass an item identifier to the ia_browse() function to open an item in your browser. If you pass this function multiple identifiers, it will open only the first one.

ia_browse("TheLifeOfFatherHecker")

Advanced search

Usually it is more useful to perform an advanced search. You can construct an advanced search as a named character vector, where the names correspond to the fields. The following search, for instance, looks for items published by the American Tract Society in 1864. Run the function ia_list_fields() to see the list of accepted metadata fields.

ats_query <- c("publisher" = "american tract society", "year" = "1864")
ia_search(ats_query, num_results = 20)
#> 3 total items found. This query requested 20 results.
#> [1] "vitalgodlinessa00plumgoog" "huguenotsfrance00martgoog"
#> [3] "sketcheseloquen00wategoog"

You can change the number of items returned by the search using the num_results = argument, and you can request subsequent pages of results with the page = argument.

Notice that ia_search() and ia_keyword_search() both return a character vector of identifiers, so both can be used in the same way at the beginning of a pipeline.

Dates

To search by a date range, use the date field and the years (or ISO 8601 dates) separated by TO. Here we search for publications by the American Tract Society in the 1840s.

ia_search(c("publisher" = "american tract society", date = "1840 TO 1850"))
#> 88 total items found. This query requested 5 results.
#> [1] "scripturebiogra00hookgoog" "memoirmrssarahl00hookgoog"
#> [3] "historyreformat22aubgoog"  "circulationandc00socigoog"
#> [5] "historyreformat09aubgoog"

Getting item metadata and files

Once you have retrieved a list of items, you can retrieve their metadata and the list of files associated with the items.

To get a single item's metadata, you can pass its identifier to the ia_get_items() function.

hecker <- ia_get_items("TheLifeOfFatherHecker")
#> Getting TheLifeOfFatherHecker

The result is a list where the names of items in the list are the item identifiers, and the rest of the list is the metadata. This nested list can be difficult to work with, so the ia_metadata() returns a data frame of the metadata, and ia_files() returns a data frame of the files associated with the item.

ia_metadata(hecker)
#> Source: local data frame [25 x 3]
#> 
#>                       id       field
#> 1  TheLifeOfFatherHecker  identifier
#> 2  TheLifeOfFatherHecker   mediatype
#> 3  TheLifeOfFatherHecker collection1
#> 4  TheLifeOfFatherHecker collection2
#> 5  TheLifeOfFatherHecker collection3
#> 6  TheLifeOfFatherHecker     creator
#> 7  TheLifeOfFatherHecker        date
#> 8  TheLifeOfFatherHecker description
#> 9  TheLifeOfFatherHecker    language
#> 10 TheLifeOfFatherHecker  licenseurl
#> ..                   ...         ...
#> Variables not shown: value (chr)
ia_files(hecker)
#> Source: local data frame [14 x 3]
#> 
#>                       id                                   file    type
#> 1  TheLifeOfFatherHecker            /TheLifeOfFatherHecker.djvu    djvu
#> 2  TheLifeOfFatherHecker            /TheLifeOfFatherHecker.epub    epub
#> 3  TheLifeOfFatherHecker             /TheLifeOfFatherHecker.gif     gif
#> 4  TheLifeOfFatherHecker             /TheLifeOfFatherHecker.pdf     pdf
#> 5  TheLifeOfFatherHecker        /TheLifeOfFatherHecker_abbyy.gz      gz
#> 6  TheLifeOfFatherHecker /TheLifeOfFatherHecker_archive.torrent torrent
#> 7  TheLifeOfFatherHecker        /TheLifeOfFatherHecker_djvu.txt     txt
#> 8  TheLifeOfFatherHecker        /TheLifeOfFatherHecker_djvu.xml     xml
#> 9  TheLifeOfFatherHecker       /TheLifeOfFatherHecker_files.xml     xml
#> 10 TheLifeOfFatherHecker         /TheLifeOfFatherHecker_jp2.zip     zip
#> 11 TheLifeOfFatherHecker     /TheLifeOfFatherHecker_meta.sqlite  sqlite
#> 12 TheLifeOfFatherHecker        /TheLifeOfFatherHecker_meta.xml     xml
#> 13 TheLifeOfFatherHecker    /TheLifeOfFatherHecker_scandata.xml     xml
#> 14 TheLifeOfFatherHecker        /TheLifeOfFatherHecker_text.pdf     pdf

These functions can also retrieve the information for multiple items when used in a pipeline. Here we search for all the items about Hecker, retrieve their metadata, and turn it into a data frame. We then filter the data frame to get only the titles.

ia_keyword_search("isaac hecker", num_results = 20) %>% 
  ia_get_items() %>% 
  ia_metadata() %>% 
  filter(field == "title") %>% 
  select(value)
#> 19 total items found. This query requested 20 results.
#> Getting TheLifeOfFatherHecker
#> Getting fatherhecker01sedg
#> Getting fatherhecker00sedggoog
#> Getting lifeoffatherheck01elli
#> Getting lifeoffatherheck00elli
#> Getting abitunpublished00heckgoog
#> Getting ERIC_ED250755
#> Getting TheLightOfTheCrossV2
#> Getting questionsofsoul00heck
#> Getting questionssoul01heckgoog
#> Getting catholicchurchi00heckgoog
#> Getting questionssoul00heckgoog
#> Getting cu31924031386414
#> Getting aspirationsofnat00heck
#> Getting uncatholicismeam00dela
#> Getting a587173700heckuoft
#> Getting cu31924029381013
#> Getting a589111500unknuoft
#> Getting aspirationsofnat00heckuoft
#> Source: local data frame [19 x 1]
#> 
#>                                                                          value
#> 1                                                    The Life Of Father Hecker
#> 2                                                                Father Hecker
#> 3                                                                Father Hecker
#> 4                                                    The life of Father Hecker
#> 5                                                    The life of Father Hecker
#> 6  A Bit of Unpublished Correspondence Between Henry D. Thoreau and Isaac T. H
#> 7  ERIC ED250755: Rhetoric and Public Address: Abstracts of Doctoral Dissertat
#> 8  Volume 2: The Light Of The Cross In The Twentieth Century; the influence of
#> 9                                                        Questions of the soul
#> 10                                                       Questions of the Soul
#> 11 The Catholic Church in the United States: Its Rise, Relations with the Repu
#> 12                                                       Questions of the Soul
#> 13                                                       Questions of the soul
#> 14                                                       Aspirations of nature
#> 15                                                   Un catholicisme américain
#> 16                                                       Aspirations of nature
#> 17 The church and the age; an exposition of the Catholic Church in view of the
#> 18 Die Kirche betrachtet mit Rücksicht auf die gegenwärtigen Streitfragen und 
#> 19                                                       Aspirations of nature

Downloading files

The ia_download() function will download all the files in a data frame returned from ia_files(). This function should be used with caution, and you should first filter the data frame to download only the files that you wish. In the following example, we retrieve a list of all the files associated with items published by the American Tract Society in 1864. Then we filter the list so we get only text files, then we pick only the first text file associated with each item. Finally we download the files to a directory we specify (in this case, a temporary directory).

dir <- tempdir()
ia_search(ats_query) %>% 
  ia_get_items() %>% 
  ia_files() %>% 
  filter(type == "txt") %>% 
  group_by(id) %>% 
  slice(1) %>% 
  ia_download(dir = dir, overwrite = FALSE) %>% 
  glimpse()
#> 3 total items found. This query requested 5 results.
#> Getting vitalgodlinessa00plumgoog
#> Getting huguenotsfrance00martgoog
#> Getting sketcheseloquen00wategoog
#> Downloading /var/folders/k3/yk84g4bd50b1mrltx28c_0280000gn/T//RtmpV6NIT9/huguenotsfrance00martgoog-huguenotsfrance00martgoog_djvu.txt
#> Downloading /var/folders/k3/yk84g4bd50b1mrltx28c_0280000gn/T//RtmpV6NIT9/sketcheseloquen00wategoog-sketcheseloquen00wategoog_djvu.txt
#> Downloading /var/folders/k3/yk84g4bd50b1mrltx28c_0280000gn/T//RtmpV6NIT9/vitalgodlinessa00plumgoog-vitalgodlinessa00plumgoog_djvu.txt
#> Observations: 3
#> Variables:
#> $ id         (chr) "huguenotsfrance00martgoog", "sketcheseloquen00wate...
#> $ file       (chr) "/huguenotsfrance00martgoog_djvu.txt", "/sketchesel...
#> $ type       (chr) "txt", "txt", "txt"
#> $ url        (chr) "https://archive.org/download/huguenotsfrance00mart...
#> $ local_file (chr) "/var/folders/k3/yk84g4bd50b1mrltx28c_0280000gn/T//...
#> $ downloaded (lgl) TRUE, TRUE, TRUE

Notice that ia_download() returns a modified version of the data frame that was passed to it, adding a column local_file with the path to the download files.

If the overwrite = argument is FALSE, then you can pass the same data frame of files to ia_download() and it will download only the files that it has not already downloaded.


rOpenSCi logo