archivekit
provides a mechanism for storing a (large) set of immutable documents and data files in an organized way. Transformed versions of each file can be stored the alongside the original data in order to reflect a complete processing chain. Metadata is kept with the data as a YAML file.
This library is inspired by OFS, BagIt and Pairtree. It replaces a previous project, docstash.
The easiest way of using archivekit
is via PyPI:
$ pip install archivekit
Alternatively, check out the repository from GitHub and install it locally:
$ git clone https://github.com/pudo/archivekit.git
$ cd archivekit
$ python setup.py develop
archivekit
manages Packages
which contain one or several Resources
and their associated metadata. Each Package
is part of a Collection
.
from archivekit import open_collection, Source
# open a collection of packages
collection = open_collection('file', path='/tmp')
# or via S3:
collection = open_collection('s3', aws_key_id='..', aws_secret='..',
bucket_name='test.pudo.org')
# import a file from the local working directory:
collection.ingest('README.md')
# import an http resource:
collection.ingest('http://pudo.org/index.html')
# ingest will also accept file objects and httplib/urllib/requests responses
# iterate through each document and set a metadata
# value:
for package in collection:
for source in package.all(Source):
with source.fh() as fh:
source.meta['body_length'] = len(fh.read())
package.save()
The code for this library is very compact, go check it out.
If AWS credentials are not supplied for an S3-based collection, the application will attempt to use the AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables. AWS_BUCKET_NAME
is also supported.
archivekit
is open source, licensed under a standard MIT license (included in this repository as LICENSE
).