Skip to content
This repository has been archived by the owner on Mar 15, 2020. It is now read-only.
/ archivekit Public archive

ArchiveKit manages data and documents during ETL processes, either on a local file system or on S3.

License

Notifications You must be signed in to change notification settings

pudo-attic/archivekit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

archivekit

Build Status Coverage Status

archivekit provides a mechanism for storing a (large) set of immutable documents and data files in an organized way. Transformed versions of each file can be stored the alongside the original data in order to reflect a complete processing chain. Metadata is kept with the data as a YAML file.

This library is inspired by OFS, BagIt and Pairtree. It replaces a previous project, docstash.

Installation

The easiest way of using archivekit is via PyPI:

$ pip install archivekit

Alternatively, check out the repository from GitHub and install it locally:

$ git clone https://github.com/pudo/archivekit.git
$ cd archivekit
$ python setup.py develop

Example

archivekit manages Packages which contain one or several Resources and their associated metadata. Each Package is part of a Collection.

from archivekit import open_collection, Source

# open a collection of packages
collection = open_collection('file', path='/tmp')

# or via S3:
collection = open_collection('s3', aws_key_id='..', aws_secret='..',
                             bucket_name='test.pudo.org')

# import a file from the local working directory:
collection.ingest('README.md')

# import an http resource:
collection.ingest('http://pudo.org/index.html')
# ingest will also accept file objects and httplib/urllib/requests responses

# iterate through each document and set a metadata
# value:
for package in collection:
    for source in package.all(Source):
        with source.fh() as fh:
            source.meta['body_length'] = len(fh.read())
    package.save()

The code for this library is very compact, go check it out.

Configuration

If AWS credentials are not supplied for an S3-based collection, the application will attempt to use the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables. AWS_BUCKET_NAME is also supported.

License

archivekit is open source, licensed under a standard MIT license (included in this repository as LICENSE).

About

ArchiveKit manages data and documents during ETL processes, either on a local file system or on S3.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages