Tool to allow merging of multiple WARC files into a single WARC
- Tested on Ubuntu Linux
- Requires Python 2.7+
- Requires Java to run Jwattools for validating WARC files
- Requires the warc python library from Internet Archive to work with WARC files and WARC records.
WARCMerge can be executed using one of three different methods; adding the option '-q' will make the program run in a quiet mode; the program does not display any messages:
%python <input-directory> <output-directory>
This will merge all WARC files found in "input-directory" and store the resulting output file(s) in "output-directory".
%python <file1> <file2> <file3> ... <output-directory>
Here, all listed WARC files will be merged and the resulting output file(s) will be stored in "output-directory".
%python -a <source-file> <dest-file>
The purpose of "-a" flag is to make sure that any changes in "dest-file" are done intentionally.The command line above appends the source WARC file "source-file" to the end of the destination WARC file "dest-file".
In all cases, the program checks to see whether or not the resulting WARCs are valid!
###Example 1: Merging WARC files (found in "input-directory") into new WARC file(s):
%python ./collectionExample/ my-output-dir
Merging the following WARC files:
[Yes] ./collectionExample/world-cup/20140707174317773.warc
[Yes] ./collectionExample/warcs/20140707160258526.warc
[Yes] ./collectionExample/warcs/20140707160041872.warc
[Yes] ./collectionExample/world-cup/20140707183044349.warc
Validating the resulting WARC files:
- [valid] my-output-dir/WARCMerge20140806040712197944.warc
###Example 2: Merging all listed WARC files into new WARC file(s)
%python 585.warc 472.warc ./dir1/113.warc ./warcs/449.warc mydir
Merging the following WARC files:
[Yes] ./warcs/449.warc
[Yes] ./585.warc
[Yes] ./dir1/113.warc
[Yes] ./472.warc
Validating the resulting WARC files:
- [valid] mydir/WARCMerge20140806040546699431.warc
%python -a ./test/src/20258526.warc ./test/dest/20141872.warc
The resulting (./test/dest/20141872.warc) is valid WARC file
###Example 4: Giving incorrect arguments, the following message will be shown:
%python -n 20160041872.warc new-dir
usage: WARCMerge [[-q] -a <source-file> <dest-file> ]
[[-q] <input-directory> <output-directory> ]
[[-q] <file1> <file2> <file3> ... <output-directory> ]
##Relevant Linkage
The following are links to the archived pages in the example above: