Skip to content

Commit

Permalink
Address @ikreymer's review
Browse files Browse the repository at this point in the history
- Reference Webrecorder rather than Openwayback, since WR preceeded it,
  and Openwayback did not implement their propopsal.
- Adjust CDXJ timestamp so that milliseconds are optional.
- Remove mention of wacz_version
  • Loading branch information
edsu committed Feb 7, 2022
1 parent 0dcfd82 commit bb18b72
Showing 1 changed file with 10 additions and 12 deletions.
22 changes: 10 additions & 12 deletions 1.2.0/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -80,11 +80,10 @@
}
],
localBiblio: {
"OPENWAYBACK-CDXJ": {
title: "OpenWayback CDXJ File Format 1.0",
href: "https://iipc.github.io/warc-specifications/specifications/cdx-format/openwayback-cdxj/",
publisher: "International Internet Preservation Consortium",
rawDate: "2016-06-06"
"WEBRECORDER-CDX": {
title: "Webrecorder CDX Index Format",
publisher: "Webrecorder",
rawDate: "2015-03-25"
}
}
};
Expand Down Expand Up @@ -343,7 +342,6 @@ <h2>Terminology</h2>
[[FRICTIONLESS-DATA-PACKAGE]] specification. It MUST contain the following
keys:

- `wacz_version`: The version of WACZ being used. (e.g. 1.2.0)
- `profile`: Set to `data-package`
- `resources`: a list of file names, paths, sizes and fixity for all files
contained in the WACZ.
Expand Down Expand Up @@ -403,9 +401,9 @@ <h2>Terminology</h2>
<p class="note">
CDXJ's name name and semantics partly derive from an earlier index format
developed as part of the Internet Archive's <a>Wayback Machine</a>, where CDX
may have been an acronym for Crawl (or Capture) inDeX. WACZ's implementation of
CDXJ also draws on a proposed, but never implemented, WARC index format for the
OpenWayback project [[OPENWAYBACK-CDXJ]].
may have been an acronym for Crawl (or Capture) inDeX. The CDXJ format used in
WACZ was mostly drawn from an earlier implementation in the Webrecorder
application [[WEBRECORDER-CDX]].
</p>

A CDXJ file is a sorted, line oriented plain-text file (optionally GZIP
Expand Down Expand Up @@ -439,16 +437,16 @@ <h2>Terminology</h2>

### Integer Timestamp

The Integer Timestamp is an integer representation of the date and time (UTC) when the
web archive snapshot was created. It is composed of:
The Integer Timestamp is an integer representation of the date and time (UTC)
when the web archive snapshot was created. It is composed of:

- 4 digit year (e.g. 2022)
- 2 digit month (e.g. 02)
- 2 digit day (e.g. 05)
- 2 digit hour in 24 hour format (e.g. 23)
- 2 digit minute (e.g. 13)
- 2 digit second (e.g. 59)
- 3 digit milliseconds (e.g. 032)
- 3 digit milliseconds MAY be included (e.g. 032)

This example date would get serialized as the Integer Timestamp
`20220205231359032`.
Expand Down

0 comments on commit bb18b72

Please sign in to comment.