epub-parser

Common EPUB2 data parser for Ridibooks services written in ES6

Features

Detailed parsing for EPUB2
Supports package validation, decompression and style extraction with various parsing options
Extract files within EPUB with various reading options

TODO

Add encryption and decryption function
Add readOptions.spine.truncate and readOption.spine.truncateMaxLength options
Add readOptions.spine.minify and readOptions.css.minify options
Support for EPUB3
Support for CLI
Support for other OCF spec (manifest.xml, metadata.xml, signatures.xml, encryption.xml, etc)

Install

npm install @ridi/epub-parser

Usage

Basic:

import EpubParser from '@ridi/epub-parser';

const parser = new EpubParser('./foo/bar.epub');
parser.parse().then((book) => {
  const results = parser.read(book.spines);
  ...
});

Various inputs:

import fs from 'fs';
import EpubParser from '@ridi/epub-parser';

// Unzipped path of EPUB file.
new EpubParser('./foo/bar');

// EPUB file buffer.
const buffer = fs.readFileSync('./foo/bar.epub');
new EpubParser(buffer);

Book to Object, Object to Book:

import EpubParser from '@ridi/epub-parser';

const parser = new EpubParser('./foo/bar.epub');
parser.parse().then((book) => {
  const rawBook = book.toRaw();
  const newBook = new Book(rawBook);
  ...
});

API

parse(parseOptions)

Returns Promise<Book> with:

Book: Instance with metadata, spine list, table of contents, etc.

Or throw exception.

parseOptions: `Object`

read(target(s), readOptions)

Returns string or Object or string[] or Object[] with:

string (readOptions.spine.extractBody is false)
Object (readOptions.spine.extractAdapter is undefined):
- body: Same reuslt as document.body.innerHTML
- attrs: Attributes in body tag.
Object (readOptions.spine.extractAdapter is defaultExtractAdapter):
- content: extractBody output transformed by adapter.

Or throw exception.

target(s): `Item`, `Item[]` (see: Item Types)

readOptions: `Object`

Model

Book

titles: string[]
creators: Author[]
subjects: string[]
description: string?
publisher: string?
contributors: Author[]
dates: DateTime[]
type: string?
format: string?
identifiers: Identifier[]
source: string?
language: string?
relation: string?
rights: string?
epubVersion: number?
metas: Meta[]
items: Item[]
ncx: NcxItem?
spines: SpintItem[]
fonts: FontItem[]
cover: ImageItem?
images: ImageItem[]
styles: CssItem[]
guide: Guide[]
deadItems: DeadItem[]

Author

name: string?
role: string (Default: Author.Roles.UNDEFINED)

DateTime

value: strung?
event: string (Default: DateTime.Events.UNDEFINED)

Identifier

value: string?
scheme: string? (Default: Identifier.Schemes.UNDEFINED)

Guide

title: string?
type: string (Default: Guide.Types.UNDEFINED)
href: string?
item: Item?

Item Types

Item

id: id?
href: string?
mediaType: string?
size: number?
isFileExists: boolean (size !== undefined)
defaultEncoding: string?

NcxItem (extend Item)

navPoints: NavPoint[]

SpineItem (extend Item)

spineIndex: number (Default: -1)
isLinear: boolean (Default: true)
styles: CssItem[]?

CssItem (extend Item)

namespace: string?

InlineCssItem (extend CssItem)

text: string?

ImageItem (extend Item)

isCover: boolean (Default: false)

FontItem (extend Item)

DeadItem (extend Item)

raw: Object

NavPoint

id: string?
label: string?
src: string?
anchor: string?
depth: number (Default: 0)
children: NavPoint[]
spine: SpineItem?

Parse Options

validatePackage: `boolean`

If true, validation package specifications in IDPF listed below.

Zip header should not corrupt.
mimetype file must be first file in archive.
mimetype file should not compressed.
mimetype file should only contain string application/epub+zip.
Should not use extra field feature of ZIP format for mimetype file.

Default: false

validateXml: `boolean`

If true, stop parsing when XML parsing errors occur.

Default: false

allowNcxFileMissing: `boolean`

If false, stop parsing when NCX file not exists.

Default: true

unzipPath: `string?`

If specified, uncompress to that path.

Only if input is buffer or file path of EPUB file.

Default: undefined

createIntermediateDirectories: `boolean`

If true, creates intermediate directories for unzipPath.

Default: true

removePreviousFile: `boolean`

If true, removes a previous file from unzipPath.

Default: true

ignoreLinear: `boolean`

If true, ignore spineIndex difference caused by isLinear property of SpineItem.

// e.g. If left is false, right is true.
[{ spineIndex: 0, isLinear: true, ... },       [{ spineIndex: 0, isLinear: true, ... },
{ spineIndex: 1, isLinear: true, ... },        { spineIndex: 1, isLinear: true, ... },
{ spineIndex: -1, isLinear: false, ... },      { spineIndex: 2, isLinear: false, ... },
{ spineIndex: 2, isLinear: true, ... }]        { spineIndex: 3, isLinear: true, ... }]

Default: true

useStyleNamespace: `boolean`

If true, One namespace is given per CSS file or inline style, and styles used for spine is described.

Otherwise it CssItem.namespace, SpineItem.styles is undefined.

In any list, InlineCssItem is always positioned after CssItem. (Book.styles, Book.items, SpineItem.styles, ...)

Default: false

styleNamespacePrefix: `string`

Prepend given string to namespace for identification.

Default: 'ridi_style'

Read Options

encoding: `string?`

If specified then returns a string. Otherwise it returns a buffer.

If specify default, use Item.defaultEncoding.

Item.defaultEncoding // undefined (=buffer)
SpineItem.defaultEncoding // 'utf8'
CssItem.defaultEncoding // 'utf8'
InlineCssItem.defaultEncoding // 'utf8'
ImageItem.defaultEncoding // undefined (=buffer)

Default: 'default'

ignoreEntryNotFoundError: `boolean`

If false, throw Errors.ITEM_NOT_FOUND.

Default: true

basePath: `string?`

If specified, change base path of paths used by spine and css.

HTML: SpineItem

...
  <!-- Before -->
  <div>
    <img src="../Images/cover.jpg">
  </div>
  <!-- After -->
  <div>
    <img src="{basePath}/OEBPS/Images/cover.jpg">
  </div>
...

CSS: CssItem, InlineCssItem

/* Before */
@font-face {
  font-family: NotoSansRegular;
  src: url("../Fonts/NotoSans-Regular.ttf");
}
/* After */
@font-face {
  font-family: NotoSansRegular;
  src: url("{basePath}/OEBPS/Fonts/NotoSans-Regular.ttf");
}

Default: undefined

spine.extractBody: `boolean`

If true, extract body. Otherwise it returns a full string.

true:

{
  body: '\n  <p>Extract style</p>\n  <img src=\"../Images/api-map.jpg\"/>\n',
  attrs: [
    {
      key: 'style',
      value: 'background-color: #000000;',
    },
    { // Only added if useStyleNamespace is true.
      key: 'class',
      value: '.ridi_style2, .ridi_style3, .ridi_style4, .ridi_style0, .ridi_style1',
    },
  ],
}

false:

'<!doctype><html>\n<head>\n</head>\n<body style="background-color: #000000;">\n  <p>Extract style</p>\n  <img src=\"../Images/api-map.jpg\"/>\n</body>\n</html>'

Default: false

spine.extractAdapter: `function`

If specified, transforms output of extractBody.

Define adapter:

const extractAdapter = (body, attrs) => {
  let string = '';
  attrs.forEach((attr) => {
    string += ` ${attr.key}=\"${attr.value}\"`;
  });
  return {
    content: `<article${string}>${body}</article>`,
  };
};

Result:

{
  content: '<article style=\"background-color: #000000;\" class=\".ridi_style2, .ridi_style3, .ridi_style4, .ridi_style0, .ridi_style1\">\n  <p>Extract style</p>\n  <img src=\"../Images/api-map.jpg\"/>\n</article>',
}

Default: defaultExtractAdapter

css.removeAtrules: `string[]`

Remove at-rules.

Default: ['charset', 'import', 'keyframes', 'media', 'namespace', 'supports']

css.removeTags: `string[]`

Remove selector that point to specified tags.

Default: []

css.removeIds: `string[]`

Remove selector that point to specified ids.

Default: []

css.removeClasses: `string[]`

Remove selector that point to specified classes.

Default: []

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.vscode		.vscode
src		src
test		test
.babelrc		.babelrc
.editorconfig		.editorconfig
.eslintignore		.eslintignore
.eslintrc		.eslintrc
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
package.json		package.json
yarn.lock		yarn.lock

ridi/content-parser

Folders and files

Latest commit

History

Repository files navigation

epub-parser

Features

TODO

Install

Usage

API

parse(parseOptions)

parseOptions: Object

read(target(s), readOptions)

target(s): Item, Item[] (see: Item Types)

readOptions: Object

Model

Item Types

NcxItem (extend Item)

SpineItem (extend Item)

CssItem (extend Item)

InlineCssItem (extend CssItem)

ImageItem (extend Item)

FontItem (extend Item)

DeadItem (extend Item)

Parse Options

validatePackage: boolean

validateXml: boolean

allowNcxFileMissing: boolean

unzipPath: string?

createIntermediateDirectories: boolean

removePreviousFile: boolean

ignoreLinear: boolean

useStyleNamespace: boolean

styleNamespacePrefix: string

Read Options

encoding: string?

ignoreEntryNotFoundError: boolean

basePath: string?

spine.extractBody: boolean

spine.extractAdapter: function

css.removeAtrules: string[]

css.removeTags: string[]

css.removeIds: string[]

css.removeClasses: string[]

About

Topics

Resources

Stars

Watchers

Forks

Languages

parseOptions: `Object`

target(s): `Item`, `Item[]` (see: Item Types)

readOptions: `Object`

validatePackage: `boolean`

validateXml: `boolean`

allowNcxFileMissing: `boolean`

unzipPath: `string?`

createIntermediateDirectories: `boolean`

removePreviousFile: `boolean`

ignoreLinear: `boolean`

useStyleNamespace: `boolean`

styleNamespacePrefix: `string`

encoding: `string?`

ignoreEntryNotFoundError: `boolean`

basePath: `string?`

spine.extractBody: `boolean`

spine.extractAdapter: `function`

css.removeAtrules: `string[]`

css.removeTags: `string[]`

css.removeIds: `string[]`

css.removeClasses: `string[]`