Skip to content

ridi/content-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

epub-parser

Common EPUB2 data parser for Ridibooks services written in ES6

Build Status

Features

  • Detailed parsing for EPUB2
  • Supports package validation, decompression and style extraction with various parsing options
  • Extract files within EPUB with various reading options

TODO

  • Add encryption and decryption function
  • Add readOptions.spine.truncate and readOption.spine.truncateMaxLength options
  • Add readOptions.spine.minify and readOptions.css.minify options
  • Support for EPUB3
  • Support for CLI
  • Support for other OCF spec (manifest.xml, metadata.xml, signatures.xml, encryption.xml, etc)

Install

npm install @ridi/epub-parser

Usage

Basic:

import EpubParser from '@ridi/epub-parser';

const parser = new EpubParser('./foo/bar.epub');
parser.parse().then((book) => {
  const results = parser.read(book.spines);
  ...
});

Various inputs:

import fs from 'fs';
import EpubParser from '@ridi/epub-parser';

// Unzipped path of EPUB file.
new EpubParser('./foo/bar');

// EPUB file buffer.
const buffer = fs.readFileSync('./foo/bar.epub');
new EpubParser(buffer);

Book to Object, Object to Book:

import EpubParser from '@ridi/epub-parser';

const parser = new EpubParser('./foo/bar.epub');
parser.parse().then((book) => {
  const rawBook = book.toRaw();
  const newBook = new Book(rawBook);
  ...
});

API

parse(parseOptions)

Returns Promise<Book> with:

  • Book: Instance with metadata, spine list, table of contents, etc.

Or throw exception.

parseOptions: Object

read(target(s), readOptions)

Returns string or Object or string[] or Object[] with:

Or throw exception.

target(s): Item, Item[] (see: Item Types)

readOptions: Object

Model

  • name: string?
  • role: string (Default: Author.Roles.UNDEFINED)

  • value: strung?
  • event: string (Default: DateTime.Events.UNDEFINED)

  • value: string?
  • scheme: string? (Default: Identifier.Schemes.UNDEFINED)

  • name: string?
  • content: string?

  • title: string?
  • type: string (Default: Guide.Types.UNDEFINED)
  • href: string?
  • item: Item?

Item Types

  • id: id?
  • href: string?
  • mediaType: string?
  • size: number?
  • isFileExists: boolean (size !== undefined)
  • defaultEncoding: string?

NcxItem (extend Item)

SpineItem (extend Item)

  • spineIndex: number (Default: -1)
  • isLinear: boolean (Default: true)
  • styles: CssItem[]?

CssItem (extend Item)

  • namespace: string?

  • text: string?

ImageItem (extend Item)

  • isCover: boolean (Default: false)

FontItem (extend Item)

DeadItem (extend Item)

  • raw: Object

  • id: string?
  • label: string?
  • src: string?
  • anchor: string?
  • depth: number (Default: 0)
  • children: NavPoint[]
  • spine: SpineItem?

Parse Options


validatePackage: boolean

If true, validation package specifications in IDPF listed below.

  • Zip header should not corrupt.
  • mimetype file must be first file in archive.
  • mimetype file should not compressed.
  • mimetype file should only contain string application/epub+zip.
  • Should not use extra field feature of ZIP format for mimetype file.

Default: false


validateXml: boolean

If true, stop parsing when XML parsing errors occur.

Default: false


allowNcxFileMissing: boolean

If false, stop parsing when NCX file not exists.

Default: true


unzipPath: string?

If specified, uncompress to that path.

Only if input is buffer or file path of EPUB file.

Default: undefined


createIntermediateDirectories: boolean

If true, creates intermediate directories for unzipPath.

Default: true


removePreviousFile: boolean

If true, removes a previous file from unzipPath.

Default: true


ignoreLinear: boolean

If true, ignore spineIndex difference caused by isLinear property of SpineItem.

// e.g. If left is false, right is true.
[{ spineIndex: 0, isLinear: true, ... },       [{ spineIndex: 0, isLinear: true, ... },
{ spineIndex: 1, isLinear: true, ... },        { spineIndex: 1, isLinear: true, ... },
{ spineIndex: -1, isLinear: false, ... },      { spineIndex: 2, isLinear: false, ... },
{ spineIndex: 2, isLinear: true, ... }]        { spineIndex: 3, isLinear: true, ... }]

Default: true


useStyleNamespace: boolean

If true, One namespace is given per CSS file or inline style, and styles used for spine is described.

Otherwise it CssItem.namespace, SpineItem.styles is undefined.

In any list, InlineCssItem is always positioned after CssItem. (Book.styles, Book.items, SpineItem.styles, ...)

Default: false


styleNamespacePrefix: string

Prepend given string to namespace for identification.

Default: 'ridi_style'


Read Options


encoding: string?

If specified then returns a string. Otherwise it returns a buffer.

If specify default, use Item.defaultEncoding.

Item.defaultEncoding // undefined (=buffer)
SpineItem.defaultEncoding // 'utf8'
CssItem.defaultEncoding // 'utf8'
InlineCssItem.defaultEncoding // 'utf8'
ImageItem.defaultEncoding // undefined (=buffer)

Default: 'default'


ignoreEntryNotFoundError: boolean

If false, throw Errors.ITEM_NOT_FOUND.

Default: true


basePath: string?

If specified, change base path of paths used by spine and css.

HTML: SpineItem

...
  <!-- Before -->
  <div>
    <img src="../Images/cover.jpg">
  </div>
  <!-- After -->
  <div>
    <img src="{basePath}/OEBPS/Images/cover.jpg">
  </div>
...

CSS: CssItem, InlineCssItem

/* Before */
@font-face {
  font-family: NotoSansRegular;
  src: url("../Fonts/NotoSans-Regular.ttf");
}
/* After */
@font-face {
  font-family: NotoSansRegular;
  src: url("{basePath}/OEBPS/Fonts/NotoSans-Regular.ttf");
}

Default: undefined


spine.extractBody: boolean

If true, extract body. Otherwise it returns a full string.

true:

{
  body: '\n  <p>Extract style</p>\n  <img src=\"../Images/api-map.jpg\"/>\n',
  attrs: [
    {
      key: 'style',
      value: 'background-color: #000000;',
    },
    { // Only added if useStyleNamespace is true.
      key: 'class',
      value: '.ridi_style2, .ridi_style3, .ridi_style4, .ridi_style0, .ridi_style1',
    },
  ],
}

false:

'<!doctype><html>\n<head>\n</head>\n<body style="background-color: #000000;">\n  <p>Extract style</p>\n  <img src=\"../Images/api-map.jpg\"/>\n</body>\n</html>'

Default: false


spine.extractAdapter: function

If specified, transforms output of extractBody.

Define adapter:

const extractAdapter = (body, attrs) => {
  let string = '';
  attrs.forEach((attr) => {
    string += ` ${attr.key}=\"${attr.value}\"`;
  });
  return {
    content: `<article${string}>${body}</article>`,
  };
};

Result:

{
  content: '<article style=\"background-color: #000000;\" class=\".ridi_style2, .ridi_style3, .ridi_style4, .ridi_style0, .ridi_style1\">\n  <p>Extract style</p>\n  <img src=\"../Images/api-map.jpg\"/>\n</article>',
}

Default: defaultExtractAdapter


css.removeAtrules: string[]

Remove at-rules.

Default: ['charset', 'import', 'keyframes', 'media', 'namespace', 'supports']


css.removeTags: string[]

Remove selector that point to specified tags.

Default: []


css.removeIds: string[]

Remove selector that point to specified ids.

Default: []


css.removeClasses: string[]

Remove selector that point to specified classes.

Default: []