Skip to content

Commit

Permalink
changes to support paddle decoder
Browse files Browse the repository at this point in the history
  • Loading branch information
SeanNaren committed Nov 17, 2017
1 parent 2d7ae35 commit 06af477
Show file tree
Hide file tree
Showing 184 changed files with 1,523 additions and 59,249 deletions.
7 changes: 5 additions & 2 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
[submodule "pytorch_ctc/src/third_party/kenlm"]
[submodule "third_party/kenlm"]
path = third_party/kenlm
url = https://github.com/kpu/kenlm.git
url = https://github.com/luotao1/kenlm.git
[submodule "third_party/ThreadPool"]
path = third_party/ThreadPool
url = https://github.com/progschj/ThreadPool.git
95 changes: 4 additions & 91 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
# ctcdecode

ctcdecode is an implementation of CTC (Connectionist Temporal Classification) beam search decoding for PyTorch.
C++ code borrowed liberally from TensorFlow with some improvements to increase flexibility.
It includes swappable scorer support enabling standard beam search, dictionary-based decoding, and KenLM-based decoding.
C++ code borrowed liberally from Paddle Paddles' [DeepSpeech](https://github.com/PaddlePaddle/DeepSpeech).
It includes swappable scorer support enabling standard beam search, and KenLM-based decoding.

## Installation
The library is largely self-contained and requires only PyTorch and CFFI. Building the C++ library requires gcc or clang. KenLM language modeling support is also optionally included, and enabled by default.
Expand All @@ -10,93 +11,5 @@ The library is largely self-contained and requires only PyTorch and CFFI. Buildi
# get the code
git clone --recursive https://github.com/parlance/ctcdecode.git
cd ctcdecode

# install dependencies (PyTorch and CFFI)
pip install -r requirements.txt

python setup.py install
# If you do NOT require kenlm, the `--recursive` flag is not required on git clone
# and `--exclude-kenlm` should be appended to the `python setup.py install` command
```

## API
ctcdecode includes a CTC beam search decoder with multiple scorer implementations. A `scorer` is a function that the decoder calls to condition the probability of a given beam based on its state.

### Scorers
Three Scorer implementations are currently implemented for ctcdecode.

**Scorer:** is a NO-OP and enables the decoder to do a vanilla beam decode
```python
scorer = Scorer()
```

**DictScorer:** conditions beams based on the provided dictionary trie. Only words in the dictionary will be hypothesized.
```python
scorer = DictScorer(labels, trie_path, blank_index=0, space_index=28):
```

**KenLMScorer:** conditions beams based on the provided KenLM binary language model.
```python
scorer = KenLMScorer(labels, lm_path, trie_path, blank_index=0, space_index=28)
```

where:
- `labels` is a string of output labels given in the same order as the output layer
- `lm_path` path to a binary KenLM language model for decoding
- `trie_path` path to a Trie containing the lexicon (see generate_lm_dict)
- `blank_index` is used to specify which position in the output distribution represents the `blank` class
- `space_index` is used to specify which position in the output distribution represents the word separator class

The `KenLMScorer` may be further configured with weights for the language model contribution to the score (`lm_weight`), as well as word bonuses (to offset decreasing probability as a function of sequence length).

```python
scorer.set_lm_weight(2.0)
scorer.set_word_weight(0.1)
pip install .
```

### Decoder
```python
decoder = CTCBeamDecoder(scorer, labels, top_paths=3, beam_width=20,
blank_index=0, space_index=28)
```

where:
- `scorer` is an instance of a concrete implementation of the `BaseScorer` class
- `labels` is a string of output labels given in the same order as the output layer
- `top_paths` is used to specify how many hypotheses to return
- `beam_width` is the number of beams to evaluate in a given step
- `blank_index` is used to specify which position in the output distribution represents the `blank` class
- `space_index` is used to specify which position in the output distribution represents the word separator class

```python
output, score, out_seq_len, offsets = decoder.decode(probs, sizes=None)
```

where:
- `probs` is a FloatTensor of log-probabilities with shape `(seq_len, batch_size, num_classes)`
- `seq_len` is an optional IntTensor of integer sequence lengths with shape `(batch_size)`

and returns:
- `output` is an IntTensor of character classes of shape `(top_paths, batch_size, seq_len)`
- `score` is a FloatTensor of log-probabilities representing the likelihood of the transcription with shape `(top_paths, batch_size)`
- `out_seq_len` is an IntTensor containing the length of the output sequence with shape `(top_paths, batch_size)`
- `offsets` is an IntTensor returning the index of the input at which the character occurs. Can be used for generating time alignments

The `CTCBeamDecoder` may be further configured with weights for the label size (`label_size`), and label margin (`label_margin`). These parameters helps to reduce the computation time.

Label selection size controls how many items in each beam are passed through to the beam scorer. Only items with top N input scores are considered.
Label selection margin controls the difference between minimal input score (versus the best scoring label) for an item to be passed to the beam scorer. This margin is expressed in terms of log-probability. Default is to do no label selection.

```python
decoder.set_label_selection_parameters(label_size=0, label_margin=6)
```

### Utilities
```python
generate_lm_dict(dictionary_path, output_path, labels, kenlm_path=None, blank_index=0, space_index=1)
```

A vocabulary trie is required for the KenLM Scorer. The trie is created from a lexicon specified as a newline separated text file of words in the vocabulary. The DictScorer also requires this function be run to generate a dictionary trie. In this case, a `kenlm_path` is not required.

## Acknowledgements
Thanks to [ebrevdo](https://github.com/ebrevdo) for the original TensorFlow CTC decoder implementation, [timediv](https://github.com/timediv) for his KenLM extension, and [SeanNaren](https://github.com/seannaren) for his assistance.
66 changes: 66 additions & 0 deletions build.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
#!/usr/bin/env python

import glob
import os
import tarfile

import wget
from torch.utils.ffi import create_extension

# Download/Extract openfst
dl_path = 'third_party/openfst-1.6.3.tar.gz'
if not os.path.isfile(dl_path):
wget.download('http://www.openfst.org/twiki/pub/FST/FstDownload/openfst-1.6.3.tar.gz',
out=dl_path)
tar = tarfile.open(dl_path)
tar.extractall('third_party/')
tar.close()


# Does gcc compile with this header and library?
def compile_test(header, library):
dummy_path = os.path.join(os.path.dirname(__file__), "dummy")
command = "bash -c \"g++ -include " + header + " -l" + library + " -x c++ - <<<'int main() {}' -o " + dummy_path \
+ " >/dev/null 2>/dev/null && rm " + dummy_path + " 2>/dev/null\""
return os.system(command) == 0


compile_args = ['-O3', '-DNDEBUG', '-DKENLM_MAX_ORDER=6', '-std=c++11', '-fPIC', '-w']
ext_libs = ['stdc++']

if compile_test('zlib.h', 'z'):
compile_args.append('-DHAVE_ZLIB')
ext_libs.append('z')

if compile_test('bzlib.h', 'bz2'):
compile_args.append('-DHAVE_BZLIB')
ext_libs.append('bz2')

if compile_test('lzma.h', 'lzma'):
compile_args.append('-DHAVE_XZLIB')
ext_libs.append('lzma')

third_party_libs = ["kenlm", "openfst-1.6.3/src/include", "ThreadPool"]
compile_args.extend(['-DINCLUDE_KENLM', '-DKENLM_MAX_ORDER=6'])
lib_sources = glob.glob('third_party/kenlm/util/*.cc') + glob.glob('third_party/kenlm/lm/*.cc') + glob.glob(
'third_party/kenlm/util/double-conversion/*.cc') + glob.glob('third_party/openfst-1.6.3/src/lib/*.cc')
lib_sources = [fn for fn in lib_sources if not (fn.endswith('main.cc') or fn.endswith('test.cc'))]

third_party_includes = ["third_party/" + lib for lib in third_party_libs]
ctc_sources = glob.glob('ctcdecode/src/*.cpp')
ctc_headers = ['ctcdecode/src/binding.h', ]

ffi = create_extension(
name='ctcdecode._ext.ctc_decode',
package=True,
language='c++',
headers=ctc_headers,
sources=ctc_sources + lib_sources,
include_dirs=third_party_includes,
with_cuda=False,
libraries=ext_libs,
extra_compile_args=compile_args
)

if __name__ == '__main__':
ffi.build()
177 changes: 43 additions & 134 deletions ctcdecode/__init__.py
Original file line number Diff line number Diff line change
@@ -1,139 +1,48 @@
import torch
import ctcdecode as ctc
from torch.utils.ffi import _wrap_function
from ._ext import ctc_decode
# from ._ext._ctc_decode import lib as _lib, ffi as _ffi
#
# __all__ = []
#
#
# def _import_symbols(locals):
# for symbol in dir(_lib):
# fn = getattr(_lib, symbol)
# new_symbol = "_" + symbol
# locals[new_symbol] = _wrap_function(fn, _ffi)
# __all__.append(new_symbol)
#
#
# _import_symbols(locals())
import torch


class BaseCTCBeamDecoder(object):
def __init__(self, labels, top_paths=1, beam_width=10, blank_index=0, space_index=28):
self._labels = labels
self._top_paths = top_paths
class CTCBeamDecoder(object):
def __init__(self, labels, model_path=None, alpha=0, beta=0, cutoff_top_n=40, cutoff_prob=1.0, beam_width=100,
num_processes=4, blank_id=0):
self.cutoff_top_n = cutoff_top_n
self._beam_width = beam_width
self._blank_index = blank_index
self._space_index = space_index
self._num_classes = len(labels)
self._decoder = None

if blank_index < 0 or blank_index >= self._num_classes:
raise ValueError("blank_index must be within num_classes")

if top_paths < 1 or top_paths > beam_width:
raise ValueError("top_paths must be greater than 1 and less than or equal to the beam_width")

def decode(self, probs, seq_len=None):
prob_size = probs.size()
max_seq_len = prob_size[0]
batch_size = prob_size[1]
num_classes = prob_size[2]

if seq_len is not None and batch_size != seq_len.size(0):
raise ValueError("seq_len shape must be a (batch_size) tensor or None")

seq_len = torch.IntTensor(batch_size).zero_().add_(max_seq_len) if seq_len is None else seq_len
output = torch.IntTensor(self._top_paths, batch_size, max_seq_len)
scores = torch.FloatTensor(self._top_paths, batch_size)
out_seq_len = torch.IntTensor(self._top_paths, batch_size)
alignments = torch.IntTensor(self._top_paths, batch_size, max_seq_len)
char_probs = torch.FloatTensor(self._top_paths, batch_size, max_seq_len)

result = ctc_decode.ctc_beam_decode(self._decoder, self._decoder_type, probs, seq_len, output, scores, out_seq_len,
alignments, char_probs)

return output, scores, out_seq_len, alignments, char_probs


class BaseScorer(object):
def __init__(self):
self._scorer_type = 0
self._scorer = None

def get_scorer_type(self):
return self._scorer_type

def get_scorer(self):
return self._scorer


class Scorer(BaseScorer):
def __init__(self):
super(Scorer, self).__init__()
self._scorer = ctc_decode.get_base_scorer()


class DictScorer(BaseScorer):
def __init__(self, labels, trie_path, blank_index=0, space_index=28):
super(DictScorer, self).__init__()
self._scorer_type = 1
self._scorer = ctc_decode.get_dict_scorer(labels, len(labels), space_index, blank_index, trie_path.encode())

def set_min_unigram_weight(self, weight):
if weight is not None:
ctc_decode.set_dict_min_unigram_weight(self._scorer, weight)


class KenLMScorer(BaseScorer):
def __init__(self, labels, lm_path, trie_path, blank_index=0, space_index=28):
super(KenLMScorer, self).__init__()
if ctc_decode.kenlm_enabled() != 1:
raise ImportError("ctcdecode not compiled with KenLM support.")
self._scorer_type = 2
self._scorer = ctc_decode.get_kenlm_scorer(labels, len(labels), space_index, blank_index, lm_path.encode(),
trie_path.encode())

# This is a way to make sure the destructor is called for the C++ object
# Frees all the member data items that have allocated memory
def __del__(self):
ctc_decode.free_kenlm_scorer(self._scorer)

def set_lm_weight(self, weight):
if weight is not None:
ctc_decode.set_kenlm_scorer_lm_weight(self._scorer, weight)

def set_word_weight(self, weight):
if weight is not None:
ctc_decode.set_kenlm_scorer_wc_weight(self._scorer, weight)

def set_min_unigram_weight(self, weight):
if weight is not None:
ctc_decode.set_kenlm_min_unigram_weight(self._scorer, weight)


class CTCBeamDecoder(BaseCTCBeamDecoder):
def __init__(self, scorer, labels, top_paths=1, beam_width=10, blank_index=0, space_index=28):
super(CTCBeamDecoder, self).__init__(labels, top_paths=top_paths, beam_width=beam_width,
blank_index=blank_index, space_index=space_index)
self._scorer = scorer
self._decoder_type = self._scorer.get_scorer_type()
self._decoder = ctc_decode.get_ctc_beam_decoder(self._num_classes, top_paths, beam_width, blank_index,
self._scorer.get_scorer(), self._decoder_type)

def set_label_selection_parameters(self, label_size=0, label_margin=-1):
ctc_decode.set_label_selection_parameters(self._decoder, label_size, label_margin)


def generate_lm_dict(dictionary_path, output_path, labels, kenlm_path=None, blank_index=0, space_index=28):
if kenlm_path is not None and ctc_decode.kenlm_enabled() != 1:
raise ImportError("ctcdecode not compiled with KenLM support.")
result = None
if kenlm_path is not None:
result = ctc_decode.generate_lm_dict(labels, len(labels), blank_index, space_index, kenlm_path.encode(),
dictionary_path.encode(), output_path.encode())
else:
result = ctc_decode.generate_dict(labels, len(labels), blank_index, space_index,
dictionary_path.encode(), output_path.encode())
if result != 0:
raise ValueError("Error encountered generating dictionary")
self._num_processes = num_processes
self._labels = ''.join(labels).encode()
self._blank_id = blank_id
if model_path:
self._scorer = ctc_decode.paddle_get_scorer(alpha, beta, model_path.encode(), self._labels,
len(self._labels))
self._cutoff_prob = cutoff_prob

def decode(self, probs):
# We expect batch x seq x label_size
probs = probs.cpu().float()
batch_size, max_seq_len = probs.size(0), probs.size(1)
output = torch.IntTensor(batch_size, self._beam_width, max_seq_len).cpu().int()
scores = torch.IntTensor(batch_size, self._beam_width).cpu().int()
out_seq_len = torch.IntTensor(batch_size, self._beam_width).cpu().int()
if self._scorer:
ctc_decode.paddle_beam_decode_lm(probs, self._labels, len(self._labels), self._beam_width,
self._num_processes, self._cutoff_prob, self.cutoff_top_n, self._blank_id,
self._scorer, output, scores, out_seq_len)
else:
ctc_decode.paddle_beam_decode(probs, self._labels, len(self._labels), self._beam_width, self._num_processes,
self._cutoff_prob, self.cutoff_top_n, self._blank_id, output, scores,
out_seq_len)

return output, scores, out_seq_len

def character_based(self):
return ctc_decode.is_character_based(self._scorer) if self._scorer else None

def max_order(self):
return ctc_decode.get_max_order(self._scorer) if self._scorer else None

def dict_size(self):
return ctc_decode.get_dict_size(self._scorer) if self._scorer else None

def reset_params(self, alpha, beta):
if self._scorer is not None:
ctc_decode.reset_params(self._scorer, alpha, beta)
Loading

0 comments on commit 06af477

Please sign in to comment.