changes to support paddle decoder

mthrok · Nov 17, 2017 · 06af477 · 06af477
1 parent 2d7ae35
commit 06af477
Show file tree

Hide file tree

Showing 184 changed files with 1,523 additions and 59,249 deletions.
diff --git a/.gitmodules b/.gitmodules
@@ -1,3 +1,6 @@
-[submodule "pytorch_ctc/src/third_party/kenlm"]
+[submodule "third_party/kenlm"]
 	path = third_party/kenlm
-	url = https://github.com/kpu/kenlm.git
+	url = https://github.com/luotao1/kenlm.git
+[submodule "third_party/ThreadPool"]
+	path = third_party/ThreadPool
+	url = https://github.com/progschj/ThreadPool.git
diff --git a/README.md b/README.md
@@ -1,7 +1,8 @@
 # ctcdecode
+
 ctcdecode is an implementation of CTC (Connectionist Temporal Classification) beam search decoding for PyTorch.
-C++ code borrowed liberally from TensorFlow with some improvements to increase flexibility.
-It includes swappable scorer support enabling standard beam search, dictionary-based decoding, and KenLM-based decoding.
+C++ code borrowed liberally from Paddle Paddles' [DeepSpeech](https://github.com/PaddlePaddle/DeepSpeech).
+It includes swappable scorer support enabling standard beam search, and KenLM-based decoding.
 
 ## Installation
 The library is largely self-contained and requires only PyTorch and CFFI. Building the C++ library requires gcc or clang. KenLM language modeling support is also optionally included, and enabled by default.
@@ -10,93 +11,5 @@ The library is largely self-contained and requires only PyTorch and CFFI. Buildi
 # get the code
 git clone --recursive https://github.com/parlance/ctcdecode.git
 cd ctcdecode
-
-# install dependencies (PyTorch and CFFI)
-pip install -r requirements.txt
-
-python setup.py install
-# If you do NOT require kenlm, the `--recursive` flag is not required on git clone
-# and `--exclude-kenlm` should be appended to the `python setup.py install` command
-```
-
-## API
-ctcdecode includes a CTC beam search decoder with multiple scorer implementations. A `scorer` is a function that the decoder calls to condition the probability of a given beam based on its state.
-
-### Scorers
-Three Scorer implementations are currently implemented for ctcdecode.
-
-**Scorer:** is a NO-OP and enables the decoder to do a vanilla beam decode
-```python
-scorer = Scorer()
-```
-
-**DictScorer:** conditions beams based on the provided dictionary trie. Only words in the dictionary will be hypothesized.
-```python
-scorer = DictScorer(labels, trie_path, blank_index=0, space_index=28):
-```
-
-**KenLMScorer:** conditions beams based on the provided KenLM binary language model.
-```python
-scorer = KenLMScorer(labels, lm_path, trie_path, blank_index=0, space_index=28)
-```
-
-where:
-- `labels` is a string of output labels given in the same order as the output layer
-- `lm_path` path to a binary KenLM language model for decoding
-- `trie_path` path to a Trie containing the lexicon (see generate_lm_dict)
-- `blank_index` is used to specify which position in the output distribution represents the `blank` class
-- `space_index` is used to specify which position in the output distribution represents the word separator class
-
-The `KenLMScorer` may be further configured with weights for the language model contribution to the score (`lm_weight`), as well as word bonuses (to offset decreasing probability as a function of sequence length).
-
-```python
-scorer.set_lm_weight(2.0)
-scorer.set_word_weight(0.1)
+pip install .
 ```
-
-### Decoder
-```python
-decoder = CTCBeamDecoder(scorer, labels, top_paths=3, beam_width=20,
-                         blank_index=0, space_index=28)
-```
-
-where:
-- `scorer` is an instance of a concrete implementation of the `BaseScorer` class
-- `labels` is a string of output labels given in the same order as the output layer
-- `top_paths` is used to specify how many hypotheses to return
-- `beam_width` is the number of beams to evaluate in a given step
-- `blank_index` is used to specify which position in the output distribution represents the `blank` class
-- `space_index` is used to specify which position in the output distribution represents the word separator class
-
-```python
-output, score, out_seq_len, offsets = decoder.decode(probs, sizes=None)
-```
-
-where:
-- `probs` is a FloatTensor of log-probabilities with shape `(seq_len, batch_size, num_classes)`
-- `seq_len` is an optional IntTensor of integer sequence lengths with shape `(batch_size)`
-
-and returns:
-- `output` is an IntTensor of character classes of shape `(top_paths, batch_size, seq_len)`
-- `score` is a FloatTensor of log-probabilities representing the likelihood of the transcription with shape `(top_paths, batch_size)`
-- `out_seq_len` is an IntTensor containing the length of the output sequence with shape `(top_paths, batch_size)`
-- `offsets` is an IntTensor returning the index of the input at which the character occurs. Can be used for generating time alignments
-
-The `CTCBeamDecoder` may be further configured with weights for the label size (`label_size`), and label margin (`label_margin`). These parameters helps to reduce the computation time.
-
-Label selection size controls how many items in each beam are passed through to the beam scorer. Only items with top N input scores are considered.
-Label selection margin controls the difference between minimal input score (versus the best scoring label) for an item to be passed to the beam scorer. This margin is expressed in terms of log-probability. Default is to do no label selection.
-
-```python
-decoder.set_label_selection_parameters(label_size=0, label_margin=6)
-```
-
-### Utilities
-```python
-generate_lm_dict(dictionary_path, output_path, labels, kenlm_path=None, blank_index=0, space_index=1)
-```
-
-A vocabulary trie is required for the KenLM Scorer. The trie is created from a lexicon specified as a newline separated text file of words in the vocabulary. The DictScorer also requires this function be run to generate a dictionary trie. In this case, a `kenlm_path` is not required.
-
-## Acknowledgements
-Thanks to [ebrevdo](https://github.com/ebrevdo) for the original TensorFlow CTC decoder implementation, [timediv](https://github.com/timediv) for his KenLM extension, and [SeanNaren](https://github.com/seannaren) for his assistance.
diff --git a/build.py b/build.py
@@ -0,0 +1,66 @@
+#!/usr/bin/env python
+
+import glob
+import os
+import tarfile
+
+import wget
+from torch.utils.ffi import create_extension
+
+# Download/Extract openfst
+dl_path = 'third_party/openfst-1.6.3.tar.gz'
+if not os.path.isfile(dl_path):
+    wget.download('http://www.openfst.org/twiki/pub/FST/FstDownload/openfst-1.6.3.tar.gz',
+                  out=dl_path)
+tar = tarfile.open(dl_path)
+tar.extractall('third_party/')
+tar.close()
+
+
+# Does gcc compile with this header and library?
+def compile_test(header, library):
+    dummy_path = os.path.join(os.path.dirname(__file__), "dummy")
+    command = "bash -c \"g++ -include " + header + " -l" + library + " -x c++ - <<<'int main() {}' -o " + dummy_path \
+              + " >/dev/null 2>/dev/null && rm " + dummy_path + " 2>/dev/null\""
+    return os.system(command) == 0
+
+
+compile_args = ['-O3', '-DNDEBUG', '-DKENLM_MAX_ORDER=6', '-std=c++11', '-fPIC', '-w']
+ext_libs = ['stdc++']
+
+if compile_test('zlib.h', 'z'):
+    compile_args.append('-DHAVE_ZLIB')
+    ext_libs.append('z')
+
+if compile_test('bzlib.h', 'bz2'):
+    compile_args.append('-DHAVE_BZLIB')
+    ext_libs.append('bz2')
+
+if compile_test('lzma.h', 'lzma'):
+    compile_args.append('-DHAVE_XZLIB')
+    ext_libs.append('lzma')
+
+third_party_libs = ["kenlm", "openfst-1.6.3/src/include", "ThreadPool"]
+compile_args.extend(['-DINCLUDE_KENLM', '-DKENLM_MAX_ORDER=6'])
+lib_sources = glob.glob('third_party/kenlm/util/*.cc') + glob.glob('third_party/kenlm/lm/*.cc') + glob.glob(
+    'third_party/kenlm/util/double-conversion/*.cc') + glob.glob('third_party/openfst-1.6.3/src/lib/*.cc')
+lib_sources = [fn for fn in lib_sources if not (fn.endswith('main.cc') or fn.endswith('test.cc'))]
+
+third_party_includes = ["third_party/" + lib for lib in third_party_libs]
+ctc_sources = glob.glob('ctcdecode/src/*.cpp')
+ctc_headers = ['ctcdecode/src/binding.h', ]
+
+ffi = create_extension(
+    name='ctcdecode._ext.ctc_decode',
+    package=True,
+    language='c++',
+    headers=ctc_headers,
+    sources=ctc_sources + lib_sources,
+    include_dirs=third_party_includes,
+    with_cuda=False,
+    libraries=ext_libs,
+    extra_compile_args=compile_args
+)
+
+if __name__ == '__main__':
+    ffi.build()
diff --git a/ctcdecode/__init__.py b/ctcdecode/__init__.py
@@ -1,139 +1,48 @@
-import torch
-import ctcdecode as ctc
-from torch.utils.ffi import _wrap_function
 from ._ext import ctc_decode
-# from ._ext._ctc_decode import lib as _lib, ffi as _ffi
-#
-# __all__ = []
-#
-#
-# def _import_symbols(locals):
-#     for symbol in dir(_lib):
-#         fn = getattr(_lib, symbol)
-#         new_symbol = "_" + symbol
-#         locals[new_symbol] = _wrap_function(fn, _ffi)
-#         __all__.append(new_symbol)
-#
-#
-# _import_symbols(locals())
+import torch
 
 
-class BaseCTCBeamDecoder(object):
-    def __init__(self, labels, top_paths=1, beam_width=10, blank_index=0, space_index=28):
-        self._labels = labels
-        self._top_paths = top_paths
+class CTCBeamDecoder(object):
+    def __init__(self, labels, model_path=None, alpha=0, beta=0, cutoff_top_n=40, cutoff_prob=1.0, beam_width=100,
+                 num_processes=4, blank_id=0):
+        self.cutoff_top_n = cutoff_top_n
         self._beam_width = beam_width
-        self._blank_index = blank_index
-        self._space_index = space_index
-        self._num_classes = len(labels)
-        self._decoder = None
-
-        if blank_index < 0 or blank_index >= self._num_classes:
-            raise ValueError("blank_index must be within num_classes")
-
-        if top_paths < 1 or top_paths > beam_width:
-            raise ValueError("top_paths must be greater than 1 and less than or equal to the beam_width")
-
-    def decode(self, probs, seq_len=None):
-        prob_size = probs.size()
-        max_seq_len = prob_size[0]
-        batch_size = prob_size[1]
-        num_classes = prob_size[2]
-
-        if seq_len is not None and batch_size != seq_len.size(0):
-            raise ValueError("seq_len shape must be a (batch_size) tensor or None")
-
-        seq_len = torch.IntTensor(batch_size).zero_().add_(max_seq_len) if seq_len is None else seq_len
-        output = torch.IntTensor(self._top_paths, batch_size, max_seq_len)
-        scores = torch.FloatTensor(self._top_paths, batch_size)
-        out_seq_len = torch.IntTensor(self._top_paths, batch_size)
-        alignments = torch.IntTensor(self._top_paths, batch_size, max_seq_len)
-        char_probs = torch.FloatTensor(self._top_paths, batch_size, max_seq_len)
-
-        result = ctc_decode.ctc_beam_decode(self._decoder, self._decoder_type, probs, seq_len, output, scores, out_seq_len,
-                                      alignments, char_probs)
-
-        return output, scores, out_seq_len, alignments, char_probs
-
-
-class BaseScorer(object):
-    def __init__(self):
-        self._scorer_type = 0
         self._scorer = None
-
-    def get_scorer_type(self):
-        return self._scorer_type
-
-    def get_scorer(self):
-        return self._scorer
-
-
-class Scorer(BaseScorer):
-    def __init__(self):
-        super(Scorer, self).__init__()
-        self._scorer = ctc_decode.get_base_scorer()
-
-
-class DictScorer(BaseScorer):
-    def __init__(self, labels, trie_path, blank_index=0, space_index=28):
-        super(DictScorer, self).__init__()
-        self._scorer_type = 1
-        self._scorer = ctc_decode.get_dict_scorer(labels, len(labels), space_index, blank_index, trie_path.encode())
-
-    def set_min_unigram_weight(self, weight):
-        if weight is not None:
-            ctc_decode.set_dict_min_unigram_weight(self._scorer, weight)
-
-
-class KenLMScorer(BaseScorer):
-    def __init__(self, labels, lm_path, trie_path, blank_index=0, space_index=28):
-        super(KenLMScorer, self).__init__()
-        if ctc_decode.kenlm_enabled() != 1:
-            raise ImportError("ctcdecode not compiled with KenLM support.")
-        self._scorer_type = 2
-        self._scorer = ctc_decode.get_kenlm_scorer(labels, len(labels), space_index, blank_index, lm_path.encode(),
-                                             trie_path.encode())
-
-    # This is a way to make sure the destructor is called for the C++ object
-    # Frees all the member data items that have allocated memory
-    def __del__(self):
-        ctc_decode.free_kenlm_scorer(self._scorer)
-
-    def set_lm_weight(self, weight):
-        if weight is not None:
-            ctc_decode.set_kenlm_scorer_lm_weight(self._scorer, weight)
-
-    def set_word_weight(self, weight):
-        if weight is not None:
-            ctc_decode.set_kenlm_scorer_wc_weight(self._scorer, weight)
-
-    def set_min_unigram_weight(self, weight):
-        if weight is not None:
-            ctc_decode.set_kenlm_min_unigram_weight(self._scorer, weight)
-
-
-class CTCBeamDecoder(BaseCTCBeamDecoder):
-    def __init__(self, scorer, labels, top_paths=1, beam_width=10, blank_index=0, space_index=28):
-        super(CTCBeamDecoder, self).__init__(labels, top_paths=top_paths, beam_width=beam_width,
-                                             blank_index=blank_index, space_index=space_index)
-        self._scorer = scorer
-        self._decoder_type = self._scorer.get_scorer_type()
-        self._decoder = ctc_decode.get_ctc_beam_decoder(self._num_classes, top_paths, beam_width, blank_index,
-                                                        self._scorer.get_scorer(), self._decoder_type)
-
-    def set_label_selection_parameters(self, label_size=0, label_margin=-1):
-        ctc_decode.set_label_selection_parameters(self._decoder, label_size, label_margin)
-
-
-def generate_lm_dict(dictionary_path, output_path, labels, kenlm_path=None, blank_index=0, space_index=28):
-    if kenlm_path is not None and ctc_decode.kenlm_enabled() != 1:
-        raise ImportError("ctcdecode not compiled with KenLM support.")
-    result = None
-    if kenlm_path is not None:
-        result = ctc_decode.generate_lm_dict(labels, len(labels), blank_index, space_index, kenlm_path.encode(),
-                                             dictionary_path.encode(), output_path.encode())
-    else:
-        result = ctc_decode.generate_dict(labels, len(labels), blank_index, space_index,
-                                          dictionary_path.encode(), output_path.encode())
-    if result != 0:
-        raise ValueError("Error encountered generating dictionary")
+        self._num_processes = num_processes
+        self._labels = ''.join(labels).encode()
+        self._blank_id = blank_id
+        if model_path:
+            self._scorer = ctc_decode.paddle_get_scorer(alpha, beta, model_path.encode(), self._labels,
+                                                        len(self._labels))
+        self._cutoff_prob = cutoff_prob
+
+    def decode(self, probs):
+        # We expect batch x seq x label_size
+        probs = probs.cpu().float()
+        batch_size, max_seq_len = probs.size(0), probs.size(1)
+        output = torch.IntTensor(batch_size, self._beam_width, max_seq_len).cpu().int()
+        scores = torch.IntTensor(batch_size, self._beam_width).cpu().int()
+        out_seq_len = torch.IntTensor(batch_size, self._beam_width).cpu().int()
+        if self._scorer:
+            ctc_decode.paddle_beam_decode_lm(probs, self._labels, len(self._labels), self._beam_width,
+                                             self._num_processes, self._cutoff_prob, self.cutoff_top_n, self._blank_id,
+                                             self._scorer, output, scores, out_seq_len)
+        else:
+            ctc_decode.paddle_beam_decode(probs, self._labels, len(self._labels), self._beam_width, self._num_processes,
+                                          self._cutoff_prob, self.cutoff_top_n, self._blank_id, output, scores,
+                                          out_seq_len)
+
+        return output, scores, out_seq_len
+
+    def character_based(self):
+        return ctc_decode.is_character_based(self._scorer) if self._scorer else None
+
+    def max_order(self):
+        return ctc_decode.get_max_order(self._scorer) if self._scorer else None
+
+    def dict_size(self):
+        return ctc_decode.get_dict_size(self._scorer) if self._scorer else None
+
+    def reset_params(self, alpha, beta):
+        if self._scorer is not None:
+            ctc_decode.reset_params(self._scorer, alpha, beta)