o.mp4
This tool allows you to use AI models to generate subtitles from only audio, then match the subtitles to an accurate text, like a book. It requires a modern GPU with decent VRAM, CPU, and RAM.
Current State: The transcript will be extremely accurate. The timings will be mostly accurate, but may come late or leave early. The currently used library for generating those offsets is the best I've found so far that works stably, but leaves much to be desired. See the video at the bottom for such an example.
I'm looking forward to being able to run more accurate models to fix this in the future.
Currently supports unix based OS's like Ubuntu 20.04 on WSL2.
-
Install
ffmpeg
and make it available on the path -
Use python
3.9.9
-
pip install -r requirements.txt
-
If you're using a single file for the entire audiobook you are good to go. If you have individually split audio tracks, they need to be combined. You can use the docker image for
m4b-tool
. Trust me, you want the improved codec's that are included in the docker image. I tested both and noticed a huge drop in sound quality without them. When lossy formats like mp3 are transcoded they lose quality so it's important to use the docker image to retain the best quality.
- Put an
m4b
and atxt
file in a folder - Run
python run.py -d "<full folder path>"
Primarily I'm using this for syncing audiobooks to their book script. So while you could use this for video files, I'm not doing that just yet.
git clone https://github.com/kanjieater/AudiobookTextSync.git
- Make sure you run any commands that start with
./
from the project root, eg after you clone you can runcd ./AudiobookTextSync
- Setup the folder. Create a folder to hold a single media file (like an audiobook). Name it whatever you name your media file, eg
Arslan Senki 7
, this is what should go anywhere you see me write<name>
- Get the book script as text from a digital copy. Put the script at:
./<name>/script.txt
. Everything in this file will show up in your subtitles. So it's important you trim out excess (table of contents, character bios that aren't in the audiobook etc) - Single media file should be in
./<name>/<name>.m4b
. If you have the split audiobook as m4b,mp3, or mp4's you can run./merge.sh "<full folder path>"
, eg./merge.sh "/mnt/d/Editing/Audiobooks/๏ฝ๏ฝ ๏ฝ๏ฝ๏ฝ๏ฝ้ๅชๆขๅตๅๅก็ฟก็ฟ "
. The split files must be in./<name>/<name>_merge/
. This will merge your file into a single file so it can be processed. - If you have the
script.txt
and either./<name>/<name>.m4b
, you can now run the GPU intense, time intense, and occasionally CPU intense script part.python run.py -d "<full folder path>"
egpython run.py -d "/mnt/d/Editing/Audiobooks/ใใใฟใฎๅญคๅ/"
. This runs each file to get a word level transcript. It then creates a sub format that can be matched to thescript.txt
. Each word level subtitle is merged into a phrase level, and your result should be a<name>.srt
file that can be watched withMPV
, showing audio in time with the full book as a subtitle. - From there, use a texthooker with something like mpv_websocket and enjoy Immersion Reading.
./split.sh "/mnt/d/Editing/Audiobooks/ใใใฟใฎๅญคๅ/"
python run.py -d "/mnt/d/Editing/Audiobooks/ใใใฟใฎๅญคๅ/"
You can also run for a single file. Beware if it's over 1GB/19hr you need as much as 8GB of RAM available.
You need yourm4b
, mp3
, or mp4
audiobook file to be inside the folder: "", with a txt
file in the same folder. The txt
file can be named anything as long as it has a txt
extension.
The -d
parameter can multiple audiobooks to process like: python run.py -d "/mnt/d/sync/Harry Potter 1/" "/mnt/d/sync/Harry Potter 2 The Spooky Sequel/"
/sync/
โโโ /Harry Potter/
โโโ Harry Potter.m4b
โโโ Harry Potter.txt
โโโ /Harry Potter 2 The Spooky Sequel/
โโโ Harry Potter 2 The Spooky Sequel.mp3
โโโ script.txt
python run.py -d "<full folder path>"
eg python run.py -d "$(wslpath -a "D:\Editing\Audiobooks\ใใใฟใฎๅญคๅ\\")"
or python run.py -d "/mnt/d/sync/Harry Potter 1/" "/mnt/d/sync/Harry Potter The Sequel/"
./merge.sh "/mnt/d/Editing/Audiobooks/๏ฝ๏ฝ
๏ฝ๏ฝ๏ฝ๏ฝ้ๅชๆขๅตๅๅก็ฟก็ฟ "
This assumes you just have mp4's in a folder like /mnt/d/Editing/Audiobooks/๏ฝ๏ฝ
๏ฝ๏ฝ๏ฝ๏ฝ้ๅชๆขๅตๅๅก็ฟก็ฟ
. It will run all of the folder's with mp4's and do a check on them after to make sure the chapters line up. Requires docker
command to be available.
python merge.py "/mnt/d/Editing/Audiobooks/"
At this point I would recommend reading from the texthooker instead of a sub. (CTRL+SHIFT+RIGHT in mpv to set offset as the next sub). Then you can see the next line coming in the texthooker, and not be distracted by subtitle jumps.
Update: The timing is much more accurate, but it still makes sense to show what going wrong could look like
bad.mp4
- Generates subs2srs style deck
- Imports the deck into Anki automatically
The Anki support currently takes your m4b file in <full_folder_path>
named <name>.m4b
, where <name>
is the name of the media, and it outputs srs audio and a TSV file that can is sent via AnkiConnect to Anki. This is useful for searching across GoldenDict to find sentences that use a word, or to merge automatically with custom scripts (more releases to support this coming hopefully).
- Install ankiconnect add-on to Anki.
- I recommend using
ANKICONNECT
as an environment variable. Setexport ANKICONNECT=localhost:8755
orexport ANKICONNECT="$(hostname).local:8765"
in your~/.zshrc
or bashrc & activate it. - Make sure you are in the project directory
cd ./AudiobookTextSync
- Install
pip install ./requirements.txt
(only needs to be done once) - Set
ANKI_MEDIA_DIR
to your anki profile's media path:/mnt/f/Anki2/KanjiEater/collection.media/
- Run the command below
Command:
./anki.sh "<full_folder_path>"
Example:
./anki.sh "/mnt/d/sync/kokoro/"
If you're using WSL2 there a few networking quirks.
- Enable WSL2 to talk to your Windows machine. microsoft/WSL#4585 (comment)
- Set your
$ANKICONNECT
url to your windows machine url,export ANKICONNECT="http://$(hostname).local:8765"
. microsoft/WSL#5211 - Make sure inside of Anki's addon config
"webBindAddress": "0.0.0.0", "webBindPort": "8765"
.0.0.0.0
binds to all network interfaces, so WSL2 can connect.
curl --header "Content-Type: application/json" \
--request POST \
--data '{ "action": "guiBrowse", "version": 6, "params": { "query": "flag:3 is:new -is:suspended -tag:้่ค tag:้่ค3" } }' \
http://172.18.224.1:8765
You might see various issues while trying this out in the early state. Here are some of the pieces at work in sequence:
- Filter down audio to improve future results - slow & probably not heavy cpu or gpu usage. Heavier on cpu
- split_run & stable-ts: Starts off heavy on CPU & RAM to identify the audio spectrum
- stable-ts: GPU heavy & requires lots of vRAM depending on the model. This is the part with the long taskbar, where it tries to transcribe a text from the audio. Currently the default is tiny. Ironically tiny, does a better job of keeping the phrases short, at the cost of accuracy of transcription, which since we are matching a script, doesn't matter. Also it runs 32x faster than large.
- Merge vtt's for split subs
- Split the script
- match the script to the generated transcription to get good timestamps
This program supports txt
files. You may need to use an external program like Calibre to convert your epub
or kindle formats like azw3
to a txt
file.
To convert in Calibre:
- Right click on the book and convert the individual book (or use the batch option beneath it)
- At the top right for output format, select
txt
- Click Find & Replace. If your book has ใใfor furigana as some aozora books do (ๆฆๅ ดใใใใใใใ), then add a regex. If they have rt for furigana use the rt one: ใ(.+?)ใ or
- You can add multiple regexes to strip any extra content or furigana as need be.
- Click ok and convert it & you should now be able to find the file wherever Calibre is saving your books
Besides the other ones already mentioned & installed this project uses other open source projects subs2cia, & anki-csv-importer