Allow generating subtitles in the background #6247

lkiesow · 2024-10-20T15:52:54Z

This patch allows generating subtitles in the backgrould, starting the jobs first and attaching the result later in the workflow.

Unfortunately, looking into the code, I also found a lot of minor issues which I fixed as well, but which were hard to separate from this patch:

language-fallback didn't work
Defaults for generator enging and Whisper.cpp binary fixed
Defaults for several operations
Simplified code

How to test this patch

Make sure to install Whisper.cpp. You may need to update the configuration etc/org.opencastproject.speechtotext.impl.engine.WhisperCppEngine.cfg. Likely you just have to configure something like:
- whispercpp.root.path=/home/lars/dev/whisper.cpp/main
- whispercpp.model=/home/lars/dev/whisper.cpp/models/ggml-tiny.bin
Update the fast.yaml workflow:

diff --git a/etc/workflows/fast.yaml b/etc/workflows/fast.yaml
index 7a0494ed46..f526ed0a60 100644
--- a/etc/workflows/fast.yaml
+++ b/etc/workflows/fast.yaml
@@ -25,6 +25,7 @@ configuration_panel_json: |-
     ]
   }]
 operations:
+
   - id: defaults
     description: "Applying default configuration values"
     configurations:
@@ -45,6 +46,14 @@ operations:
       - overwrite: false
       - accept-no-media: false
 
+  - id: speechtotext
+    description: Generates subtitles for video and audio files
+    configurations:
+      - source-flavor: '*/prepared'
+      - target-flavor: captions/source
+      - limit-to-one: true
+      - async: true
+
   - id: encode
     fail-on-error: true
     exception-handler-workflow: "partial-error"
@@ -61,9 +70,7 @@ operations:
     description: "Tag captions for publication"
     configurations:
       - source-flavor: "captions/source"
-      - target-flavor: "captions/preview"
       - target-tags: "+engage-download"
-      - copy: true
 
   - id: image
     if: "${straightToPublishing}"
@@ -110,6 +117,12 @@ operations:
       - target-tags: "engage-download"
       - encoding-profile: "player-slides.http"
 
+  - id: speechtotext-attach
+    description: Generates subtitles for video and audio files
+    configurations:
+      - target-flavor: captions/source
+      - target-tags: engage-download
+
   - id: publish-configure
     exception-handler-workflow: "partial-error"
     description: "Publish to preview publication channel"

Download a video with subtitles like:
https://github.com/user-attachments/assets/932ef91b-580a-4d11-9717-011b7bbdb142

Convert this to a wav file:

ffmpeg -i github-multiline-suggestion.mp4 github-multiline-suggestion.wav

Ingest both files:

curl -i -u admin:opencast http://localhost:8080/ingest/addMediaPackage/fast -F 'flavor=presentation/source' -F BODY=@github-multiline-suggestion.mp4 -F flavor=presentation/prepared -F BODY2=@github-multiline-suggestion.wav -F title="I 🖤 Opencast"

Your pull request should…

have a concise title
close an accompanying issue if one exists
be against the correct branch
include migration scripts and documentation, if appropriate
pass automated tests
have a clean commit history
have proper commit messages (title and body) for all commits
explain why it needs to be merged into the legacy branch, if it is targeting the legacy branch

lkiesow · 2024-10-20T15:54:28Z

etc/org.opencastproject.speechtotext.impl.SpeechToTextServiceImpl.cfg

@@ -4,5 +4,5 @@

 # Select STT engine.
 # Available engines: vosk, whisper, whispercpp
-# Default: (enginetype=whisper)
-#SpeechToTextEngine.target=(enginetype=whisper)
+# Default: (enginetype=whispercpp)


Since we provide whisper.cpp in the Opencast repositories while not providing the other tools, I think it makes sense to make it the default.

lkiesow · 2024-10-20T15:55:20Z

etc/org.opencastproject.speechtotext.impl.engine.WhisperCppEngine.cfg

-# Default: whisper
-#whispercpp.root.path=whispercpp
+# Configuration for setting a custom path to the Whisper.cpp command line tool
+# Default: whisper.cpp


If I install Whisper.cpp from any of the Opencast repositories, the binary is actually named whisper.cpp. So, let's make that the actual default.

lkiesow · 2024-10-20T15:57:14Z

.../org/opencastproject/workflow/handler/speechtotext/SpeechToTextWorkflowOperationHandler.java

@@ -192,33 +201,29 @@ public WorkflowOperationResult start(WorkflowInstance workflowInstance, JobConte

    // Use the selection strategy from the workflow config to get the tracks we want to transcribe
    List<Track> tracksToTranscribe = filterTracksByStrategy(tracksWithAudio, trackSelectionStrategy);
+    if (tracksToTranscribe.isEmpty()) {


If there are no tracks we can skip right here. No need to run through everything else and then still skip at the end.

lkiesow · 2024-10-20T15:59:07Z

.../org/opencastproject/workflow/handler/speechtotext/SpeechToTextWorkflowOperationHandler.java


    // Load the 'limit-to-one' configuration from the workflow operation.
    // This configuration sets the limit of generated subtitle files to one
    boolean limitToOne = BooleanUtils.toBoolean(workflowInstance.getCurrentOperation().getConfiguration(LIMIT_TO_ONE));
+    if (limitToOne) {
+      tracksToTranscribe = List.of(tracksToTranscribe.get(0));


This replaces the very complex block below. This actually does the same since createSubtitle(…) always returned true, meaning it would always stop after the first track had been processed. That means we can just reduce the number of tracks to one right here.

lkiesow · 2024-10-20T15:59:55Z

.../org/opencastproject/workflow/handler/speechtotext/SpeechToTextWorkflowOperationHandler.java

          ConfiguredTagsAndFlavors tagsAndFlavors, AppendSubtitleAs appendSubtitleAs, Boolean translate)
          throws WorkflowOperationException {

    // Start the transcription job, create subtitles file
    URI trackURI = track.getURI();

-    if (!track.hasAudio()) {


We actually checked that before. It's one of the track filters.

lkiesow · 2024-10-20T16:01:02Z

.../org/opencastproject/workflow/handler/speechtotext/SpeechToTextWorkflowOperationHandler.java

-            "Speech-to-Text job for media package '%s' failed, because of wrong workflow configuration. "
-                + "track-selection-strategy of type '%s' does not exist.", mediaPackage, strategyCfg));
-      }
+      return TrackSelectionStrategy.EVERYTHING; // "transcribe everything" is the default/fallback


We can just return if we already know the result. No need to continue. This makes the structure simpler.

lkiesow · 2024-10-20T16:01:58Z

.../org/opencastproject/workflow/handler/speechtotext/SpeechToTextWorkflowOperationHandler.java

-      }
-    }
-    return translateMode;
+    return BooleanUtils.toBoolean(StringUtils.trimToEmpty(operation.getConfiguration(TRANSLATE_MODE)));


This one line basically replaces the whole function :D
We don't have to re-invent parsing booleans.

.../org/opencastproject/workflow/handler/speechtotext/SpeechToTextWorkflowOperationHandler.java

KatrinIhler · 2024-10-29T09:51:53Z

FYI: Branch cut for OC 17 is on November 6!

github-actions · 2024-10-29T10:01:06Z

This pull request has conflicts ☹
Please resolve those so we can review the pull request.
Thanks.

This patch allows generating subtitles in the backgrould, starting the jobs first and attaching the result later in the workflow. Unfortunately, looking into the code, I also found a lot of minor issues which I fixed as well, but which were hard to separate from this patch: - `language-fallback` didn't work - Defaults for generator enging and Whisper.cpp binary fixed - Defaults for several operations - Simplified code

lkiesow · 2024-10-29T12:42:51Z

Fixed merge conflict

marwyg

The code looks way better as before. I couldn't find anything wrong with it.

I tested the following things (which worked):

async on
- case 1: the transcription is faster then the attach operation
- case 2: the transcription is slower then the workflow reaching the attach operation (in this case, the process waits for the transcription job to finish)
async off
- everything works as before
limit to one
- set to false and async to true -> multiple speech-to-text-job IDs stored in workflow variables, everything works as expected
- set to false and async to false -> works
- set to true with async to true -> works
- set to true with async to false -> works
attach operation:
tags and flavors

The following things didn't work:

language fallback
- removes the possibility to let whisper auto detect the language
attach operation
- I don't know if this should be labeled as "not working", but the defined target-tags of the attach-operation are overriding the tags from the speechtotext operation. Don't know what a good solution would look here: Maybe we should add the attach-operation tags to the speechtotext-operation ones? Maybe we let it that way but add a note to the Docs?
metadata: Wrong language codes. Metadata provides 3 digits and whisper expects 2 digit codes i guess
- metadata provides "deu" for example.. (but whisper works with language code "de"). Thats not like a Bug that was introduced. We could create another Issue for this or maybe this can be fixed easily?

.../org/opencastproject/workflow/handler/speechtotext/SpeechToTextWorkflowOperationHandler.java

docs/guides/admin/docs/workflowoperationhandlers/speechtotext-attach-woh.md

lkiesow · 2024-11-05T15:13:20Z

language fallback

removes the possibility to let whisper auto detect the language

That may be why someone broke the fallback before. But I'll change it so that this value has no default. That way, we can skip it if not present.

I don't know if this should be labeled as "not working", but the defined target-tags of the attach-operation are overriding the tags from the speechtotext operation. Don't know what a good solution would look here: Maybe we should add the attach-operation tags to the speechtotext-operation ones? Maybe we let it that way but add a note to the Docs?

I think this is a misconception. Target tags apply to media package elements. Since the speechtotext operation does not generate any media package element, there are no tags. In other words, setting target-tags at that operation just has no effect at all. So, this is expected behavior. I'll add a notice to the docs.

metadata: Wrong language codes. Metadata provides 3 digits and whisper expects 2 digit codes i guess

That I didn't touch and won't fix in here. If I remember correctly, the Vosk code does some mapping internally. We could do something similary if we want to? But that's a separate issue.

mtneug · 2024-11-05T15:43:12Z

The language issue was fixed for WhisperEngine.java in #5950, but not for WhisperCppEngine.java.

Since some TTS engines can auto-detect languages, we don't always want to have a fallback language. That is why a default value of `en` doesn't make sense. This patch removes the default, truly making this an optional argument.

This patch clearly documents that setting target tags in the `specchtotext` operation has no effect if the transcription is run asynchronously.

lkiesow · 2024-11-05T16:52:09Z

Pull request has been updated.

The problem with the language fallback should be fixed.
The target-tags documentation got updated to explain what's happening
I'm not touching the 2 vs 3 letter language code in here. I didn't touch that in the first place and it seems like we already have different approaches. It may therefor make sense to think avout this for a bit and not rush this.

marwyg · 2024-11-06T09:13:33Z

The changes are looking good. I tested the fallback again -> It works as expected.

The PR can be merged if you like.

marwyg · 2024-11-06T09:19:41Z

I created an Issue for the language Code problem: #6293

…round

lkiesow commented Oct 20, 2024

View reviewed changes

opencast deleted a comment from github-actions bot Oct 20, 2024

lkiesow force-pushed the speechtotext-async branch 2 times, most recently from 1138200 to 53c86c3 Compare October 21, 2024 08:28

marwyg self-assigned this Oct 24, 2024

marwyg self-requested a review October 24, 2024 07:09

marwyg removed their assignment Oct 24, 2024

github-actions bot added the has-conflicts label Oct 29, 2024

lkiesow force-pushed the speechtotext-async branch from 53c86c3 to 6da9996 Compare October 29, 2024 12:42

github-actions bot removed the has-conflicts label Oct 29, 2024

marwyg suggested changes Nov 4, 2024

View reviewed changes

.../org/opencastproject/workflow/handler/speechtotext/SpeechToTextWorkflowOperationHandler.java Outdated Show resolved Hide resolved

docs/guides/admin/docs/workflowoperationhandlers/speechtotext-attach-woh.md Show resolved Hide resolved

lkiesow added 2 commits November 5, 2024 17:34

Remove default language fallback

12bc128

Since some TTS engines can auto-detect languages, we don't always want to have a fallback language. That is why a default value of `en` doesn't make sense. This patch removes the default, truly making this an optional argument.

Describe target tags in async mode

37642f4

This patch clearly documents that setting target tags in the `specchtotext` operation has no effect if the transcription is run asynchronously.

lkiesow requested a review from marwyg November 5, 2024 16:52

marwyg approved these changes Nov 6, 2024

View reviewed changes

mtneug merged commit ca15856 into opencast:develop Nov 6, 2024
5 checks passed

marwyg mentioned this pull request Nov 6, 2024

Fix incorrect language parameter for WhisperCPP #6293

Open

mtneug pushed a commit to tales-media/fork-opencast-opencast that referenced this pull request Dec 2, 2024

shio: Backport opencast#6247: Allow generating subtitles in the backg…

2b19e48

…round

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow generating subtitles in the background #6247

Allow generating subtitles in the background #6247

lkiesow commented Oct 20, 2024 •

edited

Loading

lkiesow Oct 20, 2024

lkiesow Oct 20, 2024

lkiesow Oct 20, 2024

lkiesow Oct 20, 2024

lkiesow Oct 20, 2024

lkiesow Oct 20, 2024

lkiesow Oct 20, 2024

KatrinIhler commented Oct 29, 2024

github-actions bot commented Oct 29, 2024

lkiesow commented Oct 29, 2024

marwyg left a comment

lkiesow commented Nov 5, 2024

mtneug commented Nov 5, 2024

lkiesow commented Nov 5, 2024

marwyg commented Nov 6, 2024

marwyg commented Nov 6, 2024

Allow generating subtitles in the background #6247

Allow generating subtitles in the background #6247

Conversation

lkiesow commented Oct 20, 2024 • edited Loading

How to test this patch

Your pull request should…

lkiesow Oct 20, 2024

Choose a reason for hiding this comment

lkiesow Oct 20, 2024

Choose a reason for hiding this comment

lkiesow Oct 20, 2024

Choose a reason for hiding this comment

lkiesow Oct 20, 2024

Choose a reason for hiding this comment

lkiesow Oct 20, 2024

Choose a reason for hiding this comment

lkiesow Oct 20, 2024

Choose a reason for hiding this comment

lkiesow Oct 20, 2024

Choose a reason for hiding this comment

KatrinIhler commented Oct 29, 2024

github-actions bot commented Oct 29, 2024

lkiesow commented Oct 29, 2024

marwyg left a comment

Choose a reason for hiding this comment

I tested the following things (which worked):

The following things didn't work:

lkiesow commented Nov 5, 2024

mtneug commented Nov 5, 2024

lkiesow commented Nov 5, 2024

marwyg commented Nov 6, 2024

marwyg commented Nov 6, 2024

lkiesow commented Oct 20, 2024 •

edited

Loading