Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow generating subtitles in the background #6247

Merged
merged 3 commits into from
Nov 6, 2024

Conversation

lkiesow
Copy link
Member

@lkiesow lkiesow commented Oct 20, 2024

This patch allows generating subtitles in the backgrould, starting the jobs first and attaching the result later in the workflow.

Unfortunately, looking into the code, I also found a lot of minor issues which I fixed as well, but which were hard to separate from this patch:

  • language-fallback didn't work
  • Defaults for generator enging and Whisper.cpp binary fixed
  • Defaults for several operations
  • Simplified code

How to test this patch

  • Make sure to install Whisper.cpp. You may need to update the configuration etc/org.opencastproject.speechtotext.impl.engine.WhisperCppEngine.cfg. Likely you just have to configure something like:
    • whispercpp.root.path=/home/lars/dev/whisper.cpp/main
    • whispercpp.model=/home/lars/dev/whisper.cpp/models/ggml-tiny.bin
  • Update the fast.yaml workflow:
diff --git a/etc/workflows/fast.yaml b/etc/workflows/fast.yaml
index 7a0494ed46..f526ed0a60 100644
--- a/etc/workflows/fast.yaml
+++ b/etc/workflows/fast.yaml
@@ -25,6 +25,7 @@ configuration_panel_json: |-
     ]
   }]
 operations:
+
   - id: defaults
     description: "Applying default configuration values"
     configurations:
@@ -45,6 +46,14 @@ operations:
       - overwrite: false
       - accept-no-media: false
 
+  - id: speechtotext
+    description: Generates subtitles for video and audio files
+    configurations:
+      - source-flavor: '*/prepared'
+      - target-flavor: captions/source
+      - limit-to-one: true
+      - async: true
+
   - id: encode
     fail-on-error: true
     exception-handler-workflow: "partial-error"
@@ -61,9 +70,7 @@ operations:
     description: "Tag captions for publication"
     configurations:
       - source-flavor: "captions/source"
-      - target-flavor: "captions/preview"
       - target-tags: "+engage-download"
-      - copy: true
 
   - id: image
     if: "${straightToPublishing}"
@@ -110,6 +117,12 @@ operations:
       - target-tags: "engage-download"
       - encoding-profile: "player-slides.http"
 
+  - id: speechtotext-attach
+    description: Generates subtitles for video and audio files
+    configurations:
+      - target-flavor: captions/source
+      - target-tags: engage-download
+
   - id: publish-configure
     exception-handler-workflow: "partial-error"
     description: "Publish to preview publication channel"
  • Download a video with subtitles like:
    https://github.com/user-attachments/assets/932ef91b-580a-4d11-9717-011b7bbdb142
  • Convert this to a wav file:
    ffmpeg -i github-multiline-suggestion.mp4 github-multiline-suggestion.wav
    
  • Ingest both files:
    curl -i -u admin:opencast http://localhost:8080/ingest/addMediaPackage/fast -F 'flavor=presentation/source' -F BODY=@github-multiline-suggestion.mp4 -F flavor=presentation/prepared -F BODY2=@github-multiline-suggestion.wav -F title="I 🖤 Opencast"
    

Your pull request should…

@@ -4,5 +4,5 @@

# Select STT engine.
# Available engines: vosk, whisper, whispercpp
# Default: (enginetype=whisper)
#SpeechToTextEngine.target=(enginetype=whisper)
# Default: (enginetype=whispercpp)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we provide whisper.cpp in the Opencast repositories while not providing the other tools, I think it makes sense to make it the default.

# Default: whisper
#whispercpp.root.path=whispercpp
# Configuration for setting a custom path to the Whisper.cpp command line tool
# Default: whisper.cpp
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I install Whisper.cpp from any of the Opencast repositories, the binary is actually named whisper.cpp. So, let's make that the actual default.

@@ -192,33 +201,29 @@ public WorkflowOperationResult start(WorkflowInstance workflowInstance, JobConte

// Use the selection strategy from the workflow config to get the tracks we want to transcribe
List<Track> tracksToTranscribe = filterTracksByStrategy(tracksWithAudio, trackSelectionStrategy);
if (tracksToTranscribe.isEmpty()) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no tracks we can skip right here. No need to run through everything else and then still skip at the end.


// Load the 'limit-to-one' configuration from the workflow operation.
// This configuration sets the limit of generated subtitle files to one
boolean limitToOne = BooleanUtils.toBoolean(workflowInstance.getCurrentOperation().getConfiguration(LIMIT_TO_ONE));
if (limitToOne) {
tracksToTranscribe = List.of(tracksToTranscribe.get(0));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replaces the very complex block below. This actually does the same since createSubtitle(…) always returned true, meaning it would always stop after the first track had been processed. That means we can just reduce the number of tracks to one right here.

ConfiguredTagsAndFlavors tagsAndFlavors, AppendSubtitleAs appendSubtitleAs, Boolean translate)
throws WorkflowOperationException {

// Start the transcription job, create subtitles file
URI trackURI = track.getURI();

if (!track.hasAudio()) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually checked that before. It's one of the track filters.

"Speech-to-Text job for media package '%s' failed, because of wrong workflow configuration. "
+ "track-selection-strategy of type '%s' does not exist.", mediaPackage, strategyCfg));
}
return TrackSelectionStrategy.EVERYTHING; // "transcribe everything" is the default/fallback
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just return if we already know the result. No need to continue. This makes the structure simpler.

}
}
return translateMode;
return BooleanUtils.toBoolean(StringUtils.trimToEmpty(operation.getConfiguration(TRANSLATE_MODE)));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one line basically replaces the whole function :D
We don't have to re-invent parsing booleans.

@opencast opencast deleted a comment from github-actions bot Oct 20, 2024
@lkiesow lkiesow force-pushed the speechtotext-async branch 2 times, most recently from 1138200 to 53c86c3 Compare October 21, 2024 08:28
@marwyg marwyg self-assigned this Oct 24, 2024
@marwyg marwyg self-requested a review October 24, 2024 07:09
@marwyg marwyg removed their assignment Oct 24, 2024
@KatrinIhler
Copy link
Member

FYI: Branch cut for OC 17 is on November 6!

Copy link
Contributor

This pull request has conflicts ☹
Please resolve those so we can review the pull request.
Thanks.

This patch allows generating subtitles in the backgrould, starting the
jobs first and attaching the result later in the workflow.

Unfortunately, looking into the code, I also found a lot of minor issues
which I fixed as well, but which were hard to separate from this patch:

- `language-fallback` didn't work
- Defaults for generator enging and Whisper.cpp binary fixed
- Defaults for several operations
- Simplified code
@lkiesow
Copy link
Member Author

lkiesow commented Oct 29, 2024

Fixed merge conflict

Copy link
Member

@marwyg marwyg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks way better as before. I couldn't find anything wrong with it.

I tested the following things (which worked):

  • async on
    • case 1: the transcription is faster then the attach operation
    • case 2: the transcription is slower then the workflow reaching the attach operation (in this case, the process waits for the transcription job to finish)
  • async off
    • everything works as before
  • limit to one
    • set to false and async to true -> multiple speech-to-text-job IDs stored in workflow variables, everything works as expected
    • set to false and async to false -> works
    • set to true with async to true -> works
    • set to true with async to false -> works
  • attach operation:
  • tags and flavors

The following things didn't work:

  • language fallback
    • removes the possibility to let whisper auto detect the language
  • attach operation
    • I don't know if this should be labeled as "not working", but the defined target-tags of the attach-operation are overriding the tags from the speechtotext operation. Don't know what a good solution would look here: Maybe we should add the attach-operation tags to the speechtotext-operation ones? Maybe we let it that way but add a note to the Docs?
  • metadata: Wrong language codes. Metadata provides 3 digits and whisper expects 2 digit codes i guess
    • metadata provides "deu" for example.. (but whisper works with language code "de"). Thats not like a Bug that was introduced. We could create another Issue for this or maybe this can be fixed easily?

@lkiesow
Copy link
Member Author

lkiesow commented Nov 5, 2024

  • language fallback
    • removes the possibility to let whisper auto detect the language

That may be why someone broke the fallback before. But I'll change it so that this value has no default. That way, we can skip it if not present.

I don't know if this should be labeled as "not working", but the defined target-tags of the attach-operation are overriding the tags from the speechtotext operation. Don't know what a good solution would look here: Maybe we should add the attach-operation tags to the speechtotext-operation ones? Maybe we let it that way but add a note to the Docs?

I think this is a misconception. Target tags apply to media package elements. Since the speechtotext operation does not generate any media package element, there are no tags. In other words, setting target-tags at that operation just has no effect at all. So, this is expected behavior. I'll add a notice to the docs.

metadata: Wrong language codes. Metadata provides 3 digits and whisper expects 2 digit codes i guess

That I didn't touch and won't fix in here. If I remember correctly, the Vosk code does some mapping internally. We could do something similary if we want to? But that's a separate issue.

@mtneug
Copy link
Member

mtneug commented Nov 5, 2024

The language issue was fixed for WhisperEngine.java in #5950, but not for WhisperCppEngine.java.

Since some TTS engines can auto-detect languages, we don't always want
to have a fallback language. That is why a default value of `en` doesn't
make sense. This patch removes the default, truly making this an
optional argument.
This patch clearly documents that setting target tags in the
`specchtotext` operation has no effect if the transcription is run
asynchronously.
@lkiesow
Copy link
Member Author

lkiesow commented Nov 5, 2024

Pull request has been updated.

  • The problem with the language fallback should be fixed.
  • The target-tags documentation got updated to explain what's happening
  • I'm not touching the 2 vs 3 letter language code in here. I didn't touch that in the first place and it seems like we already have different approaches. It may therefor make sense to think avout this for a bit and not rush this.

@lkiesow lkiesow requested a review from marwyg November 5, 2024 16:52
@marwyg
Copy link
Member

marwyg commented Nov 6, 2024

The changes are looking good. I tested the fallback again -> It works as expected.

The PR can be merged if you like.

@mtneug mtneug merged commit ca15856 into opencast:develop Nov 6, 2024
5 checks passed
@marwyg
Copy link
Member

marwyg commented Nov 6, 2024

I created an Issue for the language Code problem: #6293

mtneug pushed a commit to tales-media/fork-opencast-opencast that referenced this pull request Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants