-
Notifications
You must be signed in to change notification settings - Fork 971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multichannel peak, onset, and beat detection #1766
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1766 +/- ##
==========================================
- Coverage 98.77% 98.74% -0.03%
==========================================
Files 34 34
Lines 4670 4638 -32
==========================================
- Hits 4613 4580 -33
- Misses 57 58 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
f0d467c
to
5acc4eb
Compare
Thinking through the beat tracking implementation a bit, I'll likely make the following changes:
|
While working on the beat tracker, I'm taking some time to clean up some internal API and comments to be more useful and transparent. |
Circling back on a very old issue #44 (closed back in 2019 when we merged PLP) while my head is back in this code. And yes, I recognize that it's somewhat a moot point, as the beat tracker we have is far behind SotA these days. However, I still think there's value in at least documenting where the difficulties are, if not improving things to the point where it remains useful. Local scoresThe first step of the beat tracker, after computing the onset strength envelope (OSE) and estimating tempo is to compute a "local score" for each frame. This is essentially the OSE convolved with a very sharp window over a ±1 beat window: Lines 447 to 450 in fb7013b
This is all easily implemented by a 1d conv when the tempo is fixed. When the tempo is not fixed, we'd need either a separate filter for each frame, or some kind of aggregated filter. The former is inefficient, but could possibly be computed quickly by sparsification. The latter seems problematic to me. Back-searchingThe next place where variable tempo could be problematic is in DP score calculation: Lines 458 to 475 in fb7013b
where each frame looks back to a window around where we'd expect the previous beat to land. |
Update on beat tracking before i break on this for a bit. The multichannel extension / refactor is almost done. I'm running into some numba dispatching problems though, as it's unhappy about the following: @numba.guvectorize(
[
"void(float32[:], float32, float32, int32[:], float32[:])",
"void(float64[:], float64, float32, int32[:], float64[:])",
],
"(t),(),()->(t),(t)",
nopython=True, cache=True)
def __beat_track_dp(localscore, frames_per_beat, tightness, backlink, cumscore): where in general, This might need to be reworked a little as a jitted 1d function that gets wrapped up in a numpy vectorize instead of directly using numba vectorize 😓. |
This still needs testing, but multichannel beat tracking is now functional. With jit optimizations, we ended up being just marginally faster than the old implementation, so it's possible that there are more gains to be had here. Next step is to expand the tests to cover the multichannel cases. |
😆 Rewriting the beat tracker has revealed an off-by-one error that has been trimming one too many beats since ... forever! Here's the problem. The current (main branch) implementation does the following when trimming beats: Lines 505 to 516 in 5ca70f5
that is, it finds all beat positions where the score exceeds a threshold, and then slices the index array from the first to the last. The multi-channel branch rewrites this as an explicit loop: Lines 540 to 553 in 7f07f0b
The idea is the same, but note that when we use the array slice notation (main branch), it excludes the upper end of the range, even though the score at that index exceeds the threshold. This is verified by printing out In [1]: import librosa
In [2]: y, sr = librosa.load('data/test1_22050.wav')
In [3]: bpm, beats = librosa.beat.beat_track(y=y, sr=sr)
[7.36539798 8.16666805 8.83107379 8.38428126 4.82421986 1.50669484
0.78669973 1.10853638 1.30460965 3.12087935 3.75983776 1.46726352] 2.599477926489801 The values printed out are numerically identical to those calculated on the main branch. However, the main branch excludes the last valid beat (score 3.75983776) while the multichannel branch includes it. 🤦 This turns out to be the cause of our current test failures, where a display example using beat coords is triggering an error. |
7f07f0b
to
42ca94c
Compare
I think aside from linting and final cleanups, this is ready for CR. One thing that we should discuss though is how we want to handle user-provided tempo in multichannel beat tracking. Previously, tempo could only ever be a scalar, so there was no ambiguity. Scalar You can also provide a multichannel array into The hitch here is that Lines 196 to 203 in 42ca94c
The problem I'm anticipating is that we'd like to be able to do something like: >>> bpm = librosa.feature.tempo(y=y, sr=sr, ...)
>>> _, beats = librosa.beat.beat_track(y=y, sr=sr, bpm=bpm, ...) but this will fail to vector dispatch properly: ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[5], line 1
----> 1 librosa.beat.beat_track(y=y, sr=sr, sparse=False, bpm=bpm)
File ~/git/librosa/librosa/beat.py:206, in beat_track(y, sr, onset_envelope, hop_length, start_bpm, tightness, trim, bpm, prior, units, sparse)
197 bpm = _tempo(
198 onset_envelope=onset_envelope,
199 sr=sr,
(...)
202 prior=prior,
203 )[..., 0]
205 # Then, run the tracker
--> 206 beats = __beat_tracker(onset_envelope, bpm, float(sr) / hop_length, tightness, trim)
208 if sparse:
209 beats = np.flatnonzero(beats)
File ~/git/librosa/librosa/beat.py:454, in __beat_tracker(onset_envelope, bpm, frame_rate, tightness, trim)
452 tail = __last_beat(cumscore)
453 beats = np.zeros_like(onset_envelope, dtype=bool)
--> 454 __dp_backtrack(backlink, tail, beats)
457 # Discard spurious trailing beats
458 if trim:
File ~/miniconda3/envs/py311/lib/python3.11/site-packages/numba/np/ufunc/gufunc.py:171, in GUFunc.__call__(self, *args, **kwargs)
167 def __call__(self, *args, **kwargs):
168 # If compilation is disabled OR it is NOT a dynamic gufunc
169 # call the underlying gufunc
170 if self._frozen or not self.is_dynamic:
--> 171 return self.ufunc(*args, **kwargs)
172 elif "out" in kwargs:
173 # If "out" argument is supplied
174 args += (kwargs.pop("out"),)
ValueError: output operand requires a reduction along dimension -1, but the reduction is not enabled. The dimension size of 1 does not match the expected output shape. We could detect this case easily enough and fix it internally, eg: if bpm.ndim == onset_envelope.ndim:
if bpm.shape[-1] != 1:
raise ParameterError("blah blah blah")
# Slice out the trailing time axis
bpm = bpm[..., 0] but it seems a little hacky. Curious if other people have suggestions here. |
In pinning down the vectorized tempo API, it's probably worth thinking about the (hypothetical) variable-tempo case as well. Mainly I'm trying to avoid painting ourselves into a corner here in case we do at some point want to revisit this idea. The hunch lurking in the back of my mind is that the DP-based beat tracker could still be very useful with a sufficiently strong onset estimator, and most modern beat trackers do include some kind of DP, Viterbi, or otherwise sequential decoding for exactly this reason. As such, I think it might be worth maintaining and even extending the beat tracker logic here even if the core method itself is rather antiquated, because it could find use as a utility function in more sophisticated models.
Now that we've rewritten the backlink search as a jitted function, we actually have a bit more flexibility to support a variable search window for each time step. The existing implementation uses a preallocated window that gets copied and truncated at each step, and really is a bit of a kludge to maintain as much vectorization as possible for efficiency purposes. If instead, we unroll this into an explicit loop (which would be fast under numba), we can let each frame have a different tempo and hence a different search window.
This part is still not great, but I do think implementing this quasi-convolution as a sparse near-toeplitz matrix multiply would be a viable solution here. We can retain the conv1d version for static tempo of course. Multi-channel, multi-tempo 😱The API difficulty here creeps in when we now have to consider the different ways in which a
There's an inherent ambiguity with cases (2) and (3) that I'm not wild about. The obvious way to resolve this is to require expanded singleton dimensions in whichever direction makes the use case explicit. However, this goes against the way we currently support tempo arguments (see above comment), so we should think carefully about how exactly we want to handle that. |
Remaining TODOs:
|
I've rewritten the DP backsearch as an explicit loop rather than a vectorized argmax. Speed is basically unaffected, and it would now be trivial to support time-varying tempo in this part of the algorithm. I'm still struggling with getting vector dispatch to work consistently when |
7e15993
to
30c368f
Compare
I've now rewritten this using guvectorize and it seems to essentially work, at least in the sense that calling with fully vectorized tempo inputs does properly generalize independent single calls, eg with a stereo input: In [24]: onsets = librosa.onset.onset_strength(y=y, sr=sr, aggregate=np.median)
In [25]: bpm = librosa.feature.tempo(onset_envelope=onsets, sr=sr)
In [26]: t1, b1 = librosa.beat.beat_track(onset_envelope=onsets, bpm=bpm, sparse=False)
In [27]: t1a, b1a = librosa.beat.beat_track(onset_envelope=onsets[0], bpm=bpm[0], sparse=False)
In [28]: t1b, b1b = librosa.beat.beat_track(onset_envelope=onsets[1], bpm=bpm[1], sparse=False)
In [29]: t1, t1a, t1b
Out[29]:
(array([[ 89.10290948],
[117.45383523]]),
array([89.10290948]),
array([117.45383523]))
In [30]: np.allclose(b1[0], b1a)
Out[30]: True
In [31]: np.allclose(b1[1], b1b)
Out[31]: True A problem does arise if you pass in a tempo vector that is not fully formed, eg if we were to flatten the In [32]: t2, b2 = librosa.beat.beat_track(onset_envelope=onsets, bpm=bpm.flatten(), sparse=False)
In [33]: t2
Out[33]: array([ 89.10290948, 117.45383523])
In [34]: np.allclose(b1, b2)
Out[34]: False In this case, it appears to be taking the first tempo ( The proper fix here, I think, is to use |
I put in the expand_dims hack, and it appears to be functional now: In [1]: import librosa
In [2]: import numpy as np
In [3]: y, sr = librosa.load(librosa.ex('fishin', hq=True), mono=False)
In [4]: onsets = librosa.onset.onset_strength(y=y, sr=sr, aggregate=np.median)
In [5]: bpm = librosa.feature.tempo(onset_envelope=onsets, sr=sr)
In [6]: t1, b1 = librosa.beat.beat_track(onset_envelope=onsets, bpm=bpm, sparse=False)
In [7]: t2, b2 = librosa.beat.beat_track(onset_envelope=onsets, bpm=bpm.flatten(), sparse=False)
In [8]: np.allclose(b1, b2)
Out[8]: True |
Seems like this fails with 3-dimensional inputs.. will need to investigate further. |
Following up: I believe the implementation is correct. The disagreement I was seeing came down to onset envelope calculation differing when given multichannel input. This eventually traces back to the amplitude_to_db calculation, which is not channel-dependent. |
I'm going to take a crack at extending this to support time-varying tempo, because why not? |
Quick stab at a time-varying local score function: from librosa.util.exceptions import ParameterError
@numba.guvectorize(
[
"void(float32[:], float32[:], float32[:])",
"void(float64[:], float64[:], float64[:])",
],
"(t),(n)->(t)",
nopython=True, cache=False)
def bls_numbav2(onset_envelope, frames_per_beat, localscore):
N = len(onset_envelope)
if len(frames_per_beat) == 1:
# Static tempo mode
window = np.exp(-0.5 * (np.arange(-frames_per_beat[0], frames_per_beat[0] + 1) * 32.0 / frames_per_beat[0]) ** 2)
K = len(window)
# This is a vanilla same-mode convolution
for i in range(len(onset_envelope)):
localscore[i] = 0.
# we need i + K // 2 - k < N ==> k > i + K //2 - N
# and i + K // 2 - k >= 0 ==> k <= i + K // 2
for k in range(max(0, i + K // 2 - N + 1), min(i + K // 2, K)):
localscore[i] += window[k] * onset_envelope[i + K//2 -k]
elif len(frames_per_beat) == len(onset_envelope):
# Time-varying tempo estimates
# This isn't exactly a convolution anymore, since the filter is time-varying, but it's pretty close
for i in range(len(onset_envelope)):
window = np.exp(-0.5 * (np.arange(-frames_per_beat[i], frames_per_beat[i] + 1) * 32.0 / frames_per_beat[i]) ** 2)
K = 2 * int(frames_per_beat[i]) + 1
localscore[i] = 0.
for k in range(max(0, i + K // 2 - N + 1), min(i + K // 2, K)):
localscore[i] += window[k] * onset_envelope[i + K // 2 - k]
else:
raise ParameterError(f"Invalid bpm shape={len(frames_per_beat)} does not match onset envelope shape={len(onset_envelope)}") For static tempo, this is about 200μs slower than the straight ahead scipy implementation (70μs on fishin example with standard params) we previously had, but about 200μs faster than a numba-wrapped np.convolve (420μs). The latter case is because until recently, numba's convolve implementation did not support modes, so we have to explicitly copy out the same-mode slice if using np.convolve. The above implementation avoids this copy and works in-place on the pre-allocated output array. For time-varying tempo, it takes about 1.3ms. We could probably shave that down a bit, but I don't actually think it would be worth the added code complexity to do so. |
Dynamic tempo beat tracker is now implemented in this branch. I haven't implemented tests yet, but so far it seems to be behaving exactly as I'd expect. On the brahms example: Red and blue show the current tracker (static tempo), orange and green show how it works with dynamic tempo. Dashed lines are tempo estimates (Autocorrelation of oenv), solid lines are inter-beat intervals converted to BPM after running the detector. The static tempo tracker is allowed to fluctuate from the static tempo, and it does so pretty severely when the tempo shifts. The dynamic tracker follows the estimated tempo, as expected. Neither of these are necessarily great, though I suspect most of the problems in the dynamic case are due to bad local tempo estimation, and not a failure of the tracking per se. A better dynamic tempo estimator would probably improve the tracking results. |
Doing some more thorough testing here (static tempo case), I noticed that the DP behavior is now slightly different from what we had in prior stable versions. For identical inputs (onset envelope saved to disk) in this branch vs 0.9.2, I've verified that the computation agrees numerically up to the local score step. The next step is the DP, which in this branch produces the following backlink table:
where the special token value of -1 indicates no prior link is possible.
which is not only different, but clearly incorrect because we should never have negative values here apart from -1. I'm not sure exactly where these came from, but it does seem like a bug. This never appeared in the final output because the backtracking loop discards any negative values. However, it does seem to corrupt the cumulative score computation, and the resulting detections are slightly different. |
f787468
to
03c0fed
Compare
|
||
# TODO: this might be better accomplished with a np.broadcast_shapes check | ||
if bpm.shape[-1] not in (1, onset_envelope.shape[-1]): | ||
raise ParameterError(f"Invalid bpm shape={bpm.shape} does not match onset envelope shape={onset_envelope.shape}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
untested?
librosa/util/utils.py
Outdated
if x.ndim != 1: | ||
raise ParameterError("input array must be one-dimensional") | ||
if sparse and x.ndim != 1: | ||
raise ParameterError("input array must be one-dimensional if sparse=True") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same proposal as before
Thanks @lostanlen ! I've updated the text mostly taking your suggestions (with some minor tweaks here and there). Exception messages have also been expanded per your suggestion, though the phrasing is different in the two cases you identified because one has to do with onsets/signals and the other with arbitrary data arrays. |
Doc CI failure is due to a temporary outage of the scipy procedings archive, nothing to be done from our side. |
Thank you @bmcfee for incorporating my feedback. if bpm.shape[-1] not in (1, onset_envelope.shape[-1]):
raise ParameterError(f"Invalid bpm shape={bpm.shape} does not match onset envelope shape={onset_envelope.shape}") |
Yeah - I'm not too worried about this. It's a redundant check within a private function, and the user-facing function ( |
Reference Issue
Fixes #1695
Follows up on #1130
Fixes #1773
Fixes #44
What does this implement/fix? Explain your changes.
This PR implements the extensions to various detection methods proposed in #1695 to support (dense, boolean) output and multi-channel input.
[ ] onset backtracking (maybe out of scope, but could be improved)Any other comments?
Everything here should be backward-compatible, and at least as efficient as the existing implementations.