-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rtic-monotonic panics #956
Comments
Is this reliable? How long does it take to happen, does it happen immediately or after a while? Does something trigger it? This error happens if the interrupt of a monotonic got blocked for more than half a timer period, indicating that the timer lost track of time. Are you blocking interrupts, for example with |
It took minutes or more to happen. I used ADC and BufferedUart from I needed to get it ready quickly, because of the second issue #957 I decided to move it to embassy executor. |
Still not convinced that this is related, though.
Most stm32 timers are 16 bit, and if running with 1MHz they generate one interrupt every 32ms. If the half overflow cc would be completely inactive, the first error would happen after roughly 64 ms, reproducible. Which does not seem to be the case, if I understand this correctly. |
Either way, @andresv, if you continue to use rtic-monotonics, please monitor this and report back if it happens again. I suspect there is a different reason. Edit: |
It ran hours with rtic-monotonic 2.0.2 and then
I'll check if I can strip it down a little to make src available. |
Yeah, I was afraid that would happen. @korken89 can you reopen? My primary suspect right now is an incompatibility with another library. Rtic does not support critical sections that simply disable interrupts completely, and this is what might be happening here. Timer interrupts currently run at the highest possible priority (to my knowledge), so for something to block an interrupt for more than 64 ms, something must block that interrupt entirely. Either way, there's a good chance that this is not a bug in the monotonic, and the panic is correct to go off because a sanity check is actually failing. And the actual mistake is elsewhere. |
What this panic indicates is that the monotonic realizes it is not in the state it expected to be in, so it is in an unknown state and is most certainly reporting an incorrect time. The only way this can happen is if either the half way or the rollover interrupt was fired twice in a row without the other interrupt firing. Which puts the monotonic in an undefined state. |
Of course I don't rule out a programming error on our side, but so far I didn't find one. |
I also have this panic on my board, that increases in occurrence rate when I increase the timer frequency. It also occurs much more often with a release build than a debug build (lto on, overflow checks off). Suggesting a race condition somewhere. There are no cpsid instructions executed before the panic occurs. I looked into this, and the rtic macro's generate RTIC_ASYNC_MAX_LOGICAL_PRIO as one less than the smallest HW task. I added some code to print the configured priorities. And the timer is configured as the highest number, so lowest priority (240, in my case). One priority level lower than my lowest priority task. Unfortunately when I hard code the timer interrupt to priority 0 it doesn't resolve the issue. I am wondering if it is not an issue with a short-time capture compare interrupt configured by a delay causing the normal half period interrupt to be missed/not generating the IRQ. |
Is that so? It should be configured as the highest possible priority. Are you able to create a minimal reproducible example? It's hard to fix without being able to reproduce it. @korken89 would you mind checking the claim about incorrect priorities? I'm really low on time right now. |
head rtic-expansion.rs #[doc = r" The RTIC application module"] pub mod app
{
#[doc =
r" Always include the device crate which contains the vector table"] use
stm32g4xx_hal :: stm32g4 :: stm32g473 as
you_must_enable_the_rt_feature_for_the_pac_in_your_cargo_toml;
#[doc =
r" Holds the maximum priority level for use by async HAL drivers."]
#[no_mangle] static RTIC_ASYNC_MAX_LOGICAL_PRIO : u8 = 1u8; use super This snippet is from my tim20 monotonic implementation, it is a copy paste of the rtic provided functions. I also tested the stm32-tim3 rtic monotonic and it had the same behavior, now it is to much work to switch it around as I use all the timers for PWM. I have a couple tasks in code and that are configured at some priorities. I think RTIC_ASYNC_MAX_LOGICAL_PRIO is not the correct value to use to determine the highest priority of the system. unsafe {
let mut nvic: cortex_m::peripheral::NVIC = core::mem::transmute(());
extern "C" {
static RTIC_ASYNC_MAX_LOGICAL_PRIO: u8;
}
defmt::info!("rtic max: {}, prio bits: {}",
RTIC_ASYNC_MAX_LOGICAL_PRIO,
NVIC_PRIO_BITS);
let max_prio = RTIC_ASYNC_MAX_LOGICAL_PRIO.max(1).min(1 << NVIC_PRIO_BITS);
defmt::info!("max-prio: {}", max_prio);
let hw_prio = ((1 << NVIC_PRIO_BITS) - max_prio) << (8 - NVIC_PRIO_BITS);
nvic.set_priority(stm32g4xx_hal::stm32::Interrupt::TIM20_CC, hw_prio);
cortex_m::peripheral::NVIC::unmask(stm32g4xx_hal::stm32::Interrupt::TIM20_CC);
defmt::info!("Priority TIM20_CC: {}", cortex_m::peripheral::NVIC::get_priority(stm32g4xx_hal::stm32::interrupt::TIM20_CC));
defmt::info!("Priority USART 3: {}", cortex_m::peripheral::NVIC::get_priority(stm32g4xx_hal::stm32::interrupt::USART3));
defmt::info!("Priority USART 4: {}", cortex_m::peripheral::NVIC::get_priority(stm32g4xx_hal::stm32::interrupt::UART4));
defmt::info!("Priority EXTI4: {}", cortex_m::peripheral::NVIC::get_priority(stm32g4xx_hal::stm32::interrupt::EXTI4));
defmt::info!("Priority DMA1_CH2: {}", cortex_m::peripheral::NVIC::get_priority(stm32g4xx_hal::stm32::interrupt::DMA1_CH2));
defmt::info!("priority: {}", hw_prio);
}
} This is a copy of: pub(crate) unsafe fn set_monotonic_prio(
prio_bits: u8,
interrupt: impl cortex_m::interrupt::InterruptNumber,
) {
extern "C" {
static RTIC_ASYNC_MAX_LOGICAL_PRIO: u8;
}
let max_prio = RTIC_ASYNC_MAX_LOGICAL_PRIO.max(1).min(1 << prio_bits);
let hw_prio = crate::cortex_logical2hw(max_prio, prio_bits);
// We take ownership of the entire IRQ and all settings to it, we only change settings
// for the IRQ we control.
// This will also compile-error in case the NVIC changes in size.
let mut nvic: cortex_m::peripheral::NVIC = core::mem::transmute(());
nvic.set_priority(interrupt, hw_prio);
} output: |
Which would be the correct way to determine that? |
Should we do this? let max_prio = (1 << prio_bits) - 1; There also seems to be a To be honest, when I refactored the RTIC monotonics, I did not modify the Need some input from @korken89 for this one. |
Probably best to wait for input from @korken89 to check if meant RTIC_ASYNC_MAX_LOGICAL_PRIO for this, if RTIC_ASYNC_MAX_LOGICAL_PRIO is incorrectly set or another variable is needed that holds to the maximum configured priority level. In the mean-time I hard code it on my system and continue looking into the panic. Because I still see it even with hard-coded priority 0. I can quickly reproduce it by setting the monotonic rate to 4MHz and have a tasks running with a short delay (150 microseconds) and a couple tasks with longer delays (100 milliseconds). |
Are you sure you are not using long-running critical sections? That would cause that as well. |
No running critical sections. The only critical sections that occur are during defmt prints (that i removed from all tasks) and rtic when a task runs and locks a resource at highest priority (i verified that this is not the case by adding assert to rtic), and in the timer queue/monotonic code now() call/updating timer queues. I also checked the disassembly for all cpsid instructions occurrences. I am pretty confident that there is a race condition somewhere/bug with how the stm32 timer is handled as this issue didn't occur for a long time with debug builds of my project (hence I thought I fixed it with the interrupt priorities). I checked the stm32g4 errata for timer errata but nothing pops up there. I cannot imagine interrupts being blocked for 32ms without the critical sections and maximum priority monotonic interrupt. So the capture compare interrupt has to be missed/not raising interrupt or maybe during handling of the interrupt. |
I hope we find it ... monotonics are one of the most critical subsystems of an RTOS :D You will make me very happy if you find it ;) I don't think there's something algorithmically wrong with our approach, we talked about it very extensively during the rewrite and also with the embassy people. So there must be some incorrect detail/typo or whatever. As I wrote all of this code, this will keep nagging at the back of my brain until we find it :D Can you produce your system details and a minimal reproducible example? I'd love to help you hunt at some point if I find the time. I'd even buy the correct dev board for it. |
We could add two atomic counters to see how often both interrupts happend, so we can see if we actually missed one or if the error is somewhere else. Also, does it always happen with the same interrupt? Is it always the overflow or the half-period interrupt? If it happens, are both interrupt flags set (meaning, they weren't processed fast enough) or is one interrupt triggered twice without the other one in between (meaning, a lost interrupt)? Like, did it not get processed or did it not get fired? |
I'll see if i can cook up a minimal project for the nucleo-g474re board. I used this dev kit to prototype the code half a year ago. I already saw the panic on that board a couple times. Currently I use the stm32g474 mcu running rtic with a couple tasks. One ADC tasks that samples channels and switches external analog muxes and therefore uses a short ~150us delay to switch the muxes and copy adc results, one task that reads from SPI RX memory map for commands changes and updates outputs on the board based on measurements (every 20ms) one task that updates a SPI TX memory map with results (every 100ms). On the other side of the SPI bus I have a MCU connected to Ethernet, this one also loads the stm32g474 firmware over SPI. So now I just toggle a GPIO in the panic handler of the stm32g474 to signal that it panicked and has to reboot to work around this issue for now. Regarding the interrupt status: Code changed to clear flag after panic 1: Monotonic must have missed an half period interrupt 30015 UIF currently unused as it is a different interrupt number on TIM20 one interrupt triggered twice without the other one in between (meaning, a lost interrupt)? Like, did it not get processed or did it not get fired? -> this seems to be the case. Currently running a test with enable_timer() commented out. Runs stable now for a couple minutes at 4MHz. While before it panicked after couple seconds. I think the issue might be related to the half/period irq being raised directly after or before a short delay interrupt. |
Feel free to add that timer to |
The only reason it is enabled on stm32 is because stm32 overflows so rapidly, I think. What happens if you completely remove |
I might have found it.
See here for a similar problem in libopenocm. |
@govert-overgaauw-ampel Please check if #984 fixes your problem. |
Very nice find! I am running it on my board right now and it looks good. I'll deploy it on more boards and let it run for the day and i'll get back to you. The board ran overnight without crashing with the enable_timer commented out. Suggesting that this is definitely the issue. What is the meaning of: r.0 = !0; I am using a different hal and it doesn't allow it: + Self::timer().sr.write(|w| {
+ w.cc3if().clear_bit()
+ }); To fully close this issue we should also check the priority level or it might still cause panics for other users depending on their application. |
It means "Set all bits of the register to |
When making statements like this, mentioning a specific hal would be more helpful than "different" :P |
Ah now I fully understand why it occurs. It writes an accidental 0 to interrupts just triggered. And removing the enable_timer call removes the nondeterministic interrupts. The monotonic implementation I made is out of source tree of RTIC in my own project. Due to our hardware design we needed 21 PWM outputs and a lot of ADCs, and this only allowed for TIM20 to be used for monotonic. It uses 3 capture compare units instead of the UIF due to UIF being a different interrupt on the STM32G4 than CC irq, which would cause it call two interrupt handlers. It uses the https://github.com/stm32-rs/stm32g4xx-hal . So I'll change my snippet to always write all bits to 1 except the one I want to clear. So submitting this snippet wouldn't really map nicely with the other stm32 monotonic implementations. |
Got it :) well, let us know if you encounter further panics or if it is stable now. |
Self::timer().sr.write(|w| {
unsafe { w.bits(!0); }
w.cc3if().clear_bit()
}); P.S. If you apply similar to stm32-rs/stm32-rs@ef41db1#diff-fd266a65971a9cac85b29fa61d899346f8ad8a04f6142eae833878f1ddd30488 patch for G4 timer it will ignore unchanged bits internally. |
Does that initialize all bits to 1 automatically? |
@govert-overgaauw-ampel no more panics? |
Very nice find! Just to clarify |
Does that mean the current implementation of priorities is as intended or not? |
Currently For I gave a look an we do not do a good job of documenting this relationship though, it should be improved. |
All W0C bits, yes. |
Would like to still hear from @govert-overgaauw-ampel whether this fixed his problem, though. @korken89 Let's keep it open until we have confirmation. |
I'll wait with merging #989 until confirmed then. |
I'd love to hear from @andresv as well, but I kind of don't expect that I will :) |
@korken89 I'm fine with releasing |
Alright, then I'll start the release procedure! |
@govert-overgaauw-ampel See if |
@Finomnis I was in the hurry to ship so I just switched to embassy for this particular project. Unfortunately I cannot verify if it now also works for me. |
No problem. Thanks for the answer! Also sorry it took so long to find, that was a very subtle bug... |
I ran two boards the whole night without panicking with timer set to 4MHz. It would panic in seconds before the fix. I can confirm that this fixes the issue. I only tested with out of source tree Tim20 monotonic. The code is the same so the official RTIC monotonics should be fixed as well. + Self::timer().sr.write(|w| unsafe {
+ w.bits(!0).cc3if().clear_bit()
+ }); However I still have the priority hardcoded to max due to RTIC_ASYNC_MAX_LOGICAL_PRIO being 1. I have tasks of priority 2 and 3 and those are configured in the NVIC correctly by RTIC. So not sure what is going on there. It could be wise to create a seperate issue for that and mark this one as fixed as the race condition is definitely the root cause of the panic. |
@Finomnis kudos for figuring this out. |
I managed to get monotonic panic. TIM3 on stm32g030k6 should be 16bit timer.
The text was updated successfully, but these errors were encountered: