Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rtic-monotonic panics #956

Closed
andresv opened this issue Jun 28, 2024 · 46 comments · Fixed by #960 or #984
Closed

rtic-monotonic panics #956

andresv opened this issue Jun 28, 2024 · 46 comments · Fixed by #960 or #984

Comments

@andresv
Copy link
Contributor

andresv commented Jun 28, 2024

I managed to get monotonic panic. TIM3 on stm32g030k6 should be 16bit timer.

embassy-stm32 = { git = "https://github.com/embassy-rs/embassy.git", rev = "000b022ae2e52e9abaabbd10110b4c583fe4344c", features = [
    "defmt",
    "stm32g030k6",
    "unstable-pac",
    "exti",
] }
rtic-monotonics = { version = "2.0.1", features = [
    "stm32g030k6",
    "stm32_tim3",
] }
rtic-sync = "1.3"
rtic = { version = "2.1", features = ["thumbv6-backend"] }
15439 INFO  periodic
└─ ptubasehub::app::input_events::{async_fn#0} @ src/main.rs:176
33074 INFO  periodic
└─ ptubasehub::app::input_events::{async_fn#0} @ src/main.rs:176
32779 ERROR panicked at /Users/andres/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rtic-monotonics-2.0.1/src/stm32.rs:338:1:
Monotonic must have missed an interrupt!
└─ panic_probe::print_defmt::print @ /Users/andres/.cargo/registry/src/index.crates.io-6f17d22bba15001f/panic-probe-0.3.1/src/lib.rs:104
@Finomnis
Copy link
Contributor

Finomnis commented Jul 4, 2024

Is this reliable? How long does it take to happen, does it happen immediately or after a while? Does something trigger it?

This error happens if the interrupt of a monotonic got blocked for more than half a timer period, indicating that the timer lost track of time. Are you blocking interrupts, for example with cortex_m::interrupt::free?

@andresv
Copy link
Contributor Author

andresv commented Jul 4, 2024

It took minutes or more to happen. I used ADC and BufferedUart from embassy-stm32 HAL + blocking onewire.

I needed to get it ready quickly, because of the second issue #957 I decided to move it to embassy executor.

@Finomnis
Copy link
Contributor

Finomnis commented Jul 5, 2024

@korken89

Still not convinced that this is related, though.

It took minutes or more to happen.

Most stm32 timers are 16 bit, and if running with 1MHz they generate one interrupt every 32ms. If the half overflow cc would be completely inactive, the first error would happen after roughly 64 ms, reproducible. Which does not seem to be the case, if I understand this correctly.

@Finomnis
Copy link
Contributor

Finomnis commented Jul 5, 2024

Either way, @andresv, if you continue to use rtic-monotonics, please monitor this and report back if it happens again. I suspect there is a different reason.

Edit:
I was informed this timer might be 32 bits. In that case, this would make sense.

@andresv
Copy link
Contributor Author

andresv commented Jul 7, 2024

STM32G030K6T6 only has 16bit timers:
Screenshot 2024-07-07 at 13 16 31

@andresv
Copy link
Contributor Author

andresv commented Jul 7, 2024

It ran hours with rtic-monotonic 2.0.2 and then

10 ERROR panicked at /Users/andres/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rtic-monotonics-2.0.2/src/stm32.rs:338:1:
Monotonic must have missed an interrupt!
└─ panic_probe::print_defmt::print @ /Users/andres/.cargo/registry/src/index.crates.io-6f17d22bba15001f/panic-probe-0.3.1/src/lib.rs:104

I'll check if I can strip it down a little to make src available.

@Finomnis
Copy link
Contributor

Finomnis commented Jul 7, 2024

Yeah, I was afraid that would happen. @korken89 can you reopen?

My primary suspect right now is an incompatibility with another library. Rtic does not support critical sections that simply disable interrupts completely, and this is what might be happening here.

Timer interrupts currently run at the highest possible priority (to my knowledge), so for something to block an interrupt for more than 64 ms, something must block that interrupt entirely.

Either way, there's a good chance that this is not a bug in the monotonic, and the panic is correct to go off because a sanity check is actually failing. And the actual mistake is elsewhere.

@Finomnis
Copy link
Contributor

Finomnis commented Jul 7, 2024

What this panic indicates is that the monotonic realizes it is not in the state it expected to be in, so it is in an unknown state and is most certainly reporting an incorrect time.

The only way this can happen is if either the half way or the rollover interrupt was fired twice in a row without the other interrupt firing. Which puts the monotonic in an undefined state.

@Finomnis
Copy link
Contributor

Finomnis commented Jul 7, 2024

Of course I don't rule out a programming error on our side, but so far I didn't find one.

@AfoHT AfoHT reopened this Jul 17, 2024
@govert-overgaauw-ampel
Copy link

What this panic indicates is that the monotonic realizes it is not in the state it expected to be in, so it is in an unknown state and is most certainly reporting an incorrect time.

The only way this can happen is if either the half way or the rollover interrupt was fired twice in a row without the other interrupt firing. Which puts the monotonic in an undefined state.

I also have this panic on my board, that increases in occurrence rate when I increase the timer frequency. It also occurs much more often with a release build than a debug build (lto on, overflow checks off). Suggesting a race condition somewhere. There are no cpsid instructions executed before the panic occurs.

I looked into this, and the rtic macro's generate RTIC_ASYNC_MAX_LOGICAL_PRIO as one less than the smallest HW task. I added some code to print the configured priorities. And the timer is configured as the highest number, so lowest priority (240, in my case). One priority level lower than my lowest priority task. Unfortunately when I hard code the timer interrupt to priority 0 it doesn't resolve the issue. I am wondering if it is not an issue with a short-time capture compare interrupt configured by a delay causing the normal half period interrupt to be missed/not generating the IRQ.

@Finomnis
Copy link
Contributor

Is that so? It should be configured as the highest possible priority.
If that's not the case, then this is most certainly a bug.

Are you able to create a minimal reproducible example? It's hard to fix without being able to reproduce it.

@korken89 would you mind checking the claim about incorrect priorities? I'm really low on time right now.

@govert-overgaauw-ampel
Copy link

govert-overgaauw-ampel commented Oct 22, 2024

head rtic-expansion.rs

#[doc = r" The RTIC application module"] pub mod app
{
    #[doc =
    r" Always include the device crate which contains the vector table"] use
    stm32g4xx_hal :: stm32g4 :: stm32g473 as
    you_must_enable_the_rt_feature_for_the_pac_in_your_cargo_toml;
    #[doc =
    r" Holds the maximum priority level for use by async HAL drivers."]
    #[no_mangle] static RTIC_ASYNC_MAX_LOGICAL_PRIO : u8 = 1u8; use super

This snippet is from my tim20 monotonic implementation, it is a copy paste of the rtic provided functions. I also tested the stm32-tim3 rtic monotonic and it had the same behavior, now it is to much work to switch it around as I use all the timers for PWM. I have a couple tasks in code and that are configured at some priorities. I think RTIC_ASYNC_MAX_LOGICAL_PRIO is not the correct value to use to determine the highest priority of the system.

        unsafe {                                                                                                                                                                                                                                                                                                                                      
            let mut nvic: cortex_m::peripheral::NVIC = core::mem::transmute(());                                                                                                                                                                                                                                                                      
            extern "C" {                                                                                                                                                                                                                                                                                                                              
                static RTIC_ASYNC_MAX_LOGICAL_PRIO: u8;                                                                                                                                                                                                                                                                                               
            }                                                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                                                      
            defmt::info!("rtic max: {}, prio bits: {}",                                                                                                                                                                                                                                                                                               
                RTIC_ASYNC_MAX_LOGICAL_PRIO,                                                                                                                                                                                                                                                                                                          
                NVIC_PRIO_BITS);                                                                                                                                                                                                                                                                                                                      
                                                                                     
            let max_prio = RTIC_ASYNC_MAX_LOGICAL_PRIO.max(1).min(1 << NVIC_PRIO_BITS);                                                                                    
            defmt::info!("max-prio: {}", max_prio);
            let hw_prio = ((1 << NVIC_PRIO_BITS) - max_prio) << (8 - NVIC_PRIO_BITS);                                                                                      
                                                                                     
            nvic.set_priority(stm32g4xx_hal::stm32::Interrupt::TIM20_CC, hw_prio);
            cortex_m::peripheral::NVIC::unmask(stm32g4xx_hal::stm32::Interrupt::TIM20_CC);                                                                    
                                                                                                                       
            defmt::info!("Priority TIM20_CC: {}", cortex_m::peripheral::NVIC::get_priority(stm32g4xx_hal::stm32::interrupt::TIM20_CC));                                    
            defmt::info!("Priority USART 3: {}", cortex_m::peripheral::NVIC::get_priority(stm32g4xx_hal::stm32::interrupt::USART3));                                       
            defmt::info!("Priority USART 4: {}", cortex_m::peripheral::NVIC::get_priority(stm32g4xx_hal::stm32::interrupt::UART4));                                        
            defmt::info!("Priority EXTI4: {}", cortex_m::peripheral::NVIC::get_priority(stm32g4xx_hal::stm32::interrupt::EXTI4));                                          
            defmt::info!("Priority DMA1_CH2: {}", cortex_m::peripheral::NVIC::get_priority(stm32g4xx_hal::stm32::interrupt::DMA1_CH2));                                    
            defmt::info!("priority: {}", hw_prio);  
        }              
    }      

This is a copy of:

pub(crate) unsafe fn set_monotonic_prio(
    prio_bits: u8,
    interrupt: impl cortex_m::interrupt::InterruptNumber,
) {
    extern "C" {
        static RTIC_ASYNC_MAX_LOGICAL_PRIO: u8;
    }

    let max_prio = RTIC_ASYNC_MAX_LOGICAL_PRIO.max(1).min(1 << prio_bits);

    let hw_prio = crate::cortex_logical2hw(max_prio, prio_bits);

    // We take ownership of the entire IRQ and all settings to it, we only change settings
    // for the IRQ we control.
    // This will also compile-error in case the NVIC changes in size.
    let mut nvic: cortex_m::peripheral::NVIC = core::mem::transmute(());

    nvic.set_priority(interrupt, hw_prio);
}

output:
INFO rtic max: 1, prio bits: 4
INFO max-prio: 1
INFO Priority TIM20_CC: 240
INFO Priority USART 3: 224
INFO Priority USART 4: 208
INFO Priority EXTI4: 224
INFO Priority DMA1_CH2: 224
INFO priority: 240

@Finomnis
Copy link
Contributor

Finomnis commented Oct 22, 2024

Which would be the correct way to determine that?

@Finomnis
Copy link
Contributor

Finomnis commented Oct 22, 2024

Should we do this?

let max_prio = (1 << prio_bits) - 1;

There also seems to be a - 1 missing in the .min(1 << prio_bits).

To be honest, when I refactored the RTIC monotonics, I did not modify the set_monotonic_prio function, that one is copy paste from the previous implementation.

Need some input from @korken89 for this one.

@govert-overgaauw-ampel
Copy link

Probably best to wait for input from @korken89 to check if meant RTIC_ASYNC_MAX_LOGICAL_PRIO for this, if RTIC_ASYNC_MAX_LOGICAL_PRIO is incorrectly set or another variable is needed that holds to the maximum configured priority level.

In the mean-time I hard code it on my system and continue looking into the panic. Because I still see it even with hard-coded priority 0. I can quickly reproduce it by setting the monotonic rate to 4MHz and have a tasks running with a short delay (150 microseconds) and a couple tasks with longer delays (100 milliseconds).

@Finomnis
Copy link
Contributor

Finomnis commented Oct 22, 2024

Are you sure you are not using long-running critical sections? That would cause that as well.

@govert-overgaauw-ampel
Copy link

govert-overgaauw-ampel commented Oct 22, 2024

No running critical sections. The only critical sections that occur are during defmt prints (that i removed from all tasks) and rtic when a task runs and locks a resource at highest priority (i verified that this is not the case by adding assert to rtic), and in the timer queue/monotonic code now() call/updating timer queues. I also checked the disassembly for all cpsid instructions occurrences.

I am pretty confident that there is a race condition somewhere/bug with how the stm32 timer is handled as this issue didn't occur for a long time with debug builds of my project (hence I thought I fixed it with the interrupt priorities). I checked the stm32g4 errata for timer errata but nothing pops up there. I cannot imagine interrupts being blocked for 32ms without the critical sections and maximum priority monotonic interrupt. So the capture compare interrupt has to be missed/not raising interrupt or maybe during handling of the interrupt.

@Finomnis
Copy link
Contributor

I hope we find it ... monotonics are one of the most critical subsystems of an RTOS :D

You will make me very happy if you find it ;) I don't think there's something algorithmically wrong with our approach, we talked about it very extensively during the rewrite and also with the embassy people. So there must be some incorrect detail/typo or whatever. As I wrote all of this code, this will keep nagging at the back of my brain until we find it :D

Can you produce your system details and a minimal reproducible example? I'd love to help you hunt at some point if I find the time. I'd even buy the correct dev board for it.

@Finomnis
Copy link
Contributor

Finomnis commented Oct 22, 2024

We could add two atomic counters to see how often both interrupts happend, so we can see if we actually missed one or if the error is somewhere else. Also, does it always happen with the same interrupt? Is it always the overflow or the half-period interrupt? If it happens, are both interrupt flags set (meaning, they weren't processed fast enough) or is one interrupt triggered twice without the other one in between (meaning, a lost interrupt)? Like, did it not get processed or did it not get fired?

@govert-overgaauw-ampel
Copy link

govert-overgaauw-ampel commented Oct 22, 2024

I'll see if i can cook up a minimal project for the nucleo-g474re board. I used this dev kit to prototype the code half a year ago. I already saw the panic on that board a couple times. Currently I use the stm32g474 mcu running rtic with a couple tasks. One ADC tasks that samples channels and switches external analog muxes and therefore uses a short ~150us delay to switch the muxes and copy adc results, one task that reads from SPI RX memory map for commands changes and updates outputs on the board based on measurements (every 20ms) one task that updates a SPI TX memory map with results (every 100ms).

On the other side of the SPI bus I have a MCU connected to Ethernet, this one also loads the stm32g474 firmware over SPI. So now I just toggle a GPIO in the panic handler of the stm32g474 to signal that it panicked and has to reboot to work around this issue for now.

Regarding the interrupt status:
I have seen both period and half periods being missed:

Code changed to clear flag after panic
10 runs (unused capture compare flags are set CC4IF/CC5IF/CC6IF 0x30010):
CC1IF is cleared before on_monotonic() is called as the interrupt handler in rtic always calls capture compare clear.

1: Monotonic must have missed an half period interrupt 30015
2: Monotonic must have missed an period interrupt 30019
3: Monotonic must have missed an period interrupt 30019
4: Monotonic must have missed an half period interrupt 30015
5: Monotonic must have missed an half period interrupt 30015
6: Monotonic must have missed an half period interrupt 30015
7: Monotonic must have missed an half period interrupt 30015
8: Monotonic must have missed an half period interrupt 30015
9: Monotonic must have missed an period interrupt 30019
10: Monotonic must have missed an half period interrupt 30015

UIF currently unused as it is a different interrupt number on TIM20
CC1IF enable/disable timer irq
CC2IF half period
CC3IF period

one interrupt triggered twice without the other one in between (meaning, a lost interrupt)? Like, did it not get processed or did it not get fired? -> this seems to be the case.

Currently running a test with enable_timer() commented out. Runs stable now for a couple minutes at 4MHz. While before it panicked after couple seconds. I think the issue might be related to the half/period irq being raised directly after or before a short delay interrupt.

@Finomnis
Copy link
Contributor

Finomnis commented Oct 22, 2024

@govert-overgaauw-ampel

This snippet is from my tim20 monotonic implementation

Feel free to add that timer to stm32.rs in rtic-monotonics. I'm not sure who decided which timer is implemented and which isn't; if this one works, we should probably add it.

@Finomnis
Copy link
Contributor

Finomnis commented Oct 22, 2024

enable_timer and disable_timer are simply for energy saving. If those have a race condition with the interrupt firing, just make them both empty. Does that fix your problem?
Of course if you remove enable_timer, you have to enable the interrupt somewhere else (probably in the initialization). It's still very interesting that it no longer breaks then.

stm32 is the only one that implements that one right now. No other backend uses it; all the other ones just keep the on-demand interrupt enabled continuously.

The only reason it is enabled on stm32 is because stm32 overflows so rapidly, I think.

What happens if you completely remove disable_timer?

@Finomnis
Copy link
Contributor

Finomnis commented Oct 22, 2024

@govert-overgaauw-ampel

I might have found it.

$timer.sr().modify(|r| r.set_uif(false)); is the wrong way to clear the timer interrupts. It contains a read-modify-write race condition.

See here for a similar problem in libopenocm.

@Finomnis
Copy link
Contributor

@govert-overgaauw-ampel Please check if #984 fixes your problem.

@govert-overgaauw-ampel
Copy link

govert-overgaauw-ampel commented Oct 23, 2024

Very nice find! I am running it on my board right now and it looks good. I'll deploy it on more boards and let it run for the day and i'll get back to you. The board ran overnight without crashing with the enable_timer commented out. Suggesting that this is definitely the issue.

What is the meaning of:

r.0 = !0;

I am using a different hal and it doesn't allow it:

+            Self::timer().sr.write(|w| {
+                w.cc3if().clear_bit()
+            });

To fully close this issue we should also check the priority level or it might still cause panics for other users depending on their application.

@Finomnis
Copy link
Contributor

Finomnis commented Oct 23, 2024

What is the meaning of:

r.0 = !0;

It means "Set all bits of the register to 1". Then followed by setting a single bit to zero, that we want to clear. This register in stm32 is "write-0-to-clear", so the bug previously was that not all other bits were 1, causing other flags to be cleared accidentally sometimes.

@Finomnis
Copy link
Contributor

@govert-overgaauw-ampel

I am using a different hal

When making statements like this, mentioning a specific hal would be more helpful than "different" :P

@govert-overgaauw-ampel
Copy link

govert-overgaauw-ampel commented Oct 23, 2024

Ah now I fully understand why it occurs. It writes an accidental 0 to interrupts just triggered. And removing the enable_timer call removes the nondeterministic interrupts.

The monotonic implementation I made is out of source tree of RTIC in my own project. Due to our hardware design we needed 21 PWM outputs and a lot of ADCs, and this only allowed for TIM20 to be used for monotonic. It uses 3 capture compare units instead of the UIF due to UIF being a different interrupt on the STM32G4 than CC irq, which would cause it call two interrupt handlers. It uses the https://github.com/stm32-rs/stm32g4xx-hal . So I'll change my snippet to always write all bits to 1 except the one I want to clear. So submitting this snippet wouldn't really map nicely with the other stm32 monotonic implementations.

@Finomnis
Copy link
Contributor

Finomnis commented Oct 23, 2024

Got it :) well, let us know if you encounter further panics or if it is stable now.

@burrbull
Copy link
Contributor

burrbull commented Oct 23, 2024

I am using a different hal and it doesn't allow it:

Self::timer().sr.write(|w| {
    unsafe { w.bits(!0); }
    w.cc3if().clear_bit()
});

P.S. If you apply similar to stm32-rs/stm32-rs@ef41db1#diff-fd266a65971a9cac85b29fa61d899346f8ad8a04f6142eae833878f1ddd30488 patch for G4 timer it will ignore unchanged bits internally.

@Finomnis
Copy link
Contributor

P.S. If you apply similar to stm32-rs/stm32-rs@ef41db1#diff-fd266a65971a9cac85b29fa61d899346f8ad8a04f6142eae833878f1ddd30488 patch for G4 timer it will ignore unchanged bits internally.

Does that initialize all bits to 1 automatically?

@Finomnis
Copy link
Contributor

@govert-overgaauw-ampel no more panics?

@korken89
Copy link
Collaborator

Very nice find!

Just to clarify RTIC_ASYNC_MAX_LOGICAL_PRIO allows drivers to reduce the priority inversion needed, by setting driver priorities too one higher than this.
Monotonics however are set to maximum unless other is stated, this as 16 bit timers generally need very fast servicing. Hope that helps to clarify it!

@Finomnis
Copy link
Contributor

Very nice find!

Just to clarify RTIC_ASYNC_MAX_LOGICAL_PRIO allows drivers to reduce the priority inversion needed, by setting driver priorities too one higher than this. Monotonics however are set to maximum unless other is stated, this as 16 bit timers generally need very fast servicing. Hope that helps to clarify it!

Does that mean the current implementation of priorities is as intended or not?

@korken89
Copy link
Collaborator

Currently RTIC_ASYNC_MAX_LOGICAL_PRIO has the design that you don't want the async parts to override hardware tasks, they are generally the most important thing in a system.
With that as the basis it looks correct to me.

For RTIC_ASYNC_MAX_LOGICAL_PRIO = 1 there must exist a hardware task with priority 1 or 2.
Or am I misunderstanding the question?

I gave a look an we do not do a good job of documenting this relationship though, it should be improved.

@burrbull
Copy link
Contributor

Does that initialize all bits to 1 automatically?

All W0C bits, yes.

@Finomnis
Copy link
Contributor

Finomnis commented Oct 23, 2024

Would like to still hear from @govert-overgaauw-ampel whether this fixed his problem, though.

@korken89 Let's keep it open until we have confirmation.

@korken89 korken89 reopened this Oct 23, 2024
@korken89
Copy link
Collaborator

I'll wait with merging #989 until confirmed then.

@Finomnis
Copy link
Contributor

I'd love to hear from @andresv as well, but I kind of don't expect that I will :)

@Finomnis
Copy link
Contributor

@korken89 I'm fine with releasing 2.0.3, I just want to keep this issue open until we have confirmation that there isn't another bug :)

@korken89
Copy link
Collaborator

Alright, then I'll start the release procedure!

@Finomnis
Copy link
Contributor

@govert-overgaauw-ampel See if rtic-monotonics 2.0.3 runs stable for you now.

@andresv
Copy link
Contributor Author

andresv commented Oct 24, 2024

@Finomnis I was in the hurry to ship so I just switched to embassy for this particular project. Unfortunately I cannot verify if it now also works for me.

@Finomnis
Copy link
Contributor

Finomnis commented Oct 24, 2024

@Finomnis I was in the hurry to ship so I just switched to embassy for this particular project. Unfortunately I cannot verify if it now also works for me.

No problem. Thanks for the answer!

Also sorry it took so long to find, that was a very subtle bug...

@govert-overgaauw-ampel
Copy link

I ran two boards the whole night without panicking with timer set to 4MHz. It would panic in seconds before the fix. I can confirm that this fixes the issue. I only tested with out of source tree Tim20 monotonic. The code is the same so the official RTIC monotonics should be fixed as well.

+            Self::timer().sr.write(|w| unsafe {
+                w.bits(!0).cc3if().clear_bit()
+            });

However I still have the priority hardcoded to max due to RTIC_ASYNC_MAX_LOGICAL_PRIO being 1. I have tasks of priority 2 and 3 and those are configured in the NVIC correctly by RTIC. So not sure what is going on there. It could be wise to create a seperate issue for that and mark this one as fixed as the race condition is definitely the root cause of the panic.

@andresv andresv closed this as completed Oct 24, 2024
@andresv
Copy link
Contributor Author

andresv commented Oct 25, 2024

@Finomnis kudos for figuring this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants