Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(orders): fix cancel order race condition using time-based cache #2232

Merged
merged 26 commits into from
Oct 4, 2024

Conversation

shamardy
Copy link
Collaborator

@shamardy shamardy commented Sep 28, 2024

This PR addresses the issue where cancelled orders can sometimes remain in the orderbook due to P2P message caching and out-of-order message delivery.

Changes:

  • Added a time-based cache to orderbook to track recently cancelled orders
  • Modified order processing logic to check the cancellation cache before processing create maker order messages.
  • Set cache expiration time to 2 minutes, exceeding the 1-minute gossipsub cache to ensure coverage of delayed messages.
  • Added maker_order_cancelled_p2p_notify to a location where it was previously missing, this needs careful review. Old PR that added the maker order deletion without notifying Fix orders "leak" due to race condition. #1178 and the reason for the old PR order leakage #1173 (comment)

The solution works as follows:

  1. When an order is cancelled, its UUID is added to the cancellation cache with a 2 minutes expiration time.
  2. Incoming create maker order messages are checked against the cache before processing.
  3. If a matching UUID is found in the cache and the pubkey is the maker pubkey, the create maker order message is ignored.
  4. Cache entries expire after 2 minutes to reduce memory usage.

This approach provides a simple fix for the immediate issue while avoiding major changes to the P2P layer or gossipsub protocol. It prevents cancelled orders from reappearing in the orderbook due to delayed or out-of-order message delivery.

Next steps:

  • Monitor for any new issues related to P2P message ordering.
  • Consider more comprehensive solutions for ensuring message ordering if needed in the future.

potentially fixes #1726

@shamardy shamardy changed the title fix(orders): Fix order cancellation race condition using time-based cache fix(orders): fix order cancellation race condition using time-based cache Sep 28, 2024
@shamardy shamardy changed the title fix(orders): fix order cancellation race condition using time-based cache fix(orders): fix cancel order race condition using time-based cache Sep 28, 2024
Copy link
Collaborator

@mariocynicys mariocynicys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took a quick look and tried the fix on the electrum connections branch (since it exemplifies the issue 1000x for some reason).

I fear though that nodes don't receive messages at the same time and we have some node getting the message AFTER (because some other node received the cancellation late enough that it's still in its gossibpub cache) the cancellation cache timed out and it would consider it a new order creation.

can't we enforce in-order-delivery somehow on p2p layer? or at leas t only causal delivery.

mm2src/common/expirable_map.rs Show resolved Hide resolved
mm2src/mm2_main/tests/mm2_tests/mm2_tests_inner.rs Outdated Show resolved Hide resolved
mm2src/mm2_main/src/lp_ordermatch.rs Outdated Show resolved Hide resolved
mm2src/mm2_main/src/lp_ordermatch.rs Outdated Show resolved Hide resolved
mm2src/common/expirable_map.rs Show resolved Hide resolved
@shamardy
Copy link
Collaborator Author

shamardy commented Oct 1, 2024

Received this message from @cipig

i run 2 nodes with this PR, the last commits are missing. I noticed that those 2 nodes showed this

30 22:00:01, mm2_p2p::behaviours::atomicdex:324] ERROR notify_on_adex_event error send failed because channel is full
30 22:00:01, mm2_p2p::behaviours::atomicdex:324] ERROR notify_on_adex_event error send failed because channel is full
30 22:00:01, mm2_p2p::behaviours::atomicdex:324] ERROR notify_on_adex_event error send failed because channel is full
30 22:00:05, libp2p_gossipsub::behaviour:3415] INFO HEARTBEAT: relays low. Contains: 0 needs: 2

at the same time, while the other nodes didn't
question is if the changes increase the probability of this to happen

Will look into it after fixing the review comments

@shamardy
Copy link
Collaborator Author

shamardy commented Oct 1, 2024

About this #2232 (comment) , only reason for delay is holding the orderbook lock for some time, I can move the new recently cancelled map to OrdermatchContext instead with its own lock to reduce the delay.

@onur-ozkan
Copy link
Member

Received this message from @cipig

i run 2 nodes with this PR, the last commits are missing. I noticed that those 2 nodes showed this

30 22:00:01, mm2_p2p::behaviours::atomicdex:324] ERROR notify_on_adex_event error send failed because channel is full
30 22:00:01, mm2_p2p::behaviours::atomicdex:324] ERROR notify_on_adex_event error send failed because channel is full
30 22:00:01, mm2_p2p::behaviours::atomicdex:324] ERROR notify_on_adex_event error send failed because channel is full
30 22:00:05, libp2p_gossipsub::behaviour:3415] INFO HEARTBEAT: relays low. Contains: 0 needs: 2

at the same time, while the other nodes didn't
question is if the changes increase the probability of this to happen

Will look into it after fixing the review comments

How is this PR related with those logs 🤔

@shamardy
Copy link
Collaborator Author

shamardy commented Oct 1, 2024

About this #2232 (comment) , only reason for delay is holding the orderbook lock for some time, I can move the new recently cancelled map to OrdermatchContext instead with its own lock to reduce the delay.

How is this PR related with those logs 🤔

I can see that process_msg is async, so you are right, it shouldn't affect messages buffer. But I should add this optimization nonetheless.

Edit: the log ERROR notify_on_adex_event error send failed because channel is full is related to the tx part of the channel, so it's not related to process_msg or this PR at all. Sorry for the confusion.

@shamardy
Copy link
Collaborator Author

shamardy commented Oct 1, 2024

I can move the new recently cancelled map to OrdermatchContext instead with its own lock to reduce the delay.

Done here 14f7f34

@shamardy
Copy link
Collaborator Author

shamardy commented Oct 1, 2024

@onur-ozkan @mariocynicys all issues raised in review comments should have been fixed. Please give this another look when you have the time.

Copy link
Member

@onur-ozkan onur-ozkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last notes from my side:

mm2src/mm2_main/src/lp_ordermatch.rs Outdated Show resolved Hide resolved
mm2src/mm2_main/tests/mm2_tests/mm2_tests_inner.rs Outdated Show resolved Hide resolved
mm2src/mm2_main/tests/mm2_tests/mm2_tests_inner.rs Outdated Show resolved Hide resolved
mm2src/mm2_main/src/lp_ordermatch.rs Outdated Show resolved Hide resolved
mm2src/mm2_main/src/lp_ordermatch.rs Outdated Show resolved Hide resolved
mm2src/common/expirable_map.rs Show resolved Hide resolved
@shamardy
Copy link
Collaborator Author

shamardy commented Oct 2, 2024

only reason for delay is holding the orderbook lock for some time, I can move the new recently cancelled map to OrdermatchContext instead with its own lock to reduce the delay.

Sorry about more confusion I am about to create @mariocynicys @onur-ozkan but I will move recently cancelled map back to orderbook struct and handle it in both insert_or_update_order_update_trie and delete_order where it's already locked. I was reluctant to do that at first since recently cancelled will be checked when order is being added by means other than maker order created message but once cancel order message is broadcasted an order shouldn't be added by any means whether through creation message or pubkey sync.

@shamardy
Copy link
Collaborator Author

shamardy commented Oct 2, 2024

Reverted the moving of recently cancelled to it's own lock here e0e5d41 and fixed this #2232 (comment) here cdfd809.

@shamardy
Copy link
Collaborator Author

shamardy commented Oct 2, 2024

Now test_cancel_all_orders fails after the latest additions 😩

@shamardy
Copy link
Collaborator Author

shamardy commented Oct 2, 2024

Now test_cancel_all_orders fails after the latest additions

I understand why, will fix it. The below claim

// We only do this if the order is not in the order_set, because if it is, it means that the order was already
// created and messages did arrive in the correct order
is not right as we can receive it through orderbook call and receive order creation message later.

onur-ozkan
onur-ozkan previously approved these changes Oct 3, 2024
Copy link
Member

@onur-ozkan onur-ozkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final notes from my side, LGTM other than these notes:

mm2src/mm2_main/src/lp_ordermatch.rs Outdated Show resolved Hide resolved
mm2src/mm2_main/src/lp_ordermatch.rs Outdated Show resolved Hide resolved
mm2src/mm2_main/src/lp_ordermatch.rs Show resolved Hide resolved
Copy link
Collaborator

@mariocynicys mariocynicys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last points, LGTM otherwise :)

mm2src/mm2_main/tests/mm2_tests/orderbook_sync_tests.rs Outdated Show resolved Hide resolved
mm2src/mm2_main/tests/mm2_tests/orderbook_sync_tests.rs Outdated Show resolved Hide resolved
mariocynicys
mariocynicys previously approved these changes Oct 4, 2024
Copy link
Collaborator

@mariocynicys mariocynicys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

mm2src/mm2_core/Cargo.toml Outdated Show resolved Hide resolved
@shamardy shamardy merged commit ce18376 into dev Oct 4, 2024
22 of 23 checks passed
@shamardy shamardy deleted the fix-p2p-cancellation branch October 4, 2024 11:21
dimxy added a commit that referenced this pull request Oct 4, 2024
* dev:
  fix(orders): fix cancel order race condition using time-based cache (#2232)
  fix(legacy-swap): taker failed spend maker payment marked as failed (#2199)
  chore(adex-cli): deprecate adex-cli (#2234)
  feat(new-RPC): connection healthcheck implementation for peers  (#2194)
  fix(proxy-signature): add message lifetime overflows (#2233)
  feat(CI): handle remote files in a safer way (#2217)
  chore(doc): update issue address in README (#2227)
  fix(merge): remove duplicated db_root function (#2229)
  feat(wallets): add `get_wallet_names` rpc (#2202)
  chore(tests): don't use `.wait()` and use `block_on` instead (#2220)
  fix(native-rpc): remove escaped response body (#2219)
  fix(clippy): fix coins mod clippy warnings in wasm (#2224)
  feat(core): handling CTRL-C signal with graceful shutdown (#2213)
  docs(README): fix typos (#2212)
  remove the non-sense arguments (#2216)
  fix(db): stop creating the all-zeroes dir on KDF start (#2218)
dimxy added a commit that referenced this pull request Oct 4, 2024
* dev:
  fix(orders): fix cancel order race condition using time-based cache (#2232)
  fix(legacy-swap): taker failed spend maker payment marked as failed (#2199)
  chore(adex-cli): deprecate adex-cli (#2234)
  feat(new-RPC): connection healthcheck implementation for peers  (#2194)
  fix(proxy-signature): add message lifetime overflows (#2233)
  feat(CI): handle remote files in a safer way (#2217)
  chore(doc): update issue address in README (#2227)
  fix(merge): remove duplicated db_root function (#2229)
  feat(wallets): add `get_wallet_names` rpc (#2202)
  chore(tests): don't use `.wait()` and use `block_on` instead (#2220)
  fix(native-rpc): remove escaped response body (#2219)
  fix(clippy): fix coins mod clippy warnings in wasm (#2224)
  feat(core): handling CTRL-C signal with graceful shutdown (#2213)
  docs(README): fix typos (#2212)
  remove the non-sense arguments (#2216)
  fix(db): stop creating the all-zeroes dir on KDF start (#2218)
dimxy added a commit that referenced this pull request Oct 17, 2024
* dev:
  fix(cosmos): fix tx broadcasting error (#2238)
  chore(solana): remove solana implementation (#2239)
  chore(cli): remove leftover subcommands from help message (#2235)
  fix(orders): fix cancel order race condition using time-based cache (#2232)
  fix(legacy-swap): taker failed spend maker payment marked as failed (#2199)
  chore(adex-cli): deprecate adex-cli (#2234)
  feat(new-RPC): connection healthcheck implementation for peers  (#2194)
  fix(proxy-signature): add message lifetime overflows (#2233)
  feat(CI): handle remote files in a safer way (#2217)
  chore(doc): update issue address in README (#2227)
  fix(merge): remove duplicated db_root function (#2229)
  feat(wallets): add `get_wallet_names` rpc (#2202)
  chore(tests): don't use `.wait()` and use `block_on` instead (#2220)
  fix(native-rpc): remove escaped response body (#2219)
  fix(clippy): fix coins mod clippy warnings in wasm (#2224)
  feat(core): handling CTRL-C signal with graceful shutdown (#2213)
  docs(README): fix typos (#2212)
  remove the non-sense arguments (#2216)
  fix(db): stop creating the all-zeroes dir on KDF start (#2218)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants