Fix timing of new replication test #8807

oranagra · 2021-04-18T09:16:06Z

In github actions CI with valgrind, i saw that even the fast replica
(one that wasn't paused), didn't get to complete the replication fast
enough, and ended up getting disconnected by timeout.

Additionally, due to a typo in uname, we didn't get to actually run the
CPU efficiency part of the test.

In CI valgrind run, i saw that even the fast replica (one that wasn't paused), didn't get to complete the replication fast enough, and ended up getting disconnected by timeout. Additionally, due to a typo in uname, we didn't get to actually run the cpu efficiency part of the test.

oranagra · 2021-04-18T09:25:59Z

for reference the failed run:
https://pipelines.actions.githubusercontent.com/2oDd4EuUudJqGKlOAB2KXZpKHTtseqbUa63unZUQGqUSgXNthI/_apis/pipelines/1/runs/4371/signedlogcontent/23?urlExpires=2021-04-18T05%3A31%3A32.7556878Z&urlSigningMethod=HMACV1&urlSignature=i%2BulXXy%2BXM5lgRXaEnZvale6OCI7kReK7KGsfHp5HSM%3D

2021-04-18T01:46:55.2633158Z ### Starting test diskless timeout replicas drop during rdb pipe in tests/integration/replication.tcl
2021-04-18T01:46:55.2633979Z 32555:M 18 Apr 2021 01:45:21.575 - Accepted 127.0.0.1:38337
2021-04-18T01:46:55.2634551Z 32555:M 18 Apr 2021 01:45:21.614 - Accepted 127.0.0.1:34479
2021-04-18T01:46:55.2635098Z 32555:M 18 Apr 2021 01:45:21.621 * Replica 127.0.0.1:21212 asks for synchronization
2021-04-18T01:46:55.2636945Z 32555:M 18 Apr 2021 01:45:21.621 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for 'f5d03775ec0a5dae4feed97459b8882a75fae056', my replication IDs are '645f3e27d32d58bf9e92faf5ee9f4c6dd1200cba' and '0000000000000000000000000000000000000000')
2021-04-18T01:46:55.2638268Z 32555:M 18 Apr 2021 01:45:21.622 * Delay next BGSAVE for diskless SYNC
2021-04-18T01:46:55.2639038Z 32555:M 18 Apr 2021 01:45:21.664 * Replica 127.0.0.1:21213 asks for synchronization
2021-04-18T01:46:55.2640957Z 32555:M 18 Apr 2021 01:45:21.664 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for 'fd000cca14f7d191dd4855d57587e928291a0000', my replication IDs are '645f3e27d32d58bf9e92faf5ee9f4c6dd1200cba' and '0000000000000000000000000000000000000000')
2021-04-18T01:46:55.2642421Z 32555:M 18 Apr 2021 01:45:21.664 * Delay next BGSAVE for diskless SYNC
2021-04-18T01:46:55.2643243Z 32555:M 18 Apr 2021 01:45:21.862 - DB 9: 20004 keys (0 volatile) in 32768 slots HT.
2021-04-18T01:46:55.2643907Z 32555:M 18 Apr 2021 01:45:22.871 * Starting BGSAVE for SYNC with target: replicas sockets
2021-04-18T01:46:55.2644622Z 32555:M 18 Apr 2021 01:45:22.880 * Background RDB transfer started by pid 44177
2021-04-18T01:46:55.2645461Z 44177:C 18 Apr 2021 01:45:22.944 - RDB: 0 MB of memory used by copy-on-write
2021-04-18T01:46:55.2646168Z 32555:M 18 Apr 2021 01:45:25.911 # Disconnecting timedout replica (full sync): 127.0.0.1:21212
2021-04-18T01:46:55.2646831Z 32555:M 18 Apr 2021 01:45:25.912 # Connection with replica 127.0.0.1:21212 lost.
2021-04-18T01:46:55.2647643Z 44177:C 18 Apr 2021 01:45:26.207 - RDB: 0 MB of memory used by copy-on-write
2021-04-18T01:46:55.2648439Z 32555:M 18 Apr 2021 01:45:26.924 - DB 9: 20005 keys (0 volatile) in 32768 slots HT.
2021-04-18T01:46:55.2649244Z 44177:C 18 Apr 2021 01:45:28.624 - RDB: 0 MB of memory used by copy-on-write
2021-04-18T01:46:55.2650065Z 44177:C 18 Apr 2021 01:45:29.880 - RDB: 0 MB of memory used by copy-on-write
2021-04-18T01:46:55.2650989Z 32555:M 18 Apr 2021 01:45:31.968 - DB 9: 20005 keys (0 volatile) in 32768 slots HT.
2021-04-18T01:46:55.2651779Z 44177:C 18 Apr 2021 01:45:32.241 - RDB: 0 MB of memory used by copy-on-write
2021-04-18T01:46:55.2652411Z 44177:C 18 Apr 2021 01:45:34.578 - RDB: 0 MB of memory used by copy-on-write
2021-04-18T01:46:55.2653021Z 44177:C 18 Apr 2021 01:45:35.738 - RDB: 0 MB of memory used by copy-on-write
2021-04-18T01:46:55.2653634Z 32555:M 18 Apr 2021 01:45:37.028 - DB 9: 20005 keys (0 volatile) in 32768 slots HT.
2021-04-18T01:46:55.2654225Z 44177:C 18 Apr 2021 01:45:37.264 * RDB: 0 MB of memory used by copy-on-write
2021-04-18T01:46:55.2654794Z 32555:M 18 Apr 2021 01:45:37.264 # Diskless rdb transfer, done reading from pipe, 1 replicas still up.
2021-04-18T01:46:55.2655390Z 32555:M 18 Apr 2021 01:45:37.938 * Background RDB transfer terminated with success
2021-04-18T01:46:55.2656189Z 32555:M 18 Apr 2021 01:45:37.938 * Streamed RDB transfer with replica 127.0.0.1:21213 succeeded (socket). Waiting for REPLCONF ACK from slave to enable streaming
2021-04-18T01:46:55.2656949Z 32555:M 18 Apr 2021 01:45:38.063 - Accepted 127.0.0.1:36847
2021-04-18T01:46:55.2657608Z 32555:M 18 Apr 2021 01:45:38.093 * Replica 127.0.0.1:21212 asks for synchronization
2021-04-18T01:46:55.2658164Z 32555:M 18 Apr 2021 01:45:38.094 * Full resync requested by replica 127.0.0.1:21212
2021-04-18T01:46:55.2658716Z 32555:M 18 Apr 2021 01:45:38.094 * Delay next BGSAVE for diskless SYNC
2021-04-18T01:46:55.2659302Z 32555:M 18 Apr 2021 01:45:39.049 * Starting BGSAVE for SYNC with target: replicas sockets
2021-04-18T01:46:55.2659929Z 32555:M 18 Apr 2021 01:45:39.055 * Background RDB transfer started by pid 44198
2021-04-18T01:46:55.2660662Z 44198:C 18 Apr 2021 01:45:39.124 - RDB: 0 MB of memory used by copy-on-write
2021-04-18T01:46:55.2661301Z 32555:M 18 Apr 2021 01:45:40.068 # Disconnecting timedout replica (streaming sync): 127.0.0.1:21213
2021-04-18T01:46:55.2661969Z 32555:M 18 Apr 2021 01:45:40.069 # Connection with replica 127.0.0.1:21213 lost.
2021-04-18T01:46:55.2662865Z 44198:C 18 Apr 2021 01:45:41.355 - RDB: 0 MB of memory used by copy-on-write
2021-04-18T01:46:55.2663669Z 32555:M 18 Apr 2021 01:45:42.087 - DB 9: 20005 keys (0 volatile) in 32768 slots HT.
2021-04-18T01:46:55.2664471Z 44198:C 18 Apr 2021 01:45:43.980 - RDB: 0 MB of memory used by copy-on-write
2021-04-18T01:46:55.2665283Z 44198:C 18 Apr 2021 01:45:46.741 - RDB: 0 MB of memory used by copy-on-write
2021-04-18T01:46:55.2666092Z 32555:M 18 Apr 2021 01:45:47.147 - DB 9: 20005 keys (0 volatile) in 32768 slots HT.
2021-04-18T01:46:55.2666880Z 44198:C 18 Apr 2021 01:45:49.591 - RDB: 0 MB of memory used by copy-on-write
2021-04-18T01:46:55.2667707Z 44198:C 18 Apr 2021 01:45:52.192 - RDB: 0 MB of memory used by copy-on-write
2021-04-18T01:46:55.2668497Z 32555:M 18 Apr 2021 01:45:52.230 - DB 9: 20005 keys (0 volatile) in 32768 slots HT.
2021-04-18T01:46:55.2669297Z 44198:C 18 Apr 2021 01:45:54.402 - RDB: 0 MB of memory used by copy-on-write
2021-04-18T01:46:55.2670113Z 44198:C 18 Apr 2021 01:45:56.510 - RDB: 0 MB of memory used by copy-on-write
2021-04-18T01:46:55.2670921Z 32555:M 18 Apr 2021 01:45:57.291 - DB 9: 20005 keys (0 volatile) in 32768 slots HT.
2021-04-18T01:46:55.2671703Z 44198:C 18 Apr 2021 01:45:59.097 - RDB: 0 MB of memory used by copy-on-write
2021-04-18T01:46:55.2672533Z 44198:C 18 Apr 2021 01:46:00.547 - RDB: 0 MB of memory used by copy-on-write
2021-04-18T01:46:55.2673333Z 32555:M 18 Apr 2021 01:46:02.343 - DB 9: 20005 keys (0 volatile) in 32768 slots HT.
2021-04-18T01:46:55.2674111Z 44198:C 18 Apr 2021 01:46:03.776 - RDB: 0 MB of memory used by copy-on-write
2021-04-18T01:46:55.2674936Z 44198:C 18 Apr 2021 01:46:04.592 * RDB: 0 MB of memory used by copy-on-write
2021-04-18T01:46:55.2675675Z 32555:M 18 Apr 2021 01:46:04.592 # Diskless rdb transfer, done reading from pipe, 1 replicas still up.
2021-04-18T01:46:55.2676669Z 32555:M 18 Apr 2021 01:46:05.674 * Background RDB transfer terminated with success
2021-04-18T01:46:55.2677653Z 32555:M 18 Apr 2021 01:46:05.674 * Streamed RDB transfer with replica 127.0.0.1:21212 succeeded (socket). Waiting for REPLCONF ACK from slave to enable streaming
2021-04-18T01:46:55.2678708Z 32555:M 18 Apr 2021 01:46:06.489 * Synchronization with replica 127.0.0.1:21212 succeeded
2021-04-18T01:46:55.2679483Z 32555:M 18 Apr 2021 01:46:07.387 - DB 9: 20005 keys (0 volatile) in 32768 slots HT.
2021-04-18T01:46:55.2680148Z 32555:M 18 Apr 2021 01:46:12.424 - DB 9: 20005 keys (0 volatile) in 32768 slots HT.
2021-04-18T01:46:55.2680830Z 32555:M 18 Apr 2021 01:46:17.465 - DB 9: 20005 keys (0 volatile) in 32768 slots HT.
2021-04-18T01:46:55.2681495Z 32555:M 18 Apr 2021 01:46:22.503 - DB 9: 20005 keys (0 volatile) in 32768 slots HT.
2021-04-18T01:46:55.2682174Z 32555:M 18 Apr 2021 01:46:27.541 - DB 9: 20005 keys (0 volatile) in 32768 slots HT.
2021-04-18T01:46:55.2682773Z 32555:M 18 Apr 2021 01:46:27.541 # Disconnecting timedout replica (streaming sync): 127.0.0.1:21212
2021-04-18T01:46:55.2683471Z 32555:M 18 Apr 2021 01:46:27.541 # Connection with replica 127.0.0.1:21212 lost.
2021-04-18T01:46:55.2684166Z 32555:M 18 Apr 2021 01:46:32.575 - DB 9: 20005 keys (0 volatile) in 32768 slots HT.
2021-04-18T01:46:55.2684849Z 32555:M 18 Apr 2021 01:46:37.608 - DB 9: 20005 keys (0 volatile) in 32768 slots HT.
2021-04-18T01:46:55.2685513Z 32555:M 18 Apr 2021 01:46:42.643 - DB 9: 20005 keys (0 volatile) in 32768 slots HT.
2021-04-18T01:46:55.2686222Z 32555:M 18 Apr 2021 01:46:47.678 - DB 9: 20005 keys (0 volatile) in 32768 slots HT.
2021-04-18T01:46:55.2686884Z 32555:M 18 Apr 2021 01:46:52.713 - DB 9: 20005 keys (0 volatile) in 32768 slots HT.

oranagra · 2021-04-18T09:46:46Z

p.s. i run the new code with valgrind in github actions twice, and it passed.

In github actions CI with valgrind, i saw that even the fast replica (one that wasn't paused), didn't get to complete the replication fast enough, and ended up getting disconnected by timeout. Additionally, due to a typo in uname, we didn't get to actually run the CPU efficiency part of the test.

enjoy-binbin · 2021-11-05T03:01:29Z

Seems to be the same problem? Both timed out together

https://github.com/redis/redis/runs/4111888647?check_suite_focus=true#step:5:3559

oranagra · 2021-11-11T07:08:00Z

and again here (both on MacOS):
https://github.com/redis/redis/runs/4133339823?check_suite_focus=true

*** [err]: diskless timeout replicas drop during rdb pipe in tests/integration/replication.tcl
log message of '"*Diskless rdb transfer, done reading from pipe, 1 replicas still up*"' not found in ./tests/tmp/server.3909.77/stdout after line: 115 till line: 138

and indeed i see that both replicas get disconnected by timeout, so i suppose we may wanna further increase the timeout (from 2 to 3 seconds)?
however, it's very strange that it happens.
in the past it was valgrind, who may be super slow, but this is just MacOS (not freebsd VM or valgrind), and the timeout killing is only happening if we have data to write to the socket, and the socket is not writable, so it makes no sense to me.

enjoy-binbin · 2021-11-13T02:26:24Z

yes also makes no sense to me. I did a another look at the failure logs, and didn't find anything.
increasing the timeout may be able to solve the problem, but it feels not a good choice

I noticed a strange phenomenon

# https://github.com/redis/redis/runs/4111888647?check_suite_focus=true#step:5:3572
26522:M 05 Nov 2021 00:18:21.808 # Disconnecting timedout replica (full sync): 127.0.0.1:24656
26522:M 05 Nov 2021 00:18:25.716 # Connection with replica 127.0.0.1:24656 lost.
26522:M 05 Nov 2021 00:18:28.721 # Disconnecting timedout replica (full sync): 127.0.0.1:24655
26522:M 05 Nov 2021 00:18:28.721 # Connection with replica 127.0.0.1:24655 lost.

somehow the log interval here is four seconds away

26522:M 05 Nov 2021 00:18:21.808 # Disconnecting timedout replica (full sync): 127.0.0.1:24656
26522:M 05 Nov 2021 00:18:25.716 # Connection with replica 127.0.0.1:24656 lost.

oranagra · 2021-11-14T10:08:17Z

This is very odd. this explains why the other replica doesn't get it's data on time (we didn't re-enable the read event on the pipe in rdbPipeWriteHandlerConnRemoved

but i really don't understand how these two prints can be spaced so far apart, the next line after printing Disconnecting timedout replica (full sync) does freeClient which prints the other one, and the only thing that can interrupt this flow is CLIENT_PROTECTED which shouldn't be used here AFAICT.

enjoy-binbin · 2021-12-01T02:50:39Z

and again (MacOS daily), is the same as ther previous one: #8807 (comment)
https://github.com/redis/redis/runs/4374931532?check_suite_focus=true

oranagra · 2021-12-01T06:17:44Z

this time we didn't have a 4 seconds lag between Disconnecting timedout and Connection with replica ... lost

29153:M 01 Dec 2021 00:19:22.826 # Disconnecting timedout replica (full sync): 127.0.0.1:24655
29153:M 01 Dec 2021 00:19:22.826 # Connection with replica 127.0.0.1:24655 lost.
29153:M 01 Dec 2021 00:19:29.637 - DB 9: 20005 keys (0 volatile) in 32768 slots HT.
29153:M 01 Dec 2021 00:19:29.646 # Disconnecting timedout replica (full sync): 127.0.0.1:24656
29153:M 01 Dec 2021 00:19:29.646 # Connection with replica 127.0.0.1:24656 lost.

oranagra added the state:to-be-merged The PR should be merged soon, even if not yet ready, this is used so that it won't be forgotten label Apr 18, 2021

oranagra requested review from yossigo and guybe7 April 18, 2021 09:16

yossigo approved these changes Apr 18, 2021

View reviewed changes

oranagra merged commit a9897b0 into redis:unstable Apr 18, 2021

oranagra deleted the repl_test_timing branch November 9, 2021 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix timing of new replication test #8807

Fix timing of new replication test #8807

oranagra commented Apr 18, 2021 •

edited

Loading

oranagra commented Apr 18, 2021

oranagra commented Apr 18, 2021

enjoy-binbin commented Nov 5, 2021

oranagra commented Nov 11, 2021

enjoy-binbin commented Nov 13, 2021

oranagra commented Nov 14, 2021

enjoy-binbin commented Dec 1, 2021

oranagra commented Dec 1, 2021

Fix timing of new replication test #8807

Fix timing of new replication test #8807

Conversation

oranagra commented Apr 18, 2021 • edited Loading

oranagra commented Apr 18, 2021

oranagra commented Apr 18, 2021

enjoy-binbin commented Nov 5, 2021

oranagra commented Nov 11, 2021

enjoy-binbin commented Nov 13, 2021

oranagra commented Nov 14, 2021

enjoy-binbin commented Dec 1, 2021

oranagra commented Dec 1, 2021

oranagra commented Apr 18, 2021 •

edited

Loading