Fixes to fix for FFI & GC interaction in tcp_connection.pony and file.ponyΒ #2526
Description
While working on a Wallaroo data corruption problem, chats with @dipinhora revealed the cause of the data corruption problem & how to fix it. The Wallaroo code https://github.com/WallarooLabs/wallaroo/blob/master/lib/wallaroo/core/sink/tcp_sink/tcp_sink.pony#L105 and elsewhere in that file is closely related to code in packages/net/tcp_connection.pony
and packages/files/file.pony
The use of the _pending
array in packages/net/tcp_connection.pony
appears to be write-only: only calls to .push()
and .shift()
(ignoring its return value) and .clear()
are used. However, that data structure is necessary to prevent a bad interaction between the Pony FFI and GC systems: data written to the TCP socket can be corrupted immediately prior to the writev(2)
system call execution. As I learned very recently, the race condition does indeed appear if I remove the seemingly write-only _pending
data structure.
Wallaroo's commit WallarooLabs/wally@75f1c39 has a work-around for the data corruption problem. Dipin also put this bugfix into in this repo's tcp_connection.pony
that's on the master branch today.
I recommend:
-
Adding a warning to
tcp_connection.pony
about_pending
's purpose. -
Adding the
_pending
fix topackages/files/file.pony
. Its code uses a very similar buffering scheme astcp_connection.pony
's, and therefore it is very likely vulnerable to the same data-corrupting race condition.