Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flush can raise Sys_error when used in parallel #13586

Open
jmid opened this issue Oct 31, 2024 · 0 comments
Open

flush can raise Sys_error when used in parallel #13586

jmid opened this issue Oct 31, 2024 · 0 comments
Labels

Comments

@jmid
Copy link
Contributor

jmid commented Oct 31, 2024

We've encountered that Stdlib/Out_channel.flush may raise a Sys_error exception when used in parallel with a close.
Consider this reproducer program:

let path = Filename.temp_file "stm-" ""

let test () =
  let channel = Atomic.make (Out_channel.open_text path) in

  let wait = Atomic.make true in

  let d1 = Domain.spawn (fun () ->
      while Atomic.get wait do Domain.cpu_relax() done;
      Out_channel.close (Atomic.get channel)) in

  let d2 = Domain.spawn (fun () ->
      while Atomic.get wait do Domain.cpu_relax() done;
      (try
        Out_channel.flush (Atomic.get channel)
       with (Sys_error msg) -> Printf.printf "Out_channel.flush raised Sys_error %S\n%!" msg; assert false)) in

  (* Output a sufficiently long string *)
  Out_channel.output_string (Atomic.get channel) (String.make (65531 (*65536*)) 'a');
  (* Let the races begin! *)
  Atomic.set wait false;
  let () = Domain.join d1 in
  let () = Domain.join d2 in
  (* Please leave the torture chamber nice and clean as you found it *)
  (try Out_channel.close (Atomic.get channel) with Sys_error _ -> ());
  Sys.remove path

let _ =
  for i = 1 to 50_000 do
    if i mod 250 = 0 then Printf.printf "#%!";
    test ()
  done

with this behaviour:

$ ocamlopt -g flushexc.ml
$ ./a.out
##############################################################################################################Out_channel.flush raised Sys_error "Bad file descriptor"
Fatal error: exception File "flushexc.ml", line 21, characters 94-100: Assertion failed

The documentation does not warn of unsynchronized access to Out_channel operations, as they are generally protected by internal channel locks.

The documentation for close mentions that flush should not raise Sys_error when called on a closed channel, but does
strictly speaking not specify the behaviour of flush in parallel with close:

val close : t -> unit
(** Close the given channel, flushing all buffered write operations.  Output
    functions raise a [Sys_error] exception when they are applied to a closed
    output channel, except {!close} and {!flush}, which do nothing when applied
    to an already closed channel.  Note that {!close} may raise [Sys_error] if
    the operating system signals an error when flushing or closing. *)

The problem is that flush on a still-open channel ends up in caml_flush_partial which calls check_pending that may
temporarily unlock the channel to process pending actions. This creates a small window for caml_ml_close_channel to lock and close the
underlying file descriptor, and offset channel->curr by 1 from channel->buff into dummy_buff. As a result, when caml_flush_partial resumes, it will attempt to output 1 character, fail, and raise an exception.

Before jumping into discussions of fixes, we may want to discuss how to proceed.
Here's a few suggestions:

  • Do nothing, as this is misusing the Stdlib and/or invoking unspecified behaviour
  • Update the documentation to warn of this behaviour
  • Patch the code to prevent the exception. This option is delicate as caml_flush_partial
    is simultaniously responsible for raising an exception for its other callers. After a bit
    of experimentation, my best bet at this would be to wrap a handler around flush on the
    OCaml-side in stdlib.ml, which shouldn't be a performance bottleneck.
  • ...

Note: the above example program can easily be modified to race on two closes also resulting
in a Sys_error. The last sentence of the documentation could account for that behaviour
though.

Thanks to @ncik-roberts for figuring out the above explanation.

CC to @damiendoligez who last had his fingers in these parts in #12678

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants