Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time: time.Sleep and time.NewTimer: Fix duration parameter error #17696

Open
johnrs opened this issue Oct 31, 2016 · 89 comments
Open

time: time.Sleep and time.NewTimer: Fix duration parameter error #17696

johnrs opened this issue Oct 31, 2016 · 89 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows
Milestone

Comments

@johnrs
Copy link

johnrs commented Oct 31, 2016

What version of Go are you using (go version)?

go version go1.7.3 windows/amd64

What operating system and processor architecture are you using (go env)?

set GOARCH=amd64
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOOS=windows

What did you do?

I wrote a test which is basically a loop consisting of two things: Some busy work for about 500us and a call to time.Sleep(1 * time.Millisecond).

Here's a link to the test program: https://play.golang.org/p/VYGvIbTPo3

It can't run on the playground, so you will have to grab it and run it yourself.

What did you expect to see?

  work(~500us) + time.Sleep(1ms): 1.500000ms    ... or slightly more
  work(~500us) + time.After(1ms): 1.500000ms    ... or slightly more

What did you see instead?

  work(~500us) + time.Sleep(1ms): 1.030103ms
  work(~500us) + time.After(1ms): 1.110111ms

Discussion

I believe that there are errors in the doc's for time.Sleep and time.NewTimer. They both make similar statements regarding the duration.

Sleep pauses the current goroutine for at least the duration d.

NewTimer creates a new Timer that will send the current time on its channel after at least duration d.

If these are correct, then the test loops should take ~1.5ms or longer per pass. But they don't.

I believe that at least the duration d was an attempt to convey that overhead might result in a longer duration than requested. But there are two other factors which might result in a shortened duration, sometime MUCH shorter. These factors are the resolution and phase of the time source used for these functions.

In Windows, the resolution is normally 1ms for Go programs. I'll refer to this value as r, short for resolution. I'm not sure what it is on other platforms. If you run the test program it will measure and display it. Please let me know what you find.

I believe that in general the duration can be as short as d-r (rather than d). How short simply depends on what the phase of the time source when you use it. Thus if you ask for a 100ms delay you will end up with something between 99 and 100 (plus overhead). Thus as you use smaller values of d (approaching r) the problem gets worse. By itself, this isn't too bad. But now the other shoe drops.

In the specific case that you ask for a delay of r (or anything between 1 and r), then the delay can be anything between 0 and r. Actually, for a reason I don't quite understand, the lowest delay I've actually seen is about 50us, not 0. Even so, that is 20 times less than the 1ms requested. This was drastic enough to break some of my code. I changed d from 1ms to 2ms. The resulting delay is between 1 and 2ms (plus overhead).

Can this cause a problem? Generally, in a loop only the first pass is affected, because the loop gets in phase with the timing source. So this might or might not be bad for you. If it is, put a sleep 1ms call just before the loop.

But imagine that you are using time.After to set up a timeout, but it occurs too quickly. Perhaps 20 times too quickly. This can cause intermittent, non-deterministic, bogus timeouts, even on a single-threaded program. Ouch! This is what bit me.

Another gotcha: Using d of 10ms now, but Go Windows switches r from 1ms to 15.6ms. Instead of a delay of 9 - 10ms you now get 0 - 15.6ms.

Possible Solutions

The really bad case when d is between 1 and r could be detected causing d to be changed to r+1. Tacky, in my opinion. Alternatively, vet could look for this case.

Most programming languages I've worked in allow this to happen, but they don't document it incorrectly. This seems like the best solution to me. Simply fix the doc's to reflect reality.

Perhaps a paragraph at the top describing what the "resolution of the time source" means, what it is for some typical platforms, and an example of a simple routine to determine what yours is. Maybe even a system call to retrieve it.

Then, change the current time.Sleep description from

Sleep pauses the current goroutine for at least the duration d. A negative or zero duration causes Sleep to return immediately.

to something like

Sleep pauses the current goroutine for at least the duration d - 1. A duration less than or equal to the time source resolution can cause Sleep to return quickly. A negative or zero duration causes Sleep to return immediately.

And similar for time.NewTimer's description.

@rakyll rakyll changed the title time.Sleep and time.NewTimer: Fix duration parameter error time: time.Sleep and time.NewTimer: Fix duration parameter error Oct 31, 2016
@rakyll rakyll added the Documentation Issues describing a change to documentation. label Oct 31, 2016
@rakyll
Copy link
Contributor

rakyll commented Oct 31, 2016

Not sure this is a documentation issue. On darwin and linux, we sleep more than duration (at least, i cannot reproduce the other case). The Windows behavior looks broken to me.

@rakyll rakyll added OS-Windows and removed Documentation Issues describing a change to documentation. labels Oct 31, 2016
@johnrs
Copy link
Author

johnrs commented Oct 31, 2016

Interesting. Were you able to run the test program? I'd be curious to see your results.

If it is just a WIndows problem then perhaps adding 1 to whatever d is, as long as it isn't 0, might be all it takes to solve it.

@rakyll
Copy link
Contributor

rakyll commented Oct 31, 2016

What I see on darwin:

Verifications

  time.Now resolution: 48ns
  time.Sleep(1ms):     1.266082ms
  time.After(1ms):     1.243748ms
  doWork(~500us):      501.122µs    (Adjust as needed)

Tests

  work(~500us) + time.Sleep(1ms): 1.904777ms
  work(~500us) + time.After(1ms): 1.881365ms

@rakyll rakyll added this to the Go1.8Maybe milestone Oct 31, 2016
@rakyll
Copy link
Contributor

rakyll commented Oct 31, 2016

/cc @alexbrainman

@johnrs
Copy link
Author

johnrs commented Oct 31, 2016

Wow! I was thinking that perhaps the Linux time source would be good to 1us. 48ns!

This is what I see for the verifications on my Windows machine.

  time.Now resolution: 1.0001ms
  time.Sleep(1ms):     1.0001ms
  time.After(1ms):     1.0001ms
  doWork(~500us):      480.048µs    (Adjust as needed)

I see what you mean about the resulting durations being a bit (about 25%) too long. That makes no sense at all to me. It makes me wonder if there is a math error somewhere (using 4 instead of 5, or vice versa). Perhaps it is worth creating an issue for this. 25% is too much to ignore.

@rsc
Copy link
Contributor

rsc commented Nov 2, 2016

@johnrs, what output do you get from your test program on Windows? And which version of Windows are you running?

I tried it on our Windows testing systems and I get:


Verifications

  time.Now resolution: 976.5µs
  time.Sleep(1ms):     1.953ms
  time.After(1ms):     1.953ms
  doWork(~500us):      332.01µs    (Adjust as needed)

Tests

  work(~500us) + time.Sleep(1ms): 1.97253ms
  work(~500us) + time.After(1ms): 1.97253ms

A 976.5µs interval is 1024 Hz. In order to sleep 1ms, we wait for two ticks to elapse (one is not enough), giving 1.953ms.

I'm curious what resolution you are seeing.

The documentation is a correct statement of the semantics we intend to guarantee.

The implementation is that on entry to time.Sleep, the runtime calls runtime.nanotime, adds the duration, records that sum, and puts the goroutine into a list of pending timers. At some point in the future, the timer-handling goroutine calls runtime.nanotime, sees that the current time is now at or beyond the time at which the original goroutine should be woken up, and wakes it up.

There are always going to be rounding errors possible depending on the actual precision of the underlying system clock. I'm not sure what there is to do about that. In general we probably don't want to make all sleeps take longer than they should.

FWIW, on Windows runtime.nanotime is using "Interrupt Time", which is at least recorded in 100ns units. It may only be updated at 1024 Hz. Similarly, time.Now uses "System Time", which again is 100ns precision but maybe only 1024 Hz accuracy.

@johnrs
Copy link
Author

johnrs commented Nov 3, 2016

When I run the test on my system this is what I get:

Verifications

  time.Now resolution: 1.0001ms
  time.Sleep(1ms):     1.0001ms
  time.After(1ms):     1.0001ms
  doWork(~500us):      480.048µs    (Adjust as needed)

Tests

  work(~500us) + time.Sleep(1ms): 1.030103ms
  work(~500us) + time.After(1ms): 1.110111ms

I'm running Windows 7.

I believe that most Windows systems are 1.001ms (minimum) and 15.600ms (default). This is what I've seen on various versions and systems over the years. It's also what I see most people refer to when discussing it.

The ClockRes program, part of the SysInternals suite (www.sysinternals.com), shows this "system clock resolution" when I'm running a Go program:

Current timer interval: 1.000 ms

So it seems that our systems differ in 2 ways: 976.5us vs 1.0001ms (no big deal), and two ticks vs 1 tick for the Sleep and After timing. Strange!

Using 1 extra clock tick, as you describe, sounds like a good idea to me. At first I thought this might be a tacky solution, but upon further consideration I like it. It makes is "safe" (no 20 times too short intervals) and that's the most important thing. But that's not what seems to be taking place on my system.

btw - If I run the test under Go 1.6.3 I see the same results, but based on 15.6ms instead of 1ms. I first noticed this problem under Go 1.5.2 and posted a note about it in go-nuts on 2015/12/24, so it's nothing new. I just forgot to follow up on it then.

@alexbrainman
Copy link
Member

I have nothing to add to @rsc explanation. My Windows 7 outputs:

Verifications

  time.Now resolution: 976.5µs
  time.Sleep(1ms):     2.060415ms
  time.After(1ms):     1.953ms
  doWork(~500us):      361.305µs    (Adjust as needed)

Tests

  work(~500us) + time.Sleep(1ms): 2.021355ms
  work(~500us) + time.After(1ms): 1.97253ms

You are correct that Windows 7 have 2 timer resolutions: 1ms or 16ms. When system starts, it defaults to 16ms. But once any program calls timeBeginPeriod(1), the wide system timer resolution goes to 1ms, and stays with 1ms until the program exits.

Initially every Go program called timeBeginPeriod(1) at the start. But then ab4c929 stopped that. And then we discovered some problems with our change, and 0435e88 started calling timeBeginPeriod(1) again.

Alex

@alexbrainman
Copy link
Member

@johnrs you could also debug what is happening on your computer. As far as I remember most code will be in runtime\time.go.

Alex

@johnrs
Copy link
Author

johnrs commented Nov 3, 2016

I did a safe-boot, just to see if any software on my system might be affecting it. I got the same result. I also tried a 32-bit version. Still the same. I sent a copy of the exe's to a few friends locally and asked them to try it. I'll let you know what they find.

What type of debugging did you have in mind?

@johnrs
Copy link
Author

johnrs commented Nov 3, 2016

@rsc & @alexbrainman - I have found something interesting. The problem seems to be related somehow to the Windows system's minimum timer interval (which is 0.5ms on newer systems but 1ms on older systems, from what I read). Simple way to tell. Grab this program and run it. Here's what I see.

ClockRes v2.0 - View the system clock resolution
Copyright (C) 2009 Mark Russinovich
SysInternals - www.sysinternals.com

Maximum timer interval: 15.600 ms
Minimum timer interval: 0.500 ms
Current timer interval: 15.600 ms

Go switches you to 1ms. The timeBeginPeriod call is hard coded to use only 1ms. But if you run something else which switches to 0.5ms (using ntdll.dll - NtSetTimerResolution) at the same time then the nature of the problem I'm seeing changes. It goes away!

If anyone wants to duplicate this, here's a handy tool to make the change and to show what the system is current running.

Windows System Timer Tool
TimerToolV3.zip - 24.08.15 [16:46 UTC] by tonfilm
https://vvvv.org/contribution/windows-system-timer-tool
https://github.com/tebjan/TimerTool

I've upgraded my test progam by adding more test periods, and mde them longer for better precision. Here it is: https://play.golang.org/p/7AwpUwZSDa

I'll follow up with the exact info on what I saw today shortly.

@johnrs
Copy link
Author

johnrs commented Nov 3, 2016

Here are my results using the new version of the test program. The first is the default situation, letting Go switch the system timer to 1ms. The second is with another program running in the background which has set the system to 0.5ms. In the second case the problem goes away!

=====  #1 - System timer at default (set to 1ms by Go)

Maximum timer interval: 15.600 ms
Minimum timer interval: 0.500 ms
Current timer interval: 1.000 ms

--- Test program results

Checks

  time.Now resolution:  1.000ms
  Work(~250us):         246us

Tests
                  0          1        0.5ms       1ms       1.5ms       2ms
               -------    -------    -------    -------    -------    -------
  time.Sleep:  0.000ms    1.000ms    1.000ms    1.004ms    2.000ms    2.000ms
  ... + work:  0.360ms    1.008ms    1.002ms    1.008ms    2.008ms    2.014ms
  time.After:  0.002ms    1.000ms    1.000ms    1.000ms    2.000ms    2.002ms
  ... + work:  0.296ms    1.004ms    1.006ms    1.004ms    2.004ms    2.010ms

=====  #2 - System timer interval set to 0.5ms (which supersedes Go's 1ms request)

Maximum timer interval: 15.600 ms
Minimum timer interval: 0.500 ms
Current timer interval: 0.500 ms

--- Test program results

Checks

  time.Now resolution:  0.500ms
  Work(~250us):         235us

Tests
                  0          1        0.5ms       1ms       1.5ms       2ms
               -------    -------    -------    -------    -------    -------
  time.Sleep:  0.000ms    1.009ms    1.000ms    1.000ms    2.004ms    2.000ms
  ... + work:  0.279ms    1.214ms    1.329ms    1.476ms    2.425ms    2.457ms
  time.After:  0.004ms    1.001ms    1.000ms    1.000ms    2.001ms    2.002ms
  ... + work:  0.310ms    1.321ms    1.007ms    1.472ms    2.383ms    2.453ms

=====

Also, I have some feedback from two friends. One does not see the problem. The other does - and his system timer is running at 0.5ms (which is what seems to solve the problem for me). I've sent them the new version of the test program to try.

Update: 2nd friend just sent me his results from the revised test. His system timer is indeed running at 0.5ms, but he sees the problem regardless - except for the 1.5ms and 2ms tests which do show the extra time due to the work. Strange!

@geraldstanje
Copy link

geraldstanje commented Nov 4, 2016

it seems that in windows 10 the following command (windows.h) sets the max timer interval to 1ms: timeBeginPeriod(1) therefore winmm.lib needs to be linked. is that right?

@johnrs
Copy link
Author

johnrs commented Nov 4, 2016

I believe that Go does link to it.

@alexbrainman
Copy link
Member

What type of debugging did you have in mind?

I do not have any particular idea. But when things look odd to me (just like something bothering you here), I try to add fmt.Printf statements around code in questions (you would have to use println, because you will be debugging runtime package), and see what it outputs. Sorry I did not spend enough time looking at this issue, so I cannot give anything particular to investigate.

I believe that Go does link to it.

Here https://github.com/golang/go/blob/master/src/runtime/os_windows.go#L282 runtime calls timeBeginPeriod(1) during startup.

Alex

@johnrs
Copy link
Author

johnrs commented Nov 7, 2016

Here are the results of some runtime testing I did today.

Note: I edited/compressed the results listings to save a little verticle space. In doing so, I left out the time.After() data since it mirrored the time.Sleep data.

Note: You might ask why I included the Windows 0.5ms tests. Go only sets it to 1ms. Because the fastest current program running wins, so if another program is using 0.5ms then that is what all programs will see. Also, it not having one problem, but having another instead is probably important.

=====  Runtime (go1.7.3 windows/amd64) tests using my machine

I tested with Windows timer running at 1ms and 0.5ms.

nanotime() looked good.  It returned the time in 1ms / 0.5ms steps.

cputicks() looked good.  Two adjacent calls were seperated by a count of about 70, 
with a resolution of 2.

tickspersecond() looked good.  It was ~1.855 * 1e9 (which matches my CPU clock).

usleep() seems to have a problem.  When Windows is set to 1ms there is no evidence 
of the work being done, but at 0.5ms there is (except for the 0.5ms test).  I'm not clear 
on exactly how usleep() is supposed to work ("at least" guarantee and n+1 ticks ???), 
but the results shouldn't be inconsistent.

---  time.Now resolution:  1.000ms
                  0          1        0.5ms       1ms       1.5ms       2ms
               -------    -------    -------    -------    -------    -------
  usleep:      0.002ms    0.000ms    1.000ms    1.006ms    2.000ms    2.000ms
  ... + work:  0.332ms    0.308ms    1.000ms    1.000ms    2.004ms    2.012ms

---  time.Now resolution:  0.500ms
                  0          1        0.5ms       1ms       1.5ms       2ms
               -------    -------    -------    -------    -------    -------
  usleep:      0.000ms    0.001ms    0.500ms    1.006ms    1.500ms    2.010ms
  ... + work:  0.339ms    0.296ms    0.515ms    1.357ms    1.951ms    2.475ms

=====

For reference, using the regular test, here is what I see.

=====  My machine - Windows 7, 64-bit

---  time.Now resolution:  1.000ms
                  0          1        0.5ms       1ms       1.5ms       2ms
               -------    -------    -------    -------    -------    -------
  time.Sleep:  0.000ms    1.008ms    1.000ms    1.002ms    2.012ms    2.000ms
  ... + work:  0.278ms    1.000ms    1.006ms    1.002ms    2.040ms    2.046ms

---  time.Now resolution:  0.500ms
                  0          1        0.5ms       1ms       1.5ms       2ms
               -------    -------    -------    -------    -------    -------
  time.Sleep:  0.000ms    1.007ms    1.000ms    1.003ms    2.000ms    2.003ms
  ... + work:  0.365ms    1.482ms    1.327ms    1.318ms    2.455ms    2.520ms

---  Problems

1) No evidence of n+1 ticks.
2) When Windows is 1ms there is no evidence of the work being done.
3) When Windows is 0.5ms the steps are still 1ms.

=====

Here are some results from a few friends' machines.

=====  Brian's home machine - Windows 10, 64-bit

---  time.Now resolution:  0.500ms
                  0          1        0.5ms       1ms       1.5ms       2ms
               -------    -------    -------    -------    -------    -------
  time.Sleep:  0.000ms    1.144ms    1.155ms    1.183ms    2.045ms    2.133ms
  ... + work:  0.261ms    1.159ms    1.155ms    1.138ms    2.214ms    2.262ms

1) No evidence of n+1 ticks.
2) When Windows is 0.5ms there is no evidence of the work being done.
3) When Windows is 0.5ms the steps are still 1ms.

=====  Don's home machine - Windows 7, 32-bit

---  time.Now resolution:  1.000ms
                  0          1        0.5ms       1ms       1.5ms       2ms
               -------    -------    -------    -------    -------    -------
  time.Sleep:  0.000ms    1.038ms    1.004ms    1.046ms    2.050ms    2.010ms
  ... + work:  0.246ms    1.198ms    1.020ms    1.024ms    2.072ms    2.024ms

---  Problems

1) No evidence of n+1 ticks.
2) When Windows is 1ms there is no evidence of the work being done.

=====  Marc's work machine #1 - Windows XP, 32-bit

---  time.Now resolution:  15.625ms
                 0          1        0.5ms       1ms       1.5ms       2ms
              -------    -------    -------    -------    -------    -------
 time.Sleep:  0.000ms    1.938ms    1.969ms    1.938ms    1.969ms    2.938ms
 ... + work:  0.250ms    1.938ms    1.969ms    1.938ms    1.969ms    2.906ms

---  Problems

None.  This matches @rsc and @alexbrainman's good results, including n+1 ticks.

=====  Marc's work machine #2 - Windows 10, 64-bit

--- time.Now resolution:  0.500ms
                 0          1        0.5ms       1ms       1.5ms       2ms
              -------    -------    -------    -------    -------    -------
 time.Sleep:  0.000ms    1.397ms    1.332ms    1.386ms    2.200ms    2.297ms
 ... + work:  0.275ms    1.316ms    1.373ms    1.314ms    2.205ms    2.379ms

---  Problems

1) No evidence of n+1 ticks.
2) When Windows is 0.5ms the steps are still 1ms.

=====  Marc's work machine #3 - Windows 10, 64-bit

---  time.Now resolution:  0.999ms
                 0          1        0.5ms       1ms       1.5ms       2ms
              -------    -------    -------    -------    -------    -------
 time.Sleep:  0.000ms    1.447ms    1.399ms    1.390ms    2.249ms    2.431ms
 ... + work:  0.219ms    1.502ms    1.522ms    1.480ms    2.271ms    2.495ms

---  Problems

1) No evidence of n+1 ticks.

=====

Summary: I don't see a pattern. Unfortunately I wasn't able to get both 1ms and 0.5ms results for all machines, but even so many of the results contradict each other. Of the 3 identified problems, various machines (depending on the Windows timer setting) have 0, 1, 2, or all 3 problems.

Anyone have any ideas of what to do next?

Thanks,

John

@quentinmit quentinmit added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 8, 2016
@alexbrainman
Copy link
Member

When I run the test on my system this is what I get:

Verifications

time.Now resolution: 1.0001ms
time.Sleep(1ms): 1.0001ms
time.After(1ms): 1.0001ms
doWork(~500us): 480.048µs (Adjust as needed)

Tests

work(~500us) + time.Sleep(1ms): 1.030103ms
work(~500us) + time.After(1ms): 1.110111ms

@johnrs I do not know what your problem is. But, I suspect, you expect work(~500us) + time.Sleep(1ms) to be sum of time.Sleep(1ms) and doWork(~500us) (so 1.0001ms + 480.048µs = 1.480148ms, instead of 1.030103ms). But that is not how things work here, I think. Imagine you are waiting for a train at a train station. And the train arrives every 1000µs. If you decide to skip 9 trains and catch the 10th train, it will take you about 10000µs. Imaging you decide to change this exercise to have a smoke after each train arrives (let's say smoking will take you 500µs), and then (after smoking) wait for the next train. After the adjustment, it will still take 10000µs before your 10th train arrives.

Once you start playing with intervals close to your clock resolution, unusual things will happen.

Alex

@johnrs
Copy link
Author

johnrs commented Nov 11, 2016

@alexbrainman

work(~500us) + time.Sleep(1ms): 1.030103ms

So your system has the problem, too.

I started off thinking that this was just a documentation error. But now we know that some machines yield the correct result (something "at least" 1ms + work of ~500us --- which turns out to be 2ms because Go adds a tick to guarantee the "at least").

Even a simple loop just doing a 1ms wait (when the system timer is also 1ms) should take 2ms per pass - like Russ and others see on their machines.

Please read above where Russ states that the doc's are correct. Read where he says: "In order to sleep 1ms, we wait for two ticks to elapse (one is not enough)".

Your example of waiting for a train is flawed. The interval you have to wait for the first train to arrive can be anything from 0 to 1ms. Only after you are in sync with the timer will the additional trains require waiting 1ms each. But as Russ confirmed, to deliver "at least the duration" Go waits for an additional tick.

Using your train example again, this means that you would ignore the next train - regardless of how long (0 - 1ms) it takes to arrive, then you would count the train after it as the "first" train (1 - 2ms).

I see nothing "unusual" about this. It's a bug. Many programming languages don't bother to ensure the minimum wait time, thus exposing you to a potentially serious problem. After all, if you ask for 1ms and end up with something much, much less this can cause faults. Unfortunately, the "extra tick" (which prevents this problem at the cost of a slightly longer wait than requested) doesn't work on some (yours and mine!) Windows systems.

The usleep function in runtime seems to exhibit the same problem. A quick check shows 7 places where usleep is called in runtime expecting between 1 and 1000us to pass. Any of these can fault to an almost immediate return.

This is not a boundary bug. The error is 1 tick, hence is more apparent at or below the boundary. But that doesn't make it unimportant. The 7 cases in runtime prove this. If you had run the revised test I posted you could also see the results at 2ms.

@alexbrainman
Copy link
Member

So your system has the problem, too.

If you are referring to the work(~500us) + time.Sleep(1ms): 1.030103ms, that is not from my "system". I was quoting you from #17696 (comment)

Alex

@rsc rsc modified the milestones: Go1.9, Go1.8Maybe Nov 11, 2016
@johnrs
Copy link
Author

johnrs commented Nov 12, 2016

@rsc
I've been trying to figure out where in the code the extra tick period is added. It seems to be in runtime: time.go, timerproc(), where t.period is tested. The code adds a tick (unless delta is already greater than a tick). But I'm confused about a few things.

  • Who determines what the period is?
  • Who puts the period into the timer?
  • What prevents timerproc from adding a period again on the next loop?

I looked inside a running timer. The period seemed to be 0 all the time (perhaps causing the problem I am seeing). But even if I set it (prior to the "when" time) it didn't seem to change anything. So I must be misunderstanding something.

Side question: I'm curious why wait to add the extra tick at the end of the duration instead of doing it at the beginning? It seems to me that doing it at the beginning is extremely easy and would save a considerable amount of processing. Perhaps I'm missing something?

usleep: From what I gather usleep uses a totally different mechanism. I see 7 places in runtime where it is called for waits of 1ms or less. Is it a problem if the wait is actually much less than requested?

@ianlancetaylor
Copy link
Member

@johnrs My apologies if I misunderstand. As far as I know, the Go runtime does not add any extra tick period. Why would it? When t.period > 0, it sets the time that the timer should fire next to t.period adjusted by the time since the timer fired last.

t.period is set by runtime.NewTicker, which builds a *timer and passes it to the runtime package.

@johnrs
Copy link
Author

johnrs commented Nov 12, 2016

@ianlancetaylor The doc's imply that it does. Russ said that it did. Some systems perform as if it does. I see some code which seems to implement it partially. Why should it? If you do then worst case is your interval is 1 tick too long. If not then worst case is your interval is near 0.

Thanks for the explanation of t.period. That explains why it would continue to keep adding time. I was totally focused on just the +1 tick and I was thinking this was it, but not so. :)

Regardless of whether there should be +1 tick or not, I'm seeing too many inconsistent results. At a bare minimum I think that it's fair to say that the doc's don't agree with about half of what I'm seeing. I can't help but think that this is a problem.

@johnrs
Copy link
Author

johnrs commented Aug 6, 2017

@prasannavl

I think that most would agree that 1ms timing resolution is more desirable than 15ms resolution (except when trying to save power). Which to use has changed a few times in Go. The most current solution coming in v1.9 will hopefully be optimum for the majority of uses. Allowing for extreme cases is something which is still on the table, as per #20937. I plan to open an issue about it after v1.9 is released.

This issue is about the time.Sleep's (and other timers) documentation saying:

Sleep pauses the current goroutine for at least the duration d.

As far as I can tell, Go has never performed this way (including the upcoming (v1.9). The situation is this. Let's say that you are running with the system timer set to 1ms. If you do a time.Sleep(1ms) there are two possible results (ignoring the statistically unlikely case of exactly a 1ms sleep). You will either sleep for less than 1ms or more than 1ms (generally between 1ms and 2ms).

The documentation says it should be "at least" 1ms. But testing shows that it is generally "at most" 1ms. Hence the issue. I've seen time.Sleep(1ms) take anywhere from as little as 10us to 1ms.

A more accurate description of the current behavior would be something like:

Sleep pauses the current gouroutine for least the duration d, minus 1 tick of the system clock.

The fix can be either to change the documentation to match the code, or to change the code to match the documentation, but leaving it as is is simply wrong. Personally, I prefer changing the code to match the documentation. I feel that it is the safer, most useful solution. In other words, I think that time.Sleep(1ms) resulting in a 10us sleep is much worse than a a 2ms sleep.

The QueryPerformanceTimer is interesting for high precision (550ns resolution on my machine) interval measurement but it comes at the cost of higher overhead, so I wouldn't want it to become the default but perhaps an option instead. Also, I don't think that it would help in the case of time.sleep() since it doesn't provide a "wait" feature.

Discussion here dropped off since there were other changes being made to time (wall clock vs monotonic) with the hope that perhaps the issue would be resolved as a side effect. I don't believe that it has, hence I plan to ignite the discussion again after the v1.9 release.

@prasannavl
Copy link

I think that most would agree that 1ms timing resolution is more desirable than 15ms resolution (except when trying to save power).

I'd say that's debatable. I'd like to think most wouldn't agree if they understand the costs. Most general application have really no need for the higher resolution. The fact that, to this date, there is no way baked into the .NET ecosystem (a ecosystem built for Windows) to do this without resorting to P/Invoke and calling the mm APIs stand on that very rationale as well. Only the application that really really need it, do so - and they can still do so with a very simple call.

But other than that, I am on the same page with you on everything else. I also suppose while a lot of my points had contextual relevance, it isn't directly of immediate relevance to this issue. I shall wait for the discussion to ignite again on it, as you said. :)

@johnrs
Copy link
Author

johnrs commented Aug 6, 2017

(except when trying to save power).

I'd say that's debatable.

I did leave out one other consideration: It also allows programs (your and others) to run a slight bit faster due to the reduction in overhead. I don't think it's a major factor, but seems to matter to some. Did you have any other costs in mind?

OK. Looking forward to your participation. :)

@alexbrainman
Copy link
Member

Why do you think time.Sleep(1ns) sleeping between 15ms and 30ms is preferable ...

"I believe that the safer choice is the one with the vastly smaller error range." Search the thread for "safe" for more details from my previous posts.

I don't understand your explanation, sorry.

The goal should be to remove the reliance of the runtime on multimedia timers entirely, ...

I won't argue with that.

Alex

@prasannavl
Copy link

prasannavl commented Aug 6, 2017

It also allows programs (your and others) to run a slight bit faster due to the reduction in overhead. I don't think it's a major factor, but seems to matter to some.

@johnrs

I think you have an incorrect understanding of the windows scheduler? When you increase the resolution, programs don't run faster - they run slower, with more overhead.

Back to some fundamentals - when you increase the res to 1ms, what happens is the kernel interrupts every 1ms instead of 15ms, which means there's a lot more context switches, and more book-keeping to do. So, it increases CPU usage, and increase overhead. What appears faster is that thread wake ups are happening sooner than the 15ms - this only affects programs that are sleeping, and or waiting using one of WaitForSingleObject, Multiple etc..

While I generally see a lot of complaints when directly translating UNIX philosophies on to Windows - it's simple because you're solving it against the platform's way. Windows kernel scheduler is actually far more intelligent than most realize. Unless, there's a lot of contention for threads, and it sees a compute heavy thread, it sets the affinity of the thread to a specific CPU (on multi-cores), and pins it for multiple quantum slice.

So, what this means is, the scheduler completely gets out of the way, and your program gets all the CPU it needs. Now with 1ms, you're basically forcing the kernel to do more book-keeping. So, your programs actually get slower.

You can very easily test this. Just run a simple C (or even a C#) program, and let it run a heavy computation. It will complete faster on 15ms than when running on a higher res. I don't say "Go" program, because I don't yet know the internals of the Go scheduler to understand where it interferes with this, in order to workaround it.

The only place where it could potentially be a little faster (but even here, in my naive tests, the kernel seems to work it's magic), is when you have a lot of "active" threads contending for the CPU constantly, and your program is not in the foreground. (This is a decade old optimization for foreground programs to get better quantum slice priority).

Here's a simple C# program to illustrate the same:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Runtime.InteropServices;

namespace dtest
{
    class Program
    {
        static List<TimeSpan> s_results = new List<TimeSpan>();
        static Stopwatch s_timer = new Stopwatch();
        static void Main(string[] args)
        {
            NtQueryTimerResolution(out int minRes, out int maxRes, out int currentRes);
            Console.WriteLine("Resolutions - current: {0}, {1}, {2}", currentRes, minRes, maxRes);
            NtSetTimerResolution(maxRes, true, out currentRes);
            Console.WriteLine("Setting to highest - current: {0}, {1}, {2}", currentRes, minRes, maxRes);            
            for (int i = 0; i < 20; i++)
                Run();
            NtSetTimerResolution(maxRes, false, out currentRes);                
            Console.WriteLine("Setting back to normal - current: {0}, {1}, {2}", currentRes, minRes, maxRes);
            for (int i = 0; i < 20; i++)
                Run();
            s_results.ForEach(x => Console.WriteLine(x.TotalMilliseconds / 1000 * 1000 + "ns"));
        }

        static void Run()
        {
            s_timer.Reset();
            s_timer.Start();
            var i = 0;
            for (; i < int.MaxValue; i++) { }
            s_timer.Stop();
            s_results.Add(s_timer.Elapsed);
        }

        [DllImport("ntdll.dll", SetLastError = true)]
        static extern int NtSetTimerResolution(int DesiredResolution, bool SetResolution, out int CurrentResolution);

        [DllImport("ntdll.dll", SetLastError = true)]
        static extern int NtQueryTimerResolution(out int MinimumResolution, out int MaximumResolution, out int CurrentResolution);
    }
}

It goes a bit further and sets the timer res to 0.5ms on systems that support it for the sake of the argument. According to my tests:

Resolutions - current: 156251, 156250, 5000
Setting to highest - current: 5003, 156250, 5000
Setting back to normal - current: 156251, 156250, 5000
748.0012
750.8139
750.4076
750.0441
750.243
738.9563
750.5714
737.8166
750.0437
750.4029
751.0132
750.1848
749.9552
749.7858
750.148
749.9881
750.4029
750.0625
749.9534
750.1053
721.1744
722.268
722.3988
724.3536
721.4824
723.0694
723.7977
724.2121
721.3985
722.0977
722.9411
722.5806
721.2121
722.7478
723.8593
725.0409
720.9901
720.9713
722.4745
721.0312

When resolution is 15ms vs 0.5ms - the former is about 30ns faster to run one full loop from 0 to int.MaxValue

Now, the same program when with almost no thread contention (Previously there were a bunch of processes in the background running in like Chrome, ProcessHacker, VSCode etc. - which is why the difference was much greater. Kernel's doing a lot more book-keeping. Note though: they are still in background. Foreground process is always given higher quantum slices.)

Resolutions - current: 156251, 156250, 5000
Setting to highest - current: 5003, 156250, 5000
Setting back to normal - current: 156251, 156250, 5000
739.4905ns
721.7347ns
722.6255ns
730.5602ns
722.8594ns
731.3017ns
723.2909ns
723.2439ns
723.3106ns
723.3431ns
722.9137ns
723.2832ns
722.9886ns
729.5141ns
721.5281ns
721.5093ns
734.1097ns
750.5821ns
732.7545ns
724.7744ns
721.0487ns
721.301ns
723.9683ns
723.4316ns
724.4439ns
726.2302ns
722.6499ns
723.2126ns
725.6815ns
722.0058ns
720.7848ns
726.6467ns
720.137ns
720.4915ns
720.2242ns
720.8336ns
721.6692ns
722.4454ns
720.8203ns
722.158ns

Here the difference is much smaller. Though, lower resolution is still faster with less overhead to deal with. Feel free to run the tests on difference scenarios to understand it's implications. Windows uses a timer resolution of 15ms, because it's a good compromise between stealing programs time, and resolution. It's not very difficult to auto tune this resolution in the kernel, and emulate it like UNIX as much as the interrupt timer resolution allows (typically 0.5-1ms). But it's trade-off that has to be understood explicitly by each program before it does this.

I like to believe that's possibly a reason Microsoft simply chose not to do it, and to provide make 15ms the default resolution and program that knows that it needs more can explicitly go ask for it - which is also why I think it's far better to stick to the consistent model that the platform recommends than letting every random Go programs take extra interrupts without knowing the consequences. The unfortunate thing here, that saddens me however, is that the runtime itself makes these bad decisions.

@johnrs
Copy link
Author

johnrs commented Aug 6, 2017

Hi Alex.

I don't understand your explanation, sorry.

Sorry I'm not being clear. Perhaps it would help if we backed up a step. Do you agree that there is a problem? If so, what solution do you prefer?

@johnrs
Copy link
Author

johnrs commented Aug 6, 2017

When you increase the resolution, programs don't run faster - they run slower, with more overhead.

Yes, that's what I meant. When I referred to another consideration I meant in addition to saving power, when running at 15ms.

... however that saddens me is that the runtime itself makes these bad decisions.

Better in v1.9 than it was, I think. And can be improved as per issue #20937.

@prasannavl
Copy link

prasannavl commented Aug 6, 2017

Yes, that's what I meant. When I referred to another consideration I meant in addition to saving power, when running at 15ms.

Ah! I mistook it the other way, and was confused why you meant it would increase the perf increasing the timers. Guess, I could have saved myself a few minutes not having to write out that naive test. Haha.

Anyways, yes. 1.9 is better on idling. 👍

@johnrs
Copy link
Author

johnrs commented Aug 6, 2017

No sweat. Save the test results. They would support the case for using 15ms even when the a program isn't idling. :)

@alexbrainman
Copy link
Member

Perhaps it would help if we backed up a step. Do you agree that there is a problem?

I don't agree there is a problem here. I think I already explained my view. I think that you expect your program to work in a particular way, but that is not how it works. I agree that Go documentation is not correct for some scenarios you've described. But, I submit, we should leave documentation alone, because these scenarios do not happen often enough. More documentation will just make things even more confusing.

Alex

@johnrs
Copy link
Author

johnrs commented Aug 7, 2017

I do not know what your problem is.

Incorrect behavior. time.Sleep doesn't work as documented:

Sleep pauses the current goroutine for at least the duration d.

A more accurate description of the actual behavior would be something like:

Sleep pauses the current goroutine for at least the duration d, minus 1 tick of the system clock.

Note: I only make this claim for Windows. I do suspect that the problem exists on all platforms, but I do not claim that at this time.

The difference in behavior (documented vs actual) can result in problems and can make a program harder to reason about.

Severity: Depending on the code, it can cause loss of data (high priority), bogus retries (medium priority), or not matter (low priority). The symptoms are typically non-deterministic and intermittent thus making the problem hard to troubleshoot.

It caused a program of mine to fail and that's why I decided to report it. I believe it happens more frequently than people realize because it generally doesn't cause data loss,just retries, or some other fatal effect.

There was some confusion due to problems with early versions of the test program: Testing at the tick boundary, not allowing for the Windows clocks smaller than 1ms (ex: 0.976ms), and the "sleep + work" test results being hard to interpret. So let's refer to a simple, clear test result. From my post on May 26

Hit Enter for a Sleep(900us), Ctrl-C to exit:
566us
413us
67us
458us
952us
408us
901us
470us
272us
394us
[etc]

I agree that Go documentation is not correct for some scenarios you've described.

I can't think of a scenario in which it is correct. Please describe one.

More documentation will just make things even more confusing.

I believe that the goal should be:

"Everything should be as simple as possible, but no simpler."
-- Albert Einstein

I believe that both Rob and Russ have said that the documentation is correct as is. So the remaining option is to change the code to match it.

I think I already explained my view.

In light of the early confusion, and to help me understand your position, could you please restate it in the context of this message?

@alexbrainman
Copy link
Member

Incorrect behavior. time.Sleep doesn't work as documented:

So, you want documentation changed. Thank you for explaining.

I can't think of a scenario in which it is correct. Please describe one.

Perhaps you are correct. Perhaps it is impossible to describe one. But I consider myself a practical person. The problem you are describing does not bother me.

In light of the early confusion, and to help me understand your position, could you please restate it in the context of this message?

I am not convinced we need to change code or documentation.
But don't try to convince me, I don't make decisions here.

Alex

@johnrs
Copy link
Author

johnrs commented Aug 7, 2017

So, you want documentation changed.

No, that is not what I said.

The problem you are describing does not bother me.

I understand. Would it being fixed bother you?

@alexbrainman
Copy link
Member

Would it being fixed bother you?

Fixes are not free.

Recently we made changes related to this (see CL 47832 and 47833). Austin had to spend his time implementing this changes. Austin could have improved something else instead. And now (after CLs 47832 and 47833) we have more code in runtime that all runtime developers have to look at and consider. And the code can break in the future.

Same with "improved" version of documentation. More complicated documentation means that every Gopher (even the ones who are not affected by this issue) must read and understand all these new details.

So the problem must be worth fixing. But personally I am not convinced.

Alex

@johnrs
Copy link
Author

johnrs commented Aug 7, 2017

Fixes are not free.

Very little in life is. :)

Recently we made changes related to this (see CL 47832 and 47833).

Those were related to CL 38403 and issue 20937, not this issue.

More complicated documentation means ...

Then let me propose an alternate change to the documentation:

Sleep pauses the current goroutine for about the duration d.

This is both simpler and more correct. Win-win!

@alexbrainman
Copy link
Member

This is both simpler and more correct. Win-win!

I agree this is very small change. But I still cannot make this decision. I did not design / document that API, so I cannot make that call. You have to convince others. Sorry.

Alex

@prasannavl
Copy link

prasannavl commented Aug 7, 2017

@johnrs, just been thinking about this again - I think there's a problem with your time.Now() resolution measurement. time.Now() measurement by running it in a loop until is changes is incorrect.

You'll get different time.Now() resolution by doing this during different times depending on the how the OS schedules your threads - which also implies your estimation of dowork's time is incorrect. It can or cannot be different each time you run it, depending on when and where the OS thread gets interrupted. I think this very fragile way to test this.

@johnrs
Copy link
Author

johnrs commented Aug 7, 2017

@prasannavl - You are correct, of course. But if care is taken to run the test on a system that is lightly loaded and has multiple CPUs then the chances that the test's thread will get paused are pretty small.

Dowork was an attempt to introduce a delay which didn't depend on the system clock. It was messy and the result (sleep for slightly less than 1 tick + work for half a tick) was hard to explain to people. Some assumed that the result should be the sum of the two. Work should have caused rounding to the next clock tick, for a total of 2 ticks. A result of 1 tick showed that the sleep was less than 1 tick. It can be hard to follow the logic.

Oh, and don't forget to disable turbo (speed-stepping) mode if your BIOS has it on, else it can makes the "work" loop pretty inaccurate. On my PC with only a single core running the CPU clock is about 40% faster than when all 4 cores are running.

I finally just used the QPC timer to measure the intervals. The results were clearer since they didn't need any interpretation.

In later versions I didn't loop at all and just displayed a few samples. That was enough to show if the results were "steady" or not. But I think that the Manual test is the easiest to understand. It simply times a sleep which is started at a random time (relative to the system clock's phase).

If I wanted to determine the tick resolution in a production program I'd probably use a loop (~10 passes, well under the 1ms scheduling interval) and I'd use the minimum value, not the average value. I think it's fair to say that the thread would not be paused at least once during those 10 passes. Also, Ian suggested that locking the goroutine to the thread might help.

@prasannavl
Copy link

prasannavl commented Aug 8, 2017

I understand the premise. But just wrapping my head around to see if there's a better way to test this while dealing with such low intervals.

Work should have caused rounding to the next clock tick, for a total of 2 ticks. A result of 1 tick showed that the sleep was less than 1 tick.

My concern is the definition of what's 1 tick - and how that's measured. QPC can measure intervals with the highest possible accurately, but not how many ticks a particular program has gone through. I think the most accurate (and Microsoft recommended way) to do this will be to run xperf so we can visualize it right down to the thread quantum of the program in PerfView). But this gives just accuracy, not the precision needed, since the overheads might be way too high. The only possible way to measure that with some sense of reasonably reliability would be to write a kernel driver, which is just madness for this particular problem. I'm trying to wrap my head around it, to understand if there's any way to do a better analysis without doing that.

May be xperf and perfview might just be able to provide something indicative here. Not sure.

@johnrs
Copy link
Author

johnrs commented Aug 8, 2017

The intervals for Windows aren't that low, so QPC works fine. It has nearly 2000 times the resolution of a tick on my machine. From what I read this is typical.

My concern is the definition of what's 1 tick - and how that's measured.

My scheme seems to produce repeatable, correct results for the QPC, time.Now, and time.Sleep(1) ticks. Indeed, that the QPC tick measurement works (admittedly it does double or otherwise fail from time to time) showing 550ns says something. :)

QPC can measure intervals with the highest possible accurately, but not how many ticks a particular program has gone through.

It's going to continue running even if your program is paused. Is this what you mean? For testing I don't see this as a problem since we know that running the test multiple times, the smallest result is correct. I'm in the middle of making a few improvements to my test program, including the "loop and use the minimum" idea, now. I will post it on the Play site hopefully in an hour or so. I'll let you know when I do.

Also, if needed, using keyboard input to de-synchronize the time.Sleep can be replaced by some other long running, but randomized runtime, routine.

So I think that the Windows case is well covered. Platforms with faster ticks are another matter, however. I'd like to have a test for high resolution (<= 1us) ticks. Actually, 1us isn't hard to measure and I think that my current scheme should probably work down to about 100ns, but if it gets down to 10ns then yuck!

On Linux, if the time.Now tick and the time.Sleep(1) (1 = minimum sleep time) ticks are the same, as in Windows, then something with even more resolution is needed to make measurements on time.Sleep when used in a non-synchronized mode. Ouch! Do you happen to know if they are the same? Any idea what they are?

Or I could revert to the "work + sleep" test. It's sloppy but I eliminated about half of the inaccuracy in later versions by doing an auto-calibrate. I certainly wouldn't want to use it in any production project, however!

Finally, if the ticks get below 100ns you have to ask if it's even worth worrying about any +/- 1 error. :)

@johnrs
Copy link
Author

johnrs commented Aug 8, 2017

Done. https://play.golang.org/p/nH3LnBNQtj

This is what I see.

Parameter Tests

  QPC Timer tick:              549ns
  time.Now monotonic tick:     1ms
  time.Now wall clock tick:    1.0001ms
  time.Sleep(1) re QPC:        1.000487ms
  time.Sleep(1) re monotonic:  1.000057ms

Manual Test - Hit Enter for a loop of 5 Sleep(900us)'s, Ctrl-C to exit:

    1st       2nd       3rd       4th       5th         Avg (w/o 1st)
  ------    ------    ------    ------    ------        -------------
  1296us    1067us     893us     815us    1045us    ==>     955us
   763us     998us     934us     949us    1194us    ==>    1019us
  1312us     686us    1169us     963us    1026us    ==>     961us
   195us    1089us    1006us     958us     938us    ==>     998us
  1248us     951us     879us    1017us    1095us    ==>     986us  exit status 2

@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017
@ALTree ALTree removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Apr 17, 2018
@gopherbot gopherbot modified the milestones: Go1.11, Unplanned May 23, 2018
hakankutluay added a commit to hakankutluay/echo that referenced this issue Jan 28, 2023
- remove unnecessary control
- change sleep
- change stop execution
- increase timeouts golang/go#17696

Co-authored-by: @erhanakp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows
Projects
None yet
Development

No branches or pull requests