Skip to content

Commit

Permalink
[SPARK-13823][HOTFIX] Increase tryAcquire timeout and assert it succe…
Browse files Browse the repository at this point in the history
…eds to fix failure on slow machines

## What changes were proposed in this pull request?

I'm seeing several PR builder builds fail after https://github.com/apache/spark/pull/11725/files. Example:

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.4/lastFailedBuild/console

```
testCommunication(org.apache.spark.launcher.LauncherServerSuite)  Time elapsed: 0.023 sec  <<< FAILURE!
java.lang.AssertionError: expected:<app-id> but was:<null>
	at org.apache.spark.launcher.LauncherServerSuite.testCommunication(LauncherServerSuite.java:93)
```

However, other builds pass this same test, including the test when run locally and on the Jenkins PR builder. The failure itself concerns a change to how the test waits on a condition, and the wait can time out; therefore I think this is due to fast/slow machine differences.

This is an attempt at a hot fix; it's a little hard to verify since locally and on the PR builder, it passes anyway. The change itself should be harmless anyway.

Why didn't this happen before, if the new logic was supposed to be equivalent to the old? I think this is the sequence:

- First attempt to acquire semaphore for 10ms actually silently times out
- The changed being waited for happens just after that, a bit too late
- Assertion passes since condition became true just in time
- `release()` fires from the listener
- Next `tryAcquire` however immediately succeeds because the first `tryAcquire` didn't acquire anything, but its subsequent condition is not yet true; this would explain why the second one always fails

Versus the original using `notifyAll()`, there's a small difference: `wait()`-ing after `notifyAll()` just results in another wait; it doesn't make it return immediately. So this was a tiny latent issue that was masked by the semantics. Now the test asserts that the event actually happened (semaphore was acquired). (The timeout is still here to prevent the test from hanging forever, and to detect really slow response.) The timeout is increased to a second to allow plenty of time anyway.

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes apache#11763 from srowen/SPARK-13823.3.
  • Loading branch information
srowen committed Mar 16, 2016
1 parent 496d2a2 commit 9412547
Showing 1 changed file with 3 additions and 3 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -83,17 +83,17 @@ public void infoChanged(SparkAppHandle handle) {

client = new TestClient(s);
client.send(new Hello(handle.getSecret(), "1.4.0"));
semaphore.tryAcquire(10, TimeUnit.MILLISECONDS);
assertTrue(semaphore.tryAcquire(1, TimeUnit.SECONDS));

// Make sure the server matched the client to the handle.
assertNotNull(handle.getConnection());

client.send(new SetAppId("app-id"));
semaphore.tryAcquire(10, TimeUnit.MILLISECONDS);
assertTrue(semaphore.tryAcquire(1, TimeUnit.SECONDS));
assertEquals("app-id", handle.getAppId());

client.send(new SetState(SparkAppHandle.State.RUNNING));
semaphore.tryAcquire(10, TimeUnit.MILLISECONDS);
assertTrue(semaphore.tryAcquire(1, TimeUnit.SECONDS));
assertEquals(SparkAppHandle.State.RUNNING, handle.getState());

handle.stop();
Expand Down

0 comments on commit 9412547

Please sign in to comment.