-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry with double memory [BA-5933] #5180
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent PoC. I'm happy to 👍 if the PoC needs to go out asap, but in general I think the concept of "DoubleMemory" could be replaced with a more generic "MoreMemory".
|
||
jobKey.memoryDoubleMultiplier match { | ||
case 1 => runtimeAttributes | ||
case multiplier: Int => doubleRuntimeMemory(multiplier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ToL: I haven't been following the ticket too closely, so this is mostly thinking for future PRs. A hardcoded factor of 2.0
is fine for now, but I could see someone wanting eventually wanting a multiplier of 1.1
or even 10.0
.
@@ -26,7 +26,7 @@ import scala.util.Try | |||
/** | |||
* For uniquely identifying a job which has been or will be sent to the backend. | |||
*/ | |||
case class BackendJobDescriptorKey(call: CommandCallNode, index: Option[Int], attempt: Int) extends CallKey { | |||
case class BackendJobDescriptorKey(call: CommandCallNode, index: Option[Int], attempt: Int, memoryDoubleMultiplier: Int = 1) extends CallKey { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps call the field memoryMultiplier
? While here, I'm also a fan of Double
multipliers vs. Int
.
@@ -32,6 +32,7 @@ trait JobPaths { | |||
|
|||
def workflowPaths: WorkflowPaths | |||
def returnCodeFilename: String = "rc" | |||
def doubleMemoryRetryRCFilename: String = "double_memory_retry_rc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming we remove the hardcoding of the multiplier, how about val memoryRetryRcFilename = "memory_retry_rc"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building on that idea, memory_retry_rc_filename
would make it more clear it's a filename we're looking at and not a code.
@@ -99,8 +99,8 @@ trait AsyncBackendJobExecutionActor { this: Actor with ActorLogging with SlowJob | |||
case Finish(FailedNonRetryableExecutionHandle(throwable, returnCode)) => | |||
completionPromise.success(JobFailedNonRetryableResponse(jobDescriptor.key, throwable, returnCode)) | |||
context.stop(self) | |||
case Finish(FailedRetryableExecutionHandle(throwable, returnCode)) => | |||
completionPromise.success(JobFailedRetryableResponse(jobDescriptor.key, throwable, returnCode)) | |||
case Finish(FailedRetryableExecutionHandle(throwable, returnCode, retryWithDoubleMemory)) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retryWithMoreMemory
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps another producty question for @ruchim. The JIRA ticket title says "more" but the description says "double".
pushFailedCallMetadata(jobKey, returnCode, reason, retryableFailure = true) | ||
|
||
val newJobKey = jobKey.copy(attempt = jobKey.attempt + 1) | ||
val currentMemoryMultiplier = jobKey.memoryDoubleMultiplier | ||
val newMemoryMultiplier = if (retryWithDoubleMemory) currentMemoryMultiplier * 2 else currentMemoryMultiplier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This magic number 2.0
could be another knob config/option.
...scala/cromwell/engine/workflow/lifecycle/execution/job/preparation/JobPreparationActor.scala
Outdated
Show resolved
Hide resolved
include "papi_provider_config.inc.conf" | ||
genomics.compute-service-account = "centaur@broad-dsde-cromwell-dev.iam.gserviceaccount.com" | ||
filesystems.http {} | ||
retry-with-double-memory = ["OutOfMemoryError"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ToL: Future PR..?
retry-memory {
error-messages = ["OutOfMemoryError"]
multiplier = 2.0
}
@@ -447,7 +447,8 @@ class PipelinesApiAsyncBackendJobExecutionActor(override val standardParams: Sta | |||
privateDockerKeyAndEncryptedToken = dockerKeyAndToken, | |||
womOutputRuntimeExtractor = jobDescriptor.workflowDescriptor.outputRuntimeExtractor, | |||
adjustedSizeDisks = adjustedSizeDisks, | |||
virtualPrivateCloudConfiguration = jesAttributes.virtualPrivateCloudConfiguration | |||
virtualPrivateCloudConfiguration = jesAttributes.virtualPrivateCloudConfiguration, | |||
retryWithDoubleMemoryKeys = jesAttributes.retryWithDoubleMemoryKeys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pro Tip: If you leave a trailing comma here and on the lines below, you can pay-it-forward so future devs won't get extra git blame
for just adding a ,
to each list.
https://docs.scala-lang.org/sips/completed/trailing-commas.html
@@ -141,6 +141,11 @@ object ActionCommands { | |||
|fi""".stripMargin | |||
} | |||
|
|||
def checkIfStderrContainsRetryKeys(retryLookupKeys: List[String]): String = { | |||
val lookupKeysAsString = retryLookupKeys.mkString("|") | |||
s"grep -E -q '$lookupKeysAsString' /cromwell_root/stderr ; echo $$? > /cromwell_root/double_memory_retry_rc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ToL: Nice... nothing Papi here except rapid-prototyping.
@kshakir I had thought of making the multiplier 2 as a config option. But since this ticket had evolved to be a PoC, I kept it constant at 2. If we decide to not rush this PR, than I am all in for 'MoreMemory' as compared to 'DoubleMemory'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments for now with a couple of Ruchi tags for clarification 😉
@@ -32,6 +32,7 @@ trait JobPaths { | |||
|
|||
def workflowPaths: WorkflowPaths | |||
def returnCodeFilename: String = "rc" | |||
def doubleMemoryRetryRCFilename: String = "double_memory_retry_rc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building on that idea, memory_retry_rc_filename
would make it more clear it's a filename we're looking at and not a code.
@@ -72,6 +73,7 @@ trait JobPaths { | |||
lazy val script = callExecutionRoot.resolve(scriptFilename) | |||
lazy val dockerCid = callExecutionRoot.resolve(dockerCidFilename) | |||
lazy val returnCode = callExecutionRoot.resolve(returnCodeFilename) | |||
lazy val doubleMemoryRetryRC = callExecutionRoot.resolve(doubleMemoryRetryRCFilename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume if Khalid's comments above regarding "more" vs "double" are acted upon then this should change too.
@@ -51,4 +54,42 @@ class ActionBuilderSpec extends FlatSpec with Matchers with TableDrivenPropertyC | |||
} | |||
} | |||
|
|||
def grepForRetryKeysCommand(lookupString: String) = | |||
s"grep -E -q '$lookupString' /cromwell_root/stderr ; echo $$? > /cromwell_root/double_memory_retry_rc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this should be in one place in the production code with visibility such that the test code can see it
@@ -99,8 +99,8 @@ trait AsyncBackendJobExecutionActor { this: Actor with ActorLogging with SlowJob | |||
case Finish(FailedNonRetryableExecutionHandle(throwable, returnCode)) => | |||
completionPromise.success(JobFailedNonRetryableResponse(jobDescriptor.key, throwable, returnCode)) | |||
context.stop(self) | |||
case Finish(FailedRetryableExecutionHandle(throwable, returnCode)) => | |||
completionPromise.success(JobFailedRetryableResponse(jobDescriptor.key, throwable, returnCode)) | |||
case Finish(FailedRetryableExecutionHandle(throwable, returnCode, retryWithDoubleMemory)) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps another producty question for @ruchim. The JIRA ticket title says "more" but the description says "double".
/* | ||
`CheckingForRetryWithDoubleMemory` action in Papiv2 exits with code 0 if the stderr file contains keys | ||
mentioned in `retry-with-double-memory` config. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is generic code, not specific to Papiv2.
mockJobKeyWithMemoryMultiplier4.node returns call | ||
mockJobKeyWithMemoryMultiplier4.index returns None | ||
mockJobKeyWithMemoryMultiplier4.attempt returns 3 | ||
mockJobKeyWithMemoryMultiplier4.memoryDoubleMultiplier returns 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm a "double multiplier" with different values is a bit confusing
@@ -49,13 +50,16 @@ object PipelinesApiConfigurationAttributes { | |||
|
|||
final case class VirtualPrivateCloudConfiguration(name: String, subnetwork: Option[String], auth: GoogleAuthMode) | |||
final case class BatchRequestTimeoutConfiguration(readTimeoutMillis: Option[Int Refined Positive], connectTimeoutMillis: Option[Int Refined Positive]) | |||
final case class MemoryRetryConfiguration(errorKeys: List[String], multiplier: Double) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be propagated as multiplier: Double Refined Positive
, such that the multiplier type is always positive throughout the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I somehow missed this comment. I will take a look. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the multiplier
at this point, which is when Cromwell is started and Papi backend is initialized, is validated that it is a positive double, but not used. It is used in the StandardAsyncExecutionActor when a task has failed and the retryWithMoreMemory
condition is true. I am not revalidating again because if it was not a valid double, Cromwell would have thrown an error when it was started and validated in the Papi backend. Does that make sense? I can remove the multiplier
field from MemoryRetryConfiguration
altogether to make it easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can also revalidate that the multiplier is a positive double again at StandardAsyncExecutionActor, but if at this point it is not a valid double, we can switch the multiplier value to default (which is 2) or capture the error and fail the task without retrying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After talking and coding f2f, the code has been updated to use Double Refined Positive
type for multiplier all the way through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I missed the reply to the open question above? If you point me at the reply and hit ♻️ and I'll re-review.
backend/src/main/scala/cromwell/backend/async/KnownJobFailureException.scala
Outdated
Show resolved
Hide resolved
@@ -141,6 +143,8 @@ class PipelinesApiAsyncBackendJobExecutionActor(override val standardParams: Sta | |||
case _ => false | |||
} | |||
|
|||
override lazy val memoryRetryFactor: Option[Double Refined Positive] = initializationData.papiConfiguration.papiAttributes.memoryRetryConfiguration.map(_.multiplier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TOL hmm really this should be > 1.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In strict sense, yes. But I am not sure if there is any harm to have a memory factor > 0 but less than < 1. Something to discuss during tech talk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with a memoryRetryFactor being GreaterEqual[]
vs. Positive
.
Also ToL for tech talk: I'm curious if this would be easier to follow, though it's about as verbose when used.
type MemoryRetryFactor = GreaterEqual[W.`1.0`.T]
Or is convention to alias the refined type?
type MemoryRetryFactor = Refined[Double, GreaterEqual[W.`1.0`.T]]
Or some mix of the two?
...nes/v1alpha2/src/main/scala/cromwell/backend/google/pipelines/v1alpha2/GenomicsFactory.scala
Outdated
Show resolved
Hide resolved
f7c3172
to
15dfb22
Compare
cba7284
to
86d3028
Compare
} | ||
} | ||
``` | ||
this tells Cromwell to retry the task with 1.1x memory when it sees either `OutOfMemoryError` or `Killed` in the `stderr` file. If the task has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to be careful about is Killed might also mean preempted. So it may be better to check explicitly that the call wasn't preempted, and increase memory only when it wasn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is certainly a good point. Since this PR has been developed as a POC, we are trying to keep it simple. But I will note this down for when we start adding more features to it. Thank you!
86d3028
to
fdaf311
Compare
Unfortunately, this does not yet work in practice. I opened https://broadworkbench.atlassian.net/browse/BA-6112 re the cause. |
This PR adds a config option for user defined retries. With
memory-retry
the user can specify an array of strings which when encountered in thestderr
file by Cromwell, allows the task to be retried with a factor also mentioned in the config.JIRA issue link.