Observations regarding instability of native fetch in AWS Lambda x86_64 nodejs20.x #3133
Description
We observe consistent intermittent errors when using native fetch
in an AWS Lambda x86_64
nodejs20.x
. We were
unable to find the particular place in undici
code that is the culprit of this behavior.
The intention of this submission is to report the issues and to record our observations as accurately as possible, to enable
people with better detailed knowledge to isolate and possibly fix the issues, if they are not already fixed (fetch
is still
experimental in Node 20).
Stable precursor
Versions of this service have been in production since 2022-05-19 on AWS Lambda x86_64 nodejs16.x using
node-fetch@2.6.7
without many issues. Notably, in the last 6 months of
2023 exactly 1 communication error calling the remote API was logged. It handles ~700 invocations per day.
Where are the issues observed?
The functionality of the service requires it to do ±25 remote API calls per invocation. The remote API is hosted on a
Windows VM through IIS 8, and is accessed through a Cloudfront distribution, setup to use http2and3
.
We observe this issue, or these issues, only in AWS Lambda x86_64 nodejs20.x, and have not observed the issue during
development on macOS (This does not guarantee that the issues are limited to AWS Lambda x86_64 nodejs20.x, but that it
is possible that the issues are limited to AWS Lambda x86_64 nodejs20.x).
A new version of the code was released at 2024-03-25, but this did not have any impact on the issues. It seems AWS
Lambda x86_64 nodejs20.x changed Node version at 2024-04-03. We cannot report on the exact Node version (and thus
undici
version) because sadly we do not log that. Given that the latest LTS version of node uses undici@5.28.4
, this
reports on versions undici@<=5.28.4
. The issues might have been fixed already in later releases, although we don’t see
any relevant mentioning of it in the release notes since v5.25.1. Between
2024-02-06 and 2024-04-03 the errors are reported on node:internal/deps/undici/undici:12344:11
. Between 2024-04-03 and
2024-04-15 the errors are reported on node:internal/deps/undici/undici:12345:11
.
~1% of invocations fail with 4 ≠ types of failures
The version of this service that shows the issues has been in production since 2024-02-06. It is deployed on AWS Lambda
x86_64 nodejs20.x, and replaced node-fetch
with native fetch
. Between 2024-02-06 and 2024-04-15 about 49000 Lambda
invocations have been made and logged, of which 338 failed. That is about 0.7%. When an invocation does not fail, it
calls fetch
~25 times. This means ~0,03% of fetch
calls fail, but, as you will see, the impact might be greater,
because most failures appear to happen on the first fetch
call of an invocation of a fresh Lambda instance.
All failures are logged as 1 of 4 “fetch failed” cause
variants:
{ cause: { errno: -110, code: 'ETIMEDOUT', syscall: 'write' } }
(163)UND_ERR_SOCKET
(152){ cause: { name: 'ConnectTimeoutError', code: 'UND_ERR_CONNECT_TIMEOUT', message: 'Connect Timeout Error' } }
(16){ cause: { errno: -32, code: 'EPIPE', syscall: 'write' } }
(7)
The UND_ERR_SOCKET
causes are of the general form
{
cause: {
name: 'SocketError',
code: 'UND_ERR_SOCKET',
socket: {
localAddress: 'NNN.NNN.NN.N',
localPort: NNNNN,
remoteAddress: undefined,
remotePort: undefined,
remoteFamily: undefined,
timeout: undefined,
bytesWritten: 580,
bytesRead: 1915
}
}
}
Sometimes the remote address and remote port is filled out, sometimes not.
Sadly, we cannot pinpoint where the exact errors occur, since our logs do not store the stack trace of cause
. Finding
out through code inspection did not turn up anything — the usage of processResponse
is a bit opaque.
ETIMEDOUT
and EPIPE
All samples of the ETIMEDOUT
and EPIPE
cause
we examined show < 15ms between the FETCHING…
log and the ERROR
log, and these calls are not registered in the remote API access log. We see the FETCHING…
log, but never the
RESPONSE_RECEIVED
log in these cases (see below).
In all samples we examined, the error occurs in the first call to fetch
in the invocation. We assume these calls never
left the Lambda environment, and they might occur on the first call to fetch
after spawning a Lambda instance.
This makes us believe these cases are bugs in undici
, or in the AWS Lambda x86_64 nodejs20.x OS.
UND_ERR_CONNECT_TIMEOUT
and UND_ERR_SOCKET
The UND_ERR_CONNECT_TIMEOUT
and UND_ERR_SOCKET
do not follow this pattern. Multiple successful calls of the remote
API with fetch
are successful before the error occurs. We assume these calls do go out to the remote API.
Yet, such errors rarely occured when we used this code on AWS Lambda x86_64 nodejs16.x with
node-fetch@2.6.7
(1 such error in the last 6 months of 2023).
Code
fetch
is used twice in this code base, but all errors spawn from the same use:
…
log.info(module, 'get', flowId, 'FETCHING …', { uri })
const response = await fetch(uri, {
headers: { accept: 'application/json', authorization }
})
log.info(module, 'get', flowId, 'RESPONSE_RECEIVED', { uri, statusCode: response.status })
…
The log.info
code is:
/**
* @param {NodeModule} callingModule
* @param {string} functionName
* @param {UUID} flowId
* @param {string} phase
* @param {object} [details]
*/
function info(callingModule, functionName, flowId, phase, details) {
strictEqual(typeof flowId, 'string')
try {
const now = Date.now()
const message = {
lambda: staticLambdaData(),
flowId,
now,
level: 'INFO',
module: callingModule.filename,
function: functionName,
phase,
details: masked(details, [])
}
console.log(JSON.stringify(message))
} catch (ignore) /* istanbul ignore next */ {
errorLogging(callingModule, functionName, flowId, phase)
}
}
Note that now
in the logged JSON
is determined immediately before the call to fetch
synchronously.
The try
- catch
around the call to fetch
is several levels higher in the call stack:
…
} catch (err) {
…
log.error(module, 'handler', responseFlowId, err)
return {
statusCode: 500,
…
}
}
The log.error
code is:
/**
* @param {NodeModule} callingModule
* @param {string} functionName
* @param {UUID} flowId
* @param {*} err
*/
function error(callingModule, functionName, flowId, err) {
strictEqual(typeof flowId, 'string')
try {
const now = Date.now()
const message = {
lambda: staticLambdaData(),
flowId,
now,
level: 'ERROR',
module: callingModule.filename,
function: functionName,
message: err && err.message,
stack: err && err.stack,
details: util.inspect(masked(err, []), { depth: 9, maxArrayLength: 5 })
}
console.log(JSON.stringify(message))
} catch (ignore) /* istanbul ignore next */ {
errorLogging(callingModule, functionName, flowId, 'ERROR')
}
}
Note that now
in the logged JSON
is determined here.
From the logged now
in the FETCHING…
log and the logged now
in the ERROR
log we can determine an upper bound of
the duration between our fetch
call and the occurrence of the error.
Activity