Skip to content

Observations regarding instability of native fetch in AWS Lambda x86_64 nodejs20.x #3133

Open
@jandppw

Description

We observe consistent intermittent errors when using native fetch in an AWS Lambda x86_64 nodejs20.x. We were
unable to find the particular place in undici code that is the culprit of this behavior.

The intention of this submission is to report the issues and to record our observations as accurately as possible, to enable
people with better detailed knowledge to isolate and possibly fix the issues, if they are not already fixed (fetch is still
experimental in Node 20).

Stable precursor

Versions of this service have been in production since 2022-05-19 on AWS Lambda x86_64 nodejs16.x using
node-fetch@2.6.7 without many issues. Notably, in the last 6 months of
2023 exactly 1 communication error calling the remote API was logged. It handles ~700 invocations per day.

Where are the issues observed?

The functionality of the service requires it to do ±25 remote API calls per invocation. The remote API is hosted on a
Windows VM through IIS 8, and is accessed through a Cloudfront distribution, setup to use http2and3.

We observe this issue, or these issues, only in AWS Lambda x86_64 nodejs20.x, and have not observed the issue during
development on macOS (This does not guarantee that the issues are limited to AWS Lambda x86_64 nodejs20.x, but that it
is possible that the issues are limited to AWS Lambda x86_64 nodejs20.x).

A new version of the code was released at 2024-03-25, but this did not have any impact on the issues. It seems AWS
Lambda x86_64 nodejs20.x
changed Node version at 2024-04-03. We cannot report on the exact Node version (and thus
undici version) because sadly we do not log that. Given that the latest LTS version of node uses undici@5.28.4, this
reports on versions undici@<=5.28.4. The issues might have been fixed already in later releases, although we don’t see
any relevant mentioning of it in the release notes since v5.25.1. Between
2024-02-06 and 2024-04-03 the errors are reported on node:internal/deps/undici/undici:12344:11. Between 2024-04-03 and
2024-04-15 the errors are reported on node:internal/deps/undici/undici:12345:11.

~1% of invocations fail with 4 ≠ types of failures

The version of this service that shows the issues has been in production since 2024-02-06. It is deployed on AWS Lambda
x86_64 nodejs20.x
, and replaced node-fetch with native fetch. Between 2024-02-06 and 2024-04-15 about 49000 Lambda
invocations have been made and logged, of which 338 failed. That is about 0.7%. When an invocation does not fail, it
calls fetch ~25 times. This means ~0,03% of fetch calls fail, but, as you will see, the impact might be greater,
because most failures appear to happen on the first fetch call of an invocation of a fresh Lambda instance.

All failures are logged as 1 of 4 “fetch failed” cause variants:

  • { cause: { errno: -110, code: 'ETIMEDOUT', syscall: 'write' } } (163)
  • UND_ERR_SOCKET (152)
  • { cause: { name: 'ConnectTimeoutError', code: 'UND_ERR_CONNECT_TIMEOUT', message: 'Connect Timeout Error' } } (16)
  • { cause: { errno: -32, code: 'EPIPE', syscall: 'write' } } (7)

The UND_ERR_SOCKET causes are of the general form

{
  cause: {
    name: 'SocketError',
    code: 'UND_ERR_SOCKET',
    socket: {
      localAddress: 'NNN.NNN.NN.N',
      localPort: NNNNN,
      remoteAddress: undefined,
      remotePort: undefined,
      remoteFamily: undefined,
      timeout: undefined,
      bytesWritten: 580,
      bytesRead: 1915
    }
  }
}

Sometimes the remote address and remote port is filled out, sometimes not.

Sadly, we cannot pinpoint where the exact errors occur, since our logs do not store the stack trace of cause. Finding
out through code inspection did not turn up anything — the usage of processResponse is a bit opaque.

ETIMEDOUT and EPIPE

All samples of the ETIMEDOUT and EPIPE cause we examined show < 15ms between the FETCHING… log and the ERROR
log, and these calls are not registered in the remote API access log. We see the FETCHING… log, but never the
RESPONSE_RECEIVED log in these cases (see below).

In all samples we examined, the error occurs in the first call to fetch in the invocation. We assume these calls never
left the Lambda environment, and they might occur on the first call to fetch after spawning a Lambda instance.

This makes us believe these cases are bugs in undici, or in the AWS Lambda x86_64 nodejs20.x OS.

UND_ERR_CONNECT_TIMEOUT and UND_ERR_SOCKET

The UND_ERR_CONNECT_TIMEOUT and UND_ERR_SOCKET do not follow this pattern. Multiple successful calls of the remote
API with fetch are successful before the error occurs. We assume these calls do go out to the remote API.

Yet, such errors rarely occured when we used this code on AWS Lambda x86_64 nodejs16.x with
node-fetch@2.6.7 (1 such error in the last 6 months of 2023).

Code

fetch is used twice in this code base, but all errors spawn from the same use:

  
  log.info(module, 'get', flowId, 'FETCHING …', { uri })
  const response = await fetch(uri, {
    headers: { accept: 'application/json', authorization }
  })
  log.info(module, 'get', flowId, 'RESPONSE_RECEIVED', { uri, statusCode: response.status })
  

The log.info code is:

/**
 * @param {NodeModule} callingModule
 * @param {string} functionName
 * @param {UUID} flowId
 * @param {string} phase
 * @param {object} [details]
 */
function info(callingModule, functionName, flowId, phase, details) {
  strictEqual(typeof flowId, 'string')
  try {
    const now = Date.now()
    const message = {
      lambda: staticLambdaData(),
      flowId,
      now,
      level: 'INFO',
      module: callingModule.filename,
      function: functionName,
      phase,
      details: masked(details, [])
    }
    console.log(JSON.stringify(message))
  } catch (ignore) /* istanbul ignore next */ {
    errorLogging(callingModule, functionName, flowId, phase)
  }
}

Note that now in the logged JSON is determined immediately before the call to fetch synchronously.

The try - catch around the call to fetch is several levels higher in the call stack:


} catch (err) {
  
  log.error(module, 'handler', responseFlowId, err)
  return {
    statusCode: 500,}
}

The log.error code is:

/**
 * @param {NodeModule} callingModule
 * @param {string} functionName
 * @param {UUID} flowId
 * @param {*} err
 */
function error(callingModule, functionName, flowId, err) {
  strictEqual(typeof flowId, 'string')
  try {
    const now = Date.now()
    const message = {
      lambda: staticLambdaData(),
      flowId,
      now,
      level: 'ERROR',
      module: callingModule.filename,
      function: functionName,
      message: err && err.message,
      stack: err && err.stack,
      details: util.inspect(masked(err, []), { depth: 9, maxArrayLength: 5 })
    }
    console.log(JSON.stringify(message))
  } catch (ignore) /* istanbul ignore next */ {
    errorLogging(callingModule, functionName, flowId, 'ERROR')
  }
}

Note that now in the logged JSON is determined here.

From the logged now in the FETCHING… log and the logged now in the ERROR log we can determine an upper bound of
the duration between our fetch call and the occurrence of the error.

Possibly related issues

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions