Many services in a single namespace leads to assorted problems. #8498
Closed
Description
/area API
/area autoscale
What version of Knative?
HEAD
Description of the problem.
I wanted to push on our limits a bit, and so I wrote the very innovative (patent pending 🤣 ) script below. I plotted the latency between creationTimestamp
and status.conditions[Ready].lastTransitionTime
here.
A few observations in this context:
- Revision creation latency creeps up over time.
- (anecdotally by curling) the cold start latency of ALL ksvcs creeps up over time (from 2-3s to 10-12s, with all time being spent between "Container create" and "Container start")
- After 1198 services were deployed, I started seeing deployments fail with:
RevisionFailed: Revision "foo-1200-sjwyy-1" failed with message: Container failed with: standard_init_linux.go:211: exec user process caused "argument list too long
- When the above happened, we stop being able to cold start new services (they crash loop with the same message)!
Here's where it gets interesting... On a whim, I tried picking back up in a second namespace, and things work! Not only do they work, but cold start latency for the new services is back down!
Steps to Reproduce the Problem
I needed a GKE cluster with at least 10 nodes (post-master resize) to tolerate the number of services this creates. I was playing with this in the context of mink
, but there's no reason that would affect what I'm seeing.
#!/bin/bash -e
for i in $(seq 1 1500); do
kn service create foo-$i --image=gcr.io/knative-samples/autoscale-go:0.1
sleep 10
done
I gathered the latencies as CSV with:
kubectl get ksvc -ojsonpath='{range .items[*]}{.metadata.name},{.metadata.creationTimestamp},{.status.conditions[?(@.type=="Ready")].lastTransitionTime}{"\n"}{end}' | pbcopy