Adding rate limit to pod webhook #3340

pureklkl · 2024-10-10T01:22:08Z

Component(s)

collector, auto-instrumentation

Is your feature request related to a problem? Please describe.

When significant amount of pods created at the same time, the performance burden will bring down the opentelemetry operator and the following k8s api server which killed the entire cluster.

Describe the solution you'd like

Add rate limit to the pod webhook, the user could enable/disable it and configure the max request rate when it is on.

Describe alternatives you've considered

No response

Additional context

It is a follow up, since the user need the auto-instrumentation, and also protect cluster.
open-telemetry/opentelemetry-helm-charts#1115

jaronoff97 · 2024-10-10T15:11:07Z

Hm... I think a rate limit makes sense, but we also should look into allowing people to set a label selector for the mutating webhook configuration for pods. That way the operator's webhook only would look at the pods you care about.

pureklkl · 2024-10-24T12:39:42Z

The otel operator pod webhook perf test

Result summary:

Application level rate limiting has little impact on the resource consumption for otel operator. Probably because under this test setup, the pod instrumentation logic is simple enough so that most resource is used at the web server level (e.g. handling tcp connection). Creating multiple replicas can distribute the load to individual operator instance. By comparing with the isitiod operator, which has higher resource cost, unset resource limit and adding autoscaling should be preferred to resolve the perf issue we had,
istio autoscaling, istio resources.

Setup

MacBook pro 2021.
Docker desktop with 8cpu and 10gb memory.
Operator was installed in a kind cluster on top of the docker with unlimited CPU and 128m memory.
Mock pod mutating request was generated by k6 from local machine and send to the operator webhook service through an ingress at the cluster that port mapping to the local machine.
Load test with 200 concurrent requests, 100ms interval for 60 seconds. See appendix for the k6 script. (k8s api server has --max-mutating-requests-inflight default to 200 and --max-requests-inflight to 400 ref).
The cpu and memory usage was read from k8s metric server.

Result

Rate limit 100 req/second

NAME                                      CPU(cores)   MEMORY(bytes)
opentelemetry-operator-7f575d4f86-zbtfl   595m         51Mi


     ✗ status ok
      ↳  99% — ✓ 100730 / ✗ 56
     ✗ not exceed rate limit
      ↳  6% — ✓ 6081 / ✗ 94705

     checks.........................: 52.98% 106811 out of 201572
     data_received..................: 42 MB  705 kB/s
     data_sent......................: 75 MB  1.3 MB/s
     http_req_blocked...............: avg=360.87µs min=0s       med=0s       max=293.98ms p(90)=1µs      p(95)=1µs     
     http_req_connecting............: avg=25.45µs  min=0s       med=0s       max=22.2ms   p(90)=0s       p(95)=0s      
     http_req_duration..............: avg=17.52ms  min=0s       med=5.05ms   max=829.12ms p(90)=34.48ms  p(95)=66.9ms  
       { expected_response:true }...: avg=17.53ms  min=675µs    med=5.06ms   max=829.12ms p(90)=34.49ms  p(95)=66.94ms 
     http_req_failed................: 0.05%  56 out of 100786
     http_req_receiving.............: avg=151.54µs min=0s       med=16µs     max=220.14ms p(90)=217µs    p(95)=498µs   
     http_req_sending...............: avg=133.94µs min=0s       med=26µs     max=51.9ms   p(90)=163µs    p(95)=365µs   
     http_req_tls_handshaking.......: avg=337.76µs min=0s       med=0s       max=285.86ms p(90)=0s       p(95)=0s      
     http_req_waiting...............: avg=17.24ms  min=0s       med=4.82ms   max=823.6ms  p(90)=34.07ms  p(95)=66.28ms 
     http_reqs......................: 100786 1676.93279/s
     iteration_duration.............: avg=119.13ms min=100.81ms med=106.31ms max=945.63ms p(90)=136.04ms p(95)=168.11ms
     iterations.....................: 100786 1676.93279/s
     vus............................: 200    min=200              max=200
     vus_max........................: 200    min=200              max=200


running (1m00.1s), 000/200 VUs, 100786 complete and 0 interrupted iterations
default ✓ [======================================] 200 VUs  1m0s

No rate limit

NAME                                      CPU(cores)   MEMORY(bytes)
opentelemetry-operator-5b8bccd65c-pdgc6   661m         57Mi

     ✗ status ok
      ↳  99% — ✓ 100331 / ✗ 44
     ✗ not exceed rate limit
      ↳  99% — ✓ 100331 / ✗ 44

     checks.........................: 99.95% 200662 out of 200750
     data_received..................: 44 MB  737 kB/s
     data_sent......................: 75 MB  1.2 MB/s
     http_req_blocked...............: avg=494.24µs min=0s       med=0s       max=399.13ms p(90)=1µs     p(95)=1µs     
     http_req_connecting............: avg=12.18µs  min=0s       med=0s       max=12.03ms  p(90)=0s      p(95)=0s      
     http_req_duration..............: avg=18.27ms  min=0s       med=4.24ms   max=780.26ms p(90)=31ms    p(95)=67ms    
       { expected_response:true }...: avg=18.28ms  min=684µs    med=4.24ms   max=780.26ms p(90)=31.01ms p(95)=67.01ms 
     http_req_failed................: 0.04%  44 out of 100375
     http_req_receiving.............: avg=116.86µs min=0s       med=14µs     max=355.76ms p(90)=133µs   p(95)=307µs   
     http_req_sending...............: avg=83.26µs  min=0s       med=25µs     max=25.65ms  p(90)=108µs   p(95)=207µs   
     http_req_tls_handshaking.......: avg=482.76µs min=0s       med=0s       max=390.97ms p(90)=0s      p(95)=0s      
     http_req_waiting...............: avg=18.07ms  min=0s       med=4.1ms    max=780.21ms p(90)=30.69ms p(95)=66.57ms 
     http_reqs......................: 100375 1670.086899/s
     iteration_duration.............: avg=119.58ms min=100.84ms med=104.96ms max=880.7ms  p(90)=132.4ms p(95)=168.31ms
     iterations.....................: 100375 1670.086899/s
     vus............................: 200    min=200              max=200
     vus_max........................: 200    min=200              max=200


running (1m00.1s), 000/200 VUs, 100375 complete and 0 interrupted iterations
default ✓ [======================================] 200 VUs  1m0s

No rate limit with 3 replicas

replicaCount: 3

NAME                                      CPU(cores)   MEMORY(bytes)
opentelemetry-operator-5b8bccd65c-6g7kt   330m         36Mi
opentelemetry-operator-5b8bccd65c-9r2jd   396m         42Mi
opentelemetry-operator-5b8bccd65c-h5h5g   351m         38Mi


     ✗ status ok
      ↳  99% — ✓ 94139 / ✗ 14
     ✗ not exceed rate limit
      ↳  99% — ✓ 94139 / ✗ 14

     checks.........................: 99.98% 188278 out of 188306
     data_received..................: 42 MB  692 kB/s
     data_sent......................: 70 MB  1.2 MB/s
     http_req_blocked...............: avg=824.8µs  min=0s       med=0s       max=750.95ms p(90)=1µs      p(95)=1µs     
     http_req_connecting............: avg=21.03µs  min=0s       med=0s       max=20.38ms  p(90)=0s       p(95)=0s      
     http_req_duration..............: avg=26.01ms  min=0s       med=7.48ms   max=648.31ms p(90)=60.45ms  p(95)=113.55ms
       { expected_response:true }...: avg=26.01ms  min=726µs    med=7.48ms   max=648.31ms p(90)=60.47ms  p(95)=113.56ms
     http_req_failed................: 0.01%  14 out of 94153
     http_req_receiving.............: avg=150.6µs  min=0s       med=16µs     max=119.66ms p(90)=152µs    p(95)=379µs   
     http_req_sending...............: avg=71.82µs  min=0s       med=26µs     max=21.52ms  p(90)=106µs    p(95)=183µs   
     http_req_tls_handshaking.......: avg=792.52µs min=0s       med=0s       max=719.95ms p(90)=0s       p(95)=0s      
     http_req_waiting...............: avg=25.79ms  min=0s       med=7.32ms   max=648.26ms p(90)=59.93ms  p(95)=113.04ms
     http_reqs......................: 94153  1566.39095/s
     iteration_duration.............: avg=127.52ms min=100.85ms med=108.14ms max=907.93ms p(90)=161.64ms p(95)=215.26ms
     iterations.....................: 94153  1566.39095/s
     vus............................: 200    min=200              max=200
     vus_max........................: 200    min=200              max=200


running (1m00.1s), 000/200 VUs, 94153 complete and 0 interrupted iterations
default ✓ [======================================] 200 VUs  1m0s

isitiod operator test

NAME                      CPU(cores)   MEMORY(bytes)
istiod-79dbb5b667-6k9dl   810m         72Mi
istiod-79dbb5b667-hb76d   752m         89Mi
istiod-79dbb5b667-pjd7b   806m         71Mi
istiod-79dbb5b667-vj9jz   2219m        83Mi

     ✗ status ok
      ↳  99% — ✓ 32730 / ✗ 57
     ✗ not exceed rate limit
      ↳  99% — ✓ 32730 / ✗ 57

     checks.........................: 99.82% 65460 out of 65574
     data_received..................: 240 MB 3.9 MB/s
     data_sent......................: 26 MB  429 kB/s
     http_req_blocked...............: avg=1.26ms   min=0s       med=0s       max=293.71ms p(90)=1µs      p(95)=1µs     
     http_req_connecting............: avg=90.92µs  min=0s       med=0s       max=24.15ms  p(90)=0s       p(95)=0s      
     http_req_duration..............: avg=266.75ms min=0s       med=140.98ms max=7.33s    p(90)=618.21ms p(95)=897.46ms
       { expected_response:true }...: avg=267.22ms min=3.13ms   med=141.36ms max=7.33s    p(90)=618.89ms p(95)=898.49ms
     http_req_failed................: 0.17%  57 out of 32787
     http_req_receiving.............: avg=1.58ms   min=0s       med=89µs     max=933.93ms p(90)=1.5ms    p(95)=4.28ms  
     http_req_sending...............: avg=104.36µs min=0s       med=46µs     max=12.25ms  p(90)=172µs    p(95)=278µs   
     http_req_tls_handshaking.......: avg=1.17ms   min=0s       med=0s       max=270.41ms p(90)=0s       p(95)=0s      
     http_req_waiting...............: avg=265.06ms min=0s       med=139.83ms max=7.33s    p(90)=615.41ms p(95)=893.34ms
     http_reqs......................: 32787  539.777664/s
     iteration_duration.............: avg=368.88ms min=103.62ms med=241.61ms max=7.43s    p(90)=726.73ms p(95)=1s      
     iterations.....................: 32787  539.777664/s
     vus............................: 200    min=200            max=200
     vus_max........................: 200    min=200            max=200


running (1m00.7s), 000/200 VUs, 32787 complete and 0 interrupted iterations
default ✓ [======================================] 200 VUs  1m0s

Appendix

rate limit changes
https://github.com/pureklkl/opentelemetry-operator/pull/1/files
load test script

import http from 'k6/http';
import { sleep, check } from 'k6';
import { randomString, uuidv4 } from 'https://jslib.k6.io/k6-utils/1.4.0/index.js';

export const options = {
  // --max-mutating-requests-inflight, default to 200
  vus: 200,
  // A string specifying the total duration of the test run.
  duration: '60s',
  insecureSkipTLSVerify: true,
};

export default function() {
  const podName = "testload-pod-" + randomString(8);
  const uid = "5bb029e8-4c05-4eb5-9380-1f475648885d";
  const payloadCreation = JSON.stringify({
    apiVersion: "admission.k8s.io/v1",
    kind: "AdmissionReview",
    request: {
      uid: uuidv4(),
      kind: {
        group: "",
        version: "v1",
        resource: "pods"
      },
      name: podName,
      namespace: "default",
      operation: "CREATE",
      userInfo: {
        username: "system:serviceaccount:kube-system:default",
        uid: uid,
        groups: ["system:serviceaccounts", "system:serviceaccount:kube-system", "system:authenticated"]
      },
      object: {
        metadata: {
          name: podName,
          namespace: "default",
          annotations: {
             "instrumentation.opentelemetry.io/inject-java": "true"
          }
        },
        spec: {
          containers: [
            {
              name: "example-container",
              image: "nginx:latest"
            }
          ]
        }
      }
    }
  });
  const params = {
    headers: {
      "Content-Type": 'application/json'
    }
  };
  const res = http.post('https://localhost:443/mutate-v1-pod', payloadCreation, params);
  //console.log(JSON.stringify(res));
  check(res, {
    'status ok': (r) => r.status === 200 ,
    'not exceed rate limit': (r) => {
      if (r.body == null) {
        return false;
      }
      return !r.body.includes('rate limit excceed');
    }
  });
  sleep(0.1);
}

pureklkl added enhancement New feature or request needs triage labels Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding rate limit to pod webhook #3340

Adding rate limit to pod webhook #3340

pureklkl commented Oct 10, 2024

jaronoff97 commented Oct 10, 2024

pureklkl commented Oct 24, 2024

Adding rate limit to pod webhook #3340

Adding rate limit to pod webhook #3340

Comments

pureklkl commented Oct 10, 2024

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

jaronoff97 commented Oct 10, 2024

pureklkl commented Oct 24, 2024

The otel operator pod webhook perf test

Result summary:

Setup

Result

Appendix