Skip to content

vminsert: change replication implementationΒ #8044

Open
@Haleygo

Description

Is your feature request related to a problem? Please describe

vminsert has the -replicationFactor=N command-line flag to store N copies of every ingested sample on N distinct vmstorage nodes.
Currently, it takes two steps for vminsert to ingest data to vmstorage nodes.

Image

step1: vminsert calculates the metric name and labels hash to decide which node's buff it will send the data to;
step2: buffer data is sent to N distinct vmstorage nodes, starting with the corresponding node. For example, with -replicationFactor=1, data in storageNode1 buffer will be sent to storageNode1, if failed, try storageNode2...

This separates the sharding and replication logic, and the behaviors are also affected by flags -disableRerouting(default true), -disableReroutingOnUnavailable(default false) and -dropSamplesOnOverload(default false).

However, it could cause the following issues(assuming -replicationFactor=2, and a series that is supposed to be assigned to the storageNode1 buffer):

  1. With the default -disableRerouting=true, -disableReroutingOnUnavailable=false: when storageNode2 is down, its workload will be evenly distributed among the other three nodes. However, the data that was duplicated to storageNode2 from storageNode1's buffer will all be sent to storageNode3, see vmstorage pod restart causing cpu spike on sequentially higher vmstorage podΒ #8103.
  2. With -disableReroutingOnUnavailable=true: when storageNode1 is down, the series won't be append to buffer at all, then storageNode2 won't receive it, resulting in complete data loss;
  3. With -dropSamplesOnOverload=true: when storageNode1 is degraded due to host issues or noisy neighbor, the series won't be append to buffer as well, then storageNode2 won't receive it, resulting in complete data loss;

Issue 1 degrades the cluster performance.
Issue2&3 break the expected replicated data as there is only one storage node down, while the -replicationFactor=2.

Describe the solution you'd like

Performing data replication before sending data to the storageNode buffer.

Image

In this implementation, if a series is assigned to storageNode1&storageNode2(refer to the issues mentioned above):

  1. With the default -disableRerouting=true, -disableReroutingOnUnavailable=false: when storageNode2 is down, the behavior remains unchanged, but it can be improved by setting -disableReroutingOnUnavailable=true if necessary, see vmstorage pod restart causing cpu spike on sequentially higher vmstorage podΒ #8103 (comment).
  2. With -disableReroutingOnUnavailable=true: when storageNode1 is down, series only goes to storageNode2(buffer);
  3. With -dropSamplesOnOverload=true: when storageNode 1 is degraded, series only goes to storageNode2(buffer).

Regarding sending data from the buffer to the actual storage node:

  1. With the default -disableRerouting=true, if data can't be sent from buffer to the corresponding storage node successfully, we drop it directly.
  2. If user specifies -disableRerouting=false and data can't be sent from buffer to the corresponding storage node, we try to reroute it to another storage node(may need an additional check to prevent writing to the same node when replicationFactor>1).

Describe alternatives you've considered

No response

Additional information

related to #6801, #7995

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions