Skip to content

Commit

Permalink
dpif-netdev: Add PMD load based sleeping.
Browse files Browse the repository at this point in the history
Sleep for an incremental amount of time if none of the Rx queues
assigned to a PMD have at least half a batch of packets (i.e. 16 pkts)
on an polling iteration of the PMD.

Upon detecting the threshold of >= 16 pkts on an Rxq, reset the
sleep time to zero (i.e. no sleep).

Sleep time will be increased on each iteration where the low load
conditions remain up to a total of the max sleep time which is set
by the user e.g:
ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500

The default pmd-maxsleep value is 0, which means that no sleeps
will occur and the default behaviour is unchanged from previously.

Also add new stats to pmd-perf-show to get visibility of operation
e.g.
...
   - sleep iterations:       153994  ( 76.8 % of iterations)
   Sleep time (us):         9159399  ( 59 us/iteration avg.)
...

Reviewed-by: Robin Jarry <rjarry@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
  • Loading branch information
kevintraynor authored and igsilya committed Jan 12, 2023
1 parent f4c8841 commit de3bbdc
Show file tree
Hide file tree
Showing 7 changed files with 213 additions and 10 deletions.
54 changes: 54 additions & 0 deletions Documentation/topics/dpdk/pmd.rst
Original file line number Diff line number Diff line change
Expand Up @@ -324,5 +324,59 @@ A user can use this option to set a minimum frequency of Rx queue to PMD
reassignment due to PMD Auto Load Balance. For example, this could be set
(in min) such that a reassignment is triggered at most every few hours.

PMD load based sleeping (Experimental)
--------------------------------------

PMD threads constantly poll Rx queues which are assigned to them. In order to
reduce the CPU cycles they use, they can sleep for small periods of time
when there is no load or very-low load on all the Rx queues they poll.

This can be enabled by setting the max requested sleep time (in microseconds)
for a PMD thread::

$ ovs-vsctl set open_vswitch . other_config:pmd-maxsleep=500

Non-zero values will be rounded up to the nearest 10 microseconds to avoid
requesting very small sleep times.

With a non-zero max value a PMD may request to sleep by an incrementing amount
of time up to the maximum time. If at any point the threshold of at least half
a batch of packets (i.e. 16) is received from an Rx queue that the PMD is
polling is met, the requested sleep time will be reset to 0. At that point no
sleeps will occur until the no/low load conditions return.

Sleeping in a PMD thread will mean there is a period of time when the PMD
thread will not process packets. Sleep times requested are not guaranteed
and can differ significantly depending on system configuration. The actual
time not processing packets will be determined by the sleep and processor
wake-up times and should be tested with each system configuration.

Sleep time statistics for 10 secs can be seen with::

$ ovs-appctl dpif-netdev/pmd-stats-clear \
&& sleep 10 && ovs-appctl dpif-netdev/pmd-perf-show

Example output, showing that during the last 10 seconds, 76.8% of iterations
had a sleep of some length. The total amount of sleep time was 9.15 seconds and
the average sleep time per iteration was 46 microseconds::

- sleep iterations: 153994 ( 76.8 % of iterations)
Sleep time (us): 9159399 ( 59 us/iteration avg.)

Any potential power saving from PMD load based sleeping is dependent on the
system configuration (e.g. enabling processor C-states) and workloads.

.. note::

If there is a sudden spike of packets while the PMD thread is sleeping and
the processor is in a low-power state it may result in some lost packets or
extra latency before the PMD thread returns to processing packets at full
rate.

.. note::

By default Linux kernel groups timer expirations and this can add an
overhead of up to 50 microseconds to a requested timer expiration.

.. _ovs-vswitchd(8):
http://openvswitch.org/support/dist-docs/ovs-vswitchd.8.html
3 changes: 3 additions & 0 deletions NEWS
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,9 @@ Post-v3.0.0
- Userspace datapath:
* Add '-secs' argument to appctl 'dpif-netdev/pmd-rxq-show' to show
the pmd usage of an Rx queue over a configurable time period.
* Add new experimental PMD load based sleeping feature. PMD threads can
request to sleep up to a user configured 'pmd-maxsleep' value under
low load conditions.


v3.0.0 - 15 Aug 2022
Expand Down
24 changes: 19 additions & 5 deletions lib/dpif-netdev-perf.c
Original file line number Diff line number Diff line change
Expand Up @@ -230,18 +230,26 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
uint64_t tot_iter = histogram_samples(&s->pkts);
uint64_t idle_iter = s->pkts.bin[0];
uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0;
uint64_t sleep_iter = stats[PMD_SLEEP_ITER];
uint64_t tot_sleep_cycles = stats[PMD_CYCLES_SLEEP];

ds_put_format(str,
" Iterations: %12"PRIu64" (%.2f us/it)\n"
" - Used TSC cycles: %12"PRIu64" (%5.1f %% of total cycles)\n"
" - idle iterations: %12"PRIu64" (%5.1f %% of used cycles)\n"
" - busy iterations: %12"PRIu64" (%5.1f %% of used cycles)\n",
tot_iter, tot_cycles * us_per_cycle / tot_iter,
" - busy iterations: %12"PRIu64" (%5.1f %% of used cycles)\n"
" - sleep iterations: %12"PRIu64" (%5.1f %% of iterations)\n"
" Sleep time (us): %12.0f (%3.0f us/iteration avg.)\n",
tot_iter,
(tot_cycles + tot_sleep_cycles) * us_per_cycle / tot_iter,
tot_cycles, 100.0 * (tot_cycles / duration) / tsc_hz,
idle_iter,
100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
busy_iter,
100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles,
sleep_iter, tot_iter ? 100.0 * sleep_iter / tot_iter : 0,
tot_sleep_cycles * us_per_cycle,
sleep_iter ? (tot_sleep_cycles * us_per_cycle) / sleep_iter : 0);
if (rx_packets > 0) {
ds_put_format(str,
" Rx packets: %12"PRIu64" (%.0f Kpps, %.0f cycles/pkt)\n"
Expand Down Expand Up @@ -518,14 +526,15 @@ OVS_REQUIRES(s->stats_mutex)

void
pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
int tx_packets, bool full_metrics)
int tx_packets, uint64_t sleep_cycles,
bool full_metrics)
{
uint64_t now_tsc = cycles_counter_update(s);
struct iter_stats *cum_ms;
uint64_t cycles, cycles_per_pkt = 0;
char *reason = NULL;

cycles = now_tsc - s->start_tsc;
cycles = now_tsc - s->start_tsc - sleep_cycles;
s->current.timestamp = s->iteration_cnt;
s->current.cycles = cycles;
s->current.pkts = rx_packets;
Expand All @@ -539,6 +548,11 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
histogram_add_sample(&s->cycles, cycles);
histogram_add_sample(&s->pkts, rx_packets);

if (sleep_cycles) {
pmd_perf_update_counter(s, PMD_SLEEP_ITER, 1);
pmd_perf_update_counter(s, PMD_CYCLES_SLEEP, sleep_cycles);
}

if (!full_metrics) {
return;
}
Expand Down
5 changes: 4 additions & 1 deletion lib/dpif-netdev-perf.h
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,8 @@ enum pmd_stat_type {
PMD_CYCLES_ITER_IDLE, /* Cycles spent in idle iterations. */
PMD_CYCLES_ITER_BUSY, /* Cycles spent in busy iterations. */
PMD_CYCLES_UPCALL, /* Cycles spent processing upcalls. */
PMD_SLEEP_ITER, /* Iterations where a sleep has taken place. */
PMD_CYCLES_SLEEP, /* Total cycles slept to save power. */
PMD_N_STATS
};

Expand Down Expand Up @@ -408,7 +410,8 @@ void
pmd_perf_start_iteration(struct pmd_perf_stats *s);
void
pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
int tx_packets, bool full_metrics);
int tx_packets, uint64_t sleep_cycles,
bool full_metrics);

/* Formatting the output of commands. */

Expand Down
65 changes: 61 additions & 4 deletions lib/dpif-netdev.c
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,11 @@ static struct odp_support dp_netdev_support = {
/* Time in microseconds to try RCU quiescing. */
#define PMD_RCU_QUIESCE_INTERVAL 10000LL

/* Number of pkts Rx on an interface that will stop pmd thread sleeping. */
#define PMD_SLEEP_THRESH (NETDEV_MAX_BURST / 2)
/* Time in uS to increment a pmd thread sleep time. */
#define PMD_SLEEP_INC_US 10

struct dpcls {
struct cmap_node node; /* Within dp_netdev_pmd_thread.classifiers */
odp_port_t in_port;
Expand Down Expand Up @@ -279,6 +284,8 @@ struct dp_netdev {
atomic_uint32_t emc_insert_min;
/* Enable collection of PMD performance metrics. */
atomic_bool pmd_perf_metrics;
/* Max load based sleep request. */
atomic_uint64_t pmd_max_sleep;
/* Enable the SMC cache from ovsdb config */
atomic_bool smc_enable_db;

Expand Down Expand Up @@ -4821,8 +4828,10 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
uint64_t rebalance_intvl;
uint8_t cur_rebalance_load;
uint32_t rebalance_load, rebalance_improve;
uint64_t pmd_max_sleep, cur_pmd_max_sleep;
bool log_autolb = false;
enum sched_assignment_type pmd_rxq_assign_type;
static bool first_set_config = true;

tx_flush_interval = smap_get_int(other_config, "tx-flush-interval",
DEFAULT_TX_FLUSH_INTERVAL);
Expand Down Expand Up @@ -4969,6 +4978,19 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
bool autolb_state = smap_get_bool(other_config, "pmd-auto-lb", false);

set_pmd_auto_lb(dp, autolb_state, log_autolb);

pmd_max_sleep = smap_get_ullong(other_config, "pmd-maxsleep", 0);
pmd_max_sleep = ROUND_UP(pmd_max_sleep, 10);
pmd_max_sleep = MIN(PMD_RCU_QUIESCE_INTERVAL, pmd_max_sleep);
atomic_read_relaxed(&dp->pmd_max_sleep, &cur_pmd_max_sleep);
if (first_set_config || pmd_max_sleep != cur_pmd_max_sleep) {
atomic_store_relaxed(&dp->pmd_max_sleep, pmd_max_sleep);
VLOG_INFO("PMD max sleep request is %"PRIu64" usecs.", pmd_max_sleep);
VLOG_INFO("PMD load based sleeps are %s.",
pmd_max_sleep ? "enabled" : "disabled" );
}

first_set_config = false;
return 0;
}

Expand Down Expand Up @@ -6929,6 +6951,7 @@ pmd_thread_main(void *f_)
int poll_cnt;
int i;
int process_packets = 0;
uint64_t sleep_time = 0;

poll_list = NULL;

Expand Down Expand Up @@ -6989,10 +7012,13 @@ pmd_thread_main(void *f_)
ovs_mutex_lock(&pmd->perf_stats.stats_mutex);
for (;;) {
uint64_t rx_packets = 0, tx_packets = 0;
uint64_t time_slept = 0;
uint64_t max_sleep;

pmd_perf_start_iteration(s);

atomic_read_relaxed(&pmd->dp->smc_enable_db, &pmd->ctx.smc_enable_db);
atomic_read_relaxed(&pmd->dp->pmd_max_sleep, &max_sleep);

for (i = 0; i < poll_cnt; i++) {

Expand All @@ -7011,14 +7037,40 @@ pmd_thread_main(void *f_)
dp_netdev_process_rxq_port(pmd, poll_list[i].rxq,
poll_list[i].port_no);
rx_packets += process_packets;
if (process_packets >= PMD_SLEEP_THRESH) {
sleep_time = 0;
}
}

if (!rx_packets) {
/* We didn't receive anything in the process loop.
* Check if we need to send something.
* There was no time updates on current iteration. */
pmd_thread_ctx_time_update(pmd);
tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
tx_packets = dp_netdev_pmd_flush_output_packets(pmd,
max_sleep && sleep_time
? true : false);
}

if (max_sleep) {
/* Check if a sleep should happen on this iteration. */
if (sleep_time) {
struct cycle_timer sleep_timer;

cycle_timer_start(&pmd->perf_stats, &sleep_timer);
xnanosleep_no_quiesce(sleep_time * 1000);
time_slept = cycle_timer_stop(&pmd->perf_stats, &sleep_timer);
pmd_thread_ctx_time_update(pmd);
}
if (sleep_time < max_sleep) {
/* Increase sleep time for next iteration. */
sleep_time += PMD_SLEEP_INC_US;
} else {
sleep_time = max_sleep;
}
} else {
/* Reset sleep time as max sleep policy may have been changed. */
sleep_time = 0;
}

/* Do RCU synchronization at fixed interval. This ensures that
Expand Down Expand Up @@ -7058,7 +7110,7 @@ pmd_thread_main(void *f_)
break;
}

pmd_perf_end_iteration(s, rx_packets, tx_packets,
pmd_perf_end_iteration(s, rx_packets, tx_packets, time_slept,
pmd_perf_metrics_enabled(pmd));
}
ovs_mutex_unlock(&pmd->perf_stats.stats_mutex);
Expand Down Expand Up @@ -9909,7 +9961,7 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd,
struct polled_queue *poll_list, int poll_cnt)
{
struct dpcls *cls;
uint64_t tot_idle = 0, tot_proc = 0;
uint64_t tot_idle = 0, tot_proc = 0, tot_sleep = 0;
unsigned int pmd_load = 0;

if (pmd->ctx.now > pmd->next_cycle_store) {
Expand All @@ -9926,10 +9978,13 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd,
pmd->prev_stats[PMD_CYCLES_ITER_IDLE];
tot_proc = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] -
pmd->prev_stats[PMD_CYCLES_ITER_BUSY];
tot_sleep = pmd->perf_stats.counters.n[PMD_CYCLES_SLEEP] -
pmd->prev_stats[PMD_CYCLES_SLEEP];

if (pmd_alb->is_enabled && !pmd->isolated) {
if (tot_proc) {
pmd_load = ((tot_proc * 100) / (tot_idle + tot_proc));
pmd_load = ((tot_proc * 100) /
(tot_idle + tot_proc + tot_sleep));
}

atomic_read_relaxed(&pmd_alb->rebalance_load_thresh,
Expand All @@ -9946,6 +10001,8 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd,
pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE];
pmd->prev_stats[PMD_CYCLES_ITER_BUSY] =
pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY];
pmd->prev_stats[PMD_CYCLES_SLEEP] =
pmd->perf_stats.counters.n[PMD_CYCLES_SLEEP];

/* Get the cycles that were used to process each queue and store. */
for (unsigned i = 0; i < poll_cnt; i++) {
Expand Down
46 changes: 46 additions & 0 deletions tests/pmd.at
Original file line number Diff line number Diff line change
Expand Up @@ -1254,3 +1254,49 @@ ovs-appctl: ovs-vswitchd: server returned an error

OVS_VSWITCHD_STOP
AT_CLEANUP

dnl Check default state
AT_SETUP([PMD - pmd sleep])
OVS_VSWITCHD_START

dnl Check default
OVS_WAIT_UNTIL([tail ovs-vswitchd.log | grep "PMD max sleep request is 0 usecs."])
OVS_WAIT_UNTIL([tail ovs-vswitchd.log | grep "PMD load based sleeps are disabled."])

dnl Check low value max sleep
get_log_next_line_num
AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="1"])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 10 usecs."])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."])

dnl Check high value max sleep
get_log_next_line_num
AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="10000"])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 10000 usecs."])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."])

dnl Check setting max sleep to zero
get_log_next_line_num
AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="0"])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 0 usecs."])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are disabled."])

dnl Check above high value max sleep
get_log_next_line_num
AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="10001"])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 10000 usecs."])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."])

dnl Check rounding
get_log_next_line_num
AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="490"])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 490 usecs."])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."])
dnl Check rounding
get_log_next_line_num
AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="491"])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 500 usecs."])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."])

OVS_VSWITCHD_STOP
AT_CLEANUP
26 changes: 26 additions & 0 deletions vswitchd/vswitch.xml
Original file line number Diff line number Diff line change
Expand Up @@ -788,6 +788,32 @@
The default value is <code>25%</code>.
</p>
</column>
<column name="other_config" key="pmd-maxsleep"
type='{"type": "integer",
"minInteger": 0, "maxInteger": 10000}'>
<p>
Specifies the maximum sleep time that will be requested in
microseconds per iteration for a PMD thread which has received zero
or a small amount of packets from the Rx queues it is polling.
</p>
<p>
The actual sleep time requested is based on the load
of the Rx queues that the PMD polls and may be less than
the maximum value.
</p>
<p>
The default value is <code>0 microseconds</code>, which means
that the PMD will not sleep regardless of the load from the
Rx queues that it polls.
</p>
<p>
To avoid requesting very small sleeps (e.g. less than 10 us) the
value will be rounded up to the nearest 10 us.
</p>
<p>
The maximum value is <code>10000 microseconds</code>.
</p>
</column>
<column name="other_config" key="userspace-tso-enable"
type='{"type": "boolean"}'>
<p>
Expand Down

0 comments on commit de3bbdc

Please sign in to comment.