SwitchML: Switch-Based Training Acceleration for Machine Learning

SwitchML accelerates the all-reduce communication primitive commonly used by distributed Machine Learning frameworks. It uses a programmable switch dataplane to perform in-network computation, reducing the volume of exchanged data by aggregating vectors (e.g., model updates) from multiple workers in the network. It provides an end-host library that can be integrated with ML frameworks to provide an efficient solution that speeds up training for a number of real-world benchmark models.

The switch hardware is programmed with a P4 program for the Tofino Native Architecture (TNA) and managed at runtime through a Python controller using BFRuntime. The end-host library provides simple APIs to perform all-reduce operations using different transport protocols. We currently support UDP through DPDK and RDMA UC. The library has already been integrated with ML frameworks as a NCCL plugin.

Note This is a preliminary code release and we are working to complete both code and documentation.

Getting started

To run SwitchML you need to:

compile the P4 program and deploy it on the switch (see the P4 code documentation)
run the Python controller (see the controller documentation)
compile and run the end-host program using the end-host library (see the library documentation

The examples folder provides simple programs that show how to use the APIs.

Repo organization

The SwitchML repository is organized as follows:

docs: project documentation
dev_root:
  ┣ p4: P4 code for TNA
  ┣ controller: controller program
  ┣ client_lib: end-host library
  ┣ examples: set of example programs
  ┣ benchmarks: programs used to test raw performance
  ┣ frameworks_integration: code to integrate with ML frameworks
  ┗ third_party: third party software

Testing

The benchmarks contain a benchmarks program that we used to measure SwitchML performances. In our experiments (see benchmark documentation for details) we observed a more than 2x speedup over NCCL when using RDMA. Moreover, differently from ring all-reduce, with SwitchML performance are constant with any number of workers.

Publication

Scaling Distributed Machine Learning with In-Network Aggregation A. Sapio, M. Canini, C.-Y. Ho, J. Nelson, P. Kalnis, C. Kim, A. Krishnamurthy, M. Moshref, D. R. K. Ports, P. Richtarik. In Proceedings of NSDI’21, Apr 2021.

Contributing

This project welcomes contributions and suggestions. To learn more about making a contribution to SwitchML, please see our Contribution page.

The Team

SwitchML is a project driven by the P4.org community and is currently maintained by Amedeo Sapio, Omar Alama, Marco Canini, Jacob Nelson.

License

SwitchML is released with an Apache License 2.0, as found in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
dev_root		dev_root
docs		docs
.gitignore		.gitignore
.gitmodules		.gitmodules
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SwitchML: Switch-Based Training Acceleration for Machine Learning

Getting started

Repo organization

Testing

Publication

Contributing

The Team

License

About

Releases

Packages

Contributors 3

Languages

License

p4lang/p4app-switchML

Folders and files

Latest commit

History

Repository files navigation

SwitchML: Switch-Based Training Acceleration for Machine Learning

Getting started

Repo organization

Testing

Publication

Contributing

The Team

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages