Skip to content

A user-friendly GPU management tool for distributed machine learning workloads

License

Notifications You must be signed in to change notification settings

micmarty-deepsense/TensorHive

ย 
ย 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

TensorHive

TensorHive is an open source system for monitoring and managing computing resources across multiple hosts. It solves the most common problems and nightmares about accessing and sharing your AI-oriented infrastructure across multiple, often competing users.

It's designed with simplicity, flexibility and configuration-friendliness in mind.

Use cases

Our goal is to provide solutions for painful problems that ML engineers often have to struggle with when working with remote machines in order to run neural network trainings.

You should really consider using TensorHive if anything described in profiles below matches you:

  1. You're an admin, who is responsible for managing a cluster (or multiple servers) with powerful GPUs installed.
  • ๐Ÿ˜  There are more users than resources, so they have to compete for it, but you don't know how to deal with that chaos
  • ๐ŸŒŠ Other popular tools are simply an overkill, have different purpose or require a lot of time to spend on reading documentation, installation and configuration (Grafana, Kubernetes, Slurm)
  • ๐Ÿง People using your infrastructure expect only one interface for all the things related to training models (besides terminal): monitoring, reservation calendar and scheduling distributed jobs
  • ๐Ÿ’ฅ Can't risk messing up sensitive configuration by installing software on each individual machine, prefering centralized solution which can be managed from one place
  1. You're a standalone user who has access to beefy GPUs scattered across multiple machines.
  • ใ€ฝ๏ธ You want to be able to determine if batch size is too small or if there's a bottleneck when moving data from memory to GPU - charts with metrics such as gpu_util, mem_util, mem_used are great for this purpose
  • ๐Ÿ“… Visualizing names of training experiments using calendar helps you track how you're progressing on the project
  • ๐Ÿ Launching distributed trainings is essential for you, no matter what the framework is
  • ๐Ÿ˜ต Managing a list of training commands for all your distributed training experiments drives you nuts
  • ๐Ÿ’ค Remembering to manually launch the training before going sleep is no fun anymore

What TensorHive has to offer

0๏ธโƒฃ Dead-simple one-machine installation and configuration, no sudo requirements

1๏ธโƒฃ Users can make GPU reservations for specific time range in advance via reservation mechanism

ย ย ย ย  โžก๏ธ no more frustration caused by rules: "first come, first served" or "the law of the jungle".

2๏ธโƒฃ Users can prepare and schedule custom tasks (commands) to be run on selected GPUs and hosts

ย ย ย ย  โžก๏ธ automate and simplify distributed trainings - "one button to rule them all"

3๏ธโƒฃ Gather all useful GPU metrics, from all configured hosts in one dashboard

ย ย ย ย  โžก๏ธ no more manual logging in to each individual machine in order to check if GPU is currently in use or not

For more details, check out the full list of features.

Getting started

Prerequisites

  • All nodes must be accessible via SSH, without password, using SSH Key-Based Authentication (How to set up SSH keys - explained in Quickstart section)
  • Only NVIDIA GPUs are supported (relying on nvidia-smi command)
  • Currently TensorHive assumes that all users who want to register into the system must have identical UNIX usernames on all nodes configured by TensorHive administrator (not relevant for standalone developers)

Installation

via pip

pip install tensorhive

From source

(optional) For development purposes we encourage separation from your current python packages using e.g. virtualenv, Anaconda.

git clone https://github.com/roscisz/TensorHive.git && cd TensorHive
pip install -e .

TensorHive is already shipped with newest web app build, but in case you modify the source, you can can build it with make app (currently on master branch). For more useful commands see our Makefile. Build tested with Node v10.15.2 and npm 5.8.0

Basic usage

Quickstart

The init command will guide you through basic configuration process:

tensorhive init

You can check connectivity with the configured hosts using the test command.

tensorhive test

(optional) If you want to allow your UNIX users to set up their TensorHive accounts on their own and run distributed programs through Task execution plugin, use the key command to generate the SSH key for TensorHive:

tensorhive key

Now you should be ready to launch a TensorHive instance:

tensorhive

Web application and API Documentation can be accessed via URLs highlighted in green (Ctrl + click to open in browser).

Advanced configuration

You can fully customize TensorHive behaviours via INI configuration files (which will be created automatically after tensorhive init):

~/.config/TensorHive/main_config.ini
~/.config/TensorHive/mailbot_config.ini
~/.config/TensorHive/hosts_config.ini

(see example)

Infrastructure monitoring dashboard

Accessible infrastructure can be monitored in the Nodes overview tab. Sample screenshot: Here you can add new watches, select metrics and monitor ongoing GPU processes and its' owners.

image

GPU Reservation calendar

Each column represents all reservation events for a GPU on a given day. In order to make a new reservation simply click and drag with your mouse, select GPU(s), add some meaningful title, optionally adjust time range.

If there are many hosts and GPUs in our infrastructure, you can use our simplified, horizontal calendar to quickly identify empty time slots and filter out already reserved GPUs. image

From now on, only your processes are eligible to run on reserved GPU(s). TensorHive periodically checks if some other user has violated it. He will be spammed with warnings on all his PTYs, emailed every once in a while, additionally admin will also be notified (it all depends on the configuration).

Terminal warning Email warning
image image

What admin is e-mailed:

image

Task execution

Thanks to the Task execution module, you can define commands for tasks you want to run on any configured nodes. You can manage them manually or set spawn/terminate date. Commands are run within screen session, so attaching to it while they are running is a piece of cake.

It provides a simple, but flexible (framework-agnostic) command templating mechanism that will help you automate multi-node trainings. Additionally, specialized templates help to conveniently set proper parameters for chosen well known frameworks:

image

In the examples directory, you will find sample scenarios of using the Task execution module for various frameworks and computing environments.

TensorHive requires that users who want to use this feature must append TensorHive's public key to their ~/.ssh/authorized_keys on all nodes they want to connect to.

Features

Core

  • ๐Ÿ”Ž Monitor metrics on each host
    • โ„ข๏ธ Nvidia GPUs
    • ๐Ÿ“Ÿ CPU, RAM
    • ๐Ÿ“‚ HDD
  • ๐Ÿ›ƒ Protection of reserved resources
    • โš ๏ธ Send warning messages to terminal of users who violate the rules
    • ๐Ÿ“ญ Send e-mail warnings
    • ๐Ÿ’ฃ Kill unwanted processes
  • ๐Ÿš€ Task execution and scheduling
    • ๐Ÿ—๏ธ Execute any command in the name of a user
    • โฐ Schedule spawn and termination
    • ๐Ÿ” Synchronize process status
    • ๐Ÿญ Use screen command as backend - user can easily attach to running task
    • ๐Ÿ’€ Remote process interruption, termination and kill
    • ๐Ÿ’พ Save stdout to disk
    • ๐Ÿ“„ Capture stderr
  • โŒš Track wasted (idle) time during reservation
    • ๐Ÿ”ช Gather and calculate average gpu and mem utilization
    • ๐Ÿ“ข Remind user when his reservation starts and ends
    • ๐Ÿ“จ Send e-mail if idle for too long

Web

  • ๐Ÿ“‰ Configurable charts view
    • Metrics and active processes
    • Detailed hardware specification
  • ๐Ÿ“† Calendar view
    • Allow making reservations for selected GPUs
    • Edit reservations
    • Cancel reservations
    • Attach jobs to reservation
  • ๐Ÿšผ Task execution
    • Create parametrized tasks and assign to hosts, automatically set CUDA_VISIBLE_DEVICES
    • Buttons for task spawning/scheduling/termination/killing actions
    • Fetch log produced by running task
    • Group actions (spawn, schedule, terminate, kill selected)
  • ๐Ÿ“ Detailed hardware specification panel (CPU clock speed, RAM, etc.)
  • ๐Ÿง Admin panel
    • User banning
    • Accept/reject reservation requests
    • Modify rules on-the-fly (without restarting)
    • Show popups to users (something like message of the day - motd)

CLI

  • Implement command-line app that communicates with core via API
  • Migrate all features from web app that don't require GUI (so no charts)

API

  • OpenAPI 2.0 specification with Swagger UI
  • User authentication via JWT

TensorHive is currently being used in production in the following environments:

Organization Hardware No. users
Gdansk University of Technology NVIDIA DGX Station (4x Tesla V100) + NVIDIA DGX-1 (8x Tesla V100) 30+
Lab at GUT 20 machines with GTX 1060 each 20+
Gradient PG A server with two GPUs shared by the Gradient science club at GUT. 30+
VoiceLab - Conversational Intelligence 30+ GTX and RTX GPUs 10+

TensorHive architecture (simplified)

This diagram will help you to grasp the rough concept of the system.

TensorHive_diagram _final

Contribution and feedback

We'd โค๏ธ to collect your observations, issues and pull requests!

Feel free to report any configuration problems, we will help you.

Currently we are working on user groups for differentiated GPU access control, grouping tasks into jobs and process-killing reservation violation handler, deadline - July 2020 :shipit:, so stay tuned!

If you consider becoming a contributor, please look at issues labeled as good-first-issue and help wanted.

Credits

TensorHive has been greatly supported within a joint project between VoiceLab.ai and Gdaล„sk University of Technology titled: "Exploration and selection of methods for parallelization of neural network training using multiple GPUs".

Project created and maintained by:

Top contributors:

License

Apache License 2.0

About

A user-friendly GPU management tool for distributed machine learning workloads

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 47.6%
  • Vue 43.6%
  • JavaScript 7.6%
  • Other 1.2%