cloud-devops-infra

Cloud, DevOps, Infra

System / Infra

serveo.net - Serveo is an SSH server just for remote port forwarding. When a user connects to Serveo, they get a public URL that anybody can use to connect to their localhost server. See link for other SSH and related alternatives, useful to be able to serve resources across devices i.e. access GPU or other hardware accelerators from another device remotely. | How to forward my local port to public using Serveo? | Serveo on GitHub
Inlets by Alex Ellis | Get started | Video
KnockKnock by @huggingface | tweet

Compute & Storage

Cray Computers | Artificial Intelligence | Accel AI | Cryp-em | Autonomous Vehicles | Geospatial AI
GraphCore's IPU
Lambda Labs
NGD Systems: Technology [deadlink] | Solutions - High Compute Storage, Scalable Computational Storage [deadlink] | NGD Systems: Ensuring AI Advancement with Intelligent Storage

Grid computing / Super computing

Grid Engine: wikipedia | Univa website | Datasheet
BOINC - High-Throughput Computing with BOINC | Tech Docs | Download BOINC | GitHub
Cray Computers - Supercomputing as a Service

Cloud services

vast.ai - GPU Sharing Economy. One simple interface to find the best cloud GPU rentals. Reduce cloud compute costs by 3X to 5X
paperspace - The first cloud built for the future. Powering next-generation applications and cloud ML/AI pipelines. Paperspace is built to scale with your team - pay as you go option for individuals.
NextJournal - The notebook for reproducible research
valohai | docs | blogs | GitHub | Videos | Showcase | Slack | @valohaiai - Valohai is a machine learning platform. It runs your experiments in the cloud, tracks your experiment history and streamlines data science workflows. DEEP LEARNING MANAGEMENT PLATFORM. Machine Orchestration, Version Control and Pipeline Management for Deep Learning.
Lambda Cloud GPU Instances - GPU Instances for Deep Learning & Machine Learning
NavOps - Cloud Migration for HPC | Datasheet
Verne Global: HPC Cloud | NVIDIA DGX Ready
Weights and Biases | Learn more about WandB
Marvin AI: About Marvin AI | Apache Marvin AI: MLOps platform | GitHub | Video
RealityEngine.ai | Research | Blogs
- Videos
- Notebooks
  - Workshop: Unsupervised Learning and Deep Learning Based Forecasting: Anomaly Workbook | Forecasting Workbook
  - AutoML Core Concepts and Hands-On Workshop: Regression Notebook | Classification Notebook
  - Workshop: Large Scale Deep Learning Recommender
  - Reality Engines Demo
Accelerating AI Training with MLPerf Containers and Models from NVIDIA NGC
Running AI Models in the Cloud: site | video | Docs | Getting started

Tools

snakemake - The Snakemake workflow management system is a tool to create reproducible and scalable data analyses. Slides | PyPi
plz - Plz (pronounced "please") runs your jobs storing code, input, outputs and results so that they can be queried programmatically.
valohai | docs | blogs | GitHub | Videos | Showcase | Slack - Valohai is a machine learning platform. It runs your experiments in the cloud, tracks your experiment history and streamlines data science workflows. DEEP LEARNING MANAGEMENT PLATFORM. Machine Orchestration, Version Control and Pipeline Management for Deep Learning.
Seldon - Model deployment platform, on kubernetes clusters. | docs | github | use-cases | blogs | videos | Seldon's opensource library for MachineLearning model inspection and interpretation
Arize AI | docs | certification | resources | Slack - Model monitoring and observability platform. Community edition offers model performance tracing, data quality checks, explainability, and drift detection -- including embedding drift detection for CV and NLP models.
kedro | other kedro projects | docs | Kedro-Viz | kedro-examples | Blogs | Video | gitter.im/py-sprints/kedro | pypi - Kedro is a workflow development tool that helps you build data pipelines that are robust, scalable, deployable, reproducible and versioned.
Lambda Stack - One-line installation of TensorFlow, Keras, Caffe, Caffe, CUDA, cuDNN, and NVIDIA Drivers for Ubuntu 16.04 and 18.04.
Apache Airflow - Airflow is a platform to programmatically author, schedule and monitor workflows. Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies.
Nextflow - Data-driven computational pipelines. Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages.
StackHPC suites of repositories: AI, ML, DL, Cloud, HPC | StackHPC
cortex - Machine learning deployment platform: Deploy machine learning models to production
#Uber introduces #Fiber, an #AI development and distributed training platform for methods including reinforcement learning and population-based learning.| Uber Open-Sources Fiber - A New Library For Distributed Machine Learning
A curated list of awesome pipeline toolkits inspired by Awesome Sysadmin
H2O Framework for Machine Learning
ML Framework: Introducing Ludwig, a Code-Free Deep Learning Toolbox | Ludwig is a toolbox built on top of TensorFlow that allows to train and test deep learning models without the need to write code
ML Pipelines
Large SVDs Dask + CuPy + Zarr + Genomics
Determined AI | About Niel Conway | Determined: Open-source Deep Learning Training Platform
ML Framework by Abhishek Thakur
- Episode 1 Intro and building a machine learning framework
- Episode 2 A Cross Validation Framework
- Episode 3 Handling Categorical Features in Machine Learning Problems
- Episode 4 Simple and Basic Binary Classification Metrics
See also: Data > Programs and Tools

CPU

Probing the CPU (Linux/MacOS)
- libcpuid
- Zero overhead performance capturing: use /proc/interrupts and /proc/softirqs
- Non-zero overhead, less accurate: use the PMU (capture on- and off-core events)
Probing the CPU (Windows)
- perfview - general profiling on Windows
- perfview for .net - excellent overview by Sasha Goldshtein
Neural Magic: GPU-class performance on CPU
Intel
- Intel® Developer Zone
- Intel® AI Developer Home Page
- Intel® AI Developer Webinar Series | All webinars listing
- The PlaidML Tensor Compiler - webinar
- nGraph - Unlocking next-generation performance with deep learning compilers: webinar | slides | homepage | github
- Intel Debug memory & threading bugs: Webinar slides | Intel inspector | | Inspector Docs | Intel® Parallel Studio XE | Intel® System Studio
- Intel Analysers/Profilers:
- Intel® DevCloud for oneAPI
- Tuning applications for multiple architectures
- Also see Intel in Courses
- TVM is an open deep learning compiler stack for CPUs, GPUs, and specialized accelerators. It aims to close the gap between the productivity-focused deep learning frameworks, and the performance- or efficiency-oriented hardware backends

Thanks to the great minds on the mechanical sympathy mailing list for their responses to my queries on CPU probing.

FPGA

GPU

TPU

IPU

Performance

MLPerf - Fair and useful benchmarks for measuring training and inference performance of ML hardware, software, and services.
MLPerf introduces machine learning inference benchmark suite...
ONE DEEP LEARNING BENCHMARK TO RULE THEM ALL
mlbench: Distributed Machine Learning Benchmark - A public and reproducible collection of reference implementations and benchmark suite for distributed machine learning algorithms, frameworks and systems.
EEMBC MLMark Benchmark - The EEMBC MLMark benchmark is a machine-learning (ML) benchmark designed to measure the performance and accuracy of embedded inference.
DeepOBS: A Deep Learning Optimizer Benchmark Suite
PMLB - a large benchmark suite for machine learning evaluation and comparison
Deep Learning Benchmarking Suite | HPE Deep Learning Cookbook
Hyperdimensional Computing: An Introduction to Computing in Distributed Representation with High-Dimensional Random Vectors
Performance profiling in TF 2 (TF Dev Summit '20)

Misc

Contributing

Contributions are very welcome, please share back with the wider community (and get credited for it)!

Please have a look at the CONTRIBUTING guidelines, also have a read about our licensing policy.

Back to main page (table of contents)

Name		Name	Last commit message	Last commit date
parent directory ..
gpus		gpus
README.md		README.md
about-neural-magic.md		about-neural-magic.md
about-vast.ai.md		about-vast.ai.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud-devops-infra

cloud-devops-infra

README.md

Cloud, DevOps, Infra

System / Infra

Compute & Storage

Grid computing / Super computing

Cloud services

Tools

CPU

FPGA

GPU

TPU

IPU

Performance

Misc

Contributing

Files

cloud-devops-infra

Directory actions

More options

Directory actions

More options

Latest commit

History

cloud-devops-infra

Folders and files

parent directory

README.md

Cloud, DevOps, Infra

System / Infra

Compute & Storage

Grid computing / Super computing

Cloud services

Tools

CPU

FPGA

GPU

TPU

IPU

Performance

Misc

Contributing