tinyml2021/slides.html

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="generator" content="pandoc">
  <meta name="author" content="Jon Nordby jon@soundsensing.no">
  <meta name="dcterms.date" content="2021-03-25">
  <title>TinyML Summit 2021: Environmental Sound Classification on microcontrollers</title>
  <meta name="apple-mobile-web-app-capable" content="yes">
  <meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">
  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
  <link rel="stylesheet" href="https://unpkg.com/reveal.js@^4//dist/reset.css">
  <link rel="stylesheet" href="https://unpkg.com/reveal.js@^4//dist/reveal.css">
  <style>
    code{white-space: pre-wrap;}
    span.smallcaps{font-variant: small-caps;}
    span.underline{text-decoration: underline;}
    div.column{display: inline-block; vertical-align: top; width: 50%;}
    div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
    ul.task-list{list-style: none;}
  </style>
  <link rel="stylesheet" href="https://unpkg.com/reveal.js@^4//dist/theme/white.css" id="theme">
  <link rel="stylesheet" href="style.css"/>
</head>
<body>
  <div class="reveal">
    <div class="slides">


<section class="slide level2">

<section class="titleslide level1" data-background-image="./img/soundsensing-withlogo.jpg" style="background: rgba(255, 255, 255, 0.3); padding-top: 1.7em;">
<h1 style>
Environmental Sound Classification on microcontrollers
</h1>
<p>
Jon Nordby</br> jon@soundsensing.no</br> tinyML Summit 2021</br>
</p>
</section>
</section>
<section>
<section id="introduction" class="title-slide slide level1">
<h1>Introduction</h1>
<aside class="notes">
<p>Environmental Sounds are sounds that we have around us in our environment, especially outdoors.</p>
<p>It can be cars honking, music played from a club, speech from</p>
<p>When environmental sounds are unwanted we call it environmental noise.</p>
</aside>
</section>
<section id="environmental-noise-pollution" class="slide level2">
<h2>Environmental Noise Pollution</h2>
<p><img data-src="./img/noise-map-barcelona-day.png" style="width:50.0%" /></p>
<p>The environmental pollution that affects most people in Europe</p>
<ul>
<li>13 million suffering from sleep disturbance</li>
<li>900’000 disability-adjusted life years (DALY) lost</li>
</ul>
<aside class="notes">
<p>Environmental Noise pollution is a big, and growing problem. More and more we live in urban environments, with many noise sources around us.</p>
<p>WHO estimates that in Europe alone 13 million suffer from sleep distubance due to noise. Such noise causes the body to be stressed, and in constant alert mode. This increases risk of cardiovascuar disease, obesity etc.</p>
<p>And almost 1 million disability adjusted life years are lost due to noise. This makes noise the environmental pollution that affects the most people in Europe.</p>
</aside>
</section>
<section id="occupational-noise-induced-hearing-loss" class="slide level2">
<h2>Occupational Noise-induced Hearing Loss</h2>
<p><img data-src="./img/Manufacturing-Noise-small.jpeg" style="width:50.0%" /></p>
<p>The most prevalent occupational disease in the world</p>
<ul>
<li>40 million affected by hearing loss from work</li>
<li>4 million disability-adjusted life years (DALY) lost</li>
</ul>
<aside class="notes">
<p>Another serious problem is hearing loss.</p>
<p>It is estimated that 40 million people are affected by hearing loss from work.</p>
<p>Affects workers across many industries, including construction, manufacturing and shipping.</p>
</aside>
</section>
<section id="noise-monitoring-with-machine-learning" class="slide level2">
<h2>Noise Monitoring with Machine Learning</h2>
<p><img data-src="./img/soundsensing-solution.svg.png" style="width:100.0%" /></p>
<aside class="notes">
<p>Soundsensing helps to address these issues by providing better tools for monitoring noise, understanding the underlying causes, and what is needed to make improvements.</p>
<p>We provide easy-to-use IoT sensors that can continiously measure the noise-level, as well as classify the dominant noise-source over time.</p>
<p>This is presented in our online dashboard, and is also available in an API for integrating with other systems.</p>
</aside>
</section>
<section id="wireless-audio-sensor-networks" class="slide level2">
<h2>Wireless Audio Sensor Networks</h2>
<p><img data-src="img/sensornetworks.png" style="width:85.0%" /></p>
<aside class="notes">
<p>Alternative A would be to record audio in the sensor and transmit to the cloud. This is a conceptually very simple solution, and one could use a standard neural network in the cloud to do audio classification without much computational constraints on the model.</p>
<p>However this would require a lot of data transfer, which is costly in terms of energy and data traffic in a cellular 4G system.</p>
<p>It also would be very poor for privacy, as potentially sensitive audio such as speech would have to be transported through the network and could potentially be stored in a server.</p>
<p>Alternative B would be to preprocess the data in the sensor, and classify this in the cloud. Would have to reducing the data enough to be privacy friendly and save considerable data traffic, but not so much as to reduce classification performance, which can be a difficult trade-off.</p>
<p>But the best solution both for Privacy and Data Traffic would be the TinyML solution. To do all the processing on the sensor, and only transmit data about the classes to server.</p>
<p>However this means the entire model needs to fit the constraints of the sensor device.</p>
</aside>
</section>
<section id="model-constraints" class="slide level2" data-background-image="./img/chip.jpg">
<h2 data-background-image="./img/chip.jpg">Model Constraints</h2>
<!--  <section class="level2" data-background="./img/chip.jpg"> 

<h2 style="">Device constraints</h2>
-->
<p>
<p>Example target: STM32L476 microcontroller. With 50% of capacity:</p>
<ul>
<li>64 kB RAM</br></li>
<li>512 kB FLASH memory</br></li>
<li>4.5 M operations/second</br></li>
</ul>
</p>
<!--  </section> -->
<aside class="notes">
<p>If we consider a typical low-power microcontroller such as an ARM Cortex M4F, and we dedicate 50% of the resources to the machine learning, that means</p>
</aside>
</section>
<section id="small-models-urbansound8k" class="slide level2" data-background-image="">
<h2 data-background-image="">Small models Urbansound8K</h2>
<figure>
<img data-src="img/urbansound8k-existing-models-logmel.png" style="width:100.0%" alt="Green: Feasible region on device. 2021 results not published." /><figcaption aria-hidden="true">Green: Feasible region on device. 2021 results not published.</figcaption>
</figure>
<aside class="notes">
<p>In work that we did in 2019, we found that existing models were at 1-3 orders of magnitude too large to fit on device.</p>
<p>And we showed that one can reach about 10 percentage points from the unconstrainted State-of-the-Art models when running on such a device.</p>
<p>We have since made several improvements to close the gap further, but these are not published.</p>
<p>As far as I known this still is the best published performance on Urbansound8k</p>
</aside>
</section></section>
<section>
<section id="shrinking-convolutional-neural-networks-for-tinyml-audio" class="title-slide slide level1">
<h1>Shrinking </br> Convolutional Neural Networks</br> for TinyML Audio</h1>
<p>How to did we make the model fit on device?</p>
</section>
<section id="pipeline" class="slide level2">
<h2>Pipeline</h2>
<p><img data-src="img/classification-pipeline.png" style="width:50.0%" /></p>
<p>Typical audio pipeline. Spectrogram conversion, CNN on overlapped windows.</p>
<aside class="notes">
<p>Here is a typical audio classification pipeline. The input audio is on the top. It is chopped into fixed-length windows. Then each audio window is converted to a spectrogram representation, usually Mel-spectrogram. Each spectrogram window is fed to a classifier, typically a Convolutional Neural Network. And if the sounds classes of interest is longer than the window length, one does some aggregation to combine predictions for multiple windows into prediction for a single clip.</p>
</aside>
</section>
<section id="reduce-input-dimensionality" class="slide level2">
<h2>Reduce input dimensionality</h2>
<p><img data-src="img/input-size.svg" style="width:70.0%" /></p>
<ul>
<li>Lower sample rate</li>
<li>Lower frequency range</li>
<li>Lower frequency resolution</li>
<li>Lower time duration in window</li>
<li>Lower time resolution</li>
</ul>
<p>~10x reduction i compute. And easier to learn!</p>
<aside class="notes">
<p>The first optimizalization one can do is in preprocessing. The key is to use a small input to the model as possible. So if one reduces the sample rate, the range and resolution of frequencies bands, the time duration and resolution of the window, one can make large gains.</p>
<p>Also makes it easier to learn with for small datasets!</p>
</aside>
</section>
<section id="reduce-overlap" class="slide level2">
<h2>Reduce overlap</h2>
<p><img data-src="img/framing.png" style="width:80.0%" /></p>
<p>Models in literature use 95% overlap or more. 20x penalty in inference time!</p>
<p>Often small performance benefit. Use 0% (1x) or 50% (2x).</p>
<aside class="notes">
<p>Windows are computed with overlap This gives the model a couple of different view of the same sound, which increases performance. Typical SOTA models use maximum overlap, over repeating over 20x times on same audio section.</p>
<p>The performance benefit can however be quite minor. Try 1x, 2x, 4x first</p>
<p>Not that this increases detection latency and resolution. Might not be limiting in some cases, like keyword spotting or event detection.</p>
</aside>
</section>
<section id="use-a-small-model" class="slide level2">
<h2>Use a small model!</h2>
<!--
Based on SB-CNN (Salamon+Bello, 2016)
-->
<p><img data-src="img/models.svg" style="width:70.0%" /></p>
<aside class="notes">
<p>For many audio tasks one can get really far with a small model. For example 2-4 convolutional layers followed by 2 dense layers does well on a range of tasks.</p>
<p>One can start with a large model and then prune it, but our experience start with small model is easier and works well.</p>
</aside>
</section>
<section id="depthwise-separable-convolution" class="slide level2">
<h2>Depthwise-separable Convolution</h2>
<p><img data-src="img/depthwise-separable-convolution.png" style="width:90.0%" /></p>
<p>MobileNet, “Hello Edge”, AclNet. 3x3 kernel,64 filters: 7.5x speedup</p>
<aside class="notes">
<p>The convolutions in the network take up most of the CPU budget.</p>
<p>A Conv2d with multiple channels actually does convolution in 3 dimensions. Width, height and channel.</p>
<p>This can be separated into two operations: first convolution over spatial dimensions, then convolve over the channel dimensions.</p>
<p>5-10x speedups with very little performance impact</p>
</aside>
</section>
<section id="downsampling-using-max-pooling" class="slide level2">
<h2>Downsampling using max-pooling</h2>
<p><img data-src="img/maxpooling.png" style="width:100.0%" /></p>
<p>Wasteful? Computing convolutions, then throwing away 3/4 of results!</p>
<aside class="notes">
<p>In a Convolutional Neural Network one downsamples the data as one gets deeper in layers, to operate on progressively higher level features. This is usually done by doing max pooling after each convolution, which means to pick the highest value within the input. However this is quite wasteful, as is disregards a lot of data computed by previous layer.</p>
</aside>
</section>
<section id="downsampling-using-strided-convolution" class="slide level2">
<h2>Downsampling using strided convolution</h2>
<p><img data-src="img/strided-convolution.png" style="width:100.0%" /></p>
<p>“Learned” downsampling. Striding 2x2: Approx 4x speedup</p>
<aside class="notes">
<p>An alternative is to drop the max pooling, Instead use a stride higher than one in the convolution. This reduces the amount of times the convolution is run.</p>
<p>Can sometimes perform better than max-pooling Since the downsampling is included in the learned function!</p>
</aside>
</section>
<section id="quantization" class="slide level2">
<h2>Quantization</h2>
<p><img data-src="img/quantization.png" style="width:80.0%" /></p>
<ul>
<li>Using int8 instead of float32.</li>
<li>4x improvement in weights (FLASH) and activations (RAM)</li>
<li>4.6X improvement in runtime using CMSIS-NN <em>SIMD</em></li>
</ul>
<p>Ref “CMSIS-NN: Efficient Neural Network Kernels for ARM Cortex-M CPUs”</p>
<aside class="notes">
<p>Quantizing down to 8 bit integers can be done almost without loss in performance. 4x improvement in FLASH and RAM</p>
<p>On Cortex M4F one can get around 4x improvement in CPU performance as well</p>
</aside>
</section>
<section id="latest-developments" class="slide level2">
<h2>Latest developments</h2>
<ul>
<li>Binary network quantization</li>
<li>Neural Architecture Search</li>
<li>Streaming inference</li>
<li>Learned filterbanks</li>
<li>Hardware acceleration</li>
<li>Learned pooling</li>
</ul>
<p>TinyML very actively researched, rapid improvements</p>
<aside class="notes">
<p>This area is very actively researched. Many of these you will find dedicated talks about here at TinyML Summit.</p>
</aside>
</section></section>
<section>
<section id="outro" class="title-slide slide level1">
<h1>Outro</h1>

</section>
<section id="noise-monitoring-example" class="slide level2">
<h2>Noise Monitoring example</h2>
<figure>
<img data-src="./img/noise-monitoring-report.png" style="width:50.0%" alt="Automated documentation of noise footprint wrt regulations" /><figcaption aria-hidden="true">Automated documentation of noise footprint wrt regulations</figcaption>
</figure>
<ul>
<li>Based on Noise Event Detection &amp; Classification</li>
<li>Tested successfully at shooting range</li>
<li>Expanding now to Construction and Industry noise</li>
</ul>
<aside class="notes">
<p>Automatically creating a logbook of noisy training activities.</p>
<p>One of our customers operate a training facility for police special forces, where they fire guns and conduct explosives training. They use our system to have documentation that they follow the regulations, and to verify any noise complains that come in.</p>
<p>We are now expanding this solution to other applications, such as Construction, Industry and Transportation.</p>
<p>If you are working in these areas and interested in testing it out, let us know.</p>
</aside>
</section>
<section id="condition-monitoring-example" class="slide level2">
<h2>Condition Monitoring example</h2>
<p><img data-src="./img/soundsensing-condition-monitoring.svg.png" style="width:100.0%" /></p>
<p>Condition Monitoring of technical equipment using sound.</br> Developed based on experience from Noise Monitoring.</p>
<aside class="notes">
<p>We have also used the same techniques to develop an Anomaly Detection system using sound, which has been tested out on pumps.</p>
<p>This means that the condition of technical equipment can be continiously monitored, freeing up time for the janitors and providing better detection of issues.</p>
<p>We are currently looking to test this in larger scale and on more types of machinery.</p>
</aside>
</section>
<section id="conclusions" class="slide level2">
<h2>Conclusions</h2>
<ol type="1">
<li>Audio classification of Environmental Noise can be done directly on sensor</li>
<li>Made possible with a range of efficient CNN techniques</li>
<li>Integrated into Soundsensing IoT sensors</li>
<li>Used for Noise Monitoring &amp; Condition Monitoring</li>
</ol>
</section>
<section id="section" class="slide level2" data-background="./img/soundsensing-withlogo.jpg" style="background: rgba(255, 255, 255, 0.3);">
<h2 data-background="./img/soundsensing-withlogo.jpg" style="background: rgba(255, 255, 255, 0.3);"></h2>
<p>We are open for partners and pilot projects</br> Get in touch!</br> contact@soundsensing.no</br> </br> </br></p>
<h1>
Questions ?
</h1>
<p><em>TinyML Summit 2021: Environmental Sound Classification on microcontrollers</em></p>
<p>
Jon Nordby </br>jon@soundsensing.no
</p>
</section></section>
<section id="bonus" class="title-slide slide level1">
<h1>Bonus</h1>
<p>Bonus slides after this point</p>
</section>

<section>
<section id="thesis-results" class="title-slide slide level1">
<h1>Thesis results</h1>

</section>
<section id="all-the-info" class="slide level2">
<h2>All the info</h2>
<blockquote>
<p>Thesis: Environmental Sound Classification on Microcontrollers using Convolutional Neural Networks</p>
</blockquote>
<figure>
<img data-src="./img/thesis.png" style="width:30.0%" alt="Report &amp; Code: https://github.com/jonnor/ESC-CNN-microcontroller" /><figcaption aria-hidden="true">Report &amp; Code: https://github.com/jonnor/ESC-CNN-microcontroller</figcaption>
</figure>
</section>
<section id="all-models" class="slide level2">
<h2>All models</h2>
<p><img data-src="img/models-list.png" /></p>
<aside class="notes">
<ul>
<li>Baseline is outside requirements</li>
<li>Rest fits the theoretical constraints</li>
<li>Sometimes had to reduce number of base filters to 22 to fit in RAM</li>
</ul>
</aside>
</section>
<section id="model-comparison" class="slide level2">
<h2>Model comparison</h2>
<p><img data-src="img/models_accuracy.png" style="width:100.0%" /></p>
<aside class="notes">
<ul>
<li>Baseline relative to SB-CNN and LD-CNN is down from 79% to 73% Expected because poorer input representation. Much lower overlap</li>
</ul>
</aside>
</section>
<section id="list-of-results" class="slide level2">
<h2>List of results</h2>
<p><img data-src="img/results.png" style="width:100.0%" /></p>
</section>
<section id="confusion" class="slide level2">
<h2>Confusion</h2>
<p><img data-src="img/confusion_test.png" style="width:70.0%" /></p>
</section>
<section id="grouped-classification" class="slide level2">
<h2>Grouped classification</h2>
<p><img data-src="img/grouped_confusion_test_foreground.png" style="width:60.0%" /></p>
<p>Foreground-only</p>
</section>
<section id="unknown-class" class="slide level2">
<h2>Unknown class</h2>
<p><img data-src="img/unknown-class.png" style="width:100.0%" /></p>
<aside class="notes">
<p>Idea: If confidence of model is low, consider it as “unknown”</p>
<ul>
<li>Left: Histogram of correct/incorrect predictions</li>
<li>Right: Precision/recall curves</li>
<li>Precision improves at expense of recall</li>
<li>90%+ precision possible at 40% recall</li>
</ul>
<p>Usefulness:</p>
<ul>
<li>Avoids making decisions on poor grounds</li>
<li>“Unknown” samples good candidates for labeling-&gt;dataset. Active Learning</li>
<li>Low recall not a problem? Data is abundant, 15 samples a 4 seconds per minute per sensor</li>
</ul>
</aside>
</section></section>
<section>
<section id="thesis-methods" class="title-slide slide level1">
<h1>Thesis Methods</h1>
<p>Standard procedure for Urbansound8k</p>
<ul>
<li>Classification problem</li>
<li>4 second sound clips</li>
<li>10 classes</li>
<li>10-fold cross-validation, predefined</li>
<li>Metric: Accuracy</li>
</ul>
</section>
<section id="training-settings" class="slide level2">
<h2>Training settings</h2>
<p><img data-src="img/training-settings.png" /></p>
</section>
<section id="training" class="slide level2">
<h2>Training</h2>
<ul>
<li>NVidia RTX2060 GPU 6 GB</li>
<li>10 models x 10 folds = 100 training jobs</li>
<li>100 epochs</li>
<li>3 jobs in parallel</li>
<li>36 hours total</li>
</ul>
<aside class="notes">
<ul>
<li>! GPU utilization only 15%</li>
<li>CPU utilization was near 100%</li>
<li>Larger models to utilize GPU better?</li>
<li>Parallel processing limited by RAM of biggest models</li>
<li>GPU-based augmentation might be faster</li>
</ul>
</aside>
</section>
<section id="evaluation" class="slide level2">
<h2>Evaluation</h2>
<p>For each fold of each model</p>
<ol type="1">
<li>Select best model based on validation accuracy</li>
<li>Calculate accuracy on test set</li>
</ol>
<p>For each model</p>
<ul>
<li>Measure CPU time on device</li>
</ul>
</section>
<section id="mel-spectrogram" class="slide level2">
<h2>Mel-spectrogram</h2>
<p><img data-src="img/spectrograms.svg" /></p>
</section>
<section id="more-resources" class="slide level2">
<h2>More resources</h2>
<p>Machine Hearing. ML on Audio</p>
<ul>
<li><a href="https://github.com/jonnor/machinehearing">github.com/jonnor/machinehearing</a></li>
</ul>
<p>Machine Learning for Embedded / IoT</p>
<ul>
<li><a href="https://github.com/jonnor/embeddedml">github.com/jonnor/embeddedml</a></li>
</ul>
<p>Thesis Report &amp; Code</p>
<ul>
<li><a href="https://github.com/jonnor/ESC-CNN-microcontroller">github.com/jonnor/ESC-CNN-microcontroller</a></li>
</ul>
</section></section>
    </div>
  </div>

  <script src="https://unpkg.com/reveal.js@^4//dist/reveal.js"></script>

  // reveal.js plugins
  <script src="https://unpkg.com/reveal.js@^4//plugin/notes/notes.js"></script>
  <script src="https://unpkg.com/reveal.js@^4//plugin/search/search.js"></script>
  <script src="https://unpkg.com/reveal.js@^4//plugin/zoom/zoom.js"></script>

  <script>

      // Full list of configuration options available at:
      // https://revealjs.com/config/
      Reveal.initialize({
        // Push each slide change to the browser history
        history: true,
        // The "normal" size of the presentation, aspect ratio will be preserved
        // when the presentation is scaled to fit different resolutions. Can be
        // specified using percentage units.
        width: 1920,
        height: 1080,
        // Factor of the display size that should remain empty around the content
        margin: 0,

        // reveal.js plugins
        plugins: [
          RevealNotes,
          RevealSearch,
          RevealZoom
        ]
      });
    </script>
    </body>
</html>