Restructure webdataset benchmark setup and add new results (activeloo…

…pai#767) * restructure setup and add new results
m-star18 · Apr 13, 2021 · 705edf3 · 705edf3
1 parent d9892be
commit 705edf3
Showing 1 changed file with 81 additions and 18 deletions.
diff --git a/benchmarks/webdataset_hub_benchmarks.ipynb b/benchmarks/webdataset_hub_benchmarks.ipynb
@@ -56,7 +56,7 @@
    "source": [
     "## Installing the Dependencies\n",
     "\n",
-    "First of all, we gather all the dependencies as instructed by the [tmbdev/pytorch-imagenet-wds](https://github.com/tmbdev/pytorch-imagenet-wds) repository in order to set up the environment."
+    "First of all, we gather all the dependencies as instructed by the [tmbdev/pytorch-imagenet-wds](https://github.com/tmbdev/pytorch-imagenet-wds) repository in order to set up the environment. The hub, torch and webdataset versions are specified for reproducibility."
    ]
   },
   {
@@ -67,15 +67,15 @@
    },
    "outputs": [],
    "source": [
-    "!pip install hub\n",
-    "!pip install webdataset\n",
+    "!pip install hub==1.2.3\n",
+    "!pip install torch==1.8.1\n",
+    "!pip install webdataset==0.1.54\n",
+    "!pip install torchvision\n",
     "!pip install braceexpand\n",
     "!pip install numpy\n",
     "!pip install scipy\n",
     "!pip install tk\n",
-    "!pip install matplotlib\n",
-    "!pip install torch\n",
-    "!pip install torchvision"
+    "!pip install matplotlib"
    ]
   },
   {
@@ -106,7 +106,8 @@
     "import webdataset as wds\n",
     "import torch\n",
     "import torchvision\n",
-    "import time"
+    "import time\n",
+    "import numpy as np"
    ]
   },
   {
@@ -209,7 +210,7 @@
    "source": [
     "## Preparing Hub Dataset\n",
     "\n",
-    "The dataset in hub format just needs to be pulled from S3 bucket to the local instance."
+    "The dataset in hub format just needs to be pulled from S3 bucket to the local instance. It can also be directly streamed from S3."
    ]
   },
   {
@@ -239,7 +240,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We define the parameters we want to test the methods with."
+    "We define the parameters with which we want to test the functions."
    ]
   },
   {
@@ -249,7 +250,7 @@
    "outputs": [],
    "source": [
     "WORKERS = [24, 16, 8, 4]\n",
-    "BATCH_SIZE = 1000\n",
+    "batch_size = 1000\n",
     "\n",
     "\n",
     "def employ(workers):\n",
@@ -258,7 +259,7 @@
     "            times = []\n",
     "            for n in workers:\n",
     "                times.append(f(*args, n))\n",
-    "            return times\n",
+    "            return np.round(times, 3)\n",
     "        return wrapper\n",
     "    return decorator"
    ]
@@ -294,7 +295,7 @@
    "source": [
     "# Timing the Read Access of Hub converted to PyTorch\n",
     "\n",
-    "Since WebDataset is based on PyTorch and Hub offers PyTorch integration, it would be useful to compare Hub's performance when converted to PyTorch as well."
+    "Since WebDataset is based on PyTorch and Hub offers PyTorch integration, it would be useful to compare Hub's performance when converted to PyTorch locally."
    ]
   },
   {
@@ -341,7 +342,7 @@
     {
      "data": {
       "text/plain": [
-       "[232.54446959495544, 252.8528025150299, 235.36578059196472, 198.7422969341278]"
+       "array([232.544, 252.853, 235.366, 198.742])"
       ]
      },
      "execution_count": null,
@@ -350,7 +351,7 @@
     }
    ],
    "source": [
-    "time_webdataset(webdataset_url, BATCH_SIZE)"
+    "time_webdataset(webdataset_url, batch_size)"
    ]
   },
   {
@@ -361,7 +362,7 @@
     {
      "data": {
       "text/plain": [
-       "[1865.55198264122, 2710.206123113632, 2422.845787525177, 3368.512204885483]"
+       "array([408.312, 375.634, 417.064, 477.035])"
       ]
      },
      "execution_count": null,
@@ -370,15 +371,77 @@
     }
    ],
    "source": [
-    "time_hub(hub_url, BATCH_SIZE)"
+    "time_hub(hub_url, batch_size)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Conclusion\n",
-    "Hub at least 8x slower than Webdataset."
+    "To improve Hub's performance, we use the remote version of Hub with a smaller batch size."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "batch_size = 96"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([270.917, 254.519, 251.943, 289.542])"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "time_hub(hub_url, batch_size)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can also test Hub on streaming data remotely from S3."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([1688.301, 2683.032, 4825.543, 7982.483])"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "time_hub(s3_url, batch_size)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The result is that Webdataset is 1.007-2.400x faster than Hub, depending on the configurations. Essentially, their performance is roughly the same, with a minor advantage of Webdataset, however given how much time is saved by avoiding any preprocessing with Hub, it is a more optimal choice for most dataset users."
    ]
   }
  ],