Skip to content

Commit

Permalink
Restructure webdataset benchmark setup and add new results (activeloo…
Browse files Browse the repository at this point in the history
…pai#767)

* restructure setup and add new results
  • Loading branch information
haiyangdeperci authored Apr 13, 2021
1 parent d9892be commit 705edf3
Showing 1 changed file with 81 additions and 18 deletions.
99 changes: 81 additions & 18 deletions benchmarks/webdataset_hub_benchmarks.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@
"source": [
"## Installing the Dependencies\n",
"\n",
"First of all, we gather all the dependencies as instructed by the [tmbdev/pytorch-imagenet-wds](https://github.com/tmbdev/pytorch-imagenet-wds) repository in order to set up the environment."
"First of all, we gather all the dependencies as instructed by the [tmbdev/pytorch-imagenet-wds](https://github.com/tmbdev/pytorch-imagenet-wds) repository in order to set up the environment. The hub, torch and webdataset versions are specified for reproducibility."
]
},
{
Expand All @@ -67,15 +67,15 @@
},
"outputs": [],
"source": [
"!pip install hub\n",
"!pip install webdataset\n",
"!pip install hub==1.2.3\n",
"!pip install torch==1.8.1\n",
"!pip install webdataset==0.1.54\n",
"!pip install torchvision\n",
"!pip install braceexpand\n",
"!pip install numpy\n",
"!pip install scipy\n",
"!pip install tk\n",
"!pip install matplotlib\n",
"!pip install torch\n",
"!pip install torchvision"
"!pip install matplotlib"
]
},
{
Expand Down Expand Up @@ -106,7 +106,8 @@
"import webdataset as wds\n",
"import torch\n",
"import torchvision\n",
"import time"
"import time\n",
"import numpy as np"
]
},
{
Expand Down Expand Up @@ -209,7 +210,7 @@
"source": [
"## Preparing Hub Dataset\n",
"\n",
"The dataset in hub format just needs to be pulled from S3 bucket to the local instance."
"The dataset in hub format just needs to be pulled from S3 bucket to the local instance. It can also be directly streamed from S3."
]
},
{
Expand Down Expand Up @@ -239,7 +240,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We define the parameters we want to test the methods with."
"We define the parameters with which we want to test the functions."
]
},
{
Expand All @@ -249,7 +250,7 @@
"outputs": [],
"source": [
"WORKERS = [24, 16, 8, 4]\n",
"BATCH_SIZE = 1000\n",
"batch_size = 1000\n",
"\n",
"\n",
"def employ(workers):\n",
Expand All @@ -258,7 +259,7 @@
" times = []\n",
" for n in workers:\n",
" times.append(f(*args, n))\n",
" return times\n",
" return np.round(times, 3)\n",
" return wrapper\n",
" return decorator"
]
Expand Down Expand Up @@ -294,7 +295,7 @@
"source": [
"# Timing the Read Access of Hub converted to PyTorch\n",
"\n",
"Since WebDataset is based on PyTorch and Hub offers PyTorch integration, it would be useful to compare Hub's performance when converted to PyTorch as well."
"Since WebDataset is based on PyTorch and Hub offers PyTorch integration, it would be useful to compare Hub's performance when converted to PyTorch locally."
]
},
{
Expand Down Expand Up @@ -341,7 +342,7 @@
{
"data": {
"text/plain": [
"[232.54446959495544, 252.8528025150299, 235.36578059196472, 198.7422969341278]"
"array([232.544, 252.853, 235.366, 198.742])"
]
},
"execution_count": null,
Expand All @@ -350,7 +351,7 @@
}
],
"source": [
"time_webdataset(webdataset_url, BATCH_SIZE)"
"time_webdataset(webdataset_url, batch_size)"
]
},
{
Expand All @@ -361,7 +362,7 @@
{
"data": {
"text/plain": [
"[1865.55198264122, 2710.206123113632, 2422.845787525177, 3368.512204885483]"
"array([408.312, 375.634, 417.064, 477.035])"
]
},
"execution_count": null,
Expand All @@ -370,15 +371,77 @@
}
],
"source": [
"time_hub(hub_url, BATCH_SIZE)"
"time_hub(hub_url, batch_size)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"Hub at least 8x slower than Webdataset."
"To improve Hub's performance, we use the remote version of Hub with a smaller batch size."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"batch_size = 96"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([270.917, 254.519, 251.943, 289.542])"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"time_hub(hub_url, batch_size)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also test Hub on streaming data remotely from S3."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1688.301, 2683.032, 4825.543, 7982.483])"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"time_hub(s3_url, batch_size)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result is that Webdataset is 1.007-2.400x faster than Hub, depending on the configurations. Essentially, their performance is roughly the same, with a minor advantage of Webdataset, however given how much time is saved by avoiding any preprocessing with Hub, it is a more optimal choice for most dataset users."
]
}
],
Expand Down

0 comments on commit 705edf3

Please sign in to comment.