This repository consists of:
- vision.datasets : Data loaders for popular vision datasets
- vision.models : Definitions for popular model architectures, such as AlexNet, VGG, and ResNet and pre-trained models.
- vision.transforms : Common image transformations such as random crop, rotations etc.
- vision.utils : Useful stuff such as saving tensor (3 x H x W) as image to disk, given a mini-batch creating a grid of images, etc.
Anaconda:
conda install torchvision -c soumith
pip:
pip install torchvision
From source:
python setup.py install
The following dataset loaders are available:
- MNIST and FashionMNIST
- COCO (Captioning and Detection)
- LSUN Classification
- ImageFolder
- Imagenet-12
- CIFAR10 and CIFAR100
- STL10
- SVHN
- PhotoTour
Datasets have the API: - __getitem__
- __len__
They all subclass
from torch.utils.data.Dataset
Hence, they can all be multi-threaded
(python multiprocessing) using standard torch.utils.data.DataLoader.
For example:
torch.utils.data.DataLoader(coco_cap, batch_size=args.batchSize, shuffle=True, num_workers=args.nThreads)
In the constructor, each dataset has a slightly different API as needed, but they all take the keyword args:
transform
- a function that takes in an image and returns a transformed version- common stuff like
ToTensor
,RandomCrop
, etc. These can be composed together withtransforms.Compose
(see transforms section below) target_transform
- a function that takes in the target and transforms it. For example, take in the caption string and return a tensor of word indices.
dset.MNIST(root, train=True, transform=None, target_transform=None, download=False)
dset.FashionMNIST(root, train=True, transform=None, target_transform=None, download=False)
root
: root directory of dataset where processed/training.pt
and processed/test.pt
exist
train
: True
- use training set, False
- use test set.
transform
: transform to apply to input images
target_transform
: transform to apply to targets (class labels)
download
: whether to download the MNIST data
This requires the COCO API to be installed
dset.CocoCaptions(root="dir where images are", annFile="json annotation file", [transform, target_transform])
Example:
import torchvision.datasets as dset
import torchvision.transforms as transforms
cap = dset.CocoCaptions(root = 'dir where images are',
annFile = 'json annotation file',
transform=transforms.ToTensor())
print('Number of samples: ', len(cap))
img, target = cap[3] # load 4th sample
print("Image Size: ", img.size())
print(target)
Output:
Number of samples: 82783 Image Size: (3L, 427L, 640L) [u'A plane emitting smoke stream flying over a mountain.', u'A plane darts across a bright blue sky behind a mountain covered in snow', u'A plane leaves a contrail above the snowy mountain top.', u'A mountain that has a plane flying overheard in the distance.', u'A mountain view with a plume of smoke in the background']
dset.CocoDetection(root="dir where images are", annFile="json annotation file", [transform, target_transform])
dset.LSUN(db_path, classes='train', [transform, target_transform])
db_path
= root directory for the database filesclasses
='train'
- all categories, training set'val'
- all categories, validation set'test'
- all categories, test set- [
'bedroom_train'
,'church_train'
, ...] : a list of categories to load
dset.CIFAR10(root, train=True, transform=None, target_transform=None, download=False)
dset.CIFAR100(root, train=True, transform=None, target_transform=None, download=False)
root
: root directory of dataset where there is foldercifar-10-batches-py
train
:True
= Training set,False
= Test setdownload
:True
= downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, does not do anything.
dset.STL10(root, split='train', transform=None, target_transform=None, download=False)
root
: root directory of dataset where there is folderstl10_binary
split
:'train'
= Training set,'test'
= Test set,'unlabeled'
= Unlabeled set,'train+unlabeled'
= Training + Unlabeled set (missing label marked as-1
)
download
:True
= downloads the dataset from the internet andputs it in root directory. If dataset is already downloaded, does not do anything.
dset.SVHN(root, split='train', transform=None, target_transform=None, download=False)
root
: root directory of dataset where there is folderSVHN
split
:'train'
= Training set,'test'
= Test set,'extra'
= Extra training setdownload
:True
= downloads the dataset from the internet andputs it in root directory. If dataset is already downloaded, does not do anything.
A generic data loader where the images are arranged in this way:
root/dog/xxx.png root/dog/xxy.png root/dog/xxz.png root/cat/123.png root/cat/nsdf3.png root/cat/asd932_.png
dset.ImageFolder(root="root folder path", [transform, target_transform])
It has the members:
self.classes
- The class names as a listself.class_to_idx
- Corresponding class indicesself.imgs
- The list of (image path, class-index) tuples
This is simply implemented with an ImageFolder dataset.
The data is preprocessed as described here
Learning Local Image Descriptors Data http://phototour.cs.washington.edu/patches/default.htm
import torchvision.datasets as dset
import torchvision.transforms as transforms
dataset = dset.PhotoTour(root = 'dir where images are',
name = 'name of the dataset to load',
transform=transforms.ToTensor())
print('Loaded PhotoTour: {} with {} images.'
.format(dataset.name, len(dataset.data)))
The models subpackage contains definitions for the following model architectures:
- AlexNet: AlexNet variant from the "One weird trick" paper.
- VGG: VGG-11, VGG-13, VGG-16, VGG-19 (with and without batch normalization)
- ResNet: ResNet-18, ResNet-34, ResNet-50, ResNet-101, ResNet-152
- SqueezeNet: SqueezeNet 1.0, and SqueezeNet 1.1
- DenseNet: DenseNet-128, DenseNet-169, DenseNet-201 and DenseNet-161
- Inception v3 : Inception v3
You can construct a model with random weights by calling its constructor:
import torchvision.models as models
resnet18 = models.resnet18()
alexnet = models.alexnet()
vgg16 = models.vgg16()
squeezenet = models.squeezenet1_0()
densenet = models.densenet_161()
inception = models.inception_v3()
We provide pre-trained models for the ResNet variants, SqueezeNet 1.0 and 1.1,
AlexNet, VGG, Inception v3 and DenseNet using the PyTorch model zoo.
These can be constructed by passing pretrained=True
:
import torchvision.models as models
resnet18 = models.resnet18(pretrained=True)
alexnet = models.alexnet(pretrained=True)
squeezenet = models.squeezenet1_0(pretrained=True)
vgg16 = models.vgg16(pretrained=True)
densenet = models.densenet_161(pretrained=True)
inception = models.inception_v3(pretrained=True)
All pre-trained models expect input images normalized in the same way, i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224.
The images have to be loaded in to a range of [0, 1] and then normalized using mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225]
An example of such normalization can be found in the imagenet example here
Transforms are common image transforms. They can be chained together
using transforms.Compose
One can compose several transforms together. For example.
transform = transforms.Compose([
transforms.RandomSizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean = [ 0.485, 0.456, 0.406 ],
std = [ 0.229, 0.224, 0.225 ]),
])
Rescales the input PIL.Image to the given 'size'.
If 'size' is a 2-element tuple or list in the order of (width, height), it will be the exactly size to scale.
If 'size' is a number, it will indicate the size of the smaller edge. For example, if height > width, then image will be rescaled to (size * height / width, size) - size: size of the smaller edge - interpolation: Default: PIL.Image.BILINEAR
Crops the given PIL.Image at the center to have a region of the given size. size can be a tuple (target_height, target_width) or an integer, in which case the target will be of a square shape (size, size)
Crops the given PIL.Image at a random location to have a region of the
given size. size can be a tuple (target_height, target_width) or an
integer, in which case the target will be of a square shape (size, size)
If padding
is non-zero, then the image is first zero-padded on each
side with padding
pixels.
Randomly horizontally flips the given PIL.Image with a probability of 0.5
Random crop the given PIL.Image to a random size of (0.08 to 1.0) of the original size and and a random aspect ratio of 3/4 to 4/3 of the original aspect ratio
This is popularly used to train the Inception networks - size: size of the smaller edge - interpolation: Default: PIL.Image.BILINEAR
Pads the given image on each side with padding
number of pixels, and
the padding pixels are filled with pixel value fill
. If a 5x5
image is padded with padding=1
then it becomes 7x7
Given mean: (R, G, B) and std: (R, G, B), will normalize each channel of the torch.*Tensor, i.e. channel = (channel - mean) / std
ToTensor()
- Converts a PIL.Image (RGB) or numpy.ndarray (H x W x C) in the range [0, 255] to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0]ToPILImage()
- Converts a torch.*Tensor of range [0, 1] and shape C x H x W or numpy ndarray of dtype=uint8, range[0, 255] and shape H x W x C to a PIL.Image of range [0, 255]
Given a Python lambda, applies it to the input img
and returns it.
For example:
transforms.Lambda(lambda x: x.add(10))
Given a 4D mini-batch Tensor of shape (B x C x H x W), or a list of images all of the same size, makes a grid of images
normalize=True
will shift the image to the range (0, 1),
by subtracting the minimum and dividing by the maximum pixel value.
if range=(min, max)
where min
and max
are numbers, then these numbers are used to
normalize the image.
scale_each=True
will scale each image in the batch of images separately rather than
computing the (min, max)
over all images.
pad_value=<float>
sets the value for the padded pixels.
Example usage is given in this notebook
save_image(tensor, filename, nrow=8, padding=2, normalize=False, range=None, scale_each=False, pad_value=0)
Saves a given Tensor into an image file.
If given a mini-batch tensor, will save the tensor as a grid of images.
All options after filename
are passed through to make_grid
. Refer to it's documentation for
more details