How to build your first image classifier using PyTorch
We need something more state-of-the-art, some method which can truly be called deep learning. This tutorial will present just such a deep learning method that can achieve very high accuracy in image classification tasks — the Convolutional Neural Network. In particular, this tutorial will show you both the theory and practical application of Convolutional Neural Networks in PyTorch.
PyTorch is a powerful deep learning framework which is rising in popularity, and it is thoroughly at home in Python which makes rapid prototyping very easy. Fully connected networks with a few layers can only do so much — to get close to state-of-the-art results in image classification it is necessary to go deeper. In other words, lots more layers are required in the network. However, by adding a lot of additional layers, we come across some problems.
First, we can run into the vanishing gradient problem. Another issue for deep fully connected networks is that the number of trainable parameters in the model i. This means that the training slows down or becomes practically impossible, and also exposes the model to overfitting.
Convolutional Neural Networks try to solve this second problem by exploiting correlations between adjacent inputs in images or time series. This means that not every node in the network needs to be connected to every other node in the next layer — and this cuts down the number of weight parameters required to be trained in the model.
How does a Convolutional Neural Network work? The first thing to understand in a Convolutional Neural Network is the actual convolution part. This is a fancy mathematical word for what is essentially a moving window or filter across the image being studied. This moving window applies to a certain neighborhood of nodes as shown below — here, the filter applied is 0.
The weight of the mapping of each input square, as previously mentioned, is 0. The weights of each of these connections, as stated previously, is 0.
This is contrary to fully connected neural networks, where every node is connected to every other in the following layer.
Constant filter parameters — each filter has constant parameters. In other words, as the filter moves around the image, the same weights are applied to each 2 x 2 set of nodes. Each filter, as such, can be trained to perform a certain specific transformation of the input space. Therefore, each filter has a certain set of weights that are applied for each convolution operation — this reduces the number of parameters.
Note — this is not to say that each weight is constant within the filter. In the example above, the weights were [0. It all depends on how each filter is trained These two properties of Convolutional Neural Networks can drastically reduce the number of parameters which need to be trained compared to fully connected neural networks.
The next step in the Convolutional Neural Network structure is to pass the output of the convolution operation through a non-linear activation function — generally some version of the ReLU activation function. This provides the standard non-linear behavior that neural networks are known for. Before we move onto the next main feature of Convolutional Neural Networks, called pooling, we will examine this idea of feature mapping and channels in the next section.
Feature mapping and multiple channels As mentioned previously, because the weights of individual filters are held constant as they are applied over the input nodes, they can be trained to select certain features from the input data.
In the case of images, it may learn to recognize common geometrical objects such as lines, edges and other shapes which make up objects. This is where the name feature mapping comes from. Because of this, any convolution layer needs multiple filters which are trained to detect different features. So therefore, the previous moving filter diagram needs to be updated to look something like this: Multiple convolutional filters Now you can see on the right hand side of the diagram above that there are multiple, stacked outputs from the convolution operation.
This is because there are multiple trained filters which produce their own 2D output for a 2D image. These multiple filters are commonly called channels in deep learning.
Each of these channels will end up being trained to detect certain key features in the image. The output of a convolution layer, for a gray-scale image like the MNIST dataset, will therefore actually have 3 dimensions — 2D for each of the channels, then another dimension for the number of different channels. If the input is itself multi-channelled, as in the case of a color RGB image one channel for each R-G-B , the output will actually be 4D.
Thankfully, any deep learning library worth its salt, PyTorch included, will be able to handle all this mapping easily for you. Now, the next vitally important part of Convolutional Neural Networks is a concept called pooling.
Pooling There are two main benefits to pooling in Convolutional Neural Networks. These are: It reduces the number of parameters in your model by a process called down-sampling It makes feature detection more robust to object orientation and scale changes So what is pooling? It is another sliding window type technique, but instead of applying weights, which can be trained, it applies a statistical function of some type over the contents of its window.
The most common type of pooling is called max pooling, and it applies the max function over the contents of the window. There are other variants such as mean pooling which takes the statistical mean of the contents which are also used in some cases. In this tutorial, we will be concentrating on max pooling. For the first window, the blue one, you can see that the max pooling outputs a 3. This is pretty straight-forward. Strides and down-sampling In the pooling diagram above, you will notice that the pooling window shifts to the right each time by 2 places.
This is called a stride of 2. In the diagram above, the stride is only shown in the x direction, but, if the goal was to prevent pooling window overlap, the stride would also have to be 2 in the y direction as well.
In other words, the stride is actually specified as [2, 2]. One important thing to notice is that, if during pooling the stride is greater than 1, then the output size will be reduced.
As can be observed above, the 5 x 5 input is reduced to a 3 x 3 output. This is a good thing — it is called down-sampling, and it reduces the number of trainable parameters in the model. Padding Another thing to notice in the pooling diagram above is that there is an extra column and row added to the 5 x 5 input — this makes the effective size of the pooling space equal to 6 x 6.
This is to ensure that the 2 x 2 pooling window can operate correctly with a stride of [2, 2] and is called padding. These nodes are basically dummy nodes — because the values of these dummy nodes is 0, they are basically invisible to the max pooling operation.
Ok, so now we understand how pooling works in Convolutional Neural Networks, and how it is useful in performing down-sampling, but what else does it do? Why is max pooling used so frequently? Why is pooling used in convolutional neural networks? In addition to the function of down-sampling, pooling is used in Convolutional Neural Networks to make the detection of certain features somewhat invariant to scale and orientation changes. Another way of thinking about what pooling does is that it generalizes over lower level, more complex information.
Pooling can assist with this higher level, generalized feature selection, as the diagram below shows: Stylized representation of pooling The diagram is a stylized representation of the pooling operation. Therefore, pooling acts as a generalizer of the lower level data, and so, in a way, enables the network to move from high resolution data to lower resolution information.
In other words, pooling coupled with convolutional filters attempts to detect objects within an image. The output of these filters is then sub-sampled by pooling operations. After this, there is another set of convolutions and pooling on the output of the first convolution-pooling operation.
The purpose of this fully connected layer at the output of the network requires some explanation. The fully connected layer As previously discussed, a Convolutional Neural Network takes high resolution data and effectively resolves that into representations of objects.
In order to attach this fully connected layer to the network, the dimensions of the output of the Convolutional Neural Network need to be flattened. These channels need to be flattened to a single N X 1 tensor. This can be easily performed in PyTorch, as will be demonstrated below. Now the basics of Convolutional Neural Networks has been covered, it is time to show how they can be implemented in PyTorch.
PyTorch is such a framework. The Convolutional Neural Network architecture that we are going to build can be seen in the diagram below: Convolutional neural network that will be built First up, we can see that the input images will be 28 x 28 pixel greyscale representations of digits. These layers represent the output classifier. Next — there is a specification of some local drive folders to use to store the MNIST dataset PyTorch will download the dataset into this folder for you automatically and also a location for the trained model parameters once training is complete.
Compose [transforms. ToTensor , transforms. Normalize 0. Compose function. This function comes from the torchvision package. It allows the developer to setup various manipulations on the specified dataset.
Numerous transforms can be chained together in a list using the Compose function. In this case, first we specify a transform which converts the input data set to a PyTorch tensor. A PyTorch tensor is a specific data type used in PyTorch for all of the various data and weight operations within the network. In its essence though, it is simply a multi-dimensional matrix. In any case, PyTorch requires the data set to be transformed into a tensor so it can be consumed in the training and testing of the network.
The next argument in the Compose list is a normalization transformation. Neural networks train better when the input data is normalized so that the data ranges from -1 to 1 or 0 to 1.
Note, that for each input channel a mean and standard deviation must be supplied — in the MNIST case, the input data is only single channeled, but for something like the CIFAR data set, which has 3 channels one for each color in the RGB spectrum you would need to provide a mean and standard deviation for each channel.
These will subsequently be passed to the data loader. First, the root argument specifies the folder where the train. The train argument is a boolean which informs the data set to pickup either the train. Finally, the download argument tells the MNIST data set function to download the data if required from an online source. As can be observed, there are three simple arguments to supply — first the data set you wish to load, second the batch size you desire and finally whether you wish to randomly shuffle the data.
A data loader can be used as an iterator — so to extract the data we can just use the standard Python iterators such as enumerate.
Schooling Flappy Bird: A Reinforcement Learning Tutorial
Let's not worry about the parameters of the optimizer just yet. The we start the training loop. We loop over the contents of a loader object, which we'll look at in a minute.
Every iteration it yields two items: the inputs and the labels. They are PyTorch tensors of which the first dimension is the batch size.
The inputs can be directly fed to the model, while labels has the single dimension of which the size is equal to the batch size: it represents the class of each image. Next, we use our loss function to compute the loss on the results of the model. While we do those computations PyTorch automatically tracks our operations and when we call backward on the result it calculates the derivative gradient of each of the steps with respect to the inputs.
This gradient is then what the optimizer can use to optimize the weights when we call step. We call the full training loop over all elements in the loader an epoch. Evaluation After training for one or more epochs you are probably interested in the performance of your network. First we need to set our model to evaluation mode which is the same as disabling the training mode using. This disables features that are handy using train time, such as order to get the maximum performance out of our network.
Then we have a loop similar to the one in the training case: we loop over the inputs and the labels from the loader, pass the inputs to the model and calculate the loss. In addition, we could inspect the predictions of the model and possibly use them by using the torch. These positions correspond to the output node and hence class that has the highest probability according to our model, which we can interpret as the index of the most probable class.
The loader Of course data is essential to either training or evaluating a classifier. In the previous two segments we looped through the contents of this loader object, which we did not define before.
In order to create it, we must first define a data set. Of course a single data set is not enough: we need both a training and a testing data set.
In addition you may want to have a validation data set as well. Compose [ transforms. RandomResizedCrop , transforms.
RandomHorizontalFlip , transforms. Resize , transforms. CenterCrop , transforms. We define two transformations, one for each data set. This means that although the model will encounter each training image once during every epoch, the exact images it will be seeing vary from epoch to epoch: sometimes it will be seeing most of the image and other times it will see only a small crop. Since most objects still look roughly the same when we horizontally flip the image, we want the model to also learn from the flipped images.
Vertically flipped upside-down images usually do not look like the same object anymore, so we only flip horizontally. All this randomly transforming the training images helps to prevent our model it cannot learn by heart that a small portion of an image belongs to a certain label because every epoch it sees a different subset of the image.
Once we have defined the data sets, we can create the loaders: from torch. Now we have all ingredients to really start training our model! And especially the first one, lr, the learning rate, is very important. This parameter defines how much the weights will be changed in every optimization step. In other words, it defines our step size when we are looking for the most optimal set of weights. Let's have a look at a 1D example. Suppose we are looking to find the minimum value in the curves depicted below.
If our learning rate is too large then we might actually walk away from the minimum, as we see on the left. If, on the other hand our learning rate is too low, we will be moving very slowly and we run the risk of getting stuck in a local optimum. Now you might be inclined to perform a classical hyper-parameter search, by simply trying out a lot of values for the learning rate and seeing how well the model performs in the end.
But training a single models takes at least a few hours on a decent GPU, so training tens or hundreds! A better way to figure out the optimal value of the learning rate is to do a learning rate sweep: we train our model for a number of batches for a range of learning rates.
The result should look something like this: We see that in the beginning we learn very very slowly, but it improves after a while. Your ideal setting is there where the improvement is the fastest, i. After the sweep, do not forget to reset the network to the state before you did the sweep, as the batches with the highest learning rates will most likely have ruined your networks' performance.
Learning rate scheduler Unfortunately doing a sweep once is not enough, as the best learning rate depends on the state of our network. The closer we come to the ideal weights, the lower we should set our learning rate. We can solve this by using a learning rate scheduler. For example, we can use the ReduceLROnPlateau scheduler which decreases the learning rate when the loss has been stable for a while: from torch. All we have to do next is call scheduler.
The result will look something like the figure below: every once in a while the scheduler will decide to reduce the learning rate when it thinks the loss is not improving enough. Now all that you need to start making your own image classifier is a data set!
Where next? If you're looking for more example code, have a look at this project which I used to build an image classifier that can recognize skylines of a few large cities. I gave a talk about the project on EuroPython , of which you can find the And of course the PyTorch docs are your friend whenever you are building something like this! Join us for more on deep learning!
PyTorch: Training your first Convolutional Neural Network (CNN)
Machine learning algorithms can roughly be divided into two parts: Traditional learning algorithms and deep learning algorithms. Traditional learning algorithms usually have much fewer learnable parameters than deep learning algorithms and have much less learning capacity.
Also, traditional learning algorithms are not able to do feature extraction: Artificial intelligence specialists need to figure out a good data representation which is then sent to the learning algorithm. A lot of deep learning techniques have been known for a very long time, but recent advances in hardware rapidly boosted deep learning research and development. Nvidia is responsible for the expansion of the field because its GPUs have enabled fast deep learning experiments.
Learnable Parameters and Hyperparameters Machine learning algorithms consist of learnable parameters which are tuned in the training process and non-learnable parameters which are set before the training process.
Parameters set prior to learning are called hyperparameters. Grid search is a common method for finding the optimal hyperparameters.
PyTorch conv2d: A Practical Guide
Supervised, Unsupervised, and Reinforcement Learning Algorithms One way to classify learning algorithms is drawing a line between supervised and unsupervised algorithms. An optimizer minimizes the loss function. One optimizer that is very popular in deep learning is stochastic gradient descent.
There are a lot of variations that try to improve on the stochastic gradient descent method: Adam, Adadelta, Adagrad, and so on. Unsupervised algorithms try to find structure in the data without explicitly being provided with labels. Below is an image with data points. Each cluster has its own color. Reinforcement learning uses rewards: Sparse, time-delayed labels. An agent takes action, which changes the environment, from which it can get a new observation and reward.
An observation is the stimulus an agent perceives from the environment. It can be what the agent sees, hears, smells, and so on. A reward is given to the agent when it takes an action. It tells the agent how good the action is. By perceiving observations and rewards, an agent learns how to optimally behave in the environment.
Active, Passive, and Inverse Reinforcement Learning There are a few different approaches to this technique. Finally, inverse reinforcement learning tries to reconstruct a reward function given the history of actions and their rewards in various states.
Generalization, Overfitting, and Underfitting Any fixed instance of parameters and hyperparameters is called a model. Machine learning experiments usually consist of two parts: Training and testing.
How to use the BatchNorm layer in PyTorch?
During the training process, learnable parameters are tuned using training data. In the test process, learnable parameters are frozen, and the task is to check how well the model makes predictions on previously unseen data. Generalization is the ability of a learning machine to perform accurately on a new, unseen example or task after having experienced a learning dataset. If a model is too simple with respect to the data, it will not be able to fit the training data and it will perform poorly both on the training dataset and the test dataset.
In that case, we say the model is underfitting. Overfitting is the situation where a model is too complex with respect to the data.
It can perfectly fit training data, but it is adapted so much to the training dataset that it performs poorly on test data—i. Below is an image showing underfitting and overfitting compared with a balanced situation between the overall data and the prediction function. Scalability Data is crucial in building machine learning models. But because of their limited capacity, performance is limited as well. Below is a plot showing how deep learning methods scale well compared to traditional machine learning algorithms.
Neural Networks Neural networks consist of multiple layers. The image below shows a simple neural network with four layers. The first layer is the input layer, and the last layer is the output layer. The two layers between the input and output layers are hidden layers. If a neural network has more than one hidden layer, we call it a deep neural network.
Learning is done using a backpropagation algorithm which combines a loss function and an optimizer. Backpropagation consists of two parts: a forward pass and a backward pass. In the forward pass, input data is put on the input of the neural network and output is obtained.
Convolutional Neural Network One neural network variation is the convolutional neural network. Step 1: Download the code. For more background on using Git see this post.
PyTorch Conv2D Explained with Examples
To create the conda environment with all the dependencies, a Install Anaconda. For a 5-minute introduction to CNNs, see this post ; for a longer introduction, see this post. In other cases, you may want to modify an existing CNN, e. Finally, perhaps you would like to write your own CNN entirely from scratch, without any pre-defined components.
A grayscale image has 1 color channel, for different shades of gray. The dimensions of a grayscale image are [1, height, width]. Chest CT scan showing pneumothorax. Therefore, the dimensions of a color image are [3, height, width]. Samoyed on the beach. The input layer of a CNN that takes in grayscale images must specify 1 input channel, corresponding to the gray channel of the input grayscale image.
The input layer of a CNN that takes in color images must have 3 input channels, corresponding to the red-green-blue color channels of the input color image.
A batch is a subset of examples selected out of the whole data set. The model is trained on one batch at a time. Image by Author. Here is an example layer definition: nn. Anatomy of a 2D CNN layer. Sequential nn.