ML: Introduction to PyTorch: 3. Neural Networks

Sequence of layers
Model data
Network training
Functional architecture
Network structure output

Saving and loading
Dataset classes
GPU computing
Custom layers
Custom optimizer

Sequence of layers

This document discusses how to create and train neural networks using the PyTorch library, the basics of which are described here and here. A reference guide for the main types of layers, losses, and optimizers can be found here.

Simple neural networks with a sequential architecture can be created using the stack-style interface. For example, a two-layer fully connected network with two inputs (nX=2 features), one (nY=1) sigmoid output (2 classes), and five neurons (nH=5) in the hidden layer can be defined as follows:

import torch
import torch.nn as nn

nX, nH, nY = 2, 5, 1

model = nn.Sequential(
          nn.Linear(nX, nH),    # the first layer
          nn.Sigmoid(),         # hidden layer activation
          nn.Linear(nH, nY),    # the second, output layer
          nn.Sigmoid() )        # its activation function

The Linear layer performs a linear transformation of an input tensor with shape (N, nX), where N is the number of examples. At the output of this layer, we obtain a tensor of shape (N, nH), which is then passed through a sigmoid (producing new nH features). The second layer produces a tensor of shape (N, nY), which is also passed through a sigmoid. If its value is less than 0.5 - it belongs to the first class; if greater than 0.5 - it belongs to the second.

Functional architecture

When designing complex neural networks, it is often more convenient to define their architecture in a functional form.
To do this, a subclass of nn.Module is created. In this subclass, the constructor and the forward method are redefined. In the constructor, the required layers are specified and their parameters are initialized. Note the call to the parent class constructor, to which it is necessary to pass the name of our class. In the forward method (forward propagation), the architecture of the network is defined (the layers are linked together into a computational graph):

class TwoLayersNet(nn.Module):
    def __init__(self, nX, nH, nY):        
        super(TwoLayersNet, self).__init__()     # parent constructor with this class name
        
        self.fc1 = nn.Linear(nX, nH)             # create model parameters
        self.fc2 = nn.Linear(nH, nY)             # in fully connected layers
         
    def forward(self, x):                        # define forward pass
        x = self.fc1(x)                          # output of the first layer
        x = nn.Sigmoid()(x)                      # pass through Sigmoid
        x = self.fc2(x)                          # output of the second layer
        x = nn.Sigmoid()(x)                      # pass through Sigmoid again
        return x
         
model = TwoLayersNet(2, 5, 1)                    # create a network instance

The call nn.Linear(in_features, out_features) creates two instances of fully connected linear layers fc1, fc2. At the same time, random values are assigned to their weights and biases. The parenthesis operator fc1(x) inside the forward method executes the computations within the layer and outputs the resulting tensor.

Model data

Let's create model data in PyTorch for two types of objects (two classes) characterized by two features. Suppose on a 2D plane, objects of the first class fill the unit square $[0...1]^2$, except for a circle in the center of the square, where objects of the second class are located:

X = torch.rand (1200,2)                       
Y = (torch.sum((X - 0.5)**2, axis=1) < 0.1).float().view(-1,1)

The float method converts the boolean results of the less-than operator into floating-point numbers (0.0 or 1.0). Then, using view, we turn Y into a single-column matrix (containing 0 or 1 values). Each row corresponds to one training example’s class.

Using the matplotlib library, we can visualize the generated data:

import matplotlib.pyplot as plt                              # plotting library

plt.figure (figsize=(5, 5))                                  # figure size (square)
plt.scatter(X.numpy()[:,0], X.numpy()[:,1], c=Y.numpy()[:,0], 
            s=30, cmap=plt.cm.Paired, edgecolors='k')        
plt.show()                                                   # display the plot

Network training

To train the network, you need to create a loss function (what should be minimized) and an optimizer (which will minimize this loss). The optimizer receives the model parameters. As the loss, we choose binary cross-entropy (BCELoss), and the optimizer will be SGD (Stochastic Gradient Descent):

model = TwoLayersNet(2, 5, 1)                            # network instance        

loss      = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(),          # model parameters
                            lr=0.5, momentum=0.8)        # optimizer settings

On each iteration, a batch array is formed from the training samples, then forward propagation is executed
y = model(bx) and the loss is computed (by comparing the model output y with the correct labels yb).
On the computational graph of the loss, the method loss.backward() computes the gradients of the model parameters, which the optimizer uses in the step() method to update the parameter values:

def fit(model, X,Y, batch_size=100, train=True):    
      model.train(train)                                 # important for Dropout, BatchNorm
      sumL, sumA, numB = 0, 0, int( len(X)/batch_size )  # loss, accuracy, batches
      
      for i in range(0, numB*batch_size, batch_size):          
        xb = X[i: i+batch_size]                          # current batch
        yb = Y[i: i+batch_size]                          # X,Y are torch tensors
                     
        y = model(xb)                                    # forward propagation
        L = loss(y, yb)                                  # compute loss
 
        if train:                                        # training mode
            optimizer.zero_grad()                        # reset gradients        
            L.backward()                                 # compute gradients            
            optimizer.step()                             # update parameters
                                    
        sumL += L.item()                                 # total loss (item from graph)
        sumA += (y.round() == yb).float().mean()         # classification accuracy
        
    return sumL/numB,  sumA/numB                         # average loss and accuracy

The fit function makes a pass through all the data. This is called a training epoch. If the parameter leran=True is set, the model is trained (its parameters are updated). The value leran=False is used when evaluating the model quality without changing it. As quality metrics, the average loss over all examples and the average accuracy (the proportion of correctly predicted classes) are used. Usually, a sufficiently large number of training epochs is required:

                                                         # model evaluation mode:
print( "before:      loss: %.4f accuracy: %.4f" %  fit(model, X,Y, train=False) )

epochs = 1000                                            # number of epochs
for epoch in range(epochs):                              # an epoch - pass through all examples
    L,A = fit(model, X, Y)                               # one epoch
    
    if epoch % 100 == 0 or epoch == epochs-1:                 
        print(f'epoch: {epoch:5d} loss: {L:.4f} accuracy: {A:.4f}' )

As a result, we will obtain something like:

    
before:      loss: 0.6340 accuracy: 0.6950
epoch:     0 loss: 0.6327 accuracy: 0.6550
....
epoch:   999 loss: 0.0334 accuracy: 0.9908

To improve training, it is usually recommended to shuffle the data. This can be done in two ways. The first method is executed before the fit function at the beginning of each epoch:

    idx = torch.randperm( len(X) )     # Shuffled index list
    X = X[idx]
    Y = Y[idx]

In this method, new memory is allocated for all the data, which is not always optimal for large training sets. The second method performs a random sampling only for the current batch inside the fit function:

    idx = torch.randint(high = len(X), size = (batch_size,) )
    xb = X[idx]                               
    yb = Y[idx]

This method uses less memory, but during a single epoch not all samples may be used for training, and the same samples may appear in a batch. This method is used in the DataLoader class (the shuffle parameter), see below.

Thus, compared to Keras in TensorFloor, the training procedure must be written manually. However, in complex cases this allows you to intervene in the process at any stage. For example, you can insert your own optimizer :)

.       if train:                                        # in training mode
            L.backward()                                 # compute gradients            
            with torch.no_grad():                   
                for p in model.parameters():
                    p.add_(p.grad, alpha=-0.7)           # p += -0.7*grad
                    p.grad.zero_()

Network structure output

The easiest way to see the network layers is to just print the model:

print(model)                             # textual representation of the model

It is more convenient to use the external modelsummary module. Its summary function is given an input tensor with any values (but with the correct shape) from a single example (the first dimension in shape will be -1):

from modelsummary import summary

summary(model, torch.zeros(1, 2), show_input=False)      # analogous to summary in keras

-----------------------------------------------------------------------
             Layer (type)               Output Shape         Param #
=======================================================================
                 Linear-1                    [-1, 5]              15
                 Linear-2                    [-1, 1]               6
=======================================================================
Total params: 21
Trainable params: 21
Non-trainable params: 0

If you set show_input=True, then instead of the output shape, the input shape of each layer will be printed.
For composite models, you can run the summary function with the parameter show_hierarchical=True.

To display model parameters when using the stacked (Sequential) way of defining it, you can print them layer by layer:

for layer in model:
    print("***", layer)
    for param in layer.parameters():
        print(param.data.numpy())

In general, the model parameters are derived as follows:

for param in model.parameters():
    print(param.numel(), param.size(), param.data.numpy())

Accessing parameters through the data attribute is necessary because they are nodes of the computation graph (you can instead use param.detach(), see the introduction to PyTorch). Another way to display trainable parameters with their names, number of parameters, and shapes:

tot = 0
for k, v in model.state_dict().items():
    pars = np.prod(list(v.shape)); tot += pars
    print(f'{k:20s} :{pars:7d}  shape: {tuple(v.shape)} ')
print(f"{'total':20s} :{tot:7d}")

fc1.weight           :     10  shape: (5, 2) 
fc1.bias             :      5  shape: (5,) 
fc2.weight           :      5  shape: (1, 5) 
fc2.bias             :      1  shape: (1,) 
total                :     21

Finally, the torchviz library allows you to visualize the computational graph of the model (image on the right):

import torchviz
     
torchviz.make_dot(model(X),  
                  params = dict(model.named_parameters()) )

In the first argument of the make_dot function, you pass the root of the model’s graph, which is obtained by feeding it an input tensor (of correct shape, with any number of examples).

The AddmmBackward node corresponds to the torch.addmm(v, m1, m2, b=1, a=1) function, which multiplies matrices and adds a vector to them: b v + a (m1 @ m2).

Saving and loading

The torch.save method saves any dictionary in binary form, including the state of the model and the optimizer:

import datetime
 
state = {'info':      "This is my network",      # description
         'date':      datetime.datetime.now(),   # date and time
         'model' :    model.state_dict(),        # model parameters
         'optimizer': optimizer.state_dict()}    # optimizer state

torch.save(state, 'state.pt')                    # save the file

They can then be loaded after creating the model and optimizer beforehand (their parameters can be arbitrary, since they will be loaded from the file):

state = torch.load('state.pt')                   # load the file

m = TwoLayersNet(2, 5, 1)                        # network instance
optimizer = torch.optim.SGD(m.parameters(),lr=1) # optimizer (any parameters)     

m.        load_state_dict(state['model'])        # load model parameters
optimizer.load_state_dict(state['optimizer'])    # load optimizer state

print(state['info'], state['date'])              # auxiliary information

Dataset classes

In PyTorch, it is customary to wrap training data in a class. It inherits from the Dataset class and overrides the __len__ method (the number of samples) and the __getitem__ method for retrieving a sample by index idx:

class MyDataset(torch.utils.data.Dataset):       # Inherits from Dataset

    def __init__(self, N = 10, *args, **kwargs):
        super().__init__(*args,**kwargs)
        
        self.x = torch.rand(N,1)                 # random data

    def __len__(self):                           # Number of samples
        return len(self.x)
        
    def __getitem__(self, idx):                  # get the idx-th sample
        return {'input' : self.x[idx],  
                'target': 2*self.x[idx]}
    

data = MyDataset()

for sample in data:
    print(sample)      # {'input': tensor([0.1545]), 'target': tensor([0.3090])} ...

After creating a dataset, it can be passed to a DataLoader object, which will provide batches, shuffle, split into training and test sets, normalize data, etc. (see the documentation).

        
train_loader = torch.utils.data.DataLoader(dataset=data, batch_size=12, shuffle=False)

for batch in train_loader:                       # get batches for training
    print(batch['input'], batch['target'])

GPU computing

To perform computations on a GPU (as in the general case), it is necessary to create the corresponding devices:

gpu = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
cpu = torch.device("cpu")

After creating the model, it can be sent to the graphics card and (if memory allows) the same can be done with the training data:

model = TwoLayersNet(2, 5, 1)                       # network instance        
model.to(gpu)                                       # send it to the GPU

X =  X.to(gpu)                                      # send training data to the GPU
Y =  Y.to(gpu)

If there is not enough GPU memory for the training data, it can be sent there in batches.

During training, the loss is transferred from the GPU to the CPU. Note that for small models and training datasets, using a GPU is not efficient and may even slow down training compared to a CPU.

Custom layers

Layers, like the entire network, inherit from the nn.Module class. Here is an example of implementing a linear layer:

import math
from torch.nn.parameter import Parameter

class My_Linear(nn.Module):
    
    def __init__(self, in_F, out_F):
        super(My_Linear, self).__init__()
        self.weight = Parameter(torch.Tensor(out_F, in_F))
        self.bias   = Parameter(torch.Tensor(out_F))
            
        self.reset_parameters()

    def reset_parameters(self):
        for p in self.parameters():
            stdv =  1.0 / math.sqrt(p.shape[0])
            p.data.uniform_(-stdv, stdv)   

    def forward(self, x):
        return  x @ self.weight.t() + self.bias

Custom optimizer

Below is the code for a simple SGD optimizer. Note the decorator @torch.no_grad() before the step method. It means that the parameter update step will be performed with gradient tracking disabled.

class SGD(torch.optim.Optimizer):
    
    def __init__(self, params, lr=0.1, momentum=0):
        defaults = dict(lr=lr, momentum=momentum)
        super(SGD, self).__init__(params, defaults)

    @torch.no_grad()
    def step(self):                
        for group in self.param_groups:
            momentum = group['momentum']
            lr       = group['lr']

            for p in group['params']:
                if p.grad is None:
                    continue
                    
                grad = p.grad
                
                if momentum != 0:
                    p_state = self.state[p]
                    if 'momentum_buf' not in p_state:
                        buf = p_state['momentum_buf'] =  torch.clone(grad)
                    else:
                        buf = p_state['momentum_buf']
                        buf.mul_(momentum).add_( grad )                       
                    grad = buf
                    
                p.add_(grad, alpha = -lr)


optimizer = SGD(model.parameters(), lr=0.1, momentum=0.9)

The parameters of built-in optimizers can also be modified directly during training.

def adjust_optim(optimizer, epoch):
    if epoch == 1000:
        optimizer.param_groups[0]['betas'] = (0.3, optimizer.param_groups[0]['betas'][1])
    if epoch > 1000:  
        optimizer.param_groups[0]['lr'] *= 0.9999