Vectorization in Fully-Connected Neural Networks

5 min readSep 9, 2022

A guide to implement vectorization in fully-connected neural networks.

A neuron! Image from Medical News Today.

Preface

In machine learning, vectorization is a great technique to speed up training. It is a way to organize all parameters into matrices, allowing you to get rid of unnecessary for loops. In this post, I am going to show how with NumPy you can vectorize a fully-connected neural network.

Recap

In a previous post, I talked about how you can implement forward and back propagation iteratively. Today, we are basically going to rebuild these two processes, and translate them into vectorized computation.

Back Propagation in Fully-Connected Neural Networks from Scratch

A guide to building your own fully-connected neural network using Python and NumPy.

liug26.medium.com

Re-defining Parameters

A key first step to rebuild forward and back propagation is to re-introduce the parameters (weights, biases, inputs) that we’ve been using. In my previous post, I defined the parameters as follows:

Uppercase X and Y refer to the inputs and outputs of the entire network, while lowercase x and y are the inputs and outputs of a single layer.

Note that all these parameters are stored as single numbers in a list, so they have superscripts and subscripts indicating their position in the list. What we are going to do is to define a new way to store and access these parameters — through NumPy 2D arrays. We start by defining the big X and big Y of the network.

m denotes the number of samples in a batch

Notice that these matrices have a width of m, so we are essentially storing the data of the entire batch in one matrix. This also allows us to do calculations across the entire batch, instead of looping over every single sample, which makes training significantly faster.

Then we are going to define the parameters within layers:

Activation a is essentially g(z), so they share a very similar structure.

Now we have defined new dimensions of weights w and bias b, let’s update that to the __init__ method of our DenseLayer

class DenseLayer:
    def __init__(self, num_neurons, num_inputs, activation_func):
        self.num_neurons = num_neurons
        self.num_inputs = num_inputs
        self.activation_func = activation_func
        self.x = None
        self.z = None
        self.w = np.random.randn(num_neurons, num_inputs) * 0.01
        self.b = np.random.randn(num_neurons, 1) * 0.01
        self.dw = np.zeros((num_neurons, num_inputs))
        self.db = np.zeros((num_neurons, 1))

Note that np.random.randn() returns a matrix from random normal distribution, so we need * 0.01 to contain the weights to be closer to 0. Otherwise, it will take gradient descent a very long time to adjust the large weights.

You may also ask, why can’t we leave weights and biases to be np.zeros() ? Leaving weights to be the same initial value of 0 will result in a symmetric network problem. This will significantly reduce the complexity of your network.

Forward Propagation

Now that we have re-defined new parameters, we can adjust our calculations in accordance. Let’s start from forward propagation. Recall that the formulas for forward propagation are:

With matrices, we can calculate all z and a in just one step.

In code, it’s also just one step!

def forward_propagation(self, x):
    assert x.shape[0] == self.num_inputs
    self.x = x  # stores x for back prop
    self.z = np.dot(self.w, x) + self.b  # store z for back prop
    a = activation(self.z, self.activation_func)
    return a

Back Propagation

Now we have vectorized forward propagation, let’s do the same on back propagation. Back propagation calculates the change in cost J with respect to parameters w, b and a, therefore ∂J/∂w, ∂J/∂b and ∂J/∂a will share the same matrix dimensions with w, b, and a.

In my previous post, we derived the formulas to calculate ∂J/∂w, ∂J/∂b and ∂J/∂a of the l-1 layer. Now we are going to translate them into matrix calculations. Let’s start with ∂J/∂z.

To distinguish between dot products and element-wise multiplication, I will use an asterisk * to denote an element-wise multiplication, and dot products otherwise.

With ∂J/∂z, we can more easily calculate other derivatives.

Note that there is one addition to the matrix formulas for ∂J/∂w and ∂J/∂b — there is an additional 1/m term. This is because ∂J/∂z is of dimension (# outputs, m), so the computed ∂J/∂w and ∂J/∂b matrices are the sum of ∂J/∂w and ∂J/∂b across all m samples. To prevent exploding gradients, we divide this sum by m to compute the average ∂J/∂w and ∂J/∂b of the batch.

And finally, the ∂J/∂a of the l-1 layer.

Now we have all the formulas, we can assemble them into code.

def back_propagation(self, da):
    m_batch = da.shape[1]  # #samples
    dz = da * activation_prime(self.z, self.activation_func)
    self.dw = 1 / m_batch * np.dot(dz, self.x.T)
    self.db = 1 / m_batch * np.sum(dz, axis=1, keepdims=True)
    da_prev = np.dot(self.w.T, dz)
    return da_prev

As simple as it is! Now you just need to implement gradient descent as usual, then you’ll have a vectorized fully-connected neural network.

Epilogue

If you substitute for loops with NumPy arrays, you’ll find that training time is drastically reduced. Not only NumPy offers a much faster computation speed than for loops, it also allows you to compute over multiple samples simultaneously. On my PC, doing an iterative 100,000-step stochastic gradient descent takes roughly 9 hours, while the vectorized version brings the training time down to a little more than a minute.

Vectorization almost never hurts your performance, and it goes along with most optimization and regularization algorithms. So it is advised to implement vectorization on neural networks whenever you can.