Artificial Neural Network

The following code is a simple three layer neural network written in Python and NumPy. This was my first shot at programming a neural network, so there may be some issues with it. I wanted to build it from scratch so I could attempt to understand and conceptualize the math involved.

A neural network is a learning algorithm that attempts to solve complex problems in computer vision, speech recognition, and other fields in artificial intelligence. Artificial Neural Networks were named after their similarity to the human brain, with its millions of neurons and axons receiving inputs from the external environment and creating electrical impulses which travel through the connected network of the brain.

My program will have a single hidden layer. A simple visualization is below. Source code is here

Data

I'm going to be using NumPy here, which is a linear algebra library for Python. I read all my data from csv. A basic function of this is below which creates two matrices, inputData and outputData. The output data is in the fifth column of the csv file and the input data is from the sixth column to the last column.


#Basic, stripped down data load from a CSV file to two NumPy arrays
def LoadData(fileName, start,end):
    inputData=[] #python array
    outputData=[]    
    with open(fileName, 'rb') as csvfile: 
        reader = csv.reader(csvfile, delimiter=',')
        for row in itertools.islice(reader, start,end): #itertools is a python iterator tool
            output=np.longfloat(row[5]) #output is the fifth index in each row
	    #create a numpy long float array for each row from the sixth index to the last (len(row)) index 
            input = [np.longfloat(x) for x in itertools.islice(row , 6,len(row) ) if x != ''] 
	    #add each row to the python array
            inputData.append(input)
            outputData.append(output)
    return inputData,outputData			
			

Note that the input data should already be normalized. I'm not going to talk about this, but it is a very important piece to building a working neural network.

Links on data normalization:
Data Normalization in Neural Networks
How to Standardize data for Neural Networks

Another important facet of supervised learning is the three different types of datasets. I'll go over these below.

Stats and graphs and stuffs

Think of the dots above as data. We are trying to learn from the data and create a model so that we can use it in the future for some particular use. The model is the blue line.
For example, suppose the input data is the the x coordinate and the output data is the y coordinate. If we want to predict the y coordinate given the x coordinate, we can use the line to predict where the dot will be.
In statistcs, overfitting is making your model too complex to have a valuable predictive performance. The overfitting line may look like it has a perfect guesses in the picture above, but in reality when an overfit model is given new inputs it makes poor predictions. This is because it learns from the stochastic error of the data, which is the error that is random from one measurement to the next. This causes the model to overreact to minor fluctuations in the data.
This is another way to look at overfitting. The bigger graph at the bottom shows model complexity vs. model prediction error. As the complexity of the model increases, the risk of overfitting goes up. A lot of times you won't be able to tell you are overfitting your model unless you use new data that hasn't been trained on the model. This is called the validation set. A good way to tell if you are overfitting is measuring the training data vs the validation data. If the model error, or model accuracy, is roughly the same and improving at the same rate, you are on the right track.

The next image shows how a model's accuracy can be depicted graphically. Note how the green green validatino line is near the training set accuracy while the blue line is a model with overfitting.
Overfitting can happen in any statistical learning model, but bn the case of neural networks overfitting often occurs when there are too many nodes. Each node will attempt to hold specific information about the input data, and when there are too many nodes, they will attempt to hold information on the data that is too specific and can't be generalized to new data.

Training in python, NumPy

The training method is the meat of the project. This is where backpropagation and several other neural network techniques are implemented. Here is the code with comments explaining the functionality.
 
def Train(self):
    global weights0,weights1,layer1_deltaPrev,layer2_deltaPrev
    
    #training iterations are set by the user in the program initialization. 
    for iter in xrange(self.trainingIterations):
   
   	#layer 0 equals the training inputs in NumPy matrix form.  
        layer0 = self.TrainingIn
        
        #neural nets should always use shuffle.  One big bug in this program is that the outputs are in a separate
        #matrix.  This means that randomly shuffling the inputs is going to give them random outputs.  For my training
        #I set shuffle to off.  
        if(self.shuffle):
	    np.random.shuffle(layer0)

	#np.dot is a NumPy multiplication function.  It multiplies layer0(input) by the randomly initialized weights, and is fed through the sigmoid function self.nonlin
        layer1 = self.nonlin(np.dot(layer0,self.weights0))

	#you should always include a bias layer. This means adding a column of 1's to your matrix 
        if(self.biasLayer1TF):
            self.biaslayer1 = np.ones(layer1.shape[0])
            layer1=np.column_stack((layer1,self.biaslayer1))
            
        #dropout should also always be used.  0>dropout<1
        if(self.dropout):
            layer1 *= np.random.binomial([np.ones((layer1.shape[0],layer1.shape[1]))],1-self.dropoutPercent)[0]
        
        #layer2 equals layer1*weights1 fed through the sigmoid function self.nonlin
        #layer2 is what the network is predicting the output will be.  
        layer2 = self.nonlin(np.dot(layer1,self.weights1))
        
        #more on this later
        if(self.miniBtch_StochastGradTrain):
            self.layer1_delta,self.layer2_delta=self.MiniBatchOrStochasticGradientDescent()
            self.layer2_deltaPrev=self.layer2_delta
            self.layer1_deltaPrev=self.layer1_delta
        else:

	    #the error= the distance between the guesses from the actual values.  
            layer2_error = self.TrainingOut - layer2
            
            #take the derivative (slope) of each node in layer2, and multiply it by the error.  The higher the error,
            #the more the slope delta will be.  Learning rate is also applied here, which is how fast you want your network to correct 
            #itself.  Additionally momentum is applied to the end of this, which is added the previous iterations deltas to the current iteration 
            self.layer2_delta = layer2_error*self.nonlin(layer2,deriv=True)*self.learningRate+self.momentum*self.layer2_deltaPrev
            self.layer2_deltaPrev=self.layer2_delta

	    #.T means the transposed matrix in Numpy.  
            layer1_error = self.layer2_delta.dot(self.weights1.T)
            self.layer1_delta = layer1_error * self.nonlin(layer1,deriv=True)*self.learningRate+self.momentum*self.layer1_deltaPrev
            self.layer1_deltaPrev=self.layer1_delta

	#update the weights in each layer with the delta layers
        self.weights1 += layer1.T.dot(self.layer2_delta)
        if(self.biasLayer1TF):
            self.weights0 += layer0.T.dot(self.layer1_delta.T[:-1].T)
        else:
            self.weights0 += layer0.T.dot(self.layer1_delta)
        

        if (iter% 20) == 0:
            self.CalculateValidationError()

            if(self.minimumValidationError<0.01):
                break
		
I'm going to walkthrough the neural net code above with the following data:

Looking at the table we have 4 rows with 3 variables of input data. The output data is 4 rows with 1 result. So we will have a TrainIn matrix of size 4x3 and a TrainOut matrix of 4x1. The above table looks like the following in a NumPy array:
TestIn = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
TestOut = np.array([[0,1,1,0]]).T
Note the ".T" in the TestOut variable. This means transpose the matrix. It is visualized here as a 1x4 matrix, by transposing it we create a 4X1 matrix, conducive to our input set.

Since we are working with such a simple dataset I'm going to skip over validation data and test data. Our test data will simply be our training data.

Now that we have input data to send to the network, I'm going set the parameters for the network. I do this in main.py. These parameters get sent to the Neural Network class and initialize the model.
I'm going to concentrate on the parameters hiddenDim, learningRate, trainingIterations, momentum, and dropout.

Below is the parameter dictionary I send to the Neural Network class.


Run({"hiddenDim":2,"learningRate":.05,"trainingIterations":5000,"momentum":.1,
"dropout":True,"dropoutPercent":0.65,
 "miniBtch_StochastGradTrain":"False","miniBtch_StochastGradTrain_Split":null,"NFoldCrossValidation":False,"N":null#skip these,
 "biasLayer1TF":True,"addInputBias":False,"shuffle":False
})	
			
In the example I send the entire input matrix through the algorithm all at once. This is known at "batch training".
Mini-Batch involves using a subset of the input data greater than 1, and training and updating the network in this way.
Stochastic training involves sending one input value through each training iteration, and updating the weights one input at a time.
Initializing weights and deltas
In my NeuralNet class __init__ code I initialize the weights based on the hiddenDim variable:
np.random.seed(1)
self.weights0 = (2*np.random.random((self.TrainingIn.shape[1],self.hiddenDim)) - 1)
self.weights1 = (2*np.random.random((self.hiddenDim+1 if self.biasLayer1TF else self.hiddenDim,1)) - 1)

np.random.random creates a matrix of random numbers between 0 and 1. It is best practice to initialize the weights randomly with mean 0, so I multiply these numbers by 2 and subtract 1.
self.TrainingIn.shape returns the size of the matrix, in our case 4x3. Therefore self.TrainingIn.shape[1] returns the second number of this value, 3.
hiddenDim was set to 2 in our initalization parameters above. So self.weights0 will be a matrix of size 3X2 with random numbers between -1 and 1.
Since bias is turned off for this example, weights1 will be initialized with a matrix size 2(hiddenDim)x1 with random numbers between -1 and 1.

We also need to intialize "layerx_deltaPrev" matrices for momentum.
np.zeros creates a matrix of 0's.

self.layer2_deltaPrev=np.zeros(shape=(self.TrainingOut.shape[0],self.weights1.shape[1]))
self.layer1_deltaPrev=np.zeros(shape=(self.TrainingIn.shape[0],self.hiddenDim+1 if self.biasLayer1TF else self.hiddenDim))
self.layer2_deltaPrev will be a matrix of size 4X1, and self.layer1_deltaPrev will be a matrix of size 4x2.

Train()

We're ready to enter the Train() method from above. We will iterate in the for loop 5000 times to update the weights of the network based on the input data and expected output.
layer0 = self.TrainingIn. Self explanatory.
Shuffle is rearranging the data into random rows.
The next line of code is doing a lot:
layer1 = self.nonlin(np.dot(layer0,self.weights0)
np.dot is matrix multiplication. Multiply the matrix layer0 by self.weights0.

Remember back to linear algebra, these matrices have to be compatible sizes. Since layer0 is size 4x3 and weights0 is size 3x2, we can multiply these together to output a 4x2 matrix. (For matrix multiplication the middle two numbers have to be the same)

The 4x2 layer1 matrix is illustrated in the picture at the top of the page as the circles in the middle depicting the "hidden layer", although its a different size than this example. The weights are depicted as the lines between the different layers of the network.
Next the resulting 4x2 matrix gets sent through the function nonlin(). Code:
def nonlin(self,x,deriv=False):
	if(deriv==True):
    		return x*(1-x)
	return 1/(1+np.exp(-x))
This is the sigmoid function. 1/(1+e^(-x)).
Important function. Google it. This is our nonlinearity function. Notice it maps any value to a value between 0 and 1, a probabilistic scale. Its derivative returns x*(1-x). We will use this when computing the gradient in a few steps.

The next line is indicating whether to add a bias node to the layer. This is a column of 1's in your matrix. Bias is good in every neural network, it is used for shifting your activation function to the left or right on the graph. Great link

Dropout
layer1 *= np.random.binomial([np.ones((layer1.shape[0],layer1.shape[1]))],1-self.dropoutPercent)[0] 
Dissecting this line of code-
np.ones((layer1.shape[0],layer1.shape[1]))] creates an array of 1's the same shape as layer1. (4x2) np.random.binomial([np.ones((layer1.shape[0],layer1.shape[1]))],self.dropoutPercent)[0] writes out an array that has 1's with a rate of self.dropoutPercent. The other 1-self.dropoutPercent of the time they will be zero. So a 4x2 array with a dropout rate of 50% will have roughly five zeroes.

This is then used to multiple with the layer1. So a certain percentage of the nodes in layer1 will be zero with dropout>0. This is a form of regularization. It prevents overfitting models with complex adaptions of the data by "dropping out" random nodes during each iteration. It other words, it prevents the model from becoming too smart for its own good by shutting off random areas of itself at each turn.

The next line of code is similar to the above code, but it's computing the final output layer. In other words, the model is guessing what the actual output values will be, which in this case is 0, 1, 1, 0.
layer2 = self.nonlin(np.dot(layer1,self.weights1))
Layer1 is a 4x2 matrix, and self.weights1 is a 2x1 matrix. Multiply these together and get a 4x1 matrix. Then send them through the sigmoid function to map the values between 0 and 1. Notice this is the same size as the output matrix.
Next I will compute the error and deltas and begin the fun stuff, backpropagation.

Backpropagation
layer2_error = self.TrainingOut - layer2
The self.TrainingOut matrix is the actual output values for the dataset, 0,1,1,0. Layer2 is what the neural network guessed. Subtracting layer2 from TrainingOut will tell the network how far off its guessed values are to the actual values.
self.layer2_delta = layer2_error*self.nonlin(layer2,deriv=True)*self.learningRate+self.momentum*self.layer2_deltaPrev
The simple version of the delta, or the error weighted derivative, is layer2_error*self.nonlin(layer2,deriv=True). This is taking the slope (derivative) for the weights and multiplying these by the layer2_error.
0.24*0.2=0.048


self.learningRate is the rate you want your network to update its weights. It can be between 0 and 1 and the default is 1. I think of the learning rate as the rate at which which you want your network to abandon its preconceived ideas about the data.
I learned how important learning rate becomes when fine-tuning a pretrained network on new data. Fine-tuning involves using a pre-existing model such as ResNet or Google's Inception that has weights based on a specific dataset, and training it on your own data. When you attempt to train your data's new nodes with a high learning rate, the new nodes won't know anything about the data, and will proceed to update all of the pre-trained models learned weights. This means that all the work the model did to learn its pretrained weights will be wiped out.

Momentum
Momentum is the second part of the code in the line above:
self.momentum*self.layer2_deltaPrev 

The momentum number is also between 0 and 1, and could be thought of as the velocity you want your neural network to learn with the memory of past velocities. So, if you are going down the gradient, searching for the minimum (as in the picture below), you don't want to rapidly change directions at each step because some directions could lead to local minima. Momentum smooths out the descent and pushes the network in the direction that most of the weights push it, rather than the current weight. You do this by adding the weights from the previous iterations delta to the current iteration:
self.layer2_deltaPrev=self.layer2_delta
Note* Because layer2_error is a 4x1 matrix and layer2 is also a 4x1 matrix, we are taking the Hadamard product, which is multiplying the corresponding cell in each matrix with each other to produce a matrix of the same size. The output of layer2_delta will be a 4x1 matrix.

Layer1 error and delta
Calculate layer1_error:
layer1_error = self.layer2_delta.dot(self.weights1.T)
weights1 is a 2x1 matrix. .T transposes the matrix to get a 1x2 matrix. Therefore the output size for layer1_error is (4x1)*(1x2)=4x2 matrix. Which is the size of the weights0 matrix.
We again do Hadamard, or "elementwise" multiplication to get the layer1_delta. We do this with the following code:
self.layer1_delta = layer1_error * self.nonlin(layer1,deriv=True)*self.learningRate+self.momentum*self.layer1_deltaPrev
self.layer1_deltaPrev=self.layer1_delta
Momentum and learning rate are applied in the same way they were for layer2_delta, and and layer1_deltaPrev is saved off as well.

Weight updates
This is where the network "learns". We update the weights of the network to more resemble the output data:
self.weights1 += layer1.T.dot(self.layer2_delta)
layer2_delta=(4x1) shaped matrix.
layer1=(4x2) shaped matrix -> Transposed=(2x4) shaped matrix.
2x4 matrix * 4x1 matrix=2x1 matrix, or the weights1 shape. We add these values to the randomized weights so that weights1 can update with numbers that will more accurately guess what the output will be next time it forward propogates.

The math behind this works because of basic calculus. The derivative, or rate of change, is calculated and this gives us the steepest route to take to get to the correct value of the function. This is called gradient descent. On top of this, we already know how far off we are by calculating the error. If we are not far off from the correct value, we don't have to move far. This is why we multiply the error by the derivative.

From here just keep iterating through through the for loop until your weights update enough to guess the correct output values.