The following code is a simple three layer neural network written in Python and NumPy. This was my first shot at programming a neural network, so there may be some issues with it. I wanted to build it from scratch so I could attempt to understand and conceptualize the math involved.
A neural network is a learning algorithm that attempts to solve complex problems in computer vision, speech recognition, and other fields in artificial intelligence. Artificial Neural Networks were named after their similarity to the human brain, with its millions of neurons and axons receiving inputs from the external environment and creating electrical impulses which travel through the connected network of the brain.I'm going to be using NumPy here, which is a linear algebra library for Python. I read all my data from csv. A basic function of this is below which creates two matrices, inputData and outputData. The output data is in the fifth column of the csv file and the input data is from the sixth column to the last column.
#Basic, stripped down data load from a CSV file to two NumPy arrays
def LoadData(fileName, start,end):
inputData=[] #python array
outputData=[]
with open(fileName, 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for row in itertools.islice(reader, start,end): #itertools is a python iterator tool
output=np.longfloat(row[5]) #output is the fifth index in each row
#create a numpy long float array for each row from the sixth index to the last (len(row)) index
input = [np.longfloat(x) for x in itertools.islice(row , 6,len(row) ) if x != '']
#add each row to the python array
inputData.append(input)
outputData.append(output)
return inputData,outputData
Note that the input data should already be normalized. I'm not going to talk about this, but it is a very important piece to building a working neural network.
Links on data normalization: Data Normalization in Neural Networks How to Standardize data for Neural Networks
Another important facet of supervised learning is the three different types of datasets. I'll go over these below.
def Train(self):
global weights0,weights1,layer1_deltaPrev,layer2_deltaPrev
#training iterations are set by the user in the program initialization.
for iter in xrange(self.trainingIterations):
#layer 0 equals the training inputs in NumPy matrix form.
layer0 = self.TrainingIn
#neural nets should always use shuffle. One big bug in this program is that the outputs are in a separate
#matrix. This means that randomly shuffling the inputs is going to give them random outputs. For my training
#I set shuffle to off.
if(self.shuffle):
np.random.shuffle(layer0)
#np.dot is a NumPy multiplication function. It multiplies layer0(input) by the randomly initialized weights, and is fed through the sigmoid function self.nonlin
layer1 = self.nonlin(np.dot(layer0,self.weights0))
#you should always include a bias layer. This means adding a column of 1's to your matrix
if(self.biasLayer1TF):
self.biaslayer1 = np.ones(layer1.shape[0])
layer1=np.column_stack((layer1,self.biaslayer1))
#dropout should also always be used. 0>dropout<1
if(self.dropout):
layer1 *= np.random.binomial([np.ones((layer1.shape[0],layer1.shape[1]))],1-self.dropoutPercent)[0]
#layer2 equals layer1*weights1 fed through the sigmoid function self.nonlin
#layer2 is what the network is predicting the output will be.
layer2 = self.nonlin(np.dot(layer1,self.weights1))
#more on this later
if(self.miniBtch_StochastGradTrain):
self.layer1_delta,self.layer2_delta=self.MiniBatchOrStochasticGradientDescent()
self.layer2_deltaPrev=self.layer2_delta
self.layer1_deltaPrev=self.layer1_delta
else:
#the error= the distance between the guesses from the actual values.
layer2_error = self.TrainingOut - layer2
#take the derivative (slope) of each node in layer2, and multiply it by the error. The higher the error,
#the more the slope delta will be. Learning rate is also applied here, which is how fast you want your network to correct
#itself. Additionally momentum is applied to the end of this, which is added the previous iterations deltas to the current iteration
self.layer2_delta = layer2_error*self.nonlin(layer2,deriv=True)*self.learningRate+self.momentum*self.layer2_deltaPrev
self.layer2_deltaPrev=self.layer2_delta
#.T means the transposed matrix in Numpy.
layer1_error = self.layer2_delta.dot(self.weights1.T)
self.layer1_delta = layer1_error * self.nonlin(layer1,deriv=True)*self.learningRate+self.momentum*self.layer1_deltaPrev
self.layer1_deltaPrev=self.layer1_delta
#update the weights in each layer with the delta layers
self.weights1 += layer1.T.dot(self.layer2_delta)
if(self.biasLayer1TF):
self.weights0 += layer0.T.dot(self.layer1_delta.T[:-1].T)
else:
self.weights0 += layer0.T.dot(self.layer1_delta)
if (iter% 20) == 0:
self.CalculateValidationError()
if(self.minimumValidationError<0.01):
break
Looking at the table we have 4 rows with 3 variables of input data. The output data is 4 rows with 1 result. So we will have a TrainIn matrix of size 4x3 and a TrainOut matrix of 4x1. The above table looks like the following in a NumPy array: TestIn = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ]) TestOut = np.array([[0,1,1,0]]).T Note the ".T" in the TestOut variable. This means transpose the matrix. It is visualized here as a 1x4 matrix, by transposing it we create a 4X1 matrix, conducive to our input set.
Since we are working with such a simple dataset I'm going to skip over validation data and test data. Our test data will simply be our training data.Now that we have input data to send to the network, I'm going set the parameters for the network. I do this in main.py. These parameters get sent to the Neural Network class and initialize the model. I'm going to concentrate on the parameters hiddenDim, learningRate, trainingIterations, momentum, and dropout. Below is the parameter dictionary I send to the Neural Network class.
Run({"hiddenDim":2,"learningRate":.05,"trainingIterations":5000,"momentum":.1,
"dropout":True,"dropoutPercent":0.65,
"miniBtch_StochastGradTrain":"False","miniBtch_StochastGradTrain_Split":null,"NFoldCrossValidation":False,"N":null#skip these,
"biasLayer1TF":True,"addInputBias":False,"shuffle":False
})
np.random.seed(1)
self.weights0 = (2*np.random.random((self.TrainingIn.shape[1],self.hiddenDim)) - 1)
self.weights1 = (2*np.random.random((self.hiddenDim+1 if self.biasLayer1TF else self.hiddenDim,1)) - 1)
np.random.random creates a matrix of random numbers between 0 and 1. It is best practice to initialize the weights randomly with mean 0, so I multiply these numbers by 2 and subtract 1. self.TrainingIn.shape returns the size of the matrix, in our case 4x3. Therefore self.TrainingIn.shape[1] returns the second number of this value, 3. hiddenDim was set to 2 in our initalization parameters above. So self.weights0 will be a matrix of size 3X2 with random numbers between -1 and 1. Since bias is turned off for this example, weights1 will be initialized with a matrix size 2(hiddenDim)x1 with random numbers between -1 and 1. We also need to intialize "layerx_deltaPrev" matrices for momentum. np.zeros creates a matrix of 0's.
self.layer2_deltaPrev=np.zeros(shape=(self.TrainingOut.shape[0],self.weights1.shape[1]))
self.layer1_deltaPrev=np.zeros(shape=(self.TrainingIn.shape[0],self.hiddenDim+1 if self.biasLayer1TF else self.hiddenDim))
self.layer2_deltaPrev will be a matrix of size 4X1, and self.layer1_deltaPrev will be a matrix of size 4x2.
layer1 = self.nonlin(np.dot(layer0,self.weights0)
np.dot is matrix multiplication. Multiply the matrix layer0 by self.weights0.
Remember back to linear algebra, these matrices have to be compatible sizes. Since layer0 is size 4x3 and weights0 is size 3x2, we can multiply these together to output a 4x2 matrix. (For matrix multiplication the middle two numbers have to be the same)
The 4x2 layer1 matrix is illustrated in the picture at the top of the page as the circles in the middle depicting the "hidden layer", although its a different size than this example. The weights are depicted as the lines between the different layers of the network.
Next the resulting 4x2 matrix gets sent through the function nonlin(). Code:
def nonlin(self,x,deriv=False):
if(deriv==True):
return x*(1-x)
return 1/(1+np.exp(-x))
This is the sigmoid function. 1/(1+e^(-x)).
Important function. Google it. This is our nonlinearity function. Notice it maps any value to a value between 0 and 1, a probabilistic scale. Its derivative returns x*(1-x). We will use this when computing the gradient in a few steps.
The next line is indicating whether to add a bias node to the layer. This is a column of 1's in your matrix. Bias is good in every neural network, it is used for shifting your activation function to the left or right on the graph.
Great link
layer1 *= np.random.binomial([np.ones((layer1.shape[0],layer1.shape[1]))],1-self.dropoutPercent)[0]
Dissecting this line of code-
np.ones((layer1.shape[0],layer1.shape[1]))] creates an array of 1's the same shape as layer1. (4x2)
np.random.binomial([np.ones((layer1.shape[0],layer1.shape[1]))],self.dropoutPercent)[0] writes out an array that has 1's with a rate of self.dropoutPercent. The other 1-self.dropoutPercent of the time they will be zero. So a 4x2 array with a dropout rate of 50% will have roughly five zeroes.
This is then used to multiple with the layer1. So a certain percentage of the nodes in layer1 will be zero with dropout>0. This is a form of regularization. It prevents overfitting models with complex adaptions of the data by "dropping out" random nodes during each iteration. It other words, it prevents the model from becoming too smart for its own good by shutting off random areas of itself at each turn.
The next line of code is similar to the above code, but it's computing the final output layer. In other words, the model is guessing what the actual output values will be, which in this case is 0, 1, 1, 0.
layer2 = self.nonlin(np.dot(layer1,self.weights1))
Layer1 is a 4x2 matrix, and self.weights1 is a 2x1 matrix. Multiply these together and get a 4x1 matrix. Then send them through the sigmoid function to map the values between 0 and 1. Notice this is the same size as the output matrix. Next I will compute the error and deltas and begin the fun stuff, backpropagation.
layer2_error = self.TrainingOut - layer2
The self.TrainingOut matrix is the actual output values for the dataset, 0,1,1,0. Layer2 is what the neural network guessed. Subtracting layer2 from TrainingOut will tell the network how far off its guessed values are to the actual values.
self.layer2_delta = layer2_error*self.nonlin(layer2,deriv=True)*self.learningRate+self.momentum*self.layer2_deltaPrev
The simple version of the delta, or the error weighted derivative, is layer2_error*self.nonlin(layer2,deriv=True). This is taking the slope (derivative) for the weights and multiplying these by the layer2_error.
0.24*0.2=0.048
self.momentum*self.layer2_deltaPrev
The momentum number is also between 0 and 1, and could be thought of as the velocity you want your neural network to learn with the memory of past velocities. So, if you are going down the gradient, searching for the minimum (as in the picture below), you don't want to rapidly change directions at each step because some directions could lead to local minima. Momentum smooths out the descent and pushes the network in the direction that most of the weights push it, rather than the current weight. You do this by adding the weights from the previous iterations delta to the current iteration:
self.layer2_deltaPrev=self.layer2_delta
layer1_error = self.layer2_delta.dot(self.weights1.T)
weights1 is a 2x1 matrix. .T transposes the matrix to get a 1x2 matrix. Therefore the output size for layer1_error is (4x1)*(1x2)=4x2 matrix. Which is the size of the weights0 matrix.
We again do Hadamard, or "elementwise" multiplication to get the layer1_delta. We do this with the following code:
self.layer1_delta = layer1_error * self.nonlin(layer1,deriv=True)*self.learningRate+self.momentum*self.layer1_deltaPrev
self.layer1_deltaPrev=self.layer1_delta
Momentum and learning rate are applied in the same way they were for layer2_delta, and and layer1_deltaPrev is saved off as well.
self.weights1 += layer1.T.dot(self.layer2_delta)
layer2_delta=(4x1) shaped matrix.
layer1=(4x2) shaped matrix -> Transposed=(2x4) shaped matrix.
2x4 matrix * 4x1 matrix=2x1 matrix, or the weights1 shape. We add these values to the randomized weights so that weights1 can update with numbers that will more accurately guess what the output will be next time it forward propogates.
The math behind this works because of basic calculus. The derivative, or rate of change, is calculated and this gives us the steepest route to take to get to the correct value of the function. This is called gradient descent. On top of this, we already know how far off we are by calculating the error. If we are not far off from the correct value, we don't have to move far. This is why we multiply the error by the derivative.
From here just keep iterating through through the for loop until your weights update enough to guess the correct output values.