During August and September 2019 I attempted modeling the computer vision regression datasets for crowd counting. I was attempting to improve upon the SOA models but came up short. The following papers were referenced heavily in my experiments:
Below are the enhancements I tried on the above networks. As a baseline I used the first layers of a pretrained VGGNet, similar to the tactics used in the above papers and other crowd counting papers. Additionally, most of my modeling centered around CSRNet's use of dilations. I tried what seemed like every neural network trick in the book: convolutional dilations, groupings, attention masks, sparsity, transfer learning, etc., coming out with a pretentious combination of the above models.
Enhancements: 1st set of experiments:
2nd set of experiments:
3rd set of experiments.
I found the code on GitHub to generate density maps for the crowd counting output. This computes a distance to your nearest neighbors to fill in an estimate of a density map using a gaussian filter. Thanks GitHub user leeyeehoo for providing this library
Below is one of the models I used for training. This gave me an error on par with SOA models. However, I wasn't able to do better before overfitting occured. I decided to end my analysis here. All models were trained on ShanghaiTech Part A.
import torch.nn as nn
import torch
from torchvision import models
from utils import save_net,load_net
import math
from torch.autograd import Variable
from itertools import islice
import copy
import torch.nn.functional as F
class GatedCSRNet(nn.Module):
def __init__(self, load_weights=False):
super(GatedCSRNet, self).__init__()
self.seen = 0
self.frontend_feat = [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512]
self.frontend = make_layers(self.frontend_feat, )
self.output_layer=nn.Sequential(nn.Conv2d(512,512, kernel_size=3,padding=2,dilation = 2), nn.ReLU(inplace=True),
nn.Conv2d(512,512, kernel_size=3,padding=2,dilation = 2), nn.ReLU(inplace=True),
nn.Conv2d(512,256, kernel_size=3,padding=2,dilation = 2), nn.ReLU(inplace=True),
nn.Conv2d(256,128, kernel_size=3,padding=2,dilation = 2), nn.ReLU(inplace=True),
nn.Conv2d(128,64, kernel_size=3,padding=2,dilation = 2), nn.ReLU(inplace=True),
nn.Conv2d(64, 64, groups=64, kernel_size=1))
self.gate=nn.Sequential(nn.Conv2d(512,256, kernel_size=3,padding=2,dilation = 2),nn.ReLU(inplace=True),
nn.Conv2d(256,128, kernel_size=3,padding=2,dilation = 2),nn.ReLU(inplace=True),
nn.Conv2d(128,64,groups=64, kernel_size=1))
self.soft=nn.Softmax(dim=1)
def forward(self,input,train=True):
x = self.frontend(input)
regressor=self.output_layer(x)
gate=self.gate(x)
#trained with and without random noise in the gate
#see paper on outrageously large NN's
noise=gate+1e-2
if(train):
if(torch.cuda.is_available()):
std=torch.randn(gate.size()).cuda()
else:
std=torch.randn(gate.size())
gate_soft=F.softplus(noise)
gate_multiplier=std*gate*gate_soft
gate=gate+gate_multiplier
else:
gate_soft=F.softplus(noise)
gate_multiplier=gate_soft
gate=gate+gate_multiplier
#sparse gating portion of code
negInf = torch.ones(gate.size())*float('-inf')
sparse=torch.topk(gate,64, dim=1, largest=True, sorted=True, out=None)
if(torch.cuda.is_available()):
res = Variable(negInf.cuda())
else:
res = Variable(negInf)
gate = res.scatter_(1, sparse.indices, sparse.values)
#gating portion of network
gate=self.soft(gate)
final=(gate*regressor)
final=final.sum(1) #sum on first dimension
return final,gate
def _initialize_weights(self):
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.normal_(m.weight, std=.01)
if m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.Linear):
nn.init.normal_(m.weight, std=.01)
if m.bias is not None:
nn.init.constant_(m.bias, 0)
def make_layers(cfg, in_channels = 3,batch_norm=False,dilation = False):
if dilation:
d_rate = 2
else:
d_rate = 1
layers = []
for v in cfg:
if v == 'M':
layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
else:
conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=d_rate,dilation = d_rate)
if batch_norm:
layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
else:
layers += [conv2d, nn.ReLU(inplace=True)]
in_channels = v
return nn.Sequential(*layers)
Below we find the coefficient of variation for the outputs from the gating network prior to the softmax activation. The below loss encourages gate network values to stay close to each other so that one value doesn't get too high and do all the work.
Without this loss a few experts in the network will take all the attention. During the first few iterations of training, the network will backpropagate so that a few experts converge on the ideal value, leaving the other branches of the network behind. This is a self-reinforcing phenomenon: the experts will get better and better, while other branches stay the same. This is especially true in sparsely gated networks.
def cv_squared(input, target,gateValsPriorToSoftmax):
"""The squared coefficient of variation of a sample.
Useful as a loss to encourage a positive distribution to be more uniform.
Epsilons added for numerical stability.
Returns 0 for an empty Tensor.
Args:
x: a `Tensor`.
Returns:
a `Scalar`.
"""
#CSRNet summed over the entire picture to find loss
main_loss=((input-target)**2).sum()
epsilon = 1e-10
float_size = 64+epsilon
mean = torch.sum(gateValsPriorToSoftmax) / float_size
variance = torch.sum((x-mean)**2) / float_size
importance=(variance / (mean**2 + epsilon))
#tried different factors for lambda_ from .05 to 10
return main_loss+lambda_*importance
I used batch sizes of 1 because of varying picture sizes. Convolutions make this possible. The loss above relies on a "batchwise sum of each gate". Because of my small batch size and the sparse gating, I tried training the network with accumulating values for each gate per epoch. This helped keep the coefficient of variation small. This is shown below in my "importance" variable. I detach the gate values, sum them on the second and third dimensions, and add the values to the importance each iteration. Below is a full training epoch.
for i,(img, target,img_path)in enumerate(train_loader):
data_time.update(time.time() - end)
if(torch.cuda.is_available()):
img = img.cuda()
img = Variable(img)
output ,gate= model(img)
if(torch.cuda.is_available()):
target = target.type(torch.FloatTensor).unsqueeze(0).cuda()
else:
target = target.type(torch.FloatTensor).unsqueeze(0)
hidden = gate.detach()
hidden=hidden.sum(3).sum(2)
importance=(importance*(i))
importance=importance.add_(hidden)
importance/=(i+1)
loss=cv_squared(output, target, importance)
losses.update(loss.item(), img.size(0))
optimizer.zero_grad()
loss.backward()
optimizer.step()
I was disappointed in coming up short and yet again having an experiment fail me, however, I learned all about the PyTorch framework in the process. The framework is more intuitive than pure TensorFlow, and significantly more versatile than the Keras API. Though I haven't used TensorFlow's new version, PyTorch was smooth enough to learn that I won't be going back.
Time to remove the TensorFlow sticker off my laptop.
Until next time.