Experiments in Crowd Counting

During August and September 2019 I attempted modeling the computer vision regression datasets for crowd counting. I was attempting to improve upon the SOA models but came up short. The following papers were referenced heavily in my experiments:

Modifications of above networks

Below are the enhancements I tried on the above networks. As a baseline I used the first layers of a pretrained VGGNet, similar to the tactics used in the above papers and other crowd counting papers. Additionally, most of my modeling centered around CSRNet's use of dilations. I tried what seemed like every neural network trick in the book: convolutional dilations, groupings, attention masks, sparsity, transfer learning, etc., coming out with a pretentious combination of the above models.

Enhancements: 1st set of experiments:

  1. 4-6 layers on top of VGGNet architecture with dilation=2, padding=2. Layer sizes ranged from 512 to 64.
  2. Group convolutions at the output layer ranging from 4 to 64 in experiments.
  3. A gating network branching off the VGGNet with a grouped convolution output equal to the output in the regression layer (#2).
  4. Sparse gating of the gating network ranging from 1 to 64 (see #5 and code below).
  5. Loss equal to CSRNet + cv_squared loss referenced in the paper Outageously Large Neural Networks: The Sparsely Gated Mixture of Experts Layer

2nd set of experiments:

  1. Built dilated layers on top of VGGNet with grouped convolutional outputs and a gating network. Loss function equal to loss in paper 'Multi-Scale Attention Network for Crowd Counting'. Trained head detection model for loss.

3rd set of experiments.

  1. Pretrained 8 models to specialize in a single head count range such as 50 people to 150 people or 900 people to 2000 people. Trained final model with these loaded pretrained weights and a gating network to minimize loss on the consensus of models. I believe the bottleneck here was a limited amount of training pictures. Training one sub model only had around 40 pictures.

Density generation code

I found the code on GitHub to generate density maps for the crowd counting output. This computes a distance to your nearest neighbors to fill in an estimate of a density map using a gaussian filter. Thanks GitHub user leeyeehoo for providing this library

Pytorch model

Below is one of the models I used for training. This gave me an error on par with SOA models. However, I wasn't able to do better before overfitting occured. I decided to end my analysis here. All models were trained on ShanghaiTech Part A.

In [15]:
import torch.nn as nn
import torch
from torchvision import models
from utils import save_net,load_net
import math
from torch.autograd import Variable
from itertools import islice
import copy
import torch.nn.functional as F

class GatedCSRNet(nn.Module):
    def __init__(self, load_weights=False):
        super(GatedCSRNet, self).__init__()
        self.seen = 0
        self.frontend_feat = [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512]
        self.frontend = make_layers(self.frontend_feat, )

        
        self.output_layer=nn.Sequential(nn.Conv2d(512,512, kernel_size=3,padding=2,dilation = 2), nn.ReLU(inplace=True),
                                nn.Conv2d(512,512, kernel_size=3,padding=2,dilation = 2), nn.ReLU(inplace=True),
                                nn.Conv2d(512,256, kernel_size=3,padding=2,dilation = 2), nn.ReLU(inplace=True),
                                nn.Conv2d(256,128, kernel_size=3,padding=2,dilation = 2), nn.ReLU(inplace=True),
                                nn.Conv2d(128,64, kernel_size=3,padding=2,dilation = 2), nn.ReLU(inplace=True),
                                nn.Conv2d(64, 64, groups=64, kernel_size=1))

        self.gate=nn.Sequential(nn.Conv2d(512,256, kernel_size=3,padding=2,dilation = 2),nn.ReLU(inplace=True),
                                nn.Conv2d(256,128, kernel_size=3,padding=2,dilation = 2),nn.ReLU(inplace=True),
                                nn.Conv2d(128,64,groups=64, kernel_size=1))

        self.soft=nn.Softmax(dim=1)

    def forward(self,input,train=True):
        x = self.frontend(input)
        regressor=self.output_layer(x)
        gate=self.gate(x)

        #trained with and without random noise in the gate
        #see paper on outrageously large NN's 
        noise=gate+1e-2        
        if(train):
            if(torch.cuda.is_available()):
                std=torch.randn(gate.size()).cuda()
            else:
                std=torch.randn(gate.size())
            gate_soft=F.softplus(noise)
            gate_multiplier=std*gate*gate_soft
            gate=gate+gate_multiplier
        else:
            gate_soft=F.softplus(noise)
            gate_multiplier=gate_soft
            gate=gate+gate_multiplier


        #sparse gating portion of code
        negInf = torch.ones(gate.size())*float('-inf')
        sparse=torch.topk(gate,64, dim=1, largest=True, sorted=True, out=None)
        if(torch.cuda.is_available()):
            res = Variable(negInf.cuda())
        else:
            res = Variable(negInf)
        gate = res.scatter_(1, sparse.indices, sparse.values)
        
        #gating portion of network
        gate=self.soft(gate)
        final=(gate*regressor)
        final=final.sum(1) #sum on first dimension
        return final,gate
    
    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.normal_(m.weight, std=.01)
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, std=.01)
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)

    
def make_layers(cfg, in_channels = 3,batch_norm=False,dilation = False):
    if dilation:
        d_rate = 2
    else:
        d_rate = 1
    layers = []
    for v in cfg:
        if v == 'M':
            layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
        else:
            conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=d_rate,dilation = d_rate)
            if batch_norm:
                layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
            else:
                layers += [conv2d, nn.ReLU(inplace=True)]
            in_channels = v
    return nn.Sequential(*layers)   

Loss Function

Below we find the coefficient of variation for the outputs from the gating network prior to the softmax activation. The below loss encourages gate network values to stay close to each other so that one value doesn't get too high and do all the work.
Without this loss a few experts in the network will take all the attention. During the first few iterations of training, the network will backpropagate so that a few experts converge on the ideal value, leaving the other branches of the network behind. This is a self-reinforcing phenomenon: the experts will get better and better, while other branches stay the same. This is especially true in sparsely gated networks.

In [ ]:
def cv_squared(input, target,gateValsPriorToSoftmax):
  """The squared coefficient of variation of a sample.
  Useful as a loss to encourage a positive distribution to be more uniform.
  Epsilons added for numerical stability.
  Returns 0 for an empty Tensor.
  Args:
    x: a `Tensor`.
  Returns:
    a `Scalar`.
  """

  #CSRNet summed over the entire picture to find loss
  main_loss=((input-target)**2).sum()
  
  epsilon = 1e-10
  float_size = 64+epsilon
  mean = torch.sum(gateValsPriorToSoftmax) / float_size
  variance = torch.sum((x-mean)**2) / float_size
  importance=(variance / (mean**2 + epsilon))

  #tried different factors for lambda_ from .05 to 10
  return main_loss+lambda_*importance

I used batch sizes of 1 because of varying picture sizes. Convolutions make this possible. The loss above relies on a "batchwise sum of each gate". Because of my small batch size and the sparse gating, I tried training the network with accumulating values for each gate per epoch. This helped keep the coefficient of variation small. This is shown below in my "importance" variable. I detach the gate values, sum them on the second and third dimensions, and add the values to the importance each iteration. Below is a full training epoch.

In [ ]:
for i,(img, target,img_path)in enumerate(train_loader):
    
        data_time.update(time.time() - end)
        if(torch.cuda.is_available()):
            img = img.cuda()
        img = Variable(img)

        output ,gate= model(img)

        if(torch.cuda.is_available()):         
            target = target.type(torch.FloatTensor).unsqueeze(0).cuda()
        else:
            target = target.type(torch.FloatTensor).unsqueeze(0)        

        hidden = gate.detach()
        hidden=hidden.sum(3).sum(2)

        importance=(importance*(i))
        importance=importance.add_(hidden)
        importance/=(i+1)

        loss=cv_squared(output, target, importance)

        losses.update(loss.item(), img.size(0))
        optimizer.zero_grad()
        loss.backward()

        optimizer.step()    

Takeaways

I was disappointed in coming up short and yet again having an experiment fail me, however, I learned all about the PyTorch framework in the process. The framework is more intuitive than pure TensorFlow, and significantly more versatile than the Keras API. Though I haven't used TensorFlow's new version, PyTorch was smooth enough to learn that I won't be going back.
Time to remove the TensorFlow sticker off my laptop.
Until next time.