Introduction

Hello! I do research on low power neural networks for micro-robots with Kris Pister!

As a grad student, my job is to turn coffee into code. JK. Grad school has been a lot of self-learning, so here's a place for my notes!!!

Fundamental Work

Trained Ternary Quantization

Quantization

Repo

import torch
from torch.autograd import Variable
import torch.nn.functional as F
from .training import _accuracy


def initial_scales(kernel):
    return 1.0, 1.0


def quantize(kernel, w_p, w_n, t):
    """
    Return quantized weights of a layer.
    Only possible values of quantized weights are: {zero, w_p, -w_n}.
    """
    
    '''
    @Brian
    kernel: tensor
    w_p: float? - positive weight
    w_n: float? - negative weight
    t: float? - hyper param for quantizing (see below)
    '''

    delta = t*kernel.abs().max()
    # @Brian 1 mask if kernel is above delta?
    a = (kernel > delta).float()
    # @Brian similar mask if below negative delta.
    # basically +- threshold
    b = (kernel < -delta).float()
    '''
    @Brian set positive weight to all values in the mask tensor
    likewise for negative. Combine masks.
    example: np.array([
        [0.73,0, 0.73],
        [-.82, -.82, 0],
        [0, 0.73, 0],
    ])
    '''
    return w_p*a + (-w_n*b)


def get_grads(kernel_grad, kernel, w_p, w_n, t):
    """
    Arguments:
        kernel_grad: gradient with respect to quantized kernel.
        kernel: corresponding full precision kernel.
        w_p, w_n: scaling factors.
        t: hyperparameter for quantization.

    Returns:
        1. gradient for the full precision kernel.
        2. gradient for w_p.
        3. gradient for w_n.
    """
    # @Brian
    # kernel |> abs |> max |> times x
    # t is a quantization hyperparameter 
    delta = t*kernel.abs().max()
    # masks
    a = (kernel > delta).float()
    b = (kernel < -delta).float()
    # @Brian a and b are masked
    # create a ones tensor with the same shape - a - b
    # anything that should be quantized to zero is now 1
    c = torch.ones(kernel.size()).cuda() - a - b
    # scaled kernel grad and grads for scaling factors (w_p, w_n)

    # @Brian, rewritten. positive scaling factor times mask times gradient. Do gradients for others. A single tensor from three gradients 
    full_percision_grad = w_p*a*kernel_grad + w_n*b*kernel_grad + 1.0*c*kernel_grad
    # @Brian, rewritten. gradients to update w_p and w_n. Sum of everything it multiplies with.
    w_p_grad = (a*kernel_grad).sum()
    w_n_grad = (b*kernel_grad).sum()
    return full_percision_grad, w_p_grad, w_n_grad

def optimization_step(model, loss, x_batch, y_batch, optimizer_list, t):
    """Make forward pass and update model parameters with gradients."""

    # parameter 't' is a hyperparameter for quantization

    # 'optimizer_list' contains optimizers for
    # 1. full model (all weights including quantized weights),
    # 2. backup of full precision weights,
    # 3. scaling factors for each layer
    optimizer, optimizer_fp, optimizer_sf = optimizer_list

    x_batch, y_batch = Variable(x_batch.cuda()), Variable(y_batch.cuda(async=True))
    # forward pass using quantized model
    logits = model(x_batch)

    # compute logloss
    loss_value = loss(logits, y_batch)
    batch_loss = loss_value.data[0]

    # compute accuracies
    pred = F.softmax(logits)
    batch_accuracy, batch_top5_accuracy = _accuracy(y_batch, pred, top_k=(1, 5))

    optimizer.zero_grad()
    optimizer_fp.zero_grad()
    optimizer_sf.zero_grad()
    # compute grads for quantized model
    loss_value.backward()

    # get all quantized kernels
    all_kernels = optimizer.param_groups[1]['params']

    # get their full precision backups
    all_fp_kernels = optimizer_fp.param_groups[0]['params']

    # get two scaling factors for each quantized kernel
    scaling_factors = optimizer_sf.param_groups[0]['params']

    for i in range(len(all_kernels)):

        # get a quantized kernel
        k = all_kernels[i]

        # get corresponding full precision kernel
        k_fp = all_fp_kernels[i]

        # get scaling factors for the quantized kernel
        f = scaling_factors[i]
        w_p, w_n = f.data[0], f.data[1]

        # get modified grads
        k_fp_grad, w_p_grad, w_n_grad = get_grads(k.grad.data, k_fp.data, w_p, w_n, t)

        # grad for full precision kernel
        k_fp.grad = Variable(k_fp_grad)

        # we don't need to update the quantized kernel directly
        k.grad.data.zero_()

        # grad for scaling factors
        f.grad = Variable(torch.FloatTensor([w_p_grad, w_n_grad]).cuda())

    # update all non quantized weights in quantized model
    # (usually, these are the last layer, the first layer, and all batch norm params)
    optimizer.step()

    # update all full precision kernels
    optimizer_fp.step()

    # update all scaling factors
    optimizer_sf.step()

    # update all quantized kernels with updated full precision kernels
    for i in range(len(all_kernels)):

        k = all_kernels[i]
        k_fp = all_fp_kernels[i]
        f = scaling_factors[i]
        w_p, w_n = f.data[0], f.data[1]

        # requantize a quantized kernel using updated full precision weights
        k.data = quantize(k_fp.data, w_p, w_n, t)

    return batch_loss, batch_accuracy, batch_top5_accuracy

Micro Robot Dataset

Google Collab Notebook

!git clone https://github.com/bCom5/pytorch-cifar.git
%cd pytorch-cifar
!pip install thop

After pulling everything, pull the dataset

!git clone https://github.com/bCom5/micro-robot-dataset.git
!cd micro-robot-dataset; sh setup.sh
import sys
sys.path.append('micro-robot-dataset')
from loader import *
trainloader, testloader = get_outdoor_iterators()

# we have get_outdoor_iterators, get_indoor_iterators, get_combined_iterators

Papers

\chapter{SqueezeNet}

\section{Links}
\begin{enumerate}
    \item \href{https://arxiv.org/pdf/1602.07360.pdf}{Paper}
    \item \href{https://youtu.be/ge_RT5wvHvY}{Video Walkthrough}
\end{enumerate}

\section{Design}

\begin{figure}[ht!]
    \includegraphics[width=1
    \textwidth]{photos/squeezenet/design.jpg}
    \centering
\end{figure}

1x1 filters are 9x smaller than 3x3 filters.
Squeeze (decrease number of channels by doing 1x1 filters K number of filters)

\clearpage

\section{Fire Module}

\begin{figure}[ht!]
    \includegraphics[width=.8
    \textwidth]{photos/squeezenet/fire_module.jpg}
    \centering
\end{figure}

Squeeze, then expand with 1x1 filter and 3x3 filter. Zero pad so 3x3 filter and 1x1 filters have the same feature map, concat together.

\clearpage

\section{Delayed Downsampling}

\begin{figure}[ht!]
    \includegraphics[width=.8
    \textwidth]{photos/squeezenet/delay.jpg}
    \centering
\end{figure}

\begin{figure}[ht!]
    \includegraphics[width=.8
    \textwidth]{photos/squeezenet/model.png}
    \centering
\end{figure}

Downsample with maxpool and stride of 2 (skips over half). Fire modules shown. ResNet Skip Connections in Middle (best accuracy) and Right. Uses \textbf{avg pool} instead of FC at the end.

\clearpage

\section{Deep Compression}

\begin{figure}[ht!]
    \includegraphics[width=.8
    \textwidth]{photos/squeezenet/deep_comp.png}
    \centering
    \caption{Accuracy stays very good. Can be shrank more with \textbf{Deep Compression} with little accuracy loss.} 
\end{figure}

\clearpage

\chapter{MobileNet Depthwise Separable Convolution}

\section{Links}
\begin{enumerate}
    \item \href{https://arxiv.org/pdf/1704.04861.pdf}{Paper}
    \item \href{https://youtu.be/T7o3xvJLuHk}{Video Walkthrough}
\end{enumerate}

\clearpage

\section{Convolution Operation and Cost}

\begin{figure}[ht!]
    \includegraphics[width=.8
    \textwidth]{photos/mobilenet/convolution.png}
    \centering
    \caption{N convolutions of $D_K \times D_K \times M$ (channels like 3 for RGB). Output is $D_G \times D_G \times N$} 
\end{figure}

\begin{figure}[ht!]
    \includegraphics[width=.8
    \textwidth]{photos/mobilenet/conv_cost.png}
    \centering
\end{figure}

\begin{enumerate}
    \item One convolution multiplication costs $(D_K)^2 \times M$.
    \item Doing all convolutions for that kernel (green tensor) costs $(D_G)^2 \times (D_K)^2 \times M$.
    \item Finally, the total costs with all $N$ Kernels costs $N \times (D_G)^2 \times (D_K)^2 \times M$.
\end{enumerate}
\textbf{Convolutions are expensive.}

\clearpage

\section{Depthwise Convolution (Filtering Step)}

\begin{figure}[ht!]
    \includegraphics[width=.8
    \textwidth]{photos/mobilenet/depthwise.png}
    \centering
    \caption{Instead of the Filters being size M, have it be size 1. Then have M Filters. The output of the \textbf{Depthwise Convolution} will be $D_G \times D_G \times M$.}
\end{figure}

Multiplication cost is $M \times (D_G)^2 \times (D_K)^2$.

\section{Pointwise Convolution (Combining Step)}

\begin{figure}[ht!]
    \includegraphics[width=.8
    \textwidth]{photos/mobilenet/pointwise.png}
    \centering
    \caption{The input is the output of the Depthwise Convolution. Use N $1 \times 1 \times M$ filters and the output is $D_G \times D_G \times N$. This is the same as a normal convolution with N Filters of $D_K \times D_K \times M$.}
\end{figure}

Multiplication cost is $N \times (D_G)^2 \times M$.

\textbf{The total cost is:} $M \times (D_G)^2 \times (D_K)^2 + N \times (D_G)^2 \times M = \mathbf{M \times (D_G)^2 \times ((D_K)^2 + N)}$

\clearpage

\section{Others}

\subsection{Comparison}

\begin{figure}[ht!]
    \includegraphics[width=.8
    \textwidth]{photos/mobilenet/mobilenet_comp.png}
    \centering
    \caption{Similar accuracy, significantly less mutli-add and parameters.}
\end{figure}

\subsection{Module}

\begin{figure}[ht!]
    \includegraphics[width=.8
    \textwidth]{photos/mobilenet/module.png}
    \centering
\end{figure}

\clearpage
\subsection{Activation ReLU6}

\begin{figure}[ht!]
    \includegraphics[width=1
    \textwidth]{photos/mobilenet/ReLU6.png}
    \centering
    \caption{ReLU capped at 6 efficient for low precision}
\end{figure}

Also there is a \textbf{Global Avg Pooling} then Fully Connected Layer to a Softmax Classifier.

\clearpage

\chapter{MobileNet V2}

\section{Links}
\begin{enumerate}
    \item \href{https://arxiv.org/abs/1801.04381}{Paper}
    \item \href{https://machinethink.net/blog/mobilenet-v2/}{Notes}
\end{enumerate}

\clearpage

\section{Modules}

\begin{figure}[ht!]
    \includegraphics[width=.4
    \textwidth]{photos/mobilenetv2/mnetv1mod.png}
    \centering
    \caption{MobileNet V1 uses pointwise convolutions.}
\end{figure}

\begin{figure}[ht!]
    \includegraphics[width=.4
    \textwidth]{photos/mobilenetv2/mnetv2mod.png}
    \centering
    \caption{MobileNet V2 uses Residual Blocks and new First Layer of Projection Layer. This is called ain \textbf{inverted residual with linear bottleneck}.}
\end{figure}

\section{Inverted Residuals with Linear Bottlenecks}

The idea of MobileNet V2 is based on the two ideas:
\begin{enumerate}
    \item \textbf{Low Dimension Tensors} reduce the number of computations/multiplications.
    \item Low Dimension Tensors \textbf{only} do not work well. They cannot extra a lot of information.
\end{enumerate}
MobileNet V2 addresses this by having the input be a low dimensional tensor, \textbf{expanding} it to a reasonable/high dimensional tensor, \textbf{run a depthwise convolution} on it, and \textbf{squeeze} it back into a low dimensional tensor.

\begin{figure}[ht!]
    \includegraphics[width=1
    \textwidth]{photos/mobilenetv2/channel_diag.png}
    \centering
\end{figure}

\begin{figure}[ht!]
    \includegraphics[width=1
    \textwidth]{photos/mobilenetv2/expand_filter_squeeze.png}
    \centering
\end{figure}

The Green is an \textbf{Expand Convolution}. It increases the number of channels in. The Blue is a \textbf{Depthwise Convolution}. It keeps the number of channels the same and runs filters on the data. The Orange is a \textbf{Projection Layer/Bottleneck Layer}. Additionally there is a \textbf{residual skip connection} to keep information flow in the network. \\

\clearpage

\section{Other}

\begin{figure}[ht!]
    \includegraphics[width=1
    \textwidth]{photos/mobilenetv2/stats.png}
    \centering
    \caption{It is more efficient than MobileNet V1.}
\end{figure}

Some blocks keep channel size the same, others expand it until the final fully connected classification layer.

\textbf{Note: SqueezeNet still has smaller memory usage.}

\clearpage

\chapter{MobileNet V3}

\section{Links}
\begin{enumerate}
    \item \href{https://arxiv.org/abs/1905.02244}{Paper}
    \item \href{https://vipermdl.github.io/2019/07/26/the-road-for-mobilenet-change/}{Notes}
    \item \href{https://github.com/kuan-wang/pytorch-mobilenet-v3/blob/master/mobilenetv3.py}{Code Implementation}
\end{enumerate}

\section{H-Swish}

\begin{figure}[ht!]
    \includegraphics[width=.8
    \textwidth]{photos/MobileNetV3/hswish.png}
    \centering
\end{figure}

Swish is an activation function better than ReLU.
$$swish(x) = x * sigmoid(\beta x)$$
H-Swish (hard) is more efficient for hardware.
$$hswish(x) = x * \frac{ReLU6(x + 3)}{6}$$
Recall $ReLU6 = min(max(0, x), 6)$, 0 then linear than 6 from 6 onwards.

H-Swish is good only at the end deeper layers of the neural network.

\clearpage

\section{Squeeze and Excitation}

\begin{figure}[ht!]
    \includegraphics[width=.8
    \textwidth]{photos/MobileNetV3/squeezeandexcite.png}
    \centering
\end{figure}

From \href{https://arxiv.org/pdf/1709.01507.pdf}{Squeeze and Excite} Paper (Big Model previous SOTA on ImageNet).


\begin{definition}
\textbf{Squeeze}: Global Information Embedding instead of where each learned filter operatives on a local field. A \textbf{Global Average Pooling} to get global information.
\end{definition}

\begin{figure}[ht!]
    \includegraphics[width=.6
    \textwidth]{photos/MobileNetV3/Excite.png}
    \centering
\end{figure}

\begin{definition}
\textbf{Excitation}: Adaptive Recalibration. This is done with gating sigmoid of the nonlinear ReLU  filter. This value is scaled to the input.
\end{definition}

\begin{figure}[ht!]
    \includegraphics[width=.8
    \textwidth]{photos/MobileNetV3/Diagram_Excite.png}
    \centering
\end{figure}

\clearpage

\section{Network Improvements}

Like overridden by the humans to be more efficient?

\begin{figure}[ht!]
    \includegraphics[width=.8
    \textwidth]{photos/MobileNetV3/dropexpensive.png}
    \centering
\end{figure}

The layers for tensors \textbf{960}, \textbf{320}, \textbf{1280} are removed. Instead \textbf{960} (input is 7x7x960) is avgpool to 1x1x960. Input it into 1280 1x1x960 conv. Output is 1x1x1280 (\textbf{1280} in the bottom). Input it into 1000 1x1x1280 conv. Output is 1x1x1000 (flattened).

In the original, between tensors \textbf{320} to \textbf{1280} the 1x1 Conv is used to expand to high dimensional feature space. The output is 7x7x1280. Then it is avg-pooled into 1x1x1280.

Instead in the human-efficient stage do avg pool first. It is 1x1x960, which the features are extracted more efficiently through a 1x1x1280 conv.

Also optimize the initial layers. Usually use 32 filter but most times they are just mirrors. Instead use 16 filters with h-swish, non-linearity and helps deal with redundancy.

\clearpage

\chapter{FD-MobileNet}
\section{Links}
\begin{enumerate}
    \item \href{https://arxiv.org/pdf/1802.03750.pdf}{Paper}
    \item \href{https://github.com/clavichord93/FD-MobileNet}{Code Implementation}
    \item \href{https://github.com/clavichord93/FD-MobileNet/blob/master/pyvision/models/ImageNet/MobileNet.py}{MobileNet and FD-MobileNet models}
\end{enumerate}

\clearpage
\section{Notes}

\begin{figure}[ht!]
    \includegraphics[width=.5
    \textwidth]{photos/fd_mobile/comparison.png}
    \centering
    \caption{FD (Fast Downsampling) downsamples early. This means there is less operations early on, more operations after downsampled when feature map is smaller.}

\end{figure}

FD-MobileNet x0.25 only has 0.383M params at 43.81\% top-1 accuracy compared to MobileNet x0.25 with 0.47M params at 54.22\% accuracy. MobileNet seems way more accurate, but we are really hardware limited, so I think this is promising.

\begin{figure}[ht!]
    \includegraphics[width=1
    \textwidth]{photos/fd_mobile/code.png}
    \centering
    \caption{Only changes are number of channels and stride sizes!}
\end{figure}

\clearpage

SqueezeNet

MobileNet

MobileNet V2

Links

Modules

  • MobileNet V1 uses pointwise convolutions

  • MobileNet V2 uses Residual Blocks and new First Layer of Projection Layer. This is called an inverted residual with linear bottleneck.

Inverted Residuals with Linear Bottleneck

The idea of MobileNet V2 is based on the two ideas:

  1. Low Dimension Tensors reduce the number of computations/multiplications.
  2. Low Dimension Tensors only do not work well. They cannot extra a lot of information. MobileNet V2 addresses this by having the input be a low dimensional tensor, expanding it to a reasonable/high dimensional tensor, run a depthwise convolution on it, and squeeze it back into a low dimensional tensor.

The Green is an Expand Convolution. It increases the number of channels in. The Blue is a Depthwise Convolution. It keeps the number of channels the same and runs filters on the data. The Orange is a Projection Layer/Bottleneck Layer. Additionally there is a residual skip connection to keep information flow in the network.

Other

  • It is more efficient than MobileNet V1.

Some blocks keep channel size the same, others expand it until the final fully connected classification layer.

Note: SqueezeNet still has smaller memory usage

MobileNet V3

Links

H-Swish

hswish

Swish is an activation function better than ReLU. $$swish(x) = x * sigmoid(\beta x)$$

H-Swish (hard) is more efficient for hardware. $$hswish(x) = x * \frac{ReLU6(x + 3){6}$$

Recall \( ReLU6 = min(max(0, x), 6) \), 0 then linear than 6 from 6 onwards.

H-Swish is good only at the end deeper layers of the neural network.

Squeeze and Excitation

squeeze_and_excite

Definition: Squeeze: Global Information Embedding instead of where each learned filter operatives on a local field. A Global Average Pooling to get global information.

Diagram_Excite

Definition: Excitation: Adaptive Recalibration. This is done with gating sigmoid of the nonlinear ReLU filter. This value is scaled to the input.

Excite

Network Improvements

Like overridden by the humans to be more efficient?

dropexpensive

The layers for tensors 960, 320, 1280 are removed. Instead 960 (input is 7x7x960) is avgpool to 1x1x960. Input it into 1280 1x1x960 conv. Output is 1x1x1280 (1280 in the bottom). Input it into 1000 1x1x1280 conv. Output is 1x1x1000 (flattened).

In the original, between tensors 320 to 1280 the 1x1 Conv is used to expand to high dimensional feature space. The output is 7x7x1280. Then it is avg-pooled into 1x1x1280.

Instead in the human-efficient stage do avg pool first. It is 1x1x960, which the features are extracted more efficiently through a 1x1x1280 conv.

Also optimize the initial layers. Usually use 32 filter but most times they are just mirrors. Instead use 16 filters with h-swish, non-linearity and helps deal with redundancy.

Fd-MobileNet

Links

  1. Paper
  2. Code Implementation
  3. MobileNet and FD-MobileNet models

Notes

comparison

  • FD (Fast Downsampling) downsamples early. This means there is less operations early on, more operations after downsampled when feature map is smaller.

FD-MobileNet x0.25 only has 0.383M params at 43.81% top-1 accuracy compared to MobileNet x0.25 with 0.47M params at 54.22% accuracy. MobileNet seems way more accurate, but we are really hardware limited, so I think this is promising.

code

  • Only changes are number of channels and stride sizes!

Trained Ternary Quantization

Links

  1. Paper

Cifar-10 Binary CNN Processor on Chip

  • mixed-signal binary convolutional neural netwok
    • 3.8 uJ/classification (forward pass)
    • 86% accuracy
  • BinaryNet {+1 -1}
    • multiplication to XNOR
    • weight stationary
    • data-parallel (all multiplies in parallel (?))
    • input reuse
    • wide vector sum as energy bottleneck
  • 28 nm CMOS
  • 328 kB on chip SRAM
  • 237 frame/s
  • 0.9 mW from 0.6 V meanint 3.8 uJ

Intro

  • problem: DNN have to do millions to billions of MAC per inference

  • weight stationary
  • computing in memory (CIM) (?)
  • CMOS-inspired, hardware specialization

  • output image pixels are binarized
  • always uses 2x2 filters, 256 channels and filters
    • low fan-out de-multipliers

No Hidden Fully Connected Layers

  • BinaryNet required 1.67 MB
    • 558 kB 6 CNN layers
    • 1.13 MB 3 FC layers
  • Instead only 261.5 kB
    • 256 kB 8 CNN layers
    • 5.5 kB 1 FC layer

Filter Computation

  • since we are dealing with +1, -1 and sign, batch norm is simplified

Top Level Architecture

  • pixel is quantized to 7 bits

Comparison

Summary

  • BinaryNet with XNOR operations
  • network architecture designed to work well with CMOS hasrdware
    • low weight memory
    • memory cost is amortized - weight stationay data parallel, input reuse
  • energy efficient SC neuron?

Tools

Google Collab

Cloning a Repo

!git clone https://github.com/bCom5/trained-ternary-quantization.git
%cd trained-ternary-quantization
!pip install thop

Saving a model

# model.cpu();
torch.save(model.state_dict(), 'micro_large_conv_and_fc_ttq_x0_32.pytorch_state')

from google.colab import drive
drive.mount('/content/gdrive')

!ls /content/gdrive/My\ Drive/04-eecs-299-research/03-new-work/01-fundamental-work/trained-ternary-quantization/ttq_microbotnet/ttq_models

!mv micro_large_conv_and_fc_ttq_x0_32.pytorch_state /content/gdrive/My\ Drive/04-eecs-299-research/03-new-work/01-fundamental-work/trained-ternary-quantization/ttq_microbotnet/ttq_models/micro_large_conv_and_fc_ttq_x0_32.pytorch_state

PyTorch

Loading Models

Loading .pth file

# for a .pth file
def load_model(model, file_name):
    def reformat_dict(state):
        reformat_state = {}
        for key in state:
            new_key = key.replace('module.', '')
            reformat_state[new_key] = state[key]
        return reformat_state
    state = torch.load(file_name,  map_location='cpu')['net']
    reformat_state = reformat_dict(state)
    model.load_state_dict(reformat_state)

Saving Models

Loading a .pytorch_state file

torch.save(model.state_dict(), 'micro_large_conv_and_fc_ttq_x0_32.pytorch_state')

See Google Collab Tools page

Gist Notes

MicroBotNet Architecture

MobileNet V3`

MicroBotNet x1.00

Correction on this image, First s should be 2

#16
#* MACs: 6,597,218
#* Params: 2,044,298
#torch.Size([1, 10])

FdMobileNetV3Imp2(
  (features): Sequential(
    (0): Sequential(
      (0): Conv2d(3, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): H_swish()
    )
    (1): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(16, 72, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(72, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU(inplace=True)
        (3): Conv2d(72, 72, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=72, bias=False)
        (4): BatchNorm2d(72, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): Sequential()
        (6): ReLU(inplace=True)
        (7): Conv2d(72, 24, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): BatchNorm2d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (2): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(24, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(96, 96, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), groups=96, bias=False)
        (4): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=96, out_features=24, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=24, out_features=96, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(96, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): BatchNorm2d(40, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (3): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(40, 240, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(240, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(240, 240, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=240, bias=False)
        (4): BatchNorm2d(240, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=240, out_features=60, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=60, out_features=240, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(240, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): Sequential(
          (lastBN): BatchNorm2d(40, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (4): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(40, 120, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(120, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(120, 120, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=120, bias=False)
        (4): BatchNorm2d(120, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=120, out_features=30, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=30, out_features=120, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(120, 48, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): BatchNorm2d(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (5): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(48, 144, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(144, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(144, 144, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=144, bias=False)
        (4): BatchNorm2d(144, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=144, out_features=36, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=36, out_features=144, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(144, 48, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): Sequential(
          (lastBN): BatchNorm2d(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (6): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(48, 288, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(288, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(288, 288, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), groups=288, bias=False)
        (4): BatchNorm2d(288, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=288, out_features=72, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=72, out_features=288, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(288, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (7): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(576, 576, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=576, bias=False)
        (4): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=576, out_features=144, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=144, out_features=576, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(576, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): Sequential(
          (lastBN): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (8): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(576, 576, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=576, bias=False)
        (4): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=576, out_features=144, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=144, out_features=576, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(576, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): Sequential(
          (lastBN): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (9): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(576, 576, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=576, bias=False)
        (4): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=576, out_features=144, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=144, out_features=576, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(576, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): Sequential(
          (lastBN): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (10): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(576, 576, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=576, bias=False)
        (4): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=576, out_features=144, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=144, out_features=576, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(576, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): Sequential(
          (lastBN): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (11): Sequential(
      (0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): H_swish()
    )
    (12): AdaptiveAvgPool2d(output_size=1)
    (13): Conv2d(576, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (14): H_swish()
  )
  (classifier): Sequential(
    (0): Dropout(p=0.2, inplace=False)
    (1): Linear(in_features=1024, out_features=10, bias=True)
  )
)

MicroBotNet Code

# -*- coding: UTF-8 -*-

# This is MicroBotNet
# FdMobileNetV3Imp2 x0.32

'''
From https://github.com/ShowLo/MobileNetV3/blob/master/mobileNetV3.py
MobileNetV3 From <Searching for MobileNetV3>, arXiv:1905.02244.
Ref: https://github.com/d-li14/mobilenetv3.pytorch/blob/master/mobilenetv3.py
     https://github.com/kuan-wang/pytorch-mobilenet-v3/blob/master/mobilenetv3.py
     
'''

import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from collections import OrderedDict
from thop import profile

def _ensure_divisible(number, divisor, min_value=None):
    '''
    Ensure that 'number' can be 'divisor' divisible
    Reference from original tensorflow repo:
    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
    '''
    if min_value is None:
        min_value = divisor
    new_num = max(min_value, int(number + divisor / 2) // divisor * divisor)
    if new_num < 0.9 * number:
        new_num += divisor
    return new_num

class H_sigmoid(nn.Module):
    '''
    hard sigmoid
    '''
    def __init__(self, inplace=True):
        super(H_sigmoid, self).__init__()
        self.inplace = inplace

    def forward(self, x):
        return F.relu6(x + 3, inplace=self.inplace) / 6

class H_swish(nn.Module):
    '''
    hard swish
    '''
    def __init__(self, inplace=True):
        super(H_swish, self).__init__()
        self.inplace = inplace

    def forward(self, x):
        return x * F.relu6(x + 3, inplace=self.inplace) / 6

class SEModule(nn.Module):
    '''
    SE Module
    Ref: https://github.com/moskomule/senet.pytorch/blob/master/senet/se_module.py
    '''
    def __init__(self, in_channels_num, reduction_ratio=4):
        super(SEModule, self).__init__()

        if in_channels_num % reduction_ratio != 0:
            raise ValueError('in_channels_num must be divisible by reduction_ratio(default = 4)')

        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Sequential(
            nn.Linear(in_channels_num, in_channels_num // reduction_ratio, bias=False),
            nn.ReLU(inplace=True),
            nn.Linear(in_channels_num // reduction_ratio, in_channels_num, bias=False),
            H_sigmoid()
        )

    def forward(self, x):
        batch_size, channel_num, _, _ = x.size()
        y = self.avg_pool(x).view(batch_size, channel_num)
        y = self.fc(y).view(batch_size, channel_num, 1, 1)
        return x * y

class Bottleneck(nn.Module):
    '''
    The basic unit of MobileNetV3
    '''
    def __init__(self, in_channels_num, exp_size, out_channels_num, kernel_size, stride, use_SE, NL, BN_momentum):
        '''
        use_SE: True or False -- use SE Module or not
        NL: nonlinearity, 'RE' or 'HS'
        '''
        super(Bottleneck, self).__init__()

        assert stride in [1, 2]
        NL = NL.upper()
        assert NL in ['RE', 'HS']

        use_HS = NL == 'HS'
        
        # Whether to use residual structure or not
        self.use_residual = (stride == 1 and in_channels_num == out_channels_num)

        if exp_size == in_channels_num:
            # Without expansion, the first depthwise convolution is omitted
            self.conv = nn.Sequential(
                # Depthwise Convolution
                nn.Conv2d(in_channels=in_channels_num, out_channels=exp_size, kernel_size=kernel_size, stride=stride, padding=(kernel_size-1)//2, groups=in_channels_num, bias=False),
                nn.BatchNorm2d(num_features=exp_size, momentum=BN_momentum),
                # SE Module
                SEModule(exp_size) if use_SE else nn.Sequential(),
                H_swish() if use_HS else nn.ReLU(inplace=True),
                # Linear Pointwise Convolution
                nn.Conv2d(in_channels=exp_size, out_channels=out_channels_num, kernel_size=1, stride=1, padding=0, bias=False),
                #nn.BatchNorm2d(num_features=out_channels_num, momentum=BN_momentum)
                nn.Sequential(OrderedDict([('lastBN', nn.BatchNorm2d(num_features=out_channels_num))])) if self.use_residual else
                    nn.BatchNorm2d(num_features=out_channels_num, momentum=BN_momentum)
            )
        else:
            # With expansion
            self.conv = nn.Sequential(
                # Pointwise Convolution for expansion
                nn.Conv2d(in_channels=in_channels_num, out_channels=exp_size, kernel_size=1, stride=1, padding=0, bias=False),
                nn.BatchNorm2d(num_features=exp_size, momentum=BN_momentum),
                H_swish() if use_HS else nn.ReLU(inplace=True),
                # Depthwise Convolution
                nn.Conv2d(in_channels=exp_size, out_channels=exp_size, kernel_size=kernel_size, stride=stride, padding=(kernel_size-1)//2, groups=exp_size, bias=False),
                nn.BatchNorm2d(num_features=exp_size, momentum=BN_momentum),
                # SE Module
                SEModule(exp_size) if use_SE else nn.Sequential(),
                H_swish() if use_HS else nn.ReLU(inplace=True),
                # Linear Pointwise Convolution
                nn.Conv2d(in_channels=exp_size, out_channels=out_channels_num, kernel_size=1, stride=1, padding=0, bias=False),
                #nn.BatchNorm2d(num_features=out_channels_num, momentum=BN_momentum)
                nn.Sequential(OrderedDict([('lastBN', nn.BatchNorm2d(num_features=out_channels_num))])) if self.use_residual else
                    nn.BatchNorm2d(num_features=out_channels_num, momentum=BN_momentum)
            )

    def forward(self, x):
        if self.use_residual:
            return self.conv(x) + x
        else:
            return self.conv(x)


class FdMobileNetV3Imp2(nn.Module):
    '''
    
    '''
    def __init__(self, mode='large', classes_num=1000, input_size=224, width_multiplier=1.0, dropout=0.2, BN_momentum=0.1, zero_gamma=False):
        '''
        configs: setting of the model
        mode: type of the model, 'large' or 'small'
        '''
        super(FdMobileNetV3Imp2, self).__init__()

        mode = mode.lower()
        assert mode in ['large', 'small']
        s = 2
        if input_size == 32 or input_size == 56:
            # using cifar-10, cifar-100 or Tiny-ImageNet
            #s = 1
            _ = None

        # setting of the model
        if mode == 'large':
            # Configuration of a MobileNetV3-Large Model
            configs = [
                #kernel_size, exp_size, out_channels_num, use_SE, NL, stride
                [3, 16, 16, False, 'RE', 1],
                [3, 64, 24, False, 'RE', s],
                [3, 72, 24, False, 'RE', 1],
                [5, 72, 40, True, 'RE', 2],
                [5, 120, 40, True, 'RE', 1],
                [5, 120, 40, True, 'RE', 1],
                [3, 240, 80, False, 'HS', 2],
                [3, 200, 80, False, 'HS', 1],
                [3, 184, 80, False, 'HS', 1],
                [3, 184, 80, False, 'HS', 1],
                [3, 480, 112, True, 'HS', 1],
                [3, 672, 112, True, 'HS', 1],
                [5, 672, 160, True, 'HS', 2],
                [5, 960, 160, True, 'HS', 1],
                [5, 960, 160, True, 'HS', 1]
            ]
        elif mode == 'small':
            # @SELF edited
            configs = [
                #kernel_size, exp_size, out_channels_num, use_SE, NL, stride
                [3, 72, 24, False, 'RE', 2],
                [5, 96, 40, True, 'HS', 2],
                [5, 240, 40, True, 'HS', 1],
                [5, 120, 48, True, 'HS', 1],
                [5, 144, 48, True, 'HS', 1],
                [5, 288, 96, True, 'HS', 2],
                [5, 576, 96, True, 'HS', 1],
                [5, 576, 96, True, 'HS', 1],
                [5, 576, 96, True, 'HS', 1],
                [5, 576, 96, True, 'HS', 1]
            ]
            # Configuration of a MobileNetV3-Small Model
            '''
            configs = [
                #kernel_size, exp_size, out_channels_num, use_SE, NL, stride
                [3, 16, 16, True, 'RE', s],
                [3, 72, 24, False, 'RE', 2],
                [3, 88, 24, False, 'RE', 1],
                [5, 96, 40, True, 'HS', 2],
                [5, 240, 40, True, 'HS', 1],
                [5, 240, 40, True, 'HS', 1],
                [5, 120, 48, True, 'HS', 1],
                [5, 144, 48, True, 'HS', 1],
                [5, 288, 96, True, 'HS', 2],
                [5, 576, 96, True, 'HS', 1],
                [5, 576, 96, True, 'HS', 1]
            ]
            '''

        first_channels_num = 16

        # last_channels_num = 1280
        # according to https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet_v3.py
        # if small -- 1024, if large -- 1280
        last_channels_num = 1280 if mode == 'large' else 1024

        divisor = 8

        ########################################################################################################################
        # feature extraction part
        # input layer
        input_channels_num = _ensure_divisible(first_channels_num * width_multiplier, divisor)
        print(input_channels_num)
        last_channels_num = _ensure_divisible(last_channels_num * width_multiplier, divisor) if width_multiplier > 1 else last_channels_num
        feature_extraction_layers = []
        first_layer = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=input_channels_num, kernel_size=3, stride=s, padding=1, bias=False),
            nn.BatchNorm2d(num_features=input_channels_num, momentum=BN_momentum),
            H_swish()
        )
        feature_extraction_layers.append(first_layer)
        # Overlay of multiple bottleneck structures
        for kernel_size, exp_size, out_channels_num, use_SE, NL, stride in configs:
            output_channels_num = _ensure_divisible(out_channels_num * width_multiplier, divisor)
            exp_size = _ensure_divisible(exp_size * width_multiplier, divisor)
            feature_extraction_layers.append(Bottleneck(input_channels_num, exp_size, output_channels_num, kernel_size, stride, use_SE, NL, BN_momentum))
            input_channels_num = output_channels_num
        
        # the last stage
        last_stage_channels_num = _ensure_divisible(exp_size * width_multiplier, divisor)
        last_stage_layer1 = nn.Sequential(
                nn.Conv2d(in_channels=input_channels_num, out_channels=last_stage_channels_num, kernel_size=1, stride=1, padding=0, bias=False),
                nn.BatchNorm2d(num_features=last_stage_channels_num, momentum=BN_momentum),
                H_swish()
            )
        feature_extraction_layers.append(last_stage_layer1)
    

        # SE Module
        # remove the last SE Module according to https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet_v3.py
        # feature_extraction_layers.append(SEModule(last_stage_channels_num) if mode == 'small' else nn.Sequential())
        # @SELF changed last_channels_num // 2 to last_channels_num 1024 to 576
        feature_extraction_layers.append(nn.AdaptiveAvgPool2d(1))
        feature_extraction_layers.append(nn.Conv2d(in_channels=last_stage_channels_num, out_channels=last_channels_num, kernel_size=1, stride=1, padding=0, bias=False))
        feature_extraction_layers.append(H_swish())

        self.features = nn.Sequential(*feature_extraction_layers)

        ########################################################################################################################
        # Classification part
        self.classifier = nn.Sequential(
            nn.Dropout(p=dropout),
            # nn.Linear(last_channels_num // 2, last_channels_num),
            # @SELF added second linear layer
            nn.Linear(last_channels_num, classes_num),
        )

        ########################################################################################################################
        # Initialize the weights
        self._initialize_weights(zero_gamma)

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

    def _initialize_weights(self, zero_gamma):
        '''
        Initialize the weights
        '''
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, std=0.001)
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
        if zero_gamma:
            for m in self.modules():
	            if hasattr(m, 'lastBN'):
	                nn.init.constant_(m.lastBN.weight, 0.0)

def test():
    net = FdMobileNetV3Imp2(classes_num=10, input_size=32,
            width_multiplier=1.00, mode='small')
    x = torch.randn(1,3,32,32)
    flops, params = profile(net, inputs=(x, ))
    print('* MACs: {:,.2f}'.format(flops).replace('.00', ''))
    print('* Params: {:,.2f}'.format(params).replace('.00', ''))
    y = net(x)
    print(y.size())
    print()
    print(net)

test()

Weekly Log

Week 7

Sunday, 20-03-01

  • Why does our accurate die?
  • We think it isn't training correctly.

On MicroBotNet:

from functools import reduce

[
    (p.shape, reduce(lambda x, y: x*y, p.shape)) for n, p in model.features[1:11].named_parameters()
    if 'conv' in n and 'weight' in n and 'lastBN' not in n and 'fc' not in n
]
# is
'''
[(torch.Size([24, 8, 1, 1]), 192),
 (torch.Size([24]), 24),
 (torch.Size([24, 1, 3, 3]), 216),
 (torch.Size([24]), 24),
 (torch.Size([8, 24, 1, 1]), 192),
 (torch.Size([8]), 8),
 (torch.Size([32, 8, 1, 1]), 256),
 (torch.Size([32]), 32),
 (torch.Size([32, 1, 5, 5]), 800),
 (torch.Size([32]), 32),
 (torch.Size([16, 32, 1, 1]), 512),
 (torch.Size([16]), 16),
 (torch.Size([80, 16, 1, 1]), 1280),
 (torch.Size([80]), 80),
 (torch.Size([80, 1, 5, 5]), 2000),
 (torch.Size([80]), 80),
 (torch.Size([16, 80, 1, 1]), 1280),
 (torch.Size([40, 16, 1, 1]), 640),
 (torch.Size([40]), 40),
 (torch.Size([40, 1, 5, 5]), 1000),
 (torch.Size([40]), 40),
 (torch.Size([16, 40, 1, 1]), 640),
 (torch.Size([48, 16, 1, 1]), 768),
 (torch.Size([48]), 48),
 (torch.Size([48, 1, 5, 5]), 1200),
 (torch.Size([48]), 48),
 (torch.Size([16, 48, 1, 1]), 768),
 (torch.Size([96, 16, 1, 1]), 1536),
 (torch.Size([96]), 96),
 (torch.Size([96, 1, 5, 5]), 2400),
 (torch.Size([96]), 96),
 (torch.Size([32, 96, 1, 1]), 3072),
 (torch.Size([32]), 32),
 (torch.Size([184, 32, 1, 1]), 5888),
 (torch.Size([184]), 184),
 (torch.Size([184, 1, 5, 5]), 4600),
 (torch.Size([184]), 184),
 (torch.Size([32, 184, 1, 1]), 5888),
 (torch.Size([184, 32, 1, 1]), 5888),
 (torch.Size([184]), 184),
 (torch.Size([184, 1, 5, 5]), 4600),
 (torch.Size([184]), 184),
 (torch.Size([32, 184, 1, 1]), 5888),
 (torch.Size([184, 32, 1, 1]), 5888),
 (torch.Size([184]), 184),
 (torch.Size([184, 1, 5, 5]), 4600),
 (torch.Size([184]), 184),
 (torch.Size([32, 184, 1, 1]), 5888),
 (torch.Size([184, 32, 1, 1]), 5888),
 (torch.Size([184]), 184),
 (torch.Size([184, 1, 5, 5]), 4600),
 (torch.Size([184]), 184),
 (torch.Size([32, 184, 1, 1]), 5888)]
'''

This does not include fully connected layers embedded within our layers

SqueezeNet for comparison

from functools import reduce

[
    (p.shape, reduce(lambda x, y: x*y, p.shape)) for n, p in all_conv_weights
    if not ('classifier' in n or 'features.0.' in n)
]
# is
'''
[(torch.Size([16, 64, 1, 1]), 1024),
 (torch.Size([64, 16, 1, 1]), 1024),
 (torch.Size([64, 16, 3, 3]), 9216),
 (torch.Size([16, 128, 1, 1]), 2048),
 (torch.Size([64, 16, 1, 1]), 1024),
 (torch.Size([64, 16, 3, 3]), 9216),
 (torch.Size([32, 128, 1, 1]), 4096),
 (torch.Size([128, 32, 1, 1]), 4096),
 (torch.Size([128, 32, 3, 3]), 36864),
 (torch.Size([32, 256, 1, 1]), 8192),
 (torch.Size([128, 32, 1, 1]), 4096),
 (torch.Size([128, 32, 3, 3]), 36864),
 (torch.Size([48, 256, 1, 1]), 12288),
 (torch.Size([192, 48, 1, 1]), 9216),
 (torch.Size([192, 48, 3, 3]), 82944),
 (torch.Size([48, 384, 1, 1]), 18432),
 (torch.Size([192, 48, 1, 1]), 9216),
 (torch.Size([192, 48, 3, 3]), 82944),
 (torch.Size([64, 384, 1, 1]), 24576),
 (torch.Size([256, 64, 1, 1]), 16384),
 (torch.Size([256, 64, 3, 3]), 147456),
 (torch.Size([64, 512, 1, 1]), 32768),
 (torch.Size([256, 64, 1, 1]), 16384),
 (torch.Size([256, 64, 3, 3]), 147456)]

Now do MicroBotNet, only convs bigger than 1000:

from functools import reduce

[
    (p.shape, reduce(lambda x, y: x*y, p.shape)) for n, p in model.features[1:11].named_parameters()
    if 'conv' in n and 'weight' in n
    and 'lastBN' not in n and 'fc' not in n
    and reduce(lambda x, y: x*y, p.shape) > 1000
]
# is
'''
[(torch.Size([80, 16, 1, 1]), 1280),
 (torch.Size([80, 1, 5, 5]), 2000),
 (torch.Size([16, 80, 1, 1]), 1280),
 (torch.Size([48, 1, 5, 5]), 1200),
 (torch.Size([96, 16, 1, 1]), 1536),
 (torch.Size([96, 1, 5, 5]), 2400),
 (torch.Size([32, 96, 1, 1]), 3072),
 (torch.Size([184, 32, 1, 1]), 5888),
 (torch.Size([184, 1, 5, 5]), 4600),
 (torch.Size([32, 184, 1, 1]), 5888),
 (torch.Size([184, 32, 1, 1]), 5888),
 (torch.Size([184, 1, 5, 5]), 4600),
 (torch.Size([32, 184, 1, 1]), 5888),
 (torch.Size([184, 32, 1, 1]), 5888),
 (torch.Size([184, 1, 5, 5]), 4600),
 (torch.Size([32, 184, 1, 1]), 5888),
 (torch.Size([184, 32, 1, 1]), 5888),
 (torch.Size([184, 1, 5, 5]), 4600),
 (torch.Size([32, 184, 1, 1]), 5888)] 
'''

Got to 73.5% Accuracy quantizing only conv above 1000 Got to 73.2% Accuracy quantizing conv and bias above 1000

Monday, 20-03-02

  • 65% quantized to zero, one, negative one, 72.3% accuracy
  • TTQ is 72.8% accuracy.
  • 31 layers quantized. 86 layers not quantized
    • 153,792 param quantized
      • 12,222 quantized to one.
      • 134,124 quantized to zero.
      • 7,446 quantized to neg one.
    • 82,866 param not quantized

Thursday, 20-03-05

  • Make a visualization of the network for Pister

Google Collab with the quantization

Week 8

Monday, 20-03-09

  • Acorns
  • Transfer Learning
  • Explain our Model
# MicroBotNet x0.32 (FdMobileNetV3Imp2)
# * MACs: 932,886
# * Params: 236,658


FdMobileNetV3Imp2(
  (features): Sequential(
    (0): Sequential(
      (0): Conv2d(3, 8, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (1): BatchNorm2d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): H_swish()
    )
    (1): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(8, 24, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU(inplace=True)
        (3): Conv2d(24, 24, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=24, bias=False)
        (4): BatchNorm2d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): Sequential()
        (6): ReLU(inplace=True)
        (7): Conv2d(24, 8, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): BatchNorm2d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (2): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(8, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(32, 32, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), groups=32, bias=False)
        (4): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=32, out_features=8, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=8, out_features=32, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(32, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (3): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(16, 80, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(80, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(80, 80, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=80, bias=False)
        (4): BatchNorm2d(80, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=80, out_features=20, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=20, out_features=80, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(80, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): Sequential(
          (lastBN): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (4): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(16, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(40, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(40, 40, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=40, bias=False)
        (4): BatchNorm2d(40, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=40, out_features=10, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=10, out_features=40, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(40, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): Sequential(
          (lastBN): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (5): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(16, 48, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(48, 48, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=48, bias=False)
        (4): BatchNorm2d(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=48, out_features=12, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=12, out_features=48, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(48, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): Sequential(
          (lastBN): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (6): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(16, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(96, 96, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), groups=96, bias=False)
        (4): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=96, out_features=24, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=24, out_features=96, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(96, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (7): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(32, 184, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(184, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(184, 184, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=184, bias=False)
        (4): BatchNorm2d(184, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=184, out_features=46, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=46, out_features=184, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(184, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): Sequential(
          (lastBN): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (8): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(32, 184, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(184, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(184, 184, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=184, bias=False)
        (4): BatchNorm2d(184, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=184, out_features=46, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=46, out_features=184, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(184, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): Sequential(
          (lastBN): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (9): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(32, 184, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(184, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(184, 184, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=184, bias=False)
        (4): BatchNorm2d(184, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=184, out_features=46, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=46, out_features=184, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(184, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): Sequential(
          (lastBN): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (10): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(32, 184, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(184, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(184, 184, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=184, bias=False)
        (4): BatchNorm2d(184, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=184, out_features=46, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=46, out_features=184, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(184, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): Sequential(
          (lastBN): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (11): Sequential(
      (0): Conv2d(32, 56, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (1): BatchNorm2d(56, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): H_swish()
    )
    (12): AdaptiveAvgPool2d(output_size=1)
    (13): Conv2d(56, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (14): H_swish()
  )
  (classifier): Sequential(
    (0): Dropout(p=0.2, inplace=False)
    (1): Linear(in_features=1024, out_features=10, bias=True)
  )
)

To understand a convolution like:

Conv2d(3, 8, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  • Input Channel: 3
  • Number of Filter: 8
  • Stride
16
* MACs: 6,597,218
* Params: 2,044,298
torch.Size([1, 10])

FdMobileNetV3Imp2(
  (features): Sequential(
    (0): Sequential(
      (0): Conv2d(3, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): H_swish()
    )
    (1): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(16, 72, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(72, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU(inplace=True)
        (3): Conv2d(72, 72, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=72, bias=False)
        (4): BatchNorm2d(72, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): Sequential()
        (6): ReLU(inplace=True)
        (7): Conv2d(72, 24, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): BatchNorm2d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (2): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(24, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(96, 96, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), groups=96, bias=False)
        (4): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=96, out_features=24, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=24, out_features=96, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(96, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): BatchNorm2d(40, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (3): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(40, 240, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(240, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(240, 240, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=240, bias=False)
        (4): BatchNorm2d(240, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=240, out_features=60, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=60, out_features=240, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(240, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): Sequential(
          (lastBN): BatchNorm2d(40, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (4): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(40, 120, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(120, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(120, 120, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=120, bias=False)
        (4): BatchNorm2d(120, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=120, out_features=30, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=30, out_features=120, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(120, 48, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): BatchNorm2d(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (5): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(48, 144, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(144, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(144, 144, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=144, bias=False)
        (4): BatchNorm2d(144, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=144, out_features=36, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=36, out_features=144, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(144, 48, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): Sequential(
          (lastBN): BatchNorm2d(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (6): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(48, 288, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(288, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(288, 288, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), groups=288, bias=False)
        (4): BatchNorm2d(288, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=288, out_features=72, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=72, out_features=288, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(288, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (7): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(576, 576, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=576, bias=False)
        (4): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=576, out_features=144, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=144, out_features=576, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(576, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): Sequential(
          (lastBN): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (8): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(576, 576, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=576, bias=False)
        (4): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=576, out_features=144, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=144, out_features=576, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(576, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): Sequential(
          (lastBN): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (9): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(576, 576, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=576, bias=False)
        (4): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=576, out_features=144, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=144, out_features=576, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(576, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): Sequential(
          (lastBN): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (10): Bottleneck(
      (conv): Sequential(
        (0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): H_swish()
        (3): Conv2d(576, 576, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=576, bias=False)
        (4): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): SEModule(
          (avg_pool): AdaptiveAvgPool2d(output_size=1)
          (fc): Sequential(
            (0): Linear(in_features=576, out_features=144, bias=False)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=144, out_features=576, bias=False)
            (3): H_sigmoid()
          )
        )
        (6): H_swish()
        (7): Conv2d(576, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (8): Sequential(
          (lastBN): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (11): Sequential(
      (0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): H_swish()
    )
    (12): AdaptiveAvgPool2d(output_size=1)
    (13): Conv2d(576, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (14): H_swish()
  )
  (classifier): Sequential(
    (0): Dropout(p=0.2, inplace=False)
    (1): Linear(in_features=1024, out_features=10, bias=True)
  )
)

MicroBotNet Rendering

Weekly Meeting

Week 7 Meeting 20-03-05

MicroBotNet

  • Can we do classification under 1 uJ?
  • Goal is a neural network with under 1 million MAC to do classification on Cifar-10 (10 image classes)
  • MicroBotNet achieves 79.35% accuracy with 930,000 MACs


Trained Ternary Quantization

  • Trained/quantized to 3 weights per convolution
  • Less precision for weights – significantly less power-hungry memory access!
  • My testing shows from 91% accuracy on SqueezeNet to 88% accuracy


Trained Ternary Quantization on MicroBotNet

  • We quantize any convolution or fully connected layer
    • With over 1000 parameters
    • Is not the first convolution layer or final fully connected layers.

Trained Ternary Quantization

  • 31 layers quantized. 86 layers not quantized

  • 65% quantized to one positive weight, one negative weight is 72.8% accuracy.

One Negative One Quantization

  • 65% quantized to zero, one, negative one, 72.3% accuracy.

Results

  • 153,792 param quantized
    • 12,222 quantized to one.
    • 134,124 quantized to zero.
    • 7,446 quantized to neg one.
  • 82,866 param not quantized

Week 8 Meeting

Thesis

Related Works

Low Power Classification

Deep Learning is a powerful tool that is able to model complex relationships. One example is image classification, where the user

\subsection{Low Power Classification}
Deep Learning training and inference often incur significant computing and power expense, making them impractical for edge devices. 
MACs are an accepted computational-cost metric as they map to both the multiply-and-accumulate computation and its memory access patterns of filter weights, layer input maps, and partial sums for layer output maps. 
Prior work has decreased parameter size and multiply-and-accumulate (MAC) operations.

SqueezeNet \cite{iandola2016squeezenet} introduced Fire modules as a compression method in an effort to reduce the number of parameters while maintaining accuracy. 
% Reducing $3\times 3$ convolutions to $1\times1$ achieves $9\times$ fewer parameters for a given filter; decreasing the number of input channels to larger $3\times 3$ filters with squeeze layers further lowers the number of parameters.
MobileNetV1 \cite{howard2017mobilenets} replaced standard convolution with depth-wise separable convolutions where a depth-wise convolution performs spatial filtering and pointwise convolutions generate features.
Fast Downsampling \cite{qin2018fd} expanded on MobileNet for extremely computationally constrained tasks--32$\times$ downsampling in the first 12 layers drops the computational substantially with a 5\% accuracy loss.
Trained Ternary Quantization \cite{zhu2016trained} reduced weight precision to 2-bit ternary values with scaling factors with zero accuracy loss.
MobileNetV3 \cite{howard2019searching} used neural architecture search optimizing for efficiency to design their model. 
Other improvements include `hard'~activation functions (h-swish and h-sigmoid) \cite{ramachandran2017searching}, inverted residuals and linear bottlenecks \cite{sandler2018mobilenetv2}, and squeeze-and-excite layers\cite{hu2018squeeze} that extract spatial and channel-wise information.
% In a 45\si{\nano\meter} process, a 32-bit integer multiplication is 3.1\si{\pico\joule}, a 32-bit integer addition is 0.1\si{\pico\joule}, encapsulating most of the energy cost in a MAC \cite{horowitzenergy}. 
Benchmarking from a 45\si{\nano\meter} process \cite{horowitzenergy}, shrinking process nodes and decreased bit precision enable a MAC cost approaching 1\si{\pico\joule}. 
Targeting 1\si{\micro\joule} per forward-pass, we combine these advancements into a new network with $<$1 million MACs. 

My Previous

\subsection{Low Power Classification}
Deep Learning training and inference often computing and power expense, making it infeasible to run on embedded devices. Prior work has dealt with dropping parameter size and multiply-and-accumulate (MAC) operations. MACs are a more useful metric as they map to both the multiply-and-accumulate computation and its memory access patterns of filter weights, layer input maps, and partial sums for layer output maps.

SqueezeNet \cite{iandola2016squeezenet} focusing on decreasing the number of parameters while maintaining accuracy. This is done with Fire Modules which reduces some $3\times3$ convolutions with $1\times1$ convolutions, which are 9x less parameters.
MobileNet \cite{howard2017mobilenets} use depth-wise separable convolutions which replace standard convolutions into two efficient operations of depth-wise convolutions and pointwise convolutions.
Fast Downsampling \cite{qin2018fd} expands on Mobilenet for extremely computational constrained tasks. They perform 32x downsampling in the first 12 layers of MobileNet and use increased number of channels to decrease computational cost to 12 MFLOPs at 5\% accuracy loss.
Trained Ternary Quantization \cite{zhu2016trained} reduces precision of weights to only 2-bit ternary values with scaling factors with zero accuracy loss.
MobileNet V3 \cite{howard2019searching} expands on Mobilenet and use neural architecture search optimizing for efficiency to design their model. They include other works including activation functions h-swish and h-sigmoid \cite{ramachandran2017searching}, inverted residuals and linear bottlenecks \cite{sandler2018mobilenetv2} that improve on depthwise seperatable convolutions, and squeeze-and-excite \cite{hu2018squeeze} layers that extract spatial and channel-wise information.