# Introduction

Hello! I do research on low power neural networks for micro-robots with Kris Pister!

As a grad student, my job is to turn coffee into code. JK. Grad school has been a lot of self-learning, so here's a place for my notes!!!

# Quantization

Repo

import torch
from torch.autograd import Variable
import torch.nn.functional as F
from .training import _accuracy

def initial_scales(kernel):
return 1.0, 1.0

def quantize(kernel, w_p, w_n, t):
"""
Return quantized weights of a layer.
Only possible values of quantized weights are: {zero, w_p, -w_n}.
"""

'''
@Brian
kernel: tensor
w_p: float? - positive weight
w_n: float? - negative weight
t: float? - hyper param for quantizing (see below)
'''

delta = t*kernel.abs().max()
# @Brian 1 mask if kernel is above delta?
a = (kernel > delta).float()
# @Brian similar mask if below negative delta.
# basically +- threshold
b = (kernel < -delta).float()
'''
@Brian set positive weight to all values in the mask tensor
likewise for negative. Combine masks.
example: np.array([
[0.73,0, 0.73],
[-.82, -.82, 0],
[0, 0.73, 0],
])
'''
return w_p*a + (-w_n*b)

"""
Arguments:
kernel_grad: gradient with respect to quantized kernel.
kernel: corresponding full precision kernel.
w_p, w_n: scaling factors.
t: hyperparameter for quantization.

Returns:
1. gradient for the full precision kernel.
2. gradient for w_p.
3. gradient for w_n.
"""
# @Brian
# kernel |> abs |> max |> times x
# t is a quantization hyperparameter
delta = t*kernel.abs().max()
a = (kernel > delta).float()
b = (kernel < -delta).float()
# @Brian a and b are masked
# create a ones tensor with the same shape - a - b
# anything that should be quantized to zero is now 1
c = torch.ones(kernel.size()).cuda() - a - b
# scaled kernel grad and grads for scaling factors (w_p, w_n)

# @Brian, rewritten. positive scaling factor times mask times gradient. Do gradients for others. A single tensor from three gradients
# @Brian, rewritten. gradients to update w_p and w_n. Sum of everything it multiplies with.

def optimization_step(model, loss, x_batch, y_batch, optimizer_list, t):
"""Make forward pass and update model parameters with gradients."""

# parameter 't' is a hyperparameter for quantization

# 'optimizer_list' contains optimizers for
# 1. full model (all weights including quantized weights),
# 2. backup of full precision weights,
# 3. scaling factors for each layer
optimizer, optimizer_fp, optimizer_sf = optimizer_list

x_batch, y_batch = Variable(x_batch.cuda()), Variable(y_batch.cuda(async=True))
# forward pass using quantized model
logits = model(x_batch)

# compute logloss
loss_value = loss(logits, y_batch)
batch_loss = loss_value.data[0]

# compute accuracies
pred = F.softmax(logits)
batch_accuracy, batch_top5_accuracy = _accuracy(y_batch, pred, top_k=(1, 5))

# compute grads for quantized model
loss_value.backward()

# get all quantized kernels
all_kernels = optimizer.param_groups[1]['params']

# get their full precision backups
all_fp_kernels = optimizer_fp.param_groups[0]['params']

# get two scaling factors for each quantized kernel
scaling_factors = optimizer_sf.param_groups[0]['params']

for i in range(len(all_kernels)):

# get a quantized kernel
k = all_kernels[i]

# get corresponding full precision kernel
k_fp = all_fp_kernels[i]

# get scaling factors for the quantized kernel
f = scaling_factors[i]
w_p, w_n = f.data[0], f.data[1]

# get modified grads

# grad for full precision kernel

# we don't need to update the quantized kernel directly

# grad for scaling factors

# update all non quantized weights in quantized model
# (usually, these are the last layer, the first layer, and all batch norm params)
optimizer.step()

# update all full precision kernels
optimizer_fp.step()

# update all scaling factors
optimizer_sf.step()

# update all quantized kernels with updated full precision kernels
for i in range(len(all_kernels)):

k = all_kernels[i]
k_fp = all_fp_kernels[i]
f = scaling_factors[i]
w_p, w_n = f.data[0], f.data[1]

# requantize a quantized kernel using updated full precision weights
k.data = quantize(k_fp.data, w_p, w_n, t)

return batch_loss, batch_accuracy, batch_top5_accuracy


# Micro Robot Dataset

!git clone https://github.com/bCom5/pytorch-cifar.git
%cd pytorch-cifar
!pip install thop


After pulling everything, pull the dataset

!git clone https://github.com/bCom5/micro-robot-dataset.git
!cd micro-robot-dataset; sh setup.sh
import sys
sys.path.append('micro-robot-dataset')
from loader import *

# we have get_outdoor_iterators, get_indoor_iterators, get_combined_iterators


# Papers

\chapter{SqueezeNet}

\begin{enumerate}
\item \href{https://arxiv.org/pdf/1602.07360.pdf}{Paper}
\item \href{https://youtu.be/ge_RT5wvHvY}{Video Walkthrough}
\end{enumerate}

\section{Design}

\begin{figure}[ht!]
\includegraphics[width=1
\textwidth]{photos/squeezenet/design.jpg}
\centering
\end{figure}

1x1 filters are 9x smaller than 3x3 filters.
Squeeze (decrease number of channels by doing 1x1 filters K number of filters)

\clearpage

\section{Fire Module}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/squeezenet/fire_module.jpg}
\centering
\end{figure}

Squeeze, then expand with 1x1 filter and 3x3 filter. Zero pad so 3x3 filter and 1x1 filters have the same feature map, concat together.

\clearpage

\section{Delayed Downsampling}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/squeezenet/delay.jpg}
\centering
\end{figure}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/squeezenet/model.png}
\centering
\end{figure}

Downsample with maxpool and stride of 2 (skips over half). Fire modules shown. ResNet Skip Connections in Middle (best accuracy) and Right. Uses \textbf{avg pool} instead of FC at the end.

\clearpage

\section{Deep Compression}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/squeezenet/deep_comp.png}
\centering
\caption{Accuracy stays very good. Can be shrank more with \textbf{Deep Compression} with little accuracy loss.}
\end{figure}

\clearpage

\chapter{MobileNet Depthwise Separable Convolution}

\begin{enumerate}
\item \href{https://arxiv.org/pdf/1704.04861.pdf}{Paper}
\item \href{https://youtu.be/T7o3xvJLuHk}{Video Walkthrough}
\end{enumerate}

\clearpage

\section{Convolution Operation and Cost}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/mobilenet/convolution.png}
\centering
\caption{N convolutions of $D_K \times D_K \times M$ (channels like 3 for RGB). Output is $D_G \times D_G \times N$}
\end{figure}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/mobilenet/conv_cost.png}
\centering
\end{figure}

\begin{enumerate}
\item One convolution multiplication costs $(D_K)^2 \times M$.
\item Doing all convolutions for that kernel (green tensor) costs $(D_G)^2 \times (D_K)^2 \times M$.
\item Finally, the total costs with all $N$ Kernels costs $N \times (D_G)^2 \times (D_K)^2 \times M$.
\end{enumerate}
\textbf{Convolutions are expensive.}

\clearpage

\section{Depthwise Convolution (Filtering Step)}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/mobilenet/depthwise.png}
\centering
\caption{Instead of the Filters being size M, have it be size 1. Then have M Filters. The output of the \textbf{Depthwise Convolution} will be $D_G \times D_G \times M$.}
\end{figure}

Multiplication cost is $M \times (D_G)^2 \times (D_K)^2$.

\section{Pointwise Convolution (Combining Step)}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/mobilenet/pointwise.png}
\centering
\caption{The input is the output of the Depthwise Convolution. Use N $1 \times 1 \times M$ filters and the output is $D_G \times D_G \times N$. This is the same as a normal convolution with N Filters of $D_K \times D_K \times M$.}
\end{figure}

Multiplication cost is $N \times (D_G)^2 \times M$.

\textbf{The total cost is:} $M \times (D_G)^2 \times (D_K)^2 + N \times (D_G)^2 \times M = \mathbf{M \times (D_G)^2 \times ((D_K)^2 + N)}$

\clearpage

\section{Others}

\subsection{Comparison}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/mobilenet/mobilenet_comp.png}
\centering
\caption{Similar accuracy, significantly less mutli-add and parameters.}
\end{figure}

\subsection{Module}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/mobilenet/module.png}
\centering
\end{figure}

\clearpage
\subsection{Activation ReLU6}

\begin{figure}[ht!]
\includegraphics[width=1
\textwidth]{photos/mobilenet/ReLU6.png}
\centering
\caption{ReLU capped at 6 efficient for low precision}
\end{figure}

Also there is a \textbf{Global Avg Pooling} then Fully Connected Layer to a Softmax Classifier.

\clearpage

\chapter{MobileNet V2}

\begin{enumerate}
\item \href{https://arxiv.org/abs/1801.04381}{Paper}
\item \href{https://machinethink.net/blog/mobilenet-v2/}{Notes}
\end{enumerate}

\clearpage

\section{Modules}

\begin{figure}[ht!]
\includegraphics[width=.4
\textwidth]{photos/mobilenetv2/mnetv1mod.png}
\centering
\caption{MobileNet V1 uses pointwise convolutions.}
\end{figure}

\begin{figure}[ht!]
\includegraphics[width=.4
\textwidth]{photos/mobilenetv2/mnetv2mod.png}
\centering
\caption{MobileNet V2 uses Residual Blocks and new First Layer of Projection Layer. This is called ain \textbf{inverted residual with linear bottleneck}.}
\end{figure}

\section{Inverted Residuals with Linear Bottlenecks}

The idea of MobileNet V2 is based on the two ideas:
\begin{enumerate}
\item \textbf{Low Dimension Tensors} reduce the number of computations/multiplications.
\item Low Dimension Tensors \textbf{only} do not work well. They cannot extra a lot of information.
\end{enumerate}
MobileNet V2 addresses this by having the input be a low dimensional tensor, \textbf{expanding} it to a reasonable/high dimensional tensor, \textbf{run a depthwise convolution} on it, and \textbf{squeeze} it back into a low dimensional tensor.

\begin{figure}[ht!]
\includegraphics[width=1
\textwidth]{photos/mobilenetv2/channel_diag.png}
\centering
\end{figure}

\begin{figure}[ht!]
\includegraphics[width=1
\textwidth]{photos/mobilenetv2/expand_filter_squeeze.png}
\centering
\end{figure}

The Green is an \textbf{Expand Convolution}. It increases the number of channels in. The Blue is a \textbf{Depthwise Convolution}. It keeps the number of channels the same and runs filters on the data. The Orange is a \textbf{Projection Layer/Bottleneck Layer}. Additionally there is a \textbf{residual skip connection} to keep information flow in the network. \\

\clearpage

\section{Other}

\begin{figure}[ht!]
\includegraphics[width=1
\textwidth]{photos/mobilenetv2/stats.png}
\centering
\caption{It is more efficient than MobileNet V1.}
\end{figure}

Some blocks keep channel size the same, others expand it until the final fully connected classification layer.

\textbf{Note: SqueezeNet still has smaller memory usage.}

\clearpage

\chapter{MobileNet V3}

\begin{enumerate}
\item \href{https://arxiv.org/abs/1905.02244}{Paper}
\item \href{https://github.com/kuan-wang/pytorch-mobilenet-v3/blob/master/mobilenetv3.py}{Code Implementation}
\end{enumerate}

\section{H-Swish}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/MobileNetV3/hswish.png}
\centering
\end{figure}

Swish is an activation function better than ReLU.
$$swish(x) = x * sigmoid(\beta x)$$
H-Swish (hard) is more efficient for hardware.
$$hswish(x) = x * \frac{ReLU6(x + 3)}{6}$$
Recall $ReLU6 = min(max(0, x), 6)$, 0 then linear than 6 from 6 onwards.

H-Swish is good only at the end deeper layers of the neural network.

\clearpage

\section{Squeeze and Excitation}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/MobileNetV3/squeezeandexcite.png}
\centering
\end{figure}

From \href{https://arxiv.org/pdf/1709.01507.pdf}{Squeeze and Excite} Paper (Big Model previous SOTA on ImageNet).

\begin{definition}
\textbf{Squeeze}: Global Information Embedding instead of where each learned filter operatives on a local field. A \textbf{Global Average Pooling} to get global information.
\end{definition}

\begin{figure}[ht!]
\includegraphics[width=.6
\textwidth]{photos/MobileNetV3/Excite.png}
\centering
\end{figure}

\begin{definition}
\textbf{Excitation}: Adaptive Recalibration. This is done with gating sigmoid of the nonlinear ReLU  filter. This value is scaled to the input.
\end{definition}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/MobileNetV3/Diagram_Excite.png}
\centering
\end{figure}

\clearpage

\section{Network Improvements}

Like overridden by the humans to be more efficient?

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/MobileNetV3/dropexpensive.png}
\centering
\end{figure}

The layers for tensors \textbf{960}, \textbf{320}, \textbf{1280} are removed. Instead \textbf{960} (input is 7x7x960) is avgpool to 1x1x960. Input it into 1280 1x1x960 conv. Output is 1x1x1280 (\textbf{1280} in the bottom). Input it into 1000 1x1x1280 conv. Output is 1x1x1000 (flattened).

In the original, between tensors \textbf{320} to \textbf{1280} the 1x1 Conv is used to expand to high dimensional feature space. The output is 7x7x1280. Then it is avg-pooled into 1x1x1280.

Instead in the human-efficient stage do avg pool first. It is 1x1x960, which the features are extracted more efficiently through a 1x1x1280 conv.

Also optimize the initial layers. Usually use 32 filter but most times they are just mirrors. Instead use 16 filters with h-swish, non-linearity and helps deal with redundancy.

\clearpage

\chapter{FD-MobileNet}
\begin{enumerate}
\item \href{https://arxiv.org/pdf/1802.03750.pdf}{Paper}
\item \href{https://github.com/clavichord93/FD-MobileNet}{Code Implementation}
\item \href{https://github.com/clavichord93/FD-MobileNet/blob/master/pyvision/models/ImageNet/MobileNet.py}{MobileNet and FD-MobileNet models}
\end{enumerate}

\clearpage
\section{Notes}

\begin{figure}[ht!]
\includegraphics[width=.5
\textwidth]{photos/fd_mobile/comparison.png}
\centering
\caption{FD (Fast Downsampling) downsamples early. This means there is less operations early on, more operations after downsampled when feature map is smaller.}

\end{figure}

FD-MobileNet x0.25 only has 0.383M params at 43.81\% top-1 accuracy compared to MobileNet x0.25 with 0.47M params at 54.22\% accuracy. MobileNet seems way more accurate, but we are really hardware limited, so I think this is promising.

\begin{figure}[ht!]
\includegraphics[width=1
\textwidth]{photos/fd_mobile/code.png}
\centering
\caption{Only changes are number of channels and stride sizes!}
\end{figure}

\clearpage


# MobileNet V2

## Modules

• MobileNet V1 uses pointwise convolutions

• MobileNet V2 uses Residual Blocks and new First Layer of Projection Layer. This is called an inverted residual with linear bottleneck.

## Inverted Residuals with Linear Bottleneck

The idea of MobileNet V2 is based on the two ideas:

1. Low Dimension Tensors reduce the number of computations/multiplications.
2. Low Dimension Tensors only do not work well. They cannot extra a lot of information. MobileNet V2 addresses this by having the input be a low dimensional tensor, expanding it to a reasonable/high dimensional tensor, run a depthwise convolution on it, and squeeze it back into a low dimensional tensor.

The Green is an Expand Convolution. It increases the number of channels in. The Blue is a Depthwise Convolution. It keeps the number of channels the same and runs filters on the data. The Orange is a Projection Layer/Bottleneck Layer. Additionally there is a residual skip connection to keep information flow in the network.

## Other

• It is more efficient than MobileNet V1.

Some blocks keep channel size the same, others expand it until the final fully connected classification layer.

Note: SqueezeNet still has smaller memory usage

# MobileNet V3

## H-Swish

Swish is an activation function better than ReLU. $$swish(x) = x * sigmoid(\beta x)$$

H-Swish (hard) is more efficient for hardware. $$hswish(x) = x * \frac{ReLU6(x + 3){6}$$

Recall $$ReLU6 = min(max(0, x), 6)$$, 0 then linear than 6 from 6 onwards.

H-Swish is good only at the end deeper layers of the neural network.

## Squeeze and Excitation

Definition: Squeeze: Global Information Embedding instead of where each learned filter operatives on a local field. A Global Average Pooling to get global information.

Definition: Excitation: Adaptive Recalibration. This is done with gating sigmoid of the nonlinear ReLU filter. This value is scaled to the input.

## Network Improvements

Like overridden by the humans to be more efficient?

The layers for tensors 960, 320, 1280 are removed. Instead 960 (input is 7x7x960) is avgpool to 1x1x960. Input it into 1280 1x1x960 conv. Output is 1x1x1280 (1280 in the bottom). Input it into 1000 1x1x1280 conv. Output is 1x1x1000 (flattened).

In the original, between tensors 320 to 1280 the 1x1 Conv is used to expand to high dimensional feature space. The output is 7x7x1280. Then it is avg-pooled into 1x1x1280.

Instead in the human-efficient stage do avg pool first. It is 1x1x960, which the features are extracted more efficiently through a 1x1x1280 conv.

Also optimize the initial layers. Usually use 32 filter but most times they are just mirrors. Instead use 16 filters with h-swish, non-linearity and helps deal with redundancy.

# Fd-MobileNet

## Notes

• FD (Fast Downsampling) downsamples early. This means there is less operations early on, more operations after downsampled when feature map is smaller.

FD-MobileNet x0.25 only has 0.383M params at 43.81% top-1 accuracy compared to MobileNet x0.25 with 0.47M params at 54.22% accuracy. MobileNet seems way more accurate, but we are really hardware limited, so I think this is promising.

• Only changes are number of channels and stride sizes!

# Cifar-10 Binary CNN Processor on Chip

• mixed-signal binary convolutional neural netwok
• 3.8 uJ/classification (forward pass)
• 86% accuracy
• BinaryNet {+1 -1}
• multiplication to XNOR
• weight stationary
• data-parallel (all multiplies in parallel (?))
• input reuse
• wide vector sum as energy bottleneck
• 28 nm CMOS
• 328 kB on chip SRAM
• 237 frame/s
• 0.9 mW from 0.6 V meanint 3.8 uJ

## Intro

• problem: DNN have to do millions to billions of MAC per inference

• weight stationary
• computing in memory (CIM) (?)
• CMOS-inspired, hardware specialization

• output image pixels are binarized
• always uses 2x2 filters, 256 channels and filters
• low fan-out de-multipliers

### No Hidden Fully Connected Layers

• BinaryNet required 1.67 MB
• 558 kB 6 CNN layers
• 1.13 MB 3 FC layers
• Instead only 261.5 kB
• 256 kB 8 CNN layers
• 5.5 kB 1 FC layer

### Filter Computation

• since we are dealing with +1, -1 and sign, batch norm is simplified

### Top Level Architecture

• pixel is quantized to 7 bits

## Summary

• BinaryNet with XNOR operations
• network architecture designed to work well with CMOS hasrdware
• low weight memory
• memory cost is amortized - weight stationay data parallel, input reuse
• energy efficient SC neuron?

# Tools

## Cloning a Repo

!git clone https://github.com/bCom5/trained-ternary-quantization.git
%cd trained-ternary-quantization
!pip install thop


## Saving a model

# model.cpu();
torch.save(model.state_dict(), 'micro_large_conv_and_fc_ttq_x0_32.pytorch_state')

from google.colab import drive
drive.mount('/content/gdrive')

!ls /content/gdrive/My\ Drive/04-eecs-299-research/03-new-work/01-fundamental-work/trained-ternary-quantization/ttq_microbotnet/ttq_models

!mv micro_large_conv_and_fc_ttq_x0_32.pytorch_state /content/gdrive/My\ Drive/04-eecs-299-research/03-new-work/01-fundamental-work/trained-ternary-quantization/ttq_microbotnet/ttq_models/micro_large_conv_and_fc_ttq_x0_32.pytorch_state


# PyTorch

# for a .pth file
def reformat_dict(state):
reformat_state = {}
for key in state:
new_key = key.replace('module.', '')
reformat_state[new_key] = state[key]
return reformat_state
state = torch.load(file_name,  map_location='cpu')['net']
reformat_state = reformat_dict(state)


## Saving Models

torch.save(model.state_dict(), 'micro_large_conv_and_fc_ttq_x0_32.pytorch_state')


# MicroBotNet Architecture

## MicroBotNet x1.00

Correction on this image, First s should be 2

#16
#* MACs: 6,597,218
#* Params: 2,044,298
#torch.Size([1, 10])

FdMobileNetV3Imp2(
(features): Sequential(
(0): Sequential(
(0): Conv2d(3, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
)
(1): Bottleneck(
(conv): Sequential(
(0): Conv2d(16, 72, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(72, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(72, 72, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=72, bias=False)
(4): BatchNorm2d(72, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): Sequential()
(6): ReLU(inplace=True)
(7): Conv2d(72, 24, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(2): Bottleneck(
(conv): Sequential(
(0): Conv2d(24, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(96, 96, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), groups=96, bias=False)
(4): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=96, out_features=24, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=24, out_features=96, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(96, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(40, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(3): Bottleneck(
(conv): Sequential(
(0): Conv2d(40, 240, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(240, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(240, 240, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=240, bias=False)
(4): BatchNorm2d(240, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=240, out_features=60, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=60, out_features=240, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(240, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): Sequential(
(lastBN): BatchNorm2d(40, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(4): Bottleneck(
(conv): Sequential(
(0): Conv2d(40, 120, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(120, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(120, 120, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=120, bias=False)
(4): BatchNorm2d(120, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=120, out_features=30, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=30, out_features=120, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(120, 48, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(5): Bottleneck(
(conv): Sequential(
(0): Conv2d(48, 144, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(144, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(144, 144, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=144, bias=False)
(4): BatchNorm2d(144, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=144, out_features=36, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=36, out_features=144, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(144, 48, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): Sequential(
(lastBN): BatchNorm2d(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(6): Bottleneck(
(conv): Sequential(
(0): Conv2d(48, 288, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(288, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(288, 288, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), groups=288, bias=False)
(4): BatchNorm2d(288, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=288, out_features=72, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=72, out_features=288, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(288, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(7): Bottleneck(
(conv): Sequential(
(0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(576, 576, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=576, bias=False)
(4): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=576, out_features=144, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=144, out_features=576, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(576, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): Sequential(
(lastBN): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(8): Bottleneck(
(conv): Sequential(
(0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(576, 576, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=576, bias=False)
(4): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=576, out_features=144, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=144, out_features=576, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(576, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): Sequential(
(lastBN): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(9): Bottleneck(
(conv): Sequential(
(0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(576, 576, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=576, bias=False)
(4): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=576, out_features=144, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=144, out_features=576, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(576, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): Sequential(
(lastBN): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(10): Bottleneck(
(conv): Sequential(
(0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(576, 576, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=576, bias=False)
(4): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=576, out_features=144, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=144, out_features=576, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(576, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): Sequential(
(lastBN): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(11): Sequential(
(0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
)
(13): Conv2d(576, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(14): H_swish()
)
(classifier): Sequential(
(0): Dropout(p=0.2, inplace=False)
(1): Linear(in_features=1024, out_features=10, bias=True)
)
)


# MicroBotNet Code

# -*- coding: UTF-8 -*-

# This is MicroBotNet
# FdMobileNetV3Imp2 x0.32

'''
From https://github.com/ShowLo/MobileNetV3/blob/master/mobileNetV3.py
MobileNetV3 From <Searching for MobileNetV3>, arXiv:1905.02244.
Ref: https://github.com/d-li14/mobilenetv3.pytorch/blob/master/mobilenetv3.py
https://github.com/kuan-wang/pytorch-mobilenet-v3/blob/master/mobilenetv3.py

'''

import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from collections import OrderedDict
from thop import profile

def _ensure_divisible(number, divisor, min_value=None):
'''
Ensure that 'number' can be 'divisor' divisible
Reference from original tensorflow repo:
https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
'''
if min_value is None:
min_value = divisor
new_num = max(min_value, int(number + divisor / 2) // divisor * divisor)
if new_num < 0.9 * number:
new_num += divisor
return new_num

class H_sigmoid(nn.Module):
'''
hard sigmoid
'''
def __init__(self, inplace=True):
super(H_sigmoid, self).__init__()
self.inplace = inplace

def forward(self, x):
return F.relu6(x + 3, inplace=self.inplace) / 6

class H_swish(nn.Module):
'''
hard swish
'''
def __init__(self, inplace=True):
super(H_swish, self).__init__()
self.inplace = inplace

def forward(self, x):
return x * F.relu6(x + 3, inplace=self.inplace) / 6

class SEModule(nn.Module):
'''
SE Module
Ref: https://github.com/moskomule/senet.pytorch/blob/master/senet/se_module.py
'''
def __init__(self, in_channels_num, reduction_ratio=4):
super(SEModule, self).__init__()

if in_channels_num % reduction_ratio != 0:
raise ValueError('in_channels_num must be divisible by reduction_ratio(default = 4)')

self.fc = nn.Sequential(
nn.Linear(in_channels_num, in_channels_num // reduction_ratio, bias=False),
nn.ReLU(inplace=True),
nn.Linear(in_channels_num // reduction_ratio, in_channels_num, bias=False),
H_sigmoid()
)

def forward(self, x):
batch_size, channel_num, _, _ = x.size()
y = self.avg_pool(x).view(batch_size, channel_num)
y = self.fc(y).view(batch_size, channel_num, 1, 1)
return x * y

class Bottleneck(nn.Module):
'''
The basic unit of MobileNetV3
'''
def __init__(self, in_channels_num, exp_size, out_channels_num, kernel_size, stride, use_SE, NL, BN_momentum):
'''
use_SE: True or False -- use SE Module or not
NL: nonlinearity, 'RE' or 'HS'
'''
super(Bottleneck, self).__init__()

assert stride in [1, 2]
NL = NL.upper()
assert NL in ['RE', 'HS']

use_HS = NL == 'HS'

# Whether to use residual structure or not
self.use_residual = (stride == 1 and in_channels_num == out_channels_num)

if exp_size == in_channels_num:
# Without expansion, the first depthwise convolution is omitted
self.conv = nn.Sequential(
# Depthwise Convolution
nn.Conv2d(in_channels=in_channels_num, out_channels=exp_size, kernel_size=kernel_size, stride=stride, padding=(kernel_size-1)//2, groups=in_channels_num, bias=False),
nn.BatchNorm2d(num_features=exp_size, momentum=BN_momentum),
# SE Module
SEModule(exp_size) if use_SE else nn.Sequential(),
H_swish() if use_HS else nn.ReLU(inplace=True),
# Linear Pointwise Convolution
nn.Conv2d(in_channels=exp_size, out_channels=out_channels_num, kernel_size=1, stride=1, padding=0, bias=False),
#nn.BatchNorm2d(num_features=out_channels_num, momentum=BN_momentum)
nn.Sequential(OrderedDict([('lastBN', nn.BatchNorm2d(num_features=out_channels_num))])) if self.use_residual else
nn.BatchNorm2d(num_features=out_channels_num, momentum=BN_momentum)
)
else:
# With expansion
self.conv = nn.Sequential(
# Pointwise Convolution for expansion
nn.Conv2d(in_channels=in_channels_num, out_channels=exp_size, kernel_size=1, stride=1, padding=0, bias=False),
nn.BatchNorm2d(num_features=exp_size, momentum=BN_momentum),
H_swish() if use_HS else nn.ReLU(inplace=True),
# Depthwise Convolution
nn.Conv2d(in_channels=exp_size, out_channels=exp_size, kernel_size=kernel_size, stride=stride, padding=(kernel_size-1)//2, groups=exp_size, bias=False),
nn.BatchNorm2d(num_features=exp_size, momentum=BN_momentum),
# SE Module
SEModule(exp_size) if use_SE else nn.Sequential(),
H_swish() if use_HS else nn.ReLU(inplace=True),
# Linear Pointwise Convolution
nn.Conv2d(in_channels=exp_size, out_channels=out_channels_num, kernel_size=1, stride=1, padding=0, bias=False),
#nn.BatchNorm2d(num_features=out_channels_num, momentum=BN_momentum)
nn.Sequential(OrderedDict([('lastBN', nn.BatchNorm2d(num_features=out_channels_num))])) if self.use_residual else
nn.BatchNorm2d(num_features=out_channels_num, momentum=BN_momentum)
)

def forward(self, x):
if self.use_residual:
return self.conv(x) + x
else:
return self.conv(x)

class FdMobileNetV3Imp2(nn.Module):
'''

'''
def __init__(self, mode='large', classes_num=1000, input_size=224, width_multiplier=1.0, dropout=0.2, BN_momentum=0.1, zero_gamma=False):
'''
configs: setting of the model
mode: type of the model, 'large' or 'small'
'''
super(FdMobileNetV3Imp2, self).__init__()

mode = mode.lower()
assert mode in ['large', 'small']
s = 2
if input_size == 32 or input_size == 56:
# using cifar-10, cifar-100 or Tiny-ImageNet
#s = 1
_ = None

# setting of the model
if mode == 'large':
# Configuration of a MobileNetV3-Large Model
configs = [
#kernel_size, exp_size, out_channels_num, use_SE, NL, stride
[3, 16, 16, False, 'RE', 1],
[3, 64, 24, False, 'RE', s],
[3, 72, 24, False, 'RE', 1],
[5, 72, 40, True, 'RE', 2],
[5, 120, 40, True, 'RE', 1],
[5, 120, 40, True, 'RE', 1],
[3, 240, 80, False, 'HS', 2],
[3, 200, 80, False, 'HS', 1],
[3, 184, 80, False, 'HS', 1],
[3, 184, 80, False, 'HS', 1],
[3, 480, 112, True, 'HS', 1],
[3, 672, 112, True, 'HS', 1],
[5, 672, 160, True, 'HS', 2],
[5, 960, 160, True, 'HS', 1],
[5, 960, 160, True, 'HS', 1]
]
elif mode == 'small':
# @SELF edited
configs = [
#kernel_size, exp_size, out_channels_num, use_SE, NL, stride
[3, 72, 24, False, 'RE', 2],
[5, 96, 40, True, 'HS', 2],
[5, 240, 40, True, 'HS', 1],
[5, 120, 48, True, 'HS', 1],
[5, 144, 48, True, 'HS', 1],
[5, 288, 96, True, 'HS', 2],
[5, 576, 96, True, 'HS', 1],
[5, 576, 96, True, 'HS', 1],
[5, 576, 96, True, 'HS', 1],
[5, 576, 96, True, 'HS', 1]
]
# Configuration of a MobileNetV3-Small Model
'''
configs = [
#kernel_size, exp_size, out_channels_num, use_SE, NL, stride
[3, 16, 16, True, 'RE', s],
[3, 72, 24, False, 'RE', 2],
[3, 88, 24, False, 'RE', 1],
[5, 96, 40, True, 'HS', 2],
[5, 240, 40, True, 'HS', 1],
[5, 240, 40, True, 'HS', 1],
[5, 120, 48, True, 'HS', 1],
[5, 144, 48, True, 'HS', 1],
[5, 288, 96, True, 'HS', 2],
[5, 576, 96, True, 'HS', 1],
[5, 576, 96, True, 'HS', 1]
]
'''

first_channels_num = 16

# last_channels_num = 1280
# according to https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet_v3.py
# if small -- 1024, if large -- 1280
last_channels_num = 1280 if mode == 'large' else 1024

divisor = 8

########################################################################################################################
# feature extraction part
# input layer
input_channels_num = _ensure_divisible(first_channels_num * width_multiplier, divisor)
print(input_channels_num)
last_channels_num = _ensure_divisible(last_channels_num * width_multiplier, divisor) if width_multiplier > 1 else last_channels_num
feature_extraction_layers = []
first_layer = nn.Sequential(
nn.Conv2d(in_channels=3, out_channels=input_channels_num, kernel_size=3, stride=s, padding=1, bias=False),
nn.BatchNorm2d(num_features=input_channels_num, momentum=BN_momentum),
H_swish()
)
feature_extraction_layers.append(first_layer)
# Overlay of multiple bottleneck structures
for kernel_size, exp_size, out_channels_num, use_SE, NL, stride in configs:
output_channels_num = _ensure_divisible(out_channels_num * width_multiplier, divisor)
exp_size = _ensure_divisible(exp_size * width_multiplier, divisor)
feature_extraction_layers.append(Bottleneck(input_channels_num, exp_size, output_channels_num, kernel_size, stride, use_SE, NL, BN_momentum))
input_channels_num = output_channels_num

# the last stage
last_stage_channels_num = _ensure_divisible(exp_size * width_multiplier, divisor)
last_stage_layer1 = nn.Sequential(
nn.Conv2d(in_channels=input_channels_num, out_channels=last_stage_channels_num, kernel_size=1, stride=1, padding=0, bias=False),
nn.BatchNorm2d(num_features=last_stage_channels_num, momentum=BN_momentum),
H_swish()
)
feature_extraction_layers.append(last_stage_layer1)

# SE Module
# remove the last SE Module according to https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet_v3.py
# feature_extraction_layers.append(SEModule(last_stage_channels_num) if mode == 'small' else nn.Sequential())
# @SELF changed last_channels_num // 2 to last_channels_num 1024 to 576
feature_extraction_layers.append(nn.Conv2d(in_channels=last_stage_channels_num, out_channels=last_channels_num, kernel_size=1, stride=1, padding=0, bias=False))
feature_extraction_layers.append(H_swish())

self.features = nn.Sequential(*feature_extraction_layers)

########################################################################################################################
# Classification part
self.classifier = nn.Sequential(
nn.Dropout(p=dropout),
# nn.Linear(last_channels_num // 2, last_channels_num),
# @SELF added second linear layer
nn.Linear(last_channels_num, classes_num),
)

########################################################################################################################
# Initialize the weights
self._initialize_weights(zero_gamma)

def forward(self, x):
x = self.features(x)
x = x.view(x.size(0), -1)
x = self.classifier(x)
return x

def _initialize_weights(self, zero_gamma):
'''
Initialize the weights
'''
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out')
if m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.BatchNorm2d):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.Linear):
nn.init.normal_(m.weight, std=0.001)
if m.bias is not None:
nn.init.constant_(m.bias, 0)
if zero_gamma:
for m in self.modules():
if hasattr(m, 'lastBN'):
nn.init.constant_(m.lastBN.weight, 0.0)

def test():
net = FdMobileNetV3Imp2(classes_num=10, input_size=32,
width_multiplier=1.00, mode='small')
x = torch.randn(1,3,32,32)
flops, params = profile(net, inputs=(x, ))
print('* MACs: {:,.2f}'.format(flops).replace('.00', ''))
print('* Params: {:,.2f}'.format(params).replace('.00', ''))
y = net(x)
print(y.size())
print()
print(net)

test()


# Week 7

## Sunday, 20-03-01

• Why does our accurate die?
• We think it isn't training correctly.

On MicroBotNet:

from functools import reduce

[
(p.shape, reduce(lambda x, y: x*y, p.shape)) for n, p in model.features[1:11].named_parameters()
if 'conv' in n and 'weight' in n and 'lastBN' not in n and 'fc' not in n
]
# is
'''
[(torch.Size([24, 8, 1, 1]), 192),
(torch.Size([24]), 24),
(torch.Size([24, 1, 3, 3]), 216),
(torch.Size([24]), 24),
(torch.Size([8, 24, 1, 1]), 192),
(torch.Size([8]), 8),
(torch.Size([32, 8, 1, 1]), 256),
(torch.Size([32]), 32),
(torch.Size([32, 1, 5, 5]), 800),
(torch.Size([32]), 32),
(torch.Size([16, 32, 1, 1]), 512),
(torch.Size([16]), 16),
(torch.Size([80, 16, 1, 1]), 1280),
(torch.Size([80]), 80),
(torch.Size([80, 1, 5, 5]), 2000),
(torch.Size([80]), 80),
(torch.Size([16, 80, 1, 1]), 1280),
(torch.Size([40, 16, 1, 1]), 640),
(torch.Size([40]), 40),
(torch.Size([40, 1, 5, 5]), 1000),
(torch.Size([40]), 40),
(torch.Size([16, 40, 1, 1]), 640),
(torch.Size([48, 16, 1, 1]), 768),
(torch.Size([48]), 48),
(torch.Size([48, 1, 5, 5]), 1200),
(torch.Size([48]), 48),
(torch.Size([16, 48, 1, 1]), 768),
(torch.Size([96, 16, 1, 1]), 1536),
(torch.Size([96]), 96),
(torch.Size([96, 1, 5, 5]), 2400),
(torch.Size([96]), 96),
(torch.Size([32, 96, 1, 1]), 3072),
(torch.Size([32]), 32),
(torch.Size([184, 32, 1, 1]), 5888),
(torch.Size([184]), 184),
(torch.Size([184, 1, 5, 5]), 4600),
(torch.Size([184]), 184),
(torch.Size([32, 184, 1, 1]), 5888),
(torch.Size([184, 32, 1, 1]), 5888),
(torch.Size([184]), 184),
(torch.Size([184, 1, 5, 5]), 4600),
(torch.Size([184]), 184),
(torch.Size([32, 184, 1, 1]), 5888),
(torch.Size([184, 32, 1, 1]), 5888),
(torch.Size([184]), 184),
(torch.Size([184, 1, 5, 5]), 4600),
(torch.Size([184]), 184),
(torch.Size([32, 184, 1, 1]), 5888),
(torch.Size([184, 32, 1, 1]), 5888),
(torch.Size([184]), 184),
(torch.Size([184, 1, 5, 5]), 4600),
(torch.Size([184]), 184),
(torch.Size([32, 184, 1, 1]), 5888)]
'''


This does not include fully connected layers embedded within our layers

SqueezeNet for comparison

from functools import reduce

[
(p.shape, reduce(lambda x, y: x*y, p.shape)) for n, p in all_conv_weights
if not ('classifier' in n or 'features.0.' in n)
]
# is
'''
[(torch.Size([16, 64, 1, 1]), 1024),
(torch.Size([64, 16, 1, 1]), 1024),
(torch.Size([64, 16, 3, 3]), 9216),
(torch.Size([16, 128, 1, 1]), 2048),
(torch.Size([64, 16, 1, 1]), 1024),
(torch.Size([64, 16, 3, 3]), 9216),
(torch.Size([32, 128, 1, 1]), 4096),
(torch.Size([128, 32, 1, 1]), 4096),
(torch.Size([128, 32, 3, 3]), 36864),
(torch.Size([32, 256, 1, 1]), 8192),
(torch.Size([128, 32, 1, 1]), 4096),
(torch.Size([128, 32, 3, 3]), 36864),
(torch.Size([48, 256, 1, 1]), 12288),
(torch.Size([192, 48, 1, 1]), 9216),
(torch.Size([192, 48, 3, 3]), 82944),
(torch.Size([48, 384, 1, 1]), 18432),
(torch.Size([192, 48, 1, 1]), 9216),
(torch.Size([192, 48, 3, 3]), 82944),
(torch.Size([64, 384, 1, 1]), 24576),
(torch.Size([256, 64, 1, 1]), 16384),
(torch.Size([256, 64, 3, 3]), 147456),
(torch.Size([64, 512, 1, 1]), 32768),
(torch.Size([256, 64, 1, 1]), 16384),
(torch.Size([256, 64, 3, 3]), 147456)]


Now do MicroBotNet, only convs bigger than 1000:

from functools import reduce

[
(p.shape, reduce(lambda x, y: x*y, p.shape)) for n, p in model.features[1:11].named_parameters()
if 'conv' in n and 'weight' in n
and 'lastBN' not in n and 'fc' not in n
and reduce(lambda x, y: x*y, p.shape) > 1000
]
# is
'''
[(torch.Size([80, 16, 1, 1]), 1280),
(torch.Size([80, 1, 5, 5]), 2000),
(torch.Size([16, 80, 1, 1]), 1280),
(torch.Size([48, 1, 5, 5]), 1200),
(torch.Size([96, 16, 1, 1]), 1536),
(torch.Size([96, 1, 5, 5]), 2400),
(torch.Size([32, 96, 1, 1]), 3072),
(torch.Size([184, 32, 1, 1]), 5888),
(torch.Size([184, 1, 5, 5]), 4600),
(torch.Size([32, 184, 1, 1]), 5888),
(torch.Size([184, 32, 1, 1]), 5888),
(torch.Size([184, 1, 5, 5]), 4600),
(torch.Size([32, 184, 1, 1]), 5888),
(torch.Size([184, 32, 1, 1]), 5888),
(torch.Size([184, 1, 5, 5]), 4600),
(torch.Size([32, 184, 1, 1]), 5888),
(torch.Size([184, 32, 1, 1]), 5888),
(torch.Size([184, 1, 5, 5]), 4600),
(torch.Size([32, 184, 1, 1]), 5888)]
'''


Got to 73.5% Accuracy quantizing only conv above 1000 Got to 73.2% Accuracy quantizing conv and bias above 1000

## Monday, 20-03-02

• 65% quantized to zero, one, negative one, 72.3% accuracy
• TTQ is 72.8% accuracy.
• 31 layers quantized. 86 layers not quantized
• 153,792 param quantized
• 12,222 quantized to one.
• 134,124 quantized to zero.
• 7,446 quantized to neg one.
• 82,866 param not quantized

## Thursday, 20-03-05

• Make a visualization of the network for Pister

Google Collab with the quantization

# Week 8

## Monday, 20-03-09

• Acorns
• Transfer Learning
• Explain our Model
# MicroBotNet x0.32 (FdMobileNetV3Imp2)
# * MACs: 932,886
# * Params: 236,658

FdMobileNetV3Imp2(
(features): Sequential(
(0): Sequential(
(0): Conv2d(3, 8, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(1): BatchNorm2d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
)
(1): Bottleneck(
(conv): Sequential(
(0): Conv2d(8, 24, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(24, 24, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=24, bias=False)
(4): BatchNorm2d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): Sequential()
(6): ReLU(inplace=True)
(7): Conv2d(24, 8, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(2): Bottleneck(
(conv): Sequential(
(0): Conv2d(8, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(32, 32, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), groups=32, bias=False)
(4): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=32, out_features=8, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=8, out_features=32, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(32, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(3): Bottleneck(
(conv): Sequential(
(0): Conv2d(16, 80, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(80, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(80, 80, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=80, bias=False)
(4): BatchNorm2d(80, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=80, out_features=20, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=20, out_features=80, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(80, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): Sequential(
(lastBN): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(4): Bottleneck(
(conv): Sequential(
(0): Conv2d(16, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(40, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(40, 40, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=40, bias=False)
(4): BatchNorm2d(40, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=40, out_features=10, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=10, out_features=40, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(40, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): Sequential(
(lastBN): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(5): Bottleneck(
(conv): Sequential(
(0): Conv2d(16, 48, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(48, 48, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=48, bias=False)
(4): BatchNorm2d(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=48, out_features=12, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=12, out_features=48, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(48, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): Sequential(
(lastBN): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(6): Bottleneck(
(conv): Sequential(
(0): Conv2d(16, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(96, 96, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), groups=96, bias=False)
(4): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=96, out_features=24, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=24, out_features=96, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(96, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(7): Bottleneck(
(conv): Sequential(
(0): Conv2d(32, 184, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(184, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(184, 184, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=184, bias=False)
(4): BatchNorm2d(184, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=184, out_features=46, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=46, out_features=184, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(184, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): Sequential(
(lastBN): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(8): Bottleneck(
(conv): Sequential(
(0): Conv2d(32, 184, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(184, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(184, 184, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=184, bias=False)
(4): BatchNorm2d(184, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=184, out_features=46, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=46, out_features=184, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(184, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): Sequential(
(lastBN): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(9): Bottleneck(
(conv): Sequential(
(0): Conv2d(32, 184, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(184, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(184, 184, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=184, bias=False)
(4): BatchNorm2d(184, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=184, out_features=46, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=46, out_features=184, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(184, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): Sequential(
(lastBN): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(10): Bottleneck(
(conv): Sequential(
(0): Conv2d(32, 184, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(184, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(184, 184, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=184, bias=False)
(4): BatchNorm2d(184, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=184, out_features=46, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=46, out_features=184, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(184, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): Sequential(
(lastBN): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(11): Sequential(
(0): Conv2d(32, 56, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(56, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
)
(13): Conv2d(56, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(14): H_swish()
)
(classifier): Sequential(
(0): Dropout(p=0.2, inplace=False)
(1): Linear(in_features=1024, out_features=10, bias=True)
)
)


To understand a convolution like:

Conv2d(3, 8, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)

• Input Channel: 3
• Number of Filter: 8
• Stride
16
* MACs: 6,597,218
* Params: 2,044,298
torch.Size([1, 10])

FdMobileNetV3Imp2(
(features): Sequential(
(0): Sequential(
(0): Conv2d(3, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
)
(1): Bottleneck(
(conv): Sequential(
(0): Conv2d(16, 72, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(72, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(72, 72, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=72, bias=False)
(4): BatchNorm2d(72, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): Sequential()
(6): ReLU(inplace=True)
(7): Conv2d(72, 24, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(2): Bottleneck(
(conv): Sequential(
(0): Conv2d(24, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(96, 96, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), groups=96, bias=False)
(4): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=96, out_features=24, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=24, out_features=96, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(96, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(40, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(3): Bottleneck(
(conv): Sequential(
(0): Conv2d(40, 240, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(240, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(240, 240, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=240, bias=False)
(4): BatchNorm2d(240, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=240, out_features=60, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=60, out_features=240, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(240, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): Sequential(
(lastBN): BatchNorm2d(40, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(4): Bottleneck(
(conv): Sequential(
(0): Conv2d(40, 120, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(120, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(120, 120, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=120, bias=False)
(4): BatchNorm2d(120, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=120, out_features=30, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=30, out_features=120, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(120, 48, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(5): Bottleneck(
(conv): Sequential(
(0): Conv2d(48, 144, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(144, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(144, 144, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=144, bias=False)
(4): BatchNorm2d(144, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=144, out_features=36, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=36, out_features=144, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(144, 48, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): Sequential(
(lastBN): BatchNorm2d(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(6): Bottleneck(
(conv): Sequential(
(0): Conv2d(48, 288, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(288, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(288, 288, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), groups=288, bias=False)
(4): BatchNorm2d(288, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=288, out_features=72, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=72, out_features=288, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(288, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(7): Bottleneck(
(conv): Sequential(
(0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(576, 576, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=576, bias=False)
(4): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=576, out_features=144, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=144, out_features=576, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(576, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): Sequential(
(lastBN): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(8): Bottleneck(
(conv): Sequential(
(0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(576, 576, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=576, bias=False)
(4): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=576, out_features=144, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=144, out_features=576, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(576, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): Sequential(
(lastBN): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(9): Bottleneck(
(conv): Sequential(
(0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(576, 576, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=576, bias=False)
(4): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=576, out_features=144, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=144, out_features=576, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(576, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): Sequential(
(lastBN): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(10): Bottleneck(
(conv): Sequential(
(0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
(3): Conv2d(576, 576, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=576, bias=False)
(4): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): SEModule(
(fc): Sequential(
(0): Linear(in_features=576, out_features=144, bias=False)
(1): ReLU(inplace=True)
(2): Linear(in_features=144, out_features=576, bias=False)
(3): H_sigmoid()
)
)
(6): H_swish()
(7): Conv2d(576, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): Sequential(
(lastBN): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(11): Sequential(
(0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): H_swish()
)
(13): Conv2d(576, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(14): H_swish()
)
(classifier): Sequential(
(0): Dropout(p=0.2, inplace=False)
(1): Linear(in_features=1024, out_features=10, bias=True)
)
)


MicroBotNet Rendering

# Week 7 Meeting 20-03-05

## MicroBotNet

• Can we do classification under 1 uJ?
• Goal is a neural network with under 1 million MAC to do classification on Cifar-10 (10 image classes)
• MicroBotNet achieves 79.35% accuracy with 930,000 MACs

## Trained Ternary Quantization

• Trained/quantized to 3 weights per convolution
• Less precision for weights – significantly less power-hungry memory access!
• My testing shows from 91% accuracy on SqueezeNet to 88% accuracy

## Trained Ternary Quantization on MicroBotNet

• We quantize any convolution or fully connected layer
• With over 1000 parameters
• Is not the first convolution layer or final fully connected layers.

### Trained Ternary Quantization

• 31 layers quantized. 86 layers not quantized

• 65% quantized to one positive weight, one negative weight is 72.8% accuracy.

### One Negative One Quantization

• 65% quantized to zero, one, negative one, 72.3% accuracy.

### Results

• 153,792 param quantized
• 12,222 quantized to one.
• 134,124 quantized to zero.
• 7,446 quantized to neg one.
• 82,866 param not quantized

# Related Works

## Low Power Classification

Deep Learning is a powerful tool that is able to model complex relationships. One example is image classification, where the user

\subsection{Low Power Classification}
Deep Learning training and inference often incur significant computing and power expense, making them impractical for edge devices.
MACs are an accepted computational-cost metric as they map to both the multiply-and-accumulate computation and its memory access patterns of filter weights, layer input maps, and partial sums for layer output maps.
Prior work has decreased parameter size and multiply-and-accumulate (MAC) operations.

SqueezeNet \cite{iandola2016squeezenet} introduced Fire modules as a compression method in an effort to reduce the number of parameters while maintaining accuracy.
% Reducing $3\times 3$ convolutions to $1\times1$ achieves $9\times$ fewer parameters for a given filter; decreasing the number of input channels to larger $3\times 3$ filters with squeeze layers further lowers the number of parameters.
MobileNetV1 \cite{howard2017mobilenets} replaced standard convolution with depth-wise separable convolutions where a depth-wise convolution performs spatial filtering and pointwise convolutions generate features.
Fast Downsampling \cite{qin2018fd} expanded on MobileNet for extremely computationally constrained tasks--32$\times$ downsampling in the first 12 layers drops the computational substantially with a 5\% accuracy loss.
Trained Ternary Quantization \cite{zhu2016trained} reduced weight precision to 2-bit ternary values with scaling factors with zero accuracy loss.
MobileNetV3 \cite{howard2019searching} used neural architecture search optimizing for efficiency to design their model.
Other improvements include hard'~activation functions (h-swish and h-sigmoid) \cite{ramachandran2017searching}, inverted residuals and linear bottlenecks \cite{sandler2018mobilenetv2}, and squeeze-and-excite layers\cite{hu2018squeeze} that extract spatial and channel-wise information.
% In a 45\si{\nano\meter} process, a 32-bit integer multiplication is 3.1\si{\pico\joule}, a 32-bit integer addition is 0.1\si{\pico\joule}, encapsulating most of the energy cost in a MAC \cite{horowitzenergy}.
Benchmarking from a 45\si{\nano\meter} process \cite{horowitzenergy}, shrinking process nodes and decreased bit precision enable a MAC cost approaching 1\si{\pico\joule}.
Targeting 1\si{\micro\joule} per forward-pass, we combine these advancements into a new network with $<$1 million MACs.


## My Previous

\subsection{Low Power Classification}
Deep Learning training and inference often computing and power expense, making it infeasible to run on embedded devices. Prior work has dealt with dropping parameter size and multiply-and-accumulate (MAC) operations. MACs are a more useful metric as they map to both the multiply-and-accumulate computation and its memory access patterns of filter weights, layer input maps, and partial sums for layer output maps.

SqueezeNet \cite{iandola2016squeezenet} focusing on decreasing the number of parameters while maintaining accuracy. This is done with Fire Modules which reduces some $3\times3$ convolutions with $1\times1$ convolutions, which are 9x less parameters.
MobileNet \cite{howard2017mobilenets} use depth-wise separable convolutions which replace standard convolutions into two efficient operations of depth-wise convolutions and pointwise convolutions.
Fast Downsampling \cite{qin2018fd} expands on Mobilenet for extremely computational constrained tasks. They perform 32x downsampling in the first 12 layers of MobileNet and use increased number of channels to decrease computational cost to 12 MFLOPs at 5\% accuracy loss.
Trained Ternary Quantization \cite{zhu2016trained} reduces precision of weights to only 2-bit ternary values with scaling factors with zero accuracy loss.
MobileNet V3 \cite{howard2019searching} expands on Mobilenet and use neural architecture search optimizing for efficiency to design their model. They include other works including activation functions h-swish and h-sigmoid \cite{ramachandran2017searching}, inverted residuals and linear bottlenecks \cite{sandler2018mobilenetv2} that improve on depthwise seperatable convolutions, and squeeze-and-excite \cite{hu2018squeeze} layers that extract spatial and channel-wise information.
`