# Papers

\chapter{SqueezeNet}

\begin{enumerate}
\item \href{https://arxiv.org/pdf/1602.07360.pdf}{Paper}
\item \href{https://youtu.be/ge_RT5wvHvY}{Video Walkthrough}
\end{enumerate}

\section{Design}

\begin{figure}[ht!]
\includegraphics[width=1
\textwidth]{photos/squeezenet/design.jpg}
\centering
\end{figure}

1x1 filters are 9x smaller than 3x3 filters.
Squeeze (decrease number of channels by doing 1x1 filters K number of filters)

\clearpage

\section{Fire Module}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/squeezenet/fire_module.jpg}
\centering
\end{figure}

Squeeze, then expand with 1x1 filter and 3x3 filter. Zero pad so 3x3 filter and 1x1 filters have the same feature map, concat together.

\clearpage

\section{Delayed Downsampling}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/squeezenet/delay.jpg}
\centering
\end{figure}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/squeezenet/model.png}
\centering
\end{figure}

Downsample with maxpool and stride of 2 (skips over half). Fire modules shown. ResNet Skip Connections in Middle (best accuracy) and Right. Uses \textbf{avg pool} instead of FC at the end.

\clearpage

\section{Deep Compression}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/squeezenet/deep_comp.png}
\centering
\caption{Accuracy stays very good. Can be shrank more with \textbf{Deep Compression} with little accuracy loss.}
\end{figure}

\clearpage

\chapter{MobileNet Depthwise Separable Convolution}

\begin{enumerate}
\item \href{https://arxiv.org/pdf/1704.04861.pdf}{Paper}
\item \href{https://youtu.be/T7o3xvJLuHk}{Video Walkthrough}
\end{enumerate}

\clearpage

\section{Convolution Operation and Cost}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/mobilenet/convolution.png}
\centering
\caption{N convolutions of $D_K \times D_K \times M$ (channels like 3 for RGB). Output is $D_G \times D_G \times N$}
\end{figure}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/mobilenet/conv_cost.png}
\centering
\end{figure}

\begin{enumerate}
\item One convolution multiplication costs $(D_K)^2 \times M$.
\item Doing all convolutions for that kernel (green tensor) costs $(D_G)^2 \times (D_K)^2 \times M$.
\item Finally, the total costs with all $N$ Kernels costs $N \times (D_G)^2 \times (D_K)^2 \times M$.
\end{enumerate}
\textbf{Convolutions are expensive.}

\clearpage

\section{Depthwise Convolution (Filtering Step)}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/mobilenet/depthwise.png}
\centering
\caption{Instead of the Filters being size M, have it be size 1. Then have M Filters. The output of the \textbf{Depthwise Convolution} will be $D_G \times D_G \times M$.}
\end{figure}

Multiplication cost is $M \times (D_G)^2 \times (D_K)^2$.

\section{Pointwise Convolution (Combining Step)}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/mobilenet/pointwise.png}
\centering
\caption{The input is the output of the Depthwise Convolution. Use N $1 \times 1 \times M$ filters and the output is $D_G \times D_G \times N$. This is the same as a normal convolution with N Filters of $D_K \times D_K \times M$.}
\end{figure}

Multiplication cost is $N \times (D_G)^2 \times M$.

\textbf{The total cost is:} $M \times (D_G)^2 \times (D_K)^2 + N \times (D_G)^2 \times M = \mathbf{M \times (D_G)^2 \times ((D_K)^2 + N)}$

\clearpage

\section{Others}

\subsection{Comparison}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/mobilenet/mobilenet_comp.png}
\centering
\caption{Similar accuracy, significantly less mutli-add and parameters.}
\end{figure}

\subsection{Module}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/mobilenet/module.png}
\centering
\end{figure}

\clearpage
\subsection{Activation ReLU6}

\begin{figure}[ht!]
\includegraphics[width=1
\textwidth]{photos/mobilenet/ReLU6.png}
\centering
\caption{ReLU capped at 6 efficient for low precision}
\end{figure}

Also there is a \textbf{Global Avg Pooling} then Fully Connected Layer to a Softmax Classifier.

\clearpage

\chapter{MobileNet V2}

\begin{enumerate}
\item \href{https://arxiv.org/abs/1801.04381}{Paper}
\item \href{https://machinethink.net/blog/mobilenet-v2/}{Notes}
\end{enumerate}

\clearpage

\section{Modules}

\begin{figure}[ht!]
\includegraphics[width=.4
\textwidth]{photos/mobilenetv2/mnetv1mod.png}
\centering
\caption{MobileNet V1 uses pointwise convolutions.}
\end{figure}

\begin{figure}[ht!]
\includegraphics[width=.4
\textwidth]{photos/mobilenetv2/mnetv2mod.png}
\centering
\caption{MobileNet V2 uses Residual Blocks and new First Layer of Projection Layer. This is called ain \textbf{inverted residual with linear bottleneck}.}
\end{figure}

\section{Inverted Residuals with Linear Bottlenecks}

The idea of MobileNet V2 is based on the two ideas:
\begin{enumerate}
\item \textbf{Low Dimension Tensors} reduce the number of computations/multiplications.
\item Low Dimension Tensors \textbf{only} do not work well. They cannot extra a lot of information.
\end{enumerate}
MobileNet V2 addresses this by having the input be a low dimensional tensor, \textbf{expanding} it to a reasonable/high dimensional tensor, \textbf{run a depthwise convolution} on it, and \textbf{squeeze} it back into a low dimensional tensor.

\begin{figure}[ht!]
\includegraphics[width=1
\textwidth]{photos/mobilenetv2/channel_diag.png}
\centering
\end{figure}

\begin{figure}[ht!]
\includegraphics[width=1
\textwidth]{photos/mobilenetv2/expand_filter_squeeze.png}
\centering
\end{figure}

The Green is an \textbf{Expand Convolution}. It increases the number of channels in. The Blue is a \textbf{Depthwise Convolution}. It keeps the number of channels the same and runs filters on the data. The Orange is a \textbf{Projection Layer/Bottleneck Layer}. Additionally there is a \textbf{residual skip connection} to keep information flow in the network. \\

\clearpage

\section{Other}

\begin{figure}[ht!]
\includegraphics[width=1
\textwidth]{photos/mobilenetv2/stats.png}
\centering
\caption{It is more efficient than MobileNet V1.}
\end{figure}

Some blocks keep channel size the same, others expand it until the final fully connected classification layer.

\textbf{Note: SqueezeNet still has smaller memory usage.}

\clearpage

\chapter{MobileNet V3}

\begin{enumerate}
\item \href{https://arxiv.org/abs/1905.02244}{Paper}
\item \href{https://github.com/kuan-wang/pytorch-mobilenet-v3/blob/master/mobilenetv3.py}{Code Implementation}
\end{enumerate}

\section{H-Swish}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/MobileNetV3/hswish.png}
\centering
\end{figure}

Swish is an activation function better than ReLU.
$$swish(x) = x * sigmoid(\beta x)$$
H-Swish (hard) is more efficient for hardware.
$$hswish(x) = x * \frac{ReLU6(x + 3)}{6}$$
Recall $ReLU6 = min(max(0, x), 6)$, 0 then linear than 6 from 6 onwards.

H-Swish is good only at the end deeper layers of the neural network.

\clearpage

\section{Squeeze and Excitation}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/MobileNetV3/squeezeandexcite.png}
\centering
\end{figure}

From \href{https://arxiv.org/pdf/1709.01507.pdf}{Squeeze and Excite} Paper (Big Model previous SOTA on ImageNet).

\begin{definition}
\textbf{Squeeze}: Global Information Embedding instead of where each learned filter operatives on a local field. A \textbf{Global Average Pooling} to get global information.
\end{definition}

\begin{figure}[ht!]
\includegraphics[width=.6
\textwidth]{photos/MobileNetV3/Excite.png}
\centering
\end{figure}

\begin{definition}
\textbf{Excitation}: Adaptive Recalibration. This is done with gating sigmoid of the nonlinear ReLU  filter. This value is scaled to the input.
\end{definition}

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/MobileNetV3/Diagram_Excite.png}
\centering
\end{figure}

\clearpage

\section{Network Improvements}

Like overridden by the humans to be more efficient?

\begin{figure}[ht!]
\includegraphics[width=.8
\textwidth]{photos/MobileNetV3/dropexpensive.png}
\centering
\end{figure}

The layers for tensors \textbf{960}, \textbf{320}, \textbf{1280} are removed. Instead \textbf{960} (input is 7x7x960) is avgpool to 1x1x960. Input it into 1280 1x1x960 conv. Output is 1x1x1280 (\textbf{1280} in the bottom). Input it into 1000 1x1x1280 conv. Output is 1x1x1000 (flattened).

In the original, between tensors \textbf{320} to \textbf{1280} the 1x1 Conv is used to expand to high dimensional feature space. The output is 7x7x1280. Then it is avg-pooled into 1x1x1280.

Instead in the human-efficient stage do avg pool first. It is 1x1x960, which the features are extracted more efficiently through a 1x1x1280 conv.

Also optimize the initial layers. Usually use 32 filter but most times they are just mirrors. Instead use 16 filters with h-swish, non-linearity and helps deal with redundancy.

\clearpage

\chapter{FD-MobileNet}
\begin{enumerate}
\item \href{https://arxiv.org/pdf/1802.03750.pdf}{Paper}
\item \href{https://github.com/clavichord93/FD-MobileNet}{Code Implementation}
\item \href{https://github.com/clavichord93/FD-MobileNet/blob/master/pyvision/models/ImageNet/MobileNet.py}{MobileNet and FD-MobileNet models}
\end{enumerate}

\clearpage
\section{Notes}

\begin{figure}[ht!]
\includegraphics[width=.5
\textwidth]{photos/fd_mobile/comparison.png}
\centering
\caption{FD (Fast Downsampling) downsamples early. This means there is less operations early on, more operations after downsampled when feature map is smaller.}

\end{figure}

FD-MobileNet x0.25 only has 0.383M params at 43.81\% top-1 accuracy compared to MobileNet x0.25 with 0.47M params at 54.22\% accuracy. MobileNet seems way more accurate, but we are really hardware limited, so I think this is promising.

\begin{figure}[ht!]
\includegraphics[width=1
\textwidth]{photos/fd_mobile/code.png}
\centering
\caption{Only changes are number of channels and stride sizes!}
\end{figure}

\clearpage