MobileNet V3




Swish is an activation function better than ReLU. $$swish(x) = x * sigmoid(\beta x)$$

H-Swish (hard) is more efficient for hardware. $$hswish(x) = x * \frac{ReLU6(x + 3){6}$$

Recall \( ReLU6 = min(max(0, x), 6) \), 0 then linear than 6 from 6 onwards.

H-Swish is good only at the end deeper layers of the neural network.

Squeeze and Excitation


Definition: Squeeze: Global Information Embedding instead of where each learned filter operatives on a local field. A Global Average Pooling to get global information.


Definition: Excitation: Adaptive Recalibration. This is done with gating sigmoid of the nonlinear ReLU filter. This value is scaled to the input.


Network Improvements

Like overridden by the humans to be more efficient?


The layers for tensors 960, 320, 1280 are removed. Instead 960 (input is 7x7x960) is avgpool to 1x1x960. Input it into 1280 1x1x960 conv. Output is 1x1x1280 (1280 in the bottom). Input it into 1000 1x1x1280 conv. Output is 1x1x1000 (flattened).

In the original, between tensors 320 to 1280 the 1x1 Conv is used to expand to high dimensional feature space. The output is 7x7x1280. Then it is avg-pooled into 1x1x1280.

Instead in the human-efficient stage do avg pool first. It is 1x1x960, which the features are extracted more efficiently through a 1x1x1280 conv.

Also optimize the initial layers. Usually use 32 filter but most times they are just mirrors. Instead use 16 filters with h-swish, non-linearity and helps deal with redundancy.