the class prediction layer uses a convolutional layer without altering width or height of feature maps. In this way, there can be a one-to-one correspondence between outputs and inputs at the same spatial dimensions (width and height) of feature maps.
To produce valid predictions, there must be � ( � + 1 ) output channels, where for the same spatial position the output channel with index � ( � + 1 ) + � represents the prediction of the class � ( 0 ≤ � ≤ � ) for the anchor box � ( 0 ≤ � < � ).
In the following example, we construct feature maps at two different scales, Y1 and Y2, for the same minibatch, where the height and width of Y2 are half of those of Y1.
As we can see, except for the batch size dimension, the other three dimensions all have different sizes. To concatenate these two prediction outputs for more efficient computation, we will transform these tensors into a more consistent format.
Glasp is a social web highlighter that people can highlight and organize quotes and thoughts from the web, and access other like-minded people’s learning.