Abstract
With reference to the limitations of YOLOv3 in recognizing symbols on Zhuang pattern, such as slow detection speed, unable to detect small object, and inaccurate positioning of bounding box, we propose a new model: Earf-YOLO (Efficient Attention Receptive Field You only look once) in this paper. In EarF-YOLO, we present an attention module: CBEAM (Convolution Block Efficient Attention Module) at first, which provides feature maps from channel and spatial dimensions. In CBEAM module, a local cross-channel interaction strategy without reducing dimensionality is used to improve the performance of the convolutional neural network. Besides, we put forward the SRFB (Strength Receptive Field Block) structure. During its training, more branch structures will be generated to enrich the feature space of the convolutional block. During its prediction, the multi-branched structures will be reparametrized and fused into one main branch to improve the performance of the model. Finally, we adopt some advanced training techniques to improve the detection performance. Experiments on the dataset of Zhuang patterns and the COCO dataset show that the Earf-YOLO model can effectively reduce the error of the prediction box and the ground-truth box, and decrease the calculation time. The mAP value of this model on the dataset of Zhuang patterns and on the COCO dataset reaches 82.1 (IoU=0.5) and 62.14 (IoU=0.5) respectively.