Weakly supervised object localization (WSOL) has attracted intense interest in computer vision for instance level annotations. As a hot research topic, a number of existing works concentrated on utilizing convolutional neural network (CNN)-based methods, which are powerful in extracting and representing features. The main challenge in CNN-based WSOL methods is to obtain features covering the entire target objects, not only the most discriminative object parts. To overcome this challenge and to improve the detection performance of feature extracting related WSOL methods, a CNN-based two-branch model was presented in this paper to locate objects using supervised learning. Our method contained two branches, including a detection branch and a self-attention branch. During the training process, the two branches interacted with each other by regarding the segmentation mask from the other branch as the pseudo ground truth labels of itself. Our model was able to focus on capturing the information of all the object parts due to the self-attention mechanism. Additionally, we embedded multi-scale detection into our two-branch method to output two-scale features. We evaluated our two-branch network on the CUB-200-2011 and VOC2007 datasets. The pointing localization, intersection over union (IoU) localization, and correct localization precision (CorLoc) results demonstrated competitive performance with other state-of-the-art methods in WSOL.