Semantic information usually contains a description of the environment content, which enables mobile robot to understand the environment and improves its ability to interact with the environment. In high-level human–computer interaction application, the Simultaneous Localization and Mapping (SLAM) system not only needs higher accuracy and robustness, but also has the ability to construct a static semantic map of the environment. However, traditional visual SLAM lacks semantic information. Furthermore, in an actual scene, dynamic objects will reduce the system performance and also generate redundancy when constructing map. these all directly affect the robot’s ability to perceive and understand the surrounding environment. Based on ORB-SLAM3, this article proposes a new Algorithm that uses semantic information and the global dense optical flow as constraints to generate dynamic-static mask and eliminate dynamic objects. then, to further construct a static 3D semantic map under indoor dynamic environments, a fusion of 2D semantic information and 3D point cloud is carried out. the experimental results on different types of dataset sequences show that, compared with original ORB-SLAM3, both Absolute Pose Error (APE) and Relative Pose Error (RPE) have been ameliorated to varying degrees, especially on freiburg3-walking-xyz, the APE reduced by 97.78% from the original average value of 0.523, and RPE reduced by 52.33% from the original average value of 0.0193. Compared with DS-SLAM and DynaSLAM, our system improves real-time performance while ensuring accuracy and robustness. Meanwhile, the expected map with environmental semantic information is built, and the map redundancy caused by dynamic objects is successfully reduced. the test results in real scenes further demonstrate the effect of constructing static semantic maps and prove the effectiveness of our Algorithm.