Indoor localization has played a significant role in facilitating a collection of emerging applications in the past decade. This paper presents a novel indoor localization solution via inaudible acoustic sensing, called EchoSpot, which relies on only one speaker and one microphone that are readily available on audio devices at households. We program the speaker to periodically send FMCW chirps at 18kHz-23kHz and leverage the co-located microphone to capture the reflected signals from the body and the wall for analysis. By applying the normalized cross-correlation on the transmitted and received signals, we can estimate and profile their time-of-flights (ToFs). We then eliminate the interference from device imperfection and environmental static objects, able to identify the ToFs corresponding to the direct reflection from human body. In addition, a new solution to estimate the ToF from wall reflection is designed, assisting us in spotting a human location in the two-dimensional space. We implement EchoSpot on three different types of speakers, e.g., Amazon Echo, Edifier R1280DB, and Logitech z200, and deploy them in real home environments for evaluation. Experimental results exhibit that EchoSpot achieves the mean localization errors of 4.1cm, 9.2cm, 13.1cm, 17.9cm, 22.2cm, respectively, at 1m, 2m, 3m, 4m, and 5m, comparable to results from the state-of-the-arts while maintaining favorable advantages.