Accurate recognition of facial expressions and emotional gestures is promising to understand the audience's feedback and engagement on the entertainment content. Existing methods are primarily based on various cameras or wearable sensors, which either raise privacy concerns or demand extra devices. To this aim, we propose a novel ubiquitous sensing system based on the commodity microphone array --- SonicFace, which provides an accessible, unobtrusive, contact-free, and privacy-preserving solution to monitor the user's emotional expressions continuously without playing hearable sound. SonicFace utilizes a pair of speaker and microphone array to recognize various fine-grained facial expressions and emotional hand gestures by emitted ultrasound and received echoes. Based on a set of experimental evaluations, the accuracy of recognizing 6 common facial expressions and 4 emotional gestures can reach around 80%. Besides, the extensive system evaluations with distinct configurations and an extended real-life case study have demonstrated the robustness and generalizability of the proposed SonicFace system.