AbstractIn noisy, complex environments, our ability to understand audio speech benefits greatly from seeing the speaker’s face. This is attributed to the brain’s ability to integrate audio and visual information, a process known as multisensory integration. In addition, selective attention to speech in complex environments plays an enormous role in what we understand, the so-called cocktail-party phenomenon. But how attention and multisensory integration interact remains incompletely understood. While considerable progress has been made on this issue using simple, and often illusory (e.g., McGurk) stimuli, relatively little is known about how attention and multisensory integration interact in the case of natural, continuous speech. Here, we addressed this issue by analyzing EEG data recorded from subjects who undertook a multisensory cocktail-party attention task using natural speech. To assess multisensory integration, we modeled the EEG responses to the speech in two ways. The first assumed that audiovisual speech processing is simply a linear combination of audio speech processing and visual speech processing (i.e., an A+V model), while the second allows for the possibility of audiovisual interactions (i.e., an AV model). Applying these models to the data revealed that EEG responses to attended audiovisual speech were better explained by an AV model than an A+V model, providing evidence for multisensory integration. In contrast, unattended audiovisual speech responses were best captured using an A+V model, suggesting that multisensory integration is suppressed for unattended speech. Follow up analyses revealed some limited evidence for early multisensory integration of unattended AV speech, with no integration occurring at later levels of processing. We take these findings as evidence that the integration of natural audio and visual speech occurs at multiple levels of processing in the brain, each of which can be differentially affected by attention.