For many speech telecommunication technologies a robust speech activity
detector is important. An audio-only speech detector will give false positives
when the interfering signal is speech or has speech characteristics. The
modality video is suitable to solve this problem. In this report the approach
to and implementation of a decision-based audiovisual speech detector
is given. Acoustic and visual features of speech are first separately investigated.
Firstly, a common method for speech detection based on audio has
been built. Secondly, from the video data the mouth features have been
extracted with the implementation of an own idea. The visual features were
used to create a conservative visual non-speech detector. The low false
detection rate makes the visual non-speech detector suitable to rule out
some false speech detections of an audio only solution. Finally, the combination
of the audio detector and the video detector leads to an audiovisual
speech detector which uses basic mouth features and a common acoustical
speech detection method to outperform an audio-only solution.