We propose a VAD using long-term 200 ms Mel frequency band statistics, auditory masking, and a pre-trained two level decision tree ensemble based classifier, which allows capturing syllable level structure of speech and discriminating it from common noises. Proposed algorithm demonstrates on the test dataset almost 100 % acceptance of clear voice for English, Chinese, Russian, and Polish speech and 100 % rejection of stationary noises independently of loudness. The algorithm is aimed to be used as a trigger for ASR. It reuses short-term FFT analysis (STFFT) from ASR frontend with additional 2 KB memory and 15 % complexity overhead
Original languageEnglish
Pages (from-to)352-358
JournalLecture Notes in Computer Science
Volume9924
DOIs
StatePublished - 2016
EventInternational Conference on Text, Speech, and Dialogue 2016 - Брно, Czech Republic
Duration: 12 Apr 201616 Apr 2016
Conference number: 19
https://www.tsdconference.org/tsd2016/

    Research areas

  • Voice Activity Detector Classification Decision tree ensemble Auditory masking

ID: 7595429