We propose a VAD using long-term 200 ms Mel frequency band statistics, auditory masking, and a pre-trained two level decision tree ensemble based classifier, which allows capturing syllable level structure of speech and discriminating it from common noises. Proposed algorithm demonstrates on the test dataset almost 100 % acceptance of clear voice for English, Chinese, Russian, and Polish speech and 100 % rejection of stationary noises independently of loudness. The algorithm is aimed to be used as a trigger for ASR. It reuses short-term FFT analysis (STFFT) from ASR frontend with additional 2 KB memory and 15 % complexity overhead
Язык оригиналаанглийский
Страницы (с-по)352-358
ЖурналLecture Notes in Computer Science
Том9924
DOI
СостояниеОпубликовано - 2016
СобытиеInternational Conference on Text, Speech, and Dialogue 2016 - Брно, Чехия
Продолжительность: 12 апр 201616 апр 2016
Номер конференции: 19
https://www.tsdconference.org/tsd2016/

ID: 7595429