ABSTRACT

The speech command dataset comes with torchaudio, a package that does for auditory data what torchvision does for images and video. As of this writing, there are two versions; the one provided by torchaudio is number one. All recordings have been sampled at the same rate. Their length almost always equals one second; the – very – few ones that are minimally longer we can safely truncate. Imagine that instead of as a sequence of amplitudes over time, the above wave were represented in a way that had no information about time at all. To accommodate this physiological precondition, one sometimes converts the Fourier coefficients to the so-called Mel scale. Various formulae exist that do this; in one or the other way, they always include taking the logarithm. But in practice, what usually is done is to create overlapping filters that aggregate sets of Fourier coefficients into a new representation, the Mel coefficients.