ABSTRACT

Traditional audio synthesis techniques are based mainly on Fourier theory. Virtually any sound can be reconstructed starting from an analytical representation of its spectrum, but lack of "naturalness" is such methods' main disadvantage. Naturalness in sound synthesis is very difficult to model by means of mathematical formulas or linear processes because it is primarily non linear. Linguistic rather than mathematical description of such processes is simpler and effective. Soft-computing methodologies (neural networks, fuzzy logic, genetic algorithms, smart logic, etc.) have been shown to efficiently solve very complex non-linear problems, such as pattern recognition and automatic control of very complex systems. Since the 1990s, a great deal of research has applied neural networks to solving audio pattern-recognition problems. Speech recognition is probably the most targeted, because it is probably the most complex to solve. On the other hand, there has been relatively little research concerning pattern generation. Sejnowsky and Rosenberg [1] were the first researchers to successfully apply a neural network (NETtalk) to speech synthesis, training a three-layer back-propagation network (BPN) to execute textto-phoneme conversion and to generate the corresponding phonetic parameters. Since Sejnowsky and Rosenberg's experience, other speech-synthesis research based on neural networks has been reported. Scordilis and Gowdy [2] implemented two parallel neural networks that derive pitch frequency and its variability that from text. Those efforts demonstrate that a neural network can learn to efficiently control even prosody during speech synthesis, if an extensive corpus is used to train the networks. Some projects also involved audio synthesis. A pioneering experience was the "Neural-Network Audio Synthesizer" developed by Thorson, Warthman, and Holler [3] using an Intel 80170NX ETANN (electrically trainable analog neural network), the first implementation of a complete neural net on a single chip. Thorson, Warthman, and Holler designed and hardware-implemented a synthesizer capable

of generating a remarkable range of audio effects by configuring ETANN neurons in weighted loops using programmable synaptic weights and feedback paths. Space-age and science-fiction sounds were experienced, but also natural sounds such as heart beats, drums, gongs, porpoises, birds, engines, and music from instruments such as violas and flutes. It was also demonstrated in practice that a neural network could be trained to generate waveforms as does a function generator. Unlike a function generator, a trained neural network is not a table lookup system but a system capable of generalizing from a relatively limited set of examples. The results is naturalness in the response. Very few research has involved using fuzzy logic to model speech and audio synthesis. Some work has relied on the neuro-fuzzy combination primarily for musical process control [4] or for cognitive exploration [5]. Peter Elsea presented an article in 1995 showing how fuzzy logic can be applied to typical music and composition-analysis problems [6]. Aguilar and Salinas [7] were successful in using fuzzy logic to tune a synthesized musical instrument. Their work demonstrated that fuzzy logic can provide a good solution to audio-synthesis system control. A human expert can tune a set of rules that the fuzzy-logic controller can emulate. We have had experimental successful using fuzzy-logic modeling to solve audio problems such as audio-event end-point detection (EPD) [8] and voice-activity detection (VAD). We are now proposing a mixed-method approach to develop a smart audio synthesizer in which a fuzzy-logic engine is tuned so as to drive a neural network trained to generate audio frames.