ALBERT

All Library Books, journals and Electronic Records Telegrafenberg

Your email was sent successfully. Check your inbox.

An error occurred while sending the email. Please try again.

Proceed reservation?

Export
Filter
  • Articles  (615)
  • Data
  • Springer  (615)
  • EURASIP Journal on Audio, Speech, and Music Processing  (198)
  • 82739
  • Electrical Engineering, Measurement and Control Technology  (615)
Collection
  • Articles  (615)
  • Data
Publisher
  • Springer  (615)
Years
Topic
  • Electrical Engineering, Measurement and Control Technology  (615)
  • 1
    Publication Date: 2015-08-08
    Description: Spoken term detection (STD) aims at retrieving data from a speech repository given a textual representation of the search term. Nowadays, it is receiving much interest due to the large volume of multimedia information. STD differs from automatic speech recognition (ASR) in that ASR is interested in all the terms/words that appear in the speech data, whereas STD focuses on a selected list of search terms that must be detected within the speech data. This paper presents the systems submitted to the STD ALBAYZIN 2014 evaluation, held as a part of the ALBAYZIN 2014 evaluation campaign within the context of the IberSPEECH 2014 conference. This is the first STD evaluation that deals with Spanish language. The evaluation consists of retrieving the speech files that contain the search terms, indicating their start and end times within the appropriate speech file, along with a score value that reflects the confidence given to the detection of the search term. The evaluation is conducted on a Spanish spontaneous speech database, which comprises a set of talks from workshops and amounts to about 7 h of speech. We present the database, the evaluation metrics, the systems submitted to the evaluation, the results, and a detailed discussion. Four different research groups took part in the evaluation. Evaluation results show reasonable performance for moderate out-of-vocabulary term rate. This paper compares the systems submitted to the evaluation and makes a deep analysis based on some search term properties (term length, in-vocabulary/out-of-vocabulary terms, single-word/multi-word terms, and in-language/foreign terms).
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 2
    Publication Date: 2015-08-14
    Description: Support vector machines (SVMs) have played an important role in the state-of-the-art language recognition systems. The recently developed extreme learning machine (ELM) tends to have better scalability and achieve similar or much better generalization performance at much faster learning speed than traditional SVM. Inspired by the excellent feature of ELM, in this paper, we propose a novel method called regularized minimum class variance extreme learning machine (RMCVELM) for language recognition. The RMCVELM aims at minimizing empirical risk, structural risk, and the intra-class variance of the training data in the decision space simultaneously. The proposed method, which is computationally inexpensive compared to SVM, suggests a new classifier for language recognition and is evaluated on the 2009 National Institute of Standards and Technology (NIST) language recognition evaluation (LRE). Experimental results show that the proposed RMCVELM obtains much better performance than SVM. In addition, the RMCVELM can also be applied to the popular i-vector space and get comparable results to the existing scoring methods.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 3
    Publication Date: 2015-08-15
    Description: We investigate the automatic recognition of emotions in the singing voice and study the worth and role of a variety of relevant acoustic parameters. The data set contains phrases and vocalises sung by eight renowned professional opera singers in ten different emotions and a neutral state. The states are mapped to ternary arousal and valence labels. We propose a small set of relevant acoustic features basing on our previous findings on the same data and compare it with a large-scale state-of-the-art feature set for paralinguistics recognition, the baseline feature set of the Interspeech 2013 Computational Paralinguistics ChallengE (ComParE). A feature importance analysis with respect to classification accuracy and correlation of features with the targets is provided in the paper. Results show that the classification performance with both feature sets is similar for arousal, while the ComParE set is superior for valence. Intra singer feature ranking criteria further improve the classification accuracy in a leave-one-singer-out cross validation significantly.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 4
    Publication Date: 2015-09-15
    Description: In recent years, deep learning has not only permeated the computer vision and speech recognition research fields but also fields such as acoustic event detection (AED). One of the aims of AED is to detect and classify non-speech acoustic events occurring in conversation scenes including those produced by both humans and the objects that surround us. In AED, deep learning has enabled modeling of detail-rich features, and among these, high resolution spectrograms have shown a significant advantage over existing predefined features (e.g., Mel-filter bank) that compress and reduce detail. In this paper, we further asses the importance of feature extraction for deep learning-based acoustic event detection. AED, based on spectrogram-input deep neural networks, exploits the fact that sounds have “global” spectral patterns, but sounds also have “local” properties such as being more transient or smoother in the time-frequency domain. These can be exposed by adjusting the time-frequency resolution used to compute the spectrogram, or by using a model that exploits locality leading us to explore two different feature extraction strategies in the context of deep learning: (1) using multiple resolution spectrograms simultaneously and analyzing the overall and event-wise influence to combine the results, and (2) introducing the use of convolutional neural networks (CNN), a state of the art 2D feature extraction model that exploits local structures, with log power spectrogram input for AED. An experimental evaluation shows that the approaches we describe outperform our state-of-the-art deep learning baseline with a noticeable gain in the CNN case and provides insights regarding CNN-based spectrogram characterization for AED.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 5
    Publication Date: 2015-09-26
    Description: The identity of musical instruments is reflected in the acoustic attributes of musical notes played with them. Recently, it has been argued that these characteristics of musical identity (or timbre) can be best captured through an analysis that encompasses both time and frequency domains; with a focus on the modulations or changes in the signal in the spectrotemporal space. This representation mimics the spectrotemporal receptive field (STRF) analysis believed to underlie processing in the central mammalian auditory system, particularly at the level of primary auditory cortex. How well does this STRF representation capture timbral identity of musical instruments in continuous solo recordings remains unclear. The current work investigates the applicability of the STRF feature space for instrument recognition in solo musical phrases and explores best approaches to leveraging knowledge from isolated musical notes for instrument recognition in solo recordings. The study presents an approach for parsing solo performances into their individual note constituents and adapting back-end classifiers using support vector machines to achieve a generalization of instrument recognition to off-the-shelf, commercially available solo music.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 6
    Publication Date: 2015-11-26
    Description: The need to have a large amount of parallel data is a large hurdle for the practical use of voice conversion (VC). This paper presents a novel framework of exemplar-based VC that only requires a small number of parallel exemplars. In our previous work, a VC technique using non-negative matrix factorization (NMF) for noisy environments was proposed. This method requires parallel exemplars (which consist of the source exemplars and target exemplars that have the same texts uttered by the source and target speakers) for dictionary construction. In the framework of conventional Gaussian mixture model (GMM)-based VC, some approaches that do not need parallel exemplars have been proposed. However, in the framework of exemplar-based VC for noisy environments, such a method has never been proposed. In this paper, an adaptation matrix in an NMF framework is introduced to adapt the source dictionary to the target dictionary. This adaptation matrix is estimated using only a small parallel speech corpus. We refer to this method as affine NMF, and the effectiveness of this method has been confirmed by comparing its effectiveness with that of a conventional NMF-based method and a GMM-based method in noisy environments.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 7
    Publication Date: 2016-07-09
    Description: A new voice activity detection algorithm based on long-term pitch divergence is presented. The long-term pitch divergence not only decomposes speech signals with a bionic decomposition but also makes full use ...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 8
    Publication Date: 2013-09-18
    Description: Query-by-Example Spoken Term Detection (QbE STD) aims at retrieving data from a speech data repository given an acoustic query containing the term of interest as input. Nowadays, it has been receiving much interest due to the high volume of information stored in audio or audiovisual format. QbE STD differs from Automatic speech recognition (ASR) and keyword spotting (KWS)/spoken term detection (STD) since ASR is interested in all the terms/words that appear in the speech signal and KWS/STD relies on a textual transcription of the search term to retrieve the speech data. This paper presents the systems submitted to the ALBAYZIN 2012 QbE STD evaluation held as a part of ALBAYZIN 2012 evaluation campaign within the context of the IberSPEECH 2012 Conferencea. The evaluation consists of retrieving the speech files that contain the input queries, indicating their start and end timestamps within the appropriate speech file. Evaluation is conducted on a Spanish spontaneous speech database containing a set of talks from MAVIR workshopsb, which amount at about 7 h of speech in total. We present the database metric systems submitted along with all results and some discussion. Four different research groups took part in the evaluation. Evaluation results show the difficulty of this task and the limited performance indicates there is still a lot of room for improvement. The best result is achieved by a dynamic time warping-based search over Gaussian posteriorgrams/posterior phoneme probabilities. This paper also compares the systems aiming at establishing the best technique dealing with that difficult task and looking for defining promising directions for this relatively novel task.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 9
    Publication Date: 2013-06-07
    Description: A novel speech bandwidth extension method based on audio watermark is presented in this paper. The time-domain and frequency-domain envelope parameters are extracted from the high-frequency components of speech signal, and then these parameters are embedded in the corresponding narrowband speech bit stream by the modified least significant bit watermark method which uses perception property. At the decoder, the wideband speech is reproduced with the reconstruction of high-frequency components based on the parameters extracted from bit stream of the narrowband speech. The proposed method can decrease poor auditory effect caused by large local distortion. The simulation results show that the synthesized wideband speech has low spectral distortion and its speech perception quality is greatly improved.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 10
    Publication Date: 2015-05-13
    Description: Deep neural network (DNN)-based approaches have been shown to be effective in many automatic speech recognition systems. However, few works have focused on DNNs for distant-talking speaker recognition. In this study, a bottleneck feature derived from a DNN and a cepstral domain denoising autoencoder (DAE)-based dereverberation are presented for distant-talking speaker identification, and a combination of these two approaches is proposed. For the DNN-based bottleneck feature, we noted that DNNs can transform the reverberant speech feature to a new feature space with greater discriminative classification ability for distant-talking speaker recognition. Conversely, cepstral domain DAE-based dereverberation tries to suppress the reverberation by mapping the cepstrum of reverberant speech to that of clean speech with the expectation of improving the performance of distant-talking speaker recognition. Since the DNN-based discriminant bottleneck feature and DAE-based dereverberation have a strong complementary nature, the combination of these two methods is expected to be very effective for distant-talking speaker identification. A speaker identification experiment was performed on a distant-talking speech set, with reverberant environments differing from the training environments. In suppressing late reverberation, our method outperformed some state-of-the-art dereverberation approaches such as the multichannel least mean squares (MCLMS). Compared with the MCLMS, we obtained a reduction in relative error rates of 21.4% for the bottleneck feature and 47.0% for the autoencoder feature. Moreover, the combination of likelihoods of the DNN-based bottleneck feature and DAE-based dereverberation further improved the performance.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 11
    Publication Date: 2015-05-08
    Description: Estimating the directions of arrival (DOAs) of multiple simultaneous mobile sound sources is an important step for various audio signal processing applications. In this contribution, we present an approach that improves upon our previous work that is now able to estimate the DOAs of multiple mobile speech sources, while being light in resources, both hardware-wise (only using three microphones) and software-wise. This approach takes advantage of the fact that simultaneous speech sources do not completely overlap each other. To evaluate the performance of this approach, a multi-DOA estimation evaluation system was developed based on a corpus collected from different acoustic scenarios named Acoustic Interactions for Robot Audition (AIRA).
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 12
    Publication Date: 2016-04-13
    Description: Automatic speech recognition is becoming more ubiquitous as recognition performance improves, capable devices increase in number, and areas of new application open up. Neural network acoustic models that can u...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 13
    Publication Date: 2015-12-31
    Description: Using a proper distribution function for speech signal or for its representations is of crucial importance in statistical-based speech processing algorithms. Although the most commonly used probability density function (pdf) for speech signals is Gaussian, recent studies have shown the superiority of super-Gaussian pdfs. A large research effort has focused on the investigation of a univariate case of speech signal distribution; however, in this paper, we study the multivariate distributions of speech signal and its representations using the conventional distribution functions, e.g., multivariate Gaussian and multivariate Laplace, and the copula-based multivariate distributions as candidates. The copula-based technique is a powerful method in modeling non-Gaussian multivariate distributions with non-linear inter-dimensional dependency. The level of similarity between the candidate pdfs and the real speech pdf in different domains is evaluated using the energy goodness-of-fit test.In our evaluations, the best-fitted distributions for speech signal vectors with different lengths in various domains are determined. A similar experiment is performed for different classes of English phonemes (fricatives, nasals, stops, vowels, and semivowel/glides). The evaluation results demonstrate that the multivariate distribution of speech signals in different domains is mostly super-Gaussian, except for Mel-frequency cepstral coefficient. Also, the results confirm that the distribution of the different phoneme classes is better statistically modeled by a mixture of Gaussian and Laplace pdfs. The copula-based distributions provide better statistical modeling of vectors representing discrete Fourier transform (DFT) amplitude of speech vectors with a length shorter than 500 ms.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 14
    Publication Date: 2016-03-06
    Description: Time-frequency (T-F) masking is an effective method for stereo speech source separation. However, reliable estimation of the T-F mask from sound mixtures is a challenging task, especially when room reverberati...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 15
    Publication Date: 2019
    Description: We propose a new method for music detection from broadcasting contents using the convolutional neural networks with a Mel-scale kernel. In this detection task, music segments should be annotated from the broad...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 16
    Publication Date: 2019
    Description: A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architec...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 17
    Publication Date: 2019
    Description: Search on speech (SoS) is a challenging area due to the huge amount of information stored in audio and video repositories. Spoken term detection (STD) is an SoS-related task aiming to retrieve data from a spee...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 18
    Publication Date: 2019
    Description: Speech emotion recognition methods combining articulatory information with acoustic features have been previously shown to improve recognition performance. Collection of articulatory data on a large scale may ...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 19
    Publication Date: 2015-10-21
    Description: No description available
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 20
    Publication Date: 2015-10-22
    Description: In this paper, a semi-fragile and blind digital speech watermarking technique for online speaker recognition systems based on the discrete wavelet packet transform (DWPT) and quantization index modulation (QIM) has been proposed that enables embedding of the watermark within an angle of the wavelet’s sub-bands. To minimize the degradation effects of the watermark, these sub-bands were selected from frequency ranges where little speaker-specific information was available (500–3500 Hz and 6000–7000 Hz). Experimental results on the TIMIT, MIT, and MOBIO speech databases show that the degradation results for speaker verification and identification are 0.39 and 0.97 %, respectively, which are negligible. In addition, the proposed watermark technique can provide the appropriate fragility required for different signal processing operations.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 21
    Publication Date: 2015-10-22
    Description: The presence of physical task stress induces changes in the speech production system which in turn produces changes in speaking behavior. This results in measurable acoustic correlates including changes to formant center frequencies, breath pause placement, and fundamental frequency. Many of these changes are due to the subject’s internal competition between speaking and breathing during the performance of the physical task, which has a corresponding impact on muscle control and airflow within the glottal excitation structure as well as vocal tract articulatory structure. This study considers the effect of physical task stress on voice quality. Three signal processing-based values which include (i) the normalized amplitude quotient (NAQ), (ii) the harmonic richness factor (HRF), and (iii) the fundamental frequency are used to measure voice quality. The effects of physical stress on voice quality depend on the speaker as well as the specific task. While some speakers do not exhibit changes in voice quality, a subset exhibits changes in NAQ and HRF measures of similar magnitude to those observed in studies of soft, loud, and pressed speech. For those speakers demonstrating voice quality changes, the observed changes tend toward breathy or soft voicing as observed in other studies. The effect of physical stress on the fundamental frequency is correlated with the effect of physical stress on the HRF (r = −0.34) and the NAQ (r = −0.53). Also, the inter-speaker variation in baseline NAQ is significantly higher than the variation in NAQ induced by physical task stress. The results illustrate systematic changes in speech production under physical task stress, which in theory will impact subsequent speech technology such as speech recognition, speaker recognition, and voice diarization systems.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 22
    Publication Date: 2015-07-17
    Description: Acoustic data transmission (ADT) forms a branch of the audio data hiding techniques with its capability of communicating data in short-range aerial space between a loudspeaker and a microphone. In this paper, we propose an acoustic data transmission system extending our previous studies and give an in-depth analysis of its performance. The proposed technique utilizes the phases of modulated complex lapped transform (MCLT) coefficients of the audio signal. To achieve a good trade-off between the audio quality and the data transmission performance, the enhanced segmental SNR adjustment (SSA) algorithm is proposed. Moreover, we also propose a scheme to use multiple microphones for ADT technique. This multi-microphone ADT technique further enhances the transmission performance while ensuring compatibility with the single microphone system. From a series of experimental results, it has been found that the transmission performance improves when the length of the MCLT frame gets longer at the cost of the audio quality degradation. In addition, a good trade-off between the audio quality and data transmission performance is achieved by means of SSA algorithm. The experimental results also reveal that the proposed multi-microphone method is useful in enhancing the transmission performance.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 23
    Publication Date: 2016-08-10
    Description: Substantial amounts of resources are usually required to robustly develop a language model for an open vocabulary speech recognition system as out-of-vocabulary (OOV) words can hurt recognition accuracy. In th...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 24
    Publication Date: 2015-05-22
    Description: This paper presents an objective speech quality model, ViSQOL, the Virtual Speech Quality Objective Listener. It is a signal-based, full-reference, intrusive metric that models human speech quality perception using a spectro-temporal measure of similarity between a reference and a test speech signal. The metric has been particularly designed to be robust for quality issues associated with Voice over IP (VoIP) transmission. This paper describes the algorithm and compares the quality predictions with the ITU-T standard metrics PESQ and POLQA for common problems in VoIP: clock drift, associated time warping, and playout delays. The results indicate that ViSQOL and POLQA significantly outperform PESQ, with ViSQOL competing well with POLQA. An extensive benchmarking against PESQ, POLQA, and simpler distance metrics using three speech corpora (NOIZEUS and E4 and the ITU-T P.Sup. 23 database) is also presented. These experiments benchmark the performance for a wide range of quality impairments, including VoIP degradations, a variety of background noise types, speech enhancement methods, and SNR levels. The results and subsequent analysis show that both ViSQOL and POLQA have some performance weaknesses and under-predict perceived quality in certain VoIP conditions. Both have a wider application and robustness to conditions than PESQ or more trivial distance metrics. ViSQOL is shown to offer a useful alternative to POLQA in predicting speech quality in VoIP scenarios.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 25
    Publication Date: 2015-06-28
    Description: Over recent years, i-vector-based framework has been proven to provide state-of-the-art performance in speaker verification. Each utterance is projected onto a total factor space and is represented by a low-dimensional feature vector. Channel compensation techniques are carried out in this low-dimensional feature space. Most of the compensation techniques take the sets of extracted i-vectors as input. By constructing between-class covariance and within-class covariance, we attempt to minimize the between-class variance mainly caused by channel effect and to maximize the variance between speakers. In the real-world application, enrollment and test data from each user (or speaker) are always scarce. Although it is widely thought that session variability is mostly caused by channel effects, phonetic variability, as a factor that causes session variability, is still a matter to be considered. We propose in this paper a new i-vector extraction algorithm from the total factor matrix which we term component reduction analysis (CRA). This new algorithm contributes to better modelling of session variability in the total factor space.We reported results on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation (SREs) dataset. As measured both by equal error rate and the minimum values of the NIST detection cost function, 10–15 % relative improvement is achieved compared to the baseline of traditional i-vector-based system.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 26
    Publication Date: 2015-06-26
    Description: Singer identification is a difficult topic in music information retrieval because background instrumental music is included with singing voice which reduces performance of a system. One of the main disadvantages of the existing system is vocals and instrumental are separated manually and only vocals are used to build training model. The research presented in this paper automatically recognize a singer without separating instrumental and singing sounds using audio features like timbre coefficients, pitch class, mel frequency cepstral coefficients (MFCC), linear predictive coding (LPC) coefficients, and loudness of an audio signal from Indian video songs (IVS). Initially, various IVS of distinct playback singers (PS) are collected. After that, 53 audio features (12 dimensional timbre audio feature vectors, 12 pitch classes, 13 MFCC coefficients, 13 LPC coefficients, and 3 loudness feature vector of an audio signal) are extracted from each segment. Dimension of extracted audio features is reduced using principal component analysis (PCA) method. Playback singer model (PSM) is trained using multiclass classification algorithms like back propagation, AdaBoost.M2, k-nearest neighbor (KNN) algorithm, naïve Bayes classifier (NBC), and Gaussian mixture model (GMM). The proposed approach is tested on various combinations of dataset and different combinations of audio feature vectors with various Indian male and female PS’s songs.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 27
    Publication Date: 2015-06-27
    Description: Manual transcription of audio databases for the development of automatic speech recognition (ASR) systems is a costly and time-consuming process. In the context of deriving acoustic models adapted to a specific application, or in low-resource scenarios, it is therefore essential to explore alternatives capable of improving speech recognition results. In this paper, we investigate the relevance of foreign data characteristics, in particular domain and language, when using this data as an auxiliary data source for training ASR acoustic models based on deep neural networks (DNNs). The acoustic models are evaluated on a challenging bilingual database within the scope of the MediaParl project. Experimental results suggest that in-language (but out-of-domain) data is more beneficial than in-domain (but out-of-language) data when employed in either supervised or semi-supervised training of DNNs. The best performing ASR system, an HMM/GMM acoustic model that exploits DNN as a discriminatively trained feature extractor outperforms the best performing HMM/DNN hybrid by about 5 % relative (in terms of WER). An accumulated relative gain with respect to the MFCC-HMM/GMM baseline is about 30 % WER.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 28
    Publication Date: 2015-06-10
    Description: Optimal automatic speech recognition (ASR) takes place when the recognition system is tested under circumstances identical to those in which it was trained. However, in the actual real world, there exist many sources of mismatches between the environment of training and the environment of testing. These sources can be due to the sources of noise that exist in real environments. Speech enhancement techniques have been developed to provide ASR systems with the robustness against the sources of noise. In this work, a method based on histogram equalization (HEQ) was proposed to compensate for the nonlinear distortions in speech representation. This approach utilizes stereo simultaneous recordings for clean speech and its corresponding noisy speech to compute stereo Gaussian mixture model (GMM). The stereo GMM is used to compute the cumulative density function (CDF) for both clean speech and noisy speech using a sigmoid function instead of using the order statistics that is used in other HEQ-based methods. In the implementation, we show two choices to apply HEQ, hard decision HEQ and soft decision HEQ. The latter is based on minimum mean square error (MMSE) clean speech estimation. The experimental work shows that the soft HEQ and hard HEQ achieve better recognition results than the other HEQ approaches such as tabular HEQ, quantile HEQ and polynomial fit HEQ. It also shows that soft HEQ achieves notably better recognition results than hard HEQ. The results of the experimental work also show that using HEQ improves the efficiency of other speech enhancement techniques such as stereo piece-wise linear compensation for environment (SPLICE) and vector Taylor series (VTS). The results also show that using HEQ in multi style training (MST) significantly improves the ASR system performance.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 29
    Publication Date: 2015-01-20
    Description: Vocal tremor has been simulated using a high-dimensional discrete vocal fold model. Specifically, respiratory, phonatory, and articulatory tremors have been modeled as instabilities in six parameters of the model. Reported results are consistent with previous knowledge in that respiratory tremor mainly causes amplitude modulation of the voice signal while laryngeal tremor causes both amplitude and frequency modulation. In turn, articulatory tremor is commonly assumed to produce only amplitude modulations but the simulation results indicate that it also produces a high-frequency modulation of the output signal. Furthermore, articulatory tremor affects the frequency response of the vocal tract and it might thus be detected by analyzing the spectral envelope of the acoustic signal.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 30
    Publication Date: 2015-01-23
    Description: In this paper, an initial feature vector based on the combination of the wavelet packet decomposition (WPD) and the Mel frequency cepstral coefficients (MFCCs) is proposed. For optimizing the initial feature vector, a genetic algorithm (GA)-based approach is proposed and compared with the well-known principal component analysis (PCA) approach. The artificial neural network (ANN) with the different learning algorithms is used as the classifier. Some experiments are carried out for evaluating and comparing the classification accuracies which are obtained by the use of the different learning algorithms and the different feature vectors (the initial and the optimized ones). Finally, a hybrid of the ANN with the `trainscg? training algorithm and the genetic algorithm is proposed for the vocal fold pathology diagnosis. Also, the performance of the proposed method is compared with the recent works. The experiments' results show a better performance (the higher classification accuracy) of the proposed method in comparison with the others.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 31
    Publication Date: 2015-02-12
    Description: The spatio-temporal-prediction (STP) method for multichannel speech enhancement has recently been proposed. This approach makes it theoretically possible to attenuate the residual noise without distorting speech. In addition, the STP method depends only on the second-order statistics and can be implemented using a simple linear filtering framework. Unfortunately, some numerical problems can arise when estimating the filter matrix in transients. In such a case, the speech correlation matrix is usually rank deficient, so that no solution exists. In this paper, we propose to implement the spatio-temporal-prediction method using a signal subspace approach. This allows for nullifying the noise subspace and processing only the noisy signal in the signal-plus-noise subspace. As a result, we are able to not only regularize the solution in transients but also to achieve higher attenuation of the residual noise. The experimental results also show that the signal subspace approach distorts speech less than the conventional method.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 32
    Publication Date: 2015-02-12
    Description: Deep neural networks (DNNs) have gained remarkable success in speech recognition, partially attributed to the flexibility of DNN models in learning complex patterns of speech signals. This flexibility, however, may lead to serious over-fitting and hence miserable performance degradation in adverse acoustic conditions such as those with high ambient noises. We propose a noisy training approach to tackle this problem: by injecting moderate noises into the training data intentionally and randomly, more generalizable DNN models can be learned. This ‘noise injection’ technique, although known to the neural computation community already, has not been studied with DNNs which involve a highly complex objective function. The experiments presented in this paper confirm that the noisy training approach works well for the DNN model and can provide substantial performance improvement for DNN-based speech recognition.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 33
    Publication Date: 2015-02-13
    Description: Music identification via audio fingerprinting has been an active research field in recent years. In the real-world environment, music queries are often deformed by various interferences which typically include signal distortions and time-frequency misalignments caused by time stretching, pitch shifting, etc. Therefore, robustness plays a crucial role in music identification technique. In this paper, we propose to use scale invariant feature transform (SIFT) local descriptors computed from a spectrogram image as sub-fingerprints for music identification. Experiments show that these sub-fingerprints exhibit strong robustness against serious time stretching and pitch shifting simultaneously. In addition, a locality sensitive hashing (LSH)-based nearest sub-fingerprint retrieval method and a matching determination mechanism are applied for robust sub-fingerprint matching, which makes the identification efficient and precise. Finally, as an auxiliary function, we demonstrate that by comparing the time-frequency locations of corresponding SIFT keypoints, the factor of time stretching and pitch shifting that music queries might have experienced can be accurately estimated.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 34
    Publication Date: 2015-01-21
    Description: No description available
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 35
    Publication Date: 2015-01-30
    Description: Owing to the suprasegmental behavior of emotional speech, turn-level features have demonstrated a better success than frame-level features for recognition-related tasks. Conventionally, such features are obtained via a brute-force collection of statistics over frames, thereby losing important local information in the process which affects the performance. To overcome these limitations, a novel feature extraction approach using latent topic models (LTMs) is presented in this study. Speech is assumed to comprise of a mixture of emotion-specific topics, where the latter capture emotionally salient information from the co-occurrences of frame-level acoustic features and yield better descriptors. Specifically, a supervised replicated softmax model (sRSM), based on restricted Boltzmann machines and distributed representations, is proposed to learn naturally discriminative topics. The proposed features are evaluated for the recognition of categorical or continuous emotional attributes via within and cross-corpus experiments conducted over acted and spontaneous expressions. In a within-corpus scenario, sRSM outperforms competing LTMs, while obtaining a significant improvement of 16.75% over popular statistics-based turn-level features for valence-based classification, which is considered to be a difficult task using only speech. Further analyses with respect to the turn duration show that the improvement is even more significant, 35%, on longer turns (〉6 s), which is highly desirable for current turn-based practices. In a cross-corpus scenario, two novel adaptation-based approaches, instance selection, and weight regularization are proposed to reduce the inherent bias due to varying annotation procedures and cultural perceptions across databases. Experimental results indicate a natural, yet less severe, deterioration in performance - only 2.6% and 2.7%, thereby highlighting the generalization ability of the proposed features.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 36
    Publication Date: 2015-07-17
    Description: The Farrow-structure-based steerable broadband beamformer (FSBB) is particularly useful in the applications where sound source of interest may move around a wide angular range. However, in contrast with conventional filter-and-sum beamformer, the passband steerability of FSBB is achieved at the cost of high complexity in structure, i.e., highly increased number of tap weights. Moreover, it has been shown that the FSBB is sensitive to microphone mismatches, and robust FSBB design is of interest to practical applications. To deal with the aforementioned problems, this paper studies the robust design of the FSBB with sparse tap weights via convex optimization by considering some a priori knowledge of microphone mismatches. It is shown that although the worst-case performance (WCP) optimization has been successfully applied to the design of robust filter-and-sum beamformers with bounded microphone mismatches, it may become unapplicable to robust FSBB design due to its over-conservativeness nature. When limited knowledge of mean and variance of microphone mismatches is available, a robust FSBB design approach based on the worst-case mean performance optimization with the passband response variance (PRV) constraint is devised. Unlike the WCP optimization design, this approach performs well with the capability of passband stability control of array response. Finally, the robust FSBB design with sparse tap weights has been studied. It is shown that there is redundancy in the tap weights of FSBB, i.e., robust FSBB design with sparse tap weights is viable, and thus leads to low-complexity FSBB.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 37
    Publication Date: 2013-02-21
    Description: The rapid spread in digital data usage in many real life applications have urged new and effective ways to ensure their security. Efficient secrecy can be achieved, at least in part, by implementing steganograhy techniques. Novel and versatile audio steganographic methods have been proposed. The goal of steganographic systems is to obtain secure and robust way to conceal high rate of secret data. We focus in this paper on digital audio steganography, which has emerged as a prominent source of data hiding across novel telecommunication technologies such as covered voice-over-IP, audio conferencing, etc. The multitude of steganographic criteria has led to a great diversity in these system design techniques. In this paper, we review current digital audio steganographic techniques and we evaluate their performance based on robustness, security and hiding capacity indicators. Another contribution of this paper is the provision of a robustness-based classification of steganographic models depending on their occurrence in the embedding process. A survey of major trends of audio steganography applications is also discussed in this paper.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 38
    Publication Date: 2012-12-11
    Description: Conventional parametric stereo (PS) audio coding employs inter-channel phase difference and overall phase difference as phase parameters. In this article, it is shown that those parameters cannot correctly represent the phase relationship between the stereo channels when inter-channel correlation (ICC) is less than one, which is common in practical situations. To solve this problem, we introduce new phase parameters, channel phase differences (CPDs), defined as the phase differences between the mono downmix and the stereo channels. Since CPDs have a descriptive relationship with ICC as well as inter-channel intensity difference, they are more relevant to represent the phase difference between the channels in practical situations. We also propose methods of synthesizing CPDs at the decoder. Through computer simulations and subjective listening tests, it is confirmed that the proposed methods produce significantly lower phase errors than conventional PS, and it can noticeably improve sound quality for stereo inputs with low ICCs.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 39
    Publication Date: 2012-12-11
    Description: In this paper, the authors propose an optimally designed fixed Beamformer (BF) for Stereophonic Acoustic Echo Cancellation (SAEC) in real hands-free communication applications. Several contributions related to the combination of beamforming and echo cancellation have appeared in the literature so far, but, up to the authors' knowledge, the idea of using optimal fixed BFs in a real-time SAEC system both for echo reduction and stereophonic audio rendering is firstly addressed in this contribution. The employment of such designed BFs allows positively addressing both issues, as the several simulated and real tests seem to confirm. In particular, the endorsement of audio stereo-recording quality attainable through the proposed approach has been preliminarily evaluated by means of subjective listening tests. Moreover, the overall system robustness against microphone array imperfections and noise presence has been experimentally evaluated. This allowed the authors to implement a real hands-free communication system in which the usage of the proposed beamforming technique has proved its superiority with respect to the usual two-microphone one in terms of echo reduction, and guaranteeing a comparable spaciousness effect.Moreover, the proposed framework requires a low computational cost increment with regard to the baseline approach, since only few extra filtering operations with short filters need to be executed. Nevertheless, according to the performed simulations, the BF-based SAEC configuration seems not to necessitate of the signal decorrelation module, resulting in an overall computational saving.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 40
    Publication Date: 2012-12-11
    Description: The rapid spread in digital data usage inmany real life applications have urged new and effectiveways to ensure their security. Efficient secrecy can beachieved, at least in part, by implementing steganograhytechniques. Novel and versatile audio steganographicmethods have been proposed. The goal of steganographicsystems is to obtain secure and robust way to conceal highrate of secret data. We focus in this paper on digitalaudio steganography, which has emerged as a prominentsource of data hiding across novel telecommunicationtechnologies such as covered voice-over-IP, audioconferencing, etc. The multitude of steganographiccriteria has led to a great diversity in these system designtechniques. In this paper, we review current digitalaudio steganographic techniques and evaluate theirperformance based on robustness, security and hidingcapacity indicators. Another contribution of this paperis the provision of a robustness-based classification ofsteganographic models depending on their occurrencein the embedding process. A survey of major trends ofaudio steganography applications is also discussed inthis paper.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 41
    Publication Date: 2012-12-11
    Description: Mood is an important aspect of music and knowledge of mood can be used as a basic feature in music recommender and retrieval systems. A listening experiment was carried out establishing ratings for various moods and a number of attributes, e.g., valence and arousal. The analysis of these data covers the issues of the number of basic dimensions in music mood, their relation to valence and arousal, the distribution of moods in the valence-arousalplane, distinctiveness of the labels, and appropriate (number of) labels for full coverage of the plane. It is also shown that subject-averaged valence and arousal ratings can be predicted from music features by a linear model.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 42
    Publication Date: 2012-12-11
    Description: A vast amount of audio features have been proposed in the literature to characterize the content of audio signals. In order to overcome specific problems related to the existing features (such as lack of discriminative power), as well as to reduce the need for manual feature selection, in this article, we propose an evolutionary feature synthesis technique with a built-in feature selection scheme. The proposed synthesis process searches for optimal linear/nonlinear operators and feature weights from a pre-defined multi-dimensional search space to generate a highly discriminative set of new (artificial) features. The evolutionary search process is based on a stochastic optimization approach in which a multi-dimensional particle swarm optimization algorithm, along with fractional global best formation and heterogeneous particle behavior techniques, is applied. Unlike many existing feature generation approaches, the dimensionality of the synthesized feature vector is also searched and optimized within a set range in order to better meet the varying requirements set by many practical applications and classifiers. The new features generated by the proposed synthesis approach are compared with typical low-level audio features in several classification and retrieval tasks. The results demonstrate a clear improvement of up to 15--20% in average retrieval performance. Moreover, the proposed synthesis technique surpasses the synthesis performance of evolutionary artificial neural networks, exhibiting a considerable capability to accurately distinguish among different audio classes.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 43
    Publication Date: 2012-12-11
    Description: Humans exhibit a remarkable ability to reliably classify sound sources in the environment even in presence of high levels of noise. In contrast, most engineering systems suffer a drastic drop in performance when speech signals are corrupted with channel or background distortions. Our brains are equippedwith elaborate machinery for speech analysis and feature extraction, understanding of which would presumably improve the performance of automatic speech processing systems under adverse conditions. The work presented here explores a biologically-motivated multi-resolution speaker informationrepresentation obtained by performing an intricate yet computationally-efficient analysis of the information-rich spectro-temporal attributes of the speech signal. We evaluate the proposed features in a speaker verification task performed on NIST SRE 2010 data. The biomimetic approach yields significant robustness in presence of non-stationary noise and reverberation, offering a new framework for deriving reliable features for speaker recognition and speech processing.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 44
    Publication Date: 2012-12-11
    Description: Dance movements are a complex class of human behavior which convey forms of non-verbal and subjective communication that are performed as cultural vocabularies in all human cultures. The singularity of dance forms imposes fascinating challenges to computer animation and robotics, which in turn presents outstanding opportunities to deepen our understanding about the phenomenon of dance by means of developing models, analyses and syntheses of motion patterns. In this article, we formalize a model for the analysis and representation of popular dance styles of repetitive gestures by specifying the parameters and validation procedures necessary to describe the spatiotemporal elements of the dance movement in relation to its music temporal structure (musical meter). Our representation model is able to precisely describe the structure of dance gestures according to the structure of musical meter, at different temporal resolutions, and is flexible enough to convey the variability of the spatiotemporal relation between music structure and movement in space. It results in a compact and discrete mid-level representation of the dance that can be further applied to algorithms for the generation of movements in different humanoid dancing characters. The validation of our representation model relies upon two hypotheses: (i) the impact of metric resolution and (ii) the impact of variability towards fully and naturally representing a particular dance style of repetitive gestures. We numerically and subjectively assess these hypotheses by analyzing solo dance sequences of Afro-Brazilian samba and American Charleston, captured with a MoCap (Motion Capture) system. From these analyses, we build a set of dance representations modeled with different parameters, and re-synthesize motion sequence variations of the represented dance styles. For specifically assessing the metric hypothesis, we compare the captured dance sequences with repetitive sequences of a fixed dance motion pattern, synthesized at different metric resolutions for both dance styles. In order to evaluate the hypothesis of variability, we compare the same repetitive sequences with others synthesized with variability, by generating and concatenating stochastic variations of the represented dance pattern. The observed results validate the proposition that different dance styles of repetitive gestures might require a minimum and sufficient metric resolution to be fully represented by the proposed representation model. Yet, these also suggest that additional information may be required to synthesize variability in the dance sequences while assuring the naturalness of the performance. Nevertheless, we found evidence that supports the use of the proposed dance representation for flexibly modeling and synthesizing dance sequences from different popular dance styles, with potential developments for the generation of expressive and natural movement profiles onto humanoid dancing characters.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 45
    Publication Date: 2012-12-11
    Description: In this article, we present the evaluation results for the task of speaker diarization of broadcast news, which was part of the Albayzin 2010 evaluation campaign of language and speech technologies. The evaluation data consists of a subset of the Catalan broadcast news database recorded from the 3/24 TV channel. The description of five submitted systems from five different research labs is given, marking the common as well as the distinctive system features. The diarization performance is analyzed in the context of the diarization error rate, the number of detected speakers and also the acoustic background conditions. An effort is also made to put the achieved results in relation to the particular system design features.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 46
    Publication Date: 2012-12-11
    Description: In this paper, we propose a speaker-dependent model interpolation method for statistical emotional speech synthesis. The basic idea is to combine the neutral model set of the target speaker and an emotional model set selected from a pool of speakers. For model selection and interpolation weight determination, we propose to use a novel monophone-based Mahalanobis distance, which is a proper distancemeasure between two Hidden Markov Model sets. We design Latin-square evaluation to reduce the systematic bias in the subjective listening tests. The proposed interpolation method achieves sound performance on the emotional expressiveness, the naturalness, and the target speaker similarity. Moreover, such performance is achieved without the need to collect the emotional speech of thetarget speaker, saving the cost of data collection and labeling.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 47
    Publication Date: 2012-12-11
    Description: A new method to secure speech communication using the discrete wavelet transforms (DWT) and the fast Fourier transform is presented in this article. In the first phase of the hiding technique, we separate the speech high-frequency components from the low-frequency components using the DWT. In a second phase, we exploit the low-pass spectral proprieties of the speech spectrum to hide another secret speech signal in the low-amplitude high-frequency regions of the cover speech signal. The proposed method allows hiding a large amount of secret information while rendering the steganalysis more complex. Experimental results prove the efficiency of the proposed hiding technique since the stego signals are perceptually indistinguishable from the equivalent cover signal, while being able to recover the secret speech message with slight degradation in the quality.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 48
    Publication Date: 2013-02-02
    Description: A comprehensive system for facial animation of generic 3D head models driven by speech is presentedin this article. In the training stage, audio-visual information is extracted from audio-visualtraining data, and then used to compute the parameters of a single joint audio-visual hidden Markovmodel (AV-HMM). In contrast to most of the methods in the literature, the proposed approach doesnot require segmentation/classification processing stages of the audio-visual data, avoiding the errorpropagation related to these procedures. The trained AV-HMM provides a compact representation ofthe audio-visual data, without the need of phoneme (word) segmentation, which makes it adaptableto different languages. Visual features are estimated from the speech signal based on the inversionof the AV-HMM. The estimated visual speech features are used to animate a simple face model. Theanimation of a more complex head model is then obtained by automatically mapping the deformationof the simple model to it, using a small number of control points for the interpolation. The proposedalgorithm allows the animation of 3D head models of arbitrary complexity through a simple setupprocedure. The resulting animation is evaluated in terms of intelligibility of visual speech throughperceptual tests, showing a promising performance. The computational complexity of the proposedsystem is analyzed, showing the feasibility of its real-time implementation.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 49
    Publication Date: 2015-10-08
    Description: In this paper, a single-channel speech enhancement method based on Bayesian decision and spectral amplitude estimation is proposed, in which the speech detection module and spectral amplitude estimation module are included, and the two modules are strongly coupled. First, under the decisions of speech presence and speech absence, the optimal speech amplitude estimators are obtained by minimizing a combined Bayesian risk function, respectively. Second, using the obtained spectral amplitude estimators, the optimal speech detector is achieved by further minimizing the combined Bayesian risk function. Finally, according to the detection results of speech detector, the optimal decision rule is made and the optimal spectral amplitude estimator is chosen for enhancing noisy speech. Furthermore, by considering both detection and estimation errors, we propose a combined cost function which incorporates two general weighted distortion measures for the speech presence and speech absence of the spectral amplitudes, respectively. The cost parameters in the cost function are employed to balance the speech distortion and residual noise caused by missed detection and false alarm, respectively. In addition, we propose two adaptive calculation methods for the perceptual weighted order p and the spectral amplitude order β concerned in the proposed cost function, respectively. The objective and subjective test results indicate that the proposed method can achieve a more significant segmental signal-noise ratio (SNR) improvement, a lower log-spectral distortion, and a better speech quality than the reference methods.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 50
    Publication Date: 2016-01-22
    Description: In woodwind instruments such as a flute, producing a higher-pitched tone than a standard tone by increasing the blowing pressure is called overblowing, and this allows several distinct fingerings for the same notes. This article presents a method that attempts to learn acoustic features that are more appropriate than conventional features such as mel-frequency cepstral coefficients (MFCCs) in detecting the fingering from a flute sound using unsupervised feature learning. To do so, we first extract a spectrogram from the audio and convert it to a mel scale. Then, we concatenate four consecutive mel-spectrogram frames to include short temporal information and use it as a front end for the sparse filtering algorithm. The learned feature is then max-pooled, resulting in a final feature vector for the classifier that has extra robustness. We demonstrate the advantages of the proposed method in a twofold manner: we first visualize and analyze the differences in the learned features between the tones generated by standard and overblown fingerings. We then perform a quantitative evaluation through classification tasks on six selected pitches with up to five different fingerings that include a variety of octave-related and non-octave-related fingerings. The results confirm that the learned features using the proposed method significantly outperform the conventional MFCCs and the residual noise spectrum in every experimental condition for the classification tasks.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 51
    Publication Date: 2016-01-23
    Description: The goal of voice conversion is to modify a source speaker’s speech to sound as if spoken by a target speaker. Common conversion methods are based on Gaussian mixture modeling (GMM). They aim to statistically model the spectral structure of the source and target signals and require relatively large training sets (typically dozens of sentences) to avoid over-fitting. Moreover, they often lead to muffled synthesized output signals, due to excessive smoothing of the spectral envelopes.Mobile applications are characterized with low resources in terms of training data, memory footprint, and computational complexity. As technology advances, computational and memory requirements become less limiting; however, the amount of available training data still presents a great challenge, as a typical mobile user is willing to record himself saying just few sentences. In this paper, we propose the grid-based (GB) conversion method for such low resource environments, which is successfully trained using very few sentences (5–10). The GB approach is based on sequential Bayesian tracking, by which the conversion process is expressed as a sequential estimation problem of tracking the target spectrum based on the observed source spectrum. The converted Mel frequency cepstrum coefficient (MFCC) vectors are sequentially evaluated using a weighted sum of the target training vectors used as grid points. The training process includes simple computations of Euclidian distances between the training vectors and is easily performed even in cases of very small training sets.We use global variance (GV) enhancement to improve the perceived quality of the synthesized signals obtained by the proposed and the GMM-based methods. Using just 10 training sentences, our enhanced GB method leads to converted sentences having closer GV values to those of the target and to lower spectral distances at the same time, compared to enhanced version of the GMM-based conversion method. Furthermore, subjective evaluations show that signals produced by the enhanced GB method are perceived as more similar to the target speaker than the enhanced GMM signals, at the expense of a small degradation in the perceived quality.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 52
    Publication Date: 2016-02-24
    Description: The goal of voice conversion is to modify a source speaker’s speech to sound as if spoken by a target speaker. Common conversion methods are based on Gaussian mixture modeling (GMM). They aim to statistically ...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 53
    Publication Date: 2016-03-02
    Description: Today, a large amount of audio data is available on the web in the form of audiobooks, podcasts, video lectures, video blogs, news bulletins, etc. In addition, we can effortlessly record and store audio data s...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 54
    Publication Date: 2016-02-24
    Description: Query-by-example spoken term detection (QbE STD) aims at retrieving data from a speech repository given an acoustic query containing the term of interest as input. Nowadays, it is receiving much interest due t...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 55
    Publication Date: 2016-02-24
    Description: Indian classical music, including its two varieties, Carnatic and Hindustani music, has a rich music tradition and enjoys a wide audience from various parts of the world. The Carnatic music which is more popul...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 56
    Publication Date: 2016-02-24
    Description: Unit selection based text-to-speech synthesis (TTS) has been the dominant TTS approach of the last decade. Despite its success, unit selection approach has its disadvantages. One of the most significant disadv...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 57
    Publication Date: 2016-02-24
    Description: In woodwind instruments such as a flute, producing a higher-pitched tone than a standard tone by increasing the blowing pressure is called overblowing, and this allows several distinct fingerings for the same ...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 58
    Publication Date: 2016-02-03
    Description: Unit selection based text-to-speech synthesis (TTS) has been the dominant TTS approach of the last decade. Despite its success, unit selection approach has its disadvantages. One of the most significant disadvantages is the sudden discontinuities in speech that distract the listeners (Speech Commun 51:1039–1064, 2009). The second disadvantage is that significant expertise and large amounts of data is needed for building a high-quality synthesis system which is costly and time-consuming. The statistical speech synthesis (SSS) approach is a promising alternative synthesis technique. Not only that the spurious errors that are observed in the unit selection system are mostly not observed in SSS but also building voice models is far less expensive and faster compared to the unit selection system. However, the resulting speech is typically not as natural-sounding as speech that is synthesized with a high-quality unit selection system. There are hybrid methods that attempt to take advantage of both SSS and unit selection systems. However, existing hybrid methods still require development of a high-quality unit selection system. Here, we propose a novel hybrid statistical/unit selection system for Turkish that aims at improving the quality of the baseline SSS system by improving the prosodic parameters such as intonation and stress. Commonly occurring suffixes in Turkish are stored in the unit selection database and used in the proposed system. As opposed to existing hybrid systems, the proposed system was developed without building a complete unit selection synthesis system. Therefore, the proposed method can be used without collecting large amounts of data or utilizing substantial expertise or time-consuming tuning that is typically required in building unit selection systems. Listeners preferred the hybrid system over the baseline system in the AB preference tests.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 59
    Publication Date: 2015-12-02
    Description: Audio segmentation is important as a pre-processing task to improve the performance of many speech technology tasks and, therefore, it has an undoubted research interest. This paper describes the database, the metric, the systems and the results for the Albayzín-2014 audio segmentation campaign. In contrast to previous evaluations where the task was the segmentation of non-overlapping classes, Albayzín-2014 evaluation proposes the delimitation of the presence of speech, music and/or noise that can be found simultaneously. The database used in the evaluation was created by fusing different media and noises in order to increase the difficulty of the task. Seven segmentation systems from four different research groups were evaluated and combined. Their experimental results were analyzed and compared with the aim of providing a benchmark and showing up the promising directions in this field.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 60
    Publication Date: 2015-12-05
    Description: Using a recently proposed informed spatial filter, it is possible to effectively and robustly reduce reverberation from speech signals captured in noisy environments using multiple microphones. Late reverberation can be modeled by a diffuse sound field with a time-varying power spectral density (PSD). To attain reverberation reduction using this spatial filter, an accurate estimate of the diffuse sound PSD is required. In this work, a method is proposed to estimate the diffuse sound PSD from a set of reference signals by blocking the direct signal components. By considering multiple plane waves in the signal model to describe the direct sound, the method is suitable in the presence of multiple simultaneously active speakers. The proposed diffuse sound PSD estimator is analyzed and compared to existing estimators. In addition, the performance of the spatial filter computed with the diffuse sound PSD estimate is analyzed using simulated and measured room impulse responses in noisy environments with stationary noise and non-stationary babble noise.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 61
    Publication Date: 2019
    Description: Audio signals represent a wide diversity of acoustic events, from background environmental noise to spoken communication. Machine learning models such as neural networks have already been proposed for audio si...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 62
    Publication Date: 2019
    Description: Singing voice analysis has been a topic of research to assist several applications in the domain of music information retrieval system. One such major area is singer identification (SID). There has been enormo...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 63
    Publication Date: 2018
    Description: This paper deals with a project of Automatic Bird Species Recognition Based on Bird Vocalization. Eighteen bird species of 6 different families were analyzed. At first, human factor cepstral coefficients repre...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 64
    Publication Date: 2019
    Description: This paper proposes two novel linguistic features extracted from text input for prosody generation in a Mandarin text-to-speech system. The first feature is the punctuation confidence (PC), which measures the ...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 65
    Publication Date: 2019
    Description: The overall recognition rate will reduce due to the increase of emotional confusion in multiple speech emotion recognition. To solve the problem, we propose a speech emotion recognition method based on the dec...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 66
    Publication Date: 2019
    Description: Current automatic speech recognition (ASR) systems achieve over 90–95% accuracy, depending on the methodology applied and datasets used. However, the level of accuracy decreases significantly when the same ASR...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 67
    Publication Date: 2019
    Description: Filter banks on spectrums play an important role in many audio applications. Traditionally, the filters are linearly distributed on perceptual frequency scale such as Mel scale. To make the output smoother, th...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 68
    Publication Date: 2019
    Description: Voice conversion (VC) is a technique of exclusively converting speaker-specific information in the source speech while preserving the associated phonemic information. Non-negative matrix factorization (NMF)-ba...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 69
    Publication Date: 2019
    Description: Voice-enabled interaction systems in domestic environments have attracted significant interest recently, being the focus of smart home research projects and commercial voice assistant home devices. Within the ...
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 70
    Publication Date: 2013-12-07
    Description: This paper investigates real-time N-dimensional wideband sound source localization in outdoor (far-field) and low-degree reverberation cases, using a simple N-microphone arrangement. Outdoor sound source localization in different climates needs highly sensitive and high-performance microphones, which are very expensive. Reduction of the microphone count is our goal. Time delay estimation (TDE)-based methods are common for N-dimensional wideband sound source localization in outdoor cases using at least N + 1 microphones. These methods need numerical analysis to solve closed-form non-linear equations leading to large computational overheads and a good initial guess to avoid local minima. Combined TDE and intensity level difference or interaural level difference (ILD) methods can reduce microphone counts to two for indoor two-dimensional cases. However, ILD-based methods need only one dominant source for accurate localization. Also, using a linear array, two mirror points are produced simultaneously (half-plane localization). We apply this method to outdoor cases and propose a novel approach for N-dimensional entire-space outdoor far-field and low reverberation localization of a dominant wideband sound source using TDE, ILD, and head-related transfer function (HRTF) simultaneously and only N microphones. Our proposed TDE-ILD-HRTF method tries to solve the mentioned problems using source counting, noise reduction using spectral subtraction, and HRTF. A special reflector is designed to avoid mirror points and source counting used to make sure that only one dominant source is active in the localization area. The simple microphone arrangement used leads to linearization of the non-linear closed-form equations as well as no need for initial guess. Experimental results indicate that our implemented method features less than 0.2 degree error for angle of arrival and less than 10 % error for three-dimensional location finding as well as less than 150-ms processing time for localization of a typical wideband sound source such as a flying object (helicopter).
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 71
    Publication Date: 2014-01-14
    Description: We propose an integrative method of recognizing gestures such as pointing,accompanying speech. Speech generated simultaneously with gestures canassist in the recognition of gestures, and since this occurs in acomplementary manner, gestures can also assist in the recognition ofspeech. Our integrative recognition method uses a probability distribution which expresses the distributionof the time interval between the starting times of gestures and of thecorresponding utterances. We evaluate the rate of improvement of theproposed integrative recognition method with a task involving the solutionof a geometry problem.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 72
    Publication Date: 2014-05-04
    Description: In this paper, a two-stage scheme is proposed to deal with the difficult problem of acoustic echo cancellation (AEC) in single-channel scenario in the presence of noise. In order to overcome the major challenge of getting a separate reference signal in adaptive filter-based AEC problem, the delayed version of the echo and noise suppressed signal is proposed to use as reference. A modified objective function is thereby derived for a gradient-based adaptive filter algorithm, and proof of its convergence to the optimum Wiener-Hopf solution is established. The output of the AEC block is fed to an acoustic noise cancellation (ANC) block where a spectral subtraction-based algorithm with an adaptive spectral floor estimation is employed. In order to obtain fast but smooth convergence with maximum possible echo and noise suppression, a set of updating constraints is proposed based on various speech characteristics (e.g., energy and correlation) of reference and current frames considering whether they are voiced, unvoiced, or pause. Extensive experimentation is carried out on several echo and noise corrupted natural utterances taken from the TIMIT database, and it is found that the proposed scheme can significantly reduce the effect of both echo and noise in terms of objective and subjective quality measures.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 73
    Publication Date: 2014-04-29
    Description: Neural network language models (NNLM) have been proved to be quite powerful for sequence modeling, including feed-forward NNLM (FNNLM), recurrent NNLM (RNNLM), etc. One main issue concerned for NNLMis the heavy computational burden of the output layer, where the output needs to be probabilistically normalized and the normalizing factors require lots of computation. How to fast rescore the N-best list or lattice with NNLM attracts much attention for large-scale applications. In this paper, the statistic characteristics of normalizing factors are investigated on the N-best list. Based on the statistic observations, we propose to approximate the normalizing factors for each hypothesis as a constant proportional to the number of words in the hypothesis. Then, the unnormalized NNLM is investigated and combined with back-off N-gram for fast rescoring, which can be computed very fast without the normalization in the output layer, with the complexity reduced significantly. We apply our proposed method to a well-tuned context-dependent deep neural network hidden Markov model (CD-DNN-HMM) speech recognition system on the English-Switchboard phone-call speech-to-text task, where both FNNLM and RNNLM are trained to demonstrate our method. Experimental results show that unnormalized probability of NNLM is quite complementary to that of back-off N-gram, and combining the unnormalized NNLM and back-off N-gram can further reduce the word error rate with little computational consideration.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 74
    Publication Date: 2014-04-28
    Description: Speech enhancement has an increasing demand in mobile communications and faces a great challenge in a real ambient noisy environment. This paper develops an effective spatial-frequency domain speech enhancement method with a single acoustic vector sensor (AVS) in conjunction with minimum variance distortionless response (MVDR) spatial filtering and Wiener post-filtering (WPF) techniques. In remote speech applications, the MVDR spatial filtering is effective in suppressing the strong spatial interferences and the Wiener post-filtering is considered as a popular and powerful estimator to further suppress the residual noise if the power spectral density (PSD) of target speech can be estimated properly. With the favorable directional response of the AVS together with the trigonometric relations of the steering vectors, the closed-form estimation of the signal PSDs is derived and the frequency response of the optimal Wiener post-filter is determined accordingly. Extensive computer simulations and a real experiment in an anechoic chamber condition have been carried out to evaluate the performance of the proposed algorithm. Simulation results show that the proposed method offers good ability to suppress the spatial interference while maintaining comparable log spectral deviation and perceptual evaluation of speech quality performance compared with the conventional methods with several objective measures. Moreover, a single AVS solution is particularly attractive for hands-free speech applications due to its compact size.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 75
    Publication Date: 2014-03-05
    Description: This paper presents an optical music recognition (OMR) system to process the handwritten musical scores of Kunqu Opera written in Gong-Che Notation (GCN). First, it introduces the background of Kunqu Opera and GCN. Kunqu Opera is one of the oldest forms of musical activity, spanning the sixteenth to eighteenth centuries, and GCN has been the most popular notation for recording musical works in China since the seventh century. Many Kunqu Operas that use GCN are available as original manuscripts or photocopies, and transforming these versions into a machine-readable format is a pressing need. The OMR system comprises six stages: image pre-processing, segmentation, feature extraction, symbol recognition, musical semantics, and musical instrument digital interface (MIDI) representation. This paper focuses on the symbol recognition stage and obtains the musical information with Bayesian, genetic algorithm, and K-nearest neighbor classifiers. The experimental results indicate that symbol recognition for Kunqu Opera's handwritten musical scores is effective. This work will help to preserve and popularize Chinese cultural heritage and to store Kunqu Opera scores in a machine-readable format, thereby ensuring the possibility of spreading and performing original Kunqu Opera musical scores.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 76
    Publication Date: 2014-04-29
    Description: When several acoustic sources are simultaneously active in a meeting room scenario, and both the position of the sources and the identity of the time-overlapped sound classes have been estimated, the problem of assigning each source position to one of the sound classes still remains. This problem is found in the real-time system implemented in our smart-room, where it is assumed that up to two acoustic events may overlap in time and the source positions are relatively well separated in space. The position assignment system proposed in this work is based on fusion of model-based log-likelihood ratios obtained after carrying out several different partial source separations in parallel. To perform the separation, frequency-invariant null-steering beamformers, which can work with a small number of microphones, are used. The experimental results using all the six microphone arrays deployed in the room show a high assignment rate in our particular scenario.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 77
    Publication Date: 2014-02-02
    Description: We propose a novel approach of integrating exemplar-based template matching with statistical modeling to improve continuous speech recognition. We choose the template unit to be context-dependent phone segments (triphone context) and use multiple Gaussian mixture model (GMM) indices to represent each frame of speech templates. We investigate two different local distances, log likelihood ratio (LLR) and Kullback-Leibler (KL) divergence, for dynamic time warping (DTW)-based template matching. In order to reduce computation and storage complexities, we also propose two methods for template selection: minimum distance template selection (MDTS) and maximum likelihood template selection (MLTS). We further propose to fine tune the MLTS template representatives by using a GMM merging algorithm so that the GMMs can better represent the frames of the selected template representatives. Experimental results on the TIMIT phone recognition task and a large vocabulary continuous speech recognition (LVCSR) task of telehealth captioning demonstrated that the proposed approach of integrating template matching with statistical modeling significantly improved recognition accuracy over the hidden Markov modeling (HMM) baselines for both TIMIT and telehealth tasks. The template selection methods also provided significant accuracy gains over the HMM baseline while largely reducing the computation and storage complexities. When all templates or MDTS were used, using the LLR local distance gave better performance than the KL local distance. For MLTS and template compression, KL local distance gave better performance than the LLR local distance, and template compression further improved the recognition accuracy on top of MLTS while having less computational cost.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 78
    Publication Date: 2014-02-02
    Description: We present in this paper a voice conversion (VC) methodfor a person with an articulation disorderresulting from athetoid cerebral palsy.The movement of such speakers is limited by their athetoid symptoms,and their consonants are often unstable or unclear,which makes it difficult for them to communicate.In this paper, exemplar-based spectral conversionusing nonnegative matrix factorization (NMF) isapplied to a voice with an articulation disorder.To preserve the speaker's individuality,we used an individuality-preserving dictionary that is constructedfrom the source speaker's vowels and target speaker's consonants.Using this dictionary,we can create a natural and clear voice preserving their voice's individuality.Experimental results indicate that the performance of NMF-based VCis considerably better than conventional GMM-based VC.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 79
    Publication Date: 2013-12-06
    Description: Affective computing, especially from speech, is one of the key steps toward building more natural and effective human-machine interaction. In recent years, several emotional speech corpora in dif- ferent languages have been collected; however, Turkish is not among the languages that have been investigated in the context of emotion recognition. For this purpose, a new Turkish emotional speech database, which includes 5,100 utterances extracted from 55 Turkish movies, was constructed. Each utterance in the database is labeled with emotion categories (happy, surprised, sad, angry, fearful, neutral, and others) and three-dimensional emotional space (valence, activation, and dominance). We performed classification of four basic emotion classes (neutral, sad, happy, and angry) and estima- tion of emotion primitives using acoustic features. The importance of acoustic features in estimating the emotion primitive values and in classifying emotions into categories was also investigated. An unweighted average recall of 45.5% was obtained for the classification. For emotion dimension esti- mation, we obtained promising results for activation and dominance dimensions. For valence, how- ever, the correlation between the averaged ratings of the evaluators and the estimates was low. The cross-corpus training and testing also showed good results for activation and dominance dimensions.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 80
    Publication Date: 2013-12-12
    Description: The framework of voice conversion system is expected to emphasize both the static and dynamic characteristics of the speech signal. The conventional approaches like Mel frequency cepstrum coefficients and linear predictive coefficients focus on spectral features limited to lower frequency bands. This paper presents a novel wavelet packet filter bank approach to identify non-uniformly distributed dynamic characteristics of the speaker. Contribution of this paper is threefold. First, in the feature extraction stage, dyadic wavelet packet tree structure is optimized to involve less computation while preserving the speaker-specific features. Second, in the feature representation step, magnitude and phase attributes are treated separately to rule out on the fact that raw time-frequency traits are highly correlated but carry intelligent speech information. Finally, the RBF mapping function is established to transform the speaker-specific features from the source to the target speakers. The results obtained by the proposed filter bank-based voice conversion system are compared to the baseline multiscale voice morphing results by using subjective and objective measures. Evaluation results reveal that the proposed method outperforms by incorporating the speaker-specific dynamic characteristics and phase information of the speech signal.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 81
    Publication Date: 2014-09-17
    Description: Model-based speech enhancement algorithms that employ trained models, such as codebooks, hidden Markov models, Gaussian mixture models, etc., containing representations of speech such as linear predictive coefficients, mel-frequency cepstrum coefficients, etc., have been found to be successful in enhancing noisy speech corrupted by nonstationary noise. However, these models are typically trained on speech data from multiple speakers under controlled acoustic conditions. In this paper, we introduce the notion of context-dependent models that are trained on speech data with one or more aspects of context, such as speaker, acoustic environment, speaking style, etc. In scenarios where the modeled and observed contexts match, context-dependent models can be expected to result in better performance, whereas context-independent models are preferred otherwise. In this paper, we present a Bayesian framework that automatically provides the benefits of both models under varying contexts. As several aspects of the context remain constant over an extended period during usage, a memory-based approach that exploits information from past data is employed. We use a codebook-based speech enhancement technique that employs trained models of speech and noise linear predictive coefficients as an example model-based approach. Using speaker, acoustic environment, and speaking style as aspects of context, we demonstrate the robustness of the proposed framework for different context scenarios, input signal-to-noise ratios, and number of contexts modeled.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 82
    Publication Date: 2014-10-23
    Description: The full modulation spectrum is a high-dimensional representation of one-dimensional audio signals. Most previous research in automatic speech recognition converted this very rich representation into the equivalent of a sequence of short-time power spectra, mainly to simplify the computation of the posterior probability that a frame of an unknown speech signal is related to a specific state. In this paper we use the raw output of a modulation spectrum analyser in combination with sparse coding as a means for obtaining state posterior probabilities. The modulation spectrum analyser uses 15 gammatone filters. The Hilbert envelope of the output of these filters is then processed by nine modulation frequency filters, with bandwidths up to 16 Hz. Experiments using the AURORA-2 task show that the novel approach is promising. We found that the representation of medium-term dynamics in the modulation spectrum analyser must be improved. We also found that we should move towards sparse classification, by modifying the cost function in sparse coding such that the class(es) represented by the exemplars weigh in, in addition to the accuracy with which unknown observations are reconstructed. This creates two challenges: (1) developing a method for dictionary learning that takes the class occupancy of exemplars into account and (2) developing a method for learning a mapping from exemplar activations to state posterior probabilities that keeps the generalization to unseen conditions that is one of the strongest advantages of sparse coding.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 83
    Publication Date: 2014-10-31
    Description: Although the field of automatic speaker or speech recognition has been extensively studied over the past decades, the lack of robustness has remained a major challenge. The missing data technique (MDT) is a promising approach. However, its performance depends on the correlation across frequency bands. This paper presents a new reconstruction method for feature enhancement based on the trait. In this paper, the degree of concentration across frequency bands is measured with principal component analysis (PCA). Through theoretical analysis and experimental results, it is found that the correlation of the feature vector extracted from the sub-band (SB) is much stronger than the ones extracted from the full-band (FB). Thus, rather than dealing with the spectral features as a whole, this paper splits full-band into sub-bands and then individually reconstructs spectral features extracted from each SB based on MDT. At the end, those constructed features from all sub-bands will be recombined to yield the conventional mel-frequency cepstral coefficient (MFCC) for recognition experiments. The 2-sub-band reconstruction approach is evaluated in speaker recognition system. The results show that the proposed approach outperforms full-band reconstruction in terms of recognition performance in all noise conditions. Finally, we particularly discuss the optimal selection of frequency division ways for the recognition task. When FB is divided into much more sub-bands, some of the correlations across frequency channels are lost. Consequently, efficient division ways need to be investigated to perform further recognition performance.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 84
    Publication Date: 2014-08-29
    Description: This paper proposes a new speech enhancement (SE) algorithm utilizing constraints to the Wiener gain function which is capable of working at 10 dB and lower signal-to-noise ratios (SNRs). The wavelet thresholded multitaper spectrum was taken as the clean spectrum for the constraints. The proposed algorithm was evaluated under eight types of noises and seven SNR levels in NOIZEUS database and was predicted by the composite measures and the SNRLOSS measure to improve subjective quality and speech intelligibility in various noisy environments. Comparisons with two other algorithms (KLT and wavelet thresholding (WT)) demonstrate that in terms of signal distortion, overall quality, and the SNRLOSS measure, our proposed constrained SE algorithm outperforms the KLT and WT schemes for most conditions considered.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 85
    Publication Date: 2014-08-30
    Description: This paper studies a novel audio segmentation-by-classification approach based on factor analysis. The proposed technique compensates the within-class variability by using class-dependent factor loading matrices and obtains the scores by computing the log-likelihood ratio for the class model to a non-class model over fixed-length windows. Afterwards, these scores are smoothed to yield longer contiguous segments of the same class by means of different back-end systems. Unlike previous solutions, our proposal does not make use of specific acoustic features and does not need a hierarchical structure. The proposed method is applied to segment and classify audios coming from TV shows into five different acoustic classes: speech, music, speech with music, speech with noise, and others. The technique is compared to a hierarchical system with specific acoustic features achieving a significant error reduction.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 86
    Publication Date: 2014-10-21
    Description: Feature-based vocoders, e.g., STRAIGHT, offer a way to manipulate the perceived characteristics of the speech signal in speech transformation and synthesis. For the harmonic model, which provide excellent perceived quality, features for the amplitude parameters already exist (e.g., Line Spectral Frequencies (LSF), Mel-Frequency Cepstral Coefficients (MFCC)). However, because of the wrapping of the phase parameters, phase features are more difficult to design. To randomize the phase of the harmonic model during synthesis, a voicing feature is commonly used, which distinguishes voiced and unvoiced segments. However, voice production allows smooth transitions between voiced/unvoiced states which makes voicing segmentation sometimes tricky to estimate. In this article, two-phase features are suggested to represent the phase of the harmonic model in a uniform way, without voicing decision. The synthesis quality of the resulting vocoder has been evaluated, using subjective listening tests, in the context of resynthesis, pitch scaling, and Hidden Markov Model (HMM)-based synthesis. The experiments show that the suggested signal model is comparable to STRAIGHT or even better in some scenarios. They also reveal some limitations of the harmonic framework itself in the case of high fundamental frequencies.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 87
    Publication Date: 2014-10-21
    Description: Building a voice-operated system for learning disabled users is a difficult task that requires a considerable amount of time and effort. Due to the wide spectrum of disabilities and their different related phonopathies, most approaches available are targeted to a specific pathology. This may improve their accuracy for some users, but makes them unsuitable for others. In this paper, we present a cross-lingual approach to adapt a general-purpose modular speech recognizer for learning disabled people. The main advantage of this approach is that it allows rapid and cost-effective development by taking the already built speech recognition engine and its modules, and utilizing existing resources for standard speech in different languages for the recognition of the users? atypical voices. Although the recognizers built with the proposed technique obtain lower accuracy rates than those trained for specific pathologies, they can be used by a wide population and developed more rapidly, which makes it possible to design various types of speech-based applications accessible to learning disabled users.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 88
    Publication Date: 2014-11-27
    Description: In this paper, we propose a wideband (WB) to super-wideband audio bandwidth extension (BWE) method based on temporal smoothing cepstral coefficients (TSCC). A temporal relationship of audio signals is included into feature extraction in the bandwidth extension frontend to make the temporal evolution of the extended spectra smoother. In the bandwidth extension scheme, a Gammatone auditory filter bank is used to decompose the audio signal, and the energy of each frequency band is long-term smoothed using minima controlled recursive averaging (MCRA) in order to suppress transient components. The resulting `steady-state? spectrum is processed by frequency weighting, and the temporal smoothing cepstral coefficients are obtained by means of the power-law loudness function and cepstral normalization. The extracted temporal smoothing cepstral coefficients are fed into a Gaussian mixture model (GMM)-based Bayesian estimator to estimate the high-frequency (HF) spectral envelope, while the fine structure is restored by spectral translation. Evaluation results show that the temporal smoothing cepstral coefficients exploit the temporal relationship of audio signals and provide higher mutual information between the low- and high-frequency parameters, without increasing the dimension of input vectors in the frontend of bandwidth extension systems. In addition, the proposed bandwidth extension method is applied into the G.729.1 wideband codec and outperforms the Mel frequency cepstral coefficient (MFCC)-based method in terms of log spectral distortion (LSD), cosh measure, and differential log spectral distortion. Further, the proposed method improves the smoothness of the reconstructed spectrum over time and also gains a good performance in the subjective listening tests
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 89
    Publication Date: 2014-09-09
    Description: This paper studies a novel audio segmentation-by-classification approach based on factor analysis. The proposed technique compensates the within-class variability by using class-dependent factor loading matrices and obtains the scores by computing the log-likelihood ratio for the class model to a non-class model over fixed-length windows. Afterwards, these scores are smoothed to yield longer contiguous segments of the same class by means of different back-end systems. Unlike previous solutions, our proposal does not make use of specific acoustic features and does not need a hierarchical structure. The proposed method is applied to segment and classify audios coming from TV shows into five different acoustic classes: speech, music, speech with music, speech with noise, and others. The technique is compared to a hierarchical system with specific acoustic features achieving a significant error reduction.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 90
    Publication Date: 2015-03-26
    Description: Automatic diagnosis and monitoring of Alzheimer’s disease can have a significant impact on society as well as the well-being of patients. The part of the brain cortex that processes language abilities is one of the earliest parts to be affected by the disease. Therefore, detection of Alzheimer’s disease using speech-based features is gaining increasing attention. Here, we investigated an extensive set of features based on speech prosody as well as linguistic features derived from transcriptions of Turkish conversations with subjects with and without Alzheimer’s disease. Unlike most standardized tests that focus on memory recall or structured conversations, spontaneous unstructured conversations are conducted with the subjects in informal settings. Age-, education-, and gender-controlled experiments are performed to eliminate the effects of those three variables. Experimental results show that the proposed features extracted from the speech signal can be used to discriminate between the control group and the patients with Alzheimer’s disease. Prosodic features performed significantly better than the linguistic features. Classification accuracy over 80% was obtained with three of the prosodic features, but experiments with feature fusion did not further improve the classification performance.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 91
    Publication Date: 2014-10-15
    Description: The task of automatic retrieval and extraction of lyrics from the web is of great importance to different Music Information Retrieval applications. However, despite its importance, very little research has been carried out for this task. For this reason, in this paper, we present the Ethnic Lyrics Fetcher (ELF), a tool for Music Information Researchers, which has a novel lyrics detection and extraction mechanism. We performed two experiments to evaluate ELF’s lyrics extraction mechanism and its performance as a lyrics fetcher tool. Our experimental results show that ELF performs better than the current state of the art, using website-specific lyrics fetchers or manually retrieving the lyrics from the web.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 92
    Publication Date: 2014-10-11
    Description: In this paper, we propose a semi-blind, imperceptible, and robust digital audio watermarking algorithm. The proposed algorithm is based on cascading two well-known transforms: the discrete wavelet transform and the singular value decomposition. The two transforms provide different, but complementary, levels of robustness against watermarking attacks. The uniqueness of the proposed algorithm is twofold: the distributed formation of the wavelet coefficient matrix and the selection of the off-diagonal positions of the singular value matrix for embedding watermark bits. Imperceptibility, robustness, and high data payload of the proposed algorithm are demonstrated using different musical clips.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 93
    Publication Date: 2014-06-20
    Description: Composers may not provide instructions for playing their works, especially for instrument solos, and therefore, different musicians may give very different interpretations of the same work. Such differences usually lead to time, amplitude, or frequency variations of musical notes in a phrase in the signal point of view. This paper proposes a frame-based recursive regularization method for time-dependent analysis of each note presenting in solo violin recordings. The system of equations evolves when a new frame is added and an old frame is dropped to track the varying characteristics of violin playing. This method is compared with a time-dependent non-negative matrix factorization method. The complete recordings of both BWV 1005 No. 3 played by Kuijken and 24 Caprices op. 1 no. 24 in A minor played by Paganini are used for the transcription experiment, where the proposed method performs strongly. The analysis results of a short passage extracted from BWV 1005 No. 3 performed by three famous violinists reveal numerous differences in the styles and performances of these violinists.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 94
    Publication Date: 2014-07-15
    Description: In this paper, unsupervised learning is used to separate percussive and harmonic sounds from monaural non-vocal polyphonic signals. Our algorithm is based on a modified non-negative matrix factorization (NMF) procedure that no labeled data is required to distinguish between percussive and harmonic bases because information from percussive and harmonic sounds is integrated into the decomposition process. NMF is performed in this process by assuming that harmonic sounds exhibit spectral sparseness (narrowband sounds) and temporal smoothness (steady sounds), whereas percussive sounds exhibit spectral smoothness (broadband sounds) and temporal sparseness (transient sounds). The evaluation is performed using several real-world excerpts from different musical genres. Comparing the developed approach to three current state-of-the art separation systems produces promising results.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 95
    Publication Date: 2014-07-17
    Description: In this paper, we advocate the use of the uncompressed form of i-vector and depend on subspace modeling using probabilistic linear discriminant analysis (PLDA) in handling the speaker and session (or channel) variability. An i-vector is a low-dimensional vector containing both speaker and channel information acquired from a speech segment. When PLDA is used on an i-vector, dimension reduction is performed twice: first in the i-vector extraction process and second in the PLDA model. Keeping the full dimensionality of the i-vector in the i-supervector space for PLDA modeling and scoring would avoid unnecessary loss of information. We refer to the uncompressed i-vector as the i-supervector. The drawback in using the i-supervector with PLDA is the inversion of large matrices in the estimation of the full posterior distribution, which we show can be solved rather efficiently by portioning large matrices into smaller blocks. We also introduce the Gaussianized rank-norm, as an alternative to whitening, for feature normalization prior to PLDA modeling. We found that the i-supervector performs better during normalization. A better performance is obtained by combining the i-supervector and i-vector at the score level. Furthermore, we also analyze the computational complexity of the i-supervector system, compared with that of the i-vector, at four different stages of loading matrix estimation, posterior extraction, PLDA modeling, and PLDA scoring.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 96
    Publication Date: 2014-02-07
    Description: Speech production and speech perception studies were conducted to compare (de)voicing in the Romance languages European Portuguese (EP) and Italian. For the speech production part, velar stops in two positions and four vowel contexts were recorded. The voicing status for 10 consecutive landmarks during stop closure was computed. Results showed that during the complete stop, closure voicing was always maintained for Italian, and that for EP, there was strong devoicing for all vowel contexts and positions. Both language and vowel context had a significant effect on voicing during stop closure. The duration values and voicing patterns from the production study were then used as input factors to a follow-up perceptual experiment to test the effects of vowel duration, stop duration and voicing maintenance on voicing perception by EP and Italian listeners. Perceptual stimuli (VCV) were generated using biomechanical modelling so that they would include physically realistic transitions between phonemes. The consonants were velar stops, with no burst or noise included in the signal. A strong language dependency of the three factors on listeners' voicing distinction was found, with high sensitivity for all three cues for EP listeners and low sensitivity for Italian listeners. For EP stimuli with high voicing maintenance during stop closure, this cue was very strong and overrode the other two acoustic cues. However, for stimuli with low voicing maintenance (i.e. highly devoiced stimuli), the acoustic cues vowel duration and stop duration take over. Even in the absence of both voicing maintenance during stop closure and a burst, the acoustic cues vowel duration and stop duration guaranteed stable voicing distinction in EP. Italian listeners were insensitive to all three acoustic cues examined in this study, with stable voiced responses throughout all of the varying fully crossed factors. None of the examined acoustic cues appeared to be used by Italian listeners to obtain a robust voicing distinction, thus pointing to the use of other acoustic cues or combination of other cues to guarantee stable voicing distinction in this language.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 97
    Publication Date: 2014-03-25
    Description: Three-dimensional (3D) audio technologies are booming with the success of 3D video technology. The surge in audio channels makes its huge data unacceptable for transmitting bandwidth and storage media, and the signal compression algorithm for 3D audio systems becomes an important task. This paper investigates the conventional mid/side (M/S) coding method and discusses the signal correlation property of three-dimensional multichannel systems. Then based on the channel triple, a three-channel dependent M/S coding (3D-M/S) method is proposed to reduce interchannel redundancy and corresponding transform matrices are presented. Furthermore, a framework is proposed to enable 3D-M/S compress any number of audio channels. Finally, The masking threshold of the perceptual audio core codec is modified, which guarantees the final coding noise to meet the perceptual threshold constraint of the original channel signals. Objective and subjective tests with panning signals indicate an increase in coding efficiency compared to Independent channel coding and a moderate complexity increase compared to a PCA method.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 98
    Publication Date: 2014-03-12
    Description: An approach is proposed for creating location-specific audio textures for virtual location-explorationservices. The presented approach creates audio textures by processing a small amount of audiorecorded at a given location, providing a cost-effective way to produce a versatile audio signal thatcharacterizes the location. The resulting texture is non-repetitive and conserves the location-specificcharacteristics of the audio scene, without the need of collecting large amount of audio from eachlocation. The method consists of two stages: analysis and synthesis. In the analysis stage, the sourceaudio recording is segmented into homogeneous segments. In the synthesis stage, the audio texture iscreated by randomly drawing segments from the source audio so that the consecutive segments willhave timbral similarity near the segment boundaries. Results obtained in listening experiments showthat there is no statistically significant difference in the audio quality or location-specificity of audiowhen the created audio textures are compared to excerpts of the original recordings. Therefore, theproposed audio textures could be utilized in virtual location-exploration services. Examples of sourcesignals and audio textures created from them are available at www.cs.tut.fi/~heittolt/audiotexture.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 99
    Publication Date: 2014-04-08
    Description: Decision tree-clustered context-dependent hidden semi-Markov models (HSMMs) are typically used in statistical parametric speech synthesis to represent probability densities of acoustic features given contextual factors. This paper addresses three major limitations of this decision tree-based structure: (i) The decision tree structure lacks adequate context generalization. (ii) It is unable to express complex context dependencies. (iii) Parameters generated from this structure represent sudden transitions between adjacent states. In order to alleviate the above limitations, many former papers applied multiple decision trees with an additive assumption over those trees. Similarly, the current study uses multiple decision trees as well, but instead of the additive assumption, it is proposed to train the smoothest distribution by maximizing entropy measure. Obviously, increasing the smoothness of the distribution improves the context generalization. The proposed model, named hidden maximum entropy model (HMEM), estimates a distribution that maximizes entropy subject to multiple moment-based constraints. Due to the simultaneous use of multiple decision trees and maximum entropy measure, the three aforementioned issues are considerably alleviated. Relying on HMEM, a novel speech synthesis system has been developed with maximum likelihood (ML) parameter re-estimation as well as maximum output probability parameter generation. Additionally, an effective and fast algorithm that builds multiple decision trees in parallel is devised. Two sets of experiments have been conducted to evaluate the performance of the proposed system. In the first set of experiments, HMEM with some heuristic context clusters is implemented. This system outperformed the decision tree structure in small training databases (i.e., 50, 100, and 200 sentences). In the second set of experiments, the HMEM performance with four parallel decision trees is investigated using both subjective and objective tests. All evaluation results of the second experiment confirm significant improvement of the proposed system over the conventional HSMM.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 100
    Publication Date: 2014-04-06
    Description: Eigenphone-based speaker adaptation outperforms conventional maximum likelihood linear regression(MLLR) and eigenvoice methods when there is sufficient adaptation data. However, it suffersfrom severe over-fitting when only a few seconds of adaptation data are provided. In this paper, variousregularization methods are investigated to obtain a more robust speaker-dependent eigenphonematrix estimation. Element-wise l1 norm regularization (known as lasso) encourages the eigenphonematrix to be sparse, which reduces the number of effective free parameters and improves generalization.Squared l2 norm regularization promotes an element-wise shrinkage of the estimated matrixtowards zero, thus alleviating over-fitting. Column-wise unsquared l2 norm regularization (known asgroup lasso) acts like the lasso at the column level, encouraging column sparsity in the eigenphonematrix, i.e., preferring an eigenphone matrix with many zero columns as solution. Each column correspondsto an eigenphone, which is a basis vector of the phone variation subspace. Thus, grouplasso tries to prevent the dimensionality of the subspace from growing beyond what is necessary. Fornonzero columns, group lasso acts like a squared l2 norm regularization with an adaptive weightingfactor at the column level. Two combinations of these methods are also investigated, namely elasticnet (applying l1 and squared l2 norms simultaneously) and sparse group lasso (applying l1 andcolumn-wise unsquared l2 norms simultaneously). Furthermore, a simplified method for estimatingthe eigenphone matrix in case of diagonal covariance matrices is derived, and a unified frameworkfor solving various regularized matrix estimation problems is presented. Experimental results showthat these methods improve the adaptation performance substantially, especially when the amount ofadaptation data is limited. The best results are obtained when using the sparse group lasso method,which combines the advantages of both the lasso and group lasso methods. Using speaker-adaptivetraining, performance can be further improved.
    Print ISSN: 1687-4714
    Topics: Electrical Engineering, Measurement and Control Technology
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
Close ⊗
This website uses cookies and the analysis tool Matomo. More information can be found here...