Matching entries: 0
settings...
Dobrišek S, Štruc V, Križaj J and Mihelič F (2015), "Face recognition in the wild with the probabilistic Gabor-Fisher classifier", In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on. Ljubljana, Slovenia, May, 2015. Vol. 02, pp. 1-6. IEEE.
Abstract: The paper addresses the problem of face recognition in the wild. It introduces a novel approach to unconstrained face recognition that exploits Gabor magnitude features and a simplified version of the probabilistic linear discriminant analysis (PLDA). The novel approach, named Probabilistic Gabor-Fisher Classifier (PGFC), first extracts a vector of Gabor magnitude features from the given input image using a battery of Gabor filters, then reduces the dimensionality of the extracted feature vector by projecting it into a low-dimensional subspace and finally produces a representation suitable for identity inference by applying PLDA to the projected feature vector. The proposed approach extends the popular Gabor-Fisher Classifier (GFC) to a probabilistic setting and thus improves on the generalization capabilities of the GFC method. The PGFC technique is assessed in face verification experiments on the Point and Shoot Face Recognition Challenge (PaSC) database, which features real-world videos of subjects performing everyday tasks. Experimental results on this challenging database show the feasibility of the proposed approach, which improves on the best results on this database reported in the literature by the time of writing.
BibTeX:
@conference{Dobrisek2015,
  author = {Simon Dobri\v{s}ek and Vitomir \v{S}truc and Janez Kri\v{z}aj and France Miheli\v{c}},
  title = {Face recognition in the wild with the probabilistic Gabor-Fisher classifier},
  booktitle = {Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on},
  publisher = {IEEE},
  year = {2015},
  volume = {02},
  pages = {1-6},
  url = {http://dx.doi.org/10.1109/FG.2015.7284835},
  doi = {10.1109/FG.2015.7284835}
}
Justin T, Štruc V, Dobrišek S, Vesnicer B, Ipšić I and Mihelič F (2015), "Speaker de-identification using diphone recognition and speech synthesis", In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on. Ljubljana, Slovenia, May, 2015. Vol. 04, pp. 1-7. IEEE.
Abstract: The paper addresses the problem of speaker (or voice) de-identification by presenting a novel approach for concealing the identity of speakers in their speech. The proposed technique first recognizes the input speech with a diphone recognition system and then transforms the obtained phonetic transcription into the speech of another speaker with a speech synthesis system. Due to the fact that a Diphone RecOgnition step and a sPeech SYnthesis step are used during the de-identification, we refer to the developed technique as DROPSY. With this approach the acoustical models of the recognition and synthesis modules are completely independent from each other, which ensures the highest level of input speaker de-identification. The proposed DROPSY-based de-identification approach is language dependent, text independent and capable of running in real-time due to the relatively simple computing methods used. When designing speaker de-identification technology two requirements are typically imposed on the de-identification techniques: i) it should not be possible to establish the identity of the speakers based on the de-identified speech, and ii) the processed speech should still sound natural and be intelligible. This paper, therefore, implements the proposed DROPSY-based approach with two different speech synthesis techniques (i.e, with the HMM-based and the diphone TD-PSOLA-based technique). The obtained de-identified speech is evaluated for intelligibility and evaluated in speaker verification experiments with a state-of-the-art (i-vector/PLDA) speaker recognition system. The comparison of both speech synthesis modules integrated in the proposed method reveals that both can efficiently de-identify the input speakers while still producing intelligible speech.
BibTeX:
@conference{Justin2015,
  author = {Tadej Justin and Vitomir \v{S}truc and Simon Dobri\v{s}ek and Bo\v{s}tjan Vesnicer and Ivo Ip\v{s}i\'{c} and France Miheli\v{c}},
  title = {Speaker de-identification using diphone recognition and speech synthesis},
  booktitle = {Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on},
  publisher = {IEEE},
  year = {2015},
  volume = {04},
  pages = {1-7},
  url = {http://dx.doi.org/10.1109/FG.2015.7285021},
  doi = {10.1109/FG.2015.7285021}
}
Štruc V, Križaj J and Dobrišek S (2015), "Modest face recognition", In Biometrics and Forensics (IWBF), 2015 International Workshop on. Gjovik, Norway, March, 2015. , pp. 1-6. IEEE.
Abstract: The facial imagery usually at the disposal for forensics investigations is commonly of a poor quality due to the unconstrained settings in which it was acquired. The captured faces are typically non-frontal, partially occluded and of a low resolution, which makes the recognition task extremely difficult. In this paper we try to address this problem by presenting a novel framework for face recognition that combines diverse features sets (Gabor features, local binary patterns, local phase quantization features and pixel intensities), probabilistic linear discriminant analysis (PLDA) and data fusion based on linear logistic regression. With the proposed framework a matching score for the given pair of probe and target images is produced by applying PLDA on each of the four feature sets independently - producing a (partial) matching score for each of the PLDA-based feature vectors - and then combining the partial matching results at the score level to generate a single matching score for recognition. We make two main contributions in the paper: i) we introduce a novel framework for face recognition that relies on probabilistic MOdels of Diverse fEature SeTs (MODEST) to facilitate the recognition process and ii) benchmark it against the existing state-of-the-art. We demonstrate the feasibility of our MODEST framework on the FRGCv2 and PaSC databases and present comparative results with the state-of-the-art recognition techniques, which demonstrate the efficacy of our framework.
BibTeX:
@conference{Struc2015,
  author = {Vitomir \v{S}truc and Janez Kri\v{z}aj and Simon Dobri\v{s}ek},
  title = {Modest face recognition},
  booktitle = {Biometrics and Forensics (IWBF), 2015 International Workshop on},
  publisher = {IEEE},
  year = {2015},
  pages = {1-6},
  url = {http://dx.doi.org/10.1109/IWBF.2015.7110235},
  doi = {10.1109/IWBF.2015.7110235}
}
Justin T, Mihelič F and Dobrišek S (2014), "Intelligibility assessment of the de-identified speech obtained using phoneme recognition and speech synthesis systems", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 8655 LNAI, pp. 529-536. Springer Verlag.
Abstract: The paper presents and evaluates a speaker de-identification technique using speech recognition and two speech synthesis techniques. The phoneme recognition system is built using HMM-based acoustical models of context-dependent diphone speech units, and two different speech synthesis systems (diphone TD-PSOLA-based and HMM-based) are employed for re-synthesizing the recognized sequences of speech units. Since the acoustical models of the two speech synthesis systems are assumed to be completely independent of the input speaker's voice, the highest level of input speaker de-identification is ensured. The proposed de-identification system is considered to be language dependent, but is, however, vocabulary and speaker independent since it is based mainly on acoustical modelling of the selected diphone speech units. Due to the relatively simple computing methods, the whole de-identification procedure runs in real-time. The speech outputs are compared and assessed by testing the intelligibility of the re-synthesized speech from different points of view. The assessment results show interesting variabilities of the evaluators' transcriptions depending on the input speaker, the synthesis method applied and the evaluators capabilities. But in spite of the relatively high phoneme recognition error rate (approx. 19%), the re-synthesized speech is in many cases still fully intelligible.
BibTeX:
@article{Justin2014,
  author = {Tadej Justin and France Miheli\v{c} and Simon Dobri\v{s}ek},
  title = {Intelligibility assessment of the de-identified speech obtained using phoneme recognition and speech synthesis systems},
  journal = {Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)},
  publisher = {Springer Verlag},
  year = {2014},
  volume = {8655 LNAI},
  pages = {529-536},
  url = {http://link.springer.com/chapter/10.1007%2F978-3-319-10816-2_64},
  doi = {10.1007/978-3-319-10816-2_64}
}
Križaj J, Štruc V, Dobrišek S, Marčetić D and Ribarić S (2014), "SIFT vs. FREAK: Assessing the usefulness of two keypoint descriptors for 3D face verification", In Proceedings of the 37th International Convention on Information and Communication Technology, Electronics and Microelectronics, MIPRO 2014. Opatija, Croatia, May, 2014. , pp. 1336-1341. IEEE Computer Society.
Abstract: Many techniques in the area of 3D face recognition rely on local descriptors to characterize the surface-shape information around points of interest (or keypoints) in the 3D images. Despite the fact that a lot of advancements have been made in the area of keypoint descriptors over the last years, the literature on 3D-face recognition for the most part still focuses on established descriptors, such as SIFT and SURF, and largely neglects more recent descriptors, such as the FREAK descriptor. In this paper we try to bridge this gap and assess the usefulness of the FREAK descriptor for the task for 3D face recognition. Of particular interest to us is a direct comparison of the FREAK and SIFT descriptors within a simple verification framework. To evaluate our framework with the two descriptors, we conduct 3D face recognition experiments on the challenging FRGCv2 and UMB-DB databases and show that the FREAK descriptor ensures a very competitive verification performance when compared to the SIFT descriptor, but at a fraction of the computational cost. Our results indicate that the FREAK descriptor is a viable alternative to the SIFT descriptor for the problem of 3D face verification and due to its binary nature is particularly useful for real-time-recognition systems and verification techniques for low-resource devices such as mobile phones, tablets and alike.
BibTeX:
@conference{Krizaj2014,
  author = {Janez Kri\v{z}aj and Vitomir \v{S}truc and Simon Dobri\v{s}ek and Darijan Mar\v{c}eti\'{c} and Slobodan Ribari\'{c}},
  title = {SIFT vs. FREAK: Assessing the usefulness of two keypoint descriptors for 3D face verification},
  booktitle = {Proceedings of the 37th International Convention on Information and Communication Technology, Electronics and Microelectronics, MIPRO 2014},
  publisher = {IEEE Computer Society},
  year = {2014},
  pages = {1336-1341},
  url = {http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6859775},
  doi = {10.1109/MIPRO.2014.6859775}
}
Vesnicer B, Žganec-Gros J, Dobrišek S and Štruc V (2014), "Incorporating Duration Information into I-Vector-Based Speaker-Recognition Systems", In The Speaker and Language Recognition Workshop, Odyssey 2014. Joensuu, Finland, June, 2014. , pp. 241-248.
Abstract: Most of the existing literature on i-vector-based speaker recognition focuses on recognition problems, where i-vectors are extracted from speech recordings of sufficient length. The majority of modeling/recognition techniques therefore simply ignore the fact that the i-vectors are most likely estimated unreliably when short recordings are used for their computation. Only recently, were a number of solutions proposed in the literature to address the problem of duration variability, all treating the i-vector as a random variable whose posterior distribution can be parameterized by the posterior mean and the posterior covariance. In this setting the covariance matrix serves as a measure of uncertainty that is related to the length of the available recording. In contract to these solutions, we address the problem of duration variability through weighted statistics. We demonstrate in the paper how established feature transformation techniques regularly used in the area of speaker recognition, such as PCA or WCCN, can be modified to take duration into account. We evaluate our weighting scheme in the scope of the i-vector challenge organized as part of the Odyssey, Speaker and Language Recognition Workshop 2014 and achieve a minimal DCF of 0.280, which at the time of writing puts our approach in third place among all the participating institutions
BibTeX:
@conference{Vesnicer2014,
  author = {Bo\v{s}tjan Vesnicer and Jerneja \v{Z}ganec-Gros and Simon Dobri\v{s}ek and Vitomir \v{S}truc},
  title = {Incorporating Duration Information into I-Vector-Based Speaker-Recognition Systems},
  booktitle = {The Speaker and Language Recognition Workshop, Odyssey 2014},
  year = {2014},
  pages = {241-248},
  url = {http://cs.uef.fi/odyssey2014/program/pdfs/41.pdf}
}
Gajšek R, Mihelič F and Dobrišek S (2013), "Speaker state recognition using an HMM-based feature extraction method", Computer Speech & Language. Vol. 27(1), pp. 135-150.
Abstract: In this article we present an efficient approach to modeling the acoustic features for the tasks of recognizing various paralinguistic phenomena. Instead of the standard scheme of adapting the Universal Background Model (UBM), represented by the Gaussian Mixture Model (GMM), normally used to model the frame-level acoustic features, we propose to represent the UBM by building a monophone-based Hidden Markov Model (HMM). We present two approaches: transforming the monophone-based segmented HMM-UBM to a GMM-UBM and proceeding with the standard adaptation scheme, or to perform the adaptation directly on the HMM-UBM. Both approaches give superior results than the standard adaptation scheme (GMM-UBM) in both the emotion recognition task and the alcohol detection task. Furthermore, with the proposed method we were able to achieve better results than the current state-of-the-art systems in both tasks.
BibTeX:
@article{Gajsek2013,
  author = {Rok Gaj\v{s}ek and France Miheli\v{c} and Simon Dobri\v{s}ek},
  title = {Speaker state recognition using an HMM-based feature extraction method},
  journal = {Computer Speech & Language},
  year = {2013},
  volume = {27},
  number = {1},
  pages = {135-150},
  note = {Special issue on Paralinguistics in Naturalistic Speech and Language },
  url = {http://www.sciencedirect.com/science/article/pii/S0885230812000095},
  doi = {10.1016/j.csl.2012.01.007}
}
Gunther M, Costa-Pazo A, Ding C, Boutellaa E, Chiachia G, Zhang H, De Assis Angeloni M, Štruc V, Khoury E, Vazquez-Fernandez E, Tao D, Bengherabi M, Cox D, Kiranyaz S, De Freitas Pereira T, Žganec-Gros J, Argones-Rua E, Pinto N, Gabbouj M, Simoes F, Dobrišek S, Gonzalez-Jimenez D, Rocha A, Neto M, Pavešić N, Falcao A, Violato R and Marcel S (2013), "The 2013 face recognition evaluation in mobile environment", In Proceedings of the 2013 International Conference on Biometrics - ICB 2013. Madrid, Spain, June, 2013. , pp. 1-7. IEEE Computer Society.
Abstract: Automatic face recognition in unconstrained environments is a challenging task. To test current trends in face recognition algorithms, we organized an evaluation on face recognition in mobile environment. This paper presents the results of 8 different participants using two verification metrics. Most submitted algorithms rely on one or more of three types of features: local binary patterns, Gabor wavelet responses including Gabor phases, and color information. The best results are obtained from UNILJ-ALP, which fused several image representations and feature types, and UC-HU, which learns optimal features with a convolutional neural network. Additionally, we assess the usability of the algorithms in mobile devices with limited resources.
BibTeX:
@conference{Gunther2013,
  author = {Gunther, M. and Costa-Pazo, A. and Ding, C. and Boutellaa, E. and Chiachia, G. and Zhang, H. and De Assis Angeloni, M. and Vitomir \v{S}truc and Khoury, E. and Vazquez-Fernandez, E. and Tao, D. and Bengherabi, M. and Cox, D. and Kiranyaz, S. and De Freitas Pereira, T. and Jerneja \v{Z}ganec-Gros and Argones-Rua, E. and Pinto, N. and Gabbouj, M. and Simoes, F. and Simon Dobri\v{s}ek and Gonzalez-Jimenez, D. and Rocha, A. and Neto, M.U. and Nikola Pave\v{s}i\'{c} and Falcao, A. and Violato, R. and Marcel, S.},
  title = {The 2013 face recognition evaluation in mobile environment},
  booktitle = {Proceedings of the 2013 International Conference on Biometrics - ICB 2013},
  publisher = {IEEE Computer Society},
  year = {2013},
  pages = {1-7},
  url = {http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6613024},
  doi = {10.1109/ICB.2013.6613024}
}
Križaj J, Štruc V and Dobrišek S (2013), "Combining 3D face representations using region covariance descriptors and statistical models", In Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2013. Shanghai, China, April, 2013. , pp. 1-7.
Abstract: The paper introduces a novel framework for 3D face recognition that capitalizes on region covariance descriptors and Gaussian mixture models. The framework presents an elegant and coherent way of combining multiple facial representations, while simultaneously examining all computed representations at various levels of locality. The framework first computes a number of region covariance matrices/descriptors from different sized regions of several image representations and then adopts the unscented transform to derive low-dimensional feature vectors from the computed descriptors. By doing so, it enables computations in the Euclidean space, and makes Gaussian mixture modeling feasible. In the last step a support vector machine classification scheme is used to make a decision regarding the identity of the modeled input 3D face image. The proposed framework exhibits several desirable characteristics, such as an inherent mechanism for data fusion/integration (through the region covariance matrices), the ability to examine the facial images at different levels of locality, and the ability to integrate domain-specific prior knowledge into the modeling procedure. We assess the feasibility of the proposed framework on the Face Recognition Grand Challenge version 2 (FRGCv2) database with highly encouraging results.
BibTeX:
@conference{Krizaj2013,
  author = {Janez Kri\v{z}aj and Vitomir \v{S}truc and Simon Dobri\v{s}ek},
  title = {Combining 3D face representations using region covariance descriptors and statistical models},
  booktitle = {Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, FG 2013},
  year = {2013},
  pages = {1-7},
  url = {http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6553816},
  doi = {10.1109/FG.2013.6553816}
}
Dobrišek S, Gajšek R, Mihelič F, Pavešić N and Štruc V (2012), "Towards efficient multi-modal emotion recognition", International Journal of Advanced Robotic Systems. Vol. 10(53), pp. 1-10.
Abstract: The paper presents a multi-modal emotion recognition system exploiting audio and video (i.e., facial expression) information. The system first processes both sources of information individually to produce corresponding matching scores and then combines the computed matching scores to obtain a classification decision. For the video part of the system, a novel approach to emotion recognition, relying on image-set matching, is developed. The proposed approach avoids the need for detecting and tracking specific facial landmarks throughout the given video sequence, which represents a common source of error in video-based emotion recognition systems, and, therefore, adds robustness to the video processing chain. The audio part of the system, on the other hand, relies on utterance-specific Gaussian Mixture Models (GMMs) adapted from a Universal Background Model (UBM) via the maximum a posteriori probability (MAP) estimation. It improves upon the standard UBM-MAP procedure by exploiting gender information when building the utterance-specific GMMs, thus ensuring enhanced emotion recognition performance. Both the uni-modal parts as well as the combined system are assessed on the challenging multi-modal eNTERFACE'05 corpus with highly encouraging results. The developed system represents a feasible solution to emotion recognition that can easily be integrated into various systems, such as humanoid robots, smart surveillance systems and alike.
BibTeX:
@article{Dobrisek2013,
  author = {Simon Dobri\v{s}ek and Rok Gaj\v{s}ek and France Miheli\v{c} and Nikola Pave\v{s}i\'{c} and Vitomir \v{S}truc},
  title = {Towards efficient multi-modal emotion recognition},
  journal = {International Journal of Advanced Robotic Systems},
  year = {2012},
  volume = {10},
  number = {53},
  pages = {1-10},
  url = {http://www.intechopen.com/books/international_journal_of_advanced_robotic_systems/towards-efficient-multi-modal-emotion-recognition},
  doi = {10.5772/54002}
}
Gajšek R, Dobrišek S and Mihelič F (2012), "Analysis and assessment of state relevance in HMM-based feature extraction method", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 7499 LNAI, pp. 559-565.
Abstract: In the article we evaluate the importance of different HMM states in an HMM-based feature extraction method used to model paralinguistic information. Specifically, we evaluate the distribution of the paralinguistic information across different states of the HMM in two different classification tasks: emotion recognition and alcoholization detection. In the task of recognizing emotions we found that the majority of emotion-related information is incorporated in the first and third state of a 3-state HMM. Surprisingly, in the alcoholization detection task we observed a somewhat equal distribution of task-specific information across all three states, resulting in constantly producing better results if more states are utilized.
BibTeX:
@article{Gajsek2012,
  author = {Rok Gaj\v{s}ek and Simon Dobri\v{s}ek and France Miheli\v{c}},
  title = {Analysis and assessment of state relevance in HMM-based feature extraction method},
  journal = {Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)},
  year = {2012},
  volume = {7499 LNAI},
  pages = {559-565},
  url = {http://link.springer.com/chapter/10.1007%2F978-3-642-32790-2_68},
  doi = {10.1007/978-3-642-32790-2_68}
}
Golob Z, Žganec-Gros J, Žganec M, Vesnicer B and Dobrišek S (2012), "FST-Based Pronunciation Lexicon Compression for Speech Engines", International Journal of Advanced Robotic Systems. Vol. 9(211), pp. 1-9.
Abstract: Finite-state transducers are frequently used for pronunciation lexicon representations in speech engines, in which memory and processing resources are scarce. This paper proposes two possibilities for further reducing the memory footprint of finite-state transducers representing pronunciation lexicons. First, different alignments of grapheme and allophone transcriptions are studied and a reduction in the number of states of up to 30% is reported. Second, a combination of grapheme-to-allophone rules with a finite-state transducer is proposed, which yields a 65% smaller finite-state transducer than conventional approaches.
BibTeX:
@article{Golob2012,
  author = {\v{Z}iga Golob and Jerneja \v{Z}ganec-Gros and Mario \v{Z}ganec and Bo\v{s}tjan Vesnicer and Simon Dobri\v{s}ek},
  title = {FST-Based Pronunciation Lexicon Compression for Speech Engines},
  journal = {International Journal of Advanced Robotic Systems},
  year = {2012},
  volume = {9},
  number = {211},
  pages = {1-9},
  url = {http://www.intechopen.com/books/international_journal_of_advanced_robotic_systems/fst-based-pronunciation-lexicon-compression-for-speech-engines},
  doi = {10.5772/52795}
}
Križaj J, Štruc V and Dobrišek S (2012), "Towards robust 3D face verification using gaussian mixture models", International Journal of Advanced Robotic Systems. Vol. 9(162), pp. 1-11.
Abstract: This paper focuses on the use of Gaussian Mixture models (GMM) for 3D face verification. A special interest is taken in practical aspects of 3D face verification systems, where all steps of the verification procedure need to be automated and no meta-data, such as preannotated eye/nose/mouth positions, is available to the system. In such settings the performance of the verification system correlates heavily with the performance of the employed alignment (i.e., geometric normalization) procedure. We show that popular holistic as well as local recognition techniques, such as principal component analysis (PCA), or Scale-invariant feature transform (SIFT)-based methods considerably deteriorate in their performance when an "imperfect" geometric normalization procedure is used to align the 3D face scans and that in these situations GMMs should be preferred. Moreover, several possibilities to improve the performance and robustness of the classical GMM framework are presented and evaluated: i) explicit inclusion of spatial information, during the GMM construction procedure, ii) implicit inclusion of spatial information during the GMM construction procedure and iii) on-line evaluation and possible rejection of local feature vectors based on their likelihood. We successfully demonstrate the feasibility of the proposed modifications on the Face Recognition Grand Challenge data set.
BibTeX:
@article{Krizaj2012a,
  author = {Janez Kri\v{z}aj and Vitomir \v{S}truc and Simon Dobri\v{s}ek},
  title = {Towards robust 3D face verification using gaussian mixture models},
  journal = {International Journal of Advanced Robotic Systems},
  year = {2012},
  volume = {9},
  number = {162},
  pages = {1-11},
  url = {http://www.intechopen.com/books/international_journal_of_advanced_robotic_systems/towards-robust-3d-face-verification-using-gaussian-mixture-models},
  doi = {10.5772/52200}
}
Križaj J, Štruc V and Dobrišek S (2012), "Robust 3D face recognition", Elektrotehniski Vestnik/Electrotechnical Review. Vol. 79(1-2), pp. 1-6.
Abstract: Face recognition in uncontrolled environments is hindered by variations in illumination, pose, expression and occlusions of faces. Many practical face-recognition systems are affected by these variations. One way to increase the robustness to illumination and pose variations is to use 3D facial images. In this paper 3D face-recognition systems are presented. Their structure and operation are described. The robustness of such systems to variations in uncontrolled environments is emphasized. We present some preliminary results of a system developed in our laboratory.
BibTeX:
@article{Krizaj2012b,
  author = {Janez Kri\v{z}aj and Vitomir \v{S}truc and Simon Dobri\v{s}ek},
  title = {Robust 3D face recognition},
  journal = {Elektrotehniski Vestnik/Electrotechnical Review},
  year = {2012},
  volume = {79},
  number = {1-2},
  pages = {1-6},
  note = {cited By 1},
  url = {http://ev.fe.uni-lj.si/online-eng.php?vol=79}
}
Dobrišek S and Mihelič F (2011), "Time- and acoustic-mediated alignment algorithms for speech recognition evaluation", In Proceedings of the 12th Annual Conference of the International Speech Communication Association, INTERSPEECH 2011. Florence, Italy, August, 2011. , pp. 1517-1520.
Abstract: The paper investigates the time- and acoustic-mediated alignment algorithms that can be used for better speech recognition evaluation. The edit-cost function, which weights the cost of speech unit matches, substitutions, deletions and insertions, is defined as a function of timed symbols or even as a function of speech signal segments. The algorithms are compared using several classical statistical measures of different types that are derived from speech recognition confusion matrices and are normally used to measure the agreement between different classifications of the same set of objects. These measures provide a reasonable indication that the investigated algorithms provide more relevant speech recognition error statistics than the algorithms that are commonly used for this purpose.
BibTeX:
@conference{Dobrisek2011,
  author = {Simon Dobri\v{s}ek and France Miheli\v{c}},
  title = {Time- and acoustic-mediated alignment algorithms for speech recognition evaluation},
  booktitle = {Proceedings of the 12th Annual Conference of the International Speech Communication Association, INTERSPEECH 2011},
  year = {2011},
  pages = {1517-1520},
  url = {http://www.isca-speech.org/archive/interspeech_2011/i11_1517.html}
}
Gajšek R, Dobrišek S and Mihelič F (2011), "University of Ljubljana system for Interspeech 2011 Speaker State Challenge", In Proceedings of the 12th Annual Conference of the International Speech Communication Association, INTERSPEECH 2011. Florence, Italy, August, 2011. , pp. 3297-3300.
Abstract: The paper presents our efforts in the Interspeech 2011 Speaker State Challenge. Both systems, for the Intoxication and the Sleepiness Sub-Challenge, are based on a Universal Background Model (UBM) in a form of a Hidden Markov Model (HMM), and the Maximum A Posteriori (MAP) adaptation. With the combination of our HMM-UBM-MAP derived super vectors and selected statistical functionals from the baseline feature set, we were able to surpass the baseline system in both sub-challenges. By employing majority voting fusion of best systems we were able to further improve the performance. In the Intoxication Sub-Challenge our best result on the test set is 67.46%, and in the Sleepiness Sub-Challenge 71.28%.
BibTeX:
@conference{Gajsek2011,
  author = {Rok Gaj\v{s}ek and Simon Dobri\v{s}ek and France Miheli\v{c}},
  title = {University of Ljubljana system for Interspeech 2011 Speaker State Challenge},
  booktitle = {Proceedings of the 12th Annual Conference of the International Speech Communication Association, INTERSPEECH 2011},
  year = {2011},
  pages = {3297-3300},
  url = {http://www.isca-speech.org/archive/interspeech_2011/i11_3297.html}
}
Dobrišek S, Žibert J and Mihelič F (2010), "Towards the optimal minimization of a pronunciation dictionary model", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 6231 LNAI, pp. 267-274.
Abstract: This paper presents the results of our efforts to obtain the minimum possible finite-state representation of a pronunciation dictionary. Finite-state transducers are widely used to encode word pronunciations and our experiments revealed that the conventional redundancy-reduction algorithms developed within this framework yield suboptimal solutions. We found that the incremental construction and redundancy reduction of acyclic finite-state transducers creates considerably smaller models (up to 60%) than the conventional, non-incremental (batch) algorithms implemented in the OpenFST toolkit.
BibTeX:
@article{Dobrisek2010,
  author = {Simon Dobri\v{s}ek and Janez \v{Z}ibert and France Miheli\v{c}},
  title = {Towards the optimal minimization of a pronunciation dictionary model},
  journal = {Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)},
  year = {2010},
  volume = {6231 LNAI},
  pages = {267-274},
  url = {http://link.springer.com/chapter/10.1007/978-3-642-15760-8_34},
  doi = {10.1007/978-3-642-15760-8_34}
}
Justin T, Gajšek R, Štruc V and Dobrišek S (2010), "Comparison of different classification methods for emotion recognition", In Proceedings of the 33rd International Convention on Information and Communication Technology, Electronics and Microelectronics, MIPRO 2010. , pp. 700-703.
Abstract: The paper presents a comparison of different classification techniques for the task of classifying a speaker's emotional state into one of two classes: aroused and normal. The comparison was conducted using the WEKA (The Waikato Environment for Knowledge Analysis) open source software which consists of a collection of machine learning algorithms for data mining. The aim of this paper is to investigate the efficiency of different classification methods to recognize the emotional state of a speaker with features obtained by a constraint version of the Maximum Likelihood Linear Regression (CMLLR). For our experiments we adopted the multi-modal AvID database of emotions, which comprises 1708 samples of utterances each lasting at least 15 seconds. The database was randomly divided into a training set and a testing set in a ratio of 5:1. Since there are much more samples in the database belonging to the neutral class than to the aroused class, the latter was over-sampled to ensure that both classes in contained equal numbers of samples in the training set. The build-in WEKA classifiers were divided into five groups based on their theoretical foundation, i.e., the group of classifiers related to the Bayes's theorem, the group of distance-based classifiers, the group of discriminant classifiers, the group of neural networks, and finally the group of decision tree classifiers. From each group we present the results of the best evaluated algorithms with respect to the unweighted average recall.
BibTeX:
@conference{Justin2010,
  author = {Tadej Justin and Rok Gaj\v{s}ek and Vitomir \v{S}truc and Simon Dobri\v{s}ek},
  title = {Comparison of different classification methods for emotion recognition},
  booktitle = {Proceedings of the 33rd International Convention on Information and Communication Technology, Electronics and Microelectronics, MIPRO 2010},
  year = {2010},
  pages = {700-703},
  url = {http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5533498}
}
Štruc V, Dobrišek S and Pavešić N (2010), "Confidence weighted subspace projection techniques for robust face recognition in the presence of partial occlusions", In Proceedings of the International Conference on Pattern Recognition, ICPR 2010. Istanbul, Turkey , pp. 1334-1338.
Abstract: Subspace projection techniques are known to be susceptible to the presence of partial occlusions in the image data. To overcome this susceptibility, we present in this paper a confidence weighting scheme that assigns weights to pixels according to a measure, which quantifies the confidence that the pixel in question represents an outlier. With this procedure the impact of the occluded pixels on the subspace representation is reduced and robustness to partial occlusions is obtained. Next, the confidence weighting concept is improved by a local procedure for the estimation of the subspace representation. Both the global weighting approach and the local estimation procedure are assessed in face recognition experiments on the AR database, where encouraging results are obtained with partially occluded facial images.
BibTeX:
@conference{Struc2010,
  author = {Vitomir \v{S}truc and Simon Dobri\v{s}ek and Nikola Pave\v{s}i\'{c}},
  title = {Confidence weighted subspace projection techniques for robust face recognition in the presence of partial occlusions},
  booktitle = {Proceedings of the International Conference on Pattern Recognition, ICPR 2010},
  year = {2010},
  pages = {1334-1338},
  url = {http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5597733},
  doi = {10.1109/ICPR.2010.331}
}
Dobrišek S, Vesnicer B and Mihelič F (2009), "A sequential minimization algorithm for finite-state pronunciation lexicon models", In Proceedings of the 10th Annual Conference of the International Speech Communication Association, INTERSPEECH 2009. Brighton, United Kingdom, September, 2009. , pp. 720-723.
Abstract: The paper first presents a large-vocabulary automatic speech-recognition system that is being developed for the Slovenian language. The concept of a single-pass token-passing algorithm for the fast speech decoding that can be used with the designed multi-level system structure is discussed. From the algorithmic point of view, the main component of the system is a finite-state pronunciation lexicon model. This component has crucial impact on the overall performance of the system and we developed a sequential minimization algorithm that very efficiently reduces the size and algorithmic complexity of the lexicon model. Our finite-state lexicon model is represented as a state-emitting finite-state transducer. The presented experiments show that the sequential minimization algorithm easily outperforms (up to 60%) the conventional algorithms that were developed for the static global optimization of the transitionemitting finite-state transducers. These algorithms are delivered as part of the AT&T FSM library and the OpenFST library.
BibTeX:
@conference{Dobrisek2009a,
  author = {Simon Dobri\v{s}ek and Bo\v{s}tjan Vesnicer and France Miheli\v{c}},
  title = {A sequential minimization algorithm for finite-state pronunciation lexicon models},
  booktitle = {Proceedings of the 10th Annual Conference of the International Speech Communication Association, INTERSPEECH 2009},
  year = {2009},
  pages = {720-723},
  url = {http://www.isca-speech.org/archive/interspeech_2009/i09_0720.html}
}
Dobrišek S, Žibert J, Pavešić N and Mihelič F (2009), "An Edit-Distance Model for the Approximate Matching of Timed Strings", Pattern Analysis and Machine Intelligence, IEEE Transactions on., April, 2009. Vol. 31(4), pp. 736-741.
Abstract: An edit-distance model that can be used for the approximate matching of contiguous and noncontiguous timed strings is presented. The model extends the concept of the weighted string-edit distance by introducing timed edit operations and by making the edit costs time dependent. Special attention is paid to the timed null symbols that are associated with the timed insertions and deletions. The usefulness of the presented model is demonstrated on the classification of phone-recognition errors using the TIMIT speech database.
BibTeX:
@article{Dobrisek2009b,
  author = {Simon Dobri\v{s}ek and Janez \v{Z}ibert and Nikola Pave\v{s}i\'{c} and France Miheli\v{c}},
  title = {An Edit-Distance Model for the Approximate Matching of Timed Strings},
  journal = {Pattern Analysis and Machine Intelligence, IEEE Transactions on},
  year = {2009},
  volume = {31},
  number = {4},
  pages = {736-741},
  url = {http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4586388},
  doi = {10.1109/TPAMI.2008.197}
}
Gajšek R, Štruc V, Dobrišek S, Žibert J, Mihelič F and Pavešić N (2009), "Combining audio and video for detection of spontaneous emotions", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 5707 LNCS, pp. 114-121.
Abstract: The paper presents our initial attempts in building an audio video emotion recognition system. Both, audio and video sub-systems are discussed, and description of the database of spontaneous emotions is given. The task of labelling the recordings from the database according to different emotions is discussed and the measured agreement between multiple annotators is presented. Instead of focusing on the prosody in audio emotion recognition, we evaluate the possibility of using linear transformations (CMLLR) as features. The classification results from audio and video sub-systems are combined using sum rule fusion and the increase in recognition results, when using both modalities, is presented.
BibTeX:
@article{Gajsek2009a,
  author = {Rok Gaj\v{s}ek and Vitomir \v{S}truc and Simon Dobri\v{s}ek and Janez \v{Z}ibert and France Miheli\v{c} and Nikola Pave\v{s}i\'{c}},
  title = {Combining audio and video for detection of spontaneous emotions},
  journal = {Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)},
  year = {2009},
  volume = {5707 LNCS},
  pages = {114-121},
  url = {http://link.springer.com/chapter/10.1007%2F978-3-642-04391-8_15},
  doi = {10.1007/978-3-642-04391-8_15}
}
Gajšek R, Štruc V, Dobrišek S and Mihelič F (2009), "Emotion recognition using linear transformations in combination with video", In Proceedings of the 10th Annual Conference of the International Speech Communication Association, INTERSPEECH 2009. Brighton, United Kingdom, September, 2009. , pp. 1967-1970.
Abstract: The paper discuses the usage of linear transformations of Hidden Markov Models, normally employed for speaker and environment adaptation, as a way of extracting the emotional components from the speech. A constrained version of Maximum Likelihood Linear Regression (CMLLR) transformation is used as a feature for classification of normal or aroused emotional state. We present a procedure of incrementally building a set of speaker independent acoustic models, that are used to estimate the CMLLR transformations for emotion classification. An audio-video database of spontaneous emotions (AvID) is briefly presented since it forms the basis for the evaluation of the proposed method. Emotion classification using the video part of the database is also described and the added value of combining the visual information with the audio features is shown.
BibTeX:
@conference{Gajsek2009b,
  author = {Rok Gaj\v{s}ek and Vitomir \v{S}truc and Simon Dobri\v{s}ek and France Miheli\v{c}},
  title = {Emotion recognition using linear transformations in combination with video},
  booktitle = {Proceedings of the 10th Annual Conference of the International Speech Communication Association, INTERSPEECH 2009},
  year = {2009},
  pages = {1967-1970},
  url = {http://www.isca-speech.org/archive/interspeech_2009/i09_1967.html}
}
Dobrišek S (2008), "Criteria for the evaluation of automated speech-recognition scoring algorithms", Elektrotehniski Vestnik/Electrotechnical Review. Vol. 75(4), pp. 229-234.
Abstract: Variations of the basic string-alignment algorithm are commonly used for the detection and classification of speech-recognition errors. In this procedure, reference and system-output hypothesis speech transcriptions are first aligned using the string-alignment algorithm that is based on primitive edit operations. The edit operations needed to transform one transcription into the other are then tallied as speech-recognition errors. The algorithms normally detect approximately the same total number of errors; however, they can produce different error classifications. This paper investigates these differences and several criterion functions that can be used for the comparison and evaluation of the algorithms for the detection and classification of speech-recognition errors. The proposed criterion functions were used for the experimental evaluation of the standard algorithms that are implemented as part of the CUED HTK and the NIST SCTK, and were used for the detection and classification of phone-recognition errors from the TIMIT speech database.
BibTeX:
@article{Dobrisek2008,
  author = {Simon Dobri\v{s}ek},
  title = {Criteria for the evaluation of automated speech-recognition scoring algorithms},
  journal = {Elektrotehniski Vestnik/Electrotechnical Review},
  year = {2008},
  volume = {75},
  number = {4},
  pages = {229-234},
  url = {http://ev.fe.uni-lj.si/4-2008.htm}
}
Žganec-Gros J, Mihelič A, Žganec M, Mihelič F, Dobrišek S, Žibert J, Vintar S, Korošec T, Erjavec T and Romih M (2005), "Initial considerations in building a speech-to-speech translation system for the Slovenian-English language pair", In Proceedings of the 10th Annual Conference of the European Association for Machine Translation, EAMT 2005. , pp. 288-293.
Abstract: The paper presents the design concept of the VoiceTRAN Communicator that integrates speech recognition, machine translation and text-to-speech synthesis using the DARPA Galaxy architecture. The aim of the project is to build a robust multimodal speech-to-speech translation communicator able to translate simple domain-specific sentences in the Slovenian-English language pair. The project represents a joint collaboration between several Slovenian research organizations that are active in human language technologies. We provide an overview of the task, describe the system architecture and individual servers. Further we describe the language resources that will be used and developed within the project. We conclude the paper with plans for evaluation of the VoiceTRAN Communicator.
BibTeX:
@conference{ZganecGros2005,
  author = {Jerneja \v{Z}ganec-Gros and Ale\v{s} Miheli\v{c} and Mario \v{Z}ganec and France Miheli\v{c} and Simon Dobri\v{s}ek and Janez \v{Z}ibert and \v{S}[ela Vintar and Koro\v{s}ec, T. and Toma\v{z} Erjavec and Miro Romih},
  title = {Initial considerations in building a speech-to-speech translation system for the Slovenian-English language pair},
  booktitle = {Proceedings of the 10th Annual Conference of the European Association for Machine Translation, EAMT 2005},
  year = {2005},
  pages = {288-293},
  note = {cited By 0},
  url = {http://www.scopus.com/inward/record.url?eid=2-s2.0-84857611306&partnerID=40&md5=250a7b9f55c967b41be8529c65ccf129}
}
Dobrišek S, Gros J, Vesnicer B, Pavešić N and Mihelič F (2003), "Evolution of the information-retrieval system for blind and visually-impaired people", International Journal of Speech Technology. Vol. 6(3), pp. 301-309.
Abstract: Blind and visually-impaired people face many problems in interacting with information retrieval systems. State-of-the-art spoken language technology offers potential to overcome many of them. In the mid-nineties our research group decided to develop an information retrieval system suitable for Slovene-speaking blind and visually-impaired people. A voice-driven text-to-speech dialogue system was developed for reading Slovenian texts obtained from the Electronic Information System of the Association of Slovenian Blind and Visually Impaired Persons Societies. The evolution of the system is presented. The early version of the system was designed to deal explicitly with the Electronic Information System where the available text corpora are stored in a plain text file format without any, or with just some, basic non-standard tagging. Further improvements to the system became possible with the decision to transfer the available corpora to the new web portal, exclusively dedicated to blind and visually-impaired users. The text files were reformatted into common HTML/XML pages, which comply with the basic recommendations set by the Web Access Initiative. In the latest version of the system all the modules of the early version are being integrated into the user interface, which has some basic web-browsing functionalities and a text-to-speech screen-reader function controlled by the mouse as well.
BibTeX:
@article{Dobrisek2003,
  author = {Simon Dobri\v{s}ek and Jerneja Gros and Bo\v{s}tjan Vesnicer and Nikola Pave\v{s}i\'{c} and France Miheli\v{c}},
  title = {Evolution of the information-retrieval system for blind and visually-impaired people},
  journal = {International Journal of Speech Technology},
  year = {2003},
  volume = {6},
  number = {3},
  pages = {301-309},
  url = {http://link.springer.com/article/10.1023%2FA%3A1023474405658},
  doi = {10.1023/A:1023474405658}
}
Mihelič F, Gros J, Dobrišek S, Žibert J and Pavešić N (2003), "Spoken language resources at LUKS of the University of Ljubljana", International Journal of Speech Technology. Vol. 6(3), pp. 221-232.
Abstract: An overview is given of the Slovene-language spoken resources acquired at the Laboratory of Artificial Perception, Systems and Cybernetics (LUKS) at the Faculty of Electrical Engineering, University of Ljubljana over the past ten years. All the resources are accompanied by relevant text transcriptions, lexicons and various segmentation labels.
BibTeX:
@article{Mihelic2003,
  author = {France Miheli\v{c} and Jerneja Gros and Simon Dobri\v{s}ek and Janez \v{Z}ibert and Nikola Pave\v{s}i\'{c}},
  title = {Spoken language resources at LUKS of the University of Ljubljana},
  journal = {International Journal of Speech Technology},
  year = {2003},
  volume = {6},
  number = {3},
  pages = {221-232},
  note = {cited By 23},
  url = {http://link.springer.com/article/10.1023%2FA%3A1023462002932},
  doi = {10.1023/A:1023462002932}
}
Pavešić N, Gros J, Dobrišek S and Mihelič F (2003), "Homer II - Man-machine interface to internet for blind and visually impaired people", Computer Communications. Vol. 26(5), pp. 438-443.
Abstract: HOMER II is a voice-driven text-to-speech system developed for blind or visually impaired persons for reading Slovenian texts. Users can obtain texts from the Internet site of the Association of Slovenian Blind and Visually Impaired Persons Societies from their Electronic Information System where they can find daily newspapers, some novels and other information. The system consists of four main modules. The first module enables Internet communication, retrieves text to a local disc and converts it to a standard form. The input interface manages the keyboard entry and/or speaker independent speech recognition. The output interface performs speech synthesis of a given text and in addition prints the same text magnified to the screen. The user dialog is responsible for the user friendly communication and controls other tasks of the system. Homer II was ported from Linux to the MS Windows 9x/ME/NT/2000 operating systems. For the best performance it uses multi-threading and other advantages of the 32-bit environment. Further versions of the HOMER system with even more advanced dialogue modules and some basic World Wide Web browsing functionality will represent an important tool in the distance learning and teaching process for the impaired persons using academic networks.
BibTeX:
@article{Pavesic2003,
  author = {Nikola Pave\v{s}i\'{c} and Jerneja Gros and Simon Dobri\v{s}ek and France Miheli\v{c}},
  title = {Homer II - Man-machine interface to internet for blind and visually impaired people},
  journal = {Computer Communications},
  year = {2003},
  volume = {26},
  number = {5},
  pages = {438-443},
  url = {http://www.sciencedirect.com/science/article/pii/S0140366402001640},
  doi = {10.1016/S0140-3664(02)00164-0}
}
Vesnicer B, Žibert J, Dobrišek S, Pavesić N and Mihelič F (2003), "A Voice-Driven Web Browser for Blind People", In Proceedings of the Eight European Conference on Speech Communication and Technology, EUROSPEECH-2003. Geneva, Switzerland, September, 2003. , pp. 1301-1304.
Abstract: A small self-voicing Web browser designed for blind users is presented. The Web browser was built from the GTK Web browser Dillo, which is a free software project in terms of the GNU general public license. Additional functionality has been introduced to this original browser in form of different modules. The browser operates in two different modes, browsing mode and dialogue mode. In browsing mode user navigates through structure of Web pages using mouse and/or keyboard. When in dialogue mode, the dialogue module offers different actions and the user chooses between them using either keyboard or spoken-commands which are recognized by the speech-recognition module. The content of the page is presented to the user by screen-reader module which uses text-to-speech module for its output. The browser is capable of displaying all common Web pages that do not contain frames, java or flash animations. However, the best performance is achieved when pages comply with the recommendations set by the WAI. The browser has been developed in Linux operating system and later ported to Windows 9x/ME/NT/2000/XP platform. Currently it is being tested by members of the Slovenian blind people society. Any suggestions or wishes from them will be considered for inclusion in future versions of the browser.
BibTeX:
@conference{Vesnicer2003,
  author = {Bo\v{s}tjan Vesnicer and Janez \v{Z}ibert and Simon Dobri\v{s}ek and Nikola Pavesi\'{c} and France Miheli\v{c}},
  title = {A Voice-Driven Web Browser for Blind People},
  booktitle = {Proceedings of the Eight European Conference on Speech Communication and Technology, EUROSPEECH-2003},
  year = {2003},
  pages = {1301-1304},
  url = {http://www.isca-speech.org/archive/eurospeech_2003/e03_1301.html}
}
Dobrišek S, Gros J, Vesnicer B, Mihelič F and Pavešić N (2002), "A Voice-Driven Web Browser for Blind People", Lecture Notes in Computer Science. Vol. 2448, pp. 453-459.
Abstract: A specialised small Web browser with a voice-driven dialogue manager and a text-to-speech screen reader is presented. The Web browser was built from the GTK Web browser Dillo, which is a free software project in the terms of the GNU general public license. The new built-in screen reader is now triggered by pointing the mouse and uses the text-to-speech module for its output. A dialogue module together with a spoken-command input was also introduced into the browser. It can be used for navigation through a structure of common Web pages. The developed browser is primarily intended to be used with the new Web portal, exclusively dedicated to blind and visually impaired users. All the Web pages at the portal or at sites that are linked from this portal are expected to be arranged as common HTML/XML pages, which complies with the basic recommendations set by the Web Access Initiative.
BibTeX:
@article{Dobrisek2002,
  author = {Simon Dobri\v{s}ek and Jerneja Gros and Bo\v{s}tjan Vesnicer and France Miheli\v{c} and Nikola Pave\v{s}i\'{c}},
  title = {A Voice-Driven Web Browser for Blind People},
  journal = {Lecture Notes in Computer Science},
  year = {2002},
  volume = {2448},
  pages = {453-459},
  url = {http://link.springer.com/chapter/10.1007%2F3-540-46154-X_65},
  doi = {10.1007/3-540-46154-X_65}
}
Gros J, Mihelič F, Dobrišek S, Erjavec T and Žganec M (2000), "Rules for Automatic Grapheme-to-Allophone Transcription in Slovene", Lecture Notes in Computer Science., In Text, Speech and Dialogue. Vol. 1902, pp. 171-176. Springer Berlin Heidelberg.
Abstract: The domain of spoken language technologies ranges from speech input and output systems to complex understanding and generation systems, including multi-modal systems of widely differing complexity (such as automatic dictation machines) and multilingual systems (for example, automatic dialogue and translation systems). The definition of standards and evaluation methodologies for such systems involves the specification and development of highly specific spoken language corpus and lexicon resources, and measurement and evaluation tools [5]. This paper presents the MobiLuz spoken resources of the Slovene language, which will be made freely available for research purposes in speech technology and linguistics.
BibTeX:
@article{Gros2000,
  author = {Jerneja Gros and France Miheli\v{c} and Simon Dobri\v{s}ek and Toma\v{z} Erjavec and Mario \v{Z}ganec},
  editor = {Sojka, Petr and Kope\v{c}ek, Ivan and Pala, Karel},
  title = {Rules for Automatic Grapheme-to-Allophone Transcription in Slovene},
  booktitle = {Text, Speech and Dialogue},
  journal = {Lecture Notes in Computer Science},
  publisher = {Springer Berlin Heidelberg},
  year = {2000},
  volume = {1902},
  pages = {171-176},
  url = {http://link.springer.com/chapter/10.1007%2F3-540-45323-7_29},
  doi = {10.1007/3-540-45323-7_29}
}
Dobrišek S, Gros J, Mihelič F and Pavesić N (1999), "HOMER: a voice-driven text-to-speech system for the blind", In Proceedings of the IEEE International Symposium on Industrial Electronics, ISIE'99. Vol. 1, pp. 205-208. IEEE, Piscataway, NJ, United States.
Abstract: HOMER is a voice-driven text-to-speech system developed for blind or visually impaired persons for reading the Slovenian texts. Users can obtain texts from the special corpora organised on the computer network server at the information centre of the association of the Slovenian blind and visually impaired persons. The system consists of three main modules. The text-to-speech module enables speech synthesis from an arbitrary Slovenian text input, the speech recognition module performs speaker independent isolated word recognition and the dialogue module controls the different tasks of the HOMER system and obtains texts from the source text corpora. Presently, the system runs under LINUX and requires a PENTINUM/133 PC with minimum 32 MB of RAM and an additional standard 16 bit sound card.
BibTeX:
@conference{Dobrisek1999a,
  author = {Simon Dobri\v{s}ek and Jerneja Gros and France Miheli\v{c} and Nikola Pavesi\'{c}},
  title = {HOMER: a voice-driven text-to-speech system for the blind},
  booktitle = {Proceedings of the IEEE International Symposium on Industrial Electronics, ISIE'99},
  publisher = {IEEE, Piscataway, NJ, United States},
  year = {1999},
  volume = {1},
  pages = {205-208},
  url = {http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=801785&tag=1},
  doi = {10.1109/ISIE.1999.801785}
}
Dobrišek S, Mihelič F and Pavešić N (1999), "Speech Segmentation Aspects of Phone Transition Acoustical Modelling", Lecture Notes in Computer Science., In Text, Speech and Dialogue. Vol. 1692, pp. 248-251. Springer Berlin Heidelberg.
Abstract: The paper presents our experiences with the phone transition acoustical models. The phone transition models were compared to the traditional context dependent phone models. We put special attention on the speech signal segmentation analysis to provide a better insight into certain segmentation effects when using the different acoustical models. Experiments with the HMM-based models were performed using the HTK toolkit, which was extended to provide proper state parameter tying for the phone transition models. All the model parameters were estimated on the GOPOLIS speech database. The annotation confusions concerning two-phone speech units are also discussed.
BibTeX:
@article{Dobrisek1999b,
  author = {Simon Dobri\v{s}ek and France Miheli\v{c} and Nikola Pave\v{s}i\'{c}},
  editor = {Matousek, Václav and Mautner, Pavel and Ocelíková, Jana and Sojka, Petr},
  title = {Speech Segmentation Aspects of Phone Transition Acoustical Modelling},
  booktitle = {Text, Speech and Dialogue},
  journal = {Lecture Notes in Computer Science},
  publisher = {Springer Berlin Heidelberg},
  year = {1999},
  volume = {1692},
  pages = {248-251},
  url = {http://link.springer.com/chapter/10.1007%2F3-540-48239-3_45},
  doi = {10.1007/3-540-48239-3_45}
}
Dobrišek S, Mihelič F and Pavešić N (1999), "Acoustical Modelling of Phone Transitions: Biphones and Diphones - What are the Differences?", In Proceedings of the Sixth European Conference on Speech Communication and Technology, EUROSPEECH'99. Budapest, Hungary, September, 1999. , pp. 1307-1310.
Abstract: The paper presents our experiences with the phonetransition acoustical models. The phone transitionmodels were compared to the traditional context de-pendent phone models. We put special attentionon the speech signal segmentation analysis to pro-vide a better insight into certain segmentation effectswhen using the different acoustical models. Experiments with the HMM-based models were performedusing the HTK toolkit, which was extended to allowproper state parameter tying for the phone transitionmodels. All the model parameters were estimatedon the GOPOLIS speech database. The annotationconfusions concerning two-phone speech units are dis-cussed. Finally, the overall word recognition score ispresented. The better score was achieved using thediphone models even when comparing them to the triphone models.
BibTeX:
@conference{Dobrisek1999c,
  author = {Simon Dobri\v{s}ek and France Miheli\v{c} and Nikola Pave\v{s}i\'{c}},
  title = {Acoustical Modelling of Phone Transitions: Biphones and Diphones - What are the Differences?},
  booktitle = {Proceedings of the Sixth European Conference on Speech Communication and Technology, EUROSPEECH'99},
  year = {1999},
  pages = {1307-1310},
  url = {http://www.isca-speech.org/archive/eurospeech_1999/e99_1307.html}
}
Ipši'c I, Mihelič F, Dobrišek S, Gros J and Pavešić N (1999), "A Slovenian Spoken Dialog System for Air Flight Inquiries", In Proceedings of the Sixth European Conference on Speech Communication and Technology, EUROSPEECH'99. Budapest, Hungary, September, 1999. , pp. 2659-2662.
Abstract: In this paper we give an overview of a Slovenian spoken dialog system, developed within a joint project in multilingual speech recognition and understanding. The aim of the project is the development of an information retrieval system, capable of having a dialog with a user. The paper presents the work on the development of the Slovenian spoken dialog system. Such a system has to be able to handle spontaneous speech, and to provide the user with correct infor-mation. The information system being developed for Slovenian speech is used for air ight information retrieval. The system has to answer questions about air flight connections and their time and date.In the paper we present the developed modules of theSlovenian system and show some results with respectto word accuracy, semantic accuracy and dialog success rate.
BibTeX:
@conference{Ipsic1999,
  author = {Ivo Ip\v{s}i'c and France Miheli\v{c} and Simon Dobri\v{s}ek and Jerneja Gros and Nikola Pave\v{s}i\'{c}},
  title = {A Slovenian Spoken Dialog System for Air Flight Inquiries},
  booktitle = {Proceedings of the Sixth European Conference on Speech Communication and Technology, EUROSPEECH'99},
  year = {1999},
  pages = {2659-2662},
  url = {http://www.isca-speech.org/archive/eurospeech_1999/e99_2659.html}
}
Žibert J, Gros J, Dobrišek S and Mihelič F (1999), "Language Model Representations for the GOPOLIS Database", Lecture Notes in Computer Science., In Text, Speech and Dialogue. Vol. 1692, pp. 380-383. Springer Berlin Heidelberg.
Abstract: The formation of a domain-oriented sentence corpus by sentence pattern rules is described. The same rules were transformed into word networks to serve as a language model within a HTK based speech recognition system. The performance of the word network language model was compared to the one of the bigram model.
BibTeX:
@article{Zibert1999,
  author = {Janez \v{Z}ibert and Jerneja Gros and Simon Dobri\v{s}ek and France Miheli\v{c}},
  editor = {Matousek, Václav and Mautner, Pavel and Ocelíková, Jana and Sojka, Petr},
  title = {Language Model Representations for the GOPOLIS Database},
  booktitle = {Text, Speech and Dialogue},
  journal = {Lecture Notes in Computer Science},
  publisher = {Springer Berlin Heidelberg},
  year = {1999},
  volume = {1692},
  pages = {380-383},
  url = {http://dx.doi.org/10.1007/3-540-48239-3_73},
  doi = {10.1007/3-540-48239-3_73}
}
Dobrišek S, Mihelič F and Pavešić N (1997), "A Multiresolutionally Oriented Approach for Determination of Cepstral Features In Speech Recognition", In Proceedings of the Fifth European Conference on Speech Communication and Technology, EUROSPEECH'97. Rhodes, Greece, September, 1997. , pp. 1367-1370.
Abstract: This paper presents an effort to provide a more efficient speech signal representation, which aims to be incorporated into an automatic speech recognition system. Modified cepstral coefficients, derived from a multiresolution auditory spectrum are proposed. The multiresolution spectrum was obtained using sliding single point discrete Fourier transformations. It is shown that the obtained spectrum values are similar to the results of a nonuniform filtering operation. The presented cepstral features are evaluated by introducing them into a simple phone recognition system.
BibTeX:
@conference{Dobrisek1997,
  author = {Simon Dobri\v{s}ek and France Miheli\v{c} and Nikola Pave\v{s}i\'{c}},
  title = {A Multiresolutionally Oriented Approach for Determination of Cepstral Features In Speech Recognition},
  booktitle = {Proceedings of the Fifth European Conference on Speech Communication and Technology, EUROSPEECH'97},
  year = {1997},
  pages = {1367-1370},
  url = {www.isca-speech.org/archive/eurospeech_1997/e97_1367.html}
}
Gros J, Ipšić I, Dobrišek S, Mihelič F and Pavešić N (1996), "Segmentation and Labelling of Slovenian Diphone Inventories", In Proceedings of the 16th International Conference on Computational Linguistics, COLING 1996. Copenhagen, Denmark Vol. 1, pp. 298-303.
Abstract: Preparation, recording, segmentation and pitch labelling of Slovenian diphone inventories are described. A special user friendly intert'ace package was developed in order to facilitate these operations. As acquisition of a labelled diphone inventory or adaptation of a speech synthesis system to synthesise further voices is manually intensive, an automatic procedure is required. A speech recogniser, based on Hidden Markov Models in forced segmentation mode is used to outline phone boundaries within spoken logatoms. A statistical evaluation of manual and automatic segmentation discrepancies is performed so as to estinmte the reliability of automatically derived labels. Finally, diphone boundaries are determined and pitch markers are assigned to voiced sections of the speech signal.
BibTeX:
@conference{Gros1996,
  author = {Jerneja Gros and Ivo Ip\v{s}i\'{c} and Simon Dobri\v{s}ek and France Miheli\v{c} and Nikola Pave\v{s}i\'{c}},
  title = {Segmentation and Labelling of Slovenian Diphone Inventories},
  booktitle = {Proceedings of the 16th International Conference on Computational Linguistics, COLING 1996},
  year = {1996},
  volume = {1},
  pages = {298-303},
  url = {http://aclweb.org/anthology/C96-1051}
}
Dobrišek S, Mihelič F and Pavešić N (1995), "Multi-Variate Mixture Probability Density Modelling Of VQ Codebook Using Gradient Descent Algorithm", In Proceedings of the Fourth European Conference on Speech Communication and Technology, EUROSPEECH'95. Madrid, Spain, September, 1995. , pp. 1431-1434.
Abstract: A vector quantisation codebook can be modelled as a set of probability density functions. The problem of estimating the parameters determining mixture probability density models can be solved using a log-likelihood based re-estimation procedure. On the other hand, this problem can be also viewed as a conventional optimisation problem. Consequently, gradient descent techniques may be used to obtain values of the model parameters. The main advantage of these techniques over the re-estimation procedure is higher robustness due to an initial estimation of the model parameters. In the paper, we describe a descent algorithm along with a criterion function, we propose. We obtained some promising results by applying this algorithm to one and two-variate pseudo Gaussian mixture probability density functions and further to signal vectors of a continuous speech database.
BibTeX:
@conference{Dobrisek1995,
  author = {Simon Dobri\v{s}ek and France Miheli\v{c} and Nikola Pave\v{s}i\'{c}},
  title = {Multi-Variate Mixture Probability Density Modelling Of VQ Codebook Using Gradient Descent Algorithm},
  booktitle = {Proceedings of the Fourth European Conference on Speech Communication and Technology, EUROSPEECH'95},
  year = {1995},
  pages = {1431-1434},
  url = {http://www.isca-speech.org/archive/eurospeech_1995/e95_1431.html}
}
Mihelič F, Ipšić I, Dobrišek S and Pavešić N (1992), "Feature representations and classification procedures for Slovene phoneme recognition", Pattern Recognition Letters. Vol. 13(12), pp. 879-891.
Abstract: In this paper the comparison of performances of different feature representations of the speech signal and comparison of classification procedures for Slovene phoneme recognition are presented. Recognition results are obtained on the database of continuous Slovene speech consisting of short Slovene sentences spoken by female speakers. MEL-cepstrum and LPC-cepstrum features combined with the normalized frame loudness were found to be the most suitable feature representations for Slovene speech. It was found that determination of MEL-cepstrum using linear spacing of bandpass filters gave significantly better results for speaker dependent recognition. Comparison of classification procedures favours the Bayes classification assuming normal distribution of the feature vectors (BNF) to the classification based on quadratic discriminant functions (DF) for minimum mean-square error and subspace method (SM), which does not confirm the results obtained in some previous studies for German and Finn speech. Additionally, classification procedures based on hidden Markov models (HMM) and the Kohonen Self-Organizing Map (KSOM) were tested on a smaller amount of speech data (1 speaker only). Classification results are comparable with classification using BNF.
BibTeX:
@article{Mihelic1992,
  author = {France Miheli\v{c} and Ivo Ip\v{s}i\'{c} and Simon Dobri\v{s}ek and Nikola Pave\v{s}i\'{c}},
  title = {Feature representations and classification procedures for Slovene phoneme recognition},
  journal = {Pattern Recognition Letters},
  year = {1992},
  volume = {13},
  number = {12},
  pages = {879-891},
  url = {http://www.sciencedirect.com/science/article/pii/016786559290087G},
  doi = {10.1016/0167-8655(92)90087-G}
}