2016 |
Walter Scheirer; Patrick Flynn; Changxing Ding; Guodong Guo; Vitomir Štruc; Mohamad Al Jazaery; Simon Dobrišek; Klemen Grm; Dacheng Tao; Yu Zhu; Joel Brogan; Sandipan Banerjee; Aparna Bharati; Brandon Richard Webster Report on the BTAS 2016 Video Person Recognition Evaluation Conference Proceedings of the IEEE International Conference on Biometrics: Theory, Applications ans Systems (BTAS), IEEE, 2016. @conference{BTAS2016, title = {Report on the BTAS 2016 Video Person Recognition Evaluation}, author = {Walter Scheirer and Patrick Flynn and Changxing Ding and Guodong Guo and Vitomir \v{S}truc and Mohamad Al Jazaery and Simon Dobri\v{s}ek and Klemen Grm and Dacheng Tao and Yu Zhu and Joel Brogan and Sandipan Banerjee and Aparna Bharati and Brandon Richard Webster}, year = {2016}, date = {2016-10-05}, booktitle = {Proceedings of the IEEE International Conference on Biometrics: Theory, Applications ans Systems (BTAS)}, publisher = {IEEE}, abstract = {This report presents results from the Video Person Recognition Evaluation held in conjunction with the 8th IEEE International Conference on Biometrics: Theory, Applications, and Systems (BTAS). Two experiments required algorithms to recognize people in videos from the Pointand- Shoot Face Recognition Challenge Problem (PaSC). The first consisted of videos from a tripod mounted high quality video camera. The second contained videos acquired from 5 different handheld video cameras. There were 1,401 videos in each experiment of 265 subjects. The subjects, the scenes, and the actions carried out by the people are the same in both experiments. An additional experiment required algorithms to recognize people in videos from the Video Database of Moving Faces and People (VDMFP). There were 958 videos in this experiment of 297 subjects. Four groups from around the world participated in the evaluation. The top verification rate for PaSC from this evaluation is 0:98 at a false accept rate of 0:01 \textemdash a remarkable advancement in performance from the competition held at FG 2015.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } This report presents results from the Video Person Recognition Evaluation held in conjunction with the 8th IEEE International Conference on Biometrics: Theory, Applications, and Systems (BTAS). Two experiments required algorithms to recognize people in videos from the Pointand- Shoot Face Recognition Challenge Problem (PaSC). The first consisted of videos from a tripod mounted high quality video camera. The second contained videos acquired from 5 different handheld video cameras. There were 1,401 videos in each experiment of 265 subjects. The subjects, the scenes, and the actions carried out by the people are the same in both experiments. An additional experiment required algorithms to recognize people in videos from the Video Database of Moving Faces and People (VDMFP). There were 958 videos in this experiment of 297 subjects. Four groups from around the world participated in the evaluation. The top verification rate for PaSC from this evaluation is 0:98 at a false accept rate of 0:01 — a remarkable advancement in performance from the competition held at FG 2015. |
Janez Križaj; Simon Dobrišek; France Mihelič; Vitomir Štruc Facial Landmark Localization from 3D Images Inproceedings In: Proceedings of the Electrotechnical and Computer Science Conference (ERK), Portorož, Slovenia, 2016. @inproceedings{ERK2016Janez, title = {Facial Landmark Localization from 3D Images}, author = {Janez Kri\v{z}aj and Simon Dobri\v{s}ek and France Miheli\v{c} and Vitomir \v{S}truc}, year = {2016}, date = {2016-09-20}, booktitle = {Proceedings of the Electrotechnical and Computer Science Conference (ERK)}, address = {Portoro\v{z}, Slovenia}, abstract = {A novel method for automatic facial landmark localization is presented. The method builds on the supervised descent framework, which was shown to successfully localize landmarks in the presence of large expression variations and mild occlusions, but struggles when localizing landmarks on faces with large pose variations. We propose an extension of the supervised descent framework which trains multiple descent maps and results in increased robustness to pose variations. The performance of the proposed method is demonstrated on the Bosphorus database for the problem of facial landmark localization from 3D data. Our experimental results show that the proposed method exhibits increased robustness to pose variations, while retaining high performance in the case of expression and occlusion variations.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } A novel method for automatic facial landmark localization is presented. The method builds on the supervised descent framework, which was shown to successfully localize landmarks in the presence of large expression variations and mild occlusions, but struggles when localizing landmarks on faces with large pose variations. We propose an extension of the supervised descent framework which trains multiple descent maps and results in increased robustness to pose variations. The performance of the proposed method is demonstrated on the Bosphorus database for the problem of facial landmark localization from 3D data. Our experimental results show that the proposed method exhibits increased robustness to pose variations, while retaining high performance in the case of expression and occlusion variations. |
Simon Dobrišek; David Čefarin; Vitomir Štruc; France Mihelič Assessment of the Google Speech Application Programming Interface for Automatic Slovenian Speech Recognition Inproceedings In: Jezikovne Tehnologije in Digitalna Humanistika, 2016. @inproceedings{SJDT, title = {Assessment of the Google Speech Application Programming Interface for Automatic Slovenian Speech Recognition}, author = {Simon Dobri\v{s}ek and David \v{C}efarin and Vitomir \v{S}truc and France Miheli\v{c}}, url = {http://luks.fe.uni-lj.si/nluks/wp-content/uploads/2016/09/jtdh16-ulfe-luks-sd-final-pdfa.pdf}, year = {2016}, date = {2016-09-20}, booktitle = {Jezikovne Tehnologije in Digitalna Humanistika}, abstract = {Automatic speech recognizers are slowly maturing into technologies that enable humans to communicate more naturally and effectively with a variety of smart devices and information-communication systems. Large global companies such as Google, Microsoft, Apple, IBM and Baidu compete in developing the most reliable speech recognizers, supporting as many of the main world languages as possible. Due to the relatively small number of speakers, the support for the Slovenian spoken language is lagging behind, and among the major global companies only Google has recently supported our spoken language. The paper presents the results of our independent assessment of the Google speech-application programming interface for automatic Slovenian speech recognition. For the experiments, we used speech databases that are otherwise used for the development and assessment of Slovenian speech recognizers.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Automatic speech recognizers are slowly maturing into technologies that enable humans to communicate more naturally and effectively with a variety of smart devices and information-communication systems. Large global companies such as Google, Microsoft, Apple, IBM and Baidu compete in developing the most reliable speech recognizers, supporting as many of the main world languages as possible. Due to the relatively small number of speakers, the support for the Slovenian spoken language is lagging behind, and among the major global companies only Google has recently supported our spoken language. The paper presents the results of our independent assessment of the Google speech-application programming interface for automatic Slovenian speech recognition. For the experiments, we used speech databases that are otherwise used for the development and assessment of Slovenian speech recognizers. |
Klemen Grm; Simon Dobrišek; Vitomir Štruc Deep pair-wise similarity learning for face recognition Conference 4th International Workshop on Biometrics and Forensics (IWBF), IEEE 2016. @conference{grm2016deep, title = {Deep pair-wise similarity learning for face recognition}, author = { Klemen Grm and Simon Dobri\v{s}ek and Vitomir \v{S}truc}, url = {http://luks.fe.uni-lj.si/nluks/wp-content/uploads/2016/09/IWBF_2016.pdf}, year = {2016}, date = {2016-01-01}, booktitle = {4th International Workshop on Biometrics and Forensics (IWBF)}, pages = {1--6}, organization = {IEEE}, abstract = {Recent advances in deep learning made it possible to build deep hierarchical models capable of delivering state-of-the-art performance in various vision tasks, such as object recognition, detection or tracking. For recognition tasks the most common approach when using deep models is to learn object representations (or features) directly from raw image-input and then feed the learned features to a suitable classifier. Deep models used in this pipeline are typically heavily parameterized and require enormous amounts of training data to deliver competitive recognition performance. Despite the use of data augmentation techniques, many application domains, predefined experimental protocols or specifics of the recognition problem limit the amount of available training data and make training an effective deep hierarchical model a difficult task. In this paper, we present a novel, deep pair-wise similarity learning (DPSL) strategy for deep models, developed specifically to overcome the problem of insufficient training data, and demonstrate its usage on the task of face recognition. Unlike existing (deep) learning strategies, DPSL operates on image-pairs and tries to learn pair-wise image similarities that can be used for recognition purposes directly instead of feature representations that need to be fed to appropriate classification techniques, as with traditional deep learning pipelines. Since our DPSL strategy assumes an image pair as the input to the learning procedure, the amount of training data available to train deep models is quadratic in the number of available training images, which is of paramount importance for models with a large number of parameters. We demonstrate the efficacy of the proposed learning strategy by developing a deep model for pose-invariant face recognition, called Pose-Invariant Similarity Index (PISI), and presenting comparative experimental results on the FERET an IJB-A datasets.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } Recent advances in deep learning made it possible to build deep hierarchical models capable of delivering state-of-the-art performance in various vision tasks, such as object recognition, detection or tracking. For recognition tasks the most common approach when using deep models is to learn object representations (or features) directly from raw image-input and then feed the learned features to a suitable classifier. Deep models used in this pipeline are typically heavily parameterized and require enormous amounts of training data to deliver competitive recognition performance. Despite the use of data augmentation techniques, many application domains, predefined experimental protocols or specifics of the recognition problem limit the amount of available training data and make training an effective deep hierarchical model a difficult task. In this paper, we present a novel, deep pair-wise similarity learning (DPSL) strategy for deep models, developed specifically to overcome the problem of insufficient training data, and demonstrate its usage on the task of face recognition. Unlike existing (deep) learning strategies, DPSL operates on image-pairs and tries to learn pair-wise image similarities that can be used for recognition purposes directly instead of feature representations that need to be fed to appropriate classification techniques, as with traditional deep learning pipelines. Since our DPSL strategy assumes an image pair as the input to the learning procedure, the amount of training data available to train deep models is quadratic in the number of available training images, which is of paramount importance for models with a large number of parameters. We demonstrate the efficacy of the proposed learning strategy by developing a deep model for pose-invariant face recognition, called Pose-Invariant Similarity Index (PISI), and presenting comparative experimental results on the FERET an IJB-A datasets. |
2015 |
Vitomir Štruc; Janez Križaj; Simon Dobrišek Modest face recognition Conference Proceedings of the International Workshop on Biometrics and Forensics (IWBF), IEEE, 2015. @conference{struc2015modest, title = {Modest face recognition}, author = { Vitomir \v{S}truc and Janez Kri\v{z}aj and Simon Dobri\v{s}ek}, url = {http://luks.fe.uni-lj.si/nluks/wp-content/uploads/2016/09/IWBF2015.pdf}, year = {2015}, date = {2015-01-01}, booktitle = {Proceedings of the International Workshop on Biometrics and Forensics (IWBF)}, pages = {1--6}, publisher = {IEEE}, abstract = {The facial imagery usually at the disposal for forensics investigations is commonly of a poor quality due to the unconstrained settings in which it was acquired. The captured faces are typically non-frontal, partially occluded and of a low resolution, which makes the recognition task extremely difficult. In this paper we try to address this problem by presenting a novel framework for face recognition that combines diverse features sets (Gabor features, local binary patterns, local phase quantization features and pixel intensities), probabilistic linear discriminant analysis (PLDA) and data fusion based on linear logistic regression. With the proposed framework a matching score for the given pair of probe and target images is produced by applying PLDA on each of the four feature sets independently - producing a (partial) matching score for each of the PLDA-based feature vectors - and then combining the partial matching results at the score level to generate a single matching score for recognition. We make two main contributions in the paper: i) we introduce a novel framework for face recognition that relies on probabilistic MOdels of Diverse fEature SeTs (MODEST) to facilitate the recognition process and ii) benchmark it against the existing state-of-the-art. We demonstrate the feasibility of our MODEST framework on the FRGCv2 and PaSC databases and present comparative results with the state-of-the-art recognition techniques, which demonstrate the efficacy of our framework.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } The facial imagery usually at the disposal for forensics investigations is commonly of a poor quality due to the unconstrained settings in which it was acquired. The captured faces are typically non-frontal, partially occluded and of a low resolution, which makes the recognition task extremely difficult. In this paper we try to address this problem by presenting a novel framework for face recognition that combines diverse features sets (Gabor features, local binary patterns, local phase quantization features and pixel intensities), probabilistic linear discriminant analysis (PLDA) and data fusion based on linear logistic regression. With the proposed framework a matching score for the given pair of probe and target images is produced by applying PLDA on each of the four feature sets independently - producing a (partial) matching score for each of the PLDA-based feature vectors - and then combining the partial matching results at the score level to generate a single matching score for recognition. We make two main contributions in the paper: i) we introduce a novel framework for face recognition that relies on probabilistic MOdels of Diverse fEature SeTs (MODEST) to facilitate the recognition process and ii) benchmark it against the existing state-of-the-art. We demonstrate the feasibility of our MODEST framework on the FRGCv2 and PaSC databases and present comparative results with the state-of-the-art recognition techniques, which demonstrate the efficacy of our framework. |
Ross Beveridge; Hao Zhang; Bruce A Draper; Patrick J Flynn; Zhenhua Feng; Patrik Huber; Josef Kittler; Zhiwu Huang; Shaoxin Li; Yan Li; Vitomir Štruc; Janez Križaj; others Report on the FG 2015 video person recognition evaluation Conference 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (IEEE FG), 1 , IEEE 2015. @conference{beveridge2015report, title = {Report on the FG 2015 video person recognition evaluation}, author = {Ross Beveridge and Hao Zhang and Bruce A Draper and Patrick J Flynn and Zhenhua Feng and Patrik Huber and Josef Kittler and Zhiwu Huang and Shaoxin Li and Yan Li and Vitomir \v{S}truc and Janez Kri\v{z}aj and others}, url = {http://luks.fe.uni-lj.si/nluks/wp-content/uploads/2016/09/fg2015videoEvalPreprint.pdf}, year = {2015}, date = {2015-01-01}, booktitle = {11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (IEEE FG)}, volume = {1}, pages = {1--8}, organization = {IEEE}, abstract = {This report presents results from the Video Person Recognition Evaluation held in conjunction with the 11th IEEE International Conference on Automatic Face and Gesture Recognition. Two experiments required algorithms to recognize people in videos from the Point-and-Shoot Face Recognition Challenge Problem (PaSC). The first consisted of videos from a tripod mounted high quality video camera. The second contained videos acquired from 5 different handheld video cameras. There were 1401 videos in each experiment of 265 subjects. The subjects, the scenes, and the actions carried out by the people are the same in both experiments. Five groups from around the world participated in the evaluation. The video handheld experiment was included in the International Joint Conference on Biometrics (IJCB) 2014 Handheld Video Face and Person Recognition Competition. The top verification rate from this evaluation is double that of the top performer in the IJCB competition. Analysis shows that the factor most effecting algorithm performance is the combination of location and action: where the video was acquired and what the person was doing.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } This report presents results from the Video Person Recognition Evaluation held in conjunction with the 11th IEEE International Conference on Automatic Face and Gesture Recognition. Two experiments required algorithms to recognize people in videos from the Point-and-Shoot Face Recognition Challenge Problem (PaSC). The first consisted of videos from a tripod mounted high quality video camera. The second contained videos acquired from 5 different handheld video cameras. There were 1401 videos in each experiment of 265 subjects. The subjects, the scenes, and the actions carried out by the people are the same in both experiments. Five groups from around the world participated in the evaluation. The video handheld experiment was included in the International Joint Conference on Biometrics (IJCB) 2014 Handheld Video Face and Person Recognition Competition. The top verification rate from this evaluation is double that of the top performer in the IJCB competition. Analysis shows that the factor most effecting algorithm performance is the combination of location and action: where the video was acquired and what the person was doing. |
Tadej Justin; Vitomir Štruc; Simon Dobrišek; Boštjan Vesnicer; Ivo Ipšić; France Mihelič Speaker de-identification using diphone recognition and speech synthesis Conference 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (IEEE FG): DeID 2015, 4 , IEEE 2015. @conference{justin2015speaker, title = {Speaker de-identification using diphone recognition and speech synthesis}, author = { Tadej Justin and Vitomir \v{S}truc and Simon Dobri\v{s}ek and Bo\v{s}tjan Vesnicer and Ivo Ip\v{s}i\'{c} and France Miheli\v{c}}, url = {http://luks.fe.uni-lj.si/nluks/wp-content/uploads/2016/09/Deid2015.pdf}, year = {2015}, date = {2015-01-01}, booktitle = {11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (IEEE FG): DeID 2015}, volume = {4}, pages = {1--7}, organization = {IEEE}, abstract = {The paper addresses the problem of speaker (or voice) de-identification by presenting a novel approach for concealing the identity of speakers in their speech. The proposed technique first recognizes the input speech with a diphone recognition system and then transforms the obtained phonetic transcription into the speech of another speaker with a speech synthesis system. Due to the fact that a Diphone RecOgnition step and a sPeech SYnthesis step are used during the deidentification, we refer to the developed technique as DROPSY. With this approach the acoustical models of the recognition and synthesis modules are completely independent from each other, which ensures the highest level of input speaker deidentification. The proposed DROPSY-based de-identification approach is language dependent, text independent and capable of running in real-time due to the relatively simple computing methods used. When designing speaker de-identification technology two requirements are typically imposed on the deidentification techniques: i) it should not be possible to establish the identity of the speakers based on the de-identified speech, and ii) the processed speech should still sound natural and be intelligible. This paper, therefore, implements the proposed DROPSY-based approach with two different speech synthesis techniques (i.e, with the HMM-based and the diphone TDPSOLA- based technique). The obtained de-identified speech is evaluated for intelligibility and evaluated in speaker verification experiments with a state-of-the-art (i-vector/PLDA) speaker recognition system. The comparison of both speech synthesis modules integrated in the proposed method reveals that both can efficiently de-identify the input speakers while still producing intelligible speech.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } The paper addresses the problem of speaker (or voice) de-identification by presenting a novel approach for concealing the identity of speakers in their speech. The proposed technique first recognizes the input speech with a diphone recognition system and then transforms the obtained phonetic transcription into the speech of another speaker with a speech synthesis system. Due to the fact that a Diphone RecOgnition step and a sPeech SYnthesis step are used during the deidentification, we refer to the developed technique as DROPSY. With this approach the acoustical models of the recognition and synthesis modules are completely independent from each other, which ensures the highest level of input speaker deidentification. The proposed DROPSY-based de-identification approach is language dependent, text independent and capable of running in real-time due to the relatively simple computing methods used. When designing speaker de-identification technology two requirements are typically imposed on the deidentification techniques: i) it should not be possible to establish the identity of the speakers based on the de-identified speech, and ii) the processed speech should still sound natural and be intelligible. This paper, therefore, implements the proposed DROPSY-based approach with two different speech synthesis techniques (i.e, with the HMM-based and the diphone TDPSOLA- based technique). The obtained de-identified speech is evaluated for intelligibility and evaluated in speaker verification experiments with a state-of-the-art (i-vector/PLDA) speaker recognition system. The comparison of both speech synthesis modules integrated in the proposed method reveals that both can efficiently de-identify the input speakers while still producing intelligible speech. |
Simon Dobrišek; Vitomir Štruc; Janez Križaj; France Mihelič Face recognition in the wild with the Probabilistic Gabor-Fisher Classifier Conference 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (IEEE FG): BWild 2015, 2 , IEEE 2015. @conference{dobrivsek2015face, title = {Face recognition in the wild with the Probabilistic Gabor-Fisher Classifier}, author = { Simon Dobri\v{s}ek and Vitomir \v{S}truc and Janez Kri\v{z}aj and France Miheli\v{c}}, url = {http://luks.fe.uni-lj.si/nluks/wp-content/uploads/2016/09/Bwild2015.pdf}, year = {2015}, date = {2015-01-01}, booktitle = {11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (IEEE FG): BWild 2015}, volume = {2}, pages = {1--6}, organization = {IEEE}, abstract = {The paper addresses the problem of face recognition in the wild. It introduces a novel approach to unconstrained face recognition that exploits Gabor magnitude features and a simplified version of the probabilistic linear discriminant analysis (PLDA). The novel approach, named Probabilistic Gabor-Fisher Classifier (PGFC), first extracts a vector of Gabor magnitude features from the given input image using a battery of Gabor filters, then reduces the dimensionality of the extracted feature vector by projecting it into a low-dimensional subspace and finally produces a representation suitable for identity inference by applying PLDA to the projected feature vector. The proposed approach extends the popular Gabor-Fisher Classifier (GFC) to a probabilistic setting and thus improves on the generalization capabilities of the GFC method. The PGFC technique is assessed in face verification experiments on the Point and Shoot Face Recognition Challenge (PaSC) database, which features real-world videos of subjects performing everyday tasks. Experimental results on this challenging database show the feasibility of the proposed approach, which improves on the best results on this database reported in the literature by the time of writing.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } The paper addresses the problem of face recognition in the wild. It introduces a novel approach to unconstrained face recognition that exploits Gabor magnitude features and a simplified version of the probabilistic linear discriminant analysis (PLDA). The novel approach, named Probabilistic Gabor-Fisher Classifier (PGFC), first extracts a vector of Gabor magnitude features from the given input image using a battery of Gabor filters, then reduces the dimensionality of the extracted feature vector by projecting it into a low-dimensional subspace and finally produces a representation suitable for identity inference by applying PLDA to the projected feature vector. The proposed approach extends the popular Gabor-Fisher Classifier (GFC) to a probabilistic setting and thus improves on the generalization capabilities of the GFC method. The PGFC technique is assessed in face verification experiments on the Point and Shoot Face Recognition Challenge (PaSC) database, which features real-world videos of subjects performing everyday tasks. Experimental results on this challenging database show the feasibility of the proposed approach, which improves on the best results on this database reported in the literature by the time of writing. |
2014 |
Janez Križaj; Vitomir Štruc; Simon Dobrišek; Darijan Marčetić; Slobodan Ribarić SIFT vs. FREAK: Assessing the usefulness of two keypoint descriptors for 3D face verification Inproceedings In: 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) , pp. 1336–1341, Mipro Opatija, Croatia, 2014. @inproceedings{krivzaj2014sift, title = {SIFT vs. FREAK: Assessing the usefulness of two keypoint descriptors for 3D face verification}, author = { Janez Kri\v{z}aj and Vitomir \v{S}truc and Simon Dobri\v{s}ek and Darijan Mar\v{c}eti\'{c} and Slobodan Ribari\'{c}}, url = {http://luks.fe.uni-lj.si/nluks/wp-content/uploads/2016/09/MIPRO2014a.pdf}, year = {2014}, date = {2014-01-01}, booktitle = {37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) }, pages = {1336--1341}, address = {Opatija, Croatia}, organization = {Mipro}, abstract = {Many techniques in the area of 3D face recognition rely on local descriptors to characterize the surface-shape information around points of interest (or keypoints) in the 3D images. Despite the fact that a lot of advancements have been made in the area of keypoint descriptors over the last years, the literature on 3D-face recognition for the most part still focuses on established descriptors, such as SIFT and SURF, and largely neglects more recent descriptors, such as the FREAK descriptor. In this paper we try to bridge this gap and assess the usefulness of the FREAK descriptor for the task for 3D face recognition. Of particular interest to us is a direct comparison of the FREAK and SIFT descriptors within a simple verification framework. To evaluate our framework with the two descriptors, we conduct 3D face recognition experiments on the challenging FRGCv2 and UMBDB databases and show that the FREAK descriptor ensures a very competitive verification performance when compared to the SIFT descriptor, but at a fraction of the computational cost. Our results indicate that the FREAK descriptor is a viable alternative to the SIFT descriptor for the problem of 3D face verification and due to its binary nature is particularly useful for real-time recognition systems and verification techniques for low-resource devices such as mobile phones, tablets and alike.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Many techniques in the area of 3D face recognition rely on local descriptors to characterize the surface-shape information around points of interest (or keypoints) in the 3D images. Despite the fact that a lot of advancements have been made in the area of keypoint descriptors over the last years, the literature on 3D-face recognition for the most part still focuses on established descriptors, such as SIFT and SURF, and largely neglects more recent descriptors, such as the FREAK descriptor. In this paper we try to bridge this gap and assess the usefulness of the FREAK descriptor for the task for 3D face recognition. Of particular interest to us is a direct comparison of the FREAK and SIFT descriptors within a simple verification framework. To evaluate our framework with the two descriptors, we conduct 3D face recognition experiments on the challenging FRGCv2 and UMBDB databases and show that the FREAK descriptor ensures a very competitive verification performance when compared to the SIFT descriptor, but at a fraction of the computational cost. Our results indicate that the FREAK descriptor is a viable alternative to the SIFT descriptor for the problem of 3D face verification and due to its binary nature is particularly useful for real-time recognition systems and verification techniques for low-resource devices such as mobile phones, tablets and alike. |
Boštjan Vesnicer; Jerneja Žganec-Gros; Simon Dobrišek; Vitomir Štruc Incorporating Duration Information into I-Vector-Based Speaker-Recognition Systems Conference Proceedings of Odyssey: The Speaker and Language Recognition Workshop, 2014. @conference{vesnicer2014incorporating, title = {Incorporating Duration Information into I-Vector-Based Speaker-Recognition Systems}, author = { Bo\v{s}tjan Vesnicer and Jerneja \v{Z}ganec-Gros and Simon Dobri\v{s}ek and Vitomir \v{S}truc}, url = {http://luks.fe.uni-lj.si/nluks/wp-content/uploads/2016/09/Odyssey.pdf}, year = {2014}, date = {2014-01-01}, booktitle = {Proceedings of Odyssey: The Speaker and Language Recognition Workshop}, pages = {241--248}, abstract = {Most of the existing literature on i-vector-based speaker recognition focuses on recognition problems, where i-vectors are extracted from speech recordings of sufficient length. The majority of modeling/recognition techniques therefore simply ignores the fact that the i-vectors are most likely estimated unreliably when short recordings are used for their computation. Only recently, were a number of solutions proposed in the literature to address the problem of duration variability, all treating the i-vector as a random variable whose posterior distribution can be parameterized by the posterior mean and the posterior covariance. In this setting the covariance matrix serves as a measure of uncertainty that is related to the length of the available recording. In contract to these solutions, we address the problem of duration variability through weighted statistics. We demonstrate in the paper how established feature transformation techniques regularly used in the area of speaker recognition, such as PCA or WCCN, can be modified to take duration into account. We evaluate our weighting scheme in the scope of the i-vector challenge organized as part of the Odyssey, Speaker and Language Recognition Workshop 2014 and achieve a minimal DCF of 0.280, which at the time of writing puts our approach in third place among all the participating institutions.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } Most of the existing literature on i-vector-based speaker recognition focuses on recognition problems, where i-vectors are extracted from speech recordings of sufficient length. The majority of modeling/recognition techniques therefore simply ignores the fact that the i-vectors are most likely estimated unreliably when short recordings are used for their computation. Only recently, were a number of solutions proposed in the literature to address the problem of duration variability, all treating the i-vector as a random variable whose posterior distribution can be parameterized by the posterior mean and the posterior covariance. In this setting the covariance matrix serves as a measure of uncertainty that is related to the length of the available recording. In contract to these solutions, we address the problem of duration variability through weighted statistics. We demonstrate in the paper how established feature transformation techniques regularly used in the area of speaker recognition, such as PCA or WCCN, can be modified to take duration into account. We evaluate our weighting scheme in the scope of the i-vector challenge organized as part of the Odyssey, Speaker and Language Recognition Workshop 2014 and achieve a minimal DCF of 0.280, which at the time of writing puts our approach in third place among all the participating institutions. |
2013 |
Janez Križaj; Simon Dobrišek; Vitomir Štruc; Nikola Pavešić Robust 3D face recognition using adapted statistical models Inproceedings In: Proceedings of the Electrotechnical and Computer Science Conference (ERK'13), 2013. @inproceedings{krizajrobust, title = {Robust 3D face recognition using adapted statistical models}, author = {Janez Kri\v{z}aj and Simon Dobri\v{s}ek and Vitomir \v{S}truc and Nikola Pave\v{s}i\'{c}}, url = {http://luks.fe.uni-lj.si/nluks/wp-content/uploads/2016/09/ERK2013b.pdf}, year = {2013}, date = {2013-09-20}, booktitle = {Proceedings of the Electrotechnical and Computer Science Conference (ERK'13)}, abstract = {The paper presents a novel framework to 3D face recognition that exploits region covariance matrices (RCMs), Gaussian mixture models (GMMs) and support vector machine (SVM) classifiers. The proposed framework first combines several 3D face representations at the feature level using RCM descriptors and then derives low-dimensional feature vectors from the computed descriptors with the unscented transform. By doing so, it enables computations in Euclidean space, and makes Gaussian mixture modeling feasible. Finally, a support vector classifier is used for identity inference. As demonstrated by our experimental results on the FRGCv2 and UMB databases, the proposed framework is highly robust and exhibits desirable characteristics such as an inherent mechanism for data fusion (through the RCMs), the ability to examine local as well as global structures of the face with the same descriptor, the ability to integrate domain-specific prior knowledge into the modeling procedure and consequently to handle missing or unreliable data. }, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } The paper presents a novel framework to 3D face recognition that exploits region covariance matrices (RCMs), Gaussian mixture models (GMMs) and support vector machine (SVM) classifiers. The proposed framework first combines several 3D face representations at the feature level using RCM descriptors and then derives low-dimensional feature vectors from the computed descriptors with the unscented transform. By doing so, it enables computations in Euclidean space, and makes Gaussian mixture modeling feasible. Finally, a support vector classifier is used for identity inference. As demonstrated by our experimental results on the FRGCv2 and UMB databases, the proposed framework is highly robust and exhibits desirable characteristics such as an inherent mechanism for data fusion (through the RCMs), the ability to examine local as well as global structures of the face with the same descriptor, the ability to integrate domain-specific prior knowledge into the modeling procedure and consequently to handle missing or unreliable data. |
Vitomir Štruc; Jerneja Žganec-Gros; Nikola Pavešić; Simon Dobrišek Zlivanje informacij za zanseljivo in robustno razpoznavanje obrazov Journal Article In: Electrotechnical Review, 80 (3), pp. 1-12, 2013. @article{EV_Struc_2013, title = {Zlivanje informacij za zanseljivo in robustno razpoznavanje obrazov}, author = {Vitomir \v{S}truc and Jerneja \v{Z}ganec-Gros and Nikola Pave\v{s}i\'{c} and Simon Dobri\v{s}ek}, url = {http://luks.fe.uni-lj.si/nluks/wp-content/uploads/2016/09/StrucEV2013.pdf}, year = {2013}, date = {2013-09-01}, journal = {Electrotechnical Review}, volume = {80}, number = {3}, pages = {1-12}, abstract = {The existing face recognition technology has reached a performance level where it is possible to deploy it in various applications providing they are capable of ensuring controlled conditions for the image acquisition procedure. However, the technology still struggles with its recognition performance when deployed in uncontrolled and unconstrained conditions. In this paper, we present a novel approach to face recognition designed specifically for these challenging conditions. The proposed approach exploits information fusion to achieve robustness. In the first step, the approach crops the facial region from each input image in three different ways. It then maps each of the three crops into one of four color representations and finally extracts several feature types from each of the twelve facial representations. The described procedure results in a total of thirty facial representations that are combined at the matching score level using a fusion approach based on linear logistic regression (LLR) to arrive at a robust decision regarding the identity of the subject depicted in the input face image. The presented approach was enlisted as a representative of the University of Ljubljana and Alpineon d.o.o. to the 2013 face-recognition competition that was held in conjunction with the IAPR International Conference on Biometrics and achieved the best overall recognition results among all competition participants. Here, we describe the basic characteristics of the approach, elaborate on the results of the competition and, most importantly, present some interesting findings made during our development work that are also of relevance to the research community working in the field of face recognition. }, keywords = {}, pubstate = {published}, tppubtype = {article} } The existing face recognition technology has reached a performance level where it is possible to deploy it in various applications providing they are capable of ensuring controlled conditions for the image acquisition procedure. However, the technology still struggles with its recognition performance when deployed in uncontrolled and unconstrained conditions. In this paper, we present a novel approach to face recognition designed specifically for these challenging conditions. The proposed approach exploits information fusion to achieve robustness. In the first step, the approach crops the facial region from each input image in three different ways. It then maps each of the three crops into one of four color representations and finally extracts several feature types from each of the twelve facial representations. The described procedure results in a total of thirty facial representations that are combined at the matching score level using a fusion approach based on linear logistic regression (LLR) to arrive at a robust decision regarding the identity of the subject depicted in the input face image. The presented approach was enlisted as a representative of the University of Ljubljana and Alpineon d.o.o. to the 2013 face-recognition competition that was held in conjunction with the IAPR International Conference on Biometrics and achieved the best overall recognition results among all competition participants. Here, we describe the basic characteristics of the approach, elaborate on the results of the competition and, most importantly, present some interesting findings made during our development work that are also of relevance to the research community working in the field of face recognition. |
Vitomir Štruc; Jeneja Žganec Gros; Simon Dobrišek; Nikola Pavešić Exploiting representation plurality for robust and efficient face recognition Inproceedings In: Proceedings of the 22nd Intenational Electrotechnical and Computer Science Conference (ERK'13), pp. 121–124, Portorož, Slovenia, 2013. @inproceedings{ERK2013_Struc, title = {Exploiting representation plurality for robust and efficient face recognition}, author = {Vitomir \v{S}truc and Jeneja \v{Z}ganec Gros and Simon Dobri\v{s}ek and Nikola Pave\v{s}i\'{c}}, url = {http://luks.fe.uni-lj.si/nluks/wp-content/uploads/2016/09/ERK2013a.pdf}, year = {2013}, date = {2013-09-01}, booktitle = {Proceedings of the 22nd Intenational Electrotechnical and Computer Science Conference (ERK'13)}, volume = {vol. B}, pages = {121--124}, address = {Portoro\v{z}, Slovenia}, abstract = {The paper introduces a novel approach to face recognition that exploits plurality of representation to achieve robust face recognition. The proposed approach was submitted as a representative of the University of Ljubljana and Alpineon d.o.o. to the 2013 face recognition competition that was held in conjunction with the IAPR International Conference on Biometrics and achieved the best overall recognition results among all competition participants. Here, we describe the basic characteristics of the submitted approach, elaborate on the results of the competition and, most importantly, present some general findings made during our development work that are of relevance to the broader (face recognition) research community.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } The paper introduces a novel approach to face recognition that exploits plurality of representation to achieve robust face recognition. The proposed approach was submitted as a representative of the University of Ljubljana and Alpineon d.o.o. to the 2013 face recognition competition that was held in conjunction with the IAPR International Conference on Biometrics and achieved the best overall recognition results among all competition participants. Here, we describe the basic characteristics of the submitted approach, elaborate on the results of the competition and, most importantly, present some general findings made during our development work that are of relevance to the broader (face recognition) research community. |
Janez Križaj; Vitomir Štruc; Simon Dobrišek Combining 3D face representations using region covariance descriptors and statistical models Conference Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition and Workshops (IEEE FG), Workshop on 3D Face Biometrics, IEEE, Shanghai, China, 2013. @conference{FG2013, title = {Combining 3D face representations using region covariance descriptors and statistical models}, author = {Janez Kri\v{z}aj and Vitomir \v{S}truc and Simon Dobri\v{s}ek}, url = {http://luks.fe.uni-lj.si/nluks/wp-content/uploads/2016/09/FG2013.pdf}, year = {2013}, date = {2013-05-01}, booktitle = {Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition and Workshops (IEEE FG), Workshop on 3D Face Biometrics}, publisher = {IEEE}, address = {Shanghai, China}, abstract = {The paper introduces a novel framework for 3D face recognition that capitalizes on region covariance descriptors and Gaussian mixture models. The framework presents an elegant and coherent way of combining multiple facial representations, while simultaneously examining all computed representations at various levels of locality. The framework first computes a number of region covariance matrices/descriptors from different sized regions of several image representations and then adopts the unscented transform to derive low-dimensional feature vectors from the computed descriptors. By doing so, it enables computations in the Euclidean space, and makes Gaussian mixture modeling feasible. In the last step a support vector machine classification scheme is used to make a decision regarding the identity of the modeled input 3D face image. The proposed framework exhibits several desirable characteristics, such as an inherent mechanism for data fusion/integration (through the region covariance matrices), the ability to examine the facial images at different levels of locality, and the ability to integrate domain-specific prior knowledge into the modeling procedure. We assess the feasibility of the proposed framework on the Face Recognition Grand Challenge version 2 (FRGCv2) database with highly encouraging results.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } The paper introduces a novel framework for 3D face recognition that capitalizes on region covariance descriptors and Gaussian mixture models. The framework presents an elegant and coherent way of combining multiple facial representations, while simultaneously examining all computed representations at various levels of locality. The framework first computes a number of region covariance matrices/descriptors from different sized regions of several image representations and then adopts the unscented transform to derive low-dimensional feature vectors from the computed descriptors. By doing so, it enables computations in the Euclidean space, and makes Gaussian mixture modeling feasible. In the last step a support vector machine classification scheme is used to make a decision regarding the identity of the modeled input 3D face image. The proposed framework exhibits several desirable characteristics, such as an inherent mechanism for data fusion/integration (through the region covariance matrices), the ability to examine the facial images at different levels of locality, and the ability to integrate domain-specific prior knowledge into the modeling procedure. We assess the feasibility of the proposed framework on the Face Recognition Grand Challenge version 2 (FRGCv2) database with highly encouraging results. |
Simon Dobrišek; Rok Gajšek; France Mihelič; Nikola Pavešić; Vitomir Štruc Towards efficient multi-modal emotion recognition Journal Article In: International Journal of Advanced Robotic Systems, 10 (53), 2013. @article{dobrivsek2013towards, title = {Towards efficient multi-modal emotion recognition}, author = { Simon Dobri\v{s}ek and Rok Gaj\v{s}ek and France Miheli\v{c} and Nikola Pave\v{s}i\'{c} and Vitomir \v{S}truc}, url = {http://luks.fe.uni-lj.si/nluks/wp-content/uploads/2016/09/multimodel-emotion.pdf}, doi = {10.5772/54002}, year = {2013}, date = {2013-01-01}, journal = {International Journal of Advanced Robotic Systems}, volume = {10}, number = {53}, abstract = {The paper presents a multi-modal emotion recognition system exploiting audio and video (i.e., facial expression) information. The system first processes both sources of information individually to produce corresponding matching scores and then combines the computed matching scores to obtain a classification decision. For the video part of the system, a novel approach to emotion recognition, relying on image-set matching, is developed. The proposed approach avoids the need for detecting and tracking specific facial landmarks throughout the given video sequence, which represents a common source of error in video-based emotion recognition systems, and, therefore, adds robustness to the video processing chain. The audio part of the system, on the other hand, relies on utterance-specific Gaussian Mixture Models (GMMs) adapted from a Universal Background Model (UBM) via the maximum a posteriori probability (MAP) estimation. It improves upon the standard UBM-MAP procedure by exploiting gender information when building the utterance-specific GMMs, thus ensuring enhanced emotion recognition performance. Both the uni-modal parts as well as the combined system are assessed on the challenging multi-modal eNTERFACE'05 corpus with highly encouraging results. The developed system represents a feasible solution to emotion recognition that can easily be integrated into various systems, such as humanoid robots, smart surveillance systems and alike.}, keywords = {}, pubstate = {published}, tppubtype = {article} } The paper presents a multi-modal emotion recognition system exploiting audio and video (i.e., facial expression) information. The system first processes both sources of information individually to produce corresponding matching scores and then combines the computed matching scores to obtain a classification decision. For the video part of the system, a novel approach to emotion recognition, relying on image-set matching, is developed. The proposed approach avoids the need for detecting and tracking specific facial landmarks throughout the given video sequence, which represents a common source of error in video-based emotion recognition systems, and, therefore, adds robustness to the video processing chain. The audio part of the system, on the other hand, relies on utterance-specific Gaussian Mixture Models (GMMs) adapted from a Universal Background Model (UBM) via the maximum a posteriori probability (MAP) estimation. It improves upon the standard UBM-MAP procedure by exploiting gender information when building the utterance-specific GMMs, thus ensuring enhanced emotion recognition performance. Both the uni-modal parts as well as the combined system are assessed on the challenging multi-modal eNTERFACE'05 corpus with highly encouraging results. The developed system represents a feasible solution to emotion recognition that can easily be integrated into various systems, such as humanoid robots, smart surveillance systems and alike. |
Vildana Sulič Kenk; Janez Križaj; Vitomir Štruc; Simon Dobrišek Smart surveillance technologies in border control Journal Article In: European Journal of Law and Technology, 4 (2), 2013. @article{kenk2013smart, title = {Smart surveillance technologies in border control}, author = { Vildana Suli\v{c} Kenk and Janez Kri\v{z}aj and Vitomir \v{S}truc and Simon Dobri\v{s}ek}, url = {http://luks.fe.uni-lj.si/nluks/wp-content/uploads/2016/09/Kenk.pdf}, year = {2013}, date = {2013-01-01}, journal = {European Journal of Law and Technology}, volume = {4}, number = {2}, abstract = {The paper addresses the technical and legal aspects of the existing and forthcoming intelligent ('smart') surveillance technologies that are (or are considered to be) employed in the border control application area. Such technologies provide a computerized decision-making support to border control authorities, and are intended to increase the reliability and efficiency of border control measures. However, the question that arises is how effective these technologies are, as well as at what price, economically, socially, and in terms of citizens' rights. The paper provides a brief overview of smart surveillance technologies in border control applications, especially those used for controlling cross-border traffic, discusses possible proportionality issues and privacy risks raised by the increasingly widespread use of such technologies, as well as good/best practises developed in this area. In a broader context, the paper presents the result of the research carried out as part of the SMART (Scalable Measures for Automated Recognition Technologies) project.}, keywords = {}, pubstate = {published}, tppubtype = {article} } The paper addresses the technical and legal aspects of the existing and forthcoming intelligent ('smart') surveillance technologies that are (or are considered to be) employed in the border control application area. Such technologies provide a computerized decision-making support to border control authorities, and are intended to increase the reliability and efficiency of border control measures. However, the question that arises is how effective these technologies are, as well as at what price, economically, socially, and in terms of citizens' rights. The paper provides a brief overview of smart surveillance technologies in border control applications, especially those used for controlling cross-border traffic, discusses possible proportionality issues and privacy risks raised by the increasingly widespread use of such technologies, as well as good/best practises developed in this area. In a broader context, the paper presents the result of the research carried out as part of the SMART (Scalable Measures for Automated Recognition Technologies) project. |
2012 |
Janez Križaj; Vitomir Štruc; Simon Dobrišek Robust 3D Face Recognition Journal Article In: Electrotechnical Review, 79 (1-2), pp. 1-6, 2012. @article{Kri\v{z}aj-EV-2012, title = {Robust 3D Face Recognition}, author = {Janez Kri\v{z}aj and Vitomir \v{S}truc and Simon Dobri\v{s}ek}, url = {http://luks.fe.uni-lj.si/nluks/wp-content/uploads/2016/09/KrizajEV.pdf}, year = {2012}, date = {2012-06-01}, journal = {Electrotechnical Review}, volume = {79}, number = {1-2}, pages = {1-6}, abstract = {Face recognition in uncontrolled environments is hindered by variations in illumination, pose, expression and occlusions of faces. Many practical face-recognition systems are affected by these variations. One way to increase the robustness to illumination and pose variations is to use 3D facial images. In this paper 3D face-recognition systems are presented. Their structure and operation are described. The robustness of such systems to variations in uncontrolled environments is emphasized. We present some preliminary results of a system developed in our laboratory.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Face recognition in uncontrolled environments is hindered by variations in illumination, pose, expression and occlusions of faces. Many practical face-recognition systems are affected by these variations. One way to increase the robustness to illumination and pose variations is to use 3D facial images. In this paper 3D face-recognition systems are presented. Their structure and operation are described. The robustness of such systems to variations in uncontrolled environments is emphasized. We present some preliminary results of a system developed in our laboratory. |
Janez Križaj; Vitomir Štruc; Simon Dobrišek Towards robust 3D face verification using Gaussian mixture models Journal Article In: International Journal of Advanced Robotic Systems, 9 , 2012. @article{krizaj2012towards, title = {Towards robust 3D face verification using Gaussian mixture models}, author = { Janez Kri\v{z}aj and Vitomir \v{S}truc and Simon Dobri\v{s}ek}, url = {http://luks.fe.uni-lj.si/nluks/wp-content/uploads/2016/09/IntechJanez-1.pdf}, doi = {10.5772/52200}, year = {2012}, date = {2012-01-01}, journal = {International Journal of Advanced Robotic Systems}, volume = {9}, publisher = {InTech}, abstract = {This paper focuses on the use of Gaussian Mixture models (GMM) for 3D face verification. A special interest is taken in practical aspects of 3D face verification systems, where all steps of the verification procedure need to be automated and no meta-data, such as pre-annotated eye/nose/mouth positions, is available to the system. In such settings the performance of the verification system correlates heavily with the performance of the employed alignment (i.e., geometric normalization) procedure. We show that popular holistic as well as local recognition techniques, such as principal component analysis (PCA), or Scale-invariant feature transform (SIFT)-based methods considerably deteriorate in their performance when an “imperfect” geometric normalization procedure is used to align the 3D face scans and that in these situations GMMs should be preferred. Moreover, several possibilities to improve the performance and robustness of the classical GMM framework are presented and evaluated: i) explicit inclusion of spatial information, during the GMM construction procedure, ii) implicit inclusion of spatial information during the GMM construction procedure and iii) on-line evaluation and possible rejection of local feature vectors based on their likelihood. We successfully demonstrate the feasibility of the proposed modifications on the Face Recognition Grand Challenge data set.}, keywords = {}, pubstate = {published}, tppubtype = {article} } This paper focuses on the use of Gaussian Mixture models (GMM) for 3D face verification. A special interest is taken in practical aspects of 3D face verification systems, where all steps of the verification procedure need to be automated and no meta-data, such as pre-annotated eye/nose/mouth positions, is available to the system. In such settings the performance of the verification system correlates heavily with the performance of the employed alignment (i.e., geometric normalization) procedure. We show that popular holistic as well as local recognition techniques, such as principal component analysis (PCA), or Scale-invariant feature transform (SIFT)-based methods considerably deteriorate in their performance when an “imperfect” geometric normalization procedure is used to align the 3D face scans and that in these situations GMMs should be preferred. Moreover, several possibilities to improve the performance and robustness of the classical GMM framework are presented and evaluated: i) explicit inclusion of spatial information, during the GMM construction procedure, ii) implicit inclusion of spatial information during the GMM construction procedure and iii) on-line evaluation and possible rejection of local feature vectors based on their likelihood. We successfully demonstrate the feasibility of the proposed modifications on the Face Recognition Grand Challenge data set. |
2010 |
Vitomir Štruc; Simon Dobrišek; Nikola Pavešić Proceedings of the International Conference on Pattern Recognition (ICPR'10), Istanbul, Turkey, 2010. @conference{ICPR_Struc_2010, title = {Confidence Weighted Subspace Projection Techniques for Robust Face Recognition in the Presence of Partial Occlusions}, author = {Vitomir \v{S}truc and Simon Dobri\v{s}ek and Nikola Pave\v{s}i\'{c}}, url = {http://luks.fe.uni-lj.si/nluks/wp-content/uploads/2016/09/ICPR2010_CW.pdf}, year = {2010}, date = {2010-01-01}, booktitle = {Proceedings of the International Conference on Pattern Recognition (ICPR'10)}, pages = {1334-1338}, address = {Istanbul, Turkey}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |
2009 |
Rok Gajšek; Vitomir Štruc; Simon Dobrišek; France Mihelič Emotion recognition using linear transformations in combination with video Conference Speech and intelligence: proceedings of Interspeech 2009, Brighton, UK, 2009. @conference{InterSp2009, title = {Emotion recognition using linear transformations in combination with video}, author = {Rok Gaj\v{s}ek and Vitomir \v{S}truc and Simon Dobri\v{s}ek and France Miheli\v{c}}, url = {http://luks.fe.uni-lj.si/nluks/wp-content/uploads/2016/09/InSP.pdf}, year = {2009}, date = {2009-09-01}, booktitle = {Speech and intelligence: proceedings of Interspeech 2009}, pages = {1967-1970}, address = {Brighton, UK}, abstract = {The paper discuses the usage of linear transformations of Hidden Markov Models, normally employed for speaker and environment adaptation, as a way of extracting the emotional components from the speech. A constrained version of Maximum Likelihood Linear Regression (CMLLR) transformation is used as a feature for classification of normal or aroused emotional state. We present a procedure of incrementally building a set of speaker independent acoustic models, that are used to estimate the CMLLR transformations for emotion classification. An audio-video database of spontaneous emotions (AvID) is briefly presented since it forms the basis for the evaluation of the proposed method. Emotion classification using the video part of the database is also described and the added value of combining the visual information with the audio features is shown.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } The paper discuses the usage of linear transformations of Hidden Markov Models, normally employed for speaker and environment adaptation, as a way of extracting the emotional components from the speech. A constrained version of Maximum Likelihood Linear Regression (CMLLR) transformation is used as a feature for classification of normal or aroused emotional state. We present a procedure of incrementally building a set of speaker independent acoustic models, that are used to estimate the CMLLR transformations for emotion classification. An audio-video database of spontaneous emotions (AvID) is briefly presented since it forms the basis for the evaluation of the proposed method. Emotion classification using the video part of the database is also described and the added value of combining the visual information with the audio features is shown. |
Rok Gajšek; Vitomir Štruc; Simon Dobrišek; Janez Žibert; France Mihelič; Nikola Pavešić Combining audio and video for detection of spontaneous emotions Conference Biometric ID management and multimodal communication, 5707 , Lecture Notes on Computer Science Springer-Verlag, Berlin, Heidelberg, 2009. @conference{BioID_Multi2009b, title = {Combining audio and video for detection of spontaneous emotions}, author = {Rok Gaj\v{s}ek and Vitomir \v{S}truc and Simon Dobri\v{s}ek and Janez \v{Z}ibert and France Miheli\v{c} and Nikola Pave\v{s}i\'{c}}, url = {http://luks.fe.uni-lj.si/nluks/wp-content/uploads/2016/09/BioID_R.pdf}, year = {2009}, date = {2009-01-01}, booktitle = {Biometric ID management and multimodal communication}, volume = {5707}, pages = {114-121}, publisher = {Springer-Verlag}, address = {Berlin, Heidelberg}, series = {Lecture Notes on Computer Science}, abstract = {The paper presents our initial attempts in building an audio video emotion recognition system. Both, audio and video sub-systems are discussed, and description of the database of spontaneous emotions is given. The task of labelling the recordings from the database according to different emotions is discussed and the measured agreement between multiple annotators is presented. Instead of focusing on the prosody in audio emotion recognition, we evaluate the possibility of using linear transformations (CMLLR) as features. The classification results from audio and video sub-systems are combined using sum rule fusion and the increase in recognition results, when using both modalities, is presented.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } The paper presents our initial attempts in building an audio video emotion recognition system. Both, audio and video sub-systems are discussed, and description of the database of spontaneous emotions is given. The task of labelling the recordings from the database according to different emotions is discussed and the measured agreement between multiple annotators is presented. Instead of focusing on the prosody in audio emotion recognition, we evaluate the possibility of using linear transformations (CMLLR) as features. The classification results from audio and video sub-systems are combined using sum rule fusion and the increase in recognition results, when using both modalities, is presented. |
Objave
2016 |
Report on the BTAS 2016 Video Person Recognition Evaluation Conference Proceedings of the IEEE International Conference on Biometrics: Theory, Applications ans Systems (BTAS), IEEE, 2016. |
Facial Landmark Localization from 3D Images Inproceedings In: Proceedings of the Electrotechnical and Computer Science Conference (ERK), Portorož, Slovenia, 2016. |
Assessment of the Google Speech Application Programming Interface for Automatic Slovenian Speech Recognition Inproceedings In: Jezikovne Tehnologije in Digitalna Humanistika, 2016. |
Deep pair-wise similarity learning for face recognition Conference 4th International Workshop on Biometrics and Forensics (IWBF), IEEE 2016. |
2015 |
Modest face recognition Conference Proceedings of the International Workshop on Biometrics and Forensics (IWBF), IEEE, 2015. |
Report on the FG 2015 video person recognition evaluation Conference 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (IEEE FG), 1 , IEEE 2015. |
Speaker de-identification using diphone recognition and speech synthesis Conference 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (IEEE FG): DeID 2015, 4 , IEEE 2015. |
Face recognition in the wild with the Probabilistic Gabor-Fisher Classifier Conference 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (IEEE FG): BWild 2015, 2 , IEEE 2015. |
2014 |
SIFT vs. FREAK: Assessing the usefulness of two keypoint descriptors for 3D face verification Inproceedings In: 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) , pp. 1336–1341, Mipro Opatija, Croatia, 2014. |
Incorporating Duration Information into I-Vector-Based Speaker-Recognition Systems Conference Proceedings of Odyssey: The Speaker and Language Recognition Workshop, 2014. |
2013 |
Robust 3D face recognition using adapted statistical models Inproceedings In: Proceedings of the Electrotechnical and Computer Science Conference (ERK'13), 2013. |
Zlivanje informacij za zanseljivo in robustno razpoznavanje obrazov Journal Article In: Electrotechnical Review, 80 (3), pp. 1-12, 2013. |
Exploiting representation plurality for robust and efficient face recognition Inproceedings In: Proceedings of the 22nd Intenational Electrotechnical and Computer Science Conference (ERK'13), pp. 121–124, Portorož, Slovenia, 2013. |
Combining 3D face representations using region covariance descriptors and statistical models Conference Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition and Workshops (IEEE FG), Workshop on 3D Face Biometrics, IEEE, Shanghai, China, 2013. |
Towards efficient multi-modal emotion recognition Journal Article In: International Journal of Advanced Robotic Systems, 10 (53), 2013. |
Smart surveillance technologies in border control Journal Article In: European Journal of Law and Technology, 4 (2), 2013. |
2012 |
Robust 3D Face Recognition Journal Article In: Electrotechnical Review, 79 (1-2), pp. 1-6, 2012. |
Towards robust 3D face verification using Gaussian mixture models Journal Article In: International Journal of Advanced Robotic Systems, 9 , 2012. |
2010 |
Proceedings of the International Conference on Pattern Recognition (ICPR'10), Istanbul, Turkey, 2010. |
2009 |
Emotion recognition using linear transformations in combination with video Conference Speech and intelligence: proceedings of Interspeech 2009, Brighton, UK, 2009. |
Combining audio and video for detection of spontaneous emotions Conference Biometric ID management and multimodal communication, 5707 , Lecture Notes on Computer Science Springer-Verlag, Berlin, Heidelberg, 2009. |