Speaker de-identification using diphone recognition and speech synthesis

Tadej Justin, Vitomir Štruc, Simon Dobrišek, Boštjan Vesnicer, Ivo Ipšić, France Mihelič (2015): Speaker de-identification using diphone recognition and speech synthesis. 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (IEEE FG): DeID 2015, 4 , IEEE 2015.

Abstract

The paper addresses the problem of speaker (or voice) de-identification by presenting a novel approach for concealing the identity of speakers in their speech. The proposed technique first recognizes the input speech with a diphone recognition system and then transforms the obtained phonetic transcription into the speech of another speaker with a speech synthesis system. Due to the fact that a Diphone RecOgnition step and a sPeech SYnthesis step are used during the deidentification, we refer to the developed technique as DROPSY. With this approach the acoustical models of the recognition and synthesis modules are completely independent from each other, which ensures the highest level of input speaker deidentification. The proposed DROPSY-based de-identification approach is language dependent, text independent and capable of running in real-time due to the relatively simple computing methods used. When designing speaker de-identification technology two requirements are typically imposed on the deidentification techniques: i) it should not be possible to establish the identity of the speakers based on the de-identified speech, and ii) the processed speech should still sound natural and be intelligible. This paper, therefore, implements the proposed DROPSY-based approach with two different speech synthesis techniques (i.e, with the HMM-based and the diphone TDPSOLA- based technique). The obtained de-identified speech is evaluated for intelligibility and evaluated in speaker verification experiments with a state-of-the-art (i-vector/PLDA) speaker recognition system. The comparison of both speech synthesis modules integrated in the proposed method reveals that both can efficiently de-identify the input speakers while still producing intelligible speech.

BibTeX (Download)

@conference{justin2015speaker,
title = {Speaker de-identification using diphone recognition and speech synthesis},
author = { Tadej Justin and Vitomir \v{S}truc and Simon Dobri\v{s}ek and Bo\v{s}tjan Vesnicer and Ivo Ip\v{s}i\'{c} and France Miheli\v{c}},
url = {http://luks.fe.uni-lj.si/nluks/wp-content/uploads/2016/09/Deid2015.pdf},
year  = {2015},
date = {2015-01-01},
booktitle = {11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (IEEE FG): DeID 2015},
volume = {4},
pages = {1--7},
organization = {IEEE},
abstract = {The paper addresses the problem of speaker (or voice) de-identification by presenting a novel approach for concealing the identity of speakers in their speech. The proposed technique first recognizes the input speech with a diphone recognition system and then transforms the obtained phonetic transcription into the speech of another speaker with a speech synthesis system. Due to the fact that a Diphone RecOgnition step and a sPeech SYnthesis step are used during the deidentification, we refer to the developed technique as DROPSY. With this approach the acoustical models of the recognition and synthesis modules are completely independent from each other, which ensures the highest level of input speaker deidentification. The proposed DROPSY-based de-identification approach is language dependent, text independent and capable of running in real-time due to the relatively simple computing methods used. When designing speaker de-identification technology two requirements are typically imposed on the deidentification techniques: i) it should not be possible to establish the identity of the speakers based on the de-identified speech, and ii) the processed speech should still sound natural and be intelligible. This paper, therefore, implements the proposed DROPSY-based approach with two different speech synthesis techniques (i.e, with the HMM-based and the diphone TDPSOLA- based technique). The obtained de-identified speech is evaluated for intelligibility and evaluated in speaker verification experiments with a state-of-the-art (i-vector/PLDA) speaker recognition system. The comparison of both speech synthesis modules integrated in the proposed method reveals that both can efficiently de-identify the input speakers while still producing intelligible speech.},
keywords = {DEID, FG, speech deidentification, speech recognition, speech synthesis, speech technologies},
pubstate = {published},
tppubtype = {conference}
}