Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis

Cabañas-Molero, Pablo Antonio; Lucena, Manuel; Fuertes, José Manuel; Vera-Candeas, Pedro; Ruiz-Reyes, Nicolás

Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis

dc.contributor.author	Cabañas-Molero, Pablo Antonio
dc.contributor.author	Lucena, Manuel
dc.contributor.author	Fuertes, José Manuel
dc.contributor.author	Vera-Candeas, Pedro
dc.contributor.author	Ruiz-Reyes, Nicolás
dc.date.accessioned	2024-02-07T00:37:51Z
dc.date.available	2024-02-07T00:37:51Z
dc.date.issued	2018-04-11
dc.description.abstract	Speaker diarization is traditionally defined as the problem of determining “who speaks when” given an audio or video stream. This is an important task in many applications for meeting rooms, including automatic transcription of conversations, camera steering or content summarization. When the room is equipped with microphone arrays and cameras, speakers can be distinguished according to their location and the problem can be addressed through localization techniques. This article proposes a multimodal speaker diarization system for meeting environments based on a modified SRP-PHAT function evaluated on space volumes rather than discrete points. In our system, this function is used in combination with a circular array, enabling audio-based localization based on the selection of local maxima. Voicing detection is used to detect speech frames, whereas video analysis is introduced to aid in the decision when users move or simultaneously speak. The approach is evaluated on the well-known AMI dataset with approximately 100 hours of realistic meeting recordings and shows an average diarization error rate of 21% – 25%.	es_ES
dc.description.sponsorship	This work was supported by the Andalusian Economy and Knowledge Council under project 2010-TIC6762, and the Spanish Ministry of Economy and Competitiveness under project TEC2015-67387-C4-2-R.	es_ES
dc.identifier.citation	Cabañas-Molero, P., Lucena, M., Fuertes, J.M. et al. Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis. Multimed Tools Appl 77, 27685–27707 (2018). https://doi.org/10.1007/s11042-018-5944-2	es_ES
dc.identifier.issn	1380-7501	es_ES
dc.identifier.other	10.1007/s11042-018-5944-2	es_ES
dc.identifier.uri	-	es_ES
dc.identifier.uri	https://hdl.handle.net/10953/2188
dc.language.iso	eng	es_ES
dc.publisher	Springer	es_ES
dc.relation.ispartof	Multimedia Tools and Applications 2018; 77, 27685–27707	es_ES
dc.rights	CC0 1.0 Universal	*
dc.rights.accessRights	info:eu-repo/semantics/openAccess	es_ES
dc.rights.uri	http://creativecommons.org/publicdomain/zero/1.0/	*
dc.subject	Speaker diarization	es_ES
dc.subject	Meeting rooms	es_ES
dc.subject	SRP-PHAT	es_ES
dc.subject	Multimodal processing	es_ES
dc.subject.udc	621.39	es_ES
dc.title	Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis	es_ES
dc.type	info:eu-repo/semantics/article	es_ES
dc.type.version	info:eu-repo/semantics/publishedVersion	es_ES