This skill is even more difficult for an artificial intelligence (AI) to master, but Google thinks it now has a system adept enough at it for real-world applications and it can operate in real-time.
Identifying the voice of a speaker it’s already heard isn’t so hard for an AI, we’re able to train AIs such as Alexa and Siri to recognize our voices. It’s getting an AI to recognize a voice we haven’t trained it to recognize as soon as the voice starts speaking that’s proven difficult.
Google AI Research Scientist Chong Wang published a blog post detailing how his team was able to create an AI better at speaker diarization — that’s the process of splitting an audio clip featuring more than one speaker into segments based on the person talking at any given moment — than previous attempts.
Wang’s explanation is highly technical, but the crux of it is this: While most speaker diarization systems rely on clustering — a machine learning technique focused on the grouping of data points — the Google team’s system makes use of recurrent neural networks, which are a type of machine learning model that processes sequences of data points.
Using this method, the Google team was able to create an AI capable of speaker diarization with an error rate of just 7.6 percent. The team is now focused on improving the system themselves, and it’s also posted its algorithms on GitHub, meaning anyone can download the files for their own research.
Eventually, we could end up with an AI capable of near-flawless real-time speaker diarization, which could improve how we caption live events, transcribe doctor-patient conversations, and more.