The thought that our gadgets are spying on us isn’t a pleasant one, which is why a group of Columbia University researchers have created what they call “neural voice camouflage.”
This technology won’t necessarily stop a human listener from understanding someone if they’re snooping (you can give recordings a listen and view the source code at the link above). Rather, this is a system designed to stop devices equipped with microphones from transmitting automatically transcribed recordings. It’s quiet – just above a whisper – but can generate sound specifically modeled to obscure speech in real time so that conversations can’t be transcribed by software and acted upon or the text sent back to some remote server for processing.
The real-time capability is where the project seemingly has made a breakthrough: while speech-masking algorithms aren’t new, they’ve typically needed to hear an entire recording to obscure it. That means it’s essentially useless for protecting in real time real-world conversations.
“A key technical challenge to achieving this was to make it all work fast enough,” said Carl Vondrick, assistant professor of computer science at the US university and one of the researchers on the project.
According to Vondrick, the algorithm his team developed can stop a microphone-equipped AI model from interpreting speech 80 percent of the time, all without having to hear a whole recording, or knowing anything about the gadget doing the listening.
Mia Chiquier, a PhD candidate and the lead author of the study, describes the speech-masking algorithm as a “predictive attack” because it uses the previous two seconds of audio to forecast what’s likely to be said next, and then generating sound that will disrupt what it predicts, or similar sounding words.
“This attack will learn to ‘hedge the bet’ by finding a single, minimal pattern that robustly obstructs all upcoming possibilities,” the project’s paper, which will be presented at the International Conference on Learning Representations next week.
Of course, a gadget could record the audio of some chatter and send that off for, say, a human to review; this project is focused on the live, real-time thwarting of automatic speech recognition and transcription.
Testing its method, the team found that it worked in real-world conditions in a variety of rooms with different geometries. Chiquier said the model works on a majority of English vocabulary, and that the team is working to extend it to other languages.
The research paper noted the project was driven entirely by the ethical considerations of ubiquitous speech-recognition technology, which University of Pennsylvania professor and machine learning researcher Jianbo Shi said needs to be reframed in a way similar to Ciquier and Vondrick’s study.
“Their work makes many of us think in the following direction: ask not what ethical AI can do for us, but what we can do for ethical AI … as a community, we need to consciously think about the human and societal impact of the AI technology we develop from the earliest research design phase,” Shi said. ®