Researchers at Carnegie Mellon University have created a machine learning model to detect the direction of an incoming voice.
Current smart speakers and voice-activated devices rely on activation keywords to listen and then respond to commands. While largely effective, it can create problems when there are multiple devices that use the same keyword, or when someone uses that keyword in normal conversation.
The researchers at Carnegie Mellon University set out to solve this problem by using machine learning to help address the problem of addressability. In other words, help devices know if a command was directed at them specifically.
The research aimed to recreate elements of human-human communication, specifically how people can address a specific person in a crowded room. If computers can learn directional conversation, it will make it much easier to control devices and interact with them much like interacting with a human being.
“In this research, we explored the use of speech as a directional communication channel,” write (PDF) researchers Karan Ahuja, Andy Kong, Mayank Goel and Chris Harrison. “In addition to receiving and processing spoken content, we propose that devices also infer the Direction of Voice (DoV). Note this is different from Direction of Arrival (DoA) algorithms, which calculate from where a voice originated. In contrast, DoV calculates the direction along which a voice was projected.
“Such DoV estimation innately enables voice commands with addressability, in a similar way to gaze, but without the need for cameras. This allows users to easily and naturally interact with diverse ecosystems of voice-enabled devices, whereas today’s voice interactions suffer from multi-device confusion.”
This research is an important development and could have a profound impact on how humans interact with everything from smart speakers to more advanced AIs.