For generations, we have dreamed of machines with artificial intelligence with which we can have real conversations but, despite amazing technological advances, such devices seem some way off. Now researchers at Cambridge are changing the picture, by remodelling the essence of spoken dialogue systems.
"We want to develop systems with which you can have a proper conversation"
—Professor Steve Young
Following the death of Steve Jobs, one of many videos which started to circulate widely on the internet showed the Apple Co-founder at a watershed moment, launching the very first Macintosh in 1984. After demonstrating the machine's facility for word processing, design and even animation, the climax came when Macintosh literally announced itself to the world, talking to an amazed audience with synthetic speech before handing back to Jobs and announcing that it was going to "sit back and listen". A beaming Jobs received a five-minute ovation.
How far we seem to have travelled. Modern smartphones are pocket computers that talk to us using speech recognition software, and owners of an Apple iPhone 4S can ask their device about the weather, or tell it to text a friend. Unlike the early Macintosh, this is no slick gimmick using pre-programmed speech on a floppy disk. Machines can listen to us, interpret our words, and respond.
Yet in a sense we have also come less distance than we hoped. An historian of science might argue that the self-aware illusion of intelligent speech that Jobs created back in 1984 met with euphoria because of a vision that is more science fiction than fact. Computing pioneers in the mid-to-late 20th century imagined conversations with far more sophisticated artificial intelligence in the future. They dreamed less of the iPhone 4S, more of HAL from 2001: A Space Odyssey.
This type of interface remains a distant prospect. Siri, the speech recognition software used in the iPhone, is a system we talk to, but not one with which we converse. Achieving that remains a complex mathematical challenge and usually throws up new problems with every breakthrough achieved. In this demanding field, researchers at Cambridge have traditionally been leaders. Today, the University's Dialogue Systems Group, in the Department of Engineering, are making more advances than most.
"Siri is a sort of personal assistant," Professor Steve Young, who leads the group, said. "If you ask it a question, it comes back with an answer, but after that you more or less have to start again. We want to develop systems with which you can have a proper conversation."
Such devices are likely to become more necessary over time. The amount of information on the internet is rapidly growing and, before long, it will take more than question-answer interfaces to cut through it. We need systems that are attuned to our needs - in short, we need computers that discuss things.
Young's group, along with an international team of collaborators, are developing one such spoken dialogue system (or SDS), in a European Union (EU) project called PARLANCE. As with some of their earlier work, this is a project which involves statistically modelling a system that talks to humans and learns as it goes. Fundamentally, the idea is not dissimilar to teaching a child new vocabulary, and the shifting set of ideas the words may represent.
Made marketable, PARLANCE would be far more three-dimensional than current systems. Where an existing SDS can, for instance, help house-hunters find properties for sale in a given town, PARLANCE would be able to process a request for a three-bedroomed house, with two bathrooms, near a good school and within walking distance of the local supermarket. Users would be able to ask it for one of these attributes, then add more to refine their results. Creating this, however, requires a reconception of how such systems work. A 'cognitive' SDS like PARLANCE has to be able to model uncertainty, or cope with the fact that humans rarely mean exactly what they say. No current SDS is able to handle this, because their modelling is too simple. In existing systems, speech is converted into data, then given to a 'dialogue manager', which tests the data's assorted attributes against an internal database of pre-programmed information, looking for what it thinks is an appropriate response.
"All the systems out there do this on the basis of pre-written programmes," Young explained. "Essentially, the developer programmes the system with a flow chart of possible conversation routes. This is very labour-intensive, and also very fragile. The user can very easily end up in the wrong bit of the flow chart altogether."
PARLANCE is different because, unlike a typical SDS, it refines its responses with experience. Critically, it takes into account not just the last thing its user said, but its overall assumptions about their intentions, their earlier questions, and its experiences from previous conversations. This combined knowledge is merged into a 'belief state' - the system's overall, shifting grasp of what is going on.
Underlying this is an approach called reinforcement learning. The system's decision processes are continually refined depending on whether it receives positive or negative feedback from users. A high score, for a correct response that gives the user exactly what they need, or a negative score, for useless information, allows it to refine its future behaviour.
In 2008, Young's team launched CamInfo, an SDS that people could telephone to ask about local restaurants, and which was developed in an EU project named CLASSiC. A 2009 demo on YouTube shows the system responding to a caller asking for a Chinese restaurant in a town's main square. There is no Chinese locally, but when the caller then says "What about an Italian restaurant?", the system retains details from earlier in the conversation, and finds an Italian in the same place.
Now PARLANCE aims to progress this by helping users with multiple goals. As with the house-hunting example, it will try to cut through swathes of information online and cope with multiple types of requests in one conversation, rather than a single enquiry about a restaurant.
Young and his team are also developing various new features. These include 'hyperlocal search', which allows the system to focus the conversation on the amenities in the local neighbourhood. The system is also being developed to use and respond to back-channel signals. These are the murmurs and grunts of agreement or disagreement such as "hu-huh" and "hmmm" that humans use unconsciously in natural dialogue to orchestrate the turn-taking and flow of information.
Do we really need this stuff, however? After all, it often seems that nothing can replace real human interaction. Young agrees, but points out that the increasing investment in speech technology by major corporations such as Apple, Google and Microsoft clearly shows that we are heading towards a world of speech interaction with our computers.
Nor is this simply a story about the unstoppable rise of the machines. In fact, it may become one about the empowerment of the Luddites. "Speech is one of the most inclusive media we have," Young observed. "Potentially, speech-controlled systems will enable us to bridge the generation gap in computing. We need to get away from crude systems that require users to constantly learn to push different combinations of buttons, which presents real barriers to some sections of the population such as the elderly. Speech will make complex systems accessible to virtually everybody."
Fluent dialogue systems that can cope with the most subtle aspects of human expression remain some way off - and perhaps we will never be able to chat with computers like our science fiction alter egos. Yet projects such as CLASSiC and PARLANCE are not only incrementally taking us closer to the goal of truly cognitive systems, but they are also changing the playing field by altering the basis on which it will be done. The SDS in your phone requires the pre-existing calculation of a programmer, but future systems will adapt on the basis of the conversations they have with you. Socrates once said that the only true wisdom is in knowing you know nothing. Perhaps the same is now becoming true for machines.