|
|
When Will HAL Understand What We Are Saying? Computer Speech Recognition and Understanding
This chapter from " HAL's Legacy: 2001's Computer as Dream and Reality" addresses the accomplishments--and challenges--of automatic speech recognition. What kind of paradigm shift in computing will give HAL the ability to understand human context, and therefore truly speak?
Originally published 1996 in Hal's Legacy: 2001's Computer as Dream and Reality. Published on KurzweilAI.net August 6, 2001.
Let's talk about how to wreck a nice beach.
Well, actually, if I were presenting this chapter verbally, you would have little difficulty understanding the preceding sentence as Let's talk about how to recognize speech. Of course, I wouldn't have enunciated the g in recognize, but then we routinely leave out and otherwise slur at least a quarter of the sounds that are "supposed" to be there a phenomenon speech scientists call coarticulation.
On the other hand, had this been an article on a rowdy headbangers' beach convention (a topic we assume HAL knew little about), the interpretation at the beginning of the chapter would have been reasonable.
On yet another hand, if you were a researcher in speech recognition and heard me read the first sentence of this chapter, you would immediately pick up the beach--wrecking interpretation, because this sentence is a famous example of acoustic ambiguity and is frequently cited by speech researchers.
The point is that we understand speech in context. Spoken language is filled with ambiguities. Only our understanding of the situation, subject matter, and person (or entity) speaking--as well as our familiarity with the speaker--lets us infer what words are actually spoken.
Perhaps the most basic ambiguity in spoken language is the phenomenon of homonyms, words that sound absolutely identical but are actually different words with different meanings. When Frank asks, "Listen, HAL, there's never been any instance at all of a computer error occurring in a 9000 series, has there?", HAL has little difficulty interpreting the last word as there and not their. Context is the only source of knowledge that can resolve such ambiguities. HAL understands that the word their is an adjective and would have to be followed by the noun it modifies. Because it is the last word in the sentence, there is the only reasonable interpretation. Today's speech--recognition systems would also have little difficulty with this word and would resolve it the same way HAL does.
A more difficult task in interpreting Frank's statement is the word all. Is all a place ; such as IBM headquarters--where a computer error may take place, as in "there's never been any instance of a computer error at IBM ... " HAL resolves this ambiguity the same way viewers of the movie do. We know that all is not the name of a place or organization where an error may take place. This leaves us with at all as an expression of emphasis reinforcing the meaning of never as the only likely interpretation.
In fact, we try to understand what is being said before the words are even spoken, through a process called hypothesis and test. Next time you order coffee in a restaurant and a waiter asks how you want it, try saying "I'd like some dream and sugar please." It would take a rather attentive person to hear that you are talking about sweet dreams and not white coffee.
When we listen to other people talking--and people frequently do not really listen, a fault that HAL does not seem to share with the rest of us--we constantly anticipate what they are going to say ... next. Consider Dave's reply to HAL's questions about the crew psychology report:
Dave: Well, I don't know. That's rather a difficult question to ...
When Dave finally says the answer, HAL tests his hypothesis by matching the word he heard against the word he had hypothesized Dave would say. In watching the movie, we all do the same thing. Any reasonable match would tend to confirm our expectation The test involves an acoustic matching process, but the hypothesis has nothing to do with sound at all--nor even with language--but rather relates to knowledge on a multiplicity of levels. As many of the chapters in this book point out, knowledge goes far beyond mere facts and data. For information to become knowledge, it must incorporate the relationships between ideas. And for knowledge to be useful, the links describing how concepts interact must be easily accessed, updated, and manipulated. Human intelligence is remarkable in its ability to perform all these tasks. Ironically, it is also remarkably weak at reliably storing the information on which knowledge is based. The natural strengths of today's computers are roughly the opposite. They have, therefore, become powerful allies of the human intellect because of their ability to reliably store and rapidly retrieve vast quantities of information. Conversely, they have been slow to master true knowledge. Modeling the knowledge needed to understand the highly ambiguous and variable phenomenon of human speech has been a primary key to making progress in the field of automatic speech recognition (ASR). Lesson 1: Knowledge Is a Many--layered Thing Thus lesson number one for constructing a computer system that can understand human speech is to build ;in knowledge at many levels: the structure of speech sounds, the way speech is produced by our vocal apparatus, the patterns of speech sounds that comprise dialects and languages, the complex (and not fully understood) rules of word usage, and the--greatest difficulty--general knowledge of the subject matter being spoken about.
Each level of analysis provides useful constraints that can limit our search for the right answer. For example, the basic building blocks of speech called phonemes cannot appear in just any order. Indeed, many sequences are impossible to articulate (try saying ptkee). More important, only certain phoneme sequences correspond to a word or word fragment in the language. Although the set of phonemes used is similar (although not identical) from one language to another, contextual factors differ dramatically. English, for example, has over ten thousand possible syllables, whereas Japanese has only a hundred and twenty.
On a higher level, the syntax and semantics of the language put constraints on possible word orders. Resolving homonym ambiguities can require multiple levels of knowledge. One type of technology frequently used in speech recognition and understanding systems is a sentence parser, which builds sentence diagrams like those we learned in elementary school. One of the first such systems, developed in 1963 by Susumu Kuno of Harvard (around the time Kubrick and Clarke began work on 2001), revealed the depth of ambiguity in English. Kuno asked his computerized parser what the sentence "Time flies like an arrow" means. In what has become a famous response, the computer replied that it was not quite sure. It might mean
1. That time passes as quickly as an arrow passes.
2. Or maybe it is a command telling us to time the flies the same way that an arrow times flies; that is, Time flies like an arrow would.
3. Or it could be a command telling us to time only those flies that are similar to arrows; that is, Time flies that are like an arrow.
4. Or perhaps it means that a type of flies known as time flies have a fondness for arrows: Time--flies like (i.e., appreciate) an arrow."
It became clear from this and other syntactical ambiguities that understanding language, spoken or written, requires both knowledge of the relationships between words and of the concepts underlying words. It is impossible to understand the sentence about time (or even to understand that the sentence is indeed talking about time and not flies) without mastery of the knowledge structures that represent what we know about time, flies, arrows, and how these concepts relate to one another
A system armed with this type of information would know that flies are not similar to arrows and would thus knock out the third interpretation. Often there is more than one way to resolve language ambiguities. The third interpretation could be syntactically resolved by noting that like in the sense of similar to ordinarily requires number agreement between the two objects compared. Such a system would also note that, as there are no such things as time flies, the fourth interpretation too is wrong. The system would also need such tidbits of knowledge as the fact that flies have never shown a fondness for arrows, and that arrows cannot and do not time anything--much less flies--to select the first interpretation as the only plausible one. The ambiguity of language, however, is far greater than this example suggests. In a language--parsing project at the MIT Speech Lab, Ken Church found a sentence with over two million syntactically correct interpretations.
Often the tidbits of knowledge we need have to do with the specific situation of speakers and listeners. If I walk into my business associate's office and say "rook to king one," I am likely to get a response along the lines of "excuse me?" Even if my words were understood, their meaning would still be unclear; my associate would probably interpret them as a sarcastic remark implying that I think he regards himself as a king. In the context of a chess game, however, not only is the meaning clear, but the words are easy to recognize. Indeed, our contemporary speech--recognition systems do a very good job when the domain of discourse is restricted to something as narrow as a chess game. So, HAL, too, has little trouble understanding when Frank says "rook to king one" during one of their chess matches. Lesson 2: The Unpredictability of Human Speech A second lesson for building our computer system is that it must be capable of understanding the variability of human speech. We can, of course, build in pictures of human speech called spectrograms, which plot the intensity of different frequencies (or pitches in human perceptual terms) as they change over time. What is interesting, and--for those of us developing speech- recognition machines--daunting is that spectrograms of two people saying the same word can look dramatically different. Even the same person pronouncing the same word at different times can produce quite different spectrograms.
Look at the two spectrogram pictures of Dave and Frank saying the word HAL. It would be difficult to know that they are saying the same word from the pictures alone. Yet the spectrograms present all the salient information in the speech signals.
Yet, there must be something about these different sound pictures that is the same; otherwise we humans and HAL, as a human-level machine, would be unable to identify them as two examples of the same spoken word. Thus, one key to building automatic speech recognition (ASR) machines is the search for these invariant features. We note for example that vowel sounds (e.g., the a sound in HAL, which may be denoted as æ (or /a/) involve certain resonant frequencies called formants that are sustained over some tens of milliseconds. We tend to find these formants in a certain mathematical relationship whenever /a/ is spoken. The same is true of the other sustained vowels. (Although the relationship is not a simple one, we observe that the relationship of the frequency of the second formant to the first formant for a particular vowel falls within a certain range, with some overlap between the ranges for different vowels.) Speech recognition systems frequently include a search function for finding these relationships, sometimes called features.
By studying spectrograms, we also note that certain changes do not convey any information; that is, there are types of changes we should filter out and ignore. An obvious one is loudness. When Dave shouts HAL's name in the pod in space, HAL realizes it is still his name. HAL infers some meaning from Dave's volume, but it is relatively unimportant for identifying the words being spoken. We apply, vtherefore, a process called normalization, in which we make all words the same loudness so as to eliminate this noninformative source of variability.
A more complex example is the phenomenon of nonlinear time compression. When we speak, we change our speed according to context and other factors. If we speak one word more quickly, we do not increase the rate evenly throughout the entire word. The duration of certain portions of the word, such as plosive consonants (e.g., /p/, /b/, /t/), remains fairly constant, while other portions, such as vowels, undergo most of the change. In matching a spoken word to a stored example (a template), we need to align corresponding acoustic events or the match will never succeed. A mathematical technique called dynamic programming solves this temporal alignment. Lesson 3: Speech Is Like a Song A third lesson is also apparent from studying speech spectrograms. It is that the perceptual cues needed to identify speech sounds and assemble words are found in the frequency domain, and not in the original time-varying signal. To make sense of it, we need to convert the original waveform into its frequency components. The human vocal tract is similar to a musical instrument (indeed it is a musical instrument). The vocal cords vibrate, creating a characteristic pitched sound; the length and tautness of the cords determine pitch in the same way that the length and tautness of a violin or piano string does. We can control the tautness of our vocal cords--as we do when singing--and alter the overtones produced by our vibrating cords by moving our tongue, teeth, and lips--which change the shape of the vocal tract. The vocal tract is a chamber that acts like a pipe in a pipe organ, the harmonic resonances of which emphasize certain overtones and diminish others. Finally, we control a small piece of tissue called the alveolar flap (or soft palate), which opens and closes the nasal cavity. When the alveolar flap is open, the nasal cavity adds an additional resonant chamber; it's a lot like opening another organ pipe. (Viewers of My Fair Lady will recall that the anatomy of speech recognition is also an important topic for specializts in phonetics.)
In addition to the pitched sound produced by the vocal cords, we can produce a noiselike sound by the rush of air through the speech cavity. This sound does not have specific overtones but is a complex spectrum of many frequencies mixed together. Like the musical tones produced by the vocal cords, the spectra of these noise sounds are shaped by the changing resonances of the moving vocal tract.
This vocal apparatus allows us to create the varied sounds that comprise human speech. Although many animals communicate with others of their species through sound, we humans are unique in our ability to shape that sound into language. We produce vowel sounds (e.g., /a/, /i/) by shaping the overtones from the vibrating vocal cords into distinct frequency bands, the formants. Sibilant sounds (/s/, /z/) result from the rush of air through particular configurations of tongue and teeth. Plosive consonants (/p/, /b/, /t/) are transitory sounds created by the percussive movement of lips, tongue, and mouth cavity. Nasal sounds (/n/, /m/) are created by invoking the resonances of the nasal cavity. The distribution of sounds vary from one language to another.
Each of the several dozen basic sounds, the phonemes, requires an intricate movement involving precise coordination of the vocal cords, alveolar flap, tongue, lips, and teeth. We typically speak about three words per second. So with an average of six phonemes per word, we make about eighteen intricate phonetic gestures per second, a task comparable in complexity to a performance by a concert pianist. We do this without thinking about it, of course. Our thoughts remain on the conceptual (that is, the highest) level of the language and knowledge hierarchy. In our first two years of life, however, we thought a lot about how to make speech sounds--and how to string them together meaningfully. This process is an example of our sequential (i.e., logical, rational) conscious mind training our parallel preconscious pattern-processing mental faculties.
The mechanisms described above for creating speech sounds--vocal cord vibrations, the noise of rushing air, articulatory gestures of the mouth, teeth and tongue, the shaping of the vocal and nasal cavities--produce different rates of vibration. Physicists measure these rates of vibration as frequencies; we perceive them as pitches. Though we normally think of speech as a single time-varying sound, it is actually a composite of many different sounds, each with its own frequency. Using this insight, most ASR researchers starting in the late 1960s began by breaking up the speech waveform into a number of frequency bands. A typical commercial or research ASR system will produce between a few and several dozen frequency bands. The front end of the human auditory system does exactly the same thing: each nerve ending in the cochlea (inner ear) responds to different frequencies and emits a pulsed digital signal when activated by an appropriate pitch. The cochlea differentiates several thousand overlapping bands of frequency, which gives the human auditory system its extremely high degree of sensitivity to frequency. Experiments have shown that increasing the number of overlapping frequency bands of an ASR system (thus making it more like the human auditory system) increases the ability of that system to recognize human speech. Lesson 4: Learn While You Listen A fourth lesson emphasizes the importance of learning. At each stage of processing, a system must adapt to the individual characteristics of the talker. Learning to do this has to take place at several levels: those of the frequency and time relationships characterizing each phoneme, the dialect (pronunciation) patterns of each word, and the syntactic patterns of possible phrases and sentences. At the highest cognitive level, a person or machine understanding speech learns a great deal about what a particular talker tends to talk about and how that talker phrases his or her thoughts.
HAL learns a great deal about his human crew mates by listening to the sound of their voices, what they talk about, and how they put sentences together. He also watches what their mouths do when they articulate certain phrases (chapter 11). HAL gathers so much knowledge about them that he can understand them even when some of the information is obscured--for example, when he has to rely solely on his visual observation of Dave and Frank's lips. Lesson 5: Hungry for MIPS and Megabytes The fifth lesson is that speech recognition is a process hungry for MIPs (millions of instructions per second) and megabytes (millions of bytes of storage); which is to say that we can obtain more accurate performance by using faster computers with larger memories. Certain algorithms or methods are only available in computers that operate at high levels of performance. Brute force--that is, huge memory--is necessary but clearly not sufficient without solving the difficult algorithmic and knowledge-capture issues mentioned above.
We now know that 1997, when HAL reportedly became intelligent, is too soon. We won't have the quantity of computing, in terms of speed and memory needed, to build a HAL. And we won't be there in 2001 either.
Let's keep these lessons in mind as we examine the roots and future prospects of building machines that can duplicate HAL's ability to understand speech. The Importance of Speech Recognition Before examining the sweep of progress in this field, it is worthwhile to underscore the importance of the auditory sense, particularly our ability to understand spoken language, and why this is a critical faculty for HAL. Most of HAL's interaction with the crew is verbal. It is primarily through his recognition and understanding of speech that he communicates. HAL's visual perceptual skills, which are far more difficult to create, are relatively less important for carrying out his mission, even though the pivotal scene, in which HAL understands Dave and Frank's conspiratorial conversation, without being able to hear them, relies on HAL's visual sense. Of course, his apparently self-taught lipreading is based on his speech-recognition ability and would have been impossible if HAL had not been able to understand spoken language.
To demonstrate the importance of the auditory sense, try watching the television news with the sound turned off. Then try it again with the sound on, but without looking at the picture. Next, try a similar experiment with a videotape of the movie 2001. You will probably find it easier to follow the stories with your ears alone than with your eyes alone, even though our eyes transmit much more information to our brains than our ears do--about fifty billion bits per second from both eyes versus approximately a million bits per second from two ears. The result is surprising. There is a saying that a picture is worth a thousand words; yet the above exercise illustrates the superior power of spoken language to convey our thoughts. Part of that power lies in the close link between verbal language and conscious thinking. Until recently, a popular theory held that thinking was subvocalized speech. (J. B. Watson, the founder of behaviorism, attached great attention to the small movements of the tongue and larynx made while we think.) Although we now recognize that thoughts incorporate both language and visual images, the crucial importance of the auditory sense in the acquisition of knowledge--which we need in order to recognize speech in the first place--is widely accepted.
Yet many people consider blindness a more serious handicap than deafness. A careful consideration of the issues shows this to be a misconception. With modern mobility techniques, blind persons with appropriate training have little difficulty going from place to place. The blind employees of my first company (Kurzweil Computer Products, Inc., which developed the Kurzweil Reading Machine for the Blind) traveled around the world routinely. Reading machines can vprovide access to the world of print, and visually impaired people experience few barriers to communicating with others in groups or individual encounters. For the deaf, however, the barrier to understanding what other people are saying is fundamental.
We learn to understand and produce spoken language during our first year of life, years before we can understand or create written language. HAL apparently spent years learning human speech by listening to his teacher, whom he identifies as Mr. Langley, at the HAL lab in Urbana, Illinois. Studies with humans have shown that groups of people can solve problems with dramatically greater speed if they can communicate verbally rather than being restricted to other methods. HAL and his human colleagues amply demonstrate this finding. Thus, intelligent machines that understand verbal language make possible an optimal modality of communication. In recent years, a major goal of artificial intelligence research has been making our interactions with computers more natural and intuitive. HAL's primarily verbal communication with crew members is a clear example of an intuitive user interface. Keeping in mind our five lessons about creating speech-recognition systems, it is interesting to examine historical attempts to endow machines with the ability to understand human speech. The effort goes back to Alexander Graham Bell, and the roots of the story go even farther back, to Bell's grandfather Alexander Bell, a widely known lecturer and speech teacher. His son, Alexander Melville Bell, created a phonetic system for teaching the deaf to speak called visible speech. At the age of twenty-four, Alexander Graham Bell began teaching his father's system of visible speech to instructors of the deaf in Boston. He fell in love with and subsequently married one of his students, Mabel Hubbard. She had been deaf since the age of four as a result of scarlet fever. The marriage served to deepen his commitment to applying his inventiveness to overcoming the handicaps of deafness.
He built a device he called a phonautograph to make visual patterns from sound. Attaching a thin stylus to an eardrum he obtained from a medical school, he traced the patterns produced by speaking through the eardrum on a smoked glass screen. His wife, however, was unable to understand speech by looking at these patterns. The device could convert speech sounds into pictures, but the pictures were highly variable and showed no similarity in patterns, even when the same person spoke the same word.
In 1874, Bell demonstrated that the different frequency harmonics from an electrical signal could be separated. His harmonic telegraph could send multiple telegraphic messages over the same wire by using different frequency tones. The next year, the twenty-eight-year-old Bell had a profound insight. He hypothesized that although the information needed to understand speech sounds could not be seen by simply displaying the speech signal directly, it could be recognized if you first broke the signal into different frequency bands. Bell's intuitive discovery of our third lesson also turns out to be a key to finding the invariant features needed for the second lesson.
Bell felt sure he had all the pieces needed to implement this insight and give his wife the ability to understand human speech. He had already developed a moving drum and solenoid (a metal core wrapped with wire) that could transform a human voice into a time-varying current of electricity. All he needed to do, he thought, was to break up this electrical signal into different frequency bands, as he had done previously with the harmonic telegraph, then render each of these harmonics visually--by using multiple phonautographs. In June of 1875, while attempting to prepare this experiment, he accidentally connected the wire from the input solenoid back to another similar device. Now most processes are not reversible. Try unsmashing a teacup or speaking into a reading machine for the blind, which converts print into speech; it will not convert the speech back into print. But, unexpectedly, Bell's erstwhile microphone began to speak! Thus was the telephone discovered, or we should say, invented.
The device ultimately broke down the communication barrier of distance for the human race. Ironically, Bell's great invention also deepened the isolation of the deaf. The two methods of communication available to the deaf--sign language and lipreading--are not possible over the telephone.
He continued to experiment with a frequency-based phonautograph, but without a computer to analyze the rapidly time-varying harmonic bands, the information remained a bewildering array to a sighted deaf person. We now know that we can visually examine frequency-based pictures of speech (i.e., spectrograms) and understand the communication from the visual information alone; but the process is extremely difficult and slow. An MIT graduate course, Speech Spectrogram Reading, teaches precisely this skill. The purpose of the course is to give students insight into the spectral cues of salient speech events. For many years, the course's professor, Dr. Victor Zue, was the only person who could understand speech from spectrograms with any proficiency; several people have reportedly now mastered this skill. Computers, on the other hand, can readily handle spectral information, and we can build a cruqde but usable speech-recognition system using this type of acoustic information alone. So Bell was on the right track--about a century too soon.
Ironically, another pioneer, Charles Babbage, had attempted to create that other prerequisite to automatic speech recognition--the programmable computer--about forty years earlier. Babbage built his computer, the analytical engine, entirely of mechanical parts; yet it was a true computer, with a stored program, a central processing unit, and memory store. Despite Babbage's exhaustive efforts, nineteenth-century machining technology could not build the machine. Like Bell, Babbage was about a century ahead of his time, and the analytical engine never ran.
Not until the 1940s, when fueled by the exigencies of war, were the first computers actually built: the Z-3 by Konrad Zuse in Nazi Germany, the Mark I by U.S. Navy Commander Howard Aiken, and the Robinson and Colossus computers by Alan Turing and his English colleagues. Turing's Bletchley group broke the German Enigma code and are credited with enabling the Royal Air Force to win the Battle of Britain and so withstand the Nazi war machine.
For Bell, whose invention of the telephone created the telecommunications revolution, the original goal of easing the isolation of the deaf remained elusive. His insights into separating the speech signal into different frequency components and rendering those components as visible traces were not successfully implemented until Potter, Kopp, and Green designed the spectrogram and Dreyfus-Graf developed the steno-sonograph in the late 1940s. These devices generated interest in the possibility of automatically recognizing speech because they made the invariant features of speech visible for all to see.
The first serious speech recognizer was developed in 1952 by Davis, Biddulph, and Balashek of Bell Labs. Using a simple frequency splitter, it generated plots of the first two formants, which it identified by matching them against prestored patterns in an analog memory. With training, it was reported, the machine achieved 97 percent accuracy on the spoken forms of ten digits.
By the 1950s, researchers began to follow lesson 5 and to use computers for ASR, which allowed for linear time normalization, a concept introduced by Denes and Mathews in 1960. The 1960s saw several successful experiments with discrete word recognition in real time using digital computers; words were spoken in isolation with brief silent pauses between them. Some notable success was also achieved with relatively large vocabularies, although with constrained syntaxes. In 1969, two such systems--the Vicens system, which accepted a five-hundred-word vocabulary, and the Medress system with its one-hundred-word vocabulary were described in Ph.D. dissertations.
That same year, John Pierce wrote a celebrated, caustic letter objecting to the repetitious implementation of small-vocabulary discrete word devices. He argued for attacking more ambitious goals by harnessing different levels of knowledge, including knowledge of speech, language, and task. He argued against real-time devices, anticipating (correctly) that processing speeds would improve dramatically in the near future. Partly in response to the concerns articulated by Pierce, the U.S. Defense Advanced Research Projects Agency began serious funding of ASRresearch with the ARPA SUR (Speech Understanding Research) project, which began in 1971. As Allen Newell of Carnegie Mellon University observes in his 1975 paper, there were three ARPA SUR dogmas. First, all sources of knowledge, from acoustics to semantics, should be part of any research system. Second, context and a priori knowledge of the language should supplement analysis of the sound itself. Third, the objective of ASR is, properly, speech understanding, not simply correct identification of words in a spoken message. Systems, therefore, should be evaluated in terms of their ability to respond correctly to spoken messages about such pragmatic problems as travel budget management. (For example, researchers might ask a system "What is the plane fare to Ottawa?") Not surprisingly, this third dogma was the most controversial and remains so today; and different markets have been identified for speech-recognition and speech-understanding systems.
The objective goal of ARPA SUR was a recognition system with 90 percent sentence accuracy for continuous-speech sentences, using thousand-word vocabularies, not in real time. Of four principal ARPA SUR projects, the only one to meet the stated goal was Carnegie Mellon University's Harpy system, which achieved a 5 percent error rate on a 1,011-word vocabulary on continuous speech. One of the ways the CMU team achieved the goal was clever: they made the task easier by restricting word order; that is, by limiting spoken words to certain sequences in the sentence.
The five-year ARPA SUR project was thoroughly analyzed and debated for at least a decade after its completion. Its legacy was to establish firmly the five lessons I have described. By then it was clear that the best way to reduce the error rate was to build in as much knowledge as possible about speech (lesson 1): how speech sounds are structured, how they are strung together, what determines sequences, the syntactic structure of the language (English, in this case), and the semantics and pragmatics of the subject matter and task--which for ARPA SUR were far simpler than what HAL had to understand.
Great strides were made in normalizing the speech signal to filter out variability (lesson 2). F. Itakura, a Japanese scientist, and H. Sakoe and S. Chiba introduced dynamic programming to compute optimal nonlinear time alignments, a technique that quickly became the standard. Jim Baker and IBM's Fred Jelinek introduced a statistical method called Markov Modeling; it provided a powerful mathematical tool for finding the invariant information in the speech signal.
Lesson 3, about breaking the speech signal into its frequency components, had already been established prior to the ARPA SUR projects, some of which developed systems that could adapt to aspects of the speaker's voice (lesson 4). Lesson 5 was anticipated by allowing ARPA SUR researchers to use as much computer memory as they could afford to buy and as much computer time as they had the patience to wait for. An underlying, and accurate expectation was that Moore's law (see chapter 3 and below) would ultimately provide whatever computing platform the algorithms required. The 1970s The 1970s were notable for other significant research efforts. In addition to introducing dynamic programming, Itakura developed an influential analysis of spectral-distance measures, a way to compute how similar two different sounds are. His system demonstrated an impressive 97.3 percent accuracy on two hundred Japanese words spoken over the telephone. Bell Labs also achieved significant success (a 97.1 percent accuracy) with speaker--independent systems--that is, systems that understand voices they have not heard before. IBM concentrated on the Markov modeling statistical technique and demonstrated systems that could recognize a large vocabulary.
By the end of the 1970s, numerous commercial speech-recognition products were available. They ranged from Heuristics' $259 H-2000 Speech Link, to $100,000 speaker-independent systems from Verbex and Nippon. Other companies, including Threshold, Scott, Centigram, and Interstate, offered systems with sixteen-channel filter banks at prices between $2,000 and $15,000. Such products could recognize small vocabularies spoken with pauses between words. The 1980s The 1980s saw the commercial field of ASR split into two fairly distinct market segments. One group--which included Verbex, Voice Processing Corporation, and several others--pursued reliable speaker-independent recognition of small vocabularies for telephone transaction processing. The other group, which included IBM and two new companies--Jim and Janet Baker's Dragon Systems, and my Kurzweil Applied Intelligence--pursued large-vocabulary ASR for creating written documents by voice.
Important work on large-vocabulary continuous speech (i.e., speech with no pauses between words) was also conducted at Carnegie Mellon University by Kai-Fu Lee, who subsequently left the university to head Apple's speech-recognition efforts.
By 1991, revenues for the ASR industry were in low eight figures and were increasing substantially every year. A buyer could choose any one (but not two) characteristics from the following menu: large vocabulary, speaker independence, or continuous speech. HAL, of course, could do all three. The State of the Art So where are we today? We now, finally, have inexpensive personal computers that can support high-performance ASR software. Buyers can now choose any two (but not all three) capabilities from the menu listed above. For example, my company's Kurzweil VOICE for Windows can recognize a sixty-thousand-word vocabulary spoken discretely (i.e., with brief pauses between each word). Another experimental system can handle a thousand-word, command-and-control vocabulary with continuous speech (i.e., no pauses). Both systems provide speaker independence; that is, they can recognize words spoken by your voice even if they've never heard it before. Systems in this product category are also made by Dragon Systems and IBM. Playing HAL To demonstrate today's state of the art in computer speech recognition, we fed in some of the sound track of 2001 into the Kurzweil VOICE for Windows version 2.0 (KV/Win 2.0). KV/Win 2.0 is capable of understanding the speech of a person it has not heard speak before and can recognize a vocabulary of up to sixty thousand words (forty thousand in its initial vocabulary with the ability to add another twenty thousand). The primary limitation of today's technology is that it can only handle discrete speech--that is, words or brief phrases (such as thank you) spoken with brief pauses in between. I played the following dialog to KV/Win 2.0 with a view to learning whether it could understand Dave as HAL does in the movie:
HAL: Good evening, Dave.
Dave: How you doing, HAL?
HAL: Everything is running smoothly; and you?
Dave: Oh, not too bad.
HAL: Have you been doing some more work?
Dave: Just a few sketches.
HAL: May I see them?
Dave: Sure.
HAL: That's a very nice rendering, Dave. I think you've improved a great deal. Can you hold it a bit closer?
Dave: Sure.
HAL: That's Dr. Hunter, isn't it?
Dave: Hm hmm.
HAL: By the way, do you mind if I ask you a personal question?
Dave: No, not at all.
I trained the system on the phrases "Oh, not too bad" and "No, not at all," but did not train it on Dave's voice. When I did the experiment, KV/Win 2.0 had never heard Dave's voice, and it had to pick out each word or phrase from among forty thousand possibilities. I had the system listen to Dave saying the following discrete words and phrases from the above dialog:
Dave: Oh, not too bad.
Dave: Sure.
Dave: Sure.
Dave: No, not at all.
KV/Win 2.0 was able to successfully recognize the above utterances even though it had not been previously exposed to Dave's voice. For good measure, I also had KV/Win 2.0 listen to Dave in the critical scene in which HAL is betraying him. In this scene, Dave says the word HAL five times in a row in an increasingly plaintive voice. KV/Win 2.0 successfully recognized the five utterances, despite their obvious differences in tone and enunciation. Looking at the spectrogram, we can see that these five utterances, although they are similar in some respects, are really quite different from one another and demonstrate clearly the variability of human speech. So, except for KV/Win's restriction to discrete speech, with regard to speech recognition we've already created HAL!
Of course, the limitation to discrete speech is no minor exception. When will our computers be capable of recognizing fully continuous speech? Recently, ARPA has funded a new round of research aimed at "holy grail" systems that combine all three capabilities--handling continuous speech with very large vocabularies and speaker independence. Like the earlier ARPA SUR projects, there are no restrictions on memory or real-time performance. Restricting the task to understanding "business English," ARPA contractors--including Phillips, Bolt, Beranek and Newman, Dragon Systems, Inc., and others--have reported word accuracies around 97 percent or higher. Moore's law will take care of achieving real-time performance on affordable machines, so that we should see such systems available commercially by, perhaps, early 1998.
Expanding the domain of recognition--not to mention understanding--to the humanlike flexibility HAL displays will take a far greater mastery of the many levels of knowledge represented in spoken language. I would expect that by the year 2001--remembering that in the movie HAL became intelligent much earlier--we will have systems able to recognize speech well enough to produce a written transcription of the movie from the sound track. Even then, the error rate will be far higher than HAL's (who, of course, claims he has never made a mistake).
In 1997 we appreciate that speech recognition does not exist in a vacuum but has to be integrated with other levels and sources of knowledge. Kurzweil Applied Intelligence, Inc., for example, has integrated its large-vocabulary speech recognition capability with an expert system that has extensive knowledge about the preparation of medical reports; the Kurzweil VoiceMED can guide doctors through the reporting process and assist them to comply with the latest regulations. If you find yourself in a hospital emergency room, there is a 10-percent chance your attending physician will dictate his or her report to one of our speech-recognition systems. We recently began adding the ability to understand natural language commands spoken in continuous speech. If, for example, you say, "go to the second paragraph on the next page; select the second sentence; capitalize every word in this sentence; underline it ..." the system is likely to follow this series of commands. If you say "Open the pod bay doors," it will probably respond "Command not understood." How to Build a Speech Recognizer
Software today is not an isolated field, but one that encompasses and codifies every other field of endeavor. Everyone--librarians, musicians, magazine publishers, doctors, graphic artists, architects, researchers of every kind--are digitizing their knowledge bases, methods, and expressions of their work. Those of us working on speech understanding are experiencing the same rapid change, as hundreds of scientists and engineers build increasingly elaborate data bases and structures to describe our knowledge of speech sounds, phonetics, linguistics, syntax, semantics, and pragmatics--in accordance with lesson 1.
A speech-recognition system operates in phases, with each new phase using increasingly sophisticated knowledge about the next higher level of language. At the front end, the system converts the time-varying air pressure we call sound into an electrical signal, as Bell did a hundred years ago with his crude microphones. Then, a device called an analog-to-digital converter changes the signal into a series of numbers. The numbers may be modified to normalize for loudness levels and possibly to eliminate background noise and distortion. The signal, which is now a digital stream of numbers, is usually converted into multiple streams, each of which represents a different frequency band. These multiple streams are then compressed, using a variety of mathematical techniques that reduce the amount of information and emphasize those features of the speech signal important for recognizing speech.
For example, we want to know that a certain segment of sound contains a broad noiselike band of frequencies that might represent the sound of rushing air, as in the sound /h/ in HAL; another segment contains two or three resonant frequencies in a certain ratio that might represent the vowel sound /a/ in HAL. One way to accomplish this labeling is to store examples of such sounds and attempt to match incoming time slices against these templates. Usually, the attempt to categorize slices of sound uses a much finer classification system than the approximately forty phonemes of English. We typically use a set of 256 or even 1,024 possible classifications in a process called vector quantization.
Once we have classified these time slices of sound, we can use one of several competing approaches to recognizing words. One of them develops statistical models for words or portions of words by analyzing massive amounts of prerecorded speech data. Markov modeling and neural nets are examples of this approach. Another approach tries to detect the underlying string of phonemes (or possibly other types of subword units) and then match them to the words spoken.
At Kurzweil Applied Intelligence (KAI), rather than select one optimal approach, we implemented seven or eight different modules, or "experts," then programmed another software module, the "expert manager," which knows the strengths and weaknesses of the different software experts. In this decision-by-committee approach, the expert manager is the chief executive officer and makes the final decisions.
In the KAI systems, some of the expert modules are based, not on the sound of the words but on rules and the statistical behavior of word sequences. This is a variation of the hypothesis-and-test paradigm in which the system expects to hear certain words, according to what the speaker has already said. Each of the modules in the system has a great deal of built-in knowledge. The acoustic experts contain knowledge on the sound structure of words or such subword units as phonemes. The language experts know how words are strung together. The expert manager can judge which experts are more reliable in particular situations.
The system as a whole begins with generic knowledge of speech and language in general, then adapts these knowledge structures, based on what it observes in a particular speaker. In the film, Dave and Frank frequently invoke HAL's name. Even today's speech-recognition systems would quickly learn to recognize the word HAL and would not mistake it for hill or hall, at least not after being corrected once or twice.
In continuous speech, a speech-recognition system needs to deal with the additional ambiguity of when words start and end. Its attempts to match the classified time slices and recognized subword units against actual word hypotheses could result in a combinatorial explosion. A vocabulary of, say, sixty thousand words, could produce 3.6 billion possible two-word sequences, 216 trillion three-word sequences, and so on. Obviously, as we cannot examine even a tiny fraction of these possibilities, search constraints based on the system's knowledge of language are crucial. The other major ingredient needed to achieve the holy grail (i.e., a system that can understand fully continuous speech with high accuracy with relatively unrestricted vocabulary and domain and with no previous exposure to the speaker) is a more-powerful computer. We already have systems that can combine continuous speech, very large vocabularies, and speaker independence--with the only limitation being restriction of the domain to business English. But these systems require RAM memories of over 100 megabytes and run much slower than real time on powerful workstations. Even though computational power is critical to developing speech recognition and understanding, no one in the field is worried about obtaining it in the near future. We know we will not have to wait long to achieve the requisite computational power because of Moore's law.
Moore's law states that computing speeds and densities double every eighteen months; it is the driving force behind a revolution so vast that the entire computer revolution to date represents only a minor ripple of its ultimate implications. It was first articulated in the mid-1960s by Dr. Gordon Moore. Moore's law actually is a corollary of a broader law I like to call Kurzweil's law, which concerns the exponentially quickening pace of technology back to the dawn of human history. A thousand years ago, not much happened in a century, technologically speaking. In the nineteenth century, quite a bit happened. Now major technological transformations occur in a few years time. Moore's law, a clear quantification of this exponential phenomenon, indicates that the pace will continue to accelerate.
Remarkably, this law has held true since the beginning of this century. It began with the mechanical card-based computing technology used in the 1890 census, moved to the relay-based computers of the 1940s, to the vacuum tube-based computers of the 1950s, to the transistor-based machines of the 1960s, and to all the generations of integrated circuits we've seen over the past three decades. If you chart the abilities of every calculator and computer developed in the past hundred years logarithmically, you get an essentially straight line. Computer memory, for example, is about sixteen thousand times more powerful today for the same unit cost than it was in about 1976 and is a hundred and fifty million times more powerful for the same unit cost than it was in 1948.
Moore's law will continue to operate unabated for many decades to come; we have not even begun to explore the third dimension in chip design. Today's chips are flat, whereas our brain is organized in three dimensions. We live in a three-dimensional world, why not use the third dimension? (Present-day chips are made up of a dozen or more layers of material that construct a single layer of transistors and other integrated components. A few chips do utilize more than one layer of components but make only limited use of the third dimension.) Improvements in semiconductor materials, including superconducting circuits that don't generate heat, will enable us to develop chips--that is, cubes--with thousands of layers of circuitry that, combined with far smaller component geometries, will improve computing power by a factor of many millions. There are more than enough new computing technologies under development to assure us of a continuation of Moore's law for a very long time. So, although some people argue that we are reaching the limits of Moore's law, I disagree. (See David Kuck's detailed analysis in chapter 3.)
Moore's law provides us with the infrastructure--in terms of memory, computation, and communication technology--to embody all our knowledge and methodologies and harness them on inexpensive platforms. It already enables us to live in a world where all our knowledge, all our creations, all our insights, all our ideas, and all our cultural expressions--pictures, movies, art, sound, music, books and the secret of life itself--are being digitized, captured, and understood in sequences of ones and zeroes. As we gather and codify more and more knowledge about the hierarchy of spoken language from speech sounds to subject matter, Moore's law will provide computing platforms able to embody that knowledge. At the front end, it will let us analyze a greater number of frequency bands, ultimately approaching the exquisite sensitivity of the human auditory sense to frequency. At the back end, it will allow us to take advantage of vast linguistic data bases.
Like many computer-science problems, recognizing human speech suffers from a number of potential combinatorial explosions. As we increase vocabulary size in a continuous-speech system, for example, the number and length of possible word combinations increases geometrically. So making linear progress in performance requires us to make exponential progress in our computing platforms. But that is exactly what we are doing. Some Predictions Based on Moore's law, and the continued efforts of over a thousand researchers in speech recognition and related areas, I expect to see commercial-grade continuous-speech dictation systems for restricted domains, such as medicine or law, to appear in 1997 or 1998. And, soon after, we will be talking to our computers in continuous speech and natural language to control personal-computer applications. By around the turn of the century, unrestricted-domain, continuous-speech dictation will be the standard. An especially exciting application of this technology will be listening machines for the deaf analogous to reading machines for the blind. They will convert speech into a display of text in real time, thus achieving Alexander Graham Bell's original vision a century and a quarter later.
Translating telephones that convert speech from one language to another (by first recognizing speech in the original language, translating the text into the target language, then synthesizing speech in the target language) will be demonstrated by the end of this century and will become common during the first decade of the twenty-first century. Conversation with computers that are increasingly unseen and embedded in our environment will become routine ways to accomplish a broad variety of tasks.
In a classic paper published in 1950, Alan Turing foretold that by early in the next century society would take for granted the pervasive intervention of intelligent machines. This remarkable prediction--given the state of hardware technology at that time--attests to his implicit appreciation of Moore's law. For reasons that should be clear from our discussion, creating a machine with HAL's ability to understand spoken language requires a level of intelligence and mastery of knowledge that spans the full range of human cognition. When we test our own ability to understand spoken words out of context (i.e., spoken in a random, nonsense order), we find that the accuracy of speech recognition diminishes dramatically, compared to our understanding of words spoken in a meaningful order. Once, as an experiment, I walked into a colleague's office and said "Pod 3BA." My colleague's response was "What?" When I asked him to repeat what I had said, he couldn't. HAL, of course, has little difficulty understanding this phrase when Dave asks him to prepare Pod 3BA; it makes sense in the context of that conversation, and we human viewers of the movie easily understood it too.
Understanding spoken language uses the full range of our intelligence and knowledge. Many observers (including some authors of chapters in this book) predict that machines will never achieve certain human capabilities--including the deep understanding of language HAL appears to possess. If by the word never, they mean not in the next couple of decades, then such predictions might be reasonable. If the word carries its usual meaning, such predictions are shortsighted in my view, reminiscent of predictions that "man" would never fly or that machines would never beat the human world chess champion.
With regard to Moore's law, the doubling of semiconductor density means that we can put twice as many processors (or, alternatively, a processor with twice the computing power) on a chip (or comparable device) every eighteen months. Combined with the doubling of speed from shorter signaling distances, such increases may actually quadruple the power of computation every eighteen months (that is, double it every nine months). This is particularly true for algorithms that can benefit from parallel processing. Most researchers anticipate the next one or two turns of Moore's screw; others look ahead to the next four or five turns. But Moore's law is inexorable. Taking into account both density and speed, we are presently increasing the power of computation (for the same unit cost) by a factor of sixteen thousand every ten years, or 250 million every twenty years.
Consider, then, what it would take to build a machine with the capacity of the human brain. We can approach this issue in many ways, one way is just to continue our dogged codification of knowledge and skill on yet-faster machines. Undoubtedly this process will continue. The following scenario, however, is a bit different approach to building a machine with human-level intelligence and knowledge--that is, building HAL. Note that I've simplified the following analysis in the interest of space; it would take a much longer article to respond to all of the anticipated objections. The human brain uses a radically different computational paradigm than the computers we're used to. A typical computer does one thing at a time, but does it very quickly. The human brain is very slow, but every part of its net of computation works simultaneously. We have about a hundred billion neurons, each of which has an average of a thousand connections to other neurons. Because all these connections can perform their computations at the same time, the brain can perform about a hundred trillion simultaneous computations. So, although human neurons are very slow--in fact about a million times slower than electronic circuits--this massive parallelism more than makes up for their slowness. Although each interneuronal connection is capable of performing only about two hundred computations each second, a hundred trillion computations being performed at the same time add up to about twenty million billion calculations per second, give or take a couple of orders of magnitude.
Calculations like these are a little different than conventional computer instructions. At the present time, we can simulate on the order of two billion such neural-connection calculations per second on dedicated machines. That's about ten million times slower than the human brain. A factor of ten million is a big factor and is one reason why present computers are dramatically more brittle and restricted than human intelligence. Some observers looking at this difference conclude that human intelligence is so much more supple and wide-ranging than computer intelligence that the gap can never be bridged.
Yet a factor of ten million, particularly of the kind of massive parallel processing the human brain employs, will be bridged by Moore's law in about two decades. Of course, matching the raw computing speed and memory capacity of the human brain--even if implemented in massively parallel architectures--will not automatically result in human level intelligence. The architecture and organization of these resources is even more important than the capacity. There is, however, a source of knowledge we can tap to accelerate greatly our efforts to design machine intelligence. That source is the human brain itself. Probing the brain's circuits will let us, essentially, copy a proven design--that is, reverse engineer one that took its original designer several billion years to develop. (And it's not even copyrighted, at least not yet.)
This may seem like a daunting effort, but ten years ago so did the Human Genome Project. Nonetheless, the entire human genetic code will soon have been scanned, recorded, and analyzed to accelerate our understanding of the human biogenetic system. A similar effort to scan and record (and perhaps to understand) the neural organization of the human brain could perhaps provide the templates of intelligence. As we approach the computational ability needed to simulate the human brain--we're not there today, but we will be early in the next century--I believe researchers will initiate such an effort.
There are already precursors of such a project. For example, a few years ago Carver Mead's company, Synaptics, created an artificial retina chip that is, essentially, a silicon copy of the neural organization of the human retina and its visual-processing layer. The Synaptics chip even uses digitally controlled analog processing, as the human brain does.
How are we going to conduct such a scan? Again, although a full discussion of the issue is beyond the scope of this chapter, we can mention several approaches. A "destructive" scan could be made of a recently deceased frozen brain; or we could use high-speed, high-resolution magnetic resonance imaging (MRI) or other noninvasive scanning technology on the living brain. MRI scanners can already image individual somas (i.e., neuron cell bodies) without disturbing living tissue. The more-powerful MRIs being developed will be capable of scanning individual nerve fibers only ten microns in diameter. Eventually, we will be able to automatically scan the presynaptic vesicles (i.e., the synaptic strengths) believed to be the site of human learning.
This ability suggests two scenarios. The first is that we could scan portions of a brain to ascertain the architecture of interneuronal connections in different regions. The exact position of each nerve fiber is not as important as the overall pattern. Using this information, we could design simulated neural nets that will operate in a similar fashion. This process will be rather like peeling an onion as each layer of human intelligence is revealed. That is essentially the procedure Synaptics has followed. They copied the essential analog algorithm called center surround filtering also found in the first layers of mammalian neurons.
A more difficult, but still ultimately feasible, scenario would be to noninvasively scan someone's brain to map the locations, interconnections, and contents of the somas, axons, dendrites, presynaptic vesicles, and other neural components. The entire organization of the brain--including the contents of its memory--could then be re-created on a neural computer of sufficiently high capacity.
Today we can peer inside someone's brain with MRI scanners whose resolution increases with each new generation. However, a number of technical challenges in complete brain-mapping--including achieving suitable resolution, bandwidth, lack of vibration, and safety--remain. For a variety of reasons, it will be easier to scan the brain of someone recently deceased than a living brain. Yet noninvasively scanning a living brain will ultimately become feasible as the resolution and speed of MRI and other scanning technologies improve. Here too the driving force behind future rapid improvements is Moore's law, because building high-resolution three-dimensional images quickly from the raw data an MRI scanner produces requires massive computational ability.
Perhaps you think this discussion is veering off into the realm of science fiction. Yet, a hundred years ago, only a handful of writers attempting to predict the technological developments of this past century foresaw any of the major forces that have shaped our era: computers, Moore's law, radio, television, atomic energy, lasers, bioengineering, or most electronics--to mention a few. The century to come will undoubtedly bring many technologies we would have similar difficulty envisioning, or even comprehending today. The important point here is, however, that the projection I am making now does not contemplate any revolutionary breakthrough; it is a modest extrapolation of current trends based on technologies and capabilities that we have today. We can't yet build a brain like HAL's, but we can describe right now how we could do it. It will take longer than the time needed to build a computer with the raw computing speed of the human brain, which I believe we will do by around 2020. By sometime in the first half of the next century, I predict, we will have mapped the neural circuitry of the brain.
Now the ability to download your mind to your personal computer will raise some interesting issues. I'll only mention a few. First, there's the philosophical issue. When people are scanned and then re--created in a neural computer, who will the people in the machine be? The answer will depend on whom you ask. The "machine people" will strenuously claim to be the original persons; they lived certain lives, went through a scanner here, and woke up in the machine there. They'll say, "Hey, this technology really works. You should give it a try." On the other hand, the people who were scanned will claim that the people in the machine are impostors, different people who just appear to share their memories, histories, and personalities.
A related issue is whether or not a re-created mind--or any intelligent machine for that matter--is conscious. This question too goes beyond the scope of the chapter, but I will venture a brief comment. There is, in fact, no objective test of another entity's subjective experience; it can argue convincingly that it feels joy and pain (perhaps it even "feels your pain"), but that is not proof of its subjective experience. HAL himself makes such a claim when he responds to the BBC interviewer's question.
HAL: I am putting myself to the fullest possible use, which is all I think that any conscious entity can ever hope to do.
Of course, HAL's telling us he's conscious, doesn't settle the issue, as Dan Dennett's engaging discussion of these issues demonstrates (see chapter 16).
Then there's the ethical issue. Will it be immoral, or even illegal, to cause pain and suffering to your computer program? Again, I refer the reader to Dennett's chapter. Few of us worry much about these issues now for our most advanced programs today are comparable to the minds of insects. However, when they attain the complexity and subtlety of the human mind--as they will in a few decades--and when they are in fact derived from human minds or portions of human minds, this will become a pressing issue.
Before Copernicus, our speciecentricity was embodied in the idea that the universe literally circled around us as a testament to our unique status. We no longer see our uniqueness as a matter of celestial relationships but of intelligence. Many people see evolution as a billion-year drama leading inexorably to its grandest creation: human intelligence. Like the Church fathers, we are threatened by the specter of machine intelligence that competes with its creator.
We cannot separate the full range of human knowledge and intelligence from the ability to understand human language, spoken or otherwise. Turing recognized this when, in his famous Turing test, he made communication through language the means of ascertaining whether a human-level intelligence is a machine or a person. HAL understands human spoken language about as well as a person; at least that's the impression we get from the movie. Achieving this level of machine proficiency is not the threshold we stand on today. Still, machines are quickly gaining the ability to understand what we say, as long as we stay within certain limited but useful domains. Until HAL comes along, we will be talking to our computers to dictate written documents, obtain information from data bases, command a diverse array of tasks, and interact with an environment that increasingly intertwines human and machine intelligence.
Copyright (C) 1997. Reproduced with permission from MIT Press.
HAL's Legacy
| | Join the discussion about this article on Mind·X! | |