A few weeks ago, Colby Cosh—a friend of a friend of sorts who ordinarily writes reasonable things for a chap who still thinks the Edmonton Oilers are a real sports team—penned an article in his Maclean’s blog about Watson, IBM’s Jeopardy!-playing machine (“I’ll take ‘Cheap Publicity Stunts’ for $1000, Alex”, 16 January 2011), that I found to be dreadfully uninformed. The thrust of his argument is that Watson is a corporate “gimmick”—a fancy plea for media coverage by the faceless villains at IBM, with nothing of scientific interest going on underneath. Keep in mind that by the standards of this article, nothing in the “perpetually disappointing history of AI” will ever be interesting until we’ve graduated from tightly delimited objectives to Big Problems like the Turing Test:
Every article about Watson, IBM’s Jeopardy!-playing device, should really lead off with the sentence “It’s the year 2011, for God’s sake.” In the wondrous science-fiction future we occupy, even human brains have instant broadband access to a staggeringly comprehensive library of general knowledge. But the horrible natural-language skills of a computer, even one with an essentially unlimited store of facts, still compromise its function to the point of near-parity in a trivia competition against unassisted humans.
This isn’t far off from saying that particle physics will be perpetually disappointing until we’ve observed the Higgs boson, or that manned spaceflight is merely an expensive publicity stunt that will never be scientifically interesting until we’ve colonized the Moon: it leans heavily on popular culture as the ultimate barometer of scientific achievement, and it requires both ignorance of methodology and apathy towards specifics.
Colby and I had a five-minute skirmish about the article on Twitter, which as a format for debate is unwieldy as piss. I promised a proper response as soon as I cleared some other priorities off my plate. Those other priorities are still, to my annoyance, on my plate; but having finally paid good money to register my copy of MarsEdit, I’m thirsting for a scrap.
This topic will do as well as any. Reluctant as I am to swing the pretentious hammer of “I know what I’m talking about,” this really is (as the idiom goes) a chance for Faramir, Captain of Gondor, to show his quality. Computational linguistics happens to be my onetime research area, popular misunderstanding of science happens to be one of my favourite bugbears, and Kasparov’s anticomputer strategies against Deep Blue happened to make a cameo appearance in the meandering slop of my master’s dissertation. None of this matters a great deal, mind you. One should never be dismissive of journalists from a position of relative expertise; they’re the ones people actually read, and it’s vital to engage with what they say.
(It is a little game we play: they put it on the bill, I tear up the bill.)
When simplifications attack
What concerns me is not so much Colby’s perspective as a non-expert (invaluable), his resort to the familiar hand-waving sophistries of Dreyfus and Searle (expected), or even whether I should call him Colby when I don’t really know the fellow and haven’t gotten around to amending my unwritten style guide to arbitrate matters of semi-personal address (pedantic). The bigger problem, one that is endemic in journalism about science, is his exclusive reliance on popular simplifications by corporate PR, other journalists, and cherry-picked philosophers for pictures of what AI research is all about.
Surely it wouldn’t have hurt to consult a real computing scientist; there are plenty of those to choose from the public sector with no vested interest in the fortunes of IBM. The only thing this would have jeopardized is a premeditated thesis founded on dismissive assertions about the entire field of research. Why talk to someone credible when they’re unlikely to agree with you?
Here, there are several bad assertions in play—all of which are traceable to the selective consultation of sources.
Let’s consider this one paragraph alone—the crux of Colby’s entire argument that nothing terribly fascinating is going on inside the box:
Jeopardy!, after all, doesn’t demand that much in the way of language interpretation. Watson has to, at most, interpret text questions of no more than 25 or 30 words—questions which, by design, have only a single answer. It handles puns and figures of speech impressively, for a computer. But it doesn’t do so in anything like the way humans do. IBM’s ads would have you believe the opposite, but it bears emphasizing that Watson is not “getting” the jokes and wordplay of the Jeopardy! writers. It’s using Bayesian math on the fly to pick out key nouns and phrases and pass them to a lookup table. If it sees “1564″ and “Pisa”, it’s going to say “Galileo”.
Now let’s put some numbers beside the assertions:
- Jeopardy! is a trivia game, and all there is to trivia is looking up keywords. We know computers can do that.
- When Watson handles wordplay, it doesn’t do it like humans do. It isn’t really thinking; it doesn’t really understand the puns. Furthermore, this somehow matters.
- IBM would like us to believe that Watson really gets the jokes. If Watson doesn’t really get the jokes, the project is a hollow exercise in corporate self-promotion.
The first assertion vastly understates the complexity of what Jeopardy! demands. The nature of the game—a time-constrained, multi-agent affair—radically alters the straightforward problem of answering a question (or in this case, questioning an answer). Even simple pattern-matching is far from trivial when every millisecond counts.
Let’s run with Colby’s caricature for a moment. With a database of facts as gargantuan as the one Watson requires, looking up “1564″ in conjunction with “Pisa” is a surprisingly time-consuming task, never mind the inference to Galileo’s date of birth. This isn’t something tractable via faster processors or larger memory banks: there are theoretical lower bounds on the efficiency of searching and sorting algorithms in proportion to the dataset’s size. Exhaustive traversals that perform perfectly on small scales are out of the question here. The algorithms have to take shortcuts and make approximate guesses. Semantic associations must be efficiently structured in the software’s abstract maps as well as the physical database in order to best distribute searches in parallel. When you consider these factors, drawing semantic inferences from the natural-language clues becomes a heuristic necessity if the approximate search queries are to be any good.
Crucially, the time constraint on a response is not a static value, but a dynamic one that depends on the performance of the other competitors. This is why a match against the most successful Jeopardy! players in history is an essential proof of concept. Every contestant who appears on the television show has to pass a solo audition first, and any of them could tell you—particularly if they meet with little success—that in a competitive setting, the game becomes a different kettle of fish.
This is to say nothing of the other decisions Watson has to make in order to be competitive in a live test. It has to assess the risk of answering a question, considering not only its confidence in its own correctness but the standing scores of both itself and the other players. It has to set wagers for Double and Final Jeopardy, which requires an assessment of confidence based on the category title alone; in the case of Double Jeopardy, this will also have to consider the money still up for grabs on the board. One of the reasons Ken Jennings had such an astonishing run on the show was that he was able to make excellent strategic wagers on the fly.
Contrary to what Colby suggests, if the structured decomposition of the process of taking a Jeopardy! clue all the way from answer to question is able to match and surpass the blazing speed of human intuition at its best, that would be a tremendous accomplishment indeed. Without the capacity to parse natural language in terms of meaningful semantic chunks—a task well beyond mere symbol manipulation—Watson wouldn’t have a prayer of displaying a fraction of the competence that it has already shown.
Trapped in the Chinese Room
The second assertion is a real howler, and one that has become downright boring to swat aside over the course of the past thirty years. That’s right, folks: say hello to John Searle and the Chinese Room. The Chinese Room objection to AI is this: a computer translating between English and Chinese is like an English speaker who knows no Chinese, but who sits in a room looking up symbol tables and matching the syntactic elements correctly. Even if the translation looks perfect to the outsider, argued Searle, you couldn’t say that the symbol-manipulating translator (i.e. the computer) understands Chinese.
In a general sense, the Chinese Room stands for a whole class of arguments that boil down to saying, it doesn’t matter how well the computer performs—it’s not really thinking because on the inside, it’s not processing information in the same way humans do. Colby makes an argument about Watson identical to the Chinese Room when he says that the system doesn’t “get” the jokes and puns in Jeopardy!‘s more puzzling clues. Apparently, it doesn’t matter if Watson solves the clues correctly: it still isn’t behaving like a human inside the box, so the whole shebang is all just smoke and mirrors.
The logic of the Chinese Room is spurious in many respects, and I won’t go through all of the embedded fallacies here. For those of you new to the debate, here are two of the more serious ones. The first is that the analogy is false. The appeal of the argument comes from how it personifies a particular component of the system to highlight its dissimilarity to real human understanding. This fallacy endures unchecked because its proponents are free to move the goalposts however they like: no matter how robust the system is, the critics can isolate a piece of the syntactic machinery, put a human face on it, and complain about the absence of high-level, humanlike semantics. The second fallacy lies in the deceptive assertion that the syntactic internals of a computer are completely unlike the internals of the human mind. In truth, we still know next to nothing about how the latter works. Our understanding of how we get from the low-level operations of neuroscience to the high-level processes of cognitive psychology is at least as discontinuous as our best notions of how semantic structures might emerge from the symbolic structures of computer systems.
I alluded to this in my initial salvo on Twitter:
To which Colby offered this astonishing reply:
@Nicholas_Tam It’s got nada to do with the Chinese room. The Turing test is the one most everyone agrees on & there’s NO progress toward it.
Completely apart from the fact that one of Colby’s objections was precisely the Chinese Room, there’s a logical contradiction here along with a factual error. (Not bad, all in all, for 140 characters or less.) The contradiction arises from the failure to distinguish between external behaviours and internal thought processes. Let’s suppose, for a moment, that the goal for whichever AI system we’re talking about is to pass the Turing Test—that is, to be misidentified as the human being in a double-blind question-and-answer test where the questioner knows that one respondent is human and the other is a machine. If you read the original paper in Mind where Alan Turing introduced his “imitation game”, Turing’s whole point was to black-box the internals and take them out of the picture. The premise of the Turing Test is that if you can’t tell the difference between man and machine in terms of external behaviour, then functionally there may as well be no difference at all; this suffices as intelligence.
The Chinese Room argument, on the other hand, is a direct attack on the validity of the Turing Test. It seeks to establish that thoughts don’t supervene on actions: that is to say, identical external behaviours do not imply identical internal machinations.
Turing’s and Searle’s positions are more or less incommensurable. You can’t have it both ways. You can’t hold up the Turing Test (which is entirely about exterior performance) as the standard of achievement while griping, as Searle does, that even in a successful performance that passes for humanlike, symbol manipulation doesn’t really count. Contrariwise, Turing ventured that if a machine’s behaviour is indistinguishable from a human’s, it’s pointless to squabble over whether it qualifies as intelligent; from the available evidence, we might as well treat it as such.
If you accept the Chinese Room argument—and you really shouldn’t—the only function of bringing up the Turing Test at all is to set up a straw man. It has not escaped me that this may have been the intent.
Acting inside the box
Unfortunately for this transparent rhetorical tactic, the Turing Test is not the accepted benchmark for artificial intelligence research, nor is it even a commonly desired objective. AI is not one monolithic project that either has or hasn’t been achieved.
The goals of AI research have historically diversified along two separate axes (a schema for thinking about AI that most students of intelligent systems pick up from Russell and Norvig). The first key distinction is between acting (what a system does on the outside) and thinking (how a system gets there on the inside). The second distinction is between performing like humans and performing rationally or optimally (which may be entirely unlike humans, but may provide solutions to well-defined problems that outstrip the capacities of human agents).
This yields four quadrants that loosely circumscribe your garden-variety intelligent agents: systems that aim to think like humans, act like humans, think rationally, or act rationally. (Think of these categories more as design goals than as discrete kinds of agents, which in practice lie all over the map.) The first quadrant, systems that think like humans, is the area of interest for much of cognitive science. This is the type of system that the Chinese Room argument contends will in principle never succeed; Hubert Dreyfus’s objection, the thesis that human thought is fundamentally unformalizable, applies specifically to this category as well. The second quadrant, systems that act like humans, is the one where the Turing Test applies.
It must be said that the Turing Test is relevant here with specific reference to the indistinguishability of external behaviours—not to the requirement of aptitude in natural languages, as Colby seems to believe. Turing’s original imitation game was framed purely in terms of language, which remains an overwhelming challenge to this day, but it has since been expanded to other problem domains. (Jeopardy! is one of them.) To pluck out one example, natural language is hardly suitable as a test for computer vision, the branch of AI concerning how computers can perceive objects in photographs or positions in 3D space from the raw data of images. It would be preposterous to say that a robust system in computer vision fails as AI or marginalize its significance as a scientific accomplishment simply because it can’t pass for a human on the telephone.
Natural language is a particular problem domain—indeed, an umbrella category for all sorts of subproblems that are fascinating in their own right. It is not the essence of the Turing Test, nor is there any consensus that linguistic aptitude is the essence of intelligence.
It’s convenient for our discussion, however, that Jeopardy! involves natural language to the extent that it does. It should attract comparisons to Turing’s imitation game, and it has. Yet it bears mentioning that whether a system is really thinking is a completely incidental consideration for the vast majority of practical work in AI, just as it was for Turing. Nobody says, “Let’s build a system that possesses general intelligence.” What they actually say is this: “Let’s identify a chunky, intuitive problem that demands high-level thought and see if we can’t build a system to break it down and tackle it.”
Watson’s aim is clear: perform well enough in Jeopardy! to defeat the best human players. Any consequences for our beliefs about the nature of human intelligence is a byproduct and not the ultimate goal. That said, it is perfectly valid to speak of a Jeopardy! Turing Test. Watson would clearly fail the test not if it fell short of champion-level play, but if it ventured solutions to clues that don’t even make sense as guesses. (Consider the early test at about 1:50 into this video. The clue, from the category on I Love Lucy: “It was Ricky’s signature tune and later the name of his club.” Watson: “What is song?”)
But if indistinguishability from human-level performance is what we are looking for, Watson is already doing fairly well. There is a very important difference between defeating humans in Jeopardy! and passing for a human player, although the goals are intertwined. There is an even wider gulf between passing for a human Jeopardy! player and passing for a human being in toto. Everybody knows the latter goal is as far off as colonizing Mars, and nowhere in the promotional materials does IBM suggest otherwise.
Colby has a problem with this:
So why, one might ask, are we still throwing computer power at such tightly delimited tasks, ones that lie many layers of complexity below what a human accomplishes in having a simple phone conversation?
And one might also ask, why study nuclear physics when we seem to be no closer to harnessing fusion power than we were fifty years ago? First of all, in both cases, we are substantially closer in terms of how we understand the problem, even if our estimates for when the endpoint will show up on the horizon haven’t necessarily shortened. The achievements that scientists think of as the most significant may not be fixtures in popular culture, but that doesn’t mean they were pointless. Far more importantly: computing science, like nuclear physics, is inherently interesting. Designing AI systems for delimited problem spaces is an activity that leads us to all sorts of discoveries about the nature and structure of those problems, and of the minutiae of problem-solving processes in general. We learn all sorts of things about comparative strategies for structuring, representing, and manipulating information—and how they measure up to the relatively black-boxed processes of human minds.
So to answer Colby’s question:
@Nicholas_Tam So we can’t test AI by scrutiny of interior process OR the curtained-black-box Turing test? What’s left, religious revelation?
We “test” AI in the context of its performance with respect to well-defined goals. Those goals could certainly involve a Turing Test, be it for answering natural-language questions or some other specified task. Whether an artificial system has a human-like mind of its own, along everything that implies—consciousness, self-awareness, semantic understanding—is a problem we leave to the philosophers; and no, it’s not empirically testable. But neither is the problem of whether other humans have minds.
The inverted pyramid scheme
Now let us turn to the third assertion: that IBM is making outlandish promotional claims that oversell Watson in the name of fuelling a publicity blitz.
What does it mean to say that something is a “gimmick”? We mean to accuse it of being all dressing and no salad. We mean to expose its failure to accomplish what we are told it does on the surface. We mean to insist that we will not be duped into believing that something humdrum is, in truth, extraordinary.
The trouble for Colby’s argument is that Watson is extraordinary—just not in the way that he thinks IBM has misled him to expect. “AI researchers have arguably the highest conceivable standards to meet when it comes to thinking about thinking,” remarked one commenter at Maclean’s, “and it’s hard to fault them for failing to live up to the naive expectations of science fiction.” Colby replied: “By ‘the naive expectations of science fiction’ I presume you mean ‘the naive expectations deliberately created by IBM promotional materials and employees’.
I received a similar response:
@Nicholas_Tam Maybe you should look at the IBM ads. Your claims for Watson are a LOT more modest than theirs.
At the time of our repartee, I was admittedly only familiar with IBM’s own materials in passing; most of what I knew about Watson was from sources that discussed it in greater detail. I found it odd that Colby’s point of engagement was exclusively with the advertising and not the technology itself, but this was understandable: he was making a statement about hype, after all, and it’s very common nowadays that the implications of scientific accomplishments are exaggerated in the public sphere. (Refer to Jorge Cham’s excellent illustration of the science news cycle, which concerns university research but applies equally as well to corporate and governmental laboratories.)
By and large, this is a product of two sets of behaviour—one on the part of journalistic reporting, the other on the part of the research organizations. Let’s begin with the journalists.
The dominant template for journalistic narrative is the inverted pyramid: begin with the most important information, and continue to points that are less and less essential on the assumption that the reader could stop at any time. (Before the age of desktop publishing, this also made it easy for newspaper editors to literally snip away the last paragraph or two when assembling the columns on the page.) The trouble is the gulf between what journalists deem most relevant to non-expert readers and what scientists consider to be important contributions to their field.
The end result is sensationalism—and too many articles about science wind up looking like Martin Robbins’ parody. They begin with far-reaching implications that may or may not be related to the research at hand, and work their way down to the specifics that matter most. This is a narrative framework that is seriously divorced from the reality of research, which operates on the level of local challenges and goals. (This post by Greg Lusk on the inverted pyramid and the conflicting priorities of journalists and scientists is highly relevant here.)
Because long-term, big-picture implications like the performance gap between artificial and human intelligence (in Watson’s case) become the centrepiece of the story, they become the focus of media attention and debate, often with no consideration of the specifics of what has been accomplished. And this is why we see casual expressions of dissent like Colby Cosh’s criticisms of Watson: wildly off the mark, selectively researched from Wikipedia with an a priori verdict already in mind, and laced with a sprinkle of pseudo-expertly mumbling about Bayesian combinatorics that are far more involved than the author makes them out to be. Criticisms like these respond to the news stories, not to the science.
Of greed and gimmickry
Colby is convinced, however, that his projected misunderstandings of what Watson claims to achieve are fundamentally IBM’s fault. And it’s no use pretending that IBM isn’t a self-interested organization: like NASA in their recent fiasco over arsenic-based lifeforms (a discredited paper, but one that was widely misreported when people still thought it looked shipshape), if people take their promotional materials and statements to the press the wrong way, they have no incentive to correct anyone so long as their project is still in a positive light. Watson is a proof of concept for IBM’s enterprise hardware and the DeepQA question-answering system, both of which the company intends to license and sell.
Not all of the problems with science journalism is the fault of journalists: research laboratories, public as well as private, are often complacent about inaccuracies in secondary reporting because the attention (and the concomitant prospects for funding) are too attractive to throw away.
Let’s be very clear about one thing, however: IBM’s profit motive as an organization does not negate the intellectual interests of its researchers. As fashionable as it is these days to appeal to the trope of corporations that are only responsible to their shareholders and therefore can’t be interested in anything but the bottom line, the truth is that corporate laboratories in private industry are invaluable centres of research. Projects like Watson attract contributions from university scientists not because they all want to see IBM succeed, and not even necessarily because the pay is so much better (though it is), but because they provide access to hardware that enables large-scale work. Computing scientists in industry are taken every bit as seriously as their compatriots in the university world, and the two regularly cooperate on grand initiatives.
But what does that say about the marketing? Complacency aside, is IBM actively making Watson sound like a much bigger deal than it is?
I have now combed through IBM’s promotional videos, articles, and FAQs, and I would like to retract my earlier concession that their claims may have gone too far. IBM’s statements about Watson are fair reflections of what AI can realistically achieve and what a successful performance by Watson will demonstrate. About the most outlandish thing they say—the one that treads the furthest into the minefield of the philosophy of AI—is that Watson performs well in Jeopardy because it understands natural language. And strictly speaking, it does. The clues in Jeopardy! are undeniably in natural language, and differ from formal or heavily restricted sentences by a significant degree of complexity. About the only restriction on the clues is length. Discard the puns and puzzles and you still have challenging problems like binding indefinite pronouns to objects (or classes of objects) that fit.
Whether Watson’s “understanding” of natural language is analogous to that of humans doesn’t figure into the discussion here. Nobody is saying that Watson actually has a conscious mind; AI researchers don’t think on those airy-fairy ontological terms when they are designing systems for specific tasks. They participate in the debates over the philosophy of artificial minds, yes, and they’re usually on the optimistic side, but everyone is aware of the separation between that conversation and the immediate challenge of defeating humans on a robust, open-domain answer-questioning game show.
We are not even remotely in Dreyfus territory. Still, I can understand why layperson readers might think we are when they read the story in The Globe and Mail and come across a juicy quotation like this:
“We can use computers to find documents with keywords, but the computers don’t know what those documents say,” Dr. Ferrucci says. “What if they did?”
People whose notions about AI come entirely from Battlestar Galactica could easily misread Ferrucci’s statement as referring to sentience or consciousness. But anybody who knows a thing or two about AI can read this and correctly interpret it to refer to semantic-level knowledge representation—concepts on a larger scale than string matching or keyword search. It’s entirely agnostic on the problem of whether artificial minds can exist. I’m not deliberately reading this as a modest apologist: this is actually what Ferrucci is obviously saying.
If you get all your science from Hollywood and you think cloning has to do with developed bodies and selves rather than the raw data in your genes, it’s not the responsibility of geneticists to clarify their work for you every time they speak. Similarly, you can’t expect scientists and engineers in AI to explicitly backpedal from the philosophical question of conscious machines every time they talk about their work.
Or can you? What we desperately need is a greater public understanding of what scientists do, and what they mean when they use everyday words to talk about their fields. Readers dive into news stories about science with popular preconceptions that are often wrong, but nobody takes up the responsibility of correcting them until the discourse goes seriously awry. We’ve seen this before with how the hysteria over genetically modified foods or embryonic stem cell research obfuscated the real issues deserving of policy attention. There are even some dark corners of the world where creationists are wreaking havoc on schools because they still think evolution by natural selection is some kind of affront to their god.
Sooner or later, this will happen with AI: we’ll explore the possibility of delegating something big and very public to an autonomous system, and legitimate policy concerns will drown in a sea of hysteria about machines taking over the world. If scientifically knowledgable people do not shoulder the burden of sober clarification, that role will become occupied by contrarian journalists who don’t really know what they’re talking about, but still take pleasure in posturing as the voice of reason in the room.
If you are going to take the position of someone who sees through the publicity and understands the underlying science, you have to understand the underlying science. No matter how bombastic IBM’s promotional claims are, or how submissively the media repeats the press releases with a dash of unchecked sensationalism on top, Watson is more than a “gimmick” if it’s computationally interesting—and by any informed and reasonable standard, it is. Watson is a nontrivial system, and Jeopardy! is a nontrivial pursuit.