How Do We Know When Computers Are Smarter Than Us?

Source: Thinkstock

Source: Thinkstock

Artificial intelligence continues to make important strides forward, prompting even Bill Gates, Microsoft co-founder and the father of the personal computer, to worry that machines might one day grow too intelligent. During an “Ask Me Anything” session on Reddit, Gates wrote, “I am in the camp that is concerned about super intelligence. First the machines will do a lot of jobs for us and not be super intelligent. That should be positive if we manage it well. A few decades after that though the intelligence is strong enough to be a concern. I agree with Elon Musk and some others on this and don’t understand why some people are not concerned.”

In a recent interview with Backchannel, Gates explained that he foresees two “threats” of artificial intelligence: the first that artificial intelligence could affect the creation of jobs that give humans “a sense of purpose and worth,” and the second that advances could give rise to “strong AI,” which would have control over resources and have goals that “are somehow conflicting with the goals of human systems.”

But how will we know when we’ve built a program that’s intelligent enough that we need to worry? The most famous metric — the Turing test, which is too often touted as a measure of a computer’s intelligence — has passed its expiration date. But a group of experts fed up with misleading claims of scripts and programs “passing” the test hope to turn the Turing test into a Turing Championship, to measure and advance the ability of artificial intelligence. Writing for Motherboard, Victoria Turk reports that Gary Marcus, a chair of the workshop, says that the Turing test is “really an exercise in deception and evasion.”

The Ikea test

The artificial intelligence experts, who met for a workshop at the AAAI Conference on Artificial Intelligence in Austin, Texas, aim to replace the 60-year-old Turing test with an annual or bi-annual Turing Championship. According to the workshop’s website, such an event might consist of three to five different challenging tasks, which would test all of the areas where artificial intelligence has progressed, including vision, speech recognition, and natural language processing.

The workshop’s website describes two such challenges. The first, the Winograd Schema Challenge, tests the ability of machines to resolve linguistic antecedents in contexts where common-sense knowledge is critical, with un-Googleable questions such as “The trophy doesn’t fit in the brown suitcase because it’s too big. What is too big? (The trophy or the suitcase?)” Another, proposed by Marcus, would focus on the comprehension of new materials, like videos, texts, photos, and podcasts, asking the machine questions like “Why did Russia invade Crimea?” or “Why did Walter White consider taking a hit out on Jessie?”

Turk learned of a third challenge that the group is considering, one which Marcus refers to as the “Ikea test.” Essentially, the challenge posits that a machine could be considered as intelligent as a human when it can follow the instructions to build flatpack furniture. To do that, it would have to see the parts, interpret the instructions, and have the motor skills itself to put the furniture together — which would require the artificial intelligence program to guide a physical robot. The event could start with simulations before moving to robots, or even involve human cooperation.

Other ideas proposed at the workshop included challenging an artificial intelligence program to play a new video game as well as a twelve-year-old child, or asking a digital teacher to learn a new topic and teach it as well or better than a human. The group hopes to launch its first championship next year, after a second workshop at the IJCAI conference in July. The championship will begin with three or four tests, and the group will add more in later years.

The imitation game

Alan Turing, known as one of the fathers of the modern computer age, recognized that the term “machine intelligence” meant next to nothing, and reasoned that it would be more useful to discuss what a machine can do. When he introduced the original concept of the Turing test in 1950 — in a paper called “Computing Machinery and Intelligence” (PDF) — Turing suggested it as a philosophical exercise to explore the question “Can machines think?” He considered that question “too meaningless to deserve discussion,” and instead proposed a challenge that later became known as the Turing test: If a judge can’t decipher which of two hidden entities is human and which is artificial, then the machine successfully passes the test.

The new form of the problem can be described in terms of a game which we call the ‘imitation game.’ It is played with three people, a man (A), a woman (B), and an interrogator (C) who may be of either sex. The interrogator stays in a room apart from the other two. The object of the game for the interrogator is to determine which of the other two is the man and which is the woman. He knows them by labels X and Y, and at the end of the game he says either ‘X is A and Y is B’ or ‘X is B and Y is A’. The interrogator is allowed to put questions to A and B.

Turing envisioned the game being mediated by teletype machines, and today, researchers can use any sort of text-based interface, like one used in a chat room or an instant messaging service. While A’s object in the game is to try to cause C to make the wrong identification, the object of the game for B is to help the interrogator. As Turing goes on, the game takes an interesting twist.

We now ask the question, ‘What will happen when a machine takes the part of A in this game?’ Will the interrogator decide wrongly as often when the game is played like this as he does when the game is played between a man and a woman? These questions replace our original, ‘Can machines think?’

The mannequin effect

The Turing test was intended as a philosophical experiment rather than as a practical assessment of artificial intelligence. But since 1950, some researchers have come to regard it as a benchmark for researchers wanting to prove the intelligence of their bots, and as such has been the center of controversy. This was never more visible than last summer, when Kevin Warwick of the University of Reading announced that a chatbot — one that goes by the name of “Eugene Goostman” and emulates the personality of a 13-year-old Ukrainian boy — was the first to pass the Turing test at an event held on the 60th anniversary of Turing’s death.

In a June story on Smithsonian.com, Dan Falk, who was a judge in a Turing test “marathon” in 2012, noted that “the Turing Test measures something, but it’s not intelligence.” Falk notes that in his 1950 paper, Turing speculated that by the year 2000, “an average interrogator will not have more than 70 per cent chance of making the right identification” after five minutes, positing that computer programs would fool judges 30% of the time. Falk notes that the stipulation of the five-minute time limit is important. Turing didn’t discuss a time limit as an inherent part of the test — and it could be argued that for a machine to really pass the test, it should be able to handle any amount of questioning — but it’s likely that he included it as an arbitrary, albeit necessary, practical limit.

Falk notes that the shorter the conversion in a Turing test, the greater the computer’s advantage. Conversely, the longer the interrogation, the higher probability that the computer will give itself away. He explains, “I like to call this the mannequin effect: Have you ever apologized to a department store mannequin, assuming that you had just bumped into a live human being? If the encounter lasts only a fraction of a second, with you facing the other way, you may imagine that you just brushed up against a human. The longer the encounter, the more obvious the mannequin-ness of the mannequin.”

The same logic applies to chatbots. While an exchange of hellos reveals nothing, more problems arise the deeper into a conversation the bot gets. Chatbots change the subject for no reason, often can’t answer simple questions, and in other ways just don’t sound human. The developers behind the “Eugene Goostman” chatbot took pains to build a character with a believable personality — but one that would make it seem reasonable that he didn’t know everything. Chatbots rely on what Falk characterizes as “an assortment of tricks,” like memorizing megabytes of canned responses, or searching the Internet for dialogue that approximates the conversation they’re currently having.

This raises the question of what the Turing test really measures, and the answer according to many is that it rewards trickery instead of intelligence, demonstrating it fitting that Turing termed the test the “imitation game.” Any program that could really pass the Turing test would be very successful at mimicking a human. Falk explains that while machine intelligence is advancing at a rapid pace, most artificial intelligence researchers have realized that conversation is unlikely to be where it’s most impressive.

The new challenges that are proposed to replace the Turing test would require a much higher level of skill than chatbots are able to fake. The point is to push artificial intelligence research further, and to that end, Marcus has proposed gradually adjusting the goalposts of what it means to pass each of the tests. A program might first need to match the skill of a child, then of an average person, and then of an expert in the field to be considered a winner. Marcus says it’s possible that programs could deliver superhuman performance. But, while everyone hopes to see an artificially intelligent bot ace the tests, the point isn’t really to pass the test, and is instead to advance artificial intelligence and to make machines more intelligent.

More from Tech Cheat Sheet: