[FoR&AI] Steps Toward Super Intelligence I, How We Got Here

God created man in his own image.

Man created AI in his own image.

Once again, with footnotes.

God created man in his1 own image.2

Man3 created AI in his3 own image.

At least that is how it started out. But figuring out what our selves are, as machines, is a really difficult task.

We may be stuck in some weird Gödel-like incompleteness world–perhaps we are creatures below some threshold of intelligence which stops us from ever understanding or building an artificial intelligence at our  level. I think most people would agree that that is true of all non-humans on planet Earth–we would be extraordinarily surprised to see a robot dolphin emerge from the ocean, one that had been completely designed and constructed by living dolphins. Like dolphins, and gorillas, and bonobos, we humans may be below the threshold as I discussed under “Hubris and Humility”.

Or, perhaps, as I tend to believe, it is just really hard, and will take a hundred years, or more, of concerted effort to get there. We still make new discoveries in chemistry, and people have tried to understand that, and turn it from a science into engineering, for thousands of years. Human level intelligence may be just as, or even more, challenging.

In my most recent blog post on the origins of Artificial Intelligence, I talked about how all the founders of AI, and most researchers since have really been motivated by producing human level intelligence. Recently a few different and small groups have tried to differentiate themselves by claiming that they alone (each of the different groups…) are interested in producing human level intelligence, and beyond, and have each adopted the name Artificial General Intelligence (AGI) to distinguish themselves from mainstream AI research. But mostly they talk about how great it is going to be, and how terrible it is going to be, and very often both messages at the same time coming out of just one mouth, once AGI is achieved. And really none of these groups have any idea how to get there. They are too much in love with the tingly feelings thinking about it to waste time actually doing it.

Others, who don’t necessarily claim to be AI researchers but merely claim to know all about AI and how it is proceeding (ahem…), talk about the exquisite dangers of what they call Super Intelligence, AI that is beyond human level. They claim it is coming any minute now, especially since as soon as we get AGI, or human level intelligence, it will be able to take over from us, using vast cloud computational resources, and accelerate  the development of AI even further. Thus there will be a runaway development of Super Intelligence. Under the logic of these hype-notists this super intelligence will naturally be way beyond our capabilities, though I do not know whether they believe it will supersede God…  In any case, they claim it will be dangerous, of course, and won’t care about us (humans) and our way of life, and will likely destroy us all. I guess they think a Super Intelligence is some sort of immigrant. But these heralds who have volunteered their clairvoyant services to us also have no idea about how AGI or Super Intelligence will be built. They just know that it is going to happen soon, if not already. And they do know, with all their hearts, that it is going to be bad. Really bad.

It does not help the lay perception at all that some companies claiming to have systems based on Artificial Intelligence are often using humans to handle the hard cases for their online systems, unbeknownst to the users. This can seriously confuse public perception of just where today’s Artificial Intelligence stands, not to mention the inherent lack of privacy since I think we humans are somehow more willing sometimes to share private thoughts and deeds with our machines than we are with other people.

In the interest of getting to the bottom of all this, I have been thinking about what research we need to do, what problems we need to solve, and how close we are to solving all of them in order to get to Artificial General Intelligence entities, or human intelligence level entities. We have been actively trying for 62 years, but apparently it is only just right now that we are about to make all the breakthroughs that will be necessary. That is what this blog is about, giving my best guess at all the things we still don’t know, that will be necessary to know for us to build AGI agents, and then how they will take us on to Super Intelligence. And thus the title of this post: Steps Toward Super Intelligence.

And yes, this title is an homage to Marvin Minsky’s Steps Toward Artificial Intelligence from 1961. I briefly reviewed that paper back in 1991 in my paper Intelligence Without Reason, where I pointed out the five main areas he identifies for research into Artificial Intelligence were search, three ways to control search (for pattern-recognition, learning, and planning), and a fifth topic, induction. Pattern recognition and learning have today clearly moved beyond search. Perhaps my prescriptions for research towards Super Intelligence will also turn out to be wrong before very long. But I am pretty confident that the things that will date my predictions are not yet known by most researchers, and certainly are not the hot topics of today.

This started out as a single long essay, but it got longer and longer. So I split it into four parts, but they are also all long. In any case, it is the penultimate essay in my series on the Future of Robotics and Artificial Intelligence.


Earlier I said this endeavor may take a hundred years? For techno enthusiasts, of which I count myself as one, that sounds like a long time. Really, is it going to take us that long? Well, perhaps not, perhaps it is really going to take us two hundred, or five hundred. Or more.

Einstein predicted gravitational waves in 1916. It took ninety nine years of people looking before we first saw them in 2015. Rainer Weiss, who won the Nobel prize for it, sketched out the successful method after fifty one years in 1967. And by then the key technologies needed, laser and computers, were in wide spread commercial use. It just took a long time.

Controlled nuclear fusion has been forty years away for well over sixty years now.

Chemistry took millennia, despite the economic incentive of turning lead into gold (and it turns out we still can’t do that in any meaningful way).

P=NP? has been around in its current form for forty seven years and its solution would guarantee whoever did it to be feted as the greatest computer scientist in a generation, at least. No one in theoretical computer science is willing to guess when we might figure that one out. And it doesn’t require any engineering or production. Just thinking.

Some things just take a long time, and require lots of new technology, lots of time for ideas to ferment, and lots of Einstein and Weiss level contributors along the way.

I suspect that human level AI falls into this class. But that it is much more complex than detecting gravity waves, controlled fusion, or even chemistry, and that it will take hundreds of years.

Being filled with hubris about how tech (and the Valley) can do whatever they put their mind to may just not be nearly enough.

Four Previous Attempts at General AI

Referring again to my blog post of April on the origins of Artificial Intelligence people have been actively working on a subject, explicitly called “Artificial Intelligence” since the summer of 1956. There were precursor efforts for the previous twenty years, but that name had not yet been invented or assigned–and once again I point to my 1991 paper Intelligence Without Reason for a history of the prior work and the first 35 years of Artificial Intelligence.

I count at least four major approaches to Artificial Intelligence over the last sixty two years. There may well be others that some would want to include.

As I see it, the four main approaches have been, along with approximate start dates:

  1. Symbolic (1956)
  2. Neural networks (1954, 1960, 1969, 1986, 2006, …)
  3. Traditional robotics (1968)
  4. Behavior-based robotics (1985)

Before explaining the strengths and weakness of these four main approaches I will justify the dates that I given above.

For Symbolic I am using the 1956 date of the famous Dartmouth workshop on Artificial Intelligence

Neural networks have been investigated, abandoned, and taken up again and again. Marvin Minsky submitted his Ph.D. thesis in Princeton in 1954, titled Theory of Neural-Analog Reinforcement Systems and its Application to the Brain-Model Problem; two years later Minsky had abandoned this approach and was a leader in the symbolic approach at Dartmouth. Dead. In 1960 Frank Rosenblatt published results from his hardware Mark I Perceptron, a simple model of a single neuron, and tried to formalize what it was learning. In 1969 Marvin Minsky and Seymour Papert published a book, Perceptrons, analyzing what a single perceptron could and could not learn. This effectively killed the field for many years. Dead, again. After years of preliminary work by many different researchers, in 1986 David Rumelhart, Geoffrey Hinton, and Ronald Williams published a paper Learning Representations by Back-Propagating Errors, which re-established the field using a small number of layers of neuron models, each much like the Perceptron model. There was a great flurry of activity for the next decade until most researchers once again abandoned neural networks. Dead, again. Researchers here and there continued to work on neural networks, experimenting with more and more layers, and coining the term deep for those many more layers. They were unwieldy and hard to make learn well, and then in 2006 Geoffrey Hinton (again!) and Ruslan Salakhutdinov, published Reducing the Dimensionality of Data with Neural Networks, where an idea called clamping allowed the layers to be trained incrementally. This made neural networks undead once again, and in the last handful of years this deep learning approach has exploded into practicality of machine learning. Many people today know Artificial Intelligence only from this one technical innovation.

I trace Traditional Robotics, as an approach to Artificial Intelligence, to the work of Donald Pieper, The Kinematics of Manipulators Under Computer Control, at the Stanford Artificial Intelligence Laboratory (SAIL) in 1968.  In 1977 I joined what had by then become the “Hand-Eye” group at SAIL, working on the “eye” part of the problem for my PhD.

As for Behavior-based robotics, I track this to my own paper, A Robust Layered Control System for a Mobile Robot, which was written in 1985, but appeared in a journal in 19864, when it was called the Subsumption Architecture. This later became the behavior-based approach, and eventually through technical innovations by others morphed into behavior trees. I am perhaps lacking a little humility in claiming this as one of the four approaches to AI. On the other hand it has lead to more than 20 million robots in people’s homes, numerically more robots by far than any other robots ever built, and behavior trees are now underneath the hood of two thirds of the world’s video games, and many physical robots from UAVs to robots in factories. So it has at least been a commercial success.

Now I attempt to give some cartoon level descriptions of these four approaches to Artificial Intelligence. I know that anyone who really knows Artificial Intelligence will feel that these descriptions are grossly inadequate. And they are. The point here is to give just a flavor for the approaches. These descriptions are not meant to be exhaustive in showing all the sub approaches, nor all the main milestones and realizations that have been made in each approach by thousands of contributors. That would require a book length treatment. And a very thick book at that. These descriptions are meant to give just a flavor.

Now to the four types of AI. Note that for the first two, there has usually been a human involved somewhere in the overall usage pattern. This puts a second intelligent agent into the system and that agent often handles ambiguity and error recovery.  Often, then, these sorts of AI systems have had to deliver much less reliability than autonomous systems will demand in the future.

1. Symbolic Artificial Intelligence

The key concept in this approach is one of symbols. In the straightforward (every approach to anything usually gets more complicated over a period of decades) symbolic approach to Artificial Intelligence a symbol is an atomic item which only has meaning from its relationship to other meanings. To make it easier to understand the symbols are often represented by a string of characters which correspond to a word (in English perhaps), such as  cat or animal. Then knowledge about the world can be encoded in relationships, such as instance of and is.

Usually the whole system would work just as well, and consistently if the words were replaced by, say g0537 and g0028. We will come back to that.

Meanwhile, here is some encoded knowledge:

  • Every instance of a cat is an instance of a mammal.
  • Fluffy is an instance of cat.
  • Now we can conclude that Fluffy is an instance of a mammal.
  • Every instance of a mammal is an instance of an animal.
  • Now we can conclude that every instance of a cat is an instance of an animal.
  • Every instance of an animal can carry out the action of walking.
  • Unless that instance of  animal is in the state of being dead.
  • Every instance of an animal is either in the state of being alive or in the state of being dead — unless the time of now is before the time of (that instance of animal carrying out the action of birth).

While what we see here makes a lot of sense to us, we must remember that as far as an AI program that uses this sort of reasoning is concerned, it might as well have been:

  • Every instance of a g0537 is an instance of a g0083.
  • g2536 is an instance of g0537.
  • Now we can conclude that g2536 is an instance of a g0083.
  • Every instance of a g0083 is an instance of an g0028.
  • Now we can conclude that every instance of a g0537 is an instance of an g0028.
  • Every instance of an g0028 can carry out the action of g0154.
  • Unless that instance of  g0028 is in the state of being g0253.
  • Every instance of an g0028 is either in the state of being g0252 or in the state of being g0253 — unless the value(the-computer-clock) < time of (that instance of g0028 carrying out the action of g0161).

In fact it is worse than this. Above the relationships are still described by English words. For an AI program that uses this sort of reasoning is concerned, it might as well have been.

  • For every x where r0002(xg0537) then r0002(xg0083).
  • r0002(g2536, g0537).
  • Now we can conclude that r0002(g2536g0083).
  • For every x where r0002(xg0083) then r0002(xg0028).
  • Now we can conclude that for every x where r0002(xg0537) then r0002(xg0028).
  • For every x where r0002(xg0028) then r0005(xg0154).
  • Unless r0007(xg0253).
  • For every x where r0002(xg0028) then either r0007(x, g0252) or r0007(x, g2053) — unless the value(the-computer-clock) < p0043(a0027(g0028, g0161)).

Here the relationships like “is an instance of” have been replaced by anonymous symbols like r0002, and the symbol < replaces “before“, etc.  This is what it looks like inside an AI program, but even with this the AI program never looks at the names of the symbols, rather just when one symbol in an inference or statement is the same symbol as in another inference or statement. The names are only there for humans to interpret, so when g0537 and g0083 were cat and mammal, a human looking at the program5 or its input or ouput could put an interpretation on what the symbols might “mean”.

And this is the critical problem with symbolic Artificial Intelligence, how the symbols that it uses are grounded in the real world. This requires some sort of perception of the real world, some way to and from symbols that connects them to things and events in the real world.

For many applications it is the humans using the system that do the grounding. When we type in a query to a search engine it is we who choose the symbols to make our way into what the AI system knows about the world. It does some reasoning and inference, and then produces for us a list of Web pages that it has deduced match what we are looking for (without actually having any idea that we are something that corresponds to the symbol person that it has in its database). Then it is us who looks at the summaries that it has produced of the pages and clicks on the most promising one or two pages, and then we come up with some new or refined symbols for a new search if it was not what we wanted. We, the humans, are the symbol grounders for the AI system. One might argue that all the intelligence is really in our heads, and that really all the AI powered search engine provides us with is a fancy index and a fancy way to use it.

To drive home this point consider the following thought experiment.

Imagine you are a non Korean speaker, and that the AI program you are interacting with has all its input and output in Korean. Those symbols would not be much help. But suppose you had a Korean dictionary, with the definitions of Korean words written in Korean. Fortunately modern Korean has a finite alphabet and spaces between words (though the rules are slightly different from those of English text), so it will be possible to extract “symbols” from looking at the program output. And then you could look them up in the dictionary, and perhaps eventually infer Korean grammar.

Now it is just possible that you could use your extensive understanding of the human world, about which the Korean dictionary must be referring to for many entries, to guess at some of the meanings of the symbols. But if you were a Heptapod from the movie Arrival and it was before (uh-oh…) Heptapods had ever visited Earth then you would not even have this avenue for grounding these entirely alien symbols.

So it really is the knowledge in people’s heads that does the grounding in many situations. In order to get the knowledge into an AI program it needs to be able to relate the symbols to something outside its self consistent Korean dictionary (so to speak). Some hold out hope that our next character in the pantheon of pretenders to the throne of general Artificial Intelligence, neural networks, will play that role. Of course, people have been working on making that so for decades. We’re still a long way off.

To see the richness of sixty plus years of symbolic Artificial Intelligence work I recommend the AI Magazine, the quarterly publication of the Association for the Advancement of Artificial Intelligence. It is behind a paywall, but even without joining the association you can see the tables of contents of all the issues and that will give a flavor of the variety of work that goes on in symbolic AI. And occasionally there will also be an article there about neural networks, and other related types of machine learning.

2.0, 2.1. 2.2, 2.3, 2.4, … Neural networks

These are loosely, very loosely, based on a circa 1948 understanding of neurons in the brain. That is to say they do not bear very much resemblance at all to our current understand of the brain, but that does not stop the press talking about this approach as being inspired by biology. Be that as it may.

Here I am going to talk about just one particular kind of artificial neural network and how it is trained, namely a feed forward layered network under supervised learning. There are many variations, but this example gives an essential flavor.

The key element is an artificial neuron that has n inputs, flowing along which come n numbers, all between zero and one, namely x1, x2, … xn. Each of these is multiplied by a weight, w1, w2, … wn, and the results are summed, as illustrated in this diagram from Wikimedia Commons.

(And yes, that w3 in the diagram should really be wn.) These weights can have any value (though are practically limited to what numbers can be represented in the computer language in which the system is programmed) and are what get changed when the system learns (and recall that learn is a suitcase word as explained in my seven deadly sins post). We will return to this in just a few paragraphs.

The sum can be any sized number, but a second step compresses it down to a number between zero and one again by feeding it though a logistic or sigmoid function, a common one being:

f(x) = \frac{1}{1+e^{-x}}

I.e., the sum gets fed in as the argument to the function and the expression on the right is evaluated to produce a number that is strictly between zero and one, closer and closer to those extremes as the input gets extremely negative or extremely positive. Note that this function preserves the order between possible inputs x fed to it. I.e., if y < z then f(y) < f(z). Furthermore the function is symmetric about an input of zero, and an output of 0.5.

This particular function is very often used as it has the property that it is easy to compute its derivate for any given output value without having to invert the function to find the input value. In particular, if you work through the normal rules for derivatives and use algebraic simplification you can show that

\dfrac{\mathrm{d}}{\mathrm{d}x}f(x) = \frac{e^x}{(1+e^x)^2} = f(x)(1-f(x))

This turns out to be very useful for the ways these artificial neurons started to be used in the 1980’s resurgence. They get linked together in regular larger networks such as the one below, where each large circle corresponds to one of the artificial neurons above. The outputs on the right are usually labelled with symbols, for instance cat, or car. The smaller circles on the left correspond to inputs to the network which come from some data source.

For instance, the source might be an image, there might be many thousand little patches of the image sampled on the left, perhaps with a little bit or processing of the local pixels to pick out local features in the image, such as corners, or bright spots. Once the network has been trained, the output labelled cat should put out a value close to one when there is a cat in the image and close to zero if there is no cat in the image, and the one labelled car should have a value similarly saying whether there is a car in the image. One can think of these numbers as the network saying what probability it assigned to there being a cat, a car, etc. This sort of network thus classifies its input into a finite number of output classes.

But how does the network get trained? Typically one would show it millions of images (yes, millions), for which there was ground truth known about which images contained what objects. When the output lines with their symbols did not get the correct result, the weights on the inputs of the offending output neuron would be adjusted up or down in order to next time produce a better result. The amount to update those weights depends on how much difference a change in weight can make to the output. Knowing the derivative  of where on the sigmoid function the output is coming from is thus critical. During training the proportional amount, or gain, of how much the weights are modified is reduced over time. And a 1980’s invention allowed the detected error at the output to be propagated backward through multiple layers of the network, usually just two or three layers at that time, so that the whole system could learn something from a single bad classification. This technique is known as back propagation.

One immediately sees that the most recent image will have a big impact on the weights, so it is necessary to show the network all the other images again, and gradually decrease how much weights are changed over time. Typically each image is shown to the network thousands, or even hundreds of thousands of times, interspersed amongst millions of other images also each being shown to the network hundreds of thousands of times.

That this sort of training works as well as it does is really a little fantastical. But it does work in many cases. Note that a human designs how many layers there are in the network, for each layer how the connections to the next layer of the network are arranged, and what the the inputs are for the network. And then the network is trained, using a schedule of what to show it when, and how to adjust the gains on the learning over time as chosen by the human designer. And if after lots of training the network has not learned well, the human may adjust the way the network is organized, and try again.

This process has been likened to alchemy, in contrast to the science of chemistry. Today’s alchemists who are good at it can command six or even seven figure salaries.

In the 1980’s when back propagation was first developed and multi-layered networks were first used, it was only practical from both a computational and algorithmic point of view to use two or three layers. Thirty years later the Deep Learning revolution of 2006 included algorithmic improvements, new incremental training techniques, of course lots more computer power, and enormous sets of training data harvested from the fifteen year old World Wide Web. Soon there were practical networks of twelve layers–that is where the word deep comes in–it refers to lots of layers in the network, and certainly not “deep introspection”…

Over the more than a decade since 2006 lots of practical systems have been built.

The biggest practical impact for most people recently, and likely over the next couple of decades is the impact on speech transliteration systems. In the last five years we have moved from speech systems over the phone that felt like “press or say `two’ for frustration”, to continuous speech transliteration of voice messages, and home appliances, starting with the Amazon Echo and Google Home, but now extending to our TV remotes, and more and more appliances built on top of the speech recognition cloud services of the large companies.

Getting the right words that people are saying depends on two capabilities. The first is detecting the phonemes, the sub pieces of words, with very different phonemes for different languages, and then partitioning a stream of those phonemes, some detected in error, into a stream of words in the target language. With out earlier neural networks the feature detectors that were applied to raw sound signals to provide low level clues for phonemes were programs that engineers had built by hand. With Deep Learning, techniques were developed where those earliest features were also learned by listening to massive amounts of speech from different speakers all talking in the target language. This is why today we are starting to think it natural to be able to talk to our machines. Just like Scotty did in Star Trek 4: The Voyage Home.

A new capability was unveiled to the world in a New York Times story on November 17, 2014 where the photo below appeared along with a caption that a Google program had automatically generated: “A group of young people playing a game of Frisbee”.

I think this is when people really started to take notice of Deep Learning. It seemed miraculous, even to AI researchers, and perhaps especially to researchers in symbolic AI, that a program could do this well. But I also think that people confused performance with competence (referring again to my seven deadly sins post). If a person had this level of performance, and could say this about that photo, then one would naturally expect that the person had enough competence in understanding the world, that they could probably answer each of the following questions:

  • what is the shape of a Frisbee?
  • roughly how far can a person throw a Frisbee?
  • can a person eat a Frisbee?
  • roughly how many people play Frisbee at once?
  • can a 3 month old person play Frisbee?
  • is today’s weather suitable for playing Frisbee?

But the Deep Learning neural network that produced the caption above can not answer these questions. It certainly has no idea what a question is, and can only output words, not take them in, but it doesn’t even have any of the knowledge that would be needed to answer these questions buried anywhere inside what it has learned. It has learned a mapping from colored pixels, with a tiny bit of spatial locality, to strings of words. And that is all. Those words only rise up a little beyond the anonymous symbols of traditional AI research, to have a sort of grounding, a grounding in the appearance of nearby pixels. But beyond that those words or symbols have no meanings that can be related to other things in the world.

Note that the medium in which learning happens here is selecting many hundreds of thousands, perhaps millions, of numbers or weights. The way that the network is connected to input data is designed by a human, the layout of the network is designed by a human, the labels, or symbols, for the outputs are selected by a human, and the set of training data has previously been labelled by a human (or thousands of humans) with these same symbols.

3. Traditional Robotics

In the very first decades of Artificial Intelligence, the AI of symbols, researchers sought to ground AI by building robots. Some were mobile robots that could move about and perhaps push things with their bodies, and some were robot arms fixed in place. It was just too hard then to have both, a mobile robot with an articulated arm.

The very earliest attempts at computer vision were then connected to these robots, where the goal was to, first, deduce the geometry of what was in the world, and then to have some simple mapping to symbols, sitting on top of that geometry.

In my post on the origins of AI I showed some examples of how perception was built up by looking for edges in images, and then working through rules on how edges might combine in real life to produce geometric models of what was in the world. I used this example of a complex scene with shadows:

In connecting cameras to computers and looking at the world, the lighting and the things allowed in the field of view often had to be constrained for the computer vision, the symbol grounding, to be successful. Below is a really fuzzy picture of the “copy-demo” at the MIT Artificial Intelligence Laboratory in 1970. Here the vision system looked at a stack of blocks and the robot tried to build a stack that looked the same.

At the same time a team at SRI International in Menlo Park, California, were building the robot Shakey, which operated in a room with large blocks, cubes and wedges, with each side painted in a different matte color, and with careful control of lighting.


By 1979 Hans Moravec at the Stanford Artificial Intelligence Lab had an outdoor capable robot, “The Cart”, in the center of the image here (a photograph that I took), which navigated around polyhedral objects, and other clutter. Since it took about 15 minutes to move one meter it did get a little confused by the high contrast moving shadows.

And here is the Freddy II robot at the Department of Artificial Intelligence at Edinburgh University in the mid 1970’s, stacking flat square and round blocks and inserting pegs into them.

These early experiments combined image to symbol mapping, along with extracting three dimensional geometry so that the robot could operate, using symbolic AI planning programs from end to end.

I think it is fair to say that those end to end goals have gotten a little lost over the years.  As the reality of complexities due to uncertainties when real objects are used have been realized, the tasks that AI robotics researchers focus on have been largely driven by a self defined research agenda, with proof of concept demonstrations as the goal.

And I want to be clear here. These AI based robotics systems are not used at all in industry. All the robots you see in factories (except those from my company, Rethink Robotics) are carefully programmed, in detail, to do exactly what they are doing, again, and again, and again. Although the lower levels of modeling robot dynamics and planning trajectories for a robot arm are shared with the AI robotics community, above that level it is very complete and precise scripting. The last forty years of AI research applied to factory robots has had almost no impact in practice.

On the other hand there has been one place from traditional robotics with AI that has had enormous impact. Starting with robots such as The Cart, above, people tried to build maps of the environment so that the system could deliberatively plan a route from one place to another which was both short in time to traverse and which would avoid obstacles or even rough terrain. So they started to build programs that took observations as the robot moved and tried to build up a map. They soon realized that because of uncertainties in how far the robot actually moved, and even more importantly what angle it turned when commanded, it was impossible to put the observations into a simple coordinate system with any certainty, and as the robot moved further and further the inaccuracies relative to the start of the journey just got worse and worse.

In late 1984 both Raja Chatila from Toulouse, and I, newly a professor at MIT, realized that if the robot could recognize when it saw a landmark a second time after wandering around for a while it could work backwards through the chain of observations made in between, and tighten up all their uncertainties. We did not need to see exactly the same scene as before, all we needed was to locate one of the things that we had earlier labeled with a symbol, and being sure that the new thing we saw labelled by the same symbol as in fact the same object in the world. This is now called “loop closing” and we independently published papers with this idea in March 1985 at a robotics conference held in St Louis (IEEE ICRA). But neither of us had very good statistical models, and mine was definitely worse than Raja’s.

By 1991 Hugh Durrant-Whyte and John Leonard, then both at Oxford, had come up with a much better formalization, which they originally called “Simultaneous Map Building and Localisation” (Oxford English spelling), which later turned into “Simultaneous Localisation and Mapping” or SLAM. Over the next fifteen years, hundreds, if not thousands, of researchers refined the early work, enabled by newly low cost and plentiful mobile robots (my company iRobot was supplying those robots as a major business during the 1990’s). With a formalized well defined problem, low cost robots, adequate computation, and researchers working all over the world, competing on performance, there was rapid progress. And before too long the researchers managed to get rid of symbolic descriptions of the world and do it all in geometry with statistical models of uncertainty.

The SLAM algorithms became part of the basis for self-driving cars, and subsystems derived from SLAM are used in all of these systems. Likewise the navigation and data collection from quadcopter drones is powered by SLAM (along with inputs from GPS).

4. Behavior-Based Robotics

By 1985 I had spent a decade working in computer vision, trying to extract symbolic descriptions of the world from images, and in traditional robotics, building planning systems for robots to operate in simulated or actual worlds.

I had become very frustrated.

Over the previous couple of years as I had tried to move from purely simulated demonstrations to getting actual robots to work in the real world, I had become more and more buried in mathematics that was all trying to estimate the uncertainty in what my programs knew about the real world. The programs were trying to measure the drift between the real world, and the perceptions that my robots were making of the world. We knew by this time that perception was difficult, and that neat mapping from perception to certainty was impossible. I was trying to accommodate that uncertainty and push it through my planning programs, using a mixture of traditional robotics and symbolic Artificial Intelligence. The hope was that by knowing how wide the uncertainty was the planners could accommodate all the possibilities in the actual physical world.

I will come back to the implicit underlying philosophical position that I was taking in the last major blog post in this series, to come out later this year.

But then I started to reflect on how well insects were able to navigate in the real world, and how they were doing so with very few neurons (certainly less that the number of artificial neurons in modern Deep Learning networks). In thinking about how this could be I realized that the evolutionary path that had lead to simple creatures probably had not started out by building a symbolic or three dimensional modeling system for the world. Rather it must have begun by very simple connections between perceptions and actions.

In the behavior-based approach that this thinking has lead to, there are many parallel behaviors running all at once, trying to make sense of little slices of perception, and using them to drive simple actions in the world. Often behaviors propose conflicting commands for the robot’s actuators and there has to be a some sort of conflict resolution. But not wanting to get stuck going back to the need for a full model of the world, the conflict resolution mechanism is necessarily heuristic in nature. Just as one might guess, the sort of thing that evolution would produce.

Behavior-based systems work because the demands of physics on a body embedded in the world force the ultimate conflict resolution between behaviors, and the interactions. Furthermore by being embedded in a physical world, as a system moves about it detects new physical constraints, or constraints from other agents in the world. For synthetic characters in video games under the control of behavior trees, the demands of physics are replaced by the demands of the simulated physics needed by the rendering engine, and other agents in the world are either the human player of yet more behavior-based synthetic characters.

Just in the last few weeks there has been a great example of this that has gotten a lot of press. Here is the original story about MIT Professor Sangbae Kim’s Cheetah 3 robot. The press was very taken with the robot blindly climbing stairs, but if you read the story you will see that the point of the research is not to produce a blind robot per se. Computer vision, even 3-D vision, is not completely accurate. So any robot that tries to climb rough terrain using vision, rather than feel, needs to be very slow, careful placing its feet one at a time, as it does not know exactly where the solid support in the world is. In this new work, Kim and his team have built a collection of low level behaviors which sense when things have gone wrong and quickly adapt individual legs. To prove the point, they made the robot completely blind–the performance of their robot will only increase as vision gives some high level direction to where the robot should aim its feet, but even so, having these reactive behaviors at the lowest levels make it much faster and more sure footed.

The behavior-based approach, which leaves the model out in the world rather than inside the agent, has allowed robots to proliferate in number. Unfortunately, I often get attacked by people outside the field, saying in effect, we were promised super intelligent robots and all you have given us is robot vacuum cleaners. Sorry, it is a work in progress. At least I gave you something practical…

Comparing The Four Approaches to AI

In my 1990 paper Elephants Don’t Play Chess, in the first paragraph of the second page I mentioned that the “holy grail” for both classical symbolic AI research, and my own research was “general purpose human level intelligence”–the youngsters today saying that the goal of AGI, or Artificial General Intelligence is a new thing are just plain wrong. All four of the approaches I outlined above have been motivated by eventually getting to human level intelligence, and perhaps beyond.

None of them are yet close by themselves, nor have any combinations of them turned into something that seems close. But each of the four approaches has somewhat unique strengths. And all of them have easily identifiable weaknesses.

The use of symbols in AI research allows one to use them as the currency of composition between different aspects of intelligence, passing the symbols from one reasoning component to another. For neural networks fairly weak symbols appear only as outputs and there is no way to feed them back in, or really to other networks. Traditional robotics trades on geometric relationships and coordinates, which means they are easy to compose, but they are very poor in semantic content. And behavior-based systems are sub-symbolic, although there are ways to have some sorts of proto symbols emerge.

Neural networks have been the most successful approach to getting meaningful symbols out of perceptual inputs. The other approaches either don’t try to do so (traditional robotics) or have not been particularly successful at it.

Hard local coordinate systems, with solid statistical relationships between them have become the evolved modern approach to traditional robotics. Both symbolic AI and behavior based systems are weak in having different parts of the systems relate to common, or even well understood relative, coordinate systems. And neural networks simply suck (yes, suck) at spatial understanding.

Of the four approaches to AI discussed here, only the behavior-based approach makes a commitment to an ongoing existence of the system, the others, especially neural networks are much more transactional in nature. And the behavior-based approach reacts to changes in the world on millisecond timescales, as it is embedded, and “living” in the real world. Or in the case of characters in video games, it is well embedded in the matrix. This ability to be part of the world, and to have agency within it is at some level an artificial sentience. A messy philosophical term to be sure. But I think all people who ever utter the phrase Artificial General Intelligence, or utter, or mutter, Super Intelligence, are expecting some sort of sentience. No matter how far from reality the sentience of behavior-based systems may be, it is the best we have got. By a long shot.

I have attempted to score the four approaches on where they are better and where worse. The scale is one to three with three being a real strength of the approach. Notice that besides the four different strengths I have added a column for how well the approaches deal with ambiguity.

These are very particular capabilities that one or the other of the four approaches does better at. But for general intelligence I think we need to talk about cognition. You can find many definitions of cognition but the all have to do with thinking. And the definitions talk about thinking variously in the context of attention, memory, language understanding, perception, problem solving and others. So my scores are going to be a little subjective, I admit.

If we think about a Super Intelligent AI entity, one might want it to act in the world with some sort of purpose. For symbolic AI and traditional robotics there have been a lot of work on planners, programs that look at the state of the world and try to work out a series of actions that will get the world (and the embedded AI system or robot) into a more desirable state. These planners, largely symbolic, and perhaps with a spatial component for robots, started out relying on full knowledge of the state of the world. In the last couple of decades that work has concentrated on finessing the impossibility of knowing the state of the world in detail. But such planners are quite deliberative in working out what is going to happen ahead of time. By contrast the behavior based approaches started out as purely reactive to how the world was changing. This has made them much more robust in the real world which is why the vast majority of deployed robots in the world are behavior-based. With the twenty year old innovation of behavior trees these systems can appear much more deliberative, though they lack the wholesale capability of dynamically re-planning that symbolic systems have. This table summarizes:

Note that neural nets are neither. There has been a relatively small amount of non-mainstream work of getting neural nets to control very simple robots, mostly in simulation only. The vast majority of work on neural networks has been to get them to classify data in some way or another. They have never been a complete system, and indeed all the recent successes of neural networks have had them embedded as part of symbolic AI systems or behavior-based systems.

To end Part I of our Steps Towards Super Intelligence, let’s go back to our comparison of the four approaches to Artificial Intelligence. Let’s see how well we are really doing (in my opinion) by comparing them to a human child.

Recall the scale here is one to three. I have added a column on the right on how well they do at cognition, and a row on the bottom on how well a human child does in comparison to each of the four AI approaches.  One to three.

Note that under this evaluation a human child scores six hundred points whereas the four AI approaches score a total of eight or nine points each. As usual, I think I may have grossly overestimated the current capabilities of AI systems.

Next up: Part II, beyond the Turing Test.

1 This pronoun is often capitalized in this quote, but in my version of the King James Bible, which was presented to my grandmother in 1908, it is just plain “his” without capitalization. Genesis 1:27.

2 In the dedication of this 1973 PhD thesis at the MIT Artificial Intelligence Lab, to the Maharal of Prague–the creator of the best known Golem, Gerry Sussman points out that the Rabbi had noticed that this line was recursive. That observation has stayed with me since I first read it in 1979, and it inspired my first two lines of this blog post.

3 I am using the male form here only for stylistic purposes to resonate with the first sentence.

4 It appeared in the IEEE Journal of Robotics and Automation, Vol. 2, No. 1, March 1986, pp 14–23. Both reviewers for the paper recommended against publishing it, but the editor, Professor George Bekey of USC, used his discretion to override them and to go ahead and put it into print.

5 I chose this form, g0047, for anonymous symbols as that is exactly the form in which they are generated in the programming language Lisp, which is what most early work in AI was written in, and is still out there being used to do useful work.

Bothersome Bystanders and Self Driving Cars

A story on how far away self-driving cars are just came out in The Verge.  It is more pessimistic than most on when we will see truly self-driving cars on our existing roads. For those of you who have read my blog posts on the unexpected consequences and the edge cases for self-driving cars or my technology adoption predictions, you will know that I too am pessimistic about when they will actually arrive. So, I tend to agree with this particular story and about the outstanding problems for AI that are pointed out by various people interviewed for the story.

BUT, there is one section that stands out for me.

Drive.AI founder Andrew Ng, a former Baidu executive and one of the industry’s most prominent boosters, argues the problem is less about building a perfect driving system than training bystanders to anticipate self-driving behavior. In other words, we can make roads safe for the cars instead of the other way around. As an example of an unpredictable case, I asked him whether he thought modern systems could handle a pedestrian on a pogo stick, even if they had never seen one before. “I think many AV teams could handle a pogo stick user in pedestrian crosswalk,” Ng told me. “Having said that, bouncing on a pogo stick in the middle of a highway would be really dangerous.” 

“Rather than building AI to solve the pogo stick problem, we should partner with the government to ask people to be lawful and considerate,” he said. “Safety isn’t just about the quality of the AI technology.”

Now I really hope that Andrew didn’t say all this stuff.  Really, I hope that.  So let’s assume someone else actually said this.  Let’s call him Professor Confused, whoever he was, just so we can reference him.

The quoted section above is right after two paragraphs about recent fatal accidents involving self-driving cars (though probably none of them should have been left unattended by the person in the driver’s seat in each case). Of the three accidents, only one involves an external person, the woman pushing a bicycle across the road in Phoenix this last March, killed by an experimental Uber vehicle driving itself.

In the first sentence Professor Confused seems to be saying that he is giving up on the promise of self-driving cars seamlessly slotting into the existing infrastructure. Now he is saying that every person, every “bystander”, is going to be responsible for changing their behavior to accommodate imperfect self-driving systems. And they are all going to have to be trained! I guess that means all of us.


The great promise of self-driving cars has been that they will eliminate traffic deaths. Now Professor Confused is saying that they will eliminate traffic deaths as long as all humans are trained to change their behavior?  What just happened?

If changing everyone’s behavior is on the table then let’s change everyone’s behavior today, right now, and eliminate the annual 35,000 fatalities on US roads, and the 1 million annual fatalities world-wide. Let’s do it today, and save all those lives.

Professor Confused suggests having the government ask people to be lawful. Excellent idea! The government should make it illegal for people to drive drunk, and then ask everyone to obey that law. That will eliminate half the deaths in the US immediately.  Let’s just do that today!

Oh, wait…

I don’t know who the real Professor Confused is that the reporter spoke to. But whoever it is just completely upended the whole rationale for self-driving cars. Now the goal, according to Professor Confused, as reported here, is self-driving cars, right or wrong, über alles (so to speak). And you people who think you know how to currently get around safely on the street better beware, or those self-driving cars are licensed to kill you and it will be your own damn fault.

PS This is why the world’s relative handful of self-driving train systems have elaborate safe guards to make sure that people can never get on to the tracks. Take a look next time you are at an airport and you will see the glass wall and doors that keep you separated from the track at all times when you are on the platform. And the track does not intersect with any pedestrian or other transport route.  The track is routed above and under them all.  We are more likely to geo fence self-driving cars than accept poor safety from them in our human spaces.

PPS Dear Professor Confused, first rule of product management. If you need the government to coerce a change in behavior of all your potential customers in order for them to become your actual customers, then you don’t got no customers for what you are trying to sell. Hmmm. In this case I guess they are not your customers. They are just the potential literal roadkill in the self-satisfaction your actual customers will experience knowing that they have gotten just the latest gee whiz technology all for themselves.

[FoR&AI] The Origins of “Artificial Intelligence”

Past is prologue1.

I mean that both the ways people interpret Shakespeare’s meaning when he has Antonio utter the phrase in The Tempest.

In one interpretation it is that the past has predetermined the sequence which is about to unfold–and so I believe that how we have gotten to where we are in Artificial Intelligence will determine the directions we take next–so it is worth studying that past.

Another interpretation is that really the past was not much and the majority of necessary work lies ahead–that too, I believe. We have hardly even gotten started on Artificial Intelligence and there is lots of hard work ahead.


It is generally agreed that John McCarthy coined the phrase “artificial intelligence” in the written proposal2 for a 1956 Dartmouth workshop, dated August 31st, 1955. It is authored by, in listed order, John McCarthy of Dartmouth, Marvin Minsky of Harvard, Nathaniel Rochester of IBM and Claude Shannon of Bell Laboratories. Later all but Rochester would serve on the faculty at MIT, although by early in the sixties McCarthy had left to join Stanford University. The nineteen page proposal has a title page and an introductory six pages (1 through 5a), followed by individually authored sections on proposed research by the four authors. It is presumed that McCarthy wrote those first six pages which include a budget to be provided by the Rockefeller Foundation to cover 10 researchers.

The title page says A PROPOSAL FOR THE DARTMOUTH SUMMER RESEARCH PROJECT ON ARTIFICIAL INTELLIGENCE. The first paragraph includes a sentence referencing “intelligence”:

The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.

And then the first sentence of the second paragraph starts out:

The following are some aspects of the artificial intelligence problem:

That’s it! No description of what human intelligence is, no argument about whether or not machines can do it (i.e., “do intelligence”), and no fanfare on the introduction of the term “artificial intelligence” (all lower case).

In the linked file above there are an additional four pages dated March 6th, 1956, by Allen Newell and Herb Simon, at that time at the RAND Corporation and Carnegie Institute of Technology respectively (later both were giants at Carnegie Mellon University), on their proposed research contribution. They say that they are engaged in a series of forays into the area of complex information processing, and that a “large part of this activity comes under the heading of artificial intelligence”. It seems that the phrase “artificial intelligence” was easily and quickly adopted without any formal definition of what it might be.

In McCarthy’s introduction, and in the outlines of what the six named participants intend to research there is no lack of ambition.

The speeds and memory capacities of present computers may be insufficient to simulate many of the higher functions of the human brain, but the major obstacle is not lack of machine capacity, but our inability to write programs taking full advantage of what we have.

Some of the AI topics that McCarthy outlines in the introduction are how to get a computer to use human language, how to arrange “neuron nets” (they had been invented in 1943–a little while before today’s technology elite first heard about them and started getting over-excited) so that they can form concepts, how a machine can improve itself (i.e., learn or evolve), how machines could form abstractions from using its sensors to observe the world, and how to make computers think creatively. These topics are expanded upon in the individual work proposals by Shannon, Minsky, Rochester, and McCarthy. The addendum from Newell and Simon adds to the mix getting machines to play chess (including through learning), and prove mathematical theorems, along with developing theories on how machines might learn, and how they might solve problems similar to problems that humans can solve.

No lack of ambition! And recall that at this time there were only a handful of digital computers in the world, and none of them had more than at most a few tens of kilobytes of memory for running programs and data, and only punched cards or paper tape for long term storage.

McCarthy was certainly not the first person to talk about machines and “intelligence”, and in fact Alan Turing had written and published about it before this, but without the moniker of “artificial intelligence”. His best known foray is Computing Machinery and Intelligence3 which was published in October 1950. This is the paper where he introduces the “Imitation Game”, which has come to be called the “Turing Test”, where a person is to decide whether the entity they are conversing with via a 1950 version of instant messaging is a person or a computer. Turing estimates that in the year 2000 a computer with 128MB of memory (he states it as 10^9 binary digits) will have a 70% chance of fooling a person.

Although the title of the paper has the word “Intelligence” in it, there is only one place where that word is used in the body of the paper (whereas “machine” appears at least 207 times), and that is to refer to the intelligence of a human who is trying to build a machine that can imitate an adult human. His aim however is clear. He believes that it will be possible to make a machine that can think as well as a human, and by the year 2000. He even estimates how many programmers will be needed (sixty is his answer, working for fifty years, so only 3,000 programmer years–a tiny number by the standards of many software systems today).

In a slightly earlier 1948 paper titled Intelligent Machinery but not published4 until 1970, long after his death,  Turing outlined the nature of “discrete controlling machines”, what we would today call “computers”, as he had essentially invented digital computers in a paper he had written in 1937. He then turns to making a a machine that fully imitates a person, even as he reasons, the brain part might be too big to be contained within the locomoting sensing part of the machine, and instead must operate it remotely. He points out that the sensors and motor systems of the day might not be up to it, so concludes that to begin with the parts of intelligence that may be best to investigate are games and cryptography, and to a less extent translation of languages and mathematics.

Again, no lack of ambition, but a bowing to the technological realities of the day.

When AI got started the clear inspiration was human level performance and human level intelligence. I think that goal has been what attracted most researchers into the field for the first sixty years. The fact that we do not have anything close to succeeding at those aspirations says not that researchers have not worked hard or have not been brilliant. It says that it is a very hard goal.

I wrote a (long) paper  Intelligence without Reason5 about the pre-history and early days of Artificial Intelligence in 1991, twenty seven years ago, and thirty five years into the endeavor. My current blog posts are trying to fill in details and to provide an update for a new generation to understand just what a long term project this is. To many it all seems so shiny and exciting and new. Of those, it is exciting only.


In the early days of AI there were very few ways to connect sensors to digital computers or to let those computers control actuators in the world.

In the early 1960’s people wanting to work on computer vision algorithms had to take photographs on film, turn them into prints, attach the prints to a drum, then have that drum rotate and move up and down next to a single light brightness sensor to turn the photo into an array of intensities. By the late seventies, with twenty or thirty pounds of equipment, costing tens of thousands of dollars, a researcher could get a digital image directly from a camera into a computer. Things did not become simple-ish until the eighties and they have gotten progressively simply and cheaper over time.

Similar stories hold for every other sensor modality, and also for output–turning results of computer programs into physical actions in the world.

Thus, as Turing had reasoned, early work in Artificial Intelligence turned towards domains where there was little need for sensing or action. There was work on games, where human moves could easily be input and output to and from a computer via a keyboard and a printer, mathematical exercises such as calculus applied to symbolic algebra, or theorem proving in logic, and to understanding typed English sentences that were arithmetic word problems.

Writing programs that could play games quickly lead to the idea of “tree search” which was key to almost all of the early AI experiments in the other fields listed above, and indeed, is now a basic tool of much of computer science. Playing games early on also provided opportunities to explore Machine Learning and to invent a particular variant of it, Reinforcement Learning, which was at the heart of the recent success of the AlphaGo program. I described this early history in more detail in my August 2017 post Machine Learning Explained.

Before too long a domain known as blocks world was invented where all sorts of problems in intelligence could be explored. Perhaps the first PhD thesis on computer vision, by Larry Roberts at MIT in 1963, had shown that with a carefully lighted scene, all the edges of wooden block with planar surfaces could be recovered.

That validated the idea that it was OK to work on complex problems with blocks where the description of their location or their edges was the input to the program, as in principle the perception part of the problem could be solved. This then was a simulated world of perception and action, and it was the principal test bed for AI for decades.

Some people worked on problem solving in a two dimensional blocks world with an imagined robot that could pick up and put down blocks from the top of a stack, or on a simulated one dimensional table.

Others worked on recovering the geometry of the underlying three dimensional blocks from just the input lines, including with shadows, paving the way for future more complete vision systems than Roberts had demonstrated.

And yet others worked on complex natural language understanding, and all sorts of problem solving in worlds with complex three dimensional blocks.

No one worked in these blocks worlds because that was their ambition. Rather they worked in them because with the tools they had available they felt that they could make progress on problems that would be important for human level intelligence. At the same time they did not think that was just around the corner, one magic breakthrough away from all being understood, implemented, and deployed.

Over time may sub-disciplines in AI developed as people got deeper and deeper into the approaches to particular sub-problems that they had discovered. Before long there was enough new work coming out that no-one could keep up with the breadth of AI research. The names of the sub-disciplines included planning, problem solving, knowledge representation, natural language processing, search, game playing, expert systems, neural networks, machine inference, statistical machine learning, robotics, mobile robotics, simultaneous localization and mapping, computer vision, image understanding, and many others.

Breakoff Groups

Often, as a group of researchers found a common set of problems to work on they would break off from the mainstream and set up their own journals and conferences where reviewing of papers could all be done by people who understood the history and context of the particular problems.

I was involved in two such break off groups in the late 1980’s and early 1990’s, both of which still exist today; Artificial Life, and Simulation of Adaptive Behavior. The first of these looks at fundamental mechanisms of order from disorder and includes evolutionary processes. The second looks at how animal behaviors can be generated by the interaction of perception, action, and computation. Both of these groups and their journals are still active today.

Below is my complete set of the Artificial Life journal from when it was published on paper from 1993 through to 2014. It is still published online today, by the MIT Press.

There were other journals on Artificial Life, and since 1989 there have been international conferences on it. I ran the 1994 conference and there were many hundreds of participants and there were 56 carefully reviewed papers published in hard copy proceedings which I co-edited with Pattie Maes; all those papers are now available online.

And here is my collection of the Adaptive Behavior journal from when it was published on paper from 1992 through to 2013. It is still published online today, by Sage.

And there has always been a robust series of major conferences, called SAB, for Simulation of Adaptive Behavior with paper and now online proceedings.

The Artificial Life Conference will be in Tokyo this year in July, and the SAB conference will be in Frankfurt in August.  Each will attract hundreds of researchers. And the 20+ volumes of each of the journals above have 4 issues each, so close to 100 issues, with 4 to 10 papers each, so many hundreds of papers in the journal series. These communities are vibrant and the Artificial Life community has had some engineering impact in developing genetic algorithms which are in use in some number of application.

But neither the Artificial Life community nor the Simulation of Adaptive Behavior community have succeeded at their early goals.

We still do not know how living systems arise from non-living systems, and in fact still do not have good definitions of what life really is. We do not have generally available evolutionary simulations which let us computationally evolve better and better systems, despite the early promise when we first tried it. And we have not figured out how to evolve systems that have even the rudimentary components of a complete general intelligence, even for very simple creatures.

On the SAB side we can still not computationally simulate the behavior of the simplest creature that has been studied at length. That is the tiny worm C. elegans, which has 959 cells total of which 302 are neurons. We know its complete connectome (and even its 56 glial cells), but still we can’t simulate how they produce much of its behaviors.

I tell these particular stories not because they were uniquely special, but because they give an idea of how research in hard problems works, especially in academia. There were many, many (at least twenty or thirty) other AI subgroups with equally specialized domains that split off. They sometimes flourished and sometimes died off. All those subgroups gave themselves unique names, but were significant in size, in numbers of researchers and in active sharing and publication of ideas.

But all researchers in AI were, ultimately, interested in full scale general human intelligence. Often their particular results might seem narrow, and in application to real world problems were very narrow. But general intelligence has always been the goal.

I will finish this section with a story of a larger scale specialized research group, that of computer vision. That specialization has had real engineering impact. It has had four or more major conferences per year for thirty five plus years. It has half a dozen major journals. I cofounded one of them in 1987, with Takeo Kanade, the International Journal of Computer Vision, which has had 126 volumes (I only stayed as an editor for the first seven volumes) and 350 issues since then, with 2,080 individual articles. Remember, that is just one of the half dozen major journals in the field. The computer vision community is what a real large push looks like. This has been a sustained community of thousands of researchers world wide for decades.


I think the press, and those outside of the field have recently gotten confused by one particular spin off name, that calls itself AGI, or Artificial General Intelligence. And the really tricky part is that there a bunch of completely separate spin off groups that all call themselves AGI, but as far as I can see really have very little commonality of approach or measures of progress. This has gotten the press and people outside of AI very confused, thinking there is just now some real push for human level Artificial Intelligence, that did not exist before. They then get confused that if people are newly working on this goal then surely we are about to see new astounding progress. The bug in this line of thinking is that thousands of AI researchers have been working on this problem for 62 years. We are not at any sudden inflection point.

There is a journal of AGI, which you can find here. Since 2009 there have been a total of 14 issues, many with only a single paper, and only 47 papers in total over that ten year period. Some of the papers are predictions about AGI, but most are very theoretical, modest, papers about specific logical problems, or architectures for action selection. None talk about systems that have been built that display intelligence in any meaningful way.

There is also an annual conference for this disparate group, since 2008, with about 20 papers, plus or minus, per year, just a handful of which are online, at the authors’ own web sites. Again the papers range from risks of AGI to very theoretical specialized, and obscure, research topics. None of them are close to any sort of engineering.

So while there is an AGI community it is very small and not at all working on any sort of engineering issues that would result in any actual Artificial General Intelligence in the sense that the press means when it talks about AGI.

I dug a little deeper and looked at two groups that often get referenced by the press in talking about AGI.

One group, perhaps the most referenced group by the press, styles themselves as an East San Francisco Bay Research Institute working on the mathematics of making AGI safe for humans. Making safe human level intelligence is exactly the goal of almost all AI researchers. But most of them are sanguine enough to understand that that goal is a long way off.

This particular research group lists all their publications and conference presentations from 2001 through 2018 on their web site. This is admirable, and is a practice followed by most research groups in academia.

Since 2001 they have produced 10 archival journal papers (but see below), made 29 presentations at conferences, written 9 book chapters, and have 45 additional internal reports, for a total output of 93 things–about what one would expect from a single middle of the pack professor, plus students, at a research university. But 36 of those 93 outputs are simply predictions of when AGI will be “achieved”, so cut it down to 57 technical outputs, and then look at their content. All of them are very theoretical mathematical and logical arguments about representation and reasoning, with no practical algorithms, and no applications to the real world. Nothing they have produced in 18 years has been taken up and used by any one else in any application of demonstration any where.

And the 10 archival journal papers, the only ones that have a chance of being read by more than a handful of people? Every single one of them is about predicting when AGI will be achieved.

This particular group gets cited by the press and by AGI alarmists again and again. But when you look there with any sort of critical eye, you find they are not a major source of progress towards AGI.

Another group that often gets cited as a source for AGI, is a company in Eastern Europe that claims it will produce an Artificial General Intelligence within 10 years. It is only a company in the sense that one successful entrepreneur is plowing enough money into it to sustain it. Again let’s look at what its own web site tells us.

In this case they have been calling for proposals and ideas from outsiders, and they have distilled that input into the following aspiration for what they will do:

We plan to implement all these requirements into one universal algorithm that will be able to successfully learn all designed and derived abilities just by interacting with the environment and with a teacher.

Yeah, well, that is just what Turing suggested in 1948.  So this group has exactly the same aspiration that has been around for seventy years. And they admit it is their aspiration but so far they have no idea of how to actually do it. Turing, in 1948, at least had a few suggestions.

If you, as a journalist, or a commentator on AI, think that the AGI movement is large and vibrant and about to burst onto the scene with any engineered systems, you are confused. You are really, really confused.

Journalists, and general purpose prognosticators, please, please, do your homework. Look below the surface and get some real evaluation on whether groups that use the phrase AGI in their self descriptions are going to bring you human level Artificial Intelligence, or indeed whether they are making any measurable progress towards doing so. It is tempting to see the ones out on the extreme, who don’t have academic appointments, working valiantly, and telling stories of how they are different and will come up something new and unique, as the brilliant misfits. But in all probability they will not succeed in decades, just as  the Artificial Life and the Simulation of Adaptive Behavior groups that I was part of have still not succeeded in their goals of almost thirty years ago.

Just because someone says they are working on AGI, Artificial General Intelligence, that does not mean they know how to build it, how long it might take, or necessarily be making any progress at all. These lacks have been the historical norm. Certainly the founding researchers in Artificial Intelligence in the 1950’s and 1960’s thought that they were working on key components of general intelligence. But that does not mean they got close to their goal, even when they thought it was not so very far off.

So, journalists, don’t you dare, don’t you dare, come back to me in ten years and say where is that Artificial General Intelligence that we were promised? It isn’t coming any time soon.

And while we are on catchy names, let’s not forget “deep learning”. I suspect that the word “deep” in that name leads outsiders a little astray. Somehow it suggests that there is perhaps a deep level of understanding that a “deep learning” algorithm has when it learns something. In fact the learning is very shallow in that sense, and not at all what “deep” refers to. The “deep” in “deep learning” refers to the number of layers of units or “neurons” in the network.

When back propagation, the actual learning mechanism used in deep learning, was developed in the 1980’s most networks had only two or three layers. The revolutionary new networks are the same in structure as 30 years ago but have as many as 12 layers. That is what the “deep” is about, 12 versus 3. In order to make learning work on these “deep” networks there had to be lots more computer power (Moore’s Law took care of that over 30 years), a clever change to the activation function in each neuron, and a way to train the network in stages known as clamping. But not deep understanding.


Why did I post this? I want to clear up some confusions about Artificial Intelligence, and the goals of people who do research in AI.

There have certainly been a million person-years of AI research carried out since 1956 (much more than the three thousand that Alan Turing thought it would take!), with an even larger number of person-years applied to AI development and deployment.

We are way off the early aspirations of how far along we would be in Artificial Intelligence by now, or by the year 2000 or the year 2001. We are not close to figuring it out. In my next blog post, hopefully in May of 2018 I will outline all the things we do not understand yet about how to build a full scale artificially intelligent entity.

My intent of that coming blog post is to:

  1. Stop people worrying about imminent super intelligent AI (yet, I know, they will enjoy the guilty tingly feeling thinking about it, and will continue to irrationally hype it up…).
  2. To suggest directions of research which can have real impact on the future of AI, and accelerate it.
  3. To show just how much fun research remains to be done, and so to encourage people to work on the hard problems, and not just the flashy demos that are hype bait.

In closing, I would like to share Alan Turing’s last sentence from his paper “Computing Machinery and Intelligence”, just as valid today as it was 68 years ago:

We can only see a short distance ahead, but we can see plenty there that needs to be done.

1This whole post started out as a footnote to one of the two long essays in the FoR&AI series that I am working on. It clearly got too long to be a footnote, but is much somewhat shorter than my usual long essays.

2I have started collecting copies of hard to find historical documents and movies about AI in one place, as I find them in obscure nooks of the Web, where the links may change as someone reorganizes their personal page, or on a course page. Of course I can not guarantee that this link will work forever, but I will try to maintain it for as long as I am able. My web address has been stable for almost a decade and a half already.

3This version is the original full version as it appeared in the journal Mind, including the references.  Most of the versions that can be found on the Web are a later re-typesetting without references and with a figure deleted–and I have not fully checked them for errors that might have been introduced–I have noticed at least one place were 109 has been substituted for 10^9. That is why I have tracked down the original version to share here.

4His boss at the National Physical Laboratory (NPL), Sir Charles Darwin, grandson of that Charles Darwin, did not approve of what he had written, and so the report was not allowed to be published. When it finally appeared in 1970 it was labelled as the “prologue” to the fifth volume of an annual series of volumes titled “Machine Intelligence”, produced in Britain, and in this case edited by Bernard Meltzer and Donald Michie, the latter a war time colleague of Turing at Bletchley Park. They too, used the past as prologue.

5This paper was written on the occasion of my being co-winner (with Martha Pollack, now President of Cornell University) in 1991 of the Computers and Thought award that is given at the bi-annual International Joint Conference on Artificial Intelligence (IJCAI) to a young researcher. There was some controversy over whether at age 36 I was still considered young and so the rules were subsequently tightened up in a way that guarantees that I will forever be the oldest recipient of this award. In any case I had been at odds with the conventional AI world for some time (I seem to remember a phrase including “angry young man”…) so I was very grateful to receive the award. The proceedings of the conference had a six page, double column, limit on contributed papers. As a winner of the award I was invited to contribute a paper with a relaxed page limit. I took them at their word and produced a paper which spanned twenty seven pages and was over 25,000 words long! It was my attempt at a scholarly deconstruction of the field of AI, along with the path forward as I saw it.


Time Traveling Refugees

This last week has seen an MIT Technology Review story about a startup company, Nectome1, that is developing a mind-uploading service that is “100 percent fatal”.

The idea is that when you are ready, perhaps when you are terminally ill, you get connected to a heart-lung machine and then, under anesthesia, you get injected with chemicals that preserve your brain and all its synaptic connections. Then you are dead and embalmed, and you wait.

When sufficiently advanced technology2 is available, future generations will map out all the neurons and connections3, put them into a computer simulation, and voilà you will be alive again, living in the cloud, or whatever it is called at that time–heaven perhaps?

At this point the Y Combinator backed company has no clue on how the brains are going to be brought back to life, and that is not important to their business model. That is for future generations to work out. So far they have been preserving rabbit brains, and intend to move on to larger brains and to eventually get to human brains. They already have 25 human customers who have put down cash deposits to be killed in this way. As the founders say “Product-market fit is people believing that it works.”

If I were a customer I would insist that I be packed along with a rabbit or three who had undergone exactly the same procedure as me. That way my future saviors could make sure their resurrection procedure worked and produced a viable bunny in the cloud before pulling apart my preserved brain to construct the digital model.

But this is not new. Not at all.

I have personally known people, while they were alive, who are now frozen heads floating in liquid nitrogen. Their heads were removed from their bodies right after their natural death, and immediately frozen. All these floating heads, there are hundreds already with thousands more signed up for when their day comes, are waiting for a future society to repair whatever damage there might be to their brains. And then, these kind souls from the future will with some as yet uninvented technology, bring them back for a glorious awakening, perhaps with a newly fabricated body, or perhaps just in a virtual reality world, as is the case for Nectome customers.

In any case, when these friends signed up for having their heads chopped off for an indertiminably long time in limbo, they knew that when they did rise from the dead they would do so in a technological heaven. A place in the future with knowledge and understanding of the universe that seemed unimaginable in their own lifetimes. They knew what a glorious future awaited them.

They had faith (and faith is always an essential part of any expectation of an eternal life) that the company they entrusted their heads to would continue to exist and keep them in as safe an environment as possible. And they had faith that future society would both find a way and would be more than happy to go through the process of raising them from the dead.

Now let’s examine these two assumptions for a moment; firstly that the physical thing that was once your brain is going to stay preserved, and secondly that the archangels of the future are going to go to the trouble of bringing you back to life and make sure that you have a good new life in that future.

Now I am not cynical about people in business. No, not at all, not even a tiny little bit. And certainly not about any one in Silicon Valley. No, certainly not. But, but, it just does seem a tiny bit convenient for a scam businees model to be structured so that all of the people who personally paid up front for future services from you are now dead, and will not be around to complain if you do not deliver as promised. Just saying… In fact, one of the early frozen body (before they realized that people would be just as happy to have just their heads frozen) businesses did not keep things frozen, and eleven bodies ended up decomposing.

But let’s assume that everyone is sincere and really wants you to be around to come back from the dead. Will future society want to put in the effort to make it so?

We have one, just one, partial experience concerning a long frozen time traveller. In 1991 a body was discovered high in the mountains just (about 100 meters) on the Italian side of the border with Austria, sticking out of a melting glacier. At first police checked out whether it was a case of foul play, and ultimately after some years it was ascertained that Otzi, as the living version of the body has come to be called, was indeed a a murder victim. But the dastardly deed was performed a bit over 5,000 years ago! Otzi was left for dead on the ice, snow soon fell and covered him, and before long he was frozen solid and remained that way until humankind’s penchant for driving gasoline powered automobiles got the better of him.

Otzi has been a fantastic source for expanding our knowledge of how people lived in Europe before there were written records. It was easy to identify the materials from which his clothes, weapons, and tools were made. His teeth and bones revealed his nutritional history. His body, bones, and scars revealed his injury history. The contents of his stomach revealed what he had eaten as his last meal. And his tools and weapons indicated much about his lifestyle.

Late 20th century medicine had no idea at all about how to bring Otzi back to life. If we could have done it I am very sure that we would have. He and his caretakers would not have spoken a common language, but soon each would have adapted enough to have robust communication. His original language would have been another source of great new knowledge. But more, what he could tell us about his times would have expanded our knowledge at least ten times beyond what was achieved by examining him and his accoutrements.

Of course Otzi would have had no useful skills for the modern world, and it may have taken years for him to adapt, and most likely he never would have become a contributing tax payer to modern society. I am sure, however, that the Italian government would have been more than happy to set him up with a comfortable life for as long as he should live in his new afterlife.

Now, what would have happened if we had discovered two bodies rather than one? I think they would have been treated equally, and hundreds of man years of effort to study Otzi would have been matched by hundreds of man years of effort to study Otzi II.

But what if there had been 100 Otzis, or 10,000 Otzis? Like all the unexamined mummies that lie in museums around the world I don’t think there would have been enough enthusiam to study each and every one of them with the same intensity as was Otzi studied. Given enough 5,000 year old bodies showing up from glaciers I am not sure we would have even kept them all preserved as we did with Otzi, a meticulous and careful task. Perhaps we would have resorted to simply burying many of them in conventional modern graves.

One refugee from the past is interesting. Ten thousand refugees are not ten thousand times as interesting.

What does this portend for our eager dead heads (and some were even Deadheads!) of today? If at some future time society has the technology to raise these hopeful souls from the dead, will they do it?

Certainly they might for some recently departed well known people. Around 1990 I think if we had known how to resurrect John Lennon everyone would have been clamoring to do it. Apart from rekindling one particular earlier controversy from his career, I think so many people would have wanted to hear his thoughts and hope that perhaps the Beatles really would get back together again, that someone would have stepped forward to pay whatever expenses it might incur. Similarly for John or Robert Kennedy, or for Martin Luther King.

So our future selves might well want to bring back to life famous people who are still in their collective memories. And individuals may be willing to do whatever it takes to bring back their parents or spouse.

Once the time since departure gets longer, and the people of the now current time have no personal connection at all with the person whose brain is preserved, it might get a little more iffy. Very famous people from the past who still figure mightly in the histories that everyone reads would be good candidates to revive. So many unknowns and mysteries left in those histories could be explained in the first person.

However, for people whose mark on the world has long since faded I think it is a real act of faith to imagine that the future us are going to spend many of our resources on reviving, re-educating, and caring for them.

But surely, you say, in a future time everyone will be so very much richer than now that they will gladly make room for these early 21st century refugees. They will provide them with bodies, if that is what they want, nutrition, comfort, education so that they can fit into modern society, and welcome them with kindness and open arms.

Hmmm. Well, think for a minute what today’s society would have looked like to an Elizabethan. In comparison to the well off of the late 16th century, a working class person in the US or Europe has a much longer lifespan, much better health care, much better food, so many more insect-free clothes, houses held at comfortable temperatures year round, a much better selection of food and drink, are much better educated, have a much easier life, have more opportunity to see so much more of the world, etc., etc.

And how are we treating living (i.e., not yet dead) refugees running for their lives from the most horrible depravations we can imagine?

Get in line! Go through the process just as our ancestors did(n’t)! We can’t afford to take these people who don’t understand our culture, are not like us, and don’t have any real skills.

Good luck dead people! You are going to need a lot of it.

1I note that in the “Team” part of the company web page the company founders only give their first names. This is a little strange, and I think not a trend we should hope to see becoming more prevalent. If you are proposing killing every one of your customers then the very least you can do is own up to who you are. The Technology Review story identifies them as Michael McCanna and Robert McIntyre.

2Recall Arthur C. Clarke’s third law: Any sufficiently advanced technology is indistinguishable from magic. Powerful magic is going to be pretty important for this scenario to work out…

3I don’t know what the embalming chemicals are going to do to the chemically enabled “weights” at the synaptic interfaces, or indeed whatever other modifications that have happened in the neurons as a result of experience–we don’t yet know what they might be, if they exist, if they are important, or if they are the key to how our brain works. And there is no mention of how the glia cells are preserved or not. My bet is that they will be even more important to how the whole machinery of brain works than we yet have an inkling. Not to mention the small molecule diffusions that we do know are important, or many other hormonal concentrations. None of these seem likely to get preserved by this process.

The Productivity Gain: Where Is It Coming From And Where Is It Going To?

There are a lot of fears that technology of various sorts is going to reduce the need for human labor to a point where we may need to provide universal basic income, reduce the work week radically, and/or have mass unemployment.

I have a different take on where things are headed.

I think we are undergoing a radical productivity gain in certain aspects of certain jobs. This will lead to lots of dislocation for the workers who are effected by it. It will in cases be gruesome in the short term.

At the same time I think there will not be enough productivity gain in many parts of the world to compensate for an aging population and lower immigration rates. I am worried about a loss of standard of living because we will have too few human workers.

But in any case, we are going to have to change the relative value of some sorts of work that almost any person could do if sufficiently motivated. We will need to re-evaluate the social standing of various job classes, and encourage more people to take them up.

The politics are going to be nasty.

Some Definitions

I think that most of the disruption that is coming is from digitalization. Note that this word has one more syllable than digitization, and the two words have different meanings.  Worse than that, though, there is some disagreement on what each of these words mean.  I will define them here as I understand them and as how I see more interesting writing using them.

Digitization is the process of taking some object or measurement, and rendering it in digital form as zeros and ones. Scanning a paper document to produce a .pdf file is the digitization of the visible marks on the paper into a form that can be manipulated by a computer; not necessarily at the level of words on the paper, but just where there is ink versus no ink. In automobiles of an earlier age the steering wheel was mechanically linked to the the axles of the front wheels so there was a direct mechanical coupling between the steering wheel and the front wheels of the car. Today the position of the steering wheel is digitized, the continuous angle of that wheel controlled by the human driver, is constantly turned into a very accurate, but nevertheless still approximate, estimation of that angle represented as string of zeros and ones.

Digitalization is replacing old methods of sharing information or the flow of control within a processes, with computer code, perhaps thousands of different programs running on hundreds or thousands of computers, that make that flow of information or control process amenable to new variations and rapid redefinition by loading new versions of code into the network of computers.

Digitization of documents originally allowed them to be stored in smaller lighter form (e.g., files kept on a computer disk rather than in a filing cabinet), and to be sent long distances at speed (e.g., the fax machine). Digitalization of office work meant that the contents of those digital representations of those documents were read and turned into digital representations of words that the original creators of the documents had written, and then the ‘meaning’ of those words, or at least a meaning of those words, was used by programs to cross index the documents, process work orders from them, update computational models of ongoing human work, etc., etc. That is the digitalization of a process.

Likewise in automobiles, once every element of the drive train of a car was continuously being digitized, it opened the possibility of computers on board the car changing the operation of the elements of the drive train faster than any human driver could do. That enabled hybrid cars, and eco modes even in all gasoline engines, where the drive train can be exquisitely controlled and the algorithms updated over time. That is digitalization of an automobile.

Where is the productivity gain coming from?

Let’s look at an example of where digitalization has come together to eliminate a whole job class in the United States, the job of being a toll taker on a toll road or a toll bridge.

The tech industry might have gone after this particular job by building a robot which would take toll tickets (they were used to record where the car entered a toll road), and cash, including crumpled bills and unsorted change, from the reached out hand of a driver, then examined exactly what it was given, and finally returned change, perhaps in a blowing wind, to the outreached hand of the driver. This is what human toll takers routinely did. To be practical the whole exchange would need to happen at the same speed as with a human toll taker–toll booths were already the choke point on roads and bridges.

It would have been a very hard problem and today’s robotics technology could not have done the job. At the very least there would have had to be changes to what sort of cash could be given; e.g., have the human throw coins into a basket where it gravity fed into a counter. If it was required to accept paper cash that would be very hard, as the human is not in an ideal situation to feed the bills into a machine, and with wind, etc., it would have been a very difficult task for most people.

Instead the solution that now abounds is to identify a car by a transponder that it carries, and or reading its license plate. The car does not have to slow down and so there is an added advantage of reducing congestion.

However, this solution relies on a whole lot more digitalization than simply identifying the car. It relies on there being readable digital records of who was issued what transponder, who owns a car with a given license plate if the car has no transponder, web pages where individuals can go and register their transponder and car, and connect it to a credit card in their name, the ability for a vendor to digitally bill a credit card without any physical presence of the card, and a way for a consumer to have their credit card paid from a bank account electronically, and most likely that bank account having their wages automatically deposited into it without any payday action of the person. There is a whole big digitalized infrastructure, almost all of which was developed for other purposes. But now toll road or toll bridge operators can tap into that infrastructure and eliminate not just toll taker jobs, but the need to handle cash, collect it from the toll booths, physically transport it to be counted, and then have it be physically deposited at a bank.

This solution is typical of how digitalization leads to fewer people being needed for a task. It is not because one particular digital pathway is opened up. Rather it is that an ever increasing collection of digitalized pathways are coming up to speed, and for a particular application there may be enough of them, which when coordinated together in an overall system design, that productivity can be increased so fewer humans than before are needed in some enterprise.

It is not the robot replacing a person. It is a whole network of digitalized pathways being coordinated together to do a task which may have required many different people to support previously.

Digitalization is the source of the productivity increase, the productivity dividend, that we are seeing from computation.

Digitalization does not eliminate every human task, certainly not at this point in history. For instance any task that requires dexterous physical handling of objects is not made easier by digitalization. Perhaps the overall amount of dexterous manipulation can be reduced in a particular business by restructuring how a task is done. But digitalization itself can not replace human dexterity.

As an example think about how fulfillment services such as Amazon have changed the retail industry. Previously goods were transported to many different retail outlets, often just a few miles apart across the whole of the country, were arranged on shelves by stockers for consumers to see, and then they were handled by retail workers when a consumer was buying the object, taking it from the consumer to be scanned (in the olden days that step had not yet been digitalized, and instead at the point of sale the retail clerk had to read the price, and reenter it into some machine), put in a bag or box, and handed back to the consumer. And in between, retail workers had to retidy the shelves when consumers had looked at the goods, picked them up, and then put them down again without making a purchase. There were many, many dexterous steps in the path of a particular object from the factory to the consumer.

Today there are much fewer such steps, and many fewer workers who touch objects than before. Consumers purchase their goods from a web page, and the same digitalized payment chain that is used for toll roads is used to settle their accounts. The goods are taken to just a very few fulfillment centers across the country. Stocking the shelves there is done only once, and there is no need to tidy up after customers. Then the objects are picked and packed for a particular order by a human. After that the manipulation of the consumer goods is again done only in single rectanguloid boxes for the whole order of many good–much easier to manipulate. Robots bring the shelves to the picker, and take the boxes from the packer. There is much less labor that needs to be done by humans in this digitalized supply chain. Unlike the toll taking application there is still a dexterous step, and that is not yet solved by digitalization.

Increasingly digitalization is making more tasks more human efficient. Less people are needed to provide some overall service than were needed before. Sometimes digitalization replaces almost all the people who where necessary for some previous service.

Increasingly digitalization is replacing human cognitive processes that are routine and transactional, despite in the past them requiring highly educated people. This includes things like looking at a radiological scan, deciding on credit worthiness of a loan applicant, or even constructing a skeleton legal document.

Tasks that are more physical, even where they too are transactional, are not being replaced if they involve variability. This includes almost any dexterous task. For productivity increases in these cases the need for dexterity needs to be eliminated as our machines are not yet dexterous.

Likewise if there is a task step that absolutely involves physical interaction with a human that also is likely not yet ready to be eliminated. Large parts of elder care fall under this–we have no machines that can help an elderly person into or out of bed, that can help an unsteady elderly person get onto and off of a toilet, can wash a person who has lost their own dexterity or cognitive capability, can clean up a table where a person eats, or even deliver food right to their table or bedside.

Hmm.  Not many of these things sound like tasks that lots of people want to do. Nor do they pay well right now. I assume many of these tasks will be hard to get robots to do in the next thirty years, so we as a society, with the support of our politicians, are going to have to make these jobs more attractive along many dimensions.

Where is the productivity gain going to?

First, a disclaimer. I am not an economist.

Second, an admission. I won’t let that stop me from blathering on about economic forces.

Now, to my argument.

The United States, from when it was an unfederated collection of proto-states, through to today, has relied on low cost immigrant labor for its wealth.

In the early days the “low paid” immigrants were brought, against their will, as human slaves. Thankfully those days are gone, if not all the after effects. There has also always been, up until now, a steady flow of “economic refugees”, coming to the United States, and taking on jobs that existing residents were not willing to do. In recent times a distinction has been made between “legal” and “illegal” immigrants. More than 10 million so called “illegal” (I prefer “undocumented”) immigrants currently live in the US and often they are exploited with lower wages than others would earn for the same work, as they have very little safe right of appeal. Now, in the United States, and many other countries, there has been a populist turn against immigrants, and the numbers arriving have dropped significantly. A physical wall has not been necessary to effect this change.

So, the good news is that now, just as we collectively have scared off immigrants who we can exploit1 with low wages, digitalization is coming along with a productivity bonus, which may well be able, in magnitude at least, plug that labor deficit which is about to hit us. With luck it will even also compensate for the coming elder care tsunami that is about to hit us– in a previous blog post I talked about how this is going to drive robotics development.

The big problem with this scenario is that there is by no means a perfect match between the skills gap demand that both reduced low cost immigrant labor and the need for massively increased elder care and services will drive, and the skills productivity that digitalization will supply.

There is going to be a lot of dislocation for a lot of people.

I am not worried at all that there will not be enough labor demand to go around. In fact I am worried that there still will not be enough labor.

And another piece of “good news” for the dislocated is that the unfilled jobs will not require years of training to do. Almost anyone will be able to find a job that they are mentally and physically capable of performing in this new dislocated labor market.

Easy for me to say.

The bad news is that those jobs may well not seem satisfying, that they will not seem as status admirable as many of the jobs that have disappeared, and that many of these jobs would, in our current circumstances, pay much less than many of the jobs that will have disappeared.

To fix these problems will require some really hard political work and leadership. I wish I could say that our politicians will be up to this task. I certainly fear they will not be.

But I think this is the real problem that we will face. How to make the jobs where we will have massive unfulfilled demand be attractive to those who are displaced by the productivity of of digitalization.  This is in stark contrast to many of the fears we see that technology is going to take away jobs, and there just will not be any need for the labor of many many people in our society.

The challenge will really be about “different jobs”, not “no jobs”. Solving this actual problem, is still going to be a real challenge.

I have not used the term AI

I have not talked about Artificial Intelligence, or AI, in this post.

I was recently at a conference on the future of work, and AI was the buzz word on everyone’s lips. AI was going to do this, AI was going to do that, there was an AI revolution happening. Most of the people saying this would not have heard of “AI” just three years ago, despite the fact that it has been around since 1956. I realized that the phrase “Artificial Intelligence”, or “AI”, has been substituted for the word “tech”. Everything people were saying would have made perfect sense three years ago with the word “tech” rather than today’s “AI”.

In this post I have talked about digitalization. I think that is the overall thing which is changing. Certainly, real actual AI, machine learning (ML), and other things that people understand as AI are going to be able to be deployed more quickly because of digitalization. So that is a big deal. But lots of the productivity gains from digitalization will not particularly rely on AI.

That is, unless we redefine AI as being a superset of any and every sort of digital tech. I am not ready to do that. Others may already be doing it.

1Cynicism alert…

My Dated Predictions

With all new technologies there are predictions of how good it will be for humankind, or how bad it will be. A common thread that I have observed is how people tend to underestimate how long new technologies will take to be adopted after proof of concept demonstrations. I pointed to this as the seventh of seven deadly sins of predicting the future of AI.

For example, recently the early techno-utopianism of the Internet providing a voice to everyone and thus blocking the ability of individuals to be controlled by governments has turned to depression about how it just did not work out that way. And there has been discussion of how the good future we thought we were promised is taking much longer to be deployed than we had ever imagined. This is precisely a realization of the early optimism about how things would be deployed and used did just not turn out to be.

Over the last few months I have been throwing a little cold water over what I consider to be current hype around Artificial Intelligence (AI) and Machine Learning (ML). However, I do not think that I am a techno-pessimist. Rather, I think of myself as a techno-realist.

In my view having ideas is easy. Turning them into reality is hard. Turning them into being deployed at scale is even harder. And in evaluating the likelihood of success at that I think it is possible to sort technology and technology deployment ideas into a spectrum running from relatively easier to very hard.

But simply spouting off about this is rather easy to do as there is no responsibility for being right or wrong. That applies not just to me, but to pundits ranging from physicists to entrepreneurs to academics, who are making wild predictions about AI and ML.

It is the New Year  and there will be many predictions about what will happen in the coming year. I am going to take this opportunity to make predictions myself, not just about the coming year, but rather the next thirty two years. I am going to write them in this blog with explicit dates attached to them. Hence they are my dated predictions. And they will be here on this blog and copies that live on elsewhere in cyberspace for all to see. I am going to take responsibility for what I say, and make it so that people can hold me to whether I turn out to be right or wrong. If I am unfortunate, some of my predictions will at some point seem rather dated!

I chose thirty two years as I will then be 95 years old, and I suspect I’ll be a little too exhausted by then to carry on arguments about why I was right or wrong on particular points. And 32 is a power of 2, so that’s always a good thing. So the furtherest out date I am going to consider is January 1st, 2050. And that also means that I am only predicting things for exactly the first half of this century (or at least for the first half of the years starting with “20” — there is a whole argument to be had here into which I am not going to get).

I specify dates in three different ways:

NIML meaning “Not In My Lifetime, i.e., not until after January 1st, 2050

NET some date, meaning “No Earlier Than” that date.

BY some date, meaning “By” that date.

Sometimes I will give both a NET and a BY for a single prediction, establishing a window in which I believe it will happen.


I am going to try to be very precise about what I am predicting and when. Now in reality precision on defining what I am predicting is almost impossible. Nevertheless I will try.

I had an experience very recently that made me realize just how hard people will try, when challenged, to hold their preconceived notions about technologies and the cornucopia they will provide to humanity. I tweeted out the following (@rodneyabrooks):

When humans next land on the Moon it will be with the help of many, many, Artificial Intelligence and Machine Learning systems.

Last time we got to the Moon and back without AI or ML.

My intent with this tweet was to say that although AI and ML are today very powerful and useful, it does not mean that they are the only way to do things, and it is worth remembering that. They don’t necessarily mean that suddenly everything has changed in the world in some magical way1.

One of the responses to this tweet, which itself was retweeted many, many times, was that Kalman filtering was used to track the spacecraft (completely true), that Kalman filtering uses Bayesian updating (completely true), and that therefore Kalman filtering is an instance of machine learning (complete non sequitur) and that therefore machine learning was used to get to the Moon (a valid inference based on a non-sequitur, and completely wrong).  When anyone says Machine Learning these days (and indeed since the introduction of the term in 1959 by Arthur Samuel (see my post on ML for details)) they mean using examples in some way to induce a representation of some concept that can later be used to select a label or action, based on an input and that saved learned material. Kalman filtering uses multiple data points from a particular process to get a good estimate of what the data is really saying. It does not save anything for later to be used for a similar problem at some future time. So, no, it is not Machine Learning, and no, we did not use Machine Learning to get to the Moon last time, no matter how much you want to believe that Machine Learning is the key to all technological progress.

That is why I am going to try to be very specific about what I mean by my predictions, and why, no doubt, I will need to argue back to many people who will want to claim that the things I predict will not happen before some future time have already happened. I predict that people will be making such claims!

What is Easy and What is Hard?

Building electric cars and reusable rockets is easy. Building flying cars, or a hyperloop system (or a palletized underground car transport network underground) is hard.

What makes the difference?

Cars have been around, and mass produced, for well over a century. If you want to build electric cars rather than gasoline cars, you do not have to invent too much stuff, and figure out how to deploy it at scale.

There has been over a hundred years of engineering and production of windscreen wipers, brakes, wheels, tires, steering systems, windows that can go up and down, car seats, a chassis, and much more. There have even been well over 20 years of large scale production of digitalized drive trains.

To build electric cars at scale, and at a competitive price, and with good range, you may have to be very clever, and well capitalized. But there is an awful lot of the car that you do not need to change. For the majority of the car there are plenty of people around who have worked on those components for decades, and plenty of manufacturing expertise for building the components and assembly.

Although reusable rockets sounds revolutionary there is again prior art and experience. All liquid fuel rockets today owe their major components and capabilities to the V-2 rockets of Wernher von Braun, built for Hitler. It was liquid fueled with high flow turbopumps (580 horsepower!), it used the fuel to cool parts of the engine, and it carried its own liquid oxygen so that it could fly above the atmosphere. It first did so just over 75 years ago. And it was mass produced, with 5,200 of them being built, using slave labor, in just two years.

Since then over 20 different liquid fueled rocket families have been developed around the world, some with over 50 years of operational use, and hundreds of different configurations within those families. Many variations in parameters and trade offs have been examined. Soyuz rockets, a fifty year old family, all lift off with twenty liquid fueled thrust chambers burning. In the Delta family, the Delta IV configuration has a “Heavy” variant, three essentially identical cores in a horizontal line, where the cores are all a first stage of the earlier single core Delta IV.

The technology for soft landing on Earth using jet engine thrusters has been around since 1950s with the Rolls Royce “flying bedstead”, with the later, at large scale, Harrier fighter jet taking off and landing vertically. A rocket engine for vertical landing was used, without atmosphere, for the manned lunar landings on the Moon, starting in 1969.

Today’s Falcon rocket uses grid fins to  steer the first stage when it is returning to the launch site or recovery barge to soft land. These were first developed theoretically in Russia in the 1950’s by Sergey Belotserkovskiy and have been used since the 1970’s for many missiles, both ballistic and others, guided bombs, cruise missiles, and for the emergency escape system for manned Soyuz capsules.

There has been a lot of money spent on developing rockets and this has lead to many useable technologies, lots of know how, and lots of flight experience.

None of this is to say that developing at scale electric cars or reusable rockets is not brave, hard, and incredibly inventive work. It does however build on large bodies of prior work, and therefore it is more likely to succeed. There is experience out there. There are known solutions to many, many, but not all, problems that will arise. Seemingly revolutionary concepts can arise from clusters of hard and brilliantly thought out evolutionary ideas, along with the braveness and determination to undertake them.

We can make estimates about these technologies being technically successful and deployable at scale with some confidence.

For completely new ideas, however, it is much harder to predict with confidence that the technologies will become deployable in any particular amount of time.

There have been sustained projects working on problems of practical nuclear fusion reactors for power generation since the 1950’s. We know that sustained nuclear fusion “works”. That is how our Sun and every other star shines. And humans first produced short time scale nuclear fusion with the first full scale thermonuclear bomb, “Ivy Mike”, being detonated 65 years ago. But we have not yet figured out how to make nuclear fusion practical for anything besides bombs, and I do not think many people would believe any predicted date for at scale practical fusion power generation. It is a really hard problem.

The hyperloop concept has attracted a bunch of start ups and capital for them, though there has been nothing close in concept that has ever been demonstrated, let alone operated at scale. So besides figuring out how to develop ultrastable cylinders that go for hundreds of miles, containing capsules that are accelerated by external air pressure traveling at hundreds of miles per hour while containing living meat of the human variety there are many, many mundane things to be developed.

One of the many challenges is how to seal the capsules and provide entirely self contained life support within for the duration of the journey. Also the capsules must be able to go past stations at which they are not stopping in a stable manner, so stations will need to be optionally sealed off from the tube for a through capsule, while allowing physical ingress and egress for passengers whose capsule has stopped at the station. There will need to be procedures for when a capsule gets stuck a hundred miles from the nearest station. There will need to be communications with the capsule, even though it is in a pretty good Faraday cage. There will need to be the right seats and restraints developed for the safety of the passengers. There will need to user experience elements developed for the sanity of the passengers while they are being whizzed at ultra high speed in windowless capsules. And then there are route rights, earthquake protection, dealing with containing cylinder distortions just because of the centimeter or so of drift induced along the route in the course of year just due to normal smooth deformations of our tectonic plates. And then there are pricing models, and getting insurance, and figuring out how that interacts with individual passenger insurance. Etc., etc.

There will need to be many, many new technologies and new designs developed for every aspect of the hyperloop. None of them will have existed before. None of them have been demonstrated, nor even enumerated as of today. It is going to take a long time to figure all these things out and build a stable system around them, and to do all the engineering needed on all the components. And it is going to be a hard psychological sell for passengers to ride in these windowless high speed systems, so even when all the technology challenges have been knocked down there will still be the challenge of pace of adoption.

So…while there might be some demonstration of some significance in the next 32 years I am confident in saying that there will be no commercial viable passenger carrying systems for hyperloop within that time frame.

I use this framework in trying to predict timing on various technological innovations. If something has not even been demonstrated yet in the lab, even though the physics says that it will be good to go, then I think it is a long, long way off. If it has been demonstrated in prototypes only, then it is still a long way off. If there are versions of it deployed at scale already, and most of what needs doing is evolutionary, then it may happen before too long. But then again, no-one may want to adopt it, so that will slow things down no matter how much enthusiasm there is by the technologists involved in developing it.


Adoption of new things in technology takes much longer than one might expect. The original version of the Internet used 32 bit addressing, allowing only 4 billion unique address for all devices on the network, and using a protocol called IPv4, Internet Protocol version 4.  But by the early 1990’s it was recognized that with all the devices that would soon join the network (not just personal devices but so many other things like electricity meters, industrial sensors, traffic sensor and control, TVs, light switches(!), etc., etc.) the world would soon run out of address space.

By 1996 a new protocol,  IPv6, Internet Protocol version 6, had been defined, increasing the address space to 128 bits from 32 bits, allowing for 7.9\times 10^{28} more devices on the network.

Since 1996 there have been various goal dates specified for when all network traffic should use IPv6 rather than IPv4. In 2010 the target date was 2012. In 2014 fully 99% of all network traffic was still using IPv4 with many, many clever edge systems to cram much more than 4 billion devices into a 4 billion device address space. By the end of 2017 various categories of network traffic running on IPv6 ranged from under 2% to just over 20%. It is still a long way from full adoption of IPv6.

There were no technical things stopping the adoption of IPv6, in fact quite the opposite. As the number of devices that wanted to connect to the Internet grew there had to be many very clever innovations and work arounds in order to limp along with IPv4 rather than adopt IPv6.

Using my heuristics (rate of replacement of equipment, maturity of technical solutions, real need for what it provides, etc.) that I use to make my predictions in this post, I would have thought that IPv6 would have been universal by 2010 or so. I would have been wildly over optimistic about it.


SpaceX first announced their Falcon Heavy rocket in April 2011, broke ground on their Vandenberg AFB, California, launch pad for it in June 2011, and expected a maiden flight in 2013. The rocket was first moved to a launch pad on December 28, 2017, at pad 39A at the Kennedy Space Center in Florida. It is now expected to fly in 2018. Development time has stretched from two years to seven years. So far.

It always takes longer than you think. It just does.


The first three entries in the table below are about flying cars. I am pretty sure that practical flying cars will need to be largely self driving while flying, so they sort of fit the category. By flying cars I mean vehicles that can be driven anywhere a car can be driven. Otherwise it is not a car. And I mean that a person who does not have a pilots license, but does perhaps have a few hours of special training, can get into wearing normal clothing that would be appropriate to wear at an office, and is able to travel 100 miles, say, with much of the journey in the air. It should require no previous arrangement for the journey, no special filing of plans, nothing beyond using a maps like app on a smartphone in order to know the route to get to the destination. In other words, apart from a little extra training it should be just like an average person today using a conventional automobile to travel 100 miles.

Now let’s talk about self driving cars, or driverless cars. I wrote two blog posts early in 2017 about driverless cars. The first talked about unexpected consequences of driverless cars, in that pedestrians and other drivers will interact with them in different ways than they do with cars with drivers in them, and how the cars may bring out anti-social behavior in humans outside of them. It also pointed out  that owners of individual driverless cars may use them in new ways that they could never use a regular car, sometime succumbing to anti-social behavior themselves. The second post was about edge cases in urban environments where there are temporary signs that drivers must read, where on a regular basis it is impossible to drive according to the letter of the law, where mobility as a service will need to figure out how much control a passenger is allowed to have, and where police and tow truck drivers must interact with these cars, and the normal human to human interaction with drivers will no longer be present nor subjugatable by a position of authority.

For me it seems clear that driverless cars are not going to simply be the same sorts of cars as normal cars, but simply without human drivers. They are going to be fundamentally different beasts with different use modes, and different ways of fitting into the world.

Horseless carriages did not simply one for one replace horse drawn carriages. Instead they demanded a whole new infrastructure of paved roads, a completely new ownership model, a different utilization model, completely different fueling and maintenance procedures, a different rate of death for occupants, a different level of convenience, and ultimately they lead to a very different structure for cities as they enabled suburbia.

I think the popular interpretation is that driverless cars will simply replace cars with human drivers, one for one. I do not think that is going to happen at all. Instead our cities will be changed with special lanes for driverless cars, geo-fencing of where they can be and where cars driven by humans can be, a change in the norm for pick up and drop off location flexibility, changes to parking regulations, and in general all sorts of small incremental modifications to our cities.

But first let’s talk about the rate of adoption of driverless cars.

As I pointed out in my seven deadly sins post, in 1987 Ernst Dickmanns and his team at the Bundeswehr University in Munich had their autonomous van drive at 90 kilometers per hour (56mph) for 20 kilometers (12 miles) on a public freeway. Of course there were people inside the van but they had their hands off the controls. For the last 30 years researchers have been improving the ability of cars to drive on public roads, but it has mostly been about the driving, with very little about the interaction, the pick up and drop off of people, the interface with other services and restrictions, and with non-driving passengers inside the cars. All of these will be important.

From one point of view it has been slow, slow, slow incremental progress over the last thirty years, even though the work has been focused on only a small part of the problem. Just about a year ago I saw a tweet which I loved, which said something like “The customers knew that they had gotten a driverless Uber as there were two people in the front seat instead of just one.”. It is only just in the last few weeks that have started seeing actual unoccupied cars on public roads, from Waymo in Phoenix, Arizona. A tweet about this story referred to them as being the first “driverless driverless cars”…

But adoption is still a ways off. The price of the sensors still needs to come way down, and all the operational things about how the cars will be used and interface with passengers still needs to be worked out, let alone all the actual regulatory and liability environment under which they will operate needs to be put in place. Within some constraints, all these things will eventually be solved. But it is going to be much slower than many expect.

The true test of the viability of driverless cars will be when they are not just in testing or in demonstration, but when the owners of driverless taxis or ride sharing services or parking garages for end consumer self driving cars are actually making money at it. This will happen only gradually and in restricted geographies and markets to start with. My milestone predictions below are not about demonstrations, but about viable sustainable businesses. Without them the deployment of driverless cars will never really take off.

I think the under discussed reality of how driverless cars will get adopted is through geo fencing of where certain activities of those cars can take place, without any human driven cars in that vicinity. Furthermore applications of driverless cars will initially be restricted to certain cities and even areas within those cities, and perhaps even certain times of day and in certain weather conditions. It may be that for quite a while the cars for the first mobility as a service driverless cars (e.g., for Uber and Lyft like services) will only operate in a driverless mode some of the time, and at other times will need to have hired human drivers.

[Self Driving Cars]
A flying car can be purchased by any US resident if they have enough money.NET 2036There is a real possibility that this will not happen at all by 2050.
Flying cars reach 0.01% of US total cars.NET 2042That would be about 26,000 flying cars given today's total.
Flying cars reach 0.1% of US total cars.NIML
First dedicated lane where only cars in truly driverless mode are allowed on a public freeway.
NET 2021This is a bit like current day HOV lanes. My bet is the left most lane on 101 between SF and Silicon Valley (currently largely the domain of speeding Teslas in any case). People will have to have their hands on the wheel until the car is in the dedicated lane.
Such a dedicated lane where the cars communicate and drive with reduced spacing at higher speed than people are allowed to driveNET 2024
First driverless "taxi" service in a major US city, with dedicated pick up and drop off points, and restrictions on weather and time of day.NET 2022The pick up and drop off points will not be parking spots, but like bus stops they will be marked and restricted for that purpose only.
Such "taxi" services where the cars are also used with drivers at other times and with extended geography, in 10 major US citiesNET 2025A key predictor here is when the sensors get cheap enough that using the car with a driver and not using those sensors still makes economic sense.
Such "taxi" service as above in 50 of the 100 biggest US cities.NET 2028It will be a very slow start and roll out. The designated pick up and drop off points may be used by multiple vendors, with communication between them in order to schedule cars in and out.
Dedicated driverless package delivery vehicles in very restricted geographies of a major US city.NET 2023The geographies will have to be where the roads are wide enough for other drivers to get around stopped vehicles.
A (profitable) parking garage where certain brands of cars can be left and picked up at the entrance and they will go park themselves in a human free environment.NET 2023The economic incentive is much higher parking density, and it will require communication between the cars and the garage infrastructure.
A driverless "taxi" service in a major US city with arbitrary pick and drop off locations, even in a restricted geographical area.
NET 2032This is what Uber, Lyft, and conventional taxi services can do today.
Driverless taxi services operating on all streets in Cambridgeport, MA, on Greenwich Village, NY, NET 2035Unless parking and human drivers are banned from those areas before then.
A major city bans parking and cars with drivers from a non-trivial portion of a city so that driverless cars have free reign in that area.NET 2027
BY 2031
This will be the starting point for a turning of the tide towards driverless cars.
The majority of US cities have the majority of their downtown under such rules.NET 2045
Electric cars hit 30% of US car sales.NET 2027
Electric car sales in the US make up essentially 100% of the sales.NET 2038
Individually owned cars can go underground onto a pallet and be whisked underground to another location in a city at more than 100mph.NIMLThere might be some small demonstration projects, but they will be just that, not real, viable mass market services.
First time that a car equipped with some version of a solution for the trolley problem is involved in an accident where it is practically invoked.NIMLRecall that a variation of this was a key plot aspect in the movie "I, Robot", where a robot had rescued the Will Smith character after a car accident at the expense of letting a young girl die.
Predictions about ROBOTICS, AI and ML

Those of you who have been reading my series of blog posts on the future of Robotics and Artificial Intelligence know that I am more sanguine about how fast things will deploy at scale in the real world than many cheerleaders and fear mongers might believe. My predictions here are tempered by that sanguinity.

Some of these predictions are about the public perception of AI (that has been the single biggest thing that has changed in the field in the last three years), some are about technical ideas, and some are about deployments.

[AI and ML]
Academic rumblings about the limits of Deep LearningBY 2017Oh, this is already happening... the pace will pick up.
The technical press starts reporting about limits of Deep Learning, and limits of reinforcement learning of game play.BY 2018
The popular press starts having stories that the era of Deep Learning is over.BY 2020
VCs figure out that for an investment to pay off there needs to be something more than "X + Deep Learning".NET 2021I am being a little cynical here, and of course there will be no way to know when things change exactly.
Emergence of the generally agreed upon "next big thing" in AI beyond deep learning.NET 2023
BY 2027
Whatever this turns out to be, it will be something that someone is already working on, and there are already published papers about it. There will be many claims on this title earlier than 2023, but none of them will pan out.
The press, and researchers, generally mature beyond the so-called "Turing Test" and Asimov's three laws as valid measures of progress in AI and ML.NET 2022I wish, I really wish.
Dexterous robot hands generally available.NET 2030
BY 2040 (I hope!)
Despite some impressive lab demonstrations we have not actually seen any improvement in widely deployed robotic hands or end effectors in the last 40 years.
A robot that can navigate around just about any US home, with its steps, its clutter, its narrow pathways between furniture, etc.Lab demo: NET 2026
Expensive product: NET 2030
Affordable product: NET 2035
What is easy for humans is still very, very hard for robots.
A robot that can provide physical assistance to the elderly over multiple tasks (e.g., getting into and out of bed, washing, using the toilet, etc.) rather than just a point solution.NET 2028There may be point solution robots before that. But soon the houses of the elderly will be cluttered with too many robots.
A robot that can carry out the last 10 yards of delivery, getting from a vehicle into a house and putting the package inside the front door.Lab demo: NET 2025
Deployed systems: NET 2028
A conversational agent that both carries long term context, and does not easily fall into recognizable and repeated patterns.Lab demo: NET 2023
Deployed systems: 2025
Deployment platforms already exist (e.g., Google Home and Amazon Echo) so it will be a fast track from lab demo to wide spread deployment.
An AI system with an ongoing existence (no day is the repeat of another day as it currently is for all AI systems) at the level of a mouse.NET 2030I will need a whole new blog post to explain this...
A robot that seems as intelligent, as attentive, and as faithful, as a dog.NET 2048This is so much harder than most people imagine it to be--many think we are already there; I say we are not at all there.
A robot that has any real idea about its own existence, or the existence of humans in the way that a six year old understands humans.NIML

These predictions may seem a little random and disjointed. And they are. But that is the way progress is going to be made in Robotics, AI, and ML. There is not going to be a general intelligence that can suddenly do all sorts of things that humans (or chimpanzees) can do. It is going to be point solutions for a long, long time to come.

Building human level intelligence and human level physical capability is really, really hard. There has been a little tiny burst of progress over the last five years, and too many people think it is all done. In reality we are less than 1% of the way there, with no real intellectual ideas yet on how to get to 5%. And yes, I made up those percentages and can not really justify them. I may well have inflated them by a factor of 10 or more, and for that I apologize.


I have been a fan of spaceflight since my childhood, when every week my father would fly from Adelaide to Woomera, South Australia, to work on the first stage engines of a European satellite launch initiative know as Europa. Every couple of months I would go with him on a Friday evening to meetings of a club of enthusiasts where they would have the latest film footage from NASA which would be projected and discussed.

I decided back then that my life goal was to eventually live on another planet. So far my major progress towards that goal is to have not died on Earth before leaving. In my realistic moments I realize now that I may eventually fail at my goal.

So here are my predictions about space travel. Not as optimistic as wish I could be. But, realistic, I think.

Next launch of people (test pilots/engineers) on a sub-orbital flight by a private company.BY 2018
A few handfuls of customers, paying for those flights.NET 2020
A regular sub weekly cadence of such flights.NET 2022
BY 2026
Regular paying customer orbital flights.NET 2027Russia offered paid flights to the ISS, but there were only 8 such flights (7 different tourists). They are now suspended indefinitely.
Next launch of people into orbit on a US booster.NET 2019
BY 2021
BY 2022 (2 different companies)
Current schedule says 2018.
Two paying customers go on a loop around the Moon, launch on Falcon Heavy.NET 2020The most recent prediction has been 4th quarter 2018. That is not going to happen.
Land cargo on Mars for humans to use at a later date
NET 2026SpaceX has said by 2022. I think 2026 is optimistic but it might be pushed to happen as a statement that it can be done, rather than for an pressing practical reason.
Humans on Mars make use of cargo previously landed there.NET 2032Sorry, it is just going to take longer than every one expects.
First "permanent" human colony on Mars.NET 2036It will be magical for the human race if this happens by then. It will truly inspire us all.
Point to point transport on Earth in an hour or so (using a BF rocket).NIMLThis will not happen without some major new breakthrough of which we currently have no inkling.
Regular service of Hyperloop between two cities.NIMLI can't help but be reminded of when Chuck Yeager described the Mercury program as "Spam in a can".

1AI and ML have been around for a long time already. I have been in pursuit of their magic for a long time. I have worked in both Artificial Intelligence and Machine Learning for over forty years. My 1977 Master’s thesis used Markov chains to prove the convergence of a particular machine learning algorithm. It was an abysmally terrible thesis.

AI/ML Is Not Uniquely Powerful Enough To Need Controlling

Note: This short post is intended as a counterpoint to some claims that are being made about the need to control AI research. I don’t directly refer to those claims. You can figure it out. 

When humans next land on the Moon it will be with the help of many, many, Artificial Intelligence and Machine Learning systems.

Last time we got to the Moon and back without AI or ML.

I think this highlights the fact that current versions of AI and ML are just technologies. Different technologies can get to the same goal.

Some AI/ML researchers are making a bug fuss about how their work needs to be regulated as it is uniquely powerful. I disagree that it is uniquely powerful. Current day AI and ML is nothing like the intelligence or learning possessed by biological systems. They are both very narrow slices of the whole system. They are not particularly powerful.

Modern day Prometheuses rely on all sorts of technologies. Neither AI nor ML given them a particular leg up despite how exciting they might seem to current practitioners. It is the goal of a Prometheus that is important, not the particular technological tools that are used to achieve that goal.

Point 1: Swarms of killer drones could just as well be developed without any “AI”, using other technologies. We both got to the Moon, and had precise cruise missiles without any technologies that we would today call AI or ML1. We can develop “slaughterbots” without using anything that practitioners today would call AI or ML. So banning AI or ML in weapons systems will not change outcomes. It is futile. If you don’t like the sorts of things those weapons systems do, then work to ban the things they do, not the particular and very fungible technologies that are just one of many ways to produce that behavior.

Earlier this week, on December 18th, twitter user @ewschaetzle sent out a quote from H. P. Lovecraft from 1928, saying it “seems to capture the (misguided) fear that some have expressed toward .”:

The most merciful thing in the world, I think, is the inability of the human mind to correlate all its contents. We live on a placid island of ignorance in the midst of black sees of infinity. and it was not meant we should voyage far. The sciences, each straining in its own direction, have hitherto harmed us little; but some day the piecing together of dissociated knowledge will open up such terrifying vistas of reality, and of our frightful position therein, that we shall either go mad from the revelation or flee from the deadly light into the peace and darkness of a new dark age.

I have not found the full quote elsewhere but here is a partial version of it.

I like this quote a lot.

Three months ago in a long essay blog post (and in a better edited version in Technology Review) I pointed out seven common mistakes that people are making in predicting the future of AI, and by implication, the future of ML. In general they are vastly overestimating both its current power and how quickly it will develop.

Lovecraft’s words give a rationale for why this overestimation leads many other sensible, and even brilliant, entrepreneurs, physicists, and others to say that AI in general is incredibly dangerous and we must control its development. It is complex and they get scared.

Point 2: If one wants to legislate control of “AI research or development” in some way, then one must believe that those rules or laws will change at least one person’s behavior in some way. Without some change in behavior there is no point to legislation or rules, beyond smug self satisfaction the such laws or rules have been enacted. My question to those who say we should have these rules is: Show me one explicit change of behavior that you would like to see. Tell me who would have to do what differently than they currently are doing, and how that would impact the future. Tell me how it would make the world safer from AI.

So far I have not seen anyone suggest any explicit law or rule. All I have heard is “we must control it”. How? Let alone why?

1 Someone on twitter disagreed with my claim that we got to the Moon without ML by saying that Kalman filters, which were developed for navigation in the Apollo missions use Bayesian statistics, so therefore we did use ML to get to the Moon. That is a silly argument. ML today, and what ML refers to is much, much more than Kalman filters which were developed as state estimators, not as anything to do with learning from datasets. There is no pre-learned anything in using Kalman filters.

What If There Were Men On The Moon Today?

I was sitting on the beach looking at the full moon above. I looked through my binoculars to see more detail. And then it occurred to me that I could see on that surface every single location that humans had landed on the surface of a body in space that was not Earth. Six times, stretching from 48 years ago to 45 years ago.

That brought to mind this iconic photograph taken by Michael Collins during the Apollo 11 mission. He was alone in the Command Module, and visible in the foreground is the Lunar Lander with Buzz Aldrin and Neil Armstrong in it. In the background is Earth. This makes Michael Collins the only person who has ever lived who was not inside the frame of this photo. The only person not included.

This got me to thinking. What if, horribly, one of the six Lunar Modules that landed on the Moon with two astronauts on board, had failed to take off or failed to reach orbit for docking with the Command Module. If that had happened, every time we look at the Moon today we would see a grave site. It would be the most visible grave site in the world, visible from every place on the surface of the Earth.

How would that have changed the way we viewed mankind’s place in the Universe? Would we have seen exploration as a failure and something we should not do any more? Or would we have been inspired to try harder and not let 45 years pass without a return to space faring?

When we return to the Moon, or go to Mars, we will do it with more intent to stay longer than the six times we visited the Moon. Then we only landed for between 24 and 72 hours. When we spend a longer time, and with more people, on the surface of another body eventually there will be deaths. We saw that with the Shuttle program, and the Soviet Union had their own deaths in their programs. But if the exploration, and indeed settling, is permanent then I don’t think we will have the same empty feeling looking up at the Moon and Mars, as we would today if there had been an Apollo tragedy on the surface of the Moon.

When we go to the Moon, or Mars, the next time, let’s make it for real.


[FoR&AI] The Seven Deadly Sins of Predicting the Future of AI

[An essay in my series on the Future of Robotics and Artificial Intelligence.]

We are surrounded by hysteria about the future of Artificial Intelligence and Robotics. There is hysteria about how powerful they will become how quickly, and there is hysteria about what they will do to jobs.

As I write these words on September 2nd, 2017, I note just two news stories from the last 48 hours.

Yesterday, in the New York Times, Oren Etzioni, chief executive of the Allen Institute for Artificial Intelligence, wrote an opinion piece titled How to Regulate Artificial Intelligence where he does a good job of arguing against the hysteria that Artificial Intelligence is an existential threat to humanity. He proposes rather sensible ways of thinking about regulations for Artificial Intelligence deployment, rather than the chicken little “the sky is falling” calls for regulation of research and knowledge that we have seen from people who really, really, should know a little better.

Today, there is a story in Market Watch that robots will take half of today’s jobs in 10 to 20 years. It even has a graphic to prove the numbers.

The claims are ludicrous. [I try to maintain professional language, but sometimes…] For instance, it appears to say that we will go from 1 million grounds and maintenance workers in the US to only 50,000 in 10 to 20 years, because robots will take over those jobs. How many robots are currently operational in those jobs? ZERO. How many realistic demonstrations have  there been of robots working in this arena? ZERO. Similar stories apply to all the other job categories in this diagram where it is suggested that there will be massive disruptions of 90%, and even as much as 97%, in jobs that currently require physical presence at some particular job site.

Mistaken predictions lead to fear of things that are not going to happen. Why are people making mistakes in predictions about Artificial Intelligence and robotics, so that Oren Etzioni, I, and others, need to spend time pushing back on them?

Below I outline seven ways of thinking that lead to mistaken predictions about robotics and Artificial Intelligence. We find instances of these ways of thinking in many of the predictions about our AI future. I am going to first list the four such general topic areas of predictions that I notice, along with a brief assessment of where I think they currently stand.

A. Artificial General Intelligence. Research on AGI is an attempt to distinguish a thinking entity from current day AI technology such as Machine Learning. Here the idea is that we will build autonomous agents that operate much like beings in the world. This has always been my own motivation for working in robotics and AI, but the recent successes of AI are not at all like this.

Some people think that all AI is an instance of AGI, but as the word “general” would imply, AGI aspires to be much more general than current AI. Interpreting current AI as an instance of AGI makes it seem much more advanced and all encompassing that it really is.

Modern day AGI research is not doing at all well on being either general or getting to an independent entity with an ongoing existence. It mostly seems stuck on the same issues in reasoning and common sense that AI has had problems with for at least fifty years. Alternate areas such as Artificial Life, and Simulation of Adaptive Behavior did make some progress in getting full creatures in the eighties and nineties (these two areas and communities were where I spent my time during those years), but they have stalled.

My own opinion is that of course this is possible in principle. I would never have started working on Artificial Intelligence if I did not believe that. However perhaps we humans are just not smart enough to figure out how to do this–see my remarks on humility in my post on the current state of Artificial Intelligence suitable for deployment in robotics. Even if it is possible I  personally think we are far, far further away from understanding how to build AGI than many other pundits might say.

[Some people refer to “an AI”, as though all AI is about being an autonomous agent. I think that is confusing, and just as the natives of San Francisco do not refer to their city as “Frisco”, no serious researchers in AI refer to “an AI”.]

B. The Singularity. This refers to the idea that eventually an AI based intelligent entity, with goals and purposes, will be better at AI research than us humans are. Then, with an unending Moore’s law mixed in making computers faster and faster, Artificial Intelligence will take off by itself, and, as in speculative physics going through the singularity of a black hole, we have no idea what things will be like on the other side.

People who “believe” in the Singularity are happy to give post-Singularity AI incredible power, as what will happen afterwards is quite unpredictable. I put the word believe in scare quotes as belief in the singularity can often seem like a religious belief. For some it comes with an additional benefit of being able to upload their minds to an intelligent computer, and so get eternal life without the inconvenience of having to believe in a standard sort of supernatural God. The ever powerful technologically based AI is the new God for them. Techno religion!

Some people have very specific ideas about when the day of salvation will come–followers of one particular Singularity prophet believe that it will happen in the year 2029, as it has been written.

This particular error of prediction is very much driven by exponentialism, and I will address that as one of the seven common mistakes that people make.

Even if there is a lot of computer power around it does not mean we are close to having programs that can do research in Artificial Intelligence, and rewrite their own code to get better and better.

Here is where we are on programs that can understand computer code. We currently have no programs that can understand a one page program as well as a new student in computer science can understand such a program after just one month of taking their very first class in programming. That is a long way from AI systems being better at writing AI systems than humans are.

Here is where we are on simulating brains at the neural level, the other methodology that Singularity worshipers often refer to. For about thirty years we have known the full “wiring diagram” of the 302 neurons in the worm C. elegans, along with the 7,000 connections between them.  This has been incredibly useful for understanding how behavior and neurons are linked. But it has been a thirty years study with hundreds of people involved, all trying to understand just 302 neurons. And according to the OpenWorm project trying to simulate C. elegans bottom up, they are not yet half way there.  To simulate a human brain with 100 billion neurons and a vast number of connections is quite a way off. So if you are going to rely on the Singularity to upload yourself to a brain simulation I would try to hold off on dying for another couple of centuries.

Just in case I have not made my own position on the Singularity clear, I refer you to my comments in a regularly scheduled look at the event by the magazine IEEE Spectrum. Here is the the 2008 version, and in particular a chart of where the players stand and what they say. Here is the 2017 version, and in particular a set of boxes of where the players stand and what they say. And yes, I do admit to being a little snarky in 2017…

C. Misaligned Values. The third case is that the Artificial Intelligence based machines get really good at execution of tasks, so much so that they are super human at getting things done in a complex world. And they do not share human values and this leads to all sorts of problems.

I think there could be versions of this that are true–if I have recently bought an airline ticket to some city, suddenly all the web pages I browse that rely on advertisements for revenue start displaying ads for airline tickets to the same city. This is clearly dumb, but I don’t think it is a sign of super capable intelligence, rather it is a case of poorly designed evaluation functions in the algorithms that place advertisements.

But here is a quote from one of the proponents of this view (I will let him remain anonymous, as an act of generosity):

The well-known example of paper clips is a case in point: if the machine’s only goal is maximizing the number of paper clips, it may invent incredible technologies as it sets about converting all available mass in the reachable universe into paper clips; but its decisions are still just plain dumb.

Well, no. We would never get to a situation in any version of the real world where such a program could exist. One smart enough that it would be able to invent ways to subvert human society to achieve goals set for it by humans, without understanding the ways in which it was causing problems for those same humans. Thinking that technology might evolve this way is just plain dumb (nice turn of phrase…), and relies on making multiple errors among the seven that I discuss below.

This same author repeatedly (including in the piece from which I took this quote, but also at the big International Joint Conference on Artificial Intelligence (IJCAI) that was held just a couple of weeks ago in Melbourne, Australia) argues that we need research to come up with ways to mathematically prove that Artificial Intelligence systems have their goals aligned with humans.

I think this case C comes from researchers seeing an intellectually interesting research problem, and then throwing their well known voices promoting it as an urgent research question. Then AI hangers-on take it, run with it, and turn it into an existential problem for mankind.

By the way, I think mathematical provability is a vain hope. With multi-year large team efforts we can not prove that a 1,000 line program can not be breached by external hackers, so we certainly won’t be able to prove very much at all about large AI systems. The good news is that us humans were able to successfully co-exist with, and even use for our own purposes, horses, themselves autonomous agents with on going existences, desires, and super-human physical strength, for thousands of years. And we had not a single theorem about horses. Still don’t!

D. Really evil horrible nasty human-destroying Artificially Intelligent entities. The last category is like case C, but here the supposed Artificial Intelligence powered machines will take an active dislike to humans and decide to destroy them and get them out of the way.

This has been a popular fantasy in Hollywood since at least the late 1960’s with movies like 2001: A Space Odyssey (1968, but set in 2001), where the machine-wreaked havoc was confined to a single space ship, and Colossus: The Forbin Project (1970, and set in those times) where the havoc was at a planetary scale. The theme has continued over the years, and more recently with I, Robot (2004, set in 2035) where the evil AI computer VIKI takes over the world through the instrument of the new NS-5 humanoid robots. [By the way, that movie continues the bizarre convention from other science fiction movies that large complex machines are built with spaces that have multi hundred feet heights around them so that there can be great physical peril for the human heroes as they fight the good fight against the machines gone bad…]

This is even wronger than case C. I think it must make people feel tingly thinking about these terrible, terrible dangers…

In this blog, I am not going to address the issue of military killer robots–this often gets confused in the press with issue D above, and worse it often gets mashed together by people busy fear mongering about issue D. They are very separate issues. Furthermore I think that many of the arguments about such military robots are misguided. But it is a very different issue and will have to wait for another blog post.

Now, the seven mistakes I think people are making. All seven of them influence the assessments about timescales for and likelihood of each of scenarios A, B, C, and D, coming about. But some are more important I believe in the mis-estimations than others. I have labeled in the section headers for each of these seven errors where I think they do the most damage. The first one does some damage everywhere!

1. [A,B,C,D] Over and under estimating

Roy Amara was a futurist and the co-founder and President of the Institute For The Future in Palo Alto, home of Stanford University, countless venture capitalists, and the intellectual heart of Silicon Valley. He is best known for his adage, now referred to as Amara’s law:

We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run.

There is actually a lot wrapped up in these 21 words which can easily fit into a tweet and allow room for attribution. An optimist can read it one way, and a pessimist can read it another. It should make the optimist somewhat pessimistic, and the pessimist somewhat optimistic, for a while at least, before each reverting to their norm.

A great example⁠1 of the two sides of Amara’s law that we have seen unfold over the last thirty years concerns the US Global Positioning System. Starting in 1978 a constellation of 24 satellites (30 including spares) were placed in orbit. A ground station that can see 4 of them at once can compute the the latitude, longitude, and height above a version of sea level. An operations center at Schriever Air Force Base in Colorado constantly monitors the precise orbits of the satellites and the accuracy of their onboard atomic clocks and uploads minor and continuous adjustments to them. If those updates were to stop GPS would fail to have you on the correct road as you drive around town after only a week or two, and would have you in the wrong town after a couple of months.

The goal of GPS was to allow precise placement of bombs by the US military. That was the expectation for it. The first operational use in that regard was in 1991 during Desert Storm, and it was promising. But during the nineties there was still much distrust of GPS as it was not delivering on its early promise, and it was not until the early 2000’s that its utility was generally accepted in the US military. It had a hard time delivering on its early expectations and the whole program was nearly cancelled again and again.

Today GPS is in the long term, and the ways it is used were unimagined when it was first placed in orbit. My Series 2 Apple Watch uses GPS while I am out running to record my location accurately enough to see which side of the street I ran along. The tiny size and tiny price of the receiver would have been incomprehensible to the early GPS engineers. GPS is now used for so many things that the designers never considered. It synchronizes physics experiments across the globe and is now an intimate component of synchronizing the US electrical grid and keeping it running, and it even allows the high frequency traders who really control the stock market to mostly not fall into disastrous timing errors. It is used by all our airplanes, large and small to navigate, it is used to track people out of jail on parole, and it determines  which seed variant will be planted in which part of many fields across the globe. It tracks our fleets of trucks and reports on driver performance, and the bouncing signals on the ground are used to determine how much moisture there is in the ground, and so determine irrigation schedules.

GPS started out with one goal but it was a hard slog to get it working as well as was originally expected. Now it has seeped into so many aspects of our lives that we would not just be lost if it went away, but we would be cold, hungry, and quite possibly dead.

We see a similar pattern with other technologies over the last thirty years. A big promise up front, disappointment, and then slowly growing confidence, beyond where the original expectations were aimed. This is true of the blockchain (Bitcoin was the first application), sequencing individual human genomes, solar power, wind power, and even home delivery of groceries.

Perhaps the most blatant example is that of computation itself. When the first commercial computers were deployed in the 1950’s there was widespread fear that they would take over all jobs (see the movie Desk Set from 1957). But for the next 30 years computers were something that had little direct impact on people’s lives and even in 1987 there were hardly any microprocessors in consumer devices. That has all changed in the second wave over the subsequent 30 years and now we all walk around with our bodies adorned with computers, our cars full of them, and they are all over our houses.

To see how the long term influence of computers has consistently been underestimated one need just go back and look at portrayals of them in old science fiction movies or TV shows about the future. The three hundred year hence space ship computer in the 1966 Star Trek (TOS) was laughable just thirty years later, let alone three centuries later.  And in Star Trek The Next Generation, and Star Trek Deep Space Nine, whose production spanned 1986 to 1999, large files still needed to be carried by hand around the far future space ship or space station as they could not be sent over the network (like an AOL network of the time). And the databases available for people to search were completely anemic with their future interfaces which were pre-Web in design.

Most technologies are overestimated in the short term. They are the shiny new thing. Artificial Intelligence has the distinction of having been the shiny new thing and being overestimated again and again, in the 1960’s, in the 1980’s, and I believe again now. (Some of the marketing messages from large companies on their AI offerings are truly delusional, and may have very bad blowback for them in the not too distant future.)

Not all technologies get underestimated in the long term, but that is most likely the case for AI. The question is how long is the long term. The next six errors that I talk about help explain how the timing for the long term is being grossly underestimated for the future of AI.

2. [B,C,D] Imagining Magic

When I was a teenager, Arthur C. Clarke was one of the “big three” science fiction writers along with Robert Heinlein and Isaac Asimov. But Clarke was more than just a science fiction writer. He was also an inventor, a science writer, and a futurist.

In February 1945 he wrote a letter⁠2 to Wireless World about the idea of geostationary satellites for research, and in October of that year he published a paper⁠3 outlining how they could be used to provide world-wide radio coverage. In 1948 he wrote a short story The Sentinel which provided the kernel idea for Stanley Kubrick’s epic AI movie 2001: A Space Odyssey, with Clarke authoring a book of the same name as the film was being made, explaining much that had left the movie audience somewhat lost.

In the period from 1962 to 1973 Clarke formulated three adages, which have come to be known as Clarke’s three laws (he said that Newton only had three, so three were enough for him too):

  1. When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
  2. The only way of discovering the limits of the possible is to venture a little way past them into the impossible.
  3. Any sufficiently advanced technology is indistinguishable from magic.

Personally I should probably be wary of the second sentence in his first law, as I am much more conservative than some others about how quickly AI will be ascendant. But for now I want to expound on Clarke’s third law.

Imagine we had a time machine (powerful magic in itself…) and we could transport Issac Newton from the late 17th century to Trinity College Chapel in Cambridge University. That chapel was already 100 years old when he was there so perhaps it would not be too much of an immediate shock to find himself in it, not realizing the current date.

Now show Newton an Apple. Pull out an iPhone from your pocket, and turn it on so that the screen is glowing and full of icons and hand it to him. The person who revealed how white light is made from components of different colored light by pulling apart sunlight with a prism and then putting it back together again would no doubt be surprised at such a small object producing such vivid colors in the darkness of the chapel. Now play a movie of an English country scene, perhaps with some animals with which he would be familiar–nothing indicating the future in the content. Then play some church music with which he would be familiar. And then show him a web page with the 500 plus pages of his personally annotated copy of his masterpiece Principia, teaching him how to use the pinch gestures to zoom in on details.

Could Newton begin to explain how this small device did all that? Although he invented calculus and explained both optics and gravity, Newton was never able to sort out chemistry and alchemy. So I think he would be flummoxed, and unable to come up with even the barest coherent outline of what this device was. It would be no different to him than an embodiment of the occult–something which was of great interest to him when he was alive. For him it would be indistinguishable from magic. And remember, Newton was a really smart dude.

If something is magic it is hard to know the limitations it has. Suppose we further show Newton how it can illuminate the dark, how it can take photos and movies and record sound, how it can be used as a magnifying glass, and as a mirror. Then we show him how it can be used to carry out arithmetical computations at incredible speed and to many decimal places. And we show it counting his steps has he carries it.

What else might Newton conjecture that the device in front of him could do? Would he conjecture that he could use it to talk to people anywhere in the world immediately from right there in the chapel? Prisms work forever. Would he conjecture that the iPhone would work forever just as it is, neglecting to understand that it needed to be recharged (and recall that we nabbed him from a time 100 years before the birth of Michael Faraday, so the concept of electricity was not quite around)? If it can be a source of light without fire could it perhaps also transmute lead to gold?

This is a problem we all have with imagined future technology. If it is far enough away from the technology we have and understand today, then we do not know its limitations. It becomes indistinguishable from magic.

When a technology passes that magic line anything one says about it is no longer falsifiable, because it is magic.

This is a problem I regularly encounter when trying to debate with people about whether we should fear just plain AGI, let alone cases C or D from above. I am told that I do not understand how powerful it will be. That is not an argument. We have no idea whether it can even exist. All the evidence that I see says that we have no real idea yet how to build one. So its properties are completely unknown, so rhetorically it quickly becomes magical and super powerful. Without limit.

Nothing in the Universe is without limit. Not even magical future AI.

Watch out for arguments about future technology which is magical. It can never be refuted. It is a faith-based argument, not a scientific argument.

3. [A,B,C] Performance versus competence

One of the social skills that we all develop is an ability to estimate the capabilities of individual people with whom we interact. It is true that sometimes “out of tribe” issues tend to overwhelm and confuse our estimates, and such is the root of the perfidy of racism, sexism, classism, etc. In general, however, we use cues from how a person performs some particular task to estimate how well they might perform some different task. We are able to generalize from observing performance at one task to a guess at competence over a much bigger set of tasks. We understand intuitively how to generalize from the performance level of the person to their competence in related areas.

When in a foreign city we ask a stranger on the street for directions and they reply in the language we spoke to them with confidence and with directions that seem to make sense, we think it worth pushing our luck and asking them about what is the local system for paying when you want to take a bus somewhere in that city.

If our teenage child is able to configure their new game machine to talk to the household wifi we suspect that if sufficiently motivated they will be able to help us get our new tablet computer on to the same network.

If we notice that someone is able to drive a manual transmission car, we will be pretty confident that they will be able to drive one with an automatic transmission too. Though if the person is North American we might not expect it to work for the converse case.

If we ask an employee in a large hardware store where to find a particular item, a home electrical fitting say, that we are looking for and they send us to an aisle of garden tools, we will probably not go back and ask that very same person where to find a particular bathroom fixture. We will estimate that not only do they not know where the electrical fittings are, but that they really do not know the layout of goods within the store, and we will look for a different person to ask with our second question.

Now consider a case that is closer to some performances we see for some of today’s AI systems.

Suppose a person tells us that a particular photo is of people playing Frisbee in the park, then we naturally assume that they can answer questions like “what is the shape of a Frisbee?”, “roughly how far can a person throw a Frisbee?”, “can a person eat a Frisbee?”, “roughly how many people play Frisbee at once?”, “can a 3 month old person play Frisbee?”, “is today’s weather suitable for playing Frisbee?”; in contrast we would not expect a person from another culture who says they have no idea what is happening in the picture to be able to answer all those questions.  Today’s image labelling systems that routinely give correct labels, like “people playing Frisbee in a park” to online photos, have no chance of answering those questions.  Besides the fact that all they can do is label more images and can not answer questions at all, they have no idea what a person is, that parks are usually outside, that people have ages, that weather is anything more than how it makes a photo look, etc., etc.

This does not mean that these systems are useless however. They are of great value to search engine companies. Simply labelling images well lets the search engine bridge the gap from search for words to searching for images. Note too that search engines usually provide multiple answers to any query and let the person using the engine review the top few and decide which ones are actually relevant. Search engine companies strive to get the performance of their systems to get the best possible answer as one of the top five or so. But they rely on the cognitive abilities of the human user so that they do not have to get the best answer first, every time. If they only gave one answer, whether to a search for “great hotels in Paris”, or at an e-commerce site only gave one image selection for a “funky neck tie”, they would not be as useful as they are.

Here is what goes wrong. People hear that some robot or some AI system has performed some task. They then take the generalization from that performance to a general competence that a person performing that same task could be expected to have. And they apply that generalization to the robot or AI system.

Today’s robots and AI systems are incredibly narrow in what they can do. Human style generalizations just do not apply. People who do make these generalizations get things very, very wrong.

4. [A,B] Suitcase words

I spoke briefly about suitcase words (Marvin Minsky’s term4) in my post explaining how machine learning works. There I was discussing how the word learning can mean so many different types of learning when applied to humans. And as I said there, surely there are different mechanisms that humans use for different sorts of learning. Learning to use chopsticks is a very different experience from learning the tune of a new song. And learning to write code is a very different experience from learning your way around a particular city.

When people hear that Machine Learning is making great strides and they think about a machine learning in some new domain, they tend to use as a mental model the way in which a person would learn that new domain. However, Machine Learning is very brittle, and it requires lots of human preparation by researchers or engineers, special purpose coding for processing input data, special purpose sets of training data, and a custom learning structure for each new problem domain. Today’s Machine Learning by computers is not at all the sponge like learning that humans engage in, making rapid progress in a new domain without having to be surgically altered or purpose built.

Likewise when people hear that computers can now beat the world chess champion (in 1997) or the world Go champion (in 2016) they tend to think that it is “playing” the game just like a human would. Of course in reality those programs had no idea what a game actually was (again, see my post on machine learning), nor that they are playing. And as pointed out in this article in The Atlantic during the recent Go challenge the human player, Lee Sedol, was supported by 12 ounces of coffee, whereas the AI program, AlphaGo, was running on a whole bevy of machines as a distributed application, and was supported by a team of more than 100 scientists.

When a human plays a game a small change in rules does not throw them off–a good player can adapt. Not so for AlphaGo or Deep Blue, the program that beat Garry Kasparov back in 1997.

Suitcase words lead people astray in understanding how well machines are doing at tasks that people can do. AI researchers, on the other hand, and worse their institutional press offices, are eager to claim progress in their research in being an instance of what a suitcase word applies to for humans. The important phrase here is “an instance”. No matter how careful the researchers are, and unfortunately not all of them are so very careful, as soon as word of the research result gets to the press office and then out into the unwashed press, that detail soon gets lost. Headlines trumpet the suitcase word, and mis-set the general understanding of where AI is, and how close it is to accomplishing more.

And, we haven’t even gotten to saying many of Minsky’s suitcase words about AI systems; consciousness, experience, or thinking. For us humans it is hard to think about playing chess without being conscious, or having the experience of playing, or thinking about a move. So far, none of our AI systems have risen to an even elementary level where one of the many ways in which we use those words about humans apply. When we do, and I tend to think that we will, get to a point where we will start using some of those words about particular AI systems, the press, and most people, will over generalize again.

Even with a very narrow single aspect demonstration of one slice of these words I am afraid people will over generalize and think that machines are on the very door step of human-like capabilities in these aspects of being intelligent.

Words matter, but whenever we use a word to describe something about an AI system, where that can also be applied to humans, we find people overestimating what it means. So far most words that apply to humans when used for machines, are only a microscopically narrow conceit of what the word means when applied to humans.

Here are some of the verbs that have been applied to machines, and for which machines are totally unlike humans in their capabilities:

anticipate, beat, classify, describe, estimate, explain, hallucinate, hear, imagine, intend, learn, model, plan, play, recognize, read, reason, reflect, see, understand, walk, write

For all these words there have been research papers describing a narrow sliver of the rich meanings that these words imply when applied to humans. Unfortunately the use of these words suggests that there is much more there there than is there.

This leads people to misinterpret and then overestimate the capabilities of today’s Artificial Intelligence.

5. [A,B,B,B,…] Exponentials

Many people are suffering from a severe case of “exponentialism”.

Everyone has some idea about Moore’s Law, at least as much to sort of know that computers get better and better on a clockwork like schedule.

What Gordon Moore actually said was that the number of components that could fit on a microchip would double every year.  I published a blog post in February about this and how it is finally coming to an end after a solid fifty year run. Moore had made his predictions in 1965 with only four data points using this graph:

He only extrapolated for 10 years, but instead it has lasted 50 years, although the time constant for doubling has gradually lengthened from one year to over two years, and now it is finally coming to an end.

Double the components on a chip has lead to computers that keep getting twice as fast. And it has lead to memory chips that have four times as much memory every two years. It has also led to digital cameras which have had more and more resolution, and LCD screens with exponentially more pixels.

The reason Moore’s law worked is that it applied to a digital abstraction of true or false.  Was there an electrical charge or voltage there or not? And the answer to that yes/no question is the same when the number of electrons is halved, and halved again, and halved again. The answer remains consistent through all those halvings until we get down to so few electrons that quantum effects start to dominate, and that is where we are now with our silicon based chip technology.

Moore’s law, and exponential laws like Moore’s law can fail for three different reasons:

  1. It gets down to a physical limit where the process of halving/doubling no longer works.
  2. The market demand pull gets saturated so there is no longer an economic driver for the law to continue.
  3. It may not have been an exponential process in the first place.

When people are suffering from exponentialism they may gloss over any of these three reasons and think that the exponentials that they use to justify an argument are going to continue apace.

Moore’s Law is now faltering under case (a), but it has been the presence of Moore’s Law for fifty years that has powered the relentless innovation of the technology industry and the rise of Silicon Valley, and Venture Capital, and the ride of the geeks to be amongst the richest people in the world, that has led too many people to think that everything in technology, including AI, is exponential.

It is well understood that many cases of exponential processes are really part of an “S-curve”, where at some point the hyper growth flattens out. Exponential growth of the number of users of a social platform such as Facebook or Twitter must turn into an S-curve eventually as there are only a finite number of humans alive to be new users, and so exponential growth can not continue forever. This is an example of case (b) above.

But there is more to this. Sometimes just the demand from individual users can look like an exponential pull for a while, but then it gets saturated.

Back in the first part of this century when I was running a very large laboratory at M.I.T. (CSAIL) and needed to help raise research money for over 90 different research groups, I tried to show sponsors how things were continuing to change very rapidly through the memory increase on iPods. Unlike Gordon Moore I had five data points! The data was how much storage one got for one’s music in an iPod for about $400.  I noted the dates of new models and for five years in a row, somewhere in the June to September time frame a new model would appear.  Here are the data:

Year GigaBytes
2003 10
2004 20
2005 40
2006 80
2007 160

The data came out perfectly (Gregor Mendel would have been proud…) as an exponential. Then I would extrapolate a few years out and ask what we would do with all that memory in our pockets.

Extrapolating through to today we would expect a $400 iPod to have 160,000 GigaBytes of memory (or 160 TeraBytes). But the top of the line iPhone of today (which costs more than $400) only has 256 GigaBytes of memory, less than double the 2007 iPod, while the top of the line iPod (touch) has only 128 GigaBytes which ten years later is a decrease from the 2007 model.

This particular exponential collapsed very suddenly once the amount of memory got to the point where it was big enough to hold any reasonable person’s complete music library, in their hand. Exponentials can stop when the customers stop demanding.

Moving on, we have seen a sudden increase in performance of AI systems due to the success of deep learning, a form of Machine Learning. Many people seem to think that means that we will continue to have increases in AI performance of equal multiplying effect on a regular basis. But the deep learning success was thirty years in the making, and no one was able to predict it, nor saw it coming. It was an isolated event.

That does not mean that there will not be more isolated events, where backwaters of AI research suddenly fuel a rapid step increase in performance of many AI applications. But there is no “law” that says how often they will happen. There is no physical process, like halving the mass of material as in Moore’s Law, fueling the process of AI innovation. This is an example of case (c) from above.

So when you see exponential arguments as justification for what will happen with AI remember that not all so called exponentials are really exponentials in the first place, and those that are can collapse suddenly when a physical limit is hit, or there is no more economic impact to continue them.

6. [C,D] Hollywood scenarios

The plot for many Hollywood science fiction movies is that the world is just as it is today, except for one new twist. Certainly that is true for movies about aliens arriving on Earth. Everything is going along as usual, but then one day the aliens unexpectedly show up.

That sort of single change to the world makes logical sense for aliens but what about for a new technology? In real life lots of new technologies are always happening at the same time, more or less.

Sometimes there is a rational, within Hollywood reality, explanation for why there is a singular disruption of the fabric of humanity’s technological universe. The Terminator movies, for instance, had the super technology come from the future via time travel, so there was no need to have a build up to the super robot played by Arnold Schwarzenegger.

But in other movies it can seem a little silly.

In Bicentennial Man, Richard Martin, played by Sam Neill, sits down to breakfast being waited upon by a walking talking humanoid robot, played by Robin Williams. He picks up a newspaper to read over breakfast. A newspaper! Printed on paper. Not a tablet computer, not a podcast coming from an Amazon Echo like device, not a direct neural connection to the Internet.

In Blade Runner, as Tim Harford recently pointed out, detective Rick Deckard, played by Harrison Ford, wants to contact the robot Rachael, played by Sean Young. In the plot Rachael is essentially indistinguishable from a human being. How does Deckard connect to her? With a pay phone. With coins that you feed in to it. A technology that many of the readers of this blog may never had seen. (By the way, in that same post, Harford remarks: “Forecasting the future of technology has always been an entertaining but fruitless game.” A sensible insight.)

So there are two examples of Hollywood movies where the writers, directors, and producers, imagine a humanoid robot, able to see, hear, converse, and act in the world as a human–pretty much an AGI (Artificial General Intelligence). Never mind all the marvelous materials and mechanisms involved. But those creative people lack the imagination, or will, to consider how else the world may have changed as that amazing package of technology has been developed.

It turns out that many AI researchers and AI pundits, especially those pessimists who indulge in predictions C and D, are similarly imagination challenged.

Apart from the time scale for many C and D predictions being wrong, they ignore the fact that if we are able to eventually build such smart devices the world will have changed significantly from where we are. We will not suddenly be surprised by the existence of such super intelligences. They will evolve technologically over time, and our world will be different, populated by many other intelligences, and we will have lots of experience already.

For instance, in the case of D (evil super intelligences who want to get rid of us) long before we see such machines arising there will be the somewhat less intelligent and belligerent machines. Before that there will be the really grumpy machines. Before that the quite annoying machines. And before them the arrogant unpleasant machines.

We will change our world along the way, adjusting both the environment for new technologies and the new technologies themselves. I am not saying there may not be challenges. I am saying that they will not be as suddenly unexpected as many people think. Free running imagination about shock situations are not helpful–they will never be right, or even close.

“Hollywood scenarios” are a great rhetorical device for arguments, but they usually do not have any connection to future reality.

7. [B,C,D] SPEED OF Deployment

As the world has turned to software the deployment frequency of new versions has become very high in some industries. New features for platforms like Facebook are deployed almost hourly. For many new features, as long as they have passed integration testing, there is very little economic downside if a problem shows up in the field and the version needs to be pulled back–often I find that features I use on such platforms suddenly fail to work for an hour or so (this morning it was the pull down menu for Facebook notifications that was failing) and I assume these are deployment fails. For revenue producing components, like advertisement placement, more care is taken and changes may happen only on the scale of weeks.

This is a tempo that Silicon Valley and Web software developers have gotten used to. It works because the marginal cost of newly deploying code is very very close to zero.

Hardware on the other hand has significant marginal cost to deploy. We know that from our own lives. Many of the cars we are buying today, which are not self driving, and mostly are not software enabled, will likely still be on the road in the year 2040. This puts an inherent limit on how soon all our cars will be self driving. If we build a new home today, we can expect that it might be around for over 100 years. The building I live in was built in 1904 and it is not nearly the oldest building in my neighborhood.

Capital costs keep physical hardware around for a long time, even when there are high tech aspects to it, and even when it has an existential mission to play.

The US Air Force still flies the B-52H variant of the B-52 bomber. This version was introduced in 1961, making it 56 years old. The last one was built in 1963, a mere 54 years ago. Currently these planes are expected to keep flying until at least 2040, and perhaps longer–there is talk of extending their life out to 100 years. (cf. The Millennium Falcon!)

The US land-based Intercontinental Ballistic Missile (ICBM) force is all Minuteman-III variants, introduced in 1970. There are 450 of them. The launch system relies on eight inch floppy disk drives, and some of the digital communication for the launch procedure is carried out over analog wired phone lines.

I regularly see decades old equipment in factories around the world. I even see PCs running Windows 3.0 in factories–a software version released in 1990. The thinking is that “if it ain’t broke, don’t fix it”. Those PCs and their software have been running the same application doing the same task reliably for over two decades.

The principal control mechanisms in factories, including brand new ones in the US, Europe, Japan, Korea, and China, is based on Programmable Logic Controllers, or PLCs. These were introduced in 1968 to replace electromechanical relays. The “coil” is still the principal abstraction unit used today, and the way PLCs are programmed is as though they were a network of 24 volt electromechanical relays. Still.  Some of the direct wires have been replaced by Ethernet cables. They emulate older networks (themselves a big step up) based on the RS485 eight bit serial character protocol, which themselves carry information emulating 24 volt DC current switching.  And the Ethernet cables are not part of an open network, but instead individual cables are run point to point physically embodying the control flow in these brand new ancient automation controllers. When you want to change information flow, or control flow, in most factories around the world it takes weeks of consultants figuring out what is there, designing new reconfigurations, and then teams of tradespeople rewiring and reconfiguring hardware. One of the major manufacturers of this equipment recently told me that they aim for three software upgrades every twenty years.

In principle it could be done differently. In practice it is not. And I am not talking about just in technological backwaters. I just this minute looked on a jobs list and even today, this very day, Tesla is trying to hire full time PLC technicians at their Fremont factory. Electromagnetic relay emulation to automate the production of the most AI-software advanced automobile that exists.

A lot of AI researchers and pundits imagine that the world is already digital, and that simply introducing new AI systems will immediately trickle down to operational changes in the field, in the supply chain, on the factory floor, in the design of products.

Nothing could be further from the truth.

The impedance to reconfiguration in automation is shockingly mind-blowingly impervious to flexibility.

You can not give away a good idea in this field. It is really slow to change. The example of the AI system making paper clips deciding to co-opt all sorts of resources to manufacture more and more paper clips at the cost of other human needs is indeed a nutty fantasy. There will be people in the loop worrying about physical wiring for decades to come.

Almost all innovations in Robotics and AI take far, far, longer to get to be really widely deployed than people in the field and outside the field imagine. Self driving cars are an example. Suddenly everyone is aware of them and thinks they will soon be deployed. But it takes longer than imagined. It takes decades, not years. And if you think that is pessimistic you should be aware that it is already provably three decades from first on road demonstrations and we have no deployment. In 1987 Ernst Dickmanns and his team at the Bundeswehr University in Munich had their autonomous van drive at 90 kilometers per hour (56mph) for 20 kilometers (12 miles) on a public freeway. In July 1995 the first no hands on steering wheel, no feet on pedals, minivan from CMU’s team lead by Chuck Thorpe and Takeo Kanade drove coast to coast across the United States on public roads.  Google/Waymo has been working on self driving cars for eight years and there is still no path identified for large scale deployment. It could well be four or five or six decades from 1987 before we have real deployment of self driving cars.

New ideas in robotics and AI take a long long time to become real and deployed.


When you see pundits warn about the forthcoming wonders or terrors of robotics and Artificial Intelligence I recommend carefully evaluating their arguments against these seven pitfalls. In my experience one can always find two or three or four of these problems with their arguments.

Predicting the future is really hard, especially ahead of time.

Pinpoint: How GPS is Changing Technology, Culture, and Our Minds, Greg Milner, W. W. Norton, 2016.

2 “V2s for Ionosphere Research?”, A. C. Clarke, Wireless World, p. 45, February, 1945.

3 “Extra-Terrestrial Relays: Can Rocket Stations Give World-wide Radio Coverage”, Arthur C. Clarke, Wireless World, 305–308, October, 1945.

The Emotion Machine: Commonsense Thinking, Artificial Intelligence, and the Future of the Human Mind, Marvin Minsky, Simon and Schuster, 2006.

[FoR&AI] Machine Learning Explained

[An essay in my series on the Future of Robotics and Artificial Intelligence.]

Much of the recent enthusiasm about Artificial Intelligence is based on the spectacular recent successes of machine learning, itself often capitalized as Machine Learning, and often referred to as ML. It has become common in the technology world that the presence of ML in a company, in a development process, or in a product is viewed as a certification of technical superiority, something that will outstrip all competition.

Machine Learning is what has enabled the new assistants in our houses such as the Amazon Echo (Alexa) and Google Home by allowing them to reliably understand as we speak to them. Machine Learning is how Google chooses what advertisements to place, how it saves enormous amounts of electricity at its data centers, and how it labels images so that we can search for them with key words. Machine learning is how DeepMind (a Google company) was able to build a program called Alpha Go which beat the world Go champion. Machine Learning is how Amazon knows what recommendations to make to you whenever you are at its web site. Machine Learning is how PayPal detects fraudulent transactions. Machine Learning is how Facebook is able to translate between languages. And the list goes on!

While ML has started to have an impact on many aspects of our life, and will more and more so over the coming decades, some sobriety is not out of place. Machine Learning⁠1 is not magic. Neither AI programs, nor robots, wander around in the world ready to learn about whatever there is around them.

Every successful application of ML is hard won by researchers or engineers carefully analyzing the problem that is at hand. They select one or many different ML algorithms, and custom design how to connect them together and to the data. In some cases there is an extensive period of training on very large sets of data before the algorithm can be run on the problem that is being solved. In that case there may be months of work to do in collecting the right sort of data from which ML will actually learn. In other cases the learning algorithm will be integrated in to the application and will learn while doing the task that is desired–it might require some training wheels in the early stages, and they too must be designed. In any case there is always a big design project about how, when the ultimate system is operational, the data that comes in will be organized, processed and mapped before it reaches the ML component of the system.

When we are tending plants we pour water on them and perhaps give them some fertilizer and they grow. I think many people in the press, in management, and in the non-technical world have been dazzled by the success of Machine Learning, and have come to think of it a little like water or fertilizer for hard problems. They often mistakenly believe that a generic version will work on any and all problems. But while ML can sometimes have miraculous results it needs to be carefully customized after the DNA of the problem has beed analyzed.  And even then it might not be what is needed–to extend the metaphor, perhaps it is the climate that needs to be adjusted and no amount of fertilizer or ML will do the job.

How does Machine Learning work, and is it the same as when a child or adult learns something new? The examples above certainly seem to cover some of the same sort of territory, learning how to understand a human speaking, learning how to play a game, learning to name objects based on their appearance.

Machine Learning started with games

In the early 1940’s as war was being waged world wide there were only a handful of electronic digital computers in existence. They had been built, using the technology of vacuum tubes, to calculate gunnery tables and to decrypt coded military communications of the enemy.  Even then, however, people were starting to think about how these computers might be used to carry out intelligent activities, fifteen years before the term Artificial Intelligence was first floated by John McCarthy.

Alan Turing, who in 1936 had written the seminal paper that established the foundations of modern computation, and Donald Michie, a classics student from Oxford (later he would earn a doctorate in genetics), worked together at Bletchley Park, the famous UK code breaking establishment that Churchill credited with subtracting years from the war. Turing contributed to the design of the Colossus computer there, and through a key programming breakthrough that Michie made, the design of the second version of the Colossus was changed to accommodate his ideas even better. Meanwhile at the local pub the pair had a weekly chess game together and discussed how to program a computer to play chess, but they were only able to get as far as simulations with pen and paper.

In the United States right after the war, Arthur Samuel3, an expert on vacuum tubes was the leader of an effort to built the ILLIAC computer at the University of Illinois at Urbana-Champaign. While the computer was still being built he planned out how to program it to play checkers (or draughts in British English), but left in 1949 to join IBM before the University computer was completed. At IBM he worked on both vacuum tubes and transistors to bring IBM’s first commercial general purpose digital computers to market. On the side he was able to implement a program that by 1952 could play checkers against a human opponent. This was one of the first non-arithmetical programs to run on general purpose digital computers, and has been called the first AI program to run in the United States.

Samuel continued to improve the program over time and in 1956 it was first demonstrated to the public. But Samuel wondered whether the improvements he was making to the program by hand could be made by the machine itself. In 1959 he published a paper titled “Some Studies in Machine Learning Using the Game of Checkers”⁠2, the first time the phrase “Machine Learning” was used–earlier there had been models of learning machines, but this was a more general concept.

The first sentence in his paper was: “The studies reported here have been concerned with programming of a digital computer to behave in a way which, if done by human beings or animals, would be described as involving the process of learning.” Right there is his justification for using the term learning, and while I would not quibble with it, I think that it may have had some unintended consequences which we will explore towards the end of this post.

What Samuel had realized, demonstrated, and exploited, was that digital computers were by 1959 fast enough to take over some of the fine tuning that a person might do for a program, as he had been doing since the first version of his program in 1952, and ultimately eliminate the need for much of that effort by human programmers by letting the computer do some Machine Learning on appropriate parts of the problem. This is exactly what has lead, almost 60 years later to the great influence that ML is now having on the world.

One of the two learning techniques Samuel described was something he called rote learning, and today would be labelled as a well known programming technique called memoization4, and sped up the program. The other learning technique that he investigated involved adjusting numerical weights on how much the program should believe each of over thirty measures of how good or bad a particular board position was for the program or its human opponent. This is closer in spirit to techniques in modern ML. By improving this measure the program could get better and better at playing. By 1961 his program had beat the Connecticut state checker champion. Another first for AI, and enabled by the first ML program.

Arthur Samuel built his AI and ML systems not as an academic researcher but as a scholar working on his own time apart from his day job. However he had an incredible advantage over all the AI academic researchers. Whereas access to computers was rare and precious for them, Samuel’s day job was as a key participant building the first mass produced digital computers, and each one needed to be run for many hours to catch early life defects before it could be shipped. He had a surfeit of free computer time. Just about no one else in the world had such a luxurious computational environment.

Sometimes the less lucky academics had to resort to desperate measures. And so it was for Donald Michie, colleague of Alan Turing back at Bletchley Park. By 1960 he was a Senior Lecturer in Surgical Science at the University of Edinburgh, but his real interests lay in Artificial Intelligence, though he always preferred the term Machine Intelligence.

In 1960 Surgical Science did not have much pull in getting access to a digital computer. So Donald Michie himself built a machine that could learn to play the game of tic-tac-toe (Noughts and Crosses in British English) from 304 matchboxes, small rectangular boxes which were the containers for matches, and which had an outer cover and a sliding inner box to hold the matches. He put a label on one end of each of these sliding boxes, and carefully filled them with precise numbers of colored beads. With the help of a human operator, mindlessly following some simple rules, he had a machine that could not only play tic-tac-toe but could learn to get better at it.

He called his machine MENACE, for Matchbox Educable Noughts And Crosses Engine, and published⁠5 a report on it in 1961. In 1962 Martin Gardner⁠6 reported on it in his regular Mathematical Games column in Scientific American, but illustrated it with a slightly simpler version to play hexapawn, three chess pawns against three chess pawns on a three by three chessboard. This was a way to explain Machine Learning and provide an experimental vehicle to the scientifically interested lay population, who certainly would not have had access to a digital computer at that time. Gardner suggested that people try building a matchbox computer to play simplified checkers with two pieces for each player on a four by four board. But he felt that even the simplest version of chess that he could come up with, on a five by five board would require too many matchboxes to be practical.

I first read about the matchbox computer in 1967 in a book⁠7 published the previous year which was written by a group of teachers at a British high school. They neither attributed the idea to Michie, nor the game they described it learning, hexapawn, to Gardner. As a barely teenager who had to hand build every machine for every experiment I wanted to do in AI, I must admit I thought that the matchbox computer was too simple a project and so did not pursue it. Now, however, I have come to realize that it is the perfect way of introducing how Machine Learning works, as everything is there to see. Even though MENACE is over fifty years old many of the problems that it faced are still relevant to machine learning algorithms today, and it shares many characteristics with almost all of today’s machine learning. Due to its simplicity it can be described in complete detail and no mathematics is needed to get a strong intuitive understanding of how it works.

Today people generally recognize three different classes of Machine Learning, supervised, unsupervised, and reinforcement learning, all three very actively researched, and all being used for real applications.  Donald Michie’s MENACE introduced the idea of ML reinforcement learning, and he explicitly refers to reinforcement as a key concept in how it works.

How a collection of matchboxes plays & learns

I am going to take the details I give here from a retelling of how MENACE worked from a more accessible 1963 paper⁠8. In a very often republished picture from that paper Donald Michie (or at least his hands) can be seen both playing tic-tac-toe against the machine, and operating the machine.

On the sheet of paper in from of him you can just see a large tic-tac-toe diagram. There are stacks of matchboxes toward the rear of the table, must likely glued together so that they stay in place. Some of the matchboxes have their drawers partially pulled out, and he is holding one of the drawers in his left hand. As we will see this image captures most of what is going on as MENACE learns to play better and better tic-tac-toe.

To make the description more clear I am going to introduce a second person; let’s call him Alan. Alan will operate the matchbox machine according to fixed rules, and will not have to make any decisions that are not determined completely by those rules. Donald, the human player will not touch the machine, but instead will write out the moves on a standard three by three grid, accepting the moves of MENACE as delivered to him by Alan, playing his own moves, and being the adjudicator of when he or MENACE has won by getting three in a row.

Michie had MENACE always play first, with a ‘O’ and so we will do that here also. Below is what a game might look like, starting with an empty board, MENACE playing first in the middle of the top row, Donald replying, with an ‘X’ in the top right corner as his first play to be made, and back and forth. Notice that I am using a period, or ‘.’, for a blank, so that I don’t have to draw the customary horizontal and vertical lines to divide the squares. I have put a little indicator under each board position where it is MENACE’s turn to play. At MENACE’s third move it blocks an immediate win where Donald would be able to complete the diagonal from the upper right to the lower left, but Donald replies with a move to the bottom right corner, threatening now two possible three in a rows on his next turn, and MENACE is able to block only one of them so MENACE loses to Donald.

...   .O.   .OX   .OX   .OX   .OX   .OX   OOX   OOX   
...   ...   ...   ...   .X.   .X.   .X.   .X.   .XX   
...   ...   ...   .O.   .O.   OO.   OOX   OOX   OOX 
 ^           ^           ^           ^

The way that MENACE plays is that there is a matchbox for every possible board configuration that could arise in the course of the game when it is MENACE’s turn. There is a box for every configuration where it is MENACE’s first, second, third, or fourth turn, but not for its fifth turn as it has no choice to make there as there will only be one empty square.

The configurations are drawn on a small label pasted to the front of each individual drawer. When it is MENACE’s turn, Alan finds the matchbox with a label that matches the current state of play on the piece of paper on which Donald is keeping track of the game. He opens the drawer which has some number of colored beads in it. Without looking he randomly picks one of the beads. Importantly he leaves the drawer open and after showing the bead to Donald he puts it on the table in front of the open drawer from which it came. The boxes are arranged left to right corresponding to less moves played and then more moves played so it is easy to keep tack of which bead came from which drawer. There are nine colors of bead, and each color corresponds to one of the nine squares in tic-tac-toe. After seeing the bead Donald writes down an ‘O’ at the appropriate square on his piece of paper, and then writes his own ‘X’ as his next move, and the cycle repeats.

Although the actual colors of the beads do not really matter, here are the colors and their correspondence to the squares that were used in the original experiments (this time I have put in the horizontal vertical lines to divide the color words).

 white | lilac | silver
 black | gold  | green 
 amber |  red  | pink  

How many beads are there in each box?  For all the first move boxes, of which there is one, corresponding to the empty board, there are four beads for each possible move, so 36 in total.  For all possible second moves for MENACE there are only seven possibilities, and each of those empty squares has three beads.  For the third move there are 2 beads for each of the five possibilities, and for each fourth move there is one beed for each of the three possibilities.

To start out MENACE is playing each move completely at random. But suppose MENACE loses a game. Then Alan discards the beads below each open drawer and closes them all. This is negative reinforcement as MENACE lost, and so made moves it should not make in the future. In particular its fourth move, with only one bead, led to a loss that was at that point completely out of its control. So removing that bead means that MENACE will never play that bad move again. Its third move was perhaps a little suspect so that goes down to only one bead instead of two and it is less likely to try that again, but if it does it will not be tricked in exactly the same way again.

If MENACE draws the game then it gets positive reinforcement as each bead that was picked from each drawer is put back in, along with an extra bonus bead of the same color. If it won the game then it gets three additional beads same colored beads along with the one played at each turn. In this way MENACE tends to do the things that worked in the past, but if the opponent (in this case Donald) finds a new way to win against what used to work then MENACE will gradually adapt to that and avoid that losing line of play.

That is it. With Alan following this simple set of rules MENACE learns to play better and better over time. But there is one point of practicality.

Human structuring of the learning problem

As I described it above there are a lot of matchboxes needed for MENACE. There is 1 for MENACE’s first move, 72 for its second move, 756 for its third, and 1372 for its fourth move, for a total of 2201 matchboxes.

But let’s look at another possible game, where every position is different from the previous game we looked at.

...   ...   ...   ...   ...   O..   O..   O.O   O.O   
...   ..O   ..O   O.O   OXO   OXO   OXO   OXO   OXO   
...   ...   ..X   ..X   ..X   ..X   X.X   X.X   XXX
 ^           ^           ^           ^

But wait, this is really the same game as before just rotated ninety degrees clockwise. It is going to take a lot longer to learn how to play tic-tac-toe if MENACE has to independently learn that its second move in this game is just as bad as its second move in the previous game. To make MENACE learn faster, and to reduce the number of matchboxes down to a more manageable level, Donald Michie took into account that up to eight different patterns of Noughts and Crosses might really be essentially the same.

Here is an example where an original board positions is rotated clockwise by a quarter, half, and three-quarter turn, and where a reflection about the vertical axis, the horizontal axis, and the two diagonal axes all give different board positions Nonetheless these eight positions are essentially the same as far as the rules of the game of tic-tac-toe are concerned.

OX.   ..O   ...   ...   .XO   ...   O..   ...
.O.   .OX   .O.   XO.   .O.   .O.   XO.   .OX
...   ...   .XO   O..   ...   OX.   ...   ..O

Some board positions may not result in so many different looking positions when rotated or reflected. For instance, a single play in the center of the board is not changed at all by these spatial transformations.

In any case, by considering all the rotations and reflections of a board position as a single position, and therefore only assigning one matchbox to all of them combined, the requirements for MENACE are reduced to 1 matchbox, as before, for MENACE’s first move, 12 for its second, 108 for its third, and 183 for its fourth move, bringing the total⁠9 to 304. Furthermore by looking at the symmetries in what move is played there are often less essentially different moves than there are empty squares. For instance in both these cases MENACE is about to play an ‘O’:

...   .O.
...   .O.
...   X.X

In each case there are only three essentially different moves that can be played, so MENACE’s matchboxes for these move need only start out with three different colored beads rather than nine or five respectively.

By taking into account these symmetries the MENACE machine can be much smaller, and the speed of learning will be much faster, as lessons from one symmetric position will be automatically learned at another. The cost is that Alan is going to have to do quite a bit more work to operate MENACE. Now he will have to look at the position that Donald shows on the piece of paper where Donald is playing and not just look for an identical label on the front of a matchbox, but look for one that might be a rotation or reflection of the state of the game. Then, when Alan randomly selects a bead which indicates a particular move on the label on the matchbox from which it game, he will have to figure out which square that corresponds to on Donald’s sheet of paper through the inverse rotation or reflection that he used to select the matchbox. Fortunately this extra work is all quite deterministic and Alan is still following a strict set of rules with no room for judgement to creep in. We will come back to this a little later and mechanize Alan’s tasks though a few sheets of very simple instructions that will do all this extra work.

How well do matchboxes learn?

MENACE is learning what move it should choose in one of 304 essentially different board configurations for its first four moves in a game of tic-tac-toe. Since Alan randomly picks out one bead from the matchbox corresponding to one of those configurations it is making a random move from a small number of moves but the probability of a particular move goes up when there are more beads of a particular color from positive reinforcements from previous games, and the probability of a move which leads to a loss goes down relative to the other possible moves as its beads are removed.

Look back at the  two examples just above for a first move and a third move for MENACE. The empty board starts out with 12 beads, for each of three different colors representing placing an ‘O’ on a corner, in the middle of an edge, or in the middle of the board. The board waiting for MENACE’s third ‘O’ to be played starts out with just six beads, of three different colors, corresponding to playing the blocking move between the ‘X’s, one of the corners, or one of the other two middle edges. We will refer to number of beads of the same color in a single box as a parameter. By mapping all symmetric situations to a common matchbox and restricting the different moves to essentially different moves, there are 1087 parameters that MENACE adjusts over time through the removal or addition of beads. When MENACE starts off it has a total of 1720 beads representing those 1087 parameters in 304 different matchboxes.

When MENACE starts out it is playing uniformly randomly over all essentially different moves. If two uniformly random players play against each other, the first to play wins 59% of the time, draws 13%, and loses 28%. This shows the inherent bias in the game for the first player, which makes learning a little easier for MENACE.

In his original paper Michie reported that MENACE became quite a good player after only 220 games and was winning most games, but neither I nor others who report simulating MENACE (you can find many with a web search) saw MENACE doing that well at all. In fact since a perfect player never loses at tic-tac-toe then two perfect players always draw the game 100% of the time. It seems likely that Michie was carefully training MENACE with deliberately chosen games, and then playing against it in a fairly random way. He alludes to this when he later converted to a computer simulation of MENACE and mentions that playing against random moves results in much slower learning than playing against a deliberate policy.

To explore this I made a computer simulation of MENACE and three different simulated strategies of Donald playing against MENACE. I let learning proceed for 4,000 games, and did this multiple times against each of the three simulated players. Since there is randomness in picking a bead from a matchbox, the random number generator used by the computer to simulate this ensures that different trials of 4,000 games will lead to different actual games being played. Every so often I turned off the learning and tested how well⁠10 MENACE was currently playing against the three simulated players, including the two it was currently not learning from.

The three simulated players were as follows. Player A played completely randomly at all times. Player B played optimally and was unbeatable. Player C was optimal except that 25% of its moves were random instead of the optimal play. These are the three different versions of Donald that I used in my simulations.

In the table below the first row shows the performance of MENACE before it has learned at all, against each of the three simulated players. Each triple of numbers is the percentage of wins, draws, and losses (these are rounded percentages from a very large number of test games so don’t necessarily add to exactly 100%). As expected it never wins against Player B which plays optimally and can not be beaten.  Player C which makes mistakes 25% of the time can be beaten, but only about a quarter of the time.  In each row below that, MENACE was trained from scratch playing 4,000 games against a different one of these players. In each column we show, with MENACE stopped from further learning and adjusting its parameters, how it typically did against each of the three players once trained. We say “typically” as there is some variation in the resulting percentages between different trials with the same conditions, but only by a few points, and not in all cases.

\   compete |             |             |             |
 \  against | Player A    | Player B    | Player C    |
  \------\  |             |             |             |
trained   \ |             |             |             |
against    \|             |             |             |
no training |  59/ 13/ 28 |   0/ 24/ 76 |  27/ 19/ 53 |
Player A    |  86/  8/  6 |   0/ 28/ 72 |  50/ 20/ 30 |
Player B    |  71/ 15/ 14 |   0/100/  0 |  38/ 48/ 14 |
Player C    |  90/  8/  2 |   0/ 99/  1 |  56/ 42/  2 |

The first thing to notice is that how MENACE plays really does depend on what player it was trained against. When it is trained against Player B, which always plays optimally, it very quickly, usually after only about 200 games, learns to always play to a draw. But with that training (look in the same row) it is really not very good at playing against Player C which plays optimally with a 25% error rate. That is probably because in its training it never got to win at all against Player B, so it has not learned any winning moves to use against Player C.

When MENACE is trained against Player A (look in the row labelled Player A), which plays completely randomly it does learn to play against it quite well, and it also does reasonably well against Player C, probably because it has accidentally won enough times during training to have boosted some winning moves when they are available. It does dismally against the optimal Player B however. This particular box in the table has the highest variance of all in the table. Sometimes after 4,000 games it is doing less than half as well against Player B than when it started out learning.

When MENACE trains against Player C it does the best overall. It sees enough losses early on that after about 400 games it is starting to get good at avoiding losses, though it is still slowly, slowly getting better at that aspect of its game even after 4,000 games of learning. It usually doesn’t get quite as good as Player B, and very occasionally still loses to it, but it is really good at winning when there are opportunities for it to do so. We can see that against Player C it as learned to take advantage of its mistakes to drive home a win.

While not as good as a person, MENACE does get better against different types of players. It does however end up tuning its game to the type of player it is playing against.

There is also something surprising about the number of beads. MENACE starts off with 1,720 beads, but depending on which of Players A, B, or C, it is learning from it has from 2,300 to 3,200 beads after just 200 games, and always there is at least one parameter with over one hundred beads representing it by that time.  By 4,000 games it may have more than 35,000 beads representing just 1,087 parameters, with as many as 6,500 beads for one of the parameters.  This seems unnecessary, and perhaps the impact of rewarding all the moves with three beads on a win. However when I changed my simulation to never add more beads to a parameter that already had at least one hundred beads, a practical limit perhaps for a MENACE machine built from physical matchboxes, it tended to slow down learning in most cases represented in the table above, and even had small drops in typical levels of play even after 4,000 games of experience when playing against Players A and C.

Note that besides eliminating ever taking the very last bead away from a matchbox after a loss, this is the only place where I deviated from Michie’s description of his MENACE. Since he chose his plays carefully to instruct MENACE, and since he only played 220 games by hand, he perhaps did not come across the phenomenon of large numbers of beads.

Mechanizing MENACE a little more

In preparation for comparing how MENACE learns to how a person learns I want to make the role of Alan, the human operator of MENACE, a little clearer. In the description derived from Michie’s original paper, Michie himself played the role of both Donald and Alan. In my description above I talked about Alan matching the image of the paper on which Donald was playing to the labels on the matchboxes, possibly having to rotate or reflect the game board. And after randomly selecting a bead from that box, Alan would need to figure out which square that applied to on Donald’s piece of paper.

That sounds a little fuzzy, and perhaps requiring some reasoning on Alan’s part, so now we’ll make explicit a very rule driven approach that we could enforce, to ensure that Alan’s role is completely rote, with no judgement at all required.

We will make the communication between Donald and Alan very simple. Donald will hand Alan a string of nine characters drawn from ‘.’, ‘O’, and ‘X’, representing the board position after his play, and Alan will hand back a string where one of the periods has been replaced by an ‘O’. To enable this we will number the nine positions on the tic-tac-toe board as follows.


The string representing the board is just the contents of these squares in numerical order. So, for instance, if Donald has just played his ‘X’ to make the following board position, then he should give Alan the string printed to the right.

.OX       ....OX...

We will get rid of the labels, images of tic-tac-toe board positions from the front of the match boxes, and replace them with the numbers 1 through 304, so that each matchbox has a unique numerical label.  We will label the matchbox corresponding to the empty board with 1, as that will be how Alan starts a game, by drawing a bead from there, and he will look up what square that color means in a “Transform #1” on a sheet of paper with eight different transforms listed. Here they are:

Transform #1:
  white =1  lilac =2  silver=3  
  black =4  gold  =5  green =6  
  amber =7  red   =8  pink  =9  

Transform #2:
  white =3  lilac =6  silver=9  
  black =2  gold  =5  green =8  
  amber =1  red   =4  pink  =7  

Transform #3:
  white =9  lilac =8  silver=7  
  black =6  gold  =5  green =4  
  amber =3  red   =2  pink  =1  

Transform #4:
  white =7  lilac =4  silver=1  
  black =8  gold  =5  green =2  
  amber =9  red   =6  pink  =3  

Transform #5:
  white =3  lilac =2  silver=1  
  black =6  gold  =5  green =4  
  amber =9  red   =8  pink  =7  

Transform #6:
  white =7  lilac =8  silver=9  
  black =4  gold  =5  green =6  
  amber =1  red   =2  pink  =3  

Transform #7:
  white =1  lilac =4  silver=7  
  black =2  gold  =5  green =8  
  amber =3  red   =6  pink  =9  

Transform #8:
  white =9  lilac =6  silver=3  
  black =8  gold  =5  green =2  
  amber =7  red   =4  pink  =1  

The eight transforms correspond to four rotations (of zero, one, two and three quarters clockwise), and four reflections.

The remaining 303 matchboxes correspond to the essentially different board positions for MENACE’s second, third, and fourth moves. Although there are 72 different board positions for MENACEs second move there are only twelve that essentially distinct, and here they all are, numbered 2 through 13 as the next twelve matchboxes after the one for the first move.

#1    #2    #3    #4    #5    #6    #7    #8    #9    #10   #11   #12   #13
 |     |     |     |     |     |     |     |     |     |     |     |     |
...   .O.   .O.   .O.   .O.   X..   .X.   OX.   O.X   O..   O..   O..   XO.   
...   X..   .X.   ...   ...   .O.   .O.   ...   ...   .X.   ..X   ...   ...   
...   ...   ...   X..   .X.   ...   ...   ...   ...   ...   ...   ..X   ...   

When Alan is given a string by Donald (there is only one possible string for the first move, the empty board, but there are 72 possibilities for the MENACE’s second move, 756 for the third, and 1372 for the fourth move) Alan just mindlessly looks it up in a big table that is printed on a few sheets of paper. Each line has a string representing a board position, a box number, and a transform number. For instance, for the second move for MENACE we talked about above with string ....OX... Alan would find it, simply by matching character for character, in the following part of the table (for the first and second moves by MENACE):

.........  Box: #  1, Transform #1

.......OX  Box: # 13, Transform #3
.......XO  Box: #  8, Transform #3
......O.X  Box: #  9, Transform #6
......OX.  Box: #  8, Transform #6
......X.O  Box: #  9, Transform #3
......XO.  Box: # 13, Transform #6
.....O..X  Box: # 13, Transform #8
.....O.X.  Box: #  2, Transform #8
.....OX..  Box: #  4, Transform #8
.....X..O  Box: #  8, Transform #8
.....X.O.  Box: #  2, Transform #3
.....XO..  Box: # 11, Transform #6
....O...X  Box: #  6, Transform #3
....O..X.  Box: #  7, Transform #3
....O.X..  Box: #  6, Transform #4
....OX...  Box: #  7, Transform #2      <== this one
....X...O  Box: # 10, Transform #3
....X..O.  Box: #  3, Transform #3
....X.O..  Box: # 10, Transform #4
....XO...  Box: #  3, Transform #2
...O....X  Box: #  4, Transform #4
...O...X.  Box: #  2, Transform #4
...O..X..  Box: # 13, Transform #4
...O.X...  Box: #  5, Transform #4
...OX....  Box: #  3, Transform #4
...X....O  Box: # 11, Transform #3
...X...O.  Box: #  2, Transform #6
...X..O..  Box: #  8, Transform #4
...X.O...  Box: #  5, Transform #2
...XO....  Box: #  7, Transform #4
..O.....X  Box: #  9, Transform #2
..O....X.  Box: # 11, Transform #2
..O...X..  Box: # 12, Transform #2
..O..X...  Box: #  8, Transform #2
..O.X....  Box: # 10, Transform #2
..OX.....  Box: # 11, Transform #5
..X.....O  Box: #  9, Transform #8
..X....O.  Box: #  4, Transform #3
..X...O..  Box: # 12, Transform #4
..X..O...  Box: # 13, Transform #2
..X.O....  Box: #  6, Transform #2
..XO.....  Box: #  4, Transform #7
.O......X  Box: #  4, Transform #5
.O.....X.  Box: #  5, Transform #1
.O....X..  Box: #  4, Transform #1
.O...X...  Box: #  2, Transform #5
.O..X....  Box: #  3, Transform #1
.O.X.....  Box: #  2, Transform #1
.OX......  Box: # 13, Transform #5
.X......O  Box: # 11, Transform #8
.X.....O.  Box: #  5, Transform #3
.X....O..  Box: # 11, Transform #4
.X...O...  Box: #  2, Transform #2
.X..O....  Box: #  7, Transform #1
.X.O.....  Box: #  2, Transform #7
.XO......  Box: #  8, Transform #5
O.......X  Box: # 12, Transform #1
O......X.  Box: # 11, Transform #7
O.....X..  Box: #  9, Transform #7
O....X...  Box: # 11, Transform #1
O...X....  Box: # 10, Transform #1
O..X.....  Box: #  8, Transform #7
O.X......  Box: #  9, Transform #1
OX.......  Box: #  8, Transform #1
X.......O  Box: # 12, Transform #3
X......O.  Box: #  4, Transform #6
X.....O..  Box: #  9, Transform #4
X....O...  Box: #  4, Transform #2
X...O....  Box: #  6, Transform #1
X..O.....  Box: # 13, Transform #7
X.O......  Box: #  9, Transform #5
XO.......  Box: # 13, Transform #1

This tells Alan that the move given to him by Donald is to be played by matchbox #7, and then he is to use Transform #2, which we saw above, to interpret the color of the drawn move as to which square is meant.  We can see what position box #7 corresponds to above, though Alan does not know that. He simply reaches into box #7 and pulls out a random bead. As it happens, in my simulation of MENACE where it never tries to play two essentially the same moves, the only beads in #7 are colored white, black, amber, and red, corresponding to essentially different moves down the left column and in the bottom at the middle using the original MENACE bead color interpretations.  Under Transform #2 we see that those colors correspond to squares 3, 2, 1, and 4, respectively, which are across the top row and the left middle square for the way Donald is playing. So whichever one of those colors is removed from the box, Alan simply goes to that position in the string that was given to him by Donald, and changes the blank to an ‘O’. So suppose that Alan pulls out a black bead. In that case he changes the second element to an ‘O’, and gives it back to Donald who then interprets the string to mean the following new board position:

.O..OX...       .OX

The only remaining thing is the reinforcement signal. Donald, the human player, is the one who is responsible for deciding when the game is over and at that point needs to communicate one of just three options to Alan; L, for loss, meaning forfeit all the beads out of boxes, D, for draw, meaning put the beads back with an extra one of the same color for each, or W, for win, meaning put them back with three extra ones of each.

Summary of What Alan Must Do

With these modifications we have made the job of Alan both incredibly simple and incredibly regimented.

    1. When Donald gives Alan a string of nine characters Alan looks it up in a table, noting the matchbox number and transform number.
      1. He opens the numbered matchbox, randomly picks a bead from it and leaves it on the table in front of the open matchbox.
      2. He looks up the color of the bead in the numbered transform, to get a number between one and nine.
      3. He replaces that numbered character in the string with an ‘O’, and hands the string back to Donald.
    2. When Donald gives Alan a sign for one of L, D, or W, Alan does the following:
      1. For L he removes the beads on the table and closes the open matchboxes.
      2. For D he adds one more bead of the same color to each one on the table, and puts the pairs in the matchboxes behind them, and closes the matchboxes.
      3. For W he adds three more beads of the same color to each one on the table, and puts the sets of four in the matchboxes behind them, and closes the matchboxes.

That is all there is. Alan looks up things on a few sheets of paper, acts on matchboxes, and changes a character in a string.

One could say that Alan is a Turing machine.

The thing that learns how to play tic-tac-toe is a combination of Alan following these completely strict rules, and the contents of the matchboxes, the colored beads, whose number varies over time.

Is this how a person would learn?

For anyone who has played tic-tac-toe the most striking thing about about the way that MENACE learns is that it has no concept of “three in a row”. When we teach a child how to play the game that is the most important thing to explain, showing how rows, columns, and diagonals can all give rise to three O’s or X’s in a row. We explain to the child that getting three in a row is the goal of the game. So the first rule for playing tic-tac-toe is to complete three in a row on your move if that option is available. MENACE does not know this at all.

The next thing, or second rule, we might show our tic-tac-toe pupil is that assuming they have no winning move, the next best thing is to block the opponent if they have two of three in a row already with an empty spot to play and complete it. This does not guarantee an eventual win, as there are seventeen essentially different situations where the ‘X’ player may have two three-in-a-row’s ready to play, and the ‘O’ player can only block one of them.  Here are two examples of that.

.O.   XOO
OOX   O..
.XX   .XX

However just these two rules are a marked improvement over random play.  If we play tic-tac-toe with the preference of rule 1 if it is applicable, then rule 2 if that is applicable, and if neither is applicable then make a random move, we actually get a pretty good player. Here is the same sort of table as above, with an identical first row showing how well random untrained play succeeds against Players A, B, and C, then in the second row how well the addition of rules 1 and 2 improve a random player

            | Player A    | Player B    | Player C    |
random play |  59/ 13/ 28 |   0/ 24/ 76 |  27/ 19/ 53 |
+ rules 1&2 |  86/  10/ 4 |   0/ 82/ 18 |  51/ 37/ 13 |

Just the addition of those two rules gets to a level of play against a random player (Player A) that MENACE only gets to after about 4,000 games learning from Player A. Against Player B, the optimal player, it does not get as good as it does when it is trained for 200 games by Player B, but it is better against either of the other two players than when it has been been trained for 4,000 games against Player B. And against Player C, the player with 25% error rate from optimal play, it is almost as good as it ever gets even being trained by Player C.

Clearly these rules are very powerful. But they are also rather easy for a child to learn as they don’t require thinking ahead beyond the very next move. All the information is right there in the board layout, and there is no need to think ahead about what the opponent might do next once the current move is made. What is it that the child has that MENACE does not?

One answer might be geometrical representations. MENACE does not have any way to represent “three in a row” as a concept that can be applied to different situations. Each matchbox is a kingdom unto itself about one particular essentially unique board configuration. If one particular matchbox learns, through reinforcement, that it is good to place a third ‘O’ to make a diagonal, there is no way to transfer that insight, were MENACE able to have it, to other essentially different situations where there is also a diagonal that can be filled in. And certainly not to a situation about completing a horizontal or vertical row.

As we saw, Michie did incorporate some geometric “knowledge” into MENACE by mapping all rotations and reflections of the tic-tac-toe board to a common matchbox. But the machine itself has no insight into this–it was all done ahead of time by Michie (whose preparation was extended slightly by me so that Alan could be very explicitly machine-like in his tasks) by producing the dictionary of positions that mapped to matchboxes number 1 through 304, and which of the eight inversion lookup tables that mapped from color of bead to numbered square on the board should be used. That manual design process handled some mappings between different aspects of three in a row but not all. In general a researcher or engineer using Machine Learning to solve a problem does something very similar, in reducing the space of inputs. The art of it is to reduce the input space so that learning can happen more quickly, but not over reduce the space so that subtle differences in situations are obliterated by the pre-processing. By mapping from all the general board positions to precisely those that are essentially different, Donald Michie, the Machine Learning engineer in this case, managed so satisfy both those goals.

A child knows something about geometry in a way that MENACE does not. A child can talk about things being in a row independently of learning tic-tac-toe. A child has learned that in-a-row-ness is independent of orientation of the line the defines the row. By a certain age a child comes to know that the left-to-rightness of some ordering depends on the point of view of the observer, so they are able to see that two in a row with an empty third one is an important generalization that applies equally to the horizontal and vertical rows around the edges, thinking about them in both directions, and also applies to the horizontal and vertical rows that go through the middle square, and to the two diagonals that also go through that square. The child may or may not generalize that to two at each end of a row with the middle to be filled in–perhaps that might be a different concept for young children. But the rowness of things is something they have a lot of experience with, and are able to apply to tic-tac-toe. In computer science we would talk about rowness being a first class object for a child–something that can be manipulated by other programs, or in a child by many cognitive systems. In MENACE rowness is hidden in the pre-analysis of the problem that Donald Michie did in order to map tic-tac-toe to collection of numbered matchboxes with beads in them.

The learning that MENACE does somehow feels different to the learning that human does when playing tic-tac-toe. That is not to say that all learning that a human does is necessarily completely different from what MENACE does. Perhaps things that humans learn in an unconscious fashion (e.g., how to adjust their stance to stay balanced–negative and positive reinforcement signals based on whether they hit the ground or not), where we have no way to access what is happening inside us, nor an ability to talk about it, is more like MENACE learning.

Not all learning is necessarily the same sort of learning.

Is this how a person would play?

A more fundamental question, perhaps, is whether MENACE plays tic-tac-toe like a person does, and I think the answer is a clear no. The MENACE system consisting of the matchboxes and Alan strictly following rules only fills in part of the role of a normal player. The rest of what is usually a social interaction between two people is all taken on by Donald.

There is no representation inside the MENACE (where we include in the definition of MENACE the sheets of papers that Alan consults, and the rules that we have instructed Alan to strictly follow) of tic-tac-toe being a game that is played. MENACE does not know what a game is, or even that it is playing a game. All that happens inside MENACE is that one at a time, either three or four times sequentially, one of its matchbox drawers is opened and a bead is randomly removed, and then either the beads are taken away, or they are put back in the boxes from where they came with either one or three additional beads of the same color, and the boxes are closed.

All the gameness of tic-tac-toe is handled by the human Donald. It is he who initiates the game by handing Alan a string of nine periods. It is he who manages the consistency of subsequent turns by annotating his hand drawn tic-tac-toe board with the moves. It is he who decides when the game has been won, drawn, or lost, and communicates to Alan the reinforcement signal that is to be applied to the open matchboxes. It is he, Donald, who decides whether and when to initiate a new game.

MENACE does not know, nor does it learn, what a game is. The designer of MENACE abstracted that away from the situation, so that MENACE could be a pure learning machine.

That today is both the strength and weakness of modern Machine Learning. Really smart people, researchers or engineers, come up with an abstraction for the problem in the real world that they want to apply ML to. Those same smart people figure out how data should flow to and fro between the learning system and the world to which it is to be applied. They set up machinery which times and gates that information flow from the application. They set up a signaling system on when the learning system is supposed to respond to an input (in MENACE’s case a string of nine characters drawn from ‘.’, ‘X’, and ‘O’) and produce an output. And those same people set up a system which tells the learning system when to learn, to adjust the numbers inside it, in response to a reinforcement signal, or in some other forms of ML a very different, but still similarly abstracted signal–we will see that in the next chapter.

tic-tac-toe machine resonates with modern ML

Although MENACE is well over fifty years old, it it shares many properties with modern Machine Learning systems, though of course it is much smaller and simpler than the systems that people use today–one must expect something from 50+ years of hard intellectual work. But the essential problems that MENACE and today’s ML algorithms have are very instructive as they can give some intuition about some of the limits we might expect for modern AI and ML.

Parameters. After the design work was done on MENACE, all that could change during learning as the value of the 1087 parameters, the numbers of various colored beads in various matchboxes. Those numbers impact the probability of randomly picking a bead of a particular color from a matchbox. If the number of red beads goes down and the number of amber beads goes up over time in a single matchbox, then it is more likely that Alan will pick an amber bead at random.  In this way MENACE has learned that for the particular situation on a tic-tac-toe board corresponding to that matchbox the square corresponding to the amber bead is a better square to play than the one corresponding to a red bead. All MEANCE is doing is juggling these numbers up and down. It does not learn any new structure to the problem while it learns. The structure was designed by a researcher or engineer, in this case Donald Michie.

This is completely consistent with most modern Machine Learning systems. The researchers or engineers structure the system and all that can change during learning is a fixed quantity of numbers or parameters, pushing them up or down, but not changing the structure of the system at all. 1087 may seem like a lot of parameters for playing tic-tac-toe, but really that is the price of eliminating the geometry of the board from the MENACE machine.  In modern applications of Machine Learning there are often many millions of parameters. Sometimes they take on integer values as do the number of beads in MENACE, but more usually these days the parameters are represented as floating point numbers in computers, things that can take on values like 5.37, -201.65, 894.78253, etc.

Notice how simply changing a big bunch of numbers and not changing the underlying abstraction that connected the external problem (playing tic-tac-toe) to a geometry-free internal representation (the numbers of different colored beads in matchboxes) is very different from how we have become familiar with using computers. When we manage our mail box folders, creating special folders for particular categories (e.g., “upcoming trips”, “kids”, etc.) and then sub folders (e.g., “Chicago May 5”, “soccer”, etc.) and then filing emails in those subfolders, we are changing the structure of our representation of the important things in our life which are covered by emails. Machine Learning, as in the case of MENACE, usually has an engineering phase were the problem is converted to a large number of parameters, and after that there is no dynamic updating of structures.

In contrast, I think all our intuitions tell us that our own learning often has our internal mental models tweak and sometimes even radically change how we are categorizing aspects of the skill or capability that we are learning.

Large Parameters. My computer simulations of MENACE soon had the numbers of beads of a particular color in particular boxes ranging from none or one up to many thousand. This intuitively seem strange but is not uncommon in today’s Machine Learning systems. Sometimes there will be parameters that are between zero and one, were just a change of one ten thousandth in value will have drastic effects on the capabilities that the system is learning, while at the same time there will be parameters that are up in the millions. There is nothing wrong with this, but it does feel a little different from our own introspections of how we might weigh things relatively in our own minds.

Many Examples Needed. If we taught tic-tac-toe to an adult we would think that just a few examples would let them get the hang of the game. MENACE on the other hand, even when carefully tutored by Donald Michie took a couple of hundred examples to get moderately good. My simulation is still making relatively big progress after three thousand games and is often still slowly getting even a little better at four thousand games. In modern Machine Learning systems there may be tens of millions of different examples that are needed to train a particular system to get to adequate performance. But the system does not just get exposed to each of these training examples once. Often each of those millions of examples needs to be shown to the system hundreds of thousands of times. Just being exposed to the examples once leaves way to much bias from the most recently processed examples. Instead by having them re-exposed over and over, after the ML system has already seen all of them many times, the recentness bias gets washed away into more equal influence from all the examples.

Training examples are really important. Learning to play against just one of Player A, B, or C, always lead to very different performance levels against each of these different players with learning turned off in my computer simulation of MENACE.  This too is a huge issue for modern Machine Learning systems. With millions of examples needed there is a often a scale issue of how to collect enough training data.  In the last couple of years companies have sprung up which specialize in generating training data sets and can be hired for specific projects.  But getting a good data set which does not have unexpected biases in it can often be a problem.

When MENACE is trained against Player B, the optimal player that can not be beaten, MENACE does not learn how to win, as it never has an experience of winning so it never receives reinforcement for winning. It does learn how to not be defeated, and so playing against Players A or C its win rate does go up a little as they each sometimes screw up, but MENACE’s winning rate does not go up as much as it does when it trains against those two players. In our example with MENACE my simulations worked best overall when trained against Player C, as that had a mixture of examples that  were tough to win against (when it got through a game without making a random bad choice), and because of its occasional random choices examples which more fully spanned all of the possible playing styles MENACE might meet. In the parlance of Machine Learning we would say that when MENACE was trained only against Player B, the optimal player, it overfit its playing style to the relatively small number of games that it saw (no wins, and few losses) so was not capable when playing against more diverse players.

In general, the more complex the problem for which Machine Learning is to be used, the more training data that will be needed.  In general, training data sets are a big resource consideration in building a Machine Learning system to solve a problem.

Credit assignment. The particular form of learning that MENACE both first introduced and demonstrates is reinforcement learning, where the system is given feedback only once it has completed a task. If many actions were taken in a row, as is the case with MENACE, either three of four moves of its own before it gets any feedback, then there is the issue of how far back the feedback should be used.

In the original MENACE all three forms of reinforcement, for a win, a draw, or a loss, were equally applied to all the moves. Certainly it makes sense to apply the reinforcement to the last move, as it directly did lead to that win, or a loss. In the case of a draw however, it could in some circumstances not be the best move as perhaps choosing another move would have given a direct win. As we move backward, credit for whether earlier moves were best, worst, or indifferent is a little less certain. In the case of Player A or C as the opponent it may have simply made a bad move in reply to a bad move by MENACE early on, so giving the earlier move three beads for a win may be encouraging something that Player  B, the optimal player, will be able to crush. A natural modification would be three beads for the last move in a winning game, two beads for the next to last, and one bead for the third to last move.  Of course people have tried all these variations and under different circumstances much more complex schemes would be the best. We will discuss this more, a little later.

In modern reinforcement learning systems a big part of the design is how credit is assigned. In fact now it is often the case that the credit assignment itself is also something that is learned by a parallel learning algorithm, trying to optimize the policy based on the particulars of the environment in which the reinforcement learner finds itself.

Getting front end processing right. In MENACE Michie developed what might be called “front end processing” to map all board positions to only those that were essentially distinct. This simultaneously drastically cut down the number of parameters that had to be learned, let the learning system automatically transfer learning across different cases in the full world (i.e., across symmetries in the tic-tac-toe board), and introduced zero entanglements that could confuse the learning process.

Up until a few years ago Machine Learning systems applied to understanding human speech usually had as their front end programs that had been written by people to determine the fundamental units of speech that were in sound being listened to.

Those fundamental units of speech are called phonemes, and they can be very different for different human languages. Different units of speech lead to different words being heard. For instance, the four English words pad, pat, bad, and bat all have three phonemes with the same middle phoneme corresponding to the vowel sound (in English the same letters may be used represent to different phonemes, so the word paper, while having the same letter ‘a’ for the second phoneme (of four in this word) has a very different sound associated with it, and is therefore a different phoneme), the four different phonemes p, b, d, and t, lead to four different words being heard as p and b are varied at the start, and d and t are varied at the end.

In earlier speech understanding systems the specially built front end phoneme detector programs relied on some numerical estimators of certain frequency characteristics of the sounds and produced phoneme labels as their output that were fed into the Machine Learning system to recognize the speech. It turned out that those detectors were limiting the performance of the speech understanding systems no matter how well they learned. Relatively recently those programs were replaced by other machine learning system, that didn’t necessarily output conventional phoneme representations, and this lead to a remarkable overall increase in reliability of speech understanding systems. This is why today, but only in the last few years, many people now have devices in their homes, such as Amazon’s Echo or Google’s Home, that they can easily interact with via voice.

Getting the front end processing right for an ML problem is a major design exercise. Getting it wrong can lead to much larger learning systems than necessary, making learning slower, perhaps impossibly slower, or it can make the learning problem impossible if it destroys vital information from the real domain. Unfortunately, since in general it is not known whether a particular problem will be amenable to a particular Machine Learning technique, it is often hard to debug where things have gone wrong when an ML system does not perform well.  Perhaps inherently the technique being used will not be able to learn what is desired, or perhaps the front end processing is getting in the way of success.

Geometry is hard. Just as MENACE knew no geometry and so tackled tic-tac-toe in a fundamentally different way than how a human would approach it, most Machine Learning systems are not very good at preserving geometry nor therefore are they good at exploiting it. Geometry does not play a role in speech processing, but for many other sorts of tasks there is some inherent value to the geometry of the input data. The engineers or researchers building the front end processing for the system need to find a way to accommodate the poor geometric performance of the ML system being used.

The issue of geometry and the limitations of representing it in a set of numeric parameters arranged in some fixed system, as was the case in MENACE, has long been recognized. It was the major negative result of the book Perceptrons⁠11 written by Marvin Minsky and Seymour Papert in 1969. While people have attributed all sorts of motivations to the authors I think that their insights on this front, formally proved in the limited cases they consider, still ring true today.

Fixed structure stymies generalization. MENACE’s fixed structure meant that anything it implicitly learned about filling or blocking three in a row on a diagonal could not be transferred to filling or blocking a vertical or horizontal row. The fixed structures spanning thousands or millions of variable numerical parameters of most Machine Learning systems likewise stymies generalization. We will see some surprising consequences of this when we look at some of the most recent exciting results in Machine Learning in a later blog post–programs that learn to play a video game but then fail completely and revert to zero capability on exactly the same game when the colors of pixels are mapped to different colorations, or if each individual pixel is replaced by a square of four identical pixels.

Furthermore, any sort of meta-learning is usually impossible too. Since MENACE doesn’t know that it is playing a game, and since there is nothing besides the play and reward mechanism that can access the matchboxes, there is no way that observations of the flow of a game can be ruminated upon. A child might learn a valuable meta-lesson in playing tic-tac-toe, that when you have an opportunity to win take it immediately as it might go away if the other player gets to take a turn. That would correspond to learning rule 1 in our comparison between MENACE and how a person might learn.

Machine Learning engineers and researchers must, at this point in the history of AI, form an optimized and fixed description of the problem and let ML adjust parameters. All possibility of reflective learning is removed from these very impressive learning systems. This greatly restricts how much power of intelligence and AI system with current day Machine Learning systems can tease out of their learning exploits. Humans are generally much much smarter than this.

A Few Developments in Reinforcement Learning

The description of reinforcement learning comes from 1961, and is the first use of the term reinforcement learning when applied to a machine process that I can find. There have been some developments in reinforcement learning since 1961, but only in details as this section shows. The fundamental ideas were all there in Donald Michie’s matchboxes.

Reinforcement learning is still an active field of research and application today. It is commonly used in robotics applications, and for playing games. It was part of the system that beat the world Go champion in 2016, but we will come back to that in a little bit.

After Michie’s first paper, reinforcement learning was formalized over the next twenty years. Without resorting to the mathematical formulation, today reinforcement learning is used where there are a finite number of states that the world can be in.  For MENACE those states correspond to the 304 matchboxes of essentially different tic-tac-toe board positions where it is O’s turn to play. For each state there are a number of possible actions (the different colored beads in each matchbox corresponding to the possible moves). The policy that the system currently has is the probability of each action in each state, which for MENACE corresponds to the number of beads of a particular color in a matchbox divided by the total number of beads in that same matchbox. Reinforcement learning tries to learn a good policy.

The structure of states and actions for MENACE, and indeed for reinforcement learning for many games, is a special case, in that the system can never return to a state once it has left it. That would not be the case for chess or Go where it is possible to get back to exactly the same board position that has already been seen.

For many systems of reinforcement learning real numbers are used rather than integers as in MENACE. In some cases they are probabilities, and for a given state they must sum to exactly one. For many large reinforcement learning problems, rather than represent the policy explicitly for each state, it is represented as a function approximated by some other sort of learning system such as a neural network, or a deep learning network. The steps in the reinforcement process are the same, but rather than changing values in a big table of states and actions, the 1087 parameters of MENACE, a learning update is given to another learning system.

MENACE, and many other game playing systems, including chess and Go this time, are a special case of reinforcement learning in another way. The learning system can see the state of the world exactly. In many robotics problems where reinforcement learning is used that is not the case. There the robot may have sensors which can not distinguish all the nuances in the world (e.g., for a flying robot it may not know the exact current wind speed and direction ten meters away from it in the direction of travel). For these sorts of reinforcement learning problems the world is referred to as partially observable.

in MENACE any rewards, be they positive or negative were spread equally over all moves leading up the win, loss, or draw. But in reality it could be that an early move was good, and just a dumb move at the end was bad. To handle this problem Christoper Watkins came up with a method that became known as Q-learning for his Ph.D. thesis12, titled “Learning from Delayed Rewards”, at Cambridge University in 1989. The Q function that he learns is an estimate of what the ultimate reward will be by taking a particular action in a particular state. Three years later he and Peter Dayan published a paper that proved that under some reasonable assumptions his algorithm always eventually converged on the correct answer as to how the reward should be distributed.

This method, which is at its heart the reinforcement learning of Donald Michie’s MENACE from 1961, is what is powering some of today’s headlines. The London company DeepMind, which was bought by Google, uses reinforcement learning (as they explain here) with the Q-learning implemented in something called deep learning (another popular headline topic). This is how they built their Alpha Go program which recently beat both the human Korean and Chinese Go champions.

As a side note, when I visited DeepMind in June this year I asked how well their program would have done if on the day of the tournament the board size had been changed from 19 by 19 to 29 by 29. I estimated that the human champions would have been able to adapt and still play well. My DeepMind hosts laughed and said that even changing to an 18 by 18 board would have completely wiped out their program…this is rather consistent with what we have observed about MENACE. Alpha Go plays Go in a way that is very different from how humans apparatently play Go.

Overloaded words

In English, at least, ships do not swim. Ships cruise or sail, whereas fish and humans swim. However in English planes fly, as do birds. By extension people often fly when they go on vacation or on a business trip. Birds move from one place to another by traveling through the air. These days, so too can people.

But really people do not fly at all like birds fly. Our technology lets us “fly” a quarter of the way around the world, non-stop, in less than a day. Birds who can fly that far non-stop (and there are some) certainly take a lot longer than a day to do that.

If humans could fly like birds we would think nothing of chatting to a friend on the street on a sunny day, and as they walk away, flying up into a nearby tree, landing on a branch, and being completely out of the sun. If I could fly like a bird then when on my morning run I would not have to wait for a bridge to get across the Charles River to get back home, but could choose to just fly across it at any point in its meander.

We do fly. We do not fly like birds. Human flying is very different in scope, in method, and in auxiliary equipment beyond our own bodies.

Arthur Samuel introduced the term Machine Learning for two sorts of things his computer program was doing as it got better and better over time at and through the experience of playing checkers. A person who got better and better over time at and through the experience of playing checkers would certainly be said to be learning to be a better player. With only eight to ten hours experience Samuel’s program (he was so early at this he did not give a name to his program–that innovation had to away the early 1960’s) got better at playing checkers than Samuel himself. Thus, in his first sentence of his paper, again, does Samuel justify the term learning: “The studies reported here have been concerned with programming of a digital computer to behave in a way which, if done by human beings or animals, would be described as involving the process of learning.”

What I have tried to do in this post is to show how Machine Learning works, and to provide an argument that it works in a way that feels very different to how human learning of similar tasks proceeds. Thus, taking an understanding of what it is like for a human to learn something and applying that knowledge to an AI system that is doing Machine Learning may lead to very incorrect conclusions about the capabilities of that AI system.

Minsky13 labels as suitcase words terms like consciousness, experience, and thinking. These are words that have so many different meanings that people can understand different things by them. I think that learning is also a suitcase word. Even for humans it surely refers to many different sorts of phenomena. Learning to ride a bicycle is a very different experience from learning ancient Latin. And there seems to be very little in common in the experience of learning algebra and learning to play tennis. So, too, is Machine Learning very different from any sort of the myriad of different learning capabilities of a person.

The word “learn” can lead to misleading conclusions.


I am going to indulge myself a little by pontificating here. Be warned.

In 1991 I wrote a long (I have been pontificating since I was relatively young) paper14  on the history of Artificial Intelligence and how it had been shaped by certain key ideas. In the final paragraphs of that paper I lamented that there was a bandwagon effect in Artificial Intelligence Research, and said that “[m]any lines of research have become goals of pursuit in their own right, with little recall of the reasons for pursuing those lines”.

I think we are in that same position today in regard to Machine Learning. The papers in conferences fall into two categories. One is mathematical results showing that yet another slight variation of a technique is optimal under some carefully constrained definition of optimality. A second type of paper takes a well know learning algorithm, and some new problem area, designs the mapping from the problem to a data representation (e.g., the mapping from tic-tac-toe board positions to the numbers 1 through 304 for the three hundred and four matchboxes that comprise MENACE), and show the results of how well that problem area can be learned.

This would all be admirable if our Machine Learning ecosystem covered even a tiny portion of the capabilities of human learning. It does not. And, I see no alternate evidence of admirability.

Instead I see a bandwagon today, where vast numbers of new recruits to AI/ML have jumped aboard after recent successes of Machine Learning, and are running with particular versions of it as fast as they can. They have neither any understanding of how their tiny little narrow technical field fits into a bigger picture of intelligent systems, nor do they care. They think that the current little hype niche is all that matters, are blind to its limitations, and are uninterested in deeper questions.

I recommend reading Christopher Watkins Ph.D. thesis12 for an example of something that is admirable. It revitalized reinforcement learning by introducing Q-learning, and that is still having impact today, thirty years later. But more importantly most of the thesis is not about the particular algorithm or proofs about how well it works under some newly defined metric. Instead, most of the thesis is an illuminating discussion about animal and human learning, and attempting to get lessons from there about how to design a new learning algorithm. And then he does it.

1 Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, MIT Press, 2012.

2 “Some Studies in Machine Learning Using the Game of Checkers”, Arthur L. Samuel, IBM Journal of Research and Development, 3(3):210–229, 1959.

3 When I first joined the Stanford Artificial Intelligence Laboratory (SAIL) in 1977 I got to meet Arthur Samuel. Born in 1901 he was certainly the oldest person in the lab at that time. After retiring from IBM in 1966 he had come to SAIL as a researcher. Arthur was a delightful and generous person, and besides his research he worked on systems programming in assembler language for the Lab’s time shared computer. He was the principal author of the full screen editor (a rarity at that time) that we had, called Edit TV, or ET at the command level. He was still programming at age 85, and last logged in to the computer system when he was 88, a few months before he passed away.

4 Perhaps I am wrong about exactly what Samuel was referring to. In his Ph.D. thesis12, which I talk about later in the post, Christopher Watkins allows that perhaps Samuel means what I interpret him to mean, though perhaps there is a smarter version of it that was implemented that involved recomputing the saved computations when more of the game tree had been searched. Watkins was unable to tell exactly from reading the paper.

5 “Trial and Error”, Donald Michie, Penguin Science Survey, vol 2, 1961.

6 “How to build a game-learning machine and then teach it to play, and to win”, Martin Gardner, Scientific American, 206(3):138–153, March 1962.

7 We Built Our Own Computers, A. B. Bolt, J. C. Harcourt, J. Hunter, C. T. S. Mayes, A. P. Milne, R. H Surcombe, and D. A. Hobbs, Cambridge University Press, 1966.

8 “Experiments on the Mechanization of Game-Learning Part I. Characterization of the Mode and its parameters”, Donald Michie, Computer Journal, 6(3):232–236, 1963.

9 Michie reports only 287 essentially different situations so his version of MENACE had only 287 matchboxes (though in a 1986 paper he refers to there being 288 matchboxes). Many people have since built copies of MENACE both physically and in computer simulations, and all the ones that I have found on the web report 304 matchboxes, virtual or otherwise. This matches how I counted them in my simulation of MENACE as a program.

10 In all the test results I give I froze the learning and ran 100,000 games–I found that about that number were necessary to give 2 digits, i.e., a percentage, that was stable for different such trials. Note that in total there are 301,248 different legal ways to play out a game of tic-tac-toe. If we consider only essentially different situations by eliminating rotational and reflective symmetries then that number drops to 31,698.

11 Perceptrons: An introduction to Computational Geometry, Marvin Minsky and Seymour Papert, MIT Press, 1968.

12 Learning from Delayed Rewards, Christopher J. C. H. Watkins, Ph.D. thesis, King’s College, Cambridge University, May 1989.

13 The Emotion Machine: Commonsense Thinking, Artificial Intelligence, and the Future of the Human Mind, Marvin Minsky, Simon and Schuster, 2006.

14 “Intelligence Without Reason, Proceedings of 12th International Joint Conference on Artificial Intelligence, Sydney, Australia, August 1991, 569–595.