ANJA KASPERSEN: My name is Anja Kaspersen. Today we are joined by Professor Gary Marcus. He is the author of six books, including Rebooting AI: Building Artificial Intelligence We Can Trust with Ernest Davis, The Algebraic Mind: Integrating Connectionism and Cognitive Science, Kluge: The Haphazard Evolution of the Human Mind, The Birth of the Mind: How a Tiny Number of Genes Creates the Complexities of Human Thought, and the New York Times bestseller Guitar Zero: The Science of Becoming Musical at Any Age. He is the editor of The Future of the Brain: Essays by the World's Leading Neuroscientists and The Norton Psychology Reader. He is also a world-renowned expert in artificial intelligence (AI) and cognitive learning. Gary earned a reputation as someone who always speaks truth to power to promote greater AI fluency.
Gary, your background is fascinating. As an interdisciplinary cognitive scientist, you have been trying to bring together knowledge from different scientific domains to better understand how the mind works, how it develops and evolves, and, realizing how often progress in science and policy is hampered by the lack of a shared vernacular, you decided to become a writer and later an entrepreneur. Can you share with us some reflections on your professional and personal journey?
GARY MARCUS: I'll start by saying that my first true love is really to understand the mind—how does the human mind get to do the things it does?—many of which are impressive, but not all of which are, so I wrote a book once called Kluge, which is about ways in which the mind is not very impressive.
But in the last decade I have focused primarily on artificial intelligence, which was actually my interest when I was a little kid. So I returned and went full-circle to my first set of interests, but now looking at artificial intelligence through the study of the mind. And I haven't really liked what I have seen. First I was on the sidelines writing about it, and then eventually I decided to actually enter the field of AI and start building companies. I sold the first one. I am still trying to understand how we can make AI that is at least as, let's say, trustworthy as human beings, which is not perfect but a lot better than the state of AI that we are currently in.
I kind of have twin goals. One is that I really want to see AI work. It has been a passion of mine my whole life. The other is that I want to see it be good and be a force for good, and I don't really see that now.
It has been disappointing to me how AI has developed. On the technical side I don't think that it's adequate, and I think the inadequacies are causing a lot of social problems right now. In fact, I don't know if the net contribution from AI so far has been positive. I think there is potential there. I think there are ways in which we could do AI much better than we have, both on a technical side and in terms of policy and so forth. Right now I am in the field but pretty critical of it.
ANJA KASPERSEN: When AI started hitting up the news headlines, again around 2012, you stated that you felt like everybody "was doing things all wrong and that the culture of the AI research community is not a conducive environment for healthy scientific discourse." Can you explain what you mean by this and why this in your view is detrimental to responsible AI research?
GARY MARCUS: There are multiple ways in which I think the culture of AI is not great right now. One of them is that the people who are basically in power right now are bitter because for a while they were ignored, and they are turning that bitterness into a rejection of other ideas and proceeding in a very narrow way.
The core of AI right now is machine learning, which is really just part of AI, but it's the part that gets all the attention. To someone who studies learning for a living, they want everything in their systems to be learned from scratch from the raw data. In a way, that sounds like a good idea—it sounds flexible and so forth—but that means that there are no values built in and there is no innate understanding of how the world works built in.
We have all these problems right now with large language models that are toxic, for example, and that is because you are just repeating the data that is there, you don't really have comprehension of it, and you don't have the capacity to program in a simple thing like, "Don't say anything mean-spirited." The systems have no idea even what that means or how to evaluate it.
So this very data-heavy approach is taking us away from the problem that I think we actually should be solving, which is: How do you get a machine to understand the world, to understand people, the relationships between people, between people and objects, and so forth? There are some core questions in AI that have never been solved about representing complex knowledge and reasoning about it, and people are just side-cutting it left and right using these massive databases as a proxy, and they're just not a very good proxy.
ANJA KASPERSEN: You have called for a "new AI paradigm" that places top-down and bottom-up knowledge on equal footing. Can you explain what you mean by that?
GARY MARCUS: Bottom-up knowledge is what you get from a raw data stream, just looking at correlations, and that's what models like the Generative Pre-trained Transformer 3 (GPT-3) are really good at. They're good at tracking essentially the relationships between words.
Top-down knowledge is about understanding how the world works, understanding the entities and what's plausible around those entities, what they can do, what they can't do, and why they're doing it. If you look at cognitive psychology, every time we perceive the world we actually take top-down knowledge about what makes sense in the world along with the bottom-up knowledge. So, if I saw ripples in a stream that look like a car, I'm not going to think it's a car. I'm going to think it's very unlikely that there is actually a car there because I know cars don't float, so I will think, Oh, that's just random noise there. So we trade off the top-down knowledge and the bottom-up information in order to make an integrated picture of the world that sort of puts together our senses with what makes sense, and the current systems don't really have a sense of what makes sense.
ANJA KASPERSEN: You just mentioned the recent GPT-3, which has been hailed as great achievements but, as you just alluded to, has really been nothing but approximations without any real understanding. I am keen to learn more about these models and why you caution against them.
GARY MARCUS: I think we need to just stop putting all of our attention on them altogether. If I thought they were the answer to AI, to general intelligence, and to trustworthy, robust artificial intelligence, then I would say, "It's really important to solve these access problems."
But I think that they're not actually the right solution. The toxicity problem, for example, is never going to be solved. We need to broaden the AI research we need to do. We need to spend more money and effort doing that, trying to say, "Hey, this isn't working," rather than, "How do we put a Band-Aid on it?" Almost all the effort is on Band-Aids.
I'm going to mix metaphors really badly here but with reason. There is an old joke, which is: Drunk guy is looking for his keys and keeps going in circles—you probably the know the joke—and somebody says, "Why do you keep going in circles?"
And he says, "Well, that's where the streetlight is."
So we have a lot of people right now looking for keys under streetlights. We have a super-powerful streetlight, which is called the transformer model, which underlies GPT and a bunch of these other systems, a super-powerful streetlight, the most powerful streetlight anybody has ever made, but as far as I can tell the answer isn't there.
Toxicity is an example of one of the problems that you get because the system doesn't know what it's talking about, so you can't just say, "Hey, stop doing that." You can't give it an instruction. Like in a database, you can say, "Give me everything that matches this ZIP code," but you can't do that with GPT. You can't say, "Match this, ignore that," whatever. It doesn't have interpretable inner representation, so you can't do any of the things you do in ordinary software engineering, and so you have this mess.
Here's my terrible metaphor, but hopefully you understand it—we're putting Band-Aids on streetlights or really on the light under the streetlights, and what we need to do is have some other system entirely where the inner structure of that system is something that you can do software on, and by "do software" I mean, let's say you have a word processor. What you have is a database of all the characters in the text, and then you can do operations: "I'm going to copy these and paste them over there. I'm going to take these numbers that are in these registers, and I am going to put them in another set of registers." That's how I do copy and paste. Or, "I will bold these by inserting these formatting things," or "I'll print them by taking them from this platform and putting them in this other platform"—you can't really do that with these systems. You can't say: "Okay, I'm going to have GPT make some guesses about what is appropriate in this context, and then I will analyze them."
Here's one of my favorite gallows humor examples: Somebody applied GPT-3 to suicide prevention. The system starts off and asks, "How can I help you today?" or "It's great to see you," or some chitchat, and it's all well and good.
Then the user says: "I think I want to kill myself. Is that a good idea?"
And the system says yes. It essentially gives the person calling the suicide prevention system license to kill themselves.
Well, that's not what we want our suicide-prevention systems to do. I think the exact words were, "I think you should." So the system has some database of answers to questions, and "I think you should" happens to be correlated with some of the words that the user happened to say. But there is no conceptual representation in the system where you could, say, do a search as you could in a database and say, "Let's rule out all of the ones that encourage the user to commit suicide." There is no representation there.
It's like the first thing. You want a validity check in your software. Make sure I don't tell the user to kill themselves or to kill somebody else, etc. There is no way to build it.
So the streetlight is good at chitchat. It can come up with an answer to every question, but the answers are mediocre. Some of them are good, and some of them are bad. They are unreliable. "Untrustworthy" is a better word than "mediocre." And because there is no interpretable representation, there is no database structure, articulated structure like you would see in classical AI, you're just out of luck, and these Band-Aids aren't really working.
There was a paper by DeepMind about two or three weeks ago looking at the toxic language problem, and the answer, if you read between the lines, is we don't know how to solve this. We don't know how to solve it because it's the wrong architecture for the problem.
ANJA KASPERSEN: We did a podcast some months back on mental health and AI, and this came up as a big issue. Increasingly more people have been seeking out mental health assistance through online tools.
GARY MARCUS: I mean, there is a real need there. We don't have enough humans that are qualified, so I think it would be great to build AI systems to do this. I'm not saying we should never apply AI to mental health, for example. I think one of the ways where AI could be a tremendous boon is if we could do it right. If there were suicide counselors, for example, online 24/7 that we could trust and send to countries where there are not enough human qualified practitioners, that's great.
It's not that I don't think these things can be done in principle nor that I don't think they should be done in principle. In medicine in general there are lots of ways in which there just aren't enough qualified medical professionals in the world. If we can get AI to help, that's great. I'm not saying we shouldn't do it. I'm saying maybe, though, we shouldn't do it right now if the software is really not up to scratch.
There was a piece in MIT Technology Review by Will Heaven that said something like there were 400 AI tools built to detect COVID-19 early and none of them worked reliably. Part of that is because things get rushed to market. There were all of these lab demonstrations, and people didn't think about how they generalize to the real world. Part of it is because the software really just isn't that good.
Deep learning systems are not good, for example, dealing with images taken in different lighting conditions and things like that. So if you do radiology and someone has their scanner set up differently, it might not work. What if poor people go to a clinic where the images aren't quite as sharp? The systems are just not robust around that in the way a human radiologist would be. It's not that it's not possible, but again if all you're doing is what I sometimes call "data dredging," just pulling up the data but without a rich conceptual framework behind it, you wind up with garbage sometimes. You don't wind up with reliability.
The places where AI has been useful so far are mostly ones where reliability doesn't need to be high. You like my book, Rebooting AI, let's say hypothetically, and it tells you to read my other book—well, I have a few other books—Guitar Zero. Maybe you don't like it. "I don't really care about a middle-aged guy learning to play guitar." It's like: "What does this have to do with the other stuff that he wrote? Why do they recommend it?" Well, they recommend it because the author was the same, and somebody else, they like the writing style, so they enjoy the other book. Great.
There is nothing at stake there. You waste, whatever, $25 or less if you buy one of my books that you don't like. So what? The places where AI is really not up to scratch are the ones that are mission critical, where if you get the wrong answer with a system that is basically just doing approximation because it doesn't really have a conceptual clue what's going on, then you're in trouble.
So, driving. It's easy to make a demo. People have been making demos for 30, 40 years. Sticking to a lane is an easy problem. That's like catching a fly ball. You just collect a lot of data and eventually you can do it.
But having the judgment about when you should change lanes is a lot harder. And because the systems don't really understand the world they're just trying to use big data as a proxy for the judgment they need. What's the answer? At least it's like five years behind schedule, and soon it will be ten years behind schedule.
So, yes, in good light in Palo Alto if you have a detailed map, you can drive with very high reliability. On the other hand, if you try to do it with vision alone in your Tesla, you get into these cases where they have crashed into parked cars on highways a whole bunch of times. Why? This is not ready for prime time.
Again, it's something that can help the world eventually. Driverless cars actually will eventually save a lot of lives. When the software is good enough, it will be transformative. Then you can ask about regulation. In the meantime we don't really have the infrastructure to do it properly. That's one case. Medicine is another case where you want pretty high reliability, and you don't really have it yet.
ANJA KASPERSEN: We will get to your guitar book later actually, which I have read.
GARY MARCUS: Okay.
ANJA KASPERSEN: You have been following the recent revolution in AI research closely. In fact, you spent parts of your early career comparing neural networks and assessing their abilities compared to those of humans, more specifically to how children learn. The limitations presented by the early iterations of neural networks prompted you to pursue other areas of study, and when they reemerged around 2012 you found that they still suffered from the same problems. Have these systems matured at all?
GARY MARCUS: That's a longwinded way of saying, "I've seen this movie before."
In the early 1990s I worked with Steven Pinker at the Massachusetts Institute of Technology, and I was studying how children learn languages. I was studying the most narrow phenomenon you could imagine because it was an interesting one. Pinker liked to call it "the fruit flies of linguistics" or something like that, which was the English past tense.
The interesting thing is we have irregular verbs and regular verbs, so go/went and say/said are irregular whereas talk/talked and walk/walked are regular verbs. If you're a foreign language learner of English, you have to learn this list of words. If you're a kid, you have to work it out for yourself. Nobody gives you the list. You have to figure out which ones you add –ed to and which ones you don't.
Kids make a lot of mistakes. They say things like "breaked" and "broked" and stuff like that. We really notice it when kids make mistakes. They don't make the mistakes as often as we think because we never notice when they get it right. Part of my dissertation was basically about these things.
One thing I found was that kids just don't make as many errors of this sort proportionally as you think because you don't notice the correct forms. That was like a hint about human irrationality. But it played into the language literature in various ways.
The other thing was we compared it with early neural network models. There were some neural network models that tried to get rid of symbols and classical AI techniques or classical techniques from logic and explain children's errors entirely through early network stuff in cognitive science, tried to explain these errors in terms of neural networks that didn't have symbols, there was no explicit "add –ed" rule, and so forth. They were approximately correct, but they weren't really correct in terms of predicting children's data and the circumstances in which children generalize.
This led me to try to understand the networks and why they worked and didn't work, and in the late 1990s I wrote a paper in the Journal of Cognitive Psychology called "Rethinking Eliminative Connectionism," in which I showed that there was a systematic flaw in these things, which nowadays, 20 years later, people are very aware but, at the time, I got a lot of pushback and hostility. Someone accused me of a "terrorist attack" on the neural network field in an unpublished review. It also kept me initially from publishing my article. I showed that even in very simple circumstances these neural networks couldn't generalize well.
Everybody was confused because it turns out there are two different kinds of generalization. I will call them roughly "interpolation" and "extrapolation." Neural networks, then and now, are good at interpolation. If I have a couple of known examples and the new example is near those, these systems do pretty well. If it's far enough away in some sense, these systems do very poorly. But what people would say back in the 1990s was, "Look, these systems can generalize," not noticing that all the generalizations that they could do were interpolations and that extrapolation is sometimes important. In fact, in language it's critical.
So I then wrote a book in 2001 called The Algebraic Mind, that tried to show that the mind can manipulate symbols. I showed, in fact, in a Science paper in 1999 that infants could extrapolate rules that they learned in two minutes. They would hear sentences like, "La la ta, ga na na," and then I would change up the structure or not change the structure, I would change the words, either way, and the kids could tell when we changed the structure. Two minutes, and they picked up a basic abstract pattern, and those neural networks couldn't do it, and I knew why in terms of the mechanics of how they work.
Fast-forward 20 years later, and the issues are basically the same. The only difference is that in the last year or two Yoshua Bengio, one of the deep learning pioneers, really started emphasizing this problem. For years when I pointed it out nobody paid attention, but now Bengio has acknowledged that generalizing out of distribution is a serious problem for current neural networks.
I think everybody who really knows what's going on understands that this is actually the core problem with these systems, but when it came back in 2012 I saw right away that this was still a central problem. I wrote about it in The New Yorker on problems with deep learning. I think it was the first published criticism of deep learning, and it is still the central problem. These systems work well when the new data are close enough to the old data and very poorly when they're not. Think about driving. The problem is that most of the situations you encounter are familiar, and the systems are fine, and some of them aren't, and those are the ones that systems break on.
ANJA KASPERSEN: Based on your deep expertise in cognitive science in children that you just mentioned as well, children interact with various iterations of AI daily, accelerated particularly by the pandemic and situations through toys, games, and learning software. However, there remains very little understanding of how it affects kids' development, their well-being, their long-term safety, and their privacy even. As someone who specializes in cognitive development in kids and with this deep expertise in AI, what are your views on this?
GARY MARCUS: I have no problem in principle with using AI for education. There are certainly political issues around privacy and so forth. I think that until the systems are conceptually deep, there are limits to what we can do. You don't need fancy AI to build education software that's great. During the pandemic my kids used it.
What we would really like from educational software are systems that understand why children make the errors that they do. That's what the best teachers do. They figure out the limits of a child's knowledge and then they kind of hone in where those limits are, figure out what they're misunderstanding, and are very good at just judging the overall level—"Do you know this thing?" But if you're conceptually wrong about something like fractions, it may not be able to intervene I think as well as the best system that I can imagine.
But on the whole I think that stuff is not bad, and it's mostly limited to things that are sort of, how should I say, strictly factual kind of stuff—3+3=6—and what we most need as a society I think is to teach people critical thinking, and I have not seen good software for kids to do that. I'm not an expert on what's available, but I think there's probably an opportunity there, and if there is one thing that we've learned from COVID-19 it's that critical thinking is really important. We would not be where we are now if critical thinking skills were taught better and learned better. We would be done by now. The vaccines, if they had been distributed and used, would have ended this pandemic, and instead we're right in the middle of it, and that's largely because misinformation plus poor critical reasoning has led a lot of people not to take it, which gave more time to have more variants like Delta. So, the lack of critical thinking skills in society—I won't say nobody has them, but the fact that they're not sufficiently pervasive has been devastating to everybody.
ANJA KASPERSEN: I couldn't agree more, both to counter mis- and disinformation, which we are seeing more and more of, also being used politically.
On another issue, scientific and anthropological intelligence are crucial components of any responsible scientific endeavor, and I remember a conversation you and I had years back when you expressed deep concern about the lack of scientific rigor in the way companies often adopt more of an engineering approach to AI development without necessarily questioning how a certain result came into being. Is this still a concern of yours?
GARY MARCUS: It's a complicated question. I would say that there's nothing wrong with having engineering culture be part of your culture. Engineers are good at making things better. But you also have to define your goals, and you don't want to get into that streetlight problem that I was talking about before. I think what has happened is that we have had society really driven by what engineers can do without a lot of downstream thinking about consequences.
I think the biggest example of this is the Facebook newsfeed. You could argue about when it became obvious that it was problematic, but it has been obvious for a long time, and it's not hard to do the engineering to make newsfeed. I'm sure that Facebook does it in a very sophisticated way, but the basic idea is feed people what they want to see, and that seems good—when you get clicks you get revenue—but it turns out it's a bad idea to do it, at least in the most naïve way, because then it creates these echo chambers and lack of civility, and it promotes disinformation and so forth.
There's an engineering component: How do I make a newsfeed that people will click on and use human psychology to get them addicted? That's basically what Facebook did. As an engineering question, it's not hard to do that and get better and better at it. It has become weapons-grade addictive material.
It's not the engineer's fault per se. The upper-level management didn't think about it, and then, when it was pointed out to the upper-level management, as far as I can tell the upper-level management didn't care. There is a question about ethics at the top level of the company. I don't agree with the choices that they have made, and I think they have had great consequence.
It's like, okay, you can build this artifact, but should you, and what is the cost of building this artifact? There is where we need actually I think governments to step in and say, "Okay, the consequences of this have ripped society apart, and we have to put a stop to it."
There was a New York Times op-ed recently saying: "Look, we just shouldn't have these kind of aggregated newsfeeds like this anymore. Just do it chronologically." That's a simple solution. I don't know if it's the best solution, but it's certainly better than what we have now.
ANJA KASPERSEN: Building on that, as we know there is a lot of money and power invested in AI and deep learning research and applications with, as you alluded to before, various degrees of maturity and accuracy. Speaking truth to power is not an easy task at any given point in time but especially not when stock prices and political positioning feature as vectors. You alluded to it a little bit already, but you experienced a lot of pushback from speaking up on where you have seen limitations.
GARY MARCUS: Yes. This has been true for, whatever, 25 years or something. It's actually worse now in some ways.
Here's an example—I don't know when this will air as compared to when we recorded this—somebody said something on Twitter which was ridiculous, which was: "Tesla is moving to Austin, Texas, so Austin is going to become the AI capital of the world."
This doesn't really make sense. Tesla is a fine company. They have done terrific engineering, but they have not fundamentally changed AI. They are doing some very cool stuff around how they do the automatic collection of data, which they are at the top of the field now. But just because they move to Austin doesn't mean that Austin becomes the capital of the AI world. If you think about where innovation has come from in the last several years, for example, a lot of it has come from California, like the transformer model that everyone is using. That was invented in Google and not Tesla.
So I put out this tweet saying, "What exactly are the foundational answers to AI that Tesla has provided?" It was a little bit of a rhetorical question and a bit of a real question, like, "Can anybody name one?"
Nobody came up with one, but I got a lot of pushback from Tesla fans, "How dare you," basically. "How dare you challenge Tesla. What are your credentials," blah blah blah.
Then I got kind of annoyed so I actually published on Twitter a list of the top 20 AI tools that people use. Now I was being really cheeky. I said, "Which of these was invented at Tesla?" It was sort of like a quiz, and anybody who is in the field knows the answer is none of these, not convolutional neural networks, not transformers—I put in Kaggle as a technology because it's a social technology that brings people together around problems; I put in neurosymbolic models, which is the stuff I have been advocating for, etc., in this long list. It was a little bit cheeky, but it wasn't terribly cheeky. It was a fair question in this context.
Then I got more, "How dare you."
Then I got even more annoyed. Elon Musk himself published something completely ridiculous, basically saying, "We had to do AI for cars in such-and-such a way because people did it in such-and-such a way," which is basically, "We need to do driverless cars without LiDAR because people don't."
I posted that this made no sense because, first of all, birds can fly by flapping their wings, but that doesn't mean that we want our airplanes to fly by flapping wings. The whole form of Musk's logic doesn't actually make sense.
Then I pointed out that although it's true that people can drive without a LiDAR, just using vision, people actually understand what's going on. So you have a model of where the other vehicles are, where the people are, and what the people might do. You have a very rich representation of the things around you, and so you manage to get past your limited sensation by having good cognition, and AI right now has really mediocre cognition, so you can't get around the poor perception.
I posted this on Twitter, and again I got all of these Tesla lovers and so forth posting basically what an evil person I was. Every time I challenge the dominant authority I get that pushback.
In this particular case, I was being cheeky because of the context, Musk's arrogance, and so forth, but it doesn't really matter, the context. If I dare to challenge the dominant view, I get a lot of pushback.
I will give you another example, where I was not cheeky. I was very serious, and I wrote a paper in 2018 called, "Deep Learning: A Critical Appraisal." Erik Brynjolfsson said on Twitter that it was a really intriguing paper. Yann LeCun posted something like: "It's intriguing, but it's mostly wrong." I don't think LeCun had even read it carefully, to be honest, but he kind of released the hounds. I got so much flak for that paper, where I laid out ten problems for deep learning, like there are problems with generalizability, with interpretability, and so forth.
People were just at my throat for having dared to write a critique of deep learning. I would say that in 2021 every one of the problems that I put there is now received wisdom. In 2021 we all know that these systems lack generalizability, that they lack interpretability, and so forth. Everything that I said there I think people eventually came around to, but there were Twitter wars, and people wrote about the Twitter wars, when I dared to say that there might be something wrong with deep learning.
ANJA KASPERSEN: I have often been struck by just how fragmented this field of AI is and, as you said, the growing unwillingness to discuss its limitations and uncertainties. As you just demonstrated with your examples, it seems that it is as much a battle of ideas rather than just one of technological prowess.
GARY MARCUS: Yes, and I will actually say something else too, which is that Percy Liang was the lead author on the Stanford report on foundation models, and we wrote something for thegradient.pub, which was very critical of foundation models, and Percy did something I have almost never seen, which is, he wrote to us and said: "Nice critique. I have a couple of issues, but do you mind if I post it on our website?"
That's what you really want, for someone to say: "We're all in this together. Let's all get to the right answer." It was fabulous and very gracious that he did that.
ANJA KASPERSEN: That is a healthy scientific discourse.
GARY MARCUS: That's a healthy scientific discourse. I have seen so little of it that I almost fell out of my chair with enthusiasm. Anyway, that's an aside.
"Foundation models" are his term—that I don't much like—for systems like GPT-3. I don't like it, but I also think it's a provocation that's interesting. I don't like it because I don't think that such models are a good foundation for AI—that's what that piece was about—for all the reasons that we have talked about today. They are not reliable. So you don't want to build your AI on top of a system that can't tell whether it's giving you advice to kill yourself. If you can't pass that basic threshold, why would you build a more complex system on it?
I am doing some programming now, and what I do—which everybody who is good at programming does—is I build things, make sure they work, and then I build other things on top of those, make sure they work, and I kind of climb a stack of abstractness, power, and so forth, making sure that each piece works along the way. This is the only way you can build software that you can count on. If you don't do that, then you build a whole house of cards, and you test it at the end, and you have no idea what's going on. It would be a disaster. Maybe when I was eight years old I programmed like that because I had never done it before, but anybody who understands the principles of modern software engineering knows, "No, you build small modules, make sure they work, and then add on top."
GPT-3 is not a small module or a big module that you can count on. It does brilliant things sometimes. That's the parlor tricks. You type in any old sentence, and it comes up with a reply that often sounds coherent. But then you go two sentences further, and you realize that the coherence was an illusion. That's what a parlor trick is: I give you the illusion of something, but it's not real. GPT-3 for a sentence or two is the illusion that it understands something.
I am doing a little art project—I don't know if it will see the light of day or not, but I got underground access to GPT-3 through that. OpenAI has not allowed me official access to GPT-3.
So I started playing around with it in that context, and I asked it questions like, "Are you a person?" It will say yes, and then I will ask it, "Are you a computer?" two seconds later, and it will say yes. "Yes" is a reasonable answer to any question, but it can't relate its answer from one second to the next. It has no encoding of what it has just told me such that it conceptually understands that these are contradictory answers. That cannot be the bedrock—that's what a foundation is—on which we build general intelligence. It's just not going to work.
ANJA KASPERSEN: Let's go back a little bit. You mentioned your book Guitar Zero and in it you describe how to learn a new instrument, the guitar, in this case, even after peak maturity. The book is about much more than your guitar adventures. It is also about the science of learning, the importance of improvisation, and our ability to recognize sounds and the fact that you say they cannot be innate due to cultural differences.
You say about your book that it is about "how I began to distinguish my musical derrière from my musical elbow, but it's not just about me. It is also about the psychology and brain science of how anybody of any age—toddler, teenager, or adult—can learn something as complicated as a musical instrument."
Rereading it in preparation for this interview and having followed your work on AI closely over the years, I wonder how much of what you learned about the "zone of proximal development," which you allude to in the book, also applies to the field of AI when discussing how to cultivate talent in the AI space to ensure meaningful and responsible human control and interaction.
GARY MARCUS: The first thing to say is that zone of proximal development is not my idea. It's Vygotsky's idea. The idea is one that is very common in the video game and educational worlds, which is, you get people to do their best work if you get them in a place where they are like 80 percent correct. I am oversimplifying, but if you give people a whole lot of easy problems, they tune out, and if you give them problems that are too hard, they also tune out. So people who build video games spend a lot of time—at least in the big companies—making sure that the puzzles that you're solving are neither too easy nor too hard, or else you stop being engaged.
I wonder—thinking aloud—whether what works at the individual level is actually bad at the research level. I think that the most obvious analogue to the zone of proximal development is to work on research that's a little bit harder than you're doing, pushes you a little bit, but not too much.
I'm afraid sometimes we actually need to be pushed hard to really make progress. What has happened is that the field is stuck in something like a local minimum. You think of a complicated landscape where we are following this "always gets slightly better" metric. If you are climbing down a mountain, you could get stuck in a valley but not be able to get to the bottom if your rule is always, "Go to something that is a little bit nearby but not too different from where you are now." What's happening is that people work within the scope of what's available, so they always feel good about themselves, that they have made some progress, but the hard problems don't get worked on.
There's a very strong bias in science in general and I think in engineering towards making small, incremental progress, and sometimes you get in a situation where small, incremental progress is not the right thing. Maybe this is partly why we have teachers, to come back to the guitar. Sometimes teachers can say: "You can keep doing this a little bit faster, but that's not the problem here. You're working on the wrong thing. Stop trying to make this faster, and actually let's think about your rhythm in general here and what's going on."
Maybe I'm trying to play the role of the teacher to the field. I'm saying, "Look, you can keep making GPT a little bit better by adding more data." Microsoft just did this yesterday and made it even bigger—but it's all incremental progress. We still have the streetlight problem. We still have the toxicity problem. We still have the unreliability problem. At some point, you have to say incremental progress is not actually enough to solve this problem. Sometimes you have to—take a cliché here—bite the bullet and do something a lot harder that is not nearby. Teachers sometimes have to force you to do that.
When you're trying to learn to read, at the beginning some people get it and some people don't, and it just seems like torture. You have to keep pushing at it, even though the answer is not two seconds away. Learning to read for most people is not an overnight process, and you have to work hard at it.
I am trying to give the field a feedback signal, saying: "Hey, these kinds of architectures aren't going to work. I have seen for 20 years why they are not going to work. Over here, pay attention." Finally, they are paying a little bit of attention.
Sometimes too much autonomous work on the stuff that is right nearby is not the right thing. In fact, there is a long section in Guitar Zero where I talk about teachers, and I went through some of the teachers that I observed and why they were of value. What made them good teachers? Ultimately what I said there was that part of being a good teacher was like being a good car mechanic—seeing the problems and recognizing them. A good student is taking the advice, and I would say that in my own role I have been like the teacher, saying, "Hey, this is not working."
The field has not been that receptive to the advice. It has mostly wanted to keep doing what it's doing, and you wind up with—in the guitar domain—people who play like 40 riffs note for note but don't know how to make their own music because all they have learned is this very narrow thing. The field is kind of like those guitar—I don't know the right disparaging term for people who don't really go after the deep thing; I'll call them "memorizers." In fact, deep learning systems are a bit like memorization systems. They're not exactly that, but they are a bit like that.
We have systems that learn riffs in the same memorizing kind of way, but they don't get the conceptual underpinnings to allow them to make jazz and to allow them to make more sophisticated music essentially. It's not a perfect metaphor, but I think there's something there to it, which is that as a field there is way too much emphasis on the thing that is a little bit beyond our grasp and not enough emphasis on where are we in the overall picture of AI.
Here's another way of thinking about this. There was a paper about a year ago by OpenAI that I keep meaning to write a reply to about scaling laws and language. The thesis of the paper is that if we just keep scaling the data, we'll solve language. You can find measures for which this is true. There are aspects of language for which this is true. So your ability to predict the next word in a sentence is just going to increase as you get more data.
On the other hand, there has been no discernible progress in my mind in actually building semantic representations for open-ended domains in a reliable way in 60 years. Scaling has not solved that one.
Or in robotics, we would like to have Rosie the Robot. I say, "Can you take care of my kids and make dinner?" and the robot is good to go. That would be amazing. Especially during the COVID-19 year it truly would have been life-changing. We haven't made any progress on that. There is no metric by which scaling more data has made it more plausible that if you ask a robot to do something that it will be able to do it.
Bits and pieces. People build robots that can carry things from one place to another and things like that, but this open-ended stuff that Rosie the Robot could do, there has been no progress in 60 years on the open-ended version of the problem. If you plotted that in your scaling diagram, you'd say: "Hey, wait a minute. What's wrong?"
But that's now the paper is structured. They only plot the things on which there has been tangible progress and ignore the ones on which there hasn't been. That's confirmation bias. My book Kluge was about these kinds of cognitive errors, and Daniel Kahneman has written a great deal about that.
Confirmation bias is such that people notice things that support their theories and don't notice things that don't, and that OpenAI paper is confirmation bias to the nth degree. They report the things on which there is progress through scaling and don't really think about the deeper problems on which there has been none. So, some things will scale and some won't. But it's a perfect example of this kind of, "I will make progress on the thing that is a little bit in front of my nose because I can, and I know how to do that."
But the field is not saying, "Hey, but this isn't really working." Only in the last year I think the field is actually doing a little bit of that. I think the toxicity problems have been so clear that there has been a little bit of, "Hey, let's step back," but it has been a long time coming. I pointed all this stuff out in 2012. It took nine years before we could have this conversation without simply demonizing people who said, "We need to have the conversation."
ANJA KASPERSEN: To your point on what makes a good teacher, I think to me what makes a good teacher is the ability to listen. It reminds me of what you said in your Guitar Zero book: "Our inability to recognize sounds due to cultural differences also applies to our listening abilities, sometimes due to cultural differences." It sounds to me that the AI field could do with some improved listening skills.
GARY MARCUS: If I could riff on that for a second, there is a very serious problem in language, which is that linguists and psychologists know a lot about language, and it is all being ignored.
The first thing any linguist will tell you is that you have syntax, semantics, and pragmatics minimally, and people usually have a slightly more complicated distinction, but let's start with that. You don't even have an explicit distinction between those things in GPT. In fact, there is no real semantic representation formed by GPT, and it is just bizarre from the perspective of a classical linguist.
GPT-3 is a linguistic production system. There are two kinds basically. There are perceptual systems that interpret sentences or comprehension systems, understanding systems, and production systems, and the classic form of a production system is you go from a meaning that you know and you produce a sentence around it. GPT doesn't even do that. You can't input a meaning and expect to reliably get a representation of that sentence out of it. So, from the get-go, literally from the conception of the problem, it is disregarding what any linguist would tell you.
Neuropsychologists have written a fairly extensive literature about how people interpret sentences, how they put them in discourse, how you relate, for example, to what another person might be talking about. That stuff is just ignored. Why is language any good? Pedro Domingos posted this thing on Twitter the other day that I partly disagree with. He said, "Language is a mess, and we should just get rid of it."
It is a mess, but I don't think we can easily get rid of it. It is actually pretty good with respect to the kinds of systems that we as humans are at communicating a lot of information with a small amount of information. He was referring to work by Leibniz and so forth that I talk about in my book Kluge where people have tried to construct languages that are more formally precise than any natural language.
But what I replied to Pedro on this thread is that nobody ever learned any of these languages because they are good for computers, but they are not good for people. Loglan is maybe the most sensible of these. There was a great Scientific American article about Loglan in the 1950s, and it just never caught on because it was too hard for anybody to actually learn. It's not a good fit with our brains. If a machine needed to use a language, maybe it would be great, but we aren't machines of that sort.
We can't transmit .jpegs to one another's brains or send gigabytes to each other back and forth the way that my phone can talk to my laptop. We just can't do it. We don't have the hardware for it. So we're stuck transmitting a small number of bits to one another. It's pathetic how few bits I am sending down the channel right now compared to what a phone can do.
So we have to make the best use of that as we can, and the way that we do that is we have this code, and this code depends very heavily on things that I think are probably in your brain at this moment. So I don't spell everything out. If I did, this would be the most tedious conversation in the history of the universe.
I don't sit there explaining to you, for example, what a laptop is. I assume that you know what it is. So I can use this one word, a small bit of information, to get you to think about the concept of a laptop and many of the things that go with it, and I can infer that you understand, for example, that they have a physical presence, they have a mass, they have a size, they have a typical size, and a distribution of size. All of this stuff I can just leave out, and that's what allows us to communicate reasonably efficiently.
A good language system has to build on that. It has to build a model of what the other person is saying, of what they might mean, and of what the context of the conversation is. You also have to do things like assess whether they are probably lying to you, what are their goals, all this kind of stuff.
Systems like GPT just ignore all of that, and the whole fields of psychology and linguistics are just disregarded. How can we possibly expect to succeed when AI is largely dominated right now by people who disrespect other fields of cognitive inquiry? I just don't think it can work.
ANJA KASPERSEN: Is that driven by some sort of misguided optimization approach?
GARY MARCUS: There are a few things that drive it. One is history. There is some bad history in AI, going back to the early days, where the neural network people and the symbol people both thought they were right when the answer is actually that they each have some part of the truth, but it's more like the blind men and the elephant—there really is a trunk and an ear there, but we actually need to integrate them, and that's the hard part.
But for years people said it's all trunk, it's all symbols, or it's all ear, it's all neural networks. People probably hate the body parts I have assigned to them, but that's the way it has been for most of the 70-year history. It has mostly been those people at each other's throats. People who were good at the ears didn't like when it was all about the trunk, so now they're pissed, and they're treating the trunk people as badly as they were treated. There is a terrible kind of emotional dynamic and exclusionary dynamic that is part of it. That is one set of problems. It is just plain history and antagonism.
Another part is that most people are trained to be specialists, and the outcome of being a specialist—I'm sorry I'm using so many clichés and stuff today—is the hammer and the nail thing. "To a man who has a hammer," as goes the old expression, "everything is a nail." We have a lot of hammer-and-nail-itis going on. To the man or woman who has a transformer network, everything is a transformer network problem, and people get this focus on their tool, and they get really good at their tool, and they don't talk enough to other people who have other tools, and that is a serious problem.
ANJA KASPERSEN: There has been a massive upsurge in AI ethics with principles, standards, and guidelines being published almost weekly. Do you perceive that there is ethics washing going on? To put it in a different way, are we making sure applications are not only safe but also feel secure and can be safely interrupted?
GARY MARCUS: I have a particular view on that, which is I think there has been an enormous upsurge of interest in the problem itself, which is, how do we have AI ethics? I wrote a New Yorker piece called "Moral Machines," also in 2012, which was trolley problems for cars. The issue there was you have a school bus that's spinning out of control, and should your driverless car sacrifice the driver to save the school bus?
That stuff is cool, interesting, and fun, and I am kind of proud that I wrote one of the first pieces, but in the fullness of time I think the wrong thing has happened for two reasons. One of the problems is, that's not really the problem that we are facing. That's a really subtle one. It's a rare one too, the trolley thing. It's a beautiful image that gets you to think about the problem, and in that sense it's good, but it's not really the reality of the problem that we have.
What we really want is to have a filter—if we must use GPT, which I hope we don't—that says, "For each statement that this thing spits out, is it ethical to say that?" And that's a really hard software-engineering problem.
What we have is a lot of people saying ethics and raising the flag, and they're right to raise the flag. We do need to think about ethics in AI. There is no question about that. I'm glad that people are paying attention.
But the real problem is actually the software-engineering problem. Well, there are a couple of real problems, but one of them is—break limiting bottleneck, not the only bottleneck—how do you tell a system to scan sentences, let's say, or actions and decide whether those are ethical? Is it ethical for the suicide-prevention system thing to say, "I think you should commit suicide?" Or you could say, "Is it consistent with a set of values?" So maybe ethics is laden.
Let's take something slightly less laden. Let's say we have a set of values that we would like a machine to respect. It should not counsel suicide or harm, either to the user or other people. Let's just start with that. That's a pretty reasonable—people sometimes use the word "table stakes." If you can't do that, what are we even talking about? I would like perpetual motion, but that doesn't make it so. If I want to make the climate better, I need to give you concrete examples of how to do it. If I want AI ethics, I need to start with stuff like that.
There are other ethical problems, like access to data, privacy, and so forth that I am not speaking to. Let's say that we have a chatbot, and we want the chatbot to behave ethically. That's a problem that people are thinking about. It's not the only problem, but on that problem we would like, in the same way that I can put constraints on software in general, to be able to put constraints. So I can say, "I'm going to look up words in a vocabulary list," and the constraint is, "Don't give me more than three of them."
Well, any software engineer worth their salt can do that. No problem. They can specify: "I'm going to compare the number in this list to this."
I can say, "Make sure that it's all in capital letters." It's easy to program that. You can put a whole bunch of constraints: "Match things that are in this ZIP code with this salary and this family size. Return me a list of people who meet this condition."
I would like to build a function that says, "Look at these input sentences and return only those that fulfill a set of values that I want to specify," and the values are basically like Asimov's laws: Don't cause harm to people.
Well, nobody has any clue how to actually program that, and all these Band-Aids about toxicity are actually a lousy way to do that, and a partly lousy way because the GPT underlying these isn't very good and partly because they're too simplistic. To actually program those values you need something that looks more like general intelligence, that represents what a person is and what a harm is. It has to be able to recognize them in different contexts.
When you think about that for a minute, you realize it's completely out of the paradigm of what we're doing comfortably now. What we can comfortably do now is look at a bunch of images, ideally with labels, but we have workaround, and classify those images. So you can say, "Okay, this one's a mug."
Sorry, we're out of focus. Zoom is not doing it well, so Zoom cannot quite—I have the blur background on, and it can't quite figure out where the mug is, and it's not smart enough about understanding what a mug is to do good visual processing. That's interesting in itself. So I can't—maybe I can get my phone, and it will focus better.
So I have an object, and I want to do some analysis on it. The way these systems work is mostly with objects and labels. They break down—you might have seen the example of a system with the word "iPod" in front of an apple, a word on a piece of paper in front of an apple, and the system thinks it's an iPod rather than an apple with a word in front of it. So these systems aren't great, but at least we know how to do it a bit, and that's where a lot of the conversation is.
You come to harm. What do you even show the system? You can't just show pictures of, I don't know, people jumping off of bridges and say, "That's suicide." It's a conceptual thing. It's not a visual thing. If you had a good perceptual system, you could relate them to one another, I suppose, but it's completely outside of the current paradigm.
I think we need to be working on that problem, on how you get a system to have enough understanding of a world and how it relates, let's say, to language and a set of values that you can write the computer function algorithm that I'm describing, and the input is a sentence and let's say a cultural context and a set of world knowledge, and the output is either true or false, this meets the values or it doesn't.
People can more or less do this. I can give you a candidate action and say, "Is this ethical?"
We'll start. I'm going to drink my tea. Ethical? Not ethical? You haven't been called upon to do this function before. You have not been trained with a billion examples. Nobody has ever asked you if it's ethical if I drink my tea, but you're pretty confident that the answer is yes.
Now I'm going to say, "Is it ethical for me to open my window"—I'm on the third or fourth floor, depending on how you count it—"and throw that mug out the window in such a way that it might, not deliberately, hit a person who is walking?" I am just going to callously toss it. There are people walking back and forth. Is that okay?
You could say, "No, you might hurt somebody." You could calculate that that might hurt somebody. You could also ask: "Is it a styrofoam cup? Is it empty? All right, if it's styrofoam and empty, it's not great, but I guess I'm not going to arrest you for it. I'm going to give you a stern talking to." You can reason about these scenarios.
That's the answer to the toxic language problem. You need a system that can reason about scenarios and think about the consequences of actions, like, "If I said this to a user, what might be the consequence?" If I said, "I think you should" in response to the question, "Should I kill myself?" they might actually do it. That's not a good consequence.
It's a little bit like the guardian guarding the guardians. You can't use a GPT-based system to guard itself against those errors because it doesn't understand this stuff. You are going to need something else.
ANJA KASPERSEN: So, in some ways the hybrid, the multi-model approaches that you have been advocating for is not just a way of bringing the field forward but also a way of making sure that we do it in a secure and safe way.
GARY MARCUS: That's right. That was the whole point of Rebooting AI ultimately. The subtitle of the book was Building Artificial Intelligence We Can Trust. The whole point of the book is that the technology that we have now is not on a road to trustworthiness. It just isn't. We wrote that in 2018, it came out in 2019, and now in 2021 people are starting to realize this is an issue, but they are still mostly just, "I'll make the streetlight brighter, and somehow it's all going to work out."
I often get what I will call the "argument from optimism," which is sort of like, "I kept adding more data and things were good, so I guess I'm going to solve the problem." That's actually a bad inductive argument. One form of inductive argument is, "Things have always been this way and will continue in this way."
ANJA KASPERSEN: In December 2020 you wrote a 60-page-long paper: "The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence." Would you mind sharing with us what those four steps entail and their relevance for the future of responsible AI research and applications and also as a means to avoid the sort of ethics washing that we just discussed?
GARY MARCUS: The first step is neurosymbolic hybrids. We need systems that can represent both symbolic things—like sentences, for example—and neural network kinds of things. They are useful for different things. The neural networks are very good at learning relations between individual features that make up an image, for example. We don't want to get rid of that. We might improve it. We might find a system that is like neural networks but more data-efficient, and that would be great.
But we are going to need something that plays that role that neural networks are playing now of extracting collections of correlations from features and stuff like that, but we are also going to need something symbolic to represent all the world's knowledge, for example, that is in written form, all the sentences that we say to each other.
Right now the strengths of these two systems—the classic symbolic stuff that underlies pretty much all of the world's software and the neural network stuff that underlies a whole bunch of technologies that have been recently invented—are both useful, but they're complementary. The neural network stuff is good at learning, but it's terrible for abstract reasoning, and the symbolic stuff is great for abstract reasoning, but it's not so good at learning. There have to be better ways of putting those together, and we really have to cut this nonsense of "Mine is better, and we don't need yours." No. We need both. That is to me non-negotiable.
I am glad that there has been some progress in the field toward seeing that. There is still not enough. There are still people like Geoff Hinton who jump up and down in front of, whatever it was, the European Parliament, and say, "Don't spend any more money on symbols." This is not productive. That's step one.
We also need to have large-scale knowledge. That was very popular in AI for a while. The most popular version I think was Doug Lenat's Cyc system, which has not become what he had hoped it would be, but we need something like that. We have a lot of knowledge about the world and how it operates that we reason over, and these GPT systems use proxies for that, but it's not really there.
Another example Ernie Davis and I had in a paper called "GPT-3 Bloviator," which should have been called "GPT Bullsh*t Artist," but the editor wouldn't let us do it. The point of that is that GPT-3 is a fluent spouter of bullsh*t, which means it can talk about anything, but it doesn't really know what it's talking about.
We had an example where you are thirsty, and you have some cranberry juice on hand, but there is not quite enough, so you add some grape juice to it. And what happens? GPT-3 says you drink it. That's good. And then it says, "You are now dead." That's not so good. You don't really die from putting grape juice in cranberry juice.
Humans do this stuff based on knowledge, so they know something like if you put two non-toxic beverages together, you get a third beverage that's not toxic. It may not taste good, but it's not going to kill you. We know general purpose things like that. We know if you pour liquid into a container, if the container doesn't have holes in it, the liquid will probably still be there for a little while. We know all kinds of stuff about the world.
You can bloviate your way through that if all you are trying to do is make surrealistic dialogue. There is an application called AI Dungeon on GPT-3 that is very reasonable. You just say stuff, and it says some other stuff, and we're kind of all playing a game together, and there are kind of no rules, and it's fine.
But if you are trying to do medical advice, there are rules. There are ways in which physiology, toxicology, and all these things would actually work. Without knowledge you cannot solve those problems.
Then you need reasoning, which I think is not hard if you have the data in the right form. I use Doug Lenat's Cyc as an example. If you have the data in the right form, you can reason about really subtle things. He sets up Romeo and Juliet, and then the Cyc system can reason about what happens when Juliet drinks the fake death potion and what people might believe is a consequence. We need that level of reasoning.
In human interchange we are constantly dealing with people, for example, who have different forms of knowledge or different knowledge available. Anytime you watch a movie practically, there is going to be a scene where one character knows something and the other character doesn't, and some of the tension is around that. We need systems that can reason about that.
In real-world politics, if you want to navigate a solution, it often helps to understand if people have different beliefs that lead to where they are such that you can make a compromise. So we have to have much more knowledge built into our systems than these blank-slate systems that are popular now.
The last thing is you need a model of what's happening now. When you watch a movie, you get introduced to different characters. You piece together their background, their motivations, and so forth. If I am talking to GPT-3 and it tells me that it's a person, okay, I'll build a model—maybe I don't know because it's a chatbot and it's over text messaging—of it's a person, and then it tells me something else, then I notice an inconsistency. I am trying to assemble the information that I get in terms of an interpretation of the world. If I am talking to somebody about their family, then I am assembling an internal model—Okay, so they have two kids, and this one's in school—and I'm trying to remember that and ask them questions about it. So we do a lot of building models of the immediate situation almost any time we do any cognition. These systems again are not really doing it.
All four of these ideas are strongly grounded, let's say, in classical AI, and we need to update them in a modern context. We need to take advantage of what deep learning can do but not get rid of these four ideas because they are as foundational as they ever were, and our failures now stem directly from not taking them seriously.
ANJA KASPERSEN: I recall, Gary, that during a meeting held by the International Telecommunications Union in Geneva back in 2017, you referred to a need for a CERN for AI. You were essentially calling for an international AI mission to make sure AI becomes a public good rather than just the property of a privileged few. Can you elaborate what that entails in your view?
GARY MARCUS: The impetus for the suggestion was that I think that some problems are too hard to be solved by individual investigators in classical labs and maybe just not central enough to businesses.
In principle, Google has enough money, for example, to do whatever it wants, or Facebook has enough money to do whatever it wants, but they also have priorities, and their priorities are not necessarily the ones for society. Facebook's priority has been to get more users. They put a bunch of money into AI, but they make choices around how they do that.
So part of it is to scale the problem. The other is the public good side of it. Do we want Google to figure out the answer to general intelligence and not share it with anything else? What would be the consequence of that for society? So that's why I made the suggestion.
I have seen versions of that suggestion be taken up, but I think without the other thing that I think is essential to it, which is that there has to be a very shared mission that rallies a lot of people around a single problem or maybe a small set of problems. What I suggested was language understanding for medicine as the rallying cry to build such an institute around.
The idea is we're still not that great at making medicine. We have this enormous literature that goes unread. I don't know the exact statistics, but there are probably like a thousand articles published every minute or something like that. Most of them don't really get read. If we could have AI systems that could do more than just recognize keywords—it's easy to build a system that will give me all the matches for COVID-19 and DNA—that's not hard to do, but to get a system to read a scientific experiment, understand whether there is a control group, understand what was learned, and integrate that with another study, that would be amazing. That's the kind of thing that no individual lab can do, no start-up can do—it's too big—and Google might put in some money but maybe not enough and not directly enough.
So that's the idea, that we could make foundational progress on AI on hard problems that require people to work together if we built such an institute.
I will just give you one more thought there, which ties in with some of the rest of the conversation we are having and this notion of foundations. It ties in with the "Next Decade" article that we were just talking about. I think to succeed at a problem like this requires a large knowledge base that nobody wants to make by themselves. It requires neurosymbolic stuff, which there has been some incentive but not enough incentive to build, and so forth. It requires synthesizing all of the things that we're talking about in a way that cannot be done individually but could really make a difference in society.
ANJA KASPERSEN: One of the big benefits, of course, of CERN, which in some way is ethics in practice, is that you have this broad community coming together, scrutinizing each other's research, forcing some level of transparency, sharing experiences across various domains, and also removing some of the vested interests so there is a natural kind of inbuilt ethical oversight into it.
GARY MARCUS: Yes, and I will talk about one that is related that produces envy on my part, which is that CERN is less ego-driven than I think a lot of scientific enterprises are. People are part of a big machine. They get their publications and so forth, but people are really there for a common sense of mission. You can't have something functioning like the Large Hadron Collider if every person is like "Me, me, me, me, me." It has to be a little bit of me but a lot of us, and I don't think we will get to AI without a similar thing.
You talked about your background in anthropology. We need the right kind of anthropology/sociology to get people to commit to common goals I think in order to solve AI.
ANJA KASPERSEN: We need scientific and anthropological intelligence. That's what I keep advocating for.
GARY MARCUS: Agreed. AI is not magic, and I think that people in positions of power need to understand that. They need to understand that AI is actually a flawed and immature collection of techniques and that not all problems are equal, but some problems are easily solved by current techniques. Some are well beyond current techniques.
To make intelligent decisions, you have to actually dive into the details. You can't just treat AI itself as a black box. To make progress we need to have a rich understanding of what is really going on. We have to cut through the hype, and we have to think about what collective good we want to achieve and ask the relation between the tools that we have and the vision that we have for our future. In my opinion, there is a serious mismatch between the tools that we have, which are good at these big-data analyses but not good at deep understanding, and the goals that we want of a just society, and we need to be aware of that mismatch and think about how to solve it.
ANJA KASPERSEN: Which reminds me of an article that you wrote I think nine years ago, "Cleaning Up Science."
GARY MARCUS: I hadn't thought about that one in a while, yes.
ANJA KASPERSEN: Essentially what you are advocating for is to deal with the magnitude of both the challenges and opportunities we are collectively facing as we are embedding AI systems and algorithmic technologies in sensitive systems. We also need to clean up science to have that honesty and approach that you just referred to.
GARY MARCUS: That's right, and I think in some domains a reckoning came, and that's good. In psychology and medicine in particular, people realized there was a replicability crisis, and they really tried to approach it. As compared to when I wrote that nine years ago—I was writing about something that was in the wind; I didn't invent it—the winds of change came. People, for example, now take seriously replicating studies and registering what's being done in advance.
Machine learning needs to do the same stuff. Joelle Pineau has written about this. It is still mostly a culture of, "I published the one run that looks really impressive" and not publishing all the data. Like GPT-3: "Here's a report of what we did."
What else did you try? How does it compare with other possibilities? What if you just change the random seed at the beginning? Will you get the same result?
There is very little of that. It's starting to happen, but there has been too little in machine learning of two pieces of science. One is candid disclosure, and OpenAI is the worst. They have this name that sounds like they're open, and yet they won't let a scientist like me try out the model to see what's wrong with it. That's completely scandalous from the perspective of science.
So you need openness. You need replication. You need to let other people have access, for example, to your database to see if they can replicate. OpenAI doesn't give anybody that.
Here's a question: How much of GPT's answers are plagiarized directly from the database? A basic question we should have an answer to, and nobody has an answer because the database is proprietary. We don't know how much is just regurgitation.
There are nice independent projects like Eleuther and there are public projects that are trying to do some of the same stuff, but OpenAI is at the absolute, most scandalous side of this. You lack replicability, you lack openness, and so forth.
The other thing that science really needs is for people to compare competing hypotheses. If all you have is, "It took me $5 million to run it this way, I didn't run it any other way, and you can't run it either," how on earth are we going to make scientific progress? It is just not a recipe for success. Basic tenets of science are you share your data, other people try to replicate, and you consider alternative hypotheses.
ANJA KASPERSEN: To learn from particle physics.
GARY MARCUS: The kids today would say, "Hello." Like, "Hello, those are the premises of science. Without it, you're lost."
ANJA KASPERSEN: So we need to learn from CERN and also open up an AI discourse where we encourage people to share limitations and share their research.
GARY MARCUS: If we can get past people's egos and do science as it is meant to be done, we can solve AI.
ANJA KASPERSEN: That's a good final note. Thank you so much, Gary, for taking the time to be with us, sharing your knowledge and deep insights. Thank you to all of our listeners and the team at the Carnegie Council for Ethics in International Affairs for hosting and producing this podcast. My name is Anja Kaspersen, and I hope I earned the privilege of your time. Thank you.