I was asked recently by a student about how machine learning could happen. I started out by talking about human learning: how we don’t consider mere parroting of received information to be same as learning, but that we can make the leap from some examples we have seen to a new situation or problem that we haven’t seen before. Granted there need to be some similarities (shared structure or domain of discourse—we don’t become experts on European Union economics as a result only of learning to distinguish different types of wine), but what makes learning meaningful and fun for us is the ability to make a leap, to solve a previously inaccessible problem or deduce (really it’s ‘induce’) a new categorization.
In response, the student asked how machines could do that. I replied that not only do we give them many examples to learn from, but we also give them algorithms (ways to deal with examples) that are inspired by how natural systems work: inspired by ants or honeybees, genetics, the immune system, evolution, languages, social networks and ideas (memes), and even just the mammalian brain. (One difference is that, so far, we are not trying to make general-purpose consciousness in machines; we are only trying to get them to solve well-defined problems very well, and increasingly these days, not-so-well-defined problems also).
So, then the student asked how machines could make the leap just like we can. This led me to bring up overfitting and how to avoid it. I explained that if a machine learns the examples it is given all too well, it will not be able to see the forest for the trees—it will be overly rigid, and will want to make all novel experiences fit the examples in its training. For new examples that do not fit, it will reject them (if we build that ability into it), or it will make justifiable wrong choices. It will ‘overfit’, in the language of machine learning.
Then it occurred to me that humans do this, too. We’ve all probably heard the argument that stereotypes are there for a reason. In my opinion, they are there because of the power of confirmation bias (not to mention, sometimes selection bias as well—consider the humorous example of the psychiatrist who believes everyone is psychotic).
Just as a machine-learning algorithm that has been presented with a set of data will learn the idiosyncrasies of that data set if not kept from overfitting by early-stopping, prestructuring, or some other measure, people also overfit to their early-life experiences. However, we have one other pitfall compared to machines: We continue to experience new situations which we filter through confirmation bias to make ourselves think that we have verification of the validity of our misinformed or under-informed early notions. Confirmation bias conserves good feelings about oneself. Machines so far do not have this weakness, so they are only limited by what data we give them; they cannot filter out inconvenient data the way we do.
Another aspect of this conversation turned out to be pertinent to what I do every day. Not learning the example set so well is advantageous not only for machines but for people as well, specifically for people who teach.
I have been teaching at the college level since January 1994, and continuously since probably 2004, and full-time since 2010, three or four quarters per year, anywhere from two to five courses per quarter. I listed all this because I need to point out, for the sake of my next argument, that I seem to be a good teacher. (I got tenured at a teaching institution that has no research requirement but very high teaching standards.) So, let’s assume that I can teach well.
I was, for the most part, not a good student. Even today, I’m not the fastest at catching on, whether it’s a joke, an insult, or a mathematical derivation. (I’m nowhere near the slowest, but I’m definitely not among the geniuses.) I think this is a big part of why I’m a good teacher: I know what it’s like not to get it, and I know what I have had to do to get it. Hence, I know how to present anything to those who don’t get it, because, chances are, I didn’t get it right away either.
But there is more to this than speed. I generate analogies like crazy, both for myself and for teaching. Unlike people who can operate solely at the abstract level, I make connections to other domains—that’s how I learn; I don’t overfit my training set. I can take it in a new direction more easily, perhaps, than many super-fast thinkers. They’re right there, at a 100% match to the training set. I wobble around the training set, and maybe even map it to n+1 dimensions when it was given in only n.
Overfitting is not only harmful to machines. In people, it causes undeserved confidence in prejudices and stereotypes, and makes us less able to relate to others or think outside the box.
One last thought engendered by my earlier conservation with this student: The majority of machine-learning applications, at least until about 2010 or maybe 2015, were for well-defined, narrow problems. What happens when machines that are capable of generalizing well from examples in one domain, and in another, and in another, achieve meta-generalization from entire domains to new ones we have not presented them with? Will they attain strong AI as a consequence of this development (after some time)? If so, will they, because they’ve never experienced the evolutionary struggle for survival, never develop the violent streak that is the bane of humankind? Or will they come to despise us puny humans?