When Patterns Mislead

Conrad Wolfram, strategic director and European cofounder/CEO of Wolfram Research. Image source https://www.conradwolfram.com

One of the devices regular writers and bloggers turn to when an article is due and nothing quite right has cropped up, is to look for anniversaries that occur when the article is published. Which is what led to this month’s post.

In this case, I did have a topic (mathematics education and assessment in today’s world), and I’ll get to it, but it wasn’t sufficient to spend an entire post on, since I’ve written many times here and elsewhere about how technological developments have changed the way mathematics is used and therefore may (in fact does) require changes in how it is taught and how it is assessed. So, as I do from time to time, I looked back at the posts I published in the same month (June in this case) ten, twenty, and twenty-five years ago (Devlin’s Angle started in January 1996) to see if there was a connection to the topic I wanted to cover. (Good teachers regularly scan the news to see if there is a story they can use to show the relevance of the day’s classroom topic(s); it not only introduces an element of topicality, the link often provides a new perspective from which to look at the topic.) Here’s what I found:

June 2013:  Will Cantor’s Paradise Ever Be of Practical Use? A discussion of whether the theory of infinitary sets will ever find an application in real life.

June 2003:  When is a proof? Reflections on the nature of mathematical proof, and whether the very notion of “fully proved” (as opposed to proved beyond reasonable doubt) is ever truly achieved.

June 1998:  The Bible Code That Wasn't. Reflections on whether the claims made in the bestselling book The Bible Code (I’m not linking to the book) about hidden messages in The Bible were true. (Answer: no.)

That last one was a match for my chosen topic (the effectiveness of current mathematics assessments), which I identified when I read the 23 March 2023  blog post by the UK-based Conrad Wolfram, titled Game Over for Maths A-level.

As Wolfram reported, when the Wolfram Research team combined ChatGPT (which I wrote about in last month’s post) with a Wolfram Alpha plug-in, the system scored 96% in a UK Maths A-level paper. That’s the exam taken at age 18 at the end of high school, and is a crucial metric for university entrance.

On its own, ChatGPT (which I wrote about last month) had scored 43%.

Towards the end of the first half of my (half-century) career (spent mostly in Europe), I became interested in AI (to the point of being offered a Chair in AI at a major UK university). But, after moving to the US in 1987, I spent the entire second half of my career working on large-scale applied technology projects that involved human-computer interaction, eventually running three Stanford research centers (two of which I co-founded) that focused on that broad, interdisciplinary topic. In that work, I collaborated closely with experts on linguistics, socio-linguistics, ethnography, and systems design, as a result of which I have for several decades been a fully-paid-up skeptic of creating digital systems that are capable of understanding and intelligence (as we understand those terms).

I wrote about that skepticism (by no means unique to me) in last month’s post. A similar view was superbly articulated by robotics pioneer Rodney Brooks in an interview for IEEE Spectrum published last month. Do read it. (If this were one of my classes, it would figure heavily in the quiz at the end. And it would be far more than multiple-choice.)

Rodney Brooks, robotics pioneer. Image source https://spectrum.ieee.org/gpt-4-calm-down

Plainly expressed, there is no possibility of AI (at least as presently conceived) getting smarter than people (in the sense we mean by “smart”).

What there is a danger of — and it is, I think, significant — is that our societies will be seduced by an appearance of intelligence into incorporating AI systems into our lives and our institutions to autonomously perform tasks that require intelligence. (And it’s already happening with the legal system.)

The Media Equation, Byron Reeves and Clifford Nass, CSLI Publications, Stanford University (2003)

The point is, evolution has shaped us to automatically ascribe intelligent agency to any behavioral entity we interact with, based on a very small number of clues. How remarkably little it takes for us to ascribe agency was demonstrated by the studies made by Stanford’s Byron Reeves and (the late) Cliff Nass and published in their book The Media Equation, which shows us just how vulnerable we are to being misled. (The book’s subtitle is How People Treat Computers, Television, and New Media Like Real People and Places”. This not just theory; the authors are/were experimentalists.)

If ever there was a dramatic, and very early, indication of this danger when it comes to AI, it was Joseph Weizenbaum’s “AI system” Eliza, created in 1964-66. Check that out too if you are not already familiar with it.

Current Large Language Models such as ChatGPT are really just far more sophisticated, and vastly scaled-up, versions of Eliza (built on a very different architecture).

The publishing success of The Bible Code shows how susceptible people are to ascribing some form of significance to a pattern. In itself, that’s a good thing. Indeed, pattern recognition is the basis of rationality. Discerning patterns and acting accordingly is what brains (in general, not just humans’ or primates’ brains) evolved to do.

Pattern recognition is so fundamental, and so important to our survival, that we seek, and usually find, an explanatory “theory” whenever we discern a pattern. We do that even if the pattern we discern is in random data, which means that the pattern is actually in our minds, and we have somehow managed to superimpose it on the data. That’s what occurred with The Bible Code.

When we see signs that such has happened, our response should be to examine “what went wrong” and adjust our future behavior accordingly. There is usually a valuable lesson to be learned from making that response. In the case of The Bible Code, the lesson was an increased awareness of the often significant, and potentially misleading, role played by how we categorize the data.

Large Language Models such as ChatGPT work on recognizing patterns in language; specifically, frequency patterns in large corpuses of text. That’s it. Not meaning or truth, just (character) frequency in the textual database.

The reason all the experts were surprised by ChatGPT’s outputs was that no one appreciated just how fluent and natural the output would be when the training data reached a sufficiently large size. But since meaning and truth are not input factors, they cannot be reliably ascribed to the outputs.

Just as we had to accept that AI’s performance in playing chess did not mean they were smarter than us (AI’s don’t “play chess” in our sense), we now have to adjust to the reality that their performance in generating fluent, and plausible, text likewise (to quote from The Princess Bride) “does not mean what you think it means.”

One of the lessons I learned from working with ethnographers and ethnomethodologists was that, while he have to categorize data in order to make use of it, the moment we introduce a categorization, we lose the data itself, and are then working with categorized data, a fictitious entity of our creation, which carries with it any and all assumptions (conscious or otherwise) we made in performing the categorization. (Exercise for the reader: How does ChatGTP categorize the language data, both in the training phase and the input instructions we give it? How do we categorize (and hence classify) the language output from the system? Discuss any differences and possible implications thereof, paying particular attention to what Reeves and Nass taught us.)

Truly rational behavior on our part requires we are always ready to revise our categorization, should evidence arise that causes us to question it. That is not always easy, especially if we have considerable investment in the categorization and all that depends on it.

Just look at how problematic today’s societies are finding it coming to terms with the relatively recent realization, at the societal level, that our once seemingly simple categorization of people into two genders was in fact simplistic, and does not accurately reflect reality; rather we imposed it on the population, based on a (totally natural seeming) belief system. Many cultures throughout history categorized people into three genders.

In the case of mathematics education and assessment in the era of both Wolfram Alpha and systems like ChatGPT, the lesson to learn from the recent Maths A-level success of the combination of those two (as reported by Wolfram) is that our system of assessing mathematics performance needs to be revised, likely considerably so. The goal of mathematics assessment has always been (in theory) to determine how capable someone is of using mathematics effectively. The current assessments, by and large, worked sufficiently well for enough students in earlier times, though they were always assessments by proxy, since they assessed only one of several abilities required to use mathematics effectively, namely the ability to perform well in solving carefully selected, highly constrained, well-defined problems having single, precise answers. (Such problems rarely arise in real-world applications of mathematics.)

I concur with Conrad Wolfram’s conclusion here. Neither ChatGTP nor Wolfram Alpha poses a threat to mathematics education and assessment as important enterprises to be supported; but they do show us we have a major problem with how those enterprises are currently undertaken.

The A-level success of the ChatGPT—Wolfram Alpha team makes it clear that it’s now (well past) the time when we change both the way we teach mathematics and the way we assess it. That lesson should have been learned (but wasn’t) back in the late 1980s, when systems such as Wolfram Mathematica came out, as I have argued many times on his forum and elsewhere. But with the addition of ChatGPT into the mix resulting in an A-level superstar, that change can no longer be put off any longer. The time for change is now.

As Brooks observed, we need to stop confusing performance with competence. He was referring to AI systems. But the same is equally true for mathematics education and mathematics assessment.

Both Wolfram Alpha and ChatGPT show us just how much better computers are than us when it comes to performance. But when it comes to intelligence, they have none. They are just very good at faking it. Survival in the world of natural selection (which is the one we live in) depends on adapting to changing circumstances. Researchers currently sounding the alarm do have a point; there is a danger posed by AI. But it’s not a danger caused by their superior intelligence; rather, the danger is we do not use ours, and fail to recognize that, now some extremely good performers have entered the picture — performers that are far better than us at what we used to think of as the core (and hence a reliable metric) of intelligence — we need to learn to distinguish performance from competence.

IMAGE CREDIT: The thumbnail image for those month’s post is a Getty image used in the article Constellations Across Cultures: How Our Visual Systems Pick Out Patterns in the Night Sky, Association for Psychological Science, April 29, 2022. You should definitely check out that article; it is very relevant to this post.

USEFUL SOURCE: I was alerted to the Rodney Brooks article by the May 23 blogpost of John Naughton, whose daily Memex posts I subscribe to. I highly recommend it. Naughton is a fascinating guy with a wide range of talents, as a quick look at his Wikipedia entry (linked) will indicate. His eclectic daily musings, which frequently veer into MAA territory, make an excellent start to the day.

NOTE: Negotiating the Devlin’s Angle archives is a bit tricky, since they are spread over four generations of the MAA website. Here is a quick guide (but note that each site has it’s own navigation system, and you may have to search for the index on the respective home page):

Jan 1996 – Dec 2003: https://www.maa.org/external_archive/devlin/devlin_archives.html

Jan 2004 – July 2013: https://www.maa.org/external_archive/devlin/devangle.html

Aug 2011 – Dec 2018: http://devlinsangle.blogspot.com/2019/

Jan 2018 – date: https://mathvalues.squarespace.com/masterblog/category/Devlin%27s+Angle