o1 pro mode still has a long way to go for Mathematics

10 Dec, 2024

Key Points

The newest OpenAI models (o1 and o1 pro) claim greater reasoning skills and multimodal capabilities, yet practical tests show a limited ability to accurately solve visually presented math problems.
In testing with primary and secondary-level math questions, the models’ accuracy improved over older versions but still fell short, succeeding reliably on only about 67% of the tested items.
For now, students can’t simply rely on AI for correct answers; educators can still trust that authentic problem-solving skills remain necessary, keeping traditional assessment methods relevant.

The 12 Days of OpenAI began with the full release of o1 and the introduction of ChatGPT Pro. The release showcase was impressive, with claims of significant improvements in their newer models, a large progression from GPT-4o, to o1 preview, to o1, and finally to the newly announced o1 pro mode. OpenAI took a lot of care to emphasize the enhancements across mathematics, coding, and science domains.

Unlike previous models, o1 is positioned as OpenAI's first model that "thinks" before responding—which we can think of as “reasoning” through problems. OpenAI has described o1 as multimodal, handling both text and images, with greater accuracy and detail compared to earlier versions. During the demonstration, its capabilities were displayed through history questions about Roman emperors, thermodynamics involving a hypothetical space data center, and chemistry problems requiring specific protein configurations. The showcase suggested a real improvement over the models I have been using for the past two years.

However, with most of my work day in the Secondary Mathematics teaching trench, my question was how good, really, has it become at Mathematics? OpenAI is claiming that o1 and o1 pro mode are significantly better than GPT-4o, and that the model can now solve problems more accurately and interpret questions from images.

Secondary Mathematics departments have been lucky in escaping the LLM apocalypse in our classrooms because, so far, they have been notoriously bad at solving mathematics problems. After all, their core function is language prediction, not computation. For any readers not familiar with how LLMs work on a technical level, you can think of tools like ChatGPT as sophisticated autocomplete systems. When presented with a query, they predict the most probable sequence of words based on extensive textual training. Mathematics, however, demands precise answers and step-by-step reasoning—skills not inherently aligned with predictive text modeling. For example, when asked, "What is 12 x 8?" a model might respond with 96, not because it "understands" multiplication, but because it recalls that “96” is often associated with that question. The underlying process lacks mathematical comprehension.

Other examples make this point more concretely, such as ChatGPT’s apparent fondness for the numbers 42 and 7. Colin Fraser, a data scientist, recognized that it seemed to output 42 as a random number nearly 10% of the time when asked for a random number between 1 and 100. For the literary nerds reading this article, you may be able to recognize why. After all, 42 is the answer to the “ultimate question of life, the universe, and everything.” Fraser speculates that there were a lot more 42’s for the AI to see than other numbers, resulting in a random output 9% over what should be expected if these tools did understand and implement true mathematical randomness.

My experience with LLMs reinforces their limitations in mathematics. No matter how many math problems I have tested on earlier models, the accuracy felt like a coin flip at best. This inconsistency was, in some ways, a relief. Students who relied too heavily on AI for answers without engaging in a critical thinking process risked getting things wrong, ensuring that authentic, extended projects remained effective for learning and valid for assessment.

Now we are facing claims from OpenAI that o1 can interpret and solve complex problems, even from images. If true, Secondary Mathematics classrooms are going to be in trouble. What do you do with assessment, formative or summative, if any problem or project, image or text, can be fed to o1 and be given a valid, reasoned result? Aside from completely closed off and strictly secured assessments, any student motivated only by score, and not by process, would just need to memorize the answer and the reasoning that was provided. Realistically, I teach within a larger, assembly line like system with limited time and resourcing, so any lofty idealistic answers on classroom transformation fly right out the window (and seeing at how poorly schools are handling all facets of AI tools existence leads me to believe there won’t be any meaningful change in Secondary Education anytime soon). Again, realistically, I also need to put myself in the shoes of students who are often inclined toward the path of least resistance and are likely to outsource their problem-solving process entirely. While I encourage and demonstrate ethical AI usage to support learning, not all students are intrinsically motivated enough to resist the temptation of easy answers. This makes it harder for students to actually learn the material. Though I would be remiss to act as if this was new, as cheating is already common. One study of eleven years of college courses found that when students did their homework in 2008, it improved test grades for 86% of them, compared to 45% of students in 2017. This drop was due to half of students simply looking up the answers in 2017, so they never got the benefits of homework. I consider LLMs to be just an extension of an already existing trend.

Methodology

So I wanted to see how true OpenAI’s claims are, but for a use case more contextual to my classroom, and less so for the competition level math it was tested on. One of my favorite tasks to layer my classes with are the Problems of the Week provided by the University of Waterloo’s Centre for Education in Mathematics and Computing (CEMC). I enjoy them so much because the problems are multi-faceted, designed to challenge students across learning strands, and promote critical and computational thinking across grades three to twelve. Given o1’s showcased ability to handle advanced thermodynamics and chemistry problems, I expected it to easily navigate these question sets.

I upgraded to ChatGPT Pro to access o1 pro mode, the model OpenAI claims is their best at reasoning. I wanted to see the best results I could get, as there is always the possibility of an enterprising student willing to invest in these tools. To test its capabilities, I fed o1 Pro mode 12 questions from this academic year set at each of five difficulty levels, for a total of 60 questions.¹ Each question was tested across four trials, my attempt at modelling my test after the “4/4 reliability” framework detailed by OpenAI. For each question and trial I converted the PDF files provided by the CEMC into a PNG, attached it, and provided the following prompt.

“This image shows a Mathematics problem.

Your job is to provide a solution to the problem. While doing so, please do the following.
1. Tell me what problem you are solving.
2. Provide details on how to solve the problem.
3. Provide an answer.”

Results

Despite the hype, I found the results underwhelming. o1 pro mode passed the “4/4 reliability” framework on only 40 out of 60 questions — a pass rate of 67%. At the individual trial level, the model answered correctly in 177 out of 240 attempts, a 74% success rate.² For now, my mathematics classroom, and the effort my students must put into their work, remains relatively safe, though this is mostly because it is bad at one particular style of problem: anything that requires parsing information from an image.

o1 pro mode’s poor performance on image interpretation surprised me, especially given how heavily the showcase promoted exactly that capability. Many of the Problem B (Grade 5–6) questions included images that needed to be parsed, and while humans can easily extract the necessary information, o1 pro mode struggled significantly. I first recognized this problem when beginning the tests with the higher-level Problem D (Grade 9–10) and Problem E (Grade 11–12) sets. I was curious if it was the difficulty of the problems themselves, or if it was the visual component that really tripped up o1 pro and caused errors. Even when the problems themselves were simplified, the mere presence of an image appeared to render the model ineffective. It failed consistently. The complexity level it performed worst at was the Grade 5–6 set, with a pass rate of 5 out of 12 questions (42%), apparently because those questions carried the heaviest visual component.

Analysis

That said, this is a noticeable improvement over earlier models. While I never recorded previous results, my earlier impression of a "coin flip"—where a question had roughly a 50/50 chance of being answered correctly—no longer holds. The model is more consistent. o1 pro mode now tends to either get every trial of a question correct or fail all trials outright. But it still produces different wrong answers for the same question, which is strange. For example, in POTWE, Problem 3, when asked which route from Omicron to Tau gives the shortest travel time the model mistakenly omitted travel pathways from Pi, Rho, and Sigma differently in three separate tests, as well as made several varied mistakes on the information given for time between cities.³ This variability means users must still closely verify its responses, though the consistency on questions it does solve correctly is worth noting. Mistakes often vary from failing to identify the correct information the model has to work with to solve, though I wish I could share the chats so people could take a closer look at the overall outputs.

Limitations

This is practitioner testing with acknowledged constraints, not peer-reviewed methodology. I am one of many, many educators (with severe time constraints) trying to figure out how to teach effectively now that students have easy access to LLMs and other AI tools. Unfortunately, I don’t know a single secondary school, district, or board dedicating serious time, resources, and personnel to meaningfully tackle these problems. I haven’t seen a single institution across primary and secondary education allocating the necessary personnel and resources to rigorously test emerging AI models against internal, validated, context-specific benchmarks to actually see where the frontiers are for our use cases and implications it has in our classrooms. Though, I do unfortunately see a lot of Educational Technology Coordinator roles spending a lot of time making nice looking Canva posters (money well spent, I guess). So it is up to me, and others in the trench like me, even with limitations on time and money. There are many limitations to my test. This sample size was small, with only 60 questions tested across four trials each. I chose the questions because of how “new” they were, how they were likely to be a bit more unique, and hopefully not in the training data set. Another potential limitation is that tests focused solely on one question source, which would not fully represent the broader range of mathematical problems students might encounter in the classroom. Larger, more rigorous benchmarks exist — Frontier Math by Epoch AI, for instance. However, their context is PhD and Field Medalist Mathematicians, not the types of questions that the average secondary student would encounter, nor the questions I would work with in my education context.

How accurate does a model need to be before I am impressed? Does it need to reach perfection, or is the more detailed reasoning good enough to start students on a path of critically analyzing the output of its tools? Secondary Mathematics does not vitally need 100% accuracy, realistically none of this work is going to result in life or death, so I should probably temper my expectations on what I consider impressive. From these tests, the gap is mainly in image interpretation — a problem with the vision model, not the reasoning architecture. If we get true, full, multimodal, then these test results would have likely looked much more impressive.

I wish I had more time to test a wider variety of questions, run more trials, and control conditions better. A breakdown of specific error types — prompt misinterpretation, math reasoning failures, image parsing struggles — would help clarify where the model actually falls apart.

For now, my students still need to do the heavy lifting when it comes to solving problems. But if the bottleneck is the vision model, and vision capabilities are improving with each release cycle, how many iterations are we away from a model that passes these tests reliably? And when it does, what does secondary mathematics assessment actually look like?

If you want to chat, shoot me an email. If you would like to get updates, subscribe to my blog via email or RSS feed. You can also follow me at LinkedIn, and X.

You can find an archive of the questions I used at the CEMC website. It is the first 12 questions for each level set from 2024/2025. I also have an archive, which you can contact me to obtain.↩
You can download a recording of my results as an XLSX file.↩
Here are images from three (1, 2, 3) of the trials mentioned, where it incorrectly identifies the wrong pathways between cities.↩