60
$\begingroup$

I did a small informal test over the last few weeks to see if AI could give helpful answers to MathOverflow questions. I want to discuss the results, in case it helps the community work through possible policies around AI; think of this as data to complement the recent discussion around a literature-review answer. The question I was interested in:

Can a publicly available AI system create an accepted proof for a MathOverflow problem?

The short answer: Yes, but it's a huge amount of work to do responsibly! The technology is both better and worse than I had expected. I won't be using AI again on this site unless the technology improves and there's a clear policy allowing it. However, it's not 100% "garbage" (as described in this 2022 post) either. Perhaps more like 90-95% garbage, which is a significant difference.

The long answer is below. In fact I apologize for the length of this post, but this new technology is quite weird, and I think details matter. For context: My interest comes from the fact that I'm a computer scientist who studies AI, and also a longtime MathOverflow member. I'm curious about how AI might affect the site when used in friendly hands, i.e., not just trying to amass points but rather as part of contributing to the community.

What I did and what happened

I pasted about 15 MathOverflow questions into OpenAI's o1 model (the $20/month version). I got a solution once, a helpful lead once, and chaos in the remainder of cases. I also tried a few of these questions with Claude and Gemini, but with uniformly worse results, so I made o1 my workhorse. Here's the breakdown:

One answer. The one direct hit was this question on combinatorics. There is a tiny "insight" that makes the answer obvious, which o1 found and described clearly. It felt like spooky magic to see a computer explain this. I triple-checked the answer, since I didn't want to add slop to a site which I am very fond of. The o1 answer felt very wordy and long to me, so I wrote my own version using o1's idea, which is now accepted with double-digit upvotes. (Like many of my 100%-human answers on this site, I've edited it a few times since then for clarity.) Granted, this is not a particularly deep solution (unlike the subsequent human answer to the same question) and in retrospect I felt slightly annoyed that I hadn't just tried to solve the problem myself. As a point of reference, however, both Claude and Gemini hallucinated a "proof" of a wrong answer to this question.

One useful lead. This question on differential topology asked for a non-handwaving answer or reference. In response, the o1 model gave me a plausible but handwaving answer based on stable mappings. I actually think o1 was basically right, but in tracking down theorems to fill the gaps, I immediately found a direct reference to the theorem in question. That answer is now accepted with multiple upvotes. I do worry my reference may have short-circuited useful discussion that could have shed more light on the problem.

These two examples were the only answers I posted, because the rest were...

A ton of time-consuming chaos. For the other questions, I can only describe the results as chaotic. In a few cases, o1 produced work that was just flat-out incorrect. It produced GPT-1 level gibberish for a question on algebra. I saw it sail perfectly through some choppy mathematical waters only to run aground by claiming that the sum of two irrational numbers had to be irrational. It invented a complicated fake Lyapunov function which took me some time to realize couldn't work.

But worse than crisp-but-wrong answers, it produced plausible arguments for many other questions that, because they were outside my areas of expertise, I had to give up on checking carefully. The arguments sounded reasonable, yet didn't seem airtight, and it was hard to distinguish high-level reasoning from AI hunches. At first it was fun to look through the AI answers, but after a while it felt like o1 was just a brilliant yet overconfident colleague who hears a problem, says "that obviously follows from compactness and Hilbert's work" and then suddenly remembers an urgent meeting when you ask what that means.

There was at least one "good" answer I didn't recognize. After thinking for 58 seconds, o1 gave what now appears to be the correct diagonalization in response to this matrix question. However, its proof was an intricate and (to me) opaque wall of algebra, which I decided I didn't have the patience to verify. So I didn't post anything. I think this was the right decision, but it did mean the OP had to wait for five days, not 58 seconds, for an answer.

What does this mean for MathOverflow?

The fact that AI could produce any useful answer beat my expectation going in. However, to avoid polluting this site, I spent many hours checking (or trying to check and then giving up on) AI-generated proofs. I'm pretty certain that if I'd spent the same amount of time thinking about the questions, I could have gotten at least as many useful answers, and it would have been more fun.

My biggest concern is that someone who is insufficiently worried about junk will post AI proofs without realizing how hard they are to check. Maybe systems trained on human feedback are ultimately learning to make convincing proofs, rather than correct proofs. Every chatbot I tried wrote confident, fluid, initially persuasive arguments. It felt harder to check these than, say, student answers on an exam. Here are potential ways the community might guard against problems:

  • Add a warning about how hard AI proofs are to check, right by the "submit" button for an answer.
  • Explicitly ask people to rewrite any AI answers. Even when it had good ideas, not once did I see an AI response that was sufficiently concise and well-organized to make a good MathOverflow answer—not even close.
  • Add an "AI checkbox" for people to disclose AI usage. This could alert others to check extra carefully, and give the community information about whether AI use is widespread (and whether or not it's helpful). This could be a formalized version of some of Will Sawin's suggested policies.

In addition, we may want to think about:

  • How to handle cases where the AI model guesses a plausible formula. Would it have been better for me to post a comment with the system's correct, but unverified, formula for this question, with the disclaimer that it came from AI? Would that have helped other people get the final answer faster? Or would it have added noise to the system, demotivating other users? If I had verified it with a computer algebra program, the OP would have gotten a correct answer faster—but not nearly as understandable as the one that came five days later.

However, I also think it's worth discussing now what happens if AI becomes significantly more capable. Progress has been fast in the past few years, and even as I finished my experiment a new $200/month version of o1 was released, which I have not yet tested. If the problems I saw become less salient, we might see people simply use AI privately, rather than asking questions on this site. Perhaps in the future so-called "soft questions" will turn out to be critical to maintaining participation, and standards for closing questions should be adjusted accordingly. I don't know!

Final thoughts

I did hesitate before trying this test, partly because I deliberately hid my use of AI to avoid biasing the results. If one sees the reputation system as a game, then using AI presumably counts as cheating. But I'm not in love with gamification of math, so that concern didn't carry much weight for me. Much more salient was the worry that bad answers would waste other people's time, and that good answers might reduce human conversations and connections.

I decided that if I could address questions that posters seemed to genuinely want an answer to, it was OK to do this in moderation. I promised myself to write the post you're reading now, partly for transparency and partly as a "carbon offset" for any conversation I reduced! I also added links to this thread to the two answers where I used AI. That said, I'm definitely open to hearing that this whole thing wasn't a good idea.

I'm probably not the first to try this, although I haven't seen a detailed public description. I'd be extremely interested to hear other people's experiences. One thing I don't know is how my own level of skill factors into this. If I were better at math (whatever that means) would I be able to spot or fix flaws in AI proofs more quickly? Or would I just need AI less?

Finally, although I don't plan to use AI on MathOverflow in the near future, I probably would try it for my own mathematical work if there were some small, crisp problem that I desperately needed to solve. Filling a key missing step in a larger proof might justify the 5-10% hit rate I saw.

$\endgroup$
14
  • 14
    $\begingroup$ This is very interesting and cool, thank you for the hard work Martin. $\endgroup$ Commented Dec 28, 2024 at 2:25
  • 2
    $\begingroup$ This essay, bit-player.org/2023/ai-and-the-end-of-programming by Brian Hayes, might be of interest. Hayes asks an LLM to write some computer code, it writes nonsense, but with some back-and-forth he can get something useful out of it. $\endgroup$ Commented Dec 28, 2024 at 3:37
  • 3
    $\begingroup$ I assume that, when you pasted the MO questions into AI, those questions had not yet received any answers on MO. If you ask AI an MO question that already has good answers on MO, paraphrasing the question instead of copy-pasting, how does AI perform? $\endgroup$ Commented Dec 28, 2024 at 7:37
  • 2
    $\begingroup$ @GerryMyerson That is a great article, and very much mirrors my experience with AI coding. $\endgroup$ Commented Dec 28, 2024 at 12:38
  • 3
    $\begingroup$ @Dan Yes, I only did this for unanswered questions. However, I just tried it for a question I answered in 2015 and o1 gave a different, nonsense proof—for both a paraphrase and an exact replica of the question. That actually surprises me; it would be easy to imagine MO was in the training set. $\endgroup$ Commented Dec 28, 2024 at 12:44
  • 9
    $\begingroup$ Thank you for taking on and documenting this experiment! It seems like an excellent summary of what the state of the art for this narrow set of tasks is. And the way you carried it out also seems most respectful of everyone's time on MO. If similar experiments are tried as the technology keeps advancing, I hope they will use this approach as a model. $\endgroup$ Commented Dec 28, 2024 at 18:07
  • 1
    $\begingroup$ Nice question, posted a solution to an introductory measure theory problem about a week ago from chatGPT and it was not popular. I think humans still write more readable mathematics in most cases than LLMs. From an energy perspective (post-training) I think LLMs show their use-case, but the ethics of delegating editing to AI is tricky and borders on censorship if there's not humans in the loop. $\endgroup$ Commented Jan 6 at 22:47
  • $\begingroup$ I believe that AI has a great potential but shouldn't be used to tell what is easy. An AI might think a problem as simple as the twin primes is easy but this is wrong and many people have devoted years to it. $\endgroup$ Commented Jan 9 at 23:43
  • $\begingroup$ I think this assessment needs an update, recently I have noticed the models have made some insane advances. My personal experience is with Gemini 2.5 Pro (Math, Reasoning). I have noticed it has ability to solve some really difficult questions. In fact apparently it did solve an open conjecture (conjecture 3 in arxiv.org/pdf/2310.06058) according to the author of the paper. $\endgroup$ Commented Aug 21 at 17:59
  • 1
    $\begingroup$ @user127776: In the preprint you linked I could not find a conjecture 3 nor any mentioning of Gemini 2.5 Pro. $\endgroup$ Commented Aug 21 at 20:43
  • $\begingroup$ @JochenGlueck Sorry it is conj 3.7. I am not sure the public version can solve it or not (I am not qualified to verify its solution). My statement about Gemini 2.5 pro is based on my own experience and it is very recent. Here is the author talking about it: youtu.be/QoXRfTb7ves?si=LfukvHVuWI2dqckm $\endgroup$ Commented Aug 21 at 21:33
  • 2
    $\begingroup$ @user127776: The video you linked is a promo video by Google DeepMind that cuts together a few quotes by van Garrel without providing any context. Moreover, the first version of the arXiv preprint you linked, uploaded in October 2023, contains a proof of that conjecture. $\endgroup$ Commented Aug 21 at 21:51
  • $\begingroup$ @JochenGlueck True, there is no way for sure to know what that problem was. This problem was the one being typed into gemini. But it does not say this is the problem, especially since this is already proved in the paper. But whatever it is, it had a different approach according to the author. It is still impressive and seems to indicate things are improving very rapidly and we are not at a plateau. $\endgroup$ Commented Aug 21 at 21:59
  • 1
    $\begingroup$ @user127776: I don't think that a promo video by a company about its own product indicates anything. Maybe things are developing rapidly, maybe not. We'll see. $\endgroup$ Commented Aug 21 at 22:09

3 Answers 3

38
$\begingroup$

I despise AI generated content. It's like an ever-rising sea of garbage crowding out the useful parts of the internet.

That being said, I don't think there is anything wrong with using it like you describe. I really don't care where someone gets their ideas (though for me personally, working something out on my own sounds way more fun than trying to decode the effluent spewing out of a chatbot to find the hidden gem). What is important to me is that:

  1. The person posting the content understands every detail of any argument they give, and has personally carefully checked any references to make sure they prove exactly what they claim to prove.

  2. They write it up entirely in their own words. Nothing in their answer should come from a chatbot, or even be lightly paraphrased from one.

If someone did that, then their answers would be indistinguishable from normal MO answers.

However, given how toxic the current AI stuff is, I think it is good to also require anyone to acknowledge using AI and to explain exactly what it did for them. Eventually that might not be strictly necessary and the citation standards for AI can be the same as everything else. I don't e.g. tell people who first explained something to me at tea, but I do give credit for significant ideas when I am able to.

$\endgroup$
13
  • 8
    $\begingroup$ Why "toxic"? I would rather say "unreliable". $\endgroup$ Commented Dec 28, 2024 at 11:55
  • 4
    $\begingroup$ I came to the same conclusion as you: it's way more fun to think things through myself than clean up AI answers! But that might not be true for everyone. It's conceivable that one day "mining AI for gems" could be a valuable supporting role. I do like the idea of marking AI-assisted content as such, partly because it can help us evaluate the actual level of toxicity—now and as the technology changes. $\endgroup$ Commented Dec 28, 2024 at 13:04
  • $\begingroup$ Similar suggestion of Will Sawin on another Meta thread: make it a (formal or informal) rule to require people to disclose that they used AI - meta.mathoverflow.net/a/6109/25028 $\endgroup$ Commented Dec 28, 2024 at 13:47
  • 20
    $\begingroup$ @FrancescoPolizzi: It's toxic because it gives people the illusion of understanding. I can teach mathematics to people who want to learn, but I cannot teach people who are ignorant of their own ignorance. $\endgroup$ Commented Dec 28, 2024 at 15:01
  • 13
    $\begingroup$ @AndyPutman: FWIW, I disagree with the use of the adjective "toxic" for whatever we do not like. A human behaviour can be "toxic", but not a software. Telling students to use a AI for solving their math problems could be considered a piece of toxic advice (for the reasons you said above), but the AI itself is just an inanimate, (often) unreliable tool. $\endgroup$ Commented Dec 29, 2024 at 8:56
  • 7
    $\begingroup$ @FrancescoPolizzi: Since I don't think we disagree about the aspects of my answer that are germane to MO, I'll let you have the last word. $\endgroup$ Commented Dec 29, 2024 at 14:36
  • 2
    $\begingroup$ @FrancescoPolizzi Still, let me try another argument since I think it is very important. For me, toxicity is primarily in plain simple deception. For me AI generated content will remain toxic until there will be no obilgatory attribution. I think it is absolutely necessary to accompany all kinds of AI generated content with the information which version of which software generated it. I realize that frequently it is mixture of AI and human activity, but still, if AI is involved, its presence must be clearly identifiable. $\endgroup$ Commented Jan 1 at 6:03
  • $\begingroup$ @FrancescoPolizzi - your argument on people vs tools sounds familiar to anyone who ever looked at arguments of firearms proponents in US. They also argue that guns by themselves don't kill, thus guns are OK. $\endgroup$ Commented Mar 11 at 17:50
  • $\begingroup$ @DimaPasechnik: In fact, guns themselves are not toxic. They are dangerous, which is a different thing. $\endgroup$ Commented Mar 11 at 20:37
  • $\begingroup$ Just to be clear: since they are dangerous, they are not OK. $\endgroup$ Commented Mar 11 at 20:44
  • $\begingroup$ AI tools, in particular LLM-based, are dangerous too (just like powerful cars with faulty breaks, but worse). If proving a theorem using an LLM-based tool produces $10^7$ tons of CO2, is it a toxic tool? $\endgroup$ Commented Mar 11 at 21:55
  • $\begingroup$ Come on. Everything can be dangerous, if used badly. No theorem proved by an LLM-based tool ever deserved $10^7$ tons of CO2 so far, this is a slippery slope argument. $\endgroup$ Commented Mar 11 at 22:29
  • $\begingroup$ Or, perhaps, I got the pun too late... $\endgroup$ Commented Mar 12 at 11:22
8
$\begingroup$

I believe currently AI is being optimized to sound plausible. And there is rightfully quite a bit of hype around AI because it has the chance to perform better than current alternatives (like googling).

Today it does not meet the standard of providing correct proofs/arguments, that we would require for an answer and the fact that it actually sounds plausible makes it harder to disregard without checking in detail.

Maybe in the future we have AIs that are build to generate correct proofs and these would have great potential.

$\endgroup$
-4
$\begingroup$

In August 2025, I tried this a bit with GPT5. I took 5 random recent questions from MO (trying to avoid that the AI was trained with it) with accepted answers.

In 3 out of 5 cases, the answer was the same as the accepted one. For two cases the answer differed in a way I could not judge the difference because of a lack of expertise (e.g., it provided a different counterexample).

As AI evolves, it might be interesting to follow the performance on MO questions. In other areas like computer science, the number of stack exchange questions has significantly reduced - just because AI gives good answers in this case already. So AI might heavily impact MO one day in a similar way. Many questions might not be asked and discussed in public anymore.

In any case, MO questions can serve a nice benchmark for how an AI performs in research math, in particular one can always compare with an accepted answer and thus judge in many cases even without being an expert in the field.

$\endgroup$
4
  • 7
    $\begingroup$ I wrote a comment about how this is not so compelling a test because the AI may well have read the question and answer. I see now that you wrote "(trying to avoid that the AI was trained with it)", but I would still be suspicious about this... the standard AI systems all have access to the Internet, so even recent things may be available to them. The serious attempts to use mathematics to benchmark AI, like epoch.ai/frontiermath, all crucially use secret questions that have not been publicly posted anywhere. $\endgroup$ Commented Aug 21 at 18:49
  • 4
    $\begingroup$ Standard LLMs use the internet. Just ask about a mathematician with some web presence and not significant enough scientific contribution. Most will even give you the citation where the information comes from. If you want to test it properly, vary those questions in a small but significant way. $\endgroup$ Commented Aug 22 at 14:16
  • 2
    $\begingroup$ I expect that AI will decrease the number of questions on MO and SE in the future. In fact an AI has many advantages for a questioner: It answers questions immediately, i.e. one doesn't have to wait an indefinite amount of time till a human does (o does not) provide an answer. Also, an AI won't say 'your question is not research level, I refuse to answer that' or will close a well-posed question by stating 'context is missing' without providing any reason what should be unclear. The AI will just help a user without all these quirks and is more user-friendly. $\endgroup$ Commented Aug 23 at 3:39
  • 5
    $\begingroup$ It will "help" a user, but the settings are such that the LLM will give you a positive experience rather than output "I don't know", and a result make up something or give a bland non-answer couched like it solves your problem. It will be user-friendly to the point of praising random ideas as being amazing, when they are going nowhere. $\endgroup$ Commented Aug 26 at 6:07

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.