I did a small informal test over the last few weeks to see if AI could give helpful answers to MathOverflow questions. I want to discuss the results, in case it helps the community work through possible policies around AI; think of this as data to complement the recent discussion around a literature-review answer. The question I was interested in:
Can a publicly available AI system create an accepted proof for a MathOverflow problem?
The short answer: Yes, but it's a huge amount of work to do responsibly! The technology is both better and worse than I had expected. I won't be using AI again on this site unless the technology improves and there's a clear policy allowing it. However, it's not 100% "garbage" (as described in this 2022 post) either. Perhaps more like 90-95% garbage, which is a significant difference.
The long answer is below. In fact I apologize for the length of this post, but this new technology is quite weird, and I think details matter. For context: My interest comes from the fact that I'm a computer scientist who studies AI, and also a longtime MathOverflow member. I'm curious about how AI might affect the site when used in friendly hands, i.e., not just trying to amass points but rather as part of contributing to the community.
What I did and what happened
I pasted about 15 MathOverflow questions into OpenAI's o1 model (the $20/month version). I got a solution once, a helpful lead once, and chaos in the remainder of cases. I also tried a few of these questions with Claude and Gemini, but with uniformly worse results, so I made o1 my workhorse. Here's the breakdown:
One answer. The one direct hit was this question on combinatorics. There is a tiny "insight" that makes the answer obvious, which o1 found and described clearly. It felt like spooky magic to see a computer explain this. I triple-checked the answer, since I didn't want to add slop to a site which I am very fond of. The o1 answer felt very wordy and long to me, so I wrote my own version using o1's idea, which is now accepted with double-digit upvotes. (Like many of my 100%-human answers on this site, I've edited it a few times since then for clarity.) Granted, this is not a particularly deep solution (unlike the subsequent human answer to the same question) and in retrospect I felt slightly annoyed that I hadn't just tried to solve the problem myself. As a point of reference, however, both Claude and Gemini hallucinated a "proof" of a wrong answer to this question.
One useful lead. This question on differential topology asked for a non-handwaving answer or reference. In response, the o1 model gave me a plausible but handwaving answer based on stable mappings. I actually think o1 was basically right, but in tracking down theorems to fill the gaps, I immediately found a direct reference to the theorem in question. That answer is now accepted with multiple upvotes. I do worry my reference may have short-circuited useful discussion that could have shed more light on the problem.
These two examples were the only answers I posted, because the rest were...
A ton of time-consuming chaos. For the other questions, I can only describe the results as chaotic. In a few cases, o1 produced work that was just flat-out incorrect. It produced GPT-1 level gibberish for a question on algebra. I saw it sail perfectly through some choppy mathematical waters only to run aground by claiming that the sum of two irrational numbers had to be irrational. It invented a complicated fake Lyapunov function which took me some time to realize couldn't work.
But worse than crisp-but-wrong answers, it produced plausible arguments for many other questions that, because they were outside my areas of expertise, I had to give up on checking carefully. The arguments sounded reasonable, yet didn't seem airtight, and it was hard to distinguish high-level reasoning from AI hunches. At first it was fun to look through the AI answers, but after a while it felt like o1 was just a brilliant yet overconfident colleague who hears a problem, says "that obviously follows from compactness and Hilbert's work" and then suddenly remembers an urgent meeting when you ask what that means.
There was at least one "good" answer I didn't recognize. After thinking for 58 seconds, o1 gave what now appears to be the correct diagonalization in response to this matrix question. However, its proof was an intricate and (to me) opaque wall of algebra, which I decided I didn't have the patience to verify. So I didn't post anything. I think this was the right decision, but it did mean the OP had to wait for five days, not 58 seconds, for an answer.
What does this mean for MathOverflow?
The fact that AI could produce any useful answer beat my expectation going in. However, to avoid polluting this site, I spent many hours checking (or trying to check and then giving up on) AI-generated proofs. I'm pretty certain that if I'd spent the same amount of time thinking about the questions, I could have gotten at least as many useful answers, and it would have been more fun.
My biggest concern is that someone who is insufficiently worried about junk will post AI proofs without realizing how hard they are to check. Maybe systems trained on human feedback are ultimately learning to make convincing proofs, rather than correct proofs. Every chatbot I tried wrote confident, fluid, initially persuasive arguments. It felt harder to check these than, say, student answers on an exam. Here are potential ways the community might guard against problems:
- Add a warning about how hard AI proofs are to check, right by the "submit" button for an answer.
- Explicitly ask people to rewrite any AI answers. Even when it had good ideas, not once did I see an AI response that was sufficiently concise and well-organized to make a good MathOverflow answer—not even close.
- Add an "AI checkbox" for people to disclose AI usage. This could alert others to check extra carefully, and give the community information about whether AI use is widespread (and whether or not it's helpful). This could be a formalized version of some of Will Sawin's suggested policies.
In addition, we may want to think about:
- How to handle cases where the AI model guesses a plausible formula. Would it have been better for me to post a comment with the system's correct, but unverified, formula for this question, with the disclaimer that it came from AI? Would that have helped other people get the final answer faster? Or would it have added noise to the system, demotivating other users? If I had verified it with a computer algebra program, the OP would have gotten a correct answer faster—but not nearly as understandable as the one that came five days later.
However, I also think it's worth discussing now what happens if AI becomes significantly more capable. Progress has been fast in the past few years, and even as I finished my experiment a new $200/month version of o1 was released, which I have not yet tested. If the problems I saw become less salient, we might see people simply use AI privately, rather than asking questions on this site. Perhaps in the future so-called "soft questions" will turn out to be critical to maintaining participation, and standards for closing questions should be adjusted accordingly. I don't know!
Final thoughts
I did hesitate before trying this test, partly because I deliberately hid my use of AI to avoid biasing the results. If one sees the reputation system as a game, then using AI presumably counts as cheating. But I'm not in love with gamification of math, so that concern didn't carry much weight for me. Much more salient was the worry that bad answers would waste other people's time, and that good answers might reduce human conversations and connections.
I decided that if I could address questions that posters seemed to genuinely want an answer to, it was OK to do this in moderation. I promised myself to write the post you're reading now, partly for transparency and partly as a "carbon offset" for any conversation I reduced! I also added links to this thread to the two answers where I used AI. That said, I'm definitely open to hearing that this whole thing wasn't a good idea.
I'm probably not the first to try this, although I haven't seen a detailed public description. I'd be extremely interested to hear other people's experiences. One thing I don't know is how my own level of skill factors into this. If I were better at math (whatever that means) would I be able to spot or fix flaws in AI proofs more quickly? Or would I just need AI less?
Finally, although I don't plan to use AI on MathOverflow in the near future, I probably would try it for my own mathematical work if there were some small, crisp problem that I desperately needed to solve. Filling a key missing step in a larger proof might justify the 5-10% hit rate I saw.