What I learned building an AI coding agent for a year

We thought we'd be the best within months. Here's what went wrong — and why I'm more excited than ever.

Jul 05, 2025

It’s been a full year of trying to build the best coding agent!

I didn’t know that my world was about to change last July 4th, at hackathon where I first prototyped a CLI coding tool that became Codebuff. What a ride it’s been!

From leaving Manifold, to doing YC F24, to hiring, to competing with Claude Code, all the while averaging ~70 hours weekly by working most weekends — it’s been a lot!

We may not have won the first round, but I’m more fired up and excited for the future than ever.

Our bet

We got so many things right initially:

CLI first. Scoping down to just a command line tool helped us focus on the core of a coding agent.
Inject more context. Immediately reading a dozen files related to the user prompt gave a huge advantage over competitors.
No permissions checks. We were full YOLO mode from the very beginning which was positively heretical then.
Premium tool. It makes sense to spend more when developer salaries are the alternative.
Knowledge files. We came up with the idea of knowledge.md files that are checked in to your codebase. Codebuff would automatically update these files as it learned.

Most of these are standard or becoming standard in coding agents today!

What didn’t work out

For the first 10 months, we always thought we were weeks away from breaking out and growing exponentially. During YC, we even did grow exponentially, to $5k MRR.

We regularly got people saying it was the best coding agent. But, it wasn’t always as reliable.

Our file editing strategy was flaky for months, much worse than Cursor’s with its custom model to rewrite a file.

Even after we adopted Relace’s fast rewriter model, our product still had a long tail of issues that made ~5-10% of tasks fail. Some of these issues just take time to isolate and fix, but we could have prioritized better.

Without reliability, we could not have high retention. Without high retention, Codebuff could not grow.

What we should have done

Here’s what I’d do differently after an extensive retrospective.

Build end-to-end evals and run them nightly

This would get us regular quantified feedback on how Codebuff performs as a coding agent. It would help solve reliability issues AND allow us to test hypotheses on how to further improve our product.

Because we did not have this, we spent way too much time manually testing Codebuff after every change or when evaluating whether to switch models.

Cut every feature that is not core

We thought we scoped down a lot by sticking to the CLI, but we should have cut even more. Elon Musk was right when he said you must first “delete the part!”.

Here are a few features we should have cut earlier:

Magic detection of whether the input is supposed to be a terminal command or prompt
Automatic knowledge file updates, which we tweaked for months before largely scrapping
A pseudo-terminal library (node-pty) for color output & aliases, which was recently named our biggest blackhole feature ever

Get the whole team improving the core product

I took on too much of the core system and left my cofounder to deal with other tasks which may not have been as impactful. It helps focus and morale to get all hands in the game.

Live in the future

Never stop thinking about how to disrupt your current product. What is the next thing? What experiments can we try today to make it work?

Monthly retrospectives

One bit of process that could have helped us achieve the above is monthly retrospective meetings. Schedule these on your calendar and set aside an hour for everyone to answer these questions and discuss them:

What should we double down on?
What should we cut?
What should we explore next?

Next steps for Codebuff

In the last couple months, we’ve done more reflection and exploration as competitors such as Claude Code have entered the market with similar ideas.

(Incidentally, I believe Claude Code succeeded in part by having a more focused bet: client-side only, search-replace file editing only, agentic-RAG only.)

We’ve been dreaming of the next thing, and now I’m confident we know what it is.

Our new multi-agent product is live!

Our multi-agent framework, launched two days ago, is already increasing our evals!

I’m happy to say that, as of two days ago, we’ve soft-launched our multi-agent architecture, where agents spawn other agents with different roles.

The reception so far has been overwhelmingly positive even though this is the very beginning. My cofounder says we’re just scratching the surface of what is possible in this framework: “it feels like an infinite world of possibilities,” he says.

I agree — check it out! And stay tuned for a bigger launch soon!

Predictions for the next year

Follow along on Manifold and place your bets!

If we got so many things right about what was coming for coding agents last year, can we do it again? I think so!

Here are my forecasts:

The multi-agent paradigm will win. Our experience is that it’s possible to rapidly improve capabilities by delegating tasks to specialized agents.

“Live learning” will be standard. Having the coding agent learn as it does tasks is extremely powerful.

Coding agents will flip the initiative. We’ll see a shift from the user always initiating prompts, to the coding agent more often coming up with tasks for the user, e.g. to review key decisions.

Coding agents will close the loop. Instead of just proposing code changes, they will also use the product itself to perform QA and evals, and commit the changes autonomously.

Recursively improving coding agents will start working. And all the top coding agents will be a flavor of this.

xAI will gain a sizable lead. The multi-polar era will recede as xAI gains a decisive lead in model quality and intelligence.

The best model will not matter as much as today. Instead, it will be the network of agents that distinguishes the best product.

It’s been a blast

Thanks for reading, and cheers to another year of:

Big ideas, grinding, new employees, office snacks, customers that want to acquire us, offsites in Tokyo, afternoon breaks for running or basketball, and late night coding sessions.

May the best coding agent win!

James

P.S. Come help us build the world’s best coding agent!

You can join as a founding engineer and possibly have a stake in the first 10 trillion dollar startup once agents rule the world. Email james@codebuff.com. We also offer referral bonuses!

Taking on the giants — an AI coding startup on the brink of success

We're winning users off of Cursor and Windsurf. Why does it feel like we are losing?

James

Mar 31, 2025

Meet my codegen startup, Codebuff.

Every day, there’s a new AI launch and we feel further behind. Gemini Pro 2.5. The new DeepSeek V3. Claude Code.

Every day, there’s a new bug report (or three or five). Our npm package is bricked, throwing an error on startup. A service we are using is down. We deleted a user’s code, and they didn’t have a backup — losing 7 hours of work.

But also, every day, there’s a new comment on our Discord, on Twitter, or on Bookface, saying they like our product best, it’s faster, it solved something another service couldn’t.

The rollercoaster is real

Our company is 7 months old. We are three people. We have >100 paying subscribers, but are struggling to make the product reliable to scale further.

It turns out that to make something “just work” means you need to fix 1000 papercuts, one at a time. That means debugging websocket connections, staring at logs to understand which of 10 steps went wrong in applying a file edit, and responding to angry users to get more insight on what failed this time.

(And if you’re me, you might need to deploy a hotfix for your hotfix for your hotfix haha.)

Cursor and/or the labs will crush you

Codebuff (left) vs Claude Code (right). We fulfilled the user’s prompt 5 times faster

Meanwhile, competitors are forging ahead. Is there room for another coding agent startup, when billions of funding is already distributed?

We have the best code output by some measures (speed x quality). Moreover, I can see how we can stay ahead, for at least a while.

Cursor/Windsurf are limited by $20/month plans and aren’t optimizing for a fully capable coding agent.
Claude Code can only use Claude. The best codegen will use many models.
Injection of better code context quickly will continue to be an advantage, even as models get smarter.

I recently met another codegen founder, who was seemingly resigned to the “bitter lesson”.

He says new models will come out and they will be better than whatever you are working on. Just stop trying to compete on code quality. Compete on infra and marketing — those will be the only enduring advantages.

I don’t agree. I continue to believe even very smart models will improve with more context.

Even models trained to be more agentic, where Sonnet 3.7 is just the start, will have a similar opportunity for improvement.

Historically, startups are the most responsive to customer needs. I think a startup will beat the labs at this task of cobbling together all the right pieces of context.

You need to do more marketing

For every startup, you talk to enough people while building it that you get an intuition for what everyone else thinks you are lacking.

Commonly, they will tell you that you just need to get your product out in front of more people.

Like, you need to personally go to events and tell people to use your product. You need to post on social media. Or, you need to email Bill Gates (something my Dad mentioned for my first startup haha).

Maybe.

But founders have limited time and need to prioritize. Assuming you have enough users that are providing feedback daily (a big assumption), focusing on building a high quality product that people can’t live without is often the better strategy.

The returns to marketing are sharply limited if your user retention sucks. Conversely, marketing is way easier if you have a killer product.

So, yes we need to do more marketing, but we also need to fix a high proportion of our bugs.

Why is your team so small???

—Say the VC’s hoping we will take more investment to dilute ourselves beyond what is necessary.

We (intentionally!) raised less than our peers in YC

It’s human nature to over-hire. Nearly every first-time founder gets it wrong.

Your company is more impressive the more people it has. It can also get more done too. It’s a given that more hours worked collectively means better results, right?

Wrong.

If the scope of what you need to build is one hairy bit of software, adding employees is unlikely to speed that up, since most code should have only one owner.

It’s also dangerous to not keep a focused vision. New employees, even great ones, might want to take the company in a different direction. Even if they’re right, they’re wrong. Your startup can only go one way with full conviction or it will fail.

All that said, hiring our first employee did seem to take some of the weight off our backs, given that we had one more person we could trust to handle user issues. It helps that he’s a cracked engineer!

If you’re interested in joining us to build the best coding agent, email me (james@codebuff.com)!

The grind continues

I work everyday. Frequently, for the entire day. I like it. It’s rewarding albeit tiring.

But those sweet moments when users are happy,

when they’re rooting for us, when they say we’re about to blow up,

and when the roller coaster races, faster and faster, up the incline toward a new peak—

make the struggle all worth it.

Today I took a break and went to the blog club at a local coworking space to hang out with friends, resulting in this post.

Thanks for reading!

Announcing Manicode, v0

The best AI coder that ever ran from your terminal

James

Aug 13, 2024

Try it like this:

> npm install -g manicode

> manicode

Hello. It’s been a month and a half since my post “Software automation will make us into Crazed-Super-Scientist Barons”.

In that piece, I postulate that LLM’s killer use case is generating code. When software becomes one-tenth as costly to write, it will unleash the “Crazed-Super-Scientist Barons”, i.e. entrepreneurs that use this newfound power to build amazing things.

I didn’t imagine that I would be building a coding tool that could help make this a reality!

AI Grant application

Here’s an excerpt from my application to the AI Grant accelerator, which does a good job explaining Manicode (more discussion below!):

Run manicode in your terminal. Ask it to do any coding task. It will make changes to your files.

...and it will do a really good job. Why?

It has full access to read and write to your files, run terminal commands, and scrape the web
It can: grab files it needs for context, edit multiple files at once (no copy-pasting), run the type checker, run tests, install dependencies, and search for documentation.
These abilities are key to doing a good job and will only become more powerful as LLM's continue to level up.
It uses so-called "knowledge" files
LLM's perform so much better with extra context!
With Manicode, we've come up with this idea to check in knowledge.md files in any directory, and write down extra bits of context, like which 3 files you need to edit in order to create a new endpoint. Or which patterns are being deprecated and which should be used. Or which directories can import from other directories.
Every codebase has lots of implicit knowledge like this that you have to impart to your engineers. Once written down, it makes Claude really fly! It's truly a night and day difference.
It's synchronous, and you can give feedback
You're chatting with it. It takes ~30 seconds to get back to you and then you can tell it what you want to do next or what it did wrong.
This keeps Manicode on track and aligned.
It learns
The flow of using Manicode is:
- Ask it to do something
- If it fails, point out its error
- Manicode fixes the error and automatically writes down how it can improve for next time in a knowledge file
- You push the commit, and now Manicode has become even more capable when the next engineer runs it in the codebase.
This is the magic loop that will make Manicode productive for experienced engineers in giant codebases.
We're unafraid to spend for better results
We can use as many parallel API calls with as much context as we can to produce the best code, because we know that the alternative is human labor, which is much more expensive.

We're targeting the largest market for software engineers

It's a tool for daily use by experts (not just junior engineers)
It's for software maintainers (not just people starting new projects)

We're starting with a console application, because it's simple and has great distribution

Every developer knows how to install new packages with npm or pip.
Most developers already have the terminal accessible: as a pane in your vscode window, for example.

The timing is right

Claude Sonnet 3.5 passed some bar of coding competence, and the form factor of a fully capable agent that can change any file works now, whereas before you could only reliably edit one function at a time.

There is a moat after all

Handling every tech stack well, knowing when to run their tests and type check, integrating with git, linear, slack, and email, supporting database migrations, etc, etc, etc. You can build hundreds or thousands of special case prompt magic to improve things so that it always just magically works the first time. A startup arriving at this 6 months late wouldn't catch up.

Intro video

Demo video

Why Manicode is the right product

"Wow, manicode is pretty great. I think by the end of the hour I'll have all the schema changes." - Manifold dev while I was writing this Substack piece

Manicode is just an LLM wrapper. But I think this is what works best.

The best code will continue to be generated by the best LLM on the market for the foreseeable future.

You could build a smaller, custom model. But it probably won’t have seen as much code as OpenAI’s or Anthropic’s models. It’s hard to compete with billions of dollars of investment!

Manicode gets user feedback quickly.

LLM’s are not quite good enough to go off on their own and continue to make progress.

They get stuck: sometimes they can’t fix a type error, or a test. They go off in a wrong direction: they pick the wrong design and keep building for that.

The best experience is thus going to a quicker feedback loop. Where the human can direct the AI every 30 seconds on what to do next, or what it did wrong. This is why Manicode is a synchronous conversation with the AI.

Manicode fills context well

Manicode’s genuine new idea is to write extra context as knowledge.md files and check it into your codebase, side-by-side with the actual code. Knowledge is anything you write that makes the actual agent work!

Other than that, Manicode knows the directory structure and chooses other relevant files to include automatically. This makes a big difference, especially because it can use the knowledge files to help it pick relevant files.

Manicode has full access

It can read and write to your files, as well as run potentially risky commands in your terminal without any confirmation from the user.

That sounds incredibly scary, but is actually much less risky in reality, especially if you have version control.

What command could it run that would mess things up that much? The extra abilities only make it more useful.

This quality of doing something that normal people think "goes too far" or seems unsafe is a correlated with good startup ideas, because it means fewer people are likely to have thought of it. (E.g. For Airbnb: You let random strangers sleep in your house? Or Manifold: You let anyone ask and judge the resolution of their own question?)

The race is on

The most valuable application for LLM’s is right in front of us.

There are probably a hundred-plus startups vying to win with billions in investment. There could hardly be more at stake. The winner could be the next FAANG star, that brings the creative power of LLM’s to the world.

Bet on Manicode.

P.S. Join the Discord!

The Future of Prediction Markets

My talk at Manifest 2024

James

Aug 03, 2024

Watch the video here, or proceed to read the lightly edited transcript.

This is pretty close to what Manifest looked like haha.

The talk

Welcome! I'm so glad you're all here. Like all of you, you've come from far away places. Welcome to Manifest! I guess this is the first talk. Manifest this year is twice as big as last year. It's going to be crazy. I hope you guys have a great time. I hope you have twice as much fun. I think my talk is not going to be twice as good as last year, but yeah, welcome.

My name is James. I'm one of the co-founders of Manifold. It's nice to meet you if we haven't met already. I'm going to dive right into my talk, which is on the future of prediction markets. I hope that you find it interesting.

I'm most interested in the potential use cases for prediction markets. I'm going to outline four of them in my talk. Basically, prediction markets are sort of going places these days. They're getting more popular. We even have people writing articles about prediction markets. Some of them are saying things like, "Oh, they're not going to grow very much more" or "They're actually not as good as polls." But when you have these kinds of articles coming out that are criticizing them, I think that's actually a good sign. That means we're on to something. So I'm happy about that.

The original idea of Manifold was to take prediction markets and have them be run by a creator where one person would ask the question, they would set the resolution criteria, they would allow people to trade, and they would earn trading fees. So they would sort of run this whole thing. They provide a lot of value by doing that, and then they would earn from the trading fees. So it turns it into almost like a mini business. That was the original idea that we proposed to Scott Alexander.

But we decided to move into play money. So it was never quite the case that you could create any question and if it got popular enough, actually earn income from it. However, Manifold has recently announced that we are going to introduce these cash prizes using the sweepstakes model. I'm happy to say that in about a month - it's not launched yet - you will be able to earn cash prizes and perhaps run markets and run them like a business and possibly earn a profit by being a creator of markets. So I'm really excited for that.

Also at this Manifest, you will be able to learn more about sweepstakes, including there will be a live theatrical performance and musical that will go over the details. I hope you will attend that. I think it'll be in the park in like a couple of hours. So look forward to that.

And then without further ado, I'll continue to enlighten you on the four use cases that I think are really valuable for prediction markets.

I. Running a market profitably

The first one I've sort of outlined already, which is to run a market in a way that's net profitable. On Manifold so far, with the play money world, we actually were not running these zero-sum markets. In fact, we were printing a bunch of Mana. We were giving out lots of bonuses because that made it a lot easier for someone who creates a market to earn Mana from doing it. So we would give you a bonus for every unique trader that trades on your market.

It was common for people to make a profit by creating markets on all these topics. Post pivot, we have eliminated all those bonuses, and it's very sad. All the markets are zero-sum actually, and we're charging trading fees. So when users trade, a little bit of their bet actually goes to the creator of the market now. We are trying to bring back the original vision of Manifold, which is that the creator can make a profit on their market.

Let me tell you an anecdote. Just yesterday, you know that Starship succeeded at taking off and re-entering, and it didn't explode. That's amazing. It's really significant, but it's significant for another reason, which is that one of our users, Chris J Billington, created a market on whether the Starship would not explode. He subsidized it with 50,000 of his own Mana, and in this zero-sum environment, he was able to earn a profit for the first time as a creator because lots of people bet on it. He managed to close the market before the spaceship launched, so that meant he was able to get back his liquidity.

The way liquidity works is a little complicated, but if the probability gets bet to an extreme, then it actually ends up eating it up, and that money goes to the traders that bet on it. But if you close the market early, which is kind of a novel thing because in Manifold with the play money version, people are just not that concerned about play money. But in this world, to justify adding a large amount of subsidy - and I hope that this was basically $50, 50,000 Mana, but I hope that we can scale that. So this is like a proof of concept that something with a little bit of subsidy, $50, is something that we can magnify, and we can eventually have $50,000 as a subsidy.

If you can manage liquidity right, then you can sort of turn the market into a profitable endeavor. So I think that that was something very significant that just happened yesterday.

Imagine if you have these engines for prediction markets where each one can actually become profitable. Then that will inspire a lot of people in a decentralized way to create markets on their niche, on what they're knowledgeable about. I think that is a great thing, and that's the first thing that I'm very jazzed about.

II. Buying information

Okay, the second thing I'm jazzed about for prediction markets: not every topic can support enough traders or enough betting volume in order to earn enough fees. So essentially, it's like what if you have a question that's not very popular? What if it's a niche question? What if it's some obscure scientific fact that you want a forecast on, but maybe it's not quite profitable?

In that case, prediction markets are actually still useful. Here's how: basically, you flip the script and you say instead of trying to earn a profit, I am going to purchase information. I am going to subsidize this market. So you put in however much money you think it's worth to answer this question, and then you just create the market. Then in a crowdsourced way, lots of people from around the world see that, and then they bet on it if they think it's profitable. Through this magic, the prediction market will create a forecast to answer your question.

So the second case is basically buying information. I think that that's really cool.

I've done this, for example. I wanted to know what the daily active users for Manifold would be in July, so in the future. I set it up. I asked this question using our experimental numeric distribution format, which I think is also novel. Essentially, it allows people to bet on this complex continuous distribution. It's actually not continuous; we break it into little buckets, but it's approximately continuous. Users can choose a range and then bet an amount within that range, and it will adjust the distribution to be higher where they bet it.

I paid 10,000 Mana, and I got as a result this nice distribution which said, "Here's the expected daily active users for Manifold in July." That's an amazing service. I think that's useful. I think that's worth $10, and I think it might be worth more. If you put in more, the more Mana you put, the more subsidy, the more traders will put in an effort to fix all the cases in that distribution.

Maybe they think that the chance is basically centered around where our current users are, and then there's this long tail where maybe we're going to grow a lot. I'm happy about that case, but it basically said there was like a 5% chance that we would grow by 50% or more. If I were interested in those tails and I wanted it to be more precise, it would actually just work to add more subsidy to the market so that people have more incentive to bet it to be accurate.

There are tons of examples within Manifold because we use markets all the time for everything. We're buying information or we're doing a brainstorming session where we're getting users to propose features to us, we're getting users to find bugs for us. There are many formats where you can ask either an open-ended question and get free response answers.

The T-shirt design for Manifest that you have is created through a market, which was basically a contest that says, "Who will design the best T-shirt?" Another example that's pretty cool is that every month, the Manifold team does a retrospective of what went well and what didn't go so well in the last month. I create two markets for that. These are free response markets that mostly the Manifold team uses where we submit answers of what we thought went especially well and what we thought maybe didn't. This helps us improve.

Basically, I will just look at those answers and then subjectively be like, "I think this is pretty important" and "This one's not that important," and I come up with weights for them. Then I just resolve the market according to that. So it provides an incentive for people to submit and bet on answers where I think it will be judged as relatively important. It's just one more use case for how we use markets internally. I think that the ability to earn cash prizes and subsidize stuff is going to be a major deal, which might unlock a lot of use cases for other companies to use these as well.

III. Economic hedging

Okay, the third use case for prediction markets that I think is valuable and will be interesting going forward is economic hedging. Usually, when you make a bet, you're betting it because you think you're going to make money in expectation, like you're going to get more out of it than you put in. Sometimes you don't even need that in order for it to be profitable for you.

The way that works is because people have utility functions that are basically risk-averse. They might want to hedge the downside so that maybe they're losing money in expectation, but they're also making sure that the worst case is not so bad. This is kind of like an insurance market.

I created a market for myself because I ride an electric unicycle, and it's very dangerous. So I created a market like, "Will I have an accident or will I have an injury?" The market itself can first find the right price, so it discovers what is the right price for my insurance, essentially. It'll be like, "Okay, I broke my collarbone last year right after last Manifest riding an electric skateboard, so I switched to unicycle because I thought it was safer." So you have a base rate.

I created that market, and I started. I was like, "I'll buy yes at 4% that I will have an accident in the rest of 2024." Then people started betting it up, and now it's like a 10% chance that I'll have an accident. So I ended up just posting a limit order to buy yes at 10%. People will fill that. I've gotten some of it filled already. Then I have insurance. Whenever I actually make a profit, if I do end up breaking another bone.

Economic hedging is probably less useful personally. I think the main use cases are usually for businesses. There are tons of use cases for this. There's like, "Will the weather be bad on this day that makes the event not work?" or "Will Trump be elected, and then somehow that changes something for you in your business?" or "Taiwan is invaded, and that's not good for your business for some reason." So you can hedge all of those outcomes.

What I will say is that the user-created model is basically an amazing combination with this economic hedging use case because you can figure out what you want to hedge, and then you create exactly that question and hedge it. Like I could create exactly the question I wanted on my electric unicycle. But if you're a business, I think that that's like a superpower. If you're going through traditional finance and stuff, it's not that easy. So economic hedging is pretty cool.

IV. Matchmaking

Okay, the fourth use case that I think is really valuable is matchmaking. I would put this in a generalized way. There's basically hiring, where you're matching a job seeker to a company. There's networking, like who should I meet, who should I talk to. There are friendships, like who would I really vibe with. And there's dating, of course. I mean, obviously.

After last Manifest, Robin Hansen gave a speech, and he said, "You guys are doing cool stuff, but you're kind of not getting to the really valuable use cases. You need to think long and hard about where prediction markets are going to be the most valuable. Is it really about predicting whether the ball pit is going to materialize at Manifest?" He suggested that hiring markets, helping companies hire employees, is obviously really valuable.

I heard that, and then I was like, "I know exactly what to build," and that was Manifold Love. I don't know if you guys know, but basically, the way I'll explain very briefly, the idea is that you create a public profile, a dating profile. You upload photos, you answer questions, and then you bet on who among these profiles is going to date who.

Secondly, I'd say that there were definitely some issues where people were not that interested in browsing on behalf of someone else. People are definitely more interested in themselves and browsing for themselves and betting on their own prospects. But I think that with the addition of sweepstakes and cash prizes, that could help incentivize people to go out and find matches for other people. So I think it's possible that Manifold Love could be rebooted at some point. I know if Stephen were here, my brother, he would be like, "No, no, no, we're not doing that at all. Just forget about Manifold Love." But yeah, I think it's really cool.

Also, we are planning to actually do the hiring use case. Manifold is not really hiring at the moment, but we might soon. When we are, I will create a market on who we will hire. I will subsidize it with lots of money so that you'll make at least $10,000, maybe more, by betting on who we're going to hire. I actually think that this is a promising use case. I think that crowdsourcing who we would hire, like people have good ideas, they know people, and the mechanism is correct. It's like they're going to bet if they think we're going to hire them, that it's profitable in expectation for them to bet on it. That will surface to us the most promising candidates. We will just look at them and see which one has the highest probability of being hired, then we'll interview those people. I think it's going to work. So we will dog food that, and if it's sufficiently successful, then we can just try to make that work for everyone, try to run that as a product for other companies. I think that's really exciting.

V. AI

Okay, so those are four use cases. This is a talk about the future of prediction markets, so I have to mention AI at some point. Basically, my take on AI is that it will take all of those use cases and it will supercharge them because AI is actually going to make all of those way better.

Let me run through them again. So we have running a profitable market. Imagine that it's an AI that creates the market. It's the AI that has to judge whether it happened. I actually think AIs are going to be great judges, like an impartial source. Instead of relying on a human that's emotional and maybe they didn't read the evidence correctly, or they woke up and were sick that day, you don't know. But if it's an AI, you're like, "Okay, that's objective." I think they're going to be great market makers and resolvers. They can do it for cheap, which is really like another superpower.

So it means that we can have markets on everything because it's going to be so cheap for AIs to create and resolve markets. You can ask the AI questions about the resolution criteria, and it will always be there to respond and give you clarifications. So I think it's going to be a better service, and it'll be cheaper. That's amazing because we actually can just support markets on everything. Then you can just type into Manifold any question, and we already have a lot of questions - we have like 100,000 - but then we'll just have like 100 million or something. There'll just be so many questions.

That's just the first use case - running markets as a business. It's going to be profitable for those AIs in particular because they just don't need to be paid much.

Then there's another use case, which is doing research or subsidizing markets. Why would an AI pay to create a market and subsidize it? I think we're going to enter this world of AI agents that are trying to understand the world. They're trying to make progress on their own. The world is complex, and the AI might not know everything. I think there could be domain expert AI agents that know about certain things.

Basically, what I think will happen is that prediction markets will be like a native technology of AIs. When there's something they don't know, they will ask a question using APIs, and they will subsidize it. Then other AIs will answer, will bet on it, and so it can, in a matter of seconds, sort of figure something out about the world. This is all just information technology. It's just whatever it wants to learn, it can do. There will be these AI agents that perform this role because it's profitable to bet on these markets.

They will be running businesses, and then they will be hedging their business economically using prediction markets because it just makes sense. It just produces value. So I think that prediction markets are actually a really good match with AI. I think humans are actually not a good match for prediction markets when you think about it. It doesn't come naturally, like thinking in probabilities, figuring out exactly how much to bet. That's like only weird humans do that. It's not something that all of us just natively think in this way economically and in probabilities. But computers are so good at this, and they're already trading on markets. Like most of the volume in stock markets is all algorithms and stuff. So basically, I want to submit that prediction markets, and in particular user-created prediction markets, will be a useful tool for AIs to do a bunch of things using all the use cases I outlined.

One ending anecdote might be if, in the future, you have your personal AI and it knows everything about you. But people are protective of their data, so actually only your AI has that data. It's kind of like it sees your whole life and it knows everything you say. So it has a really good grasp on you personally. That AI could go out and bet in prediction markets on who you will marry, on which jobs you will take. You will benefit from this because you will get all these nice forecasts of what you should be doing. It will make your life a lot better. I think that that would be a truly amazing world.

Software automation will make us into Crazed-Super-Scientist Barons

10x cheaper and faster software development will be here in ~1 year via AI agents. How will the world change?

James

Jul 01, 2024

We’re quickly moving up the abstraction ladder for software development! Claude 3.5 Sonnet is more evidence that the cutting edge is continuing to improve.

I propose 5 levels of automation for software development, akin to the self-driving cars levels.

Levels of Software Automation

In the last few years, we’ve moved from no automation, to auto-completing lines of code, to writing whole functions:

We are currently at level II. See Cursor for the state of the art.

For experienced engineers, levels I and II are tools that result in only a modest speedup, currently between 1-2x.

However, I think we’re on the brink of a major shake-up.

Level III automation

The next level of automation, where a human guides the AI toward implementing whole features, will be totally different. Software will become much, much cheaper.

A feature that might take 3 hours of concentrated work today could be done in 15 minutes by spec’ing it out in a paragraph and leaving a few comments on an AI agent’s proposed changes.

This is the vision of the startup Mentat.ai, which claims the highest score on a software engineering benchmark.

MentatBot’s impressive benchmark result.

I tried it out yesterday, with little success.

First, I created an issue on our open source repo, and tagged “@MentatBot” to trigger it:

I asked the AI agent bot to do some work for me.

The resulting Pull Request seemed impressive at first, but on closer inspection, almost every change it made was a little bit wrong. The bot edited some of the right files, but didn’t call the helper function that it created elsewhere. It also edited some wrong files that were more like library code. It created type errors that it couldn’t fix, and didn’t always follow my instruction.

Still, MentatBot is a promising early stab at the problem, currently powered by GPT 4o (they hope to upgrade to Claude Sonnet 3.5 soon).

With another year of improvements to base LLMs, plus further unhobbling via efforts of startups to chain LLM calls productively, I can imagine us at automation level III in a year (50% chance) or two years (75% chance).

Below, I created a market on roughly the criteria for level III automation by July 2025:

The world will change appreciably with a 10x speedup in software creation.

There are 4.4M software engineers in the US. They collectively earn approximately $500B per year. If we’re able to do all that work with 10% of the engineers, that naively implies ~$450 billion in value created.

Of course, decreasing the cost of software by 90% will dramatically increase the demand, as economists know. That’s why the value created from level III automation is likely much larger, though hard to predict. With an explosion in use cases for cheap software, the value created could be in the trillions annually.

Suddenly, software will become more polished. Bank apps will take less time to load. There will be fewer bugs in day-to-day usage.

Most importantly, there’ll be an explosion of startups. Niches will be filled where it was not profitable previously. We’ll have more personalized software, generated even for individuals. And of course faster software development will feed into better AI.

However, the key unlock of level III automation is not cost savings. It’s iteration speed.

Crazed-Super-Scientist Barons

It’s well-known Manifold lore that we push changes at a breakneck pace, sometimes to the detriment of our users.

We printed this on a poster in our office. It’s become a source of pride.

For moving this quickly, as a small team of 6 full-time, Manifold was said to be a “fiefdom run by crazed scientist barons.”

I ran with this idea and proposed that all organizations would be more effective if they operated on this model. See my “Mad Scientists Theory of Governance” market for further elaboration!

Now, imagine what will happen when you hand us another 10x speedup. I added “Super-” into the phrase, but it’s hard to envision where exactly that will lead us.

In day-to-day work, we receive a giant volume of requests for bug fixes and features. There’re so many ideas and experiments to try that we are incredibly bottlenecked on execution.

Going ten times faster could change this bottleneck from execution to getting feedback (and deciding what to do next). You need to test product changes on users to see whether your idea was good, and that takes time in the real world.

Running A/B tests can take a while to give statistically significant results. However, qualitative feedback can be richer and faster. I predict startups like ours will collect more individual feedback from users, because they will have the capacity to act on it. (Just like Discord has been critical for increasing user feedback in our journey so far.)

If today it takes two years to find product-market fit for a new product, then crazed-super-scientist barons should be able to do it in a few months.

We’ll thus see an acceleration of the serial-entrepreneur phenomenon, including more parent companies that spin up dozens of products. Such could be the future of Manifold. Our name allows for it, at least!

The speed limit of progress

Innovation is currently driven by small teams pushing hard against the frontier. This sets the global speed limit for progress.

In the next 1-2 years, AI will increase that speed limit by a factor of ten, at least for software startups. Exciting times!

As AI continues to develop, and the level of automation increases, the speed limit will continue to be pushed back across all fields. I look forward to this world of abundant frontier advances!

Loading more posts…

Liberty

What I learned building an AI coding agent for a year

We thought we'd be the best within months. Here's what went wrong — and why I'm more excited than ever.

Our bet

What didn’t work out

What we should have done

Build end-to-end evals and run them nightly

Cut every feature that is not core

Get the whole team improving the core product

Live in the future

Monthly retrospectives

Next steps for Codebuff

Our new multi-agent product is live!

Predictions for the next year

It’s been a blast

Taking on the giants — an AI coding startup on the brink of success

We're winning users off of Cursor and Windsurf. Why does it feel like we are losing?

The rollercoaster is real

Cursor and/or the labs will crush you

You need to do more marketing

Why is your team so small???

The grind continues

Announcing Manicode, v0

The best AI coder that ever ran from your terminal

AI Grant application

Why Manicode is the right product

The race is on

The Future of Prediction Markets

My talk at Manifest 2024

The talk

I. Running a market profitably

II. Buying information

III. Economic hedging

IV. Matchmaking

V. AI

Software automation will make us into Crazed-Super-Scientist Barons

10x cheaper and faster software development will be here in ~1 year via AI agents. How will the world change?

Levels of Software Automation

Level III automation

Crazed-Super-Scientist Barons

The speed limit of progress