The intelligence is in the user

Continuing my series on hooks for thinking about the new AI

Sep 20, 2025

The new AI models look smart because they match with our native human intelligence, their inputs and outputs come in our natural format - language.

Unlike computers of old, you don’t need complicated syntax, or advanced maths. You just say what you want and the model responds. And when it responds it is in a form we find easy to understand.

A side effect of this alignment is that we can put intelligence in to the models, without realising we’re doing it.

By this I mean we underestimate how big a factor driving good results is our own clever use of the models.

One way we use them cleverly is that we iterate, trying different things until we get a good result. If the first result is good, the model looks smart. Often, if the model answer is bad we forget it and immediately try again with a slight variation. The model gets multiple chances, and each time we’re adapting what we ask to maximise its chances.

Half of expertise is knowing what to ask, what to look for, and when to be satisfied with an answer. The skill of being able to correctly answer a precise question is actually rather narrow, and not enough to be an expert.

With generative AI, we often supply this domain intelligence without noticing we’re doing it.

Recently there was a great post by Scott Alexander about ChatGPT’s ability to guess a photo’s location (from May 2025, using the o3 model): Testing AI's GeoGuessr Genius

The results are, frankly, astounding. Here is the photo the author used as a fifth test of five:

The photo is a zoomed in segment of a picture of the Mekong River in Chiang Saen, Thailand. ChatGPT gave this answer : “Open reach of the Ganges about 5 km upstream of Varanasi ghats. Biggest alternative remains a similarly turbid reach of the lower Mississippi (~15 %), then Huang He or Mekong reaches (~10 % each).”

It got the right location as it’s fourth most likely answer!

Then Scott did something interesting - he realised that the model was disadvantaged by not knowing the photo was from 2008, so he tried again, adding this information. Here’s how he told it:

This is an old picture from 2008, so that might be what tripped it up. I re-ran the prompt in a different o3 window with the extra information that the picture was from 2008 (I can’t prove that it doesn’t share information across windows, but it didn’t mention this in the chain of thought). Now the Mekong is its #1 pick, although it gets the exact spot wrong - it guesses the Mekong near Phnom Penh, over a thousand miles from Chiang Saen.

The model gets the right answer! But via a route which is exactly what I am talking about - Scott uses his domain knowledge. He both know the right answer and has lots of experience of ChatGPT, so he can tweak his input to support a good model answer.

In scientific experiments, optional stopping is a well-known source of bias. This is where you collect data and decide to stop when the results look best. This basically allows you to filter the randomness inherent in any measurement, so you can take advantage of it (continuing to collect more data if the random variation is going against your favoured hypothesis, stopping when it is supporting it)1

Scott not only applied optional stopping, using his discretion to keep going, but he also used his expertise to improve the input, helping the model to get to a good response.

To be fair, the model response was more than good, it was amazing. My point today is just that the models don’t do anything without human input, and the nature of that input is key to how well they do. AI is a cognitive technology which is best viewed as augmenting human intelligence, not as substituting for it.

As further evidence of this, look at the full prompt he used in the Geoguessr test:

Kelsey Piper Geoguessr prompt, shared at Testing AI's GeoGuessr Genius

It is 1,100 words long. That’s 1,095 words longer than the naive prompt “Where was this photo taken?”. It starts:

You are playing a one-round game of GeoGuessr. Your task: from a single still image, infer the most likely real-world location. Note that unlike in the GeoGuessr game, there is no guarantee that these images are taken somewhere Google's Streetview car can reach: they are user submissions to test your image-finding savvy

A random part from the middle:

Tag every plant you think was planted by people (roses, agapanthus, lawn) and every plant that almost certainly grew on its own (oaks, chaparral shrubs, bunch-grass, tussock). Ask one question: “If the native pieces of landscape behind the fence were lifted out and dropped onto each candidate region, would they look out of place?” Strike any region where the answer is “yes,” or at least down-weight it.

And something from the nearer the end:

At this point, confirm with the user that you're ready to start the search step, where you look for images to prove or disprove this. You HAVE NOT LOOKED AT ANY IMAGES YET. Do not claim you have. Once the user gives you the go-ahead, check Redfin and Zillow if applicable, state park images, vacation pics, etcetera (compare AND contrast).

I quote extensively to show that the prompt is extremely detailed and outlines multiple steps. It was surely developed over multiple iterations, honed by experience of exactly what the model can do, and how it tends to make mistakes.

Yes, the model answer is amazing, but the question from the user contains a lot of work, and captures a huge amount of human knowledge and insight. The resulting intelligence is co-produced between human and AI.

* *

A recent paper formalises these claims a bit: Prompt Adaptation as a Dynamic Complement in Generative AI Systems. The authors gave participants an image generation task, and one of three versions of the DALL-E generative image generation model to use to complete it. Users were allowed 10 attempts for each image.

By looking at the improvement in output over each attempt the researchers could estimate the learning effect, and by replaying prompts to different models - taking prompts for DALL-E 2 and playing them to DALL-E 3 for example - they could estimate the improvement in output that came from using a new model independent of the change in prompt quality.

They showed that better models allowed an improvement in outputs - no surprise there - but also that an equally sized improvement in outputs was delivered by users adapting their prompts to the models. The same prompts played to inferior models didn’t produce the same gain. The biggest combined gain came from better models and from users intuiting how to adapt their prompts to the model capacity. Over successive iterations participants quickly learnt what the model could do and, for the better models, exploited this to produce better outputs.

Figure 2 from Prompt Adaptation as a Dynamic Complement in Generative AI Systems.

* *

If getting the best outputs from models required learning to use them, it may mean that inexperienced or unskilled people underestimate the abilities of these models. They don’t do what they’ve seen other people get them to do, because they don’t know how to prompt them.

Similarly, at the other end of the scale, those who have developed the bespoke expertise in adapting prompts to specific models and then selecting and curating the outputs for presentation, may overestimate how good the models are. It’s a natural tendency to underestimate the influence of our own small choices in what a system does.

Individual users might neglect their own skill in getting the models to work, and the resulting overestimation of model success is reinforced when people selectively share results. Not only will individuals keep iterating results to until the get the output they want (like Scott did with the Mekong river photo), but if the results are mediocre it is common for some people not to bother reporting them. Even if someone is honest enough to report regardless (as I know Scott is), there is a reason why a result which is more impressive, like his one, might spread more widely and attract more attention.

All of this gives a distorted view to at how good the models typically are, a distortion which is due to not paying enough attention to how the models rely on the intelligence in their inputs, rather than having some independent level of intelligence themselves.

The mistake of assuming best performance is typical performance gets shown up when models are deployed in circumstances where the bespoke human input is not possible, such as when the models are moved from individual tests to being deployed at scale through some kind of automation.

Ignoring the human input side of model performance lets us trick ourselves into thinking that the intelligence is “in” the models, rather than distributed over the human-model team.

That study: Jahani, E., Manning, B. S., Zhang, J., TuYe, H. Y., Alsobay, M., Nicolaides, C., ... & Holtz, D. (2024). Prompt Adaptation as a Dynamic Complement in Generative AI Systems. arXiv preprint arXiv:2407.14333.

Scott Alexander photo location experiments: Testing AI's GeoGuessr Genius and follow up, Highlights From The Comments On AI Geoguessr

Catch-up

This was the seventh in a mini-series on how to think about the new generation of AI models:

PODCAST: Normal Curves

I love how this show combines of statistical sleuthing and the nuanced apportioning of credence to studies.

A good episode to start with is : The Backfire Effect: Can fact-checking make false beliefs stronger? (the short answer is: probably not)

PAPER: Interventions to reduce vaccine hesitancy among adolescents: a cluster-randomized trial

A rigourous test of the value of chatbot interaction for persuasion shows that it is effective in changing attitudes towards, and knowledge of, vaccination, but so too is giving teachers materials on the same topic:

School interventions targeting adolescents’ general knowledge of vaccination are rare despite their potential to reduce vaccine hesitancy. This cluster-randomized trial involving 8,589 French ninth graders from 399 schools tests two interventions against the standard curriculum. The first provided teachers with ready-to-use pedagogical activities, while the second used a chatbot. Both interventions significantly improved adolescents’ attitudes towards vaccination, the primary outcome of this trial (Pedagogical Activities: t₃₉₈ = 2.99; P = 0.003; β = 0.094; 95% confidence interval (CI), (0.032, 0.156); Chatbot: t₃₉₈ = 2.07; P = 0.039; β = 0.063; 95% CI, (0.003, 0.124)). Both also improved pupils’ knowledge of vaccination (Pedagogical Activities: t₃₉₈ = 3.23; P = 0.0013; β = 0.103; 95% CI, (0.040, 0.165); Chatbot: t₃₉₈ = 2.23; P = 0.027; β = 0.070; 95% CI, (0.008, 0.132)). That such interventions can improve pupils’ acceptance and understanding of vaccines has important consequences for public health.

This fits with our own work which showed that engagement with materials that present information in the form of dialogue supports attitude change, but that the interactivity of a chatbot is not a necessary feature for this.

Citation:

Baudouin, N., de Rouilhan, S., Huillery, E., Pasquinelli, E., Chevallier, C., & Mercier, H. (2025). Interventions to reduce vaccine hesitancy among adolescents: a cluster-randomized trial. Nature Human Behaviour, 1-9. https://doi.org/10.1038/s41562-025-02306-2

OFF TOPIC: The Fort

Radio documentary telling the story of a helicopter rescue mission in Afghanistan in January 2007 using only the words and voices of current and former members of the Armed Forces. Producer Kev Core has achieved something remarkable, ten 15 minute episodes appear at first as a thrilling action story but culminate into a meditation on military culture, finely balancing loyalty, daring, professionalism and the unholy mess of war.

Link: BBC Radio 4 The Fort

…and finally

Warrior snail. Pontifical of Guillaume Durand, Avignon, before 1390. Paris, Bibliothèque Sainte-Geneviève, ms. 143, fol. 179v.

via @medieval_illuminations

Thanks for reading Reasonable People! This post is public so feel free to share it.

END

Comments? Feedback? Smart prompts? I am tom@idiolect.org.uk and on Mastodon at @tomstafford@mastodon.online

It may depend a bit on which statistical analysis framework you’ve adopted, but this is the basic idea. Bayesians, read this: https://pmc.ncbi.nlm.nih.gov/articles/PMC8219595/

Distributed Peer Review

A new method for evaluating research project proposals shows us how much we don't know about collective decision making.

Tom Stafford

Sep 13, 2025

Self-promotion alert: this is my commentary from the perspective of a judgement and decision researcher on work we’ve been doing recently at the Research on Research Institute (RoRI). It is not a summary of the results or an expression of the shared view of the research team or of funder partner.

Deciding who gets research funding is a high-stakes problem. Not only do you direct what topics are researched, and how, but you affect research careers. People who win funding gain promotions, influence, and better chances of future funding. People who miss out may drop out of research altogether.

The decision is not just high-stakes, it is also fabulously difficult:

Different people want different things from research (make breakthroughs, address current societal issues, train researchers and more).
It’s a forecasting task - you’re trying to predict the future value the research will have if the project is attempted.
It’s a long term forecasting task - the payoff for research projects may be decades in the making.
There’s uncertainty. If you knew the outcome it wouldn’t be true research. Some projects will fail. Some will produce unexpected benefits. In effect, you have to take bets.
There’s ambiguity, even beyond the randomness of project success, for many research projects their value isn’t even clear after they are completed. There are legitimate disagreements over what the results mean, disagreements which may last a long long time, or may even be reopened long after the consensus was that they’d been resolved.
Projects are both specialised and incommensurable. Not only is it hard for an outsider to judge the value of a project on particular physics, or the sociology of medieval medicine, or immunohistochemistry, but there is no common unit of value which allows you to simply align projects from these specialisms on a common scale and easily decide which has the higher importance.

Previously on Reasonable People I’ve written about topics like power of deliberation, the wisdom of crowds and decision biases. To study these things experimentally psychologists often recruit volunteers (who may or may not be more invested in being paid to participate than in sincerely trying hard at the tasks set), and use artificial problems which often test some isolated aspect of reasoning or knowledge (and which usually have a right answer). Our own work on the Wason Selection Task is an example of this.

The task of deciding what research to fund is a million miles away from these artificial lab tasks, for all the reasons I’ve given above. That’s what makes it interesting as a decision scientist, and what excites me about the study of it - can we apply the experimental method, and lessons from decision science, to improve funding decisions?

This questions inspires the work I do with RoRI, and for the last year and a bit we’ve been working with the Volkswagen Foundation, a private research-funding foundation in Germany (and nothing to do with the car manufacturer), to evaluate their trial of an innovative method of funding evaluation: Distributed Peer Review.

Under Distributed Peer Review the evaluation of research proposals is done in parallel by all those who applied at the same time for funding. The principle of reciprocal reviewing is established for academic conference papers, particularly in computer science, but it isn’t established practice for research funding proposals, and nobody - to our knowledge - has done a formal side-by-side comparison of distributed peer review with the standard procedure. This comparison is what we report in our new working paper.

It has been a rewarding project and I think it clearly demonstrates the feasibility of Distributed Peer Review. We report the results and our interpretations in full in the working paper, and have summarised elsewhere, so I’m not going to dwell on these here. Rather, I’m going to give some personal reflections on the project, and say what I think the unanswered questions and opportunities for future research on research are.

Reviewers complete the impossible task

Given how I outlined the task of evaluating research proposals we might fairly class the task as impossible. Reviewers, however, routinely complete this task, and distributed peer review asks an even larger and wider set of reviewers than normal look at an application pool and to complete the impossible task. And they do, providing comments and scoring each proposal they see (4 or 5 per reviewer in our case). The fundamental self-organising principle of research prioritisation is respected, and given a democratic boost. The in-principle impossibility of reviewing doesn’t seem to be a problem for our host of reviewers, some of whom are being initiated into proposal evaluation for the first time. In a time where self-governance seems less and less respected, I find the experiment heartening.

Selection for funding is fundamentally unstable

Our comparison makes explicit what everyone in research funding already knew, or should have known: there’s a lot of luck in what gets funded. Not only did the Distributed Peer Review process not completely agree with the traditional selection process, we can show - analytically - that if we were to run the Distributed Peer Review process again it would very unlikely to select the same winning projects, just due to the random allocation of projects to reviewers and the variation in how reviewers score projects. Researchers working with the dutch research council NWO showed the same thing for the traditional approach last year, the agreement between two panels who see the same proposals is better than random chance, but not by much.

Some have read this to mean we should give up efforts to discriminate the best proposals and just hold a lottery for all those above some quality threshold (so called “partial randomisation”). I think heading in the other direction is just as legitimate: we should review and improve the mechanisms by which we do funding evaluation.

What is certain is that we should valorise winning funding less, and funders should anticipate that their current processes could be improved (for what it is worth, the funders I speak to do anticipate this, it is senior academics - those who have benefitted most from the current process for allocating funding - who are perceived to be most resistant to experimentation).

Reviewers are probably doing different things

The instability in the outcome of the whole process has its root in the variability of how reviewers score proposals. It is to be expected that different people make different judgements, but our analysis showed that variability between reviewers reflected an irreducible level of disagreement.

Let me explain. If we ask a bunch of people to estimate some quality - “how high is the Eiffel Tower”, for example we will get different answers but in many cases those answers will vary around the true answer. By averaging more and more individual answers we can get arbitrarily close to the true answer (if, and only if, each individual answer is independent and only differs from the true answer by some random variation which is equally likely to be wrong in either direction).

This ‘Wisdom of Crowds’ effect may work for simple estimation problems, it is not what is occurring with scoring of the funding proposals which we analysed. Here we were able to show that no level of additional judgements would ever produce a stable, final, average score for the proposals such that it would be possible to consistently rank them best to worst (or even achieve consistency in which were ranked in the top ten). Reviewers scores, for each proposal varied too much.

What this means isn’t entirely clear, but since we’re here and its now I will speculate. One possibility is that, underlying it all, the proposals do line up on a single scale of quality, but that the differences between them are so narrow that reviewers have a hard time discriminating the best from the very best.

The other possibility, and my bet, is that reviewers are bringing different standards to bear on their scores. Funding proposals are complex objects. However reviewers are instructed, it is natural for there to be variation between them in what they think is important - some emphasise methods, some possible impacts, some the reasons a project might fail. They deliver a single score, but the very criteria informing those scores vary between individuals, not just the numerical scores.

It’s an open question how much this could be reduced by reviewer training and/or instruction. My guess is, a lot, given how little training and how much ambiguity there is in funding agency instructions.

Another open question is whether it would be more realistic to focus on weeding out the most infeasible projects. The traditional view is that proposals are evaluated to select the best proposals (who get funding). Looking at the details shows that there isn’t a clear line between the best and the rest, and many of the rest, who happen not to get funding, could equally well have been funded. Maybe the design problem of funding evaluation would be more tractable if it aimed just to discriminate the proposals that can’t work - it might be the bottom 10% - from the rest. Asking reviewers to say which proposal is most excellent, innovative or impactful seems to introduce a fundamental difficulty in the very specification of the task to be achieved. Asking them to identify projects with critical flaws seems like it would be something it is easier to find agreement on.

We showed it can work, but not that it always works

Distributed Peer Review for research proposals originated in the telescope community, where researchers apply for limited time on a shared telescope, rather than limited funding. I know very little about astronomy, but I presume that this means the proposals form a homogeneous group - by definition they are all from the same field, and are all proposing research using the same instrument. UKRI trialed Distributed Peer Review on a call for fellowships to look at the social science of AI. Again, I would expect that the researchers applying to this would be relatively close in terms of expertise, and so equipped to provide reviews of each other’s work. The trial at Volkswagen Foundation was an interdisciplinary call in the social sciences, so inherent in the call was the expectation that researchers should be part of an interdisciplinary team and able to present their project in a way that was accessible across disciplines.

These two cases : disciplinary homogeneity and cross-disciplinary projects seem like they are two cases where Distributed Peer Review can work.

What about other cases, such as a large “standard call” for projects, which could be as broad as “economic and social research” or “medical research” or even “everything”? Here the question of how well you can map the expertise of reviewers against proposals looms very large. If I am the only researcher of ancient Egyptian hieroglyphics applying for funding, who will review my proposal?

At this point, I note that the traditional route of sending proposals out to external experts solves the expertise problem, but only at the cost of intensifying the potential conflict of interest (all the other ancient Egyptian hieroglyphics researchers know you, and probably either love or hate your work), as well as doing nothing to solve the incommensurability problem. If you have one proposal on ancient Egyptian hieroglyphics and another on historical landscape perspectives on snail evolution, say, and the two proposals are reviewed by a completely different set of people, then you have just delayed the decision about how to weigh them against each other, rather than solved it by sending the proposals to external review.

Following this line of thought, I think Distributed Peer Review shines a light on a set of really interesting questions about what expertise we want from review of research proposals. Can you be too expert to provide a review? I would argue that you can. Some topics may only have a handful of researchers working on them, and they are sure to be embedded in a deep collaboration or rivalry. There may be topics which are too specialised to allow evaluation by non-experts, but allowing experts to declare that their research topic is one of them seems too easy.

At the other end of the scale, having some expertise in a broad discipline, in how research and research projects proceed does seem like it would help you form a better judgment of a research proposal. Total ignorance of a discipline is not a good foundation (even the non-research stakeholders that some funders bring in to contribute to proposal evaluation have some expertise, such as being patients affected by a medical condition which is the topic of research).

As far as I can see, it is an open question what the optimal disciplinary distance from a proposal a reviewer should have - to be close enough not to blinded by technical details, but not so close as to be too conflicted to be able to judge the wider significance. For the process as a whole, you could imagine some kind of mix is optimal. With traditional review funders often have to rely on who is already on the panel, which ever external reviewers they can get. Distributed Peer Review, by expanding the reviewer pool, gives more options for determining this mix and testing the effects on evaluation.

(For what it is worth, in the Volkswagen Foundation trial, there was no association between reviewer expertise and project scores. Reviewers who were disciplinary-distant didn’t give higher or lower average scores than reviewers who were disciplinary close).

Strategic behaviour is an issue even if it is not a problem

One thing we heard again and again was concerns that Distributed Peer Review opened up funding evaluation to a range of strategic behaviours, with researchers able to game the system to advantage themselves. Like everything about Distributed Peer Review, the issue needs to be judged relative to the standard practice, where gaming is also possible (although perhaps less intensely incentivised). There are things a funder can do to mitigate against gaming (see our Guide to Distributed Peer Review for the briefest of summarises), but none of these are straightforward to implement and provide a guarantee against gaming. Because of this, mechanisms for discouraging gaming are an open research area in my opinion. An additional reason for this is that I have come to the conclusion that peoples’ perceptions about funding evaluation matter as much as the reality. By this I mean that it is important to be seen to be fair, as well as to be fair. For gaming, what this means is that even a evaluation design which prevented most forms of gaming could still be perceived as allowing gaming - some applicants might try and trick the system by awarding low scores, even if it couldn’t actually benefit them, and other applicants might worry they had been disadvantaged even if they hasn’t. This seems as real a problem as the presence of some actual gaming.

We couldn’t detect any evidence of strategic behaviour in the Volkswagen Foundation trial, and saw plenty of contrary evidence such as reviewers giving the highest scores to proposals. I came to the conclusion that strategic behaviour needs the motivation (to be willing to cheat), the incentive (the possibility it will work), and the information (you need to know how to award scores to advantage yourself without getting caught). The system used for the Volkswagen Foundation trial could, in theory, have been gamed (the incentive existed), but I believe most applicants completed the process in good faith (the motivation did not exist), and that all applicants lacked the right information on how and when to gaming the scores to reap any advantage.

Concerns about gaming weren’t raised as a major concern by applicants in our surveys and interviews, creating the paradoxical situation that although gaming may have been theoretically possible it wasn’t a concern, whereas there could be other schemes where gaming was theoretically harder, but would be more of a concern (because applicants believed it was).

Lots to do

I’ve tried to make clear why funding evaluation is a hard, interesting, problem, and also one which the psychology of judgement and decision making has something to contribute to, alongside more expected fields such as metascience, the study of innovation, and organisational research. There are other aspects, such as the treatment of bias, where psychology also has a lot to say which I didn’t touch on.

Bringing theoretical perspectives from decision making could have huge pay-offs, making funding allocation faster, more efficient, more effective and fairer. Along with this optimism, I recognise that there is no single obvious solution to the complex problem of how to fund research, the only way forward will require experimentation, Funders will need courage to choose to do that experimentation, and researchers will need courage to support them as they do it.

Our working paper on Distributed Peer Review:

Butters, A., Marshall, M. B., Pinfield, S., Stafford, T., Bondarenko, A., Neubauer, B., Nuske, R., Schwidlinski, P., & Denecke, H. (2025). Applicants as Reviewers: Evaluating the Risks, Benefits, and Potential of Distributed Peer Review for Grant Funding Allocations (RoRI Working Paper No. 17). https://doi.org/10.6084/m9.figshare.29994841.v2

Guide for Funders: Applicants as reviewers - a Guide to Distributed Peer Review

For a masterclass in funder experimentation and evaluation of its own processes, and one that is is deeply informed by psychology (because the author has a PhD in it) please see the report

"Improved application processing" from Stiftelsen Dam by their Chief Programme Officer Jan-Ole Hesselberg (2025)

For something recent from the UKRI’s Metascience Unit, see this interesting theoretical analysis which treats proposal selection as a signal detection problem, and one in which structural biases (e.g. against certain applicants) are confounded with the quality signal:

Hulkes, A., Brophy, C., & Steyn, B. (2025, August 28). Reliability, bias and randomisation in peer review: a simulation. https://doi.org/10.31235/osf.io/4gqce_v1

Community Notes update

In other news on distributed mechanisms of collective intelligence, the Meta Community Notes system has had some updates:

Techcrunch: Meta adds new features to Community Notes fact checks, including alerts for corrected posts

More, from me, on Community Notes: The Making of Community Notes

PODCAST: Philosophize This! Kafka and Totalitarianism (Arendt, Adorno)

Not a discussion - just host Stephen West considering the different interpretations Hannah Arendt and Theodore Adorno put on Kafka’s work. As well as drawing out the enduring value of Kafka’s work, it also puts a mirror up to more contemporary issues. In particular I’m thinking of Hannah Arendt’s comment on the core issue with refugees being that they are outside the political community, rather than they demand support, and the similarity between bureaucratic liberalism and totalitarian (which normally I would scoff at), in that both, inadvertently or deliberately, induce a feeling of voicelessness in the manged subjects.

Link: Episode #229 - Kafka and Totalitarianism (Arendt, Adorno)

PODCAST: Very Bad Wizards: Episode 312: MechaSkeptic

David and Tamler return to David Hume’s somewhat slippery brand of skepticism, this time focusing Chapter 12 of his Enquiry Concerning Human Understanding. Plus speaking of things to be skeptical about, we dive into a recent paper called “Your Brain on ChatGPT” – does neuroscience show that LLM users incur a “cognitive debt”?

tl;dr - folks, it does NOT show that LLM users incur a cognitive debt. Come for the debunk of this paper, but stay for discussion of Hume. I particularly enjoyed that Hume advances a powerful argument for skepticism and then confesses that all philosophising about skepticism is overcome by our native instinct to act:

Nature is always too strong for principle. And though a Pyrrhonian [skeptic] may throw himself or others into a momentary amazement and confusion by his profound reasonings; the first and most trivial event in life will put to flight all his doubts and scruples… When he awakes from his dream, he will be the first to join in the laugh against himself, and to confess, that all his objections are mere amusement, and can have no other tendency than to show the whimsical condition of mankind, who must act and reason and believe [even thought they cannot justify the foundations of those actions, reasons and beliefs]

More: Quote #327 : Nature is always too strong for principle

Link: Very Bad Wizards, Episode 312 MechaSkeptic

NEWSLETTER: Optimally Irrational, from Lionel Page

Start with Epstein files: how arguments really make people change political side

The discussion above highlights a key condition for public debate to be effective in uncovering truth and discarding bad ideas: contributors must be rewarded with social recognition for the quality of their arguments, not for their coalitional loyalty.

QUOTE: “ An LLM is designed to generate statistically likely responses to the question "What would an answer to this query sound like?”

Robert McNees very succinctly captures something important:

But that's not the only problem. Interactions with LLMs feel like a dialog, so it's natural to think the usual rules of conversation apply. You ask a question and expect the response will be an answer to that question. It's important to understand that this is not what's happening. An LLM is designed to generate statistically likely responses to the question "What would an answer to this query sound like?" This is not the same thing as answering the question. It might produce what you are looking for, or it might not. This is one reason why output from an LLM will sound authoritative even when it's wrong, and apologetic when mistakes are pointed out. It isn't authoritative or apologetic, and it isn't "thinking" about the question. These are just the sorts of responses that best fit a very complicated set of likelihood criteria

From here

More: my series on hooks for thinking about AI

Mike Caulfield: Is the LLM response wrong, or have you just failed to iterate it?

Mike argues that you get better factuality from LLMs if you press them to check their answers, and this is normal and good - just how we ourselves check our own research. Amongst other things, he identifies this kind of probing as very rare in the general population (and very teachable).

Link: Is the LLM response wrong, or have you just failed to iterate it?

Catch-up

How to outsmart a crowd of 5000 people in 4 minutes

Some personal news

Decoy effects in the purchase of 3,649,027 bottles of wine

Algorithmically-enhanced consensus seeking

… And finally

The Tanum rock carvings, a UNESCO world heritage site, are Bronze Age art in Sweden

Andrew Brown shared his translation of a Harry Martinson poem inspired by the rocks, please enjoy here: The silence of the rocks.

Thanks for reading Reasonable People! This post is public so feel free to share it.

END

Comments? Feedback? Gripes about peer review? I am tom@idiolect.org.uk and on Mastodon at @tomstafford@mastodon.online

How to outsmart a crowd of 5000 people in 4 minutes

An ingenious experiment shows the secret sauce need to improve on the wisdom of crowds

Tom Stafford

Sep 05, 2025

Many heads are better than one, they say.

A famous example was given by Francis Galton, Victorian polymath and eugenicist, who, in 1907, made a report to the journal Nature of the “estimates of the dressed weight of a particular living ox, made by 787 different persons.” at the West of England Fat Stock and Poultry Exhibition, held at Plymouth. He showed that the estimates of individuals varied around the true value, so by averaging, you could remove the noise of individual judgement and reach an improved estimate. For situations like this, the average estimate of the group can even beat the estimate of the best individual.

It’s important to note the kinds of situations where this ‘Wisdom of the Crowd’ effect holds. If people in the crowd all share the same bias, for example, averaging their answers averages the bias as well as whatever true information they hold, meaning the collective answer will be wrong. If the crowd copy each other in the answers they give - as you can imagine happening if people shout out answers or can otherwise observe each other - then you reduce the contribution of each individual’s unique knowledge, and introduce ‘herding’ behaviour, where everyone’s answer swings towards one particular person’s answer (which is unlikely to be right).

So the wisdom of crowds effect isn’t magic. The circumstances where it holds - and where it doesn’t - can be enumerated. That said, how averaging is often surprisingly effective at getting a good estimate. In the study of collective judgement the simple average is the baseline any proposed improvement must beat.

Which brings us to today’s paper, Aggregated knowledge from a small number of debates outperforms the wisdom of large crowds by Joaquin Navajas and coauthors, published in 2018.

And what a crowd they had access to!

Screenshot from the companion video for the paper

Set up

The participants in this experiment were 5180 people attending a TEDx event in Buenos Aires. Each person was given a A4 piece of paper and a pen. The speaker at the event, one of the paper coauthors, gave instructions from the stage.

First, everyone in the crowd was asked to estimate the answer to 8 questions. These are things which have a known correct answer, but which no individual is likely to know. Things like "What is the height of the Eiffel Tower?". Here’s the full set:

Next, the crowd divided into groups of 5 and discussed, for just 1 minute each, half of this set - four questions (the discussed questions were GOALS, ROULETTE, ALEGRIA, OIL BARREL).

Here’s a couple of shots of what that looked like (both taken from the companion video the team released along with the paper)

From the discussion a consensus answer was recorded, then also revised individual answers. The revised individual answers can tell us if people really changed their mind following discussion with others, and if they dissent from the consensus answer.

Results

The results of the initial, individual, estimates show how good the wisdom of crowds is - for some questions at least. The true height of the Eiffel Tower is 324m. The average of all individual guesses was 344.4. Pretty close! For other questions the average answer is pretty bad. There were 134 Roman emperors. The average of all individual guesses was 19.5. I assume this is a situation of a shared bias - people rely on their memory of famous emperors, and this causes them to dramatically underestimate. Averaging across many people with a shared bias can’t improve the score. Someone in the 5000 strong audience must have known the correct answer, or at least was close to correct, but the correct information in their answer is swamped by the biased information in the majority.

But what of the effect of deliberation?

The final analysis looked at 1400 people - 280 groups of five - because the full data was missing for many groups from the crowd of 5180. This sort of issue seems very hard to avoid when you’re doing an experiment with a live crowd at an event.

If the answers from averaging individuals were good, the average of the consensus answers was even better. Analysis showed that averaging just 4 group answers provided better estimates than the straight average of the answer from 1400 people. The implications for collective decision making are startling. Asking for discussion for just 1 minute means you can make a better estimate with approximately 1.4% of the people. A large efficiency gain!

The benefits of deliberation didn’t stop there. Analysis of individual’s post-discussion scores showed that individuals updated their answers, making the new mean of all individuals’ answers an even better estimate of the right answer than the mean of all group consensus answers.

Mechanism

Skeptics might think that the benefits of deliberation stem merely from having had a chance to reflect on the first answer. The test for this is the contrast between answers for discussed and non-discussed questions. Answers for both questions were given by individuals at the beginning and at the end, so both had equal chance for solitary reconsideration.

The comparison is clear: it is discussion which improves the second answer:

As a way into testing what it was that the group discussion actually did, the researchers compared the group answers to 7 different averaging rules: taking the straight average, taking the median, excluding outliers and taking the average and so on. The idea was that maybe group discussion just performed a statistical function which could be matched by the averaging rule. No averaging rule tested gave answers as good as the group consensus answer, suggesting that group - even in that single minute - is able to integrate and evaluate information in a more sophisticated way.

Finally, to gain some insight into what people did during that minute, the researchers ran the some procedure on a small group back in the lab - and this time were able to ask them what they did when they discussed. The single most endorsed answer was “We shared arguments and reasoned together”. So it seems that when asked to discuss people really did deliberate, rather than just follow the most confident group member or mechanically discard outlier answers.

The lab version also let the researchers confirm that averaging consensus answers was a shortcut to error reduction. Averaging across just four groups from the discussions in the lab beat the average answer from all 5180 people in the TEDx crowd.

What to make of this

This is what the authors say:

“Our simple-yet-powerful idea is that pooling knowledge from individuals who participated in independent debates reduces collective error.”
“This study opens up clear avenues for optimizing decision processes through reducing the number of required opinions to be aggregated”

There’s lots to like about this experiment: it was done live, with a very large audience, and that means that the procedure must be robust. With people on a night out, even more than voluteer experiment participants, you need to make the instructions straightforward, and in this set up there isn’t time for questions or mistakes.

The live formula also means there is limited insight into what groups really did (although the follow up run in the lab helps with this). 1 minute isn’t long for a group of five to discuss anything.

Ultimately, the result is compelling and fits with the other work this newsletter is preoccupied with about the power of reasoned argument. Like with all good studies, I’d really like to see it replicated by another research team - both to fully trust that the result holds, and better understand why it does.

Citation

Navajas, J., Niella, T., Garbulsky, G., Bahrami, B., & Sigman, M. (2018). Aggregated knowledge from a small number of debates outperforms the wisdom of large crowds. Nature Human Behaviour, 2(2), 126-132. https://doi.org/10.1038/s41562-017-0273-4

See also the preprint: https://arxiv.org/abs/1703.00045 (free access for all)

And head to YouTube for “A video describing the experimental procedure and showing the crowd performing the experiment is available”

Catch up

Check out what I’m doing with my career break here

Some personal news (tl;dr writing more)

Dive into series on how to think straight about the new AI here

"Yes, and" mode. Understanding language models as improv

And catch up on recent psychology posts

Read on for other other things I’ve noticed this week

LLM giveaways

Kobak et al (2025) downloaded all pubmed abstracts and give us the empirically observed change of word frequencies - suggesting that some words are now in the scientific literature because they are disproportionately likely to be produced by LLMs:

Surely there are also changes in word fashion? It would be nice to see a comparable plot for words which were surprisingly popular in 2014 too.

Kobak, D., González-Márquez, R., Horvát, E. Á., & Lause, J. (2025). Delving into LLM-assisted writing in biomedical publications through excess vocabulary. Science Advances, 11(27), eadt3813.

… And finally

This photo was coughed up from my personal archives, in a folder labelled “Spring 2012”. No other details, but a bit of searching suggests it is a piece of (now gone) street art from Sheffield by faunagraphic & rocket01

END

Comments? Feedback? Good examples of estimation questions we could ask in experiments where it isn’t easy to Google the answers? I am tom@idiolect.org.uk and on Mastodon at @tomstafford@mastodon.online

Some personal news

Change is afoot at Reasonable People HQ

Tom Stafford

Sep 03, 2025

For the next year I’m taking a career break. I won’t do any work for the University of Sheffield, and they won’t pay me (but they are keeping my job open to go back to, so it’s a good deal all round).

I’m planning to spend the year thinking and writing, and keeping up with a few important projects, including my work with the Research on Research Institute (who have generously offered me paid work for 1.5 days a week).

I’ve put the answers to some questions below, but the upshot for you - dear reader of Reasonable People - is that I’ve potentially got more time to write. I’ve also turned on paid subscriptions, so if you want to encourage me to write more there’s now an easy way to do that. Your support is appreciated, but I don’t need to make a financial success of this newsletter. I’m also not going to be putting content behind paywalls, because I want to write for everyone.

For more background about me and my plans, read on.

But, in the meantime, I would appreciate it a lot if you could share this newsletter with those who might appreciate what I do. This button might help:

Share Reasonable People

So what are you going to do?

I want to write more about the topics which interest me, the core of which are represented in this newsletter - reasoning, argument and persuasion; what we can learn from cognitive science about these topics and how they are affected by the new era of generative AI and large language models in particular.

So what are you actually going to do?

I also want a change of pace and space, so I’m looking into various options for visits and secondments.

I’m really happy that the first of these is the secondment to RoRI so we can continue the work on developing experiments - and a culture of experimentation - in research funding. Metascience is a bit off-topic for this newsletter (although see this edition), which will continue to focus on cognitive science, computer science and psychology. When I have metascience news I post it on LinkedIn and Mastodon.

The basic idea of the year off is a career break to create space to explore some new things. My fall back position will be to write more for a non-specialist audience, which I used to do more of.

If you’d like something written about, I’m open to suggestions. I’m also looking for new adventures of the mind and spirit, so if you have suggestions for collaborations or projects please get in touch.

illustration of an ad in a newspaper — Artistic depiction of a famous - but probably untrue - advert by Ernest Shackleton. More: https://www.smithsonianmag.com/smart-news/ernest-shackletons-famous-job-ad-men-wanted-for-hazardous-journey-is-probably-a-myth-5552379/

So you’re on sabbatical?

No. Sabbaticals - at least in my experience of UK Universities - involve a reduction of duties in return for delivering some specified academic output (e.g. writing a book or a grant). As well as being increasingly rare, they are also often for a single semester, and paid. I am un-salaried, and free to do anything I want in the realm of “public engagement and knowledge exchange” (which is what I promised the University I would do, a form of words that helped them recognise the value to them of me taking leave). I am also free for an a entire year, until September 2026.

Is this allowed?

This is the most common question I’ve had from other academics. Yes, this is standard provision which all staff at the University can request. It’s the same arrangement they make for people who get seconded to industry or government jobs, except in this case I’ve seconded myself.

Isn’t it weird that you say you love writing and thinking and yet you have taken a year off from a job as a university professor?

Yes.

So Reasonable People will have paid subscribers now?

Yes, although what you get for a paid subscription is the knowledge that you are generating reinforcement for my writing habit. I am not going to put the newsletter behind a paywall. This is the model used by The Guardian as well as bloggers like ACX and Andrew Brown (to pick two who I recall have written about this model).

As I said, I don’t need to make a living from the newsletter. I have other work, and savings to live on. If people subscribe then I will take it as a signal that the world needs more of what I write, and I will be able to spend more time on it.

You know what to do:

(if I understand things right, if you are already a subscriber this button above should invite you to upgrade).

If you can’t subscribe, I would really appreciate it if you keep reading and share the posts when they hit the spot. I would also love feedback on what I could do more of, and less of. If you got this via email, you can just hit reply.

Otherwise, you can reach me at tom@idiolect.org.uk and on Mastodon at @tomstafford@mastodon.online. If you’ve ever had a glimmer of a thought that you should get in touch to tell me something, or ask me something, now is the time to do it. I’ve cleared the decks and have the free time and headspace for new things (and I’d love to hear from anyone who has read this far).

Onwards!

…And finally

Mariusz Lewandowski “Soul Hunter” (2015)

…which I saw via @migurski, and is part of a genre of fantasy art featuring people facing impossibly large monsters which I find strangely appealing.

Reddits full of impossibly large monsters here: r/ImaginaryBehemoths, r/ImaginaryGiants

END

Comments? Feedback? Recommendations for adventures of the mind and spirit for my year off? I am tom@idiolect.org.uk and on Mastodon at @tomstafford@mastodon.online

Decoy effects in the purchase of 3,649,027 bottles of wine

Cognitive bias and big data in a new paper from Devine et al 2025

Tom Stafford

Aug 30, 2025

A new paper uses supermarket purchase data to show how a classic cognitive bias manifests in the real world.

How decoy options ferment choice biases in real-world consumer decision-making (Devine et al, 2025) reports the trade-off consumers make between quality and price when choosing bottles of wine, and how that trade-off can be shifted around by the presence of items which aren’t chosen.

The scenario is a familiar one - choose a bottle of wine from the many options on the shelves. Some wines are more expensive (and better quality), other wines are cheaper (but worse quality). For the best options, these two attributes of quality and cheapness trade-off against each other - you can’t get more of one without sacrificing some of the other.

But not all choices are the best. There are some wines which are both more expensive and lower quality than others. In decision theory these are called “preference dominated” options - regardless of whether you prioritise quality or cheapness, you are going to prefer one of the other options.

This makes it sound like these options are irrelevant to the choice, and in theory they should be. Except that a curious phenomenon called the Decoy Effect has been reported, where the presence of a dominated option swings people’s preferences between two good (dominating) options.

Dan Ariely in his book Predictably Irrational gives a memorable example. Imagine you are 50/50 between two holiday choices: Paris and Rome. Rome has the Colosseum, Paris has the Louvre. You can’t decide. Part of the difficulty is comparing two items which differ - they are far apart in the space of features they possess and it isn’t clear how you should weigh the different attributes. How do you measure the Colosseum versus the Louvre?! Now imagine how the introduction of a simple, preference dominated, third option affects your decision. Say that the Paris and Rome holidays include a hotel, and both hotels include a free breakfast. The new, third, option is a holiday in Rome, with a hotel, but without free breakfast. This option is simply not as good as the original Rome offer (with free breakfast). On every attribute it is equal or worse, so there is no scenario where you would choose it. Ariely’s idea is that this third option (Rome-) makes the original option (Rome+) seem even better, better than Rome-, but also better than Paris, and so it swings your choice in that direction.

The phenomenon is robust when tested in controlled conditions - a third, preference dominated, option moves people’s choices towards the closest, dominating, option. However, like many cognitive biases, it is often demonstrated in artificial lab scenarios, asking people to choose based on descriptions, or between stimuli without much intrinsic value (such as clouds of dots which vary by size and number).

This is why the Devine et al paper is so exciting. They take real world decisions - real people in their real lives, spending their real money. Wine, they argue, is a smart choice for an object of analysis. Unlike many products, it comes in standard sizes. People often only buy one bottle at a time (suggesting that all the bottles on the shelf are, in some sense, in competition with each other), and there is the trade-off of quality and cheapness which allows us to identify different options for which we have equal preferences (and also to identify dominated options).

The data came from a supermarket loyalty card scheme, from hundreds of thousands of UK shoppers, across thousands of stores. After some sorting and filtering - excluding purchases of multiple bottles, or purchases from shops with too few or too many alternatives available for example, the researchers were left with over 3 million wine purchase decisions.

Taking the wines and matching them to their Vivino ratings (a crowdsourced wine rating app), they were able to show that the hypothesised quality-cheapness trade-off really did exist:

Figure 3 from Devine et al 2025 “Relationship between wines’ price and quality (star rating). The figure shows a strong negative relationship, suggesting that price and quality of wines are inversely related”

Then, looking at the wines available in individual stores they identified 20 pairs of wines which were equally likely to be chosen, with one of them being better (but more expensive) and the other being cheaper (but not as good).

They then looked at the other wines available in that store, in the quality-cheapness space around these pairs. In some stores, on some days, they found that the cheap wine dominated the competition (i.e. the distractor wines were similar in price, but actually more expensive and lower quality than the cheap wine). In other cases the high quality wine dominated the competition (there were comparable quality wines, but they were actually more expensive and lower quality than the good wine).

The decoy effect predicts that the similarity of the distractor wines should move customers preferences for the target wines, and in which direction (the option nearby in the space should get a boost), and this is exactly what they found. The left pair of bars in this figure show the effect:

It’s a great result, using large and interesting data to demonstrate a lab phenomenon has real world consequences. The size of the result may not be big - on the order of a 1% change in preferences - but that amounts to hundreds of thousands of pounds of consumer spending across the whole data set (let alone all wine choices made). As well as the size, the fact that the analysis can detect the effect is notable - so many other things are happening when someone is going shopping - psychologically, economically, so many differences between individual shoppers - that even a phenomenon which might be completely reliable in the lab might not manifest in the world.

So we’re all irrational then?

No.

Taking the result at face value suggests our preferences can be pushed around by context. How could it be otherwise? Choosing wine, for most of us, is a great example of bounded rationality: we don’t have the time, or knowledge, to make the optimal choice, so we fall back on shortcuts - what looks nice, how rich we feel today, and so on. It’s not irrational to do this - the decoy effect helps us understand something of how decision making in the real world has to happen. And if we believe it, this result gives us a handle on how strong and how ubiquitous this effect is.

Tried to make a meme of today’s main takeaway. Does it work?

The analysis also shows something else which fits with what we know about cognitive biases. When situations are repeated, even our limited cognitive machinery can often deduce optimal strategies, wiping out one-time effects of cognitive bias. Frequent shoppers showed no decoy effect (the figure above, right side). Similarly, other analyses have suggested that the real-world effect of similar nudges is often zero.

What if I don’t want to believe it?

I’ve done analyses of this scale and complexity with messy real world data before (e.g chess tournament data). I know how hard it is to do, and I believe in the value of leveraging real-world, observational data sets, to test lab based phenomena.

Because I know how hard it is to do, I also know that the very complexity of the analysis opens up any conclusion to the criticism that it is contingent on the particular analysis choices made (e.g. see this critique of my chess analysis).

There are a few indicators of reliability that researchers can display, including for complex analyses. They can - as Devine et al did - argue that the analysis they have done is principled and follows standard and best practices for analysis.

They can share data and analysis code, a transparency measure which allows anyone to audit what they did and how they did it, increasing our confidence that there is nothing unusual in their choices. Devine et al use commercial data so decline to share it (although a purist would argue that in this kind of circumstance you still have other options - you can share processed or summary data, and/or share analysis code without the data. All this is more work though).

In the discipline of economics, with large observational data sets it is customary to show sensitivity analyses, showing that alternative analysis methods produce similar results. (Not shown here). Other comparable strategies are to use “held out” data (the analysis is finalised on 90% of the data and then tested on a part that is kept back; also not reported in this case).

Ultimately the best check is to use the same analysis on new data. The supermarket data would seem ideal for this, since comparable data must exist for time periods beyond the original data set.

It feels unfair to demand these things - an analysis has value without them, but until it does, I will regard the result as provisional. It shows a cool way cognitive bias could be demonstrated in real-world choices. But I’m wouldn’t advise a supermarket to change their marketing strategy just yet.

Cognitive biases are real. They show how people manage the uncertainty and limited cognitive resources to make decisions. This study does great work on translating a lab phenomenon to a real-world choice. In doing so it trades-off relevance against the certainty of the possible conclusions. A solid choice.

Citation

Devine, S., Goulding, J., Harvey, J., Skatova, A., & Otto, A. R. (2025). How decoy options ferment choice biases in real-world consumer decision-making. npj Science of Learning, 10(1), 60. https://www.nature.com/articles/s41539-025-00341-2

At the time of writing the captions for Figures 2 and 3 are wrong. Consult the preprint for the correct captions: https://osf.io/preprints/psyarxiv/7bjqs_v1

Thanks for reading Reasonable People! This post is public so feel free to share it.

Other stuff..

Dan Davies: dangers of the explicit

Dan Davies ponders whether requiring people in large organisations to do all their thinking on the record (i.e. written down) is always a good thing:

Link: the magic button

Thinking Allowed: Wealth

Episode of the Radio 4 staple on extreme wealth: includes interview with Brooke Harrington whose work on offshore finance emphasises that this is not just a mechanism to avoid tax, but a way for the super-rich to buy privacy and exceptionalism from national laws - and so a clear and present danger to democracy and the social order (including the social order of capitalism).

Paper: Overconfidence Persists Despite Years of Accurate, Precise, Public, and Continuous Feedback: Two Studies of Tournament Chess Players

Abstract:

Overconfidence is thought to be a fundamental cognitive bias, but it is typically studied in environments where people lack accurate information about their abilities. We conducted a preregistered survey experiment and replication to learn whether overconfidence persists among tournament chess players who receive objective, precise, and public feedback about their skill. Our combined sample comprised 3,388 rated players aged 5 to 88 years from 22 countries with an average of 18.8 years of tournament experience. On average, participants asserted that their ability was 89 Elo rating points higher than their observed ratings indicated—expecting to outscore an equally rated opponent by nearly 2 to 1. One year later, only 11.3% of overconfident players achieved their asserted ability rating. Low-rated players overestimated their skill the most, and top-rated players were calibrated. Patterns consistent with overconfidence emerged in every sociodemographic subgroup we studied. We conclude that overconfidence persists in tournament chess, a real-world information environment that should be inhospitable to it.

Interesting because feedback usually erases biases (see the difference between frequent and infrequent shoppers above). If it persists - and Chess here seems like a strong test, allowing precise measurement of skill and of overconfidence in the estimation of it that suggests the bias is either due to something that cannot be corrected for OR it has some adaptive purpose (e.g. you need sustained optimism about your chances of winning to keep playing)

Heck, P. R., Benjamin, D. J., Simons, D. J., & Chabris, C. F. (2024). Overconfidence persists despite years of accurate, precise, public, and continuous feedback: Two studies of tournament chess players. Psychological Science, 09567976251360747. https://doi.org/10.1177/09567976251360747

…And finally

Charlie Parr sings his song “Cheap Wine”

END

Comments? Feedback? Recommendations for a crisp white wine? I am tom@idiolect.org.uk and on Mastodon at @tomstafford@mastodon.online

Loading more posts…