The Illusion of Thinking: LLM Cosplay and the Coming AI Class War

Last month, I explained how I am currently building an app, a “digital companion” that helps users pace the stuff that matters to them through their day. It’s a large project, so currently I am focusing on an MVP, but that is still a rather significant amount of work. Which is OK; I have a clear overview, and I know what needs to be done. And I’m doing it.

As one usually does, complex problems get broken down into smaller problems, and sometimes the smaller problem is interesting and general enough that it’s worth sharing on its own. This happened this month, and so I published a small library that solved an interesting problem, and wrote a (separate) blog post about it.

What’s next? Yesterday, I added additional shortcuts in these programs I use, including key combinations that move the mouse cursor around and simulate mouse clicks because these programs have functions that cannot be activated with the keyboard otherwise. I also made a printed a cheat sheet of all the keyboard shortcuts I needed and taped it to my desk. We shall see.

The state of software accessibility is a tragedy.

(What are these offending programs, you may ask? You may take a guess: we are talking about the web interfaces of Anthropic Claude, Google Gemini, ChatGPT, as well as the Cursor editor.)

❦❦❦

This also provides me a convenient segue to mention the most significant thing that happened in AI-land last month: some researchers at Apple published an article titled The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. This has been discussed by virtually all the journals, web sites, blogs, podcasts and discussion fora I’ve encountered. “It made a splash” is an understatement.

In a nutshell, the scientific result here is that the “intelligence” displayed by so-called “reasoning” LLMs seems to break down completely after problems reach a certain degree of complexity. Not like “the problem is more complex so I need more time / CPUs to solve it”, more like “this problem is more complex so blealghhd ssdk adf;l ;fsdf sfksdf ;’dfks”. The authors call this an “accuracy collapse.”

The going theory is that the “reasoning” the machine was doing was not actual reasoning; instead it may have been some kind of cosplay of someone thinking: an “imitation” of the structure of reasoning chains present in their training set. In this theory, when the problems they face are more complex than the most complex problem in the training set, there is simply no learned corresponding reasoning chain to imitate that can derive a solution. And there’s no “latent intelligence” left there able to infer additional reasoning steps.

It would not be fair for me to say this was a “I told you so” moment. I had built my own understanding that there was no “latent intelligence” to be found in LLMs before, but my position comes from another angle (namely, that actual intelligence hinges on interactions with the physical world). This “accuracy collapse” did come as a surprise to me too. We now need further research to fully understand what is going on here.

❦❦❦

The result above mirrors my own experience, in a way; something I had read about before but hadn’t experienced personally yet.

As you’d imagine, I have used some AI programming assistance in my latest project. At the very beginning, the robot was making a lot of progress quickly, with me just stating what I wanted in the end result. Then, after my project reached a certain level of complexity, the tooling simply stopped making progress altogether when given “holistic” inputs. It would simply give up or give me back complete gibberish. I had to start using my own prior skills to break down my project into well-defined and well-documented components with clear interfaces, and break down my desired output into bite-sized tasks that the robot can still process. All the while I was not shy at spending money on tooling: it could use all the CPU it deemed necessary; and yet, I saw this “accuracy collapse” happen in practice.

❦❦❦

In extremely related news, Anysphere (the company who makes the Cursor IDE) is raising the price of their “standard” (fully featured) subscription to $60/month. Their previous “standard” subscription remains at $20/month but was downgraded in capabilities this week. The word on the interweb streets is they will likely wait for most professional users, funded by cheap startup money, to upgrade, and then introduce a new $200/month plan while downgrading the $60 plan.

The reality that Anysphere is locked between a rock and a hard place. The rock is that the current technology, as imperfect as it is, is expensive to operate and it would be tough for Anysphere to make a profit on anything below $200/month under heavy load. The hard place is that the large majority of their users (by count) are non-technical hobbyists who are leveraging Cursor as much as they can to crank out mediocre software projects. Because the tool is imperfect, these users struggle and thus send far more LLM requests than they would need at a higher skill level.

What’s likely to happen next? The way I predict it is that Anysphere (probably like all the others in the field) will move in two directions simultaneously. I think they will indeed raise their prices, which will cause a significant number of users to “fall off” because they can’t afford more than $20/month. (For most of the world, $20/month is a luxury.) I also think they will develop strategies to ween low-skill users off LLMs for code generation, to reduce load. Maybe they will partner with (or acquire) other products like “low-code” tooling.

This line of thought slightly worries me. If we draw this potential future further, we see a growing gap between the AI “haves” and “have-nots”. People who already have system skills will see their productivity multiplied and earn enough to pay the exorbitant price of their tools; people who don’t will be left out, unable to afford the tooling. Wars have been fought over less than that.

Perhaps, the more optimistic future would be to see a resurgence of schooling materials to teach low-skill users when is and is not a good time to ask a LLM for help, so they can reduce their usage-based expenses to just what they can’t solve in other ways. This schooling might even become a cottage industry for intermediate-skill users. Maybe there are opportunities here to redefine the essence of education.

❦❦❦

As an interlude, consider this intriguing thought. As you might know already, the output of generative AI largely mirrors what was in its training set. So you know, the LLMs available today are based off a training set that was cut mid-2024. Now, consider that since 2024 and 2025, we have loads of human authors pumping fresh content online that says “the AI does a good job most of the time, but it’s still full of inaccuracies” in many different ways. What do you think will happen when LLM training will start using this content as input? If the LLM only reproduces what is in the training set, and the training set repeatedly says (paraphrasing) that “the LLM is often inaccurate”… Aren’t we likely to see inaccuracies “locked in” as an expected feature of responses?

(Yes, I understand that what I’m saying here at face value is not technically possible. But there are a few things unique to the 2024-2025 training sets referring to LLM performance that will start to “bloom” at the end of this year, perhaps in 2026. These will be interesting times.)

❦❦❦

As another angle to better understand the “quality” of LLM outputs, this month I also spent some time on a side quest to compare things between OpenAI (GPT-4.1 and o4-mini), Google (Gemini 2.5 pro) and Anthropic (Claude Sonnet-4 & Opus). I also looked at DeepSeek R1, but overall I was disappointed by DeepSeek’s outputs so I did not look at it as much and won’t mention it further.

The way I did this was to select a few private input data sets that I personally have a lot of indirect knowledge about, then query the LLMs to see how well they would discover the same knowledge; as well as how the prompting changes the quality of the output.

My findings so far (summarized):

OpenAI’s stuff is still extremely sycophantic, to a point it is annoying. It also does an OK job at extracting latent information in data sets but it struggles at integrating it in a wider context: it modifies the surrounding document too much in the process. I think this is related to its more limited token context.
Anthropic’s Claude is extremely good at recognizing patterns that it has in its training data. Like, some things I fed in had a structural relationship with well-known previous work, and Claude was the only of the three that spotted it immediately. Claude is also amazing at merging a piece of new data/knowledge into a larger story or document. I found its responses rather terse though (compared to the other models); as if the system prompt was restricting it to only express things that it is 100% certain about. This is a good property for precision work—like programming, which it was specialized to be good at—but not so much for exploratory work.
Gemini blew both out of the water in the accuracy and depth of the responses. (I’m talking about 2.5 Pro here; the Flash version was a dud.) It’s also rather good at meshing new stuff and old stuff together. However, I found Gemini much worse at staying coherent when generating a longer text/story: when prompted to generate prose around a sequence of logical arguments, it mixes parts of the arguments together, or reorders them, and then becomes unable to fix these errors when prompted afterwards. (Claude and GPT did not have this issue. I did not compare with o3.)

The part that made me raise my eyebrows is that I had carried a simpler version of this experiment two months ago, and at the time I felt like ChatGPT (o4-mini) was clearly superior for general tasks, and Claude superior for technical tasks. Gemini 2.5 Pro really moved the goalposts here, in such a short time to boot. This makes me curious though: where will the next spearhead be? Should I automate my experiment somehow to stay on top of things? I hope to spend some quality time with other curious folk in my town and discuss these things together.

Incidentally, OpenRouter is a gamechanger and I highly recommend it.

❦❦❦

All this being said, there’s also another thing I learned through all these “experiments”. I do not like the way I think after I interact with these tools. The process of reading subtly-wrong answers, over and over, pointing out the mistakes, hoping for a correction and often not getting it, feels not too different from spending too much time with a bad person trying to gaslight me constantly.

Jim Rohn once mused, “We are the average of the five people we spend the most time with”. There’s real neuro-psychology science behind this observation. Meanwhile, right now, many folk (including me) are spending more time with ChatGPT (and other bots) than real people. That will change us in the long term in ways we do not fully understand yet.

I feel lucky that I still spend more time reading texts written by real humans than AI-generated texts. So I can still feel the difference, and this makes me sensitive to when the AI-generated content twists my thinking. It feels distinctly “icky”! And I know how to take breaks away from it, meditate, read other things, etc. I just wonder how many others people realize this, and/or have the luxury of a more diverse set of inputs.

Beyond the pricing of Cursor & co, maybe the attitude of people vis-a-vis AI generators (active vs. passive) will be the cause of the greatest social divide in the coming decade. Glimpses of an unexpectedly looking zombie war loom at the corner of my imagination. (I had these thoughts about social media and the effects of “doom scrolling” already before. Now I feel there’s two zombie viruses to deal with.)

❦❦❦

Here are two thought provoking articles you can take away:

In why you should be smelling more things, Adam Aleksic points out that we are in the midst of an “authenticity crisis”, and the best activism we can do to counter this trend is to go out and literally smell stuff. He also wrote other very good things, which I will let you contemplate.

Meanwhile, in Smartphones: Parts of Our Minds? Or Parasites?, Rachael Brown and Robert Brooks offer a view that smartphones are best seen as symbiotic with us, often with parasitic traits. My personal experience is that this is more true of certain devices than others, and my deliberate choice to use older and more limited technology has been shielding me from the more “parasitic” impediments described in the article.

❦❦❦

References:

The timecond library
Time is a Range, not a Point - introducing timecond
Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, Mehrdad Farajtabar - The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
OpenRouter
Adam Aleksic - why you should be smelling more things
Rachael Brown and Robert Brooks - Smartphones: Parts of Our Minds? Or Parasites?

Raphael Poss is an entrepreneur who occasionally publishes field notes on systems, leadership, and the messy edge between technology and people.

Interested to discuss? Leave your comments below.

Comments

The Illusion of Thinking: LLM Cosplay and the Coming AI Class War

Comments

Keep Reading

Reading Time

Published

Category

Tags

Stay in Touch