Code: Good(-ish)

Some strengths and limitations of LLMs as programmers

Jun 30, 2025

So far in this series of posts I’ve mainly talked about LLMs as models and synthesizers of natural language. I’ve been interested in the way having access to a pattern synthesizer operating in the domain of natural language affects our self-understanding as cognitive agents who narrate our activities and reason by means of rhetoric. I’ve paid less attention to LLMs as pattern synthesizers operating in the domain of formal languages such as computer programming languages — as code reviewers, debuggers, and sometimes even authors — because this domain has seemed to me to require separate investigation.

To put it very schematically, having been surprised by the ability of LLMs to do things with meaning (semantic plausibility), I have been separately and equally surprised by their ability to do things with structure (syntactic competence). It isn’t immediately obvious that these are two facets of the same surprising thing, although I think we must inevitably conclude that they are. The question is therefore: how are they?

I don’t think I have ever seen ChatGPT write a grammatically ill-formed or unclear English sentence. It’s “better” (in the sense of “more reliably regular”) at grammar than the broad majority of human beings, which is interesting given that its training corpus presumably includes a lot of quite grammatically variant text. It can also translate quite convincingly into Middle English, Cockney or AAVE. So there is already evident a kind of syntactic competence in the construction and interpretation of complex sentences. Here, for example, is a West Country Judith Butler:

Well now, it used t’be folks reckoned capital were what shaped how people gets on wi’ each other, all tidy-like and samey most the time. But then folk started thinkin’ more like, power do work different — it comes round again, joins up in odd ways, gets said different each time. An’ that got ‘em ponderin’ on time itself, like how change don’t just sit outside structure, but’s right there in it. So they moved off from that old Althusser feller’s way o’ seein’ big fixed systems, and started talkin’ more ‘bout how structure might come an’ go, dependin’ on where power crops up, and how it gets spoke or done again.

This might be considered cheating a bit, as Butler’s original passage is notorious, and widely glossed, so ChatGPT has more to draw on than the pure letter of the text in front of it. But it nevertheless demonstrates a fairly routine capability of these kinds of systems.

With this in mind, my surprise at LLMs’ fluency with code perhaps needs a bit of explaining. First of all, it’s worth noting that the models that work with code have typically been trained differently, on code-rich corpora with programming language-aware tokenisation. However, the underlying architecture is the same: pattern prediction contextualised over a sliding attention window. My intuition — partially incorrect, as it turns out — was that the sufficiency of this approach for modelling the fuzzy, redundant and forgiving syntax of natural language would break down when faced with the brittle, recursive, and referential structure of code, where validity often hinges on long-range dependencies, strict typing, or consistent state across widely separated parts of a program. It’s one thing to recognise a for-loop, another to understand the call-graph, nested scopes, and variable bindings by which it is circumstanced.

As tends to be the case with LLMs the source of the surprise isn’t a magical “deep” capability of the model, but the unexpected availability within its domain of heuristics that its mode of prediction and synthesis can successfully bind to. The world of code, it turns out, is more “shallow” than I had supposed. By this I mean that it’s usually possible to make good(-ish) guesses about what’s going on within a given section of code without having a complete and detailed map of its wider syntactic environment. And this is, I think, for the obvious (in retrospect) reason that the human programmers who produced the initial training data didn’t have a complete and detailed map of the wider syntactic environment either, but relied on conventions of usage, pattern languages both formal and informal, and a broad professional courtesy of not making other programmers work too hard to understand what’s going on. The LLM’s surprising competency reveals the degree to which our own programming languages are already shaped by a kind of narrativised surface regularity — something historically produced to be intelligible to ourselves, and hence trivially replayable by a non-understanding agent.

This is key to a question I was considering the other day, about whether LLM-generated code needed to be “good” code to be readable and correctly modifiable by subsequent passes of LLM-directed automation. Human-authored code is “good” if it fulfils two purposes: it must tell the computer what to do, and it must tell other programmers what it is telling the computer to do. Code that succeeds at the first of these things and fails at the second is a social menace, because it delivers functionality that users may come to rely on — generally a good thing — but is opaque and resistant to modification, and introduces brittleness and friction into the ongoing process of adjusting systems to meet changing and expanding needs.

LLMs are quite good at untangling “spaghetti” code and deciphering data and execution flows that a human reader might find aggravatingly entangled. If they sometimes have a slightly better-than-human fluency with bad code, it’s because they rely less on human cognitive affordances such as metaphor and narrative to find their way around: for them a statistically recurring pattern in a sequence of tokens is recognisable as a pattern regardless of what kinds of tokens are involved, or whether they are ornamented by heuristic and mnemonic devices that help human programmers to keep track of commitments and consequences. “Minify” a program, syntactically condensing it — stripping spacing and indentation, renaming variables to single letters, and so on — and an LLM will parse it undeterred where a human reader might see only line noise.

This is evidently true for small passages of code, but is it true in the large? Given a “nightmare” system without any attempt at modular decomposition, where all state is held in global variables subject to ad hoc overwriting from every direction, an LLM will have no particular advantage over a human being in making informed guesses about the likely behaviour of a local portion of code. And indeed even in comparatively well-factored codebases, systematic change crossing module boundaries is something that LLM coding assistants seem to struggle with — they can make what look to them like the “right” changes across multiple sites, but are often terrible at keeping those changes aligned.

I noticed a telling glitch when working with ChatGPT to prototype a Kotlin DSL (Domain Specific Language, a subset of a higher-level programming language specialised to a particular domain) for writing Z80 assembler. I wanted to see if I could leverage the power of a programming language I knew well to help me write assembly language code for the Sinclair ZX Spectrum, by expressing macros (templates for common forms) in the higher-level language.

In assembler, a loop in execution flow is created through very concrete logic: store the number of times you want the loop to run in a register, and at the end of the loop decrement the value of the register then do a conditional jump back to the start of the loop if the value didn’t reach zero. If you happen to overwrite the value of the register with another number inside the loop, or if you happen to jump somewhere outside of it, then it won’t work as expected. It seemed to me that my higher-level language could express this behaviour as a re-usable construct.

However, when I asked ChatGPT for an example, it showed a nested loop in Kotlin that instead of writing assembly code to perform an action 2048 times would have emitted 2048 copies of the assembly code to perform that action once:

fun emitBlitter(bufferAddr: Int, screenAddr: Int) = macro {
    for (y in 0 until 128) {
        for (x in 0 until 16) {
            val srcOffset = y * 16 + x
            val dstAddr = zxScreenAddr(y, x * 8)
            ld("hl", bufferAddr + srcOffset)
            ld("a", "(hl)")
            ld("de", dstAddr)
            ld("(de)", "a")
        }
    }
}

Given the memory constraints of the (48K) ZX Spectrum this is very undesirable behaviour! But it also demonstrates a failure to track the distinction between two abstraction layers: that of the DSL, which organises the generation of code, and that of the generated code itself.

As I went on with the session, it became apparent that the LLM’s pattern synthesis capabilities simply couldn’t help blurring the lines between these layers. ChatGPT could discuss fluently with me the need to uphold this distinction, and could “understand” and explain its mistake when I pointed it out, but it couldn’t reliably operate at both a “concrete” and a “meta” level at the same time. Something like this blurring often happens when the LLM is asked to work across different frames (personae, descriptive levels, etc), because it doesn’t do the book-keeping needed to remember which frame it is currently meant to be inside: it ends up mixing modes and signals.

For programming this can be a serious problem, but not for all programming: there are many everyday cases where it’s just a matter of tweaking some procedural logic to change what we’re telling the computer to do, and little need for holistic structural awareness. The early stages of running up an application prototype are often like this, which I think is why the effectiveness of LLM tooling so impresses people whose primary ambition is to make “apps” and whose awareness of technical systems barely extends past the buttons and dials on the UI.

On the other hand, I’ve found ChatGPT a valuable companion when plotting out architectural decisions, such as those underlying versatile, an IoC (Inversion of Control) library for Python. Versatile makes a number of trade-offs based on things I like and dislike about the Java IoC framework Spring, and on things that Python users are likely to find intuitive (“Pythonic”) or unintuitive (“un-Pythonic”). In several matters, ChatGPT demonstrated ready command of the problem space, fetching examples from a number of languages and use cases, and helped me think through how I wanted my solution to sit within that space.

So the problem is not that it is only suited to making small local changes in procedural code, and cannot “think” more broadly: the problem is that pattern synthesis is an unreliable tool for bridging across technical registers, weaving together local ontologies (the kind of thing you do in low-level code, versus the kind of thing you do in high-level design) into systemic consistency. It is a fluent generic mimic, given clear genre boundaries within which to operate, but lacks dialectical capability (possibly one of the hallmarks of “true” AGI). Which is unfortunate, because that — once you get serious about it — is what programming is.

codepoetics

Discussion about this post