SrslyJosh 6 days ago

> Reading through these commits sparked an idea: what if we treated prompts as the actual source code? Imagine version control systems where you commit the prompts used to generate features rather than the resulting implementation.

Please god, no, never do this. For one thing, why would you not commit the generated source code when storage is essentially free? That seems insane for multiple reasons.

> When models inevitably improve, you could connect the latest version and regenerate the entire codebase with enhanced capability.

How would you know if the code was better or worse if it was never committed? How do you audit for security vulnerabilities or debug with no source code?

  • gizmo686 5 days ago

    My work has involved a project that is almost entirely generated code for over a decade. Not AI generated, the actual work of the project is in creating the code generator.

    One of the things we learned very quickly was that having generated source code in the same repository as actual source code was not sustainable. The nature of reviewing changes is just too different between them.

    Another thing we learned very quickly was that attempting to generate code, then modify the result is not sustainable; nor is aiming for a 100% generated code base. The end result of that was that we had to significantly rearchitect the project for us to essentially inject manually crafted code into arbitrary places in the generated code.

    Another thing we learned is that any change in the code generator needs to have a feature flag, because someone was relying on the old behavior.

    • saagarjha 5 days ago

      I think the biggest difference here is that your code generator is probably deterministic and you likely are able to debug the results it produces rather than treating it like a black box.

      • buu700 5 days ago

        Overloading of the term "generate" is probably creating some confused ideas here. An LLM/agent is a lot more similar to a human in terms of its transformation of input into output than it is to a compiler or code generator.

        I've been working on a recent project with heavy use of AI (probably around 100 hours of long-running autonomous AI sprints over the last few weeks), and if you tried to re-run all of my prompts in order, even using the exact same models with the exact same tooling, it would almost certainly fall apart pretty quickly. After the first few, a huge portion of the remaining prompts would be referencing code that wouldn't exist and/or responding to things that wouldn't have been said in the AI's responses. Meta-prompting (prompting agents to prepare prompts for other agents) would be an interesting challenge to properly encode. And how would human code changes be represented, as patches against code that also wouldn't exist?

        The whole idea also ignores that AI being fast and cheap compared to human developers doesn't make it infinitely fast or free, or put it in the same league of quickness and cheapness as a compiler. Even if this were conceptually feasible, all it would really accomplish is making it so that any new release of a major software project takes weeks (or more) of build time and thousands of dollars (or more) burned on compute.

        It's an interesting thought experiment, but the way I would put it into practice would be to use tooling that includes all relevant prompts / chat logs in each commit message. Then maybe in the future an agent with a more advanced model could go through each commit in the history one by one, take notes on how each change could have been better implemented based on the associated commit message and any source prompts contained therein, use those notes to inform a consolidated set of recommended changes to the current code, and then actually apply the recommendations in a series of pull requests.

      • tptacek 5 days ago

        People keep saying this and it doesn't make sense. I review code. I don't construct a theory of mind of the author of the code. With AI-generated code, if it isn't eminently reviewable, I reflexively kill the PR and either try again or change the tasking.

        There's always this vibe that, like, AI code is like an IOCCC puzzle. No. It's extremely boring mid-code. Any competent developer can review it.

        • buu700 5 days ago

          I assumed they were describing AI itself as a black box (contrasting it with deterministic code generation), not the output of AI.

          • tptacek 5 days ago

            Right, I get that, and an LLM call by itself clearly is a black box. I just don't get why that's supposed to matter. It produces an artifact I can (and must) verify myself.

            • buu700 5 days ago

              Because if the LLM is a black box and its output must ultimately be verified by humans, then you can't treat conversion of prompts into code as a simple build step as though an AI agent were just some sort of compiler. You still need to persist the actual code in source control.

              (I assume that isn't what you're actually arguing against, in which case at least one of us must have misread something from the parent chain.)

              • tptacek 5 days ago

                Right, you definitely can't do that. People do talk as if the question was whether we could stick LLM calls into Makefiles. Nobody would ever do that, at least not with the technology we have at hand.

                • baq 4 days ago

                  Ever is a long time. I expect first products built this exact way working reliably and having happy customers in the next five years, pessimistically. Optimistically this is probably happening somewhere as we speak.

        • daveguy 4 days ago

          You construct a theory of mind of the author of a work whether you recognize you are doing it or not. There are certain things everyone assumes about code based on the fact that we expect someone who writes code to have simple common sense. Which, of course, LLMs do not.

          When you are talking to a person and interpreting what they mean, you have an inherent theory of mind whether you are consciously thinking "how does this person think" or not. It's how we communicate with other people efficiently and it's one of the many things missing with LLM roulette. It's not that you generate a new "theory of mind" with every interaction. It's not something you have to consciously do (although you can, like breathing).

    • overfeed 5 days ago

      > One of the things we learned very quickly was that having generated source code in the same repository as actual source code was not sustainable

      My rule of the thumb is to have both in same repo, but treat generated code like binary data. This was informed by when I was burned by a tooling regression that broke the generated code and the investigation was complicated by having to correlate commits across different repositories

      • dkubb 5 days ago

        I love having generated code in the same repo as the generator because with every commit I can regenerate the code and compare it to make sure it stays in sync. Then it forms something similar to a golden tests where if something unexpected changes it gets noticed on review.

    • mschild 5 days ago

      > One of the things we learned very quickly was that having generated source code in the same repository as actual source code was not sustainable.

      Keeping a repository with the prompts, or other commands separate is fine, but not committing the generated code at all I find questionable at best.

      • diggan 5 days ago

        If you can 100% reproduce the same generated code from the same prompts, even 5 years later, given the same versions and everything then I'd say "Sure, go ahead and don't saved the generated code, we can always regenerate it". As someone who spent some time in frontend development, we've been doing it like that for a long time with (MB+) generated code, keeping it in scm just isn't feasible long-term.

        But given this is about LLMs, which people tend to run with temperature>0, this is unlikely to be true, so then I'd really urge anyone to actually store the results (somewhere, maybe not in scm specifically) as otherwise you won't have any idea about what the code was in the future.

        • overfeed 5 days ago

          > If you can 100% reproduce the same generated code from the same prompts, even 5 years later

          Reproducible builds with deterministic stacks and local compilers are far from solved. Throwing in LLM randomness just makes for a spicier environment to not commit the generated code.

        • layer8 4 days ago

          Temperature > 0 isn’t a problem as long as you can specify/save the random seed and everything else is deterministic. Of course, “as long as” is still a tall order here.

          • dragonwriter 4 days ago

            My understanding is that the implementation of modern hosted LLMs is nondeterministic even with known seed because the generated results are sensitive to a number of other factors including, but not limited to, other prompts running in the same batch.

            • westurner 4 days ago

              Gemini, for example, launched implicit caching on or about 2025-05-08: https://developers.googleblog.com/en/gemini-2-5-models-now-s... :

              > Now, when you send a request to one of the Gemini 2.5 models, if the request shares a common prefix as one of previous requests, then it’s eligible for a cache hit. We will dynamically pass cost savings back to you, providing the same 75% token discount.

              > In order to increase the chance that your request contains a cache hit, you should keep the content at the beginning of the request the same and add things like a user's question or other additional context that might change from request to request at the end of the prompt.

              From https://news.ycombinator.com/item?id=43939774 re: same:

              > Does this make it appear that the LLM's responses converge on one answer when actually it's just caching?

          • westurner 4 days ago

            Have any of the major hosted LLMs ever shared the temperature parameters that prompts were generated with?

      • djtango 5 days ago

        I didn't read it as that - If I understood correctly, generated code must be quarantined very tightly. And inevitably you need to edit/override generated code and the manner by which you alter it must go through some kind of process so the alteration is auditable and can again be clearly distinguished from generated code.

        Tbh this all sounds very familiar and like classic data management/admin systems for regular businesses. The only difference is that the data is code and the admins are the engineers themselves so the temptation to "just" change things in place is too great. But I suspect it doesn't scale and is hard to manage etc.

      • saagarjha 5 days ago

        I feel like using a compiler is in a sense a code generator where you don't commit the actual output

        • lelanthran 5 days ago

          > I feel like using a compiler is in a sense a code generator where you don't commit the actual output

          Compilers are deterministic. Given the same input you always get the same output so there's no reason to store the output. If you don't get the same output we call it a compiler bug!

          LLMs do not work this way.

          (Aside: Am I the only one who feels that the entire AI industry is predicated on replacing only development positions? we're looking at, what, 100bn invested, with almost no reduce in customer's operating costs other than if the customer has developers).

          • cesarb 5 days ago

            > Compilers are deterministic. Given the same input you always get the same output

            Except when they aren't. See for instance https://gcc.gnu.org/onlinedocs/gcc-15.1.0/gcc/Developer-Opti... or the __DATE__/__TIME__ macros.

            • lelanthran 5 days ago

              From the link:

              > You can use the -frandom-seed option to produce reproducibly identical object files.

              Deterministic.

              Also, with regard to __DATE__/__TIME__ macros, those are deterministic, because the current date and time are part of the inputs.

              • wat10000 4 days ago

                Determinism is predicated on what you consider to be the relevant inputs.

                Many compilers are not deterministic when only considering the source files or even the current time. For example, any output produced by iterating over a hash table with pointer keys is likely to depend on ASLR and thus be nondetermistic unless you consider the ASLR randomization to be one of the inputs. Any output that depends on directory iteration order is likely to be consistent on a single computer but vary across computers.

                LLMs aren’t magic. They’re software running on inputs like anything else, which means they’re deterministic if you constrain all the inputs.

                • kiitos 4 days ago

                  LLMs are 100% absolutely not deterministic even if you constrain all of their inputs. This is obviously the case, apparent from any even cursory experimentation with any LLM available today. Equivocating the determinism of a compiler given some source code as input, with the determinism of an LLM given some user prompt as input, is disingenuous to the extreme!

                  • wat10000 4 days ago

                    Most LLM software isn’t deterministic, sure. But LLMs are just doing a bunch of arithmetic. They can be 100% deterministic if you want them to be.

                    • kiitos 4 days ago

                      In practice, they definitely are not.

                      • wat10000 4 days ago

                        Only because nobody cares to. Just like compilers were not deterministic in practice until reproducible builds started getting attention.

          • tptacek 5 days ago

            Why does it matter to you if the code generator is deterministic? The code is.

            If LLM generation was like a Makefile step, part of your build process, this concern would make a lot of sense. But nobody, anywhere, does that.

            • rawling 4 days ago

              > If LLM generation was like a Makefile step, part of your build process, this concern would make a lot of sense. But nobody, anywhere, does that.

              Top level comment of this thread, quoting the article:

              > Reading through these commits sparked an idea: what if we treated prompts as the actual source code? Imagine version control systems where you commit the prompts used to generate features rather than the resulting implementation.

              • tptacek 4 days ago

                Ohhhhhhh. Thanks for clearing this up for me. I felt like I was going a little crazy (because, having missed that part of the thread, I sort of was). Appreciated!

          • Atotalnoob 5 days ago

            LLMs CAN be deterministic. You can control the temperature to get the same output repeatedly.

            Although I don’t really understand why you’d only want to store prompts…

            What if that model is no longer available?

            • saagarjha 5 days ago

              They’re typically not, since they typically rely on operators that aren’t (e.g. atomics).

        • mschild 5 days ago

          Sure, but compilers are arguably idempotent. Same code input, same output. LLMs certainly are not.

          • saagarjha 5 days ago

            Yeah I fully agree (in the other comments here, no less) I just think "I don't commit my code" to be a specific mindset of what code actually is

    • skywhopper 5 days ago

      There’s a huge difference between deterministic generated code and LLM generated code. The latter will be different every time, sometimes significantly so. Subsequent prompts would almost immediately be useless. “You did X, but we want Y” would just blow up if the next time through the LLM (or the new model you’re trying) doesn’t produce X at all.

    • cimi_ 5 days ago

      I will guess that you are generating orders of magnitude more lines of code with your software than people do when building projects with LLMs - if this is true I don't think the analogy holds.

    • david-gpu 5 days ago

      Please tell us we company you are working for so that we don't send our resumes there.

      Jokes aside, I have worked in projects where auto-generating code was the solution that was chosen and it's always been 100% auto-generated, essentially at compilation time. Any hand-coded stuff needed to handle corner cases or glue pieces together was kept outside of the code generator.

    • potholereseller 4 days ago

      > The end result of that was that we had to significantly rearchitect the project for us to essentially inject manually crafted code into arbitrary places in the generated code.

      This sounds like putting assembly in C code. What was the input language? These two bits ("Not AI generated", "a feature flag") suggest that the code generator didn't have a natural language frontend, but rather a real programming language frontend.

      Did you or anyone else inform management that a code generator is essentially a compiler with extra characters? [0] If yes, then what was their response?

      I am concerned that your current/past work might have been to build a Compiler-as-a-Service (CaaS). [1] No shade, I'm just concerned that other managers might read all this and then try to build their own CaaS.

      [0] Yes, I'm implying that LLMs are compilers. Altman has played us for fools; he's taught a billion people the worst part of programming: fighting the compiler to give you the output you want.

      [1] Compiler-as-a-Service is the future our forefathers couldn't imagine warning us about. LLMs are CaaS's; time is a flat circle; where's the exit?; I want off this ride.

      • gizmo686 4 days ago

        The input was a highly structured pdf specification of a family of protocols and formats. Essentially, a real language with very stupid parsing requirements and the occasional typo. The PDF itself was clearly intended for human consumption, but I'm 99% sure that someone somewhere at some point had a machine readable specification that was used to generate most of the PDF. Sadly, no one seems to know where to even start looking for such a thing.

        > Did you or anyone else inform management that a code generator is essentially a compiler with extra characters?

        The output of the code generator was itself fed into a compiler that we also built; and about half of the codegen team (myself included) were themselves developers for the compiler.

        I think management is still scared from the 20 year old M4 monstrosity we are still maintaining because writing a compiler would be "too complex".

  • lowsong 5 days ago

    I'm the first to admit that I'm an AI skeptic, but this goes way beyond my views about AI and is a fundamentally unsound idea.

    Let's assume that a hypothetical future AI is perfect. It will produce correct output 100% of the time, with no bugs, errors, omissions, security flaws, or other failings. It will also generate output instantly and cost nothing to run.

    Even with such perfection this idea is doomed to failure because it can only write code based on information in the prompt, which is written by a human. Any ambiguity, unstated assumption, or omission would result in a program that didn't work quite right. Even a perfect AI is not telepathic. So you'd need to explain and describe your intended solution extremely precisely without ambiguity. Especially considering in this "offline generation" case there is no opportunity for our presumed perfect AI to ask clarifying questions.

    But, by definition, any language which is precise and clear enough to not produce ambiguity is effectively a programming language, so you've not gained anything over just writing code.

    • gitgud 4 days ago

      This is so eloquently put and really describes the absurdity of the notion that code itself will become redundant to building a software system

    • handoflixue 5 days ago

      We already have AI agents that can ask a human for help / clarification in those cases.

      It could also analyze the company website, marketing materials, and so forth, and use that to infer the missing pieces. (Again, something that exists today)

      • layer8 4 days ago

        If the AI has to ask for clarification, you can’t run it as a reproducible build step as envisaged. It’s as if your compiler would pause to ask clarifying questions on each CI run.

        If the company website, marketing materials, and so forth become part of the input, you’ll have to put those in version control as well, as any change is likely to result in a different application being generated (which may or may not be what you want).

  • fastball 6 days ago

    The idea as stated is a poor one, but a slight reshuffling and it seems promising:

    You generate code with LLMs. You write tests for this code, either using LLMs or on your own. You of course commit your actual code: it is required to actually run the program, after all. However you also save the entire prompt chain somewhere. Then (as stated in the article), when a much better model comes along, you re-run that chain, presumably with prompting like "create this project, focusing on efficiency" or "create this project in Rust" or "create this project, focusing on readability of the code". Then you run the tests against the new codebase and if the suite passes you carry on, with a much improved codebase. The theoretical benefit of this over just giving your previously generated code to the LLM and saying "improve the readability" is that the newer (better) LLM is not burdened by the context of the "worse" decisions made by the previous LLM.

    Obviously it's not actually that simple, as tests don't catch everything (tho with fuzz testing and complete coverage and such they can catch most issues), but we programmers often treat them as if they do, so it might still be a worthwhile endeavor.

    • stingraycharles 6 days ago

      Means the temperature should be set to 0 (which not every provider supports) so that the output becomes entirely deterministic. Right now with most models if you give the same input prompt twice it will give two different solutions.

      • NitpickLawyer 6 days ago

        Even at temp 0, you might get different answers, depending on your inference engine. There might be hardware differences, as well as software issues (e.g. vLLM documents this, if you're using batching, you might get different answers depending on where in the batch sequence your query landed).

      • weird-eye-issue 6 days ago

        Claude Code already uses a temperature of 0 (just inspect the requests) but it's not deterministic

        Not to mention it also performs web searches, web fetching etc which would also make it not deterministic

      • singhrac 5 days ago

        Production inference is not deterministic because of sharding (i.e. parameter weights on several GPUs on the same machine or MoE), timing-based kernel choices (e.g. torch.backends.cudnn.benchmark), or batched routing in MoEs. Probably best to host a small model yourself.

      • derwiki 6 days ago

        Two years ago when I was working on this at a startup, setting OAI models’ temp to 0 still didn’t make them deterministic. Has that changed?

      • afiori 5 days ago

        Do LLMs inference engines have a way to seed their randomness? so tho have reproducible outputs with still some variance if desired?

        • bavell 5 days ago

          Yes, although it's not always exposed to the end user of LLM providers.

      • fastball 6 days ago

        I would only care about more deterministic output if I was repeating the same process with the same model, which is not the point of the exercise.

      • shthed 4 days ago

        This is good: run it n times, have the model review them and pick the best one.

    • maxemitchell 5 days ago

      Your rephrasing better encompasses my idea, and I should have emphasized in the post that I do not think this is a good idea (nor possible) right now, it was more of a hand-wavy "how could we rethink source control in a post-LLM world" passing thought I had while reading through all the commits.

      Clearly it struck a chord with a lot of the folks here though, and it's awesome to read the discourse.

    • layer8 4 days ago

      One reason we treat tests that way is that we don’t generally rewrite the application from scratch, but usually only refactor parts of the existing code or make smaller changes. If we regularly did the former, test suites would have to be much mire comprehensive than they typically are. Not to mention that the tests need to change when the API changes, so you generally have to rewrite the unit tests along with the application and can’t apply them unchanged.

  • rectang 6 days ago

    >> what if we treated prompts as the actual source code?

    You would not do this because: unlike programming languages, natural languages are ambiguous and thus inadequate to fully specify software.

    • squillion 5 days ago

      Exactly!

      > this assumes models can achieve strict prompt adherence

      What does strict adherence to an ambiguous prompt even mean? It’s like those people asking Babbage if his machine would give the right answer when given the wrong figures. I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a proposition.

    • a012 6 days ago

      Prompts are like story on the board, and like engineers, depends on the understanding of the model the generated source code can vary. Saying the prompts could be the actual code is so wrong and dangerous thought

  • Xelbair 5 days ago

    Worse. Models aren't deterministic! They use temperature value to control randomness, just so they can escape local minima!

    Regenerated code might behave differently, have different bugs(worst case), or not work at all(best case).

    • chrishare 5 days ago

      Nitpick - it's the ML system that is sampling from model predictions that has a temperature parameter, not the model itself. Temperature and even model aside, there are other sources of randomness like the underlying hardware that can cause the havoc you describe.

  • kace91 5 days ago

    Plus, commits depend on the current state of the system.

    What sense does “getting rid of vulnerabilities by phasing out {dependency}” make, if the next generation of the code might not rely on the mentioned library at all? What does “improve performance of {method}” mean if the next generation used a fully different implementation?

    It makes no sense whatsoever except for a vibecoders script that’s being extrapolated into a codebase.

  • pollinations 5 days ago

    I'd say commit a comprehensive testing system with the prompts.

    Prompts are in a sense what higher level programming languages were to assembly. Sure there is a crucial difference which is reproducibility. I could try and write down my thoughts why I think in the long run it won't be so problematic. I could be wrong of course.

    I run https://pollinations.ai which servers over 4 million monthly active users quite reliably. It is mostly coded with AI. Since about a year there was no significant human commit. You can check the codebase. It's messy but not more messy than my codebases were pre-LLMs.

    I think prompts + tests in code will be the medium-term solution. Humans will be spending more time testing different architecture ideas and be involved in reviewing and larger changes that involve significant changes to the tests.

    • maxemitchell 5 days ago

      Agreed with the medium-term solution. I wish I put some more detail into that part of the post, I have more thoughts on it but didn't want to stray too far off topic.

  • never_inline 5 days ago

    Apart from obvious non-reproducibility, the other problem is lack of navigable structure. I can't command+click or "show usages" or "show definition" any more.

    • saagarjha 5 days ago

      Just ask the AI for those obviously

  • tayo42 6 days ago

    I'm pretty sure most people aren't doing "software engineering" when they program. There's the whole world of WordPress and dream Weaver like programing out there too where the consequences of messing up aren't really important.

    Llms can be configured to have deterministic output too

  • dragonwriter 4 days ago

    Also, while it is in principle possible to have a deterministic LLM, the ones used by coding assistants aren't deterministic, so the prompts would not reliably reproduce the same software.

    There is definitely an argument, for also committing prompts, but it makes no sense to only commit prompts.

  • 7speter 5 days ago

    I think the author is saying you commit the prompt with the resulting code. You said it yourself, storage is free, so comment the prompt along with the output (don’t comment that out that if I’m not being clear); it would show the developers(?) intent, and to some degree, almost always contribute to the documentation process.

    • maxemitchell 5 days ago

      Author here :). Right now, I think the pragmatic thing to do is to include all prompts used in either the PR description and/or in the commit description. This wouldn't make my longshot idea of "regenerating a repo from the ground up" possible, but it still adds very helpful context to code reviewers and can help others on your team learn prompting techniques.

  • paxys 4 days ago

    Forget different model versions. The exact same model with the exact same prompt will generate vastly different code each subsequent time you invoke it.

  • TechDebtDevin 4 days ago

    Some code is generated on the fly, like llm ui/ux that writes python code to do math.

    Idk kinda different tho.

  • visarga 5 days ago

    The idea is good, but we should commit both documentation and tests. They allow regenerating the code at will.

  • Sevii 6 days ago

    There are lots of reasons not to do it. But if LLMs get good enough that it works consistently people will do it anyway.

    • minimaxir 6 days ago

      What will people call it when coders rely on vibes even more than vibe coding?

      • brookst 6 days ago

        Writing specs

        • auggierose 5 days ago

          Exactly my thought. This is just natural language as a specification language.

          • kiitos 5 days ago

            ...as an ambiguous and inadequately-specified specification language.

            • auggierose 4 days ago

              In the end, every specification is specified via natural language, this is just where the buck stops. All math books are written in natural language, even the ones about specification languages.

              • kiitos 4 days ago

                Huh? Is ABNF a "natural language"? Is the Go language spec a "natural language"?

  • croes 5 days ago

    You couldn’t even tell in advance if the prompt produces code at all.

  • mellosouls 5 days ago

    Yes, it's too early to be doing that now, but if you see the move to AI-assisted code as at least the same magnitude of change as the move from assembly to high level languages, the argument makes more sense.

    Nobody commits the compiled code; this is the direction we are moving in, high level source code is the new assembly.

starkparker 5 days ago

> Almost every feature required multiple iterations and refinements. This isn't a limitation—it's how the collaboration works.

I guess that's where a big miss in understanding so much of the messaging about generative AI in coding happens for me, and why the Fly.io skepticism blog post irritated me so much as well.

It _is_ how collaboration with a person works, but the when you have to fix the issues that the tool created, you aren't collaborating with a person, you're making up for a broken tool.

I can't think of any field where I'd be expected to not only put up with, but also celebrate, a tool that screwed up and required manual intervention so often.

The level of anthropomorphism that occurs in order to advocate on behalf of generative AI use leads to saying things like "it's how collaboration works" here, when I'd never say the same thing about the table saw in my woodshop, or even the relatively smart cruise control on my car.

Generative AI is still just a tool built by people following a design, and which purportedly makes work easier. But when my saw tears out cuts that I have to then sand or recut, or when my car slams on the brakes because it can't understand a bend in the road around a parking lane, I don't shrug and ascribe them human traits and blame myself for being frustrated over how they collaborate with me.

  • pontifier 3 days ago

    Garbage in, Garbage out... My experiment with vibe coding was quite nice, but it did require a collaborative back and forth, mostly because I didn't know exactly what I wanted. It was easiest to ask for something, then describe how what it gave me needed to be changed. The cost of this type of interaction was much easier than trying to craft the perfect prompt on the first go. My first prompts were garbage, but the output gradually converged to something quite good.

  • hooverd 4 days ago

    Your table saw hungers for fingers.

  • isaacremuant 4 days ago

    Likewise when they use all these benchmarks for "intelligence" and the tool will do the silliest things that you'd consider unacceptable from a person once you've told them a few times not to do a certain thing.

    I love the paradigm shift but hate when the hype is uninformed or dishonest or not treating it with an eye for quality.

declan_roberts 6 days ago

These posts are funny to me because prompt engineers point at them as evidence of the fast-approaching software engineer obsolescence but the years of experience in software engineering necessary to even guide an AI in this way is very high.

The reason he keeps adjusting the prompts is because he knows how to program. He knows what it should look like.

It just blurs the line between engineer and tool.

  • spaceman_2020 5 days ago

    The argument is that this stuff will so radically improve senior engineer productivity that the demand for junior engineers will crater. And without a pipeline of junior engineers, the junior-to-senior trajectory will radically atrophy

    Essentially, the field will get frozen where existing senior engineers will be able to utilize AI to outship traditional senior-junior teams, even as junior engineers fail to secure employment

    I don’t think anything in this article counters this argument

    • tptacek 5 days ago

      I don't know why people don't give more credence to the argument that the exact opposite thing will happen.

      • dcre 5 days ago

        Right. I don’t understand why everyone thinks this will make it impossible for junior devs to learn. The people I had around to answer my questions when I was learning knew a whole lot less than Claude and also had full time jobs doing something other than answering my questions.

        • Ataraxic 5 days ago

          Junior devs using AI can get a lot better at using AI and learn those existing patterns it generates, but I notice, for myself, that if I let AI write a lot of the code I remember and thereby understand it later on less well. This applies in school and when trying to learn new things but the act of writing down the solution and working out the details yourself trains our own brain. I'd say that this has been a practice for over a thousand years and I'm skeptical that this will make junior devs grow their own skills faster.

          I think asking questions to the AI for your own understanding totally makes sense, but there is a benefit when you actually create the code versus asking the AI to do it.

          • tptacek 5 days ago

            I'm sure there is when you're just getting your sea legs in some environment, but at some point most of the code you write in a given environment is rote. Rote code is both depleting and mutagenic --- if you're fluent and also interested in programming, you'll start convincing yourself to do stupid stuff to make the code less rote ("DRY it up", "make a DSL", &c) that makes your code less readable and maintainable. It's a trap I fall into constantly.

            • kiitos 5 days ago

              > but at some point most of the code you write in a given environment is rote

              "Most of the code one writes in a given environment is rote" is true in the same sense that most of the words one writes in any given bit of text are rote e.g. conjunctions, articles, prepositions, etc.

              • tptacek 5 days ago

                Some writers I know are convinced this is true, but I still don't think the comparison is completely apt, because deliberately rote code with modulated expressiveness is often (even usually) a virtue in coding, and not always so with writing. For experienced or enthusiastic coders, that is to say, the effort is often in not doing stupid stuff to make the code more clever.

                Straight-line replacement-grade mid code that just does the things a prompt tells it to in the least clever most straightforward way possible is usually a good thing; that long clunky string of modifiers goes by the name "maintainability".

            • ehutch79 5 days ago

              If your code is that boilerplate you can do it by rote, you need to abstract it already. Or write a generator/snippet

              • tptacek 5 days ago

                See, I get what you're saying, but this is my whole thing. No. Abstracting code out or building a bespoke codegen system is not always or even usually an improvement on straight-line code that just does what it says it does.

        • fch42 5 days ago

          It won't make it impossible for junior engineers to learn.

          It will simply reduce the amount of opportunities to learn (and not just for juniors), by virtue of companies' beancounters concluding "two for one" (several juniors) doesn't return the same as "buy one get one free" (existing staff + AI license).

          I dread the day we all "learn from AI". The social interaction part of learning is just as important as the content of it, really, especially when you're young; none of that comes across yet in the pure "1:1 interaction" with AI.

          • auggierose 5 days ago

            I learnt programming on my own, without any social interaction involved. In fact, I loved programming because it does not involve any social interaction.

            Programming has become more of a "social game" in the last 15 years or so. AI is a new superpower for people like me, bringing balance to the Force.

            • layer8 4 days ago

              To me, LLMs are just a different kind of social interaction I mostly don’t want, tedious and frustrating.

              • auggierose 4 days ago

                But it is not a social interaction. An LLM is a machine.

                I think there is also a big difference between being forced to use an LLM in a certain way, and being able to completely design your interaction with the LLM yourself. The former I imagine can be indeed tedious and frustrating, the latter is just miraculous.

                • layer8 4 days ago

                  No one is forcing me to use LLMs, so that’s not it. The interaction is social in the sense that it is natural-language based and nondeterministic, and that LLMs exhibit a certain “character”. They have been trained to mimic certain kinds of human social interaction.

                  • auggierose 4 days ago

                    It probably also depends on what your favourite weapon of choice is. Mine was always written language, and code is just a particular manifestation of it.

        • delegate 5 days ago

          You learn by doing.. eg typing the code. It's not just knowledge, it's the intuition you develop when you write code yourself. Just like physical exercise. Or playing an instrument. It's not enough to know the theory, practice is key.

          AI makes it very easy to avoid typing and hence make learning this skill less attractive.

          But I don't necessarily see it as doom and gloom, what I think will happen - juniors will develop advanced intuition about using AI and getting the functionality they need, not the quality of the code, while at the same time the AI models will get increasingly better and write higher quality code.

      • spaceman_2020 5 days ago

        If a junior engineer ships a similar repo to this with the help of AI, sure, I'll buy that.

        But as of now, it's senior engineers who really know what they 're doing who can spot the errors in AI code.

        • tptacek 5 days ago

          Hold on. You said "really know what they're doing". Yes, I agree with that. What I don't buy is the coupling of that concept with "seniority".

          • danielbln 5 days ago

            Have a better term for "knows what they're doing" other than senior?

            • tptacek 5 days ago

              That's not what "senior" means.

              • etothet 5 days ago

                This is not necessarily true in practical terms when it comes to hiring or promoting. Often a senior dev becomes a senior because of having an advanced skillset, despite years on the job. Similarily, often developers who have been on the job for many years aren’t ready for senior because of their lack or soft and hard skills.

                • tptacek 5 days ago

                  Oh, that's one of the ways a senior dev becomes senior.

              • danielbln 5 days ago

                Maybe you could enlighten the rest of us then. According to your favorite definition, what does senior mean, what does seniority mean, and what's a term for someone who knows what they're doing?

                • tptacek 5 days ago

                  Seniority means you've held the role for a long time.

                  • ehutch79 5 days ago

                    Time is required to be a senior engineer, but time does not _make_ you a senior engineer.

                    You need time to accumulate experience. You need experience, time in the proverbial trenches, to be a senior engineer.

                    You need to be doing different things too, not just implementing the same cookie cutter code repeatedly. If you are doing that, and havent automated it, you are not a senior engineer.

                  • thi2 5 days ago

                    There is no real definition of a senior engineer. Just looking at years served seems is wrong imho.

                    • tptacek 5 days ago

                      There's what "senior"-level developers say about themselves, and there's what's actually generally true about them. The two notions are, of course, not the same.

      • dimal 4 days ago

        If I’ve learned anything from the past few decades, something completely unexpected and even weirder than both will happen.

    • runeks 4 days ago

      > The argument is that this stuff will so radically improve senior engineer productivity that the demand for junior engineers will crater.

      What makes people think that an increase in senior engineer productivity causes demand for junior engineers to decrease?

      I think it will have the opposite effect: an increase in senior engineer productivity enables the company to add more features to its products, making it more valuable to its customers, who can therefore afford to pay more for the software. With this increase in revenue, the company is able to hire more junior engineers.

  • latexr 5 days ago

    > It just blurs the line between engineer and tool.

    I realise you meant it as “the engineer and their tool blend together”, but I read it like a funny insult: “that guy likes to think of himself as an engineer, but he’s a complete tool”.

  • visarga 5 days ago

    > prompt engineers point at them as evidence of the fast-approaching software engineer obsolescence

    Maybe journalists and bloggers angling for attention do it, prompt engineers are too aware of the limitations of prompting to do that.

  • tptacek 6 days ago

    I don't know why that's funny. This is not a post about a vibe coding session. It's Kenton Varda['s coding session].

    later

    updated to clarify kentonv didn't write this article

    • kevingadd 6 days ago

      I think it makes sense that GP is skeptical of this article considering it contains things like:

      > this tool is improving itself, learning from every interaction

      which seem to indicate a fundamental misunderstanding of how modern LLMs work: the 'improving' happens by humans training/refining existing models offline to create new models, and the 'learning' is just filling the context window with more stuff, not enhancement of the actual model or the model 'learning' - it will forget everything if you drop the context and as the context grows it can 'forget' things it previously 'learned'.

      • BurritoKing 5 days ago

        When you consider the "tool" as more than just the LLM model, but the stuff wrapped around calling that model then I feel like you can make a good argument it's improving when it keeps context in a file on disk and constantly updates and edits that file as you work throguh the project.

        I do this routinely for large initiatives I'm kicking off through Claude Code - it writes a long detailed plan into a file and as we work through the project I have it constantly updating and rewriting that document to add information we have jointly discovered from each bit of the work. That means every time I come back and fire it back up, it's got more information than when it started, which looks a lot more improvement from my perspective.

        • tptacek 5 days ago

          I would love to hear more about this workflow.

    • kiitos 5 days ago

      The sequence of commits talked about by the OP -- i.e. kenton's coding session's commits -- are like one degree removed from 100% pure vibe coding.

      • tptacek 5 days ago

        Your claim here being that Kenton Varda isn't reading the code he's generating. Got it. Good note.

        • rossjudson 3 days ago

          You ever get the feeling someone didn't look up Kenton Varda before criticizing the code he's generating?

          I guarantee you that Kenton Varda's generators generate more code than any other code generators that aren't compilers. ;)

        • kiitos 5 days ago

          No, that's not at all my claim, as it's obvious from the commit history that Kenton is reading the code he's generating before committing it.

          • kentonv 5 days ago

            What do you mean by "one degree removed from 100% pure vibe coding", then? The definition of vibe coding is letting the AI code without review...

            • kiitos 5 days ago

              > one degree removed

              You're letting Claude do your programming for you, and then sweeping up whatever it does afterwards. Bluntly, you're off-loading your cognition to the machine. If that's fine by you then that's fine enough, it just means that the quality of your work becomes a function of your tooling rather than your capabilities.

              • kentonv 5 days ago

                I don't agree. The AI largely does the boring and obvious parts. I'm still deciding what gets built and how it is designed, which is the interesting part.

                • tptacek 5 days ago

                  It's the same with me, with the added wrinkle of pulling each PR branch down and refactoring things (and, ironically, introducing my own bugs).

                • kiitos 5 days ago

                  > I'm still deciding what gets built and how it is designed, which is the interesting part.

                  How, exactly? Do you think that you're "deciding what gets built and how it's designed" by iterating on the prompts that you feed to the LLM that generates the code?

                  Or are you saying that you're somehow able to write the "interesting" code, and can instruct the LLM to generate the "boring and obvious" code that needs to be filled-in to make your interesting code work? (This is certainly not what's indicated by your commit history, but, who knows?)

                  • kentonv 5 days ago

                    Did you actually read the commit history?

                    My prompts specify very precisely what should be implemented. I specified the public API and high-level design upfront. I let the AI come up with its own storage schema initially but then I prompted it very specifically through several improvements (e.g. "denormalize this table into this other table to eliminate a lookup"). I designed the end-to-end encryption scheme and told it in detail how to implement it. I pointed out bugs and explained how to fix them. And so on.

                    All the thinking happened in those prompts. With the details I provided, combined with the OAuth spec, there was really very little room left for any creativity in the code. It was basically connect-the-dots at that point.

                    • kiitos 4 days ago

                      Right, so -- 'you think that you're "deciding what gets built and how it's designed" by iterating on the prompts that you feed to the LLM that generates the code'

                      > My prompts specify very precisely what should be implemented.

                      And the precision of your prompt's specifications, has no reliable impact on exactly what code the LLM returns as output.

                      > With the details I provided, combined with the OAuth spec, there was really very little room left for any creativity in the code. It was basically connect-the-dots at that point.

                      I truly don't know how you can come to this conclusion, if you have any amount of observed experience with any of the current-gen LLM tools. No amount of prompt engineering gets you a reliable mapping from input query to output code.

                      > I designed the end-to-end encryption scheme and told it in detail how to implement it. I pointed out bugs and explained how to fix them. And so on.

                      I guess my response here is that, if you think that this approach to prompt engineering gets you a generated code result that is in any sense equivalent, or even comparable, in terms of quality, to the work that you could produce yourself, as a professional and senior-level software engineer, then, man, we're on different planets. Pointing out bugs and explaining how to fix them in your prompts in no way gets you deterministic, reliable, accurate, high-quality code as output. And actually forget about high-quality, I mean even just bare minimum table-stakes requirements-satisfying stuff.. !

                      • tptacek 4 days ago

                        Nobody has claimed to be getting deterministic outputs from LLMs.

                        • kiitos 4 days ago

                          > My prompts specify very precisely what should be implemented. I specified the public API and high-level design upfront. I let the AI come up with its own storage schema initially but then I prompted it very specifically through several improvements (e.g. "denormalize this table into this other table to eliminate a lookup"). I designed the end-to-end encryption scheme and told it in detail how to implement it. I pointed out bugs and explained how to fix them. And so on.

                          OK. Replace "[expected] deterministic output" with whatever term best fits what this block of text is describing, as that's what I'm talking about. The claim is that a sufficiently-precisely-specified prompt can produce reliably-correct code. Which is just clearly not the case, as of today.

                          • tptacek 4 days ago

                            I don't even think anybody expects reliably-correct code. They expect code that can be made as reliably as they themselves could make code, with some minimal amount of effort. Which clearly is the case.

                            • kiitos 4 days ago

                              Forget about reliably-correct. The code that any current-gen LLM generates, no matter how precise the prompt it's given, is never even close to the quality standards expected of any senior-level engineer, in any organization I've been a part of, at any point in my career. They very much never produce code that is as good as what I can create. If the LLM-generated code you're seeing passes this level of muster, in your view, then that's really a reflection on your situation(s), and 100% not any kind of truth that you can claim as part of a blog post or whatever...

                              • kentonv 4 days ago

                                > The code that any current-gen LLM generates, no matter how precise the prompt it's given, is never even close to the quality standards expected of any senior-level engineer, in any organization I've been a part of, at any point in my career.

                                You are just making assertions here with no evidence.

                                If you prompt the LLM for code, and then you review the code, identify specific problems, and direct the LLM to fix those problems, and repeat, you can, in fact, end up with production-ready code -- in less time than it would take to write by hand.

                                Proof: My project. I did this. It worked. It's in production.

                                It seems like you believe this code is not production-ready because it was produced using an LLM which, you believe, cannot produce production-ready code. This is a cyclic argument.

                                • kiitos 3 days ago

                                  > If you prompt the LLM for code, and then you review the code, identify specific problems, and direct the LLM to fix those problems, and repeat, you can, in fact, end up with production-ready code

                                  I guess I will concede that this is possible, yes. I've never seen it happen, myself, but it could be the case, at some point, in the future.

                                  > in less time than it would take to write by hand.

                                  This is my point of contention. The process you've described takes ages longer than however much time it would take a competent senior-level engineer to just type the code from first principles. No meaningful project has ever been bottle-necked on how long it takes to type characters into editors.

                                  All of that aside, the claim you're making here is that, speaking as a senior IC, the code that an LLM produces, guided by your prompt inputs, is more or less equivalent to any code that you could produce yourself, even controlling for time spent. Which just doesn't match any of my experiences with any current-gen LLM or agent or workflow or whatever. If your universe is all about glue code, where typing is enemy no. 1, and details don't matter, then fair enough, but please understand that this is not usually the domain of senior-level engineers.

                                  • simonw 3 days ago

                                    "the code that an LLM produces, guided by your prompt inputs, is more or less equivalent to any code that you could produce yourself, even controlling for time spent"

                                    That's been my personal experience over the past 1.5 years. LLMs, prompted and guided by me, write code that I would be proud to produce without them.

                                  • kentonv 3 days ago

                                    I have only claimed that for this particular project it worked really well, and was much faster than writing by hand. This particular project was arguably a best-case scenario: a greenfield project implementing a well-known standard against a well-specified design.

                                    I have tried using AI to make changes to the Cloudflare Workers Runtime -- my usual main project, which I started, and know like the back of my hand, and which incidentally handles over a trillion web requests every day -- and in general in that case I haven't found it saved me much time. (Though I've been a bit surprised by the fact that it can find its way around the code at all, it's a pretty complicated C++ codebase.)

                                    It really depends on the use case.

                                • 59nadir 3 days ago

                                  It's possible kiitos has (or had?) a higher standard in mind for what should constitute a senior/"lead engineer" at Cloudflare and how much they should be constrained by typing as part of implementation.

                                  Out of interest: How much did the entire process take and how much would you estimate it to take without the LLM in the loop?

                                  • kentonv 3 days ago

                                    > It's possible kiitos has (or had?) a higher standard in mind for what should constitute a senior/"lead engineer" at Cloudflare and how much they should be constrained by typing as part of implementation.

                                    See again here, you're implying that I or my code is disappointing somehow, but with no explanation for how except that it was LLM-assisted. I assert that the code is basically as good as if I'd written it by hand, and if you think I'm just not a competent engineer, like, feel free to Google me.

                                    It's not the typing itself that constrains, it's the detailed but non-essential decision-making. Every line of code requires making several decisions, like naming variables, deciding basic structure, etc. Many of these fine-grained decisions are obvious or don't matter, but it's still mentally taxing, which is why nobody can write code as fast as they can type even when the code is straightforward. LLMs can basically fill in a bunch of those details for you, and reviewing the decisions -- especially the fine-grained ones that don't matter -- is a lot faster than making them.

                                    > How much did the entire process take and how much would you estimate it to take without the LLM in the loop?

                                    I spent about five days mostly focused on prompting the LLM (although I always have many things interrupting me throughout the day, so I wasn't 100% focused). I estimate it would have taken me 2x-5x as long to do by hand, but it's of course hard to say for sure.

                                    • 59nadir 2 days ago

                                      > See again here, you're implying that I or my code is disappointing somehow, but with no explanation for how except that it was LLM-assisted. I assert that the code is basically as good as if I'd written it by hand, and if you think I'm just not a competent engineer, like, feel free to Google me.

                                      I think you're reading a bit too deeply into what I wrote; I explained what I interpreted kiitos' posts as essentially saying. I realize that you've probably had to deal with a lot of people being skeptical to the point of "throwing shade", as it were, so I understand the defensive posture. I am skeptical, but the reason I'm asking questions (alongside the previous bit) is because I'm actually curious about your experiment.

                                      > It's not the typing itself that constrains, it's the detailed but non-essential decision-making. Every line of code requires making several decisions, like naming variables, deciding basic structure, etc. Many of these fine-grained decisions are obvious or don't matter, but it's still mentally taxing, which is why nobody can write code as fast as they can type even when the code is straightforward. LLMs can basically fill in a bunch of those details for you, and reviewing the decisions -- especially the fine-grained ones that don't matter -- is a lot faster than making them.

                                      In your estimation, what is your mental code coverage of the code you ended up with? Do you feel like you have a complete mapping of it, i.e. you could get an external request for change and map it quickly to where it needs to be made and why exactly there?

                                      • kentonv 2 days ago

                                        > In your estimation, what is your mental code coverage of the code you ended up with? Do you feel like you have a complete mapping of it, i.e. you could get an external request for change and map it quickly to where it needs to be made and why exactly there?

                                        I know the code structure about as well as if I had written it.

                                        Honestly the code structure is not very complicated. It flows pretty naturally from the interface spec in the readme, and I'd expect anyone who knows OAuth could find their way around pretty easily.

                                        But yes, as part of prompting improvements to the code, I had to fully understand the implementation. My prompts are entirely based on reading the code and deciding what needed to be changed -- not based on any sort of black-box testing of the code (which would be "vibe coding").

                                • bdangubic 4 days ago

                                  100% this. I have same proof… In productions… across 30+ services… hourly…

                                • tptacek 4 days ago

                                  The genetic fallacy is a hell of a drug.

              • ramchip 5 days ago

                Personally, I spend _more_ time thinking with Claude. I can focus on the design decisions while it does the mechanical work of turning that into code.

                Sometimes I give the agent a vague design ("make XYZ configurable") and it implements it the wrong way, so I'll tell it to do it again with more precise instructions ("use a config file instead of a CLI argument"). The best thing is you can tell it after it wrote 500 lines of code and updated all the tests, and its feelings won't be hurt one bit :)

                It can be useful as a research tool too, for instance I was porting a library to a new language, and I told the agent to 1) find all the core types and 2) for each type, run a subtask to compare the implementation in each language and write a markdown file that summarizes the differences with some code samples. 20 min later I had a neat collection of reports that I could refer to while designing the API in the new language.

                • steveklabnik 4 days ago

                  I was recently considering a refactoring in a work codebase. I had an interesting discussion with Claude about the tradeoffs, then had it show me what the code would look like after the refactor, both in a very simple case, and in one of the most complex cases. All of this informed what path I ended up taking, but especially the real-world examples meant this was a much better informed decision than just "hmm, yeah seems like it would be a lot of work but also probably worth it."

  • thegrim33 5 days ago

    I mean yeah, the very first prompt given to the AI was put together by an experienced developer; a bunch of code telling the AI exactly what the API should look like and how it would be used. The very first step in the process already required an experienced developer to be involved.

eviks 5 days ago

> Imagine version control systems where you commit the prompts used to generate features rather than the resulting implementation.

So every single run will result in different non-reproducible implementation with unique bugs requiring manual expert interventions. How is this better?

SupremumLimit 6 days ago

It's an interesting review but I really dislike this type of techno-utopian determinism: "When models inevitably improve..." Says who? How is it inevitable? What if they've actually reached their limits by now?

  • Dylan16807 6 days ago

    Models are improving every day. People are figuring out thousands of different optimizations to training and to hardware efficiency. The idea that right now in early June 2025 is when improvement stops beggars belief. We might be approaching a limit, but that's going to be a sigmoid curve, not a sudden halt in advancement.

    • a2128 5 days ago

      I think at this point we're reaching more incremental updates, which can score higher on some benchmarks but then simultaneously behave worse with real-world prompts, most especially if they were prompt engineered for a specific model. I recall Google updating their Flash model on their API with no way to revert to the old one and it caused a lot of people to complain that everything they've built is no longer working because the model is just behaving differently than when they wrote all the prompts.

      • whbrown 5 days ago

        Isn't it quite possible they replaced that Flash model with a distilled version, saving money rather than increasing quality? This just speaks to the value of open-weights more than anything.

    • deadbabe 6 days ago

      5 years ago a person would be blown away by today’s LLMs. But people today will merely say “cool” at whatever LLMs are in use 5 years from now. Or maybe not even that.

      • tptacek 6 days ago

        Most of the developers I know personally who have been radicalized by coding agents, it happened within the past 9 months. It does not feel like we are in a phase of predictable boring improvement.

        • keybored 5 days ago

          Radicalized? Going with the flow and wishes of the people who are driving AI is the opposite of that.

          To have their minds changed drastically, sure..

          • tptacek 5 days ago

            Sorry I have no idea what you're trying to say here.

          • lcnPylGDnU4H9OF 5 days ago

            > very different from the usual or traditional

            https://www.merriam-webster.com/dictionary/radical

            Deciding that AI is going nowhere to suddenly deciding that coding agents are how they will work going forward is a radical change. That is what they meant.

            • keybored 5 days ago
              • Dylan16807 5 days ago

                Can you explain exactly what you meant by your second paragraph? The ambiguity is why you got that reply.

                If your second paragraph makes that reply irrelevant, are you saying the meaning was "Your use of 'radicalized' is technically correct but I still think you shouldn't have used it here"?

      • dingnuts 6 days ago

        5 years ago GPT2 was already outputting largely coherent speech, there's been progress but it's not all that shocking

    • sitkack 6 days ago

      It is copium that it will suddenly stop and the world they knew before will return.

      ChatGPT came out in Nov 2022. Attention Was All There Was in 2017, we were already 5 years in the past. Or 5 years of research to catch up to, and then from 2022 to now ... papers and research have been increasing exponentially. Even in if SOTA models were frozen, we still have years of research to apply and optimize in various ways.

      • BoorishBears 5 days ago

        I think it's equally copium that people keep assuming we're just going to compound our way into intelligence that generalizes enough to stop us from handholding the AI, as much as I'd genuinely enjoy that future.

        Lately I spend all day post-training models for my product, and I want to say 99% of the research specific to LLMs doesn't reproduce and/or matter once you actually dig in.

        We're getting exponentially more papers on the topics and they're getting worse on average.

        Every day there's a new paper claiming an X% gain by post-training some ancient 8B parameter model and comparing it to a bunch of other ancient models after they've overfitted on the public dataset of a given benchmark and given the model a best of 5.

        And benchmarks won't ever show it, but even ChatGPT 3.5-Turbo has better general world knowledge than a lot models people consider "frontier" models today because post-training makes it easy to cover up those gaps with very impressive one-prompt outputs and strong benchmark scores.

        -

        It feels like things are getting stuck in a local maxima: we are making forward progress, the models are useful and getting more useful, but the future people are envisioning takes reaching a completely different goal post that I'm not at all convinced we're making exponential progress towards.

        There maybe exponential number of techniques claiming to be ground breaking, but what has actually unlocked new capabilities that can't just as easily be attributed to how much more focused post-training has become on coding and math?

        Test time compute feels like the only one and we're already seeing the cracks form in terms of its effect on hallucinations, and there's a clear ceiling for the performance the current iteration unlocks as all these models are converging on pretty similar performance after just a few model releases.

      • rxtexit 5 days ago

        The copium is I think many people got comfortable post financial crisis with nothing much changing or happening. I think many people really liked a decade stretch with not much more than web framework updates and smart phone versioning.

        We are just back on track.

        I just read Oracular Programming: A Modular Foundation for Building LLM-Enabled Software the other day.

        We don't even have a new paradigm yet. I would be shocked that in 10 years I don't look back at this time of writing a prompt into a chatbot and then pasting the code into an IDE as completely comical.

        The most shocking thing to me is we are right back on track to what I would have expected in 2000 for 2025. In 2019 those expectations seemed like science fiction delusions after nothing happening for so long.

        • sitkack 5 days ago

          Reading the Oracular paper now, https://news.ycombinator.com/edit?id=44211588

          It feels a bit like Halide, where the goal and the strategy are separated so that each can be optimized independently.

          Those new paradigms are being discovered by hordes of vibecoders, myself included. I am having wonderful results with TDD and AI assisted design.

          IDEs are now mostly browsers for code, and I no longer copy and paste with a chatbot.

          Curious what you think about the Oracular paper. One area that I have been working on for the last couple weeks is extracting ToT for the domain and then using the LLM to generate an ensemble of exploration strategies over that tree.

  • groby_b 6 days ago

    It is "inevitable" in the sense that in 99% of the cases, tomorrow is just like yesterday.

    LLMs have been continually improving for years now. The surprising thing would be them not improving further. And if you follow the research even remotely, you know they'll improve for a while, because not all of the breakthroughs have landed in commercial models yet.

    It's not "techno-utopian determinism". It's a clearly visible trajectory.

    Meanwhile, if they didn't improve, it wouldn't make a significant change to the overall observations. It's picking a minor nit.

    The observation that strict prompt adherence plus prompt archival could shift how we program is both true, and it's a phenomenon we observed several times in the past. Nobody keeps the assembly output from the compiler around anymore, either.

    There's definitely valid criticism to the passage, and it's overly optimistic - in that most non-trivial prompts are still underspecified and have multiple possible implementations, not all correct. That's both a more useful criticism, and not tied to LLM improvements at all.

    • double0jimb0 6 days ago

      Are there places that follow the research that speak to the layperson?

  • its-kostya 5 days ago

    What is ironic, if we buy in to the theory that AI will write majority of the code in the next 5-10 years, what is it going to train on after? ITSELF? Seems this theoretic trajectory of "will inevitably get better" is is only true if humans are producing quality training data. The quality of code LLMs create is very well proportionate on how mature and ubiquitous the langues/projects are.

    • solarwindy 5 days ago

      I think you neatly summarise why the current pre-trained LLM paradigm is a dead end. If these models were really capable of artificial reasoning and learning, they wouldn’t need more training data at all. If they could learn like a human junior does, and actually progress to being a senior, then I really could believe that we’ll all be out of a job—but they just do not.

  • sumedh 6 days ago

    More compute mean more faster processing, more context.

  • Sevii 6 days ago

    Models have improved significantly over the last 3 months. Yet people have been saying 'What if they've actually reached their limits by now?' for pushing 3 years.

    • BoorishBears 5 days ago

      This is just people talking past each other.

      If you want a model that's getting better at helping you as a tool (which for the record, I do), then you'd say in the last 3 months things got better between Gemini's long context performance, the return of Claude Opus, etc.

      But if your goal post is replacing SWEs entirely... then it's not hard to argue we definitely didn't overcome any new foundational issues in the last 3 months, and not too many were solved in the last 3 years even.

      In the last year the only real foundational breakthrough would be RL-based reasoning w/ test time compute delivering real results, but what that does to hallucinations + even Deepseek catching up with just a few months of post-training shows in its current form, the technique doesn't completely blow up any barriers that were standing the way people were originally touting it.

      Overall models are getting better at things we can trivially post-train and synthesize examples for, but it doesn't feel like we're breaking unsolved problems at a substantially accelerated rate (yet.)

    • greyadept 6 days ago

      For me, improvement means no hallucination, but that only seems to have gotten worse and I'm interested to find out whether it's actually solvable at all.

      • tptacek 6 days ago

        Why do you care about hallucination for coding problems? You're in an agent loop; the compiler is ground truth. If the LLM hallucinates, the agent just iterates. You don't even see it unless you make the mistake of looking closely.

        • kiitos 5 days ago

          What on earth are you talking about??

          If the LLM hallucinates, then the code it produces is wrong. That wrong code isn't obviously or programmatically determinable as wrong, the agent has no way to figure out that it's wrong, it's not as if the LLM produces at the same time tests that identify that hallucinated code as being wrong. The only way that this wrong code can be identified as wrong is by the human user "looking closely" and figuring out that it is wrong.

          You seem to have this fundamental belief that the code that's produced by your LLM is valid and doesn't need to be evaluated, line-by-line, by a human, before it can be committed?? I have no idea how you came to this belief but it certainly doesn't match my experience.

          • tptacek 5 days ago

            No, what's happening here is we're talking past each other.

            An agent lints and compiles code. The LLM is stochastic and unreliable. The agent is ~200 lines of Python code that checks the exit code of the compiler and relays it back to the LLM. You can easily fool an LLM. You can't fool the compiler.

            I didn't say anything about whether code needs to be reviewed line-by-line by humans. I review LLM code line-by-line. Lots of code that compiles clean is nonetheless horrible. But none of it includes hallucinated API calls.

            Also, from where did this "you seem to have a fundamental belief" stuff come from? You had like 35 words to go on.

            • kiitos 5 days ago

              > If the LLM hallucinates, then the code it produces is wrong. That wrong code isn't obviously or programmatically determinable as wrong, the agent has no way to figure out that it's wrong, it's not as if the LLM produces at the same time tests that identify that hallucinated code as being wrong. The only way that this wrong code can be identified as wrong is by the human user "looking closely" and figuring out that it is wrong

              The LLM can easily hallucinate code that will satisfy the agent and the compiler but will still fail the actual intent of the user.

              > I review LLM code line-by-line. Lots of code that compiles clean is nonetheless horrible.

              Indeed most code that LLMs generate compiles clean and is nevertheless horrible! I'm happy that you recognize this truth, but the fact that you review that LLM-generated code line-by-line makes you an extraordinary exception vs. the normal user, who generates LLM code and absolutely does not review it line-by-line.

              > But none of [the LLM generated code] includes hallucinated API calls.

              Hallucinated API calls are just one of many many possible kinds of hallucinated code that an LLM can generate, by no means does "hallucinated code" describe only "hallucinated API calls" -- !

              • tptacek 5 days ago

                When you say "the LLM can easily hallucinate code that will satisfy the compiler but still fail the actual intent of the user", all you are saying is that the code will have bugs. My code has bugs. So does yours. You don't get to use the fancy word "hallucination" for reasonable-looking, readable code that compiles and lints but has bugs.

                I think at this point our respective points have been made, and we can wrap it up here.

                • kiitos 5 days ago

                  > When you say "the LLM can easily hallucinate code that will satisfy the compiler but still fail the actual intent of the user", all you are saying is that the code will have bugs. My code has bugs. So does yours. You don't get to use the fancy word "hallucination" for reasonable-looking, readable code that compiles and lints but has bugs.

                  There is an obvious and categorical difference between the "bugs" that an LLM produces as part of its generated code, and the "bugs" that I produce as part of the code that I write. You don't get to conflate these two classes of bugs as though they are equivalent, or even comparable. They aren't.

                  • tptacek 5 days ago

                    They obviously are.

                    • kiitos 4 days ago

                      I get that you think this is the case, but it really very much isn't. Take that feedback/signal as you like.

              • simonw 5 days ago

                You seem to be using "hallucinate" to mean "makes mistakes".

                That's not how I use it. I see hallucination as a very specific kind of mistake: one where the LLM outputs something that is entirely fabricated, like a class method that doesn't exist.

                The agent compiler/linter loop can entirely eradicate those. That doesn't mean the LLM won't make plenty of other mistakes that don't qualify as hallucinations by the definition I use!

                It's newts and salamanders. Every newt is a salamander, not every salamander is a newt. Every hallucination is a mistake, not every mistake is a hallucination.

                https://simonwillison.net/2025/Mar/2/hallucinations-in-code/

                • kiitos 5 days ago

                  I'm not using "hallucinate" to mean "makes mistakes". I'm using it to mean "code that is syntactically correct and passes tests but is semantically incoherent". Which is the same thing that "hallucination" normally means in the context of a typical user LLM chat session.

                  • tptacek 3 days ago

                    Why would you merge code that was "semantically incoherent"? And how does the answer to that question, about "hallucinations" that matter in practice, allow you to then distinguish between "hallucinations" and "bugs"?

            • someothherguyy 5 days ago

              Linting isn't verification of correctness, and yes, you can fool the compiler, linters, etc. Work with some human interns, they are great at it. Agents will do crazy things to get around linting errors, including removing functionality.

              • fragmede 5 days ago

                have you no tests?

                • kiitos 5 days ago

                  Irrelevant, really. Tests establish a minimum threshold of acceptability, they don't (and can't) guarantee anything like overall correctness.

                  • tptacek 5 days ago

                    Just checking off the list of things you've determined to be irrelevant. Compiler? Nope. Linter? Nope. Test suite? Nope. How about TLA+ specifications?

                    • skydhash 5 days ago

                      TLA+ specs don’t verify code. They verify design. Such design can be expressed in whatever, including pseudocode (think algorithms notation in textbooks). Then you write the TLA specs that will judge if invariants are truly respected. Once you’re sure of the design, you can go and implement it, but there’s no hard constraints like a type system.

                      • tptacek 5 days ago

                        At what level of formal methods verification does the argument against AI-generated code fall apart? My expectation is that the answer is "never".

                        The subtext is pretty obvious, I think: that standards, on message boards, are being set for LLM-generated code that are ludicrously higher than would be set for people-generated code.

            • saagarjha 5 days ago

              My guy didn't you spend like half your life in the field where your job was to sift through code that compiled but nonetheless had bugs that you tried to exploit? How can you possibly have this belief about AI generated code?

              • tptacek 5 days ago

                I don't understand this question. Yes, I spent about 20 years learning the lesson that code is profoundly knowable; to start with, you just read it. What challenge do you believe AI-generated code presents to me?

          • lcnPylGDnU4H9OF 5 days ago

            > You seem to have this fundamental belief that the code that's produced by your LLM is valid and doesn't need to be evaluated, line-by-line, by a human, before it can be committed??

            This is a mistaken understanding. The person you responded to has written on these thoughts already and they used memorable words in response to this proposal:

            > Are you a vibe coding Youtuber? Can you not read code? If so: astute point. Otherwise: what the fuck is wrong with you?

            It should be obvious that one would read and verify the code before they commit it. Especially if one works on a team.

            https://fly.io/blog/youre-all-nuts/

            • kasey_junk 5 days ago

              We should go one step past this and come up with an industry practice where we get someone other than the author to read the code before we merge it.

              • lcnPylGDnU4H9OF 5 days ago

                I don’t understand your point. Are you saying that it sounds like that wouldn’t happen?

                • kasey_junk 4 days ago

                  I’m being sarcastic. The person you are responding to is implying that reading code carefully before merging it is some daunting or new challenge. In fact it’s been standard practice in our industry for 2 or more people to do that as a matter of course.

      • dymk 6 days ago

        All the benchmarks would disagree with you

        • BoorishBears 5 days ago

          The benchmarks also claim random 32B parameter models beat Claude 4 at coding, so we know just how much they matter.

          It should be obvious to anyone who with a cursory interest in model training, you can't trust benchmarks unless they're fully private black-boxes.

          If you can get even a hint of the shape of the questions on a benchmark, it's trivial to synthesize massive amounts of data that help you beat the benchmark. And given the nature of funding right now, you're almost silly not to do it: it's not cheating, it's "demonstrably improving your performance at the downstream task"

        • thuuuomas 6 days ago

          Today’s public benchmarks are yesterday’s training data.

_pdp_ 5 days ago

I commented on the original discussion a few days ago but I will do it again.

Why is this such a big deal? This library is not even that interesting. It is very straightforward task I expect most programers will be able to pull off easily. 2/3 of the code is type interfaces and comments. The rest is by book implementation of a protocol that is not even that complex.

Please, there are some React JSX files in your code base with a lot more complexities and intricacies than this.

Has anyone even read the code at all?

  • JackSlateur 4 days ago

    Of course, this is a pathetic commercial, nothing serious

    As you say, the code is not interesting, it deals with a well known topic

    And it required lots of man power to get done

    tldr: this is a non-event disguised as incredible success. No doubt cloudflare is making money with that AI crap, somehow.

thorum 6 days ago

Humorous that this article has a strong AI writing smell - the author should publish the prompts they used!

  • dcre 5 days ago

    I don’t like to accuse, and the article is fine overall, but this stinks: “This transparency transforms git history from a record of changes into a record of intent, creating a new form of documentation that bridges human reasoning and machine implementation.”

    • keybored 5 days ago

      > I don’t like to accuse, and the article is fine overall, but this stinks:

      Now consider your reasonable instinct to not accuse other people coupled with the possibility setting AI lose with “write a positive article about AI where you have some paragraphs about the current limitations based on this link. write like you are just following the evidence.” Meanwhile we are supposed to sit here and weigh every word.

      This reminds to write a prompt for a blogpost. How AI could be used for making personal-looking tech-guy who meditates and runs websites. (Do we have the technology? Yes we do)

    • ZephyrBlu 5 days ago

      Also: "This OAuth library represents something larger than a technical milestone—it's evidence of a new creative dynamic emerging"

      Em-dash baby.

      • mpalmer 5 days ago

        The sentence itself is a smeLLM. Grandiose pronouncements aren't a bot exclusive, but man do they love making them, especially about creative paradigms and dynamics

      • ZeroTalent 5 days ago

        I have used Em-dashes in many of my comments for years. It's just a result of reading books, where Em-dashes happen a lot.

      • latexr 5 days ago

        Can we please stop using the em-dash as a metric to “detect” LLM writing? It’s lazy and wrong. Plenty of people use em-dashes, it’s a useful punctuation mark. If humans didn’t use them, they wouldn’t be in the LLM training data.

        There are better clues, like the kind of vague pretentious babble bad marketers use to make their products and ideas seem more profound than they are. It’s a type of bad writing which looks grandiose but is ultimately meaningless and that LLMs heavily pick up on.

        • grey-area 5 days ago

          Very few people use n dashes in internet writing as opposed to dashes as they are not available on the default keyboard.

          • purplesyringa 5 days ago

            This is a post with formatting and we're programmers here. I can assure you their editor (or Markdown) supports em-dash in some fashion.

          • latexr 5 days ago

            That’s not true at all. Apple’s OS by default have smart punctuation enabled and convert -- (two hyphens) into — (“em-dash”; not an “en-dash”, which has a different purpose), " " (dumb quotes) into “ ” (smart quotes), and so forth.

            Furthermore, on macOS there are simple key combinations (e.g. with ⌥) to make all sort of smart punctuation even if you don’t have the feature enabled by default, and on iOS you can long press on a key (such as the hyphen) to see alternates.

            The majority of people may not use correct punctuation marks, but enough do that assuming a single character immediately means they used an LLM is just plain wrong. I have never used an LLM to write a blog post, internet comment, or anything of the sort, and I have used smart punctuation in all my writing for over a decade. Same with plenty of other HN commenters, journalists, writers, editors, and on and on. You don’t need to be a literal machine to care about correct character use.

            • grey-area 5 days ago

              So we’ve established the default is a hyphen, not an em dash.

              You can certainly select an em dash but most don’t know what it means and don’t use it.

              It’s certainly not infallible proof but multiple uses of it in comments online (vs published material or newspapers) are very unusual, so I think it’s an interesting indicator. I completely agree it is common in some texts, usually ones from publishing houses with style guides but also people who know about writing or typography.

            • ramchip 5 days ago

              > assuming a single character immediately means they used an LLM is just plain wrong

              I don't see anyone doing that here. LLM writing was brought up because of the writing style, not the dash. It just reinforces the suspicion.

          • thoroughburro 5 days ago

            On the “default keyboard” of most people (a phone), you just long-press hyphen to choose any dash length.

            • grey-area 5 days ago

              But who does? Not many.

        • isaacremuant 4 days ago

          It's not lazy and wrong. It's a fantastic indicator.

          > If humans didn’t use them, they wouldn’t be in the LLM training data.

          Humans weren't using them in every context as they are now. They might've been used in books but blog posts and work documents weren't full of them.

          It's not a definite thing but it's absolutely a good indicator.

          • latexr 3 days ago

            Blog posts, news articles, and other web texts have been using correct punctuation marks for a long time. I know because I’ve been noticing misuses (usually having switched or repeated characters for quotes) for over a decade.

            Plenty of people care about typographic punctuation, and others use software (such as Apple’s OSs, markdown converters, publishing and editing tools) which auto-converts smart punctuation. Heck, tools for doing that are older than Markdown, and that is already two decades old.

            https://daringfireball.net/projects/smartypants/

            Look, nowhere have I said using an em-dash can’t be an indicator, my objection is people using it as the indicator. It’s become a meme. Too many people act like if the existence of a single em-dash immediately and conclusively proves it was written by an LLM. It does not.

          • never_inline 4 days ago

            They may be overrepresented in the RLHF

        • vovavili 4 days ago

          It's not a guarantee, but it does make it so much more likely. Therefore, it is an extremely useful prior to hold.

    • OjotCewIo 5 days ago

      > this stinks: “This transparency transforms git history from a record of changes into a record of intent, creating a new form of documentation that bridges human reasoning and machine implementation.”

      That's where I stopped reading. If they needed "AI" for turning their git history into a record of intent ("transparency"), then they had been doing it all wrong, previously. Git commit messages have always been a "form of documentation that bridges human reasoning" -- namely, with another human's (the reader's) reasoning.

      If you don't walk your reviewer through your patch, in your commit message, as if you were teaching them, then you're doing it wrong.

      Left a bad taste in my mouth.

  • maxemitchell 5 days ago

    I did human notes -> had Claude condense and edit -> manually edit. A few of the sentences (like the stinky one below) were from Claude which I kept if it matched my own thoughts, though most were changed for style/prose.

    I'm still experimenting with it. I find it can't match style at all, and even with the manual editing it still "smells like AI" as you picked up. But, it also saves time.

    My prompt was essentially "here are my old blog posts, here's my notes on reading a bunch of AI generated commits, help me condense this into a coherent article about the insights I learned"

    • layer8 4 days ago

      I wonder if those notes wouldn’t have been more interesting as-is, and possibly also more condensed.

      • nothrabannosir 4 days ago

        I wish there were a way to opt-out of LLM generated text and see the prompt. In any context. It's always more informative, more human, more memorable, more accurate, and more representative of what the author was actually trying to convey.

    • thorum 5 days ago

      Makes sense, I could see the human touch on the article too, so I figured it was something like that.

Fischgericht 5 days ago

So, it means that you and the LLM together have managed to write SEVEN lines of trivial code per hour. On a protocol that is perfectly documented, where you can look at about one million other implementations when in doubt.

It is not my intention to hurt your feelings, but it sounds like you and/or the LLM are not really good at their job. Looking at programmer salaries and LLM energy costs, this appears to be a very very VERY expensive OAuth library.

Again: Not my intention to hurt any feelings, but the numbers really are shockingly bad.

  • kentonv 5 days ago

    I spent about 5 days semi-focused on this codebase (though I always have lots of people interrupting me all the time). It's about 5000 lines (if you count comments, tests, and documentation, which you should). Where do you get 7 lines per hour?

  • nojito 5 days ago

    >So, it means that you and the LLM together have managed to write SEVEN lines of trivial code per hour.

    Here's their response

    >It took me a few days to build the library with AI.

    >I estimate it would have taken a few weeks, maybe months to write by hand.

    >That said, this is a pretty ideal use case: implementing a well-known standard on a well-known platform with a clear API spec.

    https://news.ycombinator.com/item?id=44160208

    Lines of code per hour is a terrible metric to use. Additionally, it's far easier to critique code that's already written!

  • Fischgericht 5 days ago

    Yes, my brain got confused on who wrote the code and who just reported about it. I am truly sorry. I will go see my LLM doctor to get my brain repaired.

moron4hire 5 days ago

I'm sorry, this all sounds like a fucking miserable experience. Like, if this is what my job becomes, I'll probably quit tech completely.

  • kentonv 5 days ago

    That's exactly what I thought, too, before I tried it!

    Turns out it feels very different than I expected. I really recommend trying it rather than assuming. There's no learning curve, you just install Claude Code and run it in your repo and ask it for things.

    (I am the author of the code being discussed. Or, uh, the author of the prompts at least.)

Arainach 5 days ago

>Around the 40-commit mark, manual commits became frequent

This matches my experience: some shiny (even sometimes impressive) greenspace demos but dramatically less useful maintaining a codebase - which for any successful product is 90% of the work.

IncreasePosts 6 days ago

I asked this in the other thread (no response, but I was a bit late)

How does anyone using AI like this have confidence that they aren't unintentionally plagiarizing code and violating the terms of whatever license it was released under?

For random personal projects I don't see it mattering that much. But if a large corp is releasing code like this, one would hope they've done some due diligence that they have to just stolen the code from some similar repo on GitHub, laundered through a LLM.

The only section in the readme doesn't mention checking similar projects or libraries for common code:

> Every line was thoroughly reviewed and cross-referenced with relevant RFCs, by security experts with previous experience with those RFCs.

  • akdev1l 6 days ago

    > How does anyone using AI like this have confidence that they aren't unintentionally plagiarizing code and violating the terms of whatever license it was released under?

    They don’t and no one cares

  • tptacek 6 days ago

    Most of the code generated by LLMs, and especially the code you actually keep from an agent, is mid, replacement-level, boring stuff. If you're not already building projects with LLMs, I think you need to start doing that first before you develop a strong take on this. From what I see in my own work, the code being generated is highly unlikely to be distinguishable. There is more of me and my prompts and decisions in the LLM code than there can possibly be defensible IPR from anybody else, unless the very notion of, like, wrapping a SQLite INSERT statement in Golang is defensible.

    The best way I can explain the experience of working with an LLM agent right now is that it is like if every API in the world had a magic "examples" generator that always included whatever it was you were trying to do (so long as what you were trying to do was within the obvious remit of the library).

  • simonw 5 days ago

    All of the big LLM vendors have a "copyright shield" indemnity clause for their paying customers - a guarantee that if you get sued over IP for output from their models their legal team will step in to fight on your behalf.

  • saghm 6 days ago

    Safety in the shadow of giant tech companies. People were upset when Microsoft released Copilot trained on GitHub data, but nobody who cared doing do anything about it, and nobody who could have done something about it cared, so it just became the new norm.

  • kentonv 5 days ago

    I'm fairly confident that it's not just plagiarizing because I asked the LLM to implement a novel interface with unusual semantics. I then prompted for many specific fine-grain changes to implement features the way I wanted. It seems entirely implausible to me that there could exist prior art that happened to be structured exactly the way I requested.

    Note that I came into this project believing that LLMs were plagiarism engines -- I was looking for that! I ended up concluding that this view was not consistent with the output I was actually seeing.

  • ryandrake 6 days ago

    This is an excellent question that the AI-boosters always seem to dance around. Three replies already are saying “Nobody cares.” Until they do. I’d be willing to bet that some time in the near future, some big company is going to care a lot and that there will be a landmark lawsuit that significantly changes the LLM landscape. Regulation or a judge is going to eventually decide the extent to which someone can use AI to copy someone else’s IP, and it’s not going to be pretty.

    • SpicyLemonZest 5 days ago

      It just presumes a level of fixation in copyright law that I don’t think is realistic. There was a landmark lawsuit MAI v. Peak Computer in 1993, where judges determined that repairing a computer without the permission of the operating system’s author is copyright infringement, and it didn’t change the landscape at all because everyone immediately realized it’s not practical for things to work that way. There’s no realistic world where AI tools end up being extremely useful but nobody uses them because of a court ruling.

  • cavisne 5 days ago

    Some API's (Gemini at least) run a search on their outputs to see if the model is reciting data from training.

    So for direct copies like what you are talking about that would be picked up.

    For copying concepts from other libraries, seems like a problem with or without LLM's.

  • aryehof 5 days ago

    The consensus for right or wrong, is that LLM produced code (unless repeated verbatim) is equivalent to you or I legitimately stating our novel understanding of mixed sources some of which may be copyrighted.

  • throwawaysleep 6 days ago

    As an individual dev, I simply don’t care. Not my problem.

    Companies are satisfied with the idemnity provided by Microsoft.

viraptor 6 days ago

The documentation angle is really good. I've noticed it with the mdc files and llm.txt semi-standard. Documentation is often treated as just extra cost and a chore. Now, good description of the project structure and good examples suddenly becomes something devs want ahead of time. Even if the reason is not perfect, I appreciate this shift we'll all benefit from.

drodgers 6 days ago

> Prompts as Source Code

Another way to phrase this is LLM-as-compiler and Python (or whatever) as an intermediate compiler artefact.

Finally, a true 6th generation programming language!

I've considered building a toy of this with really aggressive modularisation of the output code (eg. python) and a query-based caching system so that each module of code output only changes when the relevant part of the prompt or upsteam modules change (the generated code would be committed to source control like a lockfile).

I think that (+ some sort of WASM encapsulated execution environment) would one of the best ways to write one off things like scripts which don't need to incrementally get better and more robust over time in the way that ordinary code does.

  • sumedh 6 days ago

    > Finally, a true 6th generation programming language!

    Karpathy already said English is the new programming language.

kookamamie 5 days ago

> Treat prompts as version-controlled assets

This only works if the model and its context are immutable. None of us really control the models we use, so I'd be sceptical about reproducing the artifacts later.

lmeyerov 5 days ago

If/when to commit prompts has been fascinating as we have been doing similarly to build Louie.ai. I now have several categories with different handling:

- Human reviewed: Code guidelines and prompt templates are essentially dev tool infra-as-code and need review

- Discarded: Individual prompt commands I write, and implementation plan progress files the AI write, both get trashed, and are even part of my .gitignore . They were kept by Cloudflare, but we don't keep these.

- Unreviewed: Claude Code does not do RAG in the usual sense, so it is on us to create guides for how we do things like use big frameworks. They are basically indexes for speeding up AI with less grepping + hallucinating across memory compactions. The AI reads and writes these, and we largely stay out of it.

There are weird cases I am still trying to figure out. Ex:

- feature impl might start with an AI coming up with the product spec, so having that maintained as the AI progresses and committed in is a potentially useful artifact

- how prompt templates get used is helpful for their automated maintenance.

mastazi 5 days ago

> Around the 40-commit mark, manual commits became frequent—styling, removing unused methods, the kind of housekeeping that coding models still struggle with. It's clear that AI generated >95% of the code, but human oversight was essential throughout.

But things like styling and unused code removal have been automated for a long time already, thanks to non-AI tools; assuming that the AI agent has access to those tools (e.g. assuming the agent can trigger a linter), then the engineer could have just included these steps in the prompts instead of running them manually.

EDIT - I still think there are aspects where AI is obviously lacking, I just think those specific examples are not among them

  • buu700 5 days ago

    Speaking of which, something funny I've noticed when using agents with prettier in a pre-commit hook is that the logs occasionally include the model thanking "me" for cleaning up its code formatting.

axi0m 5 days ago

>> what if we treated prompts as the actual source code?

And they probably will be. Looks like prompts have become the new higher-level coding language, the same way JavaScript is a human-friendly abstraction of an existing programming language (like C), which is already a more accessible way to write assembly itself, and the same goes for the underlying binary code... I guess we eventually reached the final step in the development chain, bridging the gap between hardware instructions and human language.

  • dgb23 4 days ago

    C, JS etc. Are abstractions in the Dijkstra sense. Coding agents aren’t.

GPerson 4 days ago

What are some ethical ways to oppose this? I’ll continue to make clear that at least one voice out here opposes AI in all forms.

UltraSane 5 days ago

I was thinking that if you had a good enough verified mathematical model of your code using TLA+ or similar you could then use an LLM to generate your code in any language and be confident it is correct. This would be Declarative Programming. Instead of putting in a lot of work writing code that MIGHT do what you intend you put more work into creating the verified model and then the LLM generates code that will do what the model intends.

never_inline 5 days ago

> Don't be afraid to get your hands dirty. Some bugs and styling issues are faster to fix manually than to prompt through. Knowing when to intervene is part of the craft.

This has been my experience as well. to always run the cli tool in the bottom pane of an IDE and not in a standalone terminal.

ianks 4 days ago

<thinking>I’m trying to remember if oauth has a specification or not, but I’m getting conflicting thoughts</thinking>

Lerc 5 days ago

>Treat prompts as version-controlled assets. Including prompts in commit messages creates valuable context for future maintenance and debugging.

I think this is valuable data, but it is also out of distribution data. Prior to AI models writing code, this won't be present in the training set. Additional training will probably be needed to correlate better results with the new input stream, and also to learn that some of the records would be of its own unreliability and to develop a healthy scepticism of what it has said in the past.

There's a lot of talk about model collapse with models training purely on their own output, or AI slop infecting training data sets, but ultimately it is all data. Combined with a signal to say which bits were ultimately beneficial, it can all be put to use. Even the failures can provide a good counterfactual signal for constrastive learning.

fpgaminer 6 days ago

I used almost 100% AI to build a SCUMM-like parser, interpreter, and engine (https://github.com/fpgaminer/scumm-rust). It was a fun workflow; I could generally focus on my usual work and just pop in occasionally to check on and direct the AI.

I used a combination of OpenAI's online Codex, and Claude Sonnet 4 in VSCode agent mode. It was nice that Codex was more automated and had an environment it could work in, but its thought-logs are terrible. Iteration was also slow because it takes awhile for it to spin the environment up. And while you _can_ have multiple requests running at once, it usually doesn't make sense for a single, somewhat small project.

Sonnet 4's thoughts were much more coherent, and it was fun to watch it work and figure out problems. But there's something broken in VSCode right now that makes its ability to read console output inconsistent, which made things difficult.

The biggest issue I ran into is that both are set up to seek out and read only small parts of the code. While they're generally good at getting enough context, it does cause some degradation in quality. A frequent issue was replication of CSS styling between the Rust side of things (which creates all of the HTML elements) and the style.css side of things. Like it would be working on the Rust code and forget to check style.css, so it would just manually insert styles on the Rust side even though those elements were already styled on the style.css side.

Codex is also _terrible_ at formatting and will frequently muck things up, so it's mandatory to use it with an autoformatter and instructions to use it. Even with that, Codex will often say that it ran it, but didn't actually run it (or ran it somewhere in the middle instead of at the end) so its pull requests fail CI. Sonnet never seemed to have this issue and just used the prevailing style it saw in the files.

Now, when I say "almost 100% AI", it's maybe 99% because I did have to step in and do some edits myself for things that both failed at. In particular neither can see the actual game running, so they'd make weird mistakes with the design. (Yes, Sonnet in VS Code can see attached images, and potentially can see the DOM of vscode's built in browser, but the vision of all SOTA models is ass so it's effectively useless). I also stepped in once to do one major refactor. The AIs had decided on a very strange, messy, and buggy interpreter implementation at first.

brador 5 days ago

Many of you are failing to conprehend the potential scale of AI generated codebases.

Take note - there is no limit. Every feature you or the AI can prompt can be generated.

Imagine if you were immortal and given unlimited storage. Imagine what you could create.

That’s a prompt away.

Even now you’re still restricting your thinking to the old ways.

  • _lex 5 days ago

    You're talking ahead of the others in this thread, who do not understand how you got to what you're saying. I've been doing research in this area. You are not only correct, but the implications are staggering, and go further than what you have mentioned above. This is no cult, it is the reorganization of the economics of work.

    • OjotCewIo 5 days ago

      > it is the reorganization of the economics of work

      and the overwhelming majority of humanity will be worse off for it

  • latexr 5 days ago

    You’re sounding like a religious zealot recruiting for a cult.

    No, it is not possible to prompt every feature, and I suspect people who believe LLMs can accurately program anything in any language are frankly not solving any truly novel or interesting problems, because if they were they’d see the obvious cracks.

    • nojito 5 days ago

      > I suspect people who believe LLMs can accurately program anything in any language are frankly not solving any truly novel or interesting problems, because if they were they’d see the obvious cracks.

      The vast majority of problems in programming aren't novel or interesting.

      • latexr 4 days ago

        Which in no way contradicts my point. There is still a chasm of difference between a tool which can aid with a majority of issues and one which can solve everything, which is what the commenter I replied to is preaching.

  • politelemon 5 days ago

    > That’s a prompt away.

    Currently, it's 6 prompts away in which 5 of those are me guiding the LLM to output the answer that I already have in mind.

  • zeofig 5 days ago

    Think about this for a second

    E = MC^2 + AI

    A new equation for physics?

    The potential of AI is unlimited.