But the potential of easy optimization does not mean that they will be optimized. Compilers for higher-level languages are full of broken promises around their potential. "Faster than C because we can optimize" is never really realized, except in maybe one benchmark that only works in that one exact scenario and nowhere in my code...
> Compilers for higher-level languages are full of broken promises
Sometimes because C is lingua franca of low level.
Some noalias optimizations Rust had didn't work in LLVM, because no one bothered using it before in C.
This goes even further to hardware. C-like null terminated string search SIMD is faster than a saner (pointer + len ) string view. So it's faster to append null to end of string view than to invoke the SIMD on string slice.
C standards with the 'restrict' keyword to allow aliasing optimisations have been existing longer than Rust. LLVM just never bothered, despite the early claims that all the intermediate language and "more modern" compiler would enable more optimisation potential. LLVM is still slower than GCC, even in plain C.
Where is the problem to quickly null-terminate a pointer+len string and use the quick null-terminated SIMD search? Should only be 1 move slower than the native C. "Space after that string" shouldn't be a problem, since in higher-level languages, allocations can make enough room, it is higher-level after all (and you need null-termination for syscalls anyways).
It is always the same story. Compiler and programming language people make claims about future optimisations but never ever follow through. It's always just theory, never implementation.
LLVM did do some with noalias, but it wasn’t very mature because it wasn’t broadly useful in C/C++, because it was too hard to use correctly.
Then Rust made it trivial to use correctly, and found that what LLVM did have was quite buggy, because it hadn’t been exercised. These were bugs that could in theory have been found in C/C++, but in practice never would have been. And so for quite some years Rust would toggle the noalias optimisations on and off, depending on whether there were any known bad-codegen issues at the time. I think the last toggle was a few years ago by now, this stuff is finally actually stable and appreciated.
My recollection is that the figures for the Rust compiler are in the vicinity of a 5% improvement from emitting noalias in the LLVM IR.
And it’s likely that quite a bit more is possible.
> LLVM just never bothered, despite the early claims that all the intermediate language and "more modern" compiler would enable more optimisation potential.
They did bother but no one seemed to be using it, so a bug snuck in. It took Rust exercising that corner of LLVM to find the bug.
> LLVM is still slower than GCC, even in plain C.
Citation needed. Last time I checked LLVM beat GCC on -O3. Unless you mean compilation performance.
> Where is the problem to quickly null-terminate a pointer+len string and use the quick null-terminated SIMD search?
Why should two nearly idenfical operations have such wildly different performance? And why isn't the safer/saner interface more performant?
My point if compilers and hardware are optimizing for C, it's no surprise no one can approach its speed.
It's like asking why when everyone is optimizing for Chrome no other web browser can approach its speed.
There is tons more, just ask your favourite internet search engine.
> Why should two nearly idenfical operations have such wildly different performance? And why isn't the safer/saner interface more performant?
Because in hardware, checking a zero-flag on a register is very cheap. Big NAND over all bits of a sub-register, big OR over all those flags for the whole SIMD register. Very short path, very few transistors.
Checking a length is expensive: Increment a length register through addition, including a carry, which is a long path. Compare with another register to out if you are at the last iteration. Compare usually is also expensive since it is subtract with carry internally, even though you could get away with xor + zero flag. But you can't get around the length register addition.
There can be an optimisation because you do have to increment the address you fetch from in any case. But in most modern CPUs, that happens in special pointer registers internally, if you do additional arithmetics on those it'll be expensive again. x86 actually has more complex addressing modes that handle those, but compilers never really used them, so those were dropped for SIMD.
(Okay, to be fair, compilers tend to change the loop bounds so that the final comparison is a comparison against 0. This will reduce the register usage of the loop, but otherwise doesn't change the analysis). Taking into account micro-op fusion, there is at best two actual hardware ops in that assembly pattern, since the comparison and branch will be fused, and the increment(/decrement) is also a decent candidate for fusion. The branch can also be predicted with a high degree of fidelity--an advanced processor should be able to predict the branch with exactly 100% accuracy.
You still have the some basic pattern at the end of the loop, except that the INC (which might not have been previously fused) is no longer in the critical path, but now the critical path instead has a load in it. That load is a much slower instruction than a test-against-0 or a comparison: no load completes in one cycle, whereas both tests (bitwise and) and comparisons (subtraction) complete in one cycle. Splitting hairs by saying that the test-against-0 is going to be less expensive than a subtract-and-check-carry, so therefore null-terminated is faster ignores the fact that they execute in the same time and brushes the extra load, which is definitely slower, that exists in the null-terminated string.
It goes worse, though. When you have a pointer+length combo, the length of the array, and the legal dereferenceability, is now known to the compiler in symbolic form that is easy to compute. That makes it possible to do things like unroll loops, or vectorize loops, or rewrite loop bounds in better forms, etc. When it's null-terminated, the length comes from reading memory, and the amount of memory you can read is not known ahead of time. Unrolling the loop or other steps are no longer really possible since you don't know how many iterations the loop will take, and the extra loop exits you'd generate tend not to be tolerable in optimization.
Storing the string length explicitly as an 8-byte integer does have a measurable cost. Consider llvm::Twine as an example, it supports storing a null-terminated string and a ptr+len string (among other options). I once changed the implementation to store string literals (length known at compile-time) as ptr+len instead of a pointer to a C string, with the intention of avoiding the strlen in the callee on constant strings. However, this change made things slower, because of the cost of storing the length everywhere. (That's why I never proposed such a change upstream.)
The critical (data) path of the null-terminated loop, however, does not include the load -- the actually loaded value is not a loop-carried dependency in your example. The re-steering of the branch at the end of the loop might happen much later, however.
Vectorization with null-terminated strings is possible and done, but requires alignment checking, which adds some cost.
No. First that isn't SIMD code. Second, CMP r0,0 is redundant, operating on a register that is zero will set the zero flag automatically. You also skipped the string search/comparison part, but that isn't relevant, except that the operation will implicitly set the zero flag. Generally, plain JZ is faster than CMP + JZ. Even more so if it is SIMD.
And loads are amortized by tons of speculative preloading. Which isn't (yet, unfortunately) optimized by preload-length hints.
Oh, and since we are talking about comparing strings, there is no extra load. This isn't strlen(), this is strchr() or strstr() we are talking about. You always have to load the string into some register.
It's also that whenever an hour is poured into Haskell, Scheme, or OCaml, there's hundreds of hours poured into Javascript, Python, C, Java, ...
Eventually that lets you optimize things more at the compiler level because you have the extra time to maintain it as well.
My primary argument is that a more restrictive language, if designed correctly, has programs which are resistant against bit-rot over time. You can take a program from 15 years ago and compile it. It'll run faster due to optimizations, and hardware improvements. And the program will still be the "right thing to do."
In contrast, if you take C code that's older, you'd shudder and immediately begin identifying things left and right which needs to be redone because the modern machine is different from the aged machine. Part of the efficiency comes from the fact we are building the C code such that it has affordance on current hardware for efficient execution. But when the hardware updates, then the code needs amendment to keep up. Meanwhile, your SQL statement is the same because it's written declaratively.
I don't know about "faster than C", but usually the promise is that you can get to within the same order of magnitude of performance with higher level abstractions, more guarantees of correctness (or at least higher likelihood), and for far less effort than the optimal handwritten C equivalent.
While i subscribe to the "principle of least power"; Rice's theorem extends to even total functions in finite time. So while you can make choices on what to over/under constrain, you are still stuck with “execute it and see what the result is” in the general case.
The vast majority of code isn't the general case. Very few practical applications legitimately intend to execute halt-undecidable code for instance.
But if the language itself strongly obfuscates (or makes it difficult or impossible to express/guarantee) reliable relationships, you miss out on all the desired, and leveragable, certainty that you still intend.
I used to argue with the Python crowd about that. In Python, any code can, if it works at it, find and mess with any data object anywhere, even in another thread. So the compiler can't examine a function, decide that it can't possibly do something, and optimize. That's held back PyPy.
Most useful Python programs could be hard-compiled to machine code, because they're not doing anything weird enough to require a JIT compiler. But some low-level library might be, which breaks hard compilation.
("But then my domain-specific language / interface to another language / code generator / tracing package / debugger won't work!")
If the compiler is examining the function to know what to optimize. why does it not know what that function does so it can't optimize? That is, I assume you are correct but you lost me in the logic. is it eval?
For example, you can create an instance x of class Foo, change its one method x.bar to do something completely different at runtime, and any existing code that calls Foo.bar will happily execute your version. Which means Python cannot optimize any function that calls Foo.bar even if it "knows" what Foo.bar does.
The issue in dynamic languages like python is that almost every little thing is impossible to be sure at compile time:
* The type of arguments passed to the function. Everything might point to the function always being passed values of a particular type, but you still can't be sure.
* If you call another function, you can't be sure that you'll actually call the function you expect, because any function or global variable may be overwritten at runtime.
* There are many places in the language that might call an invisible function call (method overloads, garbage collection methods, etc) and any of those invisible calls could potentially monkey patch something
All of these complexities, and more, are why JIT compilers like PyPy exist. Instead of optimizing at compile time, they observe things at runtime, optimize optimistically assuming that the program won't try anything funny, but stays ready to roll back the optimizations in case the program does try something funny.
> why does it not know what that function does so it can't optimize?
Because that seem to require the compiler to be an AGI. Its very hard to explain why its hard to make a truly intelligent program, but you just have to trust that tons of people have tried for entire careers to solve this and so far none have solved it.
Initial comments as I write this are all negative, and also responding to something the blog post didn't claim. The only time it says faster than C is when talking about a language targeting the GPU; it is not controversial that GPUs perform better than CPUs for many workloads.
On the other hand, the results of this true fact are already seen in practice in real systems. Rust's iterator adapters, for example, allow high level "functional style" with closures to be compiled to code that is exactly the same as hand rolled C loops without unamortized bounds checks - despite guaranteeing the absence of out of bound accesses even in the face of programmer error - because the constraints of these interfaces and the design of the language enable tons of optimizations by LLVM. This has been true since Rust 1.0 more than a decade ago, it's not a promise of the future.
Yes, 20 years later, the JVM is considered a marvel of engineering, and, surprisingly, some of those marketing claims have in fact come true. Or at least, more than would be expected after doing the hype-correction always called for in such situations. Yes, Java can be faster in certain contexts: the kind of contexts Java is usually employed in addressing these days. Stuff like enterprise backend systems and big data processing, where startup time or precise memory control are not important. Then there's GraalVM which allows for AOT compilation, etc.
I'd like to see those numbers please. While there have been improvements, Java performance work seems to have stopped after 2011 or so. Mobile CPUs became fast enough for Android, so no movement there. And commercial software got stuck on numerous proprietary Java VMs that didn't improve after the Oracle takeover of SUN. Alterative VMs improved, but didn't get adopted, fragemented and in the end didn't really raise the bar a lot.
To this day, Java applications are the slowest and most memory hungry long-running server applications by far. Only some scripting languages are worse, but only by performance, almost never by memory.
> To this day, Java applications are the slowest and most memory hungry long-running server applications by far
I am not a fan of Java, but the slow part is just plain wrong. On properly sized hardware Java is blisteringly fast. Did it exceed the speed of C as hyped? No, in most cases it falls solidly short, but I'm sure there are isolated cases where it does. Regardless, 2-3x slower than C using a JIT is a serious feat of engineering.
All GC languages require more memory, and Java is even worse given lack of value types, but again, memory/hardware is cheap compared to programmer time and if you have enough of it, it generally isn't a problem.
All this does mean you will need a bit more hardware than a C program, so perhaps that is the posters critique, and that is true, but compared to other high level languages, it does great in the speed dept, and is often only a bit higher in the memory dept.
I can only assume that you stopped paying attention in 2011 or so because "slow", Java is not. Memory hungry, sure, I don't think anyone would deny that.
What was the core of the hype at the time was parallel processing. And to be honest, here Java absolutely beats C, which most of the time will simply.. refuse to parallelize because it can't be maintainably and reliably done in the language without some very extreme cost. It's simply too dangerous. Meanwhile in Java people could freely experiment with concurrent data structures and algorithms that will simply be in a completely different league than the serial C program. Sure, maybe the latter's hot loop is faster, but it doesn't mean much if I can just run 100s of threads with Java.
Your second paragraph is just thoroughly uninformed, though. And unused RAM is useless, so in a server application not making use of it is just dumb. Maybe think about why Java is used extensively by pretty much every FAANG company. Also, Java has improved steadily after Oracle's takeover, and is in fact the open-source reference implementation. Again, uninformed bullshit.
> Maybe think about why Java is used extensively by pretty much every FAANG company.
Java programmers are a dime a dozen. Universities churn out heaps of Java code slaves, so of course every big company will use Java for their shovelware.
Seriously? Since 2011, so since Java 7, stagnation in performance? That's a startling claim. I'd like to see some numbers as well. I'll start the first round with this article:
That is garbage collection performance, something that only GC languages even have to consider as additional overhead. It also doesn't even try to compare against anything non-Java.
There are no serious Java performance comparisons on stuff like transaction throughput, which would be what the "long running server application" usecase should aim for. You yourself claimed that enterprise data processing is what you are talking about. There are benchmarks that even with long warmup times make Java come up far worse than all the compiled languages (even Golang nowadays). https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
Imho, people have stopped benchmarking Java ages ago, since it is known to be hopeless.
Calling GC time "additional overhead" is not entirely correct, imo. Non-GC languages often spend plenty of time dealing with malloc/free, and I consider both GC and malloc/free to be "memory management". Both are non-trivial!
GC usually also comes in "copying" variants, which can have advantages like reducing memory fragmentation.
Granted, malloc/free isn't totally free. But malloc/free languages always seem to be faster than Java in benchmarks, at least the ones I've read and referenced already in the comments around this article. GC also, even if it were faster, comes at the expense of eating RAM like mad, at least if judged by the RAM Java likes to eat. Usually, Java software just consumes 50% of your RAM until the first GC pause. And even before that GC pause, it's dog slow.
Memory fragmentation can more easily be reduced by pooled allocators. For long-running server processes, it can also be possible to just kill the worker processes/threads from time to time. And I've never seen memory fragmentation in manually allocated languages lead to more memory consumption or worse performance than in Java.
No big difference for stuff that waits for I/O, at least if it doesn't do it in an inefficient way. No Java in sight in the top lot for "Cached queries" which is the one where language makes a difference.
To be clear, you're saying for the benchmark tasks with the worst performing Java programs; the Java programs perform worse than the Go programs for the benchmark tasks with the worst performing Go programs?
What about the median performance benchmark tasks?
I'm saying that Golang is one of the worst-performing compiled languages, and even Golang is faster than Java for the respectively fastest program on the page I've referenced. Other compiled languages are even better than Golang.
> While there have been improvements, Java performance work seems to have stopped after 2011 or so.
No idea how you got that impression. Here is an incomplete set of important changes:
- major developments on Graal and native code compilation for applications that prioritize startup time and memory use (I think Graal was started before 2011, but has changed a huge amount since then).
- the concurrent G1 garbage collector became the default, making most applications have much lower pause times (I believe it was released in Java 7 in 2011, but was not mature, only became a default in Java 9, and has regular improvements since then).
- the Shenendoah and ZGC concurrent garbage collectors were not started until well after 2011
- project lilliput, which reduces the header size of Java objects from 16 to 8 bytes is landing in Java 25, which comes out this fall
- specializing strings to use 1 byte/character for ascii strings, and 2 bytes for non-ascii strings
- virtual threads shipped in Java 22
- work on a vectorization API
- plenty of work on reducing the startup time of the JDK and modularizing it
The grandparent made specific claims about scenarios where Java is supposedly faster:
> Yes, Java can be faster in certain contexts: the kind of contexts Java is usually employed in addressing these days. Stuff like enterprise backend systems and big data processing, where startup time or precise memory control are not important.
The only thing that might be relevant in your list is the vectorization API.
The rest of your list is just a bunch of band-aids for the worst problems that Java specifically has, that other languages lack. That is at best playing catch-up with everything else, since in C there is no GC overhead, almost no startup time, no object overhead, no GC-induced memory hogging, no GC pauses. Yes, I agree that things did happen. But nothing in there makes Java any more desirable for relevant workloads, it just makes it less undesirable.
"Slow" means nothing without context. The proper context is the performance, productivity, future maintainability, debuggability and a bunch of other factors in relation to the given business requirements and goals.
The current context is compiler optimizations, as per the original article. That means we are talking about performance as a compiler produces it (or doesn't).
TFA only talks about having strong or weak static guarantees. There's more that typically comes with a "high-level" language like Java: Nonzero-cost abstractions like boxed objects, automatic memory management and more.
I dunno, I was there in the 2000s and can't remember anyone saying that. I remember a lot of people saying Java's slightly slower execution will not matter that much when processing hardware gets better.
I was also there in the 2000's and I remember people saying Java can be just as fast as C
And in particular that garbage collection can be faster than malloc/free.
This is technically true in the case you have a naive C program. There are tons of C programs out there, and some are slow. (But there are more slow Java programs out there)
But it's 100% clear now that GC has a significant cost, and while you can optimize a bad C program, it can be very hard in many cases to optimize the cost of GC away.
C offers more freedom and thus more footguns, but the freedom also means you don't really have a performance cap.
And I prefer GC -- most programs I write will have GC; I just recognize the performance cap, and design accordingly. (In fact I wrote a GC for a subset of C++ for https://oils.pub -- I actually think this is a language with nicer performance properties than Java, mainy due to being AOT)
If you bring parallel processing into the picture (shared memory model), then the achievable performance caps per dev work change significantly though.
I will very easily write a faster parallelizable program in Java, before I get the C one even remotely correct and then we haven't even taken into account every future maintenance cost. Also, the way C devs often hide the cost is simply.. less efficient code like copying stuff right and left, etc, which can actually be more "bravely" avoided in managed languages.
>I will very easily write a faster parallelizable program in Java, before I get the C one even remotely correct and then we haven't even taken into account every future maintenance cost.
Even better in Rust. You get great memory safety guarantees from the borrow checker helping guide your parallelization, while still being fast and GCless.
It's the best of both worlds.
>Also, the way C devs often hide the cost is simply.. less efficient code like copying stuff right and left, etc, which can actually be more "bravely" avoided in managed languages.
I'd say Rust is again the easiest way to avoid unnecessary copies without sacrificing safety.
Then pray tell, how come that the most parallel and most demanding computations like weather simulations, fluid dynamics and stuff like that avoid superior Java like the plague? Shouldn't it have been a blessing with its easier parallelizable code?
The reality is that it's quite the opposite. Java boxed types, lack of control over memory layouts leading to page flapping and false sharing, lack of fast synchronisation primitives, lack of a compiler capable of hinted or automatic loop parallelization lead to Java being totally unsuitable here. The crown goes to FORTRAN in that regard, with C and C++ being second and third places.
Of course theoretically a magic fairy-tale Java compiler could look at the code, recognize all the aforementioned problems and fix them. Of course the language would allow for this to happen totally without changing semantics. But no Java compiler even comes close...
Especially profile-guided-optimization was hailed as the next big thing, only JIT-ed languages were the ones to be fast, because after some warmup time they would adapt to your data and hardware. Java is still dog-slow...
I was there in the 90s and people at Sun were saying that Java would be faster than C++. This kind of talk even begat Jikes, a Java-in-Java JVM from IBM.
To me, the best argument for constraints imposed by a language is around correctness, not efficiency. When we write a program, we tend to have an idea of how we want it to behave, but it's a pretty universal fact that this is hard to get exactly right on the first try (and often it can be hard to even tell without extensive testing, which is why bugs aren't all fixed by the time software actually gets released). In a certain sense, the act of writing and debugging a program can be thought of as searching through the space of all the possible programs that you could be writing, repeatedly narrowing down the set of potential choices by ruling out ones you know are incorrect, and then eventually picking one to use as the candidate you think is the one you want. From this perspective, language constraints can help with this process pretty much every step of the way; some programs are ruled out because you can't even express them in the first place, others are able to be rejected as you narrow down the set you're looking at based each new line of code you write and how the constraints interact with that, and potentially even with debugging after the fact when trying to figure out what went wrong with an incorrect selection (i.e. one that has bugs).
When we're using Turing complete languages for pretty much everything, constraints are pretty much the only thing that semantically differentiates the code we write in them at all. To me, this is basically the concept people are trying to convey when they argue for picking "the right tool for the right job". At the end of the day, what makes a language useful is just as much defined by what you _can't_ say as exact you can.
Problem is that those constraints are often very hard to express, even in more expressive languages.
One example would be integers. There are bigint-by-default languages like Haskell, where the type you usually use is an arbitrary-sized bigint. But usually you don't need that, and you usually know that something like a 32bit integer is sufficient. Often you get a int32 type that does that stuff for you. But then the question becomes about overflow behaviour, signedness, existence of -0, behaviour of -INT_MAX and stuff like that. Even in C, you are in undefined-behaviour-launch-the-rockets territory very quickly. And there are usually no "screw it, I don't care, give me whatever the CPU does" types. You don't get to choose your overflow/underflow behaviour. Often no bounded types either (unsigned integer day from 1 to 365). And if those types are available in libraries, the compiler won't be able to optimize.
There are tons more of those examples. It's always a compromise between the expressiveness that we want and that would be great, and the complexity that will make an implementation non-viable. So you get as much expressivenes as the language designer thinks he could maybe cram into his compiler. But it's always to much to be really optimized all the way through. And it's always too little for what one would actually need to be really exact.
Reminds me of the talk on Rust's LLVM IR optimization issues. Constrained languages like Rust can be optimized way better than C/C++. Seen similar points made before.
I don't disagree, but I think the optimization potential tends to be limited to trivial automations in practice. The idioms we use to code at a higher level are mostly wrappers on something we know how to do at a lower level with a macro. The compiler still helps because it guard-rails it and puts you on the path to success as you build more complex algorithms, but you have to get into very specific niches to reach both of "faster than C" and "easier to code the idiom".
As ever, hardware is an underappreciated factor. We have make-C-go-fast boxes today. That's what the engineering effort has been thrown towards. They could provide other capabilities that prefer different models of computing, but they don't because the path dependency effect is very strong. The major exceptions come from including alternate modalities like GPU, hardware DSP and FPGA in this picture, where the basic aim of the programming model is different.
> As ever, hardware is an underappreciated factor. We have make-C-go-fast boxes today.
That is a very common misconception. There have been numerous attempts at architectures that cater to higher-level languages. Java-bytecode-interpreting CPUs have been tried and were slower than contemporary "normal" CPUs at executing bytecode. Phones and smartphones were a supposed hot market for those, didn't fly, native bytecode execution is dead nowadays. Object-orientation in CPUs has been tried, like in Intels iAPX432. Type-tagged pointers and typed memory has been tried. HLLCAs were all the rage for some time ( https://en.wikipedia.org/wiki/High-level_language_computer_a... ). Early CISC CPUs had linked lists as a fundamental data type. Lisp machines did a ton of stuff in hardware, GC, types, tagging, polymorphism. All of it didn't work out, more primitive hardware with more complex interpreters and compilers always won. Not because C was all the rage at the time, but because high-level features in hardware are slow and inflexible.
What came of it was the realization that a CPU must be fast and flexible, not featureful. That's why we got RISC. That's why CISC processors like x86 internally translate down to RISC-like microcode. The only thing that added a little more complexity were SIMD and streaming architectures, but only in the sense that you could do more of the same in parallel. Not that HLL constructs were directly implemented into a CPU.
Memory tagging is making a comeback, and the primitives/API being "fast and flexible" doesn't account for the ridiculous complexity that goes into a CPU, that does indirectly help with linked lists/object oriented/etc.
Also, C is in no way special, not even particularly low-level.
I think a good example here is Unity's Burst compiler - it uses a subset of C#, which allows for more assumptions and better optimization. Compared to regular Unity Mono environment, the code can be faster by one or two orders of magnitude and I've read in some cases it's even faster than C.
I have constructed a language which can only say "hello world" and there is only 1 way to do it.
It is the most easily optimized language I know. If you write a working program in it, you know you're doing it in the most optimal way for that language already.
Everything constrained is easier to optimize. In the limit where you have one or zero possible ways to do something, all solutions are necessarily already optimal.
But the potential of easy optimization does not mean that they will be optimized. Compilers for higher-level languages are full of broken promises around their potential. "Faster than C because we can optimize" is never really realized, except in maybe one benchmark that only works in that one exact scenario and nowhere in my code...
> Compilers for higher-level languages are full of broken promises
Sometimes because C is lingua franca of low level.
Some noalias optimizations Rust had didn't work in LLVM, because no one bothered using it before in C.
This goes even further to hardware. C-like null terminated string search SIMD is faster than a saner (pointer + len ) string view. So it's faster to append null to end of string view than to invoke the SIMD on string slice.
And here we go again with the compiler excuses.
C standards with the 'restrict' keyword to allow aliasing optimisations have been existing longer than Rust. LLVM just never bothered, despite the early claims that all the intermediate language and "more modern" compiler would enable more optimisation potential. LLVM is still slower than GCC, even in plain C.
Where is the problem to quickly null-terminate a pointer+len string and use the quick null-terminated SIMD search? Should only be 1 move slower than the native C. "Space after that string" shouldn't be a problem, since in higher-level languages, allocations can make enough room, it is higher-level after all (and you need null-termination for syscalls anyways).
It is always the same story. Compiler and programming language people make claims about future optimisations but never ever follow through. It's always just theory, never implementation.
LLVM did do some with noalias, but it wasn’t very mature because it wasn’t broadly useful in C/C++, because it was too hard to use correctly.
Then Rust made it trivial to use correctly, and found that what LLVM did have was quite buggy, because it hadn’t been exercised. These were bugs that could in theory have been found in C/C++, but in practice never would have been. And so for quite some years Rust would toggle the noalias optimisations on and off, depending on whether there were any known bad-codegen issues at the time. I think the last toggle was a few years ago by now, this stuff is finally actually stable and appreciated.
My recollection is that the figures for the Rust compiler are in the vicinity of a 5% improvement from emitting noalias in the LLVM IR.
And it’s likely that quite a bit more is possible.
> LLVM just never bothered, despite the early claims that all the intermediate language and "more modern" compiler would enable more optimisation potential.
They did bother but no one seemed to be using it, so a bug snuck in. It took Rust exercising that corner of LLVM to find the bug.
> LLVM is still slower than GCC, even in plain C.
Citation needed. Last time I checked LLVM beat GCC on -O3. Unless you mean compilation performance.
> Where is the problem to quickly null-terminate a pointer+len string and use the quick null-terminated SIMD search?
Why should two nearly idenfical operations have such wildly different performance? And why isn't the safer/saner interface more performant?
My point if compilers and hardware are optimizing for C, it's no surprise no one can approach its speed.
It's like asking why when everyone is optimizing for Chrome no other web browser can approach its speed.
>> LLVM is still slower than GCC, even in plain C.
> Citation needed. Last time I checked LLVM beat GCC on -O3.
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
https://www.phoronix.com/review/gcc-clang-eoy2023/8
There is tons more, just ask your favourite internet search engine.
> Why should two nearly idenfical operations have such wildly different performance? And why isn't the safer/saner interface more performant?
Because in hardware, checking a zero-flag on a register is very cheap. Big NAND over all bits of a sub-register, big OR over all those flags for the whole SIMD register. Very short path, very few transistors.
Checking a length is expensive: Increment a length register through addition, including a carry, which is a long path. Compare with another register to out if you are at the last iteration. Compare usually is also expensive since it is subtract with carry internally, even though you could get away with xor + zero flag. But you can't get around the length register addition.
There can be an optimisation because you do have to increment the address you fetch from in any case. But in most modern CPUs, that happens in special pointer registers internally, if you do additional arithmetics on those it'll be expensive again. x86 actually has more complex addressing modes that handle those, but compilers never really used them, so those were dropped for SIMD.
A loop that's basically:
is going to compile down to something like: (Okay, to be fair, compilers tend to change the loop bounds so that the final comparison is a comparison against 0. This will reduce the register usage of the loop, but otherwise doesn't change the analysis). Taking into account micro-op fusion, there is at best two actual hardware ops in that assembly pattern, since the comparison and branch will be fused, and the increment(/decrement) is also a decent candidate for fusion. The branch can also be predicted with a high degree of fidelity--an advanced processor should be able to predict the branch with exactly 100% accuracy.A null-terminated loop instead compiles down to:
You still have the some basic pattern at the end of the loop, except that the INC (which might not have been previously fused) is no longer in the critical path, but now the critical path instead has a load in it. That load is a much slower instruction than a test-against-0 or a comparison: no load completes in one cycle, whereas both tests (bitwise and) and comparisons (subtraction) complete in one cycle. Splitting hairs by saying that the test-against-0 is going to be less expensive than a subtract-and-check-carry, so therefore null-terminated is faster ignores the fact that they execute in the same time and brushes the extra load, which is definitely slower, that exists in the null-terminated string.It goes worse, though. When you have a pointer+length combo, the length of the array, and the legal dereferenceability, is now known to the compiler in symbolic form that is easy to compute. That makes it possible to do things like unroll loops, or vectorize loops, or rewrite loop bounds in better forms, etc. When it's null-terminated, the length comes from reading memory, and the amount of memory you can read is not known ahead of time. Unrolling the loop or other steps are no longer really possible since you don't know how many iterations the loop will take, and the extra loop exits you'd generate tend not to be tolerable in optimization.
Storing the string length explicitly as an 8-byte integer does have a measurable cost. Consider llvm::Twine as an example, it supports storing a null-terminated string and a ptr+len string (among other options). I once changed the implementation to store string literals (length known at compile-time) as ptr+len instead of a pointer to a C string, with the intention of avoiding the strlen in the callee on constant strings. However, this change made things slower, because of the cost of storing the length everywhere. (That's why I never proposed such a change upstream.)
The critical (data) path of the null-terminated loop, however, does not include the load -- the actually loaded value is not a loop-carried dependency in your example. The re-steering of the branch at the end of the loop might happen much later, however.
Vectorization with null-terminated strings is possible and done, but requires alignment checking, which adds some cost.
No. First that isn't SIMD code. Second, CMP r0,0 is redundant, operating on a register that is zero will set the zero flag automatically. You also skipped the string search/comparison part, but that isn't relevant, except that the operation will implicitly set the zero flag. Generally, plain JZ is faster than CMP + JZ. Even more so if it is SIMD.
And loads are amortized by tons of speculative preloading. Which isn't (yet, unfortunately) optimized by preload-length hints.
Oh, and since we are talking about comparing strings, there is no extra load. This isn't strlen(), this is strchr() or strstr() we are talking about. You always have to load the string into some register.
Your Phoronix link shows clang as faster than GCC (more is better on the graph). The benchmark game results seem to go back and forth.
I don't think that it's convincing evidence that clang is slower than GCC.
Yes, you are right. I think they are head-to-head nowadays.
I was thinking about, and meaning to post that one, but got the wrong search result:
https://www.phoronix.com/review/clang20-gcc15-amd-znver5/5
But they swapped places a few times over the last years, so basically there doesn't seem to be a fundamental difference in performance anymore.
It's also that whenever an hour is poured into Haskell, Scheme, or OCaml, there's hundreds of hours poured into Javascript, Python, C, Java, ...
Eventually that lets you optimize things more at the compiler level because you have the extra time to maintain it as well.
My primary argument is that a more restrictive language, if designed correctly, has programs which are resistant against bit-rot over time. You can take a program from 15 years ago and compile it. It'll run faster due to optimizations, and hardware improvements. And the program will still be the "right thing to do."
In contrast, if you take C code that's older, you'd shudder and immediately begin identifying things left and right which needs to be redone because the modern machine is different from the aged machine. Part of the efficiency comes from the fact we are building the C code such that it has affordance on current hardware for efficient execution. But when the hardware updates, then the code needs amendment to keep up. Meanwhile, your SQL statement is the same because it's written declaratively.
> Meanwhile, your SQL statement is the same because it's written declaratively.
I must have imagined everything I've read about query planners.
I don't know about "faster than C", but usually the promise is that you can get to within the same order of magnitude of performance with higher level abstractions, more guarantees of correctness (or at least higher likelihood), and for far less effort than the optimal handwritten C equivalent.
This is just one aspect of the principle of least power:
https://www.w3.org/DesignIssues/Principles.html#PLP
By restricting the power of a language, you enable it to be used in more ways than just “execute it and see what the result is”.
While i subscribe to the "principle of least power"; Rice's theorem extends to even total functions in finite time. So while you can make choices on what to over/under constrain, you are still stuck with “execute it and see what the result is” in the general case.
The vast majority of code isn't the general case. Very few practical applications legitimately intend to execute halt-undecidable code for instance.
But if the language itself strongly obfuscates (or makes it difficult or impossible to express/guarantee) reliable relationships, you miss out on all the desired, and leveragable, certainty that you still intend.
I used to argue with the Python crowd about that. In Python, any code can, if it works at it, find and mess with any data object anywhere, even in another thread. So the compiler can't examine a function, decide that it can't possibly do something, and optimize. That's held back PyPy.
Most useful Python programs could be hard-compiled to machine code, because they're not doing anything weird enough to require a JIT compiler. But some low-level library might be, which breaks hard compilation.
("But then my domain-specific language / interface to another language / code generator / tracing package / debugger won't work!")
If the compiler is examining the function to know what to optimize. why does it not know what that function does so it can't optimize? That is, I assume you are correct but you lost me in the logic. is it eval?
Because, in Python, you can monkey-patch anything from anywhere. It's not done much, but the capability is there.
For example, you can create an instance x of class Foo, change its one method x.bar to do something completely different at runtime, and any existing code that calls Foo.bar will happily execute your version. Which means Python cannot optimize any function that calls Foo.bar even if it "knows" what Foo.bar does.
The issue in dynamic languages like python is that almost every little thing is impossible to be sure at compile time:
* The type of arguments passed to the function. Everything might point to the function always being passed values of a particular type, but you still can't be sure.
* If you call another function, you can't be sure that you'll actually call the function you expect, because any function or global variable may be overwritten at runtime.
* There are many places in the language that might call an invisible function call (method overloads, garbage collection methods, etc) and any of those invisible calls could potentially monkey patch something
All of these complexities, and more, are why JIT compilers like PyPy exist. Instead of optimizing at compile time, they observe things at runtime, optimize optimistically assuming that the program won't try anything funny, but stays ready to roll back the optimizations in case the program does try something funny.
> why does it not know what that function does so it can't optimize?
Because that seem to require the compiler to be an AGI. Its very hard to explain why its hard to make a truly intelligent program, but you just have to trust that tons of people have tried for entire careers to solve this and so far none have solved it.
Initial comments as I write this are all negative, and also responding to something the blog post didn't claim. The only time it says faster than C is when talking about a language targeting the GPU; it is not controversial that GPUs perform better than CPUs for many workloads.
On the other hand, the results of this true fact are already seen in practice in real systems. Rust's iterator adapters, for example, allow high level "functional style" with closures to be compiled to code that is exactly the same as hand rolled C loops without unamortized bounds checks - despite guaranteeing the absence of out of bound accesses even in the face of programmer error - because the constraints of these interfaces and the design of the language enable tons of optimizations by LLVM. This has been true since Rust 1.0 more than a decade ago, it's not a promise of the future.
Brings back nostalgic memories from Java days in the 2000s. "Java will be faster than C" for exactly the same reasons laid out in the article.
And here we are, 20 years later.
Yes, 20 years later, the JVM is considered a marvel of engineering, and, surprisingly, some of those marketing claims have in fact come true. Or at least, more than would be expected after doing the hype-correction always called for in such situations. Yes, Java can be faster in certain contexts: the kind of contexts Java is usually employed in addressing these days. Stuff like enterprise backend systems and big data processing, where startup time or precise memory control are not important. Then there's GraalVM which allows for AOT compilation, etc.
I'd like to see those numbers please. While there have been improvements, Java performance work seems to have stopped after 2011 or so. Mobile CPUs became fast enough for Android, so no movement there. And commercial software got stuck on numerous proprietary Java VMs that didn't improve after the Oracle takeover of SUN. Alterative VMs improved, but didn't get adopted, fragemented and in the end didn't really raise the bar a lot.
To this day, Java applications are the slowest and most memory hungry long-running server applications by far. Only some scripting languages are worse, but only by performance, almost never by memory.
> To this day, Java applications are the slowest and most memory hungry long-running server applications by far
I am not a fan of Java, but the slow part is just plain wrong. On properly sized hardware Java is blisteringly fast. Did it exceed the speed of C as hyped? No, in most cases it falls solidly short, but I'm sure there are isolated cases where it does. Regardless, 2-3x slower than C using a JIT is a serious feat of engineering.
All GC languages require more memory, and Java is even worse given lack of value types, but again, memory/hardware is cheap compared to programmer time and if you have enough of it, it generally isn't a problem.
All this does mean you will need a bit more hardware than a C program, so perhaps that is the posters critique, and that is true, but compared to other high level languages, it does great in the speed dept, and is often only a bit higher in the memory dept.
I can only assume that you stopped paying attention in 2011 or so because "slow", Java is not. Memory hungry, sure, I don't think anyone would deny that.
What was the core of the hype at the time was parallel processing. And to be honest, here Java absolutely beats C, which most of the time will simply.. refuse to parallelize because it can't be maintainably and reliably done in the language without some very extreme cost. It's simply too dangerous. Meanwhile in Java people could freely experiment with concurrent data structures and algorithms that will simply be in a completely different league than the serial C program. Sure, maybe the latter's hot loop is faster, but it doesn't mean much if I can just run 100s of threads with Java.
Your second paragraph is just thoroughly uninformed, though. And unused RAM is useless, so in a server application not making use of it is just dumb. Maybe think about why Java is used extensively by pretty much every FAANG company. Also, Java has improved steadily after Oracle's takeover, and is in fact the open-source reference implementation. Again, uninformed bullshit.
> Maybe think about why Java is used extensively by pretty much every FAANG company.
Java programmers are a dime a dozen. Universities churn out heaps of Java code slaves, so of course every big company will use Java for their shovelware.
There are even more JS and Python devs. Maybe have a go at my other points instead.
Seriously? Since 2011, so since Java 7, stagnation in performance? That's a startling claim. I'd like to see some numbers as well. I'll start the first round with this article:
https://kstefanj.github.io/2021/11/24/gc-progress-8-17.html
Note: this is comparing to Java 17 to Java 8 (released in 2014), and we are now at Java 24!
That is garbage collection performance, something that only GC languages even have to consider as additional overhead. It also doesn't even try to compare against anything non-Java.
There are no serious Java performance comparisons on stuff like transaction throughput, which would be what the "long running server application" usecase should aim for. You yourself claimed that enterprise data processing is what you are talking about. There are benchmarks that even with long warmup times make Java come up far worse than all the compiled languages (even Golang nowadays). https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
Imho, people have stopped benchmarking Java ages ago, since it is known to be hopeless.
Calling GC time "additional overhead" is not entirely correct, imo. Non-GC languages often spend plenty of time dealing with malloc/free, and I consider both GC and malloc/free to be "memory management". Both are non-trivial!
GC usually also comes in "copying" variants, which can have advantages like reducing memory fragmentation.
Granted, malloc/free isn't totally free. But malloc/free languages always seem to be faster than Java in benchmarks, at least the ones I've read and referenced already in the comments around this article. GC also, even if it were faster, comes at the expense of eating RAM like mad, at least if judged by the RAM Java likes to eat. Usually, Java software just consumes 50% of your RAM until the first GC pause. And even before that GC pause, it's dog slow.
Memory fragmentation can more easily be reduced by pooled allocators. For long-running server processes, it can also be possible to just kill the worker processes/threads from time to time. And I've never seen memory fragmentation in manually allocated languages lead to more memory consumption or worse performance than in Java.
https://www.techempower.com/benchmarks/#section=data-r23&tes...
Here is your transaction throughput. But I'm sure AWS's whole platform team, Alibaba, third of Google, etc don't know what they are doing.
No big difference for stuff that waits for I/O, at least if it doesn't do it in an inefficient way. No Java in sight in the top lot for "Cached queries" which is the one where language makes a difference.
I think you need to say what you mean by "far worse".
If you look at the extremes, it is up to a factor of 3: https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
If you exclude stuff like SSE intrinics, it's usually 50% slower.
To be clear, you're saying for the benchmark tasks with the worst performing Java programs; the Java programs perform worse than the Go programs for the benchmark tasks with the worst performing Go programs?
What about the median performance benchmark tasks?
I'm saying that Golang is one of the worst-performing compiled languages, and even Golang is faster than Java for the respectively fastest program on the page I've referenced. Other compiled languages are even better than Golang.
The Java programs median shows slightly faster than the Go programs median, in the first chart on that page.
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
> While there have been improvements, Java performance work seems to have stopped after 2011 or so.
No idea how you got that impression. Here is an incomplete set of important changes:
- major developments on Graal and native code compilation for applications that prioritize startup time and memory use (I think Graal was started before 2011, but has changed a huge amount since then).
- the concurrent G1 garbage collector became the default, making most applications have much lower pause times (I believe it was released in Java 7 in 2011, but was not mature, only became a default in Java 9, and has regular improvements since then).
- the Shenendoah and ZGC concurrent garbage collectors were not started until well after 2011
- project lilliput, which reduces the header size of Java objects from 16 to 8 bytes is landing in Java 25, which comes out this fall
- specializing strings to use 1 byte/character for ascii strings, and 2 bytes for non-ascii strings
- virtual threads shipped in Java 22
- work on a vectorization API
- plenty of work on reducing the startup time of the JDK and modularizing it
One where I'm not sure about the timeline:
- https://docs.oracle.com/javase/8/docs/technotes/guides/vm/cl... class data sharing to lower overhead of running multiple JVMs on a single node
P.S. I'm not writing this to comment on the the parent's claim that Java is faster in some contexts.
The grandparent made specific claims about scenarios where Java is supposedly faster:
> Yes, Java can be faster in certain contexts: the kind of contexts Java is usually employed in addressing these days. Stuff like enterprise backend systems and big data processing, where startup time or precise memory control are not important.
The only thing that might be relevant in your list is the vectorization API.
The rest of your list is just a bunch of band-aids for the worst problems that Java specifically has, that other languages lack. That is at best playing catch-up with everything else, since in C there is no GC overhead, almost no startup time, no object overhead, no GC-induced memory hogging, no GC pauses. Yes, I agree that things did happen. But nothing in there makes Java any more desirable for relevant workloads, it just makes it less undesirable.
And I'm gonna walk everywhere on foot from now on, because I don't have to drink petrol.
Are you even aware of the word "tradeoff"?
Ah, now the excuse for being slow is "tradeoff".
"Slow" means nothing without context. The proper context is the performance, productivity, future maintainability, debuggability and a bunch of other factors in relation to the given business requirements and goals.
The current context is compiler optimizations, as per the original article. That means we are talking about performance as a compiler produces it (or doesn't).
TFA only talks about having strong or weak static guarantees. There's more that typically comes with a "high-level" language like Java: Nonzero-cost abstractions like boxed objects, automatic memory management and more.
I dunno, I was there in the 2000s and can't remember anyone saying that. I remember a lot of people saying Java's slightly slower execution will not matter that much when processing hardware gets better.
I was also there in the 2000's and I remember people saying Java can be just as fast as C
And in particular that garbage collection can be faster than malloc/free.
This is technically true in the case you have a naive C program. There are tons of C programs out there, and some are slow. (But there are more slow Java programs out there)
But it's 100% clear now that GC has a significant cost, and while you can optimize a bad C program, it can be very hard in many cases to optimize the cost of GC away.
C offers more freedom and thus more footguns, but the freedom also means you don't really have a performance cap.
And I prefer GC -- most programs I write will have GC; I just recognize the performance cap, and design accordingly. (In fact I wrote a GC for a subset of C++ for https://oils.pub -- I actually think this is a language with nicer performance properties than Java, mainy due to being AOT)
If you bring parallel processing into the picture (shared memory model), then the achievable performance caps per dev work change significantly though.
I will very easily write a faster parallelizable program in Java, before I get the C one even remotely correct and then we haven't even taken into account every future maintenance cost. Also, the way C devs often hide the cost is simply.. less efficient code like copying stuff right and left, etc, which can actually be more "bravely" avoided in managed languages.
>I will very easily write a faster parallelizable program in Java, before I get the C one even remotely correct and then we haven't even taken into account every future maintenance cost.
Even better in Rust. You get great memory safety guarantees from the borrow checker helping guide your parallelization, while still being fast and GCless.
It's the best of both worlds.
>Also, the way C devs often hide the cost is simply.. less efficient code like copying stuff right and left, etc, which can actually be more "bravely" avoided in managed languages.
I'd say Rust is again the easiest way to avoid unnecessary copies without sacrificing safety.
Then pray tell, how come that the most parallel and most demanding computations like weather simulations, fluid dynamics and stuff like that avoid superior Java like the plague? Shouldn't it have been a blessing with its easier parallelizable code?
The reality is that it's quite the opposite. Java boxed types, lack of control over memory layouts leading to page flapping and false sharing, lack of fast synchronisation primitives, lack of a compiler capable of hinted or automatic loop parallelization lead to Java being totally unsuitable here. The crown goes to FORTRAN in that regard, with C and C++ being second and third places.
Of course theoretically a magic fairy-tale Java compiler could look at the code, recognize all the aforementioned problems and fix them. Of course the language would allow for this to happen totally without changing semantics. But no Java compiler even comes close...
There were tons of people making those claims. https://stackoverflow.com/questions/64582/c-java-performance...
Especially profile-guided-optimization was hailed as the next big thing, only JIT-ed languages were the ones to be fast, because after some warmup time they would adapt to your data and hardware. Java is still dog-slow...
I was there in the 2000s and remember several people saying that.
Look at your sibling comments. Some people are still saying it now.
I was there in the 90s and people at Sun were saying that Java would be faster than C++. This kind of talk even begat Jikes, a Java-in-Java JVM from IBM.
To me, the best argument for constraints imposed by a language is around correctness, not efficiency. When we write a program, we tend to have an idea of how we want it to behave, but it's a pretty universal fact that this is hard to get exactly right on the first try (and often it can be hard to even tell without extensive testing, which is why bugs aren't all fixed by the time software actually gets released). In a certain sense, the act of writing and debugging a program can be thought of as searching through the space of all the possible programs that you could be writing, repeatedly narrowing down the set of potential choices by ruling out ones you know are incorrect, and then eventually picking one to use as the candidate you think is the one you want. From this perspective, language constraints can help with this process pretty much every step of the way; some programs are ruled out because you can't even express them in the first place, others are able to be rejected as you narrow down the set you're looking at based each new line of code you write and how the constraints interact with that, and potentially even with debugging after the fact when trying to figure out what went wrong with an incorrect selection (i.e. one that has bugs).
When we're using Turing complete languages for pretty much everything, constraints are pretty much the only thing that semantically differentiates the code we write in them at all. To me, this is basically the concept people are trying to convey when they argue for picking "the right tool for the right job". At the end of the day, what makes a language useful is just as much defined by what you _can't_ say as exact you can.
Problem is that those constraints are often very hard to express, even in more expressive languages.
One example would be integers. There are bigint-by-default languages like Haskell, where the type you usually use is an arbitrary-sized bigint. But usually you don't need that, and you usually know that something like a 32bit integer is sufficient. Often you get a int32 type that does that stuff for you. But then the question becomes about overflow behaviour, signedness, existence of -0, behaviour of -INT_MAX and stuff like that. Even in C, you are in undefined-behaviour-launch-the-rockets territory very quickly. And there are usually no "screw it, I don't care, give me whatever the CPU does" types. You don't get to choose your overflow/underflow behaviour. Often no bounded types either (unsigned integer day from 1 to 365). And if those types are available in libraries, the compiler won't be able to optimize.
There are tons more of those examples. It's always a compromise between the expressiveness that we want and that would be great, and the complexity that will make an implementation non-viable. So you get as much expressivenes as the language designer thinks he could maybe cram into his compiler. But it's always to much to be really optimized all the way through. And it's always too little for what one would actually need to be really exact.
Reminds me of the talk on Rust's LLVM IR optimization issues. Constrained languages like Rust can be optimized way better than C/C++. Seen similar points made before.
I don't disagree, but I think the optimization potential tends to be limited to trivial automations in practice. The idioms we use to code at a higher level are mostly wrappers on something we know how to do at a lower level with a macro. The compiler still helps because it guard-rails it and puts you on the path to success as you build more complex algorithms, but you have to get into very specific niches to reach both of "faster than C" and "easier to code the idiom".
As ever, hardware is an underappreciated factor. We have make-C-go-fast boxes today. That's what the engineering effort has been thrown towards. They could provide other capabilities that prefer different models of computing, but they don't because the path dependency effect is very strong. The major exceptions come from including alternate modalities like GPU, hardware DSP and FPGA in this picture, where the basic aim of the programming model is different.
> As ever, hardware is an underappreciated factor. We have make-C-go-fast boxes today.
That is a very common misconception. There have been numerous attempts at architectures that cater to higher-level languages. Java-bytecode-interpreting CPUs have been tried and were slower than contemporary "normal" CPUs at executing bytecode. Phones and smartphones were a supposed hot market for those, didn't fly, native bytecode execution is dead nowadays. Object-orientation in CPUs has been tried, like in Intels iAPX432. Type-tagged pointers and typed memory has been tried. HLLCAs were all the rage for some time ( https://en.wikipedia.org/wiki/High-level_language_computer_a... ). Early CISC CPUs had linked lists as a fundamental data type. Lisp machines did a ton of stuff in hardware, GC, types, tagging, polymorphism. All of it didn't work out, more primitive hardware with more complex interpreters and compilers always won. Not because C was all the rage at the time, but because high-level features in hardware are slow and inflexible.
What came of it was the realization that a CPU must be fast and flexible, not featureful. That's why we got RISC. That's why CISC processors like x86 internally translate down to RISC-like microcode. The only thing that added a little more complexity were SIMD and streaming architectures, but only in the sense that you could do more of the same in parallel. Not that HLL constructs were directly implemented into a CPU.
Memory tagging is making a comeback, and the primitives/API being "fast and flexible" doesn't account for the ridiculous complexity that goes into a CPU, that does indirectly help with linked lists/object oriented/etc.
Also, C is in no way special, not even particularly low-level.
I think a good example here is Unity's Burst compiler - it uses a subset of C#, which allows for more assumptions and better optimization. Compared to regular Unity Mono environment, the code can be faster by one or two orders of magnitude and I've read in some cases it's even faster than C.
I have constructed a language which can only say "hello world" and there is only 1 way to do it.
It is the most easily optimized language I know. If you write a working program in it, you know you're doing it in the most optimal way for that language already.
They are also easier to statically analyze and write tooling around.
Then were is my easy Haskell tooling to find the place where I should put a strictness hint? Where is the equivalent of kcachegrind?
That it's supposedly easier might be true. But that usually doesn't lead to someone really doing it.
In the same way, I find that constrained languages like Go are a great choice for achieving high-quality code generation with LLMs.
Everything constrained is easier to optimize. In the limit where you have one or zero possible ways to do something, all solutions are necessarily already optimal.
Thank you Captain Obvious!
But easier to optimize doesn't necessarily means that they could be optimized to be more performant than less constrained languages.