mekpro a few seconds ago

How this improvement translate into real world agentic coding task ?

dwohnitmok 3 hours ago

Is everyone just glossing over the first place score of 118/120 on the Putnam?! I mean we'll see how it does on the upcoming 2025 test, but that's insane!

We've seen absolutely ridiculous progress in model capability over the past year (which is also quite terrifying).

  • N_Lens 22 minutes ago

    Also the impressive IMO-ProofBench Basic benchmark, the model achieved nearly 99% accuracy, though it fell slightly behind Gemini Deep Think on the Advanced subset.

    The approach shifts from "result-oriented" to "process-oriented" verification, particularly important for theorem proving where rigorous step-by-step derivation matters more than just numerical answers.

N_Lens 21 minutes ago

The core innovation is a verifier-generator dual architecture that enables the model to self-check reasoning rigor, addressing the fundamental problem that correct answers don't guarantee correct reasoning processes.

awei 5 hours ago

Something weird here, why is it so hard to have a deterministic program capable of checking a proof or anything math related, aren't maths super deterministic when natural language is not. From first principles, it should be possible to do this without a llm verifier.

  • JacobiX 4 hours ago

    I think that mathematical proofs, as they are actually written, rely on natural language and on a large amount of implicit shared knowledge. They are not formalized in the Principia Mathematica sense, and they are even further from the syntax required by modern theorem provers. Even the most rigorous proofs such as those in Bourbaki are not directly translatable into a fully formal system.

  • xemdetia 4 hours ago

    Maths can be super deterministic but often difficult to compute because of concepts like inferring by induction. I had to personally unlearn and rebase my understanding of math based in computation to 'get' pure maths. Another example is set building. You often don't need to compute the existence of members of sets in pure math you just need to agree that there are some members of a set that meet the criteria. How many or how many things that aren't in the set aren't meaningful often times to accept something and move on with the proof. From the computing perspective this can be difficult to put together.

  • blazespin 3 hours ago

    Verifying math requires something like Lean which is a huge bottleneck, as the paper explains.

    Plus there isn't a lot of training data in lean.

    Most gains come from training on stuff already out there, not really the RLVR part which just amps it up a bit.

  • jebarker 5 hours ago

    I haven’t read the paper yet, but I’d imagine the issue is converting the natural language generated by the reasoner into a form where a formal verifier can be applied.

  • riku_iki 5 hours ago

    such high performance program indeed could potentially be superior, if it would exist (this area is very undeveloped, there is no existing distributed well established solution which could handle large domain) and math would be formalized in that program's dsl, which also didn't happen yet.

photon_lines 5 hours ago

Exciting stuff from a fantastic team.

zaxioms 5 hours ago

It's cool, but I genuinely cannot fathom why they are targeting natural language proofs instead of a proof assistant.

  • mamami 5 hours ago

    Natural language is a lot more, well, readable than say lean. You get a lot less intuition and understanding of what the model is attempting to do in the first place.

  • blazespin 3 hours ago

    More training data on advanced math. Lean is cool, but it's mostly about formalizing stuff we already know.

gunalx 3 hours ago

If i read it right it used multiple samples of itself to verify the aqccuracy, but isnt this problematic?

  • viraptor 2 hours ago

    In what way? Panel of experts approach has been a thing for a while now and it's documented to improve quality.

  • zamadatix 2 hours ago

    Problematic in that it's still not formal verification, not problematic as in "it's worse to do this than not".

agentultra 5 hours ago

So it's designed for informal proofs and it "verifies" based on a rubric fitting function and human interaction, is that right?

What's the use case for a system like this?

  • blazespin 3 hours ago

    Advanced math solving, as the results indicate. Informal proof reasoning is advancing faster than formal proof reasoning because the latter is slow and compute intensive.

    I suspect it's also because there isn't a lot of data to train on.

newyankee 4 hours ago

That is amazing if they can do all of this at < 10 % of the cost of frontier labs. Off course they work in the shadows of the great work done in the frontier labs and shared, but there is some exceptional high speed execution happening behind the scenes that shows this is clearly a race, but a race where China is happy to be #2 as long as the gap is not significant and the costs are reasonable