codevoid.de/1/hn/comments_48503014.gph

        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   Maxproof
       
       
        rapsacnz wrote 22 hours 25 min ago:
        How about "Proofmaxxing"
       
        korbonits wrote 1 day ago:
        Proves the need for more formal verification :)
       
        minimaxir wrote 1 day ago:
        not a good day to be named Max
       
        thierrydamiba wrote 1 day ago:
        Is the harness more valuable than the weights?
       
        pfannl wrote 1 day ago:
        The real AGI test is apparently not solving the IMO, but getting caught
        in the same scoring traffic jam as 46 teenagers.
       
        daquisu wrote 1 day ago:
        "I thought it was interesting and a bit underappreciated that the
        fraction of gold medalists at the 2025 IMO (72/630 = 11.4%) is the
        highest itâs been since 1981.
        
        Crudely, IMO gold medals are awarded to the highest-scoring 1/12 of
        contestants.1 However, because scores are integers up to 42 and
        thereâs no provision for tiebreaking, itâs possible for a lot of
        contestants to be tied around the threshold. In that case, either all
        of them get a gold medal or none do, and the fraction of gold medalists
        might deviate substantially from 1/12. Thatâs what happened this
        year: 46 contestants all won a gold medal by scoring exactly 35 points.
        
        In fact, bizarrely, 35 is the mode of the scores this year; the last
        time the modal score was a gold medal score was in 1994. And, of
        course, 35 is the same score claimed by AI systems from Google, OpenAI,
        and others."
        
        From
        
  HTML  [1]: https://blog.vero.site/post/imo-2025
       
          dooglius wrote 1 day ago:
          This is not bizarre, it's a reflection of how the IMO is scored: 6
          questions with scores from 0-7 but partial credit is rare. It's
          really a score of 5/6.
       
            gus_massa wrote 1 day ago:
            > Partial credit is rare
            
            It's not rare at all. I can't find the 2026 results, but here are
            the 2025 ones. [1] The top of the table is full of 7 and the bottom
            is full of 0, but in the middle there are a lot of intermediate
            points. It's not uncommon "7 ? 0 7 ? 0" because the 1st and 4th are
            usually the easiest and the 3rd and 6th the hardest. But there are
            a of of other combinations due to stupid mistakes and lucky
            solutions and different personal styles/preferences that make some
            problems easier/harder for each contestant.
            
  HTML      [1]: https://www.imo-official.org/results/individual/year/2025/
       
              dooglius wrote 22 hours 34 min ago:
              You're right, it's less uncommon than I recalled, and thanks for
              the source. But I don't think that 35/42 is suspicious/bizarre,
              and it does look like the 5x7 scores make up the bulk of the 35s.
       
                gus_massa wrote 18 hours 44 min ago:
                I agree. If 6 was too hard and 3 "easy", then it would be
                common to get 777770.
       
          quibono wrote 1 day ago:
          I was under the impression that IMO is conducted in an official
          "exam" capacity, on site and in a very formal setting. So I find it
          hard to believe _direct_ LLM usage would be a factor
          Then again - it very well could be a factor in the training and
          preparation? I imagine "Write me a prep document for the IMO" will
          surface all kinds of interesting things from the training set.
       
            quietbritishjim wrote 1 day ago:
            > And, of course, 35 is the same score claimed by AI systems from
            Google, OpenAI, and others.
            
            This is the part of the quote your6 replying about.
            
            You seemed to take "of course" as an implication that the
            contestants used LLMs, and that's why they got the same score as
            the LLMs.
            
            I took it to mean: since this was the modal score, there seemed to
            be 35 points worth of significantly easier answers (relatively
            speaking) than the remaining points, so it's not a surprise that
            LLMs got the same easier bits right. (Though I doubt all
            contestants got their points on exactly the same answers.)
            
            But it's certainly unclear what exactly the author meant.
       
              daquisu wrote 1 day ago:
              Later in the same blog post, the author says:
              
              > We can also consider the IMO 2025 problems individually. In the
              Epoch AI newsletter, Greg Burnham combines a subjective analysis
              with Evan Chenâs MOHS ratings to argue that the first five
              problems at IMO 2025 were unusually easy and the sixth was
              unusually hard, so itâs not surprising that the first five
              problems were exactly the ones solved by these AIs. Though Iâm
              not sure the MOHS scale is rigorous enough to make sense as the
              x-axis of a bar chart itâs easy to corroborate the high-level
              story with the official IMO statistics. Based on average scores,
              this yearâs Problem 6 was the fourth hardest and its Problem 3
              was by far the easiest of all Problem 3s and 6s since 2000.
              
              In the linked MaxProof paper, in the section "6.3.1. Per-Problem
              Analysis" it shows the same behavior: 7/7 in the first 5
              problems, 0/7 in the last problem.
       
       
   DIR <- back to front page