Each LLM is given the same 1000 chess puzzles to solve. See puzzles.csv. Benchmarked on Mar 25, 2024.

Model Solved Solved % Illegal Moves Illegal Moves % Adjusted Elo
gpt-4-turbo-preview 229 22.9% 163 16.3% 1144
gpt-4 195 19.5% 183 18.3% 1047
claude-3-opus-20240229 72 7.2% 464 46.4% 521
claude-3-haiku-20240307 38 3.8% 590 59.0% 363
claude-3-sonnet-20240229 23 2.3% 663 66.3% 286
gpt-3.5-turbo 23 2.3% 683 68.3% 269
claude-instant-1.2 10 1.0% 707 66.3% 245
mistral-large-latest 4 0.4% 813 81.3% 149
mixtral-8x7b 9 0.9% 832 83.2% 136
gemini-1.5-pro-latest* FAIL - - - -

Published by the CEO of Kagi!

  • bionicjoey@lemmy.ca
    link
    fedilink
    English
    arrow-up
    3
    ·
    9 months ago

    If I tried to make an illegal move 20% of the time, would you also say I am good at chess?

      • bionicjoey@lemmy.ca
        link
        fedilink
        English
        arrow-up
        1
        ·
        9 months ago

        Okay. What if the circumstance is because I’m just recalling a bunch of chess puzzle solutions I’ve seen before and regurgitating the one I think is the correct solution for this particular pizzle without really understanding the rules of chess?

        • General_Effort@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          9 months ago

          That’s another thing I’m wondering about, but so is anyone. I’d still want to know why GPT-4 does so much better than the others.