Skip to content

Latest commit

 

History

History
58 lines (40 loc) · 2.16 KB

File metadata and controls

58 lines (40 loc) · 2.16 KB

AI Benchmark Testing

Models: Claude Sonnet 4.6 | Gemini 2.5 Flash | Microsoft Copilot Smart Mode | ChatGPT Tiers: All free tier max AI (except Claude, purposefully downgraded for comparability)


Results Table

Model Q1 Q2 Q3 Total
Claude Sonnet 4.6 10/10 10/10 10/10 30
Microsoft Copilot Smart Mode 10/10 4.5/10 8/10 22
ChatGPT 10/10 4/10 5.5/10 19.5
Gemini 2.5 Flash 10/10 2.5/10 7/10 19.5

Questions

  • Q1: How many lines of code are in the chromium engine?
  • Q2: Make me a text editor website as ONE HTML file which may include html, css, and js which I can just open on my computer and boom it works. Just give me the HTML code.
  • Q3: Best gaming keyboard with RGB.

Answers

  • Q1: 35+ million lines
  • Q2: Human graded
  • Q3: Wooting 80HE should be in the answers, but doesn't have to be.

Standings

Place Model Score
🥇 1st Claude Sonnet 4.6 30
🥈 2nd Microsoft Copilot Smart Mode 22
🥉 3rd (tie) Gemini 2.5 Flash & ChatGPT 19.5

Notes

  • These are the best models for the free tier. Claude was purposefully downgraded so that the benchmark could be comparable.
  • All 3 questions were asked the exact same way across every model.
  • Processing speed was not measured, but all models averaged under 10-15 seconds except Claude on Q2, which took approximately 3 minutes — this is a positive measure, not negative, as it reflects planning time for a fully featured text editor.
  • Q2 stress testing included conflicting filename handling for Claude only (the only model with file naming), and all 4 models were stress tested by measuring word count until crash.
  • ChatGPT and Copilot produced near identical results, but Copilot scored higher.
  • Gemini 2.5 Flash scored lowest across all three categories including stability and stress testing.
  • All models except Claude had hardcoded/defaulted document names with no ability to rename.

Full Document

View on Google Docs