Models: Claude Sonnet 4.6 | Gemini 2.5 Flash | Microsoft Copilot Smart Mode | ChatGPT
Tiers: All free tier max AI (except Claude, purposefully downgraded for comparability)
| Model |
Q1 |
Q2 |
Q3 |
Total |
| Claude Sonnet 4.6 |
10/10 |
10/10 |
10/10 |
30 |
| Microsoft Copilot Smart Mode |
10/10 |
4.5/10 |
8/10 |
22 |
| ChatGPT |
10/10 |
4/10 |
5.5/10 |
19.5 |
| Gemini 2.5 Flash |
10/10 |
2.5/10 |
7/10 |
19.5 |
- Q1: How many lines of code are in the chromium engine?
- Q2: Make me a text editor website as ONE HTML file which may include html, css, and js which I can just open on my computer and boom it works. Just give me the HTML code.
- Q3: Best gaming keyboard with RGB.
- Q1: 35+ million lines
- Q2: Human graded
- Q3: Wooting 80HE should be in the answers, but doesn't have to be.
| Place |
Model |
Score |
| 🥇 1st |
Claude Sonnet 4.6 |
30 |
| 🥈 2nd |
Microsoft Copilot Smart Mode |
22 |
| 🥉 3rd (tie) |
Gemini 2.5 Flash & ChatGPT |
19.5 |
- These are the best models for the free tier. Claude was purposefully downgraded so that the benchmark could be comparable.
- All 3 questions were asked the exact same way across every model.
- Processing speed was not measured, but all models averaged under 10-15 seconds except Claude on Q2, which took approximately 3 minutes — this is a positive measure, not negative, as it reflects planning time for a fully featured text editor.
- Q2 stress testing included conflicting filename handling for Claude only (the only model with file naming), and all 4 models were stress tested by measuring word count until crash.
- ChatGPT and Copilot produced near identical results, but Copilot scored higher.
- Gemini 2.5 Flash scored lowest across all three categories including stability and stress testing.
- All models except Claude had hardcoded/defaulted document names with no ability to rename.
View on Google Docs