Beyond Metrics: Why Traditional AI Benchmarks Fail Humans—and How W...