Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models

Mazda Moayeri; Vidhisha Balachandran; Varun Chandrasekaran; Safoora Yousefi; Thomas Fel; S. Feizi; Besmira Nushi; Neel Joshi; Vibhav Vineet

Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models

Mazda Moayeri ,
Vidhisha Balachandran ,
Varun Chandrasekaran ,
Safoora Yousefi ,
Thomas Fel ,
S. Feizi ,
Besmira Nushi ,
Neel Joshi ,
Vibhav Vineet

ICLR 2025 | October 2024

With models getting stronger, evaluations have grown more complex, testing multiple skills in one benchmark and even in the same instance at once. However, skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain. We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated rationales. After validating the relevance of rationale-parsed skills and inferring skills for \(46\)k instances over \(12\) benchmarks, we observe many skills to be common across benchmarks, resulting in the curation of hundreds of skill-slices (i.e. sets of instances testing a common skill). Inspecting accuracy over these slices yields novel insights on model trade-offs: e.g., compared to GPT-4o and Claude 3.5 Sonnet, on average, Gemini 1.5 Pro is \(18\%\) more accurate in”computing molar mass”, but \(19\%\) less accurate in”applying constitutional law”, despite the overall accuracies of the three models differing by a mere \(0.4\%\). Furthermore, we demonstrate the practical utility of our approach by showing that insights derived from skill slice analysis can generalize to held-out instances: when routing each instance to the model strongest on the relevant skills, we see a \(3\%\) accuracy improvement over our \(12\) dataset corpus. Our skill-slices and framework open a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.

Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models

관련 도구

Skill Slice Insights