FrontierOR: Benchmarking LLMs’Capacity for Efficient Algorithm Design in Large-Scale Optimization

Minwei Kong; Chonghe Jiang; Ao Qu; Wenbin Ouyang; Zhaoming Zeng; Xiaotong Guo; Zhekai Li; Junyi Li; Yingying Fan; Xinshou Zheng; Xibin Jing; Yikai Zhang; Zhiwei Liang; Seong-Hee Kim; Runqing Yang; Zijian Zhou; Sirui Li; Han Zheng; Wangyang Ying; Ou Zheng; Chong Wang; Jing Zhao; Hanzhang Qin; Cathy Wu; Paul Liang; Jinhua Zhao; Hai Wang

FrontierOR: Benchmarking LLMs’Capacity for Efficient Algorithm Design in Large-Scale Optimization

Minwei Kong ,
Chonghe Jiang ,
Ao Qu ,
Wenbin Ouyang ,
Zhaoming Zeng ,
Xiaotong Guo ,
Zhekai Li ,
Junyi Li ,
Yingying Fan ,
Xinshou Zheng ,
Xibin Jing ,
Yikai Zhang ,
Zhiwei Liang ,
Seong-Hee Kim ,
Runqing Yang ,
Zijian Zhou ,
Sirui Li ,
Han Zheng ,
Wangyang Ying ,
Ou Zheng ,
Chong Wang ,
Jing Zhao ,
Hanzhang Qin ,
Cathy Wu ,
Paul Liang ,
Jinhua Zhao ,
Hai Wang

May 2026

arXiv

Download BibTex

Large language models (LLMs) are increasingly used for optimization modeling and solver-code generation, yet practical operations research and optimization problems often require a harder capability: designing scalable algorithms that exploit problem structure and outperform direct formulation-and-solve baselines. Existing benchmarks are limited to small or simplified examples far below real-world scale and complexity. We introduce FrontierOR, among the first benchmarks to systematically evaluate LLM-based efficient algorithm design for realistic large-scale optimization problems. FrontierOR includes 180 tasks derived from methodologically diverse papers published in top-tier operations research venues, each with standardized instances and a hidden, expert-verified evaluation suite. We evaluate seven LLMs spanning frontier, cost-effective, and open-source models both in one-shot and test-time evolution settings. The results reveal that frontier models still struggle to move from executable formulations to efficient optimization algorithms: the strongest one-shot model outperforms Gurobi in only 31% of cases in both solution quality and computational efficiency, and even strong coding agents with test-time evolution achieve only 50% on selected hard tasks. FrontierOR establishes a practical evaluation platform for LLM-based optimization algorithm design, which enables future LLMs and agents to be systematically tested on whether they can move beyond correct formulation toward a feasible, high-quality, and efficient algorithm. Code and data are publicly released at https://github.com/Minw913/FrontierOR.