FrontierOR: Benchmarking LLMs’Capacity for Efficient Algorithm Design in Large-Scale Optimization

  • Minwei Kong ,
  • Chonghe Jiang ,
  • Ao Qu ,
  • Wenbin Ouyang ,
  • Zhaoming Zeng ,
  • Xiaotong Guo ,
  • Zhekai Li ,
  • Junyi Li ,
  • Yingying Fan ,
  • Xinshou Zheng ,
  • Xibin Jing ,
  • Yikai Zhang ,
  • Zhiwei Liang ,
  • Seong-Hee Kim ,
  • Runqing Yang ,
  • Zijian Zhou ,
  • Sirui Li ,
  • Han Zheng ,
  • Wangyang Ying ,
  • Ou Zheng ,
  • Chong Wang ,
  • Jing Zhao ,
  • Hanzhang Qin ,
  • Cathy Wu ,
  • Paul Liang ,
  • Jinhua Zhao ,
  • Hai Wang

arXiv

Large language models (LLMs) are increasingly used for optimization modeling and solver-code generation, yet practical operations research and optimization problems often require a harder capability: designing scalable algorithms that exploit problem structure and outperform direct formulation-and-solve baselines. Existing benchmarks are limited to small or simplified examples far below real-world scale and complexity. We introduce FrontierOR, among the first benchmarks to systematically evaluate LLM-based efficient algorithm design for realistic large-scale optimization problems. FrontierOR includes 180 tasks derived from methodologically diverse papers published in top-tier operations research venues, each with standardized instances and a hidden, expert-verified evaluation suite. We evaluate seven LLMs spanning frontier, cost-effective, and open-source models both in one-shot and test-time evolution settings. The results reveal that frontier models still struggle to move from executable formulations to efficient optimization algorithms: the strongest one-shot model outperforms Gurobi in only 31% of cases in both solution quality and computational efficiency, and even strong coding agents with test-time evolution achieve only 50% on selected hard tasks. FrontierOR establishes a practical evaluation platform for LLM-based optimization algorithm design, which enables future LLMs and agents to be systematically tested on whether they can move beyond correct formulation toward a feasible, high-quality, and efficient algorithm. Code and data are publicly released at https://github.com/Minw913/FrontierOR.