Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models

Ziqian Bi1,2*, Keyu Chen1,3*, Chiung-Yi Tseng1,4*, Danyang Zhang1,5*, Tianyang Wang1,
Hongying Luo1, Lu Chen1, Junming Huang1, Jibin Guan6, Junfeng Hao6, Junhao Song7†
1AI Agent Lab, Vokram Group, UK; 2Purdue University, USA; 3Georgia Tech, USA;
4LuxMuse AI, USA; 5ByteDance Inc, USA; 6University of Minnesota, USA; 7Imperial College London, UK

*Indicates Equal Contribution, Corresponding Author

Abstract

In August 2025, OpenAI released GPT-OSS models, its first open weight large language models since GPT-2 in 2019, comprising two mixture of experts architectures with 120B and 20B parameters. We evaluated both variants against six contemporary open source large language models ranging from 14.7B to 235B parameters, representing both dense and sparse designs, across ten benchmarks covering general knowledge, mathematical reasoning, code generation, multilingual understanding, and conversational ability. All models were tested in unquantised form under standardised inference settings, with statistical validation using McNemar's test and effect size analysis. Results show that gpt-oss-20B consistently outperforms gpt-oss-120B on several benchmarks, such as HumanEval and MMLU, despite requiring substantially less memory and energy per response. Both models demonstrate mid-tier overall performance within the current open source landscape, with relative strength in code generation and notable weaknesses in multilingual tasks. These findings provide empirical evidence that scaling in sparse architectures may not yield proportional performance gains, underscoring the need for further investigation into optimisation strategies and informing more efficient model selection for future open source deployments.

Research Figures

Paper List

LLM Surveys & Reviews

Foundation Models & Scaling Laws

Model Architectures & Mixture of Experts

Contemporary Open Source Models

Evaluation & Benchmarking

Domain-Specific Benchmarks

Reasoning & Code Generation

Statistical Analysis & Methodology

Efficiency & Environmental Impact

Multilingual & Cross-lingual Models

Dialogue & Conversational AI

Ethics & Bias

Theoretical Analysis & Understanding

BibTeX

@misc{bi2025gptossgoodcomprehensiveevaluation,
        title={Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models}, 
        author={Ziqian Bi and Keyu Chen and Chiung-Yi Tseng and Danyang Zhang and Tianyang Wang and Hongying Luo and Lu Chen and Junming Huang and Jibin Guan and Junfeng Hao and Junhao Song},
        year={2025},
        eprint={2508.12461},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2508.12461}, 
  }