Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models

Abstract

In August 2025, OpenAI released GPT-OSS models, its first open weight large language models since GPT-2 in 2019, comprising two mixture of experts architectures with 120B and 20B parameters. We evaluated both variants against six contemporary open source large language models ranging from 14.7B to 235B parameters, representing both dense and sparse designs, across ten benchmarks covering general knowledge, mathematical reasoning, code generation, multilingual understanding, and conversational ability. All models were tested in unquantised form under standardised inference settings, with statistical validation using McNemar's test and effect size analysis. Results show that gpt-oss-20B consistently outperforms gpt-oss-120B on several benchmarks, such as HumanEval and MMLU, despite requiring substantially less memory and energy per response. Both models demonstrate mid-tier overall performance within the current open source landscape, with relative strength in code generation and notable weaknesses in multilingual tasks. These findings provide empirical evidence that scaling in sparse architectures may not yield proportional performance gains, underscoring the need for further investigation into optimisation strategies and informing more efficient model selection for future open source deployments.

Paper List

LLM Surveys & Reviews

A survey of large language models, Zhao et al. arXiv 2023
Large language models: A survey, Minaee et al. arXiv 2024
Eight things to know about large language models, Bowman arXiv 2023
Unifying the perspectives of NLP and software engineering: A survey on language models for code, Zhang et al. arXiv 2023

Foundation Models & Scaling Laws

Language models are few-shot learners, Brown et al. arXiv 2020
Language models are unsupervised multitask learners, Radford et al. OpenAI 2019
Scaling laws for neural language models, Kaplan et al. arXiv 2020
Training compute-optimal large language models, Hoffmann et al. arXiv 2022
Scaling laws for autoregressive generative modeling, Henighan et al. arXiv 2020
Emergent abilities of large language models, Wei et al. arXiv 2022
Are emergent abilities of large language models a mirage?, Schaeffer et al. NeurIPS 2024

Model Architectures & Mixture of Experts

Attention is all you need, Vaswani et al. NeurIPS 2017
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, Shazeer et al. ICLR 2017
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, Fedus et al. JMLR 2022
Glam: Efficient scaling of language models with mixture-of-experts, Du et al. ICML 2022
Gshard: Scaling giant models with conditional computation and automatic sharding, Lepikhin et al. arXiv 2021
Mixture-of-experts with expert choice routing, Zhou et al. NeurIPS 2022
Lamda: Language models for dialog applications, Thoppilan et al. arXiv 2022
Scaling language models: Methods, analysis & insights from training gopher, Rae et al. arXiv 2021
Palm: Scaling language modeling with pathways, Chowdhery et al. arXiv 2022
Efficient large scale language modeling with mixtures of experts, Artetxe et al. arXiv 2021
mT5: A massively multilingual pre-trained text-to-text transformer, Xue et al. arXiv 2021
Dialogpt: Large-scale generative pre-training for conversational response generation, Zhang et al. arXiv 2020
Lifting the curse of multilinguality by pre-training modular transformers, Pfeiffer et al. arXiv 2022
A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level, Drori et al. PNAS 2022
The goldilocks of pragmatic understanding: Fine-tuning strategy matters for implicature resolution by llms, Ruis et al. NeurIPS 2024
llama.cpp: Inference of LLaMA model in pure C/C++, Gerganov GitHub 2023
Parallel machine translation with disentangled context transformer, Kasai et al. ICML 2021
Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale, Dettmers et al. NeurIPS 2022
Retrieval-augmented generation for knowledge-intensive nlp tasks, Lewis et al. NeurIPS 2020
Prior and Posterior Networks: A Survey on Evidential Deep Learning Methods For Uncertainty Estimation, Ulmer et al. arXiv 2022
Claude Opus 4.1, Anthropic Company 2025

Contemporary Open Source Models

Llama: Open and efficient foundation language models, Touvron et al. arXiv 2023
Llama 2: Open foundation and fine-tuned chat models, Touvron et al. arXiv 2023
Gemma: Open models based on gemini research and technology, Gemma Team arXiv 2024
Gemini: a family of highly capable multimodal models, Team et al. arXiv 2023
Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, Comanici et al. arXiv 2025
DeepSeek LLM: Scaling open-source language models with longtermism, DeepSeek-AI arXiv 2024
Qwen technical report, Bai et al. arXiv 2023
Qwen2.5 technical report, Qwen Team arXiv 2024
Qwen3 Technical Report, Yang et al. arXiv 2025
Phi-3 technical report: A highly capable language model locally on your phone, Abdin et al. arXiv 2024
The falcon series of open language models, Almazrouei et al. arXiv 2023
Mistral 7B, Jiang et al. arXiv 2023

Evaluation & Benchmarking

Measuring massive multitask language understanding, Hendrycks et al. arXiv 2021
Holistic evaluation of language models, Liang et al. TMLR 2023
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, Srivastava et al. arXiv 2022
Judging LLM-as-a-judge with MT-bench and chatbot arena, Zheng et al. arXiv 2023
Training verifiers to solve math word problems, Cobbe et al. arXiv 2021
Evaluating large language models trained on code, Chen et al. arXiv 2021
C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models, Huang et al. NeurIPS 2023

Domain-Specific Benchmarks

FinQA: A dataset of numerical reasoning over financial data, Chen et al. arXiv 2021
What disease does this patient have? a large-scale open domain question answering dataset from medical exams, Jin et al. Applied Sciences 2021
Experimenting with legal ai solutions: The case of question-answering for access to justice, Li et al. arXiv 2024
Crowdsourcing multiple choice science questions, Welbl et al. arXiv 2017
Piqa: Reasoning about physical commonsense in natural language, Bisk et al. AAAI 2020
DialogSum: A real-life scenario dialogue summarization dataset, Chen et al. arXiv 2021
Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3, Zhao et al. arXiv 2025
Large language models are not fair evaluators, Wang et al. arXiv 2023
AI and the everything in the whole wide world benchmark, Raji et al. NeurIPS 2021
Beyond accuracy: Behavioral testing of NLP models with CheckList, Ribeiro et al. ACL 2020
The GEM benchmark: Natural language generation, its evaluation and metrics, Gehrmann et al. arXiv 2021
Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation, Hu et al. arXiv 2020
CRUXEval: A benchmark for code reasoning, understanding and execution, Gu et al. arXiv 2024
Medical Exam Question Answering with Large-scale Reading Comprehension, Zhang et al. arXiv 2018
Measuring and improving compositional generalization in text-to-sql via component alignment, Gan et al. NAACL 2022
The benchmark lottery, Dehghani et al. arXiv 2021
Rethink reporting of evaluation results in AI, Burnell et al. arXiv 2023
NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark, Sainz et al. arXiv 2023

Reasoning & Code Generation

Chain-of-thought prompting elicits reasoning in large language models, Wei et al. NeurIPS 2022
Large language models are zero-shot reasoners, Kojima et al. NeurIPS 2022
Self-consistency improves chain of thought reasoning in language models, Wang et al. arXiv 2023
Program synthesis with large language models, Austin et al. arXiv 2021
StarCoder: may the source be with you!, Li et al. arXiv 2023
Codegen: An open large language model for code with multi-turn program synthesis, Nijkamp et al. arXiv 2022
Incoder: A generative model for code infilling and synthesis, Fried et al. arXiv 2022
Solving quantitative reasoning problems with language models, Lewkowycz et al. NeurIPS 2022
Impact of pretraining term frequencies on few-shot numerical reasoning, Razeghi et al. arXiv 2022
Are NLP models really able to solve simple math word problems?, Patel et al. arXiv 2021
Inverse scaling can become U-shaped, Wei et al. arXiv 2023
Inverse scaling: When bigger isn't better, McKenzie et al. arXiv 2023

Statistical Analysis & Methodology

The hitchhiker's guide to testing statistical significance in natural language processing, Dror et al. ACL 2018
Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis, Benavoli et al. JMLR 2017
With Little Power Comes Great Responsibility, Card et al. EMNLP 2020
An introduction to the bootstrap, Efron & Tibshirani Book 1994
Statistical power analysis for the behavioral sciences, Cohen Book 1988
A statistical analysis of summarization evaluation metrics using resampling methods, Deutsch et al. NAACL 2021
Statistical significance tests for machine translation evaluation, Koehn WMT 2004
Controlling the false discovery rate: a practical and powerful approach to multiple testing, Benjamini & Hochberg JRSS-B 1995
Improving reproducibility in machine learning research: a report from the NeurIPS 2019 reproducibility program, Pineau et al. JMLR 2021
Show your work: Improved reporting of experimental results, Dodge et al. EMNLP 2019
Reporting score distributions makes a difference: Performance study of LSTM-networks for sequence tagging, Reimers & Gurevych EMNLP 2017
New effect size rules of thumb, Sawilowsky JEP 2009
The measurement of observer agreement for categorical data, Landis & Koch Biometrics 1977
Note on the sampling error of the difference between correlated proportions or percentages, McNemar Psychometrika 1947
Teoria statistica delle classi e calcolo delle probabilita, Bonferroni Book 1936
Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension, Rogers et al. arXiv 2023

Efficiency & Environmental Impact

Energy and policy considerations for deep learning in NLP, Strubell et al. ACL 2019
Carbon emissions and large neural network training, Patterson et al. arXiv 2021
Towards the systematic reporting of the energy and carbon footprints of machine learning, Henderson et al. JMLR 2020
Efficiently scaling transformer inference, Pope et al. MLSys 2023
Scale efficiently: Insights from pretraining and finetuning transformers, Tay et al. arXiv 2022
Green ai, Schwartz et al. CACM 2020

Multilingual & Cross-lingual Models

Language models are few-shot multilingual learners, Winata et al. NAACL 2021
Unsupervised cross-lingual representation learning at scale, Conneau et al. ACL 2020
Crosslingual generalization through multitask finetuning, Muennighoff et al. arXiv 2023

Dialogue & Conversational AI

Recipes for building an open-domain chatbot, Roller et al. EACL 2021
What makes a good conversation? How controllable attributes affect human judgments, See et al. NAACL 2019

Ethics & Bias

On the dangers of stochastic parrots: Can language models be too big?, Bender et al. FAccT 2021
Data contamination: From memorization to exploitation, Magar & Schwartz arXiv 2022
Evading data contamination detection: Exploring test-time preprocessing methods, Dekoninck et al. arXiv 2024
Multimodal datasets: misogyny, pornography, and malignant stereotypes, Birhane et al. NeurIPS 2021

Theoretical Analysis & Understanding

Speak, memory: An archaeology of books known to chatgpt/gpt-4, Chang et al. arXiv 2023
Faith and fate: Limits of transformers on compositionality, Dziri et al. NeurIPS 2023
How much knowledge can you pack into the parameters of a language model?, Roberts et al. EMNLP 2020
Large language models struggle to learn long-tail knowledge, Kandpal et al. arXiv 2023
A surprisingly robust trick for the winograd schema challenge, Camburu et al. ACL 2019
The curious case of neural text degeneration, Holtzman et al. ICLR 2020
Neural text generation with unlikelihood training, Welleck et al. ICLR 2020
Locally typical sampling, Meister et al. TACL 2023
Information-theoretic probing for linguistic structure, Pimentel et al. ACL 2020
Predictability and surprise in large generative models, Ganguli et al. arXiv 2022
A systematic evaluation of large language models of code, Xu et al. arXiv 2022
Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning, Liang et al. NeurIPS 2022
Well-tuned simple nets excel on tabular datasets, Kadra et al. NeurIPS 2021

BibTeX

@misc{bi2025gptossgoodcomprehensiveevaluation,
        title={Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models}, 
        author={Ziqian Bi and Keyu Chen and Chiung-Yi Tseng and Danyang Zhang and Tianyang Wang and Hongying Luo and Lu Chen and Junming Huang and Jibin Guan and Junfeng Hao and Junhao Song},
        year={2025},
        eprint={2508.12461},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2508.12461}, 
  }

Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models

Abstract

Research Figures

Performance rankings across benchmark categories using general prompts. Error bars represent 95% confidence intervals.

Performance heatmap across model-benchmark combinations. Darker blue indicates higher accuracy.

Performance distribution across evaluation categories. Analysis methodology follows BIG-bench protocols.

Parameter-performance relationship. The non-monotonic scaling observed in GPT-OSS variants contradicts established scaling laws.

Direct performance comparison between GPT-OSS variants across all evaluation benchmarks.

Multi-dimensional performance comparison across eight evaluated models. GPT-OSS models show middle-tier performance.

Token count distribution across all models on aggregated datasets. GPT-OSS models exhibit notably concise outputs.