OpenAI Unveils New Open-Weight AI Models for Content Moderation

Introducing gpt-oss-safeguard-120b and -20b: Customizable AI for policy-driven content labeling.

OpenAI has released two new open-weight reasoning models, gpt-oss-safeguard-120b and -20b. These models are designed for content classification based on provided policies. They offer customization and detailed reasoning capabilities.

By Sarah Kline

November 2, 2025

4 min read

OpenAI Unveils New Open-Weight AI Models for Content Moderation

Key Facts

OpenAI released gpt-oss-safeguard-120b and gpt-oss-safeguard-20b, open-weight reasoning models.
These models are designed for content classification based on provided policies.
They are available under the Apache 2.0 license and OpenAI's gpt-oss usage policy.
The models offer full chain-of-thought (CoT), variable reasoning efforts, and Structured Outputs.
Safety evaluations were conducted even for unintended chat settings due to their open-weight nature.

Why You Care

Ever worry about navigating the complex world of online content moderation? What if you could train an AI to understand your specific rules for what’s acceptable? OpenAI has just made a significant move in this direction. They’ve announced two new open-weight AI models, gpt-oss-safeguard-120b and gpt-oss-safeguard-20b. These tools are built to help you label content according to your own guidelines. This creation could make your content management tasks much more efficient and precise.

What Actually Happened

OpenAI has released gpt-oss-safeguard-120b and gpt-oss-safeguard-20b, as detailed in the technical report. These are open-weight reasoning models, meaning their underlying code is accessible for inspection and use. They were developed by post-training existing gpt-oss models. Their primary function is to reason from a given policy and then label content accordingly. These text-only models are compatible with OpenAI’s Responses API. What’s more, they are available under the Apache 2.0 license and OpenAI’s gpt-oss usage policy, according to the announcement.

These new models offer several key features. They provide full chain-of-thought (CoT) — a step-by-step explanation of their reasoning. They also support different reasoning efforts, from low to high, and can produce Structured Outputs. The team revealed that these models were developed with valuable feedback from the open-source community. They are specifically recommended for classifying content against a provided policy, not for direct end-user interaction.

Why This Matters to You

These new gpt-oss-safeguard models offer significant advantages if you manage any form of online content. Imagine you run a social media system or a forum. You need to enforce specific community guidelines. These models can help you automate the process of identifying content that violates those rules. They allow for highly customizable content classification, according to the technical report.

Consider this scenario: You have a strict policy against hate speech. You can feed this policy directly to the gpt-oss-safeguard model. The AI will then use its reasoning capabilities to flag content that breaches your specific definition. This level of control and transparency is a major step forward. How might custom AI moderation improve your online community’s health?

The models’ ability to provide a “full chain-of-thought” is particularly useful. This means you don’t just get a ‘yes’ or ‘no’ answer. Instead, the AI explains why it made a particular classification. This transparency helps you understand and refine your policies. The technical report states, “They are customizable, provide full chain-of-thought (CoT), can be used with different reasoning efforts (low, medium, high), and support Structured Outputs.”

Here’s a quick look at some key features:

Feature	Benefit for You
Open-Weight	Transparency and customizability
Policy-Driven	Enforce your specific content rules
Chain-of-Thought	Understand AI’s reasoning, not just its output
Structured Output	Easier integration into existing systems

The Surprising Finding

What’s particularly interesting is how OpenAI evaluated these models. While they are designed for content classification, the team also their safety in chat settings. This might seem counterintuitive since the gpt-oss-safeguard models are “not intended for this use,” as mentioned in the release. However, because they are open models, someone could potentially use them in chat applications. Therefore, OpenAI wanted to ensure they met safety standards even in unintended scenarios. This proactive approach to potential misuse is a notable aspect of their evaluation strategy. The study finds that these models were for safety in chat settings despite their primary purpose.

This finding challenges the assumption that models are only evaluated for their intended use cases. It highlights OpenAI’s commitment to broader safety considerations. They acknowledged the possibility of off-label use and took steps to address it. What’s more, the models were trained without additional biological or cybersecurity data, according to the paper states. This means previous worst-case scenario estimations for gpt-oss models still apply.

What Happens Next

These gpt-oss-safeguard models are available now under the Apache 2.0 license. This means developers and content creators can begin integrating them into their systems immediately. We can expect to see these models deployed in various content moderation tools over the next few months. For example, a video system might use gpt-oss-safeguard-120b to automatically filter comments based on its user guidelines. This could significantly reduce manual review times.

If you’re a developer, consider exploring the customization options these models offer. You can fine-tune them with your specific policy documents to achieve highly accurate content labeling. The industry implications are vast, suggesting a future where AI-powered content moderation is more transparent and tailored. The team revealed that an initial evaluation of multi-language performance in chat settings was also conducted. This suggests future enhancements could include more multilingual content classification capabilities. Keep an eye out for updates on their performance in diverse linguistic contexts.

Ready to start creating?