How to Run OpenAI GPT-OSS AI Locally

The Transformers library by Hugging Face provides a flexible and powerful framework for running large language models both locally and in production environments. In this guide, you’ll learn how to use OpenAI’s gpt-oss-20b and gpt-oss-120b models with Transformers—whether through high-level pipelines for rapid prototyping or low-level generation interfaces for fine-tuned control. We’ll also explore how to serve these models via a local API, structure chat inputs, and scale inference using multi-GPU configurations.

The GPT-OSS series of open-weight models, released by OpenAI, represents a major step toward transparent and self-hosted LLM deployments. Designed to run on local or custom infrastructure, GPT-OSS integrates seamlessly with the Hugging Face Transformers ecosystem. This article outlines what’s possible with GPT-OSS models, including optimized inference paths, deployment strategies, API compatibility, and toolchain integration.

Overview of GPT-OSS Models

gpt-oss-20b

Size: 20 billion parameters
Hardware Requirements: ~16GB VRAM with MXFP4 quantization
Use Case: High-end consumer GPUs like RTX 3090, 4090, or newer
Ideal For: Local development and experimentation

gpt-oss-120b

Size: 120 billion parameters
Hardware Requirements: ≥60GB VRAM or multi-GPU (e.g. 4× A100s, 1× H100)
Use Case: Datacenter-class inference workloads
Ideal For: Enterprises, hosted APIs, research institutions

Both models are MXFP4 quantized by default, which dramatically reduces memory usage and boosts inference speeds. MXFP4 is supported on NVIDIA Hopper and newer (e.g. H100, RTX 50xx).

Deployment Modes Using Transformers

Transformers supports multiple levels of abstraction for working with GPT-OSS models. Your choice depends on the use case: simple prototyping, production serving, or customized generation.

1. High-Level Pipelines

Use pipeline("text-generation") to quickly load and run the model
Automatically handles GPU placement with device_map="auto"
Great for simple input/output interfaces

2. Low-Level Inference with `.generate()`

Gives you full control over generation parameters
Supports chat-style prompting with roles (system, user, assistant)
Best for custom logic, intermediate outputs, and tool integration

3. API Serving with `transformers serve`

Serves your GPT-OSS model over HTTP on localhost:8000
Compatible with OpenAI-style endpoints (e.g. /v1/responses)
Supports streaming and batched completions
Ideal for replacing OpenAI APIs with local inference

Chat Templates and Structured Conversations

GPT-OSS supports OpenAI-style structured messages. Hugging Face provides built-in support for chat formatting via apply_chat_template(). This ensures that roles, prompts, and generation tokens are cleanly aligned.

For more control, the openai-harmony library allows you to:

Explicitly define message roles and structure
Add developer instructions (mapped to system prompts)
Render messages into token IDs for generation
Parse responses back into structured assistant messages

Harmony is particularly useful for tools that require intermediate reasoning steps or tool calling behavior.

Inference at Scale: Multi-GPU and Optimized Kernels

Running gpt-oss-120b requires careful consideration of hardware. Transformers provides utilities to help:

Tensor Parallelism: Automatically splits model layers across GPUs with tp_plan="auto"
Expert Parallelism: More advanced distribution for transformer blocks
Flash Attention: Enables faster inference with custom attention kernels
Accelerate / Torchrun: Easy launch tools for distributed inference

Using these features, gpt-oss-120b can be deployed on machines with multiple GPUs or cloud setups with H100s. This enables low-latency, high-throughput inference for demanding workloads.

Fine-Tuning Possibilities

Though not required for most applications, you can fine-tune GPT-OSS models using the Hugging Face Trainer and Accelerate libraries. This enables:

Instruction tuning for task-specific behavior
Domain adaptation (e.g. legal, technical, medical)
Custom prompt-response formats

Fine-tuning requires significant resources, especially for 120B. Most users will benefit from prompt engineering and chat templating instead.

Watch this video on YouTube.

Learn more about running AI locally with a selection of our previous articles :

Tool Ecosystem Compatibility

GPT-OSS is designed to integrate smoothly with modern LLM development tools:

Hugging Face Transformers: Full support for loading, inference, serving
transformers serve: Drop-in replacement for OpenAI-style APIs
openai-harmony: Structured prompt rendering and parsing
LangChain & LlamaIndex: Compatible with custom LLM wrappers
Cursor / IDE assistants: Works with transformer-based backends
Gradio / Streamlit: Easy to wrap models with visual interfaces

This allows developers to build local-first or hybrid tools that can fully replace cloud-based LLM APIs without compromising on UX or performance.

Summary: Why Use GPT-OSS with Transformers

Freedom to run powerful language models on your own hardware
No vendor lock-in or usage-based billing
Customizable prompting, formatting, and serving options
Fine-tuned control over performance and hardware utilization

Whether you’re building a developer assistant, a local chatbot, or an inference cluster, GPT-OSS with Transformers provides the transparency, control, and performance needed to move beyond proprietary APIs.

Recommended Setup at a Glance

Best for Local Development: gpt-oss-20b + MXFP4 + single RTX 4090
Best for Production Inference: gpt-oss-120b + Flash Attention + multi-H100
Best for API Replacement: transformers serve with chat template or harmony

gpt-oss + Transformers provides an extremely capable, modular, and open-source alternative to proprietary LLM APIs. Whether you’re developing a local assistant, scaling a distributed inference pipeline, or building a developer tool, you can select the model size and deployment strategy that fits your hardware and use case.

With full integration into Hugging Face’s pipeline, generate, and serve interfaces—as well as tools like openai-harmony for structured chat and reasoning—GPT-OSS offers unmatched flexibility for developers looking to take control of their LLM workflows.

By abstracting complexity and embracing open weights, GPT-OSS empowers a new generation of AI applications that are transparent, portable, and free from vendor lock-in.

For code examples and more information, visit the official OpenAI GPT-OSS Transformers Guide.

Source: OpenAI

Filed Under: AI, Guides

Disclosure: Some of our articles include affiliate links. If you buy something through one of these links, Geeky Gadgets may earn an affiliate commission. Learn about our Disclosure Policy.

How to Run OpenAI GPT-OSS AI Locally with Hugging Face Transformers

Overview of GPT-OSS Models

gpt-oss-20b

gpt-oss-120b

Deployment Modes Using Transformers

1. High-Level Pipelines

2. Low-Level Inference with `.generate()`

3. API Serving with `transformers serve`

Chat Templates and Structured Conversations

Inference at Scale: Multi-GPU and Optimized Kernels

Fine-Tuning Possibilities

Tool Ecosystem Compatibility

Summary: Why Use GPT-OSS with Transformers

Recommended Setup at a Glance

About Us

Further Reading

Overview of GPT-OSS Models

gpt-oss-20b

gpt-oss-120b

Deployment Modes Using Transformers

1. High-Level Pipelines

2. Low-Level Inference with .generate()

3. API Serving with transformers serve

Chat Templates and Structured Conversations

Inference at Scale: Multi-GPU and Optimized Kernels

Fine-Tuning Possibilities

Tool Ecosystem Compatibility

Summary: Why Use GPT-OSS with Transformers

Recommended Setup at a Glance

Footer

About Us

Further Reading

2. Low-Level Inference with `.generate()`

3. API Serving with `transformers serve`