TREND ANALYSIS · STACKER

From Training to Inference —
Component Map Changed by Agentic Chips

Groq 3 LPU, Feynman architecture, and SRAM-based inference chips unveiled at GTC 2026 require fundamentally different component structures than existing training chips. We analyze key components facing supply shortages and beneficiary companies.

DATE 2026-03-18
TYPE Trend
STYLE Stacker
HORIZON Mid (1-6M)
SIGNAL OW · +50
$571B
Inference TAM 2033E
CAGR 27.4%
500MB
Groq 3 On-chip SRAM
150 TB/s BW
7.7x
CoWoS Supply/Demand GAP
1M vs 130K WPM
67%
Inference Proportion 2026E
2x Conversion from 33%

Executive Summary

A structural inflection point has arrived in the AI ​​chip industry. At GTC 2026, NVIDIA unveiled a dedicated inference chip — Groq 3 LPU — for the first time in history, rather than a GPU. This chip, which integrates Groq's technology acquired for ~$20 billion, uses 500MB on-chip SRAM instead of HBM, and achieves 35 times higher inference throughput than Blackwell NVL72 with a 150TB/s bandwidth.FACT

Key Finding: The BOM (Bill of Materials) of inference chips is fundamentally different from that of training chips. SRAM accounts for 70-80% of the die area, and chip-to-chip direct interconnects (96 C2C links, 112Gbps each) are used instead of HBM. This structural change causes new shortages at points completely different from the bottlenecks in the existing training chip supply chain.INFERENCE

The Feynman architecture, scheduled for release in 2028, introduces TSMC A16 process, 3D die stacking (SRAM-over-compute), and silicon photonics, and uses Intel EMIB packaging. Complete hardware separation of training and inference is becoming an industry standard.FACT

◆ INVESTMENT THESIS
"Own the Bottleneck" is the key strategy in the training→inference transition. Shortage severity by component: CoWoS packaging (CRITICAL) > SRAM die area (CRITICAL) > HBM4 (HIGH) > T-glass substrate (HIGH) > MLCC (MODERATE-HIGH). Companies with pricing power over the most severe bottlenecks — TSMC, SK Hynix, Samsung Electronics, and Broadcom — are structurally benefited.

Training vs Inference — Major Architectural Shift

For the first time in the history of AI chips, training and inference are separated into physically different hardware. Training chips (GPUs) are optimized for large-scale matrix multiplication parallel processing, while inference chips (LPUs) focus on ultra-low latency for sequential token generation. This structural difference fundamentally changes the BOM configuration.FACT

TRAINING CHIP (Blackwell/Rubin GPU)
  • Memory: External HBM4 stack (22 TB/s)
  • SRAM: 200-400MB (for cache)
  • Die: GPU logic + HBM chiplet
  • Packaging: CoWoS 2.5D large interposer (1,700mm²)
  • Scheduling: Runtime dynamic scheduling
  • Key Indicator: TFLOPS (computing power)
  • Precision: BF16/FP8
  • Bottleneck: HBM bandwidth
INFERENCE CHIP (Groq 3 LPU / Agentic Chip)
  • Memory: On-chip SRAM 500MB (150 TB/s) FACT
  • SRAM: Occupies 70-80% of die area
  • Die: SRAM-oriented + Compute logic
  • Packaging: HBM stacking unnecessary, C2C direct connection
  • Scheduling: Compiler static scheduling (deterministic)
  • Key Indicator: Tokens/sec, tail latency
  • Precision: INT4/INT8 quantization
  • Bottleneck: KV cache management + decode latency
CHART · BOM SHIFT
Changes in demand by component when transitioning from training chip → inference chip

GTC 2026 — Groq 3 LPU & Feynman Architecture

Groq 3 LPU — NVIDIA's First Dedicated Inference Chip

At the GTC 2026 keynote, Jensen Huang abandoned NVIDIA's philosophy of "one GPU handles everything" for the first time in NVIDIA's history and unveiled dedicated inference hardware Groq 3 LPU. This is the first result of integration after acquiring Groq for ~$20 billion in December 2025.FACT

Groq 3 LPX Rack Specifications: 256 LPUs, total 128GB on-chip SRAM, 40PB/s rack-level bandwidth. 32 compute trays (8 LPUs each) are directly connected by a copper spine. Claims 35 times higher throughput compared to Blackwell NVL72 and 1,500 tokens/sec in trillion-parameter models.FACT

Samsung Electronics is mass-producing Groq 3 LPU using the 4nm (SF4X) process. The yield is extremely low with a die size of 700mm² or more (approximately 64 chips per wafer). The goal is to increase wafer shipments by ~70% from 9,000 to 15,000 sheets per year.FACT

Feynman Architecture — 2028 Inference Native Platform

NVIDIA previewed the next-generation architecture after Vera Rubin, Feynman, at GTC 2026. Includes TSMC A16 (1.6nm) process, silicon photonics (optical NVLink), 3D die stacking (SRAM-over-compute), custom Rosa CPU, and BlueField 5. Claims 14x performance compared to Blackwell.FACT

Intel supplies EMIB (Embedded Multi-die Interconnect Bridge) packaging technology to Feynman. The combination of TSMC A16 + Intel EMIB is the industry's first cross-foundry advanced packaging collaboration.FACT

◆ VERA RUBIN + GROQ 3 = INFERENCE DISAGGREGATION
The LPX uses the Attention-FFN Disaggregation (AFD) architecture in the Vera Rubin platform. The Rubin GPU handles KV cache-based attention, and the LPX accelerates Feed-Forward (FFN) and MoE layers. This is a structural innovation that goes beyond the physical separation of training/inference and subdivides workloads even within inference.FACT