NVIDIA H100 Build Components

A high-performance enterprise AI system build for training, HPC, and large-scale inference workloads

System Overview

This enterprise-grade build is designed for the most demanding AI training, high-performance computing (HPC), and large-scale inference workloads. Centered around NVIDIA's flagship H100 GPUs, the system delivers exceptional performance with careful attention to scalability, power efficiency, and thermal management.

Performance

3,896 TFLOPS (FP8 sparse) per GPU with 80GB HBM3 memory

Scalability

Supports multi-node clusters with 400GbE InfiniBand

Flexibility

PCIe Gen5 version for standard server compatibility

1. GPUs: NVIDIA H100

PCIe Gen5 version for standard server integration

Model: NVIDIA H100 PCIe 80GB
Quantity: 4-8x (depending on chassis)
Key Specs:
  • 80GB HBM3 memory with 3TB/s bandwidth
  • 3,896 TFLOPS (FP8, sparse) / 67 TFLOPS (FP64)
  • PCIe Gen5 x16 support
  • Transformer Engine for accelerated AI
Why: The PCIe variant offers easier integration into standard servers compared to SXM5 modules.

2. Server Chassis/Rack

4U form factor for optimal GPU density

Primary Model: Supermicro AS-4124GS-TNR
Features:
  • 4U rackmount, 8x PCIe Gen5 slots
  • Redundant power supplies
  • Optimized airflow for high-TDP components
  • Tool-less design for easy maintenance
Alternatives:
  • Dell PowerEdge R760xa (4U, 4x GPUs)
  • Lenovo ThinkSystem SR670 V2 (NVIDIA-Certified)

3. CPU

Dual-socket configuration for maximum PCIe lanes

Model: Dual AMD EPYC 9654 "Genoa"
Specifications:
  • 96 cores, 192 threads per CPU
  • 128 PCIe Gen5 lanes per CPU
  • 384MB L3 cache per CPU
  • 360W TDP per CPU
Why: Provides sufficient PCIe lanes for multi-GPU connectivity and high memory bandwidth.
Alternative: Intel Xeon Platinum 8490H (60 cores, 120 threads)

4. Motherboard

Foundation for high-performance components

Model: Supermicro H13DSH-T
Key Features:
  • Dual-socket SP5 for AMD EPYC
  • 24x DDR5 DIMM slots (8-channel per CPU)
  • 7x PCIe Gen5 x16 slots
  • 10x NVMe U.2 bays
  • IPMI 2.0 with dedicated LAN
Note: Supports 4x H100 GPUs at full bandwidth with remaining slots for NVMe/network cards.

5. Memory (RAM)

High-capacity DDR5 for data-intensive workloads

Configuration: 1.5TB DDR5 ECC RDIMM
Details:
  • 24x 64GB modules @ 4800MHz
  • Registered ECC for data integrity
  • 8-channel memory per CPU
  • Up to 460GB/s memory bandwidth
Why ECC: Critical for mission-critical workloads to prevent data corruption.

6. Storage

Tiered storage for performance and capacity

Primary (NVMe):

  • 4x Samsung PM1743 3.84TB NVMe SSDs
  • PCIe Gen5 x4 (up to 13,000 MB/s read)
  • DWPD: 1.0 (7.0PB endurance per drive)

Secondary (HDD):

  • 8x Seagate Exos X20 20TB HDDs
  • 7200 RPM, 512MB cache
  • Configured in RAID 10 for bulk storage
Controller: Broadcom 9600-16i SAS/SATA RAID card

7. Power Supply (PSU)

High-efficiency redundant power

Model: Dual 3000W Titanium PSUs
Features:
  • 80+ Titanium efficiency (94%+ at 50% load)
  • Hot-swappable redundant configuration
  • 200-240V input required
Power Calculation: 4x H100 (350W each) = 1,400W + CPUs (720W) + RAM/Storage (~300W) = ~2,420W peak.

8. Cooling

Optimized for high thermal loads

System Cooling: Active cooling with redundant fans (chassis default)
Advanced Option: Direct liquid cooling (DLC) for dense deployments
Requirements: Data center environment with adequate airflow or liquid cooling infrastructure.

9. Networking

High-speed interconnects for multi-node clusters

Primary NIC:

NVIDIA ConnectX-7 Dual-Port 400GbE

  • InfiniBand or Ethernet mode
  • RDMA support for low-latency
  • GPUDirect RDMA for GPU-to-GPU communication

Switch (for clusters):

NVIDIA Quantum-2 InfiniBand Switch

  • 64x 400Gb/s ports
  • 51.2Tb/s aggregate bandwidth
  • Sub-600ns latency
Note: For smaller setups, 100GbE networking may be sufficient but will limit multi-node scaling.

10. Software Stack

Optimized for AI and HPC workloads

Base System:

  • OS: Ubuntu 22.04 LTS or RHEL 9
  • GPU Drivers: NVIDIA Data Center GPU Driver (v550+)
  • Container Runtime: Docker CE or Podman

AI Stack:

  • CUDA 12.2 + cuDNN 8.9 + NCCL 2.18
  • PyTorch/TensorFlow with H100 optimizations
  • NVIDIA Triton Inference Server
Orchestration: Kubernetes with NVIDIA GPU Operator for containerized workloads

Estimated System Cost

Pricing as of mid-2025 (market dependent)

Component Quantity Unit Price Total
NVIDIA H100 80GB PCIe 4 $30,000 $120,000
Supermicro AS-4124GS-TNR 1 $8,000 $8,000
AMD EPYC 9654 CPU 2 $5,000 $10,000
1.5TB DDR5 RAM 1 $12,000 $12,000
Storage (NVMe+HDD) 1 $15,000 $15,000
Estimated Total ~$170,000

Note: Prices vary by vendor, region, and market conditions. Networking switches and additional infrastructure not included.

Key Considerations

Important factors when planning your H100 deployment

Scalability

  • For LLMs, consider multi-node clusters with NVLink/InfiniBand
  • PCIe version limits GPU-to-GPU bandwidth compared to SXM
  • Plan for future expansion with additional nodes

Power Requirements

  • Requires 220V/30A circuits (standard 110V won't suffice)
  • Data center environment strongly recommended
  • Consider power redundancy for mission-critical workloads

Thermal Management

  • 4x H100s generate ~1,400W heat (plus CPUs)
  • Dedicated cooling required (25°C or below recommended)
  • Liquid cooling options for dense deployments

Cloud Alternatives

  • AWS EC2 P5 instances (8x H100 per instance)
  • Azure ND H100 v5-series VMs
  • Google Cloud A3 VMs with H100

Final Notes

Implementation recommendations

This build is optimized for enterprise AI training, HPC, or large-scale inference workloads. Key recommendations:

  • For smaller setups: Reduce to 2x H100 GPUs and scale down CPU/RAM accordingly
  • For maximum performance: Consider NVIDIA HGX H100 systems with SXM modules and NVLink
  • Vendor support: Engage with Dell, Supermicro, or Lenovo for pre-configured, supported solutions
  • Implementation: Work with certified NVIDIA partners for optimal deployment

The PCIe version offers the best balance of performance and flexibility for standard server deployments, while the HGX platform (with SXM modules) provides higher performance for specialized installations.

Back to Top