Checkpointless coaching on Amazon SageMaker HyperPod: Manufacturing-scale coaching with sooner fault restoration

Basis mannequin coaching has reached an inflection level the place conventional checkpoint-based restoration strategies have gotten a bottleneck to effectivity and cost-effectiveness. As fashions develop to trillions of parameters and coaching clusters increase to 1000’s of AI accelerators, even minor disruptions may end up in important prices and delays.

On this put up, we introduce checkpointless coaching on Amazon SageMaker HyperPod, a paradigm shift in mannequin coaching that reduces the want for conventional checkpointing by enabling peer-to-peer state restoration. Outcomes from production-scale validation present 80–93% discount in restoration time (from 15–half-hour or extra to below 2 minutes) and allows as much as 95% coaching goodput on cluster sizes with 1000’s of AI accelerators.

Understanding goodput

Basis mannequin coaching is without doubt one of the most resource-intensive processes in AI, usually involving thousands and thousands of {dollars} in compute spend throughout 1000’s of AI accelerators operating for days to months. Due to the inherent all-or-none distributed synchrony throughout all ranks, even a lack of a single rank due to software program or {hardware} faults brings the coaching workloads to a whole halt. To mitigate such localized faults, the business has relied on checkpoint-based restoration; periodically saving coaching states (checkpoints) to a sturdy retailer based mostly on a user-defined checkpoint interval. When a fault happens, the coaching workload resumes by restoring from the most recent saved checkpoint. This conventional restart-to-recover mannequin has turn out to be more and more untenable as mannequin sizes develop from billions to trillions of parameters and coaching workloads develop from lots of to 1000’s of AI accelerators.

This problem of sustaining environment friendly coaching operations at scale has led to the idea of goodput—the precise helpful work achieved in an AI coaching system in comparison with its theoretical most capability. In basis mannequin coaching, goodput is impacted by system failures and restoration overhead. The hole between the system’s theoretical most throughput and its precise productive output (goodput) grows bigger with: elevated frequency of failures (which rises with cluster dimension), longer restoration instances (which scale with mannequin dimension and cluster dimension), and better prices of idle assets throughout restoration. This definition helps body why measuring and optimizing goodput turns into more and more essential as AI coaching scales to bigger clusters and extra complicated fashions, the place even small inefficiencies may end up in important monetary and time prices.

A pre-training workload on a HyperPod cluster with 256 P5 cases, checkpointing each 20 minutes, faces two challenges when disrupted: 10 minutes of misplaced work plus 10 minutes for restoration. With ml.p5.24xlarge cases costing $55 per hour, every disruption prices $4,693 in compute time. For a month-long coaching, day by day disruptions would accumulate to $141,000 in additional prices and delay completion by 10 hours.

As cluster sizes develop, the chance and frequency of failures can improve.

Because the coaching spans throughout 1000’s of nodes, disruptions brought on by faults turn out to be more and more frequent. In the meantime, restoration turns into slower as a result of the workload reinitialization overhead grows linearly with cluster dimension. The cumulative impression of large-scale AI coaching failures can attain thousands and thousands of {dollars} yearly and translate on to delayed time-to-market, slower mannequin iteration cycles, and aggressive drawback. Each hour of idle GPU time is an hour not spent advancing mannequin capabilities.

Checkpoint-based restoration

Checkpoint-based restoration in distributed coaching is much extra complicated and time-consuming than generally understood. When a failure happens in conventional distributed coaching, the restart course of entails way over loading the final checkpoint. Understanding what occurs throughout restoration reveals why it takes so lengthy and why the complete cluster should sit idle.

The all-or-none cascade

A single failure—one GPU error, one community timeout, or one {hardware} fault—can set off a whole coaching cluster shutdown. As a result of distributed coaching treats all processes as tightly coupled, any single failure necessitates a whole restart. When any course of fails, the orchestration system (for instance, TorchElastic or Kubernetes) should terminate each course of throughout the job and restart from scratch. Every restart requires navigating a fancy, multi-stage restoration course of the place each stage is sequential and blocking:

Stage 1: Coaching job restart – The coaching job orchestrator detects a failure, terminates all processes in all nodes adopted by a cluster-wide restart or the coaching job.
Stage 2: Course of and community initialization – Each course of should re-execute the coaching script from the start. That features rank initialization, loading of Python modules from sturdy retailer resembling Community File System (NFS) or object storage, establishing the coaching topology and communication backend by peer discovery and course of teams creation. The method group initialization alone can take tens of minutes on massive clusters.
Stage 3: Checkpoint retrieval – Every course of should first establish the final fully saved checkpoint, then retrieve it from persistent storage (for instance, NFS or object storage) and cargo a number of state dictionaries: the mannequin’s parameters and buffers, the optimizer’s inner state (momentum, variance, and so forth), the training price scheduler, and coaching loop metadata (epoch, batch quantity). This step can take tens of minutes or longer relying on cluster and mannequin dimension.
Stage 4: Information loader initialization – The info-loading ranks have further accountability to initialize the info buffers. That features retrieving the info checkpoint from sturdy storage resembling Amazon FSx or Amazon Easy Storage Service (Amazon S3) and prefetching the coaching knowledge to start out the coaching loop. Information checkpointing is a vital step to keep away from processing the identical knowledge samples a number of instances or skipping samples upon coaching disruption. Relying on the info combine technique, knowledge locality, and bandwidth, the method can take a couple of minutes.
Stage 5: First step overhead – After checkpoint and coaching knowledge are retrieved and loaded, there may be further overhead to run the primary coaching step, we name it first step overhead (FSO). Throughout this primary step, there may be sometimes time spent in reminiscence allocation, creating and organising the CUDA context for communication with GPUs, and compilation a part of the CUDA graph, and so forth.
Stage 6: Misplaced steps overhead – Solely in spite of everything earlier levels full efficiently can the coaching loop resume its common progress. As a result of the coaching resumes from the final saved mannequin checkpoint, all of the steps computed between the checkpoint and the fault encountered are misplaced. These misplaced steps have to be recomputed, we name this misplaced steps overhead (LSO). Following the recomputation part, the coaching job resumes productive work that immediately contributes to goodput.

How checkpointless coaching eliminates these bottlenecks

The 5 levels outlined above—termination and restart, course of discovery and community setup, checkpoint retrieval, GPU context reinitialization, and coaching loop resumption—symbolize the basic bottlenecks in checkpoint-based restoration. Every stage is sequential and blocking, and coaching restoration can take minutes to a number of hours for big fashions. Critically, the complete cluster should wait for each stage to finish earlier than coaching can resume.

Checkpointless coaching eliminates this cascade. Checkpointless coaching preserves mannequin state coherence throughout the distributed cluster, eliminating the necessity for periodic snapshots. When failures happen, the system rapidly recovers through the use of wholesome friends, avoiding each storage I/O operations and full course of restarts sometimes required by conventional checkpointing approaches.

Checkpointless coaching structure

Checkpointless coaching is constructed on 5 elements that work collectively to remove the standard checkpoint-restart bottlenecks. Every element addresses a particular bottleneck within the restoration course of, and collectively they permit computerized detection and restoration of infrastructure faults in minutes with zero handbook intervention, even with 1000’s of AI accelerators.

Element 1: TCPStore-less/root-less NCCL and Gloo initialization (optimizing stage 2)

In a typical distributed coaching setup (for instance, utilizing torch.distributed), all ranks should initialize a course of group. The method group creates a communication layer, permitting all processes (or ranks, that’s, particular person nodes) to pay attention to one another and trade data. A TCPStore is commonly used as a rendezvous level the place all ranks examine in to find one another’s connection data. When 1000’s of ranks attempt to contact a designated root server (sometimes rank 0) concurrently, it turns into a bottleneck. This results in a flood of simultaneous community requests to a single root server that may trigger community congestion, improve latency by tens of minutes, and additional gradual the communication course of.

Checkpointless coaching eliminates this centralized dependency. As an alternative of funneling all connection requests by a single root server, the system makes use of a symmetric tackle sample the place every rank independently computes peer connection data utilizing a worldwide group counter. Ranks join immediately to one another utilizing predetermined port assignments, avoiding the TCPStore bottleneck. Course of group initialization drops from tens of minutes to seconds, even on clusters with 1000’s of nodes. The system additionally eliminates the single-point-of-failure danger inherent in root-based initialization.

Element 2: Reminiscence-mapped knowledge loading (optimizing stage 4)

One of many hidden prices in conventional restoration is reloading coaching knowledge. When a course of restarts, it should reload batches from disk, rebuild knowledge loader state, and thoroughly place itself to keep away from processing duplicate samples or skipping knowledge. On large-scale coaching runs, this knowledge loading can add minutes to each restoration cycle.

Checkpointless coaching makes use of memory-mapped knowledge loading to keep up cached knowledge throughout accelerators. Coaching knowledge is mapped into shared reminiscence areas that persist even when particular person processes fail. When a node recovers, it doesn’t reload knowledge from disk however reconnects to the present memory-mapped cache. The info loader state is preserved, serving to to make sure that coaching continues from the right place with out duplicate or skipped samples. MMAP additionally reduces host CPU reminiscence utilization by sustaining just one copy of information per node (in comparison with eight copies with conventional knowledge loaders on 8-GPU nodes), and coaching can resume instantly utilizing cached batches whereas the info loader concurrently prefetches the subsequent knowledge within the background.

Reminiscence-mapped knowledge loading workflow

Element 3: In-process restoration (optimizing stage 1, 2, and 5)

Conventional checkpoint-based restoration treats failures as job-level occasions: a single GPU error triggers termination of the complete distributed coaching job. Each course of throughout the cluster have to be killed and restarted, regardless that just one element failed.

Checkpointless coaching makes use of in-process restoration to isolate failures on the course of degree. When a GPU or course of fails, solely the failed course of executes an in-process restoration to rejoin the coaching loop inside seconds, overcoming recoverable or transient errors. Wholesome processes proceed operating with out interruption. The failed course of stays alive (avoiding full course of teardown), preserving the CUDA context, compiler cache, and GPU state, therefore eliminating minutes of reinitialization overhead. In circumstances the place the error is non-recoverable (resembling {hardware} failure), the system robotically swaps the defective element with a pre-warmed scorching spare, enabling coaching to proceed with out disruptions.

This eliminates the necessity for full cluster termination and restart, dramatically lowering restoration overhead.

Element 4: Peer-to-peer state replication (optimizing stage 3 and 6)

Checkpoint-based restoration requires loading mannequin and optimizer state from persistent storage (resembling Amazon S3 or FSx for Lustre). For fashions with billions to trillions of parameters, this implies transferring tens to lots of of gigabytes over the community, deserializing state dictionaries, and reconstructing optimizer buffers which may take tens of minutes and create an enormous I/O bottleneck.

Probably the most vital innovation in checkpointless coaching is steady peer-to-peer state replication. As an alternative of periodically saving mannequin state to centralized storage, every GPU maintains redundant copies of its mannequin shards on peer GPUs. When a failure happens, the recovering course of doesn’t load from Amazon S3. It copies state immediately from a wholesome peer over the high-speed Elastic Cloth Adapter (EFA) community interconnect. This peer-to-peer structure eliminates the I/O bottleneck that dominates conventional checkpoint restoration. State switch occurs in seconds, in comparison with minutes for loading multi-gigabyte checkpoints from storage. The recovering node pulls solely the particular shards it wants, additional lowering switch time.

Element 5: SageMaker HyperPod coaching operator (optimizing all levels)

The SageMaker HyperPod coaching operator orchestrates the checkpointless coaching elements, serving because the coordination layer that ties collectively initialization, knowledge loading, checkpointless restoration, and checkpoint fallback mechanisms. It maintains a centralized management airplane with a worldwide view of coaching course of well being throughout the complete cluster, coordinating fault detection, restoration selections, and cluster-wide synchronization.

The operator implements clever restoration escalation: it first makes an attempt in-process restart for failed elements, and if that’s not possible (for instance, due to container crashes or node failures), it escalates to process-level restoration. Throughout a process-level restoration, as an alternative of restarting the complete job when failures happen, the operator restarts solely coaching processes, holding the containers alive. Because of this, the restoration instances are sooner than a job-level restart, which requires tearing down and recreating the coaching infrastructure, involving pod rescheduling, container pulls, setting initialization, and re-loading from checkpoints. When failures happen, the operator broadcasts coordinated cease alerts to forestall cascading timeouts and integrates with the SageMaker HyperPod health-monitoring agent to robotically detect {hardware} points and set off restoration with out handbook intervention.

Getting began with checkpointless coaching

This part guides you thru organising and configuring checkpointless coaching on SageMaker HyperPod to scale back fault restoration from hours to minutes.

Stipulations

Earlier than integrating checkpointless coaching into your coaching workload, confirm that your setting meets the next necessities:

Infrastructure necessities:

Software program necessities:

Supported frameworks: Nemo, PyTorch, PyTorch Lightning
Coaching knowledge codecs: JSON, JSONGZ (compressed JSON), or ARROW
Amazon Elastic Container Registry (Amazon ECR) repository for container photos. Use the HyperPod checkpointless coaching container—required for rootless NCCL initialization (Tier 1) and peer-to-peer checkpointless restoration (Tier 4)

658645717510.dkr.ecr..amazonaws.com/sagemaker-hyperpod/pytorch-training:2.3.0-checkpointless

Checkpointless coaching workflow

Checkpointless coaching is designed for incremental adoption. You can begin with primary capabilities and progressively allow superior options as your coaching scales. The combination is organized into 4 tiers, every constructing on the earlier one:

Tier 1: NCCL initialization optimization

NCCL initialization optimization eliminates the centralized root course of bottleneck throughout initialization. Nodes uncover and connect with friends independently utilizing infrastructure alerts. This permits sooner course of group initialization (seconds as an alternative of minutes) and elimination of single-point-of-failure throughout startup.

Integration steps: Allow an setting variable as a part of the job specification and confirm that the job runs with the checkpointless coaching container.

# kubernetes job spec
env:
  - title: HPCT_USE_CONN_DATA # Allow Rootless
    worth: "1"
  - title: TORCH_SKIP_TCPSTORE # Allow TCPStore Removing
    worth: "1"

Tier 2: Reminiscence-mapped knowledge loading

Reminiscence mapped knowledge loading retains coaching knowledge cached in shared reminiscence throughout course of restarts, eliminating knowledge reload overhead throughout restoration. This permits immediate knowledge entry throughout restoration. No must reload or re-shuffle knowledge when a course of restarts.

Integration steps: Increase the present knowledge loader with a reminiscence mapped cache

from hyperpod_checkpointless_training.dataloader.mmap_data_module import MMAPDataModule
from hyperpod_checkpointless_training.dataloader.config import CacheResumeMMAPConfig

base_data_module = MY_DATA_MODULE(...). # Buyer's personal datamodule

mmap_config = CacheResumeMMAPConfig(
    cache_dir=self.cfg.mmap.cache_dir,
)

mmap_dm = MMAPDataModule(
    data_module=base_data_module,
    mmap_config=CacheResumeMMAPConfig(
        cache_dir=self.cfg.mmap.cache_dir,
    ),
)

Tier 3: In-process restoration

In-process restoration isolates failures to particular person processes as an alternative of requiring full job restarts. Failed processes get well independently whereas wholesome processes proceed coaching. It allows sub-minute restoration from process-level failures. Wholesome processes keep alive, whereas failed processes get well independently.

Integration steps:

from hyperpod_checkpointless_training.inprocess.health_check import CudaHealthCheck
from hyperpod_checkpointless_training.inprocess.wrap import HPCallWrapper, HPWrapper
from hyperpod_checkpointless_training.inprocess.train_utils import HPAgentK8sAPIFactory
@HPWrapper(
    health_check=CudaHealthCheck(),
    hp_api_factory=HPAgentK8sAPIFactory(),
    abort_timeout=60.0,
)
def re_executable_codeblock(): # The re-executable codeblock outlined by consumer, normally it is major perform or prepare loop
    ...

Tier 4: Checkpointless (peer-to-peer restoration) (NeMo integration)

Checkpointless restoration allows full peer-to-peer state replication and restoration. Failed processes get well mannequin and optimizer state immediately from wholesome friends with out loading from storage. This step allows elimination of checkpoint loading. Failed processes get well mannequin and optimizer state from wholesome replicas over the high-speed EFA interconnect.

Integration steps:

from hyperpod_checkpointless_training.inprocess.train_utils import wait_rank
    wait_rank() 
    
def major():   
    @HPWrapper(
        health_check=CudaHealthCheck(),
        hp_api_factory=HPAgentK8sAPIFactory(),
        abort_timeout=60.0,
        checkpoint_manager=PEFTCheckpointManager(enable_offload=True),
        abort=CheckpointlessAbortManager.get_default_checkpointless_abort(),
        finalize=CheckpointlessFinalizeCleanup(),
    )
    def run_main(cfg, caller: Non-compulsory[HPCallWrapper] = None):
        ...
        coach = Coach(
            technique=CheckpointlessMegatronStrategy(...,
                num_distributed_optimizer_instances=2),
            callbacks=[..., CheckpointlessCallback(...)],
            )
        coach.fresume = resume
        coach._checkpoint_connector = CheckpointlessCompatibleConnector(coach)
        coach.wrapper = caller

wait_rank: All ranks will await the rank data from the Hyperod coaching operator infrastructure.

HPWrapper: Python perform wrapper that allows restart capabilities for a restart code block (RCB). The implementation makes use of a context supervisor as an alternative of a Python decorator as a result of the decision wrapper lacks details about the variety of RCBs it ought to monitor.

CudaHealthCheck: Helps be sure that the CUDA context for the present course of is in a wholesome state. It synchronizes with the GPU and makes use of the machine similar to LOCAL_RANK setting variable, or the principle thread’s default CUDA machine if LOCAL_RANK was not specified within the setting.

HPAgentK8sAPIFactory: That is the API that checkpointless coaching will use to know the coaching standing from the opposite pods in a K8s coaching cluster. It additionally offers an infrastructure-level barrier, which makes positive each rank can efficiently carry out the abort and restart.

CheckpointManager: Manages in-memory checkpoints and peer-to-peer restoration for checkpointless fault tolerance.

We advocate beginning with Tier 1 and validating it in your setting. Add Tier 2 when knowledge loading overhead turns into a bottleneck. Undertake Tier 3 and Tier 4 for optimum resilience on the biggest coaching clusters.

For NeMo customers and HyperPod recipe customers, Tier 4 is out there out-of-the-box with minimal configuration modifications for Llama and GPT open supply recipes. NeMo examples for Llama and GPT open supply fashions may be present in SageMaker HyperPod checkpointless coaching.

Efficiency outcomes

Checkpointless coaching has been validated at manufacturing scale throughout a number of cluster configurations. The newest Amazon Nova fashions have been educated utilizing this expertise on tens of 1000’s of AI accelerators.

On this part, we display outcomes from intensive testing throughout a spread of cluster sizes, spanning 16 GPUs to 2,304 GPUs. Checkpointless coaching demonstrated important enhancements in restoration time, constantly lowering downtime by 80–93% in comparison with conventional checkpoint-based restoration.

Cluster (H100s)	Mannequin	Conventional restoration	Checkpointless restoration	Enchancment
2,304 GPUs	Inner mannequin	15–half-hour	Lower than 2 minutes	~87–93% sooner
256 GPUs	Llama-3 70B (pre-training)	4 min, 52 sec	47 seconds	~84% sooner
16 GPUs	Llama-3 70B (fine-tuning)	5 min 10 sec	50 seconds	~84% sooner

These restoration time enhancements have a direct relationship to ML goodput, outlined as the proportion of time your cluster spends making ahead progress on coaching relatively than sitting idle throughout failures. As clusters scale to 1000’s of nodes, failure frequency will increase proportionally. On the identical time, conventional checkpoint-based restoration instances additionally improve with cluster dimension on account of rising coordination overhead. This creates a compounding drawback: extra frequent failures mixed with longer restoration instances quickly erode goodput at scale.

Checkpointless coaching makes optimizations throughout the complete restoration stack, enabling greater than 95% goodput even on clusters with 1000’s of AI accelerators. Based mostly on our inner research, we constantly noticed goodput upwards of 95% throughout massive-scale deployments that exceeded 2,300 GPUs.

We additionally verified that mannequin coaching accuracy just isn’t impacted by checkpointless coaching. Particularly, we measured checksum matching for conventional checkpoint-based coaching and checkpointless coaching, and at each coaching step verified a bit-wise match on coaching loss. The next is a plot for the coaching loss for a Llama-3 70B pre-training workload on 32 x ml.p5.48xlarge cases for each conventional checkpointing versus checkpointless coaching.

Conclusion

Basis mannequin coaching has reached an inflection level. As clusters scale to 1000’s of AI accelerators and coaching runs prolong to months, the standard checkpoint-based restoration paradigm is more and more turning into a bottleneck. A single GPU failure that beforehand would have brought about minutes of downtime now triggers tens of minutes of cluster-wide idle time on 1000’s of AI accelerators, with cumulative prices reaching thousands and thousands of {dollars} yearly.

Checkpointless coaching rethinks this paradigm solely by treating failures as native, recoverable occasions relatively than cluster-wide catastrophes. Failed processes get well state from wholesome friends in seconds, enabling the remainder of the cluster to proceed making ahead progress. The shift is prime: from How can we restart rapidly? to How can we keep away from stopping in any respect?

This expertise has enabled greater than 95% goodput when coaching on SageMaker HyperPod. Our inner research on 2,304 GPUs present restoration instances dropped from 15–half-hour to below 90 seconds, translating to over 80% discount in idle GPU time per failure.

To get began, discover What’s Amazon SageMaker AI?. Pattern implementations and recipes can be found within the AWS GitHub HyperPod checkpointless coaching and SageMaker HyperPod recipes repositories.

Concerning the Authors

Anirudh Viswanathan is a Senior Product Supervisor, Technical, at AWS with the SageMaker workforce, the place he focuses on Machine Studying. He holds a Grasp’s in Robotics from Carnegie Mellon College and an MBA from the Wharton College of Enterprise. Anirudh is a named inventor on greater than 50 AI/ML patents. He enjoys long-distance operating, exploring artwork galleries, and attending Broadway exhibits. You possibly can join with Anirudh on LinkedIn.

Roy Allela is a Senior AI/ML Specialist Options Architect at AWS. He helps AWS clients, from small startups to massive enterprises to coach and deploy basis fashions effectively on AWS. He has a background in Microprocessor Engineering obsessed with computational optimization issues and bettering the efficiency of AI workloads. You possibly can join with Roy on LinkedIn.

Fei Wu is a Senior Software program Developer at AWS with Sagemaker workforce. Fei’s focus is on ML system and distributed coaching strategies. He holds a PhD in Electrical Engineering from StonyBrook College. When exterior of labor, Fei enjoys enjoying basketball and watching films. You possibly can join with Fei on LinkedIn.

Trevor Harvey is a Principal Specialist in Generative AI at Amazon Internet Providers (AWS) and an AWS Licensed Options Architect – Skilled. At AWS, Trevor works with clients to design and implement machine studying options and leads go-to-market methods for generative AI companies.

Anirban Roy is a Principal Engineer at AWS with the SageMaker workforce, primarily focussing on AI coaching infra, resiliency and observability. He holds a Grasp’s in Laptop Science from Indian Statistical Institute in Kolkata. Anirban is a seasoned distributed software program system builder with greater than 20 years of expertise and a number of patents and publications. He enjoys street biking, studying non-fiction, gardening and nature touring. You possibly can join with Anirban on LinkedIn

Arun Nagarajan is a Principal Engineer on the Amazon SageMaker AI workforce, the place he presently focuses on distributed coaching throughout the complete stack. Since becoming a member of the SageMaker workforce throughout its launch yr, Arun has contributed to a number of merchandise inside SageMaker AI, together with real-time inference and MLOps options. When he’s not engaged on machine studying infrastructure, he enjoys exploring the outside within the Pacific Northwest and hitting the slopes for snowboarding.