Monday, February 9, 2026

Scale LLM fine-tuning with Hugging Face and Amazon SageMaker AI


Enterprises are more and more shifting from relying solely on giant, general-purpose language fashions to creating specialised giant language fashions (LLMs) fine-tuned on their very own proprietary knowledge. Though basis fashions (FMs) supply spectacular common capabilities, they usually fall quick when utilized to the complexities of enterprise environments—the place accuracy, safety, compliance, and domain-specific data are non-negotiable.

To fulfill these calls for, organizations are adopting cost-efficient fashions tailor-made to their inner knowledge and workflows. By fine-tuning on proprietary paperwork and domain-specific terminology, enterprises are constructing fashions that perceive their distinctive context—leading to extra related outputs, tighter knowledge governance, and less complicated deployment throughout inner instruments.

This shift can also be a strategic transfer to scale back operational prices, enhance inference latency, and preserve better management over knowledge privateness. In consequence, enterprises are redefining their AI technique as personalized, right-sized fashions aligned to their enterprise wants.

Scaling LLM fine-tuning for enterprise use instances presents actual technical and operational hurdles, that are being overcome by way of the highly effective partnership between Hugging Face and Amazon SageMaker AI.

Many organizations face fragmented toolchains and rising complexity when adopting superior fine-tuning methods like Low-Rank Adaptation (LoRA), QLoRA, and Reinforcement Studying with Human Suggestions (RLHF). Moreover, the useful resource calls for of enormous mannequin coaching—together with reminiscence limitations and distributed infrastructure challenges—usually decelerate innovation and strains inner groups.

To beat this, SageMaker AI and Hugging Face have joined forces to simplify and scale mannequin customization. By integrating the Hugging Face Transformers libraries into SageMaker’s totally managed infrastructure, enterprises can now:

  • Run distributed fine-tuning jobs out of the field, with built-in help for parameter-efficient tuning strategies
  • Use optimized compute and storage configurations that cut back coaching prices and enhance GPU utilization
  • Speed up time to worth through the use of acquainted open supply libraries in a production-grade setting

This collaboration helps companies deal with constructing domain-specific, right-sized LLMs, unlocking AI worth sooner whereas sustaining full management over their knowledge and fashions.

On this publish, we present how this built-in method transforms enterprise LLM fine-tuning from a posh, resource-intensive problem right into a streamlined, scalable resolution for reaching higher mannequin efficiency in domain-specific functions. We use the meta-llama/Llama-3.1-8B mannequin, and execute a Supervised Fantastic-Tuning (SFT) job to enhance the mannequin’s reasoning capabilities on the MedReason dataset through the use of distributed coaching and optimization methods, equivalent to Absolutely-Sharded Information Parallel (FSDP) and LoRA with the Hugging Face Transformers library, executed with Amazon SageMaker Coaching Jobs.

Understanding the core ideas

The Hugging Face Transformers library is an open-source toolkit designed to fine-tune LLMs by enabling seamless experimentation and deployment with well-liked transformer fashions.

The Transformers library helps quite a lot of strategies for aligning LLMs to particular goals, together with:

  • Hundreds of pre-trained fashions – Entry to an unlimited assortment of fashions like BERT, Meta Llama, Qwen, T5, and extra, which can be utilized for duties equivalent to textual content classification, translation, summarization, query answering, object detection, and speech recognition.
  • Pipelines API – Simplifies frequent duties (equivalent to sentiment evaluation, summarization, and picture segmentation) by dealing with tokenization, inference, and output formatting in a single name.
  • Coach API – Gives a high-level interface for coaching and fine-tuning fashions, supporting options like blended precision, distributed coaching, and integration with well-liked {hardware} accelerators.
  • Tokenization instruments – Environment friendly and versatile tokenizers for changing uncooked textual content into model-ready inputs, supporting a number of languages and codecs.

SageMaker Coaching Jobs is a completely managed, on-demand machine studying (ML) service that runs remotely on AWS infrastructure to coach a mannequin utilizing your knowledge, code, and chosen compute sources. This service abstracts away the complexities of provisioning and managing the underlying infrastructure, so you possibly can deal with creating and fine-tuning your ML and basis fashions. Key capabilities provided by SageMaker coaching jobs are:

  • Absolutely managed – SageMaker handles useful resource provisioning, scaling, and administration to your coaching jobs, so that you don’t must manually arrange servers or clusters.
  • Versatile enter – You should utilize built-in algorithms, pre-built containers, or deliver your personal customized coaching scripts and Docker containers, to execute coaching workloads with hottest frameworks such because the Hugging Face Transformers library.
  • Scalable – It helps single-node or distributed coaching throughout a number of cases, making it appropriate for each small and large-scale ML workloads.
  • Integration with a number of knowledge sources – Coaching knowledge could be saved in Amazon Easy Storage Service (Amazon S3), Amazon FSx, and Amazon Elastic Block Retailer (Amazon EBS), and output mannequin artifacts are saved again to Amazon S3 after coaching is full.
  • Customizable – You possibly can specify hyperparameters, useful resource sorts (equivalent to GPU or CPU cases), and different settings for every coaching job.
  • Price-efficient choices – Options like managed Spot Situations, versatile coaching plans, and heterogeneous clusters assist optimize coaching prices.

Resolution overview

The next diagram illustrates the answer workflow of utilizing the Hugging Face Transformers library with a SageMaker Coaching job.

The workflow consists of the next steps:

  1. The consumer prepares the dataset by formatting it with the particular immediate type used for the chosen mannequin.
  2. The consumer prepares the coaching script through the use of the Hugging Face Transformers library to start out the coaching workload, by specifying the configuration for the distribution possibility chosen, equivalent to Distributed Information Parallel (DDP) or Absolutely-Sharded Information Parallel (FSDP).
  3. The consumer submits an API request to SageMaker AI, passing the placement of the coaching script, the Hugging Face Coaching container URI, and the coaching configurations required, equivalent to distribution algorithm, occasion sort, and occasion depend.
  4. SageMaker AI makes use of the coaching job launcher script to run the coaching workload on a managed compute cluster. Primarily based on the chosen configuration, SageMaker AI provisions the required infrastructure, orchestrates distributed coaching, and upon completion, robotically decommissions the cluster.

This streamlined structure delivers a completely managed consumer expertise, serving to you shortly develop your coaching code, outline coaching parameters, and choose your most well-liked infrastructure. SageMaker AI handles the end-to-end infrastructure administration with a pay-as-you-go pricing mannequin that payments just for the online coaching time in seconds.

Stipulations

You have to full the next stipulations earlier than you possibly can run the Meta Llama 3.1 8B fine-tuning pocket book:

  1. Make the next quota enhance requests for SageMaker AI. For this use case, you will want to request a minimal of 1 p4d.24xlarge occasion (with 8 x NVIDIA A100 GPUs) and scale to extra p4d.24xlarge cases (relying on time-to-train and cost-to-train trade-offs to your use case). To assist decide the fitting cluster measurement for the fine-tuning workload, you should utilize instruments like VRAM Calculator or “Can it run LLM“. On the Service Quotas console, request the next SageMaker AI quotas:
    • P4D cases (p4.24xlarge) for coaching job utilization: 1
  2. Create an AWS Identification and Entry Administration (IAM) function with managed insurance policies AmazonSageMakerFullAccess and AmazonS3FullAccess to offer required entry to SageMaker AI to run the examples.
  3. Assign the next coverage as a belief relationship to your IAM function:
    {
        "Model": "2012-10-17",
        "Assertion": [
            {
                "Sid": "",
                "Effect": "Allow",
                "Principal": {
                    "Service": [
                        "sagemaker.amazonaws.com"
                    ]
                },
                "Motion": "sts:AssumeRole"
            }
        ]
    }
    

  4. (Non-obligatory) Create an Amazon SageMaker Studio area (confer with Use fast setup for Amazon SageMaker AI) to entry Jupyter notebooks with the previous function. It’s also possible to use JupyterLab in your native setup

These permissions grant broad entry and are usually not advisable to be used in manufacturing environments. See the SageMaker Developer Information for steerage on defining extra fine-grained permissions.

Put together the dataset

To arrange the dataset, you should load the UCSC-VLAA/MedReason dataset. MedReason is a large-scale, high-quality medical reasoning dataset designed to allow devoted and explainable medical problem-solving in LLMs. The next desk exhibits an instance of the info.

dataset_name id_in_dataset query reply reasoning choices
medmcqa 7131 Urogenital Diaphragm is made up of the next… Colle’s fascia. Clarification: Colle’s fascia do… Discovering reasoning paths:n1. Urogenital diaphr… Reply Decisions:nA. Deep transverse Perineusn…
medmcqa 7133 Youngster with Sort I Diabetes. What’s the advise… After 5 years. Clarification: Screening for diab… **Discovering reasoning paths:**nn1. Sort 1 Diab… Reply Decisions:nA. After 5 yearsnB. After 2 …
medmcqa 7134 Most delicate check for H pylori is-

Biopsy urease check. Clarification:

Davidson&…

**Discovering reasoning paths:**nn1. Take into account th… Reply Decisions:nA. Fecal antigen testnB. Bio…

We need to use the next columns for getting ready our dataset:

  • query – The query being posed
  • reply – The right reply to the query
  • reasoning – An in depth, step-by-step logical rationalization of the way to arrive on the appropriate reply

We will use the next steps to format the enter within the correct type used for Meta Llama 3.1, and configure the info channels for SageMaker coaching jobs on Amazon S3:

  1. Load the UCSC-VLAA/MedReason dataset, utilizing the primary 10,000 rows of the unique dataset:
    from datasets import load_dataset
    dataset = load_dataset("UCSC-VLAA/MedReason", cut up="practice[:10000]")
  2. Apply the correct chat template to the dataset through the use of the apply_chat_template methodology of the Tokenizer:
    from transformers import AutoTokenizer
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    
    def prepare_dataset(pattern):
    
        system_text = (
            "You're a deep-thinking AI assistant.nn" 
            "For each consumer query, first write your ideas and reasoning inside ... tags, then present your reply."
        )
    
        messages = []
    
        messages.append({"function": "system", "content material": system_text})
        messages.append({"function": "consumer", "content material": pattern["question"]})
        messages.append(
            {
                "function": "assistant",
                "content material": f"n{pattern['reasoning']}nn{pattern['answer']}",
            }
        )
    
        # Apply chat template
        pattern["text"] = tokenizer.apply_chat_template(
            messages, tokenize=False
        )
    
        return pattern
    

    The operate prepare_dataset will iterate over the weather of the dataset, and use the apply_chat_template operate to have a immediate template within the following kind:

    system
    {{SYSTEM_PROMPT}}
    consumer
    {{QUESTION}}
    assistant
    
    {{REASONING}}
    
    
    {{FINAL_ANSWER}}
    

    The next code is an instance of the formatted immediate:

    <|begin_of_text|><|start_header_id|>system<|end_header_id|> 
    You're a deep-thinking AI assistant. 
    For each consumer query, first write your ideas and reasoning inside ... tags, then present your reply.
    <|eot_id|><|start_header_id|>consumer<|end_header_id|> 
    A 66-year-old man presents to the emergency room with blurred imaginative and prescient, lightheadedness, and chest ache that began half-hour in the past. The affected person is awake and alert. 
    His historical past is critical for uncontrolled hypertension, coronary artery illness, and he beforehand underwent percutaneous coronary intervention. 
    He's afebrile. The guts price is 102/min, the blood stress is 240/135 mm Hg, and the O2 saturation is 100% on room air. 
    An ECG is carried out and exhibits no acute adjustments. A speedy intravenous infusion of a drug that will increase peripheral venous capacitance is began. 
    This drug has an onset of motion that's lower than 1 minute with speedy serum clearance than necessitates a steady infusion. What's the most extreme facet impact of this medicine?
    <|eot_id|><|start_header_id|>assistant<|end_header_id|> 
     
    ### Discovering Reasoning Paths: 
    1. **Blurred imaginative and prescient, lightheadedness, and chest ache** → Malignant hypertension → Fast IV antihypertensive remedy. 
    2. **Uncontrolled hypertension and coronary artery illness** → Malignant hypertension → Fast IV antihypertensive remedy. 
    3. **Extreme hypertension (BP 240/135 mm Hg)** → Threat of end-organ injury → Malignant hypertension → Fast IV antihypertensive remedy. 
    4. **Chest ache and historical past of coronary artery illness** → Threat of myocardial ischemia → Malignant hypertension → Fast IV antihypertensive remedy. --- 
    
    ### Reasoning Course of: 
    1. **Medical Presentation and Prognosis**:  - The affected person presents with blurred imaginative and prescient...
    ...
     
    
    Cyanide poisoning
    <|eot_id|><|end_of_text|>
    

  3. Break up the dataset into practice, validation, and check datasets:
    from datasets import Dataset, DatasetDict
    from random import randint
    
    train_dataset = Dataset.from_pandas(practice)
    val_dataset = Dataset.from_pandas(val)
    test_dataset = Dataset.from_pandas(check)
    
    dataset = DatasetDict({"practice": train_dataset, "val": val_dataset})
    train_dataset = dataset["train"].map(
        prepare_dataset, remove_columns=record(train_dataset.options)
    )
    
    val_dataset = dataset["val"].map(
        prepare_dataset, remove_columns=record(val_dataset.options)
    )
    

  4. Put together the coaching and validation datasets for the SageMaker coaching job by saving them as JSON information and setting up the S3 paths the place these information shall be uploaded:
    ...
     
    train_dataset.to_json("./knowledge/practice/dataset.jsonl")
    val_dataset.to_json("./knowledge/val/dataset.jsonl")
    
     
    s3_client.upload_file(
        "./knowledge/practice/dataset.jsonl", bucket_name, f"{input_path}/practice/dataset.jsonl"
    )
    s3_client.upload_file(
        "./knowledge/val/dataset.jsonl", bucket_name, f"{input_path}/val/dataset.jsonl"
    )
    

Put together the coaching script

To fine-tune meta-llama/Llama-3.1-8B with a SageMaker Coaching job, we ready the practice.py file, which serves because the entry level of the coaching job to execute the fine-tuning workload.

The coaching course of can use Coach or SFTTrainer courses to fine-tune our mannequin. This simplifies the method of continued pre-training for LLMs. This method makes fine-tuning environment friendly for adapting pre-trained fashions to particular duties or domains.

The Coach and SFTTrainer courses each facilitate mannequin coaching with Hugging Face transformers. The Coach class is the usual high-level API for coaching and evaluating transformer fashions on a variety of duties, together with textual content classification, sequence labeling, and textual content technology. The SFTTrainer is a subclass constructed particularly for supervised fine-tuning of LLMs, significantly for instruction-following or conversational duties.

To speed up the mannequin fine-tuning, we distribute the coaching workload through the use of the FSDP approach. It’s a sophisticated parallelism approach designed to coach giant fashions which may not match within the reminiscence of a single GPU, with the next advantages:

  • Parameter sharding – As a substitute of replicating the complete mannequin on every GPU, FSDP splits (shards) mannequin parameters, optimizer states, and gradients throughout GPUs
  • Reminiscence effectivity – By sharding, FSDP drastically reduces the reminiscence footprint on every gadget, enabling coaching of bigger fashions or bigger batch sizes
  • Synchronization – Throughout coaching, FSDP gathers solely the mandatory parameters for every computation step, then releases reminiscence instantly after, additional saving sources
  • CPU offload – Optionally, FSDP can offload some knowledge to CPUs to avoid wasting much more GPU reminiscence
  1. In our instance, we use the Coach class and outline the required TrainingArguments to execute the FSDP distributed workload:
    from transformers import (
        Coach,
        TrainingArguments
    )
    
    coach = Coach(
        mannequin=mannequin,
        train_dataset=train_ds,
        eval_dataset=test_ds if test_ds isn't None else None,
        args=transformers.TrainingArguments(
            **training_args, 
        ),
        callbacks=callbacks,
        data_collator=transformers.DataCollatorForLanguageModeling(
            tokenizer, multi level marketing=False
        )
    )
    

  2. To additional optimize the fine-tuning workload, we use the QLoRA approach, which quantizes a pre-trained language mannequin to 4 bits and attaches small Low-Rank Adapters, that are fine-tuned:
    from transformers import (
        AutoModelForCausalLM,
        AutoTokenizer,
        BitsAndBytesConfig,
    )
    
    # Load the tokenizer
    tokenizer = AutoTokenizer.from_pretrained(script_args.model_id)
    
    # Outline PAD token
    tokenizer.pad_token = tokenizer.eos_token
    
    # Configure quantization
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_quant_storage=torch.bfloat16
    )
    
    # Load the mannequin
    mannequin = AutoModelForCausalLM.from_pretrained(
        script_args.model_id,
        trust_remote_code=True,
        quantization_config=bnb_config,
        use_cache=not training_args.gradient_checkpointing,
        cache_dir="/tmp/.cache",
        **model_configs,
    )
    

  3. The script_args and training_args are supplied as hyperparameters for the SageMaker Coaching job in a configuration recipe .yaml file and parsed within the practice.py file through the use of the TrlParser class supplied by Hugging Face TRL:
    model_id: "meta-llama/Llama-3.1-8B-Instruct"      # Hugging Face mannequin id
    # sagemaker particular parameters
    output_dir: "/choose/ml/mannequin"                       # path to the place SageMaker will add the mannequin 
    checkpoint_dir: "/choose/ml/checkpoints/"            # path to the place SageMaker will add the mannequin checkpoints
    train_dataset_path: "/choose/ml/enter/knowledge/practice/"   # path to the place S3 saves practice dataset
    val_dataset_path: "/choose/ml/enter/knowledge/val/"       # path to the place S3 saves check dataset
    save_steps: 100                                   # Save checkpoint each this many steps
    token: ""
    # coaching parameters
    lora_r: 32
    lora_alpha:64
    lora_dropout: 0.1                 
    learning_rate: 2e-4                    # studying price scheduler
    num_train_epochs: 2                    # variety of coaching epochs
    per_device_train_batch_size: 4         # batch measurement per gadget throughout coaching
    per_device_eval_batch_size: 2          # batch measurement for analysis
    gradient_accumulation_steps: 4         # variety of steps earlier than performing a backward/replace move
    gradient_checkpointing: true           # use gradient checkpointing
    bf16: true                             # use bfloat16 precision
    tf32: false                            # use tf32 precision
    fsdp: "full_shard auto_wrap offload"   #FSDP configurations
    fsdp_config: 
        backward_prefetch: "backward_pre"
        cpu_ram_efficient_loading: true
        offload_params: true
        forward_prefetch: false
        use_orig_params: true
    warmup_steps: 100
    weight_decay: 0.01
    merge_weights: true                    # merge weights within the base mannequin
    

    For the carried out use case, we determined to fine-tune the adapter with the next values:

    • lora_r: 32 – Permits the adapter to seize extra complicated reasoning transformations.
    • lora_alpha: 64 – Given the reasoning process we are attempting to enhance, this worth permits the adapter to have a big affect to the bottom.
    • lora_dropout: 0.05 – We need to protect reasoning connection by avoiding breaking necessary ones.
    • warmup_steps: 100 – Regularly will increase the training price to the required worth. For this reasoning process, we would like the mannequin to be taught a brand new construction with out forgetting the earlier data.
    • weight_decay: 0.01 – Maintains mannequin generalization.
  4. Put together the configuration file for the SageMaker Coaching job by saving them as JSON information and setting up the S3 paths the place these information shall be uploaded:
    import os
    
    if default_prefix:
        input_path = f"{default_prefix}/datasets/llm-fine-tuning-modeltrainer-sft"
    else:
        input_path = f"datasets/llm-fine-tuning-modeltrainer-sft"
    
    train_config_s3_path = f"s3://{bucket_name}/{input_path}/config/args.yaml"
    
    # add the mannequin yaml file to s3
    model_yaml = "args.yaml"
    s3_client.upload_file(model_yaml, bucket_name, f"{input_path}/config/args.yaml")
    os.take away("./args.yaml")
    
    print(f"Coaching config uploaded to:")
    print(train_config_s3_path)

SFT coaching utilizing a SageMaker Coaching job

To run a fine-tuning workload utilizing the SFT coaching script and SageMaker Coaching jobs, we use the ModelTrainer class.

The ModelTrainer class is a and extra intuitive method to mannequin coaching that considerably enhances consumer expertise and helps distributed coaching, Construct Your Personal Container (BYOC), and recipes. For extra info confer with the SageMaker Python SDK documentation.

Arrange the fine-tuning workload with the next steps:

  1. Specify the occasion sort, the container picture for the coaching job, and the checkpoint path the place the mannequin shall be saved:
    instance_type = "ml.p4d.24xlarge"
    instance_count = 1
    
    image_uri = image_uris.retrieve(
        framework="huggingface",
        area=sagemaker_session.boto_session.region_name,
        model="4.56.2",
        base_framework_version="pytorch2.8.0",
        instance_type=instance_type,
        image_scope="coaching",
    )
    

  2. Outline the supply code configuration by pointing to the created practice.py:
    from sagemaker.practice.configs import SourceCode
    
    source_code = SourceCode(
        source_dir="./scripts",
        necessities="necessities.txt",
        entry_script="practice.py",
    )
    

  3. Configure the coaching compute by optionally offering the parameter keep_alive_period_in_seconds to make use of managed heat swimming pools, to retain and reuse the cluster in the course of the experimentation section:
    from sagemaker.practice.configs Compute
    
    compute_configs = Compute(
        instance_type=instance_type,
        instance_count=instance_count,
        keep_alive_period_in_seconds=0,
    )
    

  4. Create the ModelTrainer operate by offering the required coaching setup, and outline the argument distributed=Torchrun() to make use of torchrun as a launcher to execute the coaching job in a distributed method throughout the out there GPUs within the chosen occasion:
    from sagemaker.practice.configs import (
        CheckpointConfig,
        OutputDataConfig,
        StoppingCondition,
    )
    from sagemaker.practice.distributed import Torchrun
    from sagemaker.practice.model_trainer import ModelTrainer
    
    
    # outline Coaching Job Title
    job_name = f"train-{model_id.cut up('/')[-1].substitute('.', '-')}-sft"
    
    # outline OutputDataConfig path
    output_path = f"s3://{bucket_name}/{job_name}"
    
    # Outline the ModelTrainer
    model_trainer = ModelTrainer(
        training_image=image_uri,
        source_code=source_code,
        base_job_name=job_name,
        compute=compute_configs,
        distributed=Torchrun(),
        stopping_condition=StoppingCondition(max_runtime_in_seconds=18000),
        hyperparameters={
            "config": "/choose/ml/enter/knowledge/config/args.yaml"  # path to TRL config which was uploaded to s3
        },
        output_data_config=OutputDataConfig(s3_output_path=output_path),
        checkpoint_config=CheckpointConfig(
            s3_uri=output_path + "/checkpoint", local_path="/choose/ml/checkpoints"
        ),
    ) 
    

  5. Arrange the enter channels for the ModelTrainer by creating InputData objects from the supplied S3 bucket paths for the coaching and validation dataset, and for the configuration parameters:
    from sagemaker.practice.configs import InputData
    # Move the enter knowledge
    train_input = InputData(
        channel_name="practice",
        data_source=train_dataset_s3_path, # S3 path the place coaching knowledge is saved
    )
    val_input = InputData(
        channel_name="val",
        data_source=val_dataset_s3_path, # S3 path the place validation knowledge is saved
    )
    config_input = InputData(
        channel_name="config",
        data_source=train_config_s3_path, # S3 path the place configurations are saved
    )
    # Test enter channels configured
    knowledge = [train_input, val_input, config_input]
    

  6. Submit the coaching job:
    model_trainer.practice(input_data_config=knowledge, wait=False)

The coaching job with Flash Consideration 2 for one epoch with a dataset of 10,000 samples takes roughly 18 minutes to finish.

Deploy and check fine-tuned Meta Llama 3.1 8B on SageMaker AI

To guage your fine-tuned mannequin, you’ve a number of choices. You should utilize a further SageMaker Coaching job to judge the mannequin with Hugging Face Lighteval on SageMaker AI, or you possibly can deploy the mannequin to a SageMaker real-time endpoint and interactively check the mannequin through the use of methods like LLM as decide to check generated content material with floor reality content material. For a extra complete analysis that demonstrates the affect of fine-tuning on mannequin efficiency, you should utilize the MedReason analysis script to check the bottom meta-llama/Llama-3.1-8B mannequin together with your fine-tuned model.

On this instance, we use the deployment method, iterating over the check dataset and evaluating the mannequin on these samples utilizing a easy loop.

  1. Choose the occasion sort and the container picture for the endpoint:
    import boto3
    
    sm_client = boto3.shopper("sagemaker", region_name=sess.boto_region_name)
    
    image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/vllm:0.13-gpu-py312"
    

  2. Create the SageMaker Mannequin utilizing the container URI for vLLM and the S3 path to your mannequin. Set your vLLM configuration, together with the variety of GPUs and max enter tokens. For a full record of configuration choices, see vLLM engine arguments.
    env = {
        "SM_VLLM_MODEL": "/choose/ml/mannequin",
        "SM_VLLM_DTYPE": "bfloat16",
        "SM_VLLM_GPU_MEMORY_UTILIZATION": "0.8",
        "SM_VLLM_MAX_MODEL_LEN": json.dumps(1024 * 16),
        "SM_VLLM_MAX_NUM_SEQS": "1",
        "SM_VLLM_ENABLE_CHUNKED_PREFILL": "true",
        "SM_VLLM_KV_CACHE_DTYPE": "auto",
        "SM_VLLM_TENSOR_PARALLEL_SIZE": "4",
    }
    
    model_response = sm_client.create_model(
        ModelName=f"{model_id.cut up('/')[-1].substitute('.', '-')}-model",
        ExecutionRoleArn=function,
        PrimaryContainer={
            "Picture": image_uri,
            "Atmosphere": env,
            "ModelDataSource": {
                "S3DataSource": {
                    "S3Uri": f"s3://{bucket_name}/{job_prefix}/{job_name}/output/mannequin.tar.gz",
                    "S3DataType": "S3Prefix",
                    "CompressionType": "Gzip",
                }
            },
        },
    )
    

  3. Create the endpoint configuration by specifying the sort and variety of cases:
    instance_count = 1
    instance_type = "ml.g5.12xlarge"
    health_check_timeout = 700
    
    endpoint_config_response = sm_client.create_endpoint_config(
        EndpointConfigName=f"{model_id.cut up('/')[-1].substitute('.', '-')}-config",
        ProductionVariants=[
            {
                "VariantName": "AllTraffic",
                "ModelName": f"{model_id.split('/')[-1].substitute('.', '-')}-model",
                "InstanceType": instance_type,
                "InitialInstanceCount": instance_count,
                "ModelDataDownloadTimeoutInSeconds": health_check_timeout,
                "ContainerStartupHealthCheckTimeoutInSeconds": health_check_timeout,
                "InferenceAmiVersion": "al2-ami-sagemaker-inference-gpu-3-1",
            }
        ],
    )
    

  4. Deploy the mannequin:
    endpoint_response = sm_client.create_endpoint(
        EndpointName=f"{model_id.cut up('/')[-1].substitute('.', '-')}-sft", 
        EndpointConfigName=f"{model_id.cut up('/')[-1].substitute('.', '-')}-config",
    ) 
    

SageMaker AI will now create the endpoint and deploy the mannequin to it. This could take 5–10 minutes. Afterwards, you possibly can check the mannequin by sending some instance inputs to the endpoint. You should utilize the invoke_endpoint methodology of the sagemaker-runtime shopper to ship the enter to the mannequin and get the output:

import json
import pandas as pd

eval_dataset = []

for index, el in enumerate(test_dataset, 1):
    print("Processing merchandise ", index)

    payload = {
        "messages": [
            {
                "role": "system",
                "content": "You are a deep-thinking AI assistant.nnFor every user question, first write your thoughts and reasoning inside ... tags, then provide your answer.",
            },
            {"role": "user", "content": el["question"]},
        ],
        "max_tokens": 4096,
        "cease": ["<|eot_id|>", "<|end_of_text|>"],
        "temperature": 0.4,
        "top_p": 0.9,
        "repetition_penalty": 1.15,
        "no_repeat_ngram_size": 3,
        "do_sample": True,
    }

    response = predictor.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="utility/json",
        Physique=json.dumps(payload),
    )

    outcome = json.masses(response["Body"].learn().decode())
    eval_dataset.append([el["question"], outcome["choices"][0]["message"]["content"]])

    print("**********************************************")

eval_dataset_df = pd.DataFrame(
    eval_dataset, columns=["question", "answer"]
)

eval_dataset_df.to_json(
    "./eval_dataset_results.jsonl", orient="data", traces=True
)

The next are some examples of generated solutions:

Query: "Perl's stain or prussion blue check is for:"

Reply Fantastic-tuned: """

The Perl's stain or Prussian blue check is used to detect the presence of iron in organic samples. 
It includes including potassium ferrocyanide (K4[Fe(CN)6]) to the pattern, 
which reacts with the iron ions current in it to kind a darkish blue-colored compound generally known as ferric ferrocyanide. 
This response could be noticed visually, permitting researchers to find out if iron is current within the pattern.


In less complicated phrases, the Perl's stain or Prussian blue check is used to establish iron in organic samples.
"""

The fine-tuned mannequin exhibits robust reasoning capabilities by offering structured, detailed explanations with clear thought processes, breaking down the ideas step-by-step earlier than arriving on the remaining reply. This instance showcases the effectiveness of our fine-tuning method utilizing Hugging Face Transformers and a SageMaker Coaching job.

Clear up

To wash up your sources to keep away from incurring extra fees, comply with these steps:

  1. Delete any unused SageMaker Studio sources.
  2. (Non-obligatory) Delete the SageMaker Studio area.
  3. Confirm that your coaching job isn’t operating anymore. To take action, on the SageMaker console, underneath Coaching within the navigation pane, select Coaching jobs.
  4. Delete the SageMaker endpoint.

Conclusion

On this publish, we demonstrated how enterprises can effectively scale fine-tuning of each small and enormous language fashions through the use of the mixing between the Hugging Face Transformers library and SageMaker Coaching jobs. This highly effective mixture transforms historically complicated and resource-intensive processes into streamlined, scalable, and production-ready workflows.

Utilizing a sensible instance with the meta-llama/Llama-3.1-8B mannequin and the MedReason dataset, we demonstrated the way to apply superior methods like FSDP and LoRA to scale back coaching time and price—with out compromising mannequin high quality.

This resolution highlights how enterprises can successfully deal with frequent LLM fine-tuning challenges equivalent to fragmented toolchains, excessive reminiscence and compute necessities, and multi-node scaling inefficiencies and GPU underutilization.

By utilizing the built-in Hugging Face and SageMaker structure, companies can now construct and deploy personalized, domain-specific fashions sooner—with better management, cost-efficiency, and scalability.

To get began with your personal LLM fine-tuning undertaking, discover the code samples supplied in our GitHub repository.


Concerning the Authors

Florent Gbelidji is a Machine Studying Engineer for Buyer Success at Hugging Face. Primarily based in Paris, France, Florent joined Hugging Face 3.5 years in the past as an ML Engineer within the Skilled Acceleration Program, serving to firms construct options with open supply AI. He’s now the Cloud Partnership Tech Lead for the AWS account, driving integrations between the Hugging Face setting and AWS companies.

Bruno Pistone is a Senior Worldwide Generative AI/ML Specialist Options Architect at AWS primarily based in Milan, Italy. He works with AWS product groups and enormous prospects to assist them totally perceive their technical wants and design AI and machine studying options that take full benefit of the AWS cloud and Amazon ML stack. His experience contains distributed coaching and inference workloads, mannequin customization, generative AI, and end-to-end ML. He enjoys spending time with buddies, exploring new locations, and touring to new locations.

Louise Ping is a Senior Worldwide GenAI Specialist, the place she helps companions construct go-to-market methods and leads cross-functional initiatives to broaden alternatives and drive adoption. Drawing from her various AWS expertise throughout Storage, APN Accomplice Advertising and marketing, and AWS Market, she works intently with strategic companions like Hugging Face to drive technical collaborations. When not working at AWS, she makes an attempt dwelling enchancment initiatives—ideally with restricted mishaps.

Safir Alvi is a Worldwide GenAI/ML Go-To-Market Specialist at AWS primarily based in New York. He focuses on advising strategic world prospects on scaling their mannequin coaching and inference workloads on AWS, and driving adoption of Amazon SageMaker AI Coaching Jobs and Amazon SageMaker HyperPod. He focuses on optimizing and fine-tuning generative AI and machine studying fashions throughout various industries, together with monetary companies, healthcare, automotive, and manufacturing.

Related Articles

Latest Articles