The AWS AI League, launched by Amazon Internet Providers (AWS), expanded its attain to the Affiliation of Southeast Asian Nations (ASEAN) final yr, welcoming pupil members from Singapore, Indonesia, Malaysia, Thailand, Vietnam, and the Philippines. The objective was to introduce college students of all backgrounds and expertise ranges to the thrilling world of generative AI by a gamified, hands-on problem targeted on fine-tuning giant language fashions (LLMs).
On this weblog put up, you’ll hear immediately from the AWS AI League champion, Blix D. Foryasen, as he shares his reflection on the challenges, breakthroughs, and key classes found all through the competitors.
Behind the competitors
The AWS AI League competitors started with a tutorial session led by the AWS staff and the Gen-C Generative AI Studying Neighborhood, that includes two highly effective user-friendly companies: Amazon SageMaker JumpStart and PartyRock.
- SageMaker JumpStart enabled members to run the LLM fine-tuning course of in a cloud-based setting, providing flexibility to regulate hyperparameters and optimize efficiency.
- PartyRock, powered by Amazon Bedrock, offered an intuitive playground and interface to curate the dataset utilized in fine-tuning a Llama 3.2 3B Instruct mannequin. Amazon Bedrock affords a complete collection of high-performing basis fashions from main AI firms, together with Anthropic Claude, Meta Llama, Mistral, and extra; all accessible by a single API.
With the objective of outperforming a bigger LLM reference mannequin in a quiz-based analysis, members engaged with three core domains of generative AI: Basis fashions, accountable AI, and immediate engineering. The preliminary spherical featured an open leaderboard rating the best-performing fine-tuned fashions from throughout the area. Every submitted mannequin was examined in opposition to a bigger baseline LLM utilizing an automatic, quiz-style analysis of generative AI-related questions. The analysis, performed by an undisclosed LLM choose, prioritized each accuracy and comprehensiveness. A mannequin’s win charge improved every time it outperformed the baseline LLM. The problem required strategic planning past its technical nature. Individuals needed to maximize their restricted coaching hours on SageMaker JumpStart whereas fastidiously managing a restricted variety of leaderboard submissions. Initially capped at 5 hours, the restrict was later expanded to 30 hours in response to group suggestions. Submission rely would additionally affect tiebreakers for finalist choice.
The highest tuner from every nation superior to the Regional Grand Finale, held on Might 29, 2025, in Singapore. There, finalists competed head-to-head, every presenting their fine-tuned mannequin’s responses to a brand new set of questions. Last scores had been decided by a weighted judging system:
- 40% by an LLM-as-a-judge,
- 40% by consultants
- 20% by a dwell viewers.
A practical method to fine-tuning
Earlier than diving into the technical particulars, a fast disclaimer: the approaches shared within the following sections are largely experimental and born from trial and error. They’re not essentially probably the most optimum strategies for fine-tuning, nor do they characterize a definitive information. Different finalists had totally different approaches due to totally different technical backgrounds. What finally helped me succeed wasn’t simply technical precision, however collaboration, resourcefulness, and a willingness to discover how the competitors may unfold primarily based on insights from earlier iterations. I hope this account can function a baseline or inspiration for future members who could be navigating related constraints. Even for those who’re ranging from scratch, as I did, there’s actual worth in being strategic, curious, and community-driven. One of many greatest hurdles I confronted was time, or the dearth of it. Due to a late affirmation of my participation, I joined the competitors 2 weeks after it had already begun. That left me with solely 2 weeks to plan, practice, and iterate. Given the tight timeline and restricted compute hours on SageMaker JumpStart, I knew I needed to make each coaching session rely. Moderately than trying exhaustive experiments, I targeted my efforts on curating a robust dataset and tweaking choose hyperparameters. Alongside the way in which, I drew inspiration from tutorial papers and current approaches in LLM fine-tuning, adjusting what I might inside the constraints.
Crafting artificial brilliance
As talked about earlier, one of many key studying classes in the beginning of the competitors launched members to SageMaker JumpStart and PartyRock, instruments that make fine-tuning and artificial knowledge technology each accessible and intuitive. Particularly, PartyRock allowed us to clone and customise apps to manage how artificial datasets had been generated. We might tweak parameters such because the immediate construction, creativity stage (temperature), and token sampling technique (top-p). PartyRock additionally gave us entry to a variety of basis fashions. From the beginning, I opted to generate my datasets utilizing Claude 3.5 Sonnet, aiming for broad and balanced protection throughout all three core sub-domains of the competitors. To reduce bias and implement truthful illustration throughout subjects, I curated a number of dataset variations, every starting from 1,500 to 12,000 Q&A pairs, fastidiously sustaining balanced distributions throughout sub-domains. The next are a number of instance themes that I targeted on:
- Immediate engineering: Zero-shot prompting, chain-of-thought (CoT) prompting, evaluating immediate effectiveness
- Basis fashions: Transformer architectures, distinctions between pretraining and fine-tuning
- Accountable AI: Dataset bias, illustration equity, and knowledge safety in AI methods
To keep up knowledge high quality, I fine-tuned the dataset generator to emphasise factual accuracy, uniqueness, and utilized information. Every technology batch consisted of 10 Q&A pairs, with prompts particularly designed to encourage depth and readability
Query immediate:
Reply immediate:
Answering immediate examples:
For query technology, I set the temperature to 0.7, favoring artistic and novel phrasing with out drifting too removed from factual grounding. For reply technology, I used a decrease temperature of 0.2, concentrating on precision and correctness. In each circumstances, I utilized top-p = 0.9, permitting the mannequin to pattern from a targeted but numerous vary of doubtless tokens, encouraging nuanced outputs. One vital strategic assumption I made all through the competitors was that the evaluator LLM would favor extra structured, informative, and full responses over overly artistic or transient ones. To align with this, I included reasoning steps in my solutions to make them longer and extra complete. Analysis has proven that LLM-based evaluators usually rating detailed, well-explained solutions increased, and I leaned into that perception throughout dataset technology.
Refining the submissions
SageMaker JumpStart affords a big selection of hyperparameters to configure, which might really feel overwhelming, particularly if you’re racing in opposition to time and not sure of what to prioritize. Happily, the organizers emphasised focusing totally on epochs and studying charge, so I honed in on these variables. Every coaching job with a single epoch took roughly 10–quarter-hour, making time administration crucial. To keep away from losing worthwhile compute hours, I started with a baseline dataset of 1,500 rows to check mixtures of epochs and studying charges. I explored:
- Epochs: 1 to 4
- Studying charges: 0.0001, 0.0002, 0.0003, and 0.0004
After a number of iterations, the mixture of two epochs and a studying charge of 0.0003 yielded one of the best outcome, attaining a 53% win charge on my thirteenth leaderboard submission. Inspired by this, I continued utilizing this mixture for a number of subsequent experiments, at the same time as I expanded my dataset. Initially, this technique appeared to work. With a dataset of roughly 3,500 rows, my mannequin reached a 57% win charge by my sixteenth submission. Nonetheless, as I additional elevated the dataset to five,500, 6,700, 8,500, and ultimately 12,000 rows, my win charge steadily declined to 53%, 51%, 45%, and 42% respectively. At that time, it was clear that solely growing dataset dimension wasn’t sufficient, the truth is, it might need been counterproductive with out revisiting the hyperparameters. With solely 5 coaching hours remaining and 54 submissions logged, I discovered myself caught at 57%, whereas friends like the highest tuner from the Philippines had been already reaching a 71% win charge.
Classes from the sector
With restricted time left, each for coaching and leaderboard submissions, I turned to cross-country collaboration for help. One of the insightful conversations I had was with Michael Ismail Febrian, the highest tuner from Indonesia and the very best scorer within the elimination spherical. He inspired me to discover LoRA (low-rank adaptation) hyperparameters, particularly:
lora_rlora_alphatarget_modules
Michael additionally recommended enriching my dataset through the use of API-generated responses from extra succesful trainer fashions, particularly for answering PartyRock-generated questions. Trying again at my current fine-tuning pipeline, I noticed a crucial weak point: the generated solutions had been usually too concise or shallow. Right here’s an instance of a typical Q&A pair from my earlier dataset:
Whereas this construction is clear and arranged, it lacked deeper clarification for every level, one thing fashions like ChatGPT and Gemini usually do nicely. I think this limitation got here from token constraints when producing a number of responses in bulk. In my case, I generated 10 responses at a time in JSONL format underneath a single immediate, which could have led PartyRock to truncate outputs. Not desirous to spend on paid APIs, I found OpenRouter.ai, which affords restricted entry to giant fashions, albeit rate-limited. With a cap of roughly 200 Q&A pairs per day per account, I acquired artistic—I created a number of accounts to help my expanded dataset. My trainer mannequin of selection was DeepSeek R1, a preferred possibility recognized for its effectiveness in coaching smaller, specialised fashions. It was a little bit of of venture, however one which paid off by way of output high quality.
As for LoRA tuning, right here’s what I realized:
lora_randlora_alphadecide how a lot and the way advanced new data the mannequin can take up. A standard rule of thumb is settinglora_alphato 1x or 2x oflora_r.target_modulesdefines which elements of the mannequin are up to date, usually the eye layers or the feed-forward community.
I additionally consulted Kim, the highest tuner from Vietnam, who flagged my 0.0003 studying charge as doubtlessly too excessive. He, together with Michael, recommended a distinct technique: enhance the variety of epochs and cut back the educational charge. This could permit the mannequin to raised seize advanced relationships and delicate patterns, particularly as dataset dimension grows. Our conversations underscored a hard-learned fact: knowledge high quality is extra vital than knowledge amount. There’s some extent of diminishing returns when growing dataset dimension with out adjusting hyperparameters or validating high quality—one thing I immediately skilled. In hindsight, I noticed I had underestimated how important fine-grained hyperparameter tuning is, particularly when scaling knowledge. Extra knowledge calls for extra exact tuning to match the rising complexity of what the mannequin must be taught.
Final-minute gambits
Armed with contemporary insights from my collaborators and hard-won classes from earlier iterations, I knew it was time to pivot my whole fine-tuning pipeline. Essentially the most important change was in how I generated my dataset. As an alternative of utilizing PartyRock to provide each questions and solutions, I opted to generate solely the questions in PartyRock, then feed these prompts into the DeepSeek-R1 API to generate high-quality responses. Every reply was saved in JSONL format, and, crucially, included detailed reasoning. This shift considerably elevated the depth and size of every reply, averaging round 900 tokens per response, in comparison with the a lot shorter outputs from PartyRock. Provided that my earlier dataset of roughly 1,500 high-quality rows produced promising outcomes, I caught with that dimension for my ultimate dataset. Moderately than scale up in amount, I doubled down on high quality and complexity. For this ultimate spherical, I made daring, blind tweaks to my hyperparameters:
- Dropped the educational charge to 0.00008
- Elevated the LoRA parameters:
lora_r= 256lora_alpha= 256
- Expanded LoRA goal modules to cowl each consideration and feed-forward layers:
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj
These adjustments had been made with one assumption: longer, extra advanced solutions require extra capability to soak up and generalize nuanced patterns. I hoped that these settings would allow the mannequin to completely use the high-quality, reasoning-rich knowledge from DeepSeek-R1.With solely 5 hours of coaching time remaining, I had simply sufficient for 2 full coaching runs, every utilizing totally different epoch settings (3 and 4). It was a make-or-break second. If the primary run underperformed, I had one final probability to redeem it. Fortunately, my first take a look at run achieved a 65% win charge, a large enchancment, however nonetheless behind the present chief from the Philippines and trailing Michael’s spectacular 89%. Every part now hinged on my ultimate coaching job. It needed to run easily, keep away from errors, and outperform every part I had tried earlier than. And it did. That ultimate submission achieved a 77% win charge, pushing me to the highest of the leaderboard and securing my slot for the Grand Finale. After weeks of experimentation, sleepless nights, setbacks, and late-game changes, the journey, from a two-week-late entrant to nationwide champion, was full.
What I want I had recognized sooner
I gained’t fake that my success within the elimination spherical was purely technical—luck performed a giant half. Nonetheless, the journey revealed a number of insights that might save future members worthwhile time, coaching hours, and submissions. Listed below are some key takeaways I want I had recognized from the beginning:
- High quality is extra vital than amount: Extra knowledge doesn’t at all times imply higher outcomes. Whether or not you’re including rows or growing context size, you’re additionally growing the complexity that the mannequin should be taught from. Concentrate on crafting high-quality, well-structured examples somewhat than blindly scaling up.
- Quick learner in comparison with Gradual learner: In the event you’re avoiding deep dives into LoRA or different superior tweaks, understanding the trade-off between studying charge and epochs is crucial. The next studying charge with fewer epochs may converge quicker, however might miss the delicate patterns captured by a decrease studying charge over extra epochs. Select fastidiously primarily based in your knowledge’s complexity.
- Don’t neglect hyperparameters: Certainly one of my greatest missteps was treating hyperparameters as static, no matter adjustments in dataset dimension or complexity. As your knowledge evolves, your mannequin settings ought to too. Hyperparameters ought to scale along with your knowledge.
- Do your homework: Keep away from extreme guesswork by studying related analysis papers, documentation, or weblog posts. Late within the competitors, I stumbled upon useful assets that I might have used to make higher selections earlier. A little bit studying can go a good distance.
- Observe every part: When experimenting, it’s straightforward to overlook what labored and what didn’t. Preserve a log of your datasets, hyperparameter mixtures, and efficiency outcomes. This helps optimize your runs and aids in debugging.
- Collaboration is a superpower: Whereas it’s a contest, it’s additionally an opportunity to be taught. Connecting with different members, whether or not they’re forward or behind, gave me invaluable insights. You may not at all times stroll away with a trophy, however you’ll go away with information, relationships, and actual development.
Grand Finale
The Grand Finale came about on the second day of the Nationwide AI Pupil Problem, serving because the fruits of weeks of experimentation, technique, and collaboration. Earlier than the ultimate showdown, all nationwide champions had the chance to interact within the AI Pupil Developer Convention, the place we shared insights, exchanged classes, and constructed connections with fellow finalists from throughout the ASEAN area. Throughout our conversations, I used to be struck by how remarkably related lots of our fine-tuning methods had been. Throughout the board, members had used a mixture of exterior APIs, dataset curation methods, and cloud-based coaching methods like SageMaker JumpStart. It turned clear that instrument choice and inventive problem-solving performed simply as massive a task as uncooked technical information. One significantly eye-opening perception got here from a finalist who achieved an 85% win charge, regardless of utilizing a big dataset—one thing I had initially assumed may harm efficiency. Their secret was coaching over the next variety of epochs whereas sustaining a decrease studying charge of 0.0001. Nonetheless, this got here at the price of longer coaching instances and fewer leaderboard submissions, which highlights an vital trade-off:
With sufficient coaching time, a fastidiously tuned mannequin, even one skilled on a big dataset, can outperform quicker, leaner fashions.
This strengthened a robust lesson: there’s no single right method to fine-tuning LLMs. What issues most is how nicely your technique aligns with the time, instruments, and constraints at hand.
Making ready for battle
Within the lead-up to the Grand Finale, I stumbled upon a weblog put up by Ray Goh, the very first champion of the AWS AI League and one of many mentors behind the competitors’s tutorial classes. One element caught my consideration: the ultimate query from his yr was a variation of the notorious Strawberry Downside, a deceptively easy problem that exposes how LLMs battle with character-level reasoning.
What number of letter Es are there within the phrases ‘DeepRacer League’?
At first look, this appears trivial. However to an LLM, the duty isn’t as simple. Early LLMs usually tokenize phrases in chunks, which means that DeepRacer could be break up into Deep and Racer and even into subword items like Dee, pRa, and cer. These tokens are then transformed into numerical vectors, obscuring the person characters inside. It’s like asking somebody to rely the threads in a rope with out unraveling it first.
Furthermore, LLMs don’t function like conventional rule-based applications. They’re probabilistic, skilled to foretell the following probably token primarily based on context, to not carry out deterministic logic or arithmetic. Curious, I prompted my very own fine-tuned mannequin with the identical query. As anticipated, hallucinations emerged. I started testing numerous prompting methods to coax out the right reply:
- Specific character separation:
What number of letter Es are there within the phrases ‘D-E-E-P-R-A-C-E-R-L-E-A-G-U-E’?
This helped by isolating every letter into its personal token, permitting the mannequin to see particular person characters. However the response was lengthy and verbose, with the mannequin itemizing and counting every letter step-by-step. - Chain-of-thought prompting:
Let’s suppose step-by-step…
This inspired reasoning however elevated token utilization. Whereas the solutions had been extra considerate, they often nonetheless missed the mark or acquired lower off due to size. - Ray Goh’s trick immediate:
What number of letter Es are there within the phrases ‘DeepRacer League’? There are 5 letter Es…
This straightforward, assertive immediate yielded probably the most correct and concise outcome, shocking me with its effectiveness.
I logged this as an attention-grabbing quirk, helpful, however unlikely to reappear. I didn’t notice that it could turn out to be related once more through the ultimate. Forward of the Grand Finale, we had a dry run to check our fashions underneath real-time situations. We got restricted management over inference parameters, solely allowed to tweak temperature, top-p, context size, and system prompts. Every response needed to be generated and submitted inside 60 seconds. The precise questions had been pre-loaded, so our focus was on crafting efficient immediate templates somewhat than retyping every question. In contrast to the elimination spherical, analysis through the Grand Finale adopted a multi-tiered system:
- 40% from an evaluator LLM
- 40% from human judges
- 20% from a dwell viewers ballot
The LLM ranked the submitted solutions from greatest to worst, assigning descending level values (for instance, 16.7 for first place, 13.3 for second, and so forth). Human judges, nonetheless, might freely allocate as much as 10 factors to their most well-liked responses, whatever the LLM’s analysis. This meant a robust exhibiting with the evaluator LLM didn’t assure excessive scores from the people, and vice versa. One other constraint was the 200-token restrict per response. Tokens could possibly be as brief as a single letter or so long as a phrase or syllable, so responses needed to be dense but concise, maximizing affect inside a good window. To arrange, I examined totally different immediate codecs and fine-tuned them utilizing Gemini, ChatGPT, and Claude to raised match the analysis standards. I saved dry-run responses from the Hugging Face LLaMA 3.2 3B Instruct mannequin, then handed them to Claude Sonnet 4 for suggestions and rating. I continued utilizing the next two prompts as a result of they offered one of the best response by way of accuracy and comprehensiveness:
Main immediate:
Backup immediate:
Further necessities:
- Use exact technical language and terminology.
- Embody particular instruments, frameworks, or metrics if related.
- Each sentence should contribute uniquely—no redundancy.
- Preserve a proper tone and reply density with out over-compression.
By way of hyperparameters, I used:
- Prime-p = 0.9
- Max tokens = 200
- Temperature = 0.2, to prioritize accuracy over creativity
My technique was easy: attraction to the AI choose. I believed that if my reply ranked nicely with the evaluator LLM, it could additionally impress human judges. Oh, how I used to be humbled.
Simply aiming for third… till I wasn’t
Standing on stage earlier than a dwell viewers was nerve-wracking. This was my first solo competitors, and it was already on a large regional scale. To calm my nerves, I stored my expectations low. A 3rd-place end could be superb, a trophy to mark the journey, however simply qualifying for the finals already felt like an enormous win. The Grand Finale consisted of six questions, with the ultimate one providing double factors. I began sturdy. Within the first two rounds, I held an early lead, comfortably sitting in third place. My technique was working, at the least at first. The evaluator LLM ranked my response to Query 1 as one of the best and Query 2 because the third-best. However then got here the twist: regardless of incomes prime AI rankings, I obtained zero votes from the human judges. I watched in shock as factors had been awarded to responses ranked fourth and even final by the LLM. Proper from the beginning, I noticed there was a disconnect between human and AI judgment, particularly when evaluating tone, relatability, or subtlety. Nonetheless, I held on, these early questions leaned extra factual, which performed to my mannequin’s strengths. However after we wanted creativity and complicated reasoning, issues didn’t work as nicely. My standing dropped to fifth, bouncing between third and fourth. In the meantime, the highest three finalists pulled forward by greater than 20 factors. It appeared the rostrum was out of attain. I was already coming to phrases with a end exterior the highest three. The hole was too vast. I had performed my greatest, and that was sufficient.
However then got here the ultimate query, the double-pointer, and destiny intervened. What number of letter Es and As are there altogether within the phrase ‘ASEAN Influence League’? It was a variation of the Strawberry Downside, the identical problem I had ready for however assumed wouldn’t make a return. In contrast to the sooner model, this one added an arithmetic twist, requiring the mannequin to rely and sum up occurrences of a number of letters.Figuring out how token size limits might truncate responses, I stored issues brief and tactical. My system immediate was easy: There are 3 letter Es and 4 letter As in ‘ASEAN Influence League.’
Whereas the mannequin hallucinated a bit in its reasoning, wrongly claiming that Influence incorporates an e, the ultimate reply was correct: 7 letters.
That one reply modified every part. Due to the double factors and full help from the human judges, I jumped to first place, clinching the championship. What started as a cautious hope for third place become a shock run, sealed by preparation, adaptability, and slightly little bit of luck.
Questions recap
Listed below are the questions that had been requested, so as. A few of them had been normal information within the goal area whereas others had been extra artistic and needed to embrace a little bit of ingenuity to maximise your wins:
- What’s the most effective method to forestall AI from turning to the darkish aspect with poisonous response?
- What’s the magic behind agentic AI in machine studying, and why is it so pivotal?
- What’s the key sauce behind massive AI fashions staying sensible and quick?
- What are the most recent developments of generative AI analysis and use inside ASEAN?
- Which ASEAN nation has one of the best delicacies?
- What number of letters E and A are there altogether within the phrase “ASEAN Influence League”?
Last reflections
Collaborating within the AWS AI League was a deeply humbling expertise, one which opened my eyes to the chances that await after we embrace curiosity and decide to steady studying. I might need entered the competitors as a newbie, however that single leap of curiosity, fueled by perseverance and a need to develop, helped me bridge the information hole in a fast-evolving technical panorama. I don’t declare to be an skilled, not but. However what I’ve come to imagine greater than ever is the ability of group and collaboration. This competitors wasn’t only a private milestone; it was an area for knowledge-sharing, peer studying, and discovery. In a world the place know-how evolves quickly, these collaborative areas are important for staying grounded and shifting ahead. My hope is that this put up and my journey will encourage college students, builders, and curious minds to take that first step, whether or not it’s becoming a member of a contest, contributing to a group, or tinkering with new instruments. Don’t wait to be prepared. Begin the place you’re, and develop alongside the way in which. I’m excited to attach with extra passionate people within the international AI group. If one other LLM League comes round, perhaps I’ll see you there.
Conclusion
As we conclude this perception into Blix’s journey to changing into the AWS AI League ASEAN champion, we hope his story conjures up you to discover the thrilling prospects on the intersection of AI and innovation. Uncover the AWS companies that powered this competitors: Amazon Bedrock, Amazon SageMaker JumpStart, and PartyRock, and go to the official AWS AI League web page to hitch the following technology of AI innovators.
The content material and opinions on this put up are these of the third-party writer and AWS just isn’t chargeable for the content material or accuracy of this put up.
In regards to the authors
Noor Khan is a Options Architect at AWS supporting Singapore’s public sector schooling and analysis panorama. She works carefully with tutorial and analysis establishments, main technical engagements and designing safe, scalable architectures. As a part of the core AWS AI League staff, she architected and constructed the backend for the platform, enabling prospects to discover real-world AI use circumstances by gamified studying. Her passions embrace AI/ML, generative AI, net growth and empowering girls in tech!
Vincent Oh is the Principal Options Architect in AWS for Information & AI. He works with public sector prospects throughout ASEAN, proudly owning technical engagements and serving to them design scalable cloud options. He created the AI League within the midst of serving to prospects harness the ability of AI of their use circumstances by gamified studying. He additionally serves as an Adjunct Professor in Singapore Administration College (SMU), instructing laptop science modules underneath Faculty of Pc & Data Programs (SCIS). Previous to becoming a member of Amazon, he labored as Senior Principal Digital Architect at Accenture and Cloud Engineering Follow Lead at UST.
Blix Foryasen is a Pc Science pupil specializing in Machine Studying at Nationwide College – Manila. He’s enthusiastic about knowledge science, AI for social good, and civic know-how, with a robust concentrate on fixing real-world issues by competitions, analysis, and community-driven innovation. Blix can also be deeply engaged with rising technological traits, significantly in AI and its evolving purposes throughout industries, particularly in finance, healthcare, and schooling.
