SK Telecom improves telco-specific Q&A by fine-tuning Anthropic’s Claude models in Amazon Bedrock
This post has been co-written with Seunghyun Jeong, Sunwoo Lee and Eric Davis from SK Telecom.
SK Telecom (SKT), South Korea’s leading telecommunications company serving 30 million customers, is at the forefront of AI innovation. In line with its AI Pyramid Strategy, which aims to unlock AI’s potential for anyone, anywhere, anytime, SKT has collaborated with the AWS Generative AI Innovation Center (GenAIIC) Custom Model Program to explore domain-trained models using Amazon Bedrock for telco-specific use cases.
This collaboration aligns with SKT’s vision of using AI expertise and strategic partnerships to develop innovative AI-based products and services. One such initiative focused on developing a custom solution for grounded question answering (Q&A) based on reference documents.
Retrieval Augmented Generation (RAG) is a popular technique for Q&A tasks, offering improved factual accuracy and knowledge grounding. However, RAG faces challenges with generating a response not matching preferred tone, style, and manners for telco use cases, as well as retrieving irrelevant documents, potentially leading to inaccurate responses. To address this, SKT and AWS GenAIIC aimed to use model customization to improve Anthropic Claude models on Amazon Bedrock in three key areas:
- Providing concise and informative answers
- Correctly referencing links from retrieved documents
- Answering in a tone and style consistent with SKT and similar to ground truth answers
Additionally, the team explored boosting smaller model performance using synthetic data generated by bigger large language models (LLMs) for knowledge distillation and scenarios with limited labeled training data.
Amazon Bedrock is a fully managed service that offers a variety of LLMs and foundation models (FMs) along with capabilities such as Amazon Bedrock Knowledge Bases, Amazon Bedrock Agents, and Amazon Bedrock Guardrails that can expedite many generative AI use cases. Amazon Bedrock is the only fully managed service that provides you with the ability to fine-tune Claude models. Amazon Bedrock offers an intuitive and secure way of fine-tuning Anthropic’s Claude models and more. The fine-tuned Claude model can be deployed using Amazon Bedrock and can use the capabilities of Amazon Bedrock seamlessly, for example, Amazon Bedrock Knowledge Bases for the telco domain-specific RAG or Amazon Bedrock Agents for the agentic usage.
In this post, we share how SKT customizes Anthropic Claude models for telco-specific Q&A regarding technical telecommunication documents of SKT using Amazon Bedrock.
Solution overview
The team explored combinations of prompt optimization, customization (fine-tuning), and data augmentation with synthetic data. This multifaceted approach aimed to maximize the benefits of each technique for the grounded Q&A generation task.
In the following sections, we explore these methods in more detail.
Anthropic’s Claude customization with prompt optimization
Fine-tuning, which is available through Amazon Bedrock for various FMs, including Anthropic’s Claude, allows adaptation of pre-trained language models for specific use cases. It’s particularly effective for tailoring response style and format adherence.
The team first optimized the system prompt, implementing standardized guidelines for answer formatting and document citation based on Anthropic model prompting best practices. Key focus areas included:
- Clear presentation of system commands
- Consistent use of code block formatting
- Context-based tailored responses
This prompt engineering, combined with fine-tuning, yielded substantial improvements:
- Over 50% increase in ROUGE-3 score
- Over 25% improvement in ROUGE-L score
- Over 4% increase in embedding similarity score
- Significant progress in accurate reference citation
The iterative enhancement process demonstrated cumulative benefits, with prompt updates alone showing 35–40 percent improvements in key metrics, and the final customized model achieving 50–60 percent gains in some metrics.
This progression clearly illustrates the cumulative benefits of model customization through RAG, prompt engineering, and fine-tuning, resulting in a model that significantly outperformed both the baseline and the prompt-updated versions in terms of ROUGE scores and citation accuracy. ROUGE score measures the similarity between ground truths and generated results by computing N-gram word overlaps. The following table summarizes these improvements.
LLM | Prompt update | Fine-tuning | Relative improvement over baseline | ||
ROUGE-3 | ROUGE-L | Citation accuracy | |||
Anthropic’s Claude 3 Sonnet | – | – | baseline | baseline | baseline |
Anthropic’s Claude 3 Sonnet | – | +38.30% | +13.4% | +52.94% | |
Anthropic’s Claude 3 Sonnet | +58.1% | +26.8% | +70.59% |
Synthetic data for fine-tuning
To address the challenge of limited high-quality labeled training data, the team explored synthetic data generation techniques. This approach also facilitates knowledge distillation from larger LLMs to smaller, more targeted models, offering benefits such as lower latency and cost.
The team conducted controlled experiments using:
- A baseline set of 500 ground truth samples
- An augmented set with 500 original over 1,500 synthetic samples
- A larger original set of 2,000 samples
Synthetic data was generated using Anthropic’s Claude Sonnet 3, creating new question-answer pairs over the same retrieved documents used in ground truth examples.
The results were evaluated using both LLM-based comparison and human preference evaluation. Human evaluators blindly ranked model outputs, with scores assigned based on preference (Best: 4, Second: 3, Third: 2, Worst: 1). The following table shows the results of the human preference evaluation scores.
Rank | Model | Cumulative score (best possible: 160) |
1 | Fine-tuned with 2,000 original samples | 114 |
2 | Fine-tuned with 500 original and 1,500 synthetic samples | 112 |
3 | Fine-tuned with 500 original samples | 85 |
4 | No fine-tuning (baseline) | 84 |
Some key findings include:
- Small training sets (500 samples) showed minimal improvement over baseline
- Larger training sets (2,000 samples) scored considerably higher
- Synthetically augmented data performed similarly to equivalent-sized original data
Although having a large volume of domain-specific training data is always ideal, many businesses have limited available datasets. In such scenarios, synthetic data can play a crucial role in place of original data. This demonstrates the potential of synthetic data for model customization.
Conclusion
SK Telecom’s collaboration with AWS GenAIIC showcases the company’s commitment to developing innovative AI solutions for telco challenges. By using Amazon Bedrock to customize Anthropic’s Claude models, SKT has achieved significant performance improvements for telco-specific, Korean language use cases without the need to build models from scratch. The proof of concept demonstrated significant improvements:
- ~58% increase in ROUGE-3 score
- ~27% increase in ROUGE-L score
- Substantial improvement in returning correct reference links
This approach, combined with synthetic data generation techniques, aligns with SKT’s AI Pyramid Strategy, enabling faster testing and development of new approaches. As SKT continues to focus on key areas such as personal AI assistants, AI healthcare, and AI data centers, this collaboration with AWS represents a significant step in their AI evolution and long-term competitiveness in the global AI landscape.
For those interested in working with AWS on similar projects, visit Generative AI Innovation Center.
Leave a Reply