Tracking and managing assets used in AI development with Amazon SageMaker AI 

Favorite Building custom foundation models requires coordinating multiple assets across the development lifecycle such as data assets, compute infrastructure, model architecture and frameworks, lineage, and production deployments. Data scientists create and refine training datasets, develop custom evaluators to assess model quality and safety, and iterate through fine-tuning configurations to optimize

Read More
Shared by AWS Machine Learning December 17, 2025

How Tata Power CoE built a scalable AI-powered solar panel inspection solution with Amazon SageMaker AI and Amazon Bedrock

Favorite This post is co-written with Vikram Bansal from Tata Power, and Gaurav Kankaria, Omkar Dhavalikar from Oneture. The global adoption of solar energy is rapidly increasing as organizations and individuals transition to renewable energy sources. India is on the brink of a solar energy revolution, with a national goal

Read More
Shared by AWS Machine Learning December 16, 2025

Operationalize generative AI workloads and scale to hundreds of use cases with Amazon Bedrock – Part 1: GenAIOps

Favorite Enterprise organizations are rapidly moving beyond generative AI experiments to production deployments and complex agentic AI solutions, facing new challenges in scaling, security, governance, and operational efficiency. This blog post series introduces generative AI operations (GenAIOps), the application of DevOps principles to generative AI solutions, and demonstrates how to

Read More
Shared by AWS Machine Learning December 15, 2025

Customize agent workflows with advanced orchestration techniques using Strands Agents

Favorite Large Language Model (LLM) agents have revolutionized how we approach complex, multi-step tasks by combining the reasoning capabilities of foundation models with specialized tools and domain expertise. While single-agent systems using frameworks like ReAct work well for straightforward tasks, real-world challenges often require multiple specialized agents working in coordination.

Read More
Shared by AWS Machine Learning December 15, 2025

Adaptive infrastructure for foundation model training with elastic training on SageMaker HyperPod

Favorite Modern AI infrastructure serves multiple concurrent workloads on the same cluster, from foundation model (FM) pre-training and fine-tuning to production inference and evaluation. In this shared environment, the demands for AI accelerators fluctuates continuously as inference workloads scale with traffic patterns, and experiments complete and release resources. Despite this

Read More
Shared by AWS Machine Learning December 15, 2025

Checkpointless training on Amazon SageMaker HyperPod: Production-scale training with faster fault recovery

Favorite Foundation model training has reached an inflection point where traditional checkpoint-based recovery methods are becoming a bottleneck to efficiency and cost-effectiveness. As models grow to trillions of parameters and training clusters expand to thousands of AI accelerators, even minor disruptions can result in significant costs and delays. In this

Read More
Shared by AWS Machine Learning December 15, 2025

Building a voice-driven AWS assistant with Amazon Nova Sonic

Favorite As cloud infrastructure becomes increasingly complex, the need for intuitive and efficient management interfaces has never been greater. Traditional command-line interfaces (CLI) and web consoles, while powerful, can create barriers to quick decision-making and operational efficiency. What if you could speak to your AWS infrastructure and get immediate, intelligent

Read More
Shared by AWS Machine Learning December 12, 2025