P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM

Favorite EAGLE is the state-of-the-art method for speculative decoding in large language model (LLM) inference, but its autoregressive drafting creates a hidden bottleneck: the more tokens that you speculate, the more sequential forward passes the drafter needs. Eventually those overhead eats into your gains. P-EAGLE removes this ceiling by generating

Read More
Shared by AWS Machine Learning March 13, 2026

Operationalizing Agentic AI Part 1: A Stakeholder’s Guide

Favorite Agentic AI isn’t a feature you turn on. It’s a shift in how work is defined, who does it, and how decisions get made. Most enterprises learn this the hard way. They launch pilots that stall the moment they hit real processes, systems, and governance. The pattern repeats: vague

Read More
Shared by AWS Machine Learning March 12, 2026

Accelerate custom LLM deployment: Fine-tune with Oumi and deploy to Amazon Bedrock

Favorite This post is cowritten by David Stewart and Matthew Persons from Oumi. Fine-tuning open source large language models (LLMs) often stalls between experimentation and production. Training configurations, artifact management, and scalable deployment each require different tools, creating friction when moving from rapid experimentation to secure, enterprise-grade environments. In this

Read More
Shared by AWS Machine Learning March 11, 2026

Access Anthropic Claude models in India on Amazon Bedrock with Global cross-Region inference

Favorite The adoption and implementation of generative AI inference has increased with organizations building more operational workloads that use AI capabilities in production at scale. To help customers achieve the scale of their generative AI applications, Amazon Bedrock offers cross-Region inference (CRIS) profiles. CRIS is a powerful feature that organizations

Read More
Shared by AWS Machine Learning March 10, 2026

Run NVIDIA Nemotron 3 Nano as a fully managed serverless model on Amazon Bedrock

Favorite This post is cowritten with Abdullahi Olaoye, Curtice Lockhart, Nirmal Kumar Juluru from NVIDIA. We are excited to announce that NVIDIA’s Nemotron 3 Nano is now available as a fully managed and serverless model in Amazon Bedrock. This follows our earlier announcement at AWS re:Invent supporting NVIDIA Nemotron 2

Read More
Shared by AWS Machine Learning March 10, 2026

Building custom model provider for Strands Agents with LLMs hosted on SageMaker AI endpoints

Favorite Organizations increasingly deploy custom large language models (LLMs) on Amazon SageMaker AI real-time endpoints using their preferred serving frameworks—such as SGLang, vLLM, or TorchServe—to help gain greater control over their deployments, optimize costs, and align with compliance requirements. However, this flexibility introduces a critical technical challenge: response format incompatibility

Read More
Shared by AWS Machine Learning March 6, 2026

Drive organizational growth with Amazon Lex multi-developer CI/CD pipeline

Favorite As your conversational AI initiatives evolve, developing Amazon Lex assistants becomes increasingly complex. Multiple developers working on the same shared Lex instance leads to configuration conflicts, overwritten changes, and slower iteration cycles. Scaling Amazon Lex development requires isolated environments, version control, and automated deployment pipelines. By adopting well-structured continuous

Read More
Shared by AWS Machine Learning March 6, 2026

How Ricoh built a scalable intelligent document processing solution on AWS

Favorite This post is cowritten by Jeremy Jacobson and Rado Fulek from Ricoh. This post demonstrates how enterprises can overcome document processing scaling limits by combining generative AI, serverless architecture, and standardized frameworks. Ricoh engineered a repeatable, reusable framework using the AWS GenAI Intelligent Document Processing (IDP) Accelerator. This framework

Read More
Shared by AWS Machine Learning March 5, 2026

Unlock powerful call center analytics with Amazon Nova foundation models

Favorite Call center analytics play a crucial role in improving customer experience and operational efficiency. With foundation models (FMs), you can improve the quality and efficiency of call center operations and analytics. Organizations can use generative AI to assist human customer support agents and managers of contact center teams, so

Read More
Shared by AWS Machine Learning March 5, 2026

Embed Amazon Quick Suite chat agents in enterprise applications

Favorite Organizations can face two critical challenges with conversational AI. First, users need answers where they work—in their CRM, support console, or analytics portal—not in separate tools. Second, implementing a secure embedded chat in their applications can require weeks of development to build authentication, token validation, domain security, and global

Read More
Shared by AWS Machine Learning March 5, 2026