Amazon Web Services spent its second re:Invent keynote this week sketching a future where AI agents handle everything from debugging code to booking flights. Dr Swami Sivasubramanian, VP of AWS Agentic AI, opened with nostalgia about programming calculators in high school, then pivoted to the present: agents that don’t just suggest solutions but execute them, agents that learn from outcomes, agents that supposedly work alongside humans as reliable teammates.
The pitch is seductive. Agents that write code, analyse server logs, create bug tickets, and propose fixes without human handholding. Agents that remember your travel preferences and adjust bookings based on whether you’re solo or wrangling children through an airport. Agents that run simulations, iterate designs, and optimise performance whilst you sleep.
But between the glossy demos and the ambitious promises sits a more complicated reality. AWS is betting heavily that businesses are ready to hand over meaningful autonomy to AI systems, yet the same keynote spent considerable time explaining why those systems can’t quite be trusted yet.
The production gap nobody wants to talk about
Sivasubramanian acknowledged what many developers already know: most AI agent experiments never make it to production. He called it “proof of concept jail,” and the metaphor isn’t accidental. Prototypes that work brilliantly on a laptop collapse when faced with real-world scale, security requirements, and the messy complexity of enterprise systems.
AWS’s answer is a suite of tools designed to bridge that gap. Amazon Bedrock AgentCore promises to handle identity management, tool connectivity, session isolation, and observability whilst letting developers use any framework or model they prefer. The pitch is modularity: use only what you need, integrate with what you already have, and let AWS handle the infrastructure headaches.
Cox Automotive appeared in a customer testimonial, claiming they’ve moved “aggressively towards disruptive positive value” and reduced a two-day estimation process to under 30 minutes per vehicle. Blue Origin, similarly featured, described 2,700 agents in production across 70% of their workforce, driving 3.5 million interactions monthly.
These aren’t trivial deployments. But they’re also companies with substantial engineering resources and high tolerance for complexity. The question isn’t whether AI agents work at AWS scale or Blue Origin scale. It’s whether they work for organisations without dedicated AI teams, massive compute budgets, and the capacity to absorb failures gracefully.
Customisation as competitive advantage (or necessity)
Much of the keynote focused on model customisation. Off-the-shelf models, Sivasubramanian argued, lack the efficiency and domain-specific intelligence that production systems require. Fine-tuning, distillation, and reinforcement learning can create agents optimised for specific workflows, but these techniques have historically demanded PhD-level expertise and months of development time.
AWS introduced reinforcement fine-tuning in Bedrock, claiming it delivers 66% accuracy gains on average whilst automating the complex implementation details. SageMaker AI now offers serverless model customisation, promising to compress what used to take months into days. Nova Forge, announced the previous day, lets organisations train custom frontier models using Amazon’s intermediate checkpoints and curated data.
The subtext here is clear: generic models aren’t good enough if you’re serious about agents. You’ll need to customise, which means committing to AWS’s ecosystem and trusting that their tools can deliver what previously required specialist teams.
Dhan India got name-checked for building a 7-billion-parameter model specialised for Indian financial markets, using SageMaker for training and Bedrock for teacher model support. It runs on a single GPU and apparently outperforms state-of-the-art models 88% of the time. That’s impressive, but it’s also the kind of project that requires knowing exactly what you’re doing and having the data to back it up.
Trust, reliability, and the automated reasoning gambit
Byron Cook, AWS’s distinguished scientist in automated reasoning, took the stage to address the trust problem directly. Large language models hallucinate. They make logical errors. They can be tricked by adversarial inputs. If you’re building agents that handle money, sensitive data, or critical infrastructure, statistical methods aren’t sufficient.
Cook’s solution is neurosymbolic AI: combining LLMs with formal mathematical reasoning to verify outputs, constrain agent behaviour, and provide deterministic guarantees about what agents can and cannot do. AWS has used automated reasoning internally for over a decade, applying it to virtualisation, cryptography, identity management, and networking.
Now they’re bringing it to agents. Amazon Q Developer (previously Kiro) uses specifications to guide code generation and can verify programs against formal models of AWS APIs. AgentCore Policy, announced the day prior, translates natural language constraints into Cedar policies that can be formally verified, ensuring agents respect access controls even when facing unexpected scenarios.
This is technically sophisticated work, but it’s also an acknowledgment that agents aren’t inherently trustworthy. AWS is building guardrails because they have to, not because they want to. The promise of autonomy comes with the caveat that you’ll need formal verification to sleep soundly.
Nova Act and the reliability question
The general availability of Amazon Nova Act represents AWS’s attempt to solve the reliability problem for browser automation. Traditional robotic process automation broke when interfaces changed. Early LLM-based approaches could handle variation but struggled with consistency. Nova Act claims 90% reliability in enterprise workflow settings by tightly integrating the model, orchestrator, and actuators into a single system.
The model isn’t trained separately and then bolted onto execution infrastructure. It’s trained end-to-end on the same stack it’ll use in production, learning from reinforcement rather than imitation. AWS built hundreds of “RL gyms” (simulated enterprise environments where agents practice workflows through trial and error) to teach Nova Act patterns that generalise across real systems.
Ninety per cent reliability sounds impressive until you consider what happens during the other 10% of attempts. If an agent books the wrong flight, updates incorrect customer records, or completes a hardware request incorrectly, the consequences cascade quickly. AWS is betting that 90% is sufficient for production use, but that threshold will vary dramatically depending on what’s at stake.
The human-agent collaboration narrative
Colleen Aubrey, SVP of applied AI solutions, delivered a live demo showing how Amazon Connect uses agents to assist human customer service representatives. A simulated fraud case involved an agent analysing transactions, flagging suspicious patterns, creating a police report, and setting up account monitoring whilst the human investigator focused on customer interaction.
It’s a compelling vision of collaboration rather than replacement. The agent handles tedious analysis and automation whilst the human provides empathy, judgement, and relationship-building. But it’s also carefully choreographed to show agents in their best light. Real-world implementations will encounter edge cases, miscommunications, and situations where the handoff between human and agent breaks down.
Connect launched eight new capabilities this week, including Nova 2 Sonic integration for natural-sounding voice conversations and AI-powered predictive insights based on clickstream data. These aren’t trivial features, but they’re also iterative improvements on existing customer service AI rather than fundamental breakthroughs.
What AWS isn’t saying
The keynote showcased impressive demos, substantial customer deployments, and genuine technical innovation. But several tensions remained unresolved.
First, the cost structure. Model customisation, reinforcement learning infrastructure, and formal verification don’t come cheap. AWS offered serverless options and managed services to reduce operational overhead, but the compute requirements for serious agent deployments remain substantial. Organisations will need to justify these costs against measurable business outcomes, and AWS didn’t spend much time discussing ROI calculations or break-even timelines.
Second, the data requirements. Effective customisation depends on having quality training data that captures the workflows you want to automate. Many organisations don’t have clean, well-labelled datasets ready for fine-tuning. They’ll need to invest in data preparation before they can benefit from AWS’s customisation tools, and that work often takes longer than the training itself.
Third, the governance implications. Agents that act autonomously need clear boundaries, oversight mechanisms, and accountability structures. AWS provided tools like AgentCore Policy and automated reasoning, but organisational policies, legal frameworks, and ethical guidelines remain largely undefined. Who’s responsible when an agent makes a costly mistake? How do you audit agent decisions retrospectively? What recourse exists when agents behave unexpectedly?
Fourth, the competitive landscape. AWS isn’t the only company building agent infrastructure. Microsoft, Google, Anthropic, and numerous startups are pursuing similar visions with different technical approaches. AWS’s advantage lies in its cloud infrastructure footprint and enterprise relationships, but vendor lock-in concerns will make some organisations hesitant to commit deeply to Bedrock and SageMaker.
The bigger picture
AWS’s agent vision represents a significant bet on how software development and business operations will evolve. If they’re right, the companies that master agent deployment will gain substantial competitive advantages. If they’re wrong (or premature), organisations that invest heavily in agent infrastructure may find themselves managing complex systems that don’t deliver proportional value.
The truth likely sits somewhere in between. Agents will prove transformative for specific use cases where the workflows are well-defined, the risks are manageable, and the benefits clearly outweigh the costs. They’ll struggle in domains requiring nuanced judgment, handling truly novel situations, or operating in environments where mistakes carry severe consequences.
Sivasubramanian closed by celebrating community builders: Indonesian high schoolers creating 15,000 GenAI apps, global hackathon participants from 127 countries, and Ajito Nelson’s EcoLafaek agent addressing waste management in Timor-Leste. These stories humanise the technology and highlight genuine innovation happening outside major tech companies.
But they also reveal the gap between experimental creativity and production reliability. Building an agent that works is different from building an agent that works consistently, securely, and predictably under pressure. AWS is offering tools to bridge that gap, but the bridge isn’t complete yet. Organisations willing to invest the resources, tolerate the complexity, and accept the risks can start deploying agents now. Everyone else might want to watch how those early deployments unfold before committing fully to the agent future AWS envisions.


