How an AI Agent Chooses What to Do Under Tokens, Latency, and Tool-Call Budget Constraints?

Cost-Aware Planning for AI Agents

This tutorial introduces a cost-aware AI agent designed to balance output quality with real-world constraints like token usage, latency, and tool-call budgets. The system generates candidate actions, estimates their costs and benefits, and selects a plan to maximize value while adhering to strict resource limits.

Why This Matters

Current LLM-based agents often operate under the assumption of unlimited resources, leading to inefficient use of compute and increased costs. In practical deployments, especially on edge devices or within tight service level agreements, resource constraints are paramount, and “always use the LLM” is not a viable strategy. Uncontrolled LLM usage can quickly escalate costs and render agents unusable, highlighting the need for explicit resource awareness and planning.

Key Insights

Beam search with redundancy penalty: Improves plan diversity and avoids repeating similar actions, resulting in better overall value.
Budget abstraction: Modeling tokens, latency, and tool calls as first-class quantities enables precise cost tracking and enforcement.
Temporal used by Stripe, Coinbase: Demonstrates the practical utility of workflow orchestration frameworks for managing complex agent actions.

Working Example

import os, time, math, json, random
from dataclasses import dataclass, field
from typing import List, Dict, Optional, Tuple, Any
from getpass import getpass
USE_OPENAI = True
if USE_OPENAI:
if not os.getenv("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass("Enter OPENAI_API_KEY (hidden): ").strip()
try:
from openai import OpenAI
client = OpenAI()
except Exception as e:
print("OpenAI SDK import failed. Falling back to offline mode.\nError:", e)
USE_OPENAI = False

def approx_tokens(text: str) -> int:
return max(1, math.ceil(len(text) / 4))
@dataclass
class Budget:
max_tokens: int
max_latency_ms: int
max_tool_calls: int
@dataclass
class Spend:
tokens: int = 0
latency_ms: int = 0
tool_calls: int = 0
def within(self, b: Budget) -> bool:
return (self.tokens <= b.max_tokens and
self.latency_ms <= b.max_latency_ms and
self.tool_calls <= b.max_tool_calls)
def add(self, other: "Spend") -> "Spend":
return Spend(
tokens=self.tokens + other.tokens,
latency_ms=self.latency_ms + other.latency_ms,
tool_calls=self.tool_calls + other.tool_calls
)

Practical Applications

Logistics Dashboard Pilot: An agent tasked with creating a project proposal can prioritize locally executed steps (outline, risk register) to minimize API calls, reserving LLM access for polish and refinement.
Pitfall: Over-reliance on LLM-based steps without cost estimation can quickly exhaust token budgets and lead to incomplete tasks or failed agent runs.

References:

https://www.marktechpost.com/2026/01/23/how-an-ai-agent-chooses-what-to-do-under-tokens-latency-and-tool-call-budget-constraints/

On This Page

Cost-Aware Planning for AI Agents

Why This Matters

Key Insights

Working Example

Practical Applications

Continue reading

Related Content

How to Build a Neuro-Symbolic Hybrid Agent that Combines Logical Planning with Neural Perception for Robust Autonomous Decision-Making

A Coding Guide to Build a Procedural Memory Agent That Learns, Stores, Retrieves, and Reuses Skills as Neural Modules Over Time

Google Introduces A2UI (Agent-to-User Interface): An Open Source Protocol for Agent Driven Interfaces