Skip to main content

On This Page

How an AI Agent Chooses What to Do Under Tokens, Latency, and Tool-Call Budget Constraints?

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Cost-Aware Planning for AI Agents

This tutorial introduces a cost-aware AI agent designed to balance output quality with real-world constraints like token usage, latency, and tool-call budgets. The system generates candidate actions, estimates their costs and benefits, and selects a plan to maximize value while adhering to strict resource limits.

Why This Matters

Current LLM-based agents often operate under the assumption of unlimited resources, leading to inefficient use of compute and increased costs. In practical deployments, especially on edge devices or within tight service level agreements, resource constraints are paramount, and “always use the LLM” is not a viable strategy. Uncontrolled LLM usage can quickly escalate costs and render agents unusable, highlighting the need for explicit resource awareness and planning.

Key Insights

  • Beam search with redundancy penalty: Improves plan diversity and avoids repeating similar actions, resulting in better overall value.
  • Budget abstraction: Modeling tokens, latency, and tool calls as first-class quantities enables precise cost tracking and enforcement.
  • Temporal used by Stripe, Coinbase: Demonstrates the practical utility of workflow orchestration frameworks for managing complex agent actions.

Working Example

import os, time, math, json, random
from dataclasses import dataclass, field
from typing import List, Dict, Optional, Tuple, Any
from getpass import getpass
USE_OPENAI = True
if USE_OPENAI:
if not os.getenv("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass("Enter OPENAI_API_KEY (hidden): ").strip()
try:
from openai import OpenAI
client = OpenAI()
except Exception as e:
print("OpenAI SDK import failed. Falling back to offline mode.\nError:", e)
USE_OPENAI = False
def approx_tokens(text: str) -> int:
return max(1, math.ceil(len(text) / 4))
@dataclass
class Budget:
max_tokens: int
max_latency_ms: int
max_tool_calls: int
@dataclass
class Spend:
tokens: int = 0
latency_ms: int = 0
tool_calls: int = 0
def within(self, b: Budget) -> bool:
return (self.tokens <= b.max_tokens and
self.latency_ms <= b.max_latency_ms and
self.tool_calls <= b.max_tool_calls)
def add(self, other: "Spend") -> "Spend":
return Spend(
tokens=self.tokens + other.tokens,
latency_ms=self.latency_ms + other.latency_ms,
tool_calls=self.tool_calls + other.tool_calls
)

Practical Applications

  • Logistics Dashboard Pilot: An agent tasked with creating a project proposal can prioritize locally executed steps (outline, risk register) to minimize API calls, reserving LLM access for polish and refinement.
  • Pitfall: Over-reliance on LLM-based steps without cost estimation can quickly exhaust token budgets and lead to incomplete tasks or failed agent runs.

References:

Continue reading

Next article

How I Eliminated Access Keys from My Deployment Pipeline with OIDC, Terraform, and GitHub Actions

Related Content