File tools and validate: the agent's feedback loop

Terraform PR Agent June 25, 2026 · 17 min read

bedrock
pydantic-ai
terraform
validation

What this post covers

This post gives the agent real work. In previous posts we just gave it toy work to make sure the infrastructure, monitoring, and audit logs work. Now we build a harness for the agent to write Terraform and validate it.

To do this we give the agent a few tools:

Sandboxed File System Tools:
- list_files: list files in a folder.
- read_file: read a file.
- write_file: write a file.
- edit_file: edit a file (search and replace based).
- delete_file: delete a file.
Terraform Tools:
- terraform_init: initialize a Terraform workspace.
- terraform_validate: validate a Terraform workspace.

The sandboxed file system tools themselves are simple; most of the work is returning good error messages to the model and making sure it does not breach the boundaries of the sandbox tmp folder. The terraform tools on the other hand are more complex, and they have consequences for the Lambda setup. They need to actually execute the terraform binary. For this post that meant changing the Lambda from a Python to a Docker based Lambda, which did require changing the build process and infrastructure setup.

Typical agent flow looks like this:

Logfire trace UI showing one terraform-pr-agent invocation: Mistral-large latest invocation with user prompt,
agent invokes file write tool, terraform init, and terraform validate.

Notice the new AWS Lambda span, which was not there in the previous post. This is because I wanted to extend instrumentation given that now there is more happening than just the agent run. We do also re-lock the terraform dependency lock file for different architectures after the agent run and I wanted to see timings and memory usage for the external terraform process. Because we are using AWS Lambda, memory is a constraint and cost factor, made worse by the fact that if you are (like me) running this in a new sandbox AWS account you are capped at 3008 MB of memory. Increasing that is a slow process on basic support.

The system prompt in this iteration is not optimized (we’ll do that in a later post). We just tell the agent what tools it has and that it needs to run terraform-init before the first validate. list_files/read_file is currently not that important as we do not hook up GitHub/existing terraform code in this post. It will be more important when the agent needs to assess how to integrate its code into an existing project and to follow that project’s best practice guidelines when doing so.

1
You are the terraform-pr-agent. You operate on a Terraform workspace
2
through file tools (list_files, read_file, write_file, edit_file, delete_file)
3
and two terraform tools. Use the file tools to explore, write, and edit.
4
Run terraform_init before your first validate and again whenever you add
5
or change provider or module requirements. Call terraform_validate after
6
you write or change files to confirm the workspace still parses; treat its
7
output as feedback and edit until it is clean.
8

9
When you add the AWS provider, give it a version constraint such as "~> 6.0"
10
rather than leaving it unconstrained. terraform init records the exact resolved
11
version and checksums in .terraform.lock.hcl, which travels with the workspace
12
and is the reproducibility record, so the constraint does not need to be an exact
13
pin. Pin an exact version only when the user asks for one. Example:
14

15
terraform {
16
  required_providers {
17
    aws = {
18
      source  = "hashicorp/aws"
19
      version = "~> 6.0"
20
    }
21
  }
22
}

Similarly, the user prompt is basic, with not much thought put into it. It is passed in the invoke event payload.

1
Set up a new terraform project, creating a best practice s3 bucket.

We store the end result in an S3 bucket. With no GitHub integration yet, that is a simple way to inspect the files the agent produced and validate the outcome.

Architecture

Post 2 wrapped the agent in a Lambda and stood up the dual-sink span pipeline behind it; post 3 made the model a runtime choice (an SSM model registry, a Mistral provider beside Bedrock, and EMF metrics feeding one CloudWatch dashboard). Post 4 keeps all of that as is. The Lambda itself changes from a zip package to a container image so we can ship the terraform CLI alongside the Python runtime; a new ECR repository holds the image, and terraform seeds it with a placeholder so the function can be created before the first real push, the container twin of post 2’s placeholder zip. Everything else around the function (IAM, Bedrock, the model registry, Firehose, S3) carries forward unchanged. The rest of the post is software: a workspace under /tmp, file tools, the terraform_validate tool, and a caller-side retry.

The code

Prerequisites (one-time setup for the series)

Tooling and AWS access common to every post in this series.

Tooling

Terraform 1.x (install). Every post provisions infrastructure with Terraform.
uv for Python project management (install). Each post ships a runnable script you can invoke with uv run.
direnv (install) so terraform, uv run, and aws pick up AWS credentials automatically on cd. The project scaffold ships an .envrc that sources a gitignored .envrc.local.
(Optional) A coding agent such as Claude Code, Cursor, Codex, or Gemini CLI to consume the AgentPrompt blocks throughout the series. Not required (each prompt has a manual equivalent shown alongside it), but it skips the boilerplate.

Agent prompt: Check and install missing tooling

You are helping set up tooling for a tutorial project.

For each of `terraform`, `uv`, and `direnv`, run `command -v` to
check whether it is installed. If present, print the version and
continue.

For missing tools, detect the system package manager in this order:
`command -v brew`, `command -v dnf`, `command -v apt-get`. Use the
first one available:

  - Terraform: `brew tap hashicorp/tap && brew install hashicorp/tap/terraform`,
    dnf via the HashiCorp RPM repo, or apt via the HashiCorp deb repo.
  - uv: `brew install uv`, or the official installer
    `curl -LsSf https://astral.sh/uv/install.sh | sh`.
  - direnv: `brew install direnv`, `dnf install direnv`, or
    `apt-get install direnv`.

If no package manager is available or the install fails, stop and
link the manual install page so the developer can finish by hand:

  - Terraform: https://developer.hashicorp.com/terraform/install
  - uv: https://docs.astral.sh/uv/getting-started/installation/
  - direnv: https://direnv.net/docs/installation.html

After installing direnv, do not modify any shell rc files. Print the
hook line for the developer's shell (bash, zsh, or fish) and the path
to the relevant rc file, then wait for them to apply it themselves.

Report which tools were already present, which you installed, and
which need manual follow-up.

AWS access

A sandbox, test, or personal AWS account with permission to create, modify, and delete the resources discussed in each post. If you don’t have one, follow the official Create Your AWS Account walkthrough (about ten minutes; requires a credit card and a phone number for verification). Treat it as disposable - you can close it from the billing console after the series.
AWS credentials available locally via aws configure sso, aws configure, or whichever method matches your setup. You wire them into the project through .envrc.local in the next section, not your shell rc.

Anthropic First Time Use

Bedrock requires a one-time use-case form per account (or per AWS Organization management account) before Anthropic models can be invoked. Easiest path: open any Claude model in the Bedrock console playground and submit the form. Auto-subscription on first invoke can take up to 15 minutes to settle, so it is worth clearing this before post 1.

CLI alternative and verification

Programmatic equivalent (requires AWS CLI 2.27.42 or later):

1
aws bedrock put-use-case-for-model-access \
2
  --form-data "$(printf '{"companyName":"...","companyWebsite":"...","intendedUsers":"1","industryOption":"...","otherIndustryOption":"","useCases":"..."}' | base64)"

Verify:

1
aws bedrock get-foundation-model-availability \
2
  --model-id anthropic.claude-haiku-4-5-20251001-v1:0 \
3
  --region eu-west-1

Look for agreementAvailability.status: AVAILABLE. Expected output:

1
{
2
  "modelId": "anthropic.claude-haiku-4-5-20251001-v1",
3
  "agreementAvailability": { "status": "AVAILABLE" },
4
  "authorizationStatus": "AUTHORIZED",
5
  "entitlementAvailability": "AVAILABLE",
6
  "regionAvailability": "AVAILABLE"
7
}

If the form has not been submitted, only agreementAvailability.status flips to NOT_AVAILABLE. The other three fields stay green even when invocation would fail, so do not rely on them.

The final tree. + is new in post 4, ~ extends a post 3 file, blank carries unchanged. Click any changed or new file to read it; the download below fast-forwards to this state if you want to walk through the post against the finished code. We also split the old handler.py in two: core.py holds the agent and an execute(prompt, model) that knows nothing about Lambda, and lambda_entry.py is the Lambda boundary that parses the event and calls it. It was getting out of hand as this post’s functionality piled on.

terraform-pr-agent/

agent/

__init__.py

handler.py

infra/

placeholder/

alerts.tf

audit-bucket.tf

bedrock.tf

firehose.tf

iam.tf

kms.tf

logfire.tf

main.tf

models.tf

scripts/

chat.py

queries.sql

tests/

test_handler.py

.envrc

.envrc.local

.gitignore

AGENTS.md

1
"""Define the agent and execute one run.
2

3
Runtime-agnostic: a fresh /tmp workspace threaded through the file and validate
4
tools, one run id that joins the audit trace, the parked artifacts, and the
5
result. The telemetry pipeline lives in observability.py, model construction in
6
models.py, and run persistence plus the provider re-lock in runs.py; the Lambda
7
envelope and INIT wiring live in lambda_entry.py. This module knows nothing
8
about Lambda events.
9
"""
10

11
from __future__ import annotations
12

13
import uuid
14
from pathlib import Path
15
from tempfile import TemporaryDirectory
16

17
from pydantic import BaseModel, ConfigDict
18
from pydantic_ai import Agent
19

20
from agent.env import require_env
21
from agent.models import _build_model
22
from agent.runs import _persist_run, _relock_providers
23
from agent.tools import (
24
    WorkspaceDeps,
25
    _validate,
26
    delete_file,
27
    edit_file,
28
    list_files,
29
    read_file,
30
    terraform_init,
31
    terraform_validate,
32
    write_file,
33
)
34

35
# A soft reproducibility nudge following the standard Terraform pattern: a
36
# version constraint in the config, the exact version and checksums in the lock
37
# file. The tool-call spans in the trace are the ground truth for what the agent
38
# actually wrote.
39
_PROVIDER_PIN_RULE = (
40
    "When you add the AWS provider, give it a version constraint such as "
41
    '"~> 6.0" rather than leaving it unconstrained. terraform init records '
42
    "the exact resolved version and checksums in .terraform.lock.hcl, "
43
    "which travels with the workspace and is the reproducibility record, "
44
    "so the constraint does not need to be an exact pin. Pin an exact "
45
    "version only when the user asks for one. Example:\n"
46
    "\n"
47
    "terraform {\n"
48
    "  required_providers {\n"
49
    "    aws = {\n"
50
    '      source  = "hashicorp/aws"\n'
51
    '      version = "~> 6.0"\n'
52
    "    }\n"
53
    "  }\n"
54
    "}"
55
)
56

57
SYSTEM_PROMPT = (
58
    "You are the terraform-pr-agent. You operate on a Terraform workspace "
59
    "through file tools (list_files, read_file, write_file, edit_file, "
60
    "delete_file) and two terraform tools. Use the file tools to explore, "
61
    "write, and edit. Run terraform_init before your first validate and "
62
    "again whenever you add or change provider or module requirements. "
63
    "Call terraform_validate after you write or change files to confirm "
64
    "the workspace still parses; treat its output as feedback and edit "
65
    "until it is clean.\n\n" + _PROVIDER_PIN_RULE
66
)
67

68
# One Agent instance is reused across invocations: the tools reach the workspace
69
# through RunContext.deps, so each run_sync scopes them to a fresh WorkspaceDeps.
70
# The model carries no default; it is built from the registry at INVOKE and
71
# passed per run, so switching DEFAULT_MODEL needs no code change.
72
agent = Agent(
73
    deps_type=WorkspaceDeps,
74
    system_prompt=SYSTEM_PROMPT,
75
    tools=[
76
        list_files,
77
        read_file,
78
        write_file,
79
        edit_file,
80
        delete_file,
81
        terraform_init,
82
        terraform_validate,
83
    ],
84
    # Tools raise ModelRetry on failure; pydantic-ai ends the run once one tool
85
    # fails more than `retries` times in a row (a success resets the count). The
86
    # default of 1 would end the run on the second straight failing validate,
87
    # which is a normal part of the write-validate-edit loop, so the budget is
88
    # raised well past anything a converging run produces. The per-run turn cap
89
    # stays the runaway guard.
90
    retries=10,
91
)
92

93

94
# Caller-side backstop budget: how many follow-up runs to spend trying to get a
95
# clean validate after the agent reports done. The per-tool `retries` above
96
# guards the loop inside one run; this guards the run as a whole.
97
_MAX_VALIDATE_RETRIES = 3
98

99
_RETRY_PROMPT = (
100
    "terraform validate still reports errors after you finished. "
101
    "Fix them and validate again.\n\n{output}"
102
)
103

104

105
class ValidateDidNotConverge(RuntimeError):
106
    """terraform validate still failed after the caller-side retry budget."""
107

108

109
class RunResult(BaseModel):
110
    model_config = ConfigDict(frozen=True)
111

112
    run_id: str
113
    model: str
114
    output: str
115

116

117
def execute(prompt: str, model: str | None = None) -> RunResult:
118
    """Run the agent once and return the result.
119

120
    The run id is the trace's gen_ai.conversation.id and the runs/<run_id>/
121
    prefix, so one identifier joins trace, artifacts, and result. After the run
122
    the caller re-validates the workspace and feeds any failure back as a
123
    follow-up turn; a run that still fails after the retry budget raises, so the
124
    workspace ships under status error for debugging. ``model`` overrides
125
    DEFAULT_MODEL when given.
126
    """
127
    model_name = model or require_env("DEFAULT_MODEL")
128
    run_id = str(uuid.uuid4())
129
    with TemporaryDirectory(dir="/tmp") as workspace:
130
        root = Path(workspace)
131
        deps = WorkspaceDeps(root=root)
132
        try:
133
            built = _build_model(model_name)
134
            result = agent.run_sync(
135
                prompt,
136
                deps=deps,
137
                conversation_id=run_id,
138
                model=built,
139
                metadata={"model": model_name},
140
            )
141
            # The agent can report done while terraform validate still fails.
142
            # Re-validate ourselves and feed any error back as a follow-up turn,
143
            # reusing the run id and message history so each retry is an
144
            # invoke_agent span under the one invocation trace. Give up after the
145
            # budget and raise so the failure is honest rather than a clean run
146
            # over a broken workspace.
147
            ok, output = _validate(root)
148
            attempts = 0
149
            while not ok and attempts < _MAX_VALIDATE_RETRIES:
150
                attempts += 1
151
                result = agent.run_sync(
152
                    _RETRY_PROMPT.format(output=output),
153
                    deps=deps,
154
                    conversation_id=run_id,
155
                    model=built,
156
                    message_history=result.all_messages(),
157
                    metadata={"model": model_name},
158
                )
159
                ok, output = _validate(root)
160
            if not ok:
161
                raise ValidateDidNotConverge(output)
162
        except Exception as error:
163
            _persist_run(run_id, root, status="error", error=repr(error))
164
            raise
165
        _relock_providers(root)
166
        _persist_run(run_id, root, status="ok")
167
    return RunResult(run_id=run_id, model=model_name, output=str(result.output))

1
"""Read a required environment variable, failing loudly when it is missing."""
2

3
from __future__ import annotations
4

5
import os
6

7

8
def require_env(name: str) -> str:
9
    """Return the value of `name`, or raise if it is unset or empty.
10

11
    A missing required variable is a deployment fault, not a runtime branch, so
12
    we surface it the same way everywhere instead of degrading into a silent
13
    no-op. Empty is treated as unset: a blank value is never a real config.
14
    """
15
    value = os.environ.get(name)
16
    if not value:
17
        raise RuntimeError(f"required environment variable {name} is unset")
18
    return value

1
"""The Lambda boundary: parse the event, run the agent, shape the response.
2

3
This module also owns the INIT wiring. The container CMD targets
4
agent.lambda_entry.handler, so a unit test that imports agent.core never
5
configures logfire or registers the Firehose audit processor. Everything
6
Lambda-specific lives here, off the core, which is why no runtime-detection
7
check is needed to keep it out of tests.
8
"""
9

10
from __future__ import annotations
11

12
from typing import NotRequired, TypedDict
13

14
import logfire
15

16
from agent import observability
17
from agent.core import execute
18

19

20
class HandlerEvent(TypedDict):
21
    prompt: str
22
    model: NotRequired[str]
23

24

25
class HandlerResponse(TypedDict):
26
    status: str
27
    run_id: str
28
    model: str
29
    output: str
30

31

32
def handler(event: HandlerEvent, context: object) -> HandlerResponse:
33
    """Lambda entry point: require a prompt, run the agent, wrap the result.
34

35
    ``prompt`` is required; an event without one is a caller error and fails
36
    fast rather than running a default. ``model`` overrides DEFAULT_MODEL when
37
    given. A run that does not converge raises, so the Lambda reports 5xx and
38
    the workspace ships under status error for debugging.
39
    """
40
    prompt = event.get("prompt")
41
    if not prompt:
42
        raise ValueError("event missing required 'prompt'")
43
    result = execute(prompt, event.get("model"))
44
    return {
45
        "status": "ok",
46
        "run_id": result.run_id,
47
        "model": result.model,
48
        "output": result.output,
49
    }
50

51

52
def bootstrap() -> None:
53
    """Stand up telemetry, then attach the Lambda runtime adapter.
54

55
    configure() first so the tracer provider exists when the handler is wrapped.
56
    instrument_aws_lambda wraps the target named by _HANDLER
57
    (agent.lambda_entry.handler) in place, so each invocation becomes one trace.
58
    """
59
    observability.configure()
60
    logfire.instrument_aws_lambda(handler)
61

62

63
bootstrap()

1
"""Memory probes for the terraform steps.
2

3
Lambda's Max Memory Used is misleading for this function. The terraform steps
4
download and unpack the AWS provider (~800 MB) into /tmp, and that file IO
5
fills the kernel page cache, which the cgroup-based billed figure counts but
6
which the kernel reclaims under pressure, so it is not OOM risk. ``track_memory``
7
tags a logfire span with three numbers so a run's trace separates real demand
8
from cache: the sandbox's used memory (MemTotal - MemAvailable, the
9
non-reclaimable memory that actually risks OOM), the reclaimable page cache
10
(Cached + Buffers, the bulk of the billed peak), and the peak resident size of
11
the largest terraform subprocess. Read together they show real demand stays
12
under ~1 GB while the billed peak runs to ~2 GB of reclaimable cache.
13
"""
14

15
from __future__ import annotations
16

17
import resource
18
from collections.abc import Iterator
19
from contextlib import contextmanager
20
from pathlib import Path
21

22
import logfire
23

24
# /proc/meminfo is always present in the Lambda sandbox; the cgroup memory
25
# files are not (neither the v2 /sys/fs/cgroup/memory.current nor the v1
26
# /sys/fs/cgroup/memory/memory.usage_in_bytes is readable there, so reading
27
# them silently returned None). Values are in kibibytes.
28
_MEMINFO = Path("/proc/meminfo")
29

30

31
def _memory_snapshot() -> tuple[int, int] | None:
32
    """(used, cache) bytes for the whole sandbox, or None when /proc/meminfo is absent.
33

34
    ``used`` is MemTotal - MemAvailable, the non-reclaimable memory in use (the
35
    Python runtime plus any live subprocess), which is what actually risks OOM.
36
    ``cache`` is Cached + Buffers, the reclaimable page cache that file IO on
37
    /tmp fills; Lambda's Max Memory Used counts it but real demand does not, so
38
    recording both shows why the billed peak overstates what the function needs.
39
    Absent locally (macOS) and in tests, where the caller leaves the attributes
40
    unset.
41
    """
42
    try:
43
        fields = dict(line.split(":", 1) for line in _MEMINFO.read_text().splitlines())
44
        used_kib = int(fields["MemTotal"].split()[0]) - int(fields["MemAvailable"].split()[0])
45
        cache_kib = int(fields["Cached"].split()[0]) + int(fields["Buffers"].split()[0])
46
    except (OSError, KeyError, ValueError, IndexError):
47
        return None
48
    return used_kib * 1024, cache_kib * 1024
49

50

51
@contextmanager
52
def track_memory(step: str) -> Iterator[None]:
53
    """Run a block in a logfire span tagged with its memory cost.
54

55
    On exit, records the sandbox's used memory and reclaimable page cache, and
56
    the peak resident size of the largest subprocess waited for so far.
57
    ru_maxrss is reported in kilobytes on Linux, so it is scaled to bytes; it is
58
    a monotonic high-water mark across all children, not a per-call delta, so
59
    compare it between steps to see which one grew the subprocess most. Built
60
    with contextmanager, so it doubles as a decorator: ``@track_memory("relock")``.
61
    """
62
    # _span_name forces the OTel span name to the interpolated value, so the
63
    # trace reads "memory.terraform_validate". Without it logfire keeps the
64
    # low-cardinality template "memory.{step}" as the name (its f-string magic
65
    # reconstructs the template), leaving the per-step value only in the attribute.
66
    with logfire.span("memory.{step}", _span_name=f"memory.{step}", step=step) as span:
67
        try:
68
            yield
69
        finally:
70
            if (snapshot := _memory_snapshot()) is not None:
71
                used, cache = snapshot
72
                span.set_attribute("mem.used_bytes", used)
73
                span.set_attribute("mem.cache_bytes", cache)
74
            child_peak_kb = resource.getrusage(resource.RUSAGE_CHILDREN).ru_maxrss
75
            span.set_attribute("mem.child_max_rss_bytes", child_peak_kb * 1024)

1
"""Build a pydantic-ai model from the SSM-backed registry, per invocation."""
2

3
from __future__ import annotations
4

5
import json
6
import os
7
from functools import cache
8

9
from httpx import AsyncClient, HTTPStatusError, Response
10
from pydantic_ai.models import Model
11
from pydantic_ai.models.bedrock import BedrockConverseModel
12
from pydantic_ai.models.mistral import MistralModel
13
from pydantic_ai.providers.mistral import MistralProvider
14
from pydantic_ai.retries import AsyncTenacityTransport, RetryConfig, wait_retry_after
15
from tenacity import retry_if_exception_type, stop_after_attempt, wait_exponential
16

17
from agent.env import require_env
18
from agent.ssm import fetch_parameter
19

20
# Rate limit and transient gateway errors are worth retrying; auth and bad
21
# request fail fast so a real problem is not retried five times.
22
_RETRYABLE_STATUS = frozenset({429, 502, 503, 504})
23

24

25
def _raise_for_retryable(response: Response) -> None:
26
    if response.status_code in _RETRYABLE_STATUS:
27
        response.raise_for_status()
28

29

30
def _retrying_http_client() -> AsyncClient:
31
    """An httpx client that retries rate-limit and transient errors.
32

33
    pydantic-ai does not retry transport errors itself, so a rate-limited
34
    Mistral call (the write/init/validate loop can burst past the free-tier
35
    per-second cap) would fail the whole run. wait_retry_after honours the
36
    Retry-After header Mistral sends on a 429, falling back to exponential
37
    backoff. Bedrock uses boto3 with its own retry config, so this is the
38
    Mistral client only.
39
    """
40
    transport = AsyncTenacityTransport(
41
        config=RetryConfig(
42
            retry=retry_if_exception_type(HTTPStatusError),
43
            wait=wait_retry_after(
44
                fallback_strategy=wait_exponential(multiplier=1, max=60),
45
                max_wait=300,
46
            ),
47
            stop=stop_after_attempt(5),
48
            reraise=True,
49
        ),
50
        validate_response=_raise_for_retryable,
51
    )
52
    return AsyncClient(transport=transport)
53

54

55
@cache
56
def _build_model(name: str) -> Model:
57
    """Build the pydantic-ai model registered under ``name``.
58

59
    The registry is an SSM String parameter (MODELS_PARAMETER): each entry names
60
    a provider and model id, and Bedrock entries carry the inference-profile ARN.
61
    Bedrock authenticates via the Lambda role; Mistral reads an API key from a
62
    SecureString. Memoised per name, so the lookup is one GetParameter per
63
    container.
64
    """
65
    registry = json.loads(fetch_parameter(require_env("MODELS_PARAMETER")))
66
    config = registry[name]
67
    provider = config["provider"]
68
    if provider == "bedrock":
69
        return BedrockConverseModel(
70
            config["model_id"],
71
            settings={"bedrock_inference_profile": config["inference_profile_arn"]},
72
        )
73
    if provider == "mistral":
74
        key_param = os.environ.get("MISTRAL_API_KEY_PARAMETER")
75
        if not key_param:
76
            raise RuntimeError(
77
                f"model {name!r} uses the Mistral API, but MISTRAL_API_KEY_PARAMETER "
78
                "is not set. Set MISTRAL_API_KEY and re-apply so the key is wired, or "
79
                "select a Bedrock model via DEFAULT_MODEL or the event's model field."
80
            )
81
        return MistralModel(
82
            config["model_id"],
83
            provider=MistralProvider(
84
                api_key=fetch_parameter(key_param),
85
                http_client=_retrying_http_client(),
86
            ),
87
        )
88
    raise ValueError(f"unknown provider {provider!r} for model {name!r}")

1
"""Telemetry pipeline: structured logs, the per-trace audit copy, and EMF metrics.
2

3
``configure()`` runs at INIT (see handler.py) so the tracer provider and the
4
audit processor exist before instrument_aws_lambda opens the first invocation
5
span. The audit copy ships from inside the processor when the trace's root span
6
ends, so the handler needs no flush logic.
7
"""
8

9
from __future__ import annotations
10

11
import json
12
import os
13
import threading
14
from collections.abc import Callable, Sequence
15

16
import boto3
17
import logfire
18
import structlog
19
from google.protobuf import json_format
20
from logfire.sampling import SamplingOptions
21
from opentelemetry.context import (
22
    _SUPPRESS_INSTRUMENTATION_KEY,
23
    attach,
24
    detach,
25
    set_value,
26
)
27
from opentelemetry.exporter.otlp.proto.common._internal.trace_encoder import (
28
    encode_spans,
29
)
30
from opentelemetry.sdk.trace import ReadableSpan, SpanProcessor
31
from opentelemetry.trace.status import StatusCode
32

33
from agent.env import require_env
34
from agent.ssm import fetch_parameter
35

36
# JSON logs to stdout, which CloudWatch Logs ingests as-is. The same stream
37
# carries the EMF metric envelope (see _emit_emf), so one structured sink covers
38
# both logs and metrics.
39
structlog.configure(
40
    processors=[
41
        structlog.processors.add_log_level,
42
        structlog.processors.TimeStamper(fmt="iso"),
43
        structlog.processors.EventRenamer("message"),
44
        structlog.processors.JSONRenderer(),
45
    ],
46
    logger_factory=structlog.PrintLoggerFactory(),
47
    cache_logger_on_first_use=True,
48
)
49
log = structlog.get_logger()
50

51

52
class PerTraceAuditProcessor(SpanProcessor):
53
    """Buffer spans by trace_id, ship as one batch when the local root ends.
54

55
    The OTel SDK has no `OnTraceComplete` hook, so this implements it against the
56
    only signal available: `on_end` fires synchronously, and a span is this
57
    process's local root when it has no parent, or a remote parent (context
58
    propagated in, e.g. from API Gateway), which `SpanContext.is_remote` marks.
59
    Late children (ended on a transport thread after the root shipped) are
60
    dropped, mirroring logfire's tail sampler. See pydantic/logfire#1034.
61
    """
62

63
    def __init__(
64
        self,
65
        on_trace_complete: Callable[[Sequence[ReadableSpan]], None],
66
    ) -> None:
67
        self._on_trace_complete = on_trace_complete
68
        self._buffers: dict[int, list[ReadableSpan]] = {}
69
        self._shipped: set[int] = set()
70
        self._lock = threading.Lock()
71

72
    def on_end(self, span: ReadableSpan) -> None:
73
        if not (span.context and span.context.trace_flags.sampled):
74
            return
75
        trace_id = span.context.trace_id
76
        with self._lock:
77
            if trace_id in self._shipped:
78
                return
79
            self._buffers.setdefault(trace_id, []).append(span)
80
            if span.parent is not None and not span.parent.is_remote:
81
                return
82
            spans = self._buffers.pop(trace_id)
83
            self._shipped.add(trace_id)
84
        self._ship(spans)
85

86
    def force_flush(self, timeout_millis: int = 30000) -> bool:
87
        with self._lock:
88
            pending = list(self._buffers.values())
89
            self._shipped.update(self._buffers)
90
            self._buffers.clear()
91
        for spans in pending:
92
            self._ship(spans)
93
        return True
94

95
    def shutdown(self) -> None:
96
        self.force_flush()
97

98
    def _ship(self, spans: Sequence[ReadableSpan]) -> None:
99
        # Suppress instrumentation around the callback so an instrumented
100
        # boto3 client inside it does not emit a span that re-enters on_end.
101
        token = attach(set_value(_SUPPRESS_INSTRUMENTATION_KEY, True))
102
        try:
103
            self._on_trace_complete(spans)
104
        finally:
105
            detach(token)
106

107

108

109

110
_firehose = boto3.client("firehose")
111
_DELIVERY_STREAM = require_env("FIREHOSE_DELIVERY_STREAM")
112

113

114
def _ship_trace(spans: Sequence[ReadableSpan]) -> None:
115
    """Serialise one trace as OTLP-JSON and ship it as a single Firehose record."""
116
    payload = json_format.MessageToJson(encode_spans(spans), indent=None) + "\n"
117
    _firehose.put_record(
118
        DeliveryStreamName=_DELIVERY_STREAM,
119
        Record={"Data": payload.encode("utf-8")},
120
    )
121

122

123

124

125
def _emf_record(span: ReadableSpan) -> dict:
126
    """Build the EMF envelope for one agent-run span.
127

128
    pydantic-ai records gen_ai.usage.* on the agent-run span as the run total,
129
    so a single read is the correct total. The model dimension is the registry
130
    key the handler passed as run metadata, read back so a Bedrock run and a
131
    Mistral run land on one set of widgets.
132
    """
133
    attributes = span.attributes or {}
134
    model = json.loads(attributes["metadata"]).get("model", "unknown")
135
    errored = span.status.status_code is StatusCode.ERROR
136
    return {
137
        "_aws": {
138
            "Timestamp": span.end_time // 1_000_000,
139
            "CloudWatchMetrics": [
140
                {
141
                    "Namespace": require_env("METRICS_NAMESPACE"),
142
                    "Dimensions": [["Model"]],
143
                    "Metrics": [
144
                        {"Name": "InputTokens", "Unit": "Count"},
145
                        {"Name": "OutputTokens", "Unit": "Count"},
146
                        {"Name": "CacheReadTokens", "Unit": "Count"},
147
                        {"Name": "CacheWriteTokens", "Unit": "Count"},
148
                        {"Name": "Latency", "Unit": "Milliseconds"},
149
                        {"Name": "Invocations", "Unit": "Count"},
150
                        {"Name": "Errors", "Unit": "Count"},
151
                    ],
152
                }
153
            ],
154
        },
155
        "Model": model,
156
        "InputTokens": attributes.get("gen_ai.usage.input_tokens", 0),
157
        "OutputTokens": attributes.get("gen_ai.usage.output_tokens", 0),
158
        # pydantic-ai sets these only when non-zero; providers without prompt
159
        # caching (the Mistral API) never report them, so default to 0.
160
        "CacheReadTokens": attributes.get("gen_ai.usage.cache_read.input_tokens", 0),
161
        "CacheWriteTokens": attributes.get("gen_ai.usage.cache_creation.input_tokens", 0),
162
        "Latency": (span.end_time - span.start_time) / 1_000_000,
163
        "Invocations": 1,
164
        "Errors": 1 if errored else 0,
165
    }
166

167

168
def _emit_emf(spans: Sequence[ReadableSpan]) -> None:
169
    """Emit one EMF metric line per agent run in the trace.
170

171
    Off Bedrock there are no AWS/Bedrock metrics, so the dashboard reads these.
172
    The trace root is the Lambda invocation span, so metrics come off the
173
    agent-run spans nested under it, found by the `metadata` attribute
174
    pydantic-ai stamps on every run (a caller-side retry holds several, so one
175
    line each). CloudWatch Logs extracts the metrics from the structured line.
176
    """
177
    for span in spans:
178
        if span.attributes and "metadata" in span.attributes:
179
            log.info("trace_metrics", **_emf_record(span))
180

181

182
def _on_trace_complete(spans: Sequence[ReadableSpan]) -> None:
183
    """Ship the audit copy, then emit metrics: one hook, two sinks."""
184
    _ship_trace(spans)
185
    _emit_emf(spans)
186

187

188

189

190
def _logfire_token() -> str | None:
191
    """Logfire token from SSM, or None when the integration is not wired."""
192
    name = os.environ.get("LOGFIRE_TOKEN_PARAMETER")
193
    return fetch_parameter(name) if name else None
194

195

196
def configure() -> None:
197
    """Wire logfire at INIT: register the audit processor and instrument pydantic-ai.
198

199
    head=1.0 / tail=None because this is an audit pipeline: every trace must
200
    reach S3, and volume is low (one trace per invocation). Splitting the rates
201
    (say 1% to Logfire, 100% to S3) is possible with an extra sampler.
202
    include_content=True keeps the audit copy useful for forensics; flip it to
203
    False if prompts ever carry PII or secrets. version=5 pins the GenAI span
204
    schema so the audit copy stays stable across pydantic-ai releases.
205
    """
206
    if token := _logfire_token():
207
        os.environ["LOGFIRE_TOKEN"] = token
208
    logfire.configure(
209
        send_to_logfire="if-token-present",
210
        sampling=SamplingOptions(head=1.0, tail=None),
211
        additional_span_processors=[PerTraceAuditProcessor(_on_trace_complete)],
212
    )
213
    logfire.instrument_pydantic_ai(version=5, include_content=True)

1
"""Park a run's workspace in S3, and re-lock providers before it ships."""
2

3
from __future__ import annotations
4

5
import json
6
import subprocess
7
from collections.abc import Iterator
8
from pathlib import Path
9

10
import boto3
11

12
from agent.env import require_env
13
from agent.memory import track_memory
14
from agent.observability import log
15

16

17
def _persist_run(run_id: str, workspace: Path, status: str, error: str | None = None) -> None:
18
    """Park the workspace and a minimal result marker under runs/<run_id>/.
19

20
    result.json carries only what the audit trace does not: prompt, output, and
21
    messages already live in the trace under the same conversation id, so copying
22
    them here would create a second source of truth.
23
    """
24
    bucket = require_env("RUNS_BUCKET")
25
    s3 = boto3.client("s3")
26
    for file in _workspace_files(workspace):
27
        key = f"runs/{run_id}/workspace/{file.relative_to(workspace)}"
28
        s3.put_object(Bucket=bucket, Key=key, Body=file.read_bytes())
29
    result = {"status": status} | ({"error": error} if error else {})
30
    s3.put_object(
31
        Bucket=bucket,
32
        Key=f"runs/{run_id}/result.json",
33
        Body=json.dumps(result).encode(),
34
    )
35

36

37
def _workspace_files(workspace: Path) -> Iterator[Path]:
38
    """Every file except .terraform/, which is init scratch plus the provider
39
    downloaded into /tmp, gigabytes of noise per run. The top-level
40
    .terraform.lock.hcl is the reproducibility record and stays.
41
    """
42
    for path in sorted(workspace.rglob("*")):
43
        if ".terraform" in path.relative_to(workspace).parts:
44
            continue
45
        if path.is_file():
46
            yield path
47

48

49

50

51
# init inside the arm64 Lambda locks only linux_arm64, so a reviewer or CI on
52
# another platform hits a checksum error. Re-lock the platforms they are likely
53
# to run before the workspace ships. One call per platform rather than one with
54
# three -platform flags: `providers lock` is additive, so each call merges its
55
# platform and leaves the others, and a fresh process per platform keeps the
56
# peak to one platform's footprint. Best-effort and independent: a failure on
57
# one is logged and the rest still run.
58
_LOCK_PLATFORMS = ("linux_amd64", "linux_arm64", "darwin_arm64")
59

60

61
def _relock_providers(workspace: Path) -> None:
62
    if not (workspace / ".terraform.lock.hcl").exists():
63
        return
64
    for platform in _LOCK_PLATFORMS:
65
        with track_memory(f"relock.{platform}"):
66
            result = subprocess.run(
67
                ["terraform", "providers", "lock", "-no-color", f"-platform={platform}"],
68
                cwd=workspace,
69
                capture_output=True,
70
                text=True,
71
            )
72
        if result.returncode != 0:
73
            log.warning(
74
                "provider re-lock failed",
75
                platform=platform,
76
                stdout=result.stdout,
77
                stderr=result.stderr,
78
            )

1
"""SSM Parameter Store access, shared by the model registry and the Logfire token fetch."""
2

3
from __future__ import annotations
4

5
import boto3
6

7

8
def fetch_parameter(name: str) -> str:
9
    """Read one parameter from SSM Parameter Store.
10

11
    WithDecryption is a no-op on a plain String, so this covers String and
12
    SecureString alike. A direct GetParameter call, because container-image
13
    Lambdas cannot attach the Parameters and Secrets extension layer the zip
14
    package used. The client is built per call so moto can intercept it in tests.
15
    """
16
    response = boto3.client("ssm").get_parameter(Name=name, WithDecryption=True)
17
    return response["Parameter"]["Value"]

1
"""Tool surface for the terraform-pr-agent.
2

3
Each function below is registered on the Agent in handler.py and exposed
4
to the model as a callable tool. The bodies are intentionally empty in
5
this scaffold: the post fills them in section by section, so the reader
6
can see the agent's behaviour change as each tool comes online.
7

8
All tools take ``RunContext[WorkspaceDeps]`` so the workspace root is
9
threaded through ``ctx.deps.root`` instead of read from a module global.
10
That keeps tests, evals, and multi-tenant runs from sharing state.
11
"""
12

13
from __future__ import annotations
14

15
import subprocess
16
from pathlib import Path
17

18
from pydantic import BaseModel, ConfigDict, Field
19
from pydantic_ai import ModelRetry, RunContext
20

21
from agent.memory import track_memory
22

23

24
class WorkspaceDeps(BaseModel):
25
    """Per-run dependencies threaded through ``RunContext``.
26

27
    ``root`` is the directory the agent is allowed to read, write, and
28
    validate inside. Tools resolve every path relative to it and reject
29
    anything that escapes the root, so the agent cannot reach outside
30
    the workspace via ``..`` or absolute paths.
31
    """
32

33
    model_config = ConfigDict(arbitrary_types_allowed=True)
34

35
    root: Path
36
    files_read: set[Path] = Field(default_factory=set)
37

38

39

40

41
def list_files(ctx: RunContext[WorkspaceDeps], path: str = ".") -> list[str]:
42
    """List files under ``path`` relative to the workspace root."""
43
    return [
44
        str(p.relative_to(ctx.deps.root)) for p in _resolve_absolute_folder(ctx, path).iterdir()
45
    ]
46

47

48
def read_file(ctx: RunContext[WorkspaceDeps], path: str) -> str:
49
    """Read the file at ``path`` and return its contents."""
50
    file = _resolve_absolute_file(ctx, path)
51
    if not file.exists():
52
        raise ModelRetry(f"File {path} does not exist.")
53
    with file.open() as f:
54
        ctx.deps.files_read.add(file)
55
        return f.read()
56

57

58
def write_file(ctx: RunContext[WorkspaceDeps], path: str, contents: str) -> None:
59
    """Create or overwrite the file at ``path`` with ``contents``."""
60
    file = _resolve_absolute_path(ctx, path)
61
    if file.exists() and file not in ctx.deps.files_read:
62
        raise ModelRetry(
63
            f"File {path} already exists. If you want to overwrite it, then delete it first."
64
        )
65
    with file.open("w") as f:
66
        # agent wrote it and knows the content
67
        ctx.deps.files_read.add(file)
68
        f.write(contents)
69

70

71
def edit_file(
72
    ctx: RunContext[WorkspaceDeps],
73
    path: str,
74
    old_string: str,
75
    new_string: str,
76
) -> None:
77
    """Replace ``old_string`` with ``new_string`` in the file at ``path``."""
78
    file = _resolve_absolute_file(ctx, path)
79
    if not file.exists():
80
        raise ModelRetry(f"File {path} does not exist.")
81
    if file not in ctx.deps.files_read:
82
        raise ModelRetry(f"File {path} was not read. If you want to edit it, then read it first.")
83
    with file.open("r+") as f:
84
        contents = f.read()
85
        if old_string not in contents:
86
            raise ModelRetry(f"String {old_string} not found in file {path}.")
87
        if contents.count(old_string) > 1:
88
            raise ModelRetry(f"String {old_string} found more than once in file {path}.")
89
        f.seek(0)
90
        f.write(contents.replace(old_string, new_string))
91
        f.truncate()
92

93

94

95

96
def delete_file(ctx: RunContext[WorkspaceDeps], path: str) -> None:
97
    """Delete the file at ``path``."""
98
    file = _resolve_absolute_file(ctx, path)
99
    if file not in ctx.deps.files_read:
100
        raise ModelRetry(f"File {path} was not read. If you want to delete it, then read it first.")
101
    file.unlink()
102

103

104
def terraform_init(ctx: RunContext[WorkspaceDeps]) -> str:
105
    """Run ``terraform init`` in the workspace.
106

107
    Required once before the first ``terraform_validate`` and again after
108
    provider or module requirements change.
109
    """
110
    with track_memory("terraform_init"):
111
        init = subprocess.run(
112
            ["terraform", "init", "-backend=false", "-input=false", "-no-color"],
113
            cwd=ctx.deps.root,
114
            capture_output=True,
115
            text=True,
116
        )
117
    if init.returncode != 0:
118
        raise ModelRetry(f"terraform init failed:\n{init.stdout}{init.stderr}")
119
    return "OK: terraform init completed."
120

121

122
def _validate(root: Path) -> tuple[bool, str]:
123
    """Run ``terraform validate`` in ``root``; return (passed, combined output).
124

125
    The agent reaches this through the tool below; the caller-side retry in
126
    handler.py calls it directly to re-check the workspace after the run.
127
    """
128
    with track_memory("terraform_validate"):
129
        result = subprocess.run(
130
            ["terraform", "validate", "-no-color"],
131
            cwd=root,
132
            capture_output=True,
133
            text=True,
134
        )
135
    return result.returncode == 0, f"{result.stdout}{result.stderr}"
136

137

138
def terraform_validate(ctx: RunContext[WorkspaceDeps]) -> str:
139
    """Run ``terraform validate`` in the workspace and return its output."""
140
    ok, output = _validate(ctx.deps.root)
141
    if ok:
142
        return "OK: terraform validate passed."
143
    raise ModelRetry(f"terraform validate failed:\n{output}")
144

145

146

147

148
def _resolve_absolute_folder(ctx: RunContext[WorkspaceDeps], path: str):
149
    if not (absolute_path := _resolve_absolute_path(ctx, path)).is_dir():
150
        raise ModelRetry(f"Path {path} must be a directory.")
151
    return absolute_path
152

153

154
def _resolve_absolute_file(ctx: RunContext[WorkspaceDeps], path: str):
155
    if not (absolute_path := _resolve_absolute_path(ctx, path)).is_file():
156
        raise ModelRetry(f"Path {path} must be a file.")
157
    return absolute_path
158

159

160
def _resolve_absolute_path(ctx: RunContext[WorkspaceDeps], path: str):
161
    root = ctx.deps.root.resolve()
162
    absolute_path = (root / Path(path)).resolve()
163
    if not absolute_path.is_relative_to(root):
164
        raise ModelRetry(
165
            f"Path {path} must be relative to the workspace root, "
166
            f"it cannot be absolute or walk up the directory tree."
167
        )
168
    return absolute_path

1
locals {
2
  cloudwatch_region = data.aws_region.current.region
3
  dashboard_name    = "terraform-pr-agent"
4
  lambda_name       = aws_lambda_function.agent.function_name
5

6
  # The model widgets read the EMF metrics the handler emits (namespace
7
  # local.metrics_namespace, dimensioned by Model), not AWS/Bedrock, so a
8
  # Bedrock model and a Mistral-API model land in the same widgets. One line
9
  # per registry model, built by iterating keys(local.models).
10
  model_keys = keys(local.models)
11
}
12

13
resource "aws_cloudwatch_dashboard" "agent" {
14
  dashboard_name = local.dashboard_name
15
  dashboard_body = jsonencode({
16
    widgets = [
17
      {
18
        type   = "text"
19
        x      = 0
20
        y      = 0
21
        width  = 24
22
        height = 2
23
        properties = {
24
          markdown = "## Lambda\nContainer-image function health: invocations and errors, end to end duration, cold start init duration (Lambda Insights emits `init_duration` only on a cold start), and the memory and `/tmp` footprint behind the `memory_size` and `ephemeral_storage` sizing."
25
        }
26
      },
27
      {
28
        type   = "metric"
29
        x      = 0
30
        y      = 2
31
        width  = 12
32
        height = 6
33
        properties = {
34
          title  = "Lambda invocations and errors"
35
          region = local.cloudwatch_region
36
          view   = "timeSeries"
37
          stat   = "Sum"
38
          period = 60
39
          metrics = [
40
            ["AWS/Lambda", "Invocations", "FunctionName", local.lambda_name, { label = "${local.lambda_name} / invocations" }],
41
            [".", "Errors", ".", ".", { label = "${local.lambda_name} / errors" }],
42
            [".", "Throttles", ".", ".", { label = "${local.lambda_name} / throttles" }],
43
          ]
44
        }
45
      },
46
      {
47
        type   = "metric"
48
        x      = 12
49
        y      = 2
50
        width  = 12
51
        height = 6
52
        properties = {
53
          title  = "Lambda duration (ms)"
54
          region = local.cloudwatch_region
55
          view   = "timeSeries"
56
          period = 60
57
          metrics = [
58
            ["AWS/Lambda", "Duration", "FunctionName", local.lambda_name, { label = "${local.lambda_name} / avg", stat = "Average" }],
59
            [".", ".", ".", ".", { label = "${local.lambda_name} / p99", stat = "p99" }],
60
          ]
61
        }
62
      },
63
      {
64
        type   = "metric"
65
        x      = 0
66
        y      = 8
67
        width  = 12
68
        height = 6
69
        properties = {
70
          title  = "Cold start init duration (ms)"
71
          region = local.cloudwatch_region
72
          view   = "timeSeries"
73
          period = 60
74
          metrics = [
75
            # Insights reports init_duration only when an init phase happened,
76
            # so points appear only on cold starts.
77
            ["LambdaInsights", "init_duration", "function_name", local.lambda_name, { label = "${local.lambda_name} / init avg (ms)", stat = "Average" }],
78
            [".", ".", ".", ".", { label = "${local.lambda_name} / init max (ms)", stat = "Maximum" }],
79
          ]
80
        }
81
      },
82
      {
83
        type   = "metric"
84
        x      = 12
85
        y      = 8
86
        width  = 12
87
        height = 6
88
        properties = {
89
          title  = "Memory used (MB)"
90
          region = local.cloudwatch_region
91
          view   = "timeSeries"
92
          period = 60
93
          metrics = [
94
            # used_memory_max is the cgroup figure (Max Memory Used), which
95
            # counts the reclaimable /tmp page cache and so reads ~2 GB while
96
            # real demand is under 1 GB. Backs the memory_size comment in
97
            # lambda.tf.
98
            ["LambdaInsights", "used_memory_max", "function_name", local.lambda_name, { label = "${local.lambda_name} / memory max (MB)", stat = "Maximum" }],
99
          ]
100
        }
101
      },
102
      {
103
        type   = "metric"
104
        x      = 0
105
        y      = 14
106
        width  = 12
107
        height = 6
108
        properties = {
109
          title  = "/tmp used (bytes)"
110
          region = local.cloudwatch_region
111
          view   = "timeSeries"
112
          period = 60
113
          metrics = [
114
            # tmp_used tracks the ~800 MB provider download into /tmp, behind
115
            # the 4 GB ephemeral_storage sizing in lambda.tf.
116
            ["LambdaInsights", "tmp_used", "function_name", local.lambda_name, { label = "${local.lambda_name} / tmp used (B)", stat = "Maximum" }],
117
          ]
118
        }
119
      },
120
      {
121
        type   = "text"
122
        x      = 0
123
        y      = 20
124
        width  = 24
125
        height = 2
126
        properties = {
127
          markdown = "## Model\nPer-model usage from the handler's EMF metrics (namespace `local.metrics_namespace`, dimensioned by `Model`), so Bedrock and Mistral-API models share one set of widgets: tokens, invocations and errors, latency, and cache reads and writes."
128
        }
129
      },
130
      {
131
        type   = "metric"
132
        x      = 0
133
        y      = 22
134
        width  = 12
135
        height = 6
136
        properties = {
137
          title  = "Tokens"
138
          region = local.cloudwatch_region
139
          view   = "timeSeries"
140
          stat   = "Sum"
141
          period = 60
142
          metrics = [
143
            for m in flatten([
144
              for key in local.model_keys : [
145
                { metric = "InputTokens", label = "${key} / input", key = key },
146
                { metric = "OutputTokens", label = "${key} / output", key = key },
147
              ]
148
            ]) : [local.metrics_namespace, m.metric, "Model", m.key, { label = m.label }]
149
          ]
150
        }
151
      },
152
      {
153
        type   = "metric"
154
        x      = 12
155
        y      = 22
156
        width  = 12
157
        height = 6
158
        properties = {
159
          title  = "Invocations and errors"
160
          region = local.cloudwatch_region
161
          view   = "timeSeries"
162
          stat   = "Sum"
163
          period = 60
164
          metrics = [
165
            for m in flatten([
166
              for key in local.model_keys : [
167
                { metric = "Invocations", label = "${key} / invocations", key = key },
168
                { metric = "Errors", label = "${key} / errors", key = key },
169
              ]
170
            ]) : [local.metrics_namespace, m.metric, "Model", m.key, { label = m.label }]
171
          ]
172
        }
173
      },
174
      {
175
        type   = "metric"
176
        x      = 0
177
        y      = 28
178
        width  = 12
179
        height = 6
180
        properties = {
181
          title  = "Latency (ms)"
182
          region = local.cloudwatch_region
183
          view   = "timeSeries"
184
          period = 60
185
          metrics = [
186
            for m in flatten([
187
              for key in local.model_keys : [
188
                { label = "${key} / avg", key = key, stat = "Average" },
189
                { label = "${key} / p99", key = key, stat = "p99" },
190
              ]
191
            ]) : [local.metrics_namespace, "Latency", "Model", m.key, { label = m.label, stat = m.stat }]
192
          ]
193
        }
194
      },
195
      {
196
        type   = "metric"
197
        x      = 12
198
        y      = 28
199
        width  = 12
200
        height = 6
201
        properties = {
202
          title  = "Cache tokens"
203
          region = local.cloudwatch_region
204
          view   = "timeSeries"
205
          stat   = "Sum"
206
          period = 60
207
          metrics = [
208
            for m in flatten([
209
              for key in local.model_keys : [
210
                { metric = "CacheReadTokens", label = "${key} / cache read", key = key },
211
                { metric = "CacheWriteTokens", label = "${key} / cache write", key = key },
212
              ]
213
            ]) : [local.metrics_namespace, m.metric, "Model", m.key, { label = m.label }]
214
          ]
215
        }
216
      },
217
    ]
218
  })
219
}
220

221
output "cloudwatch_dashboard_url" {
222
  value = format(
223
    "https://%s.console.aws.amazon.com/cloudwatch/home?region=%s#dashboards/dashboard/%s",
224
    local.cloudwatch_region,
225
    local.cloudwatch_region,
226
    aws_cloudwatch_dashboard.agent.dashboard_name,
227
  )
228
}

1
resource "aws_ecr_repository" "agent" {
2
  name = "terraform-pr-agent"
3

4
  # Deploys re-push the :latest and :placeholder tags in place, the same
5
  # out-of-band code-ship pattern as the zip flow this replaces. Immutable
6
  # tags would force a fresh tag per build.
7
  #trivy:ignore:avd-aws-0031
8
  image_tag_mutability = "MUTABLE"
9

10
  # Tutorial teardown: terraform destroy must succeed while images exist.
11
  force_delete = true
12

13
  image_scanning_configuration {
14
    scan_on_push = true
15
  }
16

17
  # The AWS-managed AES256 key is enough here: the image holds no secret
18
  # material, and a CMK adds cost and key-policy surface for nothing.
19
  #trivy:ignore:avd-aws-0033
20
  encryption_configuration {
21
    encryption_type = "AES256"
22
  }
23
}
24

25
# Container twin of the zip flow's archive_file placeholder: the function
26
# resource needs a pullable image at create time, so terraform seeds a
27
# minimal one. Create-only (input never changes), so scripts/build-lambda.sh
28
# owns every push after this.
29
resource "terraform_data" "placeholder_image" {
30
  input = aws_ecr_repository.agent.repository_url
31

32
  # Needs docker and the aws cli on the machine running apply; both are
33
  # already prerequisites for the series.
34
  provisioner "local-exec" {
35
    command = <<-EOT
36
      aws ecr get-login-password --region ${data.aws_region.current.region} |
37
        docker login --username AWS --password-stdin ${split("/", aws_ecr_repository.agent.repository_url)[0]}
38
      docker buildx build --platform linux/arm64 --provenance=false \
39
        -t ${aws_ecr_repository.agent.repository_url}:placeholder \
40
        --push ${path.module}/placeholder
41
    EOT
42
  }
43
}

1
data "aws_iam_policy_document" "lambda_assume" {
2
  statement {
3
    actions = ["sts:AssumeRole"]
4
    principals {
5
      type        = "Service"
6
      identifiers = ["lambda.amazonaws.com"]
7
    }
8
  }
9
}
10

11
# trivy:ignore:avd-aws-0057
12
# Bedrock foundation-model ARNs do not pin to the caller region (the inference
13
# profile fans out cross-region), and Marketplace subscription actions are
14
# global by design.
15
data "aws_iam_policy_document" "lambda_permissions" {
16
  # Bedrock invocation. Same shape as iam.tf's bedrock_invoke; copied
17
  # here so the Lambda role is self-contained and does not require
18
  # the user-role policy to also be attached to the Lambda role.
19
  statement {
20
    actions = [
21
      "bedrock:Converse",
22
      "bedrock:ConverseStream",
23
      "bedrock:InvokeModel",
24
      "bedrock:InvokeModelWithResponseStream",
25
    ]
26
    # Bedrock foundation-model ARNs do not pin to the caller region; the
27
    # inference profile fans out cross-region, so the * region segment is required.
28
    #trivy:ignore:avd-aws-0057
29
    resources = [
30
      aws_bedrock_inference_profile.agent.arn,
31
      local.system_inference_profile_arn,
32
      "arn:aws:bedrock:*::foundation-model/${local.bedrock_model_id}",
33
    ]
34
  }
35

36
  statement {
37
    actions = [
38
      "aws-marketplace:Subscribe",
39
      "aws-marketplace:Unsubscribe",
40
      "aws-marketplace:ViewSubscriptions",
41
    ]
42
    # Marketplace subscription actions are global by design.
43
    #trivy:ignore:avd-aws-0057
44
    resources = ["*"]
45
  }
46

47
  # Model registry read. A plain String parameter, so no KMS is involved.
48
  statement {
49
    actions   = ["ssm:GetParameter"]
50
    resources = [aws_ssm_parameter.models.arn]
51
  }
52

53
  # SSM SecureString read for the Logfire token, only when wired.
54
  dynamic "statement" {
55
    for_each = local.logfire_token_wired ? [1] : []
56
    content {
57
      actions   = ["ssm:GetParameter"]
58
      resources = [aws_ssm_parameter.logfire_token[0].arn]
59
    }
60
  }
61

62
  # SSM SecureString read for the Mistral API key, only when wired.
63
  dynamic "statement" {
64
    for_each = local.mistral_key_wired ? [1] : []
65
    content {
66
      actions   = ["ssm:GetParameter"]
67
      resources = [aws_ssm_parameter.mistral_api_key[0].arn]
68
    }
69
  }
70

71
  # KMS Decrypt on the AWS-managed SSM key, required to decrypt any
72
  # SecureString read via GetParameter. Present when either secret is wired.
73
  dynamic "statement" {
74
    for_each = local.logfire_token_wired || local.mistral_key_wired ? [1] : []
75
    content {
76
      actions = ["kms:Decrypt"]
77
      resources = [
78
        "arn:aws:kms:${data.aws_region.current.region}:${data.aws_caller_identity.current.account_id}:alias/aws/ssm",
79
      ]
80
    }
81
  }
82

83
  statement {
84
    actions = [
85
      "firehose:PutRecord",
86
      "firehose:PutRecordBatch",
87
    ]
88
    resources = [aws_kinesis_firehose_delivery_stream.audit.arn]
89
  }
90

91
  # Write-only: the handler parks run artifacts and never reads them back,
92
  # so there is no GetObject or ListBucket. Scoped to the runs/ prefix the
93
  # handler writes under.
94
  statement {
95
    actions   = ["s3:PutObject"]
96
    resources = ["${aws_s3_bucket.runs.arn}/runs/*"]
97
  }
98
}
99

100
resource "aws_iam_role" "lambda" {
101
  name               = "terraform-pr-agent-lambda"
102
  assume_role_policy = data.aws_iam_policy_document.lambda_assume.json
103
}
104

105
resource "aws_iam_role_policy_attachment" "lambda_basic_execution" {
106
  role       = aws_iam_role.lambda.name
107
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
108
}
109

110
resource "aws_iam_role_policy" "lambda_permissions" {
111
  name   = "terraform-pr-agent-lambda-permissions"
112
  role   = aws_iam_role.lambda.id
113
  policy = data.aws_iam_policy_document.lambda_permissions.json
114
}
115

116
# The Lambda Insights extension baked into the image ships its metrics
117
# through its own /aws/lambda-insights log group; the AWS-managed policy
118
# grants exactly that write path.
119
resource "aws_iam_role_policy_attachment" "lambda_insights" {
120
  role       = aws_iam_role.lambda.name
121
  policy_arn = "arn:aws:iam::aws:policy/CloudWatchLambdaInsightsExecutionRolePolicy"
122
}
123

124
resource "aws_lambda_function" "agent" {
125
  function_name = "terraform-pr-agent"
126
  role          = aws_iam_role.lambda.arn
127
  architectures = ["arm64"]
128

129
  # The handler entry point comes from the image's CMD; the runtime,
130
  # handler, and layers attributes only apply to zip packages.
131
  package_type = "Image"
132
  image_uri    = "${aws_ecr_repository.agent.repository_url}:placeholder"
133

134
  # Max Memory Used overstates what this function needs. It runs to ~2 GB on a
135
  # heavy run, but the track_memory spans (see agent/memory.py) show the real,
136
  # non-reclaimable demand stays under ~1 GB: ~315 MB resident for the Python
137
  # runtime plus a transient ~420 MB while terraform validate loads the
138
  # provider schema. The rest is reclaimable page cache from the ~800 MB
139
  # provider download and re-lock unpacks doing file IO on /tmp, which the
140
  # cgroup-based billed figure counts but the kernel drops under pressure, so
141
  # it is not OOM risk. Memory is therefore not the binding constraint here.
142
  # 3008 is set for the vCPU it buys, not the RAM: above 1769 MB Lambda gives a
143
  # full core (3008 is ~1.7), which speeds the run. Drop it toward ~1769 if
144
  # latency matters less than cost; do not raise it for memory headroom.
145
  memory_size = 3008
146

147
  # The per-run tool budget is the runaway guard; the timeout only has to
148
  # accommodate several model turns with init + validate rounds in between.
149
  timeout = 300
150

151
  # Two things land in /tmp: terraform init downloads the AWS provider
152
  # (~800 MB) into the workspace, and the post-run re-lock unpacks the
153
  # provider for three platforms (another ~2 GB) to write a portable lock
154
  # file. 4 GB covers both with headroom; the default 512 MB would not.
155
  ephemeral_storage {
156
    size = 4096
157
  }
158

159
  tracing_config {
160
    mode = "Active"
161
  }
162

163
  environment {
164
    variables = merge(
165
      {
166
        MODELS_PARAMETER         = aws_ssm_parameter.models.name
167
        DEFAULT_MODEL            = var.default_model
168
        METRICS_NAMESPACE        = local.metrics_namespace
169
        FIREHOSE_DELIVERY_STREAM = aws_kinesis_firehose_delivery_stream.audit.name
170
        RUNS_BUCKET              = aws_s3_bucket.runs.bucket
171
      },
172
      local.logfire_token_wired ? {
173
        LOGFIRE_TOKEN_PARAMETER = local.logfire_token_parameter_name
174
      } : {},
175
      local.mistral_key_wired ? {
176
        MISTRAL_API_KEY_PARAMETER = aws_ssm_parameter.mistral_api_key[0].name
177
      } : {},
178
    )
179
  }
180

181
  # Code ships out of band: scripts/build-lambda.sh pushes a new image and
182
  # calls update-function-code, so terraform must not flip the function
183
  # back to the placeholder on the next apply.
184
  lifecycle {
185
    ignore_changes = [image_uri]
186
  }
187

188
  depends_on = [
189
    aws_iam_role_policy_attachment.lambda_basic_execution,
190
    aws_iam_role_policy_attachment.lambda_insights,
191
    aws_iam_role_policy.lambda_permissions,
192
    terraform_data.placeholder_image,
193
  ]
194
}
195

196
output "lambda_function_name" {
197
  value = aws_lambda_function.agent.function_name
198
}
199

200
output "lambda_function_arn" {
201
  value = aws_lambda_function.agent.arn
202
}

1
locals {
2
  runs_bucket_name = "terraform-pr-agent-runs-${data.aws_caller_identity.current.account_id}-${data.aws_region.current.region}"
3
}
4

5
# Stopgap output sink until the GitHub PR flow lands in a later post: the
6
# agent's /tmp workspace dies with the invocation, so each run parks its
7
# files and a minimal result marker under runs/<run_id>/ here. Nothing in
8
# this bucket is a system of record (the Object Lock audit bucket is), so
9
# it takes the opposite posture of audit-bucket.tf: force_destroy so
10
# terraform destroy empties and removes the bucket, no versioning, and
11
# SSE-S3 instead of a KMS key that would outlive the bucket's purpose.
12
# Access logging would require a second bucket, which transient run
13
# outputs are not worth.
14
#trivy:ignore:avd-aws-0089
15
#trivy:ignore:avd-aws-0090
16
resource "aws_s3_bucket" "runs" {
17
  bucket        = local.runs_bucket_name
18
  force_destroy = true
19
}
20

21
resource "aws_s3_bucket_public_access_block" "runs" {
22
  bucket                  = aws_s3_bucket.runs.id
23
  block_public_acls       = true
24
  block_public_policy     = true
25
  ignore_public_acls      = true
26
  restrict_public_buckets = true
27
}
28

29
# Agent-written HCL, not secrets; SSE-S3 keeps the temporary bucket free
30
# of a CMK lifecycle.
31
#trivy:ignore:avd-aws-0132
32
resource "aws_s3_bucket_server_side_encryption_configuration" "runs" {
33
  bucket = aws_s3_bucket.runs.id
34

35
  rule {
36
    apply_server_side_encryption_by_default {
37
      sse_algorithm = "AES256"
38
    }
39
  }
40
}

1
variable "alert_email" {
2
  description = "Email address subscribed to the agent alerts SNS topic. Set via TF_VAR_alert_email."
3
  type        = string
4
}
5

6
variable "daily_token_alarm_threshold" {
7
  description = "Daily combined input + output token threshold. Crossing it sends an email via SNS."
8
  type        = number
9
  default     = 1000000
10
}
11

12
variable "audit_retention_days" {
13
  type        = number
14
  description = "Object Lock default retention in days. Tutorial default is 7 so the bucket is easy to clean up; production audit horizons are typically years (e.g. 2555 for SOX-style controls)."
15
  default     = 7
16
}
17

18
variable "logfire_token" {
19
  type        = string
20
  description = "Logfire write token. Leave empty to skip the Logfire integration. Set via TF_VAR_logfire_token in .envrc.local."
21
  default     = ""
22
  sensitive   = true
23
}
24

25
variable "mistral_api_key" {
26
  type        = string
27
  description = "Mistral API key. Leave empty to skip the Mistral providers (Bedrock models still work). Set via TF_VAR_mistral_api_key in .envrc.local."
28
  default     = ""
29
  sensitive   = true
30
}
31

32
variable "default_model" {
33
  type        = string
34
  description = "Registry key of the model the agent runs with (see models.tf). One of: haiku, mistral-large, devstral-small."
35
  default     = "mistral-large"
36
}

1
#!/usr/bin/env bash
2
# Build and push the terraform-pr-agent container image, then point the
3
# Lambda at it. Terraform owns the ECR repo and the function (infra/);
4
# this script only ships code, the same split as the zip flow it
5
# replaces.
6
#
7
# Runs from anywhere; cd's to the project root (the dir holding the
8
# Dockerfile).
9
set -euo pipefail
10

11
cd "$(dirname "$0")/.."
12

13
# Make sure the lock matches pyproject.toml before the Dockerfile copies
14
# it into the build.
15
uv sync --quiet
16

17
# Terraform created the repo; asking AWS for the URI keeps the account
18
# id and region out of this script.
19
repo_uri="$(aws ecr describe-repositories \
20
    --repository-names terraform-pr-agent \
21
    --query 'repositories[0].repositoryUri' --output text)"
22
registry="${repo_uri%%/*}"
23

24
aws ecr get-login-password | docker login --username AWS --password-stdin "$registry"
25

26
# --provenance=false: buildx otherwise wraps the image in an OCI image
27
# index for the provenance attestation, which Lambda rejects.
28
docker buildx build --platform linux/arm64 --provenance=false \
29
    -t "$repo_uri:latest" --push .
30

31
# update-function-code resolves :latest to a digest at call time, so a
32
# re-pushed tag rolls the function forward. No --publish needed:
33
# invocations hit $LATEST.
34
aws lambda update-function-code \
35
    --function-name terraform-pr-agent \
36
    --image-uri "$repo_uri:latest" >/dev/null
37
aws lambda wait function-updated --function-name terraform-pr-agent
38

39
echo "deployed: $repo_uri:latest"

1
-- OR REPLACE so re-running this script is idempotent: a plain
2
-- CREATE PERSISTENT SECRET errors on the second run (the secret persists
3
-- in ~/.duckdb), which aborts the script before the view below is rebuilt
4
-- and leaves a stale `traces` view in place.
5
CREATE OR REPLACE PERSISTENT SECRET (
6
    TYPE s3,
7
    PROVIDER credential_chain,
8
    REFRESH auto
9
);
10
SET VARIABLE audit_bucket = getenv('AUDIT_BUCKET');
11

12
-- hive_partitioning = true reads year=YYYY/month=MM/day=DD/ from the
13
-- object path as virtual columns, so the partition predicate in a
14
-- query below prunes objects before any file is opened.
15
CREATE OR REPLACE VIEW traces AS
16
WITH spans AS (
17
    -- Flatten the OTLP-JSON envelope into one row per span, with the
18
    -- common span fields lifted out as named columns so downstream
19
    -- CTEs and ad-hoc queries can work against `name`, `trace_id`,
20
    -- `dur_ms`, etc. without re-doing the struct navigation each time.
21
    SELECT
22
        year, month, day,
23
        span.name                                                                AS name,
24
        lower(hex(from_base64(span.traceId::VARCHAR)))                           AS trace_id,
25
        lower(hex(from_base64(span.spanId::VARCHAR)))                            AS span_id,
26
        lower(hex(from_base64(span.parentSpanId::VARCHAR)))                      AS parent_span_id,
27
        make_timestamp_ns(span.startTimeUnixNano::BIGINT)                        AS started,
28
        make_timestamp_ns(span.endTimeUnixNano::BIGINT)                          AS ended,
29
        (span.endTimeUnixNano::BIGINT - span.startTimeUnixNano::BIGINT) / 1e6    AS dur_ms,
30
        span.status.code::VARCHAR                                                AS status_code,
31
        span.attributes                                                          AS attributes,
32
        data.filename                                                            AS source_file,
33
        -- Firehose names objects <stream>-<ver>-<YYYY-MM-DD-HH-MM-SS>-<uuid>.gz;
34
        -- stripping the trailing -<uuid>.gz collapses rows from the same flush
35
        -- batch onto a stable key for grouping.
36
        regexp_replace(
37
            split_part(data.filename, '/', -1),
38
            '-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}\.gz$',
39
            ''
40
        )                                                                        AS batch_key
41
    FROM read_ndjson(
42
        's3://' || getvariable('audit_bucket') || '/traces/**/*.gz',
43
        compression = 'gzip', hive_partitioning = true, filename = true) AS data
44
       , UNNEST(data.resourceSpans) AS u1(rs)
45
       , UNNEST(rs.scopeSpans)      AS u2(ss)
46
       , UNNEST(ss.spans)           AS u3(span)
47
),
48
roots AS (
49
    -- One row per agent run: pydantic-ai's invoke_agent span, identified by
50
    -- the GenAI operation rather than by being the parentless span.
51
    -- instrument_aws_lambda now roots each trace at the SpanKind.SERVER
52
    -- invocation span, so the agent run is a child of it, not the trace root.
53
    -- (A caller-side retry would put several invoke_agent spans under one
54
    -- invocation; the trace_id join below would then cross them, so that
55
    -- case wants each chat tied to its enclosing run instead.)
56
    SELECT * FROM spans
57
    WHERE list_filter(attributes, x -> x.key = 'gen_ai.operation.name')[1]
58
        .value.stringValue = 'invoke_agent'
59
),
60
chats AS (
61
    -- Per-trace summary of the LLM call: pulls the GenAI semantic
62
    -- convention attributes off the chat span and exposes each as a
63
    -- named column.
64
    SELECT
65
        trace_id,
66
        list_filter(attributes, x -> x.key = 'gen_ai.system')[1].value.stringValue                     AS gen_ai_system,
67
        list_filter(attributes, x -> x.key = 'gen_ai.operation.name')[1].value.stringValue             AS operation,
68
        list_filter(attributes, x -> x.key = 'gen_ai.request.model')[1].value.stringValue              AS request_model,
69
        list_filter(attributes, x -> x.key = 'gen_ai.response.model')[1].value.stringValue             AS response_model,
70
        list_filter(attributes, x -> x.key = 'gen_ai.usage.input_tokens')[1].value.intValue::BIGINT    AS in_tokens,
71
        list_filter(attributes, x -> x.key = 'gen_ai.usage.output_tokens')[1].value.intValue::BIGINT   AS out_tokens,
72
        list_filter(attributes, x -> x.key = 'gen_ai.response.finish_reasons')[1]
73
            .value.arrayValue.values[1].stringValue                                                    AS finish,
74
        list_filter(attributes, x -> x.key = 'gen_ai.conversation.id')[1].value.stringValue            AS conversation_id,
75
        list_filter(attributes, x -> x.key = 'gen_ai.agent.name')[1].value.stringValue                 AS agent_name,
76
        list_filter(attributes, x -> x.key = 'gen_ai.agent.call.id')[1].value.stringValue              AS agent_call_id,
77
        list_filter(attributes, x -> x.key = 'gen_ai.input.messages')[1].value.stringValue             AS input_messages,
78
        list_filter(attributes, x -> x.key = 'gen_ai.output.messages')[1].value.stringValue            AS output_messages
79
    FROM spans
80
    WHERE name LIKE 'chat %'
81
),
82
final_chats AS (
83
    -- The run's closing assistant message, one row per trace: the chat span
84
    -- that ended last. Its first text part is the summary the agent returns,
85
    -- so unlike the per-row assistant_response convenience column it is not
86
    -- knocked out by the tool-call turns a multi-turn run is mostly made of.
87
    SELECT trace_id, output_messages FROM (
88
        SELECT
89
            trace_id,
90
            list_filter(attributes, x -> x.key = 'gen_ai.output.messages')[1].value.stringValue AS output_messages,
91
            row_number() OVER (PARTITION BY trace_id ORDER BY ended DESC) AS rn
92
        FROM spans
93
        WHERE name LIKE 'chat %'
94
    )
95
    WHERE rn = 1
96
)
97
SELECT
98
    roots.started,
99
    roots.year, roots.month, roots.day,
100
    roots.trace_id,
101
    substr(roots.trace_id, 1, 8)                       AS trace,
102
    roots.batch_key,
103
    roots.source_file,
104
    roots.dur_ms,
105
    regexp_extract(chats.request_model, '[^./]+$')     AS model,
106
    chats.in_tokens,
107
    chats.out_tokens,
108
    chats.finish,
109
    chats.agent_name,
110
    chats.conversation_id,
111
    -- Convenience columns for the common single-turn shape: system at
112
    -- input[0], user at input[1], assistant at output[0]. Multi-turn
113
    -- runs invalidate the indices, so reach for input_messages and
114
    -- output_messages directly for those.
115
    json_extract_string(chats.input_messages,  '$[0].parts[0].content')  AS system_prompt,
116
    json_extract_string(chats.input_messages,  '$[1].parts[0].content')  AS user_prompt,
117
    json_extract_string(chats.output_messages, '$[0].parts[0].content')  AS assistant_response,
118
    -- One reliable assistant summary per trace (the closing turn), repeated
119
    -- across the trace's fanned-out rows; see the final_chats CTE.
120
    json_extract_string(final_chats.output_messages, '$[0].parts[0].content') AS final_response,
121
    chats.input_messages,
122
    chats.output_messages,
123
    -- Per the OTel spec, instrumentation libraries leave status unset
124
    -- on success (only application code may set it to Ok). Every OTel
125
    -- backend treats unset as "no error reported"; we render the same.
126
    CASE roots.status_code
127
        WHEN 'STATUS_CODE_ERROR' THEN 'err'
128
        ELSE 'ok'
129
    END                                                AS status
130
FROM roots
131
JOIN chats USING (trace_id)
132
LEFT JOIN final_chats USING (trace_id);

1
"""Shared test setup.
2

3
handler.py reads FIREHOSE_DELIVERY_STREAM at import time, so a stand-in is set
4
before any test imports it. The model registry and default model are read at
5
INVOKE time; tests stub _build_model and swap in a scripted FunctionModel via
6
agent.override, so no real AWS or Mistral call is ever made. The S3 upload runs
7
against moto.
8
"""
9

10
import os
11

12
import logfire
13

14
# The tools and the re-lock open logfire spans via track_memory. Outside
15
# Lambda nothing configures logfire, so pin it to local-only here to keep
16
# spans as no-ops and off the network during the test run.
17
logfire.configure(send_to_logfire=False)
18

19
os.environ.setdefault("AWS_DEFAULT_REGION", "eu-central-1")
20
os.environ.setdefault("FIREHOSE_DELIVERY_STREAM", "test-stream")
21
os.environ.setdefault("METRICS_NAMESPACE", "TerraformPrAgent/Models")
22
os.environ.setdefault("DEFAULT_MODEL", "mistral-large")
23
os.environ.setdefault("MODELS_PARAMETER", "/terraform-pr-agent/models")

1
"""Core test: the real Agent, tools, and execute(), with the Bedrock model
2
swapped for a scripted FunctionModel so no AWS call is made. The script walks
3
the loop the post is about: write broken HCL, init, watch validate fail (twice,
4
exercising the raised retry budget), fix it, validate clean, report done. The
5
runs-bucket upload is exercised against moto, so the real boto3 call path runs
6
without touching AWS."""
7

8
from __future__ import annotations
9

10
import json
11
import subprocess
12
import uuid
13
from collections import deque
14
from types import SimpleNamespace
15

16
import agent.core as core
17
import agent.models as models
18
import agent.observability as observability
19
import agent.runs as runs
20
import boto3
21
import pytest
22
from moto import mock_aws
23
from opentelemetry.trace import SpanContext, TraceFlags
24
from opentelemetry.trace.status import StatusCode
25
from pydantic_ai.messages import ModelMessage, ModelResponse, TextPart, ToolCallPart
26
from pydantic_ai.models.bedrock import BedrockConverseModel
27
from pydantic_ai.models.function import AgentInfo, FunctionModel
28
from pydantic_ai.models.mistral import MistralModel
29

30
REGISTRY = {
31
    "haiku": {
32
        "provider": "bedrock",
33
        "model_id": "anthropic.claude-haiku-4-5-20251001-v1:0",
34
        "inference_profile_arn": "arn:aws:bedrock:eu-west-1:0:inference-profile/test",
35
    },
36
    "mistral-large": {"provider": "mistral", "model_id": "mistral-large-latest"},
37
}
38

39
# Captured before the autouse fixture stubs core._build_model, so the
40
# factory tests below can call the real implementation.
41
_REAL_BUILD_MODEL = core._build_model
42

43
INVALID_TF = 'output "x" { value = var.missing }\n'
44
VALID_TF = 'output "x" { value = "fixed" }\n'
45

46
RUNS_BUCKET = "test-runs"
47

48

49
def _stub_model(name: str) -> FunctionModel:
50
    # A throwaway model so _build_model makes no SSM call. agent.override in
51
    # each test supplies the model actually used; pydantic-ai still requires a
52
    # non-None model on the call, so this stands in for that slot.
53
    return FunctionModel(lambda messages, info: ModelResponse(parts=[TextPart(content="")]))
54

55

56
@pytest.fixture(autouse=True)
57
def no_observability(monkeypatch) -> None:
58
    # The logfire/instrumentation wiring lives in agent.lambda_entry, which the
59
    # tests never import, so the Firehose-backed audit processor is never
60
    # registered here. _build_model would read the registry from SSM; tests
61
    # supply the model through agent.override instead, which takes precedence
62
    # over the per-run model.
63
    monkeypatch.setattr(core, "_build_model", _stub_model)
64

65

66
@pytest.fixture
67
def runs_bucket(monkeypatch):
68
    # Static stand-in credentials so a hole in the moto mock can never
69
    # reach a real account.
70
    monkeypatch.setenv("AWS_ACCESS_KEY_ID", "testing")
71
    monkeypatch.setenv("AWS_SECRET_ACCESS_KEY", "testing")
72
    monkeypatch.setenv("RUNS_BUCKET", RUNS_BUCKET)
73
    with mock_aws():
74
        s3 = boto3.client("s3")
75
        s3.create_bucket(
76
            Bucket=RUNS_BUCKET,
77
            CreateBucketConfiguration={"LocationConstraint": "eu-central-1"},
78
        )
79
        yield s3
80

81

82
def scripted_model(steps: deque[list[ToolCallPart]]) -> FunctionModel:
83
    def call(messages: list[ModelMessage], info: AgentInfo) -> ModelResponse:
84
        if steps:
85
            return ModelResponse(parts=list(steps.popleft()))
86
        return ModelResponse(parts=[TextPart(content="workspace validated")])
87

88
    return FunctionModel(call)
89

90

91
def happy_path_steps() -> deque[list[ToolCallPart]]:
92
    return deque(
93
        [
94
            [ToolCallPart(tool_name="write_file", args={"path": "main.tf", "contents": VALID_TF})],
95
            [ToolCallPart(tool_name="terraform_init", args={})],
96
            [ToolCallPart(tool_name="terraform_validate", args={})],
97
        ]
98
    )
99

100

101
def uploaded_keys(s3) -> list[str]:
102
    objects = s3.list_objects_v2(Bucket=RUNS_BUCKET).get("Contents", [])
103
    return [entry["Key"] for entry in objects]
104

105

106
def test_execute_survives_consecutive_validate_failures(runs_bucket) -> None:
107
    steps = deque(
108
        [
109
            [
110
                ToolCallPart(
111
                    tool_name="write_file", args={"path": "main.tf", "contents": INVALID_TF}
112
                )
113
            ],
114
            [ToolCallPart(tool_name="terraform_init", args={})],
115
            # Two failing validates in a row: the default per-tool retry
116
            # budget of 1 would kill the run here; Agent(retries=10) is
117
            # what lets the loop continue.
118
            [ToolCallPart(tool_name="terraform_validate", args={})],
119
            [ToolCallPart(tool_name="terraform_validate", args={})],
120
            [
121
                ToolCallPart(
122
                    tool_name="edit_file",
123
                    args={
124
                        "path": "main.tf",
125
                        "old_string": "var.missing",
126
                        "new_string": '"fixed"',
127
                    },
128
                )
129
            ],
130
            [ToolCallPart(tool_name="terraform_validate", args={})],
131
        ]
132
    )
133
    with core.agent.override(model=scripted_model(steps)):
134
        result = core.execute("make me a bucket")
135

136
    assert result.output == "workspace validated"
137
    uuid.UUID(result.run_id)
138
    assert not steps, "the scripted run should consume every step"
139

140

141
def test_caller_retry_fixes_workspace_after_agent_claims_done(runs_bucket) -> None:
142
    # Run 1 leaves an invalid but initialized workspace and claims done while
143
    # validate still fails. The caller re-validates, feeds the error back, and
144
    # run 2 edits it clean. A TextPart step ends the current run, so the deque
145
    # spans two runs rather than draining in one.
146
    steps = deque(
147
        [
148
            [
149
                ToolCallPart(
150
                    tool_name="write_file", args={"path": "main.tf", "contents": INVALID_TF}
151
                )
152
            ],
153
            [ToolCallPart(tool_name="terraform_init", args={})],
154
            [TextPart(content="done")],
155
            [
156
                ToolCallPart(
157
                    tool_name="edit_file",
158
                    args={"path": "main.tf", "old_string": "var.missing", "new_string": '"fixed"'},
159
                )
160
            ],
161
            [TextPart(content="fixed it")],
162
        ]
163
    )
164

165
    def call(messages: list[ModelMessage], info: AgentInfo) -> ModelResponse:
166
        if steps:
167
            return ModelResponse(parts=list(steps.popleft()))
168
        return ModelResponse(parts=[TextPart(content="done")])
169

170
    with core.agent.override(model=FunctionModel(call)):
171
        result = core.execute("make me a bucket")
172

173
    assert result.output == "fixed it"
174
    assert not steps, "both the first run and the caller-side retry should be consumed"
175

176

177
def test_caller_retry_raises_when_not_converging(runs_bucket) -> None:
178
    # The agent sets up an invalid, initialized workspace and never fixes it.
179
    # After the retry budget execute raises (Lambda 5xx) and still parks the
180
    # broken workspace under status error for debugging.
181
    setup = deque(
182
        [
183
            [
184
                ToolCallPart(
185
                    tool_name="write_file", args={"path": "main.tf", "contents": INVALID_TF}
186
                )
187
            ],
188
            [ToolCallPart(tool_name="terraform_init", args={})],
189
        ]
190
    )
191

192
    def call(messages: list[ModelMessage], info: AgentInfo) -> ModelResponse:
193
        if setup:
194
            return ModelResponse(parts=list(setup.popleft()))
195
        return ModelResponse(parts=[TextPart(content="all done")])
196

197
    with core.agent.override(model=FunctionModel(call)):
198
        with pytest.raises(core.ValidateDidNotConverge):
199
            core.execute("make me a bucket")
200

201
    result_key = next(key for key in uploaded_keys(runs_bucket) if key.endswith("/result.json"))
202
    parked = json.loads(runs_bucket.get_object(Bucket=RUNS_BUCKET, Key=result_key)["Body"].read())
203
    assert parked["status"] == "error"
204
    assert "ValidateDidNotConverge" in parked["error"]
205

206

207
def test_success_uploads_workspace_and_minimal_result(runs_bucket) -> None:
208
    with core.agent.override(model=scripted_model(happy_path_steps())):
209
        result = core.execute("make me a bucket")
210

211
    run_id = result.run_id
212
    keys = uploaded_keys(runs_bucket)
213
    assert f"runs/{run_id}/workspace/main.tf" in keys
214
    assert f"runs/{run_id}/result.json" in keys
215
    assert not any("/.terraform/" in key for key in keys)
216

217
    body = runs_bucket.get_object(Bucket=RUNS_BUCKET, Key=f"runs/{run_id}/result.json")
218
    assert json.loads(body["Body"].read()) == {"status": "ok"}
219

220

221
def test_failure_still_uploads_with_error_status(runs_bucket) -> None:
222
    calls = iter(
223
        [[ToolCallPart(tool_name="write_file", args={"path": "main.tf", "contents": VALID_TF})]]
224
    )
225

226
    def call(messages: list[ModelMessage], info: AgentInfo) -> ModelResponse:
227
        step = next(calls, None)
228
        if step is None:
229
            raise RuntimeError("boom")
230
        return ModelResponse(parts=list(step))
231

232
    with core.agent.override(model=FunctionModel(call)):
233
        with pytest.raises(RuntimeError, match="boom"):
234
            core.execute("make me a bucket")
235

236
    keys = uploaded_keys(runs_bucket)
237
    result_key = next(key for key in keys if key.endswith("/result.json"))
238
    run_id = result_key.split("/")[1]
239
    assert f"runs/{run_id}/workspace/main.tf" in keys
240

241
    body = runs_bucket.get_object(Bucket=RUNS_BUCKET, Key=result_key)
242
    assert json.loads(body["Body"].read()) == {
243
        "status": "error",
244
        "error": "RuntimeError('boom')",
245
    }
246

247

248
def test_persist_run_requires_bucket(monkeypatch, tmp_path) -> None:
249
    # An unset RUNS_BUCKET means the run would never be parked. That is a
250
    # deployment fault, so persisting fails fast rather than dropping it.
251
    monkeypatch.delenv("RUNS_BUCKET", raising=False)
252
    with pytest.raises(RuntimeError, match="RUNS_BUCKET"):
253
        runs._persist_run("run-1", tmp_path, status="ok")
254

255

256
def test_run_id_is_the_conversation_id(runs_bucket) -> None:
257
    seen: list[str | None] = []
258
    steps = happy_path_steps()
259

260
    def call(messages: list[ModelMessage], info: AgentInfo) -> ModelResponse:
261
        seen.append(messages[0].conversation_id)
262
        if steps:
263
            return ModelResponse(parts=list(steps.popleft()))
264
        return ModelResponse(parts=[TextPart(content="workspace validated")])
265

266
    with core.agent.override(model=FunctionModel(call)):
267
        result = core.execute("make me a bucket")
268

269
    # The same id the caller gets back is stamped on every model request,
270
    # which is what surfaces as gen_ai.conversation.id on the trace.
271
    assert seen == [result.run_id] * len(seen)
272

273

274
def test_workspace_files_skips_terraform_dir(tmp_path) -> None:
275
    (tmp_path / "main.tf").write_text(VALID_TF)
276
    (tmp_path / ".terraform.lock.hcl").write_text("# lock\n")
277
    (tmp_path / ".terraform" / "providers").mkdir(parents=True)
278
    (tmp_path / ".terraform" / "providers" / "x").write_text("provider blob")
279

280
    files = [path.relative_to(tmp_path) for path in runs._workspace_files(tmp_path)]
281

282
    assert sorted(str(path) for path in files) == [".terraform.lock.hcl", "main.tf"]
283

284

285
def test_system_prompt_nudges_provider_constraint_and_lock_file() -> None:
286
    assert '"~> 6.0"' in core.SYSTEM_PROMPT
287
    assert ".terraform.lock.hcl" in core.SYSTEM_PROMPT
288

289

290
def test_relock_providers_skips_without_lock_file(tmp_path, monkeypatch) -> None:
291
    calls: list = []
292
    monkeypatch.setattr(runs.subprocess, "run", lambda *a, **k: calls.append(a))
293
    runs._relock_providers(tmp_path)
294
    assert calls == []
295

296

297
def test_relock_providers_covers_all_platforms(tmp_path, monkeypatch) -> None:
298
    (tmp_path / ".terraform.lock.hcl").write_text("# lock\n")
299
    calls: list[list[str]] = []
300

301
    def fake_run(args, **kwargs):
302
        calls.append(args)
303
        return subprocess.CompletedProcess(args, 0, "", "")
304

305
    monkeypatch.setattr(runs.subprocess, "run", fake_run)
306
    runs._relock_providers(tmp_path)
307

308
    # One terraform call per platform, each carrying only its own -platform flag.
309
    assert [args[-1] for args in calls] == [
310
        f"-platform={platform}" for platform in runs._LOCK_PLATFORMS
311
    ]
312
    for args in calls:
313
        assert args[:4] == ["terraform", "providers", "lock", "-no-color"]
314

315

316
def test_relock_providers_continues_after_one_platform_fails(tmp_path, monkeypatch) -> None:
317
    (tmp_path / ".terraform.lock.hcl").write_text("# lock\n")
318
    calls: list[list[str]] = []
319

320
    def fake_run(args, **kwargs):
321
        calls.append(args)
322
        returncode = 1 if f"-platform={runs._LOCK_PLATFORMS[0]}" in args else 0
323
        return subprocess.CompletedProcess(args, returncode, "", "boom")
324

325
    monkeypatch.setattr(runs.subprocess, "run", fake_run)
326
    runs._relock_providers(tmp_path)
327

328
    assert [args[-1] for args in calls] == [
329
        f"-platform={platform}" for platform in runs._LOCK_PLATFORMS
330
    ]
331

332

333
def _sampled_ctx(trace_id: int, span_id: int, *, is_remote: bool = False) -> SpanContext:
334
    return SpanContext(
335
        trace_id=trace_id,
336
        span_id=span_id,
337
        is_remote=is_remote,
338
        trace_flags=TraceFlags(TraceFlags.SAMPLED),
339
    )
340

341

342
def test_audit_processor_ships_on_no_parent() -> None:
343
    shipped: list = []
344
    proc = observability.PerTraceAuditProcessor(lambda spans: shipped.append(list(spans)))
345
    root = SimpleNamespace(context=_sampled_ctx(1, 1), parent=None)
346
    proc.on_end(root)
347
    assert shipped == [[root]]
348

349

350
def test_audit_processor_treats_remote_parent_as_local_root() -> None:
351
    # instrument_aws_lambda can root the trace at a span propagated in from a
352
    # remote parent (API Gateway / X-Ray); is_remote marks it the local root.
353
    shipped: list = []
354
    proc = observability.PerTraceAuditProcessor(lambda spans: shipped.append(list(spans)))
355
    span = SimpleNamespace(context=_sampled_ctx(1, 2), parent=_sampled_ctx(1, 9, is_remote=True))
356
    proc.on_end(span)
357
    assert shipped == [[span]]
358

359

360
def test_audit_processor_buffers_local_child_until_root() -> None:
361
    shipped: list = []
362
    proc = observability.PerTraceAuditProcessor(lambda spans: shipped.append(list(spans)))
363
    child = SimpleNamespace(context=_sampled_ctx(1, 2), parent=_sampled_ctx(1, 1))
364
    root = SimpleNamespace(context=_sampled_ctx(1, 1), parent=None)
365
    proc.on_end(child)
366
    assert shipped == []
367
    proc.on_end(root)
368
    assert shipped == [[child, root]]
369

370

371
def _agent_span(model: str, *, in_tokens: int = 0, out_tokens: int = 0, errored: bool = False):
372
    return SimpleNamespace(
373
        attributes={
374
            "metadata": json.dumps({"model": model}),
375
            "gen_ai.usage.input_tokens": in_tokens,
376
            "gen_ai.usage.output_tokens": out_tokens,
377
        },
378
        status=SimpleNamespace(status_code=StatusCode.ERROR if errored else StatusCode.UNSET),
379
        start_time=0,
380
        end_time=1_000_000,
381
    )
382

383

384
def _non_agent_span():
385
    return SimpleNamespace(
386
        attributes={"gen_ai.operation.name": "chat"},
387
        status=SimpleNamespace(status_code=StatusCode.UNSET),
388
        start_time=0,
389
        end_time=1_000_000,
390
    )
391

392

393
def test_emit_emf_one_line_per_agent_run(monkeypatch) -> None:
394
    records: list = []
395
    monkeypatch.setattr(
396
        observability, "log", SimpleNamespace(info=lambda event, **kw: records.append((event, kw)))
397
    )
398
    observability._emit_emf([_non_agent_span(), _agent_span("haiku", in_tokens=10, out_tokens=3)])
399
    assert len(records) == 1
400
    event, kw = records[0]
401
    assert event == "trace_metrics"
402
    assert (kw["Model"], kw["InputTokens"], kw["OutputTokens"], kw["Errors"]) == ("haiku", 10, 3, 0)
403

404

405
def test_emit_emf_emits_per_run_for_retries(monkeypatch) -> None:
406
    records: list = []
407
    monkeypatch.setattr(
408
        observability, "log", SimpleNamespace(info=lambda event, **kw: records.append((event, kw)))
409
    )
410
    observability._emit_emf(
411
        [_agent_span("haiku"), _non_agent_span(), _agent_span("mistral-large", errored=True)]
412
    )
413
    assert [kw["Model"] for _, kw in records] == ["haiku", "mistral-large"]
414
    assert [kw["Errors"] for _, kw in records] == [0, 1]
415

416

417
# These exercise the real _build_model(name) rather than the _stub_model the
418
# autouse fixture installs, guarding the factory's signature and provider
419
# branching (the execute path stubs it, so it cannot catch a signature drift).
420
def test_build_model_selects_bedrock(monkeypatch) -> None:
421
    _REAL_BUILD_MODEL.cache_clear()
422
    monkeypatch.setattr(models, "fetch_parameter", lambda name: json.dumps(REGISTRY))
423
    model = _REAL_BUILD_MODEL("haiku")
424
    assert isinstance(model, BedrockConverseModel)
425
    assert model.model_name == "anthropic.claude-haiku-4-5-20251001-v1:0"
426

427

428
def test_build_model_selects_mistral(monkeypatch) -> None:
429
    _REAL_BUILD_MODEL.cache_clear()
430
    monkeypatch.setattr(
431
        models,
432
        "fetch_parameter",
433
        lambda name: json.dumps(REGISTRY) if "models" in name else "key",
434
    )
435
    monkeypatch.setenv("MISTRAL_API_KEY_PARAMETER", "/terraform-pr-agent/mistral-api-key")
436
    model = _REAL_BUILD_MODEL("mistral-large")
437
    assert isinstance(model, MistralModel)
438
    assert model.model_name == "mistral-large-latest"

1
"""require_env: the single fail-fast read for required configuration."""
2

3
from __future__ import annotations
4

5
import pytest
6
from agent.env import require_env
7

8

9
def test_returns_value_when_set(monkeypatch) -> None:
10
    monkeypatch.setenv("SOME_VAR", "value")
11
    assert require_env("SOME_VAR") == "value"
12

13

14
def test_raises_when_unset(monkeypatch) -> None:
15
    monkeypatch.delenv("SOME_VAR", raising=False)
16
    with pytest.raises(RuntimeError, match="SOME_VAR"):
17
        require_env("SOME_VAR")
18

19

20
def test_treats_empty_as_unset(monkeypatch) -> None:
21
    monkeypatch.setenv("SOME_VAR", "")
22
    with pytest.raises(RuntimeError, match="SOME_VAR"):
23
        require_env("SOME_VAR")

1
"""Tests for the Lambda boundary: the INIT wiring and the event envelope.
2

3
Importing agent.lambda_entry runs the INIT wiring (configure then wrap), so the
4
side effects are stubbed before a fresh import; the tests assert the wiring and
5
the envelope, not real logfire configuration.
6
"""
7

8
from __future__ import annotations
9

10
import importlib
11
import sys
12

13
import agent.core as core
14
import agent.observability as observability
15
import logfire
16
import pytest
17

18

19
@pytest.fixture
20
def lambda_entry(monkeypatch):
21
    calls: list = []
22
    monkeypatch.setattr(observability, "configure", lambda: calls.append("configure"))
23
    monkeypatch.setattr(
24
        logfire, "instrument_aws_lambda", lambda h, **kw: calls.append(("instrument", h))
25
    )
26
    sys.modules.pop("agent.lambda_entry", None)
27
    module = importlib.import_module("agent.lambda_entry")
28
    module.init_calls = calls
29
    return module
30

31

32
def test_configure_runs_before_the_handler_is_wrapped(lambda_entry) -> None:
33
    # The container CMD targets agent.lambda_entry.handler, so the wrap must
34
    # land on that symbol, with configure() run first.
35
    assert lambda_entry.init_calls == ["configure", ("instrument", lambda_entry.handler)]
36

37

38
def test_handler_requires_a_prompt(lambda_entry) -> None:
39
    with pytest.raises(ValueError, match="prompt"):
40
        lambda_entry.handler({}, None)
41

42

43
def test_handler_runs_the_agent_and_wraps_the_result(lambda_entry, monkeypatch) -> None:
44
    seen: dict = {}
45

46
    def fake_execute(prompt, model=None):
47
        seen["prompt"] = prompt
48
        seen["model"] = model
49
        return core.RunResult(run_id="run-1", model="haiku", output="done")
50

51
    monkeypatch.setattr(lambda_entry, "execute", fake_execute)
52
    response = lambda_entry.handler({"prompt": "make me a bucket", "model": "haiku"}, None)
53

54
    assert seen == {"prompt": "make me a bucket", "model": "haiku"}
55
    assert response == {
56
        "status": "ok",
57
        "run_id": "run-1",
58
        "model": "haiku",
59
        "output": "done",
60
    }

1
"""Tests for the workspace tools: path guards, init/validate semantics,
2
and how ModelRetry interacts with pydantic-ai's per-tool retry budget."""
3

4
from __future__ import annotations
5

6
from pathlib import Path
7
from types import SimpleNamespace
8

9
import pytest
10
from agent.tools import (
11
    WorkspaceDeps,
12
    read_file,
13
    terraform_init,
14
    terraform_validate,
15
    write_file,
16
)
17
from pydantic_ai import Agent, ModelRetry
18
from pydantic_ai.exceptions import UnexpectedModelBehavior
19
from pydantic_ai.messages import ModelMessage, ModelResponse, ToolCallPart
20
from pydantic_ai.models.function import AgentInfo, FunctionModel
21

22
VALID_TF = 'output "ok" { value = "ok" }\n'
23
# Parses fine (init passes) but references an undeclared variable, so the
24
# failure surfaces at the validate step.
25
INVALID_TF = 'output "x" { value = var.missing }\n'
26

27

28
def ctx_for(root: Path) -> SimpleNamespace:
29
    # The tools only touch ctx.deps, so a namespace stands in for the
30
    # full RunContext.
31
    return SimpleNamespace(deps=WorkspaceDeps(root=root))
32

33

34
def test_read_missing_file_raises_model_retry(tmp_path: Path) -> None:
35
    with pytest.raises(ModelRetry, match="must be a file"):
36
        read_file(ctx_for(tmp_path), "absent.tf")
37

38

39
def test_path_escape_raises_model_retry(tmp_path: Path) -> None:
40
    with pytest.raises(ModelRetry, match="workspace root"):
41
        write_file(ctx_for(tmp_path), "../outside.tf", "boom")
42

43

44
def test_init_then_validate_passes(tmp_path: Path) -> None:
45
    ctx = ctx_for(tmp_path)
46
    (tmp_path / "main.tf").write_text(VALID_TF)
47
    assert terraform_init(ctx) == "OK: terraform init completed."
48
    assert terraform_validate(ctx) == "OK: terraform validate passed."
49

50

51
def test_init_failure_raises_model_retry(tmp_path: Path) -> None:
52
    (tmp_path / "main.tf").write_text("terraform {")
53
    with pytest.raises(ModelRetry, match="terraform init failed"):
54
        terraform_init(ctx_for(tmp_path))
55

56

57
def test_validate_failure_raises_model_retry(tmp_path: Path) -> None:
58
    ctx = ctx_for(tmp_path)
59
    (tmp_path / "main.tf").write_text(INVALID_TF)
60
    terraform_init(ctx)
61
    with pytest.raises(ModelRetry, match="terraform validate failed"):
62
        terraform_validate(ctx)
63

64

65
def test_requirements_added_mid_run_recover_via_init(tmp_path: Path) -> None:
66
    """The flow the agent is prompted to follow: adding a module after the
67
    first init breaks validate until terraform_init runs again."""
68
    ctx = ctx_for(tmp_path)
69
    (tmp_path / "main.tf").write_text(VALID_TF)
70
    terraform_init(ctx)
71
    assert terraform_validate(ctx) == "OK: terraform validate passed."
72

73
    mod = tmp_path / "mod"
74
    mod.mkdir()
75
    (mod / "main.tf").write_text(VALID_TF)
76
    (tmp_path / "uses_module.tf").write_text('module "m" { source = "./mod" }\n')
77

78
    with pytest.raises(ModelRetry, match="terraform init"):
79
        terraform_validate(ctx)
80
    terraform_init(ctx)
81
    assert terraform_validate(ctx) == "OK: terraform validate passed."
82

83

84
def _always_validate_agent(retries: int | None) -> Agent:
85
    def always_validate(messages: list[ModelMessage], info: AgentInfo) -> ModelResponse:
86
        return ModelResponse(parts=[ToolCallPart(tool_name="terraform_validate", args={})])
87

88
    kwargs = {} if retries is None else {"retries": retries}
89
    return Agent(
90
        FunctionModel(always_validate),
91
        deps_type=WorkspaceDeps,
92
        tools=[terraform_init, terraform_validate],
93
        **kwargs,
94
    )
95

96

97
def test_default_budget_kills_run_on_second_consecutive_failure(tmp_path: Path) -> None:
98
    """pydantic-ai counts consecutive ModelRetry failures per tool against
99
    `retries` (default 1) and then raises UnexpectedModelBehavior. This is
100
    why handler.py sets a higher budget."""
101
    (tmp_path / "main.tf").write_text("terraform {")
102
    agent = _always_validate_agent(retries=None)
103
    with pytest.raises(UnexpectedModelBehavior, match="exceeded max retries"):
104
        agent.run_sync("go", deps=WorkspaceDeps(root=tmp_path))
105

106

107
def test_raised_budget_tolerates_consecutive_failures(tmp_path: Path) -> None:
108
    (tmp_path / "main.tf").write_text("terraform {")
109
    agent = _always_validate_agent(retries=5)
110
    with pytest.raises(UnexpectedModelBehavior, match="exceeded max retries count of 5"):
111
        agent.run_sync("go", deps=WorkspaceDeps(root=tmp_path))

1
# Keep the build context to what the Dockerfile actually copies; .envrc
2
# files stay out because they can hold tokens.
3
.venv
4
build
5
.git
6
infra
7
scripts
8
tests
9
.envrc
10
.envrc.local
11
*.tar.gz

1
# Multi-stage build: the builder stages below carry tooling (unzip, uv,
2
# rpm metadata) that the function never needs at runtime. Only their
3
# outputs are copied into the final stage, so the shipped image stays
4
# lean and pulls faster on a cold start.
5

6
FROM public.ecr.aws/lambda/python:3.13 AS terraform
7
# Pinned + checksum-verified so the image build is reproducible and a
8
# tampered release archive fails the build instead of shipping.
9
ARG TERRAFORM_VERSION=1.15.6
10
RUN dnf install -y unzip && dnf clean all
11
RUN curl -fsSLO https://releases.hashicorp.com/terraform/${TERRAFORM_VERSION}/terraform_${TERRAFORM_VERSION}_linux_arm64.zip \
12
    && curl -fsSLO https://releases.hashicorp.com/terraform/${TERRAFORM_VERSION}/terraform_${TERRAFORM_VERSION}_SHA256SUMS \
13
    && grep " terraform_${TERRAFORM_VERSION}_linux_arm64.zip\$" terraform_${TERRAFORM_VERSION}_SHA256SUMS | sha256sum -c - \
14
    && unzip terraform_${TERRAFORM_VERSION}_linux_arm64.zip -d /usr/local/bin \
15
    && rm terraform_${TERRAFORM_VERSION}_linux_arm64.zip terraform_${TERRAFORM_VERSION}_SHA256SUMS
16

17
# Same uv flow as the zip build this replaces (see the uv AWS Lambda
18
# guide); the dependency set installs straight into the task root, where
19
# the final stage picks it up.
20
FROM public.ecr.aws/lambda/python:3.13 AS python
21
COPY --from=ghcr.io/astral-sh/uv:0.11.21 /uv /usr/local/bin/uv
22
WORKDIR /opt/build
23
COPY pyproject.toml uv.lock ./
24
RUN uv export --frozen --no-dev --no-editable -o requirements.txt \
25
    && uv pip install \
26
        --no-installer-metadata \
27
        --no-compile-bytecode \
28
        --target "${LAMBDA_TASK_ROOT}" \
29
        -r requirements.txt
30

31
# Container images cannot attach layers, so the Lambda Insights extension
32
# is baked into the image: pinned version, detached GPG signature checked
33
# against the key fingerprint published in the Lambda Insights docs, so a
34
# tampered rpm fails the build.
35
FROM public.ecr.aws/lambda/python:3.13 AS insights
36
ARG INSIGHTS_VERSION=1.0.660.0
37
ARG INSIGHTS_BASE_URL=https://lambda-insights-extension-arm64.s3-ap-northeast-1.amazonaws.com
38
# The downloaded key is checked against the fingerprint from the docs
39
# before anything trusts it; gpg runs with --batch/--no-tty/--no-autostart
40
# because the base image ships no gpg-agent.
41
RUN curl -fsSLO ${INSIGHTS_BASE_URL}/amazon_linux/lambda-insights-extension-arm64.${INSIGHTS_VERSION}.rpm \
42
    && curl -fsSLO ${INSIGHTS_BASE_URL}/amazon_linux/lambda-insights-extension-arm64.${INSIGHTS_VERSION}.rpm.sig \
43
    && curl -fsSLO ${INSIGHTS_BASE_URL}/lambda-insights-extension.gpg \
44
    && gpg --batch --no-tty --show-keys --with-colons lambda-insights-extension.gpg \
45
        | grep -q '^fpr:::::::::E0AFFA11FFF35BD7349EE222479C97A1848ABDC8:' \
46
    && gpg --batch --no-tty --no-autostart --import lambda-insights-extension.gpg \
47
    && gpg --batch --no-tty --no-autostart --verify \
48
        lambda-insights-extension-arm64.${INSIGHTS_VERSION}.rpm.sig \
49
        lambda-insights-extension-arm64.${INSIGHTS_VERSION}.rpm \
50
    && rpm -U lambda-insights-extension-arm64.${INSIGHTS_VERSION}.rpm \
51
    && rm -f lambda-insights-extension-arm64.${INSIGHTS_VERSION}.rpm* lambda-insights-extension.gpg
52

53
# The shipped image: only the artifacts the function uses at runtime
54
# cross over from the builder stages.
55
FROM public.ecr.aws/lambda/python:3.13
56
COPY --from=terraform /usr/local/bin/terraform /usr/local/bin/terraform
57
COPY --from=insights /opt/extensions /opt/extensions
58
COPY --from=insights /opt/cloudwatch /opt/cloudwatch
59
COPY --from=python ${LAMBDA_TASK_ROOT} ${LAMBDA_TASK_ROOT}
60
COPY agent ${LAMBDA_TASK_ROOT}/agent
61
WORKDIR ${LAMBDA_TASK_ROOT}
62

63
# terraform needs a writable HOME for incidental state; /tmp is the only
64
# writable path at runtime. CMD replaces the zip package's handler attribute
65
# and points at lambda_entry, the instrumented entry point, not the bare
66
# handler module.
67
ENV HOME=/tmp \
68
    TF_IN_AUTOMATION=true \
69
    TF_INPUT=false
70
CMD ["agent.lambda_entry.handler"]

1
[project]
2
name = "terraform-pr-agent"
3
version = "0.1.0"
4
description = "The terraform-pr-agent Lambda handler."
5
requires-python = ">=3.13"
6
dependencies = [
7
    "pydantic-ai-slim[bedrock,mistral,retries]>=1.106",
8
    "logfire[aws-lambda]>=4.35",
9
    "structlog>=24",
10
    "boto3>=1.35",
11
]
12

13
[dependency-groups]
14
dev = [
15
    "pytest>=8",
16
    "moto[s3]>=5",
17
]
18

19
[tool.pytest.ini_options]
20
testpaths = ["tests"]
21
pythonpath = ["."]

1
# terraform-pr-agent
2

3
A pydantic-ai agent that writes and validates Terraform, running as a
4
container-image AWS Lambda. This is the worked example from the **Terraform PR
5
Agent** series; each post adds one capability.
6

7
Series: <https://andreaslang.dev/posts/terraform-pr-agent/>
8

9
At this checkpoint the agent gets a `/tmp` workspace, sandboxed file tools
10
(list, read, write, edit, delete), and a `terraform_validate` tool it can call
11
in a loop. It ships as a Docker-based Lambda with the terraform CLI baked in.
12

13
## Prerequisites
14

15
- [uv](https://docs.astral.sh/uv/) for Python and the single-file scripts
16
- Terraform 1.x
17
- Docker with buildx (the image is arm64; the placeholder is built during apply)
18
- An AWS account you are happy to create resources in. A throwaway sandbox
19
  sub-account is assumed; never point this at production. Credentials via
20
  `aws configure sso` or static keys.
21
- Bedrock model access enabled for the models in `infra/models.tf`, in your region
22
- Optional: a [Logfire](https://logfire.pydantic.dev/) token for tracing, and a
23
  [Mistral API key](https://console.mistral.ai/api-keys/) for the Mistral models
24

25
## Setup
26

27
A commented `.envrc.local` template ships in the scaffold (gitignored). Fill it
28
in and load it:
29

30
```bash
31
direnv allow
32
```
33

34
At minimum set `AWS_REGION`, your AWS credentials, and `TF_VAR_alert_email`
35
(AWS emails a confirmation for the alarm topic). The Logfire and Mistral keys
36
are optional; leave them unset to skip those integrations. A few values
37
(`AUDIT_BUCKET`, `AGENT_ROLE_ARN`, and friends) come from terraform outputs, so
38
set them and run `direnv reload` after the first apply.
39

40
If you skip the Mistral key, set `TF_VAR_default_model=haiku` (a Bedrock entry
41
in `infra/models.tf`); the default is `mistral-large`, which needs the key.
42

43
## Deploy
44

45
```bash
46
cd infra
47
terraform init
48
terraform plan
49
terraform apply
50
cd ..
51
./scripts/build-lambda.sh
52
```
53

54
`terraform apply` stands the function up on a placeholder image so every
55
resource is created in one pass; `build-lambda.sh` then builds the real arm64
56
image and points the Lambda at it. Re-run the script whenever you change
57
`agent/` or its dependencies.
58

59
Never blind-apply: read the plan first, and note the apply reads your
60
`.envrc.local` env (the alert email and the optional tokens feed `TF_VAR_*`).
61

62
## Run the agent
63

64
Invoke the function with an empty payload (it falls back to a sample prompt) or
65
your own:
66

67
```bash
68
aws lambda invoke --function-name terraform-pr-agent \
69
  --payload '{"prompt": "Set up a new terraform project, creating a best practice s3 bucket."}' \
70
  --cli-binary-format raw-in-base64-out --cli-read-timeout 0 out.json
71
cat out.json
72
```
73

74
`--cli-read-timeout 0` disables the CLI's default 60s read timeout: a
75
synchronous invoke runs the whole agent, which takes tens of seconds.
76

77
A `model` field in the payload overrides the `default_model` variable. Each run
78
writes its workspace and a `result.json` to the runs bucket under
79
`runs/<run_id>/`, and emits a trace (Logfire if configured, plus the S3 audit
80
copy). The CloudWatch dashboard `terraform-pr-agent` shows the model and Lambda
81
metrics.
82

83
## Tests
84

85
```bash
86
uv sync
87
uv run pytest
88
```
89

90
The suite is moto-backed and makes no real AWS calls.
91

92
## Layout
93

94
- `agent/` pydantic-ai handler and tools (the Lambda code)
95
- `infra/` Terraform for all AWS resources
96
- `scripts/` standalone PEP 723 scripts (`uv run scripts/<name>.py`) plus `build-lambda.sh`
97
- `tests/` pytest suite
98

99
See `AGENTS.md` for the conventions used when editing this project.

Fast-forward to the final code of this post

Download the cumulative checkpoint that matches the state at the end of this post. Useful for landing on the finished tree without working through every step.

1
mkdir -p ~/projects
2
cd ~/projects
3
curl -fsSL https://andreaslang.dev/terraform-pr-agent/terraform-pr-agent-04.tar.gz | tar xz

From zip to container image

In previous posts we used a normal Python Lambda, but for this post we moved to a Docker based Lambda to avoid the Lambda 250 MB size limit (unzipped). We could still have managed in this post with terraform 108 MB + site-packages 56 MB = 164 MB, but the container ceiling is 10 GB rather than 250 MB, which buys real headroom as the image grows. The container also hands us the whole image to control, system packages and filesystem layout, not just Python deps layered onto the AWS runtime. As a bonus it is a standard Docker image, so moving to something like ECS Fargate later is straightforward.

Looking at the Docker image you will also notice we use a multi-stage build, where we have build layers of the image in multiple stages and then copy them into the final image. This is an easy way of keeping build tools out of the final image and therefore reducing the size of the image. Our final image is 739 MB total, 169 MB above the shared 570 MB base. That sounds huge for people trying to build small images, but the base image we use is the AWS Lambda Python base image, which is heavily cached across the Lambda stack. So in practice we only pull the final 169 MB layer, which is far more reasonable.

1
builder stage        copied into the final image
2
-------------        ---------------------------
3
terraform       -->  terraform binary
4
python          -->  site-packages (/var/task)
5
insights        -->  Lambda Insights extension
6
build context   -->  agent/ source
7

8
base image: public.ecr.aws/lambda/python:3.13
9
left in the builders, never shipped: unzip, dnf cache, uv, gpg keyring, rpm metadata

We will not go through all the stages of the image, but here are the Python dependencies as an example. We use the AWS Lambda Python base image, install uv by copying the layer from yet another Docker image, copy pyproject.toml/uv.lock, export to a requirements.txt and then run uv pip to install the dependencies.

20
# Same uv flow as the zip build this replaces (see the uv AWS Lambda
21
# guide); the dependency set installs straight into the task root, where
22
# the final stage picks it up.
23
FROM public.ecr.aws/lambda/python:3.13 AS python
24
COPY --from=ghcr.io/astral-sh/uv:0.11.21 /uv /usr/local/bin/uv
25
WORKDIR /opt/build
26
COPY pyproject.toml uv.lock ./
27
RUN uv export --frozen --no-dev --no-editable -o requirements.txt \
28
    && uv pip install \
29
        --no-installer-metadata \
30
        --no-compile-bytecode \
31
        --target "${LAMBDA_TASK_ROOT}" \
32
        -r requirements.txt

The Python stage does not ship. The final stage below just assembles the image by copying all the previous layers into it. Here you see we use:

The Terraform binary
The Lambda Insights extension (extensions and cloudwatch)
The Python dependencies
Our agent code

Initially I considered baking the Terraform AWS provider into the image, but managing it properly turned out to be difficult without constraining the use case further than it already is. So I took the simpler route: download the providers every time the agent calls terraform init.

60
# The shipped image: only the artifacts the function uses at runtime
61
# cross over from the builder stages.
62
FROM public.ecr.aws/lambda/python:3.13
63
COPY --from=terraform /usr/local/bin/terraform /usr/local/bin/terraform
64
COPY --from=insights /opt/extensions /opt/extensions
65
COPY --from=insights /opt/cloudwatch /opt/cloudwatch
66
COPY --from=python ${LAMBDA_TASK_ROOT} ${LAMBDA_TASK_ROOT}
67
COPY agent ${LAMBDA_TASK_ROOT}/agent
68
WORKDIR ${LAMBDA_TASK_ROOT}

The CMD attribute for a Docker based Lambda needs to be the Python handler. It points at the handler function in the agent.lambda_entry module. While we are at it we also make terraform non-interactive and set the home folder to the only writable folder in a Lambda (/tmp).

72
# terraform needs a writable HOME for incidental state; /tmp is the only
73
# writable path at runtime. CMD replaces the zip package's handler attribute
74
# and points at lambda_entry, the instrumented entry point, not the bare
75
# handler module.
76
ENV HOME=/tmp \
77
    TF_IN_AUTOMATION=true \
78
    TF_INPUT=false
79
CMD ["agent.lambda_entry.handler"]

ECR and the placeholder image

A consequence of switching to a Docker based Lambda is that we now need a placeholder image to avoid the terraform apply needing a three step process (first creating the ECR repo, pushing the image, then creating the Lambda). With the placeholder in place, a single apply can create the function, and scripts/build-lambda.sh pushes the real image afterward. It keeps the terraform apply self-contained, with the code ship as a separate step.

28
# Container twin of the zip flow's archive_file placeholder: the function
29
# resource needs a pullable image at create time, so terraform seeds a
30
# minimal one. Create-only (input never changes), so scripts/build-lambda.sh
31
# owns every push after this.
32
resource "terraform_data" "placeholder_image" {
33
  input = aws_ecr_repository.agent.repository_url
34

35
  # Needs docker and the aws cli on the machine running apply; both are
36
  # already prerequisites for the series.
37
  provisioner "local-exec" {
38
    command = <<-EOT
39
      aws ecr get-login-password --region ${data.aws_region.current.region} |
40
        docker login --username AWS --password-stdin ${split("/", aws_ecr_repository.agent.repository_url)[0]}
41
      docker buildx build --platform linux/arm64 --provenance=false \
42
        -t ${aws_ecr_repository.agent.repository_url}:placeholder \
43
        --push ${path.module}/placeholder
44
    EOT
45
  }
46
}

For the Lambda itself, package_type = "Image" and image_uri switch it to a Docker based Lambda. We also tell terraform to ignore changes to image_uri, because otherwise a re-apply would reset the function back to the placeholder image. Instead we want our agent Docker build to have control over the image URI. Further, we tweak timeout, memory size, and ephemeral storage to match the heavier resource needs now that terraform runs inside the function. Lambda layers, which we used in a previous post, are removed entirely; they do not work with a Docker based Lambda.

135
resource "aws_lambda_function" "agent" {
136
  function_name = "terraform-pr-agent"
137
  role          = aws_iam_role.lambda.arn
138
  architectures = ["arm64"]
139

140
  # The handler entry point comes from the image's CMD; the runtime,
141
  # handler, and layers attributes only apply to zip packages.
142
  package_type = "Image"
143
  image_uri    = "${aws_ecr_repository.agent.repository_url}:placeholder"
144

11 collapsed lines
145
  # Max Memory Used overstates what this function needs. It runs to ~2 GB on a
146
  # heavy run, but the track_memory spans (see agent/memory.py) show the real,
147
  # non-reclaimable demand stays under ~1 GB: ~315 MB resident for the Python
148
  # runtime plus a transient ~420 MB while terraform validate loads the
149
  # provider schema. The rest is reclaimable page cache from the ~800 MB
150
  # provider download and re-lock unpacks doing file IO on /tmp, which the
151
  # cgroup-based billed figure counts but the kernel drops under pressure, so
152
  # it is not OOM risk. Memory is therefore not the binding constraint here.
153
  # 3008 is set for the vCPU it buys, not the RAM: above 1769 MB Lambda gives a
154
  # full core (3008 is ~1.7), which speeds the run. Drop it toward ~1769 if
155
  # latency matters less than cost; do not raise it for memory headroom.
156
  memory_size = 3008
157

158
  # The per-run tool budget is the runaway guard; the timeout only has to
159
  # accommodate several model turns with init + validate rounds in between.
160
  timeout = 300
161

162
  # Two things land in /tmp: terraform init downloads the AWS provider
163
  # (~800 MB) into the workspace, and the post-run re-lock unpacks the
164
  # provider for three platforms (another ~2 GB) to write a portable lock
165
  # file. 4 GB covers both with headroom; the default 512 MB would not.
166
  ephemeral_storage {
167
    size = 4096
168
  }
169

170
  tracing_config {
171
    mode = "Active"
172
  }
173

17 collapsed lines
174
  environment {
175
    variables = merge(
176
      {
177
        MODELS_PARAMETER         = aws_ssm_parameter.models.name
178
        DEFAULT_MODEL            = var.default_model
179
        METRICS_NAMESPACE        = local.metrics_namespace
180
        FIREHOSE_DELIVERY_STREAM = aws_kinesis_firehose_delivery_stream.audit.name
181
        RUNS_BUCKET              = aws_s3_bucket.runs.bucket
182
      },
183
      local.logfire_token_wired ? {
184
        LOGFIRE_TOKEN_PARAMETER = local.logfire_token_parameter_name
185
      } : {},
186
      local.mistral_key_wired ? {
187
        MISTRAL_API_KEY_PARAMETER = aws_ssm_parameter.mistral_api_key[0].name
188
      } : {},
189
    )
190
  }
191

192
  # Code ships out of band: scripts/build-lambda.sh pushes a new image and
193
  # calls update-function-code, so terraform must not flip the function
194
  # back to the placeholder on the next apply.
195
  lifecycle {
196
    ignore_changes = [image_uri]
197
  }
198

199
  depends_on = [
200
    aws_iam_role_policy_attachment.lambda_basic_execution,
201
    aws_iam_role_policy_attachment.lambda_insights,
202
    aws_iam_role_policy.lambda_permissions,
203
    terraform_data.placeholder_image,
204
  ]
205
}

Watching cold starts

One common downside of a Docker Lambda is that cold starts are slower than normal Lambdas. We do not particularly care about cold starts, but we do want to measure them and make sure they do not get out of hand due to dependency bloat. The AWS Lambda Insights extension is a good way to measure cold starts, but because we cannot use layers, we need to add it to the image. The code looks scary, and I wish AWS thought about the end user the way Astral/uv do, but it is taken straight from the AWS docs.

36
# Container images cannot attach layers, so the Lambda Insights extension
37
# is baked into the image: pinned version, detached GPG signature checked
38
# against the key fingerprint published in the Lambda Insights docs, so a
39
# tampered rpm fails the build.
40
FROM public.ecr.aws/lambda/python:3.13 AS insights
41
ARG INSIGHTS_VERSION=1.0.660.0
42
ARG INSIGHTS_BASE_URL=https://lambda-insights-extension-arm64.s3-ap-northeast-1.amazonaws.com
43
# The downloaded key is checked against the fingerprint from the docs
44
# before anything trusts it; gpg runs with --batch/--no-tty/--no-autostart
45
# because the base image ships no gpg-agent.
46
RUN curl -fsSLO ${INSIGHTS_BASE_URL}/amazon_linux/lambda-insights-extension-arm64.${INSIGHTS_VERSION}.rpm \
47
    && curl -fsSLO ${INSIGHTS_BASE_URL}/amazon_linux/lambda-insights-extension-arm64.${INSIGHTS_VERSION}.rpm.sig \
48
    && curl -fsSLO ${INSIGHTS_BASE_URL}/lambda-insights-extension.gpg \
49
    && gpg --batch --no-tty --show-keys --with-colons lambda-insights-extension.gpg \
50
        | grep -q '^fpr:::::::::E0AFFA11FFF35BD7349EE222479C97A1848ABDC8:' \
51
    && gpg --batch --no-tty --no-autostart --import lambda-insights-extension.gpg \
52
    && gpg --batch --no-tty --no-autostart --verify \
53
        lambda-insights-extension-arm64.${INSIGHTS_VERSION}.rpm.sig \
54
        lambda-insights-extension-arm64.${INSIGHTS_VERSION}.rpm \
55
    && rpm -U lambda-insights-extension-arm64.${INSIGHTS_VERSION}.rpm \
56
    && rm -f lambda-insights-extension-arm64.${INSIGHTS_VERSION}.rpm* lambda-insights-extension.gpg

The extension needs one IAM grant to write to its /aws/lambda-insights log group, a single managed-policy attachment that sits in the file browser above.

Below you see how we wire the metrics into the dashboard. We take the Lambda Insights metrics and plot:

avg duration and max duration in ms
function memory used
/tmp storage space used


46 collapsed lines
18
      {
19
        type   = "text"
20
        x      = 0
21
        y      = 0
22
        width  = 24
23
        height = 2
24
        properties = {
25
          markdown = "## Lambda\nContainer-image function health: invocations and errors, end to end duration, cold start init duration (Lambda Insights emits `init_duration` only on a cold start), and the memory and `/tmp` footprint behind the `memory_size` and `ephemeral_storage` sizing."
26
        }
27
      },
28
      {
29
        type   = "metric"
30
        x      = 0
31
        y      = 2
32
        width  = 12
33
        height = 6
34
        properties = {
35
          title  = "Lambda invocations and errors"
36
          region = local.cloudwatch_region
37
          view   = "timeSeries"
38
          stat   = "Sum"
39
          period = 60
40
          metrics = [
41
            ["AWS/Lambda", "Invocations", "FunctionName", local.lambda_name, { label = "${local.lambda_name} / invocations" }],
42
            [".", "Errors", ".", ".", { label = "${local.lambda_name} / errors" }],
43
            [".", "Throttles", ".", ".", { label = "${local.lambda_name} / throttles" }],
44
          ]
45
        }
46
      },
47
      {
48
        type   = "metric"
49
        x      = 12
50
        y      = 2
51
        width  = 12
52
        height = 6
53
        properties = {
54
          title  = "Lambda duration (ms)"
55
          region = local.cloudwatch_region
56
          view   = "timeSeries"
57
          period = 60
58
          metrics = [
59
            ["AWS/Lambda", "Duration", "FunctionName", local.lambda_name, { label = "${local.lambda_name} / avg", stat = "Average" }],
60
            [".", ".", ".", ".", { label = "${local.lambda_name} / p99", stat = "p99" }],
61
          ]
62
        }
63
      },
64
      {
65
        type   = "metric"
66
        x      = 0
67
        y      = 8
68
        width  = 12
69
        height = 6
70
        properties = {
71
          title  = "Cold start init duration (ms)"
72
          region = local.cloudwatch_region
73
          view   = "timeSeries"
74
          period = 60
75
          metrics = [
76
            # Insights reports init_duration only when an init phase happened,
77
            # so points appear only on cold starts.
78
            ["LambdaInsights", "init_duration", "function_name", local.lambda_name, { label = "${local.lambda_name} / init avg (ms)", stat = "Average" }],
79
            [".", ".", ".", ".", { label = "${local.lambda_name} / init max (ms)", stat = "Maximum" }],
80
          ]
81
        }
82
      },
83
      {
84
        type   = "metric"
85
        x      = 12
86
        y      = 8
87
        width  = 12
88
        height = 6
89
        properties = {
90
          title  = "Memory used (MB)"
91
          region = local.cloudwatch_region
92
          view   = "timeSeries"
93
          period = 60
94
          metrics = [
95
            # used_memory_max is the cgroup figure (Max Memory Used), which
96
            # counts the reclaimable /tmp page cache and so reads ~2 GB while
97
            # real demand is under 1 GB. Backs the memory_size comment in
98
            # lambda.tf.
99
            ["LambdaInsights", "used_memory_max", "function_name", local.lambda_name, { label = "${local.lambda_name} / memory max (MB)", stat = "Maximum" }],
100
          ]
101
        }
102
      },
103
      {
104
        type   = "metric"
105
        x      = 0
106
        y      = 14
107
        width  = 12
108
        height = 6
109
        properties = {
110
          title  = "/tmp used (bytes)"
111
          region = local.cloudwatch_region
112
          view   = "timeSeries"
113
          period = 60
114
          metrics = [
115
            # tmp_used tracks the ~800 MB provider download into /tmp, behind
116
            # the 4 GB ephemeral_storage sizing in lambda.tf.
117
            ["LambdaInsights", "tmp_used", "function_name", local.lambda_name, { label = "${local.lambda_name} / tmp used (B)", stat = "Maximum" }],
118
          ]
119
        }
120
      },

In CloudWatch the dashboard looks like this. Cold starts sit around 3s, acceptable for our use case, and the memory and /tmp metrics stay within the configured limits.

CloudWatch dashboard for terraform-pr-agent with a Lambda section (invocations and errors, duration, cold
starts, memory and /tmp) above a Model section (tokens, invocations and errors, latency, cache tokens).

Workspace setup

The agent needs some context for the run. Pydantic-ai is itself stateless. The agent only has conversation history if the previous messages are passed to it as a message_history argument. We talked about this back in post 1. Here we also need to give the agent context about the workspace it is working in so that the root/project folder can be validated. Tool calls should usually just use a relative path to it, but we need to make sure everything ends up in this tmp folder, so we can pick up the end result of the agent’s work. You also see files_read there; we use it to stop the agent editing files it has not read.

25
class WorkspaceDeps(BaseModel):
26
    """Per-run dependencies threaded through ``RunContext``.
27

28
    ``root`` is the directory the agent is allowed to read, write, and
29
    validate inside. Tools resolve every path relative to it and reject
30
    anything that escapes the root, so the agent cannot reach outside
31
    the workspace via ``..`` or absolute paths.
32
    """
33

34
    model_config = ConfigDict(arbitrary_types_allowed=True)
35

36
    root: Path
37
    files_read: set[Path] = Field(default_factory=set)

Each time the LLM invokes any of our tools with a path we validate it against the workspace root. Luckily Python has the pathlib module that makes this easy.

168
def _resolve_absolute_path(ctx: RunContext[WorkspaceDeps], path: str):
169
    root = ctx.deps.root.resolve()
170
    absolute_path = (root / Path(path)).resolve()
171
    if not absolute_path.is_relative_to(root):
172
        raise ModelRetry(
173
            f"Path {path} must be relative to the workspace root, "
174
            f"it cannot be absolute or walk up the directory tree."
175
        )
176
    return absolute_path

File tools

We have list_files(path), read_file(path), write_file(path, contents), edit_file(path, old_string, new_string) and delete_file(path). We won’t go through all of them, but show you the edit file tool. It uses _resolve_absolute_path(ctx, path) to resolve the path (always absolute in the end), which will also verify that the path is inside the workspace root. Then it validates additional constraints:

The file exists
The agent has previously read or created the file
old_string (string to replace) is in the file contents
old_string only exists once in the file contents (prevents accidental replacements)

One decision worth calling out: we raise ModelRetry instead of returning a description of the mistake the agent makes. The difference is that ModelRetry counts against the tool retry budget of the agent, while a string description telling the agent what to do next does not. It avoids long chains of corrections costing too many tokens. If that happens we would rather fail and raise an alert in a production environment, so that an SRE or the product team can investigate and fix the cause of the agent’s confusion (better system prompt or tool descriptions). During initial validation it also allows you to weed out models that are not as good with tool calls, which makes them unsuitable for your task.

75
def edit_file(
76
    ctx: RunContext[WorkspaceDeps],
77
    path: str,
78
    old_string: str,
79
    new_string: str,
80
) -> None:
81
    """Replace ``old_string`` with ``new_string`` in the file at ``path``."""
82
    file = _resolve_absolute_file(ctx, path)
83
    if not file.exists():
84
        raise ModelRetry(f"File {path} does not exist.")
85
    if file not in ctx.deps.files_read:
86
        raise ModelRetry(f"File {path} was not read. If you want to edit it, then read it first.")
87
    with file.open("r+") as f:
88
        contents = f.read()
89
        if old_string not in contents:
90
            raise ModelRetry(f"String {old_string} not found in file {path}.")
91
        if contents.count(old_string) > 1:
92
            raise ModelRetry(f"String {old_string} found more than once in file {path}.")
93
        f.seek(0)
94
        f.write(contents.replace(old_string, new_string))
95
        f.truncate()

The validate tool: agent-driven, any time

In this blog post we only offer terraform_validate() and terraform_init() to the agent to interact and validate terraform code. Terraform validate for example has a very simple flow:

Use a sub-process to run terraform validate
Validate the result:
- If the exit code is 0 (success) return OK message
- If the exit code is 1 (failure) ModelRetry error with stdout and stderr

128
def _validate(root: Path) -> tuple[bool, str]:
129
    """Run ``terraform validate`` in ``root``; return (passed, combined output).
130

131
    The agent reaches this through the tool below; the caller-side retry in
132
    handler.py calls it directly to re-check the workspace after the run.
133
    """
134
    with track_memory("terraform_validate"):
135
        result = subprocess.run(
136
            ["terraform", "validate", "-no-color"],
137
            cwd=root,
138
            capture_output=True,
139
            text=True,
140
        )
141
    return result.returncode == 0, f"{result.stdout}{result.stderr}"
142

143

144
def terraform_validate(ctx: RunContext[WorkspaceDeps]) -> str:
145
    """Run ``terraform validate`` in the workspace and return its output."""
146
    ok, output = _validate(ctx.deps.root)
147
    if ok:
148
        return "OK: terraform validate passed."
149
    raise ModelRetry(f"terraform validate failed:\n{output}")

A little bonus is the track_memory decorator, which creates a new span and attaches memory information to it. So we can dive into the agent run in Logfire and see how much memory it used.

Logfire trace UI showing memory.<step> spans from the track_memory decorator, each tagged with mem.used_bytes, mem.cache_bytes, and mem.child_max_rss_bytes attributes.

This is the Agent definition, one instance reused across invocations. The error budget for ModelRetry can be configured in the Agent setup.


32 collapsed lines
36
# A soft reproducibility nudge following the standard Terraform pattern: a
37
# version constraint in the config, the exact version and checksums in the lock
38
# file. The tool-call spans in the trace are the ground truth for what the agent
39
# actually wrote.
40
_PROVIDER_PIN_RULE = (
41
    "When you add the AWS provider, give it a version constraint such as "
42
    '"~> 6.0" rather than leaving it unconstrained. terraform init records '
43
    "the exact resolved version and checksums in .terraform.lock.hcl, "
44
    "which travels with the workspace and is the reproducibility record, "
45
    "so the constraint does not need to be an exact pin. Pin an exact "
46
    "version only when the user asks for one. Example:\n"
47
    "\n"
48
    "terraform {\n"
49
    "  required_providers {\n"
50
    "    aws = {\n"
51
    '      source  = "hashicorp/aws"\n'
52
    '      version = "~> 6.0"\n'
53
    "    }\n"
54
    "  }\n"
55
    "}"
56
)
57

58
SYSTEM_PROMPT = (
59
    "You are the terraform-pr-agent. You operate on a Terraform workspace "
60
    "through file tools (list_files, read_file, write_file, edit_file, "
61
    "delete_file) and two terraform tools. Use the file tools to explore, "
62
    "write, and edit. Run terraform_init before your first validate and "
63
    "again whenever you add or change provider or module requirements. "
64
    "Call terraform_validate after you write or change files to confirm "
65
    "the workspace still parses; treat its output as feedback and edit "
66
    "until it is clean.\n\n" + _PROVIDER_PIN_RULE
67
)
68

69
# One Agent instance is reused across invocations: the tools reach the workspace
70
# through RunContext.deps, so each run_sync scopes them to a fresh WorkspaceDeps.
71
# The model carries no default; it is built from the registry at INVOKE and
72
# passed per run, so switching DEFAULT_MODEL needs no code change.
73
agent = Agent(
74
    deps_type=WorkspaceDeps,
75
    system_prompt=SYSTEM_PROMPT,
76
    tools=[
77
        list_files,
78
        read_file,
79
        write_file,
80
        edit_file,
81
        delete_file,
82
        terraform_init,
83
        terraform_validate,
84
    ],
85
    # Tools raise ModelRetry on failure; pydantic-ai ends the run once one tool
86
    # fails more than `retries` times in a row (a success resets the count). The
87
    # default of 1 would end the run on the second straight failing validate,
88
    # which is a normal part of the write-validate-edit loop, so the budget is
89
    # raised well past anything a converging run produces. The per-run turn cap
90
    # stays the runaway guard.
91
    retries=10,
92
)

Caller-side retry

Working with LLMs reminds me of the German saying:

Vertrauen ist gut, Kontrolle ist besser (Trust is good, verifying is better)

Here the agent can in principle ignore every instruction and report success while nothing actually works. Hence, we add some deterministic checks in the end and feed message_history back to the agent with a new prompt if they fail. At this stage this is really just making sure terraform init does not fail.

144
            # The agent can report done while terraform validate still fails.
145
            # Re-validate ourselves and feed any error back as a follow-up turn,
146
            # reusing the run id and message history so each retry is an
147
            # invoke_agent span under the one invocation trace. Give up after the
148
            # budget and raise so the failure is honest rather than a clean run
149
            # over a broken workspace.
150
            ok, output = _validate(root)
151
            attempts = 0
152
            while not ok and attempts < _MAX_VALIDATE_RETRIES:
153
                attempts += 1
154
                result = agent.run_sync(
155
                    _RETRY_PROMPT.format(output=output),
156
                    deps=deps,
157
                    conversation_id=run_id,
158
                    model=built,
159
                    message_history=result.all_messages(),
160
                    metadata={"model": model_name},
161
                )
162
                ok, output = _validate(root)
163
            if not ok:
164
                raise ValidateDidNotConverge(output)

Parking the output in S3

In this iteration of the code we do not yet integrate GitHub, so to be able to see the code that the agent produced, we need to store it somewhere. This code stores it in an S3 bucket so we can inspect it later. We skip the .terraform folder, which is just local init scratch.

18
def _persist_run(run_id: str, workspace: Path, status: str, error: str | None = None) -> None:
6 collapsed lines
19
    """Park the workspace and a minimal result marker under runs/<run_id>/.
20

21
    result.json carries only what the audit trace does not: prompt, output, and
22
    messages already live in the trace under the same conversation id, so copying
23
    them here would create a second source of truth.
24
    """
25
    bucket = require_env("RUNS_BUCKET")
26
    s3 = boto3.client("s3")
27
    for file in _workspace_files(workspace):
28
        key = f"runs/{run_id}/workspace/{file.relative_to(workspace)}"
29
        s3.put_object(Bucket=bucket, Key=key, Body=file.read_bytes())
30
    result = {"status": status} | ({"error": error} if error else {})
31
    s3.put_object(
32
        Bucket=bucket,
33
        Key=f"runs/{run_id}/result.json",
34
        Body=json.dumps(result).encode(),
35
    )
36

37

38
def _workspace_files(workspace: Path) -> Iterator[Path]:
4 collapsed lines
39
    """Every file except .terraform/, which is init scratch plus the provider
40
    downloaded into /tmp, gigabytes of noise per run. The top-level
41
    .terraform.lock.hcl is the reproducibility record and stays.
42
    """
43
    for path in sorted(workspace.rglob("*")):
44
        if ".terraform" in path.relative_to(workspace).parts:
45
            continue
46
        if path.is_file():
47
            yield path

Tracing the whole invocation

The Lambda entry point stays thin: it requires a prompt, hands off to core.execute, and shapes the response. Its other job is the INIT wiring, configure telemetry then wrap the handler, which has to run at import. Keeping the execution logic separate from anything Lambda-related also means we could add other entry points later, deploying this somewhere else (e.g. ECS Fargate).

1
"""The Lambda boundary: parse the event, run the agent, shape the response.
2

3
This module also owns the INIT wiring. The container CMD targets
4
agent.lambda_entry.handler, so a unit test that imports agent.core never
5
configures logfire or registers the Firehose audit processor. Everything
6
Lambda-specific lives here, off the core, which is why no runtime-detection
7
check is needed to keep it out of tests.
8
"""
9

10
from __future__ import annotations
11

12
from typing import NotRequired, TypedDict
13

14
import logfire
15

16
from agent import observability
17
from agent.core import execute
18

19

10 collapsed lines
20
class HandlerEvent(TypedDict):
21
    prompt: str
22
    model: NotRequired[str]
23

24

25
class HandlerResponse(TypedDict):
26
    status: str
27
    run_id: str
28
    model: str
29
    output: str
30

31

32
def handler(event: HandlerEvent, context: object) -> HandlerResponse:
33
    """Lambda entry point: require a prompt, run the agent, wrap the result.
34

35
    ``prompt`` is required; an event without one is a caller error and fails
36
    fast rather than running a default. ``model`` overrides DEFAULT_MODEL when
37
    given. A run that does not converge raises, so the Lambda reports 5xx and
38
    the workspace ships under status error for debugging.
39
    """
40
    prompt = event.get("prompt")
41
    if not prompt:
42
        raise ValueError("event missing required 'prompt'")
43
    result = execute(prompt, event.get("model"))
44
    return {
45
        "status": "ok",
46
        "run_id": result.run_id,
47
        "model": result.model,
48
        "output": result.output,
49
    }
50

51

52
def bootstrap() -> None:
53
    """Stand up telemetry, then attach the Lambda runtime adapter.
54

55
    configure() first so the tracer provider exists when the handler is wrapped.
56
    instrument_aws_lambda wraps the target named by _HANDLER
57
    (agent.lambda_entry.handler) in place, so each invocation becomes one trace.
58
    """
59
    observability.configure()
60
    logfire.instrument_aws_lambda(handler)
61

62

63
bootstrap()

Changing the span hierarchy also means the example query we showed in a previous post for reading the audit data in S3 needs to change. We already changed the audit processor to ship if span.parent is None or span.parent.is_remote, now in the query we look for the agent invocations in particular by using gen_ai.operation.name = 'invoke_agent'. This works with or without the Lambda OTEL data being written.

49
roots AS (
50
    -- One row per agent run: pydantic-ai's invoke_agent span, identified by
51
    -- the GenAI operation rather than by being the parentless span.
52
    -- instrument_aws_lambda now roots each trace at the SpanKind.SERVER
53
    -- invocation span, so the agent run is a child of it, not the trace root.
54
    -- (A caller-side retry would put several invoke_agent spans under one
55
    -- invocation; the trace_id join below would then cross them, so that
56
    -- case wants each chat tied to its enclosing run instead.)
57
    SELECT * FROM spans
58
    WHERE list_filter(attributes, x -> x.key = 'gen_ai.operation.name')[1]
59
        .value.stringValue = 'invoke_agent'
60
),

What validate catches and what it doesn’t

Currently we only run terraform validate, which checks that the HCL and provider config are correct. It does not catch bad naming for Terraform variables, security issues, or misconfigurations that only show up during apply. In the following post we remedy some of this by giving the agent more tools to validate against, and enforcing extra checks such as security baselines.

End state

An agent that takes an English request and leaves a workspace in a validated state. File tools, validate tool, retry chain, and a structured self-report are all in place for posts 4-6 to build on.

Next: Conventions and policy: more tools, same feedback loop