Module 1: LLM API Mastery · Lesson 2 of 5 · 20 min

Sampling Parameters & Streaming

Temperature and top_p control the randomness of token selection — get them wrong and your agent is either erratic or uselessly rigid. Streaming turns dead air into perceived speed.

At each step of generation, the model produces a probability distribution over its entire vocabulary for the next token. Sampling parameters shape how a token gets picked from that distribution.

temperature = 0.1peaked — near-deterministic"Paris"86%"paris"8%"The"4%"France"2%temperature = 1.0flatter — more diverse"Paris"44%"paris"22%"The"18%"France"16%
Low temperature sharpens the distribution toward the top token; high temperature flattens it, letting unlikely tokens through.
  • temperature (0–1 Anthropic, 0–2 OpenAI): scales the logits before softmax. Near 0 → almost always the top token (near-deterministic, but not perfectly — GPU nondeterminism and ties remain). High → more diverse, more creative, more wrong.
  • top_p (nucleus sampling): sample only from the smallest set of tokens whose cumulative probability ≥ p. top_p=0.9 ignores the long tail entirely.
  • Adjust one, not both. They interact multiplicatively and become impossible to reason about together. Pick temperature as your primary dial.
  • max_tokens is a hard output cap, not a target — the model doesn't know it exists. Set it as a safety rail against runaway generation and check stop_reason for max_tokens to detect truncation.
Settings for agents
Tool-calling agents want temperature 0–0.3. You're asking for precise JSON and correct function arguments, not prose flair — determinism aids debugging and eval stability too. Reserve higher temperatures for brainstorming/creative subtasks, and set it per-call, not globally.

Streaming

Without streaming you wait for the full completion before showing anything — for a long answer that's many seconds of dead air. With stream=True the API returns server-sent events (SSE), delivering tokens as they're generated. Time-to-first-token becomes your perceived latency, which is often 10× better than time-to-full-response.

ModelgeneratesClientrendersTheagentcallsthesearchtoolandwaitsserver-sent events: data: {"delta": "…"} — render as they arrive
SSE delivers deltas as the model generates; the client renders incrementally.
streaming with both SDKs
# Anthropic
with client.messages.stream(
    model="claude-sonnet-4-5", max_tokens=1024,
    messages=[{"role": "user", "content": "Explain SSE in one paragraph."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
final = stream.get_final_message()          # full message + usage

# OpenAI
resp = openai_client.chat.completions.create(
    model="gpt-4o", stream=True,
    messages=[{"role": "user", "content": "Explain SSE in one paragraph."}],
)
for chunk in resp:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
Streaming complicates tool calling slightly: tool-call arguments arrive as partial JSON fragments you must accumulate until the block completes. The SDKs' helper events (content_block_stop, accumulated snapshots) handle this — use them rather than parsing fragments yourself.
Key takeaways
  • Sampling picks from a probability distribution; temperature scales it, top_p truncates it. Tune one.
  • Agents doing tool calls: temperature 0–0.3.
  • max_tokens is a guardrail; detect truncation via stop_reason.
  • Streaming = SSE deltas; time-to-first-token is the UX metric that matters.
  • Always capture the final usage/message object after a stream completes.