Skip to content

Commit 2c679e8

Browse files
authored
feat: add llm_time_to_first_token_seconds and llm_tokens_per_second metrics (#1440)
<!-- CURSOR_SUMMARY --> > [!NOTE] > Introduce `llm_time_to_first_token_seconds` and `llm_tokens_per_second` metrics, instrument OpenAI/Anthropic streaming to emit them, and add Grafana panels/docs/queries. > > - **Backend Observability**: > - Add histograms `llm_time_to_first_token_seconds` and `llm_tokens_per_second` in `platform/backend/src/llm-metrics.ts` with init/cleanup, label handling, and buckets. > - New reporters: `reportTimeToFirstToken` and `reportTokensPerSecond`. > - **LLM Proxies (Streaming)**: > - `routes/proxy/openai.ts` and `routes/proxy/anthropic.ts`: capture first-stream chunk to record TTFT; compute and report tokens/sec at stream end using usage and elapsed time. > - **Tests**: > - Extend `llm-metrics.test.ts` with cases for TTFT/tokens-per-second and initialization paths; minor auth plugin test update to mock permission success for API-key flow. > - **Docs**: > - `docs/pages/platform-observability.md`: document new metrics and add PromQL examples; minor YAML quoting fix. > - **Grafana Dashboard**: > - `platform/dev/grafana/dashboards/platform.json`: add timeseries panels for TTFT (p50/p95) and Tokens/sec (p50/p95); adjust layout/legends formatting. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 137b2da. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
1 parent d910ce3 commit 2c679e8

File tree

7 files changed

+684
-59
lines changed

7 files changed

+684
-59
lines changed

docs/pages/platform-observability.md

Lines changed: 29 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ category: Archestra Platform
44
order: 5
55
---
66

7-
<!--
7+
<!--
88
Check ../docs_writer_prompt.md before changing this file.
99
1010
This document is human-built, shouldn't be updated with AI. Don't change anything here.
@@ -42,6 +42,8 @@ The endpoint `http://localhost:9050/metrics` exposes Prometheus-formatted metric
4242
- `llm_request_duration_seconds` - LLM API request duration by provider, profile_id, profile_name, and status code
4343
- `llm_tokens_total` - Token consumption by provider, profile_id, profile_name, and type (input/output)
4444
- `llm_blocked_tool_total` - Counter of tool calls blocked by tool invocation policies, grouped by provider, profile_id, and profile_name
45+
- `llm_time_to_first_token_seconds` - Time to first token (TTFT) for streaming requests, by provider, profile_id, profile_name, and model. Helps developers choose models with lower initial response latency.
46+
- `llm_tokens_per_second` - Output tokens per second throughput, by provider, profile_id, profile_name, and model. Allows comparing model response speeds for latency-sensitive applications.
4547

4648
> **Note:** The `agent_id` and `agent_name` labels are deprecated and will be removed in a future release. Please migrate your dashboards and alerts to use `profile_id` and `profile_name` instead. During the transition period, both label variants are emitted.
4749
@@ -156,9 +158,9 @@ Add the following to your `prometheus.yml`:
156158

157159
```yaml
158160
scrape_configs:
159-
- job_name: 'archestra-backend'
161+
- job_name: "archestra-backend"
160162
static_configs:
161-
- targets: ['localhost:9050'] # Platform API base URL
163+
- targets: ["localhost:9050"] # Platform API base URL
162164
scrape_interval: 15s
163165
metrics_path: /metrics
164166
```
@@ -229,4 +231,28 @@ Here are some PromQL queries for Grafana charts to get you started:
229231
sum(rate(llm_request_duration_seconds_count{status_code!~"2.."}[5m])) by (profile_name) / sum(rate(llm_request_duration_seconds_count[5m])) by (profile_name)
230232
```
231233

234+
- Time to first token (TTFT) p95 by model:
235+
236+
```promql
237+
histogram_quantile(0.95, sum(rate(llm_time_to_first_token_seconds_bucket[5m])) by (model, le))
238+
```
239+
240+
- Average time to first token by provider:
241+
242+
```promql
243+
sum(rate(llm_time_to_first_token_seconds_sum[5m])) by (provider) / sum(rate(llm_time_to_first_token_seconds_count[5m])) by (provider)
244+
```
245+
246+
- Tokens per second throughput p50 by model:
247+
248+
```promql
249+
histogram_quantile(0.50, sum(rate(llm_tokens_per_second_bucket[5m])) by (model, le))
250+
```
251+
252+
- Average tokens per second by provider and model:
253+
254+
```promql
255+
sum(rate(llm_tokens_per_second_sum[5m])) by (provider, model) / sum(rate(llm_tokens_per_second_count[5m])) by (provider, model)
256+
```
257+
232258
> **Note:** The `agent_name` label in PromQL queries is deprecated. Please migrate to `profile_name` for new dashboards and alerts.

platform/backend/src/auth/fastify-plugin/plugin.test.ts

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,10 @@ describe("authPlugin integration", () => {
105105
updatedAt: new Date(),
106106
} as ApiKey,
107107
});
108+
mockHasPermission.mockResolvedValue({
109+
success: true,
110+
error: null,
111+
});
108112
mockUserModel.getById.mockResolvedValue({
109113
id: "user1",
110114
name: "Test User",

platform/backend/src/llm-metrics.test.ts

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,8 @@ import {
3535
reportBlockedTools,
3636
reportLLMCost,
3737
reportLLMTokens,
38+
reportTimeToFirstToken,
39+
reportTokensPerSecond,
3840
} from "./llm-metrics";
3941

4042
describe("getObservableFetch", () => {
@@ -627,3 +629,164 @@ describe("reportBlockedTools with model", () => {
627629
);
628630
});
629631
});
632+
633+
describe("reportTimeToFirstToken", () => {
634+
let testAgent: Agent;
635+
636+
beforeEach(async ({ makeAgent }) => {
637+
vi.clearAllMocks();
638+
testAgent = await makeAgent();
639+
initializeMetrics([]);
640+
});
641+
642+
test("records time to first token with model", () => {
643+
reportTimeToFirstToken("openai", testAgent, "gpt-4", 0.5);
644+
645+
expect(histogramObserve).toHaveBeenCalledWith(
646+
{
647+
provider: "openai",
648+
agent_id: testAgent.id,
649+
agent_name: testAgent.name,
650+
profile_id: testAgent.id,
651+
profile_name: testAgent.name,
652+
model: "gpt-4",
653+
},
654+
0.5,
655+
);
656+
});
657+
658+
test("records time to first token without model", () => {
659+
reportTimeToFirstToken("anthropic", testAgent, undefined, 0.25);
660+
661+
expect(histogramObserve).toHaveBeenCalledWith(
662+
{
663+
provider: "anthropic",
664+
agent_id: testAgent.id,
665+
agent_name: testAgent.name,
666+
profile_id: testAgent.id,
667+
profile_name: testAgent.name,
668+
model: "unknown",
669+
},
670+
0.25,
671+
);
672+
});
673+
674+
test("skips reporting for invalid TTFT value", () => {
675+
reportTimeToFirstToken("openai", testAgent, "gpt-4", 0);
676+
reportTimeToFirstToken("openai", testAgent, "gpt-4", -1);
677+
678+
expect(histogramObserve).not.toHaveBeenCalled();
679+
});
680+
681+
test("records TTFT for different providers", () => {
682+
reportTimeToFirstToken("gemini", testAgent, "gemini-pro", 0.3);
683+
684+
expect(histogramObserve).toHaveBeenCalledWith(
685+
{
686+
provider: "gemini",
687+
agent_id: testAgent.id,
688+
agent_name: testAgent.name,
689+
profile_id: testAgent.id,
690+
profile_name: testAgent.name,
691+
model: "gemini-pro",
692+
},
693+
0.3,
694+
);
695+
});
696+
});
697+
698+
describe("reportTokensPerSecond", () => {
699+
let testAgent: Agent;
700+
701+
beforeEach(async ({ makeAgent }) => {
702+
vi.clearAllMocks();
703+
testAgent = await makeAgent();
704+
initializeMetrics([]);
705+
});
706+
707+
test("records tokens per second with model", () => {
708+
// 100 tokens in 2 seconds = 50 tokens/sec
709+
reportTokensPerSecond("openai", testAgent, "gpt-4", 100, 2);
710+
711+
expect(histogramObserve).toHaveBeenCalledWith(
712+
{
713+
provider: "openai",
714+
agent_id: testAgent.id,
715+
agent_name: testAgent.name,
716+
profile_id: testAgent.id,
717+
profile_name: testAgent.name,
718+
model: "gpt-4",
719+
},
720+
50,
721+
);
722+
});
723+
724+
test("records tokens per second without model", () => {
725+
// 150 tokens in 3 seconds = 50 tokens/sec
726+
reportTokensPerSecond("anthropic", testAgent, undefined, 150, 3);
727+
728+
expect(histogramObserve).toHaveBeenCalledWith(
729+
{
730+
provider: "anthropic",
731+
agent_id: testAgent.id,
732+
agent_name: testAgent.name,
733+
profile_id: testAgent.id,
734+
profile_name: testAgent.name,
735+
model: "unknown",
736+
},
737+
50,
738+
);
739+
});
740+
741+
test("skips reporting for zero output tokens", () => {
742+
reportTokensPerSecond("openai", testAgent, "gpt-4", 0, 2);
743+
744+
expect(histogramObserve).not.toHaveBeenCalled();
745+
});
746+
747+
test("skips reporting for zero duration", () => {
748+
reportTokensPerSecond("openai", testAgent, "gpt-4", 100, 0);
749+
750+
expect(histogramObserve).not.toHaveBeenCalled();
751+
});
752+
753+
test("skips reporting for negative duration", () => {
754+
reportTokensPerSecond("openai", testAgent, "gpt-4", 100, -1);
755+
756+
expect(histogramObserve).not.toHaveBeenCalled();
757+
});
758+
759+
test("calculates correct tokens/sec for fast response", () => {
760+
// 50 tokens in 0.5 seconds = 100 tokens/sec
761+
reportTokensPerSecond("gemini", testAgent, "gemini-pro", 50, 0.5);
762+
763+
expect(histogramObserve).toHaveBeenCalledWith(
764+
{
765+
provider: "gemini",
766+
agent_id: testAgent.id,
767+
agent_name: testAgent.name,
768+
profile_id: testAgent.id,
769+
profile_name: testAgent.name,
770+
model: "gemini-pro",
771+
},
772+
100,
773+
);
774+
});
775+
776+
test("calculates correct tokens/sec for slow response", () => {
777+
// 200 tokens in 10 seconds = 20 tokens/sec
778+
reportTokensPerSecond("anthropic", testAgent, "claude-3", 200, 10);
779+
780+
expect(histogramObserve).toHaveBeenCalledWith(
781+
{
782+
provider: "anthropic",
783+
agent_id: testAgent.id,
784+
agent_name: testAgent.name,
785+
profile_id: testAgent.id,
786+
profile_name: testAgent.name,
787+
model: "claude-3",
788+
},
789+
20,
790+
);
791+
});
792+
});

platform/backend/src/llm-metrics.ts

Lines changed: 91 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@ let llmRequestDuration: client.Histogram<string>;
2525
let llmTokensCounter: client.Counter<string>;
2626
let llmBlockedToolCounter: client.Counter<string>;
2727
let llmCostTotal: client.Counter<string>;
28+
let llmTimeToFirstToken: client.Histogram<string>;
29+
let llmTokensPerSecond: client.Histogram<string>;
2830

2931
// Store current label keys for comparison
3032
let currentLabelKeys: string[] = [];
@@ -50,7 +52,9 @@ export function initializeMetrics(labelKeys: string[]): void {
5052
llmRequestDuration &&
5153
llmTokensCounter &&
5254
llmBlockedToolCounter &&
53-
llmCostTotal
55+
llmCostTotal &&
56+
llmTimeToFirstToken &&
57+
llmTokensPerSecond
5458
) {
5559
logger.info(
5660
"Metrics already initialized with same label keys, skipping reinitialization",
@@ -74,6 +78,12 @@ export function initializeMetrics(labelKeys: string[]): void {
7478
if (llmCostTotal) {
7579
client.register.removeSingleMetric("llm_cost_total");
7680
}
81+
if (llmTimeToFirstToken) {
82+
client.register.removeSingleMetric("llm_time_to_first_token_seconds");
83+
}
84+
if (llmTokensPerSecond) {
85+
client.register.removeSingleMetric("llm_tokens_per_second");
86+
}
7787
} catch (_error) {
7888
// Ignore errors if metrics don't exist
7989
}
@@ -117,8 +127,26 @@ export function initializeMetrics(labelKeys: string[]): void {
117127
labelNames: [...baseLabelNames, ...nextLabelKeys],
118128
});
119129

130+
llmTimeToFirstToken = new client.Histogram({
131+
name: "llm_time_to_first_token_seconds",
132+
help: "Time to first token in seconds (streaming latency)",
133+
labelNames: [...baseLabelNames, ...nextLabelKeys],
134+
// Buckets optimized for TTFT - typically faster than full response
135+
buckets: [0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10],
136+
});
137+
138+
llmTokensPerSecond = new client.Histogram({
139+
name: "llm_tokens_per_second",
140+
help: "Output tokens per second throughput",
141+
labelNames: [...baseLabelNames, ...nextLabelKeys],
142+
// Buckets for tokens/sec throughput - typical range 10-200 tokens/sec
143+
buckets: [5, 10, 25, 50, 75, 100, 150, 200, 300],
144+
});
145+
120146
logger.info(
121-
`Metrics initialized with ${nextLabelKeys.length} agent label keys: ${nextLabelKeys.join(", ")}`,
147+
`Metrics initialized with ${
148+
nextLabelKeys.length
149+
} agent label keys: ${nextLabelKeys.join(", ")}`,
122150
);
123151
}
124152

@@ -223,6 +251,67 @@ export function reportLLMCost(
223251
llmCostTotal.inc(buildMetricLabels(agent, { provider }, model), cost);
224252
}
225253

254+
/**
255+
* Reports time to first token (TTFT) for streaming LLM requests.
256+
* This metric helps application developers understand streaming latency
257+
* and choose models with lower initial response times.
258+
* @param provider The LLM provider
259+
* @param agent The agent/profile making the request
260+
* @param model The model name
261+
* @param ttftSeconds Time to first token in seconds
262+
*/
263+
export function reportTimeToFirstToken(
264+
provider: SupportedProvider,
265+
agent: Agent,
266+
model: string | undefined,
267+
ttftSeconds: number,
268+
): void {
269+
if (!llmTimeToFirstToken) {
270+
logger.warn("LLM metrics not initialized, skipping TTFT reporting");
271+
return;
272+
}
273+
if (ttftSeconds <= 0) {
274+
logger.warn("Invalid TTFT value, must be positive");
275+
return;
276+
}
277+
llmTimeToFirstToken.observe(
278+
buildMetricLabels(agent, { provider }, model),
279+
ttftSeconds,
280+
);
281+
}
282+
283+
/**
284+
* Reports tokens per second throughput for LLM requests.
285+
* This metric allows comparing model response speeds and helps
286+
* developers choose models for latency-sensitive applications.
287+
* @param provider The LLM provider
288+
* @param agent The agent/profile making the request
289+
* @param model The model name
290+
* @param outputTokens Number of output tokens generated
291+
* @param durationSeconds Total request duration in seconds
292+
*/
293+
export function reportTokensPerSecond(
294+
provider: SupportedProvider,
295+
agent: Agent,
296+
model: string | undefined,
297+
outputTokens: number,
298+
durationSeconds: number,
299+
): void {
300+
if (!llmTokensPerSecond) {
301+
logger.warn("LLM metrics not initialized, skipping tokens/sec reporting");
302+
return;
303+
}
304+
if (durationSeconds <= 0 || outputTokens <= 0) {
305+
// Skip reporting if no output tokens or invalid duration
306+
return;
307+
}
308+
const tokensPerSecond = outputTokens / durationSeconds;
309+
llmTokensPerSecond.observe(
310+
buildMetricLabels(agent, { provider }, model),
311+
tokensPerSecond,
312+
);
313+
}
314+
226315
/**
227316
* Returns a fetch wrapped in observability. Use it as OpenAI or Anthropic provider custom fetch implementation.
228317
*/

0 commit comments

Comments
 (0)