[ENHANCEMENT] - Add support for bedrock service tiers

### Problem (one or two sentences)

# 🚀 AWS Bedrock Inference Service Tiering (Flex, Standard, Priority)

AWS Bedrock introduced new inference service tiers (**Flex** and **Priority**), alongside the existing **Standard** tier, allowing users to optimize generative AI workloads based on the desired balance of **cost** and **latency**.


## ⚖️ Service Tier Characteristics

| Tier Name | Performance Characteristics | Price Characteristics | Ideal Use Cases |
| :--- | :--- | :--- | :--- |
| **Priority** | **Premium Performance** and preferential processing. Can realize up to **25% better** Output Tokens Per Second (OTPS) latency compared to Standard. Requests get processing priority during high demand. | **Premium Price** (highest cost). | Mission-critical applications, real-time user interactions, high-traffic chat assistants. |
| **Standard** | **Consistent, reliable performance** at regular rates. Requests are processed with reliable service but no special priority. | **Standard Rate** (regular cost per token). | Everyday AI tasks, content generation, text analysis, routine document processing. |
| **Flex** | Designed for workloads that **can tolerate longer latency**. Requests receive lower priority during high demand. | **Discounted Price** (lowest cost). | Non-time-critical applications like model evaluations, large-scale summarization, labeling/annotation, and asynchronous agentic workflows. |

## 💻 Invoking a Specific Tier via the Bedrock API

Engineers specify the desired service tier by including the **`service_tier`** parameter in the JSON body of model invocation API calls (e.g., `InvokeModel`, `Converse`).

**Example (using the `InvokeModel` operation for the PRIORITY tier):**

```json
{
    "modelId": "anthropic.claude-3-sonnet-20240229-v1:0",
    "contentType": "application/json",
    "accept": "application/json",
    "body": "{\"prompt\": \"<prompt text>\", \"max_tokens_to_sample\": 200, **\"service_tier\": \"PRIORITY\"**}"
}

* Use "service_tier": "PRIORITY" for the highest latency performance.
* Use "service_tier": "FLEX" for the lowest cost/most delayed performance.
* The Standard tier is used by default if the parameter is omitted.

## 💰 Sample Model Tiering Costs (per 1 Million Tokens)

Costs vary by model, region, and input/output type. The table below provides a **directional example** of how costs compare across tiers for illustrative purposes (Always check the official AWS Bedrock pricing page for real-time rates).

| Model (Example) | Token Type | Standard Tier (Cost per 1M Tokens) | Priority Tier (Cost per 1M Tokens) | Flex Tier (Cost per 1M Tokens) |
| :--- | :--- | :--- | :--- | :--- |
| **Anthropic Claude 3 Haiku** | Input | $0.25 | ~$0.30 - $0.35 | ~$0.20 - $0.22 |
| | Output | $1.25 | ~$1.50 - $1.75 | ~$1.00 - $1.15 |
| **Amazon Titan Text Express** | Input | $0.80 | ~$0.96 - $1.12 | ~$0.64 - $0.72 |
| | Output | $1.60 | ~$1.92 - $2.24 | ~$1.28 - $1.44 |
| **Meta Llama 3.2 Instruct (11B)** | Input | $0.35 | ~$0.42 - $0.49 | ~$0.28 - $0.32 |
| | Output | $0.35 | ~$0.42 - $0.49 | ~$0.28 - $0.32 |

### Context (who is affected and when)

Amazon Bedrock users don't currently have the option of choosing a service tier, and as a result may be paying more than needed, or getting slower responses than they desire and don't have the option of choosing a more performant tier.

### Desired behavior (conceptual, not technical)

Roo code's bedrock provider for chat should support the selection of a service tier in the chat settings page when a bedrock model supports service tiers, and then properly use/apply that service tier in calls to bedrock and use that service tier's specific pricing in the Chat's token usage and pricing display.


### Constraints / preferences (optional)

_No response_

### Request checklist

- [x] I've searched existing Issues and Discussions for duplicates
- [x] This describes a specific problem with clear context and impact

### Roo Code Task Links (optional)

_No response_

### Acceptance criteria (optional)

_No response_

### Proposed approach (optional)

_No response_

### Trade-offs / risks (optional)

_No response_

Tier Name	Performance Characteristics	Price Characteristics	Ideal Use Cases
Priority	Premium Performance and preferential processing. Can realize up to 25% better Output Tokens Per Second (OTPS) latency compared to Standard. Requests get processing priority during high demand.	Premium Price (highest cost).	Mission-critical applications, real-time user interactions, high-traffic chat assistants.
Standard	Consistent, reliable performance at regular rates. Requests are processed with reliable service but no special priority.	Standard Rate (regular cost per token).	Everyday AI tasks, content generation, text analysis, routine document processing.
Flex	Designed for workloads that can tolerate longer latency. Requests receive lower priority during high demand.	Discounted Price (lowest cost).	Non-time-critical applications like model evaluations, large-scale summarization, labeling/annotation, and asynchronous agentic workflows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ENHANCEMENT] - Add support for bedrock service tiers #9874

Problem (one or two sentences)

🚀 AWS Bedrock Inference Service Tiering (Flex, Standard, Priority)

⚖️ Service Tier Characteristics

💻 Invoking a Specific Tier via the Bedrock API

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ENHANCEMENT] - Add support for bedrock service tiers #9874

Description

Problem (one or two sentences)

🚀 AWS Bedrock Inference Service Tiering (Flex, Standard, Priority)

⚖️ Service Tier Characteristics

💻 Invoking a Specific Tier via the Bedrock API

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions