Skip to content

Commit 380f0ea

Browse files
SEZ9dybyte
authored andcommitted
[Feature][Transform] Support single/batch mode vectorization using Amazon Titan & cohere embedding model (apache#9120)
1 parent 39ed108 commit 380f0ea

File tree

13 files changed

+628
-40
lines changed

13 files changed

+628
-40
lines changed

docs/en/transform-v2/embedding.md

Lines changed: 17 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -10,25 +10,26 @@ different API endpoints.
1010

1111
## Options
1212

13-
| Name | Type | Required | Default Value | Description |
14-
|----------------------------------|--------|----------|---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
15-
| model_provider | enum | yes | - | The model provider for embedding. Options may include `QIANFAN`, `OPENAI`, etc. |
16-
| api_key | string | yes | - | The API key required to authenticate with the embedding service. |
17-
| secret_key | string | yes | - | The secret key required for additional authentication with the embedding service. |
18-
| single_vectorized_input_number | int | no | 1 | The number of inputs vectorized in one request. Default is 1. |
19-
| vectorization_fields | map | yes | - | A mapping between input fields and their corresponding output vector fields. |
20-
| model | string | yes | - | The specific model to use for embedding (e.g: `text-embedding-3-small` for OPENAI). |
21-
| api_path | string | no | - | The API endpoint for the embedding service. Typically provided by the model provider. |
22-
| dimension | int | no | - | TThe vector dimension defaults to 2048. The Embedding-3 model supports custom vector dimensions, and it is recommended to choose dimensions of 256, 512, 1024, or 2048. |
23-
| oauth_path | string | no | - | The API endpoint for the oauth service. |
24-
| custom_config | map | no | | Custom configurations for the model. |
25-
| custom_response_parse | string | no | | Specifies how to parse the response from the model using JsonPath. Example: `$.choices[*].message.content`. |
26-
| custom_request_headers | map | no | | Custom headers for the request to the model. |
27-
| custom_request_body | map | no | | Custom body for the request. Supports placeholders like `${model}`, `${input}`. |
13+
| Name | Type | Required | Default Value | Description |
14+
|--------------------------------|--------|----------|---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
15+
| model_provider | enum | yes | - | The model provider for embedding. Options may include `AMAZON`, `QIANFAN`, `OPENAI`, etc. |
16+
| api_key | string | yes | - | The API key required to authenticate with the embedding service. |
17+
| secret_key | string | yes | - | The secret key required for additional authentication with the embedding service. |
18+
| aws_region | string | no | | AWS Region. Required for use Amazon Bedrock model. |
19+
| single_vectorized_input_number | int | no | 1 | The number of inputs vectorized in one request. Default is 1. |
20+
| vectorization_fields | map | yes | - | A mapping between input fields and their corresponding output vector fields. |
21+
| model | string | yes | - | The specific model to use for embedding (e.g: `text-embedding-3-small` for OPENAI). |
22+
| api_path | string | no | - | The API endpoint for the embedding service. Typically provided by the model provider. |
23+
| dimension | int | no | - | TThe vector dimension defaults to 2048. The Embedding-3 model supports custom vector dimensions, and it is recommended to choose dimensions of 256, 512, 1024, or 2048. |
24+
| oauth_path | string | no | - | The API endpoint for the oauth service. |
25+
| custom_config | map | no | | Custom configurations for the model. |
26+
| custom_response_parse | string | no | | Specifies how to parse the response from the model using JsonPath. Example: `$.choices[*].message.content`. |
27+
| custom_request_headers | map | no | | Custom headers for the request to the model. |
28+
| custom_request_body | map | no | | Custom body for the request. Supports placeholders like `${model}`, `${input}`. |
2829

2930
### model_provider
3031

31-
The providers for generating embeddings include common options such as `DOUBAO`, `QIANFAN`, and `OPENAI`. Additionally,
32+
The providers for generating embeddings include common options such as `AMAZON`, `DOUBAO`, `QIANFAN`, and `OPENAI`. Additionally,
3233
you can choose `CUSTOM` to implement requests and retrievals for custom embedding models.
3334

3435
### api_key

docs/zh/connector-v2/sink/Elasticsearch.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -156,7 +156,7 @@ sink {
156156
}
157157
}
158158
```
159-
向量转换
159+
向量转换(vector data)
160160

161161
```conf
162162
sink {

docs/zh/transform-v2/embedding.md

Lines changed: 18 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -8,30 +8,31 @@
88

99
## 配置选项
1010

11-
| 名称 | 类型 | 是否必填 | 默认值 | 描述 |
12-
|----------------------------------|--------|------|--------|--------------------------------------------------------------------|
13-
| model_provider | enum || - | embedding模型的提供商。可选项包括 `QIANFAN``OPENAI` 等。 |
14-
| api_key | string || - | 用于验证embedding服务的API密钥。 |
15-
| secret_key | string || - | 用于额外验证的密钥。一些提供商可能需要此密钥进行安全的API请求。 |
16-
| single_vectorized_input_number | int || 1 | 单次请求向量化的输入数量。默认值为1。 |
17-
| vectorization_fields | map || - | 输入字段和相应的输出向量字段之间的映射。 |
18-
| model | string || - | 要使用的具体embedding模型。例如,如果提供商为OPENAI,可以指定 `text-embedding-3-small`|
19-
| api_path | string || - | embedding服务的API。通常由模型提供商提供。 |
20-
| dimension | int || 2048 | 向量维度默认为 2048,Embedding-3模型支持自定义向量维度,建议选择256、512、1024或2048维度。 |
21-
| oauth_path | string || - | oauth 服务的 API 。 |
22-
| custom_config | map || | 模型的自定义配置。 |
23-
| custom_response_parse | string || | 使用 JsonPath 解析模型响应的方式。示例:`$.choices[*].message.content`|
24-
| custom_request_headers | map || | 发送到模型的请求的自定义头信息。 |
25-
| custom_request_body | map || | 请求体的自定义配置。支持占位符如 `${model}``${input}`|
11+
| 名称 | 类型 | 是否必填 | 默认值 | 描述 |
12+
|--------------------------------|--------|------|--------|------------------------------------------------------------------|
13+
| model_provider | enum || - | embedding模型的提供商。可选项包括 `AMAZON``QIANFAN``OPENAI` 等。 |
14+
| api_key | string || - | 用于验证embedding服务的API密钥。 |
15+
| secret_key | string || - | 用于额外验证的密钥。一些提供商可能需要此密钥进行安全的API请求。 |
16+
| aws_region | string || | 用于使用Amazon Bedrock 模型,需要指定模型请求区域. |
17+
| single_vectorized_input_number | int || 1 | 单次请求向量化的输入数量。默认值为1。 |
18+
| vectorization_fields | map || - | 输入字段和相应的输出向量字段之间的映射。 |
19+
| model | string || - | 要使用的具体embedding模型。例如,如果提供商为OPENAI,可以指定 `text-embedding-3-small`|
20+
| api_path | string || - | embedding服务的API。通常由模型提供商提供。 |
21+
| dimension | int || 2048 | 向量维度默认为 2048,Embedding-3模型支持自定义向量维度,建议选择256、512、1024或2048维度。 |
22+
| oauth_path | string || - | oauth 服务的 API 。 |
23+
| custom_config | map || | 模型的自定义配置。 |
24+
| custom_response_parse | string || | 使用 JsonPath 解析模型响应的方式。示例:`$.choices[*].message.content`|
25+
| custom_request_headers | map || | 发送到模型的请求的自定义头信息。 |
26+
| custom_request_body | map || | 请求体的自定义配置。支持占位符如 `${model}``${input}`|
2627

2728
### embedding_model_provider
2829

29-
用于生成 embedding 的模型提供商。常见选项包括 `DOUBAO``QIANFAN``OPENAI` 等,同时可选择 `CUSTOM` 实现自定义 embedding
30+
用于生成 embedding 的模型提供商。常见选项包括 `AMAZON``DOUBAO``QIANFAN``OPENAI` 等,同时可选择 `CUSTOM` 实现自定义 embedding
3031
模型的请求以及获取。
3132

3233
### api_key
3334

34-
用于验证 embedding 服务请求的API密钥。通常由模型提供商在你注册他们的服务时提供。
35+
用于验证 embedding 服务请求的API密钥。通常由模型提供商在你注册他们的服务时提供,对于使用`AMAZON` 模型则对应IAM access key
3536

3637
### secret_key
3738

0 commit comments

Comments
 (0)