-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[Feature][Transform] Support single/batch mode vectorization using Amazon Titan & cohere embedding model #9120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
init bedrock model files
init parameters and configuration
test complete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 5 out of 6 changed files in this pull request and generated no comments.
Files not reviewed (1)
- seatunnel-transforms-v2/pom.xml: Language not supported
hailin0
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
updated doc both |
|
Whether Amazon e2e tests are missing |
|
Please update |
|
updated EmbeddingTransformFactory ,add Amazon model config |
|
updated Amazon e2e tests in |
| .conditional( | ||
| EmbeddingTransformConfig.MODEL_PROVIDER, | ||
| ModelProvider.AMAZON, | ||
| EmbeddingTransformConfig.API_KEY, | ||
| EmbeddingTransformConfig.SECRET_KEY, | ||
| EmbeddingTransformConfig.AWS_REGION, | ||
| EmbeddingTransformConfig.MODEL, | ||
| EmbeddingTransformConfig.DIMENSION) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is region not here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AWS region is a required parameter when calling the Amazon model.
|
Waiting for ci passed |
hailin0
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…azon Titan & cohere embedding model (apache#9120)

Purpose of this pull request
Does this PR introduce any user-facing change?
Description
Add support for Amazon Titan model in the embedding model_provider configuration;
Implement batch inference support in the embedding process, and send data to the model API in batches at one time;
Support successful detection of batch sending and perform fault tolerance.
Usage Scenario
In large-scale text vectorization and storage in vector databases, users need to vectorize text data efficiently and at low cost and store it in vector databases. For example:
User's reviews analysis scenario, it is necessary to transfer millions or tens of millions of rows of data at one time for vectorization.
Image search scenario, users often have hundreds of thousands or millions of images vectorized into the database for subsequent vector approximation retrieval
How was this patch tested?
Check list
New License Guide
release-note.