Skip to content

Conversation

@SEZ9
Copy link
Contributor

@SEZ9 SEZ9 commented Apr 7, 2025

Purpose of this pull request

Does this PR introduce any user-facing change?

Description
Add support for Amazon Titan model in the embedding model_provider configuration;
Implement batch inference support in the embedding process, and send data to the model API in batches at one time;
Support successful detection of batch sending and perform fault tolerance.
Usage Scenario
In large-scale text vectorization and storage in vector databases, users need to vectorize text data efficiently and at low cost and store it in vector databases. For example:

User's reviews analysis scenario, it is necessary to transfer millions or tens of millions of rows of data at one time for vectorization.
Image search scenario, users often have hundreds of thousands or millions of images vectorized into the database for subsequent vector approximation retrieval

How was this patch tested?

Check list

@hailin0 hailin0 requested a review from Copilot April 7, 2025 14:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5 out of 6 changed files in this pull request and generated no comments.

Files not reviewed (1)
  • seatunnel-transforms-v2/pom.xml: Language not supported

@hailin0
Copy link
Member

hailin0 commented Apr 7, 2025

Copy link
Member

@hailin0 hailin0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SEZ9 SEZ9 changed the title Feature][Transform] Support batch mode vectorization using Amazon Titan & cohere embedding mode [Feature][Transform] Support single/batch mode vectorization using Amazon Titan & cohere embedding model Apr 7, 2025
@SEZ9
Copy link
Contributor Author

SEZ9 commented Apr 7, 2025

updated doc both en and cn

@corgy-w
Copy link
Contributor

corgy-w commented Apr 8, 2025

Whether Amazon e2e tests are missing

@corgy-w
Copy link
Contributor

corgy-w commented Apr 8, 2025

Please update EmbeddingTransformFactory config

@SEZ9
Copy link
Contributor Author

SEZ9 commented Apr 8, 2025

updated EmbeddingTransformFactory ,add Amazon model config

@github-actions github-actions bot added the e2e label Apr 8, 2025
@SEZ9
Copy link
Contributor Author

SEZ9 commented Apr 8, 2025

updated Amazon e2e tests in embedding_transform.conf

Comment on lines +51 to +58
.conditional(
EmbeddingTransformConfig.MODEL_PROVIDER,
ModelProvider.AMAZON,
EmbeddingTransformConfig.API_KEY,
EmbeddingTransformConfig.SECRET_KEY,
EmbeddingTransformConfig.AWS_REGION,
EmbeddingTransformConfig.MODEL,
EmbeddingTransformConfig.DIMENSION)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is region not here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AWS region is a required parameter when calling the Amazon model.

@SEZ9
Copy link
Contributor Author

SEZ9 commented Apr 28, 2025

Hi @hailin0 @corgy-w @Hisoka-X . Transform's e2e test was passed. The reason is that aws-sdk in e2e test was not shutdown normally, resulting in timeout.
Please help me see if this PR can be merged ,thanks!

@hailin0
Copy link
Member

hailin0 commented Apr 28, 2025

Waiting for ci passed

@github-actions github-actions bot added CI&CD and removed CI&CD labels May 30, 2025
@SEZ9
Copy link
Contributor Author

SEZ9 commented Jun 2, 2025

Hi @hailin0 @corgy-w @Hisoka-X . All checks have passed now, please help me see if this PR can be merge, thanks!

Copy link
Member

@hailin0 hailin0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Hisoka-X Hisoka-X merged commit 37d410c into apache:dev Jun 3, 2025
7 checks passed
dybyte pushed a commit to dybyte/seatunnel that referenced this pull request Jul 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants