-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[Improve][Connector-file] Add configurable binary chunk size support to BinaryReadStrategy #9391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…to BinaryReadStrategy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for configuring how binary files are chunked or read in complete-file mode by introducing two new options and updating the read strategy and tests accordingly.
- Introduce
binary_chunk_sizeandbinary_complete_file_modeoptions inFileBaseSourceOptions - Update
BinaryReadStrategyto load, validate, and apply these new options - Add
BinaryReadStrategyTestto cover default chunking, custom chunk sizes, and complete-file mode
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| seatunnel-connectors-v2/connector-file/.../FileBaseSourceOptions.java | Added BINARY_CHUNK_SIZE and BINARY_COMPLETE_FILE_MODE options with defaults |
| seatunnel-connectors-v2/connector-file/.../BinaryReadStrategy.java | Loaded and validated new options; refactored read() into readCompleteFile and readFileInChunks |
| seatunnel-connectors-v2/connector-file/.../BinaryReadStrategyTest.java | Added unit tests for default chunking, custom chunk size, and complete-file mode |
Comments suppressed due to low confidence (1)
seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/BinaryReadStrategy.java:61
- Add unit tests to cover invalid chunk size configurations (e.g., zero or negative values, values above the 100MB limit) to verify that
IllegalArgumentExceptionis thrown as expected.
if (binaryChunkSize <= 0) {
| throw new IllegalArgumentException( | ||
| "Binary chunk size must be positive, got: " + binaryChunkSize); | ||
| } | ||
| if (binaryChunkSize > 100 * 1024 * 1024) { // 100MB limit |
Copilot
AI
Jun 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Extract the 100 * 1024 * 1024 magic number into a descriptive static final constant (e.g., MAX_BINARY_CHUNK_SIZE) to improve readability and maintainability.
| if (binaryChunkSize > 100 * 1024 * 1024) { // 100MB limit | |
| if (binaryChunkSize > MAX_BINARY_CHUNK_SIZE) { // 100MB limit |
|
|
||
| private Config createConfig(String filePath, Integer chunkSize, Boolean completeFileMode) { | ||
| Map<String, Object> configMap = new HashMap<>(); | ||
| configMap.put("path", filePath); // Fixed: use "path" instead of "file_path" |
Copilot
AI
Jun 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Avoid hardcoding the config key string "path"; use FileBaseSourceOptions.FILE_PATH.key() to populate the configuration map so it stays in sync with the option definition.
| configMap.put("path", filePath); // Fixed: use "path" instead of "file_path" | |
| configMap.put(FileBaseSourceOptions.FILE_PATH.key(), filePath); |
| throw new IllegalArgumentException( | ||
| "Binary chunk size must be positive, got: " + binaryChunkSize); | ||
| } | ||
| if (binaryChunkSize > 100 * 1024 * 1024) { // 100MB limit | ||
| throw new IllegalArgumentException( |
Copilot
AI
Jun 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider using a connector-specific exception (e.g., FileConnectorException) instead of IllegalArgumentException to provide consistent error reporting within the framework.
| throw new IllegalArgumentException( | |
| "Binary chunk size must be positive, got: " + binaryChunkSize); | |
| } | |
| if (binaryChunkSize > 100 * 1024 * 1024) { // 100MB limit | |
| throw new IllegalArgumentException( | |
| throw new FileConnectorException( | |
| "FILE_CONNECTOR_ERROR", | |
| "Binary chunk size must be positive, got: " + binaryChunkSize); | |
| } | |
| if (binaryChunkSize > 100 * 1024 * 1024) { // 100MB limit | |
| throw new FileConnectorException( | |
| "FILE_CONNECTOR_ERROR", |
| byte[] fileContent = IOUtils.toByteArray(inputStream); | ||
| SeaTunnelRow row = new SeaTunnelRow(new Object[] {fileContent, relativePath, 0L}); | ||
| row.setTableId(tableId); | ||
| output.collect(row); |
Copilot
AI
Jun 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the entire file into memory via IOUtils.toByteArray can lead to high memory usage for large files; consider imposing a size check or using a streaming approach to avoid potential OOM issues.
| byte[] fileContent = IOUtils.toByteArray(inputStream); | |
| SeaTunnelRow row = new SeaTunnelRow(new Object[] {fileContent, relativePath, 0L}); | |
| row.setTableId(tableId); | |
| output.collect(row); | |
| try (ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) { | |
| byte[] buffer = new byte[binaryChunkSize]; | |
| int bytesRead; | |
| while ((bytesRead = inputStream.read(buffer)) != -1) { | |
| outputStream.write(buffer, 0, bytesRead); | |
| } | |
| byte[] fileContent = outputStream.toByteArray(); | |
| SeaTunnelRow row = new SeaTunnelRow(new Object[] {fileContent, relativePath, 0L}); | |
| row.setTableId(tableId); | |
| output.collect(row); | |
| } |
Hisoka-X
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @CosmosNi ! Please update the docs.
| .defaultValue(1024) | ||
| .withDescription( | ||
| "The chunk size (in bytes) for reading binary files. Default is 1024 bytes. " | ||
| + "Larger values may improve performance for large files but use more memory."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| + "Larger values may improve performance for large files but use more memory."); | |
| + "Larger values may improve performance for large files but use more memory. Only valid when file_format_type is binary"); |
| .defaultValue(false) | ||
| .withDescription( | ||
| "Whether to read the complete file as a single chunk instead of splitting into chunks. " | ||
| + "When enabled, the entire file content will be read into memory at once."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| + "When enabled, the entire file content will be read into memory at once."); | |
| + "When enabled, the entire file content will be read into memory at once. Only valid when file_format_type is binary"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GET
…to BinaryReadStrategy
| | binary_complete_file_mode | boolean | no | false | | ||
| | common-options | | no | - | | ||
|
|
||
| ### path [string] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add description too?
…to BinaryReadStrategy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why delete this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I accidentally deleted it.
…to BinaryReadStrategy
Hisoka-X
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if ci passes.
…to BinaryReadStrategy (apache#9391)
Add Configurable Binary Chunk Size Support to BinaryReadStrategy
Purpose of this pull request
Does this PR introduce any user-facing change?
How was this patch tested?
Check list
New License Guide