Skip to content

Conversation

@CosmosNi
Copy link
Contributor

@CosmosNi CosmosNi commented Jun 3, 2025

Add Configurable Binary Chunk Size Support to BinaryReadStrategy

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for configuring how binary files are chunked or read in complete-file mode by introducing two new options and updating the read strategy and tests accordingly.

  • Introduce binary_chunk_size and binary_complete_file_mode options in FileBaseSourceOptions
  • Update BinaryReadStrategy to load, validate, and apply these new options
  • Add BinaryReadStrategyTest to cover default chunking, custom chunk sizes, and complete-file mode

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
seatunnel-connectors-v2/connector-file/.../FileBaseSourceOptions.java Added BINARY_CHUNK_SIZE and BINARY_COMPLETE_FILE_MODE options with defaults
seatunnel-connectors-v2/connector-file/.../BinaryReadStrategy.java Loaded and validated new options; refactored read() into readCompleteFile and readFileInChunks
seatunnel-connectors-v2/connector-file/.../BinaryReadStrategyTest.java Added unit tests for default chunking, custom chunk size, and complete-file mode
Comments suppressed due to low confidence (1)

seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/BinaryReadStrategy.java:61

  • Add unit tests to cover invalid chunk size configurations (e.g., zero or negative values, values above the 100MB limit) to verify that IllegalArgumentException is thrown as expected.
if (binaryChunkSize <= 0) {

throw new IllegalArgumentException(
"Binary chunk size must be positive, got: " + binaryChunkSize);
}
if (binaryChunkSize > 100 * 1024 * 1024) { // 100MB limit
Copy link

Copilot AI Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Extract the 100 * 1024 * 1024 magic number into a descriptive static final constant (e.g., MAX_BINARY_CHUNK_SIZE) to improve readability and maintainability.

Suggested change
if (binaryChunkSize > 100 * 1024 * 1024) { // 100MB limit
if (binaryChunkSize > MAX_BINARY_CHUNK_SIZE) { // 100MB limit

Copilot uses AI. Check for mistakes.

private Config createConfig(String filePath, Integer chunkSize, Boolean completeFileMode) {
Map<String, Object> configMap = new HashMap<>();
configMap.put("path", filePath); // Fixed: use "path" instead of "file_path"
Copy link

Copilot AI Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Avoid hardcoding the config key string "path"; use FileBaseSourceOptions.FILE_PATH.key() to populate the configuration map so it stays in sync with the option definition.

Suggested change
configMap.put("path", filePath); // Fixed: use "path" instead of "file_path"
configMap.put(FileBaseSourceOptions.FILE_PATH.key(), filePath);

Copilot uses AI. Check for mistakes.
Comment on lines +62 to +66
throw new IllegalArgumentException(
"Binary chunk size must be positive, got: " + binaryChunkSize);
}
if (binaryChunkSize > 100 * 1024 * 1024) { // 100MB limit
throw new IllegalArgumentException(
Copy link

Copilot AI Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider using a connector-specific exception (e.g., FileConnectorException) instead of IllegalArgumentException to provide consistent error reporting within the framework.

Suggested change
throw new IllegalArgumentException(
"Binary chunk size must be positive, got: " + binaryChunkSize);
}
if (binaryChunkSize > 100 * 1024 * 1024) { // 100MB limit
throw new IllegalArgumentException(
throw new FileConnectorException(
"FILE_CONNECTOR_ERROR",
"Binary chunk size must be positive, got: " + binaryChunkSize);
}
if (binaryChunkSize > 100 * 1024 * 1024) { // 100MB limit
throw new FileConnectorException(
"FILE_CONNECTOR_ERROR",

Copilot uses AI. Check for mistakes.
Comment on lines +112 to +115
byte[] fileContent = IOUtils.toByteArray(inputStream);
SeaTunnelRow row = new SeaTunnelRow(new Object[] {fileContent, relativePath, 0L});
row.setTableId(tableId);
output.collect(row);
Copy link

Copilot AI Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the entire file into memory via IOUtils.toByteArray can lead to high memory usage for large files; consider imposing a size check or using a streaming approach to avoid potential OOM issues.

Suggested change
byte[] fileContent = IOUtils.toByteArray(inputStream);
SeaTunnelRow row = new SeaTunnelRow(new Object[] {fileContent, relativePath, 0L});
row.setTableId(tableId);
output.collect(row);
try (ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
byte[] buffer = new byte[binaryChunkSize];
int bytesRead;
while ((bytesRead = inputStream.read(buffer)) != -1) {
outputStream.write(buffer, 0, bytesRead);
}
byte[] fileContent = outputStream.toByteArray();
SeaTunnelRow row = new SeaTunnelRow(new Object[] {fileContent, relativePath, 0L});
row.setTableId(tableId);
output.collect(row);
}

Copilot uses AI. Check for mistakes.
Copy link
Member

@Hisoka-X Hisoka-X left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CosmosNi ! Please update the docs.

.defaultValue(1024)
.withDescription(
"The chunk size (in bytes) for reading binary files. Default is 1024 bytes. "
+ "Larger values may improve performance for large files but use more memory.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+ "Larger values may improve performance for large files but use more memory.");
+ "Larger values may improve performance for large files but use more memory. Only valid when file_format_type is binary");

.defaultValue(false)
.withDescription(
"Whether to read the complete file as a single chunk instead of splitting into chunks. "
+ "When enabled, the entire file content will be read into memory at once.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+ "When enabled, the entire file content will be read into memory at once.");
+ "When enabled, the entire file content will be read into memory at once. Only valid when file_format_type is binary");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GET

| binary_complete_file_mode | boolean | no | false |
| common-options | | no | - |

### path [string]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add description too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why delete this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I accidentally deleted it.

@Hisoka-X Hisoka-X changed the title [Improve][Connector-file] Add Configurable Binary Chunk Size Support to BinaryReadStrategy [Improve][Connector-file] Add configurable binary chunk size support to BinaryReadStrategy Jun 5, 2025
Copy link
Member

@Hisoka-X Hisoka-X left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if ci passes.

@corgy-w corgy-w merged commit 38e87e7 into apache:dev Jun 5, 2025
5 checks passed
dybyte pushed a commit to dybyte/seatunnel that referenced this pull request Jul 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants