Skip to content

Conversation

@anhvdq
Copy link
Contributor

@anhvdq anhvdq commented Sep 4, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Implement Spark functions url_encode, url_decode and try_url_decode

Are these changes tested?

Yes

Are there any user-facing changes?

Yes

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels Sep 4, 2025
}

/// Replace b'+' with b' '
fn replace_plus(input: &[u8]) -> Cow<'_, [u8]> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a test case that covers this replacement?

/// * `Ok(String)` - The encoded string
///
fn encode(value: &str) -> Result<String> {
Ok(byte_serialize(value.as_bytes()).collect::<String>())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for encoding you don't need to do the inverse operations you do in decoding? Like the + -> ' ' manipulation? I am unfamiliar with the specific needs here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timsaucer Let me explain a bit about the specific needs here. What we are trying to archieve in encode/decode functions are to transform the string from/to application/x-www-form-urlencoded format. The url::form_urlencoded crate is already support this transformation.

  • So for the encode function, we can just simply use the byte_serialize function exported from the crates
  • For decode function, there are 2 drawbacks:
    • The crate doesn't expose decode function for public use
    • It doesn't return Error when there is invalid format, for example: %2s is an invalid percent encoded string

This is the reason why on the url_decode, I tried to do the transformation manually. It's actually the same code as the decode function in the crate except that there is a valiation step.

So to answer your question:

  • encode: The function already did the necessary manipulation
  • decode: The decode function in the crate is not exposed and there's no format validation, so we try to copy the logic of the decode function in the crate with a validation step

Let me know if you have any suggestion on this

# For more information, please see:
# https://github.com/apache/datafusion/issues/15914
query T
SELECT url_decode('https%3A%2F%2Fspark.apache.org'::string);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unfamiliar with the slt files, but if there was a way to have these do all 3 types of input (utf8, utf8view, largeutf8) that would be nice.

@anhvdq
Copy link
Contributor Author

anhvdq commented Sep 5, 2025

Thank @timsaucer for your feedback.
Let me check and update accordingly

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more test cases such as different types of inputs (as mentioned), nulls, empty strings, etc. would be beneficial 🙂

Edit: also perhaps a few roundtrip tests?

@anhvdq anhvdq force-pushed the feat_url_encode_decode branch 3 times, most recently from abfac53 to 741f9a8 Compare November 17, 2025 08:34
@Jefffrey
Copy link
Contributor

Still seeing a lack of test cases mentioned by previous review comments: nulls, empty strings, roundtrips, plus replacement

@anhvdq
Copy link
Contributor Author

anhvdq commented Nov 17, 2025

Thank @Jefffrey
For the plus replacement, the function is copied as it is from rust-url crate which is widely used. Do we still need to add test case for this case?

Let me know if you have any suggestions

@Jefffrey
Copy link
Contributor

Thank @Jefffrey For the plus replacement, the function is copied as it is from rust-url crate which is widely used. Do we still need to add test case for this case?

Let me know if you have any suggestions

Yes, we should still have a test case for this, preferably in an SLT case. Consider what happens if someone decides to refactor that code; there is currently no test case that would prevent them breaking it, I believe?

@anhvdq anhvdq force-pushed the feat_url_encode_decode branch from 741f9a8 to b60dd69 Compare December 5, 2025 15:27
@anhvdq anhvdq changed the title feat(spark): Implement Spark functions url_encode and url_decode feat(spark): Implement Spark functions url_encode, url_decode and Dec 5, 2025
@anhvdq anhvdq changed the title feat(spark): Implement Spark functions url_encode, url_decode and feat(spark): Implement Spark functions url_encode, url_decode and try_url_decode Dec 5, 2025
@anhvdq
Copy link
Contributor Author

anhvdq commented Dec 5, 2025

Hi @Jefffrey
I have pushed the new changes which include:

  • More test cases as mentioned to cover the plus replacement, roundtrips, etc
  • Add a new try_url_decode function which is basically identical to url_decode except that it returns None instead of error when the value is invalid.

Please help review and let me know your feedback
Thanks

@Jefffrey
Copy link
Contributor

Jefffrey commented Dec 5, 2025

CI failures seem unrelated 🤔

@anhvdq
Copy link
Contributor Author

anhvdq commented Dec 6, 2025

CI failures seem unrelated 🤔

Seem so, the error code is unfamiliar to me.

@Jefffrey Jefffrey added this pull request to the merge queue Dec 6, 2025
Merged via the queue into apache:main with commit 6746007 Dec 6, 2025
29 of 30 checks passed
@Jefffrey
Copy link
Contributor

Jefffrey commented Dec 6, 2025

Thanks @anhvdq & @timsaucer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants