-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(marshal)!: compare strings by codepoint #2008
base: master
Are you sure you want to change the base?
Conversation
639c3e3
to
0c5d518
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent.
For this change, I do not think we can avoid the breaking change marker. That might render my argument for leaving it out of pass-style
, moot.
Let me be sure I understand: You're saying that this PR should keep the "!". Given that, we may as well keep the "!" on #2002 as well. Right? |
0c5d518
to
4824f1c
Compare
Yes
…On Thu, Jan 25, 2024 at 5:47 PM Mark S. Miller ***@***.***> wrote:
Excellent.
For this change, I do not think we can avoid the breaking change marker.
That might render my argument for leaving it out of pass-style, moot.
Let me be sure I understand:
You're saying that this PR should keep the "!". Given that, we may as well
keep the "!" on #2002 <#2002> as well.
Right?
—
Reply to this email directly, view it on GitHub
<#2008 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAOXBRXXSUOYVBOWVGDT4TYQMDLVAVCNFSM6AAAAABCLFMFD2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJRGI3TSNBQHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Just noting here for curiosity. In the UTF16 portion of https://icu-project.org/docs/papers/utf16_code_point_order.html
OMG |
1312009
to
7a3a43a
Compare
# next release | ||
|
||
- JavaScript's relational comparison operators like `<` compare strings by lexicographic UTF16 code unit order, which is exposes an internal representational detail not relevant to the string's meaning as a Unicode string. Previously, `compareRank` and associated functions compared strings using this JavaScript-native comparison. Now `compareRank` and associated functions compare strings by lexicographic Unicode Code Point order. ***This change only affects strings containing so-called supplementary characters, i.e., those whose Unicode character code does not fit in 16 bits***. | ||
- This release does not change the `encodePassable` encoding. But now, when we say it is order preserving, we need to be careful about which order we mean. `encodePassable` is rank-order preserving when the encoded strings are compared using `compareRank`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gibson042 is this true? It was true for my small test case, which proves very little. Will the same property also be true for compactOrdered
? For either, does restricting these strings to well-ordered have any effect on whether their encoding is order preserving?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is true now, but I think that's a mistake... recordNames
and any similar function that .sort()
s an array of strings in marshal or a related package should probably be updated to .sort(compareByCodePoints)
so the encoding of Copy{Record,Set,Bag,Map}s and their own comparison is consistent with that of their constituent strings.
Which unfortunately complicates adoption if we have existing use of any such strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, and bad news!
Grepping for .sort()
specifically with nothing between the parens, I see 96 occurrences in agoric-sdk and 26 in endo. Some may not be or contain strings. But still, fixing all that do will be disruptive. And the longer we wait, the more disruptive it'll be.
I'm putting this back into Draft until we decide what our plan is. Attn @ivanlei
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any practical way to scan a recent snapshot of our chain and somehow see how many persistent strings are
- non-ascii,
- non-well-formed, or
- have supplementary characters (those whose code is > 16 bits)
?
How hard would it be?
Attn @mhofman
NOT URGENT.
7a3a43a
to
12b6ebe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed NEWS.md
.
# next release | ||
|
||
- JavaScript's relational comparison operators like `<` compare strings by lexicographic UTF16 code unit order, which is exposes an internal representational detail not relevant to the string's meaning as a Unicode string. Previously, `compareRank` and associated functions compared strings using this JavaScript-native comparison. Now `compareRank` and associated functions compare strings by lexicographic Unicode Code Point order. ***This change only affects strings containing so-called supplementary characters, i.e., those whose Unicode character code does not fit in 16 bits***. | ||
- This release does not change the `encodePassable` encoding. But now, when we say it is order preserving, we need to be careful about which order we mean. `encodePassable` is rank-order preserving when the encoded strings are compared using `compareRank`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is true now, but I think that's a mistake... recordNames
and any similar function that .sort()
s an array of strings in marshal or a related package should probably be updated to .sort(compareByCodePoints)
so the encoding of Copy{Record,Set,Bag,Map}s and their own comparison is consistent with that of their constituent strings.
Which unfortunately complicates adoption if we have existing use of any such strings.
fd874a2
to
b6580c5
Compare
b6580c5
to
84b8abd
Compare
567b23c
to
3a74c9c
Compare
939657b
to
8d8375e
Compare
8d8375e
to
c2a3302
Compare
219fa1b
to
7291ff7
Compare
70402a0
to
01394d3
Compare
9e049f3
to
467dbca
Compare
467dbca
to
5ffedcf
Compare
d8a133c
to
3f95d6f
Compare
3f95d6f
to
a77259d
Compare
a77259d
to
b39d394
Compare
b39d394
to
ff421df
Compare
ff421df
to
cd0787b
Compare
cd0787b
to
702968b
Compare
445a662
to
6057000
Compare
32c1e4e
to
c6ddba1
Compare
c6ddba1
to
18541ab
Compare
18541ab
to
5c7a307
Compare
5c7a307
to
1d923fd
Compare
closes: #2113
refs: #2002
Description
<
compare strings by lexicographic UTF16 code unit order, which is exposes an internal representational detail not relevant to the string's meaning as a Unicode string. Previously,compareRank
and associated functions compared strings using this JavaScript-native comparison. NowcompareRank
and associated functions compare strings by lexicographic Unicode Code Point order. This change only affects strings containing so-called supplementary characters, i.e., those whose Unicode character code does not fit in 16 bits.encodePassable
encoding. But now, when we say it is order preserving, we need to be careful about which order we mean.encodePassable
is rank-order preserving when the encoded strings are compared usingcompareRank
.M.gte(string)
.Security Considerations
The fact that the string ordering is closer to the Unicode semantics of the strings probably minimizes some surprises in ways that help security. OTOH, this difference from JS native string ordering probably causes other surprises that hurt security. Altogether, we do not expect much effect.
Scaling Considerations
As a comparison written in JS, will be slower that the JS native string comparison. On XS at least, we expect to have a native code point comparison function available eventually. Altogether, we do not expect much effect.
Documentation Considerations
Most developers will not care. But it needs to be explained somewhere carefully so that developers that do care can easily find out.
Testing Considerations
@gibson042 , in a later PR, could you expand the property-based-testing to generate test cases sensitive to this change?
Compatibility Considerations
compareRank
function. You may need to revisit any use of patterns likeM.gte(string)
expressing inequalities over strings.Upgrade Considerations
If we currently have any persistent data, especially on chain, sorted according to JS native order (UTF16 code unit), then we cannot accept this PR until we have a plan to resort that data, or somehow continue to live with mis-sorted. (Historical note: This is how Oracle came to permanently rely on UTF16 code unit order, because of the impracticality of resorting all that data.)
*BREAKING*:
in the commit message with migration instructions for any breaking change.NEWS.md
for user-facing changes.