You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 5, 2024. It is now read-only.
This is a great library! Unfortunately it has been ambiguous about what input it wants to accept and what it wants to output. That is, while we know that it's "character based" we don't have a definition of "character." The Lua library even makes it clear that since Lua is unaware of Unicode then it will treat content as "as a series of bytes, not a series of characters."
This ambiguity has caused numerous problems for folks wanting to interchange delta strings and gets us into tricky situations when dealing with emoji and other characters which are encoded as surrogate pairs in UTF16.
Consider this example:
A: 🅰🅰
B: 🅰a🅰
We can all agree that what happened is that we entered a a in between the two existing 🅰 characters.
Some libraries produce this delta: =1\t+a\t=1
Python3
Python2 when compiled in wide mode
Most libraries produce this delta: =2\t+a\t=2
Python2 when compiled in narrow mode
JavaScript
Objective-C
Java
I didn't check the others. This seems like enough to highlight the disparity in indexing and length calculations.
I propose a new non-breaking change to indicate what the index and length values are measured in.
In my own work in #80 I discovered that clients are fine decoding in fromDelta() a blank insertion group.
Therefore I propose that we send blank insertion groups at the front of a delta to indicate what the indexing and length values correspond to.
There are only three realistic measurement units:
Unicode code points (probably what would have been most ideal to use from the start)
UTF-16 code units (because most platforms and languages use this internally)
bytes (because that's the most agnostic way of measuring this)
In addition we should point out that the legacy behavior is to not report measurement units.
In my proposal we'd stick a number of empty insertion groups at the front of a delta to indicate which of those measurement units we'd want, in the order above: one group to indicate Unicode (since unicode is the nominal way to think about text here); two groups to indicate UTF-16 code units (since these are two-byte characters); three groups to indicate bytes (because I don't know what to do about Lua other than to make it obvious); and no groups to indicate an unreported measurement (identical to all existing deltas).
Measurement units
Delta
Unicode code points
+\t=1\t+a\t=1
UTF-16 code units
+\t+\t=2\t+a\t=2
Bytes
+\t+\t+\t=4+a\t=4
Unspecified
one of the above without the prefix
Note that these diffs should (might?) work in all existing libraries to produce the same result as they would without the leading + groups. However, this gives us a chance to update fromDelta() to support the denoted measurement units and then we can slowly migrate the client libraries to support returning their deltas in a requested unit.
The text was updated successfully, but these errors were encountered:
This is a great library! Unfortunately it has been ambiguous about what input it wants to accept and what it wants to output. That is, while we know that it's "character based" we don't have a definition of "character." The Lua library even makes it clear that since Lua is unaware of Unicode then it will treat content as "as a series of bytes, not a series of characters."
This ambiguity has caused numerous problems for folks wanting to interchange delta strings and gets us into tricky situations when dealing with emoji and other characters which are encoded as surrogate pairs in UTF16.
Consider this example:
We can all agree that what happened is that we entered a
a
in between the two existing🅰
characters.Some libraries produce this delta:
=1\t+a\t=1
Most libraries produce this delta:
=2\t+a\t=2
I didn't check the others. This seems like enough to highlight the disparity in indexing and length calculations.
I propose a new non-breaking change to indicate what the index and length values are measured in.
In my own work in #80 I discovered that clients are fine decoding in
fromDelta()
a blank insertion group.Therefore I propose that we send blank insertion groups at the front of a delta to indicate what the indexing and length values correspond to.
There are only three realistic measurement units:
In addition we should point out that the legacy behavior is to not report measurement units.
In my proposal we'd stick a number of empty insertion groups at the front of a delta to indicate which of those measurement units we'd want, in the order above: one group to indicate Unicode (since unicode is the nominal way to think about text here); two groups to indicate UTF-16 code units (since these are two-byte characters); three groups to indicate bytes (because I don't know what to do about Lua other than to make it obvious); and no groups to indicate an unreported measurement (identical to all existing deltas).
+\t=1\t+a\t=1
+\t+\t=2\t+a\t=2
+\t+\t+\t=4+a\t=4
Note that these diffs should (might?) work in all existing libraries to produce the same result as they would without the leading
+
groups. However, this gives us a chance to updatefromDelta()
to support the denoted measurement units and then we can slowly migrate the client libraries to support returning their deltas in a requested unit.The text was updated successfully, but these errors were encountered: