Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexOutOfBoundsException with Emoji in YAML #508

Closed
twonirwana opened this issue Nov 22, 2024 · 6 comments
Closed

IndexOutOfBoundsException with Emoji in YAML #508

twonirwana opened this issue Nov 22, 2024 · 6 comments
Labels
yaml Issue related to YAML format backend

Comments

@twonirwana
Copy link

twonirwana commented Nov 22, 2024

Characters that are represented by more then one byte in UTF-8 can produce java.lang.IndexOutOfBoundsException: Range [1024, 1024 + 1) out of bounds for length 1024 if there are at place where they are split up in two buffers.
This can be reproduced with the code below and jackson-dataformat-yaml-2.18.1

 @Test
    void test() throws JsonProcessingException {
       String p = """
                   ---
                   value: "0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
                   0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
                   0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
                   0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
                   0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
                   0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
                   0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
                   0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
                   0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
                   000000000000000000000000000000000000000000000000000000000000000000000🪺"
                """;
        ObjectMapper mapper = new ObjectMapper(new YAMLFactory());
        mapper.readValue(p, Object.class);
    }
@cowtowncoder cowtowncoder added the yaml Issue related to YAML format backend label Nov 22, 2024
@cowtowncoder
Copy link
Member

cowtowncoder commented Nov 22, 2024

@twonirwana Thank you for reporting this.

I wonder if this is same as #497 (or rather has same root cause)

@twonirwana What is the Unicode value for the Emoji here? Would be good to use \u escape in source, just to make sure editors won't mess it.

@cowtowncoder
Copy link
Member

cowtowncoder commented Nov 22, 2024

Ok I guess it's:

https://unicodeplus.com/U+1FABA

which I think then must become 2 char surrogate pair:

0xD83E 0xDEBA 

so something like

"00000....\\uD83E\\uDEBA"

in sources

@twonirwana
Copy link
Author

twonirwana commented Nov 22, 2024

@twonirwana Thank you for reporting this.

I wonder if this is same as #497 (or rather has same root cause)

@twonirwana What is the Unicode value for the Emoji here? Would be good to use \u escape in source, just to make sure editors won't mess it.

yes, that looks like same problem as #497. It always happens if a character needs more than, that is the case by a lot of the newer emojis and asian characters, and they are split by the buffer. If the buffer splits the character in halve then the exception is throw. Maybe it is possible to check while filling the buffer that no multi byte character is split.

The \u version of my example is:

String p = """
                   ---
                   value: "0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
                   0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
                   0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
                   0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
                   0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
                   0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
                   0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
                   0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
                   0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
                   000000000000000000000000000000000000000000000000000000000000000000000\uD83E\uDEBA"
                """;

@cowtowncoder
Copy link
Member

Ok yes, good to know it is very likely the same issue. Things are only tricky wrt performance; should not have to scan multiple times and so on. But mostly I need to find time to figure out a good fix now that the problem itself is understood.

@twonirwana
Copy link
Author

The problem is with SnakeYAML and they will fix it with 2.4: https://bitbucket.org/snakeyaml/snakeyaml/issues/1098/openapi-file-that-crashes-snakeyaml-when . Sorry for opening a issue here.

@cowtowncoder
Copy link
Member

I am actually not 100% sure it's SnakeYAML that does decoding; exception stack trace can confirm that. What I know is that yaml/src/main/java/com/fasterxml/jackson/dataformat/yaml/UTF8Reader.java has an issue that would explain this.

But I can reopen this when I find time to dig into the earlier issue (#497).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
yaml Issue related to YAML format backend
Projects
None yet
Development

No branches or pull requests

2 participants