Skip to content
This repository has been archived by the owner on Jan 9, 2025. It is now read-only.

Handle non-ascii characters in url #193

Open
kolesar-andras opened this issue Nov 11, 2019 · 0 comments
Open

Handle non-ascii characters in url #193

kolesar-andras opened this issue Nov 11, 2019 · 0 comments

Comments

@kolesar-andras
Copy link

Zombie driver fails when url contains "high bytes", non-ascii characters. The following example contains a valid Hungarian with accented characters.

https://hu.wikipedia.org/wiki/Műemlék

Desktop browsers and Mink Goutte driver translate the high bytes correctly:

https://hu.wikipedia.org/wiki/M%C5%B1eml%C3%A9k

Zombie driver sends string as-is to javascript, then bytes above 0x7f go wrong somewhere in Zombie:

https://hu.wikipedia.org/wiki/Mqeml\xe9k

It's a bit strange how characters are truncated:

  • letter é becomes \xe9 that is character code in ISO-8859-1
  • letter ű becomes q because this character does not exists in that code page

Characters that don't exist in ISO-8859-1 encoding are represented with regular letters, for example q, damage is irreversible.

Example shows that desktop browsers translate non-asci characters to percent-encoded bytes using their UTF-8 character codes:

  • letter é becomes %C3%A9
  • letter ű becomes %C5%B1

That's correct, web servers expect urls in this way.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant