-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Firebird database with malformed strings cannot be accessed [PYFB76] #91
Comments
Commented by: @pcisar It's strange, because client charset OCTETS should work as pass-through for all strings as b2u() returns the string (bytes in this case) unchanged when passed charset is None (which should be according to ibase.charset_map). But P3 application or upper layer has to be prepared to handle bytes instead unicode string on output. Anyway, this is tricky problem. I don't like the idea to hide decode errors (especially if it's easy to use this option on global scale) because then such errors can slip unnoticed to cause problems elsewhere, that are then much harder to detect. A pass-through solution that would return byte-stream instead unicode when error is encountered is the same problem, as it means that users must expect various result types. But optional connection option (disabled by default) for decoding error resolution appears to be good enough solution, and I'll implement it for next release. However, usability depends on how this would be exposed by other layers like sqlalchemy. As a hotfix, you can change the b2u() function in http://fbcore.py to replace or ignore bad characters. |
Commented by: Joern Ungermann (joernu76) Thanks, solving this would be marvelous, but I do not see an easy solution for this including all the stacks, either. The quick fix works as intended, but doesn't scale very well, as it breaks our automatic deployment. We really should switch to docker or somesuch. If I try to use the charset=OCTETS option in sqlalchemy to bypass the coding, I get the following error: File "/home/icg173/bin/gloripy_V01781_R7633e89/gloripy/data/database.py", line 792, in __init__ The "OCTETS" charset from "firebird+fdb://{}:{}@{}/{}?charset=OCTETS" is translated to "None" in fdb.ibase.charset_map. |
Commented by: @pcisar I see, FDB clearly does not handle OCTETS properly. Using it as client charset is very wild idea that should nobody even try, but anyway. The problem with OCTESTS handling is that it adds another level of complexity to the code that is already trying to handle P2/P3 string difference in split-brain fashion (i,.e. P3 support in FDB is somewhat crude and unclean thanks to its history coming back to early P3 era). I worry that it would not get better anytime soon (it's planned as part of transition from P2 centered to P3 centered development of FDB over 2.x series). Ok, I'll try to add optional support for replace/ignore option into bytes2unicode conversions in FDB as soon as possible. However, it happens at several places in various contexts, so it may take some time as I'd like do it properly. |
Submitted by: Joern Ungermann (joernu76)
We have a firebird database containing a few malformed ASCII strings. These are byte streams, presumably ASCII coded, stored as is from a rather unreliable wireless transmission. Thus, there are bit flips, which causes the strings to contain some strange characters.
Using python 2, we had no problem accessing such databases. Using python 3, everything needs to be decoded/encoded.
Using the default access to the database using sqlalchemy and fdb causes a UnicodeDecodingError upon accessing an entry with a malformed string:
[...]
File "/home/icg173/anaconda3/lib/python3.6/site-packages/fdb/fbcore.py", line 2659, in __xsqlda2tuple
value = b2u(value, self.__python_charset)
File "/home/icg173/anaconda3/lib/python3.6/site-packages/fdb/fbcore.py", line 480, in b2u
return st.decode(charset)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x90 in position 1: invalid start byte
Playing around with the "charset" option of sqlalchemy could open some files, where the malformed strings contained only values available in the given charset, but introduced other problems. Looking into the code, there seems to be an "OCTETS" option, which supposedly parse the bytestream through, but fails to work (the python-translated-charset is None, which is not a valid option for encode/decode)
I'd prefer either of two options:
a) implement (optionally?) an option to influence the error behavious of decoding (either replace or ignore would suit our usecase)
b) implement a pass-through option for the byte-stream
If I oversaw some option to solve my use-case wth the given code-base, I'd be very happy to hear about it.
The text was updated successfully, but these errors were encountered: