-
-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extra BOM in CSV file, hledger reports an error #2189
Comments
That's very clear! Thank you. I also found:
We do want hledger to just work on real world data where possible, so we should be permissive where it doesn't add complications. But I'm not sure if we need to go as far as ignoring BOMs appearing anywhere in the input. It seems like an unusual niche case, and one that's easy to solve with preprocessing. Is it really valid for files to change encoding in the middle ? I can't imagine many tools that would handle that properly. |
Our BOM handling should be mentioned at https://hledger.org/dev/hledger.html#text-encoding . |
Related, https://www.unicode.org/faq/utf_bom.html#BOM says: Q: What should I do with U+FEFF in the middle of a file?
|
BOM is troublemaker... ;-) We use extended ASCII and banks produced CSV files in CP-1250 in the past. Some of them upgraded their software and moved to UTF-8 and I believe that is why they produce UTF-8 file with BOM, to clearly signal that CSV file is not in CP-1250 but in UTF-8. It is possible to create file that starts with BOM for UTF-8 and there is a BOM for UTF-16LE in the middle file. Just join file in UTF-8 with file in UTF-16LE. But that will be illegal, because BOM is just one code point (U+FEFF) expressed in different ways for each version of UTF. I thought that it could be possible to start with UTF-8 and use BOM in the middle of file to switch encoding to UTF-16LE but it is not possible because BOM for UTF-16LE is invalid sequence in UTF-8... Well, it could be possible but software has to test why there is an error in data, test if error code could be BOM for other variant of UTF... The good news is that UTF-16LE files are rare, UTF-8 is used in most cases. |
So all we need to do is document our BOM requirements at https://hledger.org/dev/hledger.html#text-encoding as I've done ?
|
What about ignoring |
hledger 1.32.3, linux
I have CSV file in UTF-8 format, it starts with BOM
<feff>
When I join several such files to one file
cat test-bom-*.csv > test-bom.csv
, this file contains several BOM characters.hledger
doesn't like those extra BOM characters, it reports an error:I am not sure but I think that it is not wrong when UTF-8 file has several BOM codes in the file; I tried other utilities and those were not failing with an error. In theory, coding of the file can change in the middle, like from UTF-8 to UTF-16LE...
How to replicate. Prepare test data, several simple CSV files with BOM and without BOM:
Files
test-bom.csv
andtest-nobom.csv
looks same but they differ in file size:grep
is "confused" with BOM:Create import rules, those are the same, I created test-bom.csv.rules and then used
ln -s test-bom.csv.rules test-nobom.csv.rules
andln -s test-bom.csv.rules test-bom-1.csv.rules
:TEST
hledger
can import CSV file with single BOM and file without BOM:hledger
doesn't like file with several BOM:The text was updated successfully, but these errors were encountered: