Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strings with accented characters are being treated as arrays #18

Open
emptyflask opened this issue Aug 4, 2014 · 2 comments
Open

Strings with accented characters are being treated as arrays #18

emptyflask opened this issue Aug 4, 2014 · 2 comments

Comments

@emptyflask
Copy link

Using the sax-without-buffer branch.

{
  "city"=>["Montr", "éal"]
}
@ginjo
Copy link
Owner

ginjo commented Aug 4, 2014

Oh, that's excellent. I've been waiting for this one to come up but couldn't produce the issue myself with Filemaker as the data source. When sax parsing, it's considered perfectly legitimate to send textual data in chunks. It's even ok to send text chunks before AND after a sub-element. The receiving handler is expected to sort it all out.

So, this is considered perfectly legal:

<element-one>some text<element-two>...</element-two>some more text</element-one>

I've never worked with any data sources that produced this kind of xml, and Filemaker doesn't return any broken text like that. But it looks like Nokogiri & Libxml are sending the text in two separate callbacks to the sax parser's text handler, whenever special characters are encountered. I'm not sure what the rule is with special characters yet, but it shouldn't matter - the rfm parser needs to concatenate all the text chunks into a single data string.

Before the "sax-without-buffer" branch, a buffer collected all the text before sending it to the translator (which produces rfm objects). The buffer was a temporary fix for handling this chunky issue, and it added unnecessary overhead and complexity. So I yanked it. Now just need to handle the chunky data in a different way.

Here's a parsing of a small data set containing repeating fields and an accented é, for each of the four parsing gems. Note the "memotext" field in the 2nd record of each parsing.

manual_fmresultset_with_repeats :nokogiri
 => [{"memotext"=>"memotest3", "stayid"=>[nil, "2nd-stay-id", "3rd-stay-id"], "recordnumber"=>"398613"},
     {"memotext"=>["memot", "\u00E9st7"], "stayid"=>nil, "recordnumber"=>"399341"}] 

manual_fmresultset_with_repeats :libxml  
 => [{"memotext"=>"memotest3", "stayid"=>[nil, "2nd-stay-id", "3rd-stay-id"], "recordnumber"=>"398613"},
     {"memotext"=>["memot", "\u00E9st7"], "stayid"=>nil, "recordnumber"=>"399341"}] 

manual_fmresultset_with_repeats :ox    
 => [{"memotext"=>"memotest3", "stayid"=>[nil, "2nd-stay-id", "3rd-stay-id"], "recordnumber"=>"398613"},
     {"memotext"=>"memot\u00E9st7", "stayid"=>nil, "recordnumber"=>"399341"}] 

manual_fmresultset_with_repeats :rexml
 => [{"memotext"=>"memotest3", "stayid"=>[nil, "2nd-stay-id", "3rd-stay-id"], "recordnumber"=>"398613"},
     {"memotext"=>"memot\u00E9st7", "stayid"=>nil, "recordnumber"=>"399341"}]

Just as you reported.

Anyway, I'm well on the way to addressing this - and the missing data element issue too.

@ginjo
Copy link
Owner

ginjo commented Aug 5, 2014

Ok, should be fixed in sax-without-buffer & master branch now. I noted that Ox appears to eliminate an extra line-feed from source xml text nodes in ruby 1.8.7. This doesn't seem to be a problem in 1.9 and beyond.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants