Strings with accented characters are being treated as arrays #18

emptyflask · 2014-08-04T16:24:19Z

Using the sax-without-buffer branch.

{
  "city"=>["Montr", "éal"]
}

The text was updated successfully, but these errors were encountered:

ginjo · 2014-08-04T22:33:53Z

Oh, that's excellent. I've been waiting for this one to come up but couldn't produce the issue myself with Filemaker as the data source. When sax parsing, it's considered perfectly legitimate to send textual data in chunks. It's even ok to send text chunks before AND after a sub-element. The receiving handler is expected to sort it all out.

So, this is considered perfectly legal:

<element-one>some text<element-two>...</element-two>some more text</element-one>

I've never worked with any data sources that produced this kind of xml, and Filemaker doesn't return any broken text like that. But it looks like Nokogiri & Libxml are sending the text in two separate callbacks to the sax parser's text handler, whenever special characters are encountered. I'm not sure what the rule is with special characters yet, but it shouldn't matter - the rfm parser needs to concatenate all the text chunks into a single data string.

Before the "sax-without-buffer" branch, a buffer collected all the text before sending it to the translator (which produces rfm objects). The buffer was a temporary fix for handling this chunky issue, and it added unnecessary overhead and complexity. So I yanked it. Now just need to handle the chunky data in a different way.

Here's a parsing of a small data set containing repeating fields and an accented é, for each of the four parsing gems. Note the "memotext" field in the 2nd record of each parsing.

manual_fmresultset_with_repeats :nokogiri
 => [{"memotext"=>"memotest3", "stayid"=>[nil, "2nd-stay-id", "3rd-stay-id"], "recordnumber"=>"398613"},
     {"memotext"=>["memot", "\u00E9st7"], "stayid"=>nil, "recordnumber"=>"399341"}] 

manual_fmresultset_with_repeats :libxml  
 => [{"memotext"=>"memotest3", "stayid"=>[nil, "2nd-stay-id", "3rd-stay-id"], "recordnumber"=>"398613"},
     {"memotext"=>["memot", "\u00E9st7"], "stayid"=>nil, "recordnumber"=>"399341"}] 

manual_fmresultset_with_repeats :ox    
 => [{"memotext"=>"memotest3", "stayid"=>[nil, "2nd-stay-id", "3rd-stay-id"], "recordnumber"=>"398613"},
     {"memotext"=>"memot\u00E9st7", "stayid"=>nil, "recordnumber"=>"399341"}] 

manual_fmresultset_with_repeats :rexml
 => [{"memotext"=>"memotest3", "stayid"=>[nil, "2nd-stay-id", "3rd-stay-id"], "recordnumber"=>"398613"},
     {"memotext"=>"memot\u00E9st7", "stayid"=>nil, "recordnumber"=>"399341"}]

Just as you reported.

Anyway, I'm well on the way to addressing this - and the missing data element issue too.

ginjo · 2014-08-05T22:46:46Z

Ok, should be fixed in sax-without-buffer & master branch now. I noted that Ox appears to eliminate an extra line-feed from source xml text nodes in ruby 1.8.7. This doesn't seem to be a problem in 1.9 and beyond.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strings with accented characters are being treated as arrays #18

Strings with accented characters are being treated as arrays #18

emptyflask commented Aug 4, 2014

ginjo commented Aug 4, 2014

ginjo commented Aug 5, 2014

Strings with accented characters are being treated as arrays #18

Strings with accented characters are being treated as arrays #18

Comments

emptyflask commented Aug 4, 2014

ginjo commented Aug 4, 2014

ginjo commented Aug 5, 2014