Unicode decode error #73

bbengfort · 2016-08-13T17:18:46Z

The pymongo driver is very strict and if it can't decode a mongo document it raises an exception.

This is turning up in export where apparently (after 12 minutes or so) a Post with an encoding error turns up and crashes the entire export process. Which is bad.

bbengfort · 2016-08-13T17:20:25Z

yougov/mongo-connector#101

bbengfort · 2016-08-13T17:43:11Z

To fix this, I wrote a white list of Post IDs:

Wrote 566142 Posts IDs in 644.759 seconds

Post.objects.count() == 566142

bbengfort · 2016-08-13T18:47:35Z

Wrote a script called blaze.py - which goes through and attempts to find any bad decoding errors in posts:

100%|███████████████████████████████████████████████████████████████████████████████| 566142/566142 [02:09<00:00, 4365.45id/s]
Phase One: wrote 566142 Posts IDs in 2 minutes 9 seconds
100%|█████████████████████████████████████████████████████████████████████████████| 566142/566142 [37:09<00:00, 267.01posts/s]
Phase Two: wrote 2 Post errors in 37 minutes 9 seconds

It only came up with 2 errors:

"571a333ac1808103a0d6067c",'utf-8' codec can't decode byte 0xed in position 48824: invalid continuation byte
"57726c2ac1808103a5ed63d6",'utf-8' codec can't decode byte 0xed in position 21004: invalid continuation byte

bbengfort added type: bug priority: high labels Aug 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode decode error #73

Unicode decode error #73

bbengfort commented Aug 13, 2016

bbengfort commented Aug 13, 2016

bbengfort commented Aug 13, 2016

bbengfort commented Aug 13, 2016

Unicode decode error #73

Unicode decode error #73

Comments

bbengfort commented Aug 13, 2016

bbengfort commented Aug 13, 2016

bbengfort commented Aug 13, 2016

bbengfort commented Aug 13, 2016