Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode decode error #73

Open
bbengfort opened this issue Aug 13, 2016 · 3 comments
Open

Unicode decode error #73

bbengfort opened this issue Aug 13, 2016 · 3 comments

Comments

@bbengfort
Copy link
Member

The pymongo driver is very strict and if it can't decode a mongo document it raises an exception.

This is turning up in export where apparently (after 12 minutes or so) a Post with an encoding error turns up and crashes the entire export process. Which is bad.

@bbengfort
Copy link
Member Author

@bbengfort
Copy link
Member Author

To fix this, I wrote a white list of Post IDs:

Wrote 566142 Posts IDs in 644.759 seconds
Post.objects.count() == 566142

@bbengfort
Copy link
Member Author

Wrote a script called blaze.py - which goes through and attempts to find any bad decoding errors in posts:

100%|███████████████████████████████████████████████████████████████████████████████| 566142/566142 [02:09<00:00, 4365.45id/s]
Phase One: wrote 566142 Posts IDs in 2 minutes 9 seconds
100%|█████████████████████████████████████████████████████████████████████████████| 566142/566142 [37:09<00:00, 267.01posts/s]
Phase Two: wrote 2 Post errors in 37 minutes 9 seconds

It only came up with 2 errors:

"571a333ac1808103a0d6067c",'utf-8' codec can't decode byte 0xed in position 48824: invalid continuation byte
"57726c2ac1808103a5ed63d6",'utf-8' codec can't decode byte 0xed in position 21004: invalid continuation byte

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant