#983 closed defect (fixed)
UnicodeDecodeError when processing a PDF file
Reported by: | Peter Kuma | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | 0.9.0 |
Component: | programming | Keywords: | |
Cc: | Parent Tickets: |
Description
When submitting a PDF file with unicode characters in the document info, mediagoblin fails with UnicodeDecodeError. E.g. a file containing
Creator: Microsoft\xc2\xae Word 2013
fails with
UnicodeDecodeError('ascii', 'Creator: Microsoft\xc2\xae Word 2013\n', 25, 26, 'ordinal not in range(128)')
at media_types/pdf/processing.py:210
, where the output of pdfinfo
is decoded with:
lines = [l.decode() for l in lines]
The problem is that decode
uses the default string encoding, which is ascii
in my case. It can be fixed by specifying UTF-8 encoding explicitly:
lines = [l.decode('utf-8', 'ignore') for l in lines]
However, I am not sure if pdfinfo always outputs the information in UTF-8, or it is document dependent.
The line was added by cda3055
The output of ./lazyserver.sh
:
2014-10-13 09:27:09,900 DEBUG [mediagoblin.processing.task] Processing <MediaEntry 10: doc> 2014-10-13 09:27:10,025 ERROR [mediagoblin.processing.task] An unhandled exception was raised while processing <MediaEntry 10: doc> 2014-10-13 09:27:10,025 WARNING [mediagoblin.processing] No idea what happened here, but it failed: UnicodeDecodeError('ascii', 'Creator: Microsoft\xc2\xae Word 2013\n', 25, 26, 'ordinal not in range(128)') 2014-10-13 09:27:10,153 WARNING [mediagoblin.processing] No idea what happened here, but it failed: UnicodeDecodeError('ascii', 'Creator: Microsoft\xc2\xae Word 2013\n', 25, 26, 'ordinal not in range(128)') [2014-10-13 09:27:10 +0000] [4679] [ERROR] Error handling request Traceback (most recent call last): File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/gunicorn-19.1.1-py2.7.egg/gunicorn/workers/sync.py", line 93, in handle self.handle_request(listener, req, client, addr) File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/gunicorn-19.1.1-py2.7.egg/gunicorn/workers/sync.py", line 134, in handle_request respiter = self.wsgi(environ, resp.start_response) File "/mnt/data/mediagoblin/mediagoblin/app.py", line 268, in __call__ return self.call_backend(environ, start_response) File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/Werkzeug-0.9.6-py2.7.egg/werkzeug/wsgi.py", line 588, in __call__ return self.app(environ, start_response) File "/mnt/data/mediagoblin/mediagoblin/app.py", line 245, in call_backend response = controller(request) File "/mnt/data/mediagoblin/mediagoblin/decorators.py", line 46, in wrapper return controller(request, *args, **kwargs) File "/mnt/data/mediagoblin/mediagoblin/decorators.py", line 73, in new_controller_func return controller(request, *args, **kwargs) File "/mnt/data/mediagoblin/mediagoblin/decorators.py", line 46, in wrapper return controller(request, *args, **kwargs) File "/mnt/data/mediagoblin/mediagoblin/decorators.py", line 73, in new_controller_func return controller(request, *args, **kwargs) File "/mnt/data/mediagoblin/mediagoblin/decorators.py", line 102, in wrapper return controller(request, *args, **kwargs) File "/mnt/data/mediagoblin/mediagoblin/submit/views.py", line 69, in submit_start urlgen=request.urlgen) File "/mnt/data/mediagoblin/mediagoblin/submit/lib.py", line 201, in submit_media run_process_media(entry, feed_url) File "/mnt/data/mediagoblin/mediagoblin/submit/lib.py", line 252, in run_process_media task_id=entry.queued_task_id) File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/celery-3.1.16-py2.7.egg/celery/app/task.py", line 547, in apply_async link=link, link_error=link_error, **options) File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/celery-3.1.16-py2.7.egg/celery/app/task.py", line 739, in apply request=request, propagate=throw) File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/celery-3.1.16-py2.7.egg/celery/app/trace.py", line 354, in eager_trace_task uuid, args, kwargs, request) File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/celery-3.1.16-py2.7.egg/celery/app/trace.py", line 253, in trace_task I, R, state, retval = on_error(task_request, exc, uuid) File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/celery-3.1.16-py2.7.egg/celery/app/trace.py", line 240, in trace_task R = retval = fun(*args, **kwargs) File "/mnt/data/mediagoblin/mediagoblin/processing/task.py", line 99, in run processor.process(**reprocess_info) File "/mnt/data/mediagoblin/mediagoblin/media_types/pdf/processing.py", line 413, in process self.extract_pdf_info() File "/mnt/data/mediagoblin/mediagoblin/media_types/pdf/processing.py", line 339, in extract_pdf_info pdf_info_dict = pdf_info(self.pdf_filename) File "/mnt/data/mediagoblin/mediagoblin/media_types/pdf/processing.py", line 210, in pdf_info lines = [l.decode() for l in lines] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 25: ordinal not in range(128)
Change History (3)
comment:1 by , 9 years ago
Milestone: | → 0.8.1 |
---|---|
Status: | new → review |
comment:3 by , 9 years ago
Milestone: | 0.8.2 → 0.9.0 |
---|
All 0.8.2 tickets are being rolled over to 0.9.0
I encountered this issue. The attached patch fixed this on my instance, and I was able to successfully process about 20 PDF files that failed previously. Submitting patch for review for 0.8.1 milestone.