#983 closed defect (fixed)
UnicodeDecodeError when processing a PDF file
| Reported by: | Peter Kuma | Owned by: | |
|---|---|---|---|
| Priority: | major | Milestone: | 0.9.0 |
| Component: | programming | Keywords: | |
| Cc: | Parent Tickets: |
Description
When submitting a PDF file with unicode characters in the document info, mediagoblin fails with UnicodeDecodeError. E.g. a file containing
Creator: Microsoft\xc2\xae Word 2013
fails with
UnicodeDecodeError('ascii', 'Creator: Microsoft\xc2\xae Word 2013\n', 25, 26, 'ordinal not in range(128)')
at media_types/pdf/processing.py:210, where the output of pdfinfo is decoded with:
lines = [l.decode() for l in lines]
The problem is that decode uses the default string encoding, which is ascii in my case. It can be fixed by specifying UTF-8 encoding explicitly:
lines = [l.decode('utf-8', 'ignore') for l in lines]
However, I am not sure if pdfinfo always outputs the information in UTF-8, or it is document dependent.
The line was added by cda3055
The output of ./lazyserver.sh:
2014-10-13 09:27:09,900 DEBUG [mediagoblin.processing.task] Processing <MediaEntry 10: doc>
2014-10-13 09:27:10,025 ERROR [mediagoblin.processing.task] An unhandled exception was raised while processing <MediaEntry 10: doc>
2014-10-13 09:27:10,025 WARNING [mediagoblin.processing] No idea what happened here, but it failed: UnicodeDecodeError('ascii', 'Creator: Microsoft\xc2\xae Word 2013\n', 25, 26, 'ordinal not in range(128)')
2014-10-13 09:27:10,153 WARNING [mediagoblin.processing] No idea what happened here, but it failed: UnicodeDecodeError('ascii', 'Creator: Microsoft\xc2\xae Word 2013\n', 25, 26, 'ordinal not in range(128)')
[2014-10-13 09:27:10 +0000] [4679] [ERROR] Error handling request
Traceback (most recent call last):
File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/gunicorn-19.1.1-py2.7.egg/gunicorn/workers/sync.py", line 93, in handle
self.handle_request(listener, req, client, addr)
File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/gunicorn-19.1.1-py2.7.egg/gunicorn/workers/sync.py", line 134, in handle_request
respiter = self.wsgi(environ, resp.start_response)
File "/mnt/data/mediagoblin/mediagoblin/app.py", line 268, in __call__
return self.call_backend(environ, start_response)
File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/Werkzeug-0.9.6-py2.7.egg/werkzeug/wsgi.py", line 588, in __call__
return self.app(environ, start_response)
File "/mnt/data/mediagoblin/mediagoblin/app.py", line 245, in call_backend
response = controller(request)
File "/mnt/data/mediagoblin/mediagoblin/decorators.py", line 46, in wrapper
return controller(request, *args, **kwargs)
File "/mnt/data/mediagoblin/mediagoblin/decorators.py", line 73, in new_controller_func
return controller(request, *args, **kwargs)
File "/mnt/data/mediagoblin/mediagoblin/decorators.py", line 46, in wrapper
return controller(request, *args, **kwargs)
File "/mnt/data/mediagoblin/mediagoblin/decorators.py", line 73, in new_controller_func
return controller(request, *args, **kwargs)
File "/mnt/data/mediagoblin/mediagoblin/decorators.py", line 102, in wrapper
return controller(request, *args, **kwargs)
File "/mnt/data/mediagoblin/mediagoblin/submit/views.py", line 69, in submit_start
urlgen=request.urlgen)
File "/mnt/data/mediagoblin/mediagoblin/submit/lib.py", line 201, in submit_media
run_process_media(entry, feed_url)
File "/mnt/data/mediagoblin/mediagoblin/submit/lib.py", line 252, in run_process_media
task_id=entry.queued_task_id)
File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/celery-3.1.16-py2.7.egg/celery/app/task.py", line 547, in apply_async
link=link, link_error=link_error, **options)
File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/celery-3.1.16-py2.7.egg/celery/app/task.py", line 739, in apply
request=request, propagate=throw)
File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/celery-3.1.16-py2.7.egg/celery/app/trace.py", line 354, in eager_trace_task
uuid, args, kwargs, request)
File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/celery-3.1.16-py2.7.egg/celery/app/trace.py", line 253, in trace_task
I, R, state, retval = on_error(task_request, exc, uuid)
File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/celery-3.1.16-py2.7.egg/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/mnt/data/mediagoblin/mediagoblin/processing/task.py", line 99, in run
processor.process(**reprocess_info)
File "/mnt/data/mediagoblin/mediagoblin/media_types/pdf/processing.py", line 413, in process
self.extract_pdf_info()
File "/mnt/data/mediagoblin/mediagoblin/media_types/pdf/processing.py", line 339, in extract_pdf_info
pdf_info_dict = pdf_info(self.pdf_filename)
File "/mnt/data/mediagoblin/mediagoblin/media_types/pdf/processing.py", line 210, in pdf_info
lines = [l.decode() for l in lines]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 25: ordinal not in range(128)
Change History (3)
comment:1 by , 10 years ago
| Milestone: | → 0.8.1 |
|---|---|
| Status: | new → review |
comment:3 by , 10 years ago
| Milestone: | 0.8.2 → 0.9.0 |
|---|
All 0.8.2 tickets are being rolled over to 0.9.0

I encountered this issue. The attached patch fixed this on my instance, and I was able to successfully process about 20 PDF files that failed previously. Submitting patch for review for 0.8.1 milestone.
From 4f0ea3a8d0749ccf932e56f433f5ee432005d6bd Mon Sep 17 00:00:00 2001 From: ayleph <ayleph@thisshitistemp.com> Date: Fri, 4 Dec 2015 02:02:02 -0500 Subject: [PATCH] Fix issue 983 PDF UnicodeDecodeError Parse PDF lines as unicode to prevent UnicodeDecodeError when a non-ASCII character is encountered. --- mediagoblin/media_types/pdf/processing.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mediagoblin/media_types/pdf/processing.py b/mediagoblin/media_types/pdf/processing.py index f6d10a5..ac4bab6 100644 --- a/mediagoblin/media_types/pdf/processing.py +++ b/mediagoblin/media_types/pdf/processing.py @@ -207,7 +207,7 @@ def pdf_info(original): _log.debug('pdfinfo could not read the pdf file.') raise BadMediaFail() - lines = [l.decode() for l in lines] + lines = [l.decode('utf-8', 'replace') for l in lines] info_dict = dict([[part.strip() for part in l.strip().split(':', 1)] for l in lines if ':' in l]) -- 2.6.2