Opened 10 years ago

Closed 8 years ago

Last modified 8 years ago

#983 closed defect (fixed)

UnicodeDecodeError when processing a PDF file

Reported by: Peter Kuma Owned by:
Priority: major Milestone: 0.9.0
Component: programming Keywords:
Cc: Parent Tickets:

Description

When submitting a PDF file with unicode characters in the document info, mediagoblin fails with UnicodeDecodeError. E.g. a file containing

Creator: Microsoft\xc2\xae Word 2013

fails with

UnicodeDecodeError('ascii', 'Creator: Microsoft\xc2\xae Word 2013\n', 25, 26, 'ordinal not in range(128)')

at media_types/pdf/processing.py:210, where the output of pdfinfo is decoded with:

lines = [l.decode() for l in lines]

The problem is that decode uses the default string encoding, which is ascii in my case. It can be fixed by specifying UTF-8 encoding explicitly:

lines = [l.decode('utf-8', 'ignore') for l in lines]

However, I am not sure if pdfinfo always outputs the information in UTF-8, or it is document dependent.

The line was added by cda3055

The output of ./lazyserver.sh:

2014-10-13 09:27:09,900 DEBUG   [mediagoblin.processing.task] Processing <MediaEntry 10: doc>
2014-10-13 09:27:10,025 ERROR   [mediagoblin.processing.task] An unhandled exception was raised while processing <MediaEntry 10: doc>
2014-10-13 09:27:10,025 WARNING [mediagoblin.processing] No idea what happened here, but it failed: UnicodeDecodeError('ascii', 'Creator:        Microsoft\xc2\xae Word 2013\n', 25, 26, 'ordinal not in range(128)')
2014-10-13 09:27:10,153 WARNING [mediagoblin.processing] No idea what happened here, but it failed: UnicodeDecodeError('ascii', 'Creator:        Microsoft\xc2\xae Word 2013\n', 25, 26, 'ordinal not in range(128)')
[2014-10-13 09:27:10 +0000] [4679] [ERROR] Error handling request
Traceback (most recent call last):
  File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/gunicorn-19.1.1-py2.7.egg/gunicorn/workers/sync.py", line 93, in handle
    self.handle_request(listener, req, client, addr)
  File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/gunicorn-19.1.1-py2.7.egg/gunicorn/workers/sync.py", line 134, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/mnt/data/mediagoblin/mediagoblin/app.py", line 268, in __call__
    return self.call_backend(environ, start_response)
  File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/Werkzeug-0.9.6-py2.7.egg/werkzeug/wsgi.py", line 588, in __call__
    return self.app(environ, start_response)
  File "/mnt/data/mediagoblin/mediagoblin/app.py", line 245, in call_backend
    response = controller(request)
  File "/mnt/data/mediagoblin/mediagoblin/decorators.py", line 46, in wrapper
    return controller(request, *args, **kwargs)
  File "/mnt/data/mediagoblin/mediagoblin/decorators.py", line 73, in new_controller_func
    return controller(request, *args, **kwargs)
  File "/mnt/data/mediagoblin/mediagoblin/decorators.py", line 46, in wrapper
    return controller(request, *args, **kwargs)
  File "/mnt/data/mediagoblin/mediagoblin/decorators.py", line 73, in new_controller_func
    return controller(request, *args, **kwargs)
  File "/mnt/data/mediagoblin/mediagoblin/decorators.py", line 102, in wrapper
    return controller(request, *args, **kwargs)
  File "/mnt/data/mediagoblin/mediagoblin/submit/views.py", line 69, in submit_start
    urlgen=request.urlgen)
  File "/mnt/data/mediagoblin/mediagoblin/submit/lib.py", line 201, in submit_media
    run_process_media(entry, feed_url)
  File "/mnt/data/mediagoblin/mediagoblin/submit/lib.py", line 252, in run_process_media
    task_id=entry.queued_task_id)
  File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/celery-3.1.16-py2.7.egg/celery/app/task.py", line 547, in apply_async
    link=link, link_error=link_error, **options)
  File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/celery-3.1.16-py2.7.egg/celery/app/task.py", line 739, in apply
    request=request, propagate=throw)
  File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/celery-3.1.16-py2.7.egg/celery/app/trace.py", line 354, in eager_trace_task
    uuid, args, kwargs, request)
  File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/celery-3.1.16-py2.7.egg/celery/app/trace.py", line 253, in trace_task
    I, R, state, retval = on_error(task_request, exc, uuid)
  File "/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/celery-3.1.16-py2.7.egg/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/mnt/data/mediagoblin/mediagoblin/processing/task.py", line 99, in run
    processor.process(**reprocess_info)
  File "/mnt/data/mediagoblin/mediagoblin/media_types/pdf/processing.py", line 413, in process
    self.extract_pdf_info()
  File "/mnt/data/mediagoblin/mediagoblin/media_types/pdf/processing.py", line 339, in extract_pdf_info
    pdf_info_dict = pdf_info(self.pdf_filename)
  File "/mnt/data/mediagoblin/mediagoblin/media_types/pdf/processing.py", line 210, in pdf_info
    lines = [l.decode() for l in lines]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 25: ordinal not in range(128)

Change History (3)

comment:1 by ayleph, 8 years ago

Milestone: 0.8.1
Status: newreview

I encountered this issue. The attached patch fixed this on my instance, and I was able to successfully process about 20 PDF files that failed previously. Submitting patch for review for 0.8.1 milestone.

From 4f0ea3a8d0749ccf932e56f433f5ee432005d6bd Mon Sep 17 00:00:00 2001
From: ayleph <ayleph@thisshitistemp.com>
Date: Fri, 4 Dec 2015 02:02:02 -0500
Subject: [PATCH] Fix issue 983 PDF UnicodeDecodeError

Parse PDF lines as unicode to prevent UnicodeDecodeError when a
non-ASCII character is encountered.
---
 mediagoblin/media_types/pdf/processing.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mediagoblin/media_types/pdf/processing.py b/mediagoblin/media_types/pdf/processing.py
index f6d10a5..ac4bab6 100644
--- a/mediagoblin/media_types/pdf/processing.py
+++ b/mediagoblin/media_types/pdf/processing.py
@@ -207,7 +207,7 @@ def pdf_info(original):
         _log.debug('pdfinfo could not read the pdf file.')
         raise BadMediaFail()
 
-    lines = [l.decode() for l in lines]
+    lines = [l.decode('utf-8', 'replace') for l in lines]
     info_dict = dict([[part.strip() for part in l.strip().split(':', 1)] 
                       for l in lines if ':' in l]) 
 
-- 
2.6.2

comment:2 by Christopher Allan Webber, 8 years ago

Resolution: fixed
Status: reviewclosed

Committed, thanks!

comment:3 by Christopher Allan Webber, 8 years ago

Milestone: 0.8.20.9.0

All 0.8.2 tickets are being rolled over to 0.9.0

Note: See TracTickets for help on using tickets.