﻿id	summary	reporter	owner	description	type	status	priority	milestone	component	resolution	keywords	cc	parents
983	UnicodeDecodeError when processing a PDF file	Peter Kuma		"When submitting a PDF file with unicode characters in the document info, mediagoblin fails with UnicodeDecodeError. E.g. a file containing

    {{{Creator: Microsoft\xc2\xae Word 2013}}}

fails with

    {{{UnicodeDecodeError('ascii', 'Creator:        Microsoft\xc2\xae Word 2013\n', 25, 26, 'ordinal not in range(128)')}}}

at `media_types/pdf/processing.py:210`, where the output of `pdfinfo` is decoded with:

    {{{lines = [l.decode() for l in lines]}}}

The problem is that `decode` uses the default string encoding, which is `ascii` in my case. It can be fixed by specifying UTF-8 encoding explicitly:

    {{{lines = [l.decode('utf-8', 'ignore') for l in lines]}}}

However, I am not sure if pdfinfo always outputs the information in UTF-8, or it is document dependent.

The line was added by [https://gitorious.org/mediagoblin/mediagoblin/commit/cda3055bd6d1810b17a83cde991c7e059ef76657 cda3055]

The output of `./lazyserver.sh`:

{{{
2014-10-13 09:27:09,900 DEBUG   [mediagoblin.processing.task] Processing <MediaEntry 10: doc>
2014-10-13 09:27:10,025 ERROR   [mediagoblin.processing.task] An unhandled exception was raised while processing <MediaEntry 10: doc>
2014-10-13 09:27:10,025 WARNING [mediagoblin.processing] No idea what happened here, but it failed: UnicodeDecodeError('ascii', 'Creator:        Microsoft\xc2\xae Word 2013\n', 25, 26, 'ordinal not in range(128)')
2014-10-13 09:27:10,153 WARNING [mediagoblin.processing] No idea what happened here, but it failed: UnicodeDecodeError('ascii', 'Creator:        Microsoft\xc2\xae Word 2013\n', 25, 26, 'ordinal not in range(128)')
[2014-10-13 09:27:10 +0000] [4679] [ERROR] Error handling request
Traceback (most recent call last):
  File ""/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/gunicorn-19.1.1-py2.7.egg/gunicorn/workers/sync.py"", line 93, in handle
    self.handle_request(listener, req, client, addr)
  File ""/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/gunicorn-19.1.1-py2.7.egg/gunicorn/workers/sync.py"", line 134, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File ""/mnt/data/mediagoblin/mediagoblin/app.py"", line 268, in __call__
    return self.call_backend(environ, start_response)
  File ""/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/Werkzeug-0.9.6-py2.7.egg/werkzeug/wsgi.py"", line 588, in __call__
    return self.app(environ, start_response)
  File ""/mnt/data/mediagoblin/mediagoblin/app.py"", line 245, in call_backend
    response = controller(request)
  File ""/mnt/data/mediagoblin/mediagoblin/decorators.py"", line 46, in wrapper
    return controller(request, *args, **kwargs)
  File ""/mnt/data/mediagoblin/mediagoblin/decorators.py"", line 73, in new_controller_func
    return controller(request, *args, **kwargs)
  File ""/mnt/data/mediagoblin/mediagoblin/decorators.py"", line 46, in wrapper
    return controller(request, *args, **kwargs)
  File ""/mnt/data/mediagoblin/mediagoblin/decorators.py"", line 73, in new_controller_func
    return controller(request, *args, **kwargs)
  File ""/mnt/data/mediagoblin/mediagoblin/decorators.py"", line 102, in wrapper
    return controller(request, *args, **kwargs)
  File ""/mnt/data/mediagoblin/mediagoblin/submit/views.py"", line 69, in submit_start
    urlgen=request.urlgen)
  File ""/mnt/data/mediagoblin/mediagoblin/submit/lib.py"", line 201, in submit_media
    run_process_media(entry, feed_url)
  File ""/mnt/data/mediagoblin/mediagoblin/submit/lib.py"", line 252, in run_process_media
    task_id=entry.queued_task_id)
  File ""/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/celery-3.1.16-py2.7.egg/celery/app/task.py"", line 547, in apply_async
    link=link, link_error=link_error, **options)
  File ""/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/celery-3.1.16-py2.7.egg/celery/app/task.py"", line 739, in apply
    request=request, propagate=throw)
  File ""/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/celery-3.1.16-py2.7.egg/celery/app/trace.py"", line 354, in eager_trace_task
    uuid, args, kwargs, request)
  File ""/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/celery-3.1.16-py2.7.egg/celery/app/trace.py"", line 253, in trace_task
    I, R, state, retval = on_error(task_request, exc, uuid)
  File ""/mnt/data/mediagoblin/env/local/lib/python2.7/site-packages/celery-3.1.16-py2.7.egg/celery/app/trace.py"", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File ""/mnt/data/mediagoblin/mediagoblin/processing/task.py"", line 99, in run
    processor.process(**reprocess_info)
  File ""/mnt/data/mediagoblin/mediagoblin/media_types/pdf/processing.py"", line 413, in process
    self.extract_pdf_info()
  File ""/mnt/data/mediagoblin/mediagoblin/media_types/pdf/processing.py"", line 339, in extract_pdf_info
    pdf_info_dict = pdf_info(self.pdf_filename)
  File ""/mnt/data/mediagoblin/mediagoblin/media_types/pdf/processing.py"", line 210, in pdf_info
    lines = [l.decode() for l in lines]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 25: ordinal not in range(128)
}}}"	defect	closed	major	0.9.0	programming	fixed			
