Opened 11 years ago

Last modified 9 years ago

#598 accepted task

Support for UTF-8 paths

Reported by: Tiberiu C. Turbureanu Owned by:
Priority: major Milestone:
Component: programming Keywords: config, path, utf8, sprint
Cc: Christopher Allan Webber, Tiberiu C. Turbureanu, joar Parent Tickets:

Description

On a fresh GMG install I get the following error for a UTF-8 path:

[tct@turbureanu mediagoblin-ceata]$ pwd
/home/tct/Descărcări/mediagoblin-ceata
[tct@turbureanu mediagoblin-ceata]$ ./bin/gmg dbupdate
Traceback (most recent call last):
  File "./bin/gmg", line 8, in <module>
    load_entry_point('mediagoblin==0.3.3.dev', 'console_scripts', 'gmg')()
  File "/home/tct/Descărcări/mediagoblin-ceata/mediagoblin/gmg_commands/__init__.py", line 100, in main_cli
    args.func(args)
  File "/home/tct/Descărcări/mediagoblin-ceata/mediagoblin/gmg_commands/dbupdate.py", line 129, in dbupdate
    global_config, app_config = setup_global_and_app_config(args.conf_file)
  File "/home/tct/Descărcări/mediagoblin-ceata/mediagoblin/init/__init__.py", line 47, in setup_global_and_app_config
    global_config, validation_result = read_mediagoblin_config(config_path)
  File "/home/tct/Descărcări/mediagoblin-ceata/mediagoblin/init/config.py", line 80, in read_mediagoblin_config
    validation_result = config.validate(validator, preserve_errors=True)
  File "build/bdist.linux-x86_64/egg/configobj.py", line 2295, in validate
  File "build/bdist.linux-x86_64/egg/configobj.py", line 2250, in validate
  File "build/bdist.linux-x86_64/egg/configobj.py", line 570, in __getitem__
  File "build/bdist.linux-x86_64/egg/configobj.py", line 562, in _interpolate
  File "build/bdist.linux-x86_64/egg/configobj.py", line 365, in interpolate
  File "build/bdist.linux-x86_64/egg/configobj.py", line 352, in recursive_interpolate
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 14: ordinal not in range(128)

Change History (6)

comment:1 by Tiberiu C. Turbureanu, 11 years ago

Owner: set to Tiberiu C. Turbureanu
Status: newaccepted

comment:2 by Christopher Allan Webber, 11 years ago

Heya,

So there is at least one of two problems here. The first one we can act on, and it might fix it. The other possibility is that it's a problem with configobj, which would mean we'd have to look at applying a patch upstream?

But let's try the local fix first. In mediagoblin/init/config.py you'll notice a function called _setup_defaults. Both there and passed into there in read_mediagoblin_config there are places where we are getting file paths, probably as byte strings, and we should probably try decoding them to utf-8 (question: what happens on operating systems with non-utf8 file paths? not sure).

    config['DEFAULT']['here'] = os.path.dirname(config_path)

Try taking that line and doing a decode to utf-8 like:

    config['DEFAULT']['here'] = os.path.dirname(config_path).decode('utf-8')

Similarly with the first line in read_mediagoblin_config(), which is in the same module.

Does that fix the problem? It would be good to know. If so, please submit a fix here!

comment:3 by Tiberiu C. Turbureanu, 11 years ago

Thanks for the tips, Chris. For ./bin/gmg dbupdate to work I had to decode only the absolute path from read_mediagoblin_config, because the unicode string is passed to the _setup_defaults():

    config_path = os.path.abspath(config_path).decode('utf-8')

However, starting the server fails, because ./bin/paster has the first line with the Python shebang with diacritics.

Applying the coding before the shebang, it fails to call Python from that utf-8 path

# coding=utf8 
#!/home/tct/țărușî/mediagoblin-ceata/bin/python -x
# ===>
./bin/paster: line 4: __requires__: command not found

Applying the coding after, I get syntax error, because the shebang has diacritics.

#!/home/tct/țărușî/mediagoblin-ceata/bin/python -x
# coding=utf8 
# ===>
SyntaxError: Non-ASCII character '\xc8' in file ./bin/paster on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

It's like the chicken and the egg problem. A possible solution would be to call the local python from within the lazyserver.sh script, but I guess this would defeat the whole idea of virtualenv automatic tool (which installed paster in the first place) and from what I rember, GNU Bash doesn't support utf-8.

I am very curious how can we solve this bug.

P.S. ConfigObj works fine, there is no need for upstream patching.

comment:4 by warp, 11 years ago

To get the encoding used by the file system, use sys.getfilesystemencoding().

However, even if you know the encoding, I think you should keep filenames in bytestrings as much as possible -- as you don't know the normalizations the file system may apply.

For example, I believe OS X does NFD normalization. Other file systems may use NFC or not normalize at all. So if you write something to a file called "pokémon" and try to open it again, you don't know if you should read u'poke\u0301mon' or u'pok\xe9mon'.

You can see the difference in python with:

import unicodedata
unicodedata.normalize ("NFC", u"pokémon")
unicodedata.normalize ("NFD", u"pokémon")

Ref: http://www.unicode.org/reports/tr15/

comment:5 by Christopher Allan Webber, 11 years ago

Keywords: sprint added

comment:6 by Christopher Allan Webber, 11 years ago

Owner: Tiberiu C. Turbureanu removed

Is anyone working on this? I'm under the impression that nobody is, so I'm unclaiming the ticket.

Note: See TracTickets for help on using tickets.