Opened 12 years ago
Last modified 10 years ago
#598 accepted task
Support for UTF-8 paths
Reported by: | Tiberiu C. Turbureanu | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | programming | Keywords: | config, path, utf8, sprint |
Cc: | Christopher Allan Webber, Tiberiu C. Turbureanu, joar | Parent Tickets: |
Description
On a fresh GMG install I get the following error for a UTF-8 path:
[tct@turbureanu mediagoblin-ceata]$ pwd /home/tct/Descărcări/mediagoblin-ceata [tct@turbureanu mediagoblin-ceata]$ ./bin/gmg dbupdate Traceback (most recent call last): File "./bin/gmg", line 8, in <module> load_entry_point('mediagoblin==0.3.3.dev', 'console_scripts', 'gmg')() File "/home/tct/Descărcări/mediagoblin-ceata/mediagoblin/gmg_commands/__init__.py", line 100, in main_cli args.func(args) File "/home/tct/Descărcări/mediagoblin-ceata/mediagoblin/gmg_commands/dbupdate.py", line 129, in dbupdate global_config, app_config = setup_global_and_app_config(args.conf_file) File "/home/tct/Descărcări/mediagoblin-ceata/mediagoblin/init/__init__.py", line 47, in setup_global_and_app_config global_config, validation_result = read_mediagoblin_config(config_path) File "/home/tct/Descărcări/mediagoblin-ceata/mediagoblin/init/config.py", line 80, in read_mediagoblin_config validation_result = config.validate(validator, preserve_errors=True) File "build/bdist.linux-x86_64/egg/configobj.py", line 2295, in validate File "build/bdist.linux-x86_64/egg/configobj.py", line 2250, in validate File "build/bdist.linux-x86_64/egg/configobj.py", line 570, in __getitem__ File "build/bdist.linux-x86_64/egg/configobj.py", line 562, in _interpolate File "build/bdist.linux-x86_64/egg/configobj.py", line 365, in interpolate File "build/bdist.linux-x86_64/egg/configobj.py", line 352, in recursive_interpolate UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 14: ordinal not in range(128)
Change History (6)
comment:1 by , 12 years ago
Owner: | set to |
---|---|
Status: | new → accepted |
comment:2 by , 12 years ago
comment:3 by , 12 years ago
Thanks for the tips, Chris. For ./bin/gmg dbupdate to work I had to decode only the absolute path from read_mediagoblin_config, because the unicode string is passed to the _setup_defaults():
config_path = os.path.abspath(config_path).decode('utf-8')
However, starting the server fails, because ./bin/paster has the first line with the Python shebang with diacritics.
Applying the coding before the shebang, it fails to call Python from that utf-8 path
# coding=utf8 #!/home/tct/țărușî/mediagoblin-ceata/bin/python -x # ===> ./bin/paster: line 4: __requires__: command not found
Applying the coding after, I get syntax error, because the shebang has diacritics.
#!/home/tct/țărușî/mediagoblin-ceata/bin/python -x # coding=utf8 # ===> SyntaxError: Non-ASCII character '\xc8' in file ./bin/paster on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
It's like the chicken and the egg problem. A possible solution would be to call the local python from within the lazyserver.sh script, but I guess this would defeat the whole idea of virtualenv automatic tool (which installed paster in the first place) and from what I rember, GNU Bash doesn't support utf-8.
I am very curious how can we solve this bug.
P.S. ConfigObj works fine, there is no need for upstream patching.
comment:4 by , 12 years ago
To get the encoding used by the file system, use sys.getfilesystemencoding().
However, even if you know the encoding, I think you should keep filenames in bytestrings as much as possible -- as you don't know the normalizations the file system may apply.
For example, I believe OS X does NFD normalization. Other file systems may use NFC or not normalize at all. So if you write something to a file called "pokémon" and try to open it again, you don't know if you should read u'poke\u0301mon' or u'pok\xe9mon'.
You can see the difference in python with:
import unicodedata
unicodedata.normalize ("NFC", u"pokémon")
unicodedata.normalize ("NFD", u"pokémon")
comment:5 by , 12 years ago
Keywords: | sprint added |
---|
comment:6 by , 12 years ago
Owner: | removed |
---|
Is anyone working on this? I'm under the impression that nobody is, so I'm unclaiming the ticket.
Heya,
So there is at least one of two problems here. The first one we can act on, and it might fix it. The other possibility is that it's a problem with configobj, which would mean we'd have to look at applying a patch upstream?
But let's try the local fix first. In mediagoblin/init/config.py you'll notice a function called _setup_defaults. Both there and passed into there in read_mediagoblin_config there are places where we are getting file paths, probably as byte strings, and we should probably try decoding them to utf-8 (question: what happens on operating systems with non-utf8 file paths? not sure).
Try taking that line and doing a decode to utf-8 like:
Similarly with the first line in read_mediagoblin_config(), which is in the same module.
Does that fix the problem? It would be good to know. If so, please submit a fix here!