Difference between revisions of "Python 3 String I/O"

From Gramps
Jump to: navigation, search
(First cut at explaining Python2 & 3 portable IO)
 
(add a category)
Line 19: Line 19:
 
* Specify relaxed Unicode error handling to io.open() or provide a UnicodeError exception handler.
 
* Specify relaxed Unicode error handling to io.open() or provide a UnicodeError exception handler.
 
* from __future__ import unicode_literals
 
* from __future__ import unicode_literals
 +
 +
[[Category:Developers/General]]

Revision as of 00:56, 15 February 2014

String Handling and I/O

Gramps must be able to run correctly in both Python 2.7 and Python 3.x. Unlike Python 2, Python 3 encodes the characters of all strings in UTF-16 and automatically transcodes all string I/O. With default settings this will frequently raise UnicodeErrors, therefore all string I/O must be coded carefully. Moreover, the builtin open() in Python 3 is quite different from that of Python 2. Fortunately Python 2.7 provides io.open() which is the same, and Python 3 provides an alias to that function, enabling compatible code. Binary file I/O on a file opened in text mode will fail in Python 3 because of the automatic transcoding.

There are several default encodings which may apply to I/O functions unless an encoding is specified, depending upon the encoding of the locale used to start Python and calling locale.setlocale() has no effect upon these defaults. The encoding used to write files is not recorded in the file unless you include code that does so. It's much easier to use only one encoding for all files. To ensure maximum portability we've selected UTF-8 as that standard encoding.

If a character in a string cannot be encoded in the target encoding the I/O statement will default to raising a UnicodeError because the default error handler is 'strict'. This can be avoided by including an errors parameter with one of the relaxed values (e.g. 'backslashreplace'); otherwise the I/O should be wrapped in try:except UnicodeError:.

Python 3 does not include a print statement, only a print function. Python 2.7 provides both. For compatibility always use the print function. It is not necessary to import it from __future__.

Note that string literals are also Unicode in Python 3. One can (and mostly should) also make all literals Unicode in Python 2 by including the line

 from __future__ import unicode_literals

near the top of each file.

Standard

Unless you understand exactly what you're doing and have tested carefully in multiple locales with both Python 2 and Python 3, always:

  • Use io.open() instead of open()
  • Open binary files (e.g. media) in binary mode ('rb' or 'wb', not just 'r' or 'w').
  • Specify encoding='utf-8' unless writing a standard filetype which specifies some other encoding
  • Specify relaxed Unicode error handling to io.open() or provide a UnicodeError exception handler.
  • from __future__ import unicode_literals