Expert Python Programming（Third Edition）

上QQ阅读APP看书，第一时间看更新

Strings and bytes

The topic of strings may provide some confusion for programmers that used to program only in Python 2. In Python 3, there is only one datatype capable of storing textual information. It is str, or simply string. It is an immutable sequence that stores Unicode code points. This is the major difference from Python 2, where str represented byte strings – something that is now handled by the bytes objects (but not exactly in the same way).

Strings in Python are sequences. This single fact should be enough to include them in a section covering other container types. But they differ from other container types in one important detail. Strings have very specific limitations on what type of data they can store, and that is Unicode text.

bytes, and its mutable alternative, bytearray, differs from str by allowing only bytes as a sequence value, and bytes in Python are integers in the 0 <= x < 256 range. This may be a bit confusing at the beginning, because, when printed, they may look very similar to strings:

>>> print(bytes([102, 111, 111]))
b'foo'

The bytes and bytearray types allow you to work with raw binary data that may not always have to be textual (for example, audio/video files, images, and network packets). The true nature of these types is revealed when they are converted into other sequence types, such as list or tuple:

>>> list(b'foo bar')
[102, 111, 111, 32, 98, 97, 114]
>>> tuple(b'foo bar')
(102, 111, 111, 32, 98, 97, 114)

A lot of Python 3 controversy was about breaking the backwards compatibility for string literals and how Python deals with Unicode. Starting from Python 3.0, every string literal without any prefix is Unicode. So, literals enclosed by single quotes ('), double quotes ("), or groups of three quotes (single or double) without any prefix represent the str data type:

>>> type("some string")
<class 'str'>

In Python 2, the Unicode literals required a u prefix (like u"some string"). This prefix is still allowed for backwards compatibility (starting from Python 3.3), but does not hold any syntactic meaning in Python 3.

Byte literals were already presented in some of the previous examples, but let's explicitly present their syntax for the sake of consistency. Bytes literals are enclosed by single quotes, double quotes, or triple quotes, but must be preceded with a b or B prefix:

>>> type(b"some bytes")
<class 'bytes'>

Note that Python does not provide a syntax for bytearray literals. If you want to create a bytearray value, you need to use a bytes literal and a bytearray() type constructor:

>>> bytearray(b'some bytes')
bytearray(b'some bytes')

It is important to remember that Unicode strings contain abstract text that is independent from the byte representation. This makes them unable to be saved on the disk or sent over the network without encoding them to binary data. There are two ways to encode string objects into byte sequences:

Using the str.encode(encoding, errors) method, which encodes the string using a registered codec for encoding. Codec is specified using the encoding argument, and, by default, it is 'utf-8'. The second errors, argument specifies the error handling scheme. It can be 'strict' (default), 'ignore' , 'replace' , 'xmlcharrefreplace', or any other registered handler (refer to the built-in codecs module documentation).
Using the bytes(source, encoding, errors) constructor, which creates a new bytes sequence. When the source is of the str type, then the encoding argument is obligatory and it does not have a default value. The usage of the encoding and errors arguments is the same as for the str.encode() method.

Binary data represented by bytes can be converted into a string in an analogous way:

Using the bytes.decode(encoding, errors) method, which decodes the bytes using the codec registered for encoding. The arguments of this method have the same meaning and defaults as the arguments of str.encode().
Using the str(source, encoding, error) constructor, which creates a new string instance. Similar to the bytes() constructor, the encoding argument in the str() call has no default value and must be provided if the bytes sequence is used as a source.

Naming – bytes versus byte string
Due to changes made in Python 3, some people tend to refer to the bytes instances as byte strings. This is mostly due to historic reasons – bytes in Python 3 is the sequence type that is the closest one to the str type from Python 2 (but not the same). Still, the bytes instance is a sequence of bytes and also does not need to represent textual data. So, in order to avoid any confusion, it is advised to always refer to them as either bytes or byte sequence, despite their similarities to strings. The concept of strings is reserved for textual data in Python 3, and this is now always str.

Let's look into the implementation details of strings and bytes.