Our last twoemail
Unicode
issues are outright bugs which we must work around
today, though they will almost certainly be fixed in a future Python
release. The first breaks message text generation for all but trivial
messages—theemail
package today no
longer supports generation of full mail text for messages that contain
any binary parts, such as images or audio files. Without coding
workarounds, only simple emails that consist entirely of text parts
can be composed and generated in Python 3.1’semail
package; any MIME-encoded binary part
causes mail text generation to fail.
This is a bit tricky to understand without poring overemail
’s source code (which, thankfully, we
can in the land of open source), but to demonstrate the issue, first
notice how simple text payloads are rendered as full message text when
printed as we’ve already seen:
C:\...\PP4E\Internet\Email>python
>>>from email.message import Message
# generic message object
>>>m = Message()
>>>m['From'] = '[email protected]'
>>>m.set_payload(open('text.txt').read())
# payload is str text
>>>print(m)
# print uses as_string()
From: [email protected]
spam
Spam
SPAM!
As we’ve also seen, for convenience, theemail
package also provides subclasses of
theMessage
object, tailored to add
message headers that provide the extra descriptive details used by
email clients to know how to process the data:
>>>from email.mime.text import MIMEText
# Message subclass with headers
>>>text = open('text.txt').read()
>>>m = MIMEText(text)
# payload is str text
>>>m['From'] = '[email protected]'
>>>print(m)
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
From: [email protected]
spam
Spam
SPAM!
This works for text, but watch what happens when we try to
render a message part with truly binary data, such as an image that
could not be decoded as Unicode text:
>>>from email.message import Message
# generic Message object
>>>m = Message()
>>>m['From'] = '[email protected]'
>>>bytes = open('monkeys.jpg', 'rb').read()
# read binary bytes (not Unicode)
>>>m.set_payload(bytes)
# we set the payload to bytes
>>>print(m)
Traceback (most recent call last):
...lines omitted...
File "C:\Python31\lib\email\generator.py", line 155, in _handle_text
raise TypeError('string payload expected: %s' % type(payload))
TypeError: string payload expected:
>>>m.get_payload()[:20]
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00x\x00x\x00\x00'
The problem here is that theemail
package’s text generator assumes that
the message’s payload data is a Base64 (or similar) encodedstr
text string by generation time, notbytes
. Really, the error is
probably our fault in this case, because we set the payload to rawbytes
manually. We should use theMIMEImage
MIME subclass tailored
for images; if we do, theemail
package internally performs Base64 MIME email encoding on the data
when the message object is created. Unfortunately, it still leaves it
asbytes
, notstr
, despite the fact the whole point of
Base64 is to change binary data to text (though the exact Unicode
flavor this text should take may be unclear). This leads to additional
failures in Python 3.1:
>>>from email.mime.image import MIMEImage
# Message sublcass with hdrs+base64
>>>bytes = open('monkeys.jpg', 'rb').read()
# read binary bytes again
>>>m = MIMEImage(bytes)
# MIME class does Base64 on data
>>>print(m)
Traceback (most recent call last):
...lines omitted...
File "C:\Python31\lib\email\generator.py", line 155, in _handle_text
raise TypeError('string payload expected: %s' % type(payload))
TypeError: string payload expected:
>>>m.get_payload()[:40]
# this is already Base64 text
b'/9j/4AAQSkZJRgABAQEAeAB4AAD/2wBDAAIBAQIB'
>>>m.get_payload()[:40].decode('ascii')
# but it's still bytes internally!
'/9j/4AAQSkZJRgABAQEAeAB4AAD/2wBDAAIBAQIB'
In other words, not only does the Python 3.1email
package not fully support the Python
3.X Unicode/bytes dichotomy, it was actually broken by it. Luckily,
there’s a workaround for this case.
To address this specific issue, I opted to create a custom
encoding function for binary MIME attachments, and pass it in to theemail
package’s MIME message object
subclasses
for all binary data
types. This custom function is coded in the upcomingmailtools
package of this chapter (
Example 13-23
). Because it is used
byemail
to encode from bytes to
text at initialization time, it is able to decode to ASCII text per
Unicode as an extra step, after running the original call to perform
Base64 encoding and arrange content-encoding headers. The fact thatemail
does not do this extra
Unicode decoding step itself is a genuine bug in that package (albeit,
one introduced by changes elsewhere in Python standard libraries), but
the workaround does
its
job
:
# in mailtools.mailSender module ahead in this chapter...
def fix_encode_base64(msgobj):
from email.encoders import encode_base64
encode_base64(msgobj) # what email does normally: leaves bytes
bytes = msgobj.get_payload() # bytes fails in email pkg on text gen
text = bytes.decode('ascii') # decode to unicode str so text gen works
...line splitting logic omitted...
msgobj.set_payload('\n'.join(lines))
>>>from email.mime.image import MIMEImage
>>>from mailtools.mailSender import fix_encode_base64
# use custom workaround
>>>bytes = open('monkeys.jpg', 'rb').read()
>>>m = MIMEImage(bytes, _encoder=fix_encode_base64)
# convert to ascii str
>>>print(m.as_string()[:500])
Content-Type: image/jpeg
MIME-Version: 1.0
Content-Transfer-Encoding: base64
/9j/4AAQSkZJRgABAQEAeAB4AAD/2wBDAAIBAQIBAQICAgICAgICAwUDAwMDAwYEBAMFBwYHBwcG
BwcICQsJCAgKCAcHCg0KCgsMDAwMBwkODw0MDgsMDAz/2wBDAQICAgMDAwYDAwYMCAcIDAwMDAwM
DAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAz/wAARCAHoAvQDASIA
AhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQA
AAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3
ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc
>>>print(m)
# to print the entire message: very long
Another possible workaround involves defining a customMIMEImage
class that is like the original
but does not attempt to perform Base64 ending on creation; that way,
we could encode and translate tostr
before message object creation, but
still make use of the original class’s header-generation logic. If you
take this route, though, you’ll find that it requires repeating
(really, cutting and pasting) far too much of the original logic to be
reasonable—this repeated code would have to mirror any futureemail
changes:
>>>from email.mime.nonmultipart import MIMENonMultipart
>>>class MyImage(MIMENonMultipart):
...def __init__(self, imagedata, subtype):
...MIMENonMultipart.__init__(self, 'image', subtype)
...self.set_payload(_imagedata)
...repeat all the base64 logic here, with an extra ASCII Unicode decode...
>>>m = MyImage(text_from_bytes)
Interestingly, this regression inemail
actually reflects an unrelated change
in Python’sbase64
module made in
2007, which was completely benign until the Python 3.Xbytes
/str
differentiation came online. Prior to that, the email encoder worked
in Python 2.X, becausebytes
was
reallystr
. In 3.X, though, becausebase64
returnsbytes
, the normal mail encoder inemail
also leaves the payload asbytes
, even though it’s been encoded to
Base64 text form. This in turn breaksemail
text generation, because it assumes
the payload is text in this case, and requires it to bestr
. As is common in large-scale
software
systems, the effects of some
3.X changes may have been difficult to anticipate or accommodate in
full.
By contrast,
parsing
binary attachments (as
opposed to generating text for them) works fine in 3.X, because the
parsed message payload is saved in message objects as a Base64-encodedstr
string, notbytes
, and is converted tobytes
only when fetched. This bug seems
likely to also go away in a future Python andemail
package (perhaps even as a simple
patch in Python 3.2), but it’s more serious than the other Unicode
decoding issues described here, because it prevents mail composition
for all but trivial mails.
The flexibility afforded by the package and the Python language
allows such a workaround to be developed external to the package,
rather than hacking the package’s code directly. With open source and
forgiving APIs, you rarely are truly
stuck.
Late-breaking news
: This section’s bug is
scheduled to be fixed in Python 3.2, making our workaround here
unnecessary in this and later Python releases. This is per
communications with members of Python’s email special interest group
(on the “email-sig” mailing list).
Regrettably, this fix didn’t appear until after this chapter
and its examples had been written. I’d like to remove the workaround
and its description entirely, but this book is based on Python 3.1,
both before and after the fix was incorporated.
So that it works under Python 3.2 alpha, too, though, the
workaround code ahead was specialized just before publication to
check for bytes prior to decoding. Moreover, the workaround still
must manually split lines in Base64 data, because 3.2 still does
not.
Our finalemail
Unicode issue
is as severe as the prior one: changes like that of the
prior section introduced yet another regression for mail composition.
In short, it’s impossible to make text message parts today without
specializing for different Unicode encodings.
Some types of text are automatically MIME-encoded for
transmission. Unfortunately, because of thestr
/bytes
split, the MIME text message class inemail
now requires different string object
types for different Unicode encodings. The net effect is that you now
have to know how theemail
package
will process your text data when making a text message object, or
repeat most of its logic redundantly.
For example, to properly generate Unicode encoding headers and
apply required MIME encodings, here’s how we must proceed today for
common Unicode text types:
>>>m = MIMEText('abc', _charset='ascii')
# pass text for ascii
>>>print(m)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
abc
>>>m = MIMEText('abc', _charset='latin-1')
# pass text for latin-1
>>>print(m)
# but not for 'latin1': ahead
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
abc
>>>m = MIMEText(b'abc', _charset='utf-8')
# pass bytes for utf8
>>>print(m)
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
YWJj
This works, but if you look closely, you’ll notice that we must
passstr
to the first two, butbytes
to the third. That requires
that we special-case code for Unicode types based upon the package’s
internal operation. Types other than those expected for a Unicode
encoding don’t work at all, because of newly invalidstr
/bytes
combinations that occur inside theemail
package in 3.1:
>>>m = MIMEText('abc', _charset='ascii')
>>>m = MIMEText(b'abc', _charset='ascii')
# bug: assumes 2.X str
Traceback (most recent call last):
...lines omitted...
File "C:\Python31\lib\email\encoders.py", line 60, in encode_7or8bit
orig.encode('ascii')
AttributeError: 'bytes' object has no attribute 'encode'
>>>m = MIMEText('abc', _charset='latin-1')
>>>m = MIMEText(b'abc', _charset='latin-1')
# bug: qp uses str
Traceback (most recent call last):
...lines omitted...
File "C:\Python31\lib\email\quoprimime.py", line 176, in body_encode
if line.endswith(CRLF):
TypeError: expected an object with the buffer interface
>>>m = MIMEText(b'abc', _charset='utf-8'
)
>>>m = MIMEText('abc', _charset='utf-8')
# bug: base64 uses bytes
Traceback (most recent call last):
...lines omitted...
File "C:\Python31\lib\email\base64mime.py", line 94, in body_encode
enc = b2a_base64(s[i:i + max_unencoded]).decode("ascii")
TypeError: must be bytes or buffer, not str
Moreover, theemail
package
is pickier about encoding name synonyms than Python and most other
tools are: “latin-1” is detected as a quoted-printable MIME type, but
“latin1” is unknown and so defaults to Base64 MIME. In fact, this is
why Base64 was used for the “latin1” Unicode type earlier in this
section—an encoding choice that is irrelevant to any recipient that
understands the “latin1” synonym, including Python itself.
Unfortunately, that means that we also need to pass in a different
string type if we use a synonym the package doesn’t understand
today:
>>>m = MIMEText('abc', _charset='latin-1')
# str for 'latin-1'
>>>print(m)
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
abc
>>>m = MIMEText('abc', _charset='latin1')
Traceback (most recent call last):
...lines omitted...
File "C:\Python31\lib\email\base64mime.py", line 94, in body_encode
enc = b2a_base64(s[i:i + max_unencoded]).decode("ascii")
TypeError: must be bytes or buffer, not str
>>>m = MIMEText(b'abc', _charset='latin1')
# bytes for 'latin1'!
>>>print(m)
Content-Type: text/plain; charset="latin1"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
YWJj
There are ways to add aliases and new encoding types in theemail
package, but they’re not
supported out of the box. Programs that care about being robust would
have to cross-check the user’s spelling, which may be valid for Python
itself, against that expected byemail
. This also holds true if your data is
not ASCII in general—you’ll have to first decode to text in order to
use the expected “latin-1” name because its quoted-printable MIME
encoding expectsstr
, even thoughbytes
are required if “latin1”
triggers
the default Base64
MIME:
>>>m = MIMEText(b'A\xe4B', _charset='latin1')
>>>print(m)
Content-Type: text/plain; charset="latin1"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
QeRC
>>>m = MIMEText(b'A\xe4B', _charset='latin-1')
Traceback (most recent call last):
...lines omitted...
File "C:\Python31\lib\email\quoprimime.py", line 176, in body_encode
if line.endswith(CRLF):
TypeError: expected an object with the buffer interface
>>>m = MIMEText(b'A\xe4B'.decode('latin1'), _charset='latin-1')
>>>print(m)
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
A=E4B
In fact, the text message object doesn’t check to see that the
data you’re MIME-
encoding
is
valid per Unicode in general—we can send invalid UTF text but the
receiver may have trouble decoding it:
>>>m = MIMEText(b'A\xe4B', _charset='utf-8'
)
>>>print(m)
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
QeRC
>>>b'A\xe4B'.decode('utf8')
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: unexpected...
>>>import base64
>>>base64.b64decode(b'QeRC')
b'A\xe4B'
>>>base64.b64decode(b'QeRC').decode('utf')
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: unexpected...
So what to do if we need to attach message text to composed
messages if the text’s datatype requirement is indirectly dictated by
its Unicode encoding name? The genericMessage
superclass doesn’t help here
directly if we specify an encoding, as it exhibits the same
encoding-specific behavior:
>>>m = Message()
>>>m.set_payload('spam', charset='us-ascii')
>>>print(m)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
spam
>>>m = Message()
>>>m.set_payload(b'spam', charset='us-ascii')
AttributeError: 'bytes' object has no attribute 'encode'
>>>m.set_payload('spam', charset='utf-8')
TypeError: must be bytes or buffer, not str
Although we could try to work around these issues by repeating
much of the code thatemail
runs,
the redundancy would make us hopelessly tied to its current
implementation and dependent upon its future changes. The following,
for example, parrots the steps that email runs internally to create a
text message object for ASCII encoding text; unlike theMIMEText
class, this approach allows all
data to be read from files as binary byte strings, even if it’s simple
ASCII:
>>>m = Message()
>>>m.add_header('Content-Type', 'text/plain')
>>>m['MIME-Version'] = '1.0'
>>>m.set_param('charset', 'us-ascii')
>>>m.add_header('Content-Transfer-Encoding', '7bit')
>>>data = b'spam'
>>>m.set_payload(data.decode('ascii'))
# data read as bytes here
>>>print(m)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
spam
>>>print(MIMEText('spam', _charset='ascii'))
# same, but type-specific
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
spam
To do the same for other kinds of text that require MIME
encoding, just insert an extra encoding step; although we’re concerned
with text parts here, a similar imitative approach could address the
binary parts text generation bug we met earlier:
>>>m = Message()
>>>m.add_header('Content-Type', 'text/plain')
>>>m['MIME-Version'] = '1.0'
>>>m.set_param('charset', 'utf-8')
>>>m.add_header('Content-Transfer-Encoding', 'base64')
>>>data = b'spam'
>>>from binascii import b2a_base64
# add MIME encode if needed
>>>data = b2a_base64(data)
# data read as bytes here too
>>>m.set_payload(data.decode('ascii'))
>>>print(m)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
c3BhbQ==
>>>print(MIMEText(b'spam', _charset='utf-8'))
# same, but type-specific
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
c3BhbQ==
This works, but besides the redundancy and dependency it
creates, to use this approach broadly we’d also have to generalize to
account for all the various kinds of Unicode encodings and MIME
encodings possible, like theemail
package already does internally. We might also have to support
encoding name synonyms to be flexible, adding further redundancy. In
other words, this requires additional work, and in the end, we’d still
have to specialize our code for different Unicode types.
Any way we go, some dependence on the current implementation
seems unavoidable today. It seems the best we can do here, apart from
hoping for an improvedemail
package in a few years’ time, is to specialize text message
construction calls by Unicode type, and assume both that encoding
names match those expected by the package and that message data is
valid for the Unicode type selected. Here is the sort of arguably
magic code that the upcomingmailtools
package (again in
Example 13-23
) will apply to choose
text types:
>>>from email.charset import Charset, BASE64, QP
>>>for e in ('us-ascii', 'latin-1', 'utf8', 'latin1', 'ascii'):
...cset = Charset(e)
...benc = cset.body_encoding
...if benc in (None, QP):
...print(e, benc, 'text')
# read/fetch data as str
...else:
...print(e, benc, 'binary')
# read/fetch data as bytes
...
us-ascii None text
latin-1 1 text
utf8 2 binary
latin1 2 binary
ascii None text
We’ll proceed this way in this book, with the major caveat that
this is almost certainly likely to require changes in the future
because of its strong coupling with the current email
implementation.
Late-breaking news
: Like the prior
section, it now appears that this section’s bug will also be fixed
in Python 3.2, making the workaround here unnecessary in this and
later Python releases. The nature of the fix is unknown, though, and
we still need the fix for the version of Python current when this
chapter was written. As of just before publication, the alpha
release of 3.2 is still somewhat type specific on this issue, but
now accepts eitherstr
orbytes
for text that triggers Base64 encodings,
instead of justbytes
.