Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

Programming Python (123 page)

Workaround: Message text generation for binary attachment
payloads is broken

Our last two
emailUnicode
issues are outright bugs which we must work around
today, though they will almost certainly be fixed in a future Python
release. The first breaks message text generation for all but trivial
messages—the
emailpackage today no
longer supports generation of full mail text for messages that contain
any binary parts, such as images or audio files. Without coding
workarounds, only simple emails that consist entirely of text parts
can be composed and generated in Python 3.1’s
emailpackage; any MIME-encoded binary part
causes mail text generation to fail.

This is a bit tricky to understand without poring over
email’s source code (which, thankfully, we
can in the land of open source), but to demonstrate the issue, first
notice how simple text payloads are rendered as full message text when
printed as we’ve already seen:

C:\...\PP4E\Internet\Email>
python
>>>
from email.message import Message
# generic message object
>>>
m = Message()
>>>
m['From'] = '[email protected]'
>>>
m.set_payload(open('text.txt').read())
# payload is str text
>>>
print(m)
# print uses as_string()
From: [email protected]
spam
Spam
SPAM!

As we’ve also seen, for convenience, the
emailpackage also provides subclasses of
the
Messageobject, tailored to add
message headers that provide the extra descriptive details used by
email clients to know how to process the data:

>>>
from email.mime.text import MIMEText
# Message subclass with headers
>>>
text = open('text.txt').read()
>>>
m = MIMEText(text)
# payload is str text
>>>
m['From'] = '[email protected]'
>>>
print(m)
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
From: [email protected]
spam
Spam
SPAM!

This works for text, but watch what happens when we try to
render a message part with truly binary data, such as an image that
could not be decoded as Unicode text:

>>>
from email.message import Message
# generic Message object
>>>
m = Message()
>>>
m['From'] = '[email protected]'
>>>
bytes = open('monkeys.jpg', 'rb').read()
# read binary bytes (not Unicode)
>>>
m.set_payload(bytes)
# we set the payload to bytes
>>>
print(m)
Traceback (most recent call last):
...lines omitted...
File "C:\Python31\lib\email\generator.py", line 155, in _handle_text
raise TypeError('string payload expected: %s' % type(payload))
TypeError: string payload expected: 
>>>
m.get_payload()[:20]
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00x\x00x\x00\x00'

The problem here is that the
emailpackage’s text generator assumes that
the message’s payload data is a Base64 (or similar) encoded
strtext string by generation time, not
bytes. Really, the error is
probably our fault in this case, because we set the payload to raw
bytesmanually. We should use the
MIMEImageMIME subclass tailored
for images; if we do, the
emailpackage internally performs Base64 MIME email encoding on the data
when the message object is created. Unfortunately, it still leaves it
as
bytes, not
str, despite the fact the whole point of
Base64 is to change binary data to text (though the exact Unicode
flavor this text should take may be unclear). This leads to additional
failures in Python 3.1:

>>>
from email.mime.image import MIMEImage
# Message sublcass with hdrs+base64
>>>
bytes = open('monkeys.jpg', 'rb').read()
# read binary bytes again
>>>
m = MIMEImage(bytes)
# MIME class does Base64 on data
>>>
print(m)
Traceback (most recent call last):
...lines omitted...
File "C:\Python31\lib\email\generator.py", line 155, in _handle_text
raise TypeError('string payload expected: %s' % type(payload))
TypeError: string payload expected: 
>>>
m.get_payload()[:40]
# this is already Base64 text
b'/9j/4AAQSkZJRgABAQEAeAB4AAD/2wBDAAIBAQIB'
>>>
m.get_payload()[:40].decode('ascii')
# but it's still bytes internally!
'/9j/4AAQSkZJRgABAQEAeAB4AAD/2wBDAAIBAQIB'

In other words, not only does the Python 3.1
emailpackage not fully support the Python
3.X Unicode/bytes dichotomy, it was actually broken by it. Luckily,
there’s a workaround for this case.

To address this specific issue, I opted to create a custom
encoding function for binary MIME attachments, and pass it in to the
emailpackage’s MIME message object
subclasses
for all binary data
types. This custom function is coded in the upcoming
mailtoolspackage of this chapter (
Example 13-23
). Because it is used
by
emailto encode from bytes to
text at initialization time, it is able to decode to ASCII text per
Unicode as an extra step, after running the original call to perform
Base64 encoding and arrange content-encoding headers. The fact that
emaildoes not do this extra
Unicode decoding step itself is a genuine bug in that package (albeit,
one introduced by changes elsewhere in Python standard libraries), but
the workaround does
its
job
:

# in mailtools.mailSender module ahead in this chapter...
def fix_encode_base64(msgobj):
from email.encoders import encode_base64
encode_base64(msgobj)                # what email does normally: leaves bytes
bytes = msgobj.get_payload()         # bytes fails in email pkg on text gen
text  = bytes.decode('ascii')        # decode to unicode str so text gen works
...line splitting logic omitted...
msgobj.set_payload('\n'.join(lines))
>>>
from email.mime.image import MIMEImage
>>>
from mailtools.mailSender import fix_encode_base64
# use custom workaround
>>>
bytes = open('monkeys.jpg', 'rb').read()
>>>
m = MIMEImage(bytes, _encoder=fix_encode_base64)
# convert to ascii str
>>>
print(m.as_string()[:500])
Content-Type: image/jpeg
MIME-Version: 1.0
Content-Transfer-Encoding: base64
/9j/4AAQSkZJRgABAQEAeAB4AAD/2wBDAAIBAQIBAQICAgICAgICAwUDAwMDAwYEBAMFBwYHBwcG
BwcICQsJCAgKCAcHCg0KCgsMDAwMBwkODw0MDgsMDAz/2wBDAQICAgMDAwYDAwYMCAcIDAwMDAwM
DAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAz/wAARCAHoAvQDASIA
AhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQA
AAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3
ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc
>>>
print(m)
# to print the entire message: very long

Another possible workaround involves defining a custom
MIMEImageclass that is like the original
but does not attempt to perform Base64 ending on creation; that way,
we could encode and translate to
strbefore message object creation, but
still make use of the original class’s header-generation logic. If you
take this route, though, you’ll find that it requires repeating
(really, cutting and pasting) far too much of the original logic to be
reasonable—this repeated code would have to mirror any future
emailchanges:

>>>
from email.mime.nonmultipart import MIMENonMultipart
>>>
class MyImage(MIMENonMultipart):
...
def __init__(self, imagedata, subtype):
...
MIMENonMultipart.__init__(self, 'image', subtype)
...
self.set_payload(_imagedata)
...repeat all the base64 logic here, with an extra ASCII Unicode decode...
>>>
m = MyImage(text_from_bytes)

Interestingly, this regression in
emailactually reflects an unrelated change
in Python’s
base64module made in
2007, which was completely benign until the Python 3.X
bytes/
strdifferentiation came online. Prior to that, the email encoder worked
in Python 2.X, because
byteswas
really
str. In 3.X, though, because
base64returns
bytes, the normal mail encoder in
emailalso leaves the payload as
bytes, even though it’s been encoded to
Base64 text form. This in turn breaks
emailtext generation, because it assumes
the payload is text in this case, and requires it to be
str. As is common in large-scale
software
systems, the effects of some
3.X changes may have been difficult to anticipate or accommodate in
full.

By contrast,
parsing
binary attachments (as
opposed to generating text for them) works fine in 3.X, because the
parsed message payload is saved in message objects as a Base64-encoded
strstring, not
bytes, and is converted to
bytesonly when fetched. This bug seems
likely to also go away in a future Python and
emailpackage (perhaps even as a simple
patch in Python 3.2), but it’s more serious than the other Unicode
decoding issues described here, because it prevents mail composition
for all but trivial mails.

The flexibility afforded by the package and the Python language
allows such a workaround to be developed external to the package,
rather than hacking the package’s code directly. With open source and
forgiving APIs, you rarely are truly
stuck.

Note

Late-breaking news
: This section’s bug is
scheduled to be fixed in Python 3.2, making our workaround here
unnecessary in this and later Python releases. This is per
communications with members of Python’s email special interest group
(on the “email-sig” mailing list).

Regrettably, this fix didn’t appear until after this chapter
and its examples had been written. I’d like to remove the workaround
and its description entirely, but this book is based on Python 3.1,
both before and after the fix was incorporated.

So that it works under Python 3.2 alpha, too, though, the
workaround code ahead was specialized just before publication to
check for bytes prior to decoding. Moreover, the workaround still
must manually split lines in Base64 data, because 3.2 still does
not.

Workaround: Message composition for non-ASCII text parts is
broken

Our final
emailUnicode issue
is as severe as the prior one: changes like that of the
prior section introduced yet another regression for mail composition.
In short, it’s impossible to make text message parts today without
specializing for different Unicode encodings.

Some types of text are automatically MIME-encoded for
transmission. Unfortunately, because of the
str/
bytessplit, the MIME text message class in
emailnow requires different string object
types for different Unicode encodings. The net effect is that you now
have to know how the
emailpackage
will process your text data when making a text message object, or
repeat most of its logic redundantly.

For example, to properly generate Unicode encoding headers and
apply required MIME encodings, here’s how we must proceed today for
common Unicode text types:

>>>
m = MIMEText('abc', _charset='ascii')
# pass text for ascii
>>>
print(m)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
abc
>>>
m = MIMEText('abc', _charset='latin-1')
# pass text for latin-1
>>>
print(m)
# but not for 'latin1': ahead
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
abc
>>>
m = MIMEText(b'abc', _charset='utf-8')
# pass bytes for utf8
>>>
print(m)
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
YWJj

This works, but if you look closely, you’ll notice that we must
pass
strto the first two, but
bytesto the third. That requires
that we special-case code for Unicode types based upon the package’s
internal operation. Types other than those expected for a Unicode
encoding don’t work at all, because of newly invalid
str/
bytescombinations that occur inside the
emailpackage in 3.1:

>>>
m = MIMEText('abc', _charset='ascii')
>>>
m = MIMEText(b'abc', _charset='ascii')
# bug: assumes 2.X str
Traceback (most recent call last):
...lines omitted...
File "C:\Python31\lib\email\encoders.py", line 60, in encode_7or8bit
orig.encode('ascii')
AttributeError: 'bytes' object has no attribute 'encode'
>>>
m = MIMEText('abc', _charset='latin-1')
>>>
m = MIMEText(b'abc', _charset='latin-1')
# bug: qp uses str
Traceback (most recent call last):
...lines omitted...
File "C:\Python31\lib\email\quoprimime.py", line 176, in body_encode
if line.endswith(CRLF):
TypeError: expected an object with the buffer interface
>>>
m = MIMEText(b'abc', _charset='utf-8'
)
>>>
m = MIMEText('abc', _charset='utf-8')
# bug: base64 uses bytes
Traceback (most recent call last):
...lines omitted...
File "C:\Python31\lib\email\base64mime.py", line 94, in body_encode
enc = b2a_base64(s[i:i + max_unencoded]).decode("ascii")
TypeError: must be bytes or buffer, not str

Moreover, the
emailpackage
is pickier about encoding name synonyms than Python and most other
tools are: “latin-1” is detected as a quoted-printable MIME type, but
“latin1” is unknown and so defaults to Base64 MIME. In fact, this is
why Base64 was used for the “latin1” Unicode type earlier in this
section—an encoding choice that is irrelevant to any recipient that
understands the “latin1” synonym, including Python itself.
Unfortunately, that means that we also need to pass in a different
string type if we use a synonym the package doesn’t understand
today:

>>>
m = MIMEText('abc', _charset='latin-1')
# str for 'latin-1'
>>>
print(m)
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
abc
>>>
m = MIMEText('abc', _charset='latin1')
Traceback (most recent call last):
...lines omitted...
File "C:\Python31\lib\email\base64mime.py", line 94, in body_encode
enc = b2a_base64(s[i:i + max_unencoded]).decode("ascii")
TypeError: must be bytes or buffer, not str
>>>
m = MIMEText(b'abc', _charset='latin1')
# bytes for 'latin1'!
>>>
print(m)
Content-Type: text/plain; charset="latin1"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
YWJj

There are ways to add aliases and new encoding types in the
emailpackage, but they’re not
supported out of the box. Programs that care about being robust would
have to cross-check the user’s spelling, which may be valid for Python
itself, against that expected by
email. This also holds true if your data is
not ASCII in general—you’ll have to first decode to text in order to
use the expected “latin-1” name because its quoted-printable MIME
encoding expects
str, even though
bytesare required if “latin1”
triggers
the default Base64
MIME:

>>>
m = MIMEText(b'A\xe4B', _charset='latin1')
>>>
print(m)
Content-Type: text/plain; charset="latin1"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
QeRC
>>>
m = MIMEText(b'A\xe4B', _charset='latin-1')
Traceback (most recent call last):
...lines omitted...
File "C:\Python31\lib\email\quoprimime.py", line 176, in body_encode
if line.endswith(CRLF):
TypeError: expected an object with the buffer interface
>>>
m = MIMEText(b'A\xe4B'.decode('latin1'), _charset='latin-1')
>>>
print(m)
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
A=E4B

In fact, the text message object doesn’t check to see that the
data you’re MIME-
encoding
is
valid per Unicode in general—we can send invalid UTF text but the
receiver may have trouble decoding it:

>>>
m = MIMEText(b'A\xe4B', _charset='utf-8'
)
>>>
print(m)
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
QeRC
>>>
b'A\xe4B'.decode('utf8')
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: unexpected...
>>>
import base64
>>>
base64.b64decode(b'QeRC')
b'A\xe4B'
>>>
base64.b64decode(b'QeRC').decode('utf')
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: unexpected...

So what to do if we need to attach message text to composed
messages if the text’s datatype requirement is indirectly dictated by
its Unicode encoding name? The generic
Messagesuperclass doesn’t help here
directly if we specify an encoding, as it exhibits the same
encoding-specific behavior:

>>>
m = Message()
>>>
m.set_payload('spam', charset='us-ascii')
>>>
print(m)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
spam
>>>
m = Message()
>>>
m.set_payload(b'spam', charset='us-ascii')
AttributeError: 'bytes' object has no attribute 'encode'
>>>
m.set_payload('spam', charset='utf-8')
TypeError: must be bytes or buffer, not str

Although we could try to work around these issues by repeating
much of the code that
emailruns,
the redundancy would make us hopelessly tied to its current
implementation and dependent upon its future changes. The following,
for example, parrots the steps that email runs internally to create a
text message object for ASCII encoding text; unlike the
MIMETextclass, this approach allows all
data to be read from files as binary byte strings, even if it’s simple
ASCII:

>>>
m = Message()
>>>
m.add_header('Content-Type', 'text/plain')
>>>
m['MIME-Version'] = '1.0'
>>>
m.set_param('charset', 'us-ascii')
>>>
m.add_header('Content-Transfer-Encoding', '7bit')
>>>
data = b'spam'
>>>
m.set_payload(data.decode('ascii'))
# data read as bytes here
>>>
print(m)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
spam
>>>
print(MIMEText('spam', _charset='ascii'))
# same, but type-specific
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
spam

To do the same for other kinds of text that require MIME
encoding, just insert an extra encoding step; although we’re concerned
with text parts here, a similar imitative approach could address the
binary parts text generation bug we met earlier:

>>>
m = Message()
>>>
m.add_header('Content-Type', 'text/plain')
>>>
m['MIME-Version'] = '1.0'
>>>
m.set_param('charset', 'utf-8')
>>>
m.add_header('Content-Transfer-Encoding', 'base64')
>>>
data = b'spam'
>>>
from binascii import b2a_base64
# add MIME encode if needed
>>>
data = b2a_base64(data)
# data read as bytes here too
>>>
m.set_payload(data.decode('ascii'))
>>>
print(m)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
c3BhbQ==
>>>
print(MIMEText(b'spam', _charset='utf-8'))
# same, but type-specific
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
c3BhbQ==

This works, but besides the redundancy and dependency it
creates, to use this approach broadly we’d also have to generalize to
account for all the various kinds of Unicode encodings and MIME
encodings possible, like the
emailpackage already does internally. We might also have to support
encoding name synonyms to be flexible, adding further redundancy. In
other words, this requires additional work, and in the end, we’d still
have to specialize our code for different Unicode types.

Any way we go, some dependence on the current implementation
seems unavoidable today. It seems the best we can do here, apart from
hoping for an improved
emailpackage in a few years’ time, is to specialize text message
construction calls by Unicode type, and assume both that encoding
names match those expected by the package and that message data is
valid for the Unicode type selected. Here is the sort of arguably
magic code that the upcoming
mailtoolspackage (again in
Example 13-23
) will apply to choose
text types:

>>>
from email.charset import Charset, BASE64, QP
>>>
for e in ('us-ascii', 'latin-1', 'utf8', 'latin1', 'ascii'):
...
cset = Charset(e)
...
benc = cset.body_encoding
...
if benc in (None, QP):
...
print(e, benc, 'text')
# read/fetch data as str
...
else:
...
print(e, benc, 'binary')
# read/fetch data as bytes
...
us-ascii None text
latin-1 1 text
utf8 2 binary
latin1 2 binary
ascii None text

We’ll proceed this way in this book, with the major caveat that
this is almost certainly likely to require changes in the future
because of its strong coupling with the current email
implementation.

Note

Late-breaking news
: Like the prior
section, it now appears that this section’s bug will also be fixed
in Python 3.2, making the workaround here unnecessary in this and
later Python releases. The nature of the fix is unknown, though, and
we still need the fix for the version of Python current when this
chapter was written. As of just before publication, the alpha
release of 3.2 is still somewhat type specific on this issue, but
now accepts either
stror
bytesfor text that triggers Base64 encodings,
instead of just
bytes.

Other books

Winter's Tale by Mark Helprin

Un artista del hambre by Franz Kafka

High Gun at Surlock (2006) by Bowers, Terrell L

Crimson and Clover by Juli Page Morgan

Ready to Bear by Ivy Sinclair

Just Cause by John Katzenbach

Next to Me by Emily Walker

Behind the Night Bazaar by Angela Savage

Stone Shadow by Rex Miller

The Wide-Awake Princess by E. D. Baker