Programming Python (122 page)

Text payload encodings: Using header information to
decode

More profoundly, text in email can be even richer than implied
so far—in principle, text payloads of a single message may be encoded
in a variety of different Unicode schemes (e.g., three HTML webpage
file attachments, all in different Unicode encodings, and possibly
different than the full message text’s encoding). Although treating
such text as binary byte strings can sometimes finesse encoding
issues, saving such parts in text-mode files for opening must respect
the original encoding types. Further, any text processing performed on
such parts will be similarly type-specific.

Luckily, the
emailpackage
both adds character-set headers when generating message text and
retains character-set information for parts if it is present when
parsing message text. For instance, adding non-ASCII text attachments
simply requires passing in an encoding name—the appropriate message
headers are added automatically on text generation, and the character
set is available directly via the
get_content_charsetmethod:

>>>
s = b'A\xe4B'
>>>
s.decode('latin1')
'AäB'
>>>
from email.message import Message
>>>
m = Message()
>>>
m.set_payload(b'A\xe4B', charset='latin1')
# or 'latin-1': see ahead
>>>
t = m.as_string()
>>>
print(t)
MIME-Version: 1.0
Content-Type: text/plain; charset="latin1"
Content-Transfer-Encoding: base64
QeRC
>>>
m.get_content_charset()
'latin1'

Notice how
emailautomatically applies Base64 MIME encoding to non-ASCII text parts on
generation, to conform to email standards. The same is true for the
more specific MIME text subclass of
Message:

>>>
from email.mime.text import MIMEText
>>>
m = MIMEText(b'A\xe4B', _charset='latin1')
>>>
t = m.as_string()
>>>
print(t)
Content-Type: text/plain; charset="latin1"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
QeRC
>>>
m.get_content_charset()
'latin1'

Now, if we parse this message’s text string with
email, we get back a new
Messagewhose text payload is the Base64
MIME-encoded text used to represent the non-ASCII Unicode string.
Requesting MIME decoding for the payload with
decode=1returns the byte string we
originally attached:

>>>
from email.parser import Parser
>>>
q = Parser().parsestr(t)
>>>
q

>>>
q.get_content_type()
'text/plain'
>>>
q._payload
'QeRC\n'
>>>
q.get_payload()
'QeRC\n'
>>>
q.get_payload(decode=1)
b'A\xe4B'

However, running Unicode decoding on this byte string to convert
to text fails if we attempt to use the platform default on Windows
(UTF8). To be more accurate, and support a wide variety of text types,
we need to use the character-set information saved by the parser and
attached to the
Messageobject.
This is especially important if we need to save the data to a file—we
either have to store as bytes in binary mode files, or specify the
correct (or at least a compatible) Unicode encoding in order to use
such strings for text-mode files. Decoding manually works the same
way:

>>>
q.get_payload(decode=1).decode()
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: unexpected
>>>
q.get_content_charset()
'latin1'
>>>
q.get_payload(decode=1).decode('latin1')
# known type
'AäB'
>>>
q.get_payload(decode=1).decode(q.get_content_charset())
# allow any type
'AäB'

In fact, all the header details are available on
Messageobjects, if we know where to look.
The character set can also be absent entirely, in which case it’s
returned as
None; clients need to
define policies for such ambiguous text (they might try common types,
guess, or treat the data as a raw byte string):

>>>
q['content-type']
# mapping interface
'text/plain; charset="latin1"'
>>>
q.items()
[('Content-Type', 'text/plain; charset="latin1"'), ('MIME-Version', '1.0'),
('Content-Transfer-Encoding', 'base64')]
>>
q.get_params(header='Content-Type')
# param interface
[('text/plain', ''), ('charset', 'latin1')]
>>>
q.get_param('charset', header='Content-Type')
'latin1'
>>>
charset = q.get_content_charset()
# might be missing
>>>
if charset:
...
print(q.get_payload(decode=1).decode(charset))
...
AäB

This handles encodings for message text parts in parsed emails.
For composing new emails, we still must apply session-wide user
settings or allow the user to specify an encoding for each part
interactively. In some of this book’s email clients, payload
conversions are performed as needed—using encoding information in
message headers after parsing and provided by users during mail
composition.

Message header encodings: email package support

On a related note, the
emailpackage
also provides support for encoding and decoding message
headers themselves (e.g., From, Subject) per email standards when they
are not simple text. Such headers are often called
Internationalized
(or
i18n
)
headers, because they support inclusion of non-ASCII character set
text in emails. This term is also sometimes used to refer to encoded
text of message payloads; unlike message headers, though, message
payload encoding is used for both international Unicode text and truly
binary data such as images (as we’ll see in the next section).

Like mail payload parts, i18n headers are encoded specially for
email, and may also be encoded per Unicode. For instance, here’s how
to decode an encoded subject line from an arguably spammish email that
just showed up in my inbox; its
=?UTF-8?Q?preamble
declares that the data following it is UTF-8 encoded Unicode text,
which is also MIME-encoded per quoted-printable for transmission in
email (in short, unlike the prior section’s part payloads, which
declare their encodings in separate header lines, headers themselves
may declare their Unicode and MIME encodings by embedding them in
their own content this way):

>>>
rawheader = '=?UTF-8?Q?Introducing=20Top=20Values=3A=20A=20Special=20Selecti
on=20of=20Great=20Money=20Savers?='
>>>
from email.header import decode_header
# decode per email+MIME
>>>
decode_header(rawheader)
[(b'Introducing Top Values: A Special Selection of Great Money Savers', 'utf-8')]
>>>
bin, enc = decode_header(rawheader)[0]
# and decode per Unicode
>>>
bin, enc
(b'Introducing Top Values: A Special Selection of Great Money Savers', 'utf-8')
>>>
bin.decode(enc)
'Introducing Top Values: A Special Selection of Great Money Savers'

Subtly, the
emailpackage can
return multiple parts if there are encoded substrings in the header,
and each must be decoded individually and joined to produce decoded
header text. Even more subtly, in 3.1, this package returns all
byteswhen any substring (or the
entire header) is encoded but returns
strfor a fully unencoded header, and
uncoded substrings returned as
bytesare encoded per “raw-unicode-escape”
in the package—an encoding scheme useful to convert
strto
byteswhen no encoding type applies:

>>>
from email.header import decode_header
>>>
S1 = 'Man where did you get that assistant?'
>>>
S2 = '=?utf-8?q?Man_where_did_you_get_that_assistant=3F?='
>>>
S3 = 'Man where did you get that =?UTF-8?Q?assistant=3F?='
# str: don't decode()
>>>
decode_header(S1)
[('Man where did you get that assistant?', None)]
# bytes: do decode()
>>>
decode_header(S2)
[(b'Man where did you get that assistant?', 'utf-8')]
# bytes: do decode() using raw-unicode-escape applied in package
>>>
decode_header(S3)
[(b'Man where did you get that', None), (b'assistant?', 'utf-8')]
# join decoded parts if more than one
>>>
parts = decode_header(S3)
>>>
' '.join(abytes.decode('raw-unicode-escape' if enc == None else enc)
...
for (abytes, enc) in parts)
'Man where did you get that assistant?'

We’ll use logic similar to the last step here in the
mailtoolspackage ahead, but also retain
strsubstrings intact without
attempting to
decode.

Note

Late-breaking news
: As I write this in
mid-2010, it seems possible that this mixed type, nonpolymorphic,
and frankly, non-Pythonic API behavior may be addressed in a future
Python release. In response to a rant posted on the Python
developers list by a book author whose work you might be familiar
with, there is presently a vigorous discussion of the topic there.
Among other ideas is a proposal for a
bytes-like type which carries with it an
explicit Unicode encoding; this may make it possible to treat some
text cases in a more generic fashion. While it’s impossible to
foresee the outcome of such proposals, it’s good to see that the
issues are being actively explored. Stay tuned to this book’s
website for further developments in the Python 3.X library API and
Unicode stories.

Message address header encodings and parsing, and header
creation

One wrinkle pertaining to the
prior section: for message headers that contain
email addresses
(e.g., From), the name component
of the name/address pair might be encoded this way as well. Because
the email package’s header parser expects encoded substrings to be
followed by whitespace or the end of string, we cannot ask it to
decode a complete address-related header—quotes around name components
will fail.

To support such Internationalized address headers, we must also
parse out the first part of the email address and then decode. First
of all, we need to extract the name and address parts of an email
address using
emailpackage
tools:

>>>
from email.utils import parseaddr, formataddr
>>>
p = parseaddr('"Smith, Bob" ')
# split into name/addr pair
>>>
p
# unencoded addr
('Smith, Bob', '[email protected]')
>>>
formataddr(p)
'"Smith, Bob" '
>>>
parseaddr('Bob Smith ')
# unquoted name part
('Bob Smith', '[email protected]')
>>>
formataddr(parseaddr('Bob Smith '))
'Bob Smith '
>>>
parseaddr('[email protected]')
# simple, no name
('', '[email protected]')
>>>
formataddr(parseaddr('[email protected]'))
'[email protected]'

Fields with multiple addresses (e.g., To) separate individual
addresses by commas. Since email names might embed commas, too,
blindly splitting on commas to run each though parsing won’t always
work. Instead, another utility can be used to parse each address
individually:
getaddressesignores
commas in names when spitting apart separate addresses, and
parseaddrdoes, too, because it simply
returns the first pair in the
getaddressesresult (some line breaks were
added to the following for legibility):

>>>
from email.utils import getaddresses
>>>
multi = '"Smith, Bob" , Bob Smith , [email protected],
"Bob" '
>>>
getaddresses([multi])
[('Smith, Bob', '[email protected]'), ('Bob Smith', '[email protected]'), ('', '[email protected]'),
('Bob', '[email protected]')]
>>>
[formataddr(pair) for pair in getaddresses([multi])]
['"Smith, Bob" ', 'Bob Smith ', '[email protected]',
'Bob ']
>>>
', '.join([formataddr(pair) for pair in getaddresses([multi])])
'"Smith, Bob" , Bob Smith , [email protected],
Bob '
>>>
getaddresses(['[email protected]'])
# handles single address cases too
('', '[email protected]')]

Now, decoding email addresses is really just an extra step
before and after the normal header decoding logic we saw
earlier:

>>>
rawfromheader = '"=?UTF-8?Q?Walmart?=" '
>>>
from email.utils import parseaddr, formataddr
>>>
from email.header import decode_header
>>>
name, addr = parseaddr(rawfromheader)
# split into name/addr parts
>>>
name, addr
('=?UTF-8?Q?Walmart?=', '[email protected]')
>>>
abytes, aenc = decode_header(name)[0]
# do email+MIME decoding
>>>
abytes, aenc
(b'Walmart', 'utf-8')
>>>
name = abytes.decode(aenc)
# do Unicode decoding
>>>
name
'Walmart'
>>>
formataddr((name, addr))
# put parts back together
'Walmart '

Although From headers will typically have just one address, to
be fully robust we need to apply this to every address in headers,
such as To, Cc, and Bcc. Again, the multiaddress
getaddressesutility
avoids comma clashes between names and address
separators; since it also handles the single address case, it suffices
for From headers as well:

>>>
rawfromheader = '"=?UTF-8?Q?Walmart?=" '
>>>
rawtoheader = rawfromheader + ', ' + rawfromheader
>>>
rawtoheader
'"=?UTF-8?Q?Walmart?=" , "=?UTF-8?Q?Walmart?=" [email protected]>'
>>>
pairs = getaddresses([rawtoheader])
>>>
pairs
[('=?UTF-8?Q?Walmart?=', '[email protected]'), ('=?UTF-8?Q?Walmart?=', 'ne
[email protected]')]
>>>
addrs = []
>>>
for name, addr in pairs:
...
abytes, aenc = decode_header(name)[0]      # email+MIME
...
name = abytes.decode(aenc)                 # Unicode
...
addrs.append(formataddr((name, addr)))     # one or more addrs
...
>>>
', '.join(addrs)
'Walmart , Walmart '

These tools are generally forgiving for unencoded content and
return them intact. To be robust, though, the last portion of code
here should also allow for multiple parts returned by
decode_header(for encoded substrings),
Noneencoding values for parts (for
unencoded substrings), and
strsubstring values instead of bytes (for fully unencoded names).

Decoding this way applies both MIME and Unicode decoding steps
to fetched mails. Creating properly
encoded
headers for inclusion in new mails composed and sent is similarly
straightforward:

>>>
from email.header import make_header
>>>
hdr = make_header([(b'A\xc4B\xe4C', 'latin-1')])
>>>
print(hdr)
AÄBäC
>>>
print(hdr.encode())
=?iso-8859-1?q?A=C4B=E4C?=
>>>
decode_header(hdr.encode())
[(b'A\xc4B\xe4C', 'iso-8859-1')]

This can be applied to entire headers such as Subject, as well
as the name component of each email address in an address-related
header line such as From and To (use
getaddressesto split into individual
addresses first if needed). The header object provides an alternative
interface; both techniques handle additional details, such as line
lengths, for which we’ll defer to Python manuals:

>>>
from email.header import Header
>>>
h = Header(b'A\xe4B\xc4X', charset='latin-1')
>>>
h.encode()
'=?iso-8859-1?q?A=E4B=C4X?='
>>>
>>>
h = Header('spam', charset='ascii')
# same as Header('spam')
>>>
h.encode()
'spam'

The
mailtoolspackage ahead
and its PyMailGUI client of
Chapter 14
will use these interfaces to automatically decode message headers in
fetched mails per their content for display, and to encode headers
sent that are not in ASCII format. That latter also applies to the
name component of email addresses, and assumes that SMTP servers will
allow these to pass. This may encroach on some SMTP server issues
which we don’t have space to address in this book. See the Web for
more on SMTP headers handling. For more on headers decoding, see also
file
_test-i18n-headers.py
in the
examples package; it decodes additional subject and address-related
headers using
mailtoolsmethods
, and displays them in a tkinter
Textwidget—a foretaste of how
these will be displayed in
PyMailGUI.

Other books

The Sunday Philosophy Club by Alexander Mccall Smith

Kaitlyn O'Connor by Enslaved III: The Gladiators

Amongst the Dead by Robert Gott

Thirteen Phantasms by James P. Blaylock

Death at Bishop's Keep by Robin Paige

Going Deep (Divemasters Book 2) by Jayne Rylon

Chameleon by William X. Kienzle

The Charmer by Autumn Dawn

Proven (Motorcycle Club Romance): Axel and Paige 3 (Fallen Idols Motorcycle Club Book 8) by Savannah Rylan

The Secret Chicken Society by Judy Cox