Read Programming Python Online

Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

Programming Python (122 page)

Text payload encodings: Using header information to
decode

More profoundly, text in email can be even richer than implied
so far—in principle, text payloads of a single message may be encoded
in a variety of different Unicode schemes (e.g., three HTML webpage
file attachments, all in different Unicode encodings, and possibly
different than the full message text’s encoding). Although treating
such text as binary byte strings can sometimes finesse encoding
issues, saving such parts in text-mode files for opening must respect
the original encoding types. Further, any text processing performed on
such parts will be similarly type-specific.

Luckily, the
email
package
both adds character-set headers when generating message text and
retains character-set information for parts if it is present when
parsing message text. For instance, adding non-ASCII text attachments
simply requires passing in an encoding name—the appropriate message
headers are added automatically on text generation, and the character
set is available directly via the
get_content_charset
method:

>>>
s = b'A\xe4B'
>>>
s.decode('latin1')
'AäB'
>>>
from email.message import Message
>>>
m = Message()
>>>
m.set_payload(b'A\xe4B', charset='latin1')
# or 'latin-1': see ahead
>>>
t = m.as_string()
>>>
print(t)
MIME-Version: 1.0
Content-Type: text/plain; charset="latin1"
Content-Transfer-Encoding: base64
QeRC
>>>
m.get_content_charset()
'latin1'

Notice how
email
automatically applies Base64 MIME encoding to non-ASCII text parts on
generation, to conform to email standards. The same is true for the
more specific MIME text subclass of
Message
:

>>>
from email.mime.text import MIMEText
>>>
m = MIMEText(b'A\xe4B', _charset='latin1')
>>>
t = m.as_string()
>>>
print(t)
Content-Type: text/plain; charset="latin1"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
QeRC
>>>
m.get_content_charset()
'latin1'

Now, if we parse this message’s text string with
email
, we get back a new
Message
whose text payload is the Base64
MIME-encoded text used to represent the non-ASCII Unicode string.
Requesting MIME decoding for the payload with
decode=1
returns the byte string we
originally attached:

>>>
from email.parser import Parser
>>>
q = Parser().parsestr(t)
>>>
q

>>>
q.get_content_type()
'text/plain'
>>>
q._payload
'QeRC\n'
>>>
q.get_payload()
'QeRC\n'
>>>
q.get_payload(decode=1)
b'A\xe4B'

However, running Unicode decoding on this byte string to convert
to text fails if we attempt to use the platform default on Windows
(UTF8). To be more accurate, and support a wide variety of text types,
we need to use the character-set information saved by the parser and
attached to the
Message
object.
This is especially important if we need to save the data to a file—we
either have to store as bytes in binary mode files, or specify the
correct (or at least a compatible) Unicode encoding in order to use
such strings for text-mode files. Decoding manually works the same
way:

>>>
q.get_payload(decode=1).decode()
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: unexpected
>>>
q.get_content_charset()
'latin1'
>>>
q.get_payload(decode=1).decode('latin1')
# known type
'AäB'
>>>
q.get_payload(decode=1).decode(q.get_content_charset())
# allow any type
'AäB'

In fact, all the header details are available on
Message
objects, if we know where to look.
The character set can also be absent entirely, in which case it’s
returned as
None
; clients need to
define policies for such ambiguous text (they might try common types,
guess, or treat the data as a raw byte string):

>>>
q['content-type']
# mapping interface
'text/plain; charset="latin1"'
>>>
q.items()
[('Content-Type', 'text/plain; charset="latin1"'), ('MIME-Version', '1.0'),
('Content-Transfer-Encoding', 'base64')]
>>
q.get_params(header='Content-Type')
# param interface
[('text/plain', ''), ('charset', 'latin1')]
>>>
q.get_param('charset', header='Content-Type')
'latin1'
>>>
charset = q.get_content_charset()
# might be missing
>>>
if charset:
...
print(q.get_payload(decode=1).decode(charset))
...
AäB

This handles encodings for message text parts in parsed emails.
For composing new emails, we still must apply session-wide user
settings or allow the user to specify an encoding for each part
interactively. In some of this book’s email clients, payload
conversions are performed as needed—using encoding information in
message headers after parsing and provided by users during mail
composition.

Message header encodings: email package support

On a related note, the
email
package
also provides support for encoding and decoding message
headers themselves (e.g., From, Subject) per email standards when they
are not simple text. Such headers are often called
Internationalized
(or
i18n
)
headers, because they support inclusion of non-ASCII character set
text in emails. This term is also sometimes used to refer to encoded
text of message payloads; unlike message headers, though, message
payload encoding is used for both international Unicode text and truly
binary data such as images (as we’ll see in the next section).

Like mail payload parts, i18n headers are encoded specially for
email, and may also be encoded per Unicode. For instance, here’s how
to decode an encoded subject line from an arguably spammish email that
just showed up in my inbox; its
=?UTF-8?Q?
preamble
declares that the data following it is UTF-8 encoded Unicode text,
which is also MIME-encoded per quoted-printable for transmission in
email (in short, unlike the prior section’s part payloads, which
declare their encodings in separate header lines, headers themselves
may declare their Unicode and MIME encodings by embedding them in
their own content this way):

>>>
rawheader = '=?UTF-8?Q?Introducing=20Top=20Values=3A=20A=20Special=20Selecti
on=20of=20Great=20Money=20Savers?='
>>>
from email.header import decode_header
# decode per email+MIME
>>>
decode_header(rawheader)
[(b'Introducing Top Values: A Special Selection of Great Money Savers', 'utf-8')]
>>>
bin, enc = decode_header(rawheader)[0]
# and decode per Unicode
>>>
bin, enc
(b'Introducing Top Values: A Special Selection of Great Money Savers', 'utf-8')
>>>
bin.decode(enc)
'Introducing Top Values: A Special Selection of Great Money Savers'

Subtly, the
email
package can
return multiple parts if there are encoded substrings in the header,
and each must be decoded individually and joined to produce decoded
header text. Even more subtly, in 3.1, this package returns all
bytes
when any substring (or the
entire header) is encoded but returns
str
for a fully unencoded header, and
uncoded substrings returned as
bytes
are encoded per “raw-unicode-escape”
in the package—an encoding scheme useful to convert
str
to
bytes
when no encoding type applies:

>>>
from email.header import decode_header
>>>
S1 = 'Man where did you get that assistant?'
>>>
S2 = '=?utf-8?q?Man_where_did_you_get_that_assistant=3F?='
>>>
S3 = 'Man where did you get that =?UTF-8?Q?assistant=3F?='
# str: don't decode()
>>>
decode_header(S1)
[('Man where did you get that assistant?', None)]
# bytes: do decode()
>>>
decode_header(S2)
[(b'Man where did you get that assistant?', 'utf-8')]
# bytes: do decode() using raw-unicode-escape applied in package
>>>
decode_header(S3)
[(b'Man where did you get that', None), (b'assistant?', 'utf-8')]
# join decoded parts if more than one
>>>
parts = decode_header(S3)
>>>
' '.join(abytes.decode('raw-unicode-escape' if enc == None else enc)
...
for (abytes, enc) in parts)
'Man where did you get that assistant?'

We’ll use logic similar to the last step here in the
mailtools
package ahead, but also retain
str
substrings intact without
attempting to
decode.

Note

Late-breaking news
: As I write this in
mid-2010, it seems possible that this mixed type, nonpolymorphic,
and frankly, non-Pythonic API behavior may be addressed in a future
Python release. In response to a rant posted on the Python
developers list by a book author whose work you might be familiar
with, there is presently a vigorous discussion of the topic there.
Among other ideas is a proposal for a
bytes
-like type which carries with it an
explicit Unicode encoding; this may make it possible to treat some
text cases in a more generic fashion. While it’s impossible to
foresee the outcome of such proposals, it’s good to see that the
issues are being actively explored. Stay tuned to this book’s
website for further developments in the Python 3.X library API and
Unicode stories.

Message address header encodings and parsing, and header
creation

One wrinkle pertaining to the
prior section: for message headers that contain
email addresses
(e.g., From), the name component
of the name/address pair might be encoded this way as well. Because
the email package’s header parser expects encoded substrings to be
followed by whitespace or the end of string, we cannot ask it to
decode a complete address-related header—quotes around name components
will fail.

To support such Internationalized address headers, we must also
parse out the first part of the email address and then decode. First
of all, we need to extract the name and address parts of an email
address using
email
package
tools:

>>>
from email.utils import parseaddr, formataddr
>>>
p = parseaddr('"Smith, Bob" ')
# split into name/addr pair
>>>
p
# unencoded addr
('Smith, Bob', '[email protected]')
>>>
formataddr(p)
'"Smith, Bob" '
>>>
parseaddr('Bob Smith ')
# unquoted name part
('Bob Smith', '[email protected]')
>>>
formataddr(parseaddr('Bob Smith '))
'Bob Smith '
>>>
parseaddr('[email protected]')
# simple, no name
('', '[email protected]')
>>>
formataddr(parseaddr('[email protected]'))
'[email protected]'

Fields with multiple addresses (e.g., To) separate individual
addresses by commas. Since email names might embed commas, too,
blindly splitting on commas to run each though parsing won’t always
work. Instead, another utility can be used to parse each address
individually:
getaddresses
ignores
commas in names when spitting apart separate addresses, and
parseaddr
does, too, because it simply
returns the first pair in the
getaddresses
result (some line breaks were
added to the following for legibility):

>>>
from email.utils import getaddresses
>>>
multi = '"Smith, Bob" , Bob Smith , [email protected],
"Bob" '
>>>
getaddresses([multi])
[('Smith, Bob', '[email protected]'), ('Bob Smith', '[email protected]'), ('', '[email protected]'),
('Bob', '[email protected]')]
>>>
[formataddr(pair) for pair in getaddresses([multi])]
['"Smith, Bob" ', 'Bob Smith ', '[email protected]',
'Bob ']
>>>
', '.join([formataddr(pair) for pair in getaddresses([multi])])
'"Smith, Bob" , Bob Smith , [email protected],
Bob '
>>>
getaddresses(['[email protected]'])
# handles single address cases too
('', '[email protected]')]

Now, decoding email addresses is really just an extra step
before and after the normal header decoding logic we saw
earlier:

>>>
rawfromheader = '"=?UTF-8?Q?Walmart?=" '
>>>
from email.utils import parseaddr, formataddr
>>>
from email.header import decode_header
>>>
name, addr = parseaddr(rawfromheader)
# split into name/addr parts
>>>
name, addr
('=?UTF-8?Q?Walmart?=', '[email protected]')
>>>
abytes, aenc = decode_header(name)[0]
# do email+MIME decoding
>>>
abytes, aenc
(b'Walmart', 'utf-8')
>>>
name = abytes.decode(aenc)
# do Unicode decoding
>>>
name
'Walmart'
>>>
formataddr((name, addr))
# put parts back together
'Walmart '

Although From headers will typically have just one address, to
be fully robust we need to apply this to every address in headers,
such as To, Cc, and Bcc. Again, the multiaddress
getaddresses
utility
avoids comma clashes between names and address
separators; since it also handles the single address case, it suffices
for From headers as well:

>>>
rawfromheader = '"=?UTF-8?Q?Walmart?=" '
>>>
rawtoheader = rawfromheader + ', ' + rawfromheader
>>>
rawtoheader
'"=?UTF-8?Q?Walmart?=" , "=?UTF-8?Q?Walmart?=" [email protected]>'
>>>
pairs = getaddresses([rawtoheader])
>>>
pairs
[('=?UTF-8?Q?Walmart?=', '[email protected]'), ('=?UTF-8?Q?Walmart?=', 'ne
[email protected]')]
>>>
addrs = []
>>>
for name, addr in pairs:
...
abytes, aenc = decode_header(name)[0] # email+MIME
...
name = abytes.decode(aenc) # Unicode
...
addrs.append(formataddr((name, addr))) # one or more addrs
...
>>>
', '.join(addrs)
'Walmart , Walmart '

These tools are generally forgiving for unencoded content and
return them intact. To be robust, though, the last portion of code
here should also allow for multiple parts returned by
decode_header
(for encoded substrings),
None
encoding values for parts (for
unencoded substrings), and
str
substring values instead of bytes (for fully unencoded names).

Decoding this way applies both MIME and Unicode decoding steps
to fetched mails. Creating properly
encoded
headers for inclusion in new mails composed and sent is similarly
straightforward:

>>>
from email.header import make_header
>>>
hdr = make_header([(b'A\xc4B\xe4C', 'latin-1')])
>>>
print(hdr)
AÄBäC
>>>
print(hdr.encode())
=?iso-8859-1?q?A=C4B=E4C?=
>>>
decode_header(hdr.encode())
[(b'A\xc4B\xe4C', 'iso-8859-1')]

This can be applied to entire headers such as Subject, as well
as the name component of each email address in an address-related
header line such as From and To (use
getaddresses
to split into individual
addresses first if needed). The header object provides an alternative
interface; both techniques handle additional details, such as line
lengths, for which we’ll defer to Python manuals:

>>>
from email.header import Header
>>>
h = Header(b'A\xe4B\xc4X', charset='latin-1')
>>>
h.encode()
'=?iso-8859-1?q?A=E4B=C4X?='
>>>
>>>
h = Header('spam', charset='ascii')
# same as Header('spam')
>>>
h.encode()
'spam'

The
mailtools
package ahead
and its PyMailGUI client of
Chapter 14
will use these interfaces to automatically decode message headers in
fetched mails per their content for display, and to encode headers
sent that are not in ASCII format. That latter also applies to the
name component of email addresses, and assumes that SMTP servers will
allow these to pass. This may encroach on some SMTP server issues
which we don’t have space to address in this book. See the Web for
more on SMTP headers handling. For more on headers decoding, see also
file
_test-i18n-headers.py
in the
examples package; it decodes additional subject and address-related
headers using
mailtools
methods
, and displays them in a tkinter
Text
widget—a foretaste of how
these will be displayed in
PyMailGUI.

Other books

The Sunday Philosophy Club by Alexander Mccall Smith
Kaitlyn O'Connor by Enslaved III: The Gladiators
Amongst the Dead by Robert Gott
Thirteen Phantasms by James P. Blaylock
Death at Bishop's Keep by Robin Paige
Chameleon by William X. Kienzle
The Charmer by Autumn Dawn


readsbookonline.com Copyright 2016 - 2024