Forum: >>> Magnum BBS <<<

Dark
Log in

Username Password

What do these '=?utf-8?' sequences mean in python?

From Chris Green@21:1/5 to All on Sat May 6 14:50:40 2023

I'm having a real hard time trying to do anything to a string (?)
returned by mailbox.MaildirMessage.get().

I'm extracting the Subject: header from a message and, if I write what
it returns to a log file using the python logging module what I see
in the log file (when the Subject: has non-ASCII characters in it) is:-

=?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=

Whatever I try I am unable to change the underscore characters in the
above string back to spaces.

So, what do those =?utf-8? and ?= sequences mean? Are they part of
the string or are they wrapped around the string on output as a way to
show that it's utf-8 encoded?

If I have the string in a variable how do I replace the underscores
with spaces? Simply doing "subject.replace('_', ' ')" doesn't work,
nothing happens at all.

All I really want to do is throw the non-ASCII characters away as the
string I'm trying to match in the subject is guaranteed to be ASCII.

--
Chris Green
·

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Peter Pearson@21:1/5 to Chris Green on Sat May 6 15:10:05 2023

On Sat, 6 May 2023 14:50:40 +0100, Chris Green <cl@isbd.net> wrote:
[snip]

So, what do those =?utf-8? and ?= sequences mean? Are they part of
the string or are they wrapped around the string on output as a way to
show that it's utf-8 encoded?

Yes, "=?utf-8?" signals "MIME header encoding".

I've only blundered about briefly in this area, but I think you
need to make sure that all header values you work with have been
converted to UTF-8 before proceeding.
Here's the code that seemed to work for me:

def mime_decode_single(pair):
"""Decode a single (bytestring, charset) pair.
"""
b, charset = pair
result = b if isinstance(b, str) else b.decode(
charset if charset else "utf-8")
return result

def mime_decode(s):
"""Decode a MIME-header-encoded character string.
"""
decoded_pairs = email.header.decode_header(s)
return "".join(mime_decode_single(d) for d in decoded_pairs)

--
To email me, substitute nowhere->runbox, invalid->com.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Chris Green@21:1/5 to Chris Green on Sat May 6 15:58:15 2023

Chris Green <cl@isbd.net> wrote:

I'm having a real hard time trying to do anything to a string (?)
returned by mailbox.MaildirMessage.get().

What a twit I am :-)

Strings are immutable, I have to do:-

newstring = oldstring.replace("_", " ")

Job done!

--
Chris Green
·

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jak@21:1/5 to All on Sat May 6 18:27:23 2023

Peter Pearson ha scritto:

On Sat, 6 May 2023 14:50:40 +0100, Chris Green <cl@isbd.net> wrote:
[snip]

So, what do those =?utf-8? and ?= sequences mean? Are they part of
the string or are they wrapped around the string on output as a way to
show that it's utf-8 encoded?

Yes, "=?utf-8?" signals "MIME header encoding".

I've only blundered about briefly in this area, but I think you
need to make sure that all header values you work with have been
converted to UTF-8 before proceeding.
Here's the code that seemed to work for me:

def mime_decode_single(pair):
"""Decode a single (bytestring, charset) pair.
"""
b, charset = pair
result = b if isinstance(b, str) else b.decode(
charset if charset else "utf-8")
return result

def mime_decode(s):
"""Decode a MIME-header-encoded character string.
"""
decoded_pairs = email.header.decode_header(s)
return "".join(mime_decode_single(d) for d in decoded_pairs)

HI,
You could also use make_header:

from email.header import decode_header, make_header

print(make_header(decode_header( subject )))

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	18:23:43
Calls:	10,389
Files:	14,061
Messages:	6,416,956