Skip to content Skip to sidebar Skip to footer

Htmlparser.htmlparser().unescape() Doesn't Work

I would like to convert HTML entities back to its human readable format, e.g. '£' to '£', '°' to '°' etc. I've read several posts regarding this question Conve

Solution 1:

Apparently HTMLParser.unescape was a bit more primitive before Python 2.6.

Python 2.5:

>>>import HTMLParser>>>HTMLParser.HTMLParser().unescape('©')
'©'

Python 2.6/2.7:

>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('©')
u'\xa9'

See the 2.5 implementation vs the 2.6 implementation / 2.7 implementation

Solution 2:

This site lists some solutions, here's one of them:

from xml.sax.saxutils import escape, unescape

html_escape_table = {
    '"': """,
    "'": "'",
    "©": "©"# etc...
}
html_unescape_table = {v:k for k, v in html_escape_table.items()}

defhtml_unescape(text):
    return unescape(text, html_unescape_table)

Not the prettiest thing though, since you would have to list each escaped symbol manually.

EDIT:

How about this?

import htmllib

defunescape(s):
    p = htmllib.HTMLParser(None)
    p.save_bgn()
    p.feed(s)
    return p.save_end()

Post a Comment for "Htmlparser.htmlparser().unescape() Doesn't Work"