Cleaning the Office

Microsoft Office produces heavily-laden HTML when converting from Word or Excel; it also does this when copying and pasting into web forms which affects our dijit.Editor-based richtext controls: 4000 characters suddenly doesn’t seem like so much space when you have:

[sourcecode language=”html”]
<p class="MsoListParagraphCxSpFirst" style="text-indent:-.25in;mso-list:l0 level1 lfo1"><!–[if !supportLists]–><span style="font-family:Symbol;mso-fareast-font-family:Symbol;mso-bidi-font-family: Symbol">·<span style="font-size: 7pt; font-family: ‘Times New Roman';"> </span></span> <!–[endif]–><b><i><u>Shall we strip out the MS Office tags?</u></i></b></p>
[/sourcecode]

The existing advice to dealing with this tends to be to pop up a dialog box, paste the content in there, and then sanitize it browser-side – this isn’t as neat for users as simply being able to paste, and I’m sure some of them will anyway.

So we decided to try and clean this up server-side when the content is saved, with the help of lxml. The steps so far have been suprisingly easy:

[sourcecode language=”python” collapse=”false”]
def clean(html):
"""clean assumes you tidied first…"""
root = etree.fromstring("<div>" + html + "</div>")
for cls in etree.XPath("//@class")(root):
parent = cls.getparent()
classes = cls.split()
classes = [cls for cls in classes if not cls.lower().startswith("mso")]
if not classes:
parent.attrib.pop("class", None)
else:
parent.attrib["class"] = " ".join(classes)
for c in root.xpath(‘//comment()’):
p = c.getparent()
p.remove(c)
for style in etree.XPath("//@style")(root):
parent = style.getparent()
style_parts = [style_part.split(":", 1) for style_part in style.split(";")]
# TODO: use a proper parser for this
new_style = ";".join([":".join([a.strip() for a in style_part]) for style_part in style_parts if not style_part[0].lower().startswith("mso")])
if new_style != style:
if not new_style.strip():
parent.attrib.pop("style", None)
else:
parent.attrib["style"] = new_style
return etree.tostring(root, encoding="UTF-8")
[/sourcecode]

That was the original stab at it; we ended up using cssutils to make the style attribute parsing consistent – the final version of the code ended up being even simpler as a result… and produces the following from the HTML shown above:

[sourcecode language=”html”]
<p>· <b><i><u>Shall we strip out the MS Office tags?</u></i></b></p>
[/sourcecode]

Posted in Uncategorized Tagged with: , , ,

Leave a Reply