Cleaning the Office

Microsoft Office produces heavily-laden HTML when converting from Word or Excel; it also does this when copying and pasting into web forms which affects our dijit.Editor-based richtext controls: 4000 characters suddenly doesn’t seem like so much space when you have:

<p class="MsoListParagraphCxSpFirst" style="text-indent:-.25in;mso-list:l0 level1 lfo1"><!--[if !supportLists]--><span style="font-family:Symbol;mso-fareast-font-family:Symbol;mso-bidi-font-family: Symbol">·<span style="font-size: 7pt; font-family: 'Times New Roman';">        </span></span> <!--[endif]--><b><i><u>Shall we strip out the MS Office tags?</u></i></b></p>

The existing advice to dealing with this tends to be to pop up a dialog box, paste the content in there, and then sanitize it browser-side – this isn’t as neat for users as simply being able to paste, and I’m sure some of them will anyway.

So we decided to try and clean this up server-side when the content is saved, with the help of lxml. The steps so far have been suprisingly easy:

def clean(html):
    """clean assumes you tidied first..."""
    root = etree.fromstring("<div>" + html + "</div>")
    for cls in etree.XPath("//@class")(root):
        parent = cls.getparent()
        classes = cls.split()
        classes = [cls for cls in classes if not cls.lower().startswith("mso")]
        if not classes:
            parent.attrib.pop("class", None)
            parent.attrib["class"] = " ".join(classes)
    for c in root.xpath('//comment()'):
        p = c.getparent()
    for style in etree.XPath("//@style")(root):
        parent = style.getparent()
        style_parts = [style_part.split(":", 1) for style_part in style.split(";")]
        # TODO: use a proper parser for this
        new_style = ";".join([":".join([a.strip() for a in style_part]) for style_part in style_parts if not style_part[0].lower().startswith("mso")])
        if new_style != style:
            if not new_style.strip():
                parent.attrib.pop("style", None)
                parent.attrib["style"] = new_style
    return etree.tostring(root, encoding="UTF-8")

That was the original stab at it; we ended up using cssutils to make the style attribute parsing consistent – the final version of the code ended up being even simpler as a result… and produces the following from the HTML shown above:

<p>·         <b><i><u>Shall we strip out the MS Office tags?</u></i></b></p>
Tagged with: , , ,
Posted in Uncategorized

Leave a Reply