Usage

Converting files using the command line interface

Using the pydocx command, you can specify the output format with the input and output files:

$ pydocx --html input.docx output.html

Converting files using the library directly

If you don’t want to mess around having to create exporters, you can use the PyDocX.to_html helper method:

from pydocx import PyDocX

# Pass in a path
html = PyDocX.to_html('file.docx')

# Pass in a file object
html = PyDocX.to_html(open('file.docx', 'rb'))

# Pass in a file-like object
from cStringIO import StringIO
buf = StringIO()
with open('file.docx') as f:
   buf.write(f.read())

html = PyDocX.to_html(buf)

Of course, you can do the same using the exporter class:

from pydocx.export import PyDocXHTMLExporter

# Pass in a path
exporter = PyDocXHTMLExporter('file.docx')
html = exporter.export()

# Pass in a file object
exporter = PyDocXHTMLExporter(open('file.docx', 'rb'))
html = exporter.export()

# Pass in a file-like object
from cStringIO import StringIO
buf = StringIO()
with open('file.docx') as f:
   buf.write(f.read())

exporter = PyDocXHTMLExporter(buf)
html = exporter.export()

Currently Supported HTML elements

  • tables
    • nested tables
    • rowspans
    • colspans
    • lists in tables
  • lists
    • list styles
    • nested lists
    • list of tables
    • list of pragraphs
  • justification
  • images
  • styles
    • bold
    • italics
    • underline
    • hyperlinks
  • headings

HTML Styles

The export class pydocx.export.PyDocXHTMLExporter relies on certain CSS classes being defined for certain behavior to occur.

Currently these include:

  • class pydocx-insert -> Turns the text green.
  • class pydocx-delete -> Turns the text red and draws a line through the text.
  • class pydocx-center -> Aligns the text to the center.
  • class pydocx-right -> Aligns the text to the right.
  • class pydocx-left -> Aligns the text to the left.
  • class pydocx-comment -> Turns the text blue.
  • class pydocx-underline -> Underlines the text.
  • class pydocx-caps -> Makes all text uppercase.
  • class pydocx-small-caps -> Makes all text uppercase, however truly lowercase letters will be small than their uppercase counterparts.
  • class pydocx-strike -> Strike a line through.
  • class pydocx-hidden -> Hide the text.
  • class pydocx-tab -> Represents a tab within the document.

Additionally, several list styles are defined based off the attribute values listed at: http://officeopenxml.com/WPnumbering-numFmt.php

  • class pydocx-list-style-type-cardinalText -> (1, 2, 3, 4, etc.)
  • class pydocx-list-style-type-decimal -> (1, 2, 3, 4, etc.)
  • class pydocx-list-style-type-decimalEnclosedCircle -> (1, 2, 3, 4, etc.)
  • class pydocx-list-style-type-decimalEnclosedFullstop -> (1, 2, 3, 4, etc.)
  • class pydocx-list-style-type-decimalEnclosedParen -> (1, 2, 3, 4, etc.)
  • class pydocx-list-style-type-decimalZero -> (01, 02, 03, etc.)
  • class pydocx-list-style-type-lowerLetter -> (a, b, c, etc.)
  • class pydocx-list-style-type-lowerRoman -> (i, ii, iii, etc.)
  • class pydocx-list-style-type-none -> List style is removed
  • class pydocx-list-style-type-ordinalText -> (1, 2, 3, 4, etc.)
  • class pydocx-list-style-type-upperLetter -> (A, B, C, etc.)
  • class pydocx-list-style-type-upperRoman -> (I, II, III, etc.)

Exceptions

There is only one custom exception (MalformedDocxException). It is raised if either the xml or zipfile libraries raise an exception.