- Internal links and anchors are now retained. Thanks, sunu! #222
- No longer error when processing margin positions with decimal points.
- Rect elements now correctly handle image data
- Textboxes can now contain tables.
- Pict elements can now contain Rect elements.
- Text colors other than black and white are no longer ignored
- Textboxes have been implemented. We no longer lose the content inside of them.
- Markup compatibility has been implemented. We always use the Fallback for AlternateContent tags.
- Fixed issue in PyDocX CLI tool and added new test cases for the same
- Simple and Complex field hyperlinks now support bookmarks / internal anchors
- Faked lists inside tables are correctly converted to real lists
- Headings inside a complex field no longer fail to ignore styles
- Fixed issue where multiple complex fields in the same paragraph would cause content to disappear.
- Added EmbeddedObject support with Shape
- Implemented complex and simple field hyperlinks.
- This includes a significant change to the API. The export methods are now all
called twice. The results are discarded in the first pass. In first pass
self.first_pass == True), you can now track information that will be used to make decisions in the second pass. The notable example where this technique is used is implementing complex fields. Because the export methods are called twice, some exporter extensions that perform lossly operations on the document structure may need to ignore processing during the first pass.
- The function signature of the
get_hyperlink_taghas changed. It previously accepted a
Hyperlinkinstance. Now it only accepts
- Styled whitespace is no longer ignored. Previously, this would result in certain configurations with words grouped together without spaces.
- Headings now preserve italic, webHidden and vanish styles
- Decimal font sizes are now handled properly
- Paragraphs that have numbering definitions with a level number format of None are no longer considered list items.
- Headings in lists no longer break numbering. By default, in the HTML exporter, headings in lists are represented using the “strong” tag, regardless of the heading level.
- Note: This release consists of significant changes to the internal API and is not backwards compatible with prior versions
- Fixed issue where the same image referenced multiple times would not display correctly after the first instance
- Removed the preprocessor and re-implemented the functionality into the exporter
- Re-implemented the exporter into a top-down generator algorithm
- Implemented the necessary object classes for each element type (Paragraph, Run, Text, etc)
- Implemented enumerated list detection and conversion to numbering lists
- Added support for python 3.4
- Added support for pypy
- No longer adding list-style-type attribute to ordered list tags. We are now using a class to indicate these.
- Faked sub/super handling is no longer handled by default.
that handling is implemented in a new mixin class.
pydocx.openxmlhave been merged into
pydocx.openxml.packagingto better mirror the MS implementation structure.
pydocx.models.styleshas been moved to
pydocx.managers.styleshas been merged into
XmlCollectionfield type, now used by
- Implemented several model classes for Numbering.
- Added numbering property to the numbering definitions part.
- XmlModels now define their own tags
- Simplified importing PyDocX
- Header processing now occurs in the exporter rather than the pre-processor
- PyDocXExporter.heading signature has changed from accepting heading_level which was an HTML tag to accepting heading_style_name which is the raw style name of the heading.
convert_root_level_upper_romanoption has been replaced with an optional mixin
- Preprocessor no longer manages table membership. Instead, that is handled in the base iterative parser.
ConvertRootUpperRomanListToHeadingMixinwould fail for paragraphs that had no properties.
- Moved parsers to export module
- Renamed DocxParser to PyDocXExporter
- Renamed Docx2Html to PyDocXHTMLExporter
- Eliminated all improper usages of the find_first utility function
- Added support for NumberingDefinitionsPart to the WordprocessingDocumentFactory
- Fixed issue #116 - Don’t assume the first sz of an rPr actually is a direct child of that rPr.
- Moved CLI to __main__
- Moved tests to root-level module
- Specify charset in rendered HTML
- Added support for using defusedxml to mitigate XML vulnerabilities.
- Allow a file-like object to be passed into the DocXParser constructor.
- Added basic support for footnotes.
- Fixed a problem with calculating image sizes
- Take into account run position and size to apply superscript and subscript tags to runs that would look like they have superscript and subscript tags but are being faked due to positioning and sizing.
- External images are now handled. This causes a backwards incompatible change with all handers related to images.
- Added support for style basedOn property
- Fixed a bug in which the run paragraph mark properties were used as run properties (pPr > rPr within a style definition)
- Fixed a bug in which the run paragraph properties defined a global style identifier, any of those styles defined globally were ignored.
- Fixed a bug which allowed run properties to reference paragraph properties, and paragraph properties to reference run properties. Such instances are now ignored.
- We are once again supporting files that are missing images.
- Fixed a problem with list nesting. We were marking list items as the first list item in error.
- Added support for python 3.3
- Fixed a problem with list nesting with nested sublists that have the same ilvl.
- Fixed an issue with marking runs as underline when they were not supposed to be.
- Fixed path issue on Windows for Zip archives
- Fixed attribute typo when attempting to generate an error message for a missing required resource
- CHANGELOG.md was missing from the MANIFEST in 0.3.15 which would cause the setup to fail.
- Use inline span to define styles instead of div
- Use ems for HTML widths instead of pixels
- If a property value is
off, it is now considered disabled
- Use paths from
_rels/.relsinstead of hardcoding
- Significant performance gains for documents with a large number of table cells.
- Significant performance gains for large documents.
- Added command line support to convert from docx to either html or markdown.
- The non breaking hyphen tag was not correctly being imported. This issue has been fixed.
- Found and optimized a fairly large performance issue with tables that had large amounts of content within a single cell, which includes nested tables.
- We are now respecting the
<w:tab/>element. We are putting a space in everywhere they happen.
- Each styling can have a default defined based on values in
styles.xml. These default styles can be overwritten using the
rPron the actual
rtag. These default styles defined in
styles.xmlare actually being respected now.
- If zipfile fails to open the passed in file,
we are now raising
- Some inline tags
(most notably the underline tag)
could have a
noneand that would signify that the style is disabled. A
noneis now correctly handled.
- It is possible for a docx file to not contain a
numbering.xmlfile but still try to use lists. Now if this happens all lists get converted to paragraphs.
- Not all docx files contain a
styles.xmlfile. We are no longer assuming they do.
- It is possible for
w:ttags to have
None. This no longer causes an error when escaping that text.
- In the event that
cElementTreehas a problem parsing the document, a
MalformedDocxExceptionis raised instead of a
- We were not taking into account that vertical merges should have a continue attribute, but sometimes they do not, and in those cases word assumes the continue attribute. We updated the parser to handle the cases in which the continue attribute is not there.
- We now correctly handle documents with unicode character in the namespace.
- In rare cases, some text would be output with a style when it should not have been. This issue has been fixed.
- Added support for several more OOXML tags including:
More details in the README.
- We switched from using
xml.etree.cElementTree. This has resulted in a fairly significant speed increase for python 2.6
- It is now possible to create your own pre processor to do additional pre processing.
- Superscripts and subscripts are now extracted correctly.
- Added a changelog
- Added the version in
- Fixed an issue with duplicating content if there was indentation or justification on a p element that had multiple t tags.