Docx Converter

FrankNight

5.00/5 (2 votes)

Jul 27, 2015

GPL3

6 min read

18596

940

This is a DOCX to HTML conversion tool with css style customization support.

Introduction

This project starts with the purpose to simplify and to automate the publishing of my company documents.

We have a lot word document that should be published on the web but that have to be converted first. Microsoft word has a function to export as html, but we need to apply our style sheets, moreover we have different stylesheets per arguments to be applied.

With these prerequisite I realized a little prototype that could be interesting to improve.

Architecture

This is only a prototype, it has limited features compared to all you can do with Microsoft word, but I converted several docs with different style and the results were appreciable. I use it also to write document to put in cms or submit to codeproject ;-)

The conversion process consist in scanning a word document, in reading order, extracting style properties of each word or paragraph, finding a matching html tag and then render in a html file.

References

Reading a docx document is not simple, it is an xml, then it is readable but… look at the size of the specifications:

Really big. But after studing a few days, I was able to understand something more and achieve some good results.

First of all, docx format is a zip compressed package. You can use custom lib to open it but you can also use classes in the “System.IO.Packaging” namespace that is a member of PresentationCore assembly provided with .net Framework 3.0. This last one is very useful, and is what I use in this project, because implements, also, all functions to manage and retrieve the package “relationships”.
The package, in fact is structured as a directory tree where are present different folders with different files.
Files can be related each other and this relationship was defined and stored in the package too.
I explain this because in many document I converted there were embedded images, and, in the package, images are parts with a relationship with the document part.

Second, the main file I want to analyze is the one that contains the texts. This is the “\word\document.xml”. This file describes the content of document then, element by element you can scan all the document in its parts.

Third, Getting from OOXML spec:

“The basis of a WordprocessingML document is its actual text contents. Those text contents can be stored in many contexts (tables, text boxes, etc.), but the most basic form of text contents in WordprocessingML is the paragraph, specified using the p element (§2.3.1.22). Within the paragraph, all rich formatting at the paragraph level is stored within the pPr element (§2.3.1.25; §2.3.1.26). [Note: Some examples of paragraph properties are alignment, border, hyphenation override, indentation, line spacing, shading, text direction, and widow/orphan control.] Within the paragraph, text is grouped into one or more runs, represented by the r element (§2.3.2.23), which define a region of text with a common set of properties.”

Notes

Technical choice

Paragraph can have its own style but the runs, contained, can override that style with one more specific. Then there are two level of style. It has been useful distinguish between an applied Style (normal, heading, list number…) and a text modifiers (bold, underline, italic…).

About languages /localization

This program works matching literally the name of style specified in the word document and that specified in the map file. Opening a word document and saving it in a different languages, makes all style names to be translated in the new word language. The demo file I attached was saved with an English version of Word and style names still in English language (Normal, Heading1, Quote, Subtitle... ) but if I save the document with an Italian version, styles name will be translated in italian (Normale, Titolo1, Citazione, sottotitolo…).

The program outputs on the console all the name of styles not matched during conversion.

Points of interest

I created a structure that saves in a buffer all text with the same format, and that flush the buffer when format changes. The StyleClass does this work. It has two public properties StyleName and Modifiers, setting these attributes we can check when format change. It shadows the ToString() function returning an unique string with the values of its properties. Each paragraph has its own StyleClass and each subelements have their too. If a sub element specifies a different style or add a text modifier, it sets a StyleClass and, when the style changes, it makes the buffer to flush.

Private Class StyleClass
    Public StyleName As String

    Private p_modifiers As Hashtable

    Public Sub New()
        p_modifiers = New Hashtable
    End Sub

    Public Sub AddModifier(ByVal modifier As String)
        If p_modifiers(modifier) Is Nothing Then
            Me.p_modifiers.Add(modifier, modifier)
        End If
    End Sub

    Public ReadOnly Property Modifiers()
        Get
            Return p_modifiers.Values
        End Get
    End Property

    Public Shadows Function ToString() As String
        Dim tmp As String
        tmp = ""
        For Each m As String In p_modifiers.Values
            tmp &= m
        Next
        Return StyleName & "|" & tmp
    End Function

End Class

Flushing the buffer means also adding an html tag to the text. I created a table that contains a list of Word styles with their matching “html start tag” and “html end tag”:

Style	Start tag	End tag
Normal	<p>	</p>
Heading1	<h1>	</h1>
Code	<text style=””>	</text>
…

And one for the text modifiers:

Modifier	Start tag	End tag
bold	<b>	</b>
Italic	<i>	</i>
Underline	<u>	</u>
…

I load this matching table in memory from an external xml file and then I use it to render the text according to specification.

<?xml version="1.0" encoding="utf-8"?>
<conversion_map >
  <styles>
    <style name="Normale">
      <start_ctag>[p]</start_ctag>
      <end_ctag>[/p]</end_ctag>
    </style>  
    <style name="Paragrafoelenco_l0">
      <start_ctag>[li style='list-style-type: circle; margin: 5px 0 5px 15px;']</start_ctag>
      <end_ctag>[/li]</end_ctag>
    </style>
    <style name="Code">
      <start_ctag>[pre lang="VB.NET"]</start_ctag>
      <end_ctag>[/pre]</end_ctag>
    </style>
    <style name="Grigliatabella">
      <start_ctag>[table class='feature' cellspacing='0' cellpadding='0' style='width:100%;']</start_ctag>
      <end_ctag>[/table]</end_ctag>
    </style>
    ...
  </styles>
  <modifiers>
    <modifer name="b">
      <start_ctag>[b]</start_ctag>
      <end_ctag>[/b]</end_ctag>
    </modifer>
    <modifer name="c">
      <start_ctag>[text style='color: #{0};']</start_ctag>
      <end_ctag>[/text]</end_ctag>
    </modifer>
    <modifer name="h">
      <start_ctag>[text style='background-color: {0};']</start_ctag>
      <end_ctag>[/text]</end_ctag>
    </modifer>
    ...
  </modifiers>
</conversion_map>

The function that writes to html document uses a stack to add the start tag and the end tag in the right order:

Dim tmp As String 'temp buffer

'searching in map table
Dim cs As ConvertionClass
cs = ht_Style(style.StyleName)
If cs Is Nothing Then
    Console.WriteLine("Style not found: " & style.StyleName)
    cs = New ConvertionClass() With {.StartTag = "", .EndTag = ""}
End If

Dim cm As ConvertionClass
Dim s As New Stack

tmp = cs.StartTag

For Each m In style.Modifiers
    cm = ht_Mod(m)
    If Not cm Is Nothing Then
        s.Push(cm)
        tmp &= cm.StartTag
    End If
Next

tmp &= buffer
While s.Count > 0 AndAlso Not s.Peek Is Nothing
    cm = s.Pop
    tmp &= cm.EndTag
End While

tmp &= cs.EndTag

Return tmp

Limitations

This project can convert simple documents.

It can read and understand all styles used in the document,
it can manage a "table of content" (that convert into anchor names),
it can convert simple bulleted lists,
it can convert simple tables,
it can convert external hyperlinks.

Conversion style XML file must be specified as second argument in the command line.

The result of the conversion is an html file saved in the relative folder “.\CDATA\”. If document contains images, they are saved in the relative folder “.\CDATA\img\” and referred as a source by img tag.
I've tested it with document created by Microsoft Word 2010 and 2013.

Conclusion

This article is posted using this tool.
I wrote the article in Microsoft Word, I prepared a conversion table for codeproject and finally I copied/pasted the converted document. It works! :-P

This article shows only a technique to read and convert a docx document. It will not be exhaustive compared to all OOXML specification. If someone needs support, please contact me at Gekoproject.com

Docx Converter

Table of Contents

Introduction

Architecture

References

Notes

Technical choice

About languages /localization

Points of interest

Limitations

Conclusion