Click here to Skip to main content
15,880,427 members
Articles / Desktop Programming / WPF

Show Word File in WPF

,
Rate me:
Please Sign up or sign in to vote.
4.96/5 (158 votes)
10 Sep 2013CPOL5 min read 103.4K   5.7K   187   20
Small WPF application that loads DOCX file, reads DOCX file and displays its content in WPF

Table of Contents

DOCX in WPF application

Introduction

Word 2007 documents are Office Open XML Documents, a combination of XML architecture and ZIP compression used to store an XML and non-XML files together within a single ZIP archive. These documents usually have DOCX extension, but there are exceptions for macro enabled documents, templates etc.

This article will show how you can read and view a DOCX file in WPF with the use of only .NET Framework 3.0 (without using any 3rd party code).

DOCX Overview

A DOCX file is actually a zipped group of files and folders, called a package. Package consists of package parts (files that contain any type of data like text, images, binary, etc.) and relationships files. Package parts have a unique URI name and relationships XML files contain these URIs.

When you open the DOCX file with a zipping application, you can see the document structure and its package's parts.

DOCX content

DOCX main content is stored in the package part document.xml, which is often located in word directory, but it does not have to be. To find out URI (location) of document.xml, we should read a relationships XML file inside the _rels directory and look for a relationship type http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument.

DOCX content

Document.xml file contains XML elements defined primarily in WordprocessingML XML namespace of Office Open XML specification. The basic structure of document.xml consists of a document (<document>) element which contains a body (<body>) element. Body element consists of one or more block level elements such as paragraph (<p>) elements. A paragraph contains one or more inline level elements such as run (<r>) elements. A run element contains one or more document's text content elements such as text (<t>), page break (<br>) and tab (<tab>) elements.

Implementation

In short, to retrieve and display a DOCX text content, application will use two classes: DocxReader and its subclass DocxToFlowDocumentConverter.

DocxReader will unzip the file with the help of System.IO.Packaging namespace, find the document.xml file through the relationship and read it with XmlReader.

DocxToFlowDocumentConverter will convert the XML elements from XmlReader into a corresponding WPF’s FlowDocument elements.

DocxReader

DocxReader constructor first opens (unzips) the package from the DOCX file stream and retrieves the mainDocumentPart (document.xml) with the help of its PackageRelationship.

C#
protected const string MainDocumentRelationshipType = 
   "http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";
private readonly Package package;
private readonly PackagePart mainDocumentPart;
 
public DocxReader(Stream stream)
{
    if (stream == null)
        throw new ArgumentNullException("stream");
 
    this.package = Package.Open(stream, FileMode.Open, FileAccess.Read);
 
    foreach (var relationship in 
       this.package.GetRelationshipsByType(MainDocumentRelationshipType))
    {
        this.mainDocumentPart = 
          package.GetPart(PackUriHelper.CreatePartUri(relationship.TargetUri));
        break;
    }
}

After retrieving the document.xml PackagePart, we can read it with .NET’s XmlReader class, a fast forward-only XML reader which has the same path trajectory as depth-first traversal algorithm in tree data structure.

DOCX elements

First path, 1 to 4, shows the simplest path in retrieving a text from the paragraph element. The second path, 5 - …, shows a more complex paragraph content. In this path, we will also read paragraph properties (<pPr>) and run properties (<rPr>) which contain various formatting options.

We create a series of reading methods for every element we wish to support in this path trajectory.

C#
protected virtual void ReadDocument(XmlReader reader)
{
    while (reader.Read())
        if (reader.NodeType == XmlNodeType.Element && reader.NamespaceURI == 
          WordprocessingMLNamespace && reader.LocalName == BodyElement)
        {
            ReadXmlSubtree(reader, this.ReadBody);
            break;
        }
}
 
private void ReadBody(XmlReader reader) {...}
private void ReadBlockLevelElement(XmlReader reader) {...}
protected virtual void ReadParagraph(XmlReader reader) {...}
private void ReadInlineLevelElement(XmlReader reader) {...}
protected virtual void ReadRun(XmlReader reader) {...}
private void ReadRunContentElement(XmlReader reader) {...}
protected virtual void ReadText(XmlReader reader) {...} 

To point out a few things you will notice in DocxReader reading methods:

  • We use XmlNameTable to store XML namespace, element and attribute names. This provides us with a better looking code but we also get better performance because now we can do an object (reference) comparisons on these strings rather than a more expensive string (value) comparison since XmlReader will use atomized strings from XmlNameTable for its LocalName and NamespaceURI properties and because .NET uses string interning and cleverly implements string equality by first doing reference equality and then value equality.
  • We use XmlReader.ReadSubtree method while passing the XmlReader into a specific DocxReader reading method to create a boundary around that XML element. DocxReader reading methods will now have access to only that specific XML element, rather than to the entire document.xml. Using this method has some performance penalty which we traded for more secure and intuitive code.
C#
private static void ReadXmlSubtree(XmlReader reader, Action<XmlReader> action)
{
    using (var subtreeReader = reader.ReadSubtree())
    {
        // Position on the first node.
        subtreeReader.Read();

        if (action != null)
           action(subtreeReader);
    }
}  

DocxToFlowDocumentConverter

This class inherits from the DocxReader and it overrides some of the reading methods of DocxReader to create a corresponding WPF’s FlowDocument element.

So, for example, while reading document element, we will create a new FlowDocument, while reading paragraph element we will create a new Paragraph element and while reading run element we will create a new Span element.

C#
protected override void ReadDocument(XmlReader reader)
{
    this.document = new FlowDocument();
    this.document.BeginInit();
    base.ReadDocument(reader);
    this.document.EndInit();
}
 
protected override void ReadParagraph(XmlReader reader)
{
    using (this.SetCurrent(new Paragraph()))
        base.ReadParagraph(reader);
}
 
protected override void ReadRun(XmlReader reader)
{
    using (this.SetCurrent(new Span()))
        base.ReadRun(reader);
}

Also, this class implements setting some Paragraph and Span properties which are read from paragraph property element <pPr> and run property element <rPr>. While XmlReader is reading these property elements we have already created a new Paragraph or Span element and now we need to set their properties.

Because we are moving from the parent element (Paragraph) to child elements (Spans) and back to a parent, we will have to track our current element in the FlowDocument with a variable of type TextElement (an abstract base class for Paragraph and Span).

This is accomplished with a help of CurrentHandle and C# using statement syntactic sugar for try-finally construct. With a SetCurrent method we set a current TextElement and with a Dispose method will retrieve our previous TextElement and set it as the current TextElement.

C#
private struct CurrentHandle : IDisposable
{
    private readonly DocxToFlowDocumentConverter converter;
    private readonly TextElement previous;
 
    public CurrentHandle(DocxToFlowDocumentConverter converter, TextElement current)
    {
        this.converter = converter;
        this.converter.AddChild(current);
        this.previous = this.converter.current;
        this.converter.current = current;
    }
 
    public void Dispose()
    {
        this.converter.current = this.previous;
    }
}

private IDisposable SetCurrent(TextElement current)
{
    return new CurrentHandle(this, current);
}

Using the Code

To get a FlowDocument all we need is to create a new DocxToFlowDocumentConverter instance from a DOCX file stream and call Read method on that instance.

After that, we can display the flow document content in WPF application using the FlowDocumentReader control.

C#
using (var stream = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
    var flowDocumentConverter = new DocxToFlowDocumentConverter(stream);
    flowDocumentConverter.Read();
    this.flowDocumentReader.Document = flowDocumentConverter.Document;
    this.Title = Path.GetFileName(path);
}

Conclusion

DOCX Reader is not a complete solution and is intended to be used for simple scenarios (without tables, lists, pictures, headers/footers, styles, etc.). This application can be enhanced to read more DOCX features, but to get a full DOCX support with all advanced features would require a lot more time and knowledge of DOCX file format. Hopefully, this article and accompanying application has shown you some insights into DOCX file format and might provide a basis for doing more complex DOCX related applications.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer GemBox Ltd.
Croatia Croatia
I'm a developer at GemBox Software, working on:

  • GemBox.Spreadsheet - Read, write, convert, and print XLSX, XLS, XLSB, CSV, HTML, and ODS spreadsheets from .NET applications.
  • GemBox.Document - Read, write, convert, and print DOCX, DOC, PDF, RTF, HTML, and ODT documents from .NET applications.
  • GemBox.Pdf - Read, write, edit, and print PDF files from .NET applications.
  • GemBox.Presentation - Read, write, convert, and print PPTX, PPT, and PPSX presentations from .NET applications.
  • GemBox.Email - Read, write, and convert MSG, EML, and MHTML email files, or send and receive email messages using POP, IMAP, SMTP, and EWS from .NET applications.
  • GemBox.Imaging - Read, convert, and transform PNG, JPEG, and GIF images from .NET applications.

Written By
Software Developer (Senior) GemBox Ltd
United Kingdom United Kingdom
Josip Kremenic works as a developer at GemBox Software.
He works on:

  • GemBox.Spreadsheet - a C# / VB.NET Excel component for reading and/or writing XLS, XLSX, CSV, HTML, PDF, XPS and ODS files.
  • GemBox.Document - a C# / VB.NET Word component for reading and/or writing DOCX, DOC, HTML, PDF, XPS, RTF and TXT files.
  • GemBox.Presentation- a C# / VB.NET PowerPoint component for reading and/or writing PPTX, PPT, PDF and XPS files.
  • GemBox.Email - a C# / VB.NET Email component for composing, receiving and sending MSG, EML and MHTML email messages using IMAP, POP and SMTP.
  • GemBox.Pdf - a C# / VB.NET Pdf component for for reading and/or writing PDF files.

Comments and Discussions

 
QuestionTranslating to VB, one problem... Pin
Kalkidas1-Jul-22 5:08
Kalkidas1-Jul-22 5:08 
QuestionPlease help, Pin
Member 1389088429-Jun-18 0:35
Member 1389088429-Jun-18 0:35 
QuestionWord tables are not exported Pin
_tlgglr_26-Oct-17 8:04
_tlgglr_26-Oct-17 8:04 
Your code reads only text areas it does not export word tables. Can you fix it please?
AnswerRe: Word tables are not exported Pin
Mario Z24-Jan-18 22:00
professionalMario Z24-Jan-18 22:00 
QuestionEdit some parts of text Pin
AchLog7-Feb-17 6:11
AchLog7-Feb-17 6:11 
AnswerRe: Edit some parts of text Pin
Mario Z7-Feb-17 22:02
professionalMario Z7-Feb-17 22:02 
GeneralRe: Edit some parts of text Pin
AchLog7-Feb-17 22:49
AchLog7-Feb-17 22:49 
QuestionCool, but one question Pin
Member 848229519-Nov-15 23:43
Member 848229519-Nov-15 23:43 
AnswerRe: Cool, but one question Pin
Mario Z19-Nov-15 23:53
professionalMario Z19-Nov-15 23:53 
QuestionCool But not perfekt Pin
GerVenson13-Oct-15 2:24
professionalGerVenson13-Oct-15 2:24 
AnswerRe: Cool But not perfekt Pin
Mario Z13-Oct-15 3:42
professionalMario Z13-Oct-15 3:42 
GeneralMy vote of 5 Pin
User 1106097920-Jan-15 3:55
User 1106097920-Jan-15 3:55 
GeneralMy vote of 5 Pin
Agent__00719-Sep-13 1:19
professionalAgent__00719-Sep-13 1:19 
GeneralMy vote of 5 Pin
Ranjan.D12-Sep-13 4:46
professionalRanjan.D12-Sep-13 4:46 
Questionmy vote is 5 Pin
Anas Jaber11-Sep-13 19:10
Anas Jaber11-Sep-13 19:10 
Questionmy vote is 5 Pin
maheshnakka10-Sep-13 22:19
professionalmaheshnakka10-Sep-13 22:19 
GeneralMy vote of 5 Pin
ibrahim_ragab10-Sep-13 6:53
professionalibrahim_ragab10-Sep-13 6:53 
GeneralMy vote of 5 Pin
fredatcodeproject10-Sep-13 3:01
professionalfredatcodeproject10-Sep-13 3:01 
QuestionVery nice Pin
Sacha Barber10-Sep-13 0:48
Sacha Barber10-Sep-13 0:48 
GeneralMy vote of 5 Pin
dyma5-Sep-13 20:05
dyma5-Sep-13 20:05 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.