Click here to Skip to main content
15,881,882 members
Articles / Web Development / ASP.NET
Article

Pdfizer, a dumb HTML to PDF converter, in C#

Rate me:
Please Sign up or sign in to vote.
4.83/5 (28 votes)
17 Jan 20042 min read 530.2K   25.3K   178   76
This library converts simple HTML documents to PDF.

Introduction

This article presents a basic HTML to PDF converter: with this library, you can transform simple HTML pages to nice and printable PDF files.

The HTML cleaning is done with NTidy (see [1]), a .NET wrapper for the HTML Tidy library (see [2]). The PDF generation is done with iTextSharp, a PDF generation library (see [3]).

Transformation Pipe

Transforming HTML documents to PDF is a fairly complex task. Hopefully, there exists powerful tools on the web that could help me accomplish this.

Parsing HTML

The first problem to handle was that HTML is usually "dirty": the structure is usually not XML conformant and trying to parse HTML pages with the XmlDocument will usually lead to a failure.

To overcome this problem, I had to write a .NET wrapper around HTML Tidy (see [2]). HTML Tidy is a very useful application that takes "dirty" HTML and returns it cleaned as much as possible. The .NET wrapper exposes a DOM-like class structure so that you can use it much like XmlDocument.

Hence, with NTidy, we can safely parse HTML document.

Creating PDF

The PDF creation is done by iTextSharp (see [3]), a .NET library hosted on SourceForge, that gives you the tool to create PDF easily. Hence, the PDF creation problem is solved.

Reading, Traversing

With NTidy and iTextSharp on my toolset, I could start to create the generator. The generator works like this: it first reads the input with NTidy, then traverses the DOM tree and generates the PDF fragments with iTextSharp.

Quick Example

The library usage is done through the HtmlToPdfConverter class. Creating a PDF file is done through the following steps, as illustrated in the example:

  1. Create a converter,
  2. Open a new PDF file using the Open method,
  3. Add a chapter,
  4. Feed HTML to the converter,
  5. If you want another chapter, go to 3.
  6. When finished, close the PDF file by calling Close.
C#
// create converter
HtmlToPdfConverter html2pdf = new HtmlToPdfConverter();

// open new pdf file
html2pdf.Open(@"test");
// start a chapter
html2pdf.AddChapter(@"Dummy Chapter");
string html = ...;
// convert string
html2pdf.Run(html);
// add a new chapter
html2pdf.AddChapter(@"Boost page");
// read web page
html2pdf.Run(new Uri(@"http://www.boost.org/libs/libraries.htm"));
// close and finish pdf file.
html2pdf.Close();

What to expect and not expect

Don't expect too much from this tool, it will not work with complex HTML pages and will give fairly good results with simple HTML pages. Specially, tables are not yet supported.

Reference

  1. NTidy, a .NET wrapper around Tidy.
  2. HTML Tidy home page.
  3. iTextSharp, PDF generation tool.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Engineer
United States United States
Jonathan de Halleux is Civil Engineer in Applied Mathematics. He finished his PhD in 2004 in the rainy country of Belgium. After 2 years in the Common Language Runtime (i.e. .net), he is now working at Microsoft Research on Pex (http://research.microsoft.com/pex).

Comments and Discussions

 
QuestionHTML to PDF Convertd, when the project its throws following error... Pin
rdssiva13-Nov-08 1:07
rdssiva13-Nov-08 1:07 
AnswerRe: HTML to PDF Convertd, when the project its throws following error... Pin
subho1007-Apr-11 18:35
subho1007-Apr-11 18:35 
GeneralTABLE and DIV Support Pin
Gokhan Mamaci25-Oct-08 6:27
professionalGokhan Mamaci25-Oct-08 6:27 
QuestionNo Spport Farsi Language ITextSharp.dll Pin
s_nazari@yahoo.com22-Oct-08 23:01
s_nazari@yahoo.com22-Oct-08 23:01 
GeneralNo Spport Farsi Language ITextSharp.dll Pin
fatemeh22046-Oct-08 3:44
fatemeh22046-Oct-08 3:44 
GeneralPdfizer Projects Pin
fatemeh22046-Oct-08 2:16
fatemeh22046-Oct-08 2:16 
GeneralPdfizer Projects Pin
m.jafari545-Oct-08 20:14
m.jafari545-Oct-08 20:14 
GeneralSample project given is not working..Please help Pin
Srinath Gopinath2-May-08 1:13
Srinath Gopinath2-May-08 1:13 
Sample project given is not working..Please help
Questionpdfize can't support chinese? Pin
eclay19-Feb-08 16:55
eclay19-Feb-08 16:55 
GeneralNTidy.dll Pin
Member 13389527-Feb-08 3:45
Member 13389527-Feb-08 3:45 
GeneralThe specified module could not be found. (Exception from HRESULT: 0x8007007E) Pin
mr_aladddin21-Jan-08 0:37
mr_aladddin21-Jan-08 0:37 
GeneralRe: The specified module could not be found. (Exception from HRESULT: 0x8007007E) Pin
Sameers Javed2-Apr-08 5:31
Sameers Javed2-Apr-08 5:31 
GeneralRe: The specified module could not be found. (Exception from HRESULT: 0x8007007E) [modified] Pin
cnj12516-Sep-08 21:27
cnj12516-Sep-08 21:27 
GeneralRe: The specified module could not be found. (Exception from HRESULT: 0x8007007E) Pin
Sameers Javed17-Sep-08 1:04
Sameers Javed17-Sep-08 1:04 
GeneralRe: The specified module could not be found. (Exception from HRESULT: 0x8007007E) Pin
scarface2113-Apr-11 1:39
scarface2113-Apr-11 1:39 
Generalc# Pin
mihaela13-Jan-08 3:39
mihaela13-Jan-08 3:39 
Questionabt pdfizer Pin
abinmaloth4u22-Nov-07 15:44
abinmaloth4u22-Nov-07 15:44 
AnswerRe: abt pdfizer Pin
Ravi Bhavnani22-Nov-07 16:01
professionalRavi Bhavnani22-Nov-07 16:01 
AnswerRe: abt pdfizer Pin
ASGuru1-Jun-10 3:32
ASGuru1-Jun-10 3:32 
QuestionNTidy.dll Support of DotNetNUke Pin
SharePoint Developer19-Aug-07 21:07
SharePoint Developer19-Aug-07 21:07 
GeneralRe: HTML to PDF Library for .NET Pin
Member 710029-Jul-08 0:00
Member 710029-Jul-08 0:00 
GeneralRe: HTML to PDF Library for .NET Pin
psinke17-Nov-08 0:18
psinke17-Nov-08 0:18 
GeneralFile extension test Pin
danneth21-Mar-07 4:58
danneth21-Mar-07 4:58 
GeneralFont color Pin
sejmik13-Mar-07 6:21
sejmik13-Mar-07 6:21 
Generalweb service URL Pin
Pingu2213-Feb-07 4:05
Pingu2213-Feb-07 4:05 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.