Introduction
This is HTML parser for getting Titles, Texts and Links from the page, it is a dll file using C# but you can transform it in an easy way to any programming language when you know, how to get the HTML code from the page
Basic Idea
The idea behind this code is,you parse through the HTML code character by character then if you get the title tag represent the text after it to the title string, if you go to body tag then accept all text which not language script or CSS, and the same for the links
Brief Code Description
i make lookup table for some special characters like when you read in the HTML code the characters < this represent the < character
public string GetTitle(string Source)
{
int len=Source.Length;
string title=" ";
char c;
for(int i=0;i<len;i++)
{
c=Convert.ToChar(Source.Substring(i,1));
title=title.Remove(0,1);
title+=c;
if(title.ToLower()=="<title")
{
while(c!='>')
{
i++;
c=Convert.ToChar(Source.Substring(i,1));
}
title="";
i++;
c=Convert.ToChar(Source.Substring(i,1));
while(c!='<')
{
title+=c;
i++;
c=Convert.ToChar(Source.Substring(i,1));
}
break;
}
}
return title.Trim();
}
The other codes for getting text and links in the file attached
Usage
in using this code you add the library to your project then call the instance of this class like Parser.Parse inst=new Parser.Parser()
and use the inst for getting the functions inst.GetTitle(page)
to represent the title
inst.GetText(page)
to represent the text
inst.MakeLinks(page)
to represent the Links
then after you make link you will get it in pLabel
and pLink
which represent the Link and the label you which appear it in the page
Resources
C# DLL in .Net 2005
Contact me
if there is a problem please contact me at ahmed_a_e2006@yahoo.com