Yesterday I was busy with HTML to PDF conversion and for this I used the HTML Agility Pack. Everything worked great, except it seemed IE and FF/Chrome render different HTML. So today I took some fairly straightforward HTML and pushed it through HTMLAgility:
New Website Under Construction
And if I use this code to loop through the childnodes:
HtmlDocument doc = new HtmlDocument();
string s;
StringBuilder builder = new StringBuilder();
using (StreamReader reader = new StreamReader(@"C:\Documents and Settings\user\Desktop\fremus.net\index.htm"))
{
while ((s = reader.ReadLine()) != null)
{
builder.AppendLine(s);
}
}
doc.LoadHtml(builder.ToString());
Console.WriteLine(doc.DocumentNode.ChildNodes.Count);
foreach (HtmlNode node in doc.DocumentNode.ChildNodes)
{
Console.WriteLine(node.Name);
foreach (HtmlNode childNode in node.ChildNodes)
{
Console.WriteLine("\t\t" + childNode.Name);
foreach (HtmlNode grandChildNode in childNode.ChildNodes)
{
Console.WriteLine("\t\t\t" + grandChildNode.Name);
}
}
}
I get the following result in my command line window:
As you can see from the output the html node has a text node. The head node has a text node, and it has 9 childnodes including 5 #text nodes. The body node has a text node as well, and it has 7 childnodes, four being #text and the other three being div. So what is this #text node? If you read this article on the W3C site you will see that it states:
A common error in DOM processing is to expect an element node to contain text.
However, the text of an element node is stored in a text node.
On the same page it then gives an example using a title tag. If you do a Google on “html #text node“, you will see that the second result points to an article and if you read the bit on the nodes it seems that each #text node is a child. The #text nodes that appear in the body node seem to point to the text spaces after each div or each element inside the body node. If I change my code slightly:
Console.WriteLine("\t\t" + childNode.Name);
foreach (HtmlNode grandChildNode in childNode.ChildNodes)
{
Console.WriteLine("\t\t\t" + grandChildNode.Name);
Console.WriteLine("\t\t\t\t" + grandChildNode.HasChildNodes);
}
It tells me that the divs have child elements, but the #text nodes do not. Thus it seems for each ‘empty space’ inside a node there exists a #text node. If I amend the HTML from earlier like this:
Then the footer div will have two text nodes, and the paragraph node will have a textnode. My issues yesterday had to do with the way IE rendered the HTML and that when I used HTMLAgility to parse it, the node counts weren’t the same. From the sample HTML I have given so far that difference is negligble, but I found that if I went to a site like this one and I saved the HTML from IE and Chrome into separate HTML files and I ran my code with that HTML, I got different node counts. Here are two screenshots that illustrate this:
The first screen is the html from the page saved from chrome and the second one is from ie. Notice the extra text nodes.