Interesting code with HtmlAgilityPack

Yesterday I was busy with HTML to PDF conversion and for this I used the HTML Agility Pack. Everything worked great, except it seemed IE and FF/Chrome render different HTML. So today I took some fairly straightforward HTML and pushed it through HTMLAgility:






	
	




New Website Under Construction

And if I use this code to loop through the childnodes:

            HtmlDocument doc = new HtmlDocument();
            string s;
            StringBuilder builder = new StringBuilder();
            using (StreamReader reader = new StreamReader(@"C:\Documents and Settings\user\Desktop\fremus.net\index.htm"))
            {
                while ((s = reader.ReadLine()) != null)
                {
                    builder.AppendLine(s);
                }
            }
            doc.LoadHtml(builder.ToString());
            Console.WriteLine(doc.DocumentNode.ChildNodes.Count);
            foreach (HtmlNode node in doc.DocumentNode.ChildNodes)
            {
                Console.WriteLine(node.Name);
                foreach (HtmlNode childNode in node.ChildNodes)
                {
                    Console.WriteLine("\t\t" + childNode.Name);
                    foreach (HtmlNode grandChildNode in childNode.ChildNodes)
                    {
                        Console.WriteLine("\t\t\t" + grandChildNode.Name);
                    }
                }
            }

I get the following result in my command line window:
cmdline

As you can see from the output the html node has a text node. The head node has a text node, and it has 9 childnodes including 5 #text nodes. The body node has a text node as well, and it has 7 childnodes, four being #text and the other three being div. So what is this #text node? If you read this article on the W3C site you will see that it states:

A common error in DOM processing is to expect an element node to contain text.

However, the text of an element node is stored in a text node.

On the same page it then gives an example using a title tag. If you do a Google on “html #text node“, you will see that the second result points to an article and if you read the bit on the nodes it seems that each #text node is a child. The #text nodes that appear in the body node seem to point to the text spaces after each div or each element inside the body node. If I change my code slightly:

                    Console.WriteLine("\t\t" + childNode.Name);
                    foreach (HtmlNode grandChildNode in childNode.ChildNodes)
                    {
                        Console.WriteLine("\t\t\t" + grandChildNode.Name);
                        Console.WriteLine("\t\t\t\t" + grandChildNode.HasChildNodes);
                    }

It tells me that the divs have child elements, but the #text nodes do not. Thus it seems for each ‘empty space’ inside a node there exists a #text node. If I amend the HTML from earlier like this:





	
	







Then the footer div will have two text nodes, and the paragraph node will have a textnode. My issues yesterday had to do with the way IE rendered the HTML and that when I used HTMLAgility to parse it, the node counts weren’t the same. From the sample HTML I have given so far that difference is negligble, but I found that if I went to a site like this one and I saved the HTML from IE and Chrome into separate HTML files and I ran my code with that HTML, I got different node counts. Here are two screenshots that illustrate this:
chromeie

The first screen is the html from the page saved from chrome and the second one is from ie. Notice the extra text nodes.

  • Share/Bookmark

Leave a Reply

Get Adobe Flash playerPlugin by wpburn.com wordpress themes