Fremus.co.za

Demistifying Life and Web Development

Springbok Selector App

You all love the Springboks right? Those men in green who do South Africa proud? Well I guess the Kiwis or Aussies or Pommies or…err wait ok dont like the Springboks that much. But I love em and the other night I started thinking about writing an app that will allow me to select a Springbok squad (or two or three) from the Super 14 teams that are currently playing. So what I did is I honestly collected some player and squad information and put it into a database, and the way I did this was to use HtmlAgility to screen scrape player information from www.sarugby.net, and I did it quite easily and successfully got some information and managed to store it. You can see on this page, albeit it might be a bit slow in coming through, that you can search for rugby players here and once you have found a rugby player you can also get the rest of his team mates in his squad. Scary what you can do with a good HTML parser. I then took the information and created a few objects with properties and methods, and then I stored the information. Note that my intention is not to make money! I just liked the code that went into it. And the data that I stored I made available through an incomplete interface here. You can click on the squads to the right, which displays the players for that squad and if you click on a player you see their information displayed. My next idea was to assign each player to one or more positions and then in the left hand pane create icons that represent rugby jerseys, and then make the players you select “addable” to the left until you reach the maximum number of players allowed.

I enjoyed doing it anyway, and I’m trying to do small projects at night to keep me motivated.

  • Share/Bookmark
posted by fr3dr1k in C#,Springboks and have No Comments

Interesting code with HtmlAgilityPack

Yesterday I was busy with HTML to PDF conversion and for this I used the HTML Agility Pack. Everything worked great, except it seemed IE and FF/Chrome render different HTML. So today I took some fairly straightforward HTML and pushed it through HTMLAgility:






	
	




New Website Under Construction

And if I use this code to loop through the childnodes:

            HtmlDocument doc = new HtmlDocument();
            string s;
            StringBuilder builder = new StringBuilder();
            using (StreamReader reader = new StreamReader(@"C:\Documents and Settings\user\Desktop\fremus.net\index.htm"))
            {
                while ((s = reader.ReadLine()) != null)
                {
                    builder.AppendLine(s);
                }
            }
            doc.LoadHtml(builder.ToString());
            Console.WriteLine(doc.DocumentNode.ChildNodes.Count);
            foreach (HtmlNode node in doc.DocumentNode.ChildNodes)
            {
                Console.WriteLine(node.Name);
                foreach (HtmlNode childNode in node.ChildNodes)
                {
                    Console.WriteLine("\t\t" + childNode.Name);
                    foreach (HtmlNode grandChildNode in childNode.ChildNodes)
                    {
                        Console.WriteLine("\t\t\t" + grandChildNode.Name);
                    }
                }
            }

I get the following result in my command line window:
cmdline

As you can see from the output the html node has a text node. The head node has a text node, and it has 9 childnodes including 5 #text nodes. The body node has a text node as well, and it has 7 childnodes, four being #text and the other three being div. So what is this #text node? If you read this article on the W3C site you will see that it states:

A common error in DOM processing is to expect an element node to contain text.

However, the text of an element node is stored in a text node.

On the same page it then gives an example using a title tag. If you do a Google on “html #text node“, you will see that the second result points to an article and if you read the bit on the nodes it seems that each #text node is a child. The #text nodes that appear in the body node seem to point to the text spaces after each div or each element inside the body node. If I change my code slightly:

                    Console.WriteLine("\t\t" + childNode.Name);
                    foreach (HtmlNode grandChildNode in childNode.ChildNodes)
                    {
                        Console.WriteLine("\t\t\t" + grandChildNode.Name);
                        Console.WriteLine("\t\t\t\t" + grandChildNode.HasChildNodes);
                    }

It tells me that the divs have child elements, but the #text nodes do not. Thus it seems for each ‘empty space’ inside a node there exists a #text node. If I amend the HTML from earlier like this:





	
	







Then the footer div will have two text nodes, and the paragraph node will have a textnode. My issues yesterday had to do with the way IE rendered the HTML and that when I used HTMLAgility to parse it, the node counts weren’t the same. From the sample HTML I have given so far that difference is negligble, but I found that if I went to a site like this one and I saved the HTML from IE and Chrome into separate HTML files and I ran my code with that HTML, I got different node counts. Here are two screenshots that illustrate this:
chromeie

The first screen is the html from the page saved from chrome and the second one is from ie. Notice the extra text nodes.

  • Share/Bookmark
posted by fr3dr1k in Browsers,C# and have No Comments
Get Adobe Flash playerPlugin by wpburn.com wordpress themes