GotDotNet

KB002


Specifying a custom UserAgent to download and parse html pages from the web and other tricks


Symptoms



The standard methods of Acrux.Html.HtmlDocument class for loading a web page from internet do not provide some advanced concepts such as controlling the UserAgent, Cookies or Redirection. Controlling those aspects of the HTTP requests is possible by using a custom code such as the one below:


Source Code


Acrux.Html.HtmlDocument doc = new Acrux.Html.HtmlDocument();

string url = "http://www.google.com/";
string userAgent = "My Custom User Agent";

HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url);
req.CookieContainer = new CookieContainer();
req.UserAgent = userAgent;

// In some cases the server may return code 302 with a session cookie and request a redirect 
// to a different page. If the autoredirect is allowed the cookie will be lost between the redirects 
// done by .NET. To get around this, disallow redirects and store the cookie in case of a 302
req.AllowAutoRedirect = false;

HttpWebResponse resp = (HttpWebResponse)req.GetResponse();

if (resp.StatusCode == HttpStatusCode.Found)
{
    CookieCollection myCookies = resp.Cookies;

    // WARNING: Be careful what resp.Headers["Location"] contains. It could be a server relative url
    // in such a case you will need to compose the full url yourself and pass it to the constructor
    req = (HttpWebRequest)HttpWebRequest.Create(
                           resp.Headers["Location"] != null ? resp.Headers["Location"] : url);
    req.CookieContainer = new CookieContainer();

    // Add the saved cookies from the previous response (if any)
    req.CookieContainer.Add(myCookies);

    req.UserAgent = userAgent;
    resp = (HttpWebResponse)req.GetResponse();
}

if (resp.StatusCode == HttpStatusCode.OK)
{
    using (Stream repStream = resp.GetResponseStream())
    {
        Encoding enc = Encoding.GetEncoding(resp.CharacterSet);
        StreamReader rdr = new StreamReader(repStream, enc, true);

        // NOTE: It is important to read the full stream in one go. 
        //       Buffered reading could lead to problems because of the encoding
        doc.LoadHtml(rdr.ReadToEnd());
    }
}             
         

Applies To


This article applies to Acrux Advanced Html Parser


© 2007-2008 Acrux Software.  Legal | Contact Us