KB002
Specifying a custom UserAgent to download and parse html pages from the web and other tricks
Symptoms
The standard methods of Acrux.Html.HtmlDocument class for loading a web page from internet do not provide some advanced concepts such as controlling the UserAgent, Cookies or Redirection. Controlling those aspects of the HTTP requests is possible by using a custom code such as the one below:
Source Code
Acrux.Html.HtmlDocument doc = new Acrux.Html.HtmlDocument(); string url = "http://www.google.com/"; string userAgent = "My Custom User Agent"; HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url); req.CookieContainer = new CookieContainer(); req.UserAgent = userAgent; // In some cases the server may return code 302 with a session cookie and request a redirect // to a different page. If the autoredirect is allowed the cookie will be lost between the redirects // done by .NET. To get around this, disallow redirects and store the cookie in case of a 302 req.AllowAutoRedirect = false; HttpWebResponse resp = (HttpWebResponse)req.GetResponse(); if (resp.StatusCode == HttpStatusCode.Found) { CookieCollection myCookies = resp.Cookies; // WARNING: Be careful what resp.Headers["Location"] contains. It could be a server relative url // in such a case you will need to compose the full url yourself and pass it to the constructor req = (HttpWebRequest)HttpWebRequest.Create( resp.Headers["Location"] != null ? resp.Headers["Location"] : url); req.CookieContainer = new CookieContainer(); // Add the saved cookies from the previous response (if any) req.CookieContainer.Add(myCookies); req.UserAgent = userAgent; resp = (HttpWebResponse)req.GetResponse(); } if (resp.StatusCode == HttpStatusCode.OK) { using (Stream repStream = resp.GetResponseStream()) { Encoding enc = Encoding.GetEncoding(resp.CharacterSet); StreamReader rdr = new StreamReader(repStream, enc, true); // NOTE: It is important to read the full stream in one go. // Buffered reading could lead to problems because of the encoding doc.LoadHtml(rdr.ReadToEnd()); } }
Applies To
This article applies to Acrux Advanced Html Parser




