An Experience:
Microsoft's suggestion is to always use the
managed HttpWebRequest & HttpWebResponse
classes and not to use the unmanaged WinInet.dll
which could go away from usability from managed code
let alone an .ASPX usage in any upcoming service
pack, but for sure it will be gone with Longhorn
except for legacy apps. I agreed fully with this
concept for Network Programming until I had a special
experience with one large company server. We were
given permission by this company to hack their
website so that an agent of theirs (who was a client
of ours) could separate sales by dealers. So the
company has contract with an agent who has 100
dealers. The web site handles sales for a dealer,
but has no ability to pull dealers together and keep
them separate also. This is the capability we needed
to add and since the sale is a very complex
relationship with the company, the agent must rely
on the progress of the sale as it progresses through
the website that checks credit and many other things
and actually allows the sale to occur. There are
also chargebacks involved. These and other reasons
make the dependence on the company website crucial.
Also a good solution would allow the agent to gain
dealers away from other agents who cannot pay
commissions as immediate as the agent that has
immediate feedback from the company website.
Actually the agent almost tripled their number of
dealers almost immediately. So again, always use
HttpWebRequest, not WinInet.dll for another reason.
Microsoft does not support the use of WinInet.dll
from an .ASPX page for many reasons having to do
with very different platforms. In the past in trying
to automate creation of an Outlook Task from an .ASPX
page I've found their warnings to be true, but
WinInet.dll does seem to work. What happened to me
was that I found a situation that required using
WinInet versus the managed alternative. I'll show
you how the HttpWebRequest attempt got into trouble,
but also you might learn something about programming
the API from .NET here.
The first step is
to change one line of the source code of the html
pages (which had much JavaScript in them) that
called for the interesting pages that we wanted to
screen scrape with regular expressions (more on this
later). We then would submit them by changing
their form tag to submit to a simple no interface
(no controls) submit.aspx file on our server instead
of the usual server so we could list the details of
the request so to fill our HttpWebRequest object
with. In Submit.aspx's code behind we used code like
this to see what their website was sending to their
server to get to the valuable screens (also note
that we had to examine the cookies that the big
company server put on our computers when we used
their website. I also had to learn how to install a
client certificate so it would be used by the
ASP.NET login user that exists in the machine config
file):
<form name="myform" method="post" action="/mysubfolder/myjavascript.jsp">
becomes
<form name="myform" method="post" action="submit.aspx">
Imports System.Collections.Specialized
Imports System.Diagnostics
Public Class submit : Inherits System.Web.UI.Page
Private Sub Page_Load(ByVal sender As System.Object,
ByVal e As System.EventArgs) Handles MyBase.Load
Dim loop1 as integer
Dim loop2 As Integer
Dim arr1() as String
Dim arr2() As String
Dim coll As NameValueCollection
' Load ServerVariable collection into NameValueCollection object.
coll = Request.ServerVariables
' Get names of all keys into a string array.
arr1 = coll.AllKeys
For loop1 = 0 To arr1.GetUpperBound(0)
Response.Write("Key: " & arr1(loop1) & "<br>")
' Get all values under this key.
arr2 = coll.GetValues(loop1)
For loop2 = 0 To arr2.GetUpperBound(0)
Response.Write("Value " & CStr(loop2) & ": " &
arr2(loop2) & "<br>")
Next loop2
Next loop1
End Sub
End Class
I began programming for this server which I
knew was more complex than any I had programmed
against with code like the following. I can't show
every little detail because the agent's competitors
would love to know about some of them. One thing not
shown is an important consideration for client
certificates. HttpWebRequest has a readonly property
that gets the collection of client certificates
associated with this request. An important
consideration is that just because an application
like .NET has added an existing certificate to this
collection does not mean that that application has
the permissions to access the certificate. The
application must have the same access rights as the
entity that issued the certificate = installed the
certificate. An important standard type of
certificate for servers is X509 which HttpWebRequest
supports fine. (note: if you just want to learn
about WinInet.dll access from .NET, then just skip
ahead now)
Imports System.Net
Imports System.IO
Imports System.Text.RegularExpressions
Imports Microsoft.VisualBasic.ControlChars
Imports System.Web.HttpContext
Public Class BigCompanyServer
Dim result, reqHeader, resHeader As String
Dim mode As String = XmlSetting.Read("appsettings",
"mode")
Dim domain As String = IIf(mode = "T",
XmlSetting.Read("appsettings", "domaint"),
XmlSetting.Read("appsettings", "domainp")) ' testing
versus real mode
Dim loginPath As String =
XmlSetting.Read("appsettings", "loginpath")
Dim myCookies As New CookieContainer()
Dim cookies As New CookieCollection()
Dim cookie As New cookie()
Public Function SignIn() As Boolean
Dim loginParameters As String
If mode = "T" Then
loginParameters = "?ACTION=LOGIN&CHPWD=&WN_VIEW_FLAG=false&USERS_COOKIE=CODE-1047055773000&USERID=NOCAL16&PASSWD=WINTER&RESERVEID=SAINT"
Else
loginParameters = "?ACTION=LOGIN&CHPWD=&WN_VIEW_FLAG=false&USERS_COOKIE=CODX-20030604074443&USERID=XXX29293&PASSWD=FINGERPRINT923&RESERVEID=9234"
End If
Dim url As String = domain + loginPath + loginParameters
Dim uri As New Uri(url)
Dim req As HttpWebRequest = WebRequest.Create(uri)
Dim myCookies As New CookieContainer
req.Method = "GET"
req.Accept = "*/*"
' next line eg. might let a server know you are not the browser it was
expecting
req.UserAgent = "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; Q312461; .NET
CLR 1.0.3705)"
req.Headers.Add("Accept-Language", "en-us")
req.CookieContainer = myCookies
Dim cert As IntPtr =
CType(xmlsettings.read("parent", "childsetting"),
IntPtr)
Dim certX509 As New X509Certificate(cert)
req.ClientCertificates.Add(certX509)
' this was used when I had trouble being accepted by the company server
'reqHeader = req.Headers.ToString +
myCookies.GetCookieHeader(req.RequestUri).ToString
Dim res As HttpWebResponse = req.GetResponse()
Dim success As Boolean = res.Cookies.Count > 0
If success Then
Dim cookieHeader1 As String = String.Format("{0} = {1}", "SESSIONID",
res.Cookies("SESSIONID").Value)
myCookies.SetCookies(New Uri(String.Format("{0}://{1}",
uri.Scheme, uri.Host)), cookieHeader1)
Dim cookieHeader2 As String = String.Format("{0} = {1}", "HANDLEID",
res.Cookies("HANDLEID").Value)
myCookies.SetCookies(New Uri(String.Format("{0}://{1}",
uri.Scheme, uri.Host)), cookieHeader2)
End If
' note that I can get both request and response headers for comparing as
both became important at different times
'resHeader = res.StatusCode.ToString + ":"
+ res.Headers.ToString()
Dim sr As New StreamReader(res.GetResponseStream())
result = sr.ReadToEnd()
sr.Close()
Current.Session("cookies") = myCookies
'Current.Session("machine") = whichMachine ' development test servers
versus production
Return success
End Function
Public Function GetHtml(ByVal currentPath As String,
ByVal parameters As String) As String
Dim parameters1 As String = IIf(parameters.StartsWith("?"), parameters,
"?" + parameters)
Dim url As String = domain + currentPath + parameters1
Dim uri As New Uri(url)
Dim req As HttpWebRequest = WebRequest.Create(uri)
req.Method = "GET"
req.Accept = "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg,
application/vnd.ms-excel, application/vnd.ms-powerpoint,
application/msword, */*"
req.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;
Q312461; .NET CLR 1.0.3705)"
req.ContentType = "application/x-www-form-urlencoded"
req.KeepAlive = True
req.Headers.Add("Accept-Language", "en-us")
Dim myCookies As CookieContainer = Current.Session("cookies")
'Dim reqCookies As String =
myCookies.GetCookieHeader(req.RequestUri).ToString
'reqHeader = req.Headers.ToString + reqCookies
req.CookieContainer = myCookies
Dim res As HttpWebResponse = req.GetResponse()
'resHeader = res.StatusCode.ToString + ":" + res.Headers.ToString
myCookies.Add(res.Cookies)
Current.Session("cookies") = myCookies
Dim sr As New StreamReader(res.GetResponseStream())
result = sr.ReadToEnd()
Dim Logged As Boolean = Not result.IndexOf("function invalidURL") > -1
sr.Close()
Return result
End Function
Public Function PostHtml(ByVal currentPath As
String, ByVal parameters As String) As String
Dim parameters1 As String = IIf(parameters.StartsWith("?"),
parameters.Substring(1), parameters)
Dim url As String = domain + currentPath
Dim uri As New Uri(url)
Dim req As HttpWebRequest = WebRequest.Create(uri)
req.Method = "POST"
req.ContentLength = parameters1.Length
req.Accept = "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg,
application/vnd.ms-excel, application/vnd.ms-powerpoint,
application/msword, */*"
req.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;
Q312461; .NET CLR 1.0.3705)"
req.ContentType = "application/x-www-form-urlencoded"
req.KeepAlive = True
req.Headers.Add("Accept-Language", "en-us")
Dim myCookies As CookieContainer = Current.Session("cookies")
Dim reqCookies As String =
myCookies.GetCookieHeader(req.RequestUri).ToString
Dim writer As New StreamWriter(req.GetRequestStream())
writer.Write(parameters1)
writer.Close()
'reqHeader = req.Headers.ToString + reqCookies
req.CookieContainer = myCookies
Dim res As HttpWebResponse = req.GetResponse()
'resHeader = res.StatusCode.ToString + ":" + res.Headers.ToString
Dim sr As New StreamReader(res.GetResponseStream())
result = sr.ReadToEnd()
sr.Close()
Return result
End Function
End Class
Using Regular Expressions to scrape, especially
the MatchCollection Class to get the current value
or all the lists from all the select tags is very
efficient, but if you do the search based on
something that is not likely to change often your
results will be even better. For example, search for
javascript function names versus some display tags
that might be changed to make the page look better.
After much initial effort to get a good login, I was
confident of the application going forward with only
a little concern about screen scraping as a stable
method of receiving data.
An Alternative:
It turned out that screen scraping never became a
problem since the web pages were rarely changed.
What was changed constantly was the large company
software development team's server security due to concerns with hacking.
They somehow could regularly (at least once a week)
find a way to exclude our requests while maintaining
all their browser requests. Each time it was very
difficult for me as a non-expert network programmer,
to solve. Microsoft will tell you over and over that
anything that WinInet.dll does with a server can be
exactly duplicated with the HttpWebRequest/Response
classes. I believe this, but I would have to be
intelligent enough to duplicate everything
programmed into WinInet which is very good at
reacting intelligently with every communication with
a server, duplicating what the server sends back in
the browser's very next request.
Even though the company had given us permission,
their own software team was not, in fact they were
ever increasingly worried over hacking attempts
which is what we looked like to them. If I knew what
they were worrying about I probably could have
programmed easily against it, but I was working
blind and finally we gave up trying after about 15
weeks of fixing. Company
management and their software team never could get
together to help us. I next took a tried and true
Visual FoxPro 3rd party tool, wwipstuff.dll by Rick
Strahl and wrapped it in a Visual FoxPro Web Service
and it immediately worked flawlessly for two years
without any problem controlled from a VB.NET .ASPX
page.
Now today my goal was to make the difficult
(average 2 hours) deployment of the webservice to each new server the
client bought easier, and to see if I can speed up
the processing time also. This is why I wanted to control WinInet.dll directly as possible from VB.NET .ASPX
page. I figured a quick search of Google would
supply me with some code and in an hour I'd have a
nice improvement available to the client. Wrong! I
found helpful articles related to FTP with WinInet, but could not get the code
sample conversions to work at all for Screen
Scraping. Below is the
final code that works smoothly from an .ASPX page.
Note: the A at the end of InternetOpenA and
InternetOpenUrlA. The A stands for Ansi, and if you
replace it with W, you get a Unicode version, if
they exist for a function. Note the important
DllImport attribute before each shared function
associated with a DLL entrypoint or function.
DllImportAttribute comes from the
System.Runtime.InteropServices namespace.
DllImport has optional properties that can be set.
When char or string data is involved as input and/or
output, then usually you would set the CharSet
property to charset.auto. In the case of
InternetOpen, InternetOpenUrl, there are A and W
versions. These two versions indicate a higher
probability that one must be specified. There is
only support for charset.ansi at least for .NET as I
tried the W version to no avail, and so charset.ansi
must be specified. If you don't specify SetLastError:=True
then you will not be able to use the error checking
method I am showing here which is the only method
available to .NET. The normal GetLastError usage
does not work for .NET as you can read about in more
detail with the top link at article bottom.
Note that the buffer argument for InternetReadFile
is 1-dimensional Byte Array. I first tried string
and then stringbuilder to no avail. This requires
conversion before displaying or scraping the string.
Notice that the IntPtr datatype is used for the
Handle the InternetOpen function returns. It has a
ToInt32 method allowing compatibility with the
InternetCloseHandle function.
Public Class WebForm1 : Inherits System.Web.UI.Page
Private Sub Button1_Click(ByVal sender As
System.Object, ByVal e As System.EventArgs) Handles
Button1.Click
display.Text = WinInet.GetHtml("http://www.computer-consulting.com",
14000)
' its a one liner due to the shared
members of the WinInet Class
' note the second argument is the number
of bytes you want to scrape
End Sub
End Class
'*************************************************
Imports System.text
Imports System.Runtime.InteropServices
Imports Microsoft.VisualBasic.ControlChars
Public Class WinInet
Const INTERNET_ACCESS_TYPE_DIRECT = 1
Const INTERNET_OPEN_TYPE_PROXY = 3
Const INTERNET_FLAG_RELOAD = &H80000000
Const USER_AGENT = "IE"
Shared handle As IntPtr
Shared session As Int32
Shared header As String = "Accept: */*" & Cr & Cr
Shared newBuffer() As Byte
Shared bytesRead As Int32
Shared size As Int32
Shared response As Int32
Shared context As Integer = 0
Shared flags As Integer = 0
Shared errorNum As Integer
Public Shared Function GetHtml(ByVal url As String,
ByVal length As Int32) As String
Dim result As String
handle = Http.InternetOpen(USER_AGENT, INTERNET_ACCESS_TYPE_DIRECT,
vbNullString, vbNullString, flags)
session = Http.InternetOpenUrl(handle, url, header, header.Length,
INTERNET_FLAG_RELOAD, context)
If session = 0 Then
result = "Error: " & Marshal.GetLastWin32Error()
Else
ReDim newBuffer(length - 1)
response = Http.InternetReadFile(session,
newBuffer, length, bytesRead)
If response = 0 Then
result = "Error Reading File: " &
Marshal.GetLastWin32Error()
Else
' Use appropriate Encoding here to
get string from byte array
result =
System.Text.UTF8Encoding.UTF8.GetString(newBuffer)
End If
End If
Http.InternetCloseHandle(session)
Http.InternetCloseHandle(handle.ToInt32)
Return result
End Function
<DllImport("WinInet.dll", _
EntryPoint:="InternetOpenA", _
CharSet:=CharSet.Ansi, ExactSpelling:=True,
SetLastError:=True)> _
Public Shared Function InternetOpen( _
ByVal agent As String, _
ByVal accessType As Int32, _
ByVal proxyName As String, _
ByVal proxyBypass As String, _
ByVal flags As Int32) As IntPtr
End Function
<DllImport("WinInet.dll", _
EntryPoint:="InternetOpenUrlA", _
CharSet:=CharSet.Ansi, ExactSpelling:=True,
SetLastError:=True)> _
Public Shared Function InternetOpenUrl( _
ByVal session As IntPtr, _
ByVal url As String, _
ByVal header As String, _
ByVal headerLength As Int32, _
ByVal flags As Int32, _
ByVal context As Int32) As Int32
End Function
'InternetReadFile
<DllImport("WinInet.dll", _
EntryPoint:="InternetReadFile", _
CharSet:=CharSet.Auto, SetLastError:=True)> _
Public Shared Function InternetReadFile( _
ByVal handle As Int32, _
<MarshalAs(UnmanagedType.LPArray)> _
ByVal newBuffer() As Byte, _
ByVal bufferLength As Int32, _
ByRef bytesRead As Int32) As Int32
End Function
<DllImport("WinInet.dll", _
EntryPoint:="InternetCloseHandle", _
CharSet:=CharSet.Ansi, ExactSpelling:=True,
SetLastError:=True)> _
Public Shared Function InternetCloseHandle( _
ByVal hInternet As Int32) As Int32
End Function
End Class
One could use the length input parameter of the
GetHtml method of WinInet Class to be a chunking
size and thereby get/append chunks until the whole
web page is scraped, but that was not my need. Some
pages I only need to scrape the first 100 bytes to
get logged in for example. To append in chunks use a
do...loop like: Loop While ((bytesRead <> 0) And
response)
Not only has this short code file solved the
average 2 hour deployment problem of the WinInet
webservice, but its speed is much approved over the
last WinInet.dll usage also. So this above code is
very valuable to me, but let's look again at the
managed version versus the WinInet API version.
The lengths of the 2 solutions are similar. The
managed code is much more illustrative of what is
going on. The managed code is much more fun to
program. The Managed code gives more control and
potentially will do anything that the WinInet.Dll
will. The most important thing of course is
that the company server still cannot tell that I am
not a browser as it could with the managed
application. I wish I could use the Managed
version, but sometimes the best of standards
to follow are not the most practical for a
particular situation.
Good luck and send your questions to:
tvoss@computer-consulting.com
or better yet to Email:
aspnet@aspadvice.com where I watch for
questions.