New in PowerShell 3: Parse HTML without IE object (unless a local file)

Remember how in PowerShell v1 and v2 we used to have to create Internet Explorer object each time we wanted to parse HTML page? This kind of works but has a few inconveniences such as having to insert Start-Sleep every now and then because IE can be busy and fail if you request too much from it too quickly.

In PowerShell v3, for web pages, things become much easier. Just do:

$p = Invoke-WebRequest "https://dmitrysotnikov.wordpress.com"

And $p.ParsedHtml.body will let you iterate though all web page elements!

However, there is a scenario in which you will have to revert to the old IE ways – local files. If the HTML file is on your local disk, $p will not have the ParsedHtml property. And you will have to use the IE COM object like you did in earlier versions of PowerShell:

$ie = new-object -com "InternetExplorer.Application"
# The easiest way to accomodate for slowness of IE
Start-Sleep -Seconds 1
$ie.Navigate("D:\SavedPage.htm")
# The easiest way to accomodate for slowness of IE
Start-Sleep -Seconds 1
$ParsedHtml = $ie.Document

Happy scripting!

Leave a comment

  1. Invoke-WebRequest, under the hood, still requires and uses Internet Explorer. In fact, if IE is not available (like on Server Core), or you don’t have permissions (using an account like NETWORK SERVICE or LOCAL SERVICE), then you must call it with -UseBasicParsing, which returns raw HTML instead of a parsed DOM.