Downloading Images from Webpages

In a training, recently a participant asked about an HTML parser to analyze and harvest data from web pages. Unfortunately, there is no .NET type for this so PowerShell can't help. Or can it? Well, as it turns out, there is a HTML parser after all. Let's create code that downloads all images from any web page you want!

Accessing the IE Parser from PowerShell

A browser like Internet Explorer needs a way to analyze and display HTML code so it most certainly has a parser. To connect to the IE intrinsics, PowerShell can use a COM object named "InternetExplorer.Application". Here is how to load and display a web page:

$ie = New-Object -COMObject InternetExplorer.Application
$ie.visible = $true
$ie
.Navigate('http://www.powershell.com')
While ($ie.Busy) { Start-Sleep -Milliseconds 400 }
'Done!'
 

Note how I hold the script until the web page is completely loaded and displayed. That's a good idea because when you want to access the document object model (DOM), you want to be sure the document has been fully loaded.

Next, you can use your IE-object to access the internal HTML document model through its Document property. It provides access to all kinds of clever methods:

$ie.Document | Get-Member

Finding Images (or Links)

One of the most useful methods is called getElementbyTagName(). It returns all HTML elements with a given tag. To scrape all images from that web page, simply ask for all elements with a tag name of 'img' like this:

$ie.document.getElementsByTagName('img')

PowerShell will throw tons of img-Objects at you. As it turns out, the image url is stored in its src property, so to get all urls from all images, refine your line just a little bit:

$ie.document.getElementsByTagName('img') | Select-Object -ExpandProperty src

Likewise, to collect all links on that page, search for the tag "a" and output its href property:

$ie.document.getElementsByTagName('a') | Select-Object -ExpandProperty href

Downloading all Images on a Web Page

Our initial objective was to harvest all images from that web page. We already got the image urls. Next, I need a way to download them to my hard drive.

While there are plenty of ways to download stuff using PowerShell, one of the more clever approaches uses the "Background Intelligent Transfer Service (BITS)". That's smart because BITS knows how to download and is very robust. It is the same technology used by Windows Update. Which raises the question how to get to BITS.

If you are running Windows 7 or better, that's easy because it comes with a module called "BitsTransfer". It also comes with PowerShell V2 updates I believe. Simply import it and have a look at the arsenal of new cmdlets it provides:

Import-Module BitsTransfer
Get-Command -Module BitsTransfer
 
CommandType Name Definition
----------- ---- ----------
Cmdlet Add-BitsFile Add-BitsFile [-BitsJob] <BitsJ...
Cmdlet Complete-BitsTransfer Complete-BitsTransfer [-BitsJo...
Cmdlet Get-BitsTransfer Get-BitsTransfer [[-Name] <Str...
Cmdlet Remove-BitsTransfer Remove-BitsTransfer [-BitsJob]...
Cmdlet Resume-BitsTransfer Resume-BitsTransfer [-BitsJob]...
Cmdlet Set-BitsTransfer Set-BitsTransfer [-BitsJob] <B...
Cmdlet Start-BitsTransfer Start-BitsTransfer [-Source] <...
Cmdlet Suspend-BitsTransfer Suspend-BitsTransfer [-BitsJob...
 
As it turns out, to download a file use Start-BitsTransfer and provide the URL and a filename to store the downloaded file.

The Solution: Get-WebPageImages

Here is a ready-to-use function called Get-WebPageImages. It's pretty small and efficient, like most PowerShell code:

function Get-WebPageImages($url, $folder) {
Import-Module BitsTransfer
if (-not (Test-Path $folder)) { md $folder }
$ie = New-Object -COMObject InternetExplorer.Application
$ie.Navigate($url)
while ($ie.Busy) { Start-Sleep -Milliseconds 400 }
$sources = $ie.document.getElementsByTagName('img') | Select-Object -ExpandProperty src
$destinations = $sources | ForEach-Object { "$folder\$($_.Split('/')[-1])" }
$displayname = $url.Split('/')[-1]
$ie.Quit()
Start-BitsTransfer $sources $destinations -Prio High -Display $displayname
}

To start downloading the images, use it like this:

Get-WebPageImages 'http://www.powershell.com' 'c:\webimages'

You will see a progressbar while the function downloads all images to your machine. Once done, all images are automatically moved into the folder you provided.

To download images asynchronously, add the parameter -Async to Start-BitsTransfer. Downloading asynchronously occurs transparently in the background and is no longer bound to PowerShell. You can close PowerShell and even restart your machine. The download continues silently in the background. To check download progress, use this line:

Get-BitsTransfer |
Select-Object DisplayName,
@{Name='Progress'; Expression={ $_.BytesTotal * 100 / $_.BytesTransferred }},
JobState

Watch out, though: to actually receive the files downloaded asynchronously, you need to manually complete the job once all files are transferred using Complete-BitsTransfer.

Get-BitsTransfer | Where-Object { $_.JobState -eq 'transferred' } | Complete-BitsTransfer

Learning Points

Downloading images from a web page is fun, and so is PowerShell, but the real learning points here are the different ways how PowerShell techniques work like small building blocks and utilize all kinds of technology to solve the problem.

I used an old COM object (IE automation interface) to access and parse HTML and identify image urls. I used the pipeline to get the information in shape. I used a PowerShell extension (BitsTransfer) to actually download the images to our machine.

And guess what? That's all perfectly fine. PowerShell is pragmatic, and with PowerShell, anything can be used and tied together. That's awesome. As a side effect, you used the BitsTransfer module quite a bit. It is a great solution for downloading (and uploading) files.

Cheerio for today, and don't forget to give PowerShellPlus a try! It makes exploring and developing PowerShell code so much easier!
Thanks for your support,

Tobias
Microsoft MVP PowerShell
PowerShellPlus Architect


Posted Mar 17 2010, 04:35 AM by Tobias

Comments

Twitter Trackbacks for Downloading Images from Webpages - Dreaming in PowerShell - PowerShell.com [powershell.com] on Topsy.com wrote Twitter Trackbacks for Downloading Images from Webpages - Dreaming in PowerShell - PowerShell.com [powershell.com] on Topsy.com
on 03-29-2010 6:36 AM

Pingback from  Twitter Trackbacks for                 Downloading Images from Webpages - Dreaming in PowerShell - PowerShell.com         [powershell.com]        on Topsy.com

nem wrote re: Downloading Images from Webpages
on 03-06-2011 8:56 PM

How to use this way to download a *.zip file if the download page is an htpps and requires you have username, pass.

Besides, I can't have the real download link because it's a short link.

Please help me!

Concentrated Tech NSoftware Dell Compellent Sponsored by Idera and Concentrated Tech and NSoftware and Dell Compellent
Copyright 2011 PowerShell.com. All rights reserved.