Yesterday, I stumbled over an excellent blog post written in 2007 by MOW, a good friend and PowerShell expert. In it, MOW demonstrates how to scrape raw data from an HTML page and convert it to PowerShell objects (http://thepowershellguy.com/blogs/posh/archive/2007/02/13/hey-powershell-how-popular-is-this-baby-name.aspx)
I loved that approach so much that I played a bit with it and refined his code.
It now asks you for a decade (1880 - 2000) and then navigates to a web page at www.ssa.gov with the most popular male and female names in that decade. The script then downloads the raw HTML content and parses it using regular expressions.
The result is then converted into PowerShell objects. The resulting data can now be analyzed, filtered, sorted and exported with all the luxury PowerShell offers.
- $decade = Read-host 'Enter decade (1880 - 2000)'
-
- Write-Progress "Connecting Web" "www.ssa.gov"
- $wc = new-Object System.Net.WebClient
- $nl = $wc.DownloadString("http://www.ssa.gov/OACT/babynames/decades/names$($decade)s.html")
- Write-Progress "Analyzing Data" "extracting..."
- $r = [regex]'="15%">(.*?)</td>'
- $m = $r.Matches($nl)
-
- $list = @()
- $sex = "male"
-
- foreach ($i in 0..($m.count -1) ) {
-
-
- $record = '' | Select-Object Name, Count, Percent, Sex
- $record.Name = $m[$i].groups[1].Value
- if (!($i % 60)) {
- Write-Progress "Finding Names ($($i/3))" $record.Name -percentComplete ($i * 100 / $m.count)
- }
- [void] $foreach.MoveNext()
- $record.Count = [int]($m[$foreach.current].groups[1].value)
- [void] $foreach.MoveNext()
- $record.Percent = "{0:p4}" -f (([double]$m[$foreach.current].groups[1].value) / 100)
-
- $Record.Sex = $sex
- if ($sex -eq 'male') { $sex='female' } else { $sex = 'male' }
- $list += $record
-
-
- }
-
- $list | Select-Object -first 5
- '#' * 40
- $list | Sort-Object count -descending | Where-Object { $_.Sex -eq 'male' } | Select-Object -first 5
The example also demonstrates the use of Write-Progress to display status messages and progress bars. The result is a list of the top names in the chosen decade as well as a filtered list of male names only. Of couse, you can elaborate on this.
MOW, great job, this shows how (relatively) easy it is to convert "unmanaged" raw HTML data into managed PowerShell objects.
Cheers
-Tobias
Posted
Nov 26 2008, 02:58 AM
by
Tobias Weltner