Scraping Information from Web Pages

Regular expressions are a great way of identifying and retrieving text patterns. Take a look at the next code fragment as it defines a RegEx engine that searches for HTML divs with a "post-summary" attribute, then reads the PowerShell team blog and returns all summaries from all posts in clear text:

$regex = [RegEx]'<div class="post-summary">(.*?)</div>'

$url = 'http://blogs.msdn.com/b/powershell/'
$wc = New-Object System.Net.WebClient
$content = $wc.DownloadString($url)

$regex.Matches($content) | Foreach-Object { $_.Groups[1].Value }

Twitter This Tip! ReTweet this Tip!


Posted Oct 06 2010, 08:00 AM by ps1

Comments

sudspark wrote re: Scraping Information from Web Pages
on 10-07-2010 10:32 AM

Isn't this supposed to only output the summaries only. It looks to be outputting the entire view source content?

Please advise.

Thanks,

Suds

Concentrated Tech NSoftware Dell Compellent Sponsored by Idera and Concentrated Tech and NSoftware and Dell Compellent
Copyright 2011 PowerShell.com. All rights reserved.