Splitting Text Into Words

If you ever need to read in a file and split file content into words, there are a couple of gotchas to keep in mind. First off, remember that Get-Content reads files line by line. To apply regular expressions or split operations on the entire text, you should first convert all lines to one text using Out-String:

$text = Get-Content k:\eula.1031.txt | Out-String

Out-String has one major disadvantage as it uses a fixed maximum line width so words may be truncated. A better approach is Join, found in the .NET String class:

$text = [String]::Join(' ', (Get-Content k:\eula.1031.txt))

Once you have the complete text stored in $text, you can then split it into words. Often, people use simple text split operations like this:

$words = $a.Split(" `t=", [stringsplitoptions]::RemoveEmptyEntries)

This would use a space, a tab or an equal character to identify word boundaries and remove empty entries. However, this approach is not very dependable because there are a lot more non-word characters to handle. You should try a better approach of using regular expressions for splitting like this example:

[regex]::Split($text, '[\s,\.]') |
Where-Object { $_ -like 'a*' } |
Group-Object |
Sort-Object {$_.Name.Length} -descending

Here, any white space character, comma or dot is used to separate words. Still, this approach is not perfect. Therefore, a much better approach leaves it to regular expressions to identify word boundaries. Use Matches() instead of Split() to match explicit instances of words (\w+) separated by word boundaries (\b):

[regex]::Matches($text, '\b\w+\b') |
ForEach-Object { $_.Value } |
Group-Object |
Sort-Object Count -descending |
Select-Object -first 10

Posted Apr 28 2009, 08:00 AM by ps1

Comments

Split long text into fixed width with Powershell - Windows Software wrote Split long text into fixed width with Powershell - Windows Software
on 06-10-2009 7:42 AM

Pingback from  Split long text into fixed width with Powershell - Windows Software

Concentrated Tech NSoftware Dell Compellent Sponsored by Idera and Concentrated Tech and NSoftware and Dell Compellent
Copyright 2011 PowerShell.com. All rights reserved.