0

Background

I have a text file containing a list of databases and entries in those databases. Example text file:

Database 1 1. Book about abc. 2. Thesis about abc. 3. Book about xyz. Database 2 1. Book about xyz. 2. Article about abc. Database 3 Thesis about abc. Article about abc. Book about xyz. Database 4 Number 1: Book about xyz is included. Number 2: Article about xyz is included. 

Problem

I want to output the strings (which contain a minimum number of words) which occur most commonly. Example output:

Name Count Book about xyz 4 Thesis about abc 2 

Notes

The strings occur within lines. i.e. this is not the same as counting the number of occurrences of a line. Sometimes the required string is prefaced and/or suffixed with something e.g. 1., Number 1: or sometimes not.

What I've tried

I've been using PowerShell. I've tried get-content .\data.txt | group-object | where { $_.count -ne 1 } or coming at it from the other way with get-content .\data.txt | select -unique but I don't see a way of getting to strings within lines. I have also investigated using select-string but I don't know the pattern such that I can define a regex -Pattern.

3
  • how do you decide what strings to track? Commented Apr 21, 2020 at 19:05
  • @Lee_Dailey all strings with a length of more than three which occur more than once. Or am I misunderstanding your question? Commented Apr 22, 2020 at 13:15
  • you show that the result is two items but your data shows more 3-word sequences ... so, how do you decide that it should only be those two 3-word sequences and not any of the others? Commented Apr 22, 2020 at 13:51

1 Answer 1

0

Here's what I came up with in Powershell. Let me know what you think

 $database = Get-content -Path c:\temp\database.txt $MyArrayList = New-Object -TypeName "System.Collections.ArrayList" foreach($line in $database){ $flag = $false [Int32]$OutNumber = $null if ($line -match "database" -or [String]::IsNullOrWhiteSpace($line)) { continue } else { if([Int32]::TryParse($line.Substring(0,1),[ref]$OutNumber)) { $tmp = $line.Substring(2).trim() $MyArrayList.Add($tmp) $flag = $true } if($line -match 'Number') { $tmp = $line.Substring($line.IndexOf(":")+1).trim() $MyArrayList.Add($tmp) $flag = $true } if ($flag -eq $false) { $MyArrayList.Add($line) } } } $MyArrayList | Group-Object 

Here's my output

Count Name Group
----- ---- -----
1 Book about abc. {Book about abc.}
2 Thesis about abc. {Thesis about abc., Thesis about abc.}
3 Book about xyz. {Book about xyz., Book about xyz., Book about xyz.}
2 Article about abc. {Article about abc., Article about abc.}
1 Book about xyz is incl... {Book about xyz is included.}
1 Article about xyz is i... {Article about xyz is included.}

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.