How to output most common strings in a text file?

Question

Background

I have a text file containing a list of databases and entries in those databases. Example text file:

Database 1 1. Book about abc. 2. Thesis about abc. 3. Book about xyz. Database 2 1. Book about xyz. 2. Article about abc. Database 3 Thesis about abc. Article about abc. Book about xyz. Database 4 Number 1: Book about xyz is included. Number 2: Article about xyz is included.

Problem

I want to output the strings (which contain a minimum number of words) which occur most commonly. Example output:

Name Count Book about xyz 4 Thesis about abc 2

Notes

The strings occur within lines. i.e. this is not the same as counting the number of occurrences of a line. Sometimes the required string is prefaced and/or suffixed with something e.g. 1., Number 1: or sometimes not.

What I've tried

I've been using PowerShell. I've tried get-content .\data.txt | group-object | where { $_.count -ne 1 } or coming at it from the other way with get-content .\data.txt | select -unique but I don't see a way of getting to strings within lines. I have also investigated using select-string but I don't know the pattern such that I can define a regex -Pattern.

@Lee_Dailey all strings with a length of more than three which occur more than once. Or am I misunderstanding your question? — cyuut
– cyuut, Commented Apr 22, 2020 at 13:15
you show that the result is two items but your data shows more 3-word sequences ... so, how do you decide that it should only be those two 3-word sequences and not any of the others? — Lee_Dailey
– Lee_Dailey, Commented Apr 22, 2020 at 13:51

Rickybobby · Accepted Answer · 2020-04-21 16:28:27Z

Here's what I came up with in Powershell. Let me know what you think

 $database = Get-content -Path c:\temp\database.txt $MyArrayList = New-Object -TypeName "System.Collections.ArrayList" foreach($line in $database){ $flag = $false [Int32]$OutNumber = $null if ($line -match "database" -or [String]::IsNullOrWhiteSpace($line)) { continue } else { if([Int32]::TryParse($line.Substring(0,1),[ref]$OutNumber)) { $tmp = $line.Substring(2).trim() $MyArrayList.Add($tmp) $flag = $true } if($line -match 'Number') { $tmp = $line.Substring($line.IndexOf(":")+1).trim() $MyArrayList.Add($tmp) $flag = $true } if ($flag -eq $false) { $MyArrayList.Add($line) } } } $MyArrayList | Group-Object

Here's my output

Count Name Group
----- ---- -----
1 Book about abc. {Book about abc.}
2 Thesis about abc. {Thesis about abc., Thesis about abc.}
3 Book about xyz. {Book about xyz., Book about xyz., Book about xyz.}
2 Article about abc. {Article about abc., Article about abc.}
1 Book about xyz is incl... {Book about xyz is included.}
1 Article about xyz is i... {Article about xyz is included.}

Stack Exchange Network

How to output most common strings in a text file?

1 Answer 1

You must log in to answer this question.

Hot Network Questions

How to output most common strings in a text file?

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions