Background
I have a text file containing a list of databases and entries in those databases. Example text file:
Database 1 1. Book about abc. 2. Thesis about abc. 3. Book about xyz. Database 2 1. Book about xyz. 2. Article about abc. Database 3 Thesis about abc. Article about abc. Book about xyz. Database 4 Number 1: Book about xyz is included. Number 2: Article about xyz is included. Problem
I want to output the strings (which contain a minimum number of words) which occur most commonly. Example output:
Name Count Book about xyz 4 Thesis about abc 2 Notes
The strings occur within lines. i.e. this is not the same as counting the number of occurrences of a line. Sometimes the required string is prefaced and/or suffixed with something e.g. 1., Number 1: or sometimes not.
What I've tried
I've been using PowerShell. I've tried get-content .\data.txt | group-object | where { $_.count -ne 1 } or coming at it from the other way with get-content .\data.txt | select -unique but I don't see a way of getting to strings within lines. I have also investigated using select-string but I don't know the pattern such that I can define a regex -Pattern.