Semantice Annotator is a library that generates semantic annotations based on a system of syntactic patterns. It rely on the Stanford CoreNLP library.
A tagger is a single json file which describe the way to extract annotations on a text using a list of rules and a list of test to validate the tagger.
Semantice Annotator will load all taggers contained in given directory and validate all rules and unit tests on load. Then it will be able to generate the annotations defined by those taggers.
A tagger can import other tagger which will be executed before its own rules. But only generatedTagLabels flagged as exportable in the tagger itself will be returned.
{ "importRules": [ "otherTagger1", "otherTagger2" ], "rules": [] }A collection is a single json file listing other taggers.
{ "collection": [ "otherTagger1", "otherTagger2" ], "unitTests": [ { "verbatim": [ "My cat is red", "Your dog is blue" ], "generatedTagLabels": [ "coloredPet" ] } ] }A rule describe the way to detect an annotation based on a syntactic pattern (or regular expression) and/or to apply transformations on the input text.
A list of samples allow to validate the rule.
For instance, the following rule will replace the first matching group with "SMALL".
{ "pattern": "(petit@ADJ|mini|minuscule) :NC", "substitutions": "1:SMALL", "samples": [ "le petit lavabo" ] }This second rule will generate the annotation "smallDog" if the syntactic pattern matches.
{ "pattern": "@SMALL chien", "generatedTagLabels": [ { "value": "smallDog", "exported": true } ], "samples": [ "ce petit chien", "le petit chien" ] }A token pattern describe a single token (aka 'word'). Its syntax is :
text@type[;property=value]*
where :
- text : the exact text value or its lemma
- type : the "part of speech" label of the token such as noun, verb, adjective, etc. It use the Treebank POS tag set.
- property=value : a list of morphosyntactic properties separated by a semicolon.
Any part of a token pattern can be empty.
You can combine multiple patterns by separating them by the boolean operator | (or).
| token pattern | will match tokens |
|---|---|
| samples@NN;lemma=sample;nb=p | samples |
| @NN;lemma=sample;nb=p | samples |
| samples;lemma=sample;nb=p | samples |
| samples;lemma=sample | samples or sample |
| samples@V | none |
| samples@ | samples or sample |
| @ | any token |
| samples@|example:NN | samples, sample, example or examples |
A syntactic pattern describe the syntactic structure of a text. It is composed of a sequence of token pattern.
You can define groups using parenthesis. It is useful for applying substitutions or to apply a quantifier to it.
Quantifiers are :
- ? : None or once
- * : any
- + : Once or many
| syntactic pattern | will match text |
|---|---|
| @DT (@JJ)? dog@NN | the dog, the big dog, a small dog, etc. |
| (@DT)* dog@NN | dog, the dog, the the dog, etc. |
The "substitutions" member of a rule allow to replace a matching group by a given tag.
The syntax is : index:value
Where :
- groupIndex : the group index to replace (starting at 1)
- value : the value which will replace the specified group
| syntactic pattern | substitutions | text | result |
|---|---|---|---|
| @DT (dog@NN|cat@NN) | 1:Pet | the dog | the Pet |
| (hello@|hi) (@NNP) | 1:HI,2:WHO | Hello Bryan | HI WHO |
This tool allow you to test and debug you taggers. This is a command line interface which can be used with any text editor to validate your tagger files.
Each time you save a file, the Semantic Annotator Console will validate its content and will display debug information in case of error. Watch the video.
This video describe the way to create substitutions.
This video describe the way to use a shared tagger.
Each time you save a file, the Semantic Annotator Console also validates all tagger which depends on it to check if your modifications does not introduce regressions. Watch the video.
Rules based on regular expressions are also validated. Watch the video
You can test your taggers easily. Watch the video
This feature allow you to run all your taggers on a large text file (a book for instance). It is really useful to detect invalid annotations. Watch the video
If this is the first time you run the console :
mvn -Dmaven.test.skip=true -pl '!semantic-annotator-console-delivery' clean install
Then :
mvn exec:java -pl semantic-annotator-console
import cle.nlp.annotator.SemanticAnnotator; import cle.nlp.tagger.Tag; public class App { public static void main(String[] args) { SemanticAnnotator annotator = new SemanticAnnotator(SupportedLanguages.FR, "/path/dir"); Collection<Tag> generatedTagLabels = annotator.getTags("this is a text"); } }