Name	Name	Last commit message	Last commit date
Latest commit History 227 Commits
doc	doc
resources/nlparse	resources/nlparse
src	src
test-resources	test-resources
test/zensols/nlparse	test/zensols/nlparse
zenbuild @ 2328ddf	zenbuild @ 2328ddf
.gitignore	.gitignore
.gitmodules	.gitmodules
.travis.yml	.travis.yml
CHANGELOG.md	CHANGELOG.md
LICENSE	LICENSE
README.md	README.md
makefile	makefile
project.clj	project.clj

Natural Language Parse and Feature Generation

This repository provides generalized library to deal with natural language. Specifically it:

Wraps several Java natural language parsing libraries.
Gives access the data structures rendered by the parsers.
Provides utility functions to create features.

This framework combines the results of the following frameworks:

Features

Callable from Java
Callable from REST
Callable from REST in a Docker Image
Completely customize.
Easily extendable.
Combines all annotations as pure Clojure data structures.
Provides a feature creation libraries:
Stitches multiple frameworks to provide the following features:
- Tokenizing
- Grouping Tokens into Sentences
- Lemmatisation
- Part of Speech Tagging
- Stop Words (both word and lemma)
- Named Entity Recognition
- Syntactic Parse Tree
- Fast Shift Reduce Parse Tree
- Dependency Tree
- Co-reference Graph
- Sentiment Analysis
- Semantic Role Labeler

Obtaining

In your project.clj file, add:

Documentation

API Documentation

Clojure
Java

Annotation Definitions

The utterance parse annotation tree definitions is given here.

Example Parse

An example of a full annotation parse is given here.

Setup

The NER model is included in the Stanford CoreNLP dependencies, but you still have to download the POS model. To download (or create a symbolic link if you've set the ZMODEL environment variable):

$ make model

If this doesn't work, follow the manual steps. Otherwise you can optionally move the model to a shared location on the file system and skip to configuring the REPL.

Download and Install POS Tagger Model Manually

If the normal setup failed, you'll have to manually download the POS tagger model.

The library can be configured to use any POS model (or NER for that matter), but by default it expects the english-left3words-distsim.tagger model.

Create a directory where to put the model
```
$ mkdir -p path-to-model/stanford/pos
```
Download the english-left3words-distsim.tagger model the or similar model.

Install the model file:

$ unzip stanford-postagger-2015-12-09.zip $ mv stanford-postagger-2015-12-09/models/english-left3words-distsim.tagger path-to-model/stanford/pos

REPL

If you download the model in to any other location other that the current start directory (see setup) you will have to tell the REPL where the model is kept on the file system.

Start the REPL and configure:

user> (System/setProperty "zensols.model" "path-to-model")

Note that system properties can be passed via lein to avoid having to repeat this for each REPL instance.

Usage

This package supports:

Usage Example

See the example repo that illustrates how to use this library and contains the code from where these examples originate. It's highly recommended to clone it and follow along as you peruse this README.

Parsing an Utterance

user> (require '[zensols.nlparse.parse :refer (parse)]) user> (clojure.pprint/pprint (parse "I am Paul Landes.")) => {:text "I am Paul Landes.", :mentions ({:entity-type "PERSON", :token-range [2 4], :ner-tag "PERSON", :sent-index 0, :char-range [5 16], :text "Paul Landes"}), :sents ({:text "I am Paul Landes.", :sent-index 0, :parse-tree {:label "ROOT", :child ({:label "S", :child ({:label "NP", :child ({:label "PRP", :child ({:label "I", :token-index 1})})} ... :dependency-parse-tree ({:token-index 4, :text "Landes", :child ({:dep "nsubj", :token-index 1, :text "I"} {:dep "cop", :token-index 2, :text "am"} {:dep "compound", :token-index 3, :text "Paul"} {:dep "punct", :token-index 5, :text "."})}), ... :tokens ({:token-range [0 1], :ner-tag "O", :pos-tag "PRP", :lemma "I", :token-index 1, :sent-index 0, :char-range [0 1], :text "I", :srl {:id 1, :propbank nil, :head-id 2, :dependency-label "root", :heads ({:function-tag "PPT", :dependency-label "A1"})}} ...

Utility Functions

There utility function to have with getting around the parsed data, as it can be pretty large. For example, to find the head of the dependency head tree:

(def panon (parse "I am Paul Landes.")) => {:text... user> (->> panon :sents first p/root-dependency :text) => "Landes"

In this case, the last name is the head of tree and happens to be a named entity as detected by the Stanford CoreNLP NER system. Named entities are annotatated at the token level, but also included in the mentions top level with the entire set of concatenated tokens (for cases where an NER contains more than one token like in this case). To get the full mention text:

user> (->> panon :sents first p/root-dependency (p/mention-for-token panon) first :text)) => "Paul Landes"

Features

This library was written to generate features for a machine learning algoritms. There are some utility functions for doing this. Here are a couple of examples.

Get the first propbank parsed from the SRL:

user> (->> panon f/first-propbank-label) => "be.01"

Get stats on features:

user> (->> panon p/tokens (f/token-features panon)) => {:utterance-length 17, :mention-count 1, :sent-count 1, :token-count 5, :token-average-length 14/5, :is-question false}

Each function X has an analog function X-feature-keys that describes the features generates and their types, which can be used directly as Weka attributes:

user> (clojure.pprint/pprint (f/token-feature-metas)) => [[:utterance-length numeric] [:mention-count numeric]	[:sent-count numeric]	[:token-count numeric]	[:token-average-length numeric]	[:is-question boolean]]

Get in/out-of-vocabulary ratio:

user> (->> panon p/tokens f/dictionary-features) => {:in-dict-ratio 4/5}

Dictionary Utility

See the NLP feature library for more information on dictionary specifics.

Pipeline Configuration

You can not only configure the natural language processing pipeline and which specific components to use, but you can also define and add your own plugin library. See the config namespace for more information.

Pipeline Usage

For example, if all you need is tokenization and sentence chunking create a context and parse it using macro with-context and the context you create with specific components:

(require '[zensols.nlparse.config :as conf :refer (with-context)] '[zensols.nlparse.parse :refer (parse)]) (let [ctx (->> (conf/create-parse-config :pipeline [(conf/tokenize) (conf/sentence)]) conf/create-context)] (with-context ctx (parse "I love Clojure. I enjoy it.")))

You can also specify the configuration in the form of a string:

(let [ctx (conf/create-context "tokenize,sentence,part-of-speech")] (with-context ctx (parse "I love Clojure. I enjoy it.")))

The configuration string can also take parameters (ex the en parameter to the tokenizer specifying English as the natural language):

(let [ctx (conf/create-context "tokenize(en),sentence,part-of-speech")] (with-context ctx (parse "I love Clojure. I enjoy it.")))

For an example on how to configure the pipeline, see this test case. For more information on the DSL itself see the DSL parser.

Convenience Namespace

If you use a particular configuration that doesn't change often consider your own utility parse namespace:

(ns example.nlp.parse (:require [zensols.nlparse.parse :as p] [zensols.nlparse.config :as conf :refer (with-context)])) (defonce ^:private parse-context-inst (atom nil)) (defn- create-context [] (->> ["tokenize" "sentence" "part-of-speech" "morphology" "named-entity-recognizer" "parse-tree"] (clojure.string/join ",") conf/create-context)) (defn- context [] (swap! parse-context-inst #(or % (create-context)))) (defn parse [utterance] (with-context (context) (p/parse utterance)))

Now in your application namespace:

(ns example.nlp.core (:require [example.nlp.parse :as p])) (defn somefn [] (p/parse "an utterance"))

Command Line Usage

The command line usage of this project has moved to the NLP server.

Building

To build from source, do the folling:

Install Leiningen (this is just a script)
Install GNU make
Install Git
Download the source: git clone --recurse-submodules https://github.com/plandes/clj-nlp-parse && cd clj-nlp-parse
Build the software: make jar
Build the distribution binaries: make dist

Note that you can also build a single jar file with all the dependencies with: make uber

Changelog

An extensive changelog is available here.

Citation

If you use this software in your research, please cite with the following BibTeX:

@misc{plandes-clj-nlp-parse, author = {Paul Landes}, title = {Natural Language Parse and Feature Generation}, year = {2018}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/plandes/clj-nlp-parse}} }

References

See the general NLP feature creation library for additional references.

@phdthesis{choi2014optimization, title = {Optimization of natural language processing components for robustness and scalability}, author = {Choi, Jinho D}, year = {2014}, school = {University of Colorado Boulder} } @InProceedings{manning-EtAl:2014:P14-5, author = {Manning, Christopher D. and Surdeanu, Mihai and Bauer, John and Finkel, Jenny and Bethard, Steven J. and McClosky, David}, title = {The {Stanford} {CoreNLP} Natural Language Processing Toolkit}, booktitle = {Association for Computational Linguistics (ACL) System Demonstrations}, year = {2014}, pages = {55--60}, url = {http://www.aclweb.org/anthology/P/P14/P14-5010} }

License

Apache License version 2.0

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Natural Language Parse and Feature Generation

Table of Contents

Features

Obtaining

Documentation

API Documentation

Annotation Definitions

Example Parse

Setup

Download and Install POS Tagger Model Manually

REPL

Usage

Usage Example

Parsing an Utterance

Utility Functions

Features

Dictionary Utility

Pipeline Configuration

Pipeline Usage

Convenience Namespace

Command Line Usage

Building

Changelog

Citation

References

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

plandes/clj-nlp-parse

Folders and files

Latest commit

History

Repository files navigation

Natural Language Parse and Feature Generation

Table of Contents

Features

Obtaining

Documentation

API Documentation

Annotation Definitions

Example Parse

Setup

Download and Install POS Tagger Model Manually

REPL

Usage

Usage Example

Parsing an Utterance

Utility Functions

Features

Dictionary Utility

Pipeline Configuration

Pipeline Usage

Convenience Namespace

Command Line Usage

Building

Changelog

Citation

References

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages