This is a Wordseer-specific fork of Dustin Smith's stanford-corenlp-python, a Python interface to Stanford CoreNLP. It can either use as python package, or run as a JSON-RPC server.
- Tested only with the current annotator configuration: not a general-purpose wrapper
- Update to Stanford CoreNLP v3.5.2
- Added multi-threaded load balancing
- Fix many bugs & improve performance
- Using jsonrpclib for stability and performance
- Can edit the constants as argument such as Stanford Core NLP directory
- Adjust parameters not to timeout in high load
- Fix a problem with long text input by Johannes Castner stanford-corenlp-python
- Packaging
- pexpect
- unidecode
- jsonrpclib (optionally)
To use this program you must download and unpack the zip file containing Stanford's CoreNLP package. By default, corenlp.py
looks for the Stanford Core NLP folder as a subdirectory of where the script is being run.
Then, to launch a server:
python corenlp/corenlp.py
Optionally, you can specify a host or port:
python corenlp/corenlp.py -H 0.0.0.0 -p 3456
For additional concurrency, you can add a load-balancing layer on top:
python corenlp/corenlp.py --ports=8081,8082,8083,8084
That will run a public JSON-RPC server on port 3456. And you can specify Stanford CoreNLP directory:
python corenlp/corenlp.py -S stanford-corenlp-full-2013-06-20/
Assuming you are running on port 8080 and CoreNLP directory is stanford-corenlp-full-2013-06-20/
in current directory, the code in client.py
shows an example parse:
import jsonrpclib from simplejson import loads server = jsonrpclib.Server("http://localhost:8080") result = loads(server.parse("Hello world. It is so beautiful")) print "Result", result
If you are using the load balancing component, then you can use the following approach:
import jsonrpclib from simplejson import loads server = jsonrpclib.Server("http://localhost:8080") result = loads(server.send("Hello world. It is so beautiful")) print "Result", server.getForKey(result['key']) # asynchronous parsing and retrieval sents = [ 'add in as many sentences as you want', 'your mileage may vary' ] for sent in sents: server.send(sent) # this approach is non-blocking print server.getCompleted() # this approach waits for all in-progress parses to finish (i.e. blocks) print server.getAll()
That returns a dictionary containing the keys sentences
and (when applicable) corefs
. The key sentences
contains a list of dictionaries for each sentence, which contain parsetree
, text
, tuples
containing the dependencies, and words
, containing information about parts of speech, NER, etc:
{u'sentences': [{u'parsetree': u'(ROOT (S (VP (NP (INTJ (UH Hello)) (NP (NN world)))) (. !)))', u'text': u'Hello world!', u'tuples': [[u'dep', u'world', u'Hello'], [u'root', u'ROOT', u'world']], u'words': [[u'Hello', {u'CharacterOffsetBegin': u'0', u'CharacterOffsetEnd': u'5', u'Lemma': u'hello', u'NamedEntityTag': u'O', u'PartOfSpeech': u'UH'}], [u'world', {u'CharacterOffsetBegin': u'6', u'CharacterOffsetEnd': u'11', u'Lemma': u'world', u'NamedEntityTag': u'O', u'PartOfSpeech': u'NN'}], [u'!', {u'CharacterOffsetBegin': u'11', u'CharacterOffsetEnd': u'12', u'Lemma': u'!', u'NamedEntityTag': u'O', u'PartOfSpeech': u'.'}]]}, {u'parsetree': u'(ROOT (S (NP (PRP It)) (VP (VBZ is) (ADJP (RB so) (JJ beautiful))) (. .)))', u'text': u'It is so beautiful.', u'tuples': [[u'nsubj', u'beautiful', u'It'], [u'cop', u'beautiful', u'is'], [u'advmod', u'beautiful', u'so'], [u'root', u'ROOT', u'beautiful']], u'words': [[u'It', {u'CharacterOffsetBegin': u'14', u'CharacterOffsetEnd': u'16', u'Lemma': u'it', u'NamedEntityTag': u'O', u'PartOfSpeech': u'PRP'}], [u'is', {u'CharacterOffsetBegin': u'17', u'CharacterOffsetEnd': u'19', u'Lemma': u'be', u'NamedEntityTag': u'O', u'PartOfSpeech': u'VBZ'}], [u'so', {u'CharacterOffsetBegin': u'20', u'CharacterOffsetEnd': u'22', u'Lemma': u'so', u'NamedEntityTag': u'O', u'PartOfSpeech': u'RB'}], [u'beautiful', {u'CharacterOffsetBegin': u'23', u'CharacterOffsetEnd': u'32', u'Lemma': u'beautiful', u'NamedEntityTag': u'O', u'PartOfSpeech': u'JJ'}], [u'.', {u'CharacterOffsetBegin': u'32', u'CharacterOffsetEnd': u'33', u'Lemma': u'.', u'NamedEntityTag': u'O', u'PartOfSpeech': u'.'}]]}], u'coref': [[[[u'It', 1, 0, 0, 1], [u'Hello world', 0, 1, 0, 2]]]]}
Not to use JSON-RPC, load the module instead:
from corenlp import StanfordCoreNLP corenlp_dir = "stanford-corenlp-full-2013-06-20/" corenlp = StanfordCoreNLP(corenlp_dir) # wait a few minutes... corenlp.raw_parse("Parse it")
If you need to parse long texts (more than 30-50 sentences), you must use a batch_parse
function. It reads text files from input directory and returns a generator object of dictionaries parsed each file results:
from corenlp import batch_parse corenlp_dir = "stanford-corenlp-full-2013-06-20/" raw_text_directory = "sample_raw_text/" parsed = batch_parse(raw_text_directory, corenlp_dir) # It returns a generator object print parsed #=> [{'coref': ..., 'sentences': ..., 'file_name': 'new_sample.txt'}]
The function uses XML output feature of Stanford CoreNLP, and you can take all information by raw_output
option. If true, CoreNLP's XML is returned as a dictionary without converting the format.
parsed = batch_parse(raw_text_directory, corenlp_dir, raw_output=True)
(note: The function requires xmltodict now, you should install it by sudo pip install xmltodict
)
- Hiroyoshi Komatsu [hiroyoshi.komat@gmail.com]
- Johannes Castner [jac2130@columbia.edu]
- Robert Elwell [robert@wikia-inc.com]
- Tristan Chong [tristan@wikia-inc.com]
- Aditi Muralidharan [aditi.shrikumar@gmail.com]
- Ian MacFarland [ianmacfarland@ischool.berkeley.edu]