|
Stanford NER - v4.2.0 - 2020-11-17 |
|
---------------------------------------------- |
|
|
|
This package provides a high-performance machine learning based named |
|
entity recognition system, including facilities to train models from |
|
supervised training data and pre-trained models for English. |
|
|
|
(c) 2002-2020. The Board of Trustees of The Leland |
|
Stanford Junior University. All Rights Reserved. |
|
|
|
Original CRF code by Jenny Finkel. |
|
Additional modules, features, internationalization, compaction, and |
|
support code by Christopher Manning, Dan Klein, Christopher Cox, Huy Nguyen |
|
Shipra Dingare, Anna Rafferty, and John Bauer. |
|
This release prepared by Jason Bolton. |
|
|
|
LICENSE |
|
|
|
The software is licensed under the full GPL v2+. Please see the file LICENCE.txt |
|
|
|
For more information, bug reports, and fixes, contact: |
|
Christopher Manning |
|
Dept of Computer Science, Gates 2A |
|
Stanford CA 94305-9020 |
|
USA |
|
java-nlp-support@lists.stanford.edu |
|
https://nlp.stanford.edu/software/CRF-NER.html |
|
|
|
CONTACT |
|
|
|
For questions about this distribution, please contact Stanford's JavaNLP group |
|
at java-nlp-user@lists.stanford.edu. We provide assistance on a best-effort |
|
basis. |
|
|
|
TUTORIAL |
|
|
|
Quickstart guidelines, primarily for end users who wish to use the included NER |
|
models, are below. For further instructions on training your own NER model, |
|
go to https://nlp.stanford.edu/software/crf-faq.html. |
|
|
|
INCLUDED SERIALIZED MODELS / TRAINING DATA |
|
|
|
The basic included serialized model is a 3 class NER tagger that can |
|
label: PERSON, ORGANIZATION, and LOCATION entities. It is included as |
|
english.all.3class.distsim.crf.ser.gz. It is trained on data from |
|
CoNLL, MUC6, MUC7, ACE, OntoNotes, and Wikipedia. |
|
Because this model is trained on both US |
|
and UK newswire, it is fairly robust across the two domains. |
|
|
|
We have also included a 4 class NER tagger trained on the CoNLL 2003 |
|
Shared Task training data that labels for PERSON, ORGANIZATION, |
|
LOCATION, and MISC. It is named |
|
english.conll.4class.distsim.crf.ser.gz . |
|
|
|
A third model is trained only on data from MUC and |
|
distinguishes between 7 different classes: |
|
english.muc.7class.distsim.crf.ser.gz. |
|
|
|
All of the serialized classifiers come in two versions, one trained to |
|
basically expected standard written English capitalization, and the other |
|
to ignore capitalization information. The case-insensitive versions |
|
of the three models available on the Stanford NER webpage. |
|
These models use a distributional similarity lexicon to improve performance |
|
(by between 1.5%-3% F-measure). The distributional similarity features |
|
make the models perform substantially better, but they require rather |
|
more memory. The distsim models are included in the release package. |
|
The nodistsim versions of the same models may be available on the |
|
Stanford NER webpage. |
|
|
|
Finally, we have models for other languages, including two German models, |
|
a Chinese model, and a Spanish model. The files for these models can be |
|
found at: |
|
|
|
http://nlp.stanford.edu/software/CRF-NER.html |
|
|
|
|
|
QUICKSTART INSTRUCTIONS |
|
|
|
This NER system requires Java 1.8 or later. |
|
|
|
Providing java is on your PATH, you should be able to run an NER GUI |
|
demonstration by just clicking. It might work to double-click on the |
|
stanford-ner.jar archive but this may well fail as the operating system |
|
does not give Java enough memory for our NER system, so it is safer to |
|
instead double click on the ner-gui.bat icon (Windows) or ner-gui.sh |
|
(Linux/Unix/MacOSX). Then, using the top option from the Classifier |
|
menu, load a CRF classifier from the classifiers directory of the |
|
distribution. You can then `either load a text file or web page from |
|
the File menu, or decide to use the default text in the window. Finally, |
|
you can now named entity tag the text by pressing the Run NER button. |
|
|
|
From a command line, you need to have java on your PATH and the |
|
stanford-ner.jar file and the lib directory in your CLASSPATH. (The way of doing this depends on |
|
your OS/shell.) The supplied ner.bat and ner.sh should work to allow |
|
you to tag a single file. For example, for Windows: |
|
|
|
ner file |
|
|
|
Or on Unix/Linux you should be able to parse the test file in the distribution |
|
directory with the command: |
|
|
|
java -mx600m -cp stanford-ner.jar:lib/* edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -textFile sample.txt |
|
|
|
Here's an output option that will print out entities and their class to |
|
the first two columns of a tab-separated columns output file: |
|
|
|
java -mx600m -cp stanford-ner.jar:lib/* edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -outputFormat tabbedEntities -textFile sample.txt > sample.tsv |
|
|
|
When run from a jar file, you also have the option of using a serialized |
|
classifier contained in the jar file. |
|
|
|
USING FULL STANFORD CORENLP NER FUNCTIONALITY |
|
|
|
This standalone distribution also allows access to the full NER |
|
capabilities of the Stanford CoreNLP pipeline. These capabilities |
|
can be accessed via the NERClassifierCombiner class. |
|
NERClassifierCombiner allows for multiple CRFs to be used together, |
|
and has options for recognizing numeric sequence patterns and time |
|
patterns with the rule-based NER of SUTime. |
|
|
|
Suppose one combines three CRF's CRF-1,CRF-2, and CRF-3 with the |
|
NERClassifierCombiner. When the NERClassiferCombiner runs, it will |
|
first apply the NER tags of CRF-1 to the text, then it will apply |
|
CRF-2's NER tags to any tokens not tagged by CRF-1 and so on. If |
|
the option ner.combinationMode is set to NORMAL (default), any label |
|
applied by CRF-1 cannot be applied by subsequent CRF's. For instance |
|
if CRF-1 applies the LOCATION tag, no other CRF's LOCATION tag will be |
|
used. If ner.combinationMode is set to HIGH_RECALL, this limitation |
|
will be deactivated. |
|
|
|
To use NERClassifierCombiner at the command-line, the jars in lib |
|
and stanford-ner.jar must be in the CLASSPATH. Here is an example command: |
|
|
|
java -mx2g edu.stanford.nlp.ie.NERClassifierCombiner -ner.model \ |
|
classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz \ |
|
-ner.useSUTime false -textFile sample-w-time.txt |
|
|
|
Let's break this down a bit. The flag "-ner.model" should be followed by a |
|
list of CRF's to be combined by the NERClassifierCombiner. Some serialized |
|
CRF's are provided in the classifiers directory. In this example the CRF's |
|
trained on the CONLL 4 class data and the MUC 7 class data are being combined. |
|
|
|
When the flag "-ner.useSUTime" is followed by "false", SUTime is shut off. You should |
|
note that when the "false" is omitted, the text "4 days ago" suddenly is |
|
tagged with DATE. These are the kinds of phrases SUTime can identify. |
|
|
|
NERClassifierCombiner can be run on different types of input as well. Here is |
|
an example which is run on CONLL style input: |
|
|
|
java -mx2g edu.stanford.nlp.ie.NERClassifierCombiner -ner.model \ |
|
classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz \ |
|
-map word=0,answer=1 -testFile sample-conll-file.txt |
|
|
|
It is crucial to include the "-map word=0,answer=1" , which is specifying that |
|
the input test file has the words in the first column and the answer labels |
|
in the second column. |
|
|
|
It is also possible to serialize and load an NERClassifierCombiner. |
|
|
|
This command loads the three sample crfs with combinationMode=HIGH_RECALL |
|
and SUTime=false, and dumps them to a file named |
|
test_serialized_ncc.ncc.ser.gz |
|
|
|
java -mx2g edu.stanford.nlp.ie.NERClassifierCombiner -ner.model \ |
|
classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz,\ |
|
classifiers/english.all.3class.distsim.crf.ser.gz -ner.useSUTime false \ |
|
-ner.combinationMode HIGH_RECALL -serializeTo test.serialized.ncc.ncc.ser.gz |
|
|
|
An example serialized NERClassifierCombiner with these settings is supplied in |
|
the classifiers directory. Here is an example of loading that classifier and |
|
running it on the sample CONLL data: |
|
|
|
java -mx2g edu.stanford.nlp.ie.NERClassifierCombiner -loadClassifier \ |
|
classifiers/example.serialized.ncc.ncc.ser.gz -map word=0,answer=1 \ |
|
-testFile sample-conll-file.txt |
|
|
|
For a more exhaustive description of NERClassifierCombiner go to |
|
http://nlp.stanford.edu/software/ncc-faq.html |
|
|
|
PROGRAMMATIC USE |
|
|
|
The NERDemo file illustrates a couple of ways of calling the system |
|
programatically. You should get the same results from |
|
|
|
java -cp stanford-ner.jar:lib/*:. -mx300m NERDemo classifiers/english.all.3class.distsim.crf.ser.gz sample.txt |
|
|
|
as from using CRFClassifier. For more information on API calls, look in |
|
the enclosed javadoc directory: load index.html in a browser and look |
|
first at the edu.stanford.nlp.ie.crf package and CRFClassifier class. |
|
If you wish to train your own NER systems, look also at the |
|
edu.stanford.nlp.ie package NERFeatureFactory class. |
|
|
|
|
|
SERVER VERSION |
|
|
|
The NER code may also be run as a server listening on a socket: |
|
|
|
java -mx1000m -cp stanford-ner.jar:lib/* edu.stanford.nlp.ie.NERServer 1234 |
|
|
|
You can specify which model to load with flags, either one on disk: |
|
|
|
java -mx1000m -cp stanford-ner.jar:lib/* edu.stanford.nlp.ie.NERServer -loadClassifier classifiers/all.3class.crf.ser.gz 1234 |
|
|
|
Or if you have put a model inside the jar file, as a resource under, say, models: |
|
|
|
java -mx1000m -cp stanford-ner.jar:lib/* edu.stanford.nlp.ie.NERServer -loadClassifier models/all.3class.crf.ser.gz 1234 |
|
|
|
|
|
RUNNING CLASSIFIERS FROM INSIDE A JAR FILE |
|
|
|
The software can run any serialized classifier from within a jar file by |
|
following the -loadClassifier flag by some resource available within a |
|
jar file on the CLASSPATH. An end user can make |
|
their own jar files with the desired NER models contained inside. |
|
This allows single jar file deployment. |
|
|
|
|
|
PERFORMANCE GUIDELINES |
|
|
|
Performance depends on many factors. Speed and memory use depend on |
|
hardware, operating system, and JVM. Accuracy depends on the data |
|
tested on. Nevertheless, in the belief that something is better than |
|
nothing, here are some statistics from one machine on one test set, in |
|
semi-realistic conditions (where the test data is somewhat varied). |
|
|
|
ner-eng-ie.crf-3-all2006-distsim.ser.gz (older version of ner-eng-ie.crf-3-all2008-distsim.ser.gz) |
|
Memory: 320MB (on a 32 bit machine) |
|
PERSON ORGANIZATION LOCATION |
|
91.88 82.91 88.21 |
|
|
|
|
|
-------------------- |
|
CHANGES |
|
-------------------- |
|
|
|
2020-11-17 4.2.0 Update for compatibility |
|
|
|
2020-05-10 4.0.0 Update to UDv2.0 tokenization |
|
|
|
2018-10-16 3.9.2 Update for compatibility |
|
|
|
2018-02-27 3.9.1 KBP ner models for Chinese and Spanish |
|
|
|
2017-06-09 3.8.0 Updated for compatibility |
|
|
|
2016-10-31 3.7.0 Improved Chinese NER |
|
|
|
2015-12-09 3.6.0 Updated for compatibility |
|
|
|
2015-04-20 3.5.2 synch standalone and CoreNLP functionality |
|
|
|
2015-01-29 3.5.1 Substantial accuracy improvements |
|
|
|
2014-10-26 3.5.0 Upgrade to Java 1.8 |
|
|
|
2014-08-27 3.4.1 Add Spanish models |
|
|
|
2014-06-16 3.4 Fix serialization bug |
|
|
|
2014-01-04 3.3.1 Bugfix release |
|
|
|
2013-11-12 3.3.0 Update for compatibility |
|
|
|
2013-11-12 3.3.0 Update for compatibility |
|
|
|
2013-06-19 3.2.0 Improve handling of line-by-line input |
|
|
|
2013-04-04 1.2.8 nthreads option |
|
|
|
2012-11-11 1.2.7 Improved English 3 class model by including |
|
data from Wikipedia, release Chinese model |
|
|
|
2012-07-09 1.2.6 Minor bug fixes |
|
|
|
2012-05-22 1.2.5 Fix encoding issue |
|
|
|
2012-04-07 1.2.4 Caseless version of English models supported |
|
|
|
2012-01-06 1.2.3 Minor bug fixes |
|
|
|
2011-09-14 1.2.2 Improved thread safety |
|
|
|
2011-06-19 1.2.1 Models reduced in size but on average improved |
|
in accuracy (improved distsim clusters) |
|
|
|
2011-05-16 1.2 Normal download includes 3, 4, and 7 |
|
class models. Updated for compatibility |
|
with other software releases. |
|
|
|
2009-01-16 1.1.1 Minor bug and usability fixes, changed API |
|
|
|
2008-05-07 1.1 Additional feature flags, various code updates |
|
|
|
2006-09-18 1.0 Initial release |
|
|
|
|