Training NER model using Stanford Core NLP CRF Classifier

Download latest Stanford Core NLP suite from https://stanfordnlp.github.io/CoreNLP/

Prepare files:

  1. Training dataset. List of word tokens annotated with their Named-Entity class (eg. PERS for Person entity). Example: jane-austen-emma-ch1.tsv
  2. Training properties. Training cofiguration properties for a CRF classifier (eg. input file(s), output file, features .etc). Example: jane-austen.prop
  3. Test dataset. Similar to training dataset but with different list of tokens. Example: jane-austen-emma-ch2.tsv

If you have datasets in ENAMEX or Open NLP format, you can use these simple python scripts enamex2stanfordner.py or tag2stanfordner.py to convert them

Enter stanfordnlp unzipped directory and run this command to train model:

java -cp "*" edu.stanford.nlp.ie.crf.CRFClassifier -prop jane-austen.prop

The result will show the output model file name ner-model.ser.gz:

.... 
[main] INFO edu.stanford.nlp.ie.crf.CRFClassifier - CRFClassifier training ... done [2.1 sec]. 
[main] INFO edu.stanford.nlp.ie.crf.CRFClassifier - Serializing classifier to ner-model.ser.gz... 
[main] INFO edu.stanford.nlp.ie.crf.CRFClassifier - done.

To test the model against the test file run this command:

java -cp "*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -testFile jane-austen-emma-ch2.tsv

The result shows the performance of the model. In this case it achieves 82% precision, 72% recall and 77% F-measure:

... 
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - CRFClassifier tagged 1999 words in 1 documents at 6305.99 words per second. 
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Entity	P	R	F1	TP	FP	FN 
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - PERS	0.8205	0.7273 0.7711	32	7	12 
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Totals	0.8205	0.7273	0.7711	32	7	12

You can also train the classifier using for language other than English as long as you provide proper training/testing dataset.

Cheers! 🍻

References: https://nlp.stanford.edu/software/crf-faq.html

Leave a Reply

Your email address will not be published. Required fields are marked *