There are endless vocabulary lists on the Internet, and on my blog for that matter, which a student of English can download. Finding lists of Collocations for a particular subject area is a lot harder though. The other day I came across a program (Multilingual Corpus Toolkit) that extracts commonly occurring collocations from a file or text corpus automatically. MLCT is not public domain and intended for non-commercial use.
To run the program you need to have the Java Runtime Environment (JRE) installed on your computer. After downloading and unzipping the files, double click on the file ‘run_mlct_public’ to start the application. Once started you should select “LexTools->Collocation Parameters->Limit Number of Tokens” from the menu. Extracting collocations from a text file is then quite easy. All you have to do is select one of the options from the “Lextools->Extract n-grams” menu. Once you do you’ll be prompted for the file you want to analyse and the results will appear in a separate window.
For those of you who are prepared to put in a lot more effort you can download the Natural Language Toolkit (NLTK) and read the excellent book, “Natural Language Processing with Python”. The tutorials in this book take you through Collocation Extraction and much, much more. To try out all the examples you will need to install Python, PyYAML and NLTK. Once you have done this just follow the instructions in the book and try out the examples using PyScripter which you can download here. To check everything is working you can cut and paste the code listed below into the PyScripter editor window and select run. You should see the following output:
“[(u’Beer’, u’Lahai’), (u’Lahai’, u’Roi’), (u’gray’, u’hairs’), (u’Most’, u’High’), (u’ewe’, u’lambs’), (u’many’, u’colors’), (u’burnt’, u’offering’), (u’Paddan’, u’Aram’), (u’east’, u’wind’), (u’living’, u’creature’)]”.
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
# change this to read in your data
finder = BigramCollocationFinder.from_words(
# only bigrams that appear 3+ times
# return the 10 n-grams with the highest PMI
print finder.nbest(bigram_measures.pmi, 10)