Collocation Extraction

There are endless vocabulary lists on the Internet, and on my blog for that matter, which a student of English can download. Finding lists of Collocations for a particular subject area is a lot harder though. The other day I came across a program (Multilingual Corpus Toolkit) that extracts commonly occurring collocations from a file or text corpus automatically. MLCT is not public domain and intended for non-commercial use.

To run the program you need to have the Java Runtime Environment (JRE) installed on your computer. After downloading and unzipping the files, double click on the file ‘run_mlct_public’ to start the application. Once started you should select “LexTools->Collocation Parameters->Limit Number of Tokens” from the menu. Extracting collocations from a text file is then quite easy. All you have to do is select one of the options from the “Lextools->Extract n-grams” menu. Once you do you’ll be prompted for the file you want to analyse and the results will appear in a separate window.

For those of you who are prepared to put in a lot more effort you can download the Natural Language Toolkit (NLTK) and read the excellent book, “Natural Language Processing with Python”. The tutorials in this book take you through Collocation Extraction and much, much more. To try out all the examples you will need to install Python, PyYAML and NLTK. Once you have done this just follow the instructions in the book and try out the examples using PyScripter which you can download here. To check everything is working you can cut and paste the code listed below into the PyScripter editor window and select run. You should see the following output:

“[(u’Beer’, u’Lahai’), (u’Lahai’, u’Roi’), (u’gray’, u’hairs’), (u’Most’, u’High’), (u’ewe’, u’lambs’), (u’many’, u’colors’), (u’burnt’, u’offering’), (u’Paddan’, u’Aram’), (u’east’, u’wind’), (u’living’, u’creature’)]”.

Code Listing

import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

# change this to read in your data
finder = BigramCollocationFinder.from_words(
nltk.corpus.genesis.words(‘english-web.txt’))

# only bigrams that appear 3+ times
finder.apply_freq_filter(3)

# return the 10 n-grams with the highest PMI
print finder.nbest(bigram_measures.pmi, 10)

Enhanced by Zemanta
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s