Spelling correction with Enchant
Replacing repeating characters is actually an extreme form of spelling correction. In this recipe, we will take on the less extreme case of correcting minor spelling issues using Enchant—a spelling correction API.
Getting ready
You will need to install Enchant and a dictionary for it to use. Enchant is an offshoot of the AbiWord open source word processor, and more information on it can be found at http://www.abisource.com/projects/enchant/.
For dictionaries, Aspell is a good open source spellchecker and dictionary that can be found at http://aspell.net/.
Finally, you will need the PyEnchant library, which can be found at the following link: http://pythonhosted.org/pyenchant/
You should be able to install it with the easy_install
command that comes with Python setuptools, such as by typing sudo easy_install pyenchant
on Linux or Unix. On a Mac machine, PyEnchant may be difficult to install. If you have difficulties, consult http://pythonhosted.org/pyenchant/download.html.
How to do it...
We will create a new class called SpellingReplacer
in replacers.py
, and this time, the replace()
method will check Enchant to see whether the word is valid. If not, we will look up the suggested alternatives and return the best match using nltk.metrics.edit_distance()
:
import enchant from nltk.metrics import edit_distance class SpellingReplacer(object): def __init__(self, dict_name='en', max_dist=2): self.spell_dict = enchant.Dict(dict_name) self.max_dist = max_dist def replace(self, word): if self.spell_dict.check(word): return word suggestions = self.spell_dict.suggest(word) if suggestions and edit_distance(word, suggestions[0]) <= self.max_dist: return suggestions[0] else: return word
The preceding class can be used to correct English spellings, as follows:
>>> from replacers import SpellingReplacer >>> replacer = SpellingReplacer() >>> replacer.replace('cookbok') 'cookbook'
How it works...
The SpellingReplacer
class starts by creating a reference to an Enchant dictionary. Then, in the replace()
method, it first checks whether the given word is present in the dictionary. If it is, no spelling correction is necessary and the word is returned. If the word is not found, it looks up a list of suggestions and returns the first suggestion, as long as its edit distance is less than or equal to max_dist
. The edit distance is the number of character changes necessary to transform the given word into the suggested word. The max_dist
value then acts as a constraint on the Enchant suggest
function to ensure that no unlikely replacement words are returned. Here is an example showing all the suggestions for languege
, a misspelling of language
:
>>> import enchant >>> d = enchant.Dict('en') >>> d.suggest('languege') ['language', 'languages', 'languor', "language's"]
Except for the correct suggestion, language
, all the other words have an edit distance of three or greater. You can try this yourself with the following code:
>>> from nltk.metrics import edit_distance >>> edit_distance('language', 'languege') 1 >>> edit_distance('language', 'languo') 3
There's more...
You can use language dictionaries other than en
, such as en_GB
, assuming the dictionary has already been installed. To check which other languages are available, use enchant.list_languages()
:
>>> enchant.list_languages() ['en', 'en_CA', 'en_GB', 'en_US']
Tip
If you try to use a dictionary that doesn't exist, you will get enchant.DictNotFoundError
. You can first check whether the dictionary exists using enchant.dict_exists()
, which will return True
if the named dictionary exists, or False
otherwise.
The en_GB dictionary
Always ensure that you use the correct dictionary for whichever language you are performing spelling correction on. The en_US
dictionary can give you different results than en_GB
, such as for the word theater
. The word theater
is the American English spelling whereas the British English spelling is theatre
:
>>> import enchant >>> dUS = enchant.Dict('en_US') >>> dUS.check('theater') True >>> dGB = enchant.Dict('en_GB') >>> dGB.check('theater') False >>> from replacers import SpellingReplacer >>> us_replacer = SpellingReplacer('en_US') >>> us_replacer.replace('theater') 'theater' >>> gb_replacer = SpellingReplacer('en_GB') >>> gb_replacer.replace('theater') 'theatre'
Personal word lists
Enchant also supports personal word lists. These can be combined with an existing dictionary, allowing you to augment the dictionary with your own words. So, let's say you had a file named mywords.txt
that had nltk
on one line. You could then create a dictionary augmented with your personal word list as follows:
>>> d = enchant.Dict('en_US') >>> d.check('nltk') False >>> d = enchant.DictWithPWL('en_US', 'mywords.txt') >>> d.check('nltk') True
To use an augmented dictionary with our SpellingReplacer
class, we can create a subclass in replacers.py
that takes an existing spelling dictionary:
class CustomSpellingReplacer(SpellingReplacer): def __init__(self, spell_dict, max_dist=2): self.spell_dict = spell_dict self.max_dist = max_dist
This CustomSpellingReplacer
class will not replace any words that you put into mywords.txt
:
>>> from replacers import CustomSpellingReplacer >>> d = enchant.DictWithPWL('en_US', 'mywords.txt') >>> replacer = CustomSpellingReplacer(d) >>> replacer.replace('nltk') 'nltk'
See also
The previous recipe covered an extreme form of spelling correction by replacing repeating characters. You can also perform spelling correction by simple word replacement as discussed in the next recipe.