Introduction
Have you ever wondered how it is that our phone automatically corrects words when we make a mistake? In this post, I'll show you how to build a simple autocorrect system! It will be a basic probabilistic model that, despite its simplicity, performs quite well. Keep in mind that companies like Google and Apple have their own autocorrection systems, which can be more advanced and use attention-based algorithms (Attention Is All You Need) to more accurately correct words depending on the context.
How the Algorithm Works
- Find the misspelled word.
- Find the words that are most similar to the input.
- Select the most probable word based on a heuristic.
How do we know if a word is spelled correctly? Let’s choose the most intuitive approach – one we can quickly arrive at with a bit of reasoning. We simply check whether the entered word exists in a dictionary, which by definition is a set of valid words in a given language.
In the next step, we need to somehow determine which words in the dictionary are the most likely alternatives to what the user intended to write. To do this, we will use the Levenshtein distance, which calculates the number of operations required to transform one word into another.
Finally, we will sort the candidate words by the number of required operations (in ascending order), and among those with the fewest changes, we will select the word(s) that appear most frequently in a language corpus.
Preprocessing the Data
We will need a dictionary and a list of probabilities indicating how likely each word is to occur in a text. To achieve this, we will input a large text file containing content written in Polish.
It’s important to ensure that the words in this file are correctly spelled — otherwise, our system might learn incorrect spellings and start "correcting" valid words into invalid ones.
Next, we will clean the text by removing all unwanted characters such as punctuation marks and capital letters. (This may cause some issues with proper nouns, which are usually capitalized, but it allows us to reduce the dictionary size in a simple way).
Below is the Python code that performs this preprocessing:
# Load data from a text file
text = ""
with open('polish_text.txt', 'r', encoding='utf-8') as file:
text = file.read()
# Remove all punctuation marks
import string
text = text.translate(str.maketrans('', '', string.punctuation + "—" + "…" + "“" + "”" + "‘" + "’" +"„"))
# Convert all letters to lowercase
text = text.lower()
# Split text into words
words = text.split()
# Create a dictionary of unique words
unique_words = set(words)
# Create a dictionary with the count of each word's occurrences
word_count = {}
for word in words:
if word in word_count:
word_count[word] += 1
else:
word_count[word] = 1
# Create a dictionary with probabilities of each word's occurrences
total_words = len(words)
word_probabilities = {}
for word, count in word_count.items(): # (word, count) pairs from word_count
word_probabilities[word] = count / total_words
# Create a dictionary with words grouped by their length
words_by_length = {}
for word in unique_words:
length = len(word)
if length not in words_by_length:
words_by_length[length] = []
words_by_length[length].append(word)
Implementation
Let’s implement the Levenshtein algorithm, which calculates the minimum number of operations required to transform one word into another.
This algorithm uses dynamic programming to break down the main problem —
"find the minimum number of operations to convert s1
to s2
—
into smaller subproblems: specifically,
"find the minimum number of operations to convert s1[:i]
to s2[:j]
.
In each iteration, we will use values that were previously computed and stored in a table.
In a future post, I’ll go into more detail about dynamic programming, but for now, it's important to understand that it's not a programming paradigm (like object-oriented or functional programming), but rather a method for solving problems efficiently.
def levenshtein_distance(s1, s2):
# s1 = source, s2 = target
n = len(s1) + 1 # add an empty character at the beginning of s1
m = len(s2) + 1 # add an empty character at the beginning of s2
matrix = np.zeros((n, m), dtype=int)
# The two loops below are used so that we don't have to check bounds in the main loop.
# The number of operations needed to convert s1[:i] to an empty string (i.e., deletions)
for i in range(n):
matrix[i][0] = i
# The number of operations needed to convert an empty string to s2[:j] (i.e., insertions)
for j in range(m):
matrix[0][j] = j
# Compute the Levenshtein distance
# In each iteration we consider 4 values
for i in range(1, n):
for j in range(1, m):
if s1[i - 1] == s2[j - 1]:
matrix[i][j] = matrix[i - 1][j - 1] # Characters are the same, no operation needed
else:
matrix[i][j] = min(
matrix[i - 1][j] + 1, # Deletion
matrix[i][j - 1] + 1, # Insertion
matrix[i - 1][j - 1] + 1 # Substitution
)
return matrix[n - 1][m - 1]
With this set of tools, we proceed to implement the autocorrect system according to our pseudocode.
def autocorrect(word, edit_distance=3):
# Preprocess the input word
word = word.strip().lower()
word = word.translate(str.maketrans('', '', string.punctuation + "—" + "…" + "“" + "”" + "‘" + "’" +"„")) # Remove punctuation and special characters
# Check if there is a mistake in the word
if word in unique_words:
return word # Word is correct, do nothing
# else: correct the word
# Find words with the same length or similar length
suggested_words_length = [len(word)] # Can be adjusted to include more lengths, e.g., [len(word) - 1, len(word), len(word) + 1]
# Find similar words
similar_words = []
for length in suggested_words_length:
# Check if the candidate word is similar enough
if levenshtein_distance(word, candidate) <= edit_distance and candidate.startswith(word[0]):
similar_words.append(candidate)
# If no similar words found, return the original word (it might be correct but not in the dictionary)
if not similar_words:
return word
# Otherwise, select the most probable correction among the closest words
# Sort by Levenshtein distance (Smallest first)
similar_words.sort(key=lambda x: levenshtein_distance(word, x), reverse=False)
# Get all words with minimum Levenshtein distance
min_distance = levenshtein_distance(word, similar_words[0])
temp = [w for w in similar_words if levenshtein_distance(word, w) == min_distance]
# Sort by probability (Highest first)
temp.sort(key=lambda x: word_probabilities.get(x, 0), reverse=True)
# Return the most probable word
return temp[0]
Testing
Let's test how well our autocorrect system performs on the following set of words, which appear in the input data:
Mały Książę miał zupełnie Dorośli są naprawdę róża zagrożona rychłym wyginięciem radziłby dać wyglądało mógłbym wąż Ziemię cię że jesteście skały się
Input data:
maly ksiaze mial zupelnie dorosli sa naprawde razy zagrozona rychlym wygninieciem radzilby dac wygladalo moglbym was ziemie cie ze jestescie skaly sie
Output data:
mały książę miał zupełnie dorośli są naprawdę razy zagrożona rychłym wyginięciem radziłby dać wyglądało mógłbym was ziemię cię ze jesteście skały się
This is a satisfying initial result, but I encourage you to explore more broadly how the system performs in different cases!
You can find the full code here: Simple autocorrect.