char n-gram based language identification

March 25, 2021

Series: msc

As a data scientist exploring natural language processing, I’m always intrigued by new techniques for detecting language in text. I recently tested out some character-based n-gram models on a corpus of sentences in English, Dutch, and Igbo. Language detection is an important first step before analyzing the meaning of text, so improving these models could have far-reaching benefits.

I started with a basic approach of generating n-gram profiles for each language - essentially counting up the character sequences and using their frequencies as a “fingerprint” for that language. I calculated similarity scores between new sentences and these profiles to determine the most likely language match. This worked alright, but struggles with very short sentences without much linguistic context.

To address this, I implemented a more robust Bayesian model called Rekishikon that supports up to 55 languages (~~stolen~~ from wikipedia). By calculating posterior probabilities for each language, it can better handle outliers and irregular sentences. However, with more languages in the mix, some misclassifications crept in.

As with any machine learning endeavor, data cleaning and processing was critical. I found removing very short sentences without linguistic content improved accuracy substantially, as they tend to be unpredictable noise during training. Expanding to further languages down the road could benefit from even more stringent filtering.

Overall, Rekishikon achieved 97-98% accuracy on my test corpus, a solid improvement from my basic n-gram approach! But some challenges remain, particularly efficiently supporting models with 50+ languages. My next step is to experiment with modifications like utilizing graph theory to minimize computational overhead. Language detection fascinates me because it touches on core questions of how we communicate and interpret meaning. I’m excited to keep refining my NLP techniques on this forever-intriguing task!

To see more details, check out the paper

Disclaimer: This project was completed as part of my MSc in Data Science Lancaster University and supervised by Dr. Paul Rayson. This blog post is an LLM generated text, based upon the hand-written report.

Credit to the base of Rekishikon for the original implementation.

This is a post in the msc series.
Other posts in this series:

September 15, 2021 - MSc Thesis - Recipe Box Production Planning
May 13, 2021 - Geostatical Models
May 12, 2021 - Satellite Semantic Segmentation
April 15, 2021 - Threshold Methods for Extreme Value Theory
March 25, 2021 - char n-gram based language identification
March 18, 2021 - Reviewing LiDAR for Road Applications
February 19, 2021 - Comparing approaches for Deep Learning Time Series Classification
December 18, 2020 - Climate Clustering with AutoML