As a data scientist exploring natural language processing, I’m always intrigued by new techniques for detecting language in text. I recently tested out some character-based n-gram models on a corpus of sentences in English, Dutch, and Igbo. Language detection is an important first step before analyzing the meaning of text, so improving these models could have far-reaching benefits.
I started with a basic approach of generating n-gram profiles for each language - essentially counting up the character sequences and using their frequencies as a “fingerprint” for that language.
LiDAR (Light Detection and Ranging) is rapidly emerging as a preferred sensing modality for autonomous vehicles and infrastructure mapping. By firing laser pulses and analyzing reflected light, LiDAR scanners produce intricate 3D point clouds depicting roadways and surroundings in unprecedented detail. As this remote sensing technology becomes more ubiquitous, the bottleneck is now interpreting massive volumes of 3D data. This is where artificial intelligence comes in.
In this post, I’ll highlight promising machine learning techniques that could soon automate detection and evaluation of key road attributes for smarter cars and safer infrastructure.
Introduction to Deep Learning I recently explored using deep learning models for time series classification. But before getting into the details, let me provide some background on deep learning. Deep learning is a subset of machine learning that uses neural networks modeled after the human brain to learn from large amounts of data. Neural networks have an input layer, hidden layers, and an output layer. By adjusting the weights between layers during training, neural nets can recognize complex patterns and make predictions.
Validation: Credly
Post Exam Thoughts As a machine learning engineer, passing the AWS Certified Cloud Practitioner exam provided useful learnings around utilizing cloud services for ML workflows. I wanted to share key takeaways most relevant for ML engineers.
The exam covers a wide range of AWS services. For ML, access to GPU compute power is critical for training models - instance types like P3 and G4 provide this capability. Services like SageMaker simplify the process from notebooks to training clusters.
As a researcher exploring new ways to analyze climate data, I recently tested out some advanced machine learning techniques on a dataset of weather measurements from Switzerland. My goal was to see if I could uncover hidden patterns and groupings within the data, without relying on human-provided labels. Here are the key things I tried:
Preparing the Data First I cleaned up the raw data - over 1700 observations across 10 years - to handle issues like missing values.
Efficient train scheduling has long relied on operations research and optimization algorithms. However, the increasing complexity of multi-train networks has rendered traditional mathematical programming inadequate. Recent advances in artificial intelligence and machine learning, combined with open access to comprehensive railway operations data, offer promising new techniques.
Goal: Use the feeds to create training datasets that can be fed into downstream learning algorithms.
Network Rail provide a number of operational data feeds available to anyone - the only requirement is registering.
Refining My Audio Noise Removal Algorithm
In my last post, I discussed developing an algorithm to eliminate background noise from an audio clip containing speech. It worked by transforming the clip into the frequency domain, removing high frequency components exceeding a set threshold, then converting back into the time domain. This effectively filtered out unwanted noise while retaining the speech.
The core techniques used are the Discrete Fourier Transform (DFT) and its inverse (IDFT).
In my recent project, I was tasked with developing an algorithm to detect speech in an audio file and eliminate any background noise or silence. The goal was to take an audio clip containing noise, silence, and a spoken phrase, and isolate just the speech. This is useful for many speech processing applications where you only want to analyze the speech portions.
The core challenge was that simple noise filters would not work here - the noise and speech overlap in time.