Climate Clustering with AutoML

December 18, 2020

Series: msc

As a researcher exploring new ways to analyze climate data, I recently tested out some advanced machine learning techniques on a dataset of weather measurements from Switzerland. My goal was to see if I could uncover hidden patterns and groupings within the data, without relying on human-provided labels. Here are the key things I tried:

Preparing the Data

First I cleaned up the raw data - over 1700 observations across 10 years - to handle issues like missing values. I also scaled the different weather measurements (temperature, precipitation, etc.) onto the same standardized range. This normalization step helps some algorithms compare attributes consistently.

Reducing Dimensions

Next, I used a custom algorithm based on L1 regularized PCA to reduce the number of dimensions, from 14 down to 6. By projecting the multi-dimensional data into a lower-dimensional space, we can surface the components that contain the most critical information while filtering out noise. My revised approach focuses on minimizing the influence of outliers compared to traditional techniques.

Clustering

Here’s where the unsupervised learning comes in! Without guidepost category labels, I experimented with clustering algorithms like k-means and Birch to find intrinsic groups within the climate data based purely on similarity. After evaluation, k-means grouped the observations into 3 clusters, while Birch identified 5 clusters in a subset. Visually checking the cluster distributions showed promising separations.

Validating a Classifier

As a final test, I built a supervised classifier using the clusters as proxy labels to validate consistency. My SVM model scored reasonably well, correctly classifying most observations against their assigned groups. The classifier confusion matrix also gave insight into areas of overlap between the clusters.

Final Thoughts

In the end, the unsupervised learning pipeline showed encouraging ability to detect patterns in the climate data without human supervision. Going forward, I’m excited to refine these methods further and apply them to larger meteorological datasets. The better we understand Earth’s intricate weather machinery, the better we can model and predict its behavior.

To see more details, check out the paper

Disclaimer: This project was completed as part of my MSc in Data Science Lancaster University. This blog post is an LLM generated text, based upon the hand-written report.

Disclaimer 2: This was my first introduction to ML

This is a post in the msc series.
Other posts in this series:

September 15, 2021 - MSc Thesis - Recipe Box Production Planning
May 13, 2021 - Geostatical Models
May 12, 2021 - Satellite Semantic Segmentation
April 15, 2021 - Threshold Methods for Extreme Value Theory
March 25, 2021 - char n-gram based language identification
March 18, 2021 - Reviewing LiDAR for Road Applications
February 19, 2021 - Comparing approaches for Deep Learning Time Series Classification
December 18, 2020 - Climate Clustering with AutoML