Machine Learning & Compound Hazards

I developed a Random Forest model to predict landslide events in California, using open-source rainfall and wildfire perimeter datasets. This work came about through the Stanford Wildfire Hackathon, and my teammates were Dylan Win and Shinnosuke Yagi.

Background & Model Design

Debris flows are governed by complex physical dynamics that are difficult to model numerically, but the triggering conditions tend to be driven by a much smaller set of observable variables like rainfall intensity and land surface changes, making them well-suited to data-driven classification. This project explored whether machine learning can improve landslide prediction in California by treating it as a compound hazard problem, combining both precipitation amount and wildfire burn history as input features, to see if wildfire history might be able to represent key surface and stability condition changes as a predictive feature. Our approach was inspired by Tsai et al. (2022), who demonstrated that a Random Forest classifier trained on hourly rainfall data alone could outperform Taiwan's existing debris flow warning system, which relies on a single time-weighted rainfall threshold.

Poster summarizing model methods and results — Project poster, presented at Stanford Wildfire Hackathon expo 2023.

Implementation & Conclusions

The machine learning pipeline was built in Python using scikit-learn, and geospatial data analysis performed with QGIS. We assembled a positive dataset from the NASA Global Landslide Catalog (post-2007 events), matched each entry to the nearest weather station via the California Irrigation Management Information System for same-day and prior-day precipitation, and overlaid wildfire history from California Open Data's fire perimeter polygons. A synthetic negative dataset was generated with varied precipitation and burn history to train the Random Forest classifier. Feature importance analysis demonstrated that same-day precipitation dominated at 61%, followed by prior-day rainfall at 26%, while wildfire history contributed only about 12 — suggesting that, at least at daily temporal resolution, rainfall is the most prominent signal despite the complex relationship between soil stability and wildfire history. A key limitation to our work was the absence of hourly rainfall data, which Tsai et al. (2022) identified as critical. We had to omit hourly data as very few weather stations had consistent hourly datasets. Despite this, our model correctly flagged 64 of 74 landslide events during the January–March 2023 rainy season without evidence of overfitting. The project was completed as part of a team: I was the project lead and managed the machine learning approach, while two other students carried out the data cleaning and map production. See our poster for more info!

Download poster ← Back to projects