Skip to main content

On This Page

Building a 1D CNN for Exoplanet Discovery: Lessons from 0.96 ROC-AUC

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

I trained a neural network to find exoplanets. Here’s what actually worked.

Developer Gaurang built a 1D Convolutional Neural Network to classify exoplanets using phase-folded light curves from NASA’s Kepler mission. The system achieved a 0.96 ROC-AUC score by processing 400 data points representing a single orbital period.

Why This Matters

Real-world astronomical data is rarely balanced or clean, presenting significant challenges for standard neural network architectures. In this case, confirmed planets represent only 1% of the dataset, making high-accuracy models functionally useless unless they account for extreme class imbalance through weighted loss functions and rigorous data exclusion.

Key Insights

  • Excluding ‘CANDIDATE’ labels prevented the model from learning unverified noise as truth, a critical step given NASA’s three-tier labeling system of confirmed, false positive, and candidate.
  • Class weights were essential to counteract the 1% representation of confirmed planets, preventing the model from defaulting to ‘not a planet’ predictions to achieve false 99% accuracy.
  • Phase-folded light curves consisting of 400 data points served as the primary input for the 1D CNN architecture to determine orbital brightness dips.
  • Parallel data fetching utilized 8 workers to streamline the retrieval of time-series light curves from the NASA archive efficiently.

Practical Applications

  • Use Case: Streamlit-based model deployment for real-time visualization of ROC curves and confusion matrices on Kepler test data. Pitfall: Training on unverified candidate labels leads to confident but incorrect model predictions.
  • Use Case: Multi-worker data pipelines for fetching large-scale astronomical time-series data from remote archives. Pitfall: Information leakage during the train/val/test split can artificially inflate performance metrics and invalidate results.

References:

Continue reading

Next article

Debugging LLM Hallucinations: How Prompt Labeling Prevents Architectural Overhauls

Related Content