Skip to main content

On This Page

Machine Learning for Fuel Efficiency Prediction: Tree-Based Model Analysis

β€’ 4 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Machine Learning for Fuel Efficiency Prediction: Tree-Based Model Analysis

This article explores the application of tree-based machine learning models to predict fuel efficiency (MPG) using a dataset of vehicle characteristics. The process involves data preparation, model training, hyperparameter tuning, and evaluation, with insights into feature importance and model performance.


🧩 Data Preparation

  • Dataset Features:

    • Numerical: vehicle_weight, engine_displacement, horsepower, acceleration
    • Categorical: model_year, origin, fuel_type
  • Data Cleaning:

    • Missing values were filled with zeros to ensure consistency.
  • Train/Validation/Test Split:

    • 60%/20%/20% split with random_state=1 for reproducibility.
  • Feature Encoding:

    • DictVectorizer(sparse=True) was used to convert categorical and numerical features into a format compatible with scikit-learn models.

🌳 Decision Tree Regressor

  • Model Configuration:

    • max_depth=1 to create a simple tree for initial feature analysis.
  • Key Insight:

    • The first split was on model_year, indicating that newer vehicles have distinct fuel efficiency patterns compared to older models.
  • Purpose:

    • Demonstrates how tree-based models identify the most influential feature for splitting data.

🌲 Random Forest Regressor

  • Model Parameters:

    • n_estimators=10, random_state=1, n_jobs=-1 (to use all CPU cores).
  • Performance:

    • Validation RMSE β‰ˆ 4.5, showing effective capture of relationships between engine specs and fuel efficiency.
  • Hyperparameter Tuning:

    • Tested n_estimators from 10 to 200 (step = 10). Performance plateaued after 80 estimators, indicating diminishing returns beyond this point.
  • Depth Tuning:

    • Compared max_depth values of 10, 15, 20, 25 with increasing n_estimators.
    • Best RMSE at max_depth=20, balancing bias and variance.

πŸ” Feature Importance Analysis

  • Top Predictors:

    • engine_displacement (most important), followed by vehicle_weight and horsepower.
  • Domain Alignment:

    • Larger engines and heavier vehicles consume more fuel, aligning with real-world knowledge.
  • Method:

    • Random Forest’s built-in feature importance metric was used to rank predictors.

⚑ XGBoost Experiments

  • Model Configuration:

    • Parameters:
      xgb_params = {
          'eta': [0.3, 0.1],
          'max_depth': 6,
          'objective': 'reg:squarederror',
          'nthread': 8,
          'seed': 1
      }
    • Trained for 100 rounds.
  • Performance:

    • eta=0.1 (smaller learning rate) achieved the best validation RMSE, demonstrating that slower learning improves generalization.
  • Key Takeaway:

    • XGBoost outperforms simpler models with proper hyperparameter tuning.

🎯 Key Takeaways

  • Model Year Impact:

    • model_year strongly influences fuel efficiency in modern cars.
  • Optimal Random Forest Configuration:

    • n_estimators β‰ˆ 80 and max_depth=20 for balanced performance.
  • Top Predictor:

    • Engine displacement is the most critical factor for predicting MPG.
  • XGBoost Best Practice:

    • Use lower eta values (e.g., 0.1) for smoother convergence and better generalization.

πŸ’‘ Final Thoughts

This project highlights the iterative process of model tuning, feature analysis, and interpretability in tree-based models. By comparing Decision Trees, Random Forests, and XGBoost, the author demonstrates how hyperparameters like n_estimators, max_depth, and eta affect performance. The findings align with domain knowledge, emphasizing the practical value of machine learning in real-world scenarios like automotive engineering.


Working Example (Python Code)

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.feature_extraction import DictVectorizer

# Sample data preparation
data = [
    {"vehicle_weight": 2500, "engine_displacement": 150, "model_year": 2000},
    {"vehicle_weight": 3000, "engine_displacement": 200, "model_year": 2010},
    # ... more data points
]

# Split data
X_train, X_val, y_train, y_val = train_test_split(
    data, target, test_size=0.2, random_state=1
)

# Vectorize features
dv = DictVectorizer(sparse=True)
X_train_vec = dv.fit_transform(X_train)
X_val_vec = dv.transform(X_val)

# Train Random Forest
rf = RandomForestRegressor(n_estimators=80, max_depth=20, random_state=1)
rf.fit(X_train_vec, y_train)

# Evaluate
preds = rf.predict(X_val_vec)
rmse = mean_squared_error(y_val, preds, squared=False)
print(f"Validation RMSE: {rmse:.2f}")

Recommendations

  • Use Cross-Validation: Always validate hyperparameters using cross-validation to avoid overfitting.
  • Monitor RMSE: Track performance metrics like RMSE during tuning to identify optimal parameters.
  • Feature Engineering: Prioritize features like engine_displacement and vehicle_weight for better model accuracy.
  • Avoid Over-Complexity: Use max_depth and n_estimators judiciously to prevent overfitting.
  • XGBoost Best Practices: Start with small eta values (e.g., 0.1) and increase tree depth gradually.

Reference: Predicting Fuel Efficiency with Tree-Based Models

Continue reading

Next article

Fundamental Principles of Software Development: DRY, KISS, YAGNI, POLS, and CoC

Related Content