⏮️ Back to Portfolio Home; ⬅️ Previous Project
House Prices Prediction Using Machine Learning
Overview
This project is my machine learning submission for the Kaggle competition House Prices: Advanced Regression Techniques. The objective is to predict the sale price of houses in Ames, Iowa, based on a dataset of 79 explanatory variables describing various aspects of residential homes.
The project employs advanced regression techniques, including feature engineering, ensemble modeling, and stacking, to achieve competitive performance. The final submission uses a stacked regressor ensemble, yielding a cross-validation RMSE (log scale) of approximately 0.123 and a test RMSE of 0.129.
View Notebook
Dataset
The dataset consists of two CSV files:
- train.csv: 1,460 training samples with 81 columns (including the target SalePrice).
- test.csv: 1,459 test samples with 80 columns (excluding SalePrice).
Approach
The workflow follows a structured pipeline:
1 Data Loading and Splitting: Load training data and split into train/validation sets (80/20) to prevent leakage.
2 Exploratory Data Analysis (EDA):
- Distributions of SalePrice and LogSalePrice.
- Histograms for numeric features.
- Bar plots for categorical features.
3 Missing Value Handling:
- Numeric: Median imputation by neighborhood for LotFrontage;
- Zero-fill for other numeric columns like MasVnrArea.
- Categorical: Prefix with “No_” (e.g., “No_Alley”).
4 Feature Engineering:
- Derived features: Total square footage (TotalSF), age (Age), bathrooms (Bathrooms), porch area (PorchArea).
- Interactions: Quality-Size (QualSize), Quality-Condition (QualCond).
- Encodings: Ordinal mapping for quality ratings; frequency encoding for high-cardinality nominals.
- Location: Neighborhood median price mapping; urban/suburban/rural categorization.
5 Feature Transformation:
- Log1p transformation for skewed numerics to stabilize variance.
6 Categorical Encoding:
- One-hot encoding for remaining nominals.
7 Modeling:
- Base models: RidgeCV, RandomForestRegressor, GradientBoostingRegressor.
- Ensemble: StackingRegressor with LinearRegression as meta-learner.
- Evaluation: 5-fold cross validation and test RMSE on log scale.
8 Submission: Predict on test set and generate CSV.
Visualizations include:
- Target distribution (pre/post log transform).
- Numeric histograms.
- Categorical bar charts
- Residual plots.
- Feature importance bar charts.
Base Models Evaluation
|Model | CV RMSE | Test RMSE |
|Ridge 🥈| 0.134726 | 0.134243 |
|RandomForest 🥉| 0.139180 | 0.144454 |
|GradientBoosting 🥇| 0.125139 | 0.134765 |
Stacked Ensemble
- CV RMSE (log): 0.123196
- Test RMSE (log): 0.128989
Top Features (Across Models):
- QualSF (Overall Quality × Total SF)
- NeighborhoodPriceMedian
- QualSize (Overall Quality × GrLivArea)
Residual analysis indicates mild heteroscedasticity at higher prices, with Gradient Boosting showing the most stable predictions.
Insights
- The stacked model demonstrates strong generalization, with minimal overfitting (CV and test errors closely aligned).
- Key drivers: Property quality, size, and neighborhood prestige.
- Log transformation effectively handles right-skewed SalePrice (mean: $181,442; median: $165,000).
Sample submission predictions:
|Id | SalePrice |
|1461 | 118,987 |
|1462 | 158,082 |
Requirements
- Python 3.11+
- Key Libraries:
- pandas==2.0.3
- numpy==1.25.2
- sklearn==1.2.2
- scipy==1.11.2
- matplotlib==3.9.0
- seaborn==0.13.1
License
This project is licensed under the MIT License - see the LICENSE file for details.
Author
Contributions and issues are welcome! Please open a pull request or issue on GitHub.