For more projects visit: https://setscholars.net
# Suppress warnings in Jupyter Notebooks
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from pycaret.regression import *
# provide the dataset name as shown in pycaret
whichDataset = 'diamond'
from pycaret.datasets import get_data
dataset = get_data(whichDataset)
Carat Weight | Cut | Color | Clarity | Polish | Symmetry | Report | Price | |
---|---|---|---|---|---|---|---|---|
0 | 1.10 | Ideal | H | SI1 | VG | EX | GIA | 5169 |
1 | 0.83 | Ideal | H | VS1 | ID | ID | AGSL | 3470 |
2 | 0.85 | Ideal | H | SI1 | EX | EX | GIA | 3183 |
3 | 0.91 | Ideal | E | SI1 | VG | VG | GIA | 4370 |
4 | 0.83 | Ideal | G | SI1 | EX | EX | GIA | 3171 |
dataset.shape
(6000, 8)
dataset.columns.to_list()
['Carat Weight', 'Cut', 'Color', 'Clarity', 'Polish', 'Symmetry', 'Report', 'Price']
data = dataset.sample(frac=0.75, random_state=1234)
data_unseen = dataset.drop(data.index)
data.reset_index(inplace=True, drop=True)
data_unseen.reset_index(inplace=True, drop=True)
print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))
Data for Modeling: (4500, 8) Unseen Data For Predictions: (1500, 8)
env_setup = setup(data = data, target = 'Price', session_id=1234)
Description | Value | |
---|---|---|
0 | session_id | 1234 |
1 | Target | Price |
2 | Original Data | (4500, 8) |
3 | Missing Values | False |
4 | Numeric Features | 1 |
5 | Categorical Features | 6 |
6 | Ordinal Features | False |
7 | High Cardinality Features | False |
8 | High Cardinality Method | None |
9 | Transformed Train Set | (3149, 28) |
10 | Transformed Test Set | (1351, 28) |
11 | Shuffle Train-Test | True |
12 | Stratify Train-Test | False |
13 | Fold Generator | KFold |
14 | Fold Number | 10 |
15 | CPU Jobs | -1 |
16 | Use GPU | False |
17 | Log Experiment | False |
18 | Experiment Name | reg-default-name |
19 | USI | c218 |
20 | Imputation Type | simple |
21 | Iterative Imputation Iteration | None |
22 | Numeric Imputer | mean |
23 | Iterative Imputation Numeric Model | None |
24 | Categorical Imputer | constant |
25 | Iterative Imputation Categorical Model | None |
26 | Unknown Categoricals Handling | least_frequent |
27 | Normalize | False |
28 | Normalize Method | None |
29 | Transformation | False |
30 | Transformation Method | None |
31 | PCA | False |
32 | PCA Method | None |
33 | PCA Components | None |
34 | Ignore Low Variance | False |
35 | Combine Rare Levels | False |
36 | Rare Level Threshold | None |
37 | Numeric Binning | False |
38 | Remove Outliers | False |
39 | Outliers Threshold | None |
40 | Remove Multicollinearity | False |
41 | Multicollinearity Threshold | None |
42 | Remove Perfect Collinearity | True |
43 | Clustering | False |
44 | Clustering Iteration | None |
45 | Polynomial Features | False |
46 | Polynomial Degree | None |
47 | Trignometry Features | False |
48 | Polynomial Threshold | None |
49 | Group Features | False |
50 | Feature Selection | False |
51 | Feature Selection Method | classic |
52 | Features Selection Threshold | None |
53 | Feature Interaction | False |
54 | Feature Ratio | False |
55 | Interaction Threshold | None |
56 | Transform Target | False |
57 | Transform Target Method | box-cox |
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter('ignore')
# --------------------------------------
best_model = compare_models()
# --------------------------------------
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
catboost | CatBoost Regressor | 625.0602 | 2228674.3579 | 1394.4117 | 0.9798 | 0.0669 | 0.0493 | 0.8640 |
xgboost | Extreme Gradient Boosting | 692.3342 | 2587044.0374 | 1509.6355 | 0.9761 | 0.0723 | 0.0533 | 0.3620 |
et | Extra Trees Regressor | 755.4115 | 2843385.7954 | 1585.2596 | 0.9739 | 0.0820 | 0.0603 | 0.5090 |
rf | Random Forest Regressor | 773.3028 | 3376399.9351 | 1729.5790 | 0.9683 | 0.0825 | 0.0600 | 0.4800 |
gbr | Gradient Boosting Regressor | 903.2423 | 3447752.8881 | 1781.3797 | 0.9675 | 0.1017 | 0.0772 | 0.0930 |
lightgbm | Light Gradient Boosting Machine | 788.8981 | 3848577.0710 | 1858.4638 | 0.9644 | 0.0814 | 0.0585 | 0.0430 |
dt | Decision Tree Regressor | 1011.5746 | 6267643.4285 | 2366.3179 | 0.9397 | 0.1104 | 0.0776 | 0.0130 |
ridge | Ridge Regression | 2350.8046 | 14218652.6172 | 3718.0422 | 0.8605 | 0.6302 | 0.2740 | 0.0120 |
lr | Linear Regression | 2349.4629 | 14198978.3137 | 3724.0871 | 0.8598 | 0.6629 | 0.2735 | 0.2580 |
lasso | Lasso Regression | 2347.4218 | 14247775.4243 | 3727.8517 | 0.8596 | 0.6336 | 0.2730 | 0.0130 |
llar | Lasso Least Angle Regression | 2299.9601 | 14289637.1079 | 3729.9228 | 0.8596 | 0.5935 | 0.2612 | 0.0090 |
br | Bayesian Ridge | 2351.2589 | 14287357.8848 | 3730.9265 | 0.8595 | 0.6383 | 0.2737 | 0.0140 |
huber | Huber Regressor | 1894.8182 | 18146646.3192 | 4187.2851 | 0.8236 | 0.4127 | 0.1602 | 0.0460 |
par | Passive Aggressive Regressor | 1909.1779 | 20264600.5022 | 4394.3689 | 0.8060 | 0.3784 | 0.1522 | 0.0190 |
omp | Orthogonal Matching Pursuit | 2929.2796 | 24232428.5276 | 4851.2578 | 0.7630 | 0.5051 | 0.2857 | 0.0080 |
ada | AdaBoost Regressor | 3921.5597 | 23378521.1175 | 4797.1697 | 0.7630 | 0.4635 | 0.5271 | 0.0780 |
knn | K Neighbors Regressor | 3135.7262 | 32187393.3592 | 5654.4385 | 0.6712 | 0.3876 | 0.2952 | 0.0420 |
en | Elastic Net | 5000.5618 | 56581353.2931 | 7449.4030 | 0.4405 | 0.5341 | 0.5790 | 0.0080 |
dummy | Dummy Regressor | 7207.5151 | 100232834.3497 | 9957.6151 | -0.0050 | 0.7545 | 0.8832 | 0.0060 |
lar | Least Angle Regression | 4261958.0728 | 329021617145702.2500 | 5739426.2367 | -3893985.2894 | 1.3839 | 609.7337 | 0.0120 |
model_1 = create_model('xgboost')
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
0 | 722.4200 | 2479157.2015 | 1574.5340 | 0.9739 | 0.0750 | 0.0551 |
1 | 661.2163 | 2256146.4191 | 1502.0474 | 0.9778 | 0.0677 | 0.0507 |
2 | 569.9288 | 936903.7718 | 967.9379 | 0.9876 | 0.0691 | 0.0519 |
3 | 714.4825 | 3176026.8133 | 1782.1411 | 0.9604 | 0.0759 | 0.0542 |
4 | 659.0378 | 1495177.7101 | 1222.7746 | 0.9823 | 0.0707 | 0.0555 |
5 | 687.2030 | 1783504.2999 | 1335.4791 | 0.9833 | 0.0697 | 0.0521 |
6 | 670.6879 | 1419911.2308 | 1191.6003 | 0.9838 | 0.0725 | 0.0550 |
7 | 907.8694 | 9253747.6510 | 3041.9973 | 0.9408 | 0.0817 | 0.0540 |
8 | 657.9486 | 1540750.4903 | 1241.2697 | 0.9856 | 0.0685 | 0.0518 |
9 | 672.5474 | 1529114.7864 | 1236.5738 | 0.9850 | 0.0718 | 0.0529 |
Mean | 692.3342 | 2587044.0374 | 1509.6355 | 0.9761 | 0.0723 | 0.0533 |
SD | 81.8413 | 2303256.9362 | 555.0177 | 0.0140 | 0.0041 | 0.0016 |
tuned_model_1 = tune_model(model_1)
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
0 | 836.1935 | 2464695.6517 | 1569.9349 | 0.9741 | 0.0913 | 0.0697 |
1 | 756.1560 | 1983860.3524 | 1408.4958 | 0.9805 | 0.0838 | 0.0624 |
2 | 693.4941 | 1341535.7589 | 1158.2468 | 0.9823 | 0.0871 | 0.0655 |
3 | 777.2015 | 2919270.3835 | 1708.5872 | 0.9636 | 0.0906 | 0.0673 |
4 | 740.9362 | 1871320.0573 | 1367.9620 | 0.9779 | 0.0840 | 0.0639 |
5 | 767.9895 | 1741654.3651 | 1319.7175 | 0.9837 | 0.0836 | 0.0616 |
6 | 756.0067 | 2249091.8517 | 1499.6973 | 0.9743 | 0.0832 | 0.0638 |
7 | 980.5202 | 10474813.5066 | 3236.4817 | 0.9329 | 0.1088 | 0.0639 |
8 | 697.0777 | 1720187.7301 | 1311.5593 | 0.9839 | 0.0790 | 0.0608 |
9 | 735.7028 | 2023796.2823 | 1422.6019 | 0.9802 | 0.0823 | 0.0619 |
Mean | 774.1278 | 2879022.5939 | 1600.3284 | 0.9733 | 0.0874 | 0.0641 |
SD | 78.8449 | 2565437.6120 | 563.8896 | 0.0146 | 0.0080 | 0.0026 |
print(tuned_model_1)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.4, max_delta_step=0, max_depth=3, min_child_weight=3, missing=nan, monotone_constraints='()', n_estimators=170, n_jobs=-1, num_parallel_tree=1, objective='reg:squarederror', random_state=1234, reg_alpha=0.0005, reg_lambda=1e-07, scale_pos_weight=45.400000000000006, subsample=0.9, tree_method='auto', validate_parameters=1, verbosity=0)
plot_model(tuned_model_1, plot = 'residuals')
plot_model(tuned_model_1, plot = 'error')
plot_model(tuned_model_1, plot='feature')
plot_model(tuned_model_1, plot = 'learning')
plot_model(tuned_model_1, plot = 'vc')
#plot_model(tuned_model_1, plot = 'rfe')
model_2 = create_model('rf')
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
0 | 721.9996 | 2076744.7416 | 1441.0915 | 0.9782 | 0.0752 | 0.0578 |
1 | 811.2659 | 4979708.9204 | 2231.5261 | 0.9511 | 0.0797 | 0.0588 |
2 | 640.5996 | 1214667.6741 | 1102.1196 | 0.9840 | 0.0833 | 0.0599 |
3 | 784.4487 | 3898120.1915 | 1974.3658 | 0.9514 | 0.0881 | 0.0619 |
4 | 777.5991 | 2410188.1139 | 1552.4781 | 0.9715 | 0.0830 | 0.0642 |
5 | 707.3217 | 1763962.4942 | 1328.1425 | 0.9835 | 0.0793 | 0.0555 |
6 | 787.3827 | 2106264.6219 | 1451.2976 | 0.9759 | 0.0828 | 0.0621 |
7 | 1044.1351 | 10925307.8264 | 3305.3453 | 0.9301 | 0.0966 | 0.0640 |
8 | 667.8969 | 1380233.5850 | 1174.8334 | 0.9871 | 0.0728 | 0.0566 |
9 | 790.3785 | 3008801.1824 | 1734.5896 | 0.9705 | 0.0840 | 0.0592 |
Mean | 773.3028 | 3376399.9351 | 1729.5790 | 0.9683 | 0.0825 | 0.0600 |
SD | 105.4182 | 2746954.0314 | 620.4487 | 0.0174 | 0.0063 | 0.0028 |
tuned_model_2 = tune_model(model_2)
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
0 | 1832.9599 | 11490027.1949 | 3389.6943 | 0.8791 | 0.2748 | 0.2327 |
1 | 1801.1970 | 15279017.9638 | 3908.8384 | 0.8499 | 0.2737 | 0.2301 |
2 | 1774.4320 | 6472274.7053 | 2544.0666 | 0.9145 | 0.2810 | 0.2466 |
3 | 1901.2658 | 8688470.9879 | 2947.6212 | 0.8917 | 0.2938 | 0.2584 |
4 | 1853.2939 | 9128964.5327 | 3021.4176 | 0.8920 | 0.2854 | 0.2443 |
5 | 1923.0723 | 12176009.1262 | 3489.4139 | 0.8858 | 0.2304 | 0.1917 |
6 | 2021.1046 | 9481134.8776 | 3079.1452 | 0.8917 | 0.2726 | 0.2414 |
7 | 2562.6819 | 30369056.0469 | 5510.8126 | 0.8056 | 0.3165 | 0.2669 |
8 | 2158.5952 | 13132939.7464 | 3623.9398 | 0.8772 | 0.2897 | 0.2523 |
9 | 2158.0531 | 12973490.7400 | 3601.8732 | 0.8728 | 0.2942 | 0.2557 |
Mean | 1998.6656 | 12919138.5922 | 3511.6823 | 0.8760 | 0.2812 | 0.2420 |
SD | 228.6190 | 6311559.2923 | 766.3068 | 0.0283 | 0.0210 | 0.0199 |
print(tuned_model_2)
RandomForestRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse', max_depth=10, max_features='sqrt', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0, min_impurity_split=None, min_samples_leaf=5, min_samples_split=7, min_weight_fraction_leaf=0.0, n_estimators=160, n_jobs=-1, oob_score=False, random_state=1234, verbose=0, warm_start=False)
plot_model(tuned_model_2, plot = 'residuals')
plot_model(tuned_model_2, plot = 'error')
plot_model(tuned_model_2, plot = 'feature')
plot_model(tuned_model_2, plot = 'learning')
plot_model(tuned_model_2, plot = 'vc')
#plot_model(tuned_rf, plot = 'rfe')
predict_model(tuned_model_1);
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|---|
0 | Extreme Gradient Boosting | 757.7599 | 2437724.3481 | 1561.3213 | 0.9771 | 0.0810 | 0.0605 |
predict_model(tuned_model_2);
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|---|
0 | Random Forest Regressor | 1917.1226 | 11036283.1471 | 3322.0902 | 0.8965 | 0.2727 | 0.2287 |
final_model = finalize_model(tuned_model_1);
# Final model parameters for deployment
print(final_model)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.4, max_delta_step=0, max_depth=3, min_child_weight=3, missing=nan, monotone_constraints='()', n_estimators=170, n_jobs=-1, num_parallel_tree=1, objective='reg:squarederror', random_state=1234, reg_alpha=0.0005, reg_lambda=1e-07, scale_pos_weight=45.400000000000006, subsample=0.9, tree_method='auto', validate_parameters=1, verbosity=0)
predict_model(final_model);
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|---|
0 | Extreme Gradient Boosting | 603.2393 | 889180.0389 | 942.9634 | 0.9917 | 0.0770 | 0.0568 |
unseen_predictions = predict_model(final_model, data=data_unseen)
unseen_predictions.head()
Carat Weight | Cut | Color | Clarity | Polish | Symmetry | Report | Price | Label | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.91 | Ideal | E | SI1 | VG | VG | GIA | 4370 | 4616.125977 |
1 | 1.50 | Fair | F | SI1 | VG | VG | GIA | 10450 | 9047.586914 |
2 | 0.91 | Ideal | D | VS2 | VG | VG | GIA | 6224 | 5786.869629 |
3 | 2.20 | Ideal | H | VS2 | EX | VG | GIA | 22241 | 21379.474609 |
4 | 1.52 | Ideal | D | VS1 | EX | EX | GIA | 17659 | 16941.912109 |
from pycaret.utils import check_metric
check_metric(unseen_predictions['Price'], unseen_predictions['Label'], metric = 'R2')
0.9782
check_metric(unseen_predictions['Price'], unseen_predictions['Label'], metric = 'MAE')
830.3992
check_metric(unseen_predictions['Price'], unseen_predictions['Label'], metric = 'MSE')
2358529.9247
check_metric(unseen_predictions['Price'], unseen_predictions['Label'], metric = 'RMSE')
1535.7506
check_metric(unseen_predictions['Price'], unseen_predictions['Label'], metric = 'MAPE')
0.0664
save_model(final_model,'Final_Model')
Transformation Pipeline and Model Successfully Saved
(Pipeline(memory=None, steps=[('dtypes', DataTypes_Auto_infer(categorical_features=[], display_types=True, features_todrop=[], id_columns=[], ml_usecase='regression', numerical_features=[], target='Price', time_features=[])), ('imputer', Simple_Imputer(categorical_strategy='not_available', fill_value_categorical=None, fill_value_numerical=None, numeric_strategy='... interaction_constraints='', learning_rate=0.4, max_delta_step=0, max_depth=3, min_child_weight=3, missing=nan, monotone_constraints='()', n_estimators=170, n_jobs=-1, num_parallel_tree=1, objective='reg:squarederror', random_state=1234, reg_alpha=0.0005, reg_lambda=1e-07, scale_pos_weight=45.400000000000006, subsample=0.9, tree_method='auto', validate_parameters=1, verbosity=0)]], verbose=False), 'Final_Model.pkl')
load_saved_model = load_model('Final_Model')
Transformation Pipeline and Model Successfully Loaded
new_prediction = predict_model(load_saved_model, data=data_unseen)
new_prediction.head(10)
Carat Weight | Cut | Color | Clarity | Polish | Symmetry | Report | Price | Label | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.91 | Ideal | E | SI1 | VG | VG | GIA | 4370 | 4616.125977 |
1 | 1.50 | Fair | F | SI1 | VG | VG | GIA | 10450 | 9047.586914 |
2 | 0.91 | Ideal | D | VS2 | VG | VG | GIA | 6224 | 5786.869629 |
3 | 2.20 | Ideal | H | VS2 | EX | VG | GIA | 22241 | 21379.474609 |
4 | 1.52 | Ideal | D | VS1 | EX | EX | GIA | 17659 | 16941.912109 |
5 | 1.07 | Very Good | G | SI1 | G | G | GIA | 4829 | 4921.619141 |
6 | 0.92 | Ideal | G | SI1 | G | VG | GIA | 4025 | 4101.433594 |
7 | 2.01 | Very Good | I | VS1 | VG | VG | GIA | 18023 | 19085.789062 |
8 | 2.12 | Ideal | F | VS1 | EX | EX | GIA | 33667 | 32184.150391 |
9 | 0.80 | Ideal | D | VS2 | VG | EX | GIA | 3817 | 4233.546387 |
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from pycaret.regression import *
# provide the dataset name as shown in pycaret
whichDataset = 'diamond'
from pycaret.datasets import get_data
dataset = get_data(whichDataset)
data = dataset.sample(frac=0.70, random_state=421)
data_unseen = dataset.drop(data.index)
data.reset_index(inplace=True, drop=True)
data_unseen.reset_index(inplace=True, drop=True)
print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))
Carat Weight | Cut | Color | Clarity | Polish | Symmetry | Report | Price | |
---|---|---|---|---|---|---|---|---|
0 | 1.10 | Ideal | H | SI1 | VG | EX | GIA | 5169 |
1 | 0.83 | Ideal | H | VS1 | ID | ID | AGSL | 3470 |
2 | 0.85 | Ideal | H | SI1 | EX | EX | GIA | 3183 |
3 | 0.91 | Ideal | E | SI1 | VG | VG | GIA | 4370 |
4 | 0.83 | Ideal | G | SI1 | EX | EX | GIA | 3171 |
Data for Modeling: (4200, 8) Unseen Data For Predictions: (1800, 8)
clf = setup(data = data, target = 'Price', session_id=1234)
Description | Value | |
---|---|---|
0 | session_id | 1234 |
1 | Target | Price |
2 | Original Data | (4200, 8) |
3 | Missing Values | False |
4 | Numeric Features | 1 |
5 | Categorical Features | 6 |
6 | Ordinal Features | False |
7 | High Cardinality Features | False |
8 | High Cardinality Method | None |
9 | Transformed Train Set | (2939, 28) |
10 | Transformed Test Set | (1261, 28) |
11 | Shuffle Train-Test | True |
12 | Stratify Train-Test | False |
13 | Fold Generator | KFold |
14 | Fold Number | 10 |
15 | CPU Jobs | -1 |
16 | Use GPU | False |
17 | Log Experiment | False |
18 | Experiment Name | reg-default-name |
19 | USI | 1b0d |
20 | Imputation Type | simple |
21 | Iterative Imputation Iteration | None |
22 | Numeric Imputer | mean |
23 | Iterative Imputation Numeric Model | None |
24 | Categorical Imputer | constant |
25 | Iterative Imputation Categorical Model | None |
26 | Unknown Categoricals Handling | least_frequent |
27 | Normalize | False |
28 | Normalize Method | None |
29 | Transformation | False |
30 | Transformation Method | None |
31 | PCA | False |
32 | PCA Method | None |
33 | PCA Components | None |
34 | Ignore Low Variance | False |
35 | Combine Rare Levels | False |
36 | Rare Level Threshold | None |
37 | Numeric Binning | False |
38 | Remove Outliers | False |
39 | Outliers Threshold | None |
40 | Remove Multicollinearity | False |
41 | Multicollinearity Threshold | None |
42 | Remove Perfect Collinearity | True |
43 | Clustering | False |
44 | Clustering Iteration | None |
45 | Polynomial Features | False |
46 | Polynomial Degree | None |
47 | Trignometry Features | False |
48 | Polynomial Threshold | None |
49 | Group Features | False |
50 | Feature Selection | False |
51 | Feature Selection Method | classic |
52 | Features Selection Threshold | None |
53 | Feature Interaction | False |
54 | Feature Ratio | False |
55 | Interaction Threshold | None |
56 | Transform Target | False |
57 | Transform Target Method | box-cox |
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter('ignore')
# compare all baseline models and select top 5
top_models = compare_models(n_select = 5)
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
catboost | CatBoost Regressor | 652.3723 | 2140917.0090 | 1434.0609 | 0.9803 | 0.0678 | 0.0495 | 0.6390 |
xgboost | Extreme Gradient Boosting | 738.7598 | 2301043.9160 | 1493.0062 | 0.9787 | 0.0750 | 0.0556 | 0.3440 |
et | Extra Trees Regressor | 757.7881 | 2384221.9741 | 1519.8308 | 0.9776 | 0.0798 | 0.0596 | 0.4640 |
rf | Random Forest Regressor | 763.7998 | 2526082.1798 | 1549.8878 | 0.9767 | 0.0807 | 0.0594 | 0.4560 |
lightgbm | Light Gradient Boosting Machine | 779.3864 | 3203799.4666 | 1738.8890 | 0.9707 | 0.0777 | 0.0571 | 0.0280 |
gbr | Gradient Boosting Regressor | 907.4681 | 3309181.5024 | 1798.9907 | 0.9692 | 0.1003 | 0.0748 | 0.0890 |
dt | Decision Tree Regressor | 997.5137 | 4161285.0494 | 2024.4496 | 0.9609 | 0.1072 | 0.0779 | 0.0150 |
ridge | Ridge Regression | 2547.4826 | 15342038.1800 | 3889.4248 | 0.8571 | 0.6450 | 0.3048 | 0.0080 |
br | Bayesian Ridge | 2548.8950 | 15342551.0528 | 3889.5542 | 0.8571 | 0.6395 | 0.3052 | 0.0090 |
llar | Lasso Least Angle Regression | 2498.0843 | 15371416.8748 | 3890.8988 | 0.8570 | 0.6532 | 0.2926 | 0.0090 |
lasso | Lasso Regression | 2548.4333 | 15364496.1155 | 3891.8053 | 0.8569 | 0.6375 | 0.3050 | 0.0110 |
huber | Huber Regressor | 2043.5376 | 20937306.6637 | 4524.2595 | 0.8072 | 0.4637 | 0.1733 | 0.0440 |
par | Passive Aggressive Regressor | 2036.4603 | 22213369.2717 | 4661.9997 | 0.7954 | 0.4258 | 0.1628 | 0.0180 |
omp | Orthogonal Matching Pursuit | 2848.1076 | 25013248.0168 | 4964.0700 | 0.7674 | 0.4348 | 0.2682 | 0.0080 |
ada | AdaBoost Regressor | 4289.2881 | 24876511.6895 | 4980.2022 | 0.7611 | 0.5204 | 0.6104 | 0.0770 |
knn | K Neighbors Regressor | 3382.3627 | 33974748.2298 | 5803.4368 | 0.6793 | 0.4009 | 0.3201 | 0.0410 |
en | Elastic Net | 5162.5092 | 59744599.7909 | 7695.6141 | 0.4426 | 0.5411 | 0.5909 | 0.0080 |
dummy | Dummy Regressor | 7466.6767 | 106924368.9269 | 10317.2567 | -0.0049 | 0.7681 | 0.9096 | 0.0070 |
lar | Least Angle Regression | 7346.3319 | 32364219472.9374 | 60433.7388 | -257.9050 | 0.6842 | 0.4518 | 0.0090 |
lr | Linear Regression | 14726.2369 | 218137914955.1214 | 151113.9618 | -1743.9216 | 0.6817 | 0.6264 | 0.0080 |
top_models
[<catboost.core.CatBoostRegressor at 0x7f344a072710>, XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.300000012, max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=-1, num_parallel_tree=1, objective='reg:squarederror', random_state=1234, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='auto', validate_parameters=1, verbosity=0), ExtraTreesRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1, oob_score=False, random_state=1234, verbose=0, warm_start=False), RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1, oob_score=False, random_state=1234, verbose=0, warm_start=False), LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0, importance_type='split', learning_rate=0.1, max_depth=-1, min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0, n_estimators=100, n_jobs=-1, num_leaves=31, objective=None, random_state=1234, reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0, subsample_for_bin=200000, subsample_freq=0)]
# tune top base models
tuned_top_models = [tune_model(i) for i in top_models]
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
0 | 806.6623 | 3220520.8276 | 1794.5810 | 0.9705 | 0.0851 | 0.0599 |
1 | 709.1156 | 2598346.9027 | 1611.9389 | 0.9652 | 0.0802 | 0.0574 |
2 | 905.4744 | 6640590.2389 | 2576.9343 | 0.9469 | 0.0914 | 0.0597 |
3 | 753.1021 | 3071615.1212 | 1752.6024 | 0.9723 | 0.0738 | 0.0549 |
4 | 805.6034 | 2523138.6673 | 1588.4391 | 0.9756 | 0.0832 | 0.0618 |
5 | 763.4681 | 2291955.4233 | 1513.9205 | 0.9783 | 0.0815 | 0.0600 |
6 | 861.3803 | 3236581.5474 | 1799.0502 | 0.9742 | 0.0787 | 0.0592 |
7 | 826.5118 | 3021292.8581 | 1738.1867 | 0.9694 | 0.0858 | 0.0625 |
8 | 669.0010 | 1773052.7073 | 1331.5603 | 0.9816 | 0.0751 | 0.0569 |
9 | 892.5622 | 4102929.4877 | 2025.5689 | 0.9643 | 0.0850 | 0.0600 |
Mean | 799.2881 | 3248002.3781 | 1773.2782 | 0.9698 | 0.0820 | 0.0592 |
SD | 72.9304 | 1279098.5124 | 321.6936 | 0.0092 | 0.0050 | 0.0022 |
tuned_top_models
[<catboost.core.CatBoostRegressor at 0x7f344a06aa50>, XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.4, max_delta_step=0, max_depth=3, min_child_weight=3, missing=nan, monotone_constraints='()', n_estimators=170, n_jobs=-1, num_parallel_tree=1, objective='reg:squarederror', random_state=1234, reg_alpha=0.0005, reg_lambda=1e-07, scale_pos_weight=45.400000000000006, subsample=0.9, tree_method='auto', validate_parameters=1, verbosity=0), ExtraTreesRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse', max_depth=3, max_features=1.0, max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.2, min_impurity_split=None, min_samples_leaf=5, min_samples_split=10, min_weight_fraction_leaf=0.0, n_estimators=120, n_jobs=-1, oob_score=False, random_state=1234, verbose=0, warm_start=False), RandomForestRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse', max_depth=10, max_features='sqrt', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0, min_impurity_split=None, min_samples_leaf=5, min_samples_split=7, min_weight_fraction_leaf=0.0, n_estimators=160, n_jobs=-1, oob_score=False, random_state=1234, verbose=0, warm_start=False), LGBMRegressor(bagging_fraction=0.9, bagging_freq=0, boosting_type='gbdt', class_weight=None, colsample_bytree=1.0, feature_fraction=1.0, importance_type='split', learning_rate=0.3, max_depth=-1, min_child_samples=61, min_child_weight=0.001, min_split_gain=0.3, n_estimators=190, n_jobs=-1, num_leaves=20, objective=None, random_state=1234, reg_alpha=0.15, reg_lambda=0.0001, silent=True, subsample=1.0, subsample_for_bin=200000, subsample_freq=0)]
# ensemble top tuned models
bagged_top_models = [ensemble_model(i) for i in tuned_top_models]
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
0 | 799.2844 | 3242401.3624 | 1800.6669 | 0.9703 | 0.0817 | 0.0579 |
1 | 633.7802 | 2362337.8416 | 1536.9899 | 0.9684 | 0.0733 | 0.0526 |
2 | 893.1270 | 7597992.4405 | 2756.4456 | 0.9392 | 0.0880 | 0.0545 |
3 | 707.2952 | 2911294.5424 | 1706.2516 | 0.9738 | 0.0675 | 0.0517 |
4 | 744.9689 | 1978323.7125 | 1406.5290 | 0.9808 | 0.0741 | 0.0572 |
5 | 766.4194 | 2325476.6611 | 1524.9514 | 0.9780 | 0.0780 | 0.0576 |
6 | 819.0654 | 3287341.7717 | 1813.1028 | 0.9738 | 0.0727 | 0.0541 |
7 | 708.5915 | 2523989.7860 | 1588.7070 | 0.9744 | 0.0719 | 0.0525 |
8 | 650.2556 | 1864797.5179 | 1365.5759 | 0.9807 | 0.0722 | 0.0528 |
9 | 899.3393 | 4200397.2505 | 2049.4871 | 0.9634 | 0.0838 | 0.0599 |
Mean | 762.2127 | 3229435.2887 | 1754.8707 | 0.9703 | 0.0763 | 0.0551 |
SD | 87.0542 | 1600811.1224 | 387.1228 | 0.0116 | 0.0061 | 0.0027 |
bagged_top_models
[BaggingRegressor(base_estimator=<catboost.core.CatBoostRegressor object at 0x7f3458c18250>, bootstrap=True, bootstrap_features=False, max_features=1.0, max_samples=1.0, n_estimators=10, n_jobs=None, oob_score=False, random_state=1234, verbose=0, warm_start=False), BaggingRegressor(base_estimator=XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.4, max_delta_step=0, max_depth=3, min_child_weight=3, missing=nan, monotone_constraints='()', n_estimators=170, n_jobs=-1, num_parallel_tree=1, objective='reg:squarederror', random_state=1234, reg_alpha=0.0005, reg_lambda=1e-07, scale_pos_weight=45.400000000000006, subsample=0.9, tree_method='auto', validate_parameters=1, verbosity=0), bootstrap=True, bootstrap_features=False, max_features=1.0, max_samples=1.0, n_estimators=10, n_jobs=None, oob_score=False, random_state=1234, verbose=0, warm_start=False), BaggingRegressor(base_estimator=ExtraTreesRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse', max_depth=3, max_features=1.0, max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.2, min_impurity_split=None, min_samples_leaf=5, min_samples_split=10, min_weight_fraction_leaf=0.0, n_estimators=120, n_jobs=-1, oob_score=False, random_state=1234, verbose=0, warm_start=False), bootstrap=True, bootstrap_features=False, max_features=1.0, max_samples=1.0, n_estimators=10, n_jobs=None, oob_score=False, random_state=1234, verbose=0, warm_start=False), BaggingRegressor(base_estimator=RandomForestRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse', max_depth=10, max_features='sqrt', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0, min_impurity_split=None, min_samples_leaf=5, min_samples_split=7, min_weight_fraction_leaf=0.0, n_estimators=160, n_jobs=-1, oob_score=False, random_state=1234, verbose=0, warm_start=False), bootstrap=True, bootstrap_features=False, max_features=1.0, max_samples=1.0, n_estimators=10, n_jobs=None, oob_score=False, random_state=1234, verbose=0, warm_start=False), BaggingRegressor(base_estimator=LGBMRegressor(bagging_fraction=0.9, bagging_freq=0, boosting_type='gbdt', class_weight=None, colsample_bytree=1.0, feature_fraction=1.0, importance_type='split', learning_rate=0.3, max_depth=-1, min_child_samples=61, min_child_weight=0.001, min_split_gain=0.3, n_estimators=190, n_jobs=-1, num_leaves=20, objective=None, random_state=1234, reg_alpha=0.15, reg_lambda=0.0001, silent=True, subsample=1.0, subsample_for_bin=200000, subsample_freq=0), bootstrap=True, bootstrap_features=False, max_features=1.0, max_samples=1.0, n_estimators=10, n_jobs=None, oob_score=False, random_state=1234, verbose=0, warm_start=False)]
# select best model based on AUC
best1 = automl(optimize = 'R2')
best2 = automl(optimize = 'MAE')
best3 = automl(optimize = 'MSE')
best4 = automl(optimize = 'RMSE')
best5 = automl(optimize = 'MAPE')
print(); print("Best model based on R2: "); print(best1)
print(); print("Best model based on MAE: "); print(best2)
print(); print("Best model based on MSE: "); print(best3)
print(); print("Best model based on RMSE: "); print(best4)
print(); print("Best model based on MAPE: "); print(best5)
Best model based on R2: <catboost.core.CatBoostRegressor object at 0x7f3448518990> Best model based on MAE: <catboost.core.CatBoostRegressor object at 0x7f3458b28450> Best model based on MSE: <catboost.core.CatBoostRegressor object at 0x7f3448505510> Best model based on RMSE: <catboost.core.CatBoostRegressor object at 0x7f344848e510> Best model based on MAPE: <catboost.core.CatBoostRegressor object at 0x7f3458c18e50>
plot_model(best2, plot = 'residuals')
plot_model(best2, plot = 'error')
plot_model(best2, plot = 'learning')
save_model(best2,'Final_Model')
Transformation Pipeline and Model Successfully Saved
(Pipeline(memory=None, steps=[('dtypes', DataTypes_Auto_infer(categorical_features=[], display_types=True, features_todrop=[], id_columns=[], ml_usecase='regression', numerical_features=[], target='Price', time_features=[])), ('imputer', Simple_Imputer(categorical_strategy='not_available', fill_value_categorical=None, fill_value_numerical=None, numeric_strategy='... ('binn', 'passthrough'), ('rem_outliers', 'passthrough'), ('cluster_all', 'passthrough'), ('dummy', Dummify(target='Price')), ('fix_perfect', Remove_100(target='Price')), ('clean_names', Clean_Colum_Names()), ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'), ('dfs', 'passthrough'), ('pca', 'passthrough'), ['trained_model', <catboost.core.CatBoostRegressor object at 0x7f3458b28450>]], verbose=False), 'Final_Model.pkl')
load_saved_model = load_model('Final_Model')
new_prediction = predict_model(load_saved_model, data=data_unseen)
new_prediction.head()
Transformation Pipeline and Model Successfully Loaded
Carat Weight | Cut | Color | Clarity | Polish | Symmetry | Report | Price | Label | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.85 | Ideal | H | SI1 | EX | EX | GIA | 3183 | 3348.100985 |
1 | 0.83 | Ideal | G | SI1 | EX | EX | GIA | 3171 | 3451.107849 |
2 | 1.53 | Ideal | E | SI1 | ID | ID | AGSL | 12791 | 12440.743537 |
3 | 1.00 | Very Good | D | SI1 | VG | G | GIA | 5747 | 5505.137881 |
4 | 0.91 | Ideal | D | VS2 | VG | VG | GIA | 6224 | 5420.743746 |
In this coding recipe, we discussed how to build a machine learning model in Python using PyCaret.