For more projects visit: https://setscholars.net
# Suppress warnings in Jupyter Notebooks
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from pycaret.classification import *
# provide the dataset name as shown in pycaret
whichDataset = 'electrical_grid'
from pycaret.datasets import get_data
dataset = get_data(whichDataset)
tau1 | tau2 | tau3 | tau4 | p1 | p2 | p3 | p4 | g1 | g2 | g3 | g4 | stabf | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2.959060 | 3.079885 | 8.381025 | 9.780754 | 3.763085 | -0.782604 | -1.257395 | -1.723086 | 0.650456 | 0.859578 | 0.887445 | 0.958034 | unstable |
1 | 9.304097 | 4.902524 | 3.047541 | 1.369357 | 5.067812 | -1.940058 | -1.872742 | -1.255012 | 0.413441 | 0.862414 | 0.562139 | 0.781760 | stable |
2 | 8.971707 | 8.848428 | 3.046479 | 1.214518 | 3.405158 | -1.207456 | -1.277210 | -0.920492 | 0.163041 | 0.766689 | 0.839444 | 0.109853 | unstable |
3 | 0.716415 | 7.669600 | 4.486641 | 2.340563 | 3.963791 | -1.027473 | -1.938944 | -0.997374 | 0.446209 | 0.976744 | 0.929381 | 0.362718 | unstable |
4 | 3.134112 | 7.608772 | 4.943759 | 9.857573 | 3.525811 | -1.125531 | -1.845975 | -0.554305 | 0.797110 | 0.455450 | 0.656947 | 0.820923 | unstable |
dataset.shape
(10000, 13)
data = dataset.sample(frac=0.75, random_state=421)
data_unseen = dataset.drop(data.index)
data.reset_index(inplace=True, drop=True)
data_unseen.reset_index(inplace=True, drop=True)
print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))
Data for Modeling: (7500, 13) Unseen Data For Predictions: (2500, 13)
env_setup = setup(data = data, target = 'stabf', session_id=412)
Description | Value | |
---|---|---|
0 | session_id | 412 |
1 | Target | stabf |
2 | Target Type | Binary |
3 | Label Encoded | stable: 0, unstable: 1 |
4 | Original Data | (7500, 13) |
5 | Missing Values | False |
6 | Numeric Features | 12 |
7 | Categorical Features | 0 |
8 | Ordinal Features | False |
9 | High Cardinality Features | False |
10 | High Cardinality Method | None |
11 | Transformed Train Set | (5249, 12) |
12 | Transformed Test Set | (2251, 12) |
13 | Shuffle Train-Test | True |
14 | Stratify Train-Test | False |
15 | Fold Generator | StratifiedKFold |
16 | Fold Number | 10 |
17 | CPU Jobs | -1 |
18 | Use GPU | False |
19 | Log Experiment | False |
20 | Experiment Name | clf-default-name |
21 | USI | 23e5 |
22 | Imputation Type | simple |
23 | Iterative Imputation Iteration | None |
24 | Numeric Imputer | mean |
25 | Iterative Imputation Numeric Model | None |
26 | Categorical Imputer | constant |
27 | Iterative Imputation Categorical Model | None |
28 | Unknown Categoricals Handling | least_frequent |
29 | Normalize | False |
30 | Normalize Method | None |
31 | Transformation | False |
32 | Transformation Method | None |
33 | PCA | False |
34 | PCA Method | None |
35 | PCA Components | None |
36 | Ignore Low Variance | False |
37 | Combine Rare Levels | False |
38 | Rare Level Threshold | None |
39 | Numeric Binning | False |
40 | Remove Outliers | False |
41 | Outliers Threshold | None |
42 | Remove Multicollinearity | False |
43 | Multicollinearity Threshold | None |
44 | Remove Perfect Collinearity | True |
45 | Clustering | False |
46 | Clustering Iteration | None |
47 | Polynomial Features | False |
48 | Polynomial Degree | None |
49 | Trignometry Features | False |
50 | Polynomial Threshold | None |
51 | Group Features | False |
52 | Feature Selection | False |
53 | Feature Selection Method | classic |
54 | Features Selection Threshold | None |
55 | Feature Interaction | False |
56 | Feature Ratio | False |
57 | Interaction Threshold | None |
58 | Fix Imbalance | False |
59 | Fix Imbalance Method | SMOTE |
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter('ignore')
# --------------------------------------
best_model = compare_models()
# --------------------------------------
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | TT (Sec) | |
---|---|---|---|---|---|---|---|---|---|
catboost | CatBoost Classifier | 0.9478 | 0.9902 | 0.9698 | 0.9494 | 0.9595 | 0.8862 | 0.8868 | 3.0390 |
xgboost | Extreme Gradient Boosting | 0.9327 | 0.9857 | 0.9540 | 0.9414 | 0.9476 | 0.8538 | 0.8541 | 0.5130 |
lightgbm | Light Gradient Boosting Machine | 0.9314 | 0.9848 | 0.9545 | 0.9389 | 0.9466 | 0.8508 | 0.8513 | 0.0900 |
gbc | Gradient Boosting Classifier | 0.9146 | 0.9752 | 0.9596 | 0.9112 | 0.9348 | 0.8116 | 0.8141 | 0.6340 |
et | Extra Trees Classifier | 0.9133 | 0.9782 | 0.9713 | 0.9006 | 0.9346 | 0.8068 | 0.8121 | 0.3510 |
rf | Random Forest Classifier | 0.9106 | 0.9746 | 0.9531 | 0.9110 | 0.9315 | 0.8033 | 0.8052 | 0.5440 |
ada | Ada Boost Classifier | 0.8564 | 0.9326 | 0.9031 | 0.8756 | 0.8890 | 0.6856 | 0.6870 | 0.1690 |
nb | Naive Bayes | 0.8421 | 0.9187 | 0.9297 | 0.8399 | 0.8824 | 0.6437 | 0.6520 | 0.0080 |
dt | Decision Tree Classifier | 0.8295 | 0.8141 | 0.8702 | 0.8633 | 0.8667 | 0.6302 | 0.6305 | 0.0240 |
lr | Logistic Regression | 0.8190 | 0.8945 | 0.8804 | 0.8429 | 0.8611 | 0.6017 | 0.6034 | 0.2320 |
ridge | Ridge Classifier | 0.8186 | 0.0000 | 0.8813 | 0.8419 | 0.8609 | 0.6005 | 0.6023 | 0.0080 |
lda | Linear Discriminant Analysis | 0.8186 | 0.8944 | 0.8762 | 0.8451 | 0.8602 | 0.6021 | 0.6034 | 0.0100 |
svm | SVM - Linear Kernel | 0.8041 | 0.0000 | 0.8690 | 0.8389 | 0.8488 | 0.5670 | 0.5838 | 0.0270 |
qda | Quadratic Discriminant Analysis | 0.7912 | 0.9490 | 0.9162 | 0.8124 | 0.8434 | 0.5122 | 0.5743 | 0.0080 |
knn | K Neighbors Classifier | 0.7773 | 0.8278 | 0.8621 | 0.8030 | 0.8314 | 0.5045 | 0.5077 | 0.0740 |
dummy | Dummy Classifier | 0.6371 | 0.5000 | 1.0000 | 0.6371 | 0.7783 | 0.0000 | 0.0000 | 0.0050 |
xgboost = create_model('xgboost')
Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|
0 | 0.9257 | 0.9858 | 0.9493 | 0.9353 | 0.9422 | 0.8382 | 0.8384 |
1 | 0.9448 | 0.9877 | 0.9642 | 0.9500 | 0.9570 | 0.8797 | 0.8799 |
2 | 0.9238 | 0.9811 | 0.9403 | 0.9403 | 0.9403 | 0.8350 | 0.8350 |
3 | 0.9295 | 0.9852 | 0.9493 | 0.9408 | 0.9450 | 0.8469 | 0.8469 |
4 | 0.9352 | 0.9857 | 0.9521 | 0.9464 | 0.9493 | 0.8598 | 0.8598 |
5 | 0.9219 | 0.9812 | 0.9491 | 0.9296 | 0.9393 | 0.8300 | 0.8303 |
6 | 0.9505 | 0.9884 | 0.9671 | 0.9556 | 0.9613 | 0.8925 | 0.8927 |
7 | 0.9371 | 0.9904 | 0.9581 | 0.9440 | 0.9510 | 0.8634 | 0.8636 |
8 | 0.9219 | 0.9870 | 0.9611 | 0.9198 | 0.9400 | 0.8284 | 0.8300 |
9 | 0.9370 | 0.9844 | 0.9491 | 0.9520 | 0.9505 | 0.8639 | 0.8639 |
Mean | 0.9327 | 0.9857 | 0.9540 | 0.9414 | 0.9476 | 0.8538 | 0.8541 |
SD | 0.0094 | 0.0028 | 0.0079 | 0.0103 | 0.0072 | 0.0206 | 0.0204 |
tuned_xgboost = tune_model(xgboost)
Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|
0 | 0.8914 | 0.9814 | 0.9821 | 0.8658 | 0.9203 | 0.7522 | 0.7670 |
1 | 0.8838 | 0.9812 | 0.9970 | 0.8477 | 0.9163 | 0.7303 | 0.7565 |
2 | 0.8895 | 0.9768 | 0.9851 | 0.8616 | 0.9192 | 0.7470 | 0.7639 |
3 | 0.8914 | 0.9779 | 0.9731 | 0.8717 | 0.9196 | 0.7540 | 0.7649 |
4 | 0.8895 | 0.9760 | 0.9790 | 0.8651 | 0.9185 | 0.7490 | 0.7629 |
5 | 0.8971 | 0.9786 | 0.9790 | 0.8743 | 0.9237 | 0.7674 | 0.7790 |
6 | 0.9029 | 0.9802 | 0.9880 | 0.8753 | 0.9283 | 0.7795 | 0.7933 |
7 | 0.9124 | 0.9800 | 0.9880 | 0.8871 | 0.9348 | 0.8023 | 0.8132 |
8 | 0.8914 | 0.9796 | 0.9850 | 0.8635 | 0.9203 | 0.7524 | 0.7686 |
9 | 0.8950 | 0.9698 | 0.9760 | 0.8740 | 0.9222 | 0.7624 | 0.7735 |
Mean | 0.8945 | 0.9782 | 0.9833 | 0.8686 | 0.9223 | 0.7596 | 0.7743 |
SD | 0.0077 | 0.0033 | 0.0066 | 0.0099 | 0.0052 | 0.0189 | 0.0162 |
print(tuned_xgboost)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.7, gamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.2, max_delta_step=0, max_depth=11, min_child_weight=2, missing=nan, monotone_constraints='()', n_estimators=260, n_jobs=-1, num_parallel_tree=1, objective='binary:logistic', random_state=412, reg_alpha=0.01, reg_lambda=3, scale_pos_weight=42.7, subsample=0.2, tree_method='auto', validate_parameters=1, verbosity=0)
plot_model(tuned_xgboost, plot = 'auc')
plot_model(tuned_xgboost, plot = 'pr')
plot_model(tuned_xgboost, plot='feature')
plot_model(tuned_xgboost, plot = 'confusion_matrix')
plot_model(tuned_xgboost, plot = 'learning')
plot_model(tuned_xgboost, plot = 'threshold')
gbc = create_model('gbc')
Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|
0 | 0.9143 | 0.9734 | 0.9642 | 0.9073 | 0.9349 | 0.8099 | 0.8131 |
1 | 0.9200 | 0.9792 | 0.9731 | 0.9081 | 0.9395 | 0.8219 | 0.8262 |
2 | 0.9105 | 0.9737 | 0.9463 | 0.9162 | 0.9310 | 0.8037 | 0.8046 |
3 | 0.9010 | 0.9776 | 0.9522 | 0.8986 | 0.9246 | 0.7805 | 0.7833 |
4 | 0.9238 | 0.9770 | 0.9701 | 0.9153 | 0.9419 | 0.8316 | 0.8346 |
5 | 0.9238 | 0.9744 | 0.9581 | 0.9249 | 0.9412 | 0.8332 | 0.8342 |
6 | 0.9200 | 0.9756 | 0.9581 | 0.9195 | 0.9384 | 0.8244 | 0.8258 |
7 | 0.9257 | 0.9814 | 0.9641 | 0.9226 | 0.9429 | 0.8368 | 0.8384 |
8 | 0.9048 | 0.9767 | 0.9641 | 0.8944 | 0.9280 | 0.7881 | 0.7929 |
9 | 0.9027 | 0.9627 | 0.9461 | 0.9054 | 0.9253 | 0.7858 | 0.7874 |
Mean | 0.9146 | 0.9752 | 0.9596 | 0.9112 | 0.9348 | 0.8116 | 0.8141 |
SD | 0.0089 | 0.0048 | 0.0088 | 0.0096 | 0.0067 | 0.0200 | 0.0198 |
tuned_gbc = tune_model(gbc)
Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|
0 | 0.9276 | 0.9795 | 0.9672 | 0.9231 | 0.9446 | 0.8404 | 0.8423 |
1 | 0.9276 | 0.9850 | 0.9612 | 0.9280 | 0.9443 | 0.8411 | 0.8422 |
2 | 0.9257 | 0.9757 | 0.9373 | 0.9458 | 0.9415 | 0.8397 | 0.8398 |
3 | 0.9162 | 0.9787 | 0.9373 | 0.9318 | 0.9345 | 0.8181 | 0.8182 |
4 | 0.9410 | 0.9842 | 0.9641 | 0.9443 | 0.9541 | 0.8714 | 0.8718 |
5 | 0.9143 | 0.9779 | 0.9431 | 0.9238 | 0.9333 | 0.8134 | 0.8137 |
6 | 0.9181 | 0.9812 | 0.9491 | 0.9242 | 0.9365 | 0.8213 | 0.8218 |
7 | 0.9333 | 0.9880 | 0.9641 | 0.9333 | 0.9485 | 0.8542 | 0.8551 |
8 | 0.9105 | 0.9808 | 0.9581 | 0.9065 | 0.9316 | 0.8024 | 0.8050 |
9 | 0.9065 | 0.9715 | 0.9311 | 0.9228 | 0.9270 | 0.7970 | 0.7971 |
Mean | 0.9221 | 0.9802 | 0.9513 | 0.9283 | 0.9396 | 0.8299 | 0.8307 |
SD | 0.0102 | 0.0045 | 0.0126 | 0.0108 | 0.0080 | 0.0223 | 0.0222 |
print(tuned_gbc)
GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None, learning_rate=0.2, loss='deviance', max_depth=10, max_features=1.0, max_leaf_nodes=None, min_impurity_decrease=0.002, min_impurity_split=None, min_samples_leaf=4, min_samples_split=5, min_weight_fraction_leaf=0.0, n_estimators=110, n_iter_no_change=None, presort='deprecated', random_state=412, subsample=0.6, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False)
plot_model(tuned_gbc, plot = 'auc')
plot_model(tuned_gbc, plot = 'pr')
plot_model(tuned_gbc, plot = 'feature')
plot_model(tuned_gbc, plot = 'confusion_matrix')
plot_model(tuned_gbc, plot = 'learning')
plot_model(tuned_gbc, plot = 'threshold')
predict_model(tuned_xgboost);
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|---|
0 | Extreme Gradient Boosting | 0.8969 | 0.9790 | 0.9815 | 0.8745 | 0.9249 | 0.7623 | 0.7753 |
predict_model(tuned_gbc);
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|---|
0 | Gradient Boosting Classifier | 0.9285 | 0.9828 | 0.9588 | 0.9325 | 0.9455 | 0.8416 | 0.8423 |
final_gbc = finalize_model(tuned_gbc)
# Final model parameters for deployment
print(final_gbc)
GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None, learning_rate=0.2, loss='deviance', max_depth=10, max_features=1.0, max_leaf_nodes=None, min_impurity_decrease=0.002, min_impurity_split=None, min_samples_leaf=4, min_samples_split=5, min_weight_fraction_leaf=0.0, n_estimators=110, n_iter_no_change=None, presort='deprecated', random_state=412, subsample=0.6, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False)
predict_model(final_gbc);
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|---|
0 | Gradient Boosting Classifier | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
unseen_predictions = predict_model(final_gbc, data=data_unseen)
unseen_predictions.head()
tau1 | tau2 | tau3 | tau4 | p1 | p2 | p3 | p4 | g1 | g2 | g3 | g4 | stabf | Label | Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8.971707 | 8.848428 | 3.046479 | 1.214518 | 3.405158 | -1.207456 | -1.277210 | -0.920492 | 0.163041 | 0.766689 | 0.839444 | 0.109853 | unstable | unstable | 0.9069 |
1 | 5.930110 | 6.730873 | 6.245138 | 0.533288 | 2.327092 | -0.702501 | -1.116920 | -0.507671 | 0.239816 | 0.563110 | 0.164461 | 0.753701 | stable | stable | 0.9988 |
2 | 5.381299 | 8.014521 | 8.095174 | 6.769248 | 5.507551 | -1.972714 | -1.849333 | -1.685505 | 0.359974 | 0.173569 | 0.349144 | 0.628860 | unstable | unstable | 0.9984 |
3 | 1.616787 | 2.939228 | 0.819791 | 4.191804 | 3.752282 | -1.484885 | -1.280581 | -0.986816 | 0.899698 | 0.866546 | 0.303921 | 0.077610 | stable | stable | 0.9974 |
4 | 4.142830 | 2.439089 | 1.290456 | 9.456443 | 3.934796 | -1.469299 | -1.766941 | -0.698556 | 0.800757 | 0.840807 | 0.917833 | 0.793982 | stable | unstable | 0.7840 |
from pycaret.utils import check_metric
check_metric(unseen_predictions['stabf'], unseen_predictions['Label'], metric = 'Accuracy')
0.9372
check_metric(unseen_predictions['stabf'], unseen_predictions['Label'], metric = 'Recall')
0.9614
check_metric(unseen_predictions['stabf'], unseen_predictions['Label'], metric = 'Precision')
0.9406
check_metric(unseen_predictions['stabf'], unseen_predictions['Label'], metric = 'AUC')
0.9285
check_metric(unseen_predictions['stabf'], unseen_predictions['Label'], metric = 'F1')
0.9509
save_model(final_gbc,'Final_Model')
Transformation Pipeline and Model Successfully Saved
(Pipeline(memory=None, steps=[('dtypes', DataTypes_Auto_infer(categorical_features=[], display_types=True, features_todrop=[], id_columns=[], ml_usecase='classification', numerical_features=[], target='stabf', time_features=[])), ('imputer', Simple_Imputer(categorical_strategy='not_available', fill_value_categorical=None, fill_value_numerical=None, numeric_strate... learning_rate=0.2, loss='deviance', max_depth=10, max_features=1.0, max_leaf_nodes=None, min_impurity_decrease=0.002, min_impurity_split=None, min_samples_leaf=4, min_samples_split=5, min_weight_fraction_leaf=0.0, n_estimators=110, n_iter_no_change=None, presort='deprecated', random_state=412, subsample=0.6, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False)]], verbose=False), 'Final_Model.pkl')
load_saved_model = load_model('Final_Model')
Transformation Pipeline and Model Successfully Loaded
new_prediction = predict_model(load_saved_model, data=data_unseen)
new_prediction[["Label", "Score"]].head(10)
Label | Score | |
---|---|---|
0 | unstable | 0.9069 |
1 | stable | 0.9988 |
2 | unstable | 0.9984 |
3 | stable | 0.9974 |
4 | unstable | 0.7840 |
5 | stable | 0.9978 |
6 | stable | 0.9852 |
7 | stable | 0.9998 |
8 | stable | 0.9996 |
9 | unstable | 0.9891 |
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from pycaret.classification import *
# provide the dataset name as shown in pycaret
whichDataset = 'electrical_grid'
from pycaret.datasets import get_data
dataset = get_data(whichDataset)
data = dataset.sample(frac=0.75, random_state=421)
data_unseen = dataset.drop(data.index)
data.reset_index(inplace=True, drop=True)
data_unseen.reset_index(inplace=True, drop=True)
print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))
tau1 | tau2 | tau3 | tau4 | p1 | p2 | p3 | p4 | g1 | g2 | g3 | g4 | stabf | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2.959060 | 3.079885 | 8.381025 | 9.780754 | 3.763085 | -0.782604 | -1.257395 | -1.723086 | 0.650456 | 0.859578 | 0.887445 | 0.958034 | unstable |
1 | 9.304097 | 4.902524 | 3.047541 | 1.369357 | 5.067812 | -1.940058 | -1.872742 | -1.255012 | 0.413441 | 0.862414 | 0.562139 | 0.781760 | stable |
2 | 8.971707 | 8.848428 | 3.046479 | 1.214518 | 3.405158 | -1.207456 | -1.277210 | -0.920492 | 0.163041 | 0.766689 | 0.839444 | 0.109853 | unstable |
3 | 0.716415 | 7.669600 | 4.486641 | 2.340563 | 3.963791 | -1.027473 | -1.938944 | -0.997374 | 0.446209 | 0.976744 | 0.929381 | 0.362718 | unstable |
4 | 3.134112 | 7.608772 | 4.943759 | 9.857573 | 3.525811 | -1.125531 | -1.845975 | -0.554305 | 0.797110 | 0.455450 | 0.656947 | 0.820923 | unstable |
Data for Modeling: (7500, 13) Unseen Data For Predictions: (2500, 13)
clf = setup(data = data, target = 'stabf', session_id=412)
Description | Value | |
---|---|---|
0 | session_id | 412 |
1 | Target | stabf |
2 | Target Type | Binary |
3 | Label Encoded | stable: 0, unstable: 1 |
4 | Original Data | (7500, 13) |
5 | Missing Values | False |
6 | Numeric Features | 12 |
7 | Categorical Features | 0 |
8 | Ordinal Features | False |
9 | High Cardinality Features | False |
10 | High Cardinality Method | None |
11 | Transformed Train Set | (5249, 12) |
12 | Transformed Test Set | (2251, 12) |
13 | Shuffle Train-Test | True |
14 | Stratify Train-Test | False |
15 | Fold Generator | StratifiedKFold |
16 | Fold Number | 10 |
17 | CPU Jobs | -1 |
18 | Use GPU | False |
19 | Log Experiment | False |
20 | Experiment Name | clf-default-name |
21 | USI | 809f |
22 | Imputation Type | simple |
23 | Iterative Imputation Iteration | None |
24 | Numeric Imputer | mean |
25 | Iterative Imputation Numeric Model | None |
26 | Categorical Imputer | constant |
27 | Iterative Imputation Categorical Model | None |
28 | Unknown Categoricals Handling | least_frequent |
29 | Normalize | False |
30 | Normalize Method | None |
31 | Transformation | False |
32 | Transformation Method | None |
33 | PCA | False |
34 | PCA Method | None |
35 | PCA Components | None |
36 | Ignore Low Variance | False |
37 | Combine Rare Levels | False |
38 | Rare Level Threshold | None |
39 | Numeric Binning | False |
40 | Remove Outliers | False |
41 | Outliers Threshold | None |
42 | Remove Multicollinearity | False |
43 | Multicollinearity Threshold | None |
44 | Remove Perfect Collinearity | True |
45 | Clustering | False |
46 | Clustering Iteration | None |
47 | Polynomial Features | False |
48 | Polynomial Degree | None |
49 | Trignometry Features | False |
50 | Polynomial Threshold | None |
51 | Group Features | False |
52 | Feature Selection | False |
53 | Feature Selection Method | classic |
54 | Features Selection Threshold | None |
55 | Feature Interaction | False |
56 | Feature Ratio | False |
57 | Interaction Threshold | None |
58 | Fix Imbalance | False |
59 | Fix Imbalance Method | SMOTE |
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter('ignore')
# compare all baseline models and select top 5
top_models = compare_models(n_select = 5)
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | TT (Sec) | |
---|---|---|---|---|---|---|---|---|---|
catboost | CatBoost Classifier | 0.9478 | 0.9902 | 0.9698 | 0.9494 | 0.9595 | 0.8862 | 0.8868 | 3.0810 |
xgboost | Extreme Gradient Boosting | 0.9327 | 0.9857 | 0.9540 | 0.9414 | 0.9476 | 0.8538 | 0.8541 | 0.5070 |
lightgbm | Light Gradient Boosting Machine | 0.9314 | 0.9848 | 0.9545 | 0.9389 | 0.9466 | 0.8508 | 0.8513 | 0.0890 |
gbc | Gradient Boosting Classifier | 0.9146 | 0.9752 | 0.9596 | 0.9112 | 0.9348 | 0.8116 | 0.8141 | 0.6330 |
et | Extra Trees Classifier | 0.9133 | 0.9782 | 0.9713 | 0.9006 | 0.9346 | 0.8068 | 0.8121 | 0.3400 |
rf | Random Forest Classifier | 0.9106 | 0.9746 | 0.9531 | 0.9110 | 0.9315 | 0.8033 | 0.8052 | 0.5390 |
ada | Ada Boost Classifier | 0.8564 | 0.9326 | 0.9031 | 0.8756 | 0.8890 | 0.6856 | 0.6870 | 0.1690 |
nb | Naive Bayes | 0.8421 | 0.9187 | 0.9297 | 0.8399 | 0.8824 | 0.6437 | 0.6520 | 0.0080 |
dt | Decision Tree Classifier | 0.8295 | 0.8141 | 0.8702 | 0.8633 | 0.8667 | 0.6302 | 0.6305 | 0.0240 |
lr | Logistic Regression | 0.8190 | 0.8945 | 0.8804 | 0.8429 | 0.8611 | 0.6017 | 0.6034 | 0.2370 |
ridge | Ridge Classifier | 0.8186 | 0.0000 | 0.8813 | 0.8419 | 0.8609 | 0.6005 | 0.6023 | 0.0070 |
lda | Linear Discriminant Analysis | 0.8186 | 0.8944 | 0.8762 | 0.8451 | 0.8602 | 0.6021 | 0.6034 | 0.0100 |
svm | SVM - Linear Kernel | 0.8041 | 0.0000 | 0.8690 | 0.8389 | 0.8488 | 0.5670 | 0.5838 | 0.0260 |
qda | Quadratic Discriminant Analysis | 0.7912 | 0.9490 | 0.9162 | 0.8124 | 0.8434 | 0.5122 | 0.5743 | 0.0080 |
knn | K Neighbors Classifier | 0.7773 | 0.8278 | 0.8621 | 0.8030 | 0.8314 | 0.5045 | 0.5077 | 0.0760 |
dummy | Dummy Classifier | 0.6371 | 0.5000 | 1.0000 | 0.6371 | 0.7783 | 0.0000 | 0.0000 | 0.0050 |
top_models
[<catboost.core.CatBoostClassifier at 0x7f7af01a5bd0>, XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.300000012, max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=-1, num_parallel_tree=1, objective='binary:logistic', random_state=412, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='auto', validate_parameters=1, verbosity=0), LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0, importance_type='split', learning_rate=0.1, max_depth=-1, min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0, n_estimators=100, n_jobs=-1, num_leaves=31, objective=None, random_state=412, reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0, subsample_for_bin=200000, subsample_freq=0), GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None, learning_rate=0.1, loss='deviance', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_iter_no_change=None, presort='deprecated', random_state=412, subsample=1.0, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False), ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1, oob_score=False, random_state=412, verbose=0, warm_start=False)]
# tune top base models
tuned_top_models = [tune_model(i) for i in top_models]
Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|
0 | 0.8914 | 0.9658 | 0.8866 | 0.9399 | 0.9124 | 0.7699 | 0.7722 |
1 | 0.8914 | 0.9680 | 0.9015 | 0.9264 | 0.9138 | 0.7673 | 0.7678 |
2 | 0.8800 | 0.9663 | 0.8627 | 0.9444 | 0.9017 | 0.7485 | 0.7536 |
3 | 0.8743 | 0.9599 | 0.8627 | 0.9353 | 0.8975 | 0.7356 | 0.7397 |
4 | 0.9067 | 0.9739 | 0.8922 | 0.9582 | 0.9240 | 0.8034 | 0.8069 |
5 | 0.8990 | 0.9689 | 0.8772 | 0.9607 | 0.9171 | 0.7888 | 0.7941 |
6 | 0.8838 | 0.9660 | 0.8623 | 0.9505 | 0.9042 | 0.7574 | 0.7632 |
7 | 0.8990 | 0.9737 | 0.8892 | 0.9489 | 0.9181 | 0.7869 | 0.7897 |
8 | 0.9048 | 0.9707 | 0.9192 | 0.9303 | 0.9247 | 0.7952 | 0.7953 |
9 | 0.8817 | 0.9611 | 0.8683 | 0.9416 | 0.9034 | 0.7514 | 0.7555 |
Mean | 0.8912 | 0.9674 | 0.8822 | 0.9436 | 0.9117 | 0.7704 | 0.7738 |
SD | 0.0105 | 0.0044 | 0.0181 | 0.0107 | 0.0091 | 0.0213 | 0.0207 |
tuned_top_models
[<catboost.core.CatBoostClassifier at 0x7f7af01a2690>, XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.7, gamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.2, max_delta_step=0, max_depth=11, min_child_weight=2, missing=nan, monotone_constraints='()', n_estimators=260, n_jobs=-1, num_parallel_tree=1, objective='binary:logistic', random_state=412, reg_alpha=0.01, reg_lambda=3, scale_pos_weight=42.7, subsample=0.2, tree_method='auto', validate_parameters=1, verbosity=0), LGBMClassifier(bagging_fraction=0.8, bagging_freq=0, boosting_type='gbdt', class_weight=None, colsample_bytree=1.0, feature_fraction=0.9, importance_type='split', learning_rate=0.05, max_depth=-1, min_child_samples=66, min_child_weight=0.001, min_split_gain=0, n_estimators=250, n_jobs=-1, num_leaves=20, objective=None, random_state=412, reg_alpha=0.0001, reg_lambda=0.15, silent=True, subsample=1.0, subsample_for_bin=200000, subsample_freq=0), GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None, learning_rate=0.2, loss='deviance', max_depth=10, max_features=1.0, max_leaf_nodes=None, min_impurity_decrease=0.002, min_impurity_split=None, min_samples_leaf=4, min_samples_split=5, min_weight_fraction_leaf=0.0, n_estimators=110, n_iter_no_change=None, presort='deprecated', random_state=412, subsample=0.6, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False), ExtraTreesClassifier(bootstrap=True, ccp_alpha=0.0, class_weight='balanced', criterion='gini', max_depth=8, max_features='sqrt', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.001, min_impurity_split=None, min_samples_leaf=5, min_samples_split=5, min_weight_fraction_leaf=0.0, n_estimators=270, n_jobs=-1, oob_score=False, random_state=412, verbose=0, warm_start=False)]
# ensemble top tuned models
bagged_top_models = [ensemble_model(i) for i in tuned_top_models]
Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|
0 | 0.9010 | 0.9687 | 0.9015 | 0.9408 | 0.9207 | 0.7889 | 0.7902 |
1 | 0.9010 | 0.9703 | 0.9164 | 0.9275 | 0.9219 | 0.7865 | 0.7866 |
2 | 0.8895 | 0.9696 | 0.8627 | 0.9601 | 0.9088 | 0.7697 | 0.7768 |
3 | 0.8857 | 0.9629 | 0.8776 | 0.9393 | 0.9074 | 0.7586 | 0.7616 |
4 | 0.9124 | 0.9736 | 0.9012 | 0.9586 | 0.9290 | 0.8149 | 0.8175 |
5 | 0.8990 | 0.9707 | 0.8862 | 0.9518 | 0.9178 | 0.7874 | 0.7908 |
6 | 0.8914 | 0.9690 | 0.8743 | 0.9511 | 0.9111 | 0.7723 | 0.7769 |
7 | 0.9048 | 0.9741 | 0.8982 | 0.9494 | 0.9231 | 0.7983 | 0.8004 |
8 | 0.9029 | 0.9682 | 0.9132 | 0.9327 | 0.9228 | 0.7918 | 0.7921 |
9 | 0.8779 | 0.9587 | 0.8473 | 0.9561 | 0.8984 | 0.7467 | 0.7553 |
Mean | 0.8965 | 0.9686 | 0.8879 | 0.9467 | 0.9161 | 0.7815 | 0.7848 |
SD | 0.0097 | 0.0044 | 0.0212 | 0.0106 | 0.0089 | 0.0189 | 0.0172 |
bagged_top_models
[BaggingClassifier(base_estimator=<catboost.core.CatBoostClassifier object at 0x7f7af010a110>, bootstrap=True, bootstrap_features=False, max_features=1.0, max_samples=1.0, n_estimators=10, n_jobs=None, oob_score=False, random_state=412, verbose=0, warm_start=False), BaggingClassifier(base_estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.7, gamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.2, max_delta_step=0, max_depth=11, min_child_weight=2, missing=nan, monotone_constraints='()', n_estimators=260, n_jobs=-1, num_parallel_tree=1, objective='binary:logistic', random_state=412, reg_alpha=0.01, reg_lambda=3, scale_pos_weight=42.7, subsample=0.2, tree_method='auto', validate_parameters=1, verbosity=0), bootstrap=True, bootstrap_features=False, max_features=1.0, max_samples=1.0, n_estimators=10, n_jobs=None, oob_score=False, random_state=412, verbose=0, warm_start=False), BaggingClassifier(base_estimator=LGBMClassifier(bagging_fraction=0.8, bagging_freq=0, boosting_type='gbdt', class_weight=None, colsample_bytree=1.0, feature_fraction=0.9, importance_type='split', learning_rate=0.05, max_depth=-1, min_child_samples=66, min_child_weight=0.001, min_split_gain=0, n_estimators=250, n_jobs=-1, num_leaves=20, objective=None, random_state=412, reg_alpha=0.0001, reg_lambda=0.15, silent=True, subsample=1.0, subsample_for_bin=200000, subsample_freq=0), bootstrap=True, bootstrap_features=False, max_features=1.0, max_samples=1.0, n_estimators=10, n_jobs=None, oob_score=False, random_state=412, verbose=0, warm_start=False), BaggingClassifier(base_estimator=GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None, learning_rate=0.2, loss='deviance', max_depth=10, max_features=1.0, max_leaf_nodes=None, min_impurity_decrease=0.002, min_impurity_split=None, min_samples_leaf=4, min_samples_split=5, min_weight_fraction_leaf=0.0, n_estimators=110, n_iter_no_change=None, presort='deprecated', random_state=412, subsample=0.6, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False), bootstrap=True, bootstrap_features=False, max_features=1.0, max_samples=1.0, n_estimators=10, n_jobs=None, oob_score=False, random_state=412, verbose=0, warm_start=False), BaggingClassifier(base_estimator=ExtraTreesClassifier(bootstrap=True, ccp_alpha=0.0, class_weight='balanced', criterion='gini', max_depth=8, max_features='sqrt', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.001, min_impurity_split=None, min_samples_leaf=5, min_samples_split=5, min_weight_fraction_leaf=0.0, n_estimators=270, n_jobs=-1, oob_score=False, random_state=412, verbose=0, warm_start=False), bootstrap=True, bootstrap_features=False, max_features=1.0, max_samples=1.0, n_estimators=10, n_jobs=None, oob_score=False, random_state=412, verbose=0, warm_start=False)]
# select best model based on AUC
best1 = automl(optimize = 'AUC')
best2 = automl(optimize = 'Accuracy')
best3 = automl(optimize = 'Recall')
best4 = automl(optimize = 'Precision')
best5 = automl(optimize = 'F1')
print(); print("Best model based on AUC: "); print(best1)
print(); print("Best model based on Accuracy: "); print(best2)
print(); print("Best model based on Recall: "); print(best3)
print(); print("Best model based on Precision: "); print(best4)
print(); print("Best model based on F1: "); print(best5)
Best model based on AUC: <catboost.core.CatBoostClassifier object at 0x7f7ae8185cd0> Best model based on Accuracy: <catboost.core.CatBoostClassifier object at 0x7f7ae81d9790> Best model based on Recall: DummyClassifier(constant=None, random_state=412, strategy='prior') Best model based on Precision: <catboost.core.CatBoostClassifier object at 0x7f7af002f990> Best model based on F1: <catboost.core.CatBoostClassifier object at 0x7f7ae81a6e90>
plot_model(best1, plot = 'auc')
plot_model(best1, plot = 'confusion_matrix')
findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans. findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial, Liberation Sans, Bitstream Vera Sans, sans-serif
plot_model(best1, plot = 'learning')
save_model(best1,'Final_Model')
Transformation Pipeline and Model Successfully Saved
(Pipeline(memory=None, steps=[('dtypes', DataTypes_Auto_infer(categorical_features=[], display_types=True, features_todrop=[], id_columns=[], ml_usecase='classification', numerical_features=[], target='stabf', time_features=[])), ('imputer', Simple_Imputer(categorical_strategy='not_available', fill_value_categorical=None, fill_value_numerical=None, numeric_strate... ('binn', 'passthrough'), ('rem_outliers', 'passthrough'), ('cluster_all', 'passthrough'), ('dummy', Dummify(target='stabf')), ('fix_perfect', Remove_100(target='stabf')), ('clean_names', Clean_Colum_Names()), ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'), ('dfs', 'passthrough'), ('pca', 'passthrough'), ['trained_model', <catboost.core.CatBoostClassifier object at 0x7f7ae8185cd0>]], verbose=False), 'Final_Model.pkl')
load_saved_model = load_model('Final_Model')
new_prediction = predict_model(load_saved_model, data=data_unseen)
new_prediction[["Label", "Score"]].head()
Transformation Pipeline and Model Successfully Loaded
Label | Score | |
---|---|---|
0 | unstable | 0.6424 |
1 | stable | 0.9968 |
2 | unstable | 0.9831 |
3 | stable | 0.9835 |
4 | unstable | 0.6787 |
In this coding recipe, we discussed how to build a machine learning model in Python using PyCaret.