ANOVA F-value For Feature Selection If the features are categorical, calculate a chi-square statistic between each feature and the target vector. However, if the features are quantitative, compute the ANOVA F-value between each feature and the target vector. The F-value scores examine if, when we group the numerical feature by the target vector, the means …
Selecting The Best Number Of Components For TSVD Preliminaries /* Load libraries */ from sklearn.preprocessing import StandardScaler from sklearn.decomposition import TruncatedSVD from scipy.sparse import csr_matrix from sklearn import datasets import numpy as np Load Digits Data And Make Sparse /* Load the data */ digits = datasets.load_digits() /* Standardize the feature matrix */ X = …
Group Observations Using K-Means Clustering Preliminaries /* Load libraries */ from sklearn.datasets import make_blobs from sklearn.cluster import KMeans import pandas as pd Create Data /* Make simulated feature matrix */ X, _ = make_blobs(n_samples = 50, n_features = 2, centers = 3, random_state = 1) /* Create DataFrame */ df = pd.DataFrame(X, columns=[‘feature_1′,’feature_2’]) Train Clusterer …
Dimensionality Reduction With PCA Preliminaries /* Load libraries */ from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn import datasets Load Data /* Load the data */ digits = datasets.load_digits() Standardize Feature Values /* Standardize the feature matrix */ X = StandardScaler().fit_transform(digits.data) Conduct Principal Component Analysis /* Create a PCA that will retain 99% …
Dimensionality Reduction With Kernel PCA Preliminaries /* Load libraries */ from sklearn.decomposition import PCA, KernelPCA from sklearn.datasets import make_circles Create Linearly Inseparable Data /* Create linearly inseparable data */ X, _ = make_circles(n_samples=1000, random_state=1, noise=0.1, factor=0.1) Conduct Kernel PCA /* Apply kernal PCA with radius basis function (RBF) kernel */ kpca = KernelPCA(kernel=”rbf”, gamma=15, n_components=1) …
Dimensionality Reduction On Sparse Feature Matrix Preliminaries /* Load libraries */ from sklearn.preprocessing import StandardScaler from sklearn.decomposition import TruncatedSVD from scipy.sparse import csr_matrix from sklearn import datasets import numpy as np Load Digits Data And Make Sparse /* Load the data */ digits = datasets.load_digits() /* Standardize the feature matrix */ X = StandardScaler().fit_transform(digits.data) /* …
Select Date And Time Ranges Preliminaries /* Load library */ import pandas as pd Create pandas Series Time Data /* Create data frame */ df = pd.DataFrame() /* Create datetimes */ df[‘date’] = pd.date_range(‘1/1/2001′, periods=100000, freq=’H’) Select Time Range (Method 1) Use this method if your data frame is not indexed by time. /* Select …
Rolling Time Window Preliminaries import pandas as pd Create Date Data time_index = pd.date_range(’01/01/2010′, periods=5, freq=’M’) df = pd.DataFrame(index=time_index) df[‘Stock_Price’] = [1,2,3,4,5] Create A Rolling Time Window Of Two Rows df.rolling(window=2).mean() Stock_Price 2010-01-31 NaN 2010-02-28 1.5 2010-03-31 2.5 2010-04-30 3.5 2010-05-31 4.5 /* Identify max value in rolling time window */ df.rolling(window=2).max() Stock_Price 2010-01-31 NaN …
Lag A Time Feature Preliminaries import pandas as pd Create Date Data df = pd.DataFrame() df[‘dates’] = pd.date_range(‘1/1/2001′, periods=5, freq=’D’) df[‘stock_price’] = [1.1,2.2,3.3,4.4,5.5] Lag Time Data By One Row df[‘previous_days_stock_price’] = df[‘stock_price’].shift(1) df dates stock_price previous_days_stock_price 0 2001-01-01 1.1 NaN 1 2001-01-02 2.2 1.1 2 2001-01-03 3.3 2.2 3 2001-01-04 4.4 3.3 4 2001-01-05 5.5 …
Handling Missing Values In Time Series Preliminaries import pandas as pd import numpy as np Create Date Data With Gap In Values time_index = pd.date_range(’01/01/2010′, periods=5, freq=’M’) df = pd.DataFrame(index=time_index) df[‘Sales’] = [1.0,2.0,np.nan,np.nan,5.0] Interpolate Missing Values df.interpolate() Sales 2010-01-31 1.0 2010-02-28 2.0 2010-03-31 3.0 2010-04-30 4.0 2010-05-31 5.0 Forward-fill Missing Values df.ffill() Sales 2010-01-31 1.0 …