Linear Regression
The normal equation.
$w = (X^TX)^{-1}X^Ty$
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_
The LinearRegression
class is based on the scipy.linalg.lstsq()
function (the name stands for “least squares”).
theta_best_svd, residuals, rank, s = np.linalg.lstsq(X_b, y, rcond=1e-6)
theta_best_svd
This function computes $\hat{\theta} = X^+y$, where $X^+$ is the pseudoinverse of X.
Computational Complexity
The SVD approach used by LinearRegression class is about $O(n^2)$, where n is the number of features.
Gradient Descent
To implement GD, we need to compute the gradient of the cost function with regard to each model parameter $\theta_j$.
Stochastic Gradient Descent
Picks a random instance in the training set at every step.
Mini-batch GD
Computes the gradients on small random sets of instances called mini-batches.
Polynomial Regression
Add powers of each feature as new features.
from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
lin_reg.intercept_, lin_reg.coef_
When there are multiple features, Polynomial Regression is capable of finding relationships between features.
Learning Curves
We need to know if our model is overfitting/underfitting. We can plot model’s performance on the training set and val set as a function of the training set size or training iteration.
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
def plot_learning_curves(model, X, y):
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=10)
train_errors, val_errors = [], []
for m in range(1, len(X_train)):
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val) train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
val_errors.append(mean_squared_error(y_val, y_val_predict))
plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")
plt.legend(loc="upper right", fontsize=14)
plt.xlabel("Training set size", fontsize=14)
plt.ylabel("RMSE", fontsize=14)
We can use a plain LR model.
lin_reg = LinearRegression()
plot_learning_curves(lin_reg, X, y)
plt.axis([0, 80, 0, 3])
plt.show()
Both curves reached a plateau. They are close and high. Underfit.
Now we try a 10th-degree polynomial model.
from sklearn.pipeline import Pipeline
polynomial_regression = Pipeline([
("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
("lin_reg", LinearRegression()),
])
plot_learning_curves(polynomial_regression, X, y)
plt.axis([0, 80, 0, 3])
plt.show()
The error on the training data is much lower than with the LR model. There is a gap between two curves. Overfit.
Regularized Linear Models
Ridge Regression
The regularization term should only be added to the cost function during training. It is important to scale the data before performing Ridge Reg, as it is sensitive to the scale of the input features.
We can use a closed-form solution.
from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=1, solver="cholesky", random_state=42)
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])
Or, we can use SGD.
sgd_reg = SGDRegressor(penalty="l2", max_iter=1000, tol=1e-3, random_state=42)
sgd_reg.fit(X, y.ravel())
sgd_reg.predict([[1.5]])
Lasso Regression
It tends to eliminate the weights of the least important features.
from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=.1)
lasso_reg.fit(X, y)
Elastic Net
A middle ground between Ridge and Lasso Regression. We can control the mix ration r. When r = 0, Elastic Net is equivalent to Ridge Regression.
In general, Elastic Net is preferred over Lasso because Lasso may behave erratically when the number of features is greater than the number of training instances or when several features are strongly correlated.
from sklearn.linear_model import ElasticNet
e_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
Early Stopping
Stop training as soon as the val error reaches a minimum.
Logistic Regression
Estimate Prob
$\sigma(t) = \frac{1}{1+exp(-t)}$
Training and cost func
$cost = -log(\hat{p})y -log(1-\hat{p})(1-y)$
The reason why we are using log loss instead of MSE here is that it is a convex function. In addition, it will give larger updates when the error is larger.
Decision Boundaries
from sklearn import datasets
iris = datasets.load_iris()
X = iris["data"][:, 3:] # petal width
y = (iris["target"] == 2).astype(np.int) # 1 if Iris virginica, else 0
X = iris["data"][:, (2, 3)] # petal length and width
y = (iris["target"] == 2).astype(np.int)
log_reg = LogisticRegression(solver="lbfgs", C=10**10, random_state=42)
log_reg.fit(X, y)
x0, x1 = np.meshgrid(
np.linspace(2.9, 7, 500).reshape(-1, 1),
np.linspace(0.8, 2.7, 200).reshape(-1, 1),
)
X_new = np.c_[x0.ravel(), x1.ravel()]
y_proba = log_reg.predict_proba(X_new)
plt.figure(figsize=(10, 4))
plt.plot(X[y==0, 0], X[y==0, 1], "bs")
plt.plot(X[y==1, 0], X[y==1, 1], "g^")
zz = y_proba[:, 1].reshape(x0.shape)
contour = plt.contour(x0, x1, zz, cmap=plt.cm.brg)
left_right = np.array([2.9, 7])
boundary = -(log_reg.coef_[0][0] * left_right + log_reg.intercept_[0]) / log_reg.coef_[0][1]
plt.clabel(contour, inline=1, fontsize=12)
plt.plot(left_right, boundary, "k--", linewidth=3)
plt.text(3.5, 1.5, "Not Iris virginica", fontsize=14, color="b", ha="center")
plt.text(6.5, 2.3, "Iris virginica", fontsize=14, color="g", ha="center")
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.axis([2.9, 7, 0.8, 2.7])
save_fig("logistic_regression_contour_plot")
plt.show()
LR has been added l2 penalty by default. The hyperparam controlling the regularization strength of LR is not alpha, but its inverse: C. The smaller, the regularization strength is stronger.
Softmax Regression
LR can be generalized to support multiple classes directly -> softmax regression.
Cross entropy gradient vector for class k
$\nabla_{\theta^{(k)}} = \sum_{i=1}^m (\hat{p}_k^{(i)} - y_k^{(i)})x^{(i)}$
X = iris["data"][:, (2, 3)] # petal length, petal width
y = iris["target"]
softmax_reg = LogisticRegression(multi_class="multinomial",solver="lbfgs", C=10, random_state=42)
softmax_reg.fit(X, y)