Model Training

Linear Regression

Prediction Equation: y^ = θ₀ + θ₁x₁ + θ₂x₂ + ..... θ_nx_n

Use case for linear regression

It is a simple model, use the equation for input and you will get the expected output.

Polynomial regression

Polynomial regression is capable of finding relationships between features when we have to use multiple features for a prediction. A simple linear regression model can not do that, so when dealing with multiple features polynomial regression is better.

The degree of the polynomial PolynomialFeatures(degree=d), the array containing the features is transformed from an array containing n features to (n+d)!/d!n! features.

Polynomial should be used when a straight line can't be drawn through your data. Polynomial regression helps to transform that data to a more of a linear regression model.

Gradient Descent

Gradient descent is an algorithm commonly used to find the minimum of a function. It is a first-order iterative optimization algorithm, which means it uses the gradient of the function to update its position in search space. The process is repeated until the minimum is reached or a stopping criterion is met.

What is it for:

To minimize the cost function

When would we use it

Gradient descent is a widely used algorithm in various machine learning applications, including linear regression, logistic regression, neural networks, and deep learning. It is a versatile algorithm that can be applied to a broad range of optimization problems.

Why would anyone use it

There are several reasons why gradient descent is a popular choice for optimization problems:

Simplicity: Gradient descent is a relatively simple algorithm to understand and implement.
Efficiency: Gradient descent can be computationally efficient, especially for large-scale problems.
Versatility: Gradient descent can be applied to a wide range of optimization problems.

Random initialization

Random initialization is a common practice in gradient descent. It involves picking an arbitrary set of starting values for the parameters of the model being optimized. This is done because if the starting point is always the same, the algorithm may converge to a local minimum instead of the global minimum of the cost function.

Batch Gradient Descent

Batch gradient descent is a type of gradient descent that uses the entire training dataset to calculate the gradient of the cost function in each iteration. This can be computationally expensive for large datasets.

Stochastic gradient descent

Stochastic gradient descent is a type of gradient descent that uses a single training example to calculate the gradient of the cost function in each iteration. This can be more efficient than batch gradient descent for large datasets, but it can also lead to more fluctuations in the learning process.

Mini-batch gradient descent

Mini-batch gradient descent is a type of gradient descent that uses a small subset of the training dataset (mini-batch) to calculate the gradient of the cost function in each iteration. This is a compromise between batch gradient descent and stochastic gradient descent, offering a balance between computational efficiency and smoothness of the learning process.

YouTube Video

Overfitting: Variance	Underfitting: Bias
Performs well during training, but NOT on testing data	Does NOT perform well during training and testing
Does NOT generalize well in the real world	High bias, fails to capture complexity on the data
Performs well during training, BUT bad Cross validation	Too simple, model not trained long enough
Occurs when: model is Too complex	How to address: Increase complexity Do less of L1, L2, and elastic net regularization Reduce regularization
Need to regularize (L1 and L2, Elastic NET)	Too simple of a model

For both Linear and Logistic Regression:

Mean Squared Error (MSE): Measures the average squared difference between the predicted values and the actual values. Lower MSE indicates better fit. It's suitable for linear regression where the errors are assumed to be continuous.
Root Mean Squared Error (RMSE): This is the square root of MSE. It's easier to interpret in the same units as your data (since it's not squared).
Mean Absolute Error (MAE): Measures the average absolute difference between the predicted values and the actual values. It's less sensitive to outliers compared to MSE.
R-squared (coefficient of determination): Represents the proportion of variance in the dependent variable explained by the independent variable(s) in linear regression. It ranges from 0 to 1, with a higher value indicating a better fit. However, R-squared can be misleading and shouldn't be solely relied upon, especially for complex models or when dealing with non-linear relationships.

Metrics For Logistic Regression and Classification:

Accuracy: The proportion of correct predictions made by the model. It can be a good starting point, but it can be misleading in imbalanced datasets (where one class has significantly more examples than others).
Precision: Measures the proportion of positive predictions that were actually correct (positive predictive value). It helps assess how good the model is at identifying true positives.
Recall: Measures the proportion of actual positive cases that were correctly identified (sensitivity). It helps assess how good the model is at capturing all the relevant positive cases.
F1-Score: This metric combines precision and recall into a single score, providing a balance between the two.

Why regularization: When a model is performing well on the training set, but performs poorly on the testing data, that is that the model is overfitting. When a model is overfitting that is when regularization is needed. Remember, overfitting is that the model has high variance on the testing set.

Say for example we have very little training data, and the model learns the training instances well, but for the testing set the predictions are off and the graph of the predicted outcomes looks a little bit scattered. The testing set has some degree of variance. To address this we need to introduce some bias, and that is where regularization comes in.

By introducing some bias, the model that is overfitting can now generalize the test dataset better. This is done in different ways based on each cost function from which regularization method you choose.

The hyperparameter lambda is needed to control the learning rate, the higher the lambda value, the less sensitive the model becomes to the training data. Eventually as lambda goes higher, the slope of the regression line just goes down to 0. What you will end up with is a straight horizontal line that goes through the mean of the data.

Regularized Linear Regression (L1)

Lasso Regression:

L1 will shrink less important features down to 0, thus doing feature selection. If you have a lot of features and aren’t sure which to select, L1 regularization will shrink the less important features down to 0, thus performing feature selection.

Cost Function: \( J(\mathbf{w}) = \frac{1}{2m} \sum_{i=1}^{m} (h_{\mathbf{w}}(\mathbf{x}^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n} |w_j| \)

Linear regression with L1 regularization: Achieves regularization of the L1 norm, which is the absolute value of the coefficients.
Useful for feature selection: Reduces the number of features.
Dense → sparse: Transforms dense feature vectors into sparse ones.
Prevents overfitting: Helps mitigate overfitting issues.

L1 will shrink less important features down to 0, thus doing feature selection. If you have a lot of features and aren’t sure which to select, L1 regularization will shrink the less important features down to 0, thus performing feature selection.

Ridge Regression (L2)

Cost Function: \( J_{\text{L2}}(\mathbf{w}) = \frac{1}{2m} \sum_{i=1}^{m} (h_{\mathbf{w}}(\mathbf{x}^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2} \sum_{j=1}^{n} w_j^2 \)

Linear regression with L2 regularization: Makes sure one term (or a few) doesn't dominate the data set.
Controls the noise: Helps mitigate overfitting issues.
Prevents overfitting of the data: Ensures a more stable model.
Model smoothing: Provides a balance between bias and variance.

What L2 does is it shrinks coefficients for the features that are playing less of a role to to predict y asymptotically close to 0. This helps deal with multicollinearity, thus effectively reducing the weight of non-important features. L2 does not shrink coefficients down to 0, but get them asymptotically close to 0, L2 therefore won’t remove any features.

Regularization term is added to the MSE. This should only be added to the cost function during training. Lambda will control the regularization, lambda represents the learning rate. Ridge is sensitive to the scale of the model, so be sure to scale your data (i.e. standard scaler).

Ridge Solvers

lbfgs
Cholesky
sparse_cg

Early stopping

Used in softmax to prevent overfitting

When validation error reaches a minimum:

RMSE – when it reaches a minimum, stop training
Validation error will stop decreasing, and when it starts increasing, you can stop