Machine Learning - Landscape (2)

19.10.20

Here the second chapter based on the Machine Learning Landscape book "Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition"

If you are interested in the first part, here the link: https://www.franciscojavierpulido.com/2020/10/machine-learning-landscape-1.html

Note: I use this post as a summary and are notes for me and part of my studies, but I like to share it with the community.

Great then, continue with the party.

Instance-based versus Model-based learning

Concepts:

- Generalization: another way to classify ML Algorithms.

Generalizate to = Make good predictions for!

- Having good performance measure on the training data is good, but insufficient; the true goal is to perform well on new instances.

Instance-based learning

The system learns the examples by heart, then generalizes to new cases by using a similarity measure to compare them to the learned examples.

Model-based learning

Building models to make good preditions.

Concepts:

- Utility function/fitness function: measures how good is a model.

- Cost function: measure how bad is a model.

For linear regression problems, people tipically use a cost function that measures the distance between the linear models prediction and the training examples; the objetive is to minimize the distance.

Training the model: when we finds the parameters that make the linear model that fit best to our data when we feed it with our training examples.

Main challenges of Machine Learning

Two main problems: bad algorithms / bad data

- Insufficient Quantity of training Data

- Non representative trainin data: for accurated preditions

a) Sampling noise: if the sample is too small.

b) Sampling bias: large samples that cannot be representative.

- Poor quality data: errors, outliers, noise ...

- Irrelevant features: feature engineering:

a) feature selection

b) feature extraction

c) creating new features

- Overfitting the training data: the model performs well on the training data, but it does not generalize well.

There are two possible solutions for overfitting:

a) Simplifing the model by selecting one with fewer parameters(e.g., a linear model rather than a high-degree polynomial model), by reducing the number of attributes in the training data, or by constraining the model.

b) Regularization: constraining a model to make it simpler and reduce the risk of overfitting is called regularization.

- Hyperparameter: is a parameter of a learning algorithm (not of the model). We can control the amount of regularization to apply during learning.

- Underfitting the tranining data: it occurs when the model is too simple to learn the underlying structure of Data. To fix this problem:

a) Select a more powerful model, with more parameters.

b) Feed better features to the learning algorithm (feature engineering).

c) Reduce the constrains of the model

Testing and validating

It is basically to split the data into two sets: the training set and the test set.

- Generalization error: the error rate on new cases. This value tells how well our model will perform on instances it has never seen before.

If the training error is low but the generalization error is high, it means that our model is overfitting the tranining data.

- Hyperparameter tunning and model selection: it uses cross-validation that consist by using multiple small test validation to avoid holdout validation, using different models and hyperparameters.

- Avoid Data missmatch: split the validation set into:

a) the test set to test the model

b) train-dev set to test the tested model

You Might Also Like

0 comentarios

Sé respetuoso/a, en este blog caben todo tipo de opiniones con respeto y serenidad.

RSS :: へ(゜∇、°)へ

Contact Form :: (」゜ロ゜)」