Training data has labels. This means that when the algorithm is shown a square (four equal sides), if it detects four equal sides it has detected that it’s a square. Again the machine has been shown a square before (and it was labelled square), and therefore the machine has detected that it’s a square.
First thing to do is always gather the data, so for example if we are trying to use machine learning to predict housing prices, we would gather data about housing prices over the course of many years.
Once data has been collected it is wise to perform an exploratory data analysis on the data to identify trends. Find a correlation between two variables. For example say we are given data on specific stocks. We can find out how two stocks are correlated by getting a correlation value to determine how the two stocks are related. The higher the correlation value, the stronger the correlation, the smaller the correlation the weaker the correlation. Negative correlation values indicate inverse relationships. Think of stocks and put options, those are inversely correlated. As the value of one increases, the value of the other decreases.
EDA helps us identify trends. Another use of EDA is to generate visualizations of the data to get a better understanding of what is going on in the data.
Regression is when we are trying input numerical data and get an output. Y = mx + b is a type of regression model, it takes numerical data as input, and outputs a numerical data point.
Supervised learning is about classification, so going back earlier when I gave the square analogy, we are giving the machine information, and it will determine what kind of shape this is. If the machine detects three sides its classified as a triangle, four equal sides it’s a square. Six sides, it is a hexagon.
The model has been shown before these shapes so it is able to detect them, and classify the output as based on what it has seen before. Determining what shape it is based on the information provided, is the classification part. It classifies the shape based on the data provided.
Determining if an email is a spam email or not is a machine learning classification problem. We are given training data the machine will learn. Spam emails usually have specific types of words, and we can collect a ton of spam emails and store those emails in the dataset.
Once the email data has been stored, the machine can read the email and count how many words are in the spam emails, and how many words in the data set appear in non-spam emails. Then whichever is higher the machine can then classify the email as spam or not spam email.
Is it going to be hot or cold tomorrow? That is just another example of classification, it will be one or the other. This example is called a binary classifier, where there are two possible outcomes, hot or cold. The spam email example is also an example of classification, spam or not spam.