Using Linear Regression Model
Problem Statement
Suppose the HR department of a company wants to make a model to predict the salary of a new employee based on the data they have on the company. The .csv file consists the data needed to train and test the model.
Objectives
This analysis aims to observe the Salary Dataset. The goal is to classify the salary of the employee on the basis of year of experience. To achieve this i have used machine learning classification methods to fit a function that can predict the discrete class of new input.
Dataset
It is Salary_Data.csv .
It has 2 columns — “Years of Experience” and “Salary” for 30 employees working in a company. we will train a Simple Linear Regression model to establish the correlation between the years of experience of each employee and their respective salary.
Load the dataset/Data Exploration
We will be using jupyter notebook to work on this dataset. We will first go with importing the necessary libraries and import our dataset to jupyter notebook:
Below is the code snippet of loading the dataset
We can find the dimensions of the data set using the panda dataset ‘shape’ attribute.
Split dataset into training set and test set
The data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We will use the training dataset for training the model and then check the performance of the model on the test dataset.
For this we will use the train_test_split method from library model_selection
We are providing a test_size of 1/2 which means test set will contain 20% observations and training set will contain 80% observationsWe will do this using SciKit-Learn library in Python using the train_test_split method.
Plotting Scatter plot to check relationship between independent variable and dependent variable
The below graph snippet shows the relationship between the independent variable i.e x and dependent variable i.e y .
Building the model
We will be using the LinearRegression class from the library sklearn.linear_model. First we create an object of the LinearRegression class and call the fit method passing the xtrain and ytrain.What is Linear Regression
Regression is the process of predictive modelling technique, which investigate’s the relationship between independent and dependent vectors. Linear Regression is a statistical supervised learning technique to predict the quantitative variable by forming a linear relationship with one or more independent features.It expresses the relation among the dependent and independent vector’s as a straight and is in the form as below.
Visualization
Let’s visualize the test and prediction results.
First we’ll plot the actual data points of training set — xtest and ytest and secondly we’ll plot testing and prediction set — xtest and ypred.
Accuracy
Predicting the accuracy using r2 score and finding the maximum accuracy
New Prediction
We can also make new predictions for data that do not exist in the dataset. Like for a person with 12 years experience
Downloads
Below is the GitHub link for more detailed information to the python code and datasets.
Thank You