Employee Salary Prediction

Shubhangi
4 min readJul 4, 2021

--

Using Linear Regression Model

Problem Statement

Suppose the HR department of a company wants to make a model to predict the salary of a new employee based on the data they have on the company. The .csv file consists the data needed to train and test the model.

Objectives

This analysis aims to observe the Salary Dataset. The goal is to classify the salary of the employee on the basis of year of experience. To achieve this i have used machine learning classification methods to fit a function that can predict the discrete class of new input.

Dataset

It is Salary_Data.csv .
It has 2 columns — “Years of Experience” and “Salary” for 30 employees working in a company. we will train a Simple Linear Regression model to establish the correlation between the years of experience of each employee and their respective salary.

Load the dataset/Data Exploration

We will be using jupyter notebook to work on this dataset. We will first go with importing the necessary libraries and import our dataset to jupyter notebook:

Below is the code snippet of loading the dataset

Data of the dataset
X and Y of the dataset

We can find the dimensions of the data set using the panda dataset ‘shape’ attribute.

Split dataset into training set and test set

The data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We will use the training dataset for training the model and then check the performance of the model on the test dataset.

For this we will use the train_test_split method from library model_selection
We are providing a test_size of 1/2 which means test set will contain 20% observations and training set will contain 80% observations

We will do this using SciKit-Learn library in Python using the train_test_split method.

training & testing
xtrain
ytrain

Plotting Scatter plot to check relationship between independent variable and dependent variable

The below graph snippet shows the relationship between the independent variable i.e x and dependent variable i.e y .

Relationship between x v/s y

Building the model

We will be using the LinearRegression class from the library sklearn.linear_model. First we create an object of the LinearRegression class and call the fit method passing the xtrain and ytrain.What is Linear Regression

Regression is the process of predictive modelling technique, which investigate’s the relationship between independent and dependent vectors. Linear Regression is a statistical supervised learning technique to predict the quantitative variable by forming a linear relationship with one or more independent features.It expresses the relation among the dependent and independent vector’s as a straight and is in the form as below.

Model building

Visualization

Let’s visualize the test and prediction results.
First we’ll plot the actual data points of training set — xtest and ytest and secondly we’ll plot testing and prediction set — xtest and ypred.

visualization

Accuracy

Predicting the accuracy using r2 score and finding the maximum accuracy

New Prediction

We can also make new predictions for data that do not exist in the dataset. Like for a person with 12 years experience

new prediction

Downloads

Below is the GitHub link for more detailed information to the python code and datasets.

Thank You

--

--