March 06, 2022

Campus Recruitment

Campus recruitment is a strategy for sourcing, engaging and hiring young talent for internship and entry-level positions. College recruiting is typically a tactic for medium- to large-sized companies with high-volume recruiting needs, but can range from small efforts (like working with university career centers to source potential candidates) to large-scale operations (like visiting a wide array of colleges and attending recruiting events throughout the spring and fall semester). Campus recruitment often involves working with university career services centers and attending career fairs to meet in-person with college students and recent graduates.

Our dataset revolves around the placement season of a Business School in India. Where it has various factors on candidates getting hired such as work experience,exam percentage etc., Finally it contains the status of recruitment and remuneration details.

There are 3 primary goal in this project, first do a exploratory analysis of the Recruitment dataset, second do an visualization analysis of the Recruitment dataset and last predict whether a student got placed or not using classification models.

Data Fields

  • sl_no — Serial Number
  • gender — Gender- Male='M',Female='F'
  • ssc_p — Secondary Education percentage- 10th Grade
  • ssc_b — Board of Education- Central/ Others
  • hsc_b — Higher Secondary Education percentage- 12th Grade
  • hsc_s — Specialization in Higher Secondary Education
  • degree_p — Degree Percentage
  • degree_t — Under Graduation(Degree type)- Field of degree education
  • workex — Work Experience
  • etest_p — Entrance Test Percentage
  • mba_p — MBA Percentage
  • status — Placed or not
  • salary — Salary offered
  • Import Package and Data

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    import sklearn 
    %matplotlib inline
    

    For this exercise, the data set (.csv format) is uploaded to my repository github, read into the Jupyter notebook and stored in a Pandas DataFrame.

    We have Gender and Educational qualification data. We have all the educational performance(score) data. We have the status of placement and salary details. We can expect null values in salary as candidates who weren't placed would have no salary. Status of placement is our target variable rest of them are independent variable except salary.

    Data Preparation and Cleaning

    We have 215 candidate details and there are mixed datatypes in each column. We have few missing values in the salary column as expected since those are the people who didn't get hired. We also have 1 integer,5 float and 8 object datatypes in our dataset

    Checking Missing Value

    Datasets in the real world are often messy, However, this dataset is almost clean and simple. But lets check the missing value first

    There are 67 null values in our data, which means 67 unhired candidates. We can't drop these values as this will provide a valuable information on why candidates failed to get hired. We can't impute it with mean/median values and it will go against the context of this dataset and it will show unhired candidates got salary. Our best way to deal with these null values is to impute it with '0' which shows they don't have any income.

    Handling Missing Value

    First lets focus on the missing data in review features,if we drop the rows which has null values we might sabotage some potential information from the dataset. So we have to impute values into the NaN records which leads us to accurate models. Since it is a salary feature,it is best to impute the records with '0' for unhired candidates

    After we replace missing value with '0' now we need to drop unwanted features. We have dropped serial number as we have index as default and we have dropped the boards of school education as I believe it doesn't matter for recruitment.

    Outliers

    Outliers are unusual values in your dataset, and they can distort statistical analyses and violate their assumptions. Unfortunately, all analysts will confront outliers and be forced to make decisions about what to do with them. Given the problems they can cause, you might think that it’s best to remove them from your data. But, that’s not always the case. Removing outliers is legitimate only for specific reasons.Outliers can be very informative about the subject-area and data collection process. It’s essential to understand how outliers occur and whether they might happen again as a normal part of the process or study area. Unfortunately, resisting the temptation to remove outliers inappropriately can be difficult. Outliers increase the variability in your data, which decreases statistical power. Consequently, excluding outliers can cause your results to become statistically significant. In our case, let's first visualize our data and decide on what to do with the outliers.

    As you see, we have very less number of outliers in our features. Especially we have majority of the outliers in hsc percentage. lets see the visualize after removing outliers.

    Exploratory Data Analysis

    Count of categorical features- Count plot

    From the visualize we can see that We have more male candidates than female, We have candidates who did commerce as their hsc course and as well as undergrad, Science background candidates are the second highest in both the cases, Candidates from Marketing and Finance dual specialization are high, Most of our candidates from our dataset don't have any work experience and most of our candidates from our dataset got placed in a company.

    Distribution Salary- Placed Students

    Many candidates who got placed received package between 2L-4L PA. Only one candidate got around 10L PA and the average of the salary is a little more than 2LPA.

    Employability score vs Salary- Joint plot

    Most of the candidates scored around 60 percentage got a decent package of around 3 lakhs PA. Not many candidates received salary more than 4 lakhs PA and The bottom dense part shows the candidates who were not placed.

    Distribution of all percentages

    All the distributions follow normal distribution except salary feature. Most of the candidates educational performances are between 60-80%. Salary distribution got outliers where few have got salary of 7.5L and 10L PA

    Work experience Vs Placement Status

    We have nearly 66.2% of candidates who never had any work experience. Candidates who never had work experience have got hired more than the ones who had experience. We can conclude that work experience doesn't influence a candidate in the recruitment process.

    MBA marks vs Placement Status- Does your academic score influence?

    Comparitively there's a slight difference between the percentage scores between both the groups, But still placed candidates still has an upper hand when it comes to numbers as you can see in the swarm. So as per the plot,percentage do influence the placement status

    Does MBA percentage and Employability score correlate?

    There is no relation between mba percentage and employability test. There are many candidates who haven't got placed when they don't have work experience and most of the candidates who performed better in both tests have got placed.

    Is there any gender bias while offering remuneration?

    The top salaries were given to male. The average salary offered were also higher for male. More male candidates were placed compared to female candidates.

    Coorelation between academic percentages

    Candidates who were good in their academics performed well throughout school,undergrad,mba and even employability test. These percentages don't have any influence over their salary.

    Distribution of our data

    Candidates who has high score in higher secondary and undergrad got placed. Whomever got high scores in their schools got placed. Comparing the number of students who got placed candidates who got good mba percentage and employability percentage.

    Data Preprocessing

    Now let's welcome our data to the model.Before jumping onto creating models we have to prepare our dataset for the models. We dont have to perform imputation as we dont have any missing values but we have categorical variables which needs to be encoded.

    Label Encoding

    We have used label encoder function for the category which has only two types of classes.

    One hot encoding

    We have used dummies function for the category which has more than two types of classes

    Assigning the target(y) and predictor variable(X)

    Our Target is to find whether the candidate is placed or not. We use rest of the features except 'salary' as this won't contribute in prediction(i.e) In real world scenario, students gets salary after they get placed, so we can't use a future feature to predict something which happens in the present

    Train and Test Split (80:20)

    Next we split the dataset into train and test in 80:20.

    Machine Learning Models

    Now let's feed the models with our data Objective: To predict whether a student got placed or not

    Logistic Regression

    Let's fit the model in logistic regression and figure out the accuracy of our model

    Confusion Matrix and Classification Report

    The Confusion matrix result is telling us that we have 9+26 correct predictions and 1+6 incorrect predictions. The Classification report reveals that we have 84% precision which means the accuracy that the model classifier not to label an instance positive that is actually negative and it is important to consider precision value because when you are hiring, you want to avoid Type I errors at all cost. They are culture killers. In hiring, a false positive is when you THINK an employee is a good fit, but in actuality they’re not.

    ROC Curve

    Let's check out the performance of our model through ROC curve

    From the ROC curve we can infer that our logistic model has classified the placed students correctly rather than predicting false positive. The more the ROC curve(blue) lies towards the top left side the better our model is. We can choose 0.8 or 0.9 for the threshold value which can reap us true positive results

    Random Forest

    lets use random forest to create a aggregation of trees and produce accurate results and we have an accuracy of 76%

    Looking at Feature Importance

    Let's see which feature influences more on making the decision and we should cut it off to make our model accurate

    As we see the school and undergrad specialisations have less influence in classifying the model. But it is really wierd to acknowledge ssc_p influencing more in classifying

    Pruning out less important feature

    Let's cut off the less important feature and check for model accuracy.

    Great. Now We have an accuracy of 81% and the ROC value 75% indicates the models have classified better without having much false positive predictions

    K Nearest Neighbours

    Let's try out a lazy supervised classification algorithm. Our beloved, KNN

    There are a lot of ups and downs in our graph. If we consider any value between 10-15 we may get an overfitted model. So let's stick onto the first trough. Our K value is 5

    Our model has precisely classified 86% of Not placed categories and 74% of Placed categories. To talk in numbers 26+6 correct classifications and 1+9 false negative and false positive classification. We should be considering the precision value as our metric because the possibility of commiting False Positive is very crucial in recuritment

    Naive Bayes Classifier with Cross Validation

    Let's use Naive Bayes model for our dataset. Since our outcome feature has 1,0(placed, not placed) we can go with Bernoulli Naive bayes algorithm and also let's measure the accuracy with cross validation

    Our cross validation precision is approximately 73.5%

    Support Vector Machine

    Let's use SVM to classify our output feature

    We have got 82% and 81% precision in classifying our model. 9+25 correctly classified and 2+6 wrongly classified( False Negative & False Positive)

    XGBoost

    Let's try our the state of art ensemble model XGBoost. We have used RMSE metrics for model performance

    The error value of our model is just 0.577. Now let's use cross validation and try to minimise further.

    XGBoost with Cross Validation

    In this algorithm we are using DMatrix to convert our dataset into a matrix and produce the output in dataframe.

    We have reduced our model error to 0.41.

    Summary

    From the analysis report on Campus Recruitment dataset here are my following conclusions:

  • Educational percentages are highly influential for a candidate to get placed
  • Past work experience doesn't influence much on your masters final placements
  • There are no gender discrimination while hiring, but higher packages were given to male
  • Academic percentages have no relation towards salary package.