IBM/Coursera Machine Learning Professional Certificate:
Module 1: Exploratory Data Analysis Project
Over the course of two weeks I was able to complete the Exploratory Data Analysis final project for the IBM Machine Learning Professional Certificate. For context, the GitHub will be hyperlinked and it has a PDF report attached as well as the working notebook, which contains code/visuals, and the CSV file. Moreover, the CSV file comes from the following Kaggle dataset.
Additionally, the original dataset contained 1,470 observations, 34 features, and a target of ‘Attrition’. Furthermore, the aforementioned dataset was clean and contained no NULL values. Moreover, a data dictionary was provided in order to grasp the context of individual labels; such as Education, Environment_Satisfaction, Job_Involvement, Job_Satisfaction, Performance_Rating, Relationship_Satisfaction, Work_Life_Balance, Job_Level, and Stock_Option_Level being on a scale from 1–5, while Distance_From_Home was measured in Kilometers and Percent_Salary_Hike is the percent increase in salary compared to the previous year.
The initial plan for data exploration
- Check for missing data (NULLS).
- Review columns to identify what is needed or not.
- Review column names and understand what each label means.
- Make plots for initial insights.
Actions taken for data cleaning and feature engineering
- Clean column names.
- Review target variable against other features (‘Attrition’ vs x) and create plots against ‘Attrition’.
- Remove columns: ‘EmployeeCount’, ‘Over18’, ‘StandardHours’, ‘EmployeeNumber’
Key Findings and Insights
- ‘Attrition’ was at 66.67% for 19 year olds, however the ‘Attrition’ rate drastically drops until employees hit the age of 58, where ‘Attrition’ jumps up to 35.71%.
- When ‘Distance_From_Home’ was 25 kilometers the “Attrition’ rate 42.85%.
- When ‘Percent_Salary_Hike’ was 24 (24% increase from previous year) ‘Attrition’ rate was 28.57%.
- When ‘Total_Working_Years’ was 0 the ‘Attrition’ rate was 45.45%, when 1 the ‘Attrition’ rate was 49.38% and at 40 100%, likely due to retirement.
- When ‘Training_Time_Last_Year’ was 0 the ‘Attrition’ rate was 27.78%.
- When ‘Year_At_Company’ was 0 the ‘Attrition’ rate was 36.36%, when 1 34.50%.
- Interestingly when an employee was given ‘Overtime’ their ‘Attrition’ rate was 30.52% compared to 10.43%.
- When an employee ‘Travel_Frequently’ the ‘Attrition” rate was 24.90%.
- Not surprisingly the ‘Job_Role’ of Sales Rep with the highest ‘Attrition’ rate was 39.75%.
- Employees with ‘Marital_Status’ single has an ‘Attrition’ rate of 25.53.
- ‘Education_Field’ of Human Resources has an ‘Attrition’ rate of 25.92%.
Formulating at least 3 hypothesis about this data
1)- Ho: µ Education_Attrition == µ Education_Not_Attrition
- Ha: µ Education_Attrition != µ Education_Not_Attrition
2)- Ho: µ Age_Attrition == µ Age_Not_Attrition
- Ha: µ Age_Attrition != µ Age_Not_Attrition
3)- Ho: µ Job_Satisfaction_Attrition == µ Job_Satisfaction_Not_Attrition
- Ha: µ Job_Satisfaction_Attrition != µ Job_Satisfaction_Not_Attrition
Conducting a formal significance test for one of the hypotheses and discuss the results
The Kruskal-Wallis H-test tests the null hypothesis that the population median of all of the groups are equal. It is a non-parametric version of ANOVA. The test works on 2 or more independent samples, which may have different sizes. Note that rejecting the null hypothesis does not indicate which of the groups differs. Post hoc comparisons between groups are required to determine which groups are different.
ss.kruskal(attrition_df[‘Education’], not_attrition_df[‘Education’])
KruskalResult(statistic=1.3527640913093548, pvalue=0.2447954753326153)
pvalue > 0.05
Therefore, there appears to be no statistically significant relationship between Attrition and Education, thus we fail reject the null hypothesis.
Next steps in analyzing this data
- Continue testing hypothesis to hone in on and uncover what the true indicators, along with their magnitude, are for Attrition.
- Would like to create models to uncover which feature is attributed most to Attrition.
A paragraph that summarizes the quality of this data set and a request for additional data if needed
Once again, this dataset was incredibly clean with no nulls or weird values. Additionally, a data dictionary was provided in order to grasp the context of each individual column. Moreover, additional data that would provide more insight would be the reason of Attrition. Did the individual seek further education, did they pursue a job within a completely different industry, or did they make a lateral move?