7.7. SF Salaries Project Exercise - Questions#
We will be using the SF Salaries Dataset from Kaggle to practice data analysis.
Use the dataset provided by the instructor.
import pandas as pd
Read Salaries.csv
into a DataFrame variable and call it sal.
sal = pd.read_csv('../../data/Salaries.csv')
### don't worry about the warning below.
Check the head of the DataFrame.
sal.head()
Id | EmployeeName | JobTitle | BasePay | OvertimePay | OtherPay | Benefits | TotalPay | TotalPayBenefits | Year | Notes | Agency | Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | NATHANIEL FORD | GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY | 167411.18 | 0.00 | 400184.25 | NaN | 567595.43 | 567595.43 | 2011 | NaN | San Francisco | NaN |
1 | 2 | GARY JIMENEZ | CAPTAIN III (POLICE DEPARTMENT) | 155966.02 | 245131.88 | 137811.38 | NaN | 538909.28 | 538909.28 | 2011 | NaN | San Francisco | NaN |
2 | 3 | ALBERT PARDINI | CAPTAIN III (POLICE DEPARTMENT) | 212739.13 | 106088.18 | 16452.60 | NaN | 335279.91 | 335279.91 | 2011 | NaN | San Francisco | NaN |
3 | 4 | CHRISTOPHER CHONG | WIRE ROPE CABLE MAINTENANCE MECHANIC | 77916.00 | 56120.71 | 198306.90 | NaN | 332343.61 | 332343.61 | 2011 | NaN | San Francisco | NaN |
4 | 5 | PATRICK GARDNER | DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) | 134401.60 | 9737.00 | 182234.59 | NaN | 326373.19 | 326373.19 | 2011 | NaN | San Francisco | NaN |
Use the .info()
method to find out how many entries there are.
sal.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148654 entries, 0 to 148653
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 148654 non-null int64
1 EmployeeName 148654 non-null object
2 JobTitle 148654 non-null object
3 BasePay 148045 non-null float64
4 OvertimePay 148650 non-null float64
5 OtherPay 148650 non-null float64
6 Benefits 112491 non-null float64
7 TotalPay 148654 non-null float64
8 TotalPayBenefits 148654 non-null float64
9 Year 148654 non-null int64
10 Notes 0 non-null float64
11 Agency 148654 non-null object
12 Status 0 non-null float64
dtypes: float64(8), int64(2), object(3)
memory usage: 14.7+ MB
Q01. What is the average BasePay?
Q02. What is the highest amount of OvertimePay in the dataset?
Q03 What is the job title of JOSEPH DRISCOLL?
Note: Use all caps; otherwise, you may get an answer that doesn’t match up (there is also a lowercase Joseph Driscoll).
Q04 How much does JOSEPH DRISCOLL make (including benefits)?
Q05. What is the name of the highest paid person (including benefits)?
### or
### sal.loc[sal['TotalPayBenefits'].idxmax()]
Id | EmployeeName | JobTitle | BasePay | OvertimePay | OtherPay | Benefits | TotalPay | TotalPayBenefits | Year | Notes | Agency | Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | NATHANIEL FORD | GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY | 167411.18 | 0.0 | 400184.25 | NaN | 567595.43 | 567595.43 | 2011 | NaN | San Francisco | NaN |
Q06. What is the name of the lowest-paid person (including benefits)? Do you notice something strange about how much he or she is paid?
### or
### sal.loc[sal['TotalPayBenefits'].idxmax()]['EmployeeName']
### IT'S NEGATIVE!! VERY STRANGE
Id | EmployeeName | JobTitle | BasePay | OvertimePay | OtherPay | Benefits | TotalPay | TotalPayBenefits | Year | Notes | Agency | Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
148653 | 148654 | Joe Lopez | Counselor, Log Cabin Ranch | 0.0 | 0.0 | -618.13 | 0.0 | -618.13 | -618.13 | 2014 | NaN | San Francisco | NaN |
Q07. What was the average (mean) BasePay of all employees per year (2011-2014)?
You may need to drop some string columns here (drop([‘EmployeeName’, ‘JobTitle’, ‘Agency’], axis=1)).
Q08. How many unique job titles are there?
Q09. What are the top 5 most common jobs?
(Check the Pandas website . Hint: use value_counts()
)
Q10. How many Job Titles were represented by only one person in 2013? (e.g., Job Titles with only one occurrence in 2013)