SF Salaries Project Exercise - Questions

7.7. SF Salaries Project Exercise - Questions#

We will be using the SF Salaries Dataset from Kaggle to practice data analysis.

Use the dataset provided by the instructor.

import pandas as pd

Read Salaries.csv into a DataFrame variable and call it sal.

sal = pd.read_csv('../../data/Salaries.csv')

### don't worry about the warning below.

Check the head of the DataFrame.

sal.head()
Id EmployeeName JobTitle BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits Year Notes Agency Status
0 1 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 167411.18 0.00 400184.25 NaN 567595.43 567595.43 2011 NaN San Francisco NaN
1 2 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 155966.02 245131.88 137811.38 NaN 538909.28 538909.28 2011 NaN San Francisco NaN
2 3 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 212739.13 106088.18 16452.60 NaN 335279.91 335279.91 2011 NaN San Francisco NaN
3 4 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 77916.00 56120.71 198306.90 NaN 332343.61 332343.61 2011 NaN San Francisco NaN
4 5 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 134401.60 9737.00 182234.59 NaN 326373.19 326373.19 2011 NaN San Francisco NaN

Use the .info() method to find out how many entries there are.

sal.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148654 entries, 0 to 148653
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Id                148654 non-null  int64  
 1   EmployeeName      148654 non-null  object 
 2   JobTitle          148654 non-null  object 
 3   BasePay           148045 non-null  float64
 4   OvertimePay       148650 non-null  float64
 5   OtherPay          148650 non-null  float64
 6   Benefits          112491 non-null  float64
 7   TotalPay          148654 non-null  float64
 8   TotalPayBenefits  148654 non-null  float64
 9   Year              148654 non-null  int64  
 10  Notes             0 non-null       float64
 11  Agency            148654 non-null  object 
 12  Status            0 non-null       float64
dtypes: float64(8), int64(2), object(3)
memory usage: 14.7+ MB

Q01. What is the average BasePay?

Q02. What is the highest amount of OvertimePay in the dataset?

Q03 What is the job title of JOSEPH DRISCOLL?

Note: Use all caps; otherwise, you may get an answer that doesn’t match up (there is also a lowercase Joseph Driscoll).

Q04 How much does JOSEPH DRISCOLL make (including benefits)?

Q05. What is the name of the highest paid person (including benefits)?

### or
### sal.loc[sal['TotalPayBenefits'].idxmax()]
Id EmployeeName JobTitle BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits Year Notes Agency Status
0 1 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 167411.18 0.0 400184.25 NaN 567595.43 567595.43 2011 NaN San Francisco NaN

Q06. What is the name of the lowest-paid person (including benefits)? Do you notice something strange about how much he or she is paid?

### or
### sal.loc[sal['TotalPayBenefits'].idxmax()]['EmployeeName']

### IT'S NEGATIVE!! VERY STRANGE
Id EmployeeName JobTitle BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits Year Notes Agency Status
148653 148654 Joe Lopez Counselor, Log Cabin Ranch 0.0 0.0 -618.13 0.0 -618.13 -618.13 2014 NaN San Francisco NaN

Q07. What was the average (mean) BasePay of all employees per year (2011-2014)?

You may need to drop some string columns here (drop([‘EmployeeName’, ‘JobTitle’, ‘Agency’], axis=1)).

Q08. How many unique job titles are there?

Q09. What are the top 5 most common jobs?

(Check the Pandas website . Hint: use value_counts())

Q10. How many Job Titles were represented by only one person in 2013? (e.g., Job Titles with only one occurrence in 2013)