Data analysis and visualization project using survey data to examine gender pay gap.
Code on github.
Gender Pay Gap Analysis
Lauren Renaud
December 15, 2015
Project Scope:
Is there a significant difference in income between men and women? Does the difference vary depending on other factors (e.g., education, profession, criminal history, marriage status, etc.)?
Data Summary
To analyze this question, we’ll use the National Longitudinal Survey of Youth, 1997 cohort data set. This dataset is comprised of about 9000 youth who were initially interviewed in 1997, and then were interviewed more times in the following years. The dataset seeks to produce a longitudinal study of respondents transition from teenage to adult years. This gives us an opportunity to look at how income and gender intersect with other factors, particularly from the teenage years.
Note: The most recent survey data is from 2011. Any references to “last year” refer to the year prior to the survey, 2010.
First we should know the count of respondents, broken down by gender. We have 4385 females and 4599 males in the survey.
Next we’ll look at the mean income from last year, broken down by gender.
Gender | Income |
---|---|
female | 29997.82 |
male | 37911.57 |
We can see here already that the average income from last year for women is lower than it is for men, by $7913.75, or that women are making 79.13% of what men make, on average. We’ll go into more detail about the statisitical significance of the difference later.
If we look at boxplots of income broken down by gender, we see that the interquartile range for women is lower than that of men, in addition to the lower mean. The outliers for the top earning women catch up with the outliers for the top earning men.
Note: At this point in the data summary we are excluding the top coded values. The rationale and further analysis regarding top coded values will be explained later in the report.
Now that we’ve looked at the data and observed a difference, we can run a t-test to find the statistical signifigance of this difference.
##
## Welch Two Sample t-test
##
## data: survey$income.lastyr by survey$gender
## t = -11.4488, df = 5229.984, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 99 percent confidence interval:
## -9694.893 -6132.607
## sample estimates:
## mean in group female mean in group male
## 29997.82 37911.57
At a 95% confidence interval, we find a p-value of 0, indicating that the difference in the means of male and female income are not attributable to random chance.
So, to begin, we can say that yes, there is a significant difference in income between men and women. We will now consider the impact of other factors.
Race
The mean income from last year, broken down by gender and race, can give us a starting off point for exploring other factors that may contribute to the wage gap. This table displays average income by gender by race, followed by the absolute and then percentage difference for each race in the survey. We can see that the gender wage gap exists for all racial catergories in the survey, though, by varying amounts. Looking at the boxplots, there appears to be less of a difference between the income means by gender for Blacks than for other races. There’s a large difference in the means for mixed race people, but there are also only 83 respondends coded as mixed race, making up only 0.92% of the survey. This low sample size makes it difficult to make inferences about this group.
Female | Male | Abs Diff | % Diff | |
---|---|---|---|---|
hispanic | 26314.59 | 34099.99 | 7785.391 | 77.17 |
mixed | 30814.29 | 38714.29 | 7900.000 | 79.59 |
white | 30928.24 | 36671.09 | 5742.849 | 84.34 |
black | 25493.25 | 28109.43 | 2616.181 | 90.69 |
Industry
Now we can also look at mean income by gender and industry. Again we see women making less, on average, than men across most of the categories.
The mean income for women is actually greater than men for acs special codes
. Similarly to what we saw in the breakdowns by race, though, it is worth nothing that only acs special codes
made up only 10 of respondants, which may mean that a singular or small number of outliers may be skewing this data.
Female | Male | Abs Diff | % Diff | |
---|---|---|---|---|
mining | 29000.00 | 51600.00 | 22600.000 | 56.20 |
agr forest fish | 21946.15 | 38904.76 | 16958.608 | 56.41 |
active military | 30000.00 | 52684.21 | 22684.211 | 56.94 |
utilities | 34725.56 | 51880.95 | 17155.397 | 66.93 |
construction | 27704.76 | 34723.43 | 7018.664 | 79.79 |
retail trade | 23575.64 | 29240.15 | 5664.508 | 80.63 |
other public services | 23123.93 | 28496.95 | 5373.027 | 81.15 |
public admin | 39580.64 | 47602.89 | 8022.247 | 83.15 |
transport warehouse | 30202.03 | 36137.34 | 5935.314 | 83.58 |
entertain accom food | 20168.33 | 23842.41 | 3674.081 | 84.59 |
fin insure real estate | 35392.19 | 41608.47 | 6216.283 | 85.06 |
manufacturing | 33301.85 | 37805.65 | 4503.798 | 88.09 |
edu health social | 30128.16 | 33902.60 | 3774.441 | 88.87 |
professional | 30122.45 | 32840.23 | 2717.785 | 91.72 |
wholesale trade | 32056.84 | 34919.99 | 2863.149 | 91.80 |
info comm | 37027.68 | 38044.50 | 1016.823 | 97.33 |
acs special codes | 36500.00 | 21333.33 | -15166.667 | 171.09 |
If we look at boxplots of the distribution of income by gender and industry, we can make some other important observations. It appears that the sample of female active duty military is very small, and if we look at the data we can find it’s actually only 1. The lower quartile for men in the mining
and utilities
are above the upper quartile for women in those industries, while for agr forest fish
the quartiles don’t even overlap. The differences seem smaller for professional
, wholesale trade
, entertain accom food
, and edu health social
.
Methodology
Missing Values
When bringing in and intially coding the data, I excluded missing values from numeric variables. While we can possibly make some assumptions about someone who, for example, did not know their income from last year, when analyzing and computing numeric values it is very difficult to do something with those assumptions.
Unfortuantely, we were missing last year’s income data for 40.98% of respondents. While that is unfortately a large percentage of our dataset, it still leaves 5302 respondents, which is a large sample size. The same goes for industry
, where we were missing 31.37%, but still have 6166 answers to analyze.
This does introduce a limit into the data, but for the most part the number of missing values was not too great.
For categorical values, things like valid skip
and non-interview
were coded into the analysis as NA
, while in most cases for categorical values refusal
and don't know
were coded in as such. The refusal
and dont know
values were ignored for some values where they comprised a small sample, but were analyzed further where they comprised a more signifigant proportion of responses.
Topcoded Values
For the most part, I removed topcoded values.
One instance where it made a difference was in looking at average income of men and women by industry. The averages by industry were displayed above in the data summary. The table below displays the industry, then mean female and male salary and the absolute difference and percent difference for means, all excluding topcoded values, followed by the absolute and percent differences if you include the topcoded values. The final column finds the difference in percentage points between the means that included the topcoded values and those that did not. This table is sorted by the final column.
Female | Male | Excld Diff | Excld % Diff | W Diff | W % Diff | Differences | |
---|---|---|---|---|---|---|---|
info comm | 37027.68 | 38044.50 | 1016.823 | 97.33 | -3979.351 | 106.22 | -8.89 |
active military | 30000.00 | 52684.21 | 22684.211 | 56.94 | 21433.626 | 61.83 | -4.89 |
utilities | 34725.56 | 51880.95 | 17155.397 | 66.93 | 11326.222 | 70.98 | -4.05 |
fin insure real estate | 35392.19 | 41608.47 | 6216.283 | 85.06 | 5987.251 | 85.22 | -0.16 |
agr forest fish | 21946.15 | 38904.76 | 16958.608 | 56.41 | 16958.608 | 56.41 | 0.00 |
acs special codes | 36500.00 | 21333.33 | -15166.667 | 171.09 | -15166.667 | 171.09 | 0.00 |
wholesale trade | 32056.84 | 34919.99 | 2863.149 | 91.80 | 3723.681 | 91.23 | 0.57 |
professional | 30122.45 | 32840.23 | 2717.785 | 91.72 | 3534.962 | 90.79 | 0.93 |
transport warehouse | 30202.03 | 36137.34 | 5935.314 | 83.58 | 6830.363 | 82.44 | 1.14 |
entertain accom food | 20168.33 | 23842.41 | 3674.081 | 84.59 | 6803.854 | 82.92 | 1.67 |
manufacturing | 33301.85 | 37805.65 | 4503.798 | 88.09 | 4229.323 | 86.13 | 1.96 |
edu health social | 30128.16 | 33902.60 | 3774.441 | 88.87 | 4674.081 | 86.82 | 2.05 |
other public services | 23123.93 | 28496.95 | 5373.027 | 81.15 | 6256.523 | 78.71 | 2.44 |
public admin | 39580.64 | 47602.89 | 8022.247 | 83.15 | 9830.407 | 80.53 | 2.62 |
retail trade | 23575.64 | 29240.15 | 5664.508 | 80.63 | 6005.371 | 77.06 | 3.57 |
mining | 29000.00 | 51600.00 | 22600.000 | 56.20 | 27350.100 | 52.31 | 3.89 |
construction | 27704.76 | 34723.43 | 7018.664 | 79.79 | 14555.482 | 71.76 | 8.03 |
Focusing only on the Differences
column we can see that some industries – info comm
and construction
in particular at -8.89% and 8.03% respectively, followed by active military
, utilities
, and mining
– have high differences depending on the inclusion or exclusion of the topcoded values. The next question is how big of a difference it makes to our analysis that these values are different.
Industry | Count | % Respondants | |
---|---|---|---|
r1 | construction | 430 | 4.79 |
r2 | info comm | 149 | 1.66 |
r3 | mining | 45 | 0.5 |
r4 | utilities | 38 | 0.42 |
Construction workers make up 4.79% of the respondents, which is a fair amount, while the number of respondants for other industries comprise a small portion of our sample.
Unexpected Variables That Had No Connection & Other Relationships
I had expected to find a difference between drug use and income by gender, but it was not very different.
I also thought there may be a difference income by gender based on household income growing up, that wealthier households would possibly set men up to be wealthier to a greater extent than women. However, it appears that greater income as a teenager means greater income as an adult but the difference by gender stays about steady, as seen in this graph below. In order to do this analysis I had to exclude some low, negative household income values that I think may be been erroneously entered.
Based on my finding that there is a relationship between weight and the income gap (more below) I suspected there may be a relationship between how respondents evaluated their own weight and income. However, we don’t see much difference across different answers to this question other than for men who consider themselves very underweight compared to women in the same catergory.
Chosen Analysis
I then chose to look at three variables – race
, weight
and marital status
. As we saw earlier in the data summary, there are observable differences in the pay gap broken down by race, and we can use more finite methods to explore this further. Weight is a great variable in this dataset because of very low missingness. Marital status is interesting because the different factors within the variable impact the predicted incomes – in some cases increasing the gap and in some cases decreasing the gap, as we’ll see further on in the analysis.
Findings
To begin, we must know if our data is nearly normal, first looking at the dataset with the topcoded values included:
and then with the topcoded values removed:
The fact inclusion of the topcoded values gives us a strange anomoly at the top of the normal Q-Q plot, so we cannot use that for our analysis at this point. While the values when excluding the topcoded values are not perfecly normal, they’re close enough that we can run our analyses.
For starters, I ran a t-test to determine if the difference in average income between men and women is stastistically significant, and with a p-value of 0, or basically zero, we can say that yes, it is. Based on the coeffients from a linear model, we can predict that, all other variables held constant, a man will make $5714.7800373 more per year than a woman.
Race
We can use linear regression to explore the collinearity between race and gender.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 24079.427 | 799.810 | 30.106 | 0.000 |
racehispanic | 3133.466 | 1037.915 | 3.019 | 0.003 |
racemixed | 11125.043 | 3587.688 | 3.101 | 0.002 |
racewhite | 9660.096 | 863.439 | 11.188 | 0.000 |
gendermale | 7589.201 | 690.730 | 10.987 | 0.000 |
Looking at these numbers, we can predict that a given individual makes the income that’s at the intercerpt, $24079.43, for our baseline, which is black females. We can use this test to predict income based on gender and race. Females of other races get the intercept
plus the coefficent for their race, and males get the intercept
plus the coefficent for their race and and coefficient for gendermale
. For example, we predict that a black male makes the baseline, plus the gendermale
coefficient, which is 7589.2, so we predict that a black male makes $31668.63. Our model predicts that a white female makes the intercept
plus the coefficent for white, or $33739.52.
We can look at t-tests of the income differences subsetted by race to explore this difference further. We find p-values of 0.00178 for black respondants, 0 for white respondants, 0 for hispanic, and 0.05089 for mixed race. This higher p-value for the mixed race catergory reflects our initial assumption that while graphically there appears to be a very large difference in these means, the small sample size may be skewing that data. We also see a slightly higher p-value for the black pay gap than for white or hispanic, confirming our earlier inferences that the gap for blacks is not as significant.
Weight
It appears that as men weight more, we can predict that they make slightly more money, while we predict that heavier women make less, as we can see in this linear regression.
The data here has been subsetted to eliminated some extremely low and extremely high values that I am assuing were entry errors.
The findings appear to be similar to the weights that were reported in 2004, so it is not worth doing a separate anaysis of these values.
Based on looking at the regression model, I hypothesized that weight is a factor that impacts the difference in income between men and women. I then used a linear model to find a more precise prediction.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 35433.771 | 1471.582 | 24.079 | 0.000 |
gendermale | -4368.771 | 2248.270 | -1.943 | 0.052 |
weight.2011 | -40.810 | 8.559 | -4.768 | 0.000 |
gendermale:weight.2011 | 57.768 | 11.931 | 4.842 | 0.000 |
Unexpectedly, this model produces a negative coefficent for gendermale
. There were a small number of very small values that skewed the data, including some negative values, and we will ignore these outliers. Either way, for every pound heavier that a man is, we can predict he will make $16.96 more (the weight.2011
coefficient plus the gendermale:weight.2011
coefficient), but for every pound heavier that a woman is, we predict she’ll make $-40.81 (just the weight.2011
coefficient).
To give an example, if we look at a 180lb man and a 180lb woman and hold all other variables constant, we predict he’ll make the intercept
plus the coefficient for gendermale
+ (180lb) * (weight.2011
+ gendermale:weight.2011
), or $34117.59. We predict a woman of the same weight will make the intercept
+ her weight, 180lb * weight.2011
, or $28088.05.
Lastly, we should test if our model using weight and gender is signifigantly more predictive than the model using gender alone. Running an ANOVA comparing the two models we find a p-value of 0, we can say that yes, it is a signifigantly better model.
To look at the data even more precisely, let’s explore non-linear options for models .
We can see that while female respondants relationship between weight and income last year maintains a fairly linear relationship, there is more variablity in the curve of the relationship for men. An area for further research may be to try to normalize this data against average weights for men and women and then look at the relationship between income last year and weight relative to average rate for the gender.
Marital Status
Next we’ll look at the impact of marital status on income. First, though, we should check the size of the values before moving further in our analysis.
Industry | Count | % Respond | |
---|---|---|---|
r1 | never married | 4148 | 46.17 |
r2 | married | 2648 | 29.47 |
r3 | NA | 1561 | 17.38 |
r4 | divorced | 479 | 5.33 |
r5 | separated | 114 | 1.27 |
r6 | invalid skip | 23 | 0.26 |
r7 | widowed | 11 | 0.12 |
We can see that we have a very small sample size of widowed
respondents, and upon looking at the data can see it is only 7. Based on this sample size, we cannot make much inferences about widowed people from this data. There are also a small, but much larger number of separated
individuals. We should keep this size in mind as we move through the analysis. We are missing values for 17.12% of respondents (NA
plus invalid skip
), but that is not so low that we should not continue with the analysis.
So now let’s look at income by gender and marital status.
From this plot we can see that there appears to be the smallest difference between men and women who have never been married and the greatest difference between those who are separated, and a somewhat similar gender wage gap between those who are married or divorced.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 25937.810 | 1352.605 | 19.176 | 0.000 |
gendermale | 7676.526 | 2119.418 | 3.622 | 0.000 |
marital.statusinvalid skip | -2437.810 | 6813.560 | -0.358 | 0.721 |
marital.statusmarried | 5029.178 | 1480.575 | 3.397 | 0.001 |
marital.statusnever married | 1500.417 | 1454.444 | 1.032 | 0.302 |
marital.statusseparated | -3753.834 | 3213.073 | -1.168 | 0.243 |
marital.statuswidowed | -12826.810 | 9540.423 | -1.344 | 0.179 |
gendermale:marital.statusinvalid skip | -5176.526 | 15082.017 | -0.343 | 0.731 |
gendermale:marital.statusmarried | 2375.623 | 2281.890 | 1.041 | 0.298 |
gendermale:marital.statusnever married | -5144.971 | 2238.898 | -2.298 | 0.022 |
gendermale:marital.statusseparated | 2114.563 | 4949.205 | 0.427 | 0.669 |
gendermale:marital.statuswidowed | 8545.808 | 14580.885 | 0.586 | 0.558 |
The baseline for this model is divorced
and female
. We can see from these coefficients that we predict that compared to a divorced woman, women who are married make $5029.18 more, never married women make $1500.42 more, while separated women make $-3753.83.
On the other hand, for men we predict that compared to a divorced man, married men make $10052.15 more, never married men make $2531.55 and separated men make $9791.09.
Holding all other variables constant, if we compared a married man and a married woman, we’d predict that he’d make : intercept
+ gendermale
+ marital.statusmarried
+ gendermale:marital.statusmarried
= $41019.14 and predict that she’d make: intercept
+ marital.statusmarried
= $30966.99
This difference – $10052.15, with women making 75.49 % of men’s mean income, is lower than what we found the overall mean difference (79.13%).
Holding all other variables constant, if we compared a never married man and a never married woman, we’d predict that he’d make : intercept
+ gendermale
+ marital.statusnever married
+ gendermale:marital.statusnever married
= $29969.78 and predict that she’d make: intercept
+ marital.statusnever married
= $27438.23
This difference here, $2531.55, with women making 91.55 % of men’s mean income, is a much closer percentage than what we found the overall mean difference (79.13%).
Since we found two different amounts in comparison to our intial findings on the wage gap, it seems reasonable that marital status has a further impact on that gap – that depending on a woman’s marital status she may make even less than similar man or may make closer – so this variable is an important predictor.
As we did with the weight
variable, let’s check that our marital status
analysis gives a closer prediction than the general model. Running an ANOVA comparing the two models we find a p-value of 0, we can say that yes, it is a signifigantly better model.
## Analysis of Variance Table
##
## Model 1: income.lastyr ~ gender
## Model 2: income.lastyr ~ gender * marital.status
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5180 1930540846927
## 2 5170 1844451702305 10 86089144623 24.131 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This analysis was done when excluding topcoded variables, so let’s see if it’s different with the topcoded valudes in.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 26550.383 | 1786.252 | 14.864 | 0.000 |
gendermale | 9524.997 | 2784.868 | 3.420 | 0.001 |
marital.statusinvalid skip | -3050.383 | 9020.130 | -0.338 | 0.735 |
marital.statusmarried | 5802.569 | 1954.120 | 2.969 | 0.003 |
marital.statusnever married | 2481.191 | 1919.652 | 1.293 | 0.196 |
marital.statusseparated | -4366.406 | 4252.130 | -1.027 | 0.305 |
marital.statuswidowed | -13439.383 | 12630.709 | -1.064 | 0.287 |
gendermale:marital.statusinvalid skip | -7024.997 | 19965.364 | -0.352 | 0.725 |
gendermale:marital.statusmarried | 3800.484 | 2995.649 | 1.269 | 0.205 |
gendermale:marital.statusnever married | -5609.696 | 2941.258 | -1.907 | 0.057 |
gendermale:marital.statusseparated | 266.091 | 6543.610 | 0.041 | 0.968 |
gendermale:marital.statuswidowed | 6697.336 | 19301.772 | 0.347 | 0.729 |
These values look about the same as what we saw in our original analysis, so we can continue to exclude them.
Discussion
It appears that both weight
and marital status
are statistically signifigant in contributing to the wage gap between men and women.
The weight variable is interesting. It confirms what I and others suspect and has been proven in some studies – women are peanlized for their appearance to a greater extent than men are, and in particular for gaining weight. I was surprised, though, that a person’s self-assesment of being overweight, underweight, or average did not seem ot have a reltaionship to income.
It’s also interesting that when comparing men and women who had never been married the wage gap closed much tighter. While this data only looked at a small range of birth years, and therefore a small range of ages, it might be worth further investigating if other factors, such as age, contribute to this difference.
While this analysis is fairly accurate based on the data presented, and it was on a fairly large dataset, this is still only one dataset. In order to feel more confident in the findings I would want to run similar tests using other large, longitudinal datsets to see if the same findings stand.