UK Road Safety Analysis

Project Introduction

This project analyzes circumstances of personal injury road accidents in Great Britain in 2012, analyzing nearly 150,000 incidents.

The statistics relate only to personal injury accidents on public roads that are reported to the police, and subsequently recorded, using the STATS19 accident reporting form. Information on damage-only accidents, with no human casualties or accidents on private roads or car parks is not included in this data.

Figures for deaths refer to persons killed immediately or who died within 30 days of the accident. This is the usual international definition, adopted by the Vienna Convention in 1968.

Methodology

  • Features Engineering: Create new variables, including morning, afternoon, evening, and night; commute and noncommute; weekend vs weekday; high and low severity and fatal and nonfatal severity.
  • Find most signicant variables using Forward Subset Selection, and lasso.
  • Use trees and randomForests to identify series of variables in order to make recommendations
  • Use GIS mapping to identify geographic areas of higher concern

Preliminary Visualization of the Data

We can see that day of the week, and/or weekend versus weekday may be significant.

Similarly, we see a difference between time of day and commute times for fatal versus non-fatal crashes. names(road)

There does not appear to be much variation in either count, or fatal versus non fatal crashes across the months or seasons.

Here it appears that speed limit and urban vs rural makes a difference.

Weather and road surface condition does not seem to vary much between fatal and nonfatal crashes.

The charts below explore the data generally, without comparing between fatal and nonfatal accidents.

  • We see that most accidents involve a small number of casualties.
  • One and two car accidents appear nearly equally likely.
  • There are a large number of highway and local districts, and they do not have equally distributed number of crashes between them. However, exploring these in greater detail is outside the scope of this analysis.

Significant Variables

Forward Stepwise Selection

This plot shows the tradeoff for Forward Stepwise Model Selection between model size and R-squared, or how much error we find between the training data and the testing data.

From this plot we can determine that looking at the first 10 models will give us a good trade off between model size (and therefore complexity) before the decrease in errors plateau.

However, our R-Squared at 10 and at our plateau is only around 0.20, which means there is not much correlation.

First 10 Forward Stepwise Models

## [[1]]
## [1] "(Intercept)" "urbanurban" 
## 
## [[2]]
## [1] "(Intercept)"       "junct.detailround" "urbanurban"       
## 
## [[3]]
## [1] "(Intercept)"       "junct.detailround" "urbanurban"       
## [4] "day.partnight"    
## 
## [[4]]
## [1] "(Intercept)"           "x1st.rd.classmotorway" "junct.detailround"    
## [4] "urbanurban"            "day.partnight"        
## 
## [[5]]
## [1] "(Intercept)"           "x1st.rd.classmotorway" "speed.lim"            
## [4] "junct.detailround"     "urbanurban"            "day.partnight"        
## 
## [[6]]
## [1] "(Intercept)"           "x1st.rd.classmotorway" "road.typesingle"      
## [4] "speed.lim"             "junct.detailround"     "urbanurban"           
## [7] "day.partnight"        
## 
## [[7]]
## [1] "(Intercept)"           "x1st.rd.classmotorway" "road.typesingle"      
## [4] "speed.lim"             "junct.detailnone"      "junct.detailround"    
## [7] "urbanurban"            "day.partnight"        
## 
## [[8]]
## [1] "(Intercept)"           "x1st.rd.classmotorway" "road.typesingle"      
## [4] "speed.lim"             "junct.detailnone"      "junct.detailround"    
## [7] "surface.condwet"       "urbanurban"            "day.partnight"        
## 
## [[9]]
##  [1] "(Intercept)"           "x1st.rd.classmotorway"
##  [3] "road.typesingle"       "speed.lim"            
##  [5] "junct.detailnone"      "junct.detailround"    
##  [7] "light.L"               "surface.condwet"      
##  [9] "urbanurban"            "day.partnight"        
## 
## [[10]]
##  [1] "(Intercept)"           "x1st.rd.classmotorway"
##  [3] "road.typesingle"       "speed.lim"            
##  [5] "junct.detailnone"      "junct.detailround"    
##  [7] "light.L"               "surface.condwet"      
##  [9] "urbanurban"            "day.partnight"        
## [11] "day.typeweekday"

Lasso

We then used lasso to more accurately find significant variables in a more nuanced way than Forward Stepwise allows.

Significant varibles from Lasso

##  [1] "(Intercept)"           "lat"                  
##  [3] "num.vehicles"          "num.casualty"         
##  [5] "x1st.rd.classmotorway" "road.typesingle"      
##  [7] "speed.lim"             "junct.detailnone"     
##  [9] "junct.detailround"     "x2nd.rd.classnojct"   
## [11] "light.L"               "weatherrain"          
## [13] "urbanurban"            "day.partnight"        
## [15] "day.typeweekday"

We determined the cut off for the lasso by using 10-fold cross-validation and selecting the simplest model within one standard erorr of the minimum cross validation error, in the cross validation plot below, which was 15. The second plot displays how many variables enter the model as model complexity increases.

Classification and Prediction

Trees

We first used trees to group the accidents, upsampling the fatal observations to make the test data more balanced. This initially produced very shallow trees.

Also, note that for reading these trees that severity is coded in the data as - 1 = fatal - 2 = severe and - 3 = slight.

So the value on the terminal node of a tree of 2.5, for example, means that grouping of accidents is between slight and severe. On these trees, the more severe and fatal groupings are to the left, with slight crashes to the right.

We plotted the cross validation errors to determine a cut off point for pruning trees.

And then used that cut off to make pruned trees that were more useful.

## n= 126285 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 126285 55124.100 2.658115  
##    2) speed.lim>=45 35298 23342.330 2.450592  
##      4) num.casualty>=2.5 4899  3977.529 2.114717  
##        8) road.type=oneway,single 3257  2615.622 1.979429 *
##        9) road.type=dual,roundabout,slip,unknown 1642  1184.049 2.383069 *
##      5) num.casualty< 2.5 30399 18723.070 2.504721  
##       10) light=no lights 5951  4786.765 2.260628 *
##       11) light=dark lit unknown,day,lights lit,lights unlit 24448 13495.430 2.564136  
##         22) junct.detail=none,other,private,T 19890 11906.390 2.514379  
##           44) surface.cond=wet 12221  8159.169 2.431061 *
##           45) surface.cond=dry,flood,ice,oil,snow 7669  3527.190 2.647151 *
##         23) junct.detail=4+,cross,miniround,round,slip 4558  1324.920 2.781264 *
##    3) speed.lim< 45 90987 29671.910 2.738622  
##      6) num.vehicles< 1.5 28668 14095.080 2.566276  
##       12) num.casualty>=1.5 2780  2009.935 2.284532 *
##       13) num.casualty< 1.5 25888 11840.770 2.596531 *
##      7) num.vehicles>=1.5 62319 14333.580 2.817905 *

We see that the tree splits on number of casualties (num.casualty) and number of vehicles (num.vehicles). Logically, the number of casualties and number of vehicles are an outcomes of the severity of accidents, rather than an inputs. So we rerun the model removing those two variables, and then again using the CP plot to deteremine where to prune the tree.

Random Forests

Note: In order to get randomForests to work, we had to severely limit our input variables, so this must be considered in interpretation of this model.

We used Random Forests to find significant variables, again upsampling the fatal observations to make the test data more balanced.

Policy Recommendations

Predictive Responses after Crashes

We recall that lasso found the following variables to be significant:

##  [1] "(Intercept)"           "lat"                  
##  [3] "num.vehicles"          "num.casualty"         
##  [5] "x1st.rd.classmotorway" "road.typesingle"      
##  [7] "speed.lim"             "junct.detailcross"    
##  [9] "junct.detailnone"      "junct.detailround"    
## [11] "junct.ctrlsignal"      "x2nd.rd.classnojct"   
## [13] "ped.cross.physpelican" "light.L"              
## [15] "light^4"               "weatherrain"          
## [17] "surface.condflood"     "surface.condwet"      
## [19] "urbanurban"            "day.partnight"        
## [21] "day.typeweekday"

It seems that a number of these strong predictors – light.l or lights lit, weatherrain or rain, surface.condwet or wet road, day.partnight or night, and day.typeweekend or weekend – are out of the hands of police or road engineers for prevention, but can be used to predict that incidents that happen under these conditions are likely to be of greater severity.

The %IncMSE Plot from the Random Forest tells us the extent to which if you were to randomly change the values of that variable, how much would that impact the classification outcomes, so the higher the value, the important that variable is in the model. This tells us that speed limit is very important, followed to a lesser degree by weather, season, weekend versus weekday (day.type), and morning, afternoon, evening or night, (day.part). We can also see by following the tree that hour of day is an early split as well, which early hours, before 6am, leading to higher severity of crashes.

Aside from speed limit, these time-based factors and weather are not able to be controlled by planners. Therefore instead these factors can be considered by authorities that when responding to crashes under these circumstances that they can expect higher levels of severity and respond accordingly. Alternatively, they can attempt to preventatiely have more officers or traffic calming measures deployed on the roads when these conditions are present.

Using the trees we built as well developing the randomForest model (with better variables included) could create a scoring system. This could be part of a system that could be given to first responders and dispatchers that would automate a scoring system that would predict severity of the crash given a number of easily obtainable factors that we’ve found to be significant predictors, like road surface condition, weather, and time of day. Authorities could then make determinations about what types or how many emergency vehicles to send even if the severity of the accident was not given when the incident was called in.

Preventing Crashes

Given that speed limit is also a significant factor, despite the fact that many other predictors are out of planners control, authorities could implement dynamic speed limits and force drivers to drive slower under these conditions.

Again we can review the variables found significant by the lasso:

##  [1] "(Intercept)"           "lat"                  
##  [3] "num.vehicles"          "num.casualty"         
##  [5] "x1st.rd.classmotorway" "road.typesingle"      
##  [7] "speed.lim"             "junct.detailnone"     
##  [9] "junct.detailround"     "x2nd.rd.classnojct"   
## [11] "light.L"               "weatherrain"          
## [13] "urbanurban"            "day.partnight"        
## [15] "day.typeweekday"

The variables that include junct.detail refer to the physical infrastructure at the junctions of crashes. Given that places with controlled junctions have lower severity of incidents, road engineers could implement controlled intersections in more locations, specifically in areas of higher crash density.

District Specific Highlights

Could get score for predicting how severe new accident would be and then predictive policing

Random Forests found latitude and longitude – or essentially location – to be important predictors. This tells us that the authorities in certain locations should consider special outreach programs to try to prevent crashes to to educate drivers about risks.

Geographic incident of crashes, normalized by population per Ward, with all crashes on the left and fatalities only on the right:

Crashes by Ward, normalized by Population. All on left, Fatalities only on right.

We can see a difference. There are some districts in the north – West Lindsey, East Lindsey and Boston – that have much higher rates of fatalties compared to the rest of Great Britian, and at a higher rate than they had less severe crashes. There is also a small district, North Warwickshire, in the center of the map that has a much higher rate of both fatal and nonfatal crashes.

Below we can see geographic incident of crashes, normalized by population density per Ward. This is adjusting for density of traffic – so if you have the same population but in a much more spread out area, normalizing by population density gives us different information than simply normalizing by number of people without considering area.

Crashes by Ward, normalized by Population Density. All on left, Fatalities only on right.

Here Powys in the west and Northumberland in the north stand out. These are districts have a high rate of crashes given their lower, more spread out population. Authorities and decision makers in these areas can push outreach programs to drivers to try to prevent crashes.

Further Study

We recommend further study in implementing random forests to develop a prediction score that would allow first responders to more efficiently deploy limited resources.

These predictors can also be analyzed more specifically to different geographies, which would allow different interventions in different districts based on their needs. Some of these may be specific to political boundaries and therefore specific policy decisions that have been made, while others may more generally be based on rural versus urban situations, or other road conditions, or cultural differences in regions, not tied to political boundaries.

Other Variables to Consider
  • (Estimated) speed of vehicle(s) at time of incident
  • Distracted driving factors such as mobile phone use
  • Driver intoxication
  • Mountains, fields, or other road areas
  • Presence of bicyclists on the road