DATA for A Twenty-Third
May 15th, 2012
This post relates to the class Data, thought by Mark Hansen in Spring 2012. It uses code and advice received from him.
A Twenty-Third is a data-based installation dedicated to the memory of cyclists who have died riding the streets of NYC. It recreates the tension of riding and the disparity experienced in an accident.
More information about the installation here
A Twenty-Third used as source a massive, not very organized datased called CrashStat. It was published by Transportation Alternatives, who managed to obtaine them via multiple FOIL (New York State Freedom of Information Law) requests to the New York State Department of Transportation (NYS DOT), and then organized, unified and made public the data in a single set covering from 1995 to 2009.
I managed to get from them a not-yet-processed dataset containing the raw data from 2010, so I also had to unify both datasets at some point.
53399 obs. of 61 variables
dim(bikes_2010)
3347 obs. of 59 variables
My installation was only related to cyclist accidents, so the first step was to filter out the accidents that did not have a cyclist involved.
When I started analysing the first dataset, I run into many problems where the dates had to be converted, the codifications weren’t numerical, and also there were a lot of missing cells.
I converted the time and date data to time objects:
bike$date = strptime(bikes$ACCD_DTE,"%m/%d/%Y")
Then, because there were many rows including only date and not time, i randomized these to unify all time data into one row.
a <- which(is.na(test$time)) bikes$time[is.na(bikes$time)] = bikes$date[is.na(bikes$time)] + sample(86400, size=length(a)) Secondly, I needed to classify them according to the seriousness of the accidents. One of the columns included information about the type and extension of the injuries, and I used the following code to convert it into a more simple classification. 1 being a non-serious accident, 2 a serious accident and 3 a fatality.
Going through the data I was able to make some preliminary observations:


2 3
22139 279
there are no records of non-serious accidents. They were either filtered out from the original database, not recorded, or just not reported.
the overall number of serious accidents decreased from 1995 to 2007, slightly increasing since then. this could respond to the increase in the biking population and addition of bike lanes to the system. It would be interesting to match data of the biking population evolution with this dataset.
there is a clear increase in the accidents every summer and it goes to near zero in the winter.
the number of fatalities doesn’t follow a visible pattern. One can see peaks in 1999, 2005 and 2007, but this could be interpreted as noise given the small number of fatalities (~13-28 per year)

I tested correlation between every column i could in the dataset, but I couldn’t find anything very surprising:
[1] "OBJECTID" "CASE_NUM" "CASE_YR"
[4] "REF_MRKR" "ACCD_DTE" "ROAD_SYS"
[7] "NUM_OF_FATALITIES" "NUM_OF_INJURIES" "REPORTABLE"
[10] "POLICE_DEPT" "INTERSECT_NUM" "MUNI"
[13] "PRECINCT" "NUM_OF_VEH" "ACCD_TYP"
[16] "LOCN" "TRAF_CNTL" "LIGHT_COND"
[19] "WEATHER" "ROAD_CHAR" "ROAD_SURF_COND"
[22] "COLLISION_TYP" "PED_LOC" "PED_ACTN"
[25] "EXT_OF_INJURIES" "REGN_CNTY_CDE" "LOW_NODE"
[28] "HIGH_NODE" "ACCD_TME" "RPT_AGCY"
[31] "DMV_ACCD_CLSF" "ERR_CDE" "COMM_VEH_ACC_IND"
[34] "INTERSECT_IND" "UTM_NORTHING" "UTM_EASTING"
[37] "GEO_SEGMENT_ID" "GEO_NODE_ID" "GEO_NODE_DISTANCE"
[40] "GEO_NODE_DIRECTION" "GEO_LCODE" "HIGHWAY_IND"
[43] "CASE_NUM_YR" "X_COORD" "Y_COORD"
[46] "BoroName" "BoroCD" "StSenDist"
[49] "SchoolDist" "CounDist" "CongDist"
[52] "ElectDist" "PjAreaName" "Precinct_1"
[55] "GEOID10" "NAME10" "DPHO"
[58] "AssemDist" "time" "date"
[61] "seriousness"
the most part of the accidents happen in biking-favorable conditions: sunny days, daylight, well lit streets. some subset would have to be made to find patterns inside the less common scenarios.
finally, as the installation was going to run at a sped up rate, in order to determine the actual speed, and the considerations I’d have to take for it I had to study the differences in time between each accident, and more importantly, between each fatality.


Time differences in hours
0% 5% 10% 15% 20% 25%
1.183333 18.145750 37.251667 68.850000 93.000000 113.000000
30% 35% 40% 45% 50% 55%
149.000000 194.571667 233.093333 282.300000 314.450000 378.915833
60% 65% 70% 75% 80% 85%
416.900000 488.195833 559.200000 642.145833 760.080000 922.090000
90% 95% 100%
1140.920000 1515.385000 4096.050000
In the case of the fatalities, there are about 15% of them that occur within 3 days from each other. The installation was run at a rate of one day per minute, so there were that many cases where the reenactment had to be delayed. For the next time, I would like to explore a different approach that solves this conflict better while maintaining an interesting experience for the viewer. The data could be replayed at a slower rate, or these cases could be reenacted together.
Finally, I used R to export a subset of the data, including only the information I needed for the installation. This operation was fairly easy using subset, rbind and write.table.
bikes.b <- subset(bikes2010, seriousness >=2, select = c(seriousness, date,time))
bikes <- rbind(bikes.a, bikes.b)
bikes$timestring = as.numeric(bikes$time)
write.table(subset(bikes, select = c(seriousness,timestring)),'~/bikes.csv', quote=FALSE, sep=',', row.names=FALSE)
There are lots of other things I’d love to do with the data, but my lack of R expertise made it very difficult. I wanted to apply more of the concepts we learned in the Data class, and check them with the many other columns I just overlooked.
R has proven very useful. As any other language, one has to go though a learning curve to become fluent, something I underestimate and didn’t have enough time to go through. I hope I can advocate more time to it in the short future.
Something I hope to do in the near future, is to use some of the learned concepts we learned to improve another project, BKME, that is basically a platform that aggregates user generated data of cars illegally parking in bikelanes. In order to make that data useful and create change, we need to be able to analyse it to find clusters and patterns.





