This is a companion post to my report on the Point Pleasant Borough First Aid Squad (PBFAS), a volunteer Emergency Medical Service. It would be helpful to view that report prior to this post.
In this post I will outline the steps I took to write that report, including details of all my analysis. The target audience for the final report is the general audience so I omitted information which would have only served to confuse the reader.
Many of the plots included in this report are not polished since they were preliminary charts used for me to better understand the data I was looking at. All the final products are included in the report linked above.
The data was obtained from an application used by PBFAS called "FireStation". PBFAS started using this software in 2014 to track member response to calls. Conveniently for me, this application stores its data in a Microsoft Access Database. As an aside, I think this is an extremely poor choice from a security perspective, since user accounts are stored in this database, the passwords were encrypted, but someone malicious could easily have obtained this key. I don't have much experience with Microsoft Access but, SQL is SQL, and I had no trouble throwing some inner joins together and getting out the data I wanted:
To get number of responders I had to select the count from a table which kept the user ids of the individuals who came to a call, I was trying to use an outer join and it wasn't working. I spent 15 minutes trying to fix my query but then realized that it would be way easier to just join the select statements together, which I did, and had immediate success.
Once I had my data, I talked to Chief and Lieutenant to see what they were interested in seeing. They wanted too see what days/times are busier than others, when there were more/less responders, how long calls are, use of the Emergency Services Unit (ESU), and Mutual Aid requests.
I pulled the data into R and cleaned it up. I added calculated columns to the data frame for time elapsed (as a float in hours), the day of the week of the call, the month of the call, and the dispatch time of the call as a float from 0-24. While there is a time data type, I preferred to have the float, it made my life a lot easier, especially since the original data set stored times as integers (i.e. 08:45 was 845). I also went and removed outliers by call length and by number of responders. For call length outliers, I would guess most of them are as a result of someone mis-typing a time or date and a call ending up as 25 hours instead of 1 hour. I also dropped outliers for the number of responders, occasionally there are calls during a meeting or drill, where almost all members are in attendance, and thus all get added as a responder to the call, although, they all would not have responded otherwise. Because of the nature of these outliers, and the goals of the analysis, I deemed it was appropriate to remove them. I created a calculated column for duty night, where if the call was between 6AM and 11PM it was assigned 0, and if it was between 11PM and 6AM it was assigned the day of the week which the duty night started (i.e. 4AM on a Monday would be assigned a value of 1 for Sunday night).
First, I looked into the busiest times of day, and I did a kernel density estimate. It was easy to do so I just called the stat_density function in ggplot2 and that's it. It is important to note this does not give the probability density estimate of getting a call at a certain time (or rather within a range of times since it is a continuous random variable), it gives the probability that given there is a call, the probability that it occurred at a certain time. This does, however, give us a visual representation of the busiest times of the day.
Next, I built two new tables, one for duty night calls, the other for calls that occur during the day both with a column n which gave the number of calls during that time. I ran the summary statistics on number of calls and found the following:
Day Calls: min = 1, median = 3, mean = 3.3, max = 12
Night Calls: min = 1, median = 1, mean = 1.3, max = 8
I looked at this and realized that the database doesn't have records when there are no calls, so I was missing a critical piece of information, days and nights with zero calls (min should be 0 for both tables). It was an easy fix to use the pad function from the padr library to create rows for the missing dates. I just had to replace the NA values with zeros and my summary statistics were as follows:
Day Calls: min= 0, median = 3, mean = 3.14, max = 12
Night Calls: min = 0, median = 0, mean = 0.58, max = 8
This data made much more sense.
I did an ANOVA on the day calls using the day of the week as the factor, and found a P value of 0.195. This matched my expectation looking at the plot I visually didn't see what looked like any significant difference.
The ANOVA on the duty nights however was 1.07e-07, so at least one of the groups was different, so it was a matter of finding out which ones. I used the Tukey HSD test to determine which groups are different. It was at this point that I determined that I had to come up with a better way to visualize the Tukey HSD so I came up with a different plot. You can visit my previous blog post for more information.
I did the same for months of the year. The P value for the ANOVA was 3.97e-09 so I did another Tukey HSD and found the summer months were busier.
I wanted to look at the distribution of calls, and I hypothesized that the number of calls in a day would follow the Poisson distribution. When I plotted a histogram of number of calls per day it looked fairly Poisson, but then when I plotted the expected values of Poisson based on the mean (being the unbiased estimator) I found that it didn't really match up, and the chi-sq test for goodness-of-fit I found a p-value of 0.2344 so it is not a good fit.
I thought, well since months of the year are significantly different then maybe I should consider each month separately, I did that, and found that no Chi Squared values were less than the critical values, so still, the Poisson distribution is not a good fit.
For all months, the Chi-Sq test statistic was higher than the critical value so we reject the hypothesis that the probabilities are completely specified by a Poisson distribution with rate = mean.
The next thing I looked at was the response to calls. The following were the summary statistics for responders on all calls:
min = 1, 1st quartile = 4, median = 6, mean = 6.06, 3rd quartile = 8, max = 14
I looked at the response by hour, and did the Tukey HSD and had the following plot, although in this case my plot misses some significant differences, it gives me a good picture of which hours have more responders than others.
1500-2100 have the most responders and they have significantly more responders than the hours with the least responders 0600-1300. Empirically we knew this already, that the morning has less people available than the afternoon-evening, and of course 6am have the least responders, it is right after duty night ends, and its early in the morning, and the response increases in order of hour up until 1400 when it becomes more of a wash.
This right here is the power of this function, I can immediately just plot in the data and I can get a good idea of which hours of the day get the most responders and which hours get the least and if there is a significant difference between them, and about what the average number of responders per hour is. Instantly with one chart I get lots of insights.
I looked at the response by month and found no significant differences, same as with responses by call type. It appears that there are significantly more for ESU call outs, but since the sample size is smaller the significant difference wasn't found. I did find a significant difference in response by years.
I wanted to see if there was any correlation between call length and number of responders, and no, there is not, the R-squared value is near zero, and the coefficient for Responders is insignificant. There really aren't any patterns observable when you look at the 2-d histogram of Responders and call length. It looks like what you would get if you did a joint distribution between the distribution of call lengths and the distribution of number of responders. They seem to be independent. An ANOVA on the call length vs number of responders (with responders as the factor) had a p-value of 0.711.
I also looked at the length of calls over time, and there hasn't been any changes, this plot is where you can see that call lengths often get bumped to an hour if they are few minutes short in order to get the extra call point.
At this point I returned back to looking at call frequency and looked to see if some holidays are busier than others. I went and got all the dates of holidays over the past 8 years and the ANOVA had a p-value of 0.457, there are no significant differences between number of calls on each holiday.
The rest of the analysis you can find in the main report. After I completed this, I selected which graphs I wanted to put in there based on what I felt would be the most beneficial for a general audience. I spent a decent amount of time messing with styles and colors. I have never claimed to be a designer by any means, but I am satisfied with the visual quality of the graphs I included. Conveniently, with the plotly library for R, I was able to export the plotly json object of my ggplot charts, and then include them as interactive plots on the final report. While my ggplots look good, nothing beats the user experience afforded by an interactive plot.
As a volunteer EMT for PBFAS I thought the results of this report were interesting and I enjoyed looking at this data. I look forward to more opportunities to look at other interesting data sets and I hope to continue to improve my data analysis skills.