Asian Journal of Empirical Research | Vol 7, Issue 1, January2017

Identifying automatic vehicle location (AVL) data completeness issues in a rural transit authority system

Roger A. Solano^* , Matthew J. Hart and Dong P. Nguyenr

Slippery Rock University, Slippery Rock, PA 16057, USA ^*Corresponding author's email address: roger.solano@sru.edu

ABSTRACT

We analyzed AVL stop level data from a rural transit system to identify data completeness and systematic data capture failures. Systematic data loss could compromise the validity of further analyses of the data, such as schedule adherence or run time performance. We audited the data to identify missing values and possible data recording errors. The frequency of missing values was analyzed as a function of trip start time, stop number, day of the week, and last reported seconds late. We also perform an outlier and extreme value analysis as a function of missing records per trip. We conclude that there are systematic data capture errors in the system that needs to be addressed before further studies, such as run time analysis can be performed. Given the widespread adoption of an AVL system by rural transit system, it is recommended that detail data completeness analysis becomes routine before using the data generated to perform other studies.

Keywords: Automatic vehicle location, public transit, buses, data quality, rural transit

ARTICLE HISTORY: Received: 08-Feb-2017, Accepted: 08-Mar-2017, Online available: 21-Mar-2017

Contribution/ Originality

This is, to our knowledge, the first paper to review data from a rural transit system and offer relatively easy and straight forward tools for data completeness analysis. The methodology could be replicated by other transit systems with limited data analysis resources. This research would be of interest to Transit systems that uses AVL technology, particularly rural transit systems as well as AVL hardware and software providers.

1. INTRODUCTION

In recent years, there has been an increasing adoption of automatic vehicle location (AVL) technology by transit agencies nationwide (El-Geneidy et al., 2011; Furth, 2006; Radin, 2005). This includes rural transit agencies that may lack the resources and tools required to analyze extremely large data sets created.

Current AVL systems match location data with route and schedule information in real time (Furth et al., 2003). However, when using AVL system, missing data is inevitable due to communications faults (Hounsell et al., 2012; McLeod, 2007). Possible sources of unreliability include satellite unavailability, Partial/total signal blocking, or other temporary failure (Moreira-Matias et al., 2015). A common problem is data capture at the beginning or end of the route when the bus is in terminal (Furth et al., 2004; Hammerle et al., 2005). Some transit providers such as King Country Metro report 80% of data recovery from AVL (Furth, 2006). When an entire bus fleet is equipped with AVL, data recovery rates are not of importance unless there is systematic data loss (Furth, 2006). Saavedra et al. (2011) propose an automatic data validation methodology for archived AVL data. They identify data as suspect when physical constraints are violated (negative travel time for example) or outliers that cannot be explained by the trip pattern.

2. DATA

We analyzed data from a system that covers 25 square miles and a population of 31,084 in 2010. It serves 218,278 passengers annually, including 45,605 senior passengers. The system has 4 full-time employees and 6 part-time employees, a fleet of 6 busses, and installed an AVL solution in January of 2010. We focused our study in route one (Figure 1). The route has three different patterns during weekdays, that cover different number of stops: Weekday covering 23 stops to 9 trips, night covering 26 stops with 3 trips and night 2 covering 23 stops with one trip. We limit our analysis to weekdays, with patterns on weekdays and night for a total of 12 trips per day. The route is set as a loop, beginning and ending at the same stop.

Figure 1: Map of Route one (Source: Transit system website)

The communication between the bus and the AVL system is done via cellular. The bus sends an AVL record every 90 seconds. It also sends a stop report including schedule adherence each time a trigger box is departed. A trigger box is a set of two latitude points and two longitude points that creates a box around a stop. Once the bus GPS enters and exits the box, the stop is triggered as being serviced. If the trigger box is not entered, the system will not generate a stop report.

We received from the transit authority AVL data for the year 2013. We were provided with a single table with 86,546 records. The datum covers all the trips made in the route during 2013. The fields in the data are described in Table 1.

Table 1: Fields in the data set

Field name	Description
stop_name	The name of the stop along the route
Scheduled Depart Time	Scheduled departure time from the stop
Actual Departure Time	Actual departure time from the stop
Service Date	Date of the trip
Seconds Late	Difference in seconds between Actual Departure Time and Scheduled Depart Time
Scheduled Offset	Difference between the trip start time and the scheduled depart time. Used to determine the relative time order relationship from the first stop within the pattern
Vehicle_Id	String identifying the Bus covering the route
Stop Report Time	Time this AVL information was sent by the vehicle
route_id	String that identifies the route
Trip_Id	String that identifies the trip. There are several trips on a day. Each trip has a different start time.
Direction	Indicates the trip direction: Inbound, outbound, or loop. The route in this study is a loop.
Trip_Start_Time	Time that the bus is scheduled to leave the first stop in this trip
Driver_Last_Name	Driver's last name
Driver_First_Name	Driver's first name
service_level	String identifying the service level in our study all the trip are weekdays
Route_Label	String Identifying the route
Stop_Dwell_Time	Time that the bus is stopped at a stop. Difference between Actual departure time and arrival time
Arival_Time	Time that bus arrives at the stop
Layover	Layover time before departure from the stop. In our data, it is zero for all the stops.
pattern_name	Identifies the pattern. Each pattern has different stops. Our student has three patterns weekday, night 1, and night 2
Trip_Label	String that identifies the trip. Similar to Trip_Id

2.1. Data completeness

We audited the data to identify missing values and possible data recording errors. We identified 16,123 records where the AVL data ware missing (ActualDepartureTime, SecondsLate, Vehicle_Id, StopReportTime, Driver_Last_Name, Driver_First_Name, and Arival_Time recorded as NULL).

Data from the first and last stop ware unreliable and ware eliminated from the study. This is a common problem with AVL systems. We identified that all records from stop 20 were missing. There were 480 stop records with negative Stop_Dwell_Time. Most are duplicated entries or followed by a missing record, since a negative cell time is impossible the records were eliminated. Fifteen records were duplicates and were eliminated.

A graph of the percentage of missing values by trip start time (Figure 2) shows a strong correlation: as the departure time increases, the percent of missing values increases. A large number of missing values occur in the 5:30 PM trip. An analysis of missing values by stop number (Figure 3) indicates an increase in the number of missing values in stops 5, 6, 8, 21, and 23. Stops 8, 21, and 23 are serviced in trips that depart after 5:30 PM and have a larger percentage of missing values as showed in Figure 4.

Figure 2: Percentage of missing values by trip start time

Figure 3: Percentage of missing values by stop number

Figure 4: Percent of missing values by trip start time and stop number

When analyzing the occurrence of missing values by day of the week, there is an increase in the number of Occurrences on Fridays followed by Mondays (Figure 5). By cross-tabulating by day of the week and stop number (Figure 6), and by day of the week and trip start time (Figure 7), we identify that data capture is completely unreliable on Fridays after 5:30 PM.

Figure 5: Percentage of missing values by day of the week

Figure 6: Percentage of missing values by day of the week and stop number

Figure 7: Percentage of missing values by day of the week

To identify if there was a relationship between seconds late (deviation from scheduled departure time) and the occurrence of missing values, a variable was created to keep the last recorded seconds late before a missing value or a series of missing values occurred. We created a histogram of the frequency of missing values by last recorded seconds late (Figure 8) and observed an increase in the number of missing values when the buses were about 20 minutes late.

Figure 8: Histogram of missing values by last recorded seconds late

2.2. Outlier analysis

The variable of interest, seconds late, measures the deviations on departure time from the schedule. It shows large variation and a significant number of outliers and extreme values (Figure 9). We wanted to measure how incomplete data per trip affects the occurrence of extreme values. After eliminating records from the first and twentieth stop, we consider that a trip data capture is complete if it reports all the corresponding stop level records (varies per pattern: weekday 21 records, night 1, 24 records, night 2, 21 records). About 60% of the trips have no missing values (Table 2).

Figure 9: Seconds late by stop number for the weekday pattern

Table 2: Trips with no missing values

Pattern	Total	Trips with no missing values	Percent
Route 1 Weekday	2277	1555	68%
Route 1 Night 1	759	334	44%
Route 1 Night 2	253	72	28%
Total trips	3289	1961	60%

We selected the five highest and lowest values for SecondsLate for each Trip_Id and graphed them against number of valid records per trip (Figure 10). We then selected the five highest and lowest values for SecondsLate for each stop and graphed them against the number of valid records per trip (Figure 11). We identified that negative seconds late (lowest extreme values) happen when there are a high number of missing values per trip (ten or fewer records per trip). We then graphed all stop records with negative SecondsLate against the number of valid records per trip and we identified the same pattern, when there are ten or less valid records per trip the frequency and magnitude of negative SecondsLate increases (Figure 12). We recommend eliminating trips to ten or less valid stop level records (130 trips) from further analysis.

Figure 10: Seconds late outliers (Five highest and lowest by trip start time) by number of valid records per trip

Figure 11: Seconds late outliers (Five highest and lowest by stop number) by number of valid records per trip

Figure 12: Seconds late (all stop records with negative seconds late) by number of valid records per trip

3. CONCLUSIONS AND RECOMMENDATIONS

The objective of our research was to perform a data completeness analysis in preparation for further studies such as run time analysis. The concern is that errors in data capture or archiving could lead to wrong conclusions, particularly when systematic data capture errors are present. We conclude that Systematic data capture errors are present:

Data capture at the terminal (beginning and end) of the trip are unreliable.
Stop number 20 is misconceived and not recording any values.
Stops number 5 and 6 has a high frequency of missing values.
There is a correlation between trip start time and the occurrence of missing values: at the start time increases, the occurrence of missing values increases.
Mondays and Fridays present a higher frequency of missing values particularly for the 5:30 PM trip where on Fridays, 94% of values are missing. Late Friday trips have the highest percentage of missing values: between 86% and 94%.
The system seems to malfunction when the busses are around 20 minutes late and does not seem to record values after 1500 seconds (25 minutes) late.
The occurrence of suspect outliers with negative seconds late increase when there are less than 10 valid stop level records per trip.

We recommend adjusting the configuration of stop 20, 5, and 6 so they record stop level data. We also recommend studying the data capture errors identified and reducing them before run time analyses are performed. Problems with data capture at the beginning and end of the trip are known problems with AVL Systems, particularly in routes configured as a loop. We recommend using the arrival at stop 2 and departure from stop 26 and proxies for the trip start and end.

After discussing our results with management and the AVL system contractor the following changes were introduced:

The trigger box for stoping 20 was relocated. It should start generating stop level reports.
It was identified that the busses drove a different route than expected and were not covering steps 5 and 6. A service change has been introduced and the stops should be serviced properly generating stop level records.
It was identified that on Fridays at 5:30 PM the bus was being assigned in the system to a different logical route than the one being covered. This problem has been addressed and the bus should start generating stop level records on Fridays after 5:30 PM.
It was identified that with the software currently installed when the busses are late for a significant amount of the system malfunctions. The bus loses schedule adherence and fails to properly be assigned to the next trip. A firmware upgrade on the vehicle has been recommended to address the issue.

We recommend that a new data completeness analysis is performed to quantify the Effectiveness of the proposed changes.

Given the widespread adoption of AVL system by rural transit system, it is recommended that detail data Completeness analysis become routine before using the data generated to perform other studies, such as runtime analysis.

Funding: This study received no specific financial support.

Competing Interests: The authors declared that they have no conflict of interests.

Contributors/Acknowledgement: All authors participated equally in designing and estimation of current research.

Views and opinions expressed in this study are the views and opinions of the authors, Asian Journal of Empirical Research shall not be responsible or answerable for any loss, damage or liability etc. caused in relation to/arising out of the use of the content.

References

El-Geneidy, A. M., Horning, J., & Krizek, K. J. (2011). Analyzing transit service reliability using detailed data from automatic vehicular locator systems. Journal of Advanced Transportation, 45(1), 66 - 79. view at Google scholar / view at publisher

Furth, P. (2006). TCRP report 113: Using archived AVL-APC data to improve transit performance and management. (P. G. Furth, B. Hemily, T. H. J. Muller, J. G. Strathman, & T. R. Board, Eds.). Washington, DC: The National Academies Press. Retrieved from http://onlinepubs.trb.org/onlinepubs/tcrp/tcrp_rpt_113.pdf.

Furth, P. G., Hemily, B. J., Muller, T. H. J., & Strathman, J. G. (2003). Uses of archived AVL-APC data to improve transit performance and management: Review and potential. Transportation Research Board Washington, DC. view at Google scholar / view at publisher

Furth, P., Muller, T., Strathman, J., & Hemily, B. (2004). Designing automated vehicle location systems for archived data analysis. Transportation Research Record: Journal of the Transportation Research Board, 1887, 62 - 70. view at Google scholar / view at publisher

Hammerle, M., Haynes, M., & McNeil, S. (2005). Use of Automatic Vehicle Location and Passenger Count Data to Evaluate Bus Operations. Transportation Research Record: Journal of the Transportation Research Board, 1903(1), 27 - 34. view at Google scholar / view at publisher

Hounsell, N. B., Shrestha, B. P., & Wong, A. (2012). Data management and applications in a world-leading bus fleet. Transportation Research Part C: Emerging Technologies, 22, 76 - 87. view at Google scholar / view at publisher

McLeod, F. (2007). Estimating bus passenger waiting times from incomplete bus arrivals data. Journal of the Operational Research Society, 58(11), 1518 - 1525. view at Google scholar / view at publisher

Moreira-Matias, L., Mendes-Moreira, J., de Sousa, J. F., & Gama, J. (2015). Improving mass transit operations by using AVL-Based systems: A Survey. IEEE Transactions on Intelligent Transportation Systems, 16(4), 1636 - 1653. view at Google scholar / view at publisher

Radin, S. (2005). Advanced public transportation systems deployment in the united states-year 2004 Update. view at Google scholar

Saavedra, M., Hellinga, B., & Casello, J. (2011). Automated quality assurance methodology for archived transit data from automatic vehicle location and passenger counting systems. Transportation Research Record: Journal of the Transportation Research Board, 2256(1), 130 - 141. Retrieved from http://trb.metapress.com/index/K077X31W67032848.pdf.

Asian Journal of Empirical Research

0
Citation

1. INTRODUCTION

2. DATA

3. CONCLUSIONS AND RECOMMENDATIONS

References

Quick Links

Asian Journal of Empirical Research

0Citation

1. INTRODUCTION

2. DATA

3. CONCLUSIONS AND RECOMMENDATIONS

References

Quick Links

0
Citation