Discipline: Analytics

Improved Variable and Value Ranking Techniques for Mining Categorical Traffic Accident Data

  • December 1st, 2005
  • in

This paper reviews the use of two new metrics for the process of assessing the significance of attributes in a database when two subsets of the data are compared. Traditional statistical techniques are useful, and the sample size in public safety databases usually allows the normal approximation to the binomial distribution to be used in comparing proportionate values. For example, the comparison of the proportion of alcohol related crashes on Saturdays would show an very highly significantly higher proportion than that for non-alcohol related crashes. However the new metrics go a step further than this in that they provide a clear intuitive grasp to the user as to exactly how much more is occurring, not in terms of proportions but in terms of number of crashes (for the traffic safety example). The metric is called Maximum Gain, and it measures directly the number of crashes over and above that which is typically expected. This provides a clear indication to the user of just what the potential gain is by applying a countermeasure related to the attribute (e.g., applying selective enforcement on Saturdays). It is not realistic to think that this gain would include all of the crashes for the attribute value; rather, it is realistic to view the maximum gain to be the total over-represented amount.

Strategies to Improve Variable Selection Performance

  • June 1st, 2005
  • in

This paper compares a “row major order” data structure that is used standard relational databases against the transposed “column major order” data structure used by CARE. These data structures are described in detail, as were the various filtering methods that could be employed. Performance tradeoffs between the two data structures demonstrated a clear advantage of the column major order over the traditional storage approaches.

Utilizing Commodity Hardware and Software to Distribute a Real-World Application: Maximizing Reuse While Improving Performance

  • June 1st, 2005
  • in

This research delved into the current use of the commodity computing hardware, which is motivated by a dramatic increase in the performance to price ratio. The research evaluated the performance of a statistical analysis application in a ten-node off the shelf computing cluster. The study had two stems: (1) examining the various network topologies, and (2) minimizing the software modifications required in distributing the application. The general conclusion was that when reuse of existing code is feasible, performance can be dramatically increased by the combined use of parallel computing and commodity components.

Variable Selection and Ranking for Analyzing Automobile Traffic Accident Data

  • April 1st, 2005
  • in

This paper explores a data mining process in which the original dataset is first transformed through a variable subset selection process followed by the application of a machine learning algorithm. A variable ranking technique, called the Sum of Maximum Gain Ratio (SMGR), is applied. This technique computes a score that is based on the over-representation of attribute values. Essentially, SMGR is the ratio of the number of cases that could potentially be reduced by an effective countermeasure to the total number of cases associated with the over-represented value. SMGR was shown empirically to provide comparable results to alternative techniques, but it had significantly improved runtime performance.

Disk Storage to Support Statistical Analysis Operations

  • January 1st, 2005
  • in

Database techniques generally require the reading of complete rows of data (traditionally referenced as “records”) in order to get at a single attribute that might be of interest. Further, if filtering is required (not all records are of interest), a further computational step is needed on each record to determine if it qualifies. Transposition of the data enables this to be accomplished with a single read operation, followed by a single filter-pointer operation producing essentially instantaneous results. This method has proven successful in producing real-time instantaneous results when applied to well over millions of records.

CARE: An Automobile Crash Data Analysis Tool

  • June 1st, 2003
  • in

This paper presents an early (2003) review of CARE that was published in IEEE Computer, the flagship publication of the IEEE Computer Society. The major points made in the paper include:

  • The causes for CARE’s early success were twofold: (1) its simplicity of use, enabling safety practitioners with basic computer literacy skills to easily obtain information from it with a minimum of training; and (2) its efficiency, providing virtual instantaneous presentation of results for even the largest of databases (several hundred thousand records).
  • That CARE had been implemented in a number of states.
  • That CARE had received the 1995 NHTSA Administrator’s Award for innovation.
  • That other applications were being made of CARE in addition to highway safety, namely databases were being mined at the Federal Aviation Administration (FAA) and NASA.