This report has been created in the framework of a student group project and the Georgia Institute of Technology does not officially sanction its content.
Executive Summary Every year, Walmart is accused of increasing crime in areas within which it builds Walmart Supercenters. Yet, research and data analyses largely disprove these claims, as they reveal that other factors such as unemployment rates and population appear to contribute more to an increase in crime rates rather than the construction of Walmart stores.
The software Minitab is used to analyze linear regression with the inputs population, demographics, high school graduation rates, unemployment rates, median household income, and the number of Walmart Supercenters and the output property crime rate. The overall purpose of the analysis was to reveal the relationship(s), but not causation, between these inputs and property crime rates. Ultimately, this analysis reveals no definite relationship between the number of Walmart Supercenters and property crime in the years 1999 to 2010 in Dekalb and Gwinnett County. Background
One of the most significant issues concerning Walmart in communities is the existence of a relationship between the presence of a Walmart and local crime rates. In past years, these concerns have been voiced by local residents and public law enforcement. Justice Starcher, a member of the West Virginia Supreme Court, states that “a quick review of reported cases reveals that Walmart parking lots are a virtual magnet for crime” and Chief John Slauch of the West Sadsbury Township Police Department “saw a significant increase in crime and incident calls for service from the date Walmart opened.”
These accusations have resulted in the common belief that Walmart stores directly and indirectly increase crime. A research and analysis plan was developed to determine whether or not any relationship exists between Walmart Supercenters and property crime rates. In order to compare various non-violent crime rates, the initial plan was to run 3 different analyses with 3 different outputs: property crime rates, larceny from a vehicle, and larceny from a non-vehicle.
These crime categories were chosen after counseling a student employee of the Georgia Tech Police Department, Harvaran Ghai, who works on data crime analysis. These crime rates were suggested because they were the top 3 non-violent crimes on campus. However, as time progressed and the full amount of work entailed in this project was realized, time only allowed for analysis on property crime rate: a category of crime that includes, among other crimes: burglary, larceny, theft, motor vehicle theft, arson, shoplifting and vandalism. After the problem was defined, input data was collected.
Using the sources listed in “Appendix 1: Works Cited” we were able to obtain 7 inputs: population, demographics (the African American and Caucasian race as separate inputs), high school graduation rates, unemployment rates, median household income, and number of Walmart Supercenters in the given counties at that particular moment. In order to get enough observation points to run regression, monthly data was collected and used over the course of 10 years (1999-2010). Monthly unemployment rates were easily obtainable; however, all other inputs had yearly data, and this data was manipulated in order to create monthly data points. Analysis
The following observations were made per iteration of data analysis in order to compare iterations to each other to see improvement to these observations: R2, p -values of individual variable, 4 residual plots. Unusual observations (e.g. outliers) were removed from the data after each iteration. After careful consideration of the listed observations, the following tests were performed: the Durbin-Watson test, partial f-test, and the variance inflation factor test.
Iterations of analysis eliminated data points that were listed as “unusual observations,” or any data point with a large standardized residual. After 5 iterations, the analysis showed improved residual plots. Randomness in the versus fits and versus order plots means that the linear regression model is appropriate for the data; a straight line in the normal probability plot illustrates the linearity of the data, and a bell shaped curve in the histogram illustrates the normality of the data.
Because of the method of monthly data collection, absolute randomness could not be obtained; however, it was decided that 5 iterations was sufficient because the sixth iteration showed a decrease in the quality of the residual plots. The first test performed was the p-value test of the individual variables. A p-value is the probability, ranging from 0 to 1, of obtaining a test statistic similar to the one that was actually observed.
The only input that did not have a p-value less than 0.05, which was the chosen significance level, was the “Number of Walmarts” variable; the number of Walmarts has no specific effect on the output, property crime rate. The R2 of the analysis, or the coefficient of determination, provides a measure of how well future outcomes are likely to be predicted by the model. R2 values range from 0 to 100% (or 0 and 1) and the analysis has an R2 of 97.9%, which is appropriate. The Durbin-Watson test was used to find the presence of autocorrelation of the residuals in the regression. Autocorrelation is the relationship between values separated from each other by a given time lag.
The test states that if the statistic is lower than dL, α/2, or 1.57, then the null hypothesis that the error terms are not correlated is rejected. Since the Durbin-Watson statistic is 0.342531, some autocorrelation exists. The VIF test, or the variance inflation factor test, tests for the presence of multicollinearity. Multicollinearity is where 2 or more predictor variables are highly correlated. The VIF test proved that multicollinearity exists between all variables because the average VIF is 387.9 which is substantially larger than 1. The next test performed was the partial f -test.
This test is relied upon in order to test the relevance, or value in our model, of all variables within the particular regression analysis. The partial f-test shows that, with the given model and inputs eliminating the “Number of Walmarts” variable does not have a different effect on the property crime rate. It is important to note that if the inputs, data points, time frame, etc. changed then the “Number of Walmarts” variable in the partial f-test might show different results. In conclusion, with the given analysis, there is no clear relationship between the number of Walmarts and the property crime rates in Dekalb and Gwinnett County from 1999-2010. Recommendations & Future Outlook
The followings recommendations give a future outlook to anyone seeking an extension of this project as well as better time frames to choose and improved methods of data collection and regression analysis: 1. Locate multiple counties with or without a Walmart as well as varying demographics. A before and after view of Walmart within a 5 year span is most indicative of its actual effects on a specified locale. 2. The availability of crime data as well as data on population and demographics monthly is ideal to avoid most autocorrelation.
The availability of this data should dictate primarily which location is chosen to be analyzed. 3. Look for variables that vary greatly but still have an assumed effect on crime (e.g. demographics, socioeconomics, education, areas surrounding the specific location, economic, social issues of the area, etc.) 4. The output variable depends on what focus the project has for crime. The crime should be chosen according to location, popular crimes in that area, and accusations made for specific crimes that are related to the placement of a Supercenter Walmart; the more specific the crime, the more telling the analysis will be.
5. Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. The method of time series analyses used should be a time-domain method which includes autocorrelation and cross correlation analysis. 6. The SAS program is the most comprehensive and detailed program that can be used for time series regression. Based on budget, the RATS program is cheaper and more easily accessible. The RATS program has many of the same capabilities as SAS in both time series analysis and other advanced statistical methods; the two differ more in the details rather than in capabilities.
There are several issues that are beyond the scope of analysis in a regression of this nature such as correlation and true randomness. Crime moves in patterns based on many different variables and it is impossible to account for all the multicollinearity that will exist based simply on the human factor. With any project similar to this, the more detailed the variables and analysis are, better results will be extractable and more meaningful relationships will be illuminated.
Appendix I: Works Cited "Census Bureau Homepage." United states census bureau. U.S. Government, n.d.Web. 14 Jun2012. <http://www.census.gov/>. "Crime and Wal-Mart — “Is Wal-Mart Safe?”." UFCW Local 770.WakeUpWalMart.com, 2006.Web. 14 Jun 2012. <http://www.ufcw770.org/>. "Georgia Crime Rates 1960-2010." Fbi uniform crime report. disastercenter.com, 2010. Web. 14Jun 2012. <http://www.disastercenter.com/crime/gacrime.htm>. Hennagir, Tim. "Citizen group speaks out against Walmart project ." ABC Newspapers.com.N.p., 2011. Web. 14 Jun 2012. <http://abcnewspapers.com/2011/11/22/citizen-groupspeaks-out-against-walmart-project/>. "Wal-Mart the Crime Magnet:." ReclaimDemocracy.org. Associated Press, 2004. Web. 14 Jun2012. <http://reclaimdemocracy.org/walmart/crime_increase.html>.
Appendix II: Minitab Output Regression Analysis: Property Cri versus Median House, Unemployment, ...
The regression equation is Property Crime rate = 7429 - 0.0628 Median Household Income - 103 Unemployment Rates + 0.0213 Population - 24.6 HS Graduation Rates - 0.0136 African American - 0.0267 White
191 cases used, 1 cases contain missing values
Predictor Coef SE Coef T P VIF Constant 7429 1024 7.26 0.000 Median Household Income -0.062750 0.008820 -7.11 0.000 20.088 Unemployment Rates -103.35 18.82 -5.49 0.000 7.882 Population 0.021318 0.001832 11.64 0.000 57.582 HS Graduation Rates -24.627 5.756 -4.28 0.000 11.192 African American -0.013569 0.003132 -4.33 0.000 758.576 White -0.026705 0.003903 -6.84 0.000 805.856
S = 180.636 R-Sq = 97.9% R-Sq(adj) = 97.8%
Analysis of Variance
Source DF SS MS F P Regression 6 277468049 46244675 1417.26 0.000 Residual Error 184 6003832 32630 Total 190 283471881
Source DF Seq SS Median Household Income 1 179704167 Unemployment Rates 1 4293815 Population 1 67155915 HS Graduation Rates 1 672273 African American 1 24114102 White 1 1527776
Median Household Property Obs Income Crime rate Fit SE Fit Residual St Resid 26 57805 2855.2 2477.2 25.3 378.0 2.11R 120 47508 6178.2 5746.2 35.7 431.9 2.44R 121 47449 6118.9 5695.2 35.2 423.7 2.39R 149 45226 5916.4 5511.3 28.9 405.2 2.27R 150 45110 5843.1 5453.3 27.5 389.7 2.18R 154 45057 4787.8 5267.4 29.8 -479.6 -2.69R 164 51311 4578.0 5091.7 28.0 -513.7 -2.88R 165 51422 4669.0 5106.9 31.8 -437.9 -2.46R 169 52984 4505.1 4968.5 37.7 -463.4 -2.62R
R denotes an observation with a large standardized residual.
Durbin-Watson statistic = 0.332698
Residual Plots for Property Crime rate