IntroductionIn the Era of Big Data, Ask Bigger Questions!
What’s the Machine Learning Dilemma
When modeling with statistical data mining and machine learning methods, models are usually more accurate with more data used for training.
However, when more and more data available, it becomes even harder to build reliable models.
Why? too many channels for input signal selection hard to meet the real-time response requirement
SolutionsHigh Performance Computers
Cloud Computing / Grid ComputingAmazon Web Services
Google DFSYahoo, Apache, Hortonworks, Cloudera
Distributed DatabaseDistributed Data Mining (Mahout)In-Memory Analytics (H2O, Spark)
Redshift – a hosted MPP database offered by Amazon.com
based on the ParAccel data warehouse. We testedRedshift on HDDs.Hive – a Hadoop-based data warehousing system. (v0.12) Shark – a Hive-compatible SQL engine which runs on top of the Spark computing framework. (v0.8.1) Impala – a Hive-compatible* SQL engine with its own MPP-like execution engine. (v1.2.3) Stinger/Tez – Tez is a next generation Hadoop executionengine currently in development (v0.2.0)
Benchmark: scan querySELECT pageURL, pageRank FROM rankings WHERE pageRank > X
Benchmark: aggregation querySELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, X)
Benchmark: join query SELECT sourceIP, totalRevenue, avgPageRank FROM (SELECT sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(`1980-01-01′) AND Date(`X’) GROUP BY UV.sourceIP) ORDER BY totalRevenue DESC LIMIT 1
Conclusion The advantage of Hadoop is not significant with queries on millions of records all platforms are highly optimized with professionals cost are not taken into consideration
The overhead of map-reduce cannot be optimized to meet the real-time running requirements
Choice of Mahout: plan to abandon map-reduce based engine going to make use of distributed in-memory analytics engine
Map Reduce: an example