Introduction In the Era of Big Data, Ask Bigger Questions!
The Trends
The Trends
What’s the Machine Learning Dilemma
When modeling with statistical data mining and machine learning methods, models are usually more accurate with more data used for training.
However, when more and more data available, it becomes even harder to build reliable models.
Why? too many channels for input signal selection hard to meet the real-time response requirement
Solutions High Performance Computers
Cloud Computing / Grid Computing Amazon Web Services
Microsoft Azure
Hadoop HDFS
Too expensive
Google DFS Yahoo, Apache, Hortonworks, Cloudera
Distributed Database Distributed Data Mining (Mahout) In-Memory Analytics (H2O, Spark)
Benchmark: Platforms
Redshift - a hosted MPP database offered by Amazon.com
based on the ParAccel data warehouse. We tested Redshift on HDDs. Hive - a Hadoop-based data warehousing system. (v0.12) Shark - a Hive-compatible SQL engine which runs on top of the Spark computing framework. (v0.8.1) Impala - a Hive-compatible* SQL engine with its own MPP-like execution engine. (v1.2.3) Stinger/Tez - Tez is a next generation Hadoop execution engine currently in development (v0.2.0)
Benchmark: scan query SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
Benchmark: aggregation query SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, X)
Benchmark: join query SELECT sourceIP, totalRevenue, avgPageRank FROM (SELECT sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(`1980-01-01') AND Date(`X') GROUP BY UV.sourceIP) ORDER BY totalRevenue DESC LIMIT 1
Conclusion The advantage of Hadoop is not significant with queries on millions of records all platforms are highly optimized with professionals cost are not taken into consideration
The overhead of map-reduce cannot be optimized to meet the real-time running requirements
Choice of Mahout: plan to abandon map-reduce based engine going to make use of distributed in-memory analytics engine
Map Reduce: an example
Thank you!
Q&A?