Machine Learning Dilemma in the Era of Big Data and Cloud Computing

IntroductionIn the Era of Big Data, Ask Bigger Questions!

The Trends

The Trends

What’s the Machine Learning Dilemma

When modeling with statistical data mining and machine learning methods, models are usually more accurate with more data used for training.

However, when more and more data available, it becomes even harder to build reliable models.

Why? too many channels for input signal selection hard to meet the real-time response requirement

SolutionsHigh Performance Computers

Cloud Computing / Grid ComputingAmazon Web Services

Microsoft Azure

Hadoop HDFS

Too expensive

Google DFSYahoo, Apache, Hortonworks, Cloudera

Distributed DatabaseDistributed Data Mining (Mahout)In-Memory Analytics (H2O, Spark)

Benchmark: Platforms

Redshift – a hosted MPP database offered by Amazon.com

based on the ParAccel data warehouse. We testedRedshift on HDDs.Hive – a Hadoop-based data warehousing system. (v0.12) Shark – a Hive-compatible SQL engine which runs on top of the Spark computing framework. (v0.8.1) Impala – a Hive-compatible* SQL engine with its own MPP-like execution engine. (v1.2.3) Stinger/Tez – Tez is a next generation Hadoop executionengine currently in development (v0.2.0)

Benchmark: scan querySELECT pageURL, pageRank FROM rankings WHERE pageRank > X

Benchmark: aggregation querySELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, X)

Benchmark: join query SELECT sourceIP, totalRevenue, avgPageRank FROM (SELECT sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(`1980-01-01′) AND Date(`X’) GROUP BY UV.sourceIP) ORDER BY totalRevenue DESC LIMIT 1

Conclusion The advantage of Hadoop is not significant with queries on millions of records all platforms are highly optimized with professionals cost are not taken into consideration

 The overhead of map-reduce cannot be optimized to meet the real-time running requirements

 Choice of Mahout: plan to abandon map-reduce based engine going to make use of distributed in-memory analytics engine

Map Reduce: an example

Thank you!

Q&A?