Machine Learning Dilemma in the Era of Big Data and Cloud Computing

Introduction In the Era of Big Data, Ask Bigger Questions!

The Trends

The Trends

What’s the Machine Learning Dilemma

When modeling with statistical data mining and machine learning methods, models are usually more accurate with more data used for training.

However, when more and more data available, it becomes even harder to build reliable models.

Why? too many channels for input signal selection hard to meet the real-time response requirement

Solutions High Performance Computers

Cloud Computing / Grid Computing Amazon Web Services

Microsoft Azure

Hadoop HDFS

Too expensive

Google DFS Yahoo, Apache, Hortonworks, Cloudera

Distributed Database Distributed Data Mining (Mahout) In-Memory Analytics (H2O, Spark)

Benchmark: Platforms

Redshift - a hosted MPP database offered by

based on the ParAccel data warehouse. We tested Redshift on HDDs. Hive - a Hadoop-based data warehousing system. (v0.12) Shark - a Hive-compatible SQL engine which runs on top of the Spark computing framework. (v0.8.1) Impala - a Hive-compatible* SQL engine with its own MPP-like execution engine. (v1.2.3) Stinger/Tez - Tez is a next generation Hadoop execution engine currently in development (v0.2.0)

Benchmark: scan query SELECT pageURL, pageRank FROM rankings WHERE pageRank > X

Benchmark: aggregation query SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, X)

Benchmark: join query SELECT sourceIP, totalRevenue, avgPageRank FROM (SELECT sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(`1980-01-01') AND Date(`X') GROUP BY UV.sourceIP) ORDER BY totalRevenue DESC LIMIT 1

Conclusion  The advantage of Hadoop is not significant with queries on millions of records  all platforms are highly optimized with professionals cost are not taken into consideration

 The overhead of map-reduce cannot be optimized to meet the real-time running requirements

 Choice of Mahout: plan to abandon map-reduce based engine going to make use of distributed in-memory analytics engine

Map Reduce: an example

Thank you!