Frank Kane's Taming Big Data with Apache Spark and Python

上QQ阅读APP看书，第一时间看更新

What is Spark?

According to Apache, Spark is a fast and general engine for large-scale data processing. This is actually a really good summary of what it's all about. If you have a really massive dataset that can represent anything - weblogs, genomics data, you name it - Spark can slice and dice that data up. It can distribute the processing among a huge cluster of computers, taking a data analysis problem that's just too big to run on one machine and divide and conquer it by splitting it up among multiple machines.