近日，著名大数据专家Bernard Marr在一篇文章中分析了Spark和 Hadoop 的异同
而分布式存储是如今许多大数据项目的基础，它可以将 PB 级的数据集存储在几乎无限数量的普通计算机的硬盘上，并提供了良好的可扩展性，只需要随着数据集的增大增加硬盘
例如，Cloudera 就既提供 Spark 服务也提供 Hadoop服务，并会根据客户的需要提供最合适的建议
Spark Or Hadoop-- Which Is The Best Big Data Framework?
Onequestion I get asked a lot by my clients is: Should we go for Hadoop or Sparkas our big data framework? Spark has overtaken Hadoop as the most active opensource Big Data project. While they are not directly comparable products, theyboth have many of the same uses.
Toshed some light onto the issue of “Spark vs. Hadoop.” I thought an articleexplaining the essential differences and similarities of each might be useful.As always, I have tried to keep it accessible to anyone, including thosewithout a background in computer science.
Hadoopand Spark are both Big Data frameworks–they provide some of the most populartools used to carry out common Big Data-related tasks.
Hadoop, for many years, was the leading opensource Big Data framework but recently the newer and more advanced Spark has become themore popular of the two Apache APA -0.29% SoftwareFoundation tools.
Howeverthey do not perform exactly the same tasks, and they are not mutuallyexclusive, as they are able to work together. Although Spark is reported to workup to 100 times faster than Hadoop in certain circumstances, it does notprovide its own distributed storage system.
Distributedstorage is fundamental to many of today’s Big Data projects as it allows vastmulti-petabyte datasets to be stored across an almost infinite number ofeveryday computer hard drives, rather than involving hugely costly custommachinery which would hold it all on one device. These systems are scalable,meaning that more drives can be added to the network as the dataset grows insize.
AsI mentioned, Spark does not include its own system for organizing files in adistributed way (the file system) so it requires one provided by a third-party.For this reason many Big Data projects involve installing Spark on top ofHadoop, where Spark’s advanced analytics applications can make use of datastored using the Hadoop Distributed File System (HDFS).
Whatreally gives Spark the edge over Hadoop is speed. Spark handles most of itsoperations “in memory” – copying them from the distributed physical storageinto far faster logical RAM memory. This reduces the amount of time consumingwriting and reading to and from slow, clunky mechanical hard drives that needsto be done under Hadoop’s MapReduce system.
MapReducewrites all of the data back to the physical storage medium after eachoperation. This was originally done to ensure a full recovery could be made incase something goes wrong – as data held electronically in RAM is more volatilethan that stored magnetically on disks. However Spark arranges data in what areknown as Resilient Distributed Datasets, which can be recovered followingfailure.
Spark’sfunctionality for handling advanced data processing tasks such as real timestream processing and machine learning is way ahead of what is possible withHadoop alone. This, along with the gain in speed provided by in-memoryoperations, is the real reason, in my opinion, for its growth in popularity.Real-time processing means that data can be fed into an analytical applicationthe moment it is captured, and insights immediately fed back to the userthrough a dashboard, to allow action to be taken. This sort of processing isincreasingly being used in all sorts of Big Data applications, for examplerecommendation engines used by retailers, or monitoring the performance ofindustrial machinery in the manufacturing industry.
Machinelearning–creating algorithms which can “think” for themselves, allowing them toimprove and “learn” through a process of statistical modelling and simulation,until an ideal solution to a proposed problem is found, is an area of analyticswhich is well suited to the Spark platform, thanks to its speed and ability tohandle streaming data. This sort of technology lies at the heart of the latestadvanced manufacturing systems used in industry which can predict when partswill go wrong and when to order replacements, and will also lie at the heart ofthe driverless cars and ships of the near future. Spark includes its ownmachine learning libraries, called MLib, whereas Hadoop systems must beinterfaced with a third-party machine learning library, for example ApacheMahout.
Thereality is, although the existence of the two Big Data frameworks is oftenpitched as a battle for dominance, that isn’t really the case. There is somecrossover of function, but both are non-commercial products so it isn’t really“competition” as such, and the corporate entities which do make money fromproviding support and installation of these free-to-use systems will oftenoffer both services, allowing the buyer to pick and choose which functionalitythey require from each framework.
Manyof the big vendors (i.e Cloudera) now offer Spark as well as Hadoop, so will bein a good position to advise companies on which they will find most suitable,on a job-by-job basis. For example, if your Big Data simply consists of a hugeamount of very structured data (i.e customer names and addresses) you may haveno need for the advanced streaming analytics and machine learning functionalityprovided by Spark. This means you would be wasting time, and probably money,having it installed as a separate layer over your Hadoop storage. Spark,although developing very quickly, is still in its infancy, and the security andsupport infrastructure is not as advanced.
Theincreasing amount of Spark activity taking place (when compared to Hadoopactivity) in the open source community is, in my opinion, a further sign thateveryday business users are finding increasingly innovative uses for theirstored data. The open source principle is a great thing, in many ways, and oneof them is how it enables seemingly similar products to exist alongside eachother – vendors can sell both (or rather, provide installation and supportservices for both, based on what their customers actually need in order toextract maximum value from their data).