有关Spark和Hadoop，孰优孰劣

2020-05-08发布 946浏览

Spark已经取代Hadoop成为最活跃的开源大数据项目，但是，在选择大数据框架时，企业不能因此就厚此薄彼

近日，著名大数据专家Bernard Marr在一篇文章中分析了Spark和 Hadoop 的异同

Hadoop和Spark均是大数据框架，都提供了一些执行常见大数据任务的工具，但确切地说，它们所执行的任务并不相同，彼此也并不排斥

虽然在特定的情况下，Spark据称要比Hadoop快100倍，但它本身没有一个分布式存储系统

而分布式存储是如今许多大数据项目的基础，它可以将 PB 级的数据集存储在几乎无限数量的普通计算机的硬盘上，并提供了良好的可扩展性，只需要随着数据集的增大增加硬盘

因此，Spark需要一个第三方的分布式存储，也正是因为这个原因，许多大数据项目都将Spark安装在Hadoop之上，这样，Spark的高级分析应用程序就可以使用存储在HDFS中的数据了

与Hadoop相比，Spark真正的优势在于速度，Spark的大部分操作都是在内存中，而Hadoop的MapReduce系统会在每次操作之后将所有数据写回到物理存储介质上，这是为了确保在出现问题时能够完全恢复，但Spark的弹性分布式数据存储也能实现这一点

另外，在高级数据处理（如实时流处理、机器学习）方面，Spark的功能要胜过Hadoop

在Bernard看来，这一点连同其速度优势是Spark越来越受欢迎的真正原因

实时处理意味着可以在数据捕获的瞬间将其提交给分析型应用程序，并立即获得反馈

在各种各样的大数据应用程序中，这种处理的用途越来越多，比如，零售商使用的推荐引擎、制造业中的工业机械性能监控

Spark平台的速度和流数据处理能力也非常适合机器学习算法，这类算法可以自我学习和改进，直到找到问题的理想解决方案

这种技术是最先进制造系统（如预测零件何时损坏）和无人驾驶汽车的核心

Spark有自己的机器学习库MLib，而Hadoop系统则需要借助第三方机器学习库，如Apache Mahout

实际上，虽然Spark和Hadoop存在一些功能上的重叠，但它们都不是商业产品，并不存在真正的竞争关系，而通过为这类免费系统提供技术支持赢利的公司往往同时提供两种服务

例如，Cloudera 就既提供 Spark 服务也提供 Hadoop服务，并会根据客户的需要提供最合适的建议

Bernard认为，虽然Spark发展迅速，但它尚处于起步阶段，安全和技术支持基础设施方还不发达，在他看来，Spark在开源社区活跃度的上升，表明企业用户正在寻找已存储数据的创新用法

英文原文如下：

Spark Or Hadoop-- Which Is The Best Big Data Framework?

——Bernard Marr

Onequestion I get asked a lot by my clients is: Should we go for Hadoop or Sparkas our big data framework? Spark has overtaken Hadoop as the most active opensource Big Data project. While they are not directly comparable products, theyboth have many of the same uses.

Toshed some light onto the issue of “Spark vs. Hadoop.” I thought an articleexplaining the essential differences and similarities of each might be useful.As always, I have tried to keep it accessible to anyone, including thosewithout a background in computer science.

Hadoopand Spark are both Big Data frameworks–they provide some of the most populartools used to carry out common Big Data-related tasks.

Hadoop, for many years, was the leading opensource Big Data framework but recently the newer and more advanced Spark has become themore popular of the two Apache APA -0.29% SoftwareFoundation tools.

Howeverthey do not perform exactly the same tasks, and they are not mutuallyexclusive, as they are able to work together. Although Spark is reported to workup to 100 times faster than Hadoop in certain circumstances, it does notprovide its own distributed storage system.

Distributedstorage is fundamental to many of today’s Big Data projects as it allows vastmulti-petabyte datasets to be stored across an almost infinite number ofeveryday computer hard drives, rather than involving hugely costly custommachinery which would hold it all on one device. These systems are scalable,meaning that more drives can be added to the network as the dataset grows insize.

AsI mentioned, Spark does not include its own system for organizing files in adistributed way (the file system) so it requires one provided by a third-party.For this reason many Big Data projects involve installing Spark on top ofHadoop, where Spark’s advanced analytics applications can make use of datastored using the Hadoop Distributed File System (HDFS).

Whatreally gives Spark the edge over Hadoop is speed. Spark handles most of itsoperations “in memory” – copying them from the distributed physical storageinto far faster logical RAM memory. This reduces the amount of time consumingwriting and reading to and from slow, clunky mechanical hard drives that needsto be done under Hadoop’s MapReduce system.

MapReducewrites all of the data back to the physical storage medium after eachoperation. This was originally done to ensure a full recovery could be made incase something goes wrong – as data held electronically in RAM is more volatilethan that stored magnetically on disks. However Spark arranges data in what areknown as Resilient Distributed Datasets, which can be recovered followingfailure.

Spark’sfunctionality for handling advanced data processing tasks such as real timestream processing and machine learning is way ahead of what is possible withHadoop alone. This, along with the gain in speed provided by in-memoryoperations, is the real reason, in my opinion, for its growth in popularity.Real-time processing means that data can be fed into an analytical applicationthe moment it is captured, and insights immediately fed back to the userthrough a dashboard, to allow action to be taken. This sort of processing isincreasingly being used in all sorts of Big Data applications, for examplerecommendation engines used by retailers, or monitoring the performance ofindustrial machinery in the manufacturing industry.

Machinelearning–creating algorithms which can “think” for themselves, allowing them toimprove and “learn” through a process of statistical modelling and simulation,until an ideal solution to a proposed problem is found, is an area of analyticswhich is well suited to the Spark platform, thanks to its speed and ability tohandle streaming data. This sort of technology lies at the heart of the latestadvanced manufacturing systems used in industry which can predict when partswill go wrong and when to order replacements, and will also lie at the heart ofthe driverless cars and ships of the near future. Spark includes its ownmachine learning libraries, called MLib, whereas Hadoop systems must beinterfaced with a third-party machine learning library, for example ApacheMahout.

Thereality is, although the existence of the two Big Data frameworks is oftenpitched as a battle for dominance, that isn’t really the case. There is somecrossover of function, but both are non-commercial products so it isn’t really“competition” as such, and the corporate entities which do make money fromproviding support and installation of these free-to-use systems will oftenoffer both services, allowing the buyer to pick and choose which functionalitythey require from each framework.

Manyof the big vendors (i.e Cloudera) now offer Spark as well as Hadoop, so will bein a good position to advise companies on which they will find most suitable,on a job-by-job basis. For example, if your Big Data simply consists of a hugeamount of very structured data (i.e customer names and addresses) you may haveno need for the advanced streaming analytics and machine learning functionalityprovided by Spark. This means you would be wasting time, and probably money,having it installed as a separate layer over your Hadoop storage. Spark,although developing very quickly, is still in its infancy, and the security andsupport infrastructure is not as advanced.

Theincreasing amount of Spark activity taking place (when compared to Hadoopactivity) in the open source community is, in my opinion, a further sign thateveryday business users are finding increasingly innovative uses for theirstored data. The open source principle is a great thing, in many ways, and oneof them is how it enables seemingly similar products to exist alongside eachother – vendors can sell both (or rather, provide installation and supportservices for both, based on what their customers actually need in order toextract maximum value from their data).

【瞬发大量并发连接】造成MySQL连接不响应的分析

大数据分析，Hadoop够用吗？Facebook数据专家说No

2024-04-22 发布 636 浏览

爱可生携手OceanBase共建生态繁荣，2024开发者大会分享《ActionDB生态体系建设与实践经验》

2024-03-28 发布 784 浏览

36氪发布AIGC行业报告，爱可生荣登向量数据库领域代表企业榜单

2024-02-02 发布 1,039 浏览

爱可生荣获证券基金行业信息技术应用创新联盟年度优秀成员奖

2024-02-01 发布 1,330 浏览

爱可生获得OceanBase特别认可“客户第一”，2023年度重点项目交付合作再创佳绩

2024-01-31 发布 3,287 浏览

信通院云大所大数据和区块链部向爱可生致谢，共铸业内首个《向量数据库技术要求》