0% found this document useful (0 votes)
441 views11 pages

Hadoop Notes

Hadoop Handbook
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
441 views11 pages

Hadoop Notes

Hadoop Handbook
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Intro to Hadoop and MapReduce

Lesson 1 Notes

Introduction

Hi!WelcometoFundamentalsofHadoopandMapReduce.MynamesSarah
Sproehnle,andImtheVicePresidentofEducationalServicesatCloudera,a
companywhichhelpsdevelop,support,andmanageHadoop.

AndImIanWrigley,ClouderasSeniorCurriculumManager.Betweenus,Sarah
andIhavebeenresponsibleforbringingHadooptrainingtoover20,000people,
andwereexcitedtoreachamuchbiggeraudiencehereatUdacity.Duringthis
courseweregoingtodiscusswhatbigdatais,whatHadoopis,whyitsuseful,
andhowtowriteMapReducecode.

Bytheendofthecourse,youllbeabletodescribethekindsofproblemsHadoopaddresses,
andyoullhavewrittenMapReduceprogramstoefficientlyanalyzeverylargeWebserverlog
files.Infact,youllhavehadhandsonexperiencerunningaHadoopjobbytheendoflessontwo.

So,letsstart.Inthislesson,we'regoingtodefine'bigdata',thesortofproblemsitintroduces,
andhowtoaddressthoseproblems.

Sources of Data

Organizationshavebeengeneratingdatasince
wayback,butastimegoeson,moreandmore
dataisbeinggenerated.IBMestimatesthatas
muchas90%ofthedataintheworldtoday
hasbeencreatedinthelasttwoyearsalone.

Justasasimpleexample,thinkaboutyourcellphone.Wheneverits
turnedon,itsconnectingtocelltowerstogetreception.Asyoumove
around,itwillconnecttodifferenttowers,andatdifferentsignal
strengthsdependingonhowfarawayfromthemyouare.Allofthat
connectiondataiscollectedbythephonecompany,anditslogged.
Copyright2014Udacity,Inc.AllRightsReserved.

Theycanuseittofinddeadspotsintheircoverage,toworkoutwhichtowersarethebusiest
andneedincreasedcapacity...theycaneventraceyouifyoumakeanemergencycallbutdont
giveyourexactlocation.Thatsanenormousamountofdatarightthere.

Anotherexampleiswhenyouvisita
WebsitelikeAmazonorNetflix.
Everythingyoudothereislogged:what
pagesyouviewed,whatproductsyou
lookedat,howlongyouspentoneach
page...eventhingslikewhatWeb
browseryouwereusingandwhatsortof
computeryouwereconnectingfrom.
Again,hugeamountsofdata.

Andthatsjustinthecorporateworld.Inmedicine,forexample,eachXRaycreateshuge
amountsofpotentiallyincrediblyvaluableinformation,andcomparinglargenumbersofthemcan
helpustodetectsimilaritiesintumors.

Thisincreaseintheamountofdataweregeneratingopensuphugepossibilities.Butitcomes
withproblemstoo.Wehavetostoreallthatdata,andwehavetobeabletoprocessitina
sensibleamountoftime.

Quiz: What is a Big Data problem?


ThiscourseisaboutHadoop,andhowithelpstodealwithBigData.Butnoteverythingis
actuallyabigdataproblem.Therearelotsofcaseswhereyoucanusetraditionalsystemsto
store,manage,andprocessyourdata.Sothefirstthingyouneedtodoisdecideifwhatyou
havereallydoesfallundertheheadingofbigdatainthefirstplace.Andtomakethatcall,we
havetocreatesomekindofdefinitionforwhatbigdatais.

Letsstartwithaquickquestion.Whichofthesewouldyouconsidertobebigdata?Youarenot
goingtobegradedonthisanswer,butgiveityourbestguess.

[]orderdetailsforapurchaseatastore
[]allordersacrosshundredsofbranchesnationwide
[]informationaboutapersonsstockportfolio
[]allstocktransactionsmadeontheNewYorkStockExchangeduringtheyear

Answer:
Formostpeople,theanswersaregoingtobe2and4.Alistofpurchasesatasinglestoreis
Copyright2014Udacity,Inc.AllRightsReserved.

almostcertainlysmallenoughtobeeasilyhandledbyatraditionalrelationaldatabasesystem
orevenjustaspreadsheet.Ordersfromhundredsofstoresnationwide,though,couldstartto
overwhelmtraditionalsystems.Likewise,informationaboutasinglepersonsstockportfolioisa
smallandeasilymanagedchunkofdata.ButdataontradesacrosstheentireNYSEforayear
willrunintotensorhundredsofterabytesandthatswheretraditionalsystemsreallydostartto
struggle.

Definition of Big Data


Theresnoonedefinitionforbigdataitsaverysubjectiveterm.Mostpeoplewouldconsidera
datasetofterabytesormoretobebigdata,buttherearecertainlypeopleusingHadoopwith
greatsuccessonsmallerchunksofdatathanthat.Onereasonabledefinitionisthatitsdata
whichcantcomfortablybeprocessedonasinglemachine.

Quiz: Challenges
ButBigDataismorethanjustsizeofthedata.Whatadditionalproblemscanyouseeinthis
field?

[]mostdataisworthlessanditshardtofindtheusefulparts
[]itshardtogatherdata
[]dataiscreatedveryfast
[]datafromdifferentsourcesisindifferentformats

Answer:
Apotentialchallengewithbigdataisthatitiscreatedveryfastanddoescomefromdifferent
sourceswhichcouldcomeinavarietyofformats.Inmyexperience,mostdataisnotworthless
butactuallydoeshavealotofvalue.

The 3 Vs of Big Data:


WhenyoureadortalkaboutBigData,youlloftenhearpeoplerefertothethreeVs.Volume
referstothesizeofdatathatyouredealingwith,Varietyreferstothefactthatthedataisoften
comingfromlotsofdifferentsourcesandinmanydifferentformats,andVelocityreferstothe
speedatwhichthedataisbeinggenerated,andthespeedatwhichitneedstobemade
availableforprocessing.Soletslookinmoredetailateachofthem.

Volume
Thepricetostoredatahasdroppedincrediblyoverthelast60years.In1980,thecostper
gigabytewasseveralhundredthousanddollars.In2013,itswellunder10cents.

Copyright2014Udacity,Inc.AllRightsReserved.

Althoughitsworthsayingthatifyouactuallywanttostorethedatareliably,youregoingtoend
uppayingrathermorethanthatprobablyseveraldollarspergigabyte,maybeevenmore.

Thatsparticularlythecasewithmore
traditionaldatastoragedevicessuchas
storageareanetworks,orSANs,which
canbeextremelyexpensive.Thehigh
costofreliablestorageputsacaponthe
amountofdatacompaniescan
practicallystore.Atsomepoint,theyd
say,OK,itstooexpensivetostoreall
thatdatathatImnotdoinganythingwith.
Letsjuststorethecriticalstuff:my
actualsales,forexample,ratherthanall
thatstuffabouthowlongpeoplespenton
eachpageofmyWebsite.Butitturns
out,aswellsee,thatthedatatheyre
currentlythrowingawaycanbeincredibly
useful.Whatweneedisacheaperway
tostoreitreliably.

Andofcoursestoringthedataisonlyonepartoftheequationyoualsoneedtobeabletoread
Copyright2014Udacity,Inc.AllRightsReserved.

andprocessitefficiently.StoringaterabyteofdataonaSANisntsohard,butstreamingthe
datafromtheSANacrossthenetworktosomecentralprocessorcantakealongtime,and
processingitcanbeextremelyslow.

QUIZ: Volume

Whichofthefollowingdatadoyouthinkisworthstoringandanalyzing?

[]transactions(financial,governmentrelated)
[]logs(recordsofactivity,location)
[]businessdata(productcatalogs,prices,customers)
[]userdata(images,documents,video)
[]sensordata(temperature,pollution)
[]medicaldata(xrays,brainactivityrecords)
[]social(email,twitteretc)

Answer
Andtheansweristhatallofthesecanprovideusefulinformation.Butinordertostoreit,youll
needawaytoscaleyourstoragecapacityuptomassivevolume.Hadoop,whichstoresdatain
adistributedwayacrossmultiplemachines,doesthat.Youllseejusthowinthenextlesson.

Variety

ThesecondVisdatavariety.Foralongtime,peoplehaveuseddatabasestostoreand
processtheirdataeithersmallerdatabaseslikeMySQL,orbigdatawarehousesbasedon
softwarefromcompanieslikeOracleandIBM.Butforadatawarehousetoeffectivelyprocess
information,allthatinformationhastofitnicelyintoapredefinedsetoftables.Theproblemis
thatthesedays,lotsofthedatayouwanttostoreiswhatwetendtocallunstructureddata,or
semistructureddata.Sarahcangiveussomeexamples.

Byunstructured,wemeanthedataarrivesinlotsof
differentformats.Forexample,abankmighthavea
listofyourcreditcardandaccounttransactions,but
theymayalsohavescansofyourchecks,recordsof
yourinteractionswithcustomerservice
representativesontheWebandoverthephone,
perhapsevenrecordingsofthosephonecalls.Allof
thatdataisinavarietyofdifferentformats,anditcan
behardtostoreandreconcileitallusingtraditional
systems.

Copyright2014Udacity,Inc.AllRightsReserved.

Andthisalsotiesbacktovolume.Youwanttostorethatdatainitsoriginalformatsoyourenot
throwinganyinformationaway.Thatwayyoucanthenprocessthedatalaterindifferentways
youmightnotevenhavethoughtoforiginally.

Forinstance,ifwejusttranscribecallcenter
conversationsintotext,wehavewhatpeoplesaidto
ourcustomerservicerepresentatives.Butifwehave
theactualrecordings,thenlateronwemightdevelop
softwarewhichcaninterpretthetoneofvoicethe
customerusesandthatmightleadtoavery
differentinterpretationofthedata.AndthenicethingaboutHadoopisthatitdoesntcarewhat
formatthedatacomesin.Unlikeatraditionaldatabase,youcanjuststorethedatainitsraw
format,andmanipulateandreformatitlater.

Quiz: Data Variety

Sometimesthemostunlikelydatacanbeextremelyusefulandleadtosavingsduetobetter
planning.Forexample,aconventionalsystemforcoordinatinglogisticssystemmightsendthe
closesttrucktothewarehousetopickupthepackage.However,itmightbethattheclosest
truckisnotthebestsolutionperhapstherearetrafficjams,orthemostdirectrouteisonsmall
roadsthatwouldtakelongertodrive.Maybethetruckdoesnthaveenoughfreespaceforthe
newload.Sowhatkindofdatawouldbehelpfulinmakingabetterplanthatcouldsavemoney
andtimeforthecompany?

[]CurrentGPSlocationfromalltrucks
[]Currentitinerariesforalltrucks
[]Currenttrafficspeedinrelatedareasasreportedby
servicessuchasWaze
[]Currentloadoftrucksbyvolumeandweight
[]Fuelefficiencyofthedifferentvehicles

Answer:
Andagainalloftheseanswersarecorrect.Youcansavealotofmoney,andtime,bymaking
betterdecisions,drivenbymorevarieddata.Theworldweliveinisextremelycomplex,and
therearealotofvariablestoconsiderthatyoucantweaktogetlargebenefits.

Velocity

Copyright2014Udacity,Inc.AllRightsReserved.

Velocity,thethirdV,isaboutthespeedatwhichthedataarrives,readytobeprocessed.We
needtobeabletoacceptandstorethatdataevenwhenitscominginatarateofterabytesor
moreaday,whichisoftenthecase.Ifwecantstoreitasitarrives,wellendupdiscarding
someofit,andthatswhatweabsolutelywanttoavoid.

What problems can we solve?

ThinkaboutanecommerceWebsite.Ifweknowwhatproductsyouvelookedatinthepast,we
couldrecommendsimilarproductsthenexttimeyouvisitoursite.Ifyouspentfiveminutes
lookingataparticularitem,wecouldmaybesendyouanemailinformingyouwhenthatitemis
onsale.IfweknowthatyoutypicallybrowseoursiteusingafirstgenerationiPad,wecould
suggestthelatestmodel.

Thisisahugedifferencetowhatwewoulddobefore,whenweonlystoredrecordsofactual
purchases.IfwecanstoreandprocessallofourWebserverlogfiles,alongwiththepurchase
datathatsinourtraditionaldatawarehouse,wecangivethecustomeramuchbettershopping
experiencewhichshoulddirectlytranslateintobiggerprofits.

YetanotherexampleisamoviesitelikeNetflix.Basedonwhat
theyknowaboutyourviewinghabits,theycanrecommend
moviestoyouasyoucanseehere,becauseofwhatIans
ratedhighlybefore,themovieontheleftisrecommendedfor
himandtheycanevenpredictwhatratinghellgivethe
movie.

History of solving data problems

Sothereareplentyofthingswecandowithbigdata.Butfirstwehavetosolveacoupleof
problems.Weneedtobeabletostorethedatainacosteffectiveway,andweneedtobeable
toprocessitefficiently.Anditturnsoutthatthesearenoteasyproblemstosolvewhenwere
talkingaboutmassiveamountsofdata.Fortunately,though,someextremelysmartpeopleat
Googlewereworkingontheminthelate1990sandreleasedtheresultsoftheirworkas
researchpapersin2003and2004.LetsseewhatDougCutting,oneofthefoundersofHadoop,
hastosay.
Copyright2014Udacity,Inc.AllRightsReserved.

DOUG CUTTING about History of Hadoop:

So,letmetellyouhowHadoopcametobe.Abouttenyearsagoinaround
2003,IwasworkingonanOpenSourcewebsearchenginecalledNutch,and
weknewitneededtobesomethingveryscalable,becausetheWebwasyou
know,billionsofpages.terabytes,petabytes,ofdata,thatweneededtobeable
toprocess,andwesetaboutdoingthebestjobwecouldanditwastough.We
gotthingsupandrunningonfourorfivemachines,notverywell,andaround
thattimeGooglepublishedsomepapersabouthowtheyweredoingthingsinternally.
Publishedapaperabouttheirdistributedfilesystem,TFS.andabouttheirprocessing,
framework,MapReduce.SomypartnerandI,atthetime,inthisproject,MikeCafarella.
saidabouttryingtoreimplementtheseinOpenSource.Sothatmorepeoplecoulduse
themthanjustfolksatGoogle.Tookusacoupleofyears,andwehadNutchupand
runningon,insteadoffourorfivemachines,on,20to40machines.Itwasn'tperfect,it
wasn'ttotallyreliable,butitworked.Andwerealizethattogetittothepointwhereitwas
scaledtothousandsofmachines,andbeasbulletproofasitneededtobe,wouldtake
morethanjustthetwoofus,workingparttime.

Aroundthattime,Yahooapproachedmeandsaidtheywereinterestedininvestingin
this.SoIwenttoworkforYahooinJanuaryof2006.FirstthingIdidthere,was,wetook
thepartsofNutchthatwereadistributedcomputingplatform,andputthemintoa
separateproject.AnewprojectchristenedHadoop.Overthenextcoupleyears,with,
Yahoo'shelp,andthehelpofothers,wetookHadoop,andreallygotittothepointwhere
itdidscaletopetabytes,andrunningonthousandsofprocessors.Anddoingsoquite
reliably.

Itspreadtolotsofcompanies,andmostlyintheInternetsector,andbecamequitea
success.afterthat,we,westartedtoseeabunchofotherprojectsgrowuparoundit.
AndHadoop'sgrowntobethekernelofa,which,prettymuchanoperatingsystemforbig
data.We'vegottoolsthat,allowyouto,moreeasilydo,MapReduceprogramming,so,
youcandevelopusingSQLoradataflowlanguagecalledPig.And
we'vealsogotthebeginningsofhigherleveltools.We'vegotinteractiveSQLwith
Impala.We'vegotSearch.andsowe'rereallyseeingthisdeveloptobeingageneral
purposeplatformfordataprocessing.thatscale'smuchbetterandthatitismuchmore
flexiblethananythingthat's,that's,elseisoutthere.

ThatsthestoryofthegenesisofHadoop:itsbasedonworkdonebythefolksatGoogle,andits
grownfromsmallbeginningstothepointnowwherehundredsofpeoplecontributetothe
project,andwhereitsbeingusedbythousandsandthousandsofcompaniesworldwide.The
Copyright2014Udacity,Inc.AllRightsReserved.

Hadooplogoisactuallyalittleyellowelephant,butdoyouknowwherethenamecamefrom?
Theresafunnystoryattachedtothat.HeresDougagain.

DOUG about Name of Hadoop

SothenameHadoopcomesfrommyson'stoyelephant.Whenhewasabout
two,afriendgavehimalittlestuffedelephantwhichheplayedwith
incessantly.Andweoverheardhimcallingitsomething,thisstrangewordthat
heinvented,andsaidHadoop.SoIimmediatelywroteitdownbecauseIwas
inthesoftwarebusiness.Andwe'realwayslookingforgoodnames.Andthis
onecamewithamascot,even.AndafewyearslaterwhenIneededaproject
name,pulleditout.Now,IwroteitdownasHADOOP.Andfiguredthateveryone
wouldsayHadoop.NowitturnsouteveryonesaysHadoopinstead,butIpersistinsaying
Hadoop.Nowmyson,ofcourse,is13,andexpectsroyaltiesforthename.Hehewants
morecredit.Healsoaccusesmeofstealingthetoy.Atsomepoint,hewasusingitin
somekindofrocketshipexperiment,andIhadtorescueit.Andnowit,itlivesinmysock
drawerfor,forsafety.

Hadoop Cluster
ThecoreHadoopprojectconsistsofaway
tostoredata,knownastheHadoop
DistributedFileSystem,orHDFS,anda
waytoprocessthedata,called
MapReduce.Thekeyconceptisthatwe
splitthethedataupandstoreitacrossa
collectionofmachines,knownasacluster.
Then,whenwewanttoprocessthedata,
weprocessitwhereitsactuallystored.
Ratherthanretrievingthedatafroma
centralserver,insteaditsalreadyonthe
cluster,andwecanprocessitinplace.Youcanaddmoremachinestothecluster(makethe
clusterbigger)astheamountofdatayourestoringgrowsand,indeed,manypeoplestartwith
justafewmachinesandaddmoreastheyreneeded.Themachinesintheclusterdontneedto
beparticularlyhighendalthoughmostclustersarebuiltusingrackmountservers,theyare
typicallymidrangeserversratherthantopoftherangeequipment.

Hadoop Ecosystem

CoreHadoopconsistsofHDFSandMapReduce.

Copyright2014Udacity,Inc.AllRightsReserved.

Butsincetheprojectwasfirststarted,anawfullotofothersoftwarehasgrownuparoundit.And
thatswhatwecalltheHadoopEcosystem.Someofthesoftwareisintendedtomakeiteasyto
loaddataintotheHadoopcluster,whilelotsofitisdesignedtomakeHadoopeasiertouse.For
example,asyoullseeinthenextlesson,writingMapReducecodeisntcompletelysimple.You
needtoknowaprogramminglanguagelikeJava,orPython,orRuby,orPerl.Buttherearelots
offolksouttherewhoarentprogrammersbutwhocanwriteSQLqueriestoaccessdataina
traditionalrelationaldatabaselikeSQLServer.Andofcoursealotofbusinessintelligencetools
alsowanttohookintoHadoop.

Forthatreason,otheropensourceprojectshavebeen
createdtomakeiteasierforpeopletoquerytheirdata
withoutknowinghowtocode.TwokeyonesareHiveand
Pig.InsteadofhavingtowriteMappersandReducers,in
Hiveyoujustwritestatements,whichlookverymuchlike
standardSQL.TheHiveinterpreterturnsthatSQLinto
MapReducecode,whichitthenrunsonthecluster.Andan
alternativeisPig,whichallowsyoutowritecodetoanalyse
yourdatainafairlysimplescriptinglanguageratherthanMapReduceagain,thecodeisturned
intoactualJavaMapReduceandrunonthecluster.

HiveandPigaregreat,buttheyrestillrunningMapReducejobs,whichmeantheywilltakea
reasonableamountoftime,especiallywhenrunningonreallylargeamountsofdata.Soanother
opensourceprojectcalledImpalawasdevelopedwhichagainallowsyoutoqueryyourdata
usingSQLbutwhichdirectlyaccessesthatdata,ratherthanaccessingitviaMapReduce.
Impalaisoptimizedforlowlatencyqueriesinotherwords,Impalaqueriesrunveryquickly,
typicallymanytimesfasterthanHivequerieswhileHiveisoptimizedforlongrunningbatch
processingjobs.

Anotherprojectusedbymanypeopleis
Sqoop.Thattakesdatafromatraditional
relationaldatabaseserversuchas
MicrosoftSQLServerandputsitinHDFS
asdelimitedfilessoitcanbeprocessed
alongwiththeotherdataonthecluster.
ThentheresFlume,whichingestsdataas
itsgeneratedbyexternalsystems.HBase
isarealtimedatabasebuiltontopofHDFS.Hueisagraphicalfrontendtothecluster.Oozieis
aworkflowmanagementtool.Mahoutisamachinelearninglibrary

Copyright2014Udacity,Inc.AllRightsReserved.

Infact,therearesomanydifferentecosystemprojectsthatmakingthemalltalktoeachother,
andworkwellwitheachother,canbetricky.Tomakeinstallingandmaintainingaclustereasier,
Cloudera,thecompanyweworkfor,hasputtogetheradistributionofHadoopcalledCDH.This
takesallthekeyecosystemprojects,alongwithHadoopitself,andpackagesthemtogetherso
thatinstallationisareallysimpleprocess.Andthecomponentsarealltestedtogether,soyou
canbesurethattherearenoincompatibilitiesbetweenthem.Ofcourseitscompletelyfreeand
opensource,justlikeHadoopitself.Youcouldinstalleverythingfromscratchyourself,butitsfar
easiertouseCDH,andthatscertainlywhatwedrecommend.Inthenextlesson,infact,youll
bedownloadingandrunningavirtualmachinewhichhasCDHinstalled.

Conclusion
Sointhislessonyoulearnedwhatbigdatais,andhowHadoopcanhelpwithbigdata
problems.Inthenextlesson,welltakeadeeperlookatthetwokeypartsofHadoop:thats
HDFS,theHadoopDistributedFileSystem,andMapReduce,thewayyoucanprocessthat
data.

Copyright2014Udacity,Inc.AllRightsReserved.

You might also like