WORKING WITH BIG DATA BY: GURUABIRAMI.D M.SC IT DEPARTMENT OF CS & IT NADAR SARASWATHI COLLEGE OF ARTS AND SEIENCE, THENI.
STRATEGY : SAMPLE AND MODEL • N ITS ENTIRETY AND CREATE A MODEL ON THE SAMPLE. DOWN SAMPLING TO THOUSANDS – OR EVEN HUNDREDS OF THOUSANDS – OF DATA POINTS CAN MAKE MODEL RUNTIMES FEASIBLE WHILE ALSO MAINTAINING STATISTICAL VALIDITY.2 • IF MAINTAINING CLASS BALANCE IS NECESSARY (OR ONE CLASS NEEDS TO BE OVER/UNDER-SAMPLED), IT’S REASONABLY SIMPLE STRATIFY THE DATA SET DURING SAMPLING.
ADVANTAGES • SPEED RELATIVE TO WORKING ON YOUR ENTIRE DATA SET, WORKING ON JUST A SAMPLE CAN DRASTICALLY DECREASE RUN TIMES AND INCREASE ITERATION SPEED. • PROTOTYPING EVEN IF YOU’LL EVENTUALLY HAVE TO RUN YOUR MODEL ON THE ENTIRE DATA SET, THIS CAN BE A GOOD WAY TO REFINE HYPER PARAMETERS AND DO FEATURE ENGINEERING FOR YOUR MODEL. • PACKAGES SINCE YOU’RE WORKING ON A NORMAL IN-MEMORY DATA SET, YOU CAN USE ALL YOUR R PACKAGES.
DISADVANTAGES • SAMPLING DOWN SAMPLING ISN’T TERRIBLY DIFFICULT, BUT DOES NEED TO BE DONE WITH CARE TO ENSURE THAT THE SAMPLE IS VALID AND THAT YOU’VE PULLED ENOUGH POINTS FROM THE ORIGINAL DATA SET. • SCALING IF YOU’RE USING SAMPLE AND MODEL TO PROTOTYPE SOMETHING THAT WILL LATER BE RUN ON THE FULL DATA SET, YOU’LL NEED TO HAVE A STRATEGY (SUCH AS PUSHING COMPUTE TO THE DATA) FOR SCALING YOUR PROTOTYPE VERSION BACK TO THE FULL DATA SET. • TOTALS BUSINESS INTELLIGENCE (BI) TASKS FREQUENTLY ANSWER QUESTIONS ABOUT TOTALS, LIKE THE COUNT OF ALL SALES IN A MONTH. ONE OF THE OTHER STRATEGIES IS USUALLY A BETTER FIT IN THIS CASE.
STRATEGY 2 : CHUNK AND PULL • STRATEGY 2: CHUNK AND PULL • IN THIS STRATEGY, THE DATA IS CHUNKED INTO SEPARABLE UNITS AND EACH CHUNK IS PULLED SEPARATELY AND OPERATED ON SERIALLY, IN PARALLEL, OR AFTER RECOMBINING. THIS STRATEGY IS CONCEPTUALLY SIMILAR TO THE MAP REDUCE • ALGORITHM. DEPENDING ON THE TASK AT HAND, THE CHUNKS MIGHT BE TIME PERIODS, GEOGRAPHIC UNITS, OR LOGICAL LIKE SEPARATE BUSINESSES, DEPARTMENTS, PRODUCTS, OR CUSTOMER SEGMENTS.
ADVANTAGES • FULL DATA SET THE ENTIRE DATA SET GETS USED. • PARALLELIZATION IF THE CHUNKS ARE RUN SEPARATELY, THE PROBLEM IS EASY TO TREAT AS EMBARASSINGLY PARALLEL AND MAKE USE OF PARALLELIZATION TO SPEED RUNTIMES.
DISADVANTAGES • NEED CHUNKS YOUR DATA NEEDS TO HAVE SEPARABLE CHUNKS FOR CHUNK AND PULL TO BE APPROPRIATE. • PULL ALL DATA EVENTUALLY HAVE TO PULL IN ALL DATA, WHICH MAY STILL BE VERY TIME AND MEMORY INTENSIVE. • STALE DATA THE DATA MAY REQUIRE PERIODIC REFRESHES FROM THE DATABASE TO STAY UP-TO-DATE SINCE YOU’RE SAVING A VERSION ON YOUR LOCAL MACHINE.
STRATEGY 3 : PUSH COMPUTE TO DATA • STRATEGY 3: PUSH COMPUTE TO DATA • IN THIS STRATEGY, THE DATA IS COMPRESSED ON THE DATABASE, AND ONLY THE COMPRESSED DATA SET IS MOVED OUT OF THE DATABASE INTO R. IT IS OFTEN POSSIBLE TO OBTAIN SIGNIFICANT SPEEDUPS SIMPLY BY DOING SUMMARIZATION OR FILTERING IN THE DATABASE BEFORE PULLING THE DATA INTO R. • SOMETIMES, MORE COMPLEX OPERATIONS ARE ALSO POSSIBLE, INCLUDING COMPUTING HISTOGRAM AND RASTER MAPS WITH DBPLOT, BUILDING A MODEL WITH MODELDB, AND GENERATING PREDICTIONS FROM MACHINE LEARNING MODELS WITH TIDYPREDICT.
ADVANTAGES • USE THE DATABASE TAKES ADVANTAGE OF WHAT DATABASES ARE OFTEN BEST AT: QUICKLY SUMMARIZING AND FILTERING DATA BASED ON A QUERY. • MORE INFO, LESS TRANSFER BY COMPRESSING BEFORE PULLING DATA BACK TO R, THE ENTIRE DATA SET GETS USED, BUT TRANSFER TIMES ARE FAR LESS THAN MOVING THE ENTIRE DATA SET.
DISADVANTAGES • DATABASE OPERATIONS DEPENDING ON WHAT DATABASE YOU’RE USING, SOME OPERATIONS MIGHT NOT BE SUPPORTED. • DATABASE SPEED IN SOME CONTEXTS, THE LIMITING FACTOR FOR DATA ANALYSIS IS THE SPEED OF THE DATABASE ITSELF, AND SO PUSHING MORE WORK ONTO THE DATABASE IS THE LAST THING ANALYSTS WANT TO DO.
THANK YOU …

Bigdata analytics

  • 1.
    WORKING WITH BIG DATA BY: GURUABIRAMI.D M.SCIT DEPARTMENT OF CS & IT NADAR SARASWATHI COLLEGE OF ARTS AND SEIENCE, THENI.
  • 2.
    STRATEGY : SAMPLEAND MODEL • N ITS ENTIRETY AND CREATE A MODEL ON THE SAMPLE. DOWN SAMPLING TO THOUSANDS – OR EVEN HUNDREDS OF THOUSANDS – OF DATA POINTS CAN MAKE MODEL RUNTIMES FEASIBLE WHILE ALSO MAINTAINING STATISTICAL VALIDITY.2 • IF MAINTAINING CLASS BALANCE IS NECESSARY (OR ONE CLASS NEEDS TO BE OVER/UNDER-SAMPLED), IT’S REASONABLY SIMPLE STRATIFY THE DATA SET DURING SAMPLING.
  • 4.
    ADVANTAGES • SPEED RELATIVETO WORKING ON YOUR ENTIRE DATA SET, WORKING ON JUST A SAMPLE CAN DRASTICALLY DECREASE RUN TIMES AND INCREASE ITERATION SPEED. • PROTOTYPING EVEN IF YOU’LL EVENTUALLY HAVE TO RUN YOUR MODEL ON THE ENTIRE DATA SET, THIS CAN BE A GOOD WAY TO REFINE HYPER PARAMETERS AND DO FEATURE ENGINEERING FOR YOUR MODEL. • PACKAGES SINCE YOU’RE WORKING ON A NORMAL IN-MEMORY DATA SET, YOU CAN USE ALL YOUR R PACKAGES.
  • 5.
    DISADVANTAGES • SAMPLING DOWNSAMPLING ISN’T TERRIBLY DIFFICULT, BUT DOES NEED TO BE DONE WITH CARE TO ENSURE THAT THE SAMPLE IS VALID AND THAT YOU’VE PULLED ENOUGH POINTS FROM THE ORIGINAL DATA SET. • SCALING IF YOU’RE USING SAMPLE AND MODEL TO PROTOTYPE SOMETHING THAT WILL LATER BE RUN ON THE FULL DATA SET, YOU’LL NEED TO HAVE A STRATEGY (SUCH AS PUSHING COMPUTE TO THE DATA) FOR SCALING YOUR PROTOTYPE VERSION BACK TO THE FULL DATA SET. • TOTALS BUSINESS INTELLIGENCE (BI) TASKS FREQUENTLY ANSWER QUESTIONS ABOUT TOTALS, LIKE THE COUNT OF ALL SALES IN A MONTH. ONE OF THE OTHER STRATEGIES IS USUALLY A BETTER FIT IN THIS CASE.
  • 6.
    STRATEGY 2 :CHUNK AND PULL • STRATEGY 2: CHUNK AND PULL • IN THIS STRATEGY, THE DATA IS CHUNKED INTO SEPARABLE UNITS AND EACH CHUNK IS PULLED SEPARATELY AND OPERATED ON SERIALLY, IN PARALLEL, OR AFTER RECOMBINING. THIS STRATEGY IS CONCEPTUALLY SIMILAR TO THE MAP REDUCE • ALGORITHM. DEPENDING ON THE TASK AT HAND, THE CHUNKS MIGHT BE TIME PERIODS, GEOGRAPHIC UNITS, OR LOGICAL LIKE SEPARATE BUSINESSES, DEPARTMENTS, PRODUCTS, OR CUSTOMER SEGMENTS.
  • 8.
    ADVANTAGES • FULL DATASET THE ENTIRE DATA SET GETS USED. • PARALLELIZATION IF THE CHUNKS ARE RUN SEPARATELY, THE PROBLEM IS EASY TO TREAT AS EMBARASSINGLY PARALLEL AND MAKE USE OF PARALLELIZATION TO SPEED RUNTIMES.
  • 9.
    DISADVANTAGES • NEED CHUNKSYOUR DATA NEEDS TO HAVE SEPARABLE CHUNKS FOR CHUNK AND PULL TO BE APPROPRIATE. • PULL ALL DATA EVENTUALLY HAVE TO PULL IN ALL DATA, WHICH MAY STILL BE VERY TIME AND MEMORY INTENSIVE. • STALE DATA THE DATA MAY REQUIRE PERIODIC REFRESHES FROM THE DATABASE TO STAY UP-TO-DATE SINCE YOU’RE SAVING A VERSION ON YOUR LOCAL MACHINE.
  • 10.
    STRATEGY 3 :PUSH COMPUTE TO DATA • STRATEGY 3: PUSH COMPUTE TO DATA • IN THIS STRATEGY, THE DATA IS COMPRESSED ON THE DATABASE, AND ONLY THE COMPRESSED DATA SET IS MOVED OUT OF THE DATABASE INTO R. IT IS OFTEN POSSIBLE TO OBTAIN SIGNIFICANT SPEEDUPS SIMPLY BY DOING SUMMARIZATION OR FILTERING IN THE DATABASE BEFORE PULLING THE DATA INTO R. • SOMETIMES, MORE COMPLEX OPERATIONS ARE ALSO POSSIBLE, INCLUDING COMPUTING HISTOGRAM AND RASTER MAPS WITH DBPLOT, BUILDING A MODEL WITH MODELDB, AND GENERATING PREDICTIONS FROM MACHINE LEARNING MODELS WITH TIDYPREDICT.
  • 12.
    ADVANTAGES • USE THEDATABASE TAKES ADVANTAGE OF WHAT DATABASES ARE OFTEN BEST AT: QUICKLY SUMMARIZING AND FILTERING DATA BASED ON A QUERY. • MORE INFO, LESS TRANSFER BY COMPRESSING BEFORE PULLING DATA BACK TO R, THE ENTIRE DATA SET GETS USED, BUT TRANSFER TIMES ARE FAR LESS THAN MOVING THE ENTIRE DATA SET.
  • 13.
    DISADVANTAGES • DATABASE OPERATIONSDEPENDING ON WHAT DATABASE YOU’RE USING, SOME OPERATIONS MIGHT NOT BE SUPPORTED. • DATABASE SPEED IN SOME CONTEXTS, THE LIMITING FACTOR FOR DATA ANALYSIS IS THE SPEED OF THE DATABASE ITSELF, AND SO PUSHING MORE WORK ONTO THE DATABASE IS THE LAST THING ANALYSTS WANT TO DO.
  • 14.