unit-1 data analytics data science Process steps

Big data • Big data collection of data sets so large or complex that it becomes difficult to process them using traditional data management techniques such as, for example, the RDBMS • Data science involves using methods to analyze massive amounts of data and extract the knowledge it contains. • big data and data science like the relationship between crude oil and an oil refinery • Data science and big data evolved from statistics and traditional data management

• The characteristics of big data are often referred to as the three Vs: ■ Volume—How much data is there? ■ ■ Variety —How diverse are different types of data? ■ ■ Velocity—At what speed is new data generated?

Facets of data • The main categories of data are these: • Structured • Unstructured • Natural language Machine-generated Graph-based Audio • video • images Streaming

Structured data • depends on a data model and resides in a fixed field within a record. • SQL, or Structured Query Language, is the preferred way to manage and query data that resides in databases.

Unstructured data • data that isn’t easy to fit into a data model because the content is context-specific or varying. • One example : regular email • Natural language requires knowledge of specific data science techniques and linguistics. • entity recognition, topic recognition, summarization, text completion, and sentiment analysis, but mod els trained in one domain don’t generalize well to other domains.

• Machine-generated data • automatically created by a computer, process, application, or other machine without human intervention

• Graph-based or network data • graph is a mathematical structure to model pair-wise relationships between objects. • focuses on the relationship or adjacency of objects. • The graph structures use nodes, edges, and properties to represent and store graphical data • social networks, and its structure allows to calculate specific metrics such as the influence of a person and the shortest path between two people.

• Eg: LinkedIn • follower list on Twitter • “friends” on Facebook

• Audio, image, and video • Streaming data

Step1 :Setting the research goal • Step 1: Defining research goals and creating a project charter understanding the what, the why, and the how of your project • What does the company expect you to do? • And why does management place such a value on your research? • Is it part of a bigger strategic picture or a “lone wolf” project originating from an opportunity someone detected? • information is then best placed in a project charter

• Spend time understanding the goals and context of your research • Create a project charter • A project charter requires teamwork, and your input covers at least the following: • A clear research goal • The project mission and context • How you’re going to perform your analysis • What resources you expect to use • Proof that it’s an achievable project, or proof of concepts • Deliverables and a measure of success • A timeline

• Start with data stored within the company • databases, data marts, data warehouses, and data lakes • primary goal • database is data storage • data warehouse -reading and analyzing that data. • A data mart - subset of the data warehouse and geared toward serving a specific business unit. • data warehouses and data marts are home to preprocessed data, data lakes contains data in its natural or raw format.

• Don’t be afraid to shop around • Do data quality checks now to prevent problems later

Step 3: Cleansing, integrating, and transforming data

Cleansing data • subprocess of the data science process that focuses on removing errors in your data • data becomes a true and consistent representation of the process • two types of errors • interpretation error--person’s age is greater than 300 years. • Inconsistencies between data sources or against company’s standardized values.

• DATA ENTRY ERROR • humans are only human, they make typos or lose their con centration for a second and introduce an error into the chain. • errors originating from machines are transmission errors or bugs in the extract, trans form, and load phase (ETL).

• Most errors of this type are easy to fix with simple assignment statements and if-then else rules: • if x == “Godo” : x = “Good” • if x == “Bade” : x = “Bad” • REDUNDANT WHITESPACE • mismatch of keys such as “FR ” – “FR” • For instance, in Python you can use the strip() function to remove leading and trailing spaces.

FIXING CAPITAL LETTER MISMATCHES • applying a function that returns both strings in lowercase, such as .lower() in Python. • “Brazil”.lower() == “brazil”.lower() should result in true. • IMPOSSIBLE VALUES AND SANITY CHECKS • check = 0 <= age <= 120

OUTLIERS • An outlier is an observation that seems to be distant from other observations • one observation that follows a different logic or generative process than the other observations.

• DEVIATIONS FROM A CODE BOOK • A code book is a description of your data, a form of metadata. • It contains things such as the number of variables per observation, the number of observations, and what each encoding within a variable means. • (For instance “0” equals “negative”, “5” stands for “very positive”.) • DIFFERENT UNITS OF MEASUREMENT • DIFFERENT LEVELS OF AGGREGATION • Correct errors as early as possible

Combining data from different data sources • THE DIFFERENT WAYS OF COMBINING DATA • JOINING TABLES

USING VIEWS TO SIMULATE DATA JOINS AND APPENDS

Transforming data • transforming data so it takes a suitable form for data modeling. • y = aebx.

REDUCING THE NUMBER OF VARIABLES • Euclidean distance between two points in a two-dimensional plane

TURNING VARIABLES INTO DUMMIES

Step 4: Exploratory data analysis • use graphical techniques to gain an understanding of data and the inter actions between variables.

Combined graphs • combine simple graphs into a Pareto diagram, or 80-20 diagram.

brushing and linking • combine and link different graphs and tables (or views) so changes in one graph are automatically transferred to the other graphs.

• In a histogram a variable is cut into discrete categories and the number of occur rences in each category are summed up

• build models with the goal of making better predictions, classifying objects, or gain ing an understanding of the system

• Building a model is an iterative process. • most models consist of the following main steps: • 1 Selection of a modeling technique and variables to enter in the model • 2 Execution of the model • 3 Diagnosis and model comparison

Model and variable selection • select the variables you want to include in our model and a modeling technique. • Must the model be moved to a production environment and, if so, would it be easy to implement? • How difficult is the maintenance on the model: • how long will it remain relevant if left untouched? • Does the model need to be easy to explain?

Model execution • chosen a model - need to implement it in code • most programming languages, such as Python, already have libraries such as StatsModels or Scikit-learn.

Model diagnostics and model comparison • building multiple models from which you then choose the best one based on multiple criteria. • A holdout sample is a part of the data you leave out of the model building so it can be used to evaluate the model afterward. • fraction of data to estimate the model and the other part, the holdout sample, is kept out of the equation.

Step 6: Presenting findings and building applications on top of them • built a well-performing model, you’re ready to present your findings to the world • automate your models.

unit-1 data analytics data science Process steps

unit-1 data analytics data science Process steps

More Related Content

Similar to unit-1 data analytics data science Process steps

Recently uploaded

unit-1 data analytics data science Process steps