Data Science Interview Questions
What do you understand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings Can you solve? You have two buckets - one of 3 liters and other of 5 liters.You are expected to measure exactly 4 liters. How will you complete the task? Note:There is no third bucket
Step 1: Fill in 5 liter bucket and empty it in the 3 liter bucket. You are left with 2 liter in the 5 liter bucket Step 2: Empty the 3 liter bucket and pour the contents of 5 liter bucket in it. So 3 liter bucket now has 2 liters Step 3: Fill the 5 liter bucket again and pour the water in 3 liter bucket (already has 2 liters of water from step 2) You now have 4 liters in the 5 liter bucket 53
What are the datatypes supported in Tableau?1 List the differences between supervised and unsupervised learning01
1 List the differences between supervised and unsupervised learning Requires both an input and an output to be given to the model for it to be trained. • Uses known and labeled data as input • Uses unlabeled data as input • Most commonly used unsupervised learning algorithms are k-means clustering, hierarchical clustering, apriori algorithm • Most commonly used supervised learning algorithms are decision tree, logistic regression, support vector machine • Supervised learning has a feedback mechanism • Unsupervised learning has no feedback mechanism Supervised Learning Unsupervised Learning
What are the datatypes supported in Tableau?1 How is logistic regression done?02
2 How is logistic regression done? Logistic Regression measures the relationship between the dependent variable (our label, what we want to predict) and the one or more independent variables (our features), by estimating probabilities using it’s underlying logistic function (sigmoid) X1 X2 X3 X4 0.5 0.8 0.9 0.1 0.9 0.1 0 or 1 Inputs Probabilities Values close to 0 and 1 Linear Model Sigmoid Function Threshold Classifier
2 0 100 1 0 Sigmoid Curve Sigmoid Function y = m*x + c p = 1 1 + ⅇ − y p ln ( 1-p ) = m*x + c No. of hours studied No. of hours studied Marks Pass How is logistic regression done?
What are the datatypes supported in Tableau?1 Explain the steps in making a decision tree 03
3 Explain the steps in making a decision tree Take the entire dataset as input Calculate entropy of target variable as well as predictor attributes Calculate information gain of all attributes Choose the attribute with highest information gain as the root node Repeat the same process on every branch till the decision node of each branch is finalized
3 Explain the steps in making a decision tree NoYes Yes Salary > $50,000 No Commute > 1 hour YesNo Decline Offer Play Decline OfferOffers Incentives Decline OfferAccept Offer Tip: You should know the formulae for entropy and information gain! For example, if you want to build a decision tree to decide whether we should accept or decline a job offer
What are the datatypes supported in Tableau?1 How do you build a random forest model? 04
4 How do you build a random forest model? Randomly select “k” features from total “m” features Where k << m Among the “k” features, calculate the node “d” using the best split point Split the node into daughter nodes using the best split Repeat steps 2 and 3 steps until leaf nodes are finalized Build forest by repeating steps 1 to 4 for “n” number times to create “n” number of trees
What are the datatypes supported in Tableau?1 How can you avoid overfitting of your model? 05
5 How can you avoid overfitting of your model? There are three main methods to avoid overfitting: Keep the model simple: take into account fewer variables, thereby removing some of the noise in the training data Use cross-validation techniques such as k-folds cross-validation Use regularization techniques such as LASSO that penalize certain model parameters if they’re likely to cause overfitting
What do you understand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings There are 9 balls out of which one ball is heavy in weight and rest are of the same weight. In how many minimum Weightings will you find the heavier ball? Can you solve?
What do you understand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings You will need to perform 2 weightings: Step 1: Place three balls on each side Scenario(a): Balance out Out of the remaining three balls from step 1, take two balls and place one ball on each side – if they balance out then the left out ball will be the heavier ball. Otherwise, you will see it in the balance. Scenario(b): Not balanced out If the balls in step 1 do not balance out, then take those three balls and reproduce step 2 to find out the heavier ball.
What are the datatypes supported in Tableau?1 Differentiate between univariate, bivariate and multivariate analysis 06
6 Differentiate between univariate, bivariate and multivariate analysis This type of data contains only one variable. The purpose of univariate analysis is to describe the data and find patterns that exist within it Example: height of students The patterns can be studied by drawing conclusions using mean, median and mode, dispersion or range, minimum, maximum etc Height (in cm) 164 167.3 170 174.2 178 180
6 Differentiate between univariate, bivariate and multivariate analysis This type of data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to find out the relationship among the two variables Example: temperature and ice cream sales in summer season Here, the relationship is visible from the table that temperature and sales are directly proportional to each other Temperature (in Celsius) Sales 20 2000 25 2100 26 2300 28 2400 30 2600 35 3100
6 Differentiate between univariate, bivariate and multivariate analysis When the data involves three or more variables, it is categorized under multivariate. It is similar to bivariate but contains more than one dependent variable Example: data for house price prediction The patterns can be studied by drawing conclusions using mean, median and mode, dispersion or range, minimum, maximum etc No. of rooms Floor Sqft. Area Price 2 0 900 40,00,00 3 2 1100 60,00,000 3.5 5 1500 90,00,000 4 3 2100 1,20,00,000
What are the datatypes supported in Tableau?107 What are the feature selection methods to select the right variables?
7 What are the feature selection methods to select the right variables? Following are the methods of variable selection you can use: There are two main methods for feature selection: Filter Methods Wrapper Methods • Linear Discriminant Analysis • ANOVA • Chi-Sqaure • Forward Selection • Backward Selection • Recursive Feature Elimination
What are the datatypes supported in Tableau?1 In your choice of language: Write a program that prints the numbers from 1 to 50. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz” 08
8 Program Code Code:
8 Program Code Output: . . .
What are the datatypes supported in Tableau?1 You are given a dataset consisting of variables having more than 30% missing values? How will you deal with them? 09
9 You are given a dataset consisting of variables having more than 30% missing values? How will you deal with them? Ways to handle missing data values: If dataset is huge, we can simply remove the rows with missing data values. It is the quickest way i.e. we use the rest of the data to predict the values We can substitute missing values with mean of rest of the data using pandas dataframe in python i.e. df.mean() df.fillna(mean)
What are the datatypes supported in Tableau?1 For the given points, how will you calculate the Eucledian Distance, in Python? 10
1 0 For the given points, how will you calculate the Eucledian Distance, in Python? Given points: plot1 = [1,3] plot2 = [2,5] euclidean_distance = sqrt( (plot1[0]-plot2[0])**2 + (plot1[1]-plot2[1])**2 )
What do you understand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings What is the angle between the hour and minute hands of a clock when the time is half past six? Can you solve?
What do you understand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings • The minute hand has travelled for 30 minutes. So, it has covered 30×6=180° • The hour hand has travelled for 6.5 hours. So, it has covered 6.5×30=195° • The difference between the two will give the angle between the two hands. Thus, the required angle=195°-180°=15° Note: A clock is a complete circle having 360 degrees In 1 hour, the hour hand covers: 360/12 = 30° In 1 minute, the minute hand covers 360/60 = 6°
What are the datatypes supported in Tableau?1 Explain dimensionality reduction, and list its benefits? 11
1 1 Explain dimensionality reduction, and list its benefits? Dimension Reduction refers to the process of converting a set of data having vast dimensions into data with lesser dimensions (fields) to convey similar information concisely It helps in data compressing and reducing the storage space It reduces computation time as less dimensions lead to less computing It removes redundant features For example: there is no point in storing a value in two different units (meters and inches)
What are the datatypes supported in Tableau?1 How will you calculate eigen values and eigen vectors of a 3 by 3 matrix?12
1 2 How will you calculate eigen values and eigen vectors of a 3 by 3 matrix? -2 -4 2 -2 1 2 4 2 5 Characteristic equation: Expanding determinant: (-2 – λ) [(1-λ) (5-λ)-2x2] + 4[(-2) x (5-λ) -4x2] + 2[(-2) x 2-4(1-λ)] =0 - λ3 + 4 λ2 + 27λ – 90 = 0, λ 3 - 4 λ2 -27 λ + 90 = 0
1 2 How will you calculate eigen values and eigen vectors of a 3 by 3 matrix? By hit and trial: Hence (λ-3) is a factor So, eigen values are 3, -5, 6 Calculate eigenvector for λ=3 For X = 1, 33 – 4 x 32 - 27 x 3 +90 = 0 λ 3 - 4 λ2 - 27 λ +90 = (λ – 3) (λ2 – λ – 30) (λ – 3) (λ2 – λ – 30) = (λ – 3) (λ+5) (λ-6), -5 -4Y +2Z =0, -2 -2Y +2Z =0
1 2 How will you calculate eigen values and eigen vectors of a 3 by 3 matrix? Subracting the two equation: Subracting back into second equation: Similarly, we can calculate the eigen vectors for -5 and 6 Z = - 1 2 . 3 + 2Y = 0, Y = - 3 2 .
What are the datatypes supported in Tableau?1 How should you maintain your deployed model? 13
1 3 How should you maintain your deployed model? CompareEvaluateMonitor Rebuild Constant monitoring of all of the models is needed to determine the performance accuracy of the models Evaluation metrics of the current model is calculated to determine if new algorithm is needed The new models are compared against each other to determine which model performs the best The best performing model is re-built on current state of data
What are the datatypes supported in Tableau?1 What are recommender systems?14
1 4 What are recommender systems? A recommender system predicts the "rating" or "preference“, a user would give to a product Collaborative Filtering Content-based Filtering Example: Last.fm recommends tracks that are often played by other users with similar interests Example: Pandora uses the properties of a song to recommend music with similar properties
What are the datatypes supported in Tableau?1 How to find RMSE and MSE in linear regression model? 15
1 5 How to find RMSE and MSE in linear regression model? RMSE and MSE are the two of the most common measures of accuracy for a linear regression RMSE indicates the Root Mean Square Error MSE indicates the Mean Square Error
What do you understand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings If it rains on Saturday with probability 0.6, and it rains on Sunday with probability 0.2 , what is the probability that it rains this weekend? Can you solve?
What do you understand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings Total probability – (Probability that it will not rain on Saturday) (Probability that it will not rain on Sunday) 1−(1−0.6)(1−0.2)=0.68 Can you solve? U
What are the datatypes supported in Tableau?1 How can you select k for k-means?16
1 6 How can you select k for k-means? We use “Elbow Method” to select k for k-means • The idea of the elbow method is to run k-means clustering on the dataset where ‘k’ is the number of clusters • Within sum of squares (WSS) is defined as the sum of the squared distance between each member of the cluster and its centroid WSS No . of. clusters Elbow Point
What are the datatypes supported in Tableau?1 What is the significance of p-value?17
1 7 What is the significance of p-value? p-value typically ≤ 0.05 p-value typically > 0.05 p-value Cutoff 0.05 Indicates strong evidence against the null hypothesis, so you reject the null hypothesis Indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis Considered to be marginal (could go either way)
What are the datatypes supported in Tableau?1 How can outlier values be treated?18
1 8 How can outlier values be treated? 1. You can drop outliers only if it is a garbage value Example. Height of adult = abc ft. This cannot be true as height cannot be a string value. In this case, outliers can be removed 2. If the outliers have extreme values, they can be removed For example, if all the data points are clustered between 0 to 10 but one point lies at 100, then we can remove this point Actual Values PredictedValues
1 8 How can outlier values be treated? If you cannot drop outliers, you can try the following: 1. Try a different model. Data detected as outliers by linear model can be fit by non-linear model. So, be sure you are choosing the right model 2. Try normalizing the data. This way the extreme data points are pulled to a similar range 3. You can use algorithms which are less affected by outliers, example random forest Actual Values PredictedValues
What are the datatypes supported in Tableau?1 How can you say that a time series data is stationary? 19
1 9 How can you say that a time series data is stationary? We can say that a time-series is stationary when the variance and mean of the series is constant with time Stationary Non-Stationary Stationary Non-Stationary Here, mean is constant with time Here, mean is increasing with time Here, variance is constant with time Here, variance is changing with time
What are the datatypes supported in Tableau?1 How can you calculate accuracy using confusion matrix? 20
20 How can you calculate accuracy using confusion matrix? Total=650 actual p n predicted P 262 15 N 26 347 False Positive True Negative True Positive False Negative Accuracy = (True Positive + True Negative) / Total Observations = (262+347) / 650 = 609 / 650 = 0.93
What are the datatypes supported in Tableau?1 Write the equation and calculate precision and recall rate21
21 Write the equation and calculate precision and recall rate Total=650 actual p n predicted P 262 15 N 26 347 False Positive True Negative True Positive False Negative Precision = (True Positive) / (True Positive + False Positive) Recall Rate = (True Positive ) / (Total Positive + False Negative) Precision = 262/277 = 0.94 Recall = 262/288 = 0.90
What do you understand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings if a drawer contains 12 red socks, 16 blue socks, and 20 white socks, how many must you pull out to be sure of having a matching pair? Can you solve?
What do you understand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings The answer is 4, An example: First pick is white Second pick is red Third pick blue, so no pairs yet Fourth pick is 100% guaranteed to be a pair, because it's either white, blue or red. So, four picks guarantees a pair. If it was four colors, the answer would be 5, and so on.
What are the datatypes supported in Tableau?1 ‘People who bought this, also bought…’ recommendations seen on Amazon is a result of which algorithm? 22
22 Collaborative Filtering exploits the behavior of other users and their purchase history in terms of ratings, selection etc. It makes predictions on what might interest a person based on the preference of many other users! In this algorithm, features of the items are not known Recommendation engine is done using Collaborative Filtering ‘People who bought this, also bought…’ recommendations seen on Amazon is a result of which algorithm?
22 ‘People who bought this, also bought…’ recommendations seen on Amazon is a result of which algorithm? For example, suppose x number of people buy a new phone and then also buys a tempered glass with it. Next time, when a person buys a phone, he will be recommended to buy a tempered glass along with it.
What are the datatypes supported in Tableau?1 Write a SQL query to list all orders with customer information 23
23 Write a SQL query to list all orders with customer information SELECT OrderNumber, TotalAmount, FirstName, LastName, City, Country FROM Order JOIN Customer ON Order.CustomerId = Customer.Id Orderid CustomerId OrderNumber Total Amount Id FirstName LastName City Country Order Table Customer Table
What are the datatypes supported in Tableau?1 You are given a dataset on cancer detection. You’ve build a classification model and achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance? What can you do about it? 24
24 Cancer detection results in IMBALANCED DATA You are given a dataset on cancer detection. You’ve build a classification model and achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance? What can you do about it? In an imbalanced dataset, accuracy should not be used as a measure of performance because it is important to focus on the remaining 4%, which are the people who were wrongly diagnosed. Wrong diagnosis is of major concern because there can be people who have cancer but were not predicted so.
24 Cancer detection results in IMBALANCED DATA In an imbalanced dataset, accuracy should not be used as a measure of performance because it is important to focus on the remaining 4%, which are the people who were wrongly diagnosed. Wrong diagnosis is of major concern because there can be people who have cancer but were not predicted so. Hence, in order to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine class wise performance of the classifier You are given a dataset on cancer detection. You’ve build a classification model and achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance? What can you do about it?
What are the datatypes supported in Tableau?1 Which of the following machine learning algorithm can be used for imputing missing values of both categorical and continuous variables? 25
25 Which of the following machine learning algorithm can be used for imputing missing values of both categorical and continuous variables? K-means clustering Linear regression K-NN Decision trees
25 Which of the following machine learning algorithm can be used for imputing missing values of both categorical and continuous variables? K-means clustering Linear regression K-NN Decision trees
What do you understand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings Given a box of matches and two ropes, not necessarily identical, measure a period of 45 minutes Can you solve? Note: The ropes are not uniform in nature and the rope takes exactly 60 minutes to completely burn out
What do you understand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings We have two ropes A and B. • Light A from both the ends and B from one end. • When A is finished burning we know that 30 minutes have elapsed and B has 30 minutes remaining. • Now, light the other end of B also so that remaining part of B will burn taking 15 minutes to burn. • Thus, we have got 30+15 = 45 minutes.
What are the datatypes supported in Tableau?1 Below are the 8 actual values of target variable in the train file. [0,0,0,1,1,1,1,1] What is the entropy of the target variable? 26
26 What is the entropy of the target variable? -(5/8 log(5/8) + 3/8 log(3/8)) 5/8 log(5/8) + 3/8 log(3/8) 3/8 log(5/8) + 5/8 log(3/8) 5/8 log(3/8) – 3/8 log(5/8) [0,0,0,1,1,1,1,1]
26 What is the entropy of the target variable? -(5/8 log(5/8) + 3/8 log(3/8)) 5/8 log(5/8) + 3/8 log(3/8) 3/8 log(5/8) + 5/8 log(3/8) 5/8 log(3/8) – 3/8 log(5/8) [0,0,0,1,1,1,1,1] Hint:
What are the datatypes supported in Tableau?1 We want to predict the probability of death from heart disease based on three risk factors: age, gender, and blood cholesterol level. What is the most appropriate algorithm for this use case? 27
27 Choose the right algorithm Logistic regression Linear regression K-means clustering Apriori algorithm
27 Choose the right algorithm Logistic regression Linear regression K-means clustering Apriori algorithm
What are the datatypes supported in Tableau?1 After studying the behavior of a population, you have identified four specific individual types who are valuable to your study. You would like to find all users who are most similar to each individual type. Which algorithm is most appropriate for this study? 28
28 Choose the right algorithm K-means clustering Linear regression Association rules Decision trees
28 Choose the right algorithm K-means clustering Linear regression Association rules Decision trees
What are the datatypes supported in Tableau?1 You have run the association rules algorithm on your dataset, and the two rules {banana, apple} => {grape} and {apple, orange}=> {grape} have been found to be relevant. What else must be true? 29
29 Choose the right answer {banana, apple, grape, orange} must be a frequent itemset {banana, apple} => {orange} must be a relevant rule {grape} => {banana, apple} must be a relevant rule {grape, apple} must be a frequent itemset
29 Choose the right answer {banana, apple, grape, orange} must be a frequent itemset {banana, apple} => {orange} must be a relevant rule {grape} => {banana, apple} must be a relevant rule {grape, apple} must be a frequent itemset
What are the datatypes supported in Tableau?1 Your organization has a website where visitors randomly receive one of two coupons. It is also possible that visitors to the website will not receive a coupon. You have been asked to determine if offering a coupon to visitors to your website has any impact on their purchase decision. Which analysis method should you use? 30
30 Choose the right analysis method One-way ANOVA K-means clustering Association rules Student T-test
30 Choose the right analysis method One-way ANOVA K-means clustering Association rules Student T-test
Data Science Interview Questions | Data Science Interview Questions And Answers | Simplilearn

Data Science Interview Questions | Data Science Interview Questions And Answers | Simplilearn

  • 1.
  • 2.
    What do youunderstand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings Can you solve? You have two buckets - one of 3 liters and other of 5 liters.You are expected to measure exactly 4 liters. How will you complete the task? Note:There is no third bucket
  • 3.
    Step 1: Fillin 5 liter bucket and empty it in the 3 liter bucket. You are left with 2 liter in the 5 liter bucket Step 2: Empty the 3 liter bucket and pour the contents of 5 liter bucket in it. So 3 liter bucket now has 2 liters Step 3: Fill the 5 liter bucket again and pour the water in 3 liter bucket (already has 2 liters of water from step 2) You now have 4 liters in the 5 liter bucket 53
  • 4.
    What are thedatatypes supported in Tableau?1 List the differences between supervised and unsupervised learning01
  • 5.
    1 List thedifferences between supervised and unsupervised learning Requires both an input and an output to be given to the model for it to be trained. • Uses known and labeled data as input • Uses unlabeled data as input • Most commonly used unsupervised learning algorithms are k-means clustering, hierarchical clustering, apriori algorithm • Most commonly used supervised learning algorithms are decision tree, logistic regression, support vector machine • Supervised learning has a feedback mechanism • Unsupervised learning has no feedback mechanism Supervised Learning Unsupervised Learning
  • 6.
    What are thedatatypes supported in Tableau?1 How is logistic regression done?02
  • 7.
    2 How islogistic regression done? Logistic Regression measures the relationship between the dependent variable (our label, what we want to predict) and the one or more independent variables (our features), by estimating probabilities using it’s underlying logistic function (sigmoid) X1 X2 X3 X4 0.5 0.8 0.9 0.1 0.9 0.1 0 or 1 Inputs Probabilities Values close to 0 and 1 Linear Model Sigmoid Function Threshold Classifier
  • 8.
    2 0 100 1 0 Sigmoid Curve Sigmoid Function y= m*x + c p = 1 1 + ⅇ − y p ln ( 1-p ) = m*x + c No. of hours studied No. of hours studied Marks Pass How is logistic regression done?
  • 9.
    What are thedatatypes supported in Tableau?1 Explain the steps in making a decision tree 03
  • 10.
    3 Explain thesteps in making a decision tree Take the entire dataset as input Calculate entropy of target variable as well as predictor attributes Calculate information gain of all attributes Choose the attribute with highest information gain as the root node Repeat the same process on every branch till the decision node of each branch is finalized
  • 11.
    3 Explain thesteps in making a decision tree NoYes Yes Salary > $50,000 No Commute > 1 hour YesNo Decline Offer Play Decline OfferOffers Incentives Decline OfferAccept Offer Tip: You should know the formulae for entropy and information gain! For example, if you want to build a decision tree to decide whether we should accept or decline a job offer
  • 12.
    What are thedatatypes supported in Tableau?1 How do you build a random forest model? 04
  • 13.
    4 How doyou build a random forest model? Randomly select “k” features from total “m” features Where k << m Among the “k” features, calculate the node “d” using the best split point Split the node into daughter nodes using the best split Repeat steps 2 and 3 steps until leaf nodes are finalized Build forest by repeating steps 1 to 4 for “n” number times to create “n” number of trees
  • 14.
    What are thedatatypes supported in Tableau?1 How can you avoid overfitting of your model? 05
  • 15.
    5 How canyou avoid overfitting of your model? There are three main methods to avoid overfitting: Keep the model simple: take into account fewer variables, thereby removing some of the noise in the training data Use cross-validation techniques such as k-folds cross-validation Use regularization techniques such as LASSO that penalize certain model parameters if they’re likely to cause overfitting
  • 16.
    What do youunderstand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings There are 9 balls out of which one ball is heavy in weight and rest are of the same weight. In how many minimum Weightings will you find the heavier ball? Can you solve?
  • 17.
    What do youunderstand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings You will need to perform 2 weightings: Step 1: Place three balls on each side Scenario(a): Balance out Out of the remaining three balls from step 1, take two balls and place one ball on each side – if they balance out then the left out ball will be the heavier ball. Otherwise, you will see it in the balance. Scenario(b): Not balanced out If the balls in step 1 do not balance out, then take those three balls and reproduce step 2 to find out the heavier ball.
  • 18.
    What are thedatatypes supported in Tableau?1 Differentiate between univariate, bivariate and multivariate analysis 06
  • 19.
    6 Differentiate betweenunivariate, bivariate and multivariate analysis This type of data contains only one variable. The purpose of univariate analysis is to describe the data and find patterns that exist within it Example: height of students The patterns can be studied by drawing conclusions using mean, median and mode, dispersion or range, minimum, maximum etc Height (in cm) 164 167.3 170 174.2 178 180
  • 20.
    6 Differentiate betweenunivariate, bivariate and multivariate analysis This type of data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to find out the relationship among the two variables Example: temperature and ice cream sales in summer season Here, the relationship is visible from the table that temperature and sales are directly proportional to each other Temperature (in Celsius) Sales 20 2000 25 2100 26 2300 28 2400 30 2600 35 3100
  • 21.
    6 Differentiate betweenunivariate, bivariate and multivariate analysis When the data involves three or more variables, it is categorized under multivariate. It is similar to bivariate but contains more than one dependent variable Example: data for house price prediction The patterns can be studied by drawing conclusions using mean, median and mode, dispersion or range, minimum, maximum etc No. of rooms Floor Sqft. Area Price 2 0 900 40,00,00 3 2 1100 60,00,000 3.5 5 1500 90,00,000 4 3 2100 1,20,00,000
  • 22.
    What are thedatatypes supported in Tableau?107 What are the feature selection methods to select the right variables?
  • 23.
    7 What arethe feature selection methods to select the right variables? Following are the methods of variable selection you can use: There are two main methods for feature selection: Filter Methods Wrapper Methods • Linear Discriminant Analysis • ANOVA • Chi-Sqaure • Forward Selection • Backward Selection • Recursive Feature Elimination
  • 24.
    What are thedatatypes supported in Tableau?1 In your choice of language: Write a program that prints the numbers from 1 to 50. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz” 08
  • 25.
  • 26.
  • 27.
    What are thedatatypes supported in Tableau?1 You are given a dataset consisting of variables having more than 30% missing values? How will you deal with them? 09
  • 28.
    9 You aregiven a dataset consisting of variables having more than 30% missing values? How will you deal with them? Ways to handle missing data values: If dataset is huge, we can simply remove the rows with missing data values. It is the quickest way i.e. we use the rest of the data to predict the values We can substitute missing values with mean of rest of the data using pandas dataframe in python i.e. df.mean() df.fillna(mean)
  • 29.
    What are thedatatypes supported in Tableau?1 For the given points, how will you calculate the Eucledian Distance, in Python? 10
  • 30.
    1 0 For the givenpoints, how will you calculate the Eucledian Distance, in Python? Given points: plot1 = [1,3] plot2 = [2,5] euclidean_distance = sqrt( (plot1[0]-plot2[0])**2 + (plot1[1]-plot2[1])**2 )
  • 31.
    What do youunderstand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings What is the angle between the hour and minute hands of a clock when the time is half past six? Can you solve?
  • 32.
    What do youunderstand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings • The minute hand has travelled for 30 minutes. So, it has covered 30×6=180° • The hour hand has travelled for 6.5 hours. So, it has covered 6.5×30=195° • The difference between the two will give the angle between the two hands. Thus, the required angle=195°-180°=15° Note: A clock is a complete circle having 360 degrees In 1 hour, the hour hand covers: 360/12 = 30° In 1 minute, the minute hand covers 360/60 = 6°
  • 33.
    What are thedatatypes supported in Tableau?1 Explain dimensionality reduction, and list its benefits? 11
  • 34.
    1 1 Explain dimensionality reduction,and list its benefits? Dimension Reduction refers to the process of converting a set of data having vast dimensions into data with lesser dimensions (fields) to convey similar information concisely It helps in data compressing and reducing the storage space It reduces computation time as less dimensions lead to less computing It removes redundant features For example: there is no point in storing a value in two different units (meters and inches)
  • 35.
    What are thedatatypes supported in Tableau?1 How will you calculate eigen values and eigen vectors of a 3 by 3 matrix?12
  • 36.
    1 2 How will youcalculate eigen values and eigen vectors of a 3 by 3 matrix? -2 -4 2 -2 1 2 4 2 5 Characteristic equation: Expanding determinant: (-2 – λ) [(1-λ) (5-λ)-2x2] + 4[(-2) x (5-λ) -4x2] + 2[(-2) x 2-4(1-λ)] =0 - λ3 + 4 λ2 + 27λ – 90 = 0, λ 3 - 4 λ2 -27 λ + 90 = 0
  • 37.
    1 2 How will youcalculate eigen values and eigen vectors of a 3 by 3 matrix? By hit and trial: Hence (λ-3) is a factor So, eigen values are 3, -5, 6 Calculate eigenvector for λ=3 For X = 1, 33 – 4 x 32 - 27 x 3 +90 = 0 λ 3 - 4 λ2 - 27 λ +90 = (λ – 3) (λ2 – λ – 30) (λ – 3) (λ2 – λ – 30) = (λ – 3) (λ+5) (λ-6), -5 -4Y +2Z =0, -2 -2Y +2Z =0
  • 38.
    1 2 How will youcalculate eigen values and eigen vectors of a 3 by 3 matrix? Subracting the two equation: Subracting back into second equation: Similarly, we can calculate the eigen vectors for -5 and 6 Z = - 1 2 . 3 + 2Y = 0, Y = - 3 2 .
  • 39.
    What are thedatatypes supported in Tableau?1 How should you maintain your deployed model? 13
  • 40.
    1 3 How should youmaintain your deployed model? CompareEvaluateMonitor Rebuild Constant monitoring of all of the models is needed to determine the performance accuracy of the models Evaluation metrics of the current model is calculated to determine if new algorithm is needed The new models are compared against each other to determine which model performs the best The best performing model is re-built on current state of data
  • 41.
    What are thedatatypes supported in Tableau?1 What are recommender systems?14
  • 42.
    1 4 What are recommendersystems? A recommender system predicts the "rating" or "preference“, a user would give to a product Collaborative Filtering Content-based Filtering Example: Last.fm recommends tracks that are often played by other users with similar interests Example: Pandora uses the properties of a song to recommend music with similar properties
  • 43.
    What are thedatatypes supported in Tableau?1 How to find RMSE and MSE in linear regression model? 15
  • 44.
    1 5 How to findRMSE and MSE in linear regression model? RMSE and MSE are the two of the most common measures of accuracy for a linear regression RMSE indicates the Root Mean Square Error MSE indicates the Mean Square Error
  • 45.
    What do youunderstand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings If it rains on Saturday with probability 0.6, and it rains on Sunday with probability 0.2 , what is the probability that it rains this weekend? Can you solve?
  • 46.
    What do youunderstand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings Total probability – (Probability that it will not rain on Saturday) (Probability that it will not rain on Sunday) 1−(1−0.6)(1−0.2)=0.68 Can you solve? U
  • 47.
    What are thedatatypes supported in Tableau?1 How can you select k for k-means?16
  • 48.
    1 6 How can youselect k for k-means? We use “Elbow Method” to select k for k-means • The idea of the elbow method is to run k-means clustering on the dataset where ‘k’ is the number of clusters • Within sum of squares (WSS) is defined as the sum of the squared distance between each member of the cluster and its centroid WSS No . of. clusters Elbow Point
  • 49.
    What are thedatatypes supported in Tableau?1 What is the significance of p-value?17
  • 50.
    1 7 What is thesignificance of p-value? p-value typically ≤ 0.05 p-value typically > 0.05 p-value Cutoff 0.05 Indicates strong evidence against the null hypothesis, so you reject the null hypothesis Indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis Considered to be marginal (could go either way)
  • 51.
    What are thedatatypes supported in Tableau?1 How can outlier values be treated?18
  • 52.
    1 8 How canoutlier values be treated? 1. You can drop outliers only if it is a garbage value Example. Height of adult = abc ft. This cannot be true as height cannot be a string value. In this case, outliers can be removed 2. If the outliers have extreme values, they can be removed For example, if all the data points are clustered between 0 to 10 but one point lies at 100, then we can remove this point Actual Values PredictedValues
  • 53.
    1 8 How canoutlier values be treated? If you cannot drop outliers, you can try the following: 1. Try a different model. Data detected as outliers by linear model can be fit by non-linear model. So, be sure you are choosing the right model 2. Try normalizing the data. This way the extreme data points are pulled to a similar range 3. You can use algorithms which are less affected by outliers, example random forest Actual Values PredictedValues
  • 54.
    What are thedatatypes supported in Tableau?1 How can you say that a time series data is stationary? 19
  • 55.
    1 9 How can yousay that a time series data is stationary? We can say that a time-series is stationary when the variance and mean of the series is constant with time Stationary Non-Stationary Stationary Non-Stationary Here, mean is constant with time Here, mean is increasing with time Here, variance is constant with time Here, variance is changing with time
  • 56.
    What are thedatatypes supported in Tableau?1 How can you calculate accuracy using confusion matrix? 20
  • 57.
    20 How canyou calculate accuracy using confusion matrix? Total=650 actual p n predicted P 262 15 N 26 347 False Positive True Negative True Positive False Negative Accuracy = (True Positive + True Negative) / Total Observations = (262+347) / 650 = 609 / 650 = 0.93
  • 58.
    What are thedatatypes supported in Tableau?1 Write the equation and calculate precision and recall rate21
  • 59.
    21 Write theequation and calculate precision and recall rate Total=650 actual p n predicted P 262 15 N 26 347 False Positive True Negative True Positive False Negative Precision = (True Positive) / (True Positive + False Positive) Recall Rate = (True Positive ) / (Total Positive + False Negative) Precision = 262/277 = 0.94 Recall = 262/288 = 0.90
  • 60.
    What do youunderstand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings if a drawer contains 12 red socks, 16 blue socks, and 20 white socks, how many must you pull out to be sure of having a matching pair? Can you solve?
  • 61.
    What do youunderstand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings The answer is 4, An example: First pick is white Second pick is red Third pick blue, so no pairs yet Fourth pick is 100% guaranteed to be a pair, because it's either white, blue or red. So, four picks guarantees a pair. If it was four colors, the answer would be 5, and so on.
  • 62.
    What are thedatatypes supported in Tableau?1 ‘People who bought this, also bought…’ recommendations seen on Amazon is a result of which algorithm? 22
  • 63.
    22 Collaborative Filtering exploitsthe behavior of other users and their purchase history in terms of ratings, selection etc. It makes predictions on what might interest a person based on the preference of many other users! In this algorithm, features of the items are not known Recommendation engine is done using Collaborative Filtering ‘People who bought this, also bought…’ recommendations seen on Amazon is a result of which algorithm?
  • 64.
    22 ‘People who boughtthis, also bought…’ recommendations seen on Amazon is a result of which algorithm? For example, suppose x number of people buy a new phone and then also buys a tempered glass with it. Next time, when a person buys a phone, he will be recommended to buy a tempered glass along with it.
  • 65.
    What are thedatatypes supported in Tableau?1 Write a SQL query to list all orders with customer information 23
  • 66.
    23 Write aSQL query to list all orders with customer information SELECT OrderNumber, TotalAmount, FirstName, LastName, City, Country FROM Order JOIN Customer ON Order.CustomerId = Customer.Id Orderid CustomerId OrderNumber Total Amount Id FirstName LastName City Country Order Table Customer Table
  • 67.
    What are thedatatypes supported in Tableau?1 You are given a dataset on cancer detection. You’ve build a classification model and achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance? What can you do about it? 24
  • 68.
    24 Cancer detection results in IMBALANCED DATA Youare given a dataset on cancer detection. You’ve build a classification model and achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance? What can you do about it? In an imbalanced dataset, accuracy should not be used as a measure of performance because it is important to focus on the remaining 4%, which are the people who were wrongly diagnosed. Wrong diagnosis is of major concern because there can be people who have cancer but were not predicted so.
  • 69.
    24 Cancer detection results in IMBALANCED DATA Inan imbalanced dataset, accuracy should not be used as a measure of performance because it is important to focus on the remaining 4%, which are the people who were wrongly diagnosed. Wrong diagnosis is of major concern because there can be people who have cancer but were not predicted so. Hence, in order to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine class wise performance of the classifier You are given a dataset on cancer detection. You’ve build a classification model and achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance? What can you do about it?
  • 70.
    What are thedatatypes supported in Tableau?1 Which of the following machine learning algorithm can be used for imputing missing values of both categorical and continuous variables? 25
  • 71.
    25 Which ofthe following machine learning algorithm can be used for imputing missing values of both categorical and continuous variables? K-means clustering Linear regression K-NN Decision trees
  • 72.
    25 Which ofthe following machine learning algorithm can be used for imputing missing values of both categorical and continuous variables? K-means clustering Linear regression K-NN Decision trees
  • 73.
    What do youunderstand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings Given a box of matches and two ropes, not necessarily identical, measure a period of 45 minutes Can you solve? Note: The ropes are not uniform in nature and the rope takes exactly 60 minutes to completely burn out
  • 74.
    What do youunderstand from Measures and Dimensions? Each field from the data source is automatically assigned a datatype (such as string, integer) and a role (dimension or measure) Aggregation applied on measures is ‘Sum’ by default but you can always change the default aggregation in the settings We have two ropes A and B. • Light A from both the ends and B from one end. • When A is finished burning we know that 30 minutes have elapsed and B has 30 minutes remaining. • Now, light the other end of B also so that remaining part of B will burn taking 15 minutes to burn. • Thus, we have got 30+15 = 45 minutes.
  • 75.
    What are thedatatypes supported in Tableau?1 Below are the 8 actual values of target variable in the train file. [0,0,0,1,1,1,1,1] What is the entropy of the target variable? 26
  • 76.
    26 What isthe entropy of the target variable? -(5/8 log(5/8) + 3/8 log(3/8)) 5/8 log(5/8) + 3/8 log(3/8) 3/8 log(5/8) + 5/8 log(3/8) 5/8 log(3/8) – 3/8 log(5/8) [0,0,0,1,1,1,1,1]
  • 77.
    26 What isthe entropy of the target variable? -(5/8 log(5/8) + 3/8 log(3/8)) 5/8 log(5/8) + 3/8 log(3/8) 3/8 log(5/8) + 5/8 log(3/8) 5/8 log(3/8) – 3/8 log(5/8) [0,0,0,1,1,1,1,1] Hint:
  • 78.
    What are thedatatypes supported in Tableau?1 We want to predict the probability of death from heart disease based on three risk factors: age, gender, and blood cholesterol level. What is the most appropriate algorithm for this use case? 27
  • 79.
    27 Choose theright algorithm Logistic regression Linear regression K-means clustering Apriori algorithm
  • 80.
    27 Choose theright algorithm Logistic regression Linear regression K-means clustering Apriori algorithm
  • 81.
    What are thedatatypes supported in Tableau?1 After studying the behavior of a population, you have identified four specific individual types who are valuable to your study. You would like to find all users who are most similar to each individual type. Which algorithm is most appropriate for this study? 28
  • 82.
    28 Choose theright algorithm K-means clustering Linear regression Association rules Decision trees
  • 83.
    28 Choose theright algorithm K-means clustering Linear regression Association rules Decision trees
  • 84.
    What are thedatatypes supported in Tableau?1 You have run the association rules algorithm on your dataset, and the two rules {banana, apple} => {grape} and {apple, orange}=> {grape} have been found to be relevant. What else must be true? 29
  • 85.
    29 Choose theright answer {banana, apple, grape, orange} must be a frequent itemset {banana, apple} => {orange} must be a relevant rule {grape} => {banana, apple} must be a relevant rule {grape, apple} must be a frequent itemset
  • 86.
    29 Choose theright answer {banana, apple, grape, orange} must be a frequent itemset {banana, apple} => {orange} must be a relevant rule {grape} => {banana, apple} must be a relevant rule {grape, apple} must be a frequent itemset
  • 87.
    What are thedatatypes supported in Tableau?1 Your organization has a website where visitors randomly receive one of two coupons. It is also possible that visitors to the website will not receive a coupon. You have been asked to determine if offering a coupon to visitors to your website has any impact on their purchase decision. Which analysis method should you use? 30
  • 88.
    30 Choose theright analysis method One-way ANOVA K-means clustering Association rules Student T-test
  • 89.
    30 Choose theright analysis method One-way ANOVA K-means clustering Association rules Student T-test

Editor's Notes

  • #2 Style - 01
  • #3 Style - 01
  • #4 Note: We have to measure 4 liters in the 5 liter bucket only, also please mention that there are no measurements given on the bucket.
  • #5 Style - 01
  • #7 Style - 01
  • #8 Note: Please mention the significance of sigmoid function and threshold classifier. As we are moving from linear to logistic, we can also talk about the difference between linear and logistic regression in a line or two. These probabilities must then be transformed into binary values in order to actually make a prediction. This is the task of the logistic function, also called the sigmoid function. The Sigmoid-Function is an S-shaped curve that can take any real-valued number and map it into a value between the range of 0 and 1, but never exactly at those limits. This values between 0 and 1 will then be transformed into either 0 or 1 using a threshold classifier.
  • #9 Note: This is an example, which will determine whether a student will pass or fail, the factor which will help us determine is the “no of hours studied”. So, no of hours studied by a student is directly proportional to whether he will pass or fail.
  • #10 Style - 01
  • #12 Example is “to build a decision tree to determine whether you should accept a job offer or not”
  • #13 Style - 01
  • #16 Explain overfitting
  • #17 Style - 01
  • #18 Style - 01
  • #19 Style - 01
  • #20 Style - 01
  • #21 Style - 01
  • #22 Style - 01
  • #23 Style - 01
  • #24 LDA: Linear discriminant analysis is used to find a linear combination of features that characterizes or separates two or more classes (or levels) of a categorical variable. ANOVA: ANOVA stands for Analysis of variance. It is similar to LDA except for the fact that it is operated using one or more categorical independent features and one continuous dependent feature. It provides a statistical test of whether the means of several groups are equal or not. Chi-Square: It is a is a statistical test applied to the groups of categorical features to evaluate the likelihood of correlation or association between them using their frequency distribution. Forward Selection: Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model. Backward Elimination: In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features. Recursive Feature elimination: It is a greedy optimization algorithm which aims to find the best performing feature subset. It repeatedly creates models and keeps aside the best or the worst performing feature at each iteration. It constructs the next model with the left features until all the features are exhausted. It then ranks the features based on the order of their elimination.
  • #25 Style - 01
  • #28 Style - 01
  • #30 Style - 01
  • #32 Style - 01
  • #33 Please explain the concept of angles in the clock so that the viewer can answer similar questions in the interview
  • #34 Style - 01
  • #35 Style - 01
  • #36 Style - 01
  • #40 Style - 01
  • #42 Style - 01
  • #43 Style - 01
  • #44 Style - 01
  • #45 Please explain these terms and why are they significant for measuring accuracy
  • #46 Style - 01
  • #47 Style - 01
  • #48 Style - 01
  • #50 Style - 01
  • #51 Please explain the concept of null hypothesis and alternative hypothesis
  • #52 Style - 01
  • #53 Style - 01
  • #54 Style - 01
  • #55 Style - 01
  • #57 Style - 01
  • #59 Style - 01
  • #60 Please explain the significance of precision and recall rate
  • #61 Style - 01
  • #62 Style - 01
  • #63 Style - 01
  • #64 In this, we can explain briefly about how recommender systems work
  • #66 Style - 01
  • #68 Style - 01
  • #71 Style - 01
  • #74 Style - 01
  • #75 Style - 01
  • #76 Note: We can talk about entropy and how it affects the decision tree
  • #77 Style - 01
  • #78 Style - 01
  • #79 Style - 01
  • #80 Style - 01
  • #81 Style - 01
  • #82 Style - 01
  • #83 Style - 01
  • #84 Style - 01
  • #85 Style - 01
  • #88 Style - 01
  • #89 Style - 01
  • #90 Style - 01