Skip to content

DannyLGZ/ml-lab

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning Lab

Part I: Preprocessing and EDA

  • The data comes from a global e-retailer company, including orders from 2012 to 2015. Import the Orders dataset and do some basic EDA.
  • For problem 1 to 3, we mainly focus on data cleaning and data visualizations. You can use all the packages that you are familiar with to conduct some plots and also provide brief interpretations about your findings.

Problem 1: Dataset Import & Cleaning

Check "Profit" and "Sales" in the dataset, convert these two columns to numeric type.

Problem 2: Inventory Management

  • Retailers that depend on seasonal shoppers have a particularly challenging job when it comes to inventory management. Your manager is making plans for next year's inventory.

  • He wants you to answer the following questions:

    1. Is there any seasonal trend of inventory in the company?
    2. Is the seasonal trend the same for different categories?
  • Hint: For each order, it has an attribute called Quantity that indicates the number of product in the order. If an order contains more than one product, there will be multiple observations of the same order.

Problem 3: Why did customers make returns?

  • Your manager required you to give a brief report (Plots + Interpretations) on returned orders.

    1. How much profit did we lose due to returns each year?

    2. How many customer returned more than once? more than 5 times?

    3. Which regions are more likely to return orders?

    4. Which categories (sub-categories) of products are more likely to be returned?

  • Hint: Merge the Returns dataframe with the Orders dataframe using Order.ID.

Part II: Machine Learning and Business Use Case

Now your manager has a basic understanding of why customers returned orders. Next, he wants you to use machine learning to predict which orders are most likely to be returned. In this part, you will generate several features based on our previous findings and your manager's requirements.

Problem 4: Feature Engineering

Step 1: Create the dependent variable

  • First of all, we need to generate a categorical variable which indicates whether an order has been returned or not.
  • Hint: the returned orders’ IDs are contained in the dataset “returns”

Step 2:

  • Your manager believes that how long it took the order to ship would affect whether the customer would return it or not.
  • He wants you to generate a feature which can measure how long it takes the company to process each order.
  • Hint: Process.Time = Ship.Date - Order.Date

Step 3:

  • If a product has been returned before, it may be returned again.
  • Let us generate a feature indictes how many times the product has been returned before.
  • If it never got returned, we just impute using 0.
  • Hint: Group by different Product.ID

Problem 5: Fitting Models

  • You can use any binary classification method you have learned so far.
  • Use 80/20 training and test splits to build your model.
  • Double check the column types before you fit the model.
  • Only include useful features. i.e all the IDs should be excluded from your training set.
  • Note that there are only less than 5% of the orders have been returned, so you should consider using the createDataPartition function from caret package and StratifiedKfold from sklearn when running cross-validation.
  • Do forget to set.seed() before the spilt to make your result reproducible.
  • Note: We are not looking for the best tuned model in the lab so don't spend too much time on grid search. Focus on model evaluation and the business use case of each model.

Problem 6: Evaluating Models

  • What is the best metric to evaluate your model. Is accuracy good for this case?
  • Now you have multiple models, which one would you pick?
  • Can you get any clue from the confusion matrix? What is the meaning of precision and recall in this case? Which one do you care the most? How will your model help the manager make decisions?
  • Note: The last question is open-ended. Your answer could be completely different depending on your understanding of this business problem.

Problem 7: Feature Engineering Revisit

  • Is there anything wrong with the new feature we generated? How should we fix it?
  • Hint: For the real test set, we do not know it will get returned or not.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%