- The data comes from a global e-retailer company, including orders from 2012 to 2015. Import the Orders dataset and do some basic EDA.
- For problem 1 to 3, we mainly focus on data cleaning and data visualizations. You can use all the packages that you are familiar with to conduct some plots and also provide brief interpretations about your findings.
Check "Profit" and "Sales" in the dataset, convert these two columns to numeric type.
-
Retailers that depend on seasonal shoppers have a particularly challenging job when it comes to inventory management. Your manager is making plans for next year's inventory.
-
He wants you to answer the following questions:
- Is there any seasonal trend of inventory in the company?
- Is the seasonal trend the same for different categories?
-
Hint: For each order, it has an attribute called
Quantitythat indicates the number of product in the order. If an order contains more than one product, there will be multiple observations of the same order.
-
Your manager required you to give a brief report (Plots + Interpretations) on returned orders.
-
How much profit did we lose due to returns each year?
-
How many customer returned more than once? more than 5 times?
-
Which regions are more likely to return orders?
-
Which categories (sub-categories) of products are more likely to be returned?
-
-
Hint: Merge the Returns dataframe with the Orders dataframe using
Order.ID.
Now your manager has a basic understanding of why customers returned orders. Next, he wants you to use machine learning to predict which orders are most likely to be returned. In this part, you will generate several features based on our previous findings and your manager's requirements.
- First of all, we need to generate a categorical variable which indicates whether an order has been returned or not.
- Hint: the returned orders’ IDs are contained in the dataset “returns”
- Your manager believes that how long it took the order to ship would affect whether the customer would return it or not.
- He wants you to generate a feature which can measure how long it takes the company to process each order.
- Hint: Process.Time = Ship.Date - Order.Date
- If a product has been returned before, it may be returned again.
- Let us generate a feature indictes how many times the product has been returned before.
- If it never got returned, we just impute using 0.
- Hint: Group by different Product.ID
- You can use any binary classification method you have learned so far.
- Use 80/20 training and test splits to build your model.
- Double check the column types before you fit the model.
- Only include useful features. i.e all the
IDs should be excluded from your training set. - Note that there are only less than 5% of the orders have been returned, so you should consider using the createDataPartition function from
caretpackage and StratifiedKfold from sklearn when running cross-validation. - Do forget to
set.seed()before the spilt to make your result reproducible. - Note: We are not looking for the best tuned model in the lab so don't spend too much time on grid search. Focus on model evaluation and the business use case of each model.
- What is the best metric to evaluate your model. Is accuracy good for this case?
- Now you have multiple models, which one would you pick?
- Can you get any clue from the confusion matrix? What is the meaning of precision and recall in this case? Which one do you care the most? How will your model help the manager make decisions?
- Note: The last question is open-ended. Your answer could be completely different depending on your understanding of this business problem.
- Is there anything wrong with the new feature we generated? How should we fix it?
- Hint: For the real test set, we do not know it will get returned or not.