Machine Learning Lab

Part I: Preprocessing and EDA

The data comes from a global e-retailer company, including orders from 2012 to 2015. Import the Orders dataset and do some basic EDA.
For problem 1 to 3, we mainly focus on data cleaning and data visualizations. You can use all the packages that you are familiar with to conduct some plots and also provide brief interpretations about your findings.

Problem 1: Dataset Import & Cleaning

Check "Profit" and "Sales" in the dataset, convert these two columns to numeric type.

Problem 2: Inventory Management

Retailers that depend on seasonal shoppers have a particularly challenging job when it comes to inventory management. Your manager is making plans for next year's inventory.
He wants you to answer the following questions:
1. Is there any seasonal trend of inventory in the company?
2. Is the seasonal trend the same for different categories?
Hint: For each order, it has an attribute called Quantity that indicates the number of product in the order. If an order contains more than one product, there will be multiple observations of the same order.

Problem 3: Why did customers make returns?

Your manager required you to give a brief report (Plots + Interpretations) on returned orders.
1. How much profit did we lose due to returns each year?
2. How many customer returned more than once? more than 5 times?
3. Which regions are more likely to return orders?
4. Which categories (sub-categories) of products are more likely to be returned?
Hint: Merge the Returns dataframe with the Orders dataframe using Order.ID.

Part II: Machine Learning and Business Use Case

Now your manager has a basic understanding of why customers returned orders. Next, he wants you to use machine learning to predict which orders are most likely to be returned. In this part, you will generate several features based on our previous findings and your manager's requirements.

Problem 4: Feature Engineering

Step 1: Create the dependent variable

First of all, we need to generate a categorical variable which indicates whether an order has been returned or not.
Hint: the returned orders’ IDs are contained in the dataset “returns”

Step 2:

Your manager believes that how long it took the order to ship would affect whether the customer would return it or not.
He wants you to generate a feature which can measure how long it takes the company to process each order.
Hint: Process.Time = Ship.Date - Order.Date

Step 3:

If a product has been returned before, it may be returned again.
Let us generate a feature indictes how many times the product has been returned before.
If it never got returned, we just impute using 0.
Hint: Group by different Product.ID

Problem 5: Fitting Models

You can use any binary classification method you have learned so far.
Use 80/20 training and test splits to build your model.
Double check the column types before you fit the model.
Only include useful features. i.e all the IDs should be excluded from your training set.
Note that there are only less than 5% of the orders have been returned, so you should consider using the createDataPartition function from caret package and StratifiedKfold from sklearn when running cross-validation.
Do forget to set.seed() before the spilt to make your result reproducible.
Note: We are not looking for the best tuned model in the lab so don't spend too much time on grid search. Focus on model evaluation and the business use case of each model.

Problem 6: Evaluating Models

What is the best metric to evaluate your model. Is accuracy good for this case?
Now you have multiple models, which one would you pick?
Can you get any clue from the confusion matrix? What is the meaning of precision and recall in this case? Which one do you care the most? How will your model help the manager make decisions?
Note: The last question is open-ended. Your answer could be completely different depending on your understanding of this business problem.

Problem 7: Feature Engineering Revisit

Is there anything wrong with the new feature we generated? How should we fix it?
Hint: For the real test set, we do not know it will get returned or not.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
final		final
.gitignore		.gitignore
ReadMe.md		ReadMe.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Machine Learning Lab

Part I: Preprocessing and EDA

Problem 1: Dataset Import & Cleaning

Problem 2: Inventory Management

Problem 3: Why did customers make returns?

Part II: Machine Learning and Business Use Case

Problem 4: Feature Engineering

Step 1: Create the dependent variable

Step 2:

Step 3:

Problem 5: Fitting Models

Problem 6: Evaluating Models

Problem 7: Feature Engineering Revisit

About

Uh oh!

Releases

Packages

Languages

DannyLGZ/ml-lab

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Lab

Part I: Preprocessing and EDA

Problem 1: Dataset Import & Cleaning

Problem 2: Inventory Management

Problem 3: Why did customers make returns?

Part II: Machine Learning and Business Use Case

Problem 4: Feature Engineering

Step 1: Create the dependent variable

Step 2:

Step 3:

Problem 5: Fitting Models

Problem 6: Evaluating Models

Problem 7: Feature Engineering Revisit

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages