This project helps classify retail products into categories. Although in this example the categories are structured in a hierarchy, to keep it simple I considered all subcategories as top-level. The main packages used in this projects are: sklearn, nltk and dataset.
You can read the post explaining this project here.
You will need Python3+ to use this project.
Now, you need the text-classification-python project files in your workspace:
$ git clone https://github.com/joaorafaelm/text-classification-python; $ cd text-classification-python;You should already know what is virtualenv at this stage. So, simply create it for the project:
$ virtualenv venv; $ source venv/bin/activate;You will find the requirements.txt. To install them, simply type:
$ pip install -r requirements.txtTo run the scraper you will need a csv of ASINS (amazons product identifier). Just search the webz for it. And then run:
python amazon_scrape.pyAll data will be saved into sqlite (file database.db), table products.
datafreeze .datafreeze.yamlThis will create a json file under the directory dumps/.
python data_prep.pyThe script will create a new file called products.json at the root of the project, and print out the category tree structure. Change the value of the variables default_depth, min_samples and domain if you need more data.
python classify.pyIt will print out the accuracy of each category, along with the confusion matrix.