Skip to content

Commit 423187d

Browse files
committed
2 parents 06971b9 + 8820c1c commit 423187d

File tree

1 file changed

+100
-0
lines changed

1 file changed

+100
-0
lines changed
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# Chapter 4 Building a "fake news" classifier
2+
3+
## CountVectorizer for text classification
4+
100xp
5+
It's time to begin building your text classifier! The data has been loaded into a DataFrame called df. Explore it in the IPython Shell to investigate what columns you can use. The .head() method is particularly informative.
6+
In this exercise, you'll use pandas alongside scikit-learn to create a sparse text vectorizer you can use to train and test a simple supervised model. To begin, you'll set up a CountVectorizer and investigate some of its features.
7+
### Instructions
8+
Import CountVectorizer from sklearn.feature_extraction.text and train_test_split from sklearn.model_selection.
9+
Create a Series y to use for the labels by assigning the .label attribute of df to y.
10+
Using df["text"] (features) and y (labels), create training and test sets using train_test_split(). Use a test_size of 0.33 and a random_state of 53.
11+
Create a CountVectorizer object called count_vectorizer. Ensure you specify the keyword argument stop_words="english" so that stop words are removed.
12+
Fit and transform the training data X_train using the .fit_transform() method. Do the same with the test data X_test, except using the .transform() method.
13+
Print the first 10 features of the count_vectorizer using its .get_feature_names() method.
14+
15+
## TfidfVectorizer for text classification
16+
100xp
17+
Similar to the sparse CountVectorizer created in the previous exercise, you'll work on creating tf-idf vectors for your documents. You'll set up a TfidfVectorizer and investigate some of its features.
18+
In this exercise, you'll use pandas and sklearn along with the same X_train, y_train and X_test, y_test DataFrames and Series you created in the last exercise.
19+
### Instructions
20+
Import TfidfVectorizer from sklearn.feature_extraction.text.
21+
Create a TfidfVectorizer object called tfidf_vectorizer. When doing so, specify the keyword arguments stop_words="english" and max_df=0.7.
22+
Fit and transform the training data.
23+
Transform the test data.
24+
Print the first 10 features of tfidf_vectorizer.
25+
Print the first 5 vectors of the tfidf training data using slicing on the .A (or array) attribute of tfidf_train.
26+
27+
## Inspecting Vectors
28+
100xp
29+
To get a better idea of how the vectors work, you'll investigate them by converting them into pandas DataFrames.
30+
Here, you'll use the same data structures you created in the previous two exercises (count_train, count_vectorizer, tfidf_train, tfidf_vectorizer) as well as pandas, which is imported as pd.
31+
### Instructions
32+
Create the DataFrames count_df and tfidf_df by using pd.DataFrame() and specifying the values as the first argument and the columns (or features) as the second argument.
33+
The values can be accessed by using the .A attribute of, respectively, count_train and tfidf_train.
34+
The columns can be accessed using the .get_feature_names() methods of count_vectorizer and tfidf_vectorizer.
35+
Print the head of each DataFrame to investigate their structure.
36+
Test if the column names are the same for each DataFrame by creating a new object called difference to see the difference between the columns that count_df has from tfidf_df. Columns can be accessed using the .columns attribute of a DataFrame. Subtract the set of tfidf_df.columns from the set of count_df.columns.
37+
Test if the two DataFrames are equivalent by using the .equals() method on count_df with tfidf_df as the argument.
38+
39+
Text classification models
40+
50xp
41+
Which of the below is the most reasonable model to use when training a new supervised model using text vector data?
42+
Possible Answers
43+
Random Forests
44+
45+
46+
Naive Bayes
47+
48+
49+
Linear Regression
50+
51+
52+
Deep Learning
53+
## Training and testing the "fake news" model with CountVectorizer
54+
100xp
55+
Now it's your turn to train the "fake news" model using the features you identified and extracted. In this first exercise you'll train and test a Naive Bayes model using the CountVectorizer data.
56+
The training and test sets have been created, and count_vectorizer, count_train, and count_test have been computed.
57+
### Instructions
58+
Import the metrics module from sklearn and MultinomialNB from sklearn.naive_bayes.
59+
Instantiate a MultinomialNB classifier called nb_classifier.
60+
Fit the classifier to the training data.
61+
Compute the predicted tags for the test data.
62+
Calculate and print the accuracy score of the classifier.
63+
Compute the confusion matrix. To make it easier to read, specify the keyword argument labels=['FAKE', 'REAL'].
64+
65+
66+
## Training and testing the "fake news" model with TfidfVectorizer
67+
100xp
68+
Now that you have evaluated the model using the CountVectorizer, you'll do the same using the TfidfVectorizer with a Naive Bayes model.
69+
The training and test sets have been created, and tfidf_vectorizer, tfidf_train, and tfidf_test have been computed. Additionally, MultinomialNB and metrics have been imported from, respectively, sklearn.naive_bayes and sklearn.
70+
### Instructions
71+
Instantiate a MultinomialNB classifier called nb_classifier.
72+
Fit the classifier to the training data.
73+
Compute the predicted tags for the test data.
74+
Calculate and print the accuracy score of the classifier.
75+
Compute the confusion matrix. As in the previous exercise, specify the keyword argument labels=['FAKE', 'REAL'] so that the resulting confusion matrix is easier to read.
76+
77+
## Improving your model
78+
100xp
79+
Your job in this exercise is to test a few different alpha levels using the Tfidf vectors to determine if there is a better performing combination.
80+
The training and test sets have been created, and tfidf_vectorizer, tfidf_train, and tfidf_test have been computed.
81+
### Instructions
82+
Create a list of alphas to try using np.arange(). Values should range from 0 to 1 with steps of 0.1.
83+
Create a function train_and_predict() that takes in one argument: alpha. The function should:
84+
Instantiate a MultinomialNBclassifier with alpha=alpha.
85+
Fit it to the training data.
86+
Compute predictions on the test data.
87+
Compute and return the accuracy score.
88+
Using a for loop, print the alpha, score and a newline in between. Use your train_and_predict() function to compute the score. Does the score change along with the alpha? What is the best alpha?
89+
90+
## Inspecting your model
91+
100xp
92+
Now that you have built a "fake news" classifier, you'll investigate what it has learned. You can map the important vector weights back to actual words using some simple inspection techniques.
93+
You have your well performing tfidf Naive Bayes classifier available as nb_classifier, and the vectors as tfidf_vectorizer.
94+
### Instructions
95+
Save the class labels as class_labelsby accessing the .classes_ attribute of nb_classifier.
96+
Extract the features using the .get_feature_names() method of tfidf_vectorizer.
97+
Create a zipped array of the classifier coefficients with the feature names and sort them by the coefficients. To do this, first use zip() with the arguments nb_classifier.coef_[0] and feature_names. Then, use sorted() on this.
98+
Print the top 20 weighted features for the first label of class_labels.
99+
Print the bottom 20 weighted features for the second label of class_labels.
100+

0 commit comments

Comments
 (0)