|
| 1 | +# Chapter 4 Building a "fake news" classifier |
| 2 | + |
| 3 | +## CountVectorizer for text classification |
| 4 | +100xp |
| 5 | +It's time to begin building your text classifier! The data has been loaded into a DataFrame called df. Explore it in the IPython Shell to investigate what columns you can use. The .head() method is particularly informative. |
| 6 | +In this exercise, you'll use pandas alongside scikit-learn to create a sparse text vectorizer you can use to train and test a simple supervised model. To begin, you'll set up a CountVectorizer and investigate some of its features. |
| 7 | +### Instructions |
| 8 | +Import CountVectorizer from sklearn.feature_extraction.text and train_test_split from sklearn.model_selection. |
| 9 | +Create a Series y to use for the labels by assigning the .label attribute of df to y. |
| 10 | +Using df["text"] (features) and y (labels), create training and test sets using train_test_split(). Use a test_size of 0.33 and a random_state of 53. |
| 11 | +Create a CountVectorizer object called count_vectorizer. Ensure you specify the keyword argument stop_words="english" so that stop words are removed. |
| 12 | +Fit and transform the training data X_train using the .fit_transform() method. Do the same with the test data X_test, except using the .transform() method. |
| 13 | +Print the first 10 features of the count_vectorizer using its .get_feature_names() method. |
| 14 | + |
| 15 | +## TfidfVectorizer for text classification |
| 16 | +100xp |
| 17 | +Similar to the sparse CountVectorizer created in the previous exercise, you'll work on creating tf-idf vectors for your documents. You'll set up a TfidfVectorizer and investigate some of its features. |
| 18 | +In this exercise, you'll use pandas and sklearn along with the same X_train, y_train and X_test, y_test DataFrames and Series you created in the last exercise. |
| 19 | +### Instructions |
| 20 | +Import TfidfVectorizer from sklearn.feature_extraction.text. |
| 21 | +Create a TfidfVectorizer object called tfidf_vectorizer. When doing so, specify the keyword arguments stop_words="english" and max_df=0.7. |
| 22 | +Fit and transform the training data. |
| 23 | +Transform the test data. |
| 24 | +Print the first 10 features of tfidf_vectorizer. |
| 25 | +Print the first 5 vectors of the tfidf training data using slicing on the .A (or array) attribute of tfidf_train. |
| 26 | + |
| 27 | +## Inspecting Vectors |
| 28 | +100xp |
| 29 | +To get a better idea of how the vectors work, you'll investigate them by converting them into pandas DataFrames. |
| 30 | +Here, you'll use the same data structures you created in the previous two exercises (count_train, count_vectorizer, tfidf_train, tfidf_vectorizer) as well as pandas, which is imported as pd. |
| 31 | +### Instructions |
| 32 | +Create the DataFrames count_df and tfidf_df by using pd.DataFrame() and specifying the values as the first argument and the columns (or features) as the second argument. |
| 33 | +The values can be accessed by using the .A attribute of, respectively, count_train and tfidf_train. |
| 34 | +The columns can be accessed using the .get_feature_names() methods of count_vectorizer and tfidf_vectorizer. |
| 35 | +Print the head of each DataFrame to investigate their structure. |
| 36 | +Test if the column names are the same for each DataFrame by creating a new object called difference to see the difference between the columns that count_df has from tfidf_df. Columns can be accessed using the .columns attribute of a DataFrame. Subtract the set of tfidf_df.columns from the set of count_df.columns. |
| 37 | +Test if the two DataFrames are equivalent by using the .equals() method on count_df with tfidf_df as the argument. |
| 38 | + |
| 39 | +Text classification models |
| 40 | +50xp |
| 41 | +Which of the below is the most reasonable model to use when training a new supervised model using text vector data? |
| 42 | +Possible Answers |
| 43 | +Random Forests |
| 44 | + |
| 45 | + |
| 46 | +Naive Bayes |
| 47 | + |
| 48 | + |
| 49 | +Linear Regression |
| 50 | + |
| 51 | + |
| 52 | +Deep Learning |
| 53 | +## Training and testing the "fake news" model with CountVectorizer |
| 54 | +100xp |
| 55 | +Now it's your turn to train the "fake news" model using the features you identified and extracted. In this first exercise you'll train and test a Naive Bayes model using the CountVectorizer data. |
| 56 | +The training and test sets have been created, and count_vectorizer, count_train, and count_test have been computed. |
| 57 | +### Instructions |
| 58 | +Import the metrics module from sklearn and MultinomialNB from sklearn.naive_bayes. |
| 59 | +Instantiate a MultinomialNB classifier called nb_classifier. |
| 60 | +Fit the classifier to the training data. |
| 61 | +Compute the predicted tags for the test data. |
| 62 | +Calculate and print the accuracy score of the classifier. |
| 63 | +Compute the confusion matrix. To make it easier to read, specify the keyword argument labels=['FAKE', 'REAL']. |
| 64 | + |
| 65 | + |
| 66 | +## Training and testing the "fake news" model with TfidfVectorizer |
| 67 | +100xp |
| 68 | +Now that you have evaluated the model using the CountVectorizer, you'll do the same using the TfidfVectorizer with a Naive Bayes model. |
| 69 | +The training and test sets have been created, and tfidf_vectorizer, tfidf_train, and tfidf_test have been computed. Additionally, MultinomialNB and metrics have been imported from, respectively, sklearn.naive_bayes and sklearn. |
| 70 | +### Instructions |
| 71 | +Instantiate a MultinomialNB classifier called nb_classifier. |
| 72 | +Fit the classifier to the training data. |
| 73 | +Compute the predicted tags for the test data. |
| 74 | +Calculate and print the accuracy score of the classifier. |
| 75 | +Compute the confusion matrix. As in the previous exercise, specify the keyword argument labels=['FAKE', 'REAL'] so that the resulting confusion matrix is easier to read. |
| 76 | + |
| 77 | +## Improving your model |
| 78 | +100xp |
| 79 | +Your job in this exercise is to test a few different alpha levels using the Tfidf vectors to determine if there is a better performing combination. |
| 80 | +The training and test sets have been created, and tfidf_vectorizer, tfidf_train, and tfidf_test have been computed. |
| 81 | +### Instructions |
| 82 | +Create a list of alphas to try using np.arange(). Values should range from 0 to 1 with steps of 0.1. |
| 83 | +Create a function train_and_predict() that takes in one argument: alpha. The function should: |
| 84 | +Instantiate a MultinomialNBclassifier with alpha=alpha. |
| 85 | +Fit it to the training data. |
| 86 | +Compute predictions on the test data. |
| 87 | +Compute and return the accuracy score. |
| 88 | +Using a for loop, print the alpha, score and a newline in between. Use your train_and_predict() function to compute the score. Does the score change along with the alpha? What is the best alpha? |
| 89 | + |
| 90 | +## Inspecting your model |
| 91 | +100xp |
| 92 | +Now that you have built a "fake news" classifier, you'll investigate what it has learned. You can map the important vector weights back to actual words using some simple inspection techniques. |
| 93 | +You have your well performing tfidf Naive Bayes classifier available as nb_classifier, and the vectors as tfidf_vectorizer. |
| 94 | +### Instructions |
| 95 | +Save the class labels as class_labelsby accessing the .classes_ attribute of nb_classifier. |
| 96 | +Extract the features using the .get_feature_names() method of tfidf_vectorizer. |
| 97 | +Create a zipped array of the classifier coefficients with the feature names and sort them by the coefficients. To do this, first use zip() with the arguments nb_classifier.coef_[0] and feature_names. Then, use sorted() on this. |
| 98 | +Print the top 20 weighted features for the first label of class_labels. |
| 99 | +Print the bottom 20 weighted features for the second label of class_labels. |
| 100 | + |
0 commit comments