You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: assignment2/README.md
+24-12Lines changed: 24 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,27 +1,32 @@
1
-
# Assignment 3: Learning Machine Learning
2
-
In our third assignment, we will be exploring the field of computational linguistics, otherwise known as **Natural Language Processing**. The goal of this assignment is to have you become familiar with working with reading/writing to files, and working with third party packages. We'll explore these Python ideas through the lens of Data Science and Machine Learning.
1
+
2
+
# Assignment 2: Learning Machine Learning
3
+
4
+
In this assignment, we will be exploring the field of computational linguistics, also known as **Natural Language Processing**. The goal of this assignment is to have you become familiar with reading/writing to files, and working with third party packages. We'll explore these Python ideas through the lens of Data Science and Machine Learning.
3
5
4
6
## Part 0: Setup
5
-
Skim through the assignment and install the relevant packages for this assignment through pip (e.g. [Sci-Kit Learn](https://github.com/scikit-learn/scikit-learn) and [NumPy](https://github.com/numpy/numpy)). Next, download the homework datasets [here](https://github.com/CIS192/homework/raw/master/assignment3/data.zip) (or from the GitHub repository). Finally, download the skeleton code, as well as the report template from the [assignment's GitHub repository](https://github.com/CIS192/homework/tree/master/assignment3).
7
+
8
+
Skim through the assignment and install the relevant packages for this assignment through pip (e.g. [Sci-Kit Learn](https://github.com/scikit-learn/scikit-learn) and [NumPy](https://github.com/numpy/numpy)). Next, download the homework datasets [here](https://github.com/CIS192/homework/raw/master/assignment2/data.zip) (or from the GitHub repository). Finally, download the skeleton code, as well as the report template from the [assignment's GitHub repository](https://github.com/CIS192/homework/tree/master/assignment2).
6
9
7
10
## Part 1: NLP Basics
11
+
8
12
For the first part of the homework you will be implementing a couple of basic NLP tasks in `part1.py`, including raw text analysis with CSV, text tokenizing, and word importance with a score called TF-IDF. The data file `raven.txt` is located in the `data.zip` file, so make sure to unzip it to `/data`! The remaining dataset files will be used for Part 2, so be sure to keep those handy.
9
13
10
14
**TODO:** Implement the incomplete stubs in `part1.py`.
11
15
12
16
## Part 2: Classification with Sci-Kit Learn
17
+
13
18
> Adapted from CIS 530 - Computational Linguistics
14
19
15
20
### Preamble
16
21
17
-
The second part of the homework will be a longer project: building a text classifier. Now that we have seen tokenizing, text cleaning, and word importance with TF-IDF, let's train a text classifier that will be able to classify a word as being simple (e.g. *easy*, *act*, *blue*) or complex (e.g. *ostentatious*, *esoteric*, *aberration*). This is an important step in a larger NLP task to simply texts to make text more readable.
22
+
The second part of the homework will be a longer project: building a text classifier. Now that we have seen tokenizing, text cleaning, and word importance with TF-IDF, let's train a text classifier that will be able to classify a word as being simple (e.g. *easy*, *act*, *blue*) or complex (e.g. *ostentatious*, *esoteric*, *aberration*). This is an important step in a larger NLP task to simply texts to make text more readable.
18
23
19
24
In the provided code template with proveded helper and unimplemented functions, you will need to:
25
+
20
26
0. Look at the dataset! Try to understand the information that is conveyed to better understand the task.
21
27
1. Implement the machine learning evaluation metric we discussed in class (accuracy).
22
28
2. Perform data pre-processing for our dataset. You will need to parse the provided pre-labeled data in training/test sets, and implement a simple baseline model.
23
-
3. Use the Sci-Kit Learn package to train machine learning models which classif
24
-
y words as simple or complex.
29
+
3. Use the Sci-Kit Learn package to train machine learning models which classify words as simple or complex.
25
30
26
31
We have provided the dataset of labelled words split between training/test sets in (.txt) format. Some notes on the dataset:
27
32
@@ -35,20 +40,23 @@ We have also provided frequencies (a contiguous sequence of 1 item from a given
35
40
Be sure to install `numpy` and `sklearn` before starting.
36
41
37
42
### Section 0: Data (0 points)
43
+
38
44
We have provided the function `load_file` that takes in the file name `data_file` of one of the datasets and returns the words and labels of that dataset. The second provided helper function `load_ngram_counts` loads Google N-Gram counts from our provided file `ngram_counts.txt` as a dictionary of word frequencies.
39
45
40
46
**TODO:** Inspect these functions, print out what they return and make sure you understand what they're providing before moving on.
41
47
42
48
### Section 1: Evaluation (5 points)
49
+
43
50
We will be implementing **accuracy**, a standard evaluation metric that we discussed in class. We will use this function later in the assignment, so be sure that this function works before moving on.
44
51
45
52
**TODO:** Implement `get_accuracy`, which should return a value between 0 and 1 that corresponds to the amount of predictions that match the true labels.
46
53
47
54
### Section 2: Baseline Models (20 points)
55
+
48
56
In the following functions, you will implement 3 baseline models. Recall that baseline models are used to benchmark our own machine learning models against.
49
57
50
-
1. The first baseline model `all_complex` classifies ALL words as complex (think back to the coin-flipping example from class).
51
-
2. The second baseline model `word_length_threshold` uses word length thresholding: if a word is longer than the given threshold, we consider it to be complex, and vice versa.
58
+
1. The first baseline model `all_complex` classifies ALL words as complex (think back to the coin-flipping example from class).
59
+
2. The second baseline model `word_length_threshold` uses word length thresholding: if a word is longer than the given threshold, we consider it to be complex, and vice versa.
52
60
3. The third baseline model `word_frequency_threshold` is similar to the second, but we will use frequencies from the Google N-Gram counts dataset as the metric to threshold against.
53
61
54
62
**TODO:** Implement the three baseline models, and report their accuracies (using the function you implemented earlier).
@@ -59,12 +67,13 @@ For our machine learning classifiers, we will use the built-in Naive Bayes and L
59
67
60
68
For features, you'll want to use both word length and word frequency. However, feel free to use any other features that you want! Be sure to document any extra features in `REPORT.md`. Extra credit is available for the inclusion of any other interesting features!
61
69
62
-
You can import the relevant models from Sci-Kit Learn using:
70
+
You can import the relevant models from Sci-Kit Learn using:
63
71
64
72
```python
65
73
from sklearn.naive_bayes import GaussianNB
66
74
```
67
-
for Naive Bayes and
75
+
76
+
for Naive Bayes and
68
77
69
78
```python
70
79
from sklearn.linear_model import LogisticRegression
@@ -90,12 +99,15 @@ import numpy as np
90
99
**TODO:** Implement `logistic_regression` and `naive_bayes`, where you will train the machine learning models and report their accuracies on the testing data.
91
100
92
101
### Section 4: Report (5 points)
102
+
93
103
**TODO:** Complete `REPORT.md` with information about your implementations and your accuracies for each section. Write a few comments comparing the performace of your Naive Bayes classifier and your Logistic Regression classifier. Include any details about extra credit you've completed.
94
104
95
-
Be sure to complete the report in [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) format. Remember to indicate which member worked on which sections for full credit.
105
+
Be sure to complete the report in [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) format. Remember to indicate which member worked on which sections for full credit.
96
106
97
107
## Submission
108
+
98
109
Submit all your code and potentially extra data to Gradescope. If you have a partner, YOU MUST MARK THEM AS A COLLABORATOR ON GRADESCOPE. If you fail to do this, you may get a 0 on this assignment.
99
110
100
111
## Attributions
101
-
<small> This homework was adapted from [CIS 530: Computational Linguistics at the University of Pennsylvania](https://computational-linguistics-class.org/) by Arun Kirubarajan and Kevin Sun.</small>
112
+
113
+
This homework was adapted from [CIS 530: Computational Linguistics at the University of Pennsylvania](https://computational-linguistics-class.org/) by Arun Kirubarajan and Kevin Sun.
0 commit comments