Skip to content

Commit c540bec

Browse files
author
Arun Kirubarajan
authored
Merge pull request #4 from CIS192/fix-links
fix links in readme
2 parents a9dfc14 + d8ec279 commit c540bec

File tree

2 files changed

+24
-12
lines changed

2 files changed

+24
-12
lines changed

assignment2/README.md

Lines changed: 24 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,32 @@
1-
# Assignment 3: Learning Machine Learning
2-
In our third assignment, we will be exploring the field of computational linguistics, otherwise known as **Natural Language Processing**. The goal of this assignment is to have you become familiar with working with reading/writing to files, and working with third party packages. We'll explore these Python ideas through the lens of Data Science and Machine Learning.
1+
2+
# Assignment 2: Learning Machine Learning
3+
4+
In this assignment, we will be exploring the field of computational linguistics, also known as **Natural Language Processing**. The goal of this assignment is to have you become familiar with reading/writing to files, and working with third party packages. We'll explore these Python ideas through the lens of Data Science and Machine Learning.
35

46
## Part 0: Setup
5-
Skim through the assignment and install the relevant packages for this assignment through pip (e.g. [Sci-Kit Learn](https://github.com/scikit-learn/scikit-learn) and [NumPy](https://github.com/numpy/numpy)). Next, download the homework datasets [here](https://github.com/CIS192/homework/raw/master/assignment3/data.zip) (or from the GitHub repository). Finally, download the skeleton code, as well as the report template from the [assignment's GitHub repository](https://github.com/CIS192/homework/tree/master/assignment3).
7+
8+
Skim through the assignment and install the relevant packages for this assignment through pip (e.g. [Sci-Kit Learn](https://github.com/scikit-learn/scikit-learn) and [NumPy](https://github.com/numpy/numpy)). Next, download the homework datasets [here](https://github.com/CIS192/homework/raw/master/assignment2/data.zip) (or from the GitHub repository). Finally, download the skeleton code, as well as the report template from the [assignment's GitHub repository](https://github.com/CIS192/homework/tree/master/assignment2).
69

710
## Part 1: NLP Basics
11+
812
For the first part of the homework you will be implementing a couple of basic NLP tasks in `part1.py`, including raw text analysis with CSV, text tokenizing, and word importance with a score called TF-IDF. The data file `raven.txt` is located in the `data.zip` file, so make sure to unzip it to `/data`! The remaining dataset files will be used for Part 2, so be sure to keep those handy.
913

1014
**TODO:** Implement the incomplete stubs in `part1.py`.
1115

1216
## Part 2: Classification with Sci-Kit Learn
17+
1318
> Adapted from CIS 530 - Computational Linguistics
1419
1520
### Preamble
1621

17-
The second part of the homework will be a longer project: building a text classifier. Now that we have seen tokenizing, text cleaning, and word importance with TF-IDF, let's train a text classifier that will be able to classify a word as being simple (e.g. *easy*, *act*, *blue*) or complex (e.g. *ostentatious*, *esoteric*, *aberration*). This is an important step in a larger NLP task to simply texts to make text more readable.
22+
The second part of the homework will be a longer project: building a text classifier. Now that we have seen tokenizing, text cleaning, and word importance with TF-IDF, let's train a text classifier that will be able to classify a word as being simple (e.g. *easy*, *act*, *blue*) or complex (e.g. *ostentatious*, *esoteric*, *aberration*). This is an important step in a larger NLP task to simply texts to make text more readable.
1823

1924
In the provided code template with proveded helper and unimplemented functions, you will need to:
25+
2026
0. Look at the dataset! Try to understand the information that is conveyed to better understand the task.
2127
1. Implement the machine learning evaluation metric we discussed in class (accuracy).
2228
2. Perform data pre-processing for our dataset. You will need to parse the provided pre-labeled data in training/test sets, and implement a simple baseline model.
23-
3. Use the Sci-Kit Learn package to train machine learning models which classif
24-
y words as simple or complex.
29+
3. Use the Sci-Kit Learn package to train machine learning models which classify words as simple or complex.
2530

2631
We have provided the dataset of labelled words split between training/test sets in (.txt) format. Some notes on the dataset:
2732

@@ -35,20 +40,23 @@ We have also provided frequencies (a contiguous sequence of 1 item from a given
3540
Be sure to install `numpy` and `sklearn` before starting.
3641

3742
### Section 0: Data (0 points)
43+
3844
We have provided the function `load_file` that takes in the file name `data_file` of one of the datasets and returns the words and labels of that dataset. The second provided helper function `load_ngram_counts` loads Google N-Gram counts from our provided file `ngram_counts.txt` as a dictionary of word frequencies.
3945

4046
**TODO:** Inspect these functions, print out what they return and make sure you understand what they're providing before moving on.
4147

4248
### Section 1: Evaluation (5 points)
49+
4350
We will be implementing **accuracy**, a standard evaluation metric that we discussed in class. We will use this function later in the assignment, so be sure that this function works before moving on.
4451

4552
**TODO:** Implement `get_accuracy`, which should return a value between 0 and 1 that corresponds to the amount of predictions that match the true labels.
4653

4754
### Section 2: Baseline Models (20 points)
55+
4856
In the following functions, you will implement 3 baseline models. Recall that baseline models are used to benchmark our own machine learning models against.
4957

50-
1. The first baseline model `all_complex` classifies ALL words as complex (think back to the coin-flipping example from class).
51-
2. The second baseline model `word_length_threshold` uses word length thresholding: if a word is longer than the given threshold, we consider it to be complex, and vice versa.
58+
1. The first baseline model `all_complex` classifies ALL words as complex (think back to the coin-flipping example from class).
59+
2. The second baseline model `word_length_threshold` uses word length thresholding: if a word is longer than the given threshold, we consider it to be complex, and vice versa.
5260
3. The third baseline model `word_frequency_threshold` is similar to the second, but we will use frequencies from the Google N-Gram counts dataset as the metric to threshold against.
5361

5462
**TODO:** Implement the three baseline models, and report their accuracies (using the function you implemented earlier).
@@ -59,12 +67,13 @@ For our machine learning classifiers, we will use the built-in Naive Bayes and L
5967

6068
For features, you'll want to use both word length and word frequency. However, feel free to use any other features that you want! Be sure to document any extra features in `REPORT.md`. Extra credit is available for the inclusion of any other interesting features!
6169

62-
You can import the relevant models from Sci-Kit Learn using:
70+
You can import the relevant models from Sci-Kit Learn using:
6371

6472
```python
6573
from sklearn.naive_bayes import GaussianNB
6674
```
67-
for Naive Bayes and
75+
76+
for Naive Bayes and
6877

6978
```python
7079
from sklearn.linear_model import LogisticRegression
@@ -90,12 +99,15 @@ import numpy as np
9099
**TODO:** Implement `logistic_regression` and `naive_bayes`, where you will train the machine learning models and report their accuracies on the testing data.
91100

92101
### Section 4: Report (5 points)
102+
93103
**TODO:** Complete `REPORT.md` with information about your implementations and your accuracies for each section. Write a few comments comparing the performace of your Naive Bayes classifier and your Logistic Regression classifier. Include any details about extra credit you've completed.
94104

95-
Be sure to complete the report in [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) format. Remember to indicate which member worked on which sections for full credit.
105+
Be sure to complete the report in [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) format. Remember to indicate which member worked on which sections for full credit.
96106

97107
## Submission
108+
98109
Submit all your code and potentially extra data to Gradescope. If you have a partner, YOU MUST MARK THEM AS A COLLABORATOR ON GRADESCOPE. If you fail to do this, you may get a 0 on this assignment.
99110

100111
## Attributions
101-
<small> This homework was adapted from [CIS 530: Computational Linguistics at the University of Pennsylvania](https://computational-linguistics-class.org/) by Arun Kirubarajan and Kevin Sun.</small>
112+
113+
This homework was adapted from [CIS 530: Computational Linguistics at the University of Pennsylvania](https://computational-linguistics-class.org/) by Arun Kirubarajan and Kevin Sun.

assignment2/data.zip

-2.68 KB
Binary file not shown.

0 commit comments

Comments
 (0)