|
| 1 | +# Assignment 2: Learning Machine Learning |
| 2 | + |
| 3 | +## Part 1: Data Wrangling with Pandas |
| 4 | + |
| 5 | +### Section 0: Pandas? |
| 6 | + |
| 7 | +One of Python's biggest assets is its ability to let us do data analysis quickly and effortlessly, in a reproducible way. Since data analysis is a prevalent part of any field in both academia, and industry, it's worth having good data skills in your toolbelt. |
| 8 | + |
| 9 | +The best way to do quick data analysis in Python is to use [Pandas](https://pandas.pydata.org), an open source data analysis library used by virtually every data scientist. In particular, we will be looking at a dataset of [different types of wines](https://archive.ics.uci.edu/ml/datasets/wine) and their chemical makeup. Cheers! |
| 10 | + |
| 11 | +### Section 1: Scavenger Hunt |
| 12 | + |
| 13 | +In this scavenger hunt, you will be using Pandas to determine the answers to the following questions about the provided [Wine Dataset](): |
| 14 | + |
| 15 | +0. What are the dimensions of the dataframe? That is, how many rows and columns are there? |
| 16 | +Give your answer as a `(rows, columns)` tuple. |
| 17 | + |
| 18 | +1. What is the average `alcohol` content over all wines? |
| 19 | + |
| 20 | +2. What is the standard deviation of the `magnesium` content? |
| 21 | + |
| 22 | +3. What is the mean alcohol content, grouped by `target`? |
| 23 | +Give your answer in terms of a list (e.g. `[class_0_alc, class_1_alc, class_2_alc]`). |
| 24 | + |
| 25 | +4. What is the minimum `proline` content? What is the maximum? Give your answer |
| 26 | +as a list in the form `[min, max]`. |
| 27 | + |
| 28 | +5. How many wine samples in the dataset have a `malic_acid` content over 2.5? |
| 29 | + |
| 30 | +6. What is the index of the wine with the lowest `flavanoid` content? |
| 31 | + |
| 32 | +7. How many unique `hue` values are there in the dataset? |
| 33 | + |
| 34 | +**TODO:** Implement the `scavenger_hunt()` function by returning a dictionary mapping the question number (as an integer) to the correct answers. |
| 35 | + |
| 36 | +*Hint*: You can read the file into a `DataFrame` by using `pd.read_csv('wine.csv')`. |
| 37 | + |
| 38 | +## Part 2: Generating Text With N-Gram Language Models |
| 39 | + |
| 40 | +### Section 0: What is a language model? |
| 41 | + |
| 42 | +In this assignment, we will be building a **language model**, which is how we can machine learning to generate text (e.g. chatbots, summarization, translation). In particular, we will be training an **n-gram** model, which is a relatively simple but extremely powerful model. Next week, we'll work with the state-of-the-art in NLP, which includes neural networks! |
| 43 | + |
| 44 | +In short, an n-gram language model works by treating language as a sequence of *overlapping* word tuples, of size $n$. For example, the sentence "I love CIS 192" would be represented as unigrams ($n = 1$) as: |
| 45 | + |
| 46 | +```python |
| 47 | +[("I"), ("love"), ("CIS"), ("192")] |
| 48 | +``` |
| 49 | + |
| 50 | +and as bigrams ($n = 2$) as: |
| 51 | + |
| 52 | +```python |
| 53 | +[("I", "love"), ("love", "CIS"), ("CIS", "192")] |
| 54 | +``` |
| 55 | + |
| 56 | +Language models work by estimating the **probability of the next word** in sequence given the previous words, and sampling from that distribution. If you're familiar with the **Markov Property**, we will be relying on it as a powerful assumption to make the generation process simpler. Powerful language models (such as [GPT-3](https://www.technologyreview.com/2020/07/20/1005454/openai-machine-learning-language-generator-gpt-3-nlp/) and the human brain) should include all the history in a sentence. However, for this assignment, we will just be considering a single n-gram of history to make the math simpler. |
| 57 | + |
| 58 | +At a high level, in this assignment, we will build an estimator for the probability of a given word (or n-gram), given a context (n-gram), and then a way to repeatedly select new words until we have a full body of text. |
| 59 | + |
| 60 | +We've provided some stubs to make the implementation process more clear. Feel free to add helper functions, but as usual, don't change the type signatures of the functions! You can get an idea of how the functions should work together by checking out the code under `if __name__ == "__main__"`. |
| 61 | + |
| 62 | +### Section 1: Reading Data |
| 63 | + |
| 64 | +The first thing you want to do is load the data from a given text file into memory. We'll start by just getting a list of the words, and chunking them into grams later. |
| 65 | + |
| 66 | +**TODO:** Implement the `get_words()` function, that takes in the file path of a plain-text `.txt` as a string. The function should return a list of words (as strings), in the order that they appear in the text file. |
| 67 | + |
| 68 | +*Hint:* make use of the `split()` function. |
| 69 | + |
| 70 | +### Section 2: Transforming Data |
| 71 | + |
| 72 | +Getting data into a useable format is often most important part of the machine learning process. For an n-gram model, this means taking our list of words and creating our list of n-grams, provided the value for $n$. |
| 73 | + |
| 74 | +**TODO:** Implement `get_ngrams()`, which takes a list of words and the size of the grams and returns a **list of tuples**, where each tuple is a gram. |
| 75 | + |
| 76 | +### Section 3: Computing the Distribution of Words |
| 77 | + |
| 78 | +The most important part of the n-gram model is our estimate of the distribution of our words. That is, we want to map a particular context n-gram to possible next n-grams and their frequencies. We'll represent this distribution by using a double dictionary, where the key of the outer dictionary is the context n-gram, and the inner dictionary maps the target n-gram to its frequency. |
| 79 | + |
| 80 | +For example, many our `counts` dictionary looks something like: |
| 81 | + |
| 82 | +```python |
| 83 | +counts = get_counts(n_grams) |
| 84 | +print(counts[('I', 'am')]) |
| 85 | + |
| 86 | +>>> {('very', 'cool'): 50, ('kinda', 'lame'): 20, ('already', 'asleep'): 10 ... } |
| 87 | +``` |
| 88 | + |
| 89 | +We also want to make sure that our model *generalizes* a bit better than just the raw frequencies. So we'll also add 1 to each possible frequency so that our model has a bit more creativity. If you're interested, this is called [smoothing](https://en.wikipedia.org/wiki/Language_model#n-gram). |
| 90 | + |
| 91 | +*Hint:* We can do this by initializing a default dictionary, which maps to a default dictionary with the default integer value of 1: |
| 92 | + |
| 93 | +```python |
| 94 | +counts = defaultdict(lambda: defaultdict(lambda: 1)) |
| 95 | +``` |
| 96 | + |
| 97 | +**TODO:** Implement the `get_counts()` function, which takes in the list of n-grams and returns the frequency distribution of the n-grams. |
| 98 | + |
| 99 | +### Section 4: Generating Words By Sampling |
| 100 | + |
| 101 | +Now, it's time for the moment of truth: using our distribution of frequencies to select the next word given some context. |
| 102 | + |
| 103 | +One way to randomly select a next word would be to select a random number between 1 and `len(n_grams)`, and select that n-gram. However, that totally disregards our actual empirical estimate of the frequencies (i.e. in `get_counts`). A better way would be to randomly select a word in accordance with the distribution we computed earlier. |
| 104 | + |
| 105 | +We can do this by using `np.random.choice`, where we provide the length of a list to sample from and a list of corresponding **probabilities**, which will return the index of the selection: |
| 106 | + |
| 107 | +```python |
| 108 | +word = words[np.random.choice(len(words), p=probabilities)] |
| 109 | +``` |
| 110 | + |
| 111 | +**TODO:** Implement the `generate_gram()` function, which takes in the distribution `counts` as well as the context n-gram, and returns a selected n-gram tuple according to the distribution for the given context. |
| 112 | + |
| 113 | +*Hint:* Make sure you understand which dictionary in `counts` to use as the distribution. You also will need to normalize the raw frequencies into probabilities for `np.random.choice` by dividing each entry by the sum of the frequencies in the distribution. |
| 114 | + |
| 115 | +There's a lot more nuance to generating text than just random sampling, but these suffices for now. If you want to learn more about how to generate text, check out [this blog post I wrote](https://kirubarajan.com/blog/decoding) on the subject. |
| 116 | + |
| 117 | +### Section 5: Generating Entire Sentences |
| 118 | + |
| 119 | +For the final part, we will put everything together. In order to generate an entire sentence or paragraph, we can repeatedly call `generate_gram()`, feeding in the previous prediction as the next context. This is known as **auto-regressive** generation. |
| 120 | + |
| 121 | +**TODO:** Implement the `generate_sentence` function, which takes in the distribution `counts` as well as a context n-gram (and an optional parameter for the number of n-grams to generate) and should return a list of tuples. |
| 122 | + |
| 123 | +*Hint:* We've provided a `stringify()` helper function to help you visualize what your generated sentence looks like in regular string format. Feel free to use it to help appreciate your work! |
| 124 | + |
| 125 | +## Conclusion |
0 commit comments