|
| 1 | +# Data |
| 2 | +This directory contains some simple tools for analyzing and transforming the HN dump data. |
| 3 | + |
| 4 | +## Steps |
| 5 | +### Raw Data |
| 6 | +Starting point is this awesome Hacker News data dump: https://archive.org/details/14566367HackerNewsCommentsAndStoriesArchivedByGreyPanthersHacker |
| 7 | + |
| 8 | +### Extract |
| 9 | +Here, we extract only the top-level comments from the raw HN data dump and convert to |
| 10 | +simple TSV for the processing steps that will follow. |
| 11 | +For now, we ignore comments that are replies, since they would require additional modelling. |
| 12 | +``` |
| 13 | +./extract.py < 14m_hn_comments_sorted.json > top_level_hn_comments.tsv |
| 14 | +``` |
| 15 | +The script also converts from HTML to Markdown using [html2text](https://pypi.org/project/html2text/). |
| 16 | +Note that the entries in the JSON seem to come from different sources, with multiple formats. |
| 17 | +For example, some entries use double newlines to represent paragraphs, while others use the HTML `<p>`. |
| 18 | +`extract.py` tries to normalize the data a bit, but it is likely that there will be some remaining |
| 19 | +inconsistencies. |
| 20 | + |
| 21 | +I get 3331156 extracted title-comment pairs, with the following statistics printed by `extract.py`: |
| 22 | +``` |
| 23 | +stories: 2461338 |
| 24 | +comments: 11629633 (4.72 per title) |
| 25 | +top-level: 3331156 (28.6437%) |
| 26 | +ignored rows: 0.1507% |
| 27 | +invalid rows: 0.2189% |
| 28 | +deleted rows: 2.8940% |
| 29 | +``` |
| 30 | + |
| 31 | +Some of the title-comment pairs may be contained multiple times, let's deduplicate: |
| 32 | +``` |
| 33 | +sort -u -t$'\t' -k 3,3 -k 4,4 top_level_hn_comments.tsv > top_level_hn_comments.dedupe.tsv |
| 34 | +``` |
| 35 | +Indeed, it looks like a few (8999) title-comment pairs are duplicates in my case: |
| 36 | +``` |
| 37 | +$ wc -l top_level_hn_comments.tsv top_level_hn_comments.dedupe.tsv |
| 38 | + 3331156 top_level_hn_comments.tsv |
| 39 | + 3322157 top_level_hn_comments.dedupe.tsv |
| 40 | +``` |
| 41 | + |
| 42 | +### Split |
| 43 | +Split the data into train, test and dev. This is just so that we can see how the model performs |
| 44 | +on unseen data during training (dev) and after training (test). |
| 45 | + |
| 46 | +We have to be a bit careful here so that we don't get the same title in both train and dev/test. |
| 47 | +The TSV format isn't very well suited for this, so I've written a stupid script for sampling. |
| 48 | +Sort by title, then sample into train/dev/test, allocating 0.1% for dev and test data each: |
| 49 | +``` |
| 50 | +sort -t$'\t' -k3,3 top_level_hn_comments.dedupe.tsv > top_level_hn_comments.dedupe.sorted-by-title.tsv |
| 51 | +./sample_train_dev_test.py --train data.train.tsv \ |
| 52 | + --dev data.dev.tsv 0.1 \ |
| 53 | + --test data.test.tsv 0.1 \ |
| 54 | + < top_level_hn_comments.dedupe.sorted-by-title.tsv |
| 55 | +``` |
| 56 | +Just to be sure, let's double check that we have no title overlap: |
| 57 | +``` |
| 58 | +$ wc -l data.{train,dev,test}.tsv |
| 59 | + 3315886 data.train.tsv |
| 60 | + 3312 data.dev.tsv |
| 61 | + 2959 data.test.tsv |
| 62 | + 3322157 total |
| 63 | +
|
| 64 | +$ cut -f3 top_level_hn_comments.dedupe.sorted-by-title.tsv | sort -u | wc -l |
| 65 | +595625 |
| 66 | +
|
| 67 | +$ for i in {train,test,dev}; do cut -f3 data.$i.tsv | sort -u | wc -l; done |
| 68 | +594479 |
| 69 | +559 |
| 70 | +587 |
| 71 | +
|
| 72 | +$ expr 594479 + 559 + 587 |
| 73 | +595625 |
| 74 | +``` |
| 75 | +Phew, looks like the titles have been distributed without overlap. We can also see that we have |
| 76 | +about 600K unique titles in training, with more than 5 comments each. Let's hope that will |
| 77 | +be enough data! |
| 78 | + |
| 79 | +### Tokenize |
| 80 | +Next, we normalize the data further. First, we note that a large number of comments contain links. |
| 81 | +As a result of the conversion to Markdown, there are different ways of specifying links, |
| 82 | +which `normalize_links.sh` tries to reduce just to plain-text URLs. Then, we tokenize the |
| 83 | +titles and comments and split from TSV into separate files for parallel line-aligned titles/comments. |
| 84 | +We also lowercase titles here, since they are only seen as an input and we think there is not much to |
| 85 | +be gained from this signal for this task. |
| 86 | +``` |
| 87 | +./preprocess_tsv.sh data.train |
| 88 | +./preprocess_tsv.sh data.dev |
| 89 | +./preprocess_tsv.sh data.test |
| 90 | +``` |
| 91 | +Sanity check that everything is still aligned: |
| 92 | +``` |
| 93 | +$ for i in {train,dev,test}; do wc -l data.$i.tsv data.$i.pp.comments data.$i.pp.titles | grep -v total; done |
| 94 | + 3315886 data.train.tsv |
| 95 | + 3315886 data.train.pp.comments |
| 96 | + 3315886 data.train.pp.titles |
| 97 | + 3312 data.dev.tsv |
| 98 | + 3312 data.dev.pp.comments |
| 99 | + 3312 data.dev.pp.titles |
| 100 | + 2959 data.test.tsv |
| 101 | + 2959 data.test.pp.comments |
| 102 | + 2959 data.test.pp.titles |
| 103 | +``` |
| 104 | + |
| 105 | +### Learn BPE |
| 106 | +Take some subset of the training data for learning BPE (for segmenting the text into subword units): |
| 107 | +``` |
| 108 | +cat <(shuf data.train.pp.comments | head -n 500000) \ |
| 109 | + <(shuf data.train.pp.titles | head -n 500000) \ |
| 110 | + > bpetrain |
| 111 | +``` |
| 112 | + |
| 113 | +Use [subword-nmt](https://github.com/rsennrich/subword-nmt.git) to learn BPE segmentation: |
| 114 | +``` |
| 115 | +subword-nmt learn-bpe -s 24000 < bpetrain > bpecodes |
| 116 | +``` |
| 117 | + |
| 118 | +### Apply BPE |
| 119 | +Take the codes we just learned to segment train, dev and test data: |
| 120 | +``` |
| 121 | +for i in {train,test,dev}; do |
| 122 | + for j in {comments,titles}; do |
| 123 | + subword-nmt apply-bpe --codes bpecodes < data.$i.pp.$j > data.$i.bpe.$j |
| 124 | + done |
| 125 | +done |
| 126 | +``` |
| 127 | + |
| 128 | +### Training the model |
| 129 | +See [../train](../train). |
| 130 | + |
| 131 | +## Appendix |
| 132 | +### Comment Lengths |
| 133 | +Unfortunately, HN comments will often go on and on. Assumably, the model will not be able to generate |
| 134 | +coherent comments of such length, especially with the relatively small amount of training data we have. |
| 135 | +A question then becomes if we should filter long comments from the training data, or even split up |
| 136 | +long comments into multiple training examples (for example at the paragraph level, since HN users care |
| 137 | +so much about structuring their comments nicely). |
| 138 | + |
| 139 | +Let's first see what we have in terms of words per comment... |
| 140 | +``` |
| 141 | +./length-distr.awk \ |
| 142 | + < data.train.pp.comments \ |
| 143 | + | gnuplot length-distr.plot -e "set ylabel 'p(length)'; plot '-' t 'length distribution' w l ls 1" \ |
| 144 | + > length-distr.data.train.pp.comments.svg |
| 145 | +
|
| 146 | +./length-distr.awk \ |
| 147 | + < data.train.pp.comments \ |
| 148 | + | gnuplot length-distr.plot -e "set ylabel 'p(<= length)'; plot '-' u 1:(cumsum(\$2)) t 'cumulative length distribution' w l ls 2" \ |
| 149 | + > length-distr-cumulative.data.train.pp.comments.svg |
| 150 | +``` |
| 151 | +  |
| 152 | + |
| 153 | +There are many long comments, but it's not as bad as I thought: more than 50% of all top-level |
| 154 | +comments have fewer than 50 tokens. |
| 155 | + |
| 156 | +What's the average number of paragraphs per comment? (This is starting to feel more and more like |
| 157 | +some kind of Jupyter notebook) |
| 158 | +``` |
| 159 | +./paragraph-distr.awk \ |
| 160 | + < data.train.pp.comments \ |
| 161 | + | gnuplot length-distr.plot -e "set ylabel 'avg. numbers of paragraphs'; plot '-' t 'paragraphs' w l ls 1" \ |
| 162 | + > paragraph-distr.data.train.pp.comments.svg |
| 163 | +``` |
| 164 | + |
| 165 | + |
| 166 | +How neat. |
| 167 | + |
| 168 | +### Format of the Raw HN Data Dump |
| 169 | +A brief look into the format of the raw HN data dump. |
| 170 | + |
| 171 | +Each line is one JSON object. Each object has an ID, by which the lines are sorted. |
| 172 | +This is the first line, representing a story, pretty-printed with `head -n1 14m_hn_comments_sorted.json | jq`: |
| 173 | +``` |
| 174 | +{ |
| 175 | + "body": { |
| 176 | + "kids": [ |
| 177 | + 487171, |
| 178 | + 15, |
| 179 | + 234509, |
| 180 | + 454410, |
| 181 | + 82729 |
| 182 | + ], |
| 183 | + "descendants": 15, |
| 184 | + "url": "http://ycombinator.com", |
| 185 | + "title": "Y Combinator", |
| 186 | + "by": "pg", |
| 187 | + "score": 61, |
| 188 | + "time": 1160418111, |
| 189 | + "type": "story", |
| 190 | + "id": 1 |
| 191 | + }, |
| 192 | + "source": "firebase", |
| 193 | + "id": 1, |
| 194 | + "retrieved_at_ts": 1435938464 |
| 195 | +} |
| 196 | +``` |
| 197 | + |
| 198 | +This is a comment: |
| 199 | +``` |
| 200 | +{ |
| 201 | + "body": { |
| 202 | + "kids": [ |
| 203 | + 455092 |
| 204 | + ], |
| 205 | + "parent": 534, |
| 206 | + "text": "which ones are you thinking about? ", |
| 207 | + "id": 586, |
| 208 | + "time": 1172193356, |
| 209 | + "type": "comment", |
| 210 | + "by": "gustaf" |
| 211 | + }, |
| 212 | + "source": "firebase", |
| 213 | + "id": 586, |
| 214 | + "retrieved_at_ts": 1435974128 |
| 215 | +} |
| 216 | +``` |
| 217 | + |
| 218 | +As explained somewhat in `extract.py`, there will be some deviations from this layout. |
0 commit comments