Skip to content

Commit 3d5177a

Browse files
committed
data
1 parent 703199f commit 3d5177a

15 files changed

+13743
-0
lines changed

.DS_Store

0 Bytes
Binary file not shown.
Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
# Data
2+
This directory contains some simple tools for analyzing and transforming the HN dump data.
3+
4+
## Steps
5+
### Raw Data
6+
Starting point is this awesome Hacker News data dump: https://archive.org/details/14566367HackerNewsCommentsAndStoriesArchivedByGreyPanthersHacker
7+
8+
### Extract
9+
Here, we extract only the top-level comments from the raw HN data dump and convert to
10+
simple TSV for the processing steps that will follow.
11+
For now, we ignore comments that are replies, since they would require additional modelling.
12+
```
13+
./extract.py < 14m_hn_comments_sorted.json > top_level_hn_comments.tsv
14+
```
15+
The script also converts from HTML to Markdown using [html2text](https://pypi.org/project/html2text/).
16+
Note that the entries in the JSON seem to come from different sources, with multiple formats.
17+
For example, some entries use double newlines to represent paragraphs, while others use the HTML `<p>`.
18+
`extract.py` tries to normalize the data a bit, but it is likely that there will be some remaining
19+
inconsistencies.
20+
21+
I get 3331156 extracted title-comment pairs, with the following statistics printed by `extract.py`:
22+
```
23+
stories: 2461338
24+
comments: 11629633 (4.72 per title)
25+
top-level: 3331156 (28.6437%)
26+
ignored rows: 0.1507%
27+
invalid rows: 0.2189%
28+
deleted rows: 2.8940%
29+
```
30+
31+
Some of the title-comment pairs may be contained multiple times, let's deduplicate:
32+
```
33+
sort -u -t$'\t' -k 3,3 -k 4,4 top_level_hn_comments.tsv > top_level_hn_comments.dedupe.tsv
34+
```
35+
Indeed, it looks like a few (8999) title-comment pairs are duplicates in my case:
36+
```
37+
$ wc -l top_level_hn_comments.tsv top_level_hn_comments.dedupe.tsv
38+
3331156 top_level_hn_comments.tsv
39+
3322157 top_level_hn_comments.dedupe.tsv
40+
```
41+
42+
### Split
43+
Split the data into train, test and dev. This is just so that we can see how the model performs
44+
on unseen data during training (dev) and after training (test).
45+
46+
We have to be a bit careful here so that we don't get the same title in both train and dev/test.
47+
The TSV format isn't very well suited for this, so I've written a stupid script for sampling.
48+
Sort by title, then sample into train/dev/test, allocating 0.1% for dev and test data each:
49+
```
50+
sort -t$'\t' -k3,3 top_level_hn_comments.dedupe.tsv > top_level_hn_comments.dedupe.sorted-by-title.tsv
51+
./sample_train_dev_test.py --train data.train.tsv \
52+
--dev data.dev.tsv 0.1 \
53+
--test data.test.tsv 0.1 \
54+
< top_level_hn_comments.dedupe.sorted-by-title.tsv
55+
```
56+
Just to be sure, let's double check that we have no title overlap:
57+
```
58+
$ wc -l data.{train,dev,test}.tsv
59+
3315886 data.train.tsv
60+
3312 data.dev.tsv
61+
2959 data.test.tsv
62+
3322157 total
63+
64+
$ cut -f3 top_level_hn_comments.dedupe.sorted-by-title.tsv | sort -u | wc -l
65+
595625
66+
67+
$ for i in {train,test,dev}; do cut -f3 data.$i.tsv | sort -u | wc -l; done
68+
594479
69+
559
70+
587
71+
72+
$ expr 594479 + 559 + 587
73+
595625
74+
```
75+
Phew, looks like the titles have been distributed without overlap. We can also see that we have
76+
about 600K unique titles in training, with more than 5 comments each. Let's hope that will
77+
be enough data!
78+
79+
### Tokenize
80+
Next, we normalize the data further. First, we note that a large number of comments contain links.
81+
As a result of the conversion to Markdown, there are different ways of specifying links,
82+
which `normalize_links.sh` tries to reduce just to plain-text URLs. Then, we tokenize the
83+
titles and comments and split from TSV into separate files for parallel line-aligned titles/comments.
84+
We also lowercase titles here, since they are only seen as an input and we think there is not much to
85+
be gained from this signal for this task.
86+
```
87+
./preprocess_tsv.sh data.train
88+
./preprocess_tsv.sh data.dev
89+
./preprocess_tsv.sh data.test
90+
```
91+
Sanity check that everything is still aligned:
92+
```
93+
$ for i in {train,dev,test}; do wc -l data.$i.tsv data.$i.pp.comments data.$i.pp.titles | grep -v total; done
94+
3315886 data.train.tsv
95+
3315886 data.train.pp.comments
96+
3315886 data.train.pp.titles
97+
3312 data.dev.tsv
98+
3312 data.dev.pp.comments
99+
3312 data.dev.pp.titles
100+
2959 data.test.tsv
101+
2959 data.test.pp.comments
102+
2959 data.test.pp.titles
103+
```
104+
105+
### Learn BPE
106+
Take some subset of the training data for learning BPE (for segmenting the text into subword units):
107+
```
108+
cat <(shuf data.train.pp.comments | head -n 500000) \
109+
<(shuf data.train.pp.titles | head -n 500000) \
110+
> bpetrain
111+
```
112+
113+
Use [subword-nmt](https://github.com/rsennrich/subword-nmt.git) to learn BPE segmentation:
114+
```
115+
subword-nmt learn-bpe -s 24000 < bpetrain > bpecodes
116+
```
117+
118+
### Apply BPE
119+
Take the codes we just learned to segment train, dev and test data:
120+
```
121+
for i in {train,test,dev}; do
122+
for j in {comments,titles}; do
123+
subword-nmt apply-bpe --codes bpecodes < data.$i.pp.$j > data.$i.bpe.$j
124+
done
125+
done
126+
```
127+
128+
### Training the model
129+
See [../train](../train).
130+
131+
## Appendix
132+
### Comment Lengths
133+
Unfortunately, HN comments will often go on and on. Assumably, the model will not be able to generate
134+
coherent comments of such length, especially with the relatively small amount of training data we have.
135+
A question then becomes if we should filter long comments from the training data, or even split up
136+
long comments into multiple training examples (for example at the paragraph level, since HN users care
137+
so much about structuring their comments nicely).
138+
139+
Let's first see what we have in terms of words per comment...
140+
```
141+
./length-distr.awk \
142+
< data.train.pp.comments \
143+
| gnuplot length-distr.plot -e "set ylabel 'p(length)'; plot '-' t 'length distribution' w l ls 1" \
144+
> length-distr.data.train.pp.comments.svg
145+
146+
./length-distr.awk \
147+
< data.train.pp.comments \
148+
| gnuplot length-distr.plot -e "set ylabel 'p(<= length)'; plot '-' u 1:(cumsum(\$2)) t 'cumulative length distribution' w l ls 2" \
149+
> length-distr-cumulative.data.train.pp.comments.svg
150+
```
151+
![comment length distribution](length-distr.data.train.pp.comments.svg) ![cumulative comment length distribution](length-distr-cumulative.data.train.pp.comments.svg)
152+
153+
There are many long comments, but it's not as bad as I thought: more than 50% of all top-level
154+
comments have fewer than 50 tokens.
155+
156+
What's the average number of paragraphs per comment? (This is starting to feel more and more like
157+
some kind of Jupyter notebook)
158+
```
159+
./paragraph-distr.awk \
160+
< data.train.pp.comments \
161+
| gnuplot length-distr.plot -e "set ylabel 'avg. numbers of paragraphs'; plot '-' t 'paragraphs' w l ls 1" \
162+
> paragraph-distr.data.train.pp.comments.svg
163+
```
164+
![avg. numbers of paragraphs](paragraph-distr.data.train.pp.comments.svg)
165+
166+
How neat.
167+
168+
### Format of the Raw HN Data Dump
169+
A brief look into the format of the raw HN data dump.
170+
171+
Each line is one JSON object. Each object has an ID, by which the lines are sorted.
172+
This is the first line, representing a story, pretty-printed with `head -n1 14m_hn_comments_sorted.json | jq`:
173+
```
174+
{
175+
"body": {
176+
"kids": [
177+
487171,
178+
15,
179+
234509,
180+
454410,
181+
82729
182+
],
183+
"descendants": 15,
184+
"url": "http://ycombinator.com",
185+
"title": "Y Combinator",
186+
"by": "pg",
187+
"score": 61,
188+
"time": 1160418111,
189+
"type": "story",
190+
"id": 1
191+
},
192+
"source": "firebase",
193+
"id": 1,
194+
"retrieved_at_ts": 1435938464
195+
}
196+
```
197+
198+
This is a comment:
199+
```
200+
{
201+
"body": {
202+
"kids": [
203+
455092
204+
],
205+
"parent": 534,
206+
"text": "which ones are you thinking about? ",
207+
"id": 586,
208+
"time": 1172193356,
209+
"type": "comment",
210+
"by": "gustaf"
211+
},
212+
"source": "firebase",
213+
"id": 586,
214+
"retrieved_at_ts": 1435974128
215+
}
216+
```
217+
218+
As explained somewhat in `extract.py`, there will be some deviations from this layout.
Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Extract only the top-level comments from a HN data dump.
4+
5+
The dump is read from STDIN in the JSON format (see README.md).
6+
Extracted comments are written to STDOUT in TSV, with the following columns:
7+
1. ID
8+
2. Time
9+
3. Story Title
10+
4. Text
11+
12+
Statistics are written on STDERR.
13+
"""
14+
15+
import argparse
16+
import sys
17+
import json
18+
import html2text
19+
20+
h2t = html2text.HTML2Text()
21+
h2t.body_width = 0 # don't automatically wrap when converting to Markdown
22+
23+
def normalize_text(text):
24+
# The data dump seems to contain carriage return frequently
25+
text = text.replace('\r', '')
26+
27+
# In some examples, it seems that double newline is used for paragraphs,
28+
# other examples use <p>. For html2text, we need the latter.
29+
text = text.replace('\n\n', '<p>')
30+
31+
# We should be fine with ignoring the remaining single newlines
32+
text = text.replace('\n', ' ')
33+
34+
# HTML -> Markdown
35+
text = h2t.handle(text)
36+
37+
# There are some trailing newlines in the markdown
38+
text = text.strip()
39+
text = text.replace('\r', '')
40+
41+
# Finally, convert whitespace so that we can give line-by-line tab separated output
42+
text = text.replace('\t', ' ')
43+
text = text.replace('<NL>', ' ') # these are texts written by programmers,
44+
# but let's not bother with this special case
45+
text = text.replace('\n', ' <NL> ')
46+
47+
return text
48+
49+
class Converter(object):
50+
def __init__(self):
51+
# We remember a mapping from object IDs to the titles of stories.
52+
# This way we can tell if a comment is top-level by checking if its parent is a story.
53+
# This works in a single pass since the input lines are sorted by ID.
54+
self.story_titles = {}
55+
56+
# Let's keep some stats
57+
self.n_total = 0
58+
self.n_comments = 0
59+
self.n_top_level_comments = 0
60+
self.n_unexpected_format = 0
61+
self.n_ignored = 0
62+
self.n_deleted = 0
63+
64+
def _process_object(self, body, f_out):
65+
object_type = body['type']
66+
67+
if object_type == 'story':
68+
title = body['title'].strip()
69+
title = title.replace('\r', '')
70+
title = title.replace('\n', ' ')
71+
title = title.replace('\t', ' ')
72+
73+
if len(title) == 0:
74+
self.n_ignored += 1
75+
return
76+
77+
self.story_titles[body['id']] = title
78+
elif object_type == 'comment':
79+
story_title = self.story_titles.get(body['parent'])
80+
81+
if story_title is not None:
82+
# Yay, got a top-level comment!
83+
84+
text = normalize_text(body['text'])
85+
if len(text) == 0:
86+
self.n_ignored += 1
87+
return
88+
89+
f_out.write(str(body['id']))
90+
f_out.write('\t')
91+
f_out.write(str(body['time']))
92+
f_out.write('\t')
93+
f_out.write(story_title)
94+
f_out.write('\t')
95+
f_out.write(normalize_text(body['text']))
96+
f_out.write('\n')
97+
98+
self.n_top_level_comments += 1
99+
100+
self.n_comments += 1
101+
102+
else:
103+
# Probably object_type == 'job'
104+
self.n_ignored += 1
105+
pass
106+
107+
def process_object(self, obj, f_out):
108+
try:
109+
self.n_total += 1
110+
111+
body = obj['body']
112+
113+
if body.get('deleted') == True:
114+
self.n_deleted += 1
115+
return
116+
117+
# Some of the titles contain "algolia" as well as "site" fields,
118+
# in which the actual "body" is stored
119+
algolia = body.get('algolia')
120+
if algolia is not None:
121+
body = algolia
122+
123+
site = body.get('site')
124+
if site is not None:
125+
body = site
126+
127+
# Those titles that have their body in "site" don't always have the
128+
# "id", let's copy it over if possible
129+
if 'id' not in body and 'id' in obj:
130+
body['id'] = obj['id']
131+
132+
self._process_object(body, f_out)
133+
except KeyError as e:
134+
# Not sure why this happens, but a few lines in the input seem to be missing fields
135+
self.n_unexpected_format += 1
136+
137+
def write_stats(self, f_out):
138+
f_out.write('stories:\t{}\n'.format(len(self.story_titles)))
139+
if len(self.story_titles) > 0:
140+
f_out.write('comments:\t{} ({:.2f} per title)\n'.format(self.n_comments, self.n_comments / float(len(self.story_titles))))
141+
if self.n_comments > 0:
142+
f_out.write('top-level:\t{} ({:.4f}%)\n'.format(self.n_top_level_comments, self.n_top_level_comments / float(self.n_comments) * 100.0))
143+
144+
f_out.write('ignored rows:\t{:.4f}%\n'.format(self.n_ignored / float(self.n_total) * 100.0))
145+
f_out.write('invalid rows:\t{:.4f}%\n'.format(self.n_unexpected_format / float(self.n_total) * 100.0))
146+
f_out.write('deleted rows:\t{:.4f}%\n'.format(self.n_deleted / float(self.n_total) * 100.0))
147+
148+
if __name__ == "__main__":
149+
parser = argparse.ArgumentParser(description=__doc__)
150+
args = parser.parse_args()
151+
152+
converter = Converter()
153+
154+
for n, line in enumerate(sys.stdin):
155+
converter.process_object(json.loads(line), f_out=sys.stdout)
156+
if (n+1) % 500_000 == 0:
157+
sys.stderr.write('[{:.1f}M]\n'.format((n+1) / float(1_000_000)))
158+
converter.write_stats(sys.stderr)
159+
160+
converter.write_stats(f_out=sys.stderr)

0 commit comments

Comments
 (0)