Skip to content

Commit 3a580ed

Browse files
committed
The only valid 2-col splitter is: ' ' (space)
- helps if the words in 'words.txt' contain special UTF whitespaces, which otherwise lead to having >2 columns, - it will make the code a little more robust to 'dirty' dataprep,
1 parent 9b23b17 commit 3a580ed

File tree

1 file changed

+3
-2
lines changed

1 file changed

+3
-2
lines changed

scripts/rnnlm/get_unigram_probs.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -130,8 +130,9 @@ def get_counts(data_sources, data_weights, vocab):
130130

131131
with open(counts_file, 'r', encoding="utf-8") as f:
132132
for line in f:
133-
fields = line.split()
134-
assert len(fields) == 2
133+
fields = line.split(' ')
134+
if len(fields) != 2: print("Warning, should be 2 cols:", fields, file=sys.stderr);
135+
assert(len(fields) == 2)
135136
word = fields[0]
136137
count = fields[1]
137138
if word not in vocab:

0 commit comments

Comments
 (0)