Skip to content

Commit b401578

Browse files
committed
remove infrequent words
1 parent 3c6353b commit b401578

File tree

27 files changed

+2328
-1
lines changed

27 files changed

+2328
-1
lines changed

remove_infrequent_words10/log.log

Lines changed: 544 additions & 0 deletions
Large diffs are not rendered by default.

remove_infrequent_words10/run.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,4 @@
33
MIN_WORD_COUNT=500
44
OUT_DIR=`pwd`
55

6-
MIN_WORD_COUNT="${MIN_WORD_COUNT}" OUTPUT="hdfs://dco-node121.dco.ethz.ch:54310/cw-combined-pruned-${MIN_WORD_COUNT}" /root/spark/bin/spark-submit --num-executors 20 --class RemoveInfrequentWordsApp ${OUT_DIR}/run.jar > ${OUT_DIR}/log.log 2>&1
6+
MIN_WORD_COUNT="${MIN_WORD_COUNT}" OUTPUT="hdfs://dco-node121.dco.ethz.ch:54310/cw-combined-pruned-${MIN_WORD_COUNT}" /root/spark/bin/spark-submit --total-executor-cores 20 --class RemoveInfrequentWordsApp ${OUT_DIR}/run.jar > ${OUT_DIR}/log.log 2>&1
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
repo
2+
run.jar
3+

remove_infrequent_words11/build.sh

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
#! /bin/bash
2+
3+
rm -f run.jar
4+
5+
cd repo
6+
git pull
7+
8+
cd src/remove_infrequent_words
9+
sbt assembly
10+
11+
cd ../../..
12+
cp repo/src/remove_infrequent_words/target/scala-2.10/RemoveInfrequentWordsApp-assembly-1.0.jar run.jar
13+

remove_infrequent_words11/log.log

Lines changed: 155 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Simple word count on the cleaned dataset
2+
3+
## Version
4+
5+
https://github.com/lukaselmer/ethz-web-scale-data-mining-project/tree/8dbe8eb1f4e885cc212e0161492a6a5a7a696d05/src/remove_infrequent_words
6+
7+
## Usage
8+
9+
<param1>=<value1> <param2>=<value2> ... mahout-submit ...
10+
Options
11+
MIN_WORD_COUNT: number, how many times a word has to occur at least to be kept
12+
[optional] MAX_WORD_COUNT: default: Int.MaxInt
13+
[optional] OUTPUT: default: hdfs://dco-node121.dco.ethz.ch:54310/cw-combined-pruned
14+
[optional] INPUT_COMBINED: default: hdfs://dco-node121.dco.ethz.ch:54310/cw-combined
15+
[optional] INPUT_WORDCOUNT: default: hdfs://dco-node121.dco.ethz.ch:54310/cw-wordcount/wordcounts.txt
16+
17+

remove_infrequent_words11/run.sh

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
#!/bin/bash
2+
3+
MIN_WORD_COUNT=1000
4+
OUT_DIR=`pwd`
5+
6+
MIN_WORD_COUNT="${MIN_WORD_COUNT}" OUTPUT="hdfs://dco-node121.dco.ethz.ch:54310/cw-combined-pruned-${MIN_WORD_COUNT}" /root/spark/bin/spark-submit --total-executor-cores 20 --class RemoveInfrequentWordsApp ${OUT_DIR}/run.jar > ${OUT_DIR}/log.log 2>&1
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
repo
2+
run.jar
3+

remove_infrequent_words12/build.sh

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
#! /bin/bash
2+
3+
rm -f run.jar
4+
5+
cd repo
6+
git pull
7+
8+
cd src/remove_infrequent_words
9+
sbt assembly
10+
11+
cd ../../..
12+
cp repo/src/remove_infrequent_words/target/scala-2.10/RemoveInfrequentWordsApp-assembly-1.0.jar run.jar
13+

remove_infrequent_words12/log.log

Lines changed: 237 additions & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)