GitHub - mremad/SpokenInputTopicDetection

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
punctuation_detector		punctuation_detector
segment_classification		segment_classification
topic_based_segmentation		topic_based_segmentation
word2vec_trainer		word2vec_trainer
README.txt		README.txt

Repository files navigation

-------------------------------------------------- | INCLUDED PROJECTS | -------------------------------------------------- *punctuation_detector A keras implementation of a period punctuation detector. This system inserts a period to a stream of unstructured text. *topic_based_segmentation A keras implementation of a system that divides a stream of sentences into segments where adjacent segments are different in topics. *segment_classification A kNN implementation (based on Word Embeddings and TFIDF features) that classifies a given segment into one of the available topics given their summaries. -------------------------------------------------- | DATASET USED | -------------------------------------------------- TDT-2 Corpus -------------------------------------------------- | DEPENDANCIES | -------------------------------------------------- keras 1.2.0 gensim word2vec numpy nltk -------------------------------------------------- | CORPUS | -------------------------------------------------- For the experiments, a TDTCorpus pickled object (*.pckl) is loaded that has two attributes: "text_corpus_bnds" and "sent_boundaries". Any object having these two attributes can work as input Explaining "text_corpus_bnds": A dict() object having as keys the ids of the documents, and values the list of tokens (words) of each document. Tokens (words) that mark an end of sentence contain a marker "<bnd>" EXAMPLE: text_corpus_bnds =	{	"document_id_1":["This","is","a", "sentence.<bnd>", "this", "is", "another", "sentence<bnd>"]	,	"document_id_2":["This","is","A", "non-pre-procces", "sentence.<bnd>", "this", "is", "another", "sentence<bnd>"]	,	"document_id_3":["This","is","a", "sentence.<bnd>", "this", "is", "another", "sentence<bnd>"]	} Explaining "sent_boundaries": A dict() object having as keys the ids of the documents, and values the list of sentence-based TOPIC boundaries. So each "i"th index in the list represents whether there is a topic change after "i"th sentence in the document or not. EXAMPLE: sent_boundaries =	{	"document_id_1":[0,0,0,1,0,0,0,0,0,1]	#document id 1 has two story segments: segment from sentence 1 > 4, segment from sentence 5 > 10	,	"document_id_2":[0,1,0,0,1]	#document id 2 has two story segments: segment from sentence 1 > 2, segment from sentence 3 > 5	,	"document_id_3":[0,0,1,0,0,1,1]	#document id 1 has three story segments: segment from sentence 1 > 3, segment from sentence 4 > 6	#and segment containing sentence 7	} -------------------------------------------------- | VOCAB | -------------------------------------------------- Another input used in experiments is the vocabulary of the training set. A vocab is a dict stored in a pickle object (*.pckl) The dict() has as keys the words and the value is a unique id to that word. Word ids should start from 1. A special word "<bnd>" is added to the dict and has the id: len(vocab) + 1. This <bnd> word is reserved for unkown words. Words should be lowercased. EXAMPLE: vocab =	{	"word"	: 1,	"usa"	: 2,	"<bnd>" : 3	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

mremad/SpokenInputTopicDetection

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages