NLTK Alberts Pumpurs
90% of world's data generated over last two years
common Internet user creates Visual Textual Instagram Flickr Vscocam Facebook Tumblr Blogger Twitter Facebook Emails Costumer Reviews
Detecting hidden signals
World is full of unstructured, text-rich data. Everything from emails to customer tweets. The information buried in all that text holds the potential to deliver valuable business insights
Text analytics is the practice of using technology to gather, store and mine textual information for hidden signals that can be used to inform smarter business decisions
An explosion of unstructured data
Many types of organizations are experiencing explosive growth in their unstructured enterprise data. Same time that they have access to external sources of data such as social media, blogs, and mobile data.
Until now, much of this information passed through the organization virtually unanalyzed. Today, new tools for handling large amounts of complex data makes it easier to squeeze value from such unlikely sources.
Text Processing use cases
sentiment analysis spam filtering text categorization topic detection keyword frequency plagiatism detection document similarity phrase extraction
Natural Language Tool Kit leading platform for building Python programs to work with human language data
NLTK Features
sentence and word tokenization text calsification corpora parsing clustring part of speach tagging text stemming and mutch more..
Sentence tokenization
Word tokenization
Part of speech tagging
Part of speech tagging explanation CC Coordinating conjunctin CD Cardinal Number DT Determiner EX Existing “ there“ FW Foreign word IN Preposition or subordination conjuction JJ Adjective JJR Adjective- comparative JJS Adjective- superlative LS List item marker MD Modal NN Noun- singular or mass NNS Non-Plural NP Proper noun- singular nltk.help.upenn_tagset() //all tag sets
Chunking and NER
Text clasification Algorithms in NLTK Naive Bayes Maximum Entropy Decision Tree
Text clasification
Sentiment analysis https://github.com/pumpurs/SentimentWordsLV/
Document similarity detection Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.
Similarity and concordance
Dispersion Plot
“Market and product reserch” “Social CMS” 1.97 b social network users “Costumer profiling / analytics” 70% of marketers used Facebook to gain 6.7 million people blog on blogging sites
pumpurs.alberts@gmail.com Big Data, Startups, Text Analysis, Internet of Things, Web Development

Python NLTK