Introduction to Natural Language Processing (NLP)

Natural Language Processing (NLP) Introduction !1 Presented by, Venkatesh Murugadas Venkatesh Murugadas

Image source : www.google.com !2Venkatesh Murugadas

Problem of Natural Language “Human language is highly ambiguous … It is also ever changing and evolving. People are great at producing language and understanding language, and are capable of expressing, perceiving, and interpreting very elaborate and nuanced meanings. At the same time, while we humans are great users of language, we are also very poor at formally understanding and describing the rules that govern language.” - Page 1, Neural Network Methods in Natural Language Processing, 2017. Source : https://machinelearningmastery.com/natural-language-processing/ 3Venkatesh Murugadas

“It is hard from the standpoint of the child, who must spend many years acquiring a language … it is hard for the adult language learner, it is hard for the scientist who attempts to model the relevant phenomena, and it is hard for the engineer who attempts to build systems that deal with natural language input or output. These tasks are so hard that Turing could rightly make ﬂuent conversation in natural language the centrepiece of his test for intelligence.” - Page 248, Mathematical Linguistics, 2010. Source : https://machinelearningmastery.com/natural-language-processing/ 4Venkatesh Murugadas

Computer Linguistics Linguistics is the scientiﬁc study of language, including its grammar, semantics, and phonetics. Computational linguistics is the modern study of linguistics using the tools of computer science. Yesterday’s linguistics may be today’s computational linguist as the use of computational tools and thinking has overtaken most ﬁelds of study. Source : https://machinelearningmastery.com/natural-language- processing/ 5Venkatesh Murugadas

Statistical NLP Statistical NLP aims to do statistical inference for the ﬁeld of natural language. Statistical inference in general consists of taking some data (generated in accordance with some unknown probability distribution) and then making some inference about this distribution. — Page 191, Foundations of Statistical Natural Language Processing, 1999. Source : https://machinelearningmastery.com/natural-language- processing/ 6Venkatesh Murugadas

Natural language processing (NLP) is a collective term referring to automatic computational processing of human languages. This includes both algorithms that take human-produced text as input, and algorithms that produce natural looking text as outputs. — Page xvii, Neural Network Methods in Natural Language Processing, 2017. Source : https://machinelearningmastery.com/natural-language-processing/ Natural language processing is a subﬁeld of computer science, information engineering, and artiﬁcial intelligence concerned with the interactions between computers and human languages, in particular how to program computers to process and analyze large amounts of natural language !7Venkatesh Murugadas

We will take Natural Language Processing — or NLP for short –in a wide sense to cover any kind of computer manipulation of natural language. At one extreme, it could be as simple as counting word frequencies to compare different writing styles. At the other extreme, NLP involves “understanding” complete human utterances, at least to the extent of being able to give useful responses to them. — Page ix, Natural Language Processing with Python, 2009. Source : https://machinelearningmastery.com/natural-language- processing/ 8Venkatesh Murugadas

Areas of NLP • Natural Language Understanding • Natural Language Search • Natural Language Generation • Natural Language Interface Venkatesh Murugadas

Applications of NLP 1. Text classiﬁcation and Categorisation 2. Named Entity Recognition 3. Conversational AI 4. Paraphrase detection 5. Language generation and Multi-document Summarisation 6. Machine Translation 7. Speech recognition 8. Spell Checking 10Venkatesh Murugadas

Corpus • “A corpus is a large body of natural language text used for accumulating statistics on natural language text. The plural is corpora. Corpora often include extra information such as a tag for each word indicating its part-of-speech, and perhaps the parse tree for each sentence.” Source : https://www.quora.com/In-NLP-what-is-the-difference-between-a-Lexicon-and-a-Corpus 11Venkatesh Murugadas

NLP Pipeline • Word Tokenisation • Sentence Segmentation • Parts of Speech Tagging • Dependency Parsing • Named Entity Recognition • Relation Extraction 12Venkatesh Murugadas

Word Tokenisation Token : “A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.” Eg. To sleep perhaps to dream. Type: “A type is the class of all tokens containing the same character sequence.” Term : “A term is a (perhaps normalized) type that is included in the dictionary.” Text Normalisation : “Token normalization is the process of creating tokens, so that matches occur despite superﬁcial differences in the character sequences of the tokens” Source: nlp.anirbansaha.com Tokenization is an identiﬁcation of basic units to be processed. Tokenizer must often be customised to the data in question. 13Venkatesh Murugadas

How is tokenisation done? • NLTK ( Natural Language Tool Kit) and SpaCy language models use Regular Expressions (Regex) to create tokens from the running sequence of texts. • NLTK - Penn Treebank Tokenizer , Word Punct Tokenizer , Tweet Tokenizer , MWETokenizer (Multi word Expression Tokenizer) • This is language dependent. • Languages in which white spaces are not present, such as Chinese, Japanese and Korean they use the technique called Word Segmentation. !14Venkatesh Murugadas

• Problems • Hyphenated words - co-operative, self-esteem • URL’s - “https://www.google.com/" • Phone numbers - (541) 754-3010 • Compound nouns (Names , Places) - New York 15Venkatesh Murugadas

Sentence Segmentation • It is splitting the running text by detecting the sentence boundary. • Sentence Boundary Detection. • NLTK uses the class Punkt Sentence Tokenizer. This is the most widely used sentence tokenizer. 16Venkatesh Murugadas

Punkt Architecture Source: Unsupervised Multilingual Sentence Boundary Detection ( Tibor Kiss, Jan Strunk) Type based Classiﬁcation (Initials, Ordinal numbers, Texts) 1. Strong Collocation 2. Internal Periods 3. Penalty Token based Classiﬁcation 1. Orthographic Heuristics - Word shape 2. The Collocation Heuristics 3. Frequent sentence Starter Heuristic 17Venkatesh Murugadas

Problem Ordinal numbers !18Venkatesh Murugadas

Parts of Speech Tagging • Part-of-Speech tagging in itself may not be the solution to any particular NLP problem. It is however something that is done as a pre-requisite to simplify a lot of different problems • 8 Parts of Speech in English • There are open classes and closed classes. • Open class - Noun, Verb, Adverb, Adjective • There are languages in which there is no classiﬁcation of Parts of Speech, such as Riau Indonesian. Korean language do not have Adjectives. NOUN. PRONOUN. VERB. ADJECTIVE. ADVERB. PREPOSITION. CONJUNCTION INTERJECTION. 19Venkatesh Murugadas

• There are 8 to 45 POS tags present. • The mainly used tagged corpora : • Brown corpus - with a million word • Wall Street Journal corpus - with a million word • Switchboard : telephone speech corpus - with 2 million words Source: Speech and Language Processing, Daniel Jurafsky and James H. Martin 20Venkatesh Murugadas

Source: Speech and Language Processing, Daniel Jurafsky and James H. Martin Parts of Speech Tagging algorithm Generative Hidden Markov Model 21Venkatesh Murugadas

Parts of Speech Tagging algorithm Discriminative Maximum Entropy Markov Model Source: Speech and Language Processing, Daniel Jurafsky and James H. Martin Discriminative model to incorporate a lot of features based on which the classiﬁcation will be better. There is a feature template. 22Venkatesh Murugadas

• The modern POS tagging algorithms use Bidirectional methods. • The Stanford core NLP uses a log- linear Parts of Speech Tagger. • Based on the paper : https:// nlp.stanford.edu/~manning/papers/ tagging.pdf 23Venkatesh Murugadas

Dependency Parser • Dependency syntax postulates that syntactic structure consists of lexical items linked by binary asymmetric relations (“arrows”) called dependencies. • The arrows are commonly typed with name of grammatical relations. • So dependencies form a tree (connected, acyclic, single-head) 24Venkatesh Murugadas

• Shallow parsing (also chunking, "light parsing") is an analysis of a sentence which ﬁrst identiﬁes constituent parts of sentences (nouns, verbs, adjectives, etc.) and then links them to higher order units that have discrete grammatical meanings (noun groups or phrases, verb groups, etc.). 25Venkatesh Murugadas

• The Stanford NLP core is based on the paper : https:// nlp.stanford.edu/~sebschu/ pubs/schuster-manning- lrec2016.pdf 26Venkatesh Murugadas

Information Extraction Architecture Source: https://www.nltk.org/book/ch07.html 27Venkatesh Murugadas

Named Entity Recognition Named-entity recognition (NER) (also known as entity identiﬁcation, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-deﬁned categories such as the person names, organisations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. Source: https://en.wikipedia.org/wiki/Named-entity_recognition 28Venkatesh Murugadas

Noun Phrase Chunking • This is a basic technique used for entity detection. • Each of these larger boxes is called a chunk. • This is done with the help of Regular Expression. (RegEx) Source: https://www.nltk.org/book/ch07.html 29Venkatesh Murugadas

Shallow parsing (also chunking, "light parsing") is an analysis of a sentence which ﬁrst identiﬁes constituent parts of sentences (nouns, verbs, adjectives, etc.) and then links them to higher order units that have discrete grammatical meanings (noun groups or phrases, verb groups, etc.). Image Source: https://www.nltk.org/book/ch07.html Source: https://en.wikipedia.org/wiki/Shallow_parsing 30Venkatesh Murugadas

• Noun chunking using Regular Expression 31Venkatesh Murugadas

• SpaCy based Named Entity Recognition • It is trained on the dataset OntoNotes5. • There are 7 pre-deﬁned categories of Entities. 32 Applications of NER 1. NLU 2. NLS Venkatesh Murugadas

Discussion ! Venkatesh Murugadas !33 For further queries, contact me at edu.venkateshdas@gmail.com

Introduction to Natural Language Processing (NLP)

More Related Content

What's hot

Similar to Introduction to Natural Language Processing (NLP)

Recently uploaded

In this document

Introduction to Natural Language Processing (NLP)