NLP Introduction , applications, NLP Pipeline, Steps in NLP

Natural Language Processin Presented by: Dr.Kirti Verma CSE, LNCTE

Advantages of NLP •NLP helps users to ask questions about any subject and get a direct response within seconds. •NLP offers exact answers to the question means it does not offer unnecessary and unwanted information. •NLP helps computers to communicate with humans in their languages. •It is very time efficient. •Most of the companies use NLP to improve the efficiency of documentation processes, accuracy of documentation, and identify the information from large databases. Disadvantages of NLP A list of disadvantages of NLP is given below: •NLP may not show context. •NLP is unpredictable •NLP may require more keystrokes. •NLP is unable to adapt to the new domain, and it has a limited function that's why NLP is built for a single and specific tasksonly.

 Lexical means relating to words of a language.  During Lexical analysis given paragraphs are broken down into words or tokens. Each token has got specific meaning.  There can be instances where a single word can be interpreted in multiple ways.  The ambiguity that is caused by the word alone rather than the context is known as Lexical Ambiguity. Example: “Give me the bat!” In the above sentence, it is unclear whether bat refers to a nocturnal animal bat or a cricket bat. Just by looking at the word it does not provide enough information about the meaning hence we need to know the context in which it is used.  Lexical Ambiguity can be further categorized into Polysemy and homonymy. Lexical Ambiguity

a) Polysemy It refers to a single word having multiple but related meanings. Example: Light (adjective). • Thanks to the new windows, this room is now so light and airy = lit by the natural light of day. • The light green dress is better on you = pale colours. In the above example, light has different meanings but they are related to each other. b) Homonymy It refers to a single word having multiple but unrelated meanings. Example: Bear, left, Pole • A bear (the animal) can bear (tolerate) very cold temperatures. • The driver turned left (opposite of right) and left (departed from) the main road. Pole and Pole — The first Pole refers to a citizen of Poland who could either be referred to as Polish or a Pole. The second Pole refers to a bamboo pole or any other wooden pole.

 Syntactic meaning refers to the grammatical structure and rules that define how words should be combined to form sentences and phrases.  A sentence can be interpreted in more than one way due to its structure or syntax such ambiguity is referred to as Syntactic Ambiguity. Example 1: “Old men and women” The above sentence can have two possible meanings: All old men and young women. All old men and old women. Example 2: “John saw the boy with telescope. “ In the above case, two possible meanings are John saw the boy through his telescope. John saw the boy who was holding the telescope. Syntactic Ambiguity/ Structural ambiguity

 Semantics is nothing but “Meaning”.  The semantics of a word or phrase refers to the way it is typically understood or interpreted by people.  Syntax describes the rules by which words can be combined into sentences, while semantics describes what they mean.  Semantic Ambiguity occurs when a sentence has more than one interpretation or meaning. Scope abiguity Example 1: “Seema loves her mother and Sriya does too.” The interpretations can be Sriya loves Seema’s mother or Sriya likes her mother. Semantic Ambiguity

Anaphoric Ambiguity A word that gets its meaning from a preceding word or phrase is called an anaphor. Example: “Susan plays the piano. She likes music.” In this example, the word she is an anaphor and refers back to a preceding expression i.e., Susan. The linguistic element or elements to which an anaphor refers is called an antecedent. The relationship between anaphor and antecedent is termed ‘anaphora’. Ambiguity that arises when there is more than one reference to the antecedent is known as Anaphoric Ambiguity. Example 1: “The horse ran up the hill. It was very steep. It soon got tired.” In this example, there are two ‘it’, and it is unclear to which each ‘it’ refers, this leads to Anaphoric Ambiguity. The sentence will be meaningful if first ‘it’ refers to the hill and 2nd ‘it’ refers to the horse. Anaphors may not be in the immediately previous sentence. They may present in the sentences before the previous one or may present in the same sentence.

Pragmatic ambiguity Pragmatics focuses on the real-time usage of language like what the speaker wants to convey and how the listener infers it. Situational context, the individuals’ mental states, the preceding dialogue, and other elements play a major role in understanding what the speaker is trying to say and how the listeners perceive it. Example:

Step 1: Sentence segmentation Sentence segmentation is the first step in the NLP pipeline. It divides the entire paragraph into different sentences for better understanding. For example, "London is the capital and most populous city of England and the United Kingdom. Standing on the River Thames in the southeast of the island of Great Britain, London has been a major settlement for two millennia. It was founded by the Romans, who named it Londinium." After using sentence segmentation, we get the following result: “London is the capital and most populous city of England and the United Kingdom.” “Standing on the River Thames in the southeast of the island of Great Britain, London has been a major settlement for two millennia.” “It was founded by the Romans, who named it Londinium.”

#Program for sentence tokenization Using NLTK import nltk from nltk.tokenize import sent_tokenize def tokenize_sentences(text): sentences = sent_tokenize(text) return sentences text = "NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.“ # Tokenize sentences sentences = tokenize_sentences(text) # Print tokenized sentences for i, sentence in enumerate(sentences): print(f"Sentence {i+1}: {sentence}")

Step 2: Word tokenization Word tokenization breaks the sentence into separate words or tokens. This helps understand the context of the text. When tokenizing the sentence “London is the capital and most populous city of England and the United Kingdom”, it is broken into separate words, i.e., “London”, “is”, “the”, “capital”, “and”, “most”, “populous”, “city”, “of”, “England”, “and”, “the”, “United”, “Kingdom”, “.”

import nltk #nltk.download('punkt') # Download the necessary tokenization models from nltk.tokenize import word_tokenize def tokenize_words(text): words = word_tokenize(text) return words # Example text text = "NLTK is a leading platform for building Python programs to work with human language data." # Tokenize words words = tokenize_words(text) # Print tokenized words print(words) Word tokenization using nltk

Step 3: Stemming Stemming helps in preprocessing text. The model analyzes the parts of speech to figure out what exactly the sentence is talking about. Stemming normalizes words into their base or root form. In other words, it helps to predict the parts of speech for each token. For example, intelligently, intelligence, and intelligent. These words originate from a single root word ‘intelligen’. However, in English there’s no such word as ‘intelligen’.

from nltk.stem import PorterStemmer porter = PorterStemmer() words = ['generous','fairly','sings','generation'] for word in words: print(word,"--->",porter.stem(word)) Step 3: Stemming code in Python using NLTK library

Step 4: Lemmatization Lemmatization removes inflectional endings and returns the canonical form of a word or lemma. It is similar to stemming except that the lemma is an actual word. For example, ‘playing’ and ‘plays’ are forms of the word ‘play’. Hence, play is the lemma of these words. Unlike a stem (recall ‘intelligen’), ‘play’ is a proper word.

### import necessary libraries from nltk.stem import WordNetLemmatizer from nltk.tokenize import word_tokenize text = "Very orderly and methodical he looked, with a hand on each knee, and a loud watch ticking a sonorous sermon under his flapped newly bought waist-coat, as though it pitted its gravity and longevity against the levity and evanescence of the brisk fire." # tokenise text tokens = word_tokenize(text) wordnet_lemmatizer = WordNetLemmatizer() lemmatized = [wordnet_lemmatizer.lemmatize(token) for token in tokens] print(lemmatized) Step 4: Lemmatization using nltk

Step 5: Stop word analysis The next step is to consider the importance of each and every word in a given sentence. In English, some words appear more frequently than others such as "is", "a", "the", "and". As they appear often, the NLP pipeline flags them as stop words. They are filtered out so as to focus on more important words.

program to eliminate stopwords using nltk import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize def remove_stopwords(text): # Tokenize the text into words words = word_tokenize(text) # Get English stopwords english_stopwords = set(stopwords.words('english')) # Remove stopwords from the tokenized words filtered_words = [word for word in words if word.lower() not in english_stopwords] # Join the filtered words back into a single string filtered_text = ' '.join(filtered_words) return filtered_text # Example text text = "NLTK is a leading platform for building Python programs to work with human language data." # Remove stopwords filtered_text = remove_stopwords(text) # Print filtered text print(filtered_text)

Step 6: Dependency parsing Next comes dependency parsing which is mainly used to find out how all the words in a sentence are related to each other. To find the dependency, we can build a tree and assign a single word as a parent word. The main verb in the sentence will act as the root node. The edges in a dependency tree represent grammatical relationships. These relationships define words’ roles in a sentence, such as subject, object, modifier, or adverbial. Subject-Verb Relationship: In a sentence like “She sings,” the word “She” depends on “sings” as the subject of the verb.

Modifier-Head Relationship: In the sentence “The big cat,” “big” modifies “cat,” creating a modifier-head relationship. Direct Object-Verb Relationship: In “She eats apples,” “apples” is the direct object that depends on the verb “eats.” Adverbial-Verb Relationship: In “He sings well,” “well” modifies the verb “sings” and forms an adverbial- verb relationship.

Dependency Tag Description acl clausal modifier of a noun (adnominal clause) acl:relcl relative clause modifier advcl adverbial clause modifier advmod adverbial modifier advmod:emph emphasizing phrase, intensifier advmod:lmod locative adverbial modifier amod adjectival modifier appos appositional modifier aux auxiliary aux:move passive auxiliary case case-marking cc coordinating conjunction cc:preconj preconjunct ccomp clausal complement clf classifier compound compound conj conjunct cop copula csubj clausal topic csubj:move clausal passive topic dep unspecified dependency det determiner det:numgov рrоnоminаl quаntifier gоverning the саse оf the nоun det:nummod рrоnоminаl quаntifier agreeing with the саse оf the nоun det:poss possessive determiner discourse discourse ingredient dislocated dislocated parts expl expletive expl:impers impersonal expletive expl:move reflexive pronoun utilized in reflexive passive expl:pv reflexive clitic with an inherently reflexive verb mounted mounted multiword expression flat flat multiword expression flat:overseas overseas phrases flat:title names goeswith goes with iobj oblique object checklist checklist mark marker nmod nominal modifier nmod:poss possessive nominal modifier nmod:tmod temporal modifier

Step 7: Part-of-speech (POS) tagging POS tags contain verbs, adverbs, nouns, and adjectives that help indicate the meaning of words in a grammatically correct way in a sentence.

POS tagging is a key step in NLP and is used in many applications, including: Text analysis Machine translation Information retrieval Speech recognition Parsing Sentiment analysis: Part-of-speech (POS) tagging is a process in Natural Language Processing (NLP) that assigns grammatical categories to words in a sentence. This helps algorithms understand the meaning and structure of a text.

Program to perform Parts of Speech tagging using nltk #Parts of Speech Tagging import nltk from nltk.tokenize import word_tokenize def pos_tagging(text): # Tokenize the text into words words = word_tokenize(text) # Perform POS tagging tagged_words = nltk.pos_tag(words) return tagged_words # Example text text = "NLTK is a leading platform for building Python programs to work with human language data." # Perform POS tagging tagged_text = pos_tagging(text) # Print POS tagged text print(tagged_text)

[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('leading', 'VBG'), ('platform', 'NN'), ('for', 'IN'), ('building', 'VBG'), ('Python', 'NNP'), ('programs', 'NNS'), ('to', 'TO'), ('work', 'VB'), ('with', 'IN'), ('human', 'JJ'), ('language', 'NN'), ('data', 'NNS'), ('.', '.')] Output: # CODE to print all the POS TAGS import nltk nltk.download('tagsets') nltk.help.upenn_tagset()

Step 8: Named Entity Recognition (NER) Named Entity Recognition (NER) is the process of detecting the named entity such as person name, movie name, organization name, or location. Example: Steve Jobs introduced iPhone at the Macworld Conference in San Francisco, California.

• Lexicon Based Method The NER uses a dictionary with a list of words or terms. • Rule Based Method The Rule Based NER method uses a set of predefined rules guides the extraction of information. These rules are based on patterns and context. • Machine Learning-Based Method Multi-Class Classification with Machine Learning Algorithms One way is to train the model for multi-class classification using different machine learning algorithms, but it requires a lot of labelling Conditional Random Field (CRF) , it is implemented by both NLP Speech Tagger and NLTK. • Deep Learning Based Method Deep learning NER system is much more accurate than previous method, as it is capable to assemble words. Types of Named Entity Recognition

#Named Entity Recognition from nltk.tokenize import word_tokenize from nltk import pos_tag, ne_chunk nltk.download('averaged_perceptron_tagger') nltk.download('maxent_ne_chunker') nltk.download('words') def ner(text): words = word_tokenize(text) tagged_words = pos_tag(words) named_entities = ne_chunk(tagged_words) return named_entities text = "Apple is a company based in California, United States. Steve Jobs was one of its founders." named_entities = ner(text) print(named_entities)

(S (GPE Apple/NNP) is/VBZ a/DT company/NN based/VBN in/IN (GPE California/NNP) ,/, (GPE United/NNP States/NNPS) ./. (PERSON Steve/NNP Jobs/NNP) was/VBD one/CD of/IN its/PRP$ founders/NNS ./.)

Step 9: Chunking Chunking is used to collect the individual piece of information and grouping them into bigger pieces of sentences. Chunk extraction or partial parsing is a process of meaningful extracting short phrases from the sentence (tagged with Part-of- Speech). parts of speech namely the noun, verb, adjective, adverb, preposition, conjunction, pronoun, and interjection.

import nltk from nltk.chunk import RegexpParser from nltk.tokenize import word_tokenize # Example sentence sentence = "Educative Answers is a free web encyclopedia written by devs for devs." # Tokenization tokens = word_tokenize(sentence) # POS tagging pos_tags = nltk.pos_tag(tokens) # Chunking patterns chunk_patterns = r""" NP: {<DT>?<JJ>*<NN>} # Chunk noun phrases VP: {<VB.*><NP|PP>} # Chunk verb phrases "" # Create a chunk parser chunk_parser = RegexpParser(chunk_patterns) # Perform chunking result = chunk_parser.parse(pos_tags) # Print the chunked result print(result)

CHUNKING : "Educative Answers is a free web encyclopedia written by devs for devs." (S Educative/JJ Answers/NNPS (VP is/VBZ (NP a/DT free/JJ web/NN)) (NP encyclopedia/NN) written/VBN by/IN (NP devs/NN) for/IN (NP devs/NN) ./.)

After chunking the text result.draw will draw a chunking tree in python

• A lexicon is defined as a collection of words and phrases in a given language, with the analysis of this collection being the process of splitting the lexicon into components, based on what the user sets as parameters – paragraphs, phrases, words, or characters. • Morphological analysis is the process of identifying the morphemes of a word. • A morpheme is a basic unit of English language construction, which is a small element of a word, that carries meaning. • These can be either a free morpheme (e.g. walk) or a bound morpheme (e.g. -ing, -ed), with the difference between the two being that the latter cannot stand on it’s own to produce a word with meaning, and should be assigned to a free morpheme to attach meaning. Phase I: Lexical or Morphological analysis

Importance of Morphological Analysis Morphological analysis is crucial in NLP for several reasons: • Understanding Word Structure: It helps in deciphering the composition of complex words. • Predicting Word Forms: It aids in anticipating different forms of a word based on its morphemes. • Improving Accuracy: It enhances the accuracy of tasks such as part-of-speech tagging, syntactic parsing, and machine translation.

• This phase is essential for understanding the structure of a sentence and assessing its grammatical correctness. • It involves analyzing the relationships between words and ensuring their logical consistency by comparing their arrangement against standard grammatical rules. • Consider the following sentences: • Correct Syntax: “John eats an apple.” • Incorrect Syntax: “Apple eats John an.” • POS Tags: • John: Proper Noun (NNP) • eats: Verb (VBZ) • an: Determiner (DT) • apple: Noun (NN) Phase II: Syntactic analysis or Parsing

• Syntactically Correct but Semantically Incorrect: “Apple eats a John.” • This sentence is grammatically correct but does not make sense semantically. • An apple cannot eat a person, highlighting the importance of semantic analysis in ensuring logical coherence. • Literal Interpretation: “What time is it?” • This phrase is interpreted literally as someone asking for the current time, demonstrating how semantic analysis helps in understanding the intended meaning. Phase III: Semantic Analysis

Semantic Analysis Semantic Analysis is the third phase of Natural Language Processing (NLP), focusing on extracting the meaning from text. Semantic analysis aims to understand the dictionary definitions of words and their usage in context. It determines whether the arrangement of words in a sentence makes logical sense. Key Tasks in Semantic Analysis Named Entity Recognition (NER): NER identifies and classifies entities within the text, such as names of people, places, and organizations. These entities belong to predefined categories and are crucial for understanding the text’s content. Word Sense Disambiguation (WSD): WSD determines the correct meaning of ambiguous words based on context. For example, the word “bank” can refer to a financial institution or the side of a river. WSD uses contextual clues to assign the appropriate meaning.

Discourse integration is the analysis and identification of the larger context for any smaller part of natural language structure (e.g. a phrase, word or sentence). During this phase, it’s important to ensure that each phrase, word, and entity mentioned are mentioned within the appropriate context. Contextual Reference: “This is unfair!” To understand what “this” refers to, we need to examine the preceding or following sentences. Without context, the statement’s meaning remains unclear. Anaphora Resolution: “Taylor went to the store to buy some groceries. She realized she forgot her wallet.” In this example, the pronoun “she” refers back to “Taylor” in the first sentence. Understanding that “Taylor” is the antecedent of “she” is crucial for grasping the sentence’s meaning. Phase IV: Discourse integration

• It focusing on interpreting the inferred meaning of a text beyond its literal content. • Human language is often complex and layered with underlying assumptions, implications, and intentions that go beyond straightforward interpretation. Contextual Greeting: “Hello! What time is it?” “Hello!” is more than just a greeting; it serves to establish contact. “What time is it?” might be a straightforward request for the current time, but it could also imply concern about being late. Figurative Expression: “I’m falling for you.” The word “falling” literally means collapsing, but in this context, it means the speaker is expressing love for someone. Phase V: Pragmatic Analysis

What is the difference between large language models and generative AI? Generative AI is an umbrella term that refers to artificial intelligence models that have the capability to generate content. Generative AI can generate text, code, images, video, and music. Examples of generative AI include Midjourney, DALL-E, and ChatGPT. Large language models are a type of generative AI that are trained on text and produce textual content. ChatGPT is a popular example of generative text AI. All large language models are generative AI1 . LLMs have achieved remarkable advancements in various language-related applications such as text generation, translation, summarization, question-answering, and more.

A large language model is a computer program that learns and generates human-like language using a transformer architecture trained on vast training data. Large Language Models (LLMs) are foundational machine learning models that use deep learning algorithms to process and understand natural language. These models are trained on massive amounts of text data to learn patterns and entity relationships in the language. LLMs can perform many types of language tasks, such as translating languages, analyzing sentiments, chatbot conversations, and more. A large language model is an advanced type of language model that is trained using deep learning techniques on massive amounts of text data. These models are capable of generating human-like text and performing various natural language processing tasks. Large Language Model

NLP Introduction , applications, NLP Pipeline, Steps in NLP

NLP Introduction , applications, NLP Pipeline, Steps in NLP

More Related Content

What's hot

Similar to NLP Introduction , applications, NLP Pipeline, Steps in NLP

More from Kirti Verma

Recently uploaded

NLP Introduction , applications, NLP Pipeline, Steps in NLP