lecture 1 intro NLP_lecture 1 intro NLP.pptx

The Introduction and History of NLP 1 Lecture 1 Ms. Umaira Khurshid

2 Course Objective Goal: Provide a toolkit of concepts and methods to describe and tackle NLP problems in real-life. ● Introduce core ideas at the basis of modern NLP algorithms ● Focus on Machine Learning & Deep Learning applied to NLP ● Focus on empirical considerations(accuracy, memory, speed) as opposed to theoretical guarantees

3 Course Evaluation ● Course Activities (Grading Criteria) ● Class Participation(10%) ● Assignment/Quizzes(10%+10%) ● Semester Project(10%) ● Mid Term Exam(35%) ● Final Term Exam(75%)

4 Course Evaluation ● There will be 4-6 quizzes and 2-3 assignments ● Late submissions or email submissions will not be entertained ● There will be separate marks for class participation

Course project • Implementation of an advance NLP algorithms • To be done in group of two. • More details will be announced in week 4

6 Why Natural Language Processing? What do we use language for? ● We communicate using language ● We think (partly) with language ● We tell stories in language ● We build Scientiﬁc Theories with language ● We make friends/build relationships Why NLP ? ● Access Knowledge (search engine, recommender system…) ● Communicate (e.g. Translation) ● Linguistics and Cognitive Sciences (Analyse Languages themselves)

Why Natural Language Processing? Amount of online textual data… ● 70 billion web-pages online (1.9 billion websites) ● 55 million Wikipedia articles …Growing at a fast pace ● 9000 tweets/second ● 3 million mail / second (60% spam) 9

8 Why Natural Language Processing? Potential Users of Natural Language Processing ● 7.9 billion people use some sort of language (January 2022) ● 4.7 billion internet users (January 2021) (~59%) ● 4.2 billion social media users (January 2021) (~54%)

9 Why Natural Language Processing? What Products ? ● Search: +2 billion Google users, 700 millions Baidu users ● Social Media: +3 billion users of Social media (Facebook, Instagram, WeChat, Twitter...) ● Voice assistant: +100 million users (Alexa, Siri, Google Assistant) ● Machine Translation: 500M users for google translate

Why is Language Hard to Model? 12

A Definition of Language Definition 1: Language is a means to communicate, it is a semiotic system. By that we simply mean that it is a set of signs. A sign is a pair consisting in [...] a signifier and a signified. Definition 2: A sign consists in a phonological structure, a morphological structure, a syntactic structure and a semantic structure 13

The Six Levels of Linguistics Analysis 14

The Six Levels of Linguistics Analysis Extra-linguistic context: refers to the broader context surrounding the language, such as the speaker's intent, the situation in which the communication takes place, or background knowledge. Linguistic context: refers to the words and phrases within a sentence or passage that provide clues about the meaning of a particular word. Semantic level: This refers to the meaning of words and sentences. Syntactic level: This refers to the grammatical structure of sentences. Morphological level: This refers to the structure of words, including prefixes, suffixes, and root words. Phonological level: This refers to the sounds that make up words.

14 The 5 Challenges of NLP 1. Productivity 2. Ambiguous 3. Variability 4. Diversity 5. Sparsity

15 Productivity Deﬁnition “property of the language-system which enables native speakers to construct and understand an indeﬁnitely large number of opinions or expressions, including opinions that they have never previously encountered.” (Lyons, 1977) ➔ New words, senses, structure are introduced in languages all the time Examples: staycation and social distance were added to the Oxford Dictionary in 2021

16 Ambiguous Ambiguity in words refers to the situation where a word has multiple possible meanings. Most linguistic observations (speech, text) are open to interpretations several We (Humans) disambiguate -i.e. ﬁnd the correct interpretation - using all kind of signals (linguistic and extra linguistic) Ambiguity can appear at all levels (phonology, graphemics, morphology, syntax, semantics)

Ambiguous Syntactic Ambiguity Syntactic ambiguity, also known as structural ambiguity or grammatical ambiguity, occurs when a sentence's structure allows for multiple interpretations. 17

18 Ambiguous Semantic Ambiguity ● Polysemy: e.g. set , arm, head Head of New-Zealand is a woman ● Name Entity: e.g. Michael Jordan Michael Jordan is a professor at Berkeley ● Object/Color: e.g. cherry Your cherry coat

19 Ambiguous Pragmatic Ambiguity Pragmatic ambiguity arises when the meaning of a sentence or utterance is unclear due to factors beyond the words themselves and the grammatical structure. Two Soviet ships collide, one dies Dealers will hear car talk at noon

20 Variation Language Varies at all levels ● Phonetic (accent) ● Morphological, Lexical (spelling) ● Syntactic ● Semantic

Phonetic Variation cf. Sagot 21

The Basics of Natural Language Processing -Machine Learning for NLP ENSAE Paris 2022 (1/6) -Benjamin Muller Spelling and Syntactic Variation 24

Variation Determiners ● Who is talking? ● To Whom? ● Where? Work, Home, Restaurant ● When? 19th century, 2008, 2022… ● About what? Specialised domain, the Weather,… Essentially, the Variability of a language depends on: ● Social Context ● Geography ● Sociology ● Date ● Topic 25

Diversity ● About 7000 languages spoken in the world ● About 60% are found in the written form (cf. Omniglot) 26

Graphemic Diversity wikipedia 29

30 Syntactic Diversity A key characteristics of the syntax of a given language is the word order ● Word order differs across languages ● Word order degree of freedom also differs across languages ● We characterize word orders with: Subject (S) Verb (V) Object (O) order

Syntactic Diversity 31 (Dyer et. al 2013)

Semantic Diversity ● Words partition the semantic space ● This partition is very diverse across language (Dyer et. al 2013) 28

29 Statistical Description of a Corpus We describe statistically a corpus of 800 scientiﬁc articles Question: If we plot the number of occurrences of each word vs. the rank, what will we observe?

Statistical Description of a Corpus We describe statistically a corpus of 800 scientiﬁc articles the is the most observed (rank 1) word with 8119 occurrences 30

Statistical Description of a Corpus We describe statistically a corpus of 800 scientiﬁc articles the is the most observed (rank 1) word with 8119 occurrences estimate is observed 56 times and is the 1001 most frequent word About 6000 Words are observed only 1 time in the dataset (e.g. stakeholders, pending, score…) 31

Statistical Description of a Corpus We describe statistically a corpus of 800 scientiﬁc articles ➔ In a large enough corpus, word distributions follows a Zipf Law ie: 32

Statistical Description of a Corpus We describe statistically a corpus of 800 scientiﬁc articles ➔ In a large enough corpus, word distributions follows a Zipf Law ie: ● Zipf law is a Power relation between the rank and frequency The most frequent entities are much more frequent than the less frequent ones ● Under a Zipf law, log(fw ) and log(k) are linearly related 33

34 What is Natural Language Processing? In a nutshell, NLP consists in handling the complexities of natural languages "to do something" ● Raw Text / Speech → Structured Information ● Raw Text / Speech → (Controlled) Text/Speech In this course we will focus on textual data

35 Framework We assume: ● A token is the basic unit of discrete data, defined to be an item from a vocabulary indexed by 1, ..., V. ● A document is a sequence of N words denoted by d = (w1,w2, ...,wN), where wn is the N-th word in the sequence. ● A corpus is a collection of M documents denoted by D = (d1, d2, ..., dM) Example: Wikipedia, All the articles of the NYT in 2021…

36 Token With regard to our end task, a token can be: ● A word ● A sub-word: e.g. a sequence of 3 characters ● A character ● An sequence of characters (sometimes a word, sometimes several words, sometimes a sub-word…)

37 Document A Document can be: ● A Sentence ● A Paragraph ● A sequence of characters

Text Segmentation 44 Deﬁnition: Text Segmentation is the process of splitting raw text (i.e. list of characters) into units of interest. Two level of segmentation (usually) required : ● Split raw text into modeling units (ex: sentence, paragraph, 1000 characters, web-page...) ● Split modeling units into sequence of basic units (referred as tokens) (e.g: words, word-pieces, characters, ...) Two distinct approaches: ● Linguistically informed e.g. word, sentence segmentation... ● Statistically informed e.g. frequent sub-words (word pieces, sentence pieces...)

39 Tokenization Deﬁnition: Tokenization consists in segmenting raw textual data into tokens: Can be framed as a character level task input: une industrie métallurgique existait. output: IIIEIIIIIIIIIIIIIIEIIIIIIIIIIIIIIIIIIIIEIIIIIIIIIIIEE ● Easy task for most languages and domains ● Can be very complex in some cases (Chinese, Social Media...)

NLP Tasks: Modeling Framework Tasks Taxonomy ● If Y is a single label and X a sequence of tokens (e.g. a sentence): Sequence Classiﬁcation ● If we have one label per token: Sequence Labelling ● If Y is a sequence of tokens: Sequence Prediction ● If Y is a graph, a tree or a complex structured output: Structure Prediction 40

Document Classiﬁcation Politics Economy Travel …. Geopolitics 41

Document Ranking (Retriever) 42

43 NLP Tasks: Part-of-Speech Tagging POS Tagging: Find the grammatical category of each word [My , name, is, Bob, and, I, live, in, NY, ! ]

44 NLP Tasks: Part-of-Speech Tagging POS Tagging: Find the grammatical category of each word [My , name, is, Bob, and, I, live, in, NY, ! ] [PRON , NOUN, VERB, NOUN, CC, PRON, VERB, PREP, NOUN, PUNCT ]

Syntactic Parsing Syntactic Parsing consists in extracting the syntactic structure of a sentence. For instance, Dependency Parsing (here) predicts an acyclic directed graph (a tree) 45

46 Slot-Filling / Intent Detection Intent Detection is a sequence classification task that consists in classifying the intent of a user in a pre-defined category. Slot-Filling is a sequence labelling task that consists in identifying specific parameters in a user request. Can you please play Hello from Adele ? Intent: play_music Slots: [Can, you, please, play, Hello, from, Adele, ? ] [O , O , O , O , SONG, O , ARTIST, O ]

47 Semantic Role Labelling (SLR) SRL is the task of ﬁnding the semantic role of each predicate in a sentence. Given a sentence, SRL predicts: who did what to whom, when, where, why, how

48 NLP Tasks: Name Entity Recognition NER: Find the Name-Entities in a sentence [My , name, live, in, NY, ! ] O, LOCATION, O ] [O , O, O, is, Bob, and, I, PERSON, O, O, O,

49 Machine Translation INPUT: My name is Bob and I live in NY ! OUTPUT: Je m’appelle Bob et je vis à New-York!

50 Question Answering INPUT: How many episodes in season 2 breaking bad? OUTPUT: 13

Brief History of NLP Symbolic 1940-2000 Focus on rule-based systems, formal grammars Development of linguistic resources (lexicon, ontologies, grammars…) Statistical Learning 1990-2010 Statistical learning theory (SVM, Random Forest), Graphical Probabilistic Models (e.g. LDA, HMM) Development of annotated datasets Deep Learning 2010-Today Deep Learning Architecture (Transformer) Transfer Learning in NLP (word2vec, BERT, CamemBERT, GPT) More compute, larger (raw) dataset, Open Source Deep Learning Libraries 60

52 1949: First Machine Translation “Model” 1949 Memorandum on Translation, Warren Weaver First to propose the idea of using “electronic computers” to do translation ● Using Shannon’s information theory to frame Machine Translation as a cryptographic problem ● Modeling context to disambiguate between word senses ● “Going down” from each language to universal language in order to translate

1964: ELIZA First Conversational Bot 53

lecture 1 intro NLP_lecture 1 intro NLP.pptx

More Related Content

Similar to lecture 1 intro NLP_lecture 1 intro NLP.pptx

Recently uploaded

lecture 1 intro NLP_lecture 1 intro NLP.pptx