The Introduction and History of NLP 1 Lecture 1 Ms. Umaira Khurshid
2 Course Objective Goal: Provide a toolkit of concepts and methods to describe and tackle NLP problems in real-life. ● Introduce core ideas at the basis of modern NLP algorithms ● Focus on Machine Learning & Deep Learning applied to NLP ● Focus on empirical considerations(accuracy, memory, speed) as opposed to theoretical guarantees
3 Course Evaluation ● Course Activities (Grading Criteria) ● Class Participation(10%) ● Assignment/Quizzes(10%+10%) ● Semester Project(10%) ● Mid Term Exam(35%) ● Final Term Exam(75%)
4 Course Evaluation ● There will be 4-6 quizzes and 2-3 assignments ● Late submissions or email submissions will not be entertained ● There will be separate marks for class participation
Course project • Implementation of an advance NLP algorithms • To be done in group of two. • More details will be announced in week 4
6 Why Natural Language Processing? What do we use language for? ● We communicate using language ● We think (partly) with language ● We tell stories in language ● We build Scientific Theories with language ● We make friends/build relationships Why NLP ? ● Access Knowledge (search engine, recommender system…) ● Communicate (e.g. Translation) ● Linguistics and Cognitive Sciences (Analyse Languages themselves)
Why Natural Language Processing? Amount of online textual data… ● 70 billion web-pages online (1.9 billion websites) ● 55 million Wikipedia articles …Growing at a fast pace ● 9000 tweets/second ● 3 million mail / second (60% spam) 9
8 Why Natural Language Processing? Potential Users of Natural Language Processing ● 7.9 billion people use some sort of language (January 2022) ● 4.7 billion internet users (January 2021) (~59%) ● 4.2 billion social media users (January 2021) (~54%)
9 Why Natural Language Processing? What Products ? ● Search: +2 billion Google users, 700 millions Baidu users ● Social Media: +3 billion users of Social media (Facebook, Instagram, WeChat, Twitter...) ● Voice assistant: +100 million users (Alexa, Siri, Google Assistant) ● Machine Translation: 500M users for google translate
Why is Language Hard to Model? 12
A Definition of Language Definition 1: Language is a means to communicate, it is a semiotic system. By that we simply mean that it is a set of signs. A sign is a pair consisting in [...] a signifier and a signified. Definition 2: A sign consists in a phonological structure, a morphological structure, a syntactic structure and a semantic structure 13
The Six Levels of Linguistics Analysis 14
The Six Levels of Linguistics Analysis Extra-linguistic context: refers to the broader context surrounding the language, such as the speaker's intent, the situation in which the communication takes place, or background knowledge. Linguistic context: refers to the words and phrases within a sentence or passage that provide clues about the meaning of a particular word. Semantic level: This refers to the meaning of words and sentences. Syntactic level: This refers to the grammatical structure of sentences. Morphological level: This refers to the structure of words, including prefixes, suffixes, and root words. Phonological level: This refers to the sounds that make up words.
14 The 5 Challenges of NLP 1. Productivity 2. Ambiguous 3. Variability 4. Diversity 5. Sparsity
15 Productivity Definition “property of the language-system which enables native speakers to construct and understand an indefinitely large number of opinions or expressions, including opinions that they have never previously encountered.” (Lyons, 1977) ➔ New words, senses, structure are introduced in languages all the time Examples: staycation and social distance were added to the Oxford Dictionary in 2021
16 Ambiguous Ambiguity in words refers to the situation where a word has multiple possible meanings. Most linguistic observations (speech, text) are open to interpretations several We (Humans) disambiguate -i.e. find the correct interpretation - using all kind of signals (linguistic and extra linguistic) Ambiguity can appear at all levels (phonology, graphemics, morphology, syntax, semantics)
Ambiguous Syntactic Ambiguity Syntactic ambiguity, also known as structural ambiguity or grammatical ambiguity, occurs when a sentence's structure allows for multiple interpretations. 17
18 Ambiguous Semantic Ambiguity ● Polysemy: e.g. set , arm, head Head of New-Zealand is a woman ● Name Entity: e.g. Michael Jordan Michael Jordan is a professor at Berkeley ● Object/Color: e.g. cherry Your cherry coat
19 Ambiguous Pragmatic Ambiguity Pragmatic ambiguity arises when the meaning of a sentence or utterance is unclear due to factors beyond the words themselves and the grammatical structure. Two Soviet ships collide, one dies Dealers will hear car talk at noon
20 Variation Language Varies at all levels ● Phonetic (accent) ● Morphological, Lexical (spelling) ● Syntactic ● Semantic
Phonetic Variation cf. Sagot 21
The Basics of Natural Language Processing -Machine Learning for NLP ENSAE Paris 2022 (1/6) -Benjamin Muller Spelling and Syntactic Variation 24
Variation Determiners ● Who is talking? ● To Whom? ● Where? Work, Home, Restaurant ● When? 19th century, 2008, 2022… ● About what? Specialised domain, the Weather,… Essentially, the Variability of a language depends on: ● Social Context ● Geography ● Sociology ● Date ● Topic 25
Diversity ● About 7000 languages spoken in the world ● About 60% are found in the written form (cf. Omniglot) 26
Graphemic Diversity wikipedia 29
30 Syntactic Diversity A key characteristics of the syntax of a given language is the word order ● Word order differs across languages ● Word order degree of freedom also differs across languages ● We characterize word orders with: Subject (S) Verb (V) Object (O) order
Syntactic Diversity 31 (Dyer et. al 2013)
Semantic Diversity ● Words partition the semantic space ● This partition is very diverse across language (Dyer et. al 2013) 28
29 Statistical Description of a Corpus We describe statistically a corpus of 800 scientific articles Question: If we plot the number of occurrences of each word vs. the rank, what will we observe?
Statistical Description of a Corpus We describe statistically a corpus of 800 scientific articles the is the most observed (rank 1) word with 8119 occurrences 30
Statistical Description of a Corpus We describe statistically a corpus of 800 scientific articles the is the most observed (rank 1) word with 8119 occurrences estimate is observed 56 times and is the 1001 most frequent word About 6000 Words are observed only 1 time in the dataset (e.g. stakeholders, pending, score…) 31
Statistical Description of a Corpus We describe statistically a corpus of 800 scientific articles ➔ In a large enough corpus, word distributions follows a Zipf Law ie: 32
Statistical Description of a Corpus We describe statistically a corpus of 800 scientific articles ➔ In a large enough corpus, word distributions follows a Zipf Law ie: ● Zipf law is a Power relation between the rank and frequency The most frequent entities are much more frequent than the less frequent ones ● Under a Zipf law, log(fw ) and log(k) are linearly related 33
34 What is Natural Language Processing? In a nutshell, NLP consists in handling the complexities of natural languages "to do something" ● Raw Text / Speech → Structured Information ● Raw Text / Speech → (Controlled) Text/Speech In this course we will focus on textual data
35 Framework We assume: ● A token is the basic unit of discrete data, defined to be an item from a vocabulary indexed by 1, ..., V. ● A document is a sequence of N words denoted by d = (w1,w2, ...,wN), where wn is the N-th word in the sequence. ● A corpus is a collection of M documents denoted by D = (d1, d2, ..., dM) Example: Wikipedia, All the articles of the NYT in 2021…
36 Token With regard to our end task, a token can be: ● A word ● A sub-word: e.g. a sequence of 3 characters ● A character ● An sequence of characters (sometimes a word, sometimes several words, sometimes a sub-word…)
37 Document A Document can be: ● A Sentence ● A Paragraph ● A sequence of characters
Text Segmentation 44 Definition: Text Segmentation is the process of splitting raw text (i.e. list of characters) into units of interest. Two level of segmentation (usually) required : ● Split raw text into modeling units (ex: sentence, paragraph, 1000 characters, web-page...) ● Split modeling units into sequence of basic units (referred as tokens) (e.g: words, word-pieces, characters, ...) Two distinct approaches: ● Linguistically informed e.g. word, sentence segmentation... ● Statistically informed e.g. frequent sub-words (word pieces, sentence pieces...)
39 Tokenization Definition: Tokenization consists in segmenting raw textual data into tokens: Can be framed as a character level task input: une industrie métallurgique existait. output: IIIEIIIIIIIIIIIIIIEIIIIIIIIIIIIIIIIIIIIEIIIIIIIIIIIEE ● Easy task for most languages and domains ● Can be very complex in some cases (Chinese, Social Media...)
NLP Tasks: Modeling Framework Tasks Taxonomy ● If Y is a single label and X a sequence of tokens (e.g. a sentence): Sequence Classification ● If we have one label per token: Sequence Labelling ● If Y is a sequence of tokens: Sequence Prediction ● If Y is a graph, a tree or a complex structured output: Structure Prediction 40
Document Classification Politics Economy Travel …. Geopolitics 41
Document Ranking (Retriever) 42
43 NLP Tasks: Part-of-Speech Tagging POS Tagging: Find the grammatical category of each word [My , name, is, Bob, and, I, live, in, NY, ! ]
44 NLP Tasks: Part-of-Speech Tagging POS Tagging: Find the grammatical category of each word [My , name, is, Bob, and, I, live, in, NY, ! ] [PRON , NOUN, VERB, NOUN, CC, PRON, VERB, PREP, NOUN, PUNCT ]
Syntactic Parsing Syntactic Parsing consists in extracting the syntactic structure of a sentence. For instance, Dependency Parsing (here) predicts an acyclic directed graph (a tree) 45
46 Slot-Filling / Intent Detection Intent Detection is a sequence classification task that consists in classifying the intent of a user in a pre-defined category. Slot-Filling is a sequence labelling task that consists in identifying specific parameters in a user request. Can you please play Hello from Adele ? Intent: play_music Slots: [Can, you, please, play, Hello, from, Adele, ? ] [O , O , O , O , SONG, O , ARTIST, O ]
47 Semantic Role Labelling (SLR) SRL is the task of finding the semantic role of each predicate in a sentence. Given a sentence, SRL predicts: who did what to whom, when, where, why, how
48 NLP Tasks: Name Entity Recognition NER: Find the Name-Entities in a sentence [My , name, live, in, NY, ! ] O, LOCATION, O ] [O , O, O, is, Bob, and, I, PERSON, O, O, O,
49 Machine Translation INPUT: My name is Bob and I live in NY ! OUTPUT: Je m’appelle Bob et je vis à New-York!
50 Question Answering INPUT: How many episodes in season 2 breaking bad? OUTPUT: 13
Brief History of NLP Symbolic 1940-2000 Focus on rule-based systems, formal grammars Development of linguistic resources (lexicon, ontologies, grammars…) Statistical Learning 1990-2010 Statistical learning theory (SVM, Random Forest), Graphical Probabilistic Models (e.g. LDA, HMM) Development of annotated datasets Deep Learning 2010-Today Deep Learning Architecture (Transformer) Transfer Learning in NLP (word2vec, BERT, CamemBERT, GPT) More compute, larger (raw) dataset, Open Source Deep Learning Libraries 60
52 1949: First Machine Translation “Model” 1949 Memorandum on Translation, Warren Weaver First to propose the idea of using “electronic computers” to do translation ● Using Shannon’s information theory to frame Machine Translation as a cryptographic problem ● Modeling context to disambiguate between word senses ● “Going down” from each language to universal language in order to translate
1964: ELIZA First Conversational Bot 53

lecture 1 intro NLP_lecture 1 intro NLP.pptx

  • 1.
    The Introduction and Historyof NLP 1 Lecture 1 Ms. Umaira Khurshid
  • 2.
    2 Course Objective Goal: Providea toolkit of concepts and methods to describe and tackle NLP problems in real-life. ● Introduce core ideas at the basis of modern NLP algorithms ● Focus on Machine Learning & Deep Learning applied to NLP ● Focus on empirical considerations(accuracy, memory, speed) as opposed to theoretical guarantees
  • 3.
    3 Course Evaluation ● CourseActivities (Grading Criteria) ● Class Participation(10%) ● Assignment/Quizzes(10%+10%) ● Semester Project(10%) ● Mid Term Exam(35%) ● Final Term Exam(75%)
  • 4.
    4 Course Evaluation ● Therewill be 4-6 quizzes and 2-3 assignments ● Late submissions or email submissions will not be entertained ● There will be separate marks for class participation
  • 5.
    Course project • Implementationof an advance NLP algorithms • To be done in group of two. • More details will be announced in week 4
  • 6.
    6 Why Natural LanguageProcessing? What do we use language for? ● We communicate using language ● We think (partly) with language ● We tell stories in language ● We build Scientific Theories with language ● We make friends/build relationships Why NLP ? ● Access Knowledge (search engine, recommender system…) ● Communicate (e.g. Translation) ● Linguistics and Cognitive Sciences (Analyse Languages themselves)
  • 7.
    Why Natural LanguageProcessing? Amount of online textual data… ● 70 billion web-pages online (1.9 billion websites) ● 55 million Wikipedia articles …Growing at a fast pace ● 9000 tweets/second ● 3 million mail / second (60% spam) 9
  • 8.
    8 Why Natural LanguageProcessing? Potential Users of Natural Language Processing ● 7.9 billion people use some sort of language (January 2022) ● 4.7 billion internet users (January 2021) (~59%) ● 4.2 billion social media users (January 2021) (~54%)
  • 9.
    9 Why Natural LanguageProcessing? What Products ? ● Search: +2 billion Google users, 700 millions Baidu users ● Social Media: +3 billion users of Social media (Facebook, Instagram, WeChat, Twitter...) ● Voice assistant: +100 million users (Alexa, Siri, Google Assistant) ● Machine Translation: 500M users for google translate
  • 10.
    Why is LanguageHard to Model? 12
  • 11.
    A Definition ofLanguage Definition 1: Language is a means to communicate, it is a semiotic system. By that we simply mean that it is a set of signs. A sign is a pair consisting in [...] a signifier and a signified. Definition 2: A sign consists in a phonological structure, a morphological structure, a syntactic structure and a semantic structure 13
  • 12.
    The Six Levelsof Linguistics Analysis 14
  • 13.
    The Six Levelsof Linguistics Analysis Extra-linguistic context: refers to the broader context surrounding the language, such as the speaker's intent, the situation in which the communication takes place, or background knowledge. Linguistic context: refers to the words and phrases within a sentence or passage that provide clues about the meaning of a particular word. Semantic level: This refers to the meaning of words and sentences. Syntactic level: This refers to the grammatical structure of sentences. Morphological level: This refers to the structure of words, including prefixes, suffixes, and root words. Phonological level: This refers to the sounds that make up words.
  • 14.
    14 The 5 Challengesof NLP 1. Productivity 2. Ambiguous 3. Variability 4. Diversity 5. Sparsity
  • 15.
    15 Productivity Definition “property of thelanguage-system which enables native speakers to construct and understand an indefinitely large number of opinions or expressions, including opinions that they have never previously encountered.” (Lyons, 1977) ➔ New words, senses, structure are introduced in languages all the time Examples: staycation and social distance were added to the Oxford Dictionary in 2021
  • 16.
    16 Ambiguous Ambiguity in wordsrefers to the situation where a word has multiple possible meanings. Most linguistic observations (speech, text) are open to interpretations several We (Humans) disambiguate -i.e. find the correct interpretation - using all kind of signals (linguistic and extra linguistic) Ambiguity can appear at all levels (phonology, graphemics, morphology, syntax, semantics)
  • 17.
    Ambiguous Syntactic Ambiguity Syntactic ambiguity,also known as structural ambiguity or grammatical ambiguity, occurs when a sentence's structure allows for multiple interpretations. 17
  • 18.
    18 Ambiguous Semantic Ambiguity ● Polysemy:e.g. set , arm, head Head of New-Zealand is a woman ● Name Entity: e.g. Michael Jordan Michael Jordan is a professor at Berkeley ● Object/Color: e.g. cherry Your cherry coat
  • 19.
    19 Ambiguous Pragmatic Ambiguity Pragmatic ambiguityarises when the meaning of a sentence or utterance is unclear due to factors beyond the words themselves and the grammatical structure. Two Soviet ships collide, one dies Dealers will hear car talk at noon
  • 20.
    20 Variation Language Varies atall levels ● Phonetic (accent) ● Morphological, Lexical (spelling) ● Syntactic ● Semantic
  • 21.
  • 22.
    The Basics ofNatural Language Processing -Machine Learning for NLP ENSAE Paris 2022 (1/6) -Benjamin Muller Spelling and Syntactic Variation 24
  • 23.
    Variation Determiners ● Whois talking? ● To Whom? ● Where? Work, Home, Restaurant ● When? 19th century, 2008, 2022… ● About what? Specialised domain, the Weather,… Essentially, the Variability of a language depends on: ● Social Context ● Geography ● Sociology ● Date ● Topic 25
  • 24.
    Diversity ● About 7000languages spoken in the world ● About 60% are found in the written form (cf. Omniglot) 26
  • 25.
  • 26.
    30 Syntactic Diversity A keycharacteristics of the syntax of a given language is the word order ● Word order differs across languages ● Word order degree of freedom also differs across languages ● We characterize word orders with: Subject (S) Verb (V) Object (O) order
  • 27.
  • 28.
    Semantic Diversity ● Wordspartition the semantic space ● This partition is very diverse across language (Dyer et. al 2013) 28
  • 29.
    29 Statistical Description ofa Corpus We describe statistically a corpus of 800 scientific articles Question: If we plot the number of occurrences of each word vs. the rank, what will we observe?
  • 30.
    Statistical Description ofa Corpus We describe statistically a corpus of 800 scientific articles the is the most observed (rank 1) word with 8119 occurrences 30
  • 31.
    Statistical Description ofa Corpus We describe statistically a corpus of 800 scientific articles the is the most observed (rank 1) word with 8119 occurrences estimate is observed 56 times and is the 1001 most frequent word About 6000 Words are observed only 1 time in the dataset (e.g. stakeholders, pending, score…) 31
  • 32.
    Statistical Description ofa Corpus We describe statistically a corpus of 800 scientific articles ➔ In a large enough corpus, word distributions follows a Zipf Law ie: 32
  • 33.
    Statistical Description ofa Corpus We describe statistically a corpus of 800 scientific articles ➔ In a large enough corpus, word distributions follows a Zipf Law ie: ● Zipf law is a Power relation between the rank and frequency The most frequent entities are much more frequent than the less frequent ones ● Under a Zipf law, log(fw ) and log(k) are linearly related 33
  • 34.
    34 What is NaturalLanguage Processing? In a nutshell, NLP consists in handling the complexities of natural languages "to do something" ● Raw Text / Speech → Structured Information ● Raw Text / Speech → (Controlled) Text/Speech In this course we will focus on textual data
  • 35.
    35 Framework We assume: ● Atoken is the basic unit of discrete data, defined to be an item from a vocabulary indexed by 1, ..., V. ● A document is a sequence of N words denoted by d = (w1,w2, ...,wN), where wn is the N-th word in the sequence. ● A corpus is a collection of M documents denoted by D = (d1, d2, ..., dM) Example: Wikipedia, All the articles of the NYT in 2021…
  • 36.
    36 Token With regard toour end task, a token can be: ● A word ● A sub-word: e.g. a sequence of 3 characters ● A character ● An sequence of characters (sometimes a word, sometimes several words, sometimes a sub-word…)
  • 37.
    37 Document A Document canbe: ● A Sentence ● A Paragraph ● A sequence of characters
  • 38.
    Text Segmentation 44 Definition: TextSegmentation is the process of splitting raw text (i.e. list of characters) into units of interest. Two level of segmentation (usually) required : ● Split raw text into modeling units (ex: sentence, paragraph, 1000 characters, web-page...) ● Split modeling units into sequence of basic units (referred as tokens) (e.g: words, word-pieces, characters, ...) Two distinct approaches: ● Linguistically informed e.g. word, sentence segmentation... ● Statistically informed e.g. frequent sub-words (word pieces, sentence pieces...)
  • 39.
    39 Tokenization Definition: Tokenization consistsin segmenting raw textual data into tokens: Can be framed as a character level task input: une industrie métallurgique existait. output: IIIEIIIIIIIIIIIIIIEIIIIIIIIIIIIIIIIIIIIEIIIIIIIIIIIEE ● Easy task for most languages and domains ● Can be very complex in some cases (Chinese, Social Media...)
  • 40.
    NLP Tasks: ModelingFramework Tasks Taxonomy ● If Y is a single label and X a sequence of tokens (e.g. a sentence): Sequence Classification ● If we have one label per token: Sequence Labelling ● If Y is a sequence of tokens: Sequence Prediction ● If Y is a graph, a tree or a complex structured output: Structure Prediction 40
  • 41.
  • 42.
  • 43.
    43 NLP Tasks: Part-of-SpeechTagging POS Tagging: Find the grammatical category of each word [My , name, is, Bob, and, I, live, in, NY, ! ]
  • 44.
    44 NLP Tasks: Part-of-SpeechTagging POS Tagging: Find the grammatical category of each word [My , name, is, Bob, and, I, live, in, NY, ! ] [PRON , NOUN, VERB, NOUN, CC, PRON, VERB, PREP, NOUN, PUNCT ]
  • 45.
    Syntactic Parsing Syntactic Parsingconsists in extracting the syntactic structure of a sentence. For instance, Dependency Parsing (here) predicts an acyclic directed graph (a tree) 45
  • 46.
    46 Slot-Filling / IntentDetection Intent Detection is a sequence classification task that consists in classifying the intent of a user in a pre-defined category. Slot-Filling is a sequence labelling task that consists in identifying specific parameters in a user request. Can you please play Hello from Adele ? Intent: play_music Slots: [Can, you, please, play, Hello, from, Adele, ? ] [O , O , O , O , SONG, O , ARTIST, O ]
  • 47.
    47 Semantic Role Labelling(SLR) SRL is the task of finding the semantic role of each predicate in a sentence. Given a sentence, SRL predicts: who did what to whom, when, where, why, how
  • 48.
    48 NLP Tasks: NameEntity Recognition NER: Find the Name-Entities in a sentence [My , name, live, in, NY, ! ] O, LOCATION, O ] [O , O, O, is, Bob, and, I, PERSON, O, O, O,
  • 49.
    49 Machine Translation INPUT: Myname is Bob and I live in NY ! OUTPUT: Je m’appelle Bob et je vis à New-York!
  • 50.
    50 Question Answering INPUT: Howmany episodes in season 2 breaking bad? OUTPUT: 13
  • 51.
    Brief History ofNLP Symbolic 1940-2000 Focus on rule-based systems, formal grammars Development of linguistic resources (lexicon, ontologies, grammars…) Statistical Learning 1990-2010 Statistical learning theory (SVM, Random Forest), Graphical Probabilistic Models (e.g. LDA, HMM) Development of annotated datasets Deep Learning 2010-Today Deep Learning Architecture (Transformer) Transfer Learning in NLP (word2vec, BERT, CamemBERT, GPT) More compute, larger (raw) dataset, Open Source Deep Learning Libraries 60
  • 52.
    52 1949: First MachineTranslation “Model” 1949 Memorandum on Translation, Warren Weaver First to propose the idea of using “electronic computers” to do translation ● Using Shannon’s information theory to frame Machine Translation as a cryptographic problem ● Modeling context to disambiguate between word senses ● “Going down” from each language to universal language in order to translate
  • 53.
    1964: ELIZA FirstConversational Bot 53