Natural Language Processing Quick Introduction Rohit Nayak Talentica Software
Part 1: Semantic Web, Uses of NLP, Core Concepts, Intro to GATE Part 2: GATE Detailed Demo
NLP 420 Falling Tree Hits, Kills OR Forest Service Worker Time flies like an arrow Choosing a Program to Improve Your Future Monkeys like bananas when they wake up Monkeys like bananas when they are ripe
I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A ‘Semantic Web’, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The ‘ intelligent agents ’ people have touted for ages will finally materialize. – Tim Berners -Lee , 1999
Disaster Type: earthquake location: Afghanistan date: 05/30/1998 magnitude: 6.9 epicenter: a remote part of the country damage: human-effect: victim: Thousands of people number: Thousands outcome: dead physical-effect: object: entire villages outcome: damaged QUAKE IN AFGHANISTAN Thousands of people are feared dead following... (voice-over) ... a powerful earthquake that hit Afghanistan today. The quake registered 6.9 on the Richter scale, centered in a remote part of the country . (on camera) Details now hard to come by, but reports say entire villages were buried by the quake .
Text Categorization Is the document about plants? sports? health and fitness? corporate acquisitions? … stock market? Document
Sentiment Classification Is the overall sentiment in the document positive? negative? In general, sentiment classification appears to be harder than categorizing by topic. Document
Information Extraction Information Extraction System text collection Who: _____ What: _____ Where:_____ When: _____ How: _____ Who: _____ What: _____ Where:_____ When: _____ How: _____ Who: _____ What: _____ Where:_____ When: _____ How: _____
Information Extraction (IE) Recognition, tagging, and extraction into a structured representation, certain key elements of information, e.g. persons, companies, locations, organizations, from large collections of text. These extractions can then be utilized for a range of applications including question-answering, visualization, and data mining.
Question-Answering In contrast to Information Retrieval, which provides a list of potentially relevant documents in response to a user’s query provides the user with either just the text of the answer itself or answer-providing passages.
Summarization reduces a larger text into a shorter, yet richly constituted abbreviated narrative representation of the original document.
Machine Translation perhaps the oldest of all NLP applications, various levels of NLP have been utilized in MT systems, ranging from the ‘word-based’ approach to applications that include higher levels of analysis.
Dialogue Systems perhaps the omnipresent application of the future, in the systems envisioned by large providers of end-user applications. Dialogue systems usually focus on a narrowly defined application (e.g. your refrigerator or home sound system), currently utilize the phonetic and lexical levels of language. It is believed that utilization of all the levels of language processing explained above offer the potential for truly habitable dialogue systems.
Challenge of Semantic Web Machine processable data to complement hypertext Attach metadata to documents Explicit: title, author, creation date Implicit: deduced information like names of entities and their relation
Ontology Specification of conceptualisation Basis of document “understanding” Creating and populating is very time-consuming, practically impossible
Simple Workflow Classification Tokeniser Gazetteer Sentence Splitter Parts Of Speech Tagging Named Entity Tagging Final Extraction
Tools GATE OpenNLP NLTK (python) Stanford Parser Weka for classification
GATE General Architecture for Text Engineering Over 10 years, active development Most popular NLP platform Current version 5.0 Built as a framework for both programmers and developers Powerful GUI and well-documented Java API Multilingual
GATE Clean separation of low-level tasks (e.g., data storage) from the NLP components Separation between linguistic data and algorithms that process it
JAPE Just A Pleasant Experience Pattern-Matching over Annotations Regular Expression like Can use Java in actions
Rule: Company1 Priority: 25 ( ({Token.orthography == upperInitial})+ {Lookup.kind == companyDesignator} ):companyMatch --> :companyMatch.NamedEntity = {kind = "company", rule = "Company1"}
CREOLE components GATE plugins uses CREOLE Collection of Reusable Objects for Language Engineering Modified JavaBeans with XML configuration Minimal component: 10 lines of Java, 10 lines of XML
External Slideshow http://www.authorstream.com/presentation/Esteban-22479-ekaw2006-tutorial-Aims-Terminology-Semantic-Annotation-Motivation-Challenge-Web-Metadata-ext-as-Entertainment-ppt-powerpoint/ (27)
GATE Demo Quick look Detailed Demo next SIG

Introduction to Natural Language Processing

  • 1.
    Natural Language ProcessingQuick Introduction Rohit Nayak Talentica Software
  • 2.
    Part 1: SemanticWeb, Uses of NLP, Core Concepts, Intro to GATE Part 2: GATE Detailed Demo
  • 3.
    NLP 420 FallingTree Hits, Kills OR Forest Service Worker Time flies like an arrow Choosing a Program to Improve Your Future Monkeys like bananas when they wake up Monkeys like bananas when they are ripe
  • 4.
    I have adream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A ‘Semantic Web’, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The ‘ intelligent agents ’ people have touted for ages will finally materialize. – Tim Berners -Lee , 1999
  • 5.
    Disaster Type: earthquake location: Afghanistan date: 05/30/1998 magnitude: 6.9 epicenter: a remote part of the country damage: human-effect: victim: Thousands of people number: Thousands outcome: dead physical-effect: object: entire villages outcome: damaged QUAKE IN AFGHANISTAN Thousands of people are feared dead following... (voice-over) ... a powerful earthquake that hit Afghanistan today. The quake registered 6.9 on the Richter scale, centered in a remote part of the country . (on camera) Details now hard to come by, but reports say entire villages were buried by the quake .
  • 6.
    Text Categorization Isthe document about plants? sports? health and fitness? corporate acquisitions? … stock market? Document
  • 7.
    Sentiment Classification Isthe overall sentiment in the document positive? negative? In general, sentiment classification appears to be harder than categorizing by topic. Document
  • 8.
    Information Extraction InformationExtraction System text collection Who: _____ What: _____ Where:_____ When: _____ How: _____ Who: _____ What: _____ Where:_____ When: _____ How: _____ Who: _____ What: _____ Where:_____ When: _____ How: _____
  • 9.
    Information Extraction (IE)Recognition, tagging, and extraction into a structured representation, certain key elements of information, e.g. persons, companies, locations, organizations, from large collections of text. These extractions can then be utilized for a range of applications including question-answering, visualization, and data mining.
  • 10.
    Question-Answering In contrastto Information Retrieval, which provides a list of potentially relevant documents in response to a user’s query provides the user with either just the text of the answer itself or answer-providing passages.
  • 11.
    Summarization reduces alarger text into a shorter, yet richly constituted abbreviated narrative representation of the original document.
  • 12.
    Machine Translation perhapsthe oldest of all NLP applications, various levels of NLP have been utilized in MT systems, ranging from the ‘word-based’ approach to applications that include higher levels of analysis.
  • 13.
    Dialogue Systems perhapsthe omnipresent application of the future, in the systems envisioned by large providers of end-user applications. Dialogue systems usually focus on a narrowly defined application (e.g. your refrigerator or home sound system), currently utilize the phonetic and lexical levels of language. It is believed that utilization of all the levels of language processing explained above offer the potential for truly habitable dialogue systems.
  • 14.
    Challenge of SemanticWeb Machine processable data to complement hypertext Attach metadata to documents Explicit: title, author, creation date Implicit: deduced information like names of entities and their relation
  • 15.
    Ontology Specification ofconceptualisation Basis of document “understanding” Creating and populating is very time-consuming, practically impossible
  • 16.
    Simple Workflow ClassificationTokeniser Gazetteer Sentence Splitter Parts Of Speech Tagging Named Entity Tagging Final Extraction
  • 17.
    Tools GATE OpenNLPNLTK (python) Stanford Parser Weka for classification
  • 18.
    GATE General Architecturefor Text Engineering Over 10 years, active development Most popular NLP platform Current version 5.0 Built as a framework for both programmers and developers Powerful GUI and well-documented Java API Multilingual
  • 19.
    GATE Clean separationof low-level tasks (e.g., data storage) from the NLP components Separation between linguistic data and algorithms that process it
  • 20.
    JAPE Just APleasant Experience Pattern-Matching over Annotations Regular Expression like Can use Java in actions
  • 21.
    Rule: Company1 Priority:25 ( ({Token.orthography == upperInitial})+ {Lookup.kind == companyDesignator} ):companyMatch --> :companyMatch.NamedEntity = {kind = "company", rule = "Company1"}
  • 22.
    CREOLE components GATEplugins uses CREOLE Collection of Reusable Objects for Language Engineering Modified JavaBeans with XML configuration Minimal component: 10 lines of Java, 10 lines of XML
  • 23.
  • 24.
    GATE Demo Quicklook Detailed Demo next SIG