The document discusses relation extraction in language technology, focusing on the formal definition of relations and the importance of extracting structured information from unstructured text. It outlines methods for building relation extractors, such as supervised learning, bootstrapping, and unsupervised techniques, and explains summarization techniques for single and multiple documents, as well as query-focused summaries. Additionally, it touches on the evaluation of summarization through metrics like ROUGE, emphasizing the creation of concise, informative summaries for various applications.
Introduction to the presentation on semantic analysis in language technology with details about the speaker and university.
Explains what relation extraction is, its significance, and examples of binary relations like 'father-of' or 'located-in'.
Describes various approaches to build relation extractors, including supervised, semi-supervised, and bootstrapping methods.
Provides examples of practical activities in relation extraction, focusing on extracting author-book pairs using seed patterns.
Introduces various summarization types, including news and book summaries, and discusses human summarization and extractive vs. abstractive methods.
Examines stages of summarization, types including extractive and abstractive summarization, and various summarization algorithms.
Focuses on methods for content selection during summarization, including supervised and unsupervised techniques, and metrics for evaluation like ROUGE.
Seman&c Analysis in Language Technology http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm Summarization Marina San(ni san$nim@stp.lingfil.uu.se Department of Linguis(cs and Philology Uppsala University, Uppsala, Sweden Spring 2016
What’s a rela$on? • A rela(on can be formally defined in the form of a tuple • t = (e1; e2 …; en) • where the ei are en((es in a predefined rela(on r within document D. • Most rela(on extrac(on systems focus on extrac(ng binary rela$ons. • Examples of binary rela(ons include • located-‐in(CMU, PiHsburgh), • father-‐of(ManuelBlum, Avrim Blum). • It is also possible to go to higher-‐order rela(ons as well and extract more complex rela(ons (ex biomedicine). 3
4.
Why Rela$on Extrac$on? • There exists a vast amount of unstructured electronic text on the Web, including newswire, blogs ,emails, governmental documents, chats, and so on. • The whole idea of IE is turn unstructured text into structured by annota(ng seman(c informa(on. • RE is the task of recognizing rela(ons between en((es in unstructured text. ! If a query to a search engine is “When was Gandhi born ?”, then the expected answer would be“Gandhi was born in 1869”. The template of the answer is <PERSON> born-in <YEAR> which is nothing but the relational triple: ! born in(PERSON, YEAR) ! where PERSON and YEAR are the entities. ! 4
5.
Watch out! • RE = extract facts from unstructured texts, ie rela(ons that exist betw en((es, such as dates, proper names, companies. • Other rela(ons (related to Word Senses): seman(c rela(ons betw concepts: hyperonyms, hyponyms, etc. like in Wordnet. 5
6.
How to build rela$on extractors 1. Hand-‐wriHen paHerns 2. Supervised machine learning 3. Semi-‐supervised and unsupervised • Bootstrapping (using seeds) • Distant supervision • Unsupervised learning from the web 6
7.
Seed-‐based or bootstrapping approaches to rela$on extrac$on • No training set? Maybe you have: • A few seed tuples or • A few high-‐precision paHerns • Can you use those seeds to do something useful? • Bootstrapping: use the seeds to directly learn to populate a rela(on 7 Roughly said: Use seeds to ini(alize a process of annota(on, then refine through itera(ons
8.
Dipre: Extract <author,book> pairs • Start with 5 seeds: • Find Instances: The Comedy of Errors, by William Shakespeare, was The Comedy of Errors, by William Shakespeare, is The Comedy of Errors, one of William Shakespeare's earliest aHempts The Comedy of Errors, one of William Shakespeare's most • Extract paHerns (group by middle, take longest common prefix/suffix) ?x , by ?y , ?x , one of ?y ‘s ! • Now iterate, finding new seeds that match the paHern ! Brin, Sergei. 1998. Extracting Patterns and Relations from the World Wide Web. Author Book Isaac Asimov The Robots of Dawn David Brin Star(de Rising James Gleick Chaos: Making a New Science Charles Dickens Great Expecta(ons William Shakespeare The Comedy of Errors 8
9.
Prac$cal Ac$vity Search for phrasal paHerns on the web Our seeds: "* is a novel by *" "* wrote the novel *" "the novel * was wriHen by *" op#onally add more phrases… Further refinemets that we felt are needed: • get read of non-‐informa(ve text included in the returned strings (maybe via adding addi(onal paHerns in the regular expressions) • Iden(fy name en((es • Maybe via Reg Expressions (eg. iden(fy words star(ng with uppercase) • Maybe combining seeds and a NER system • ect. 9c Google is fantastic, but also unpredictable… à different behaviours depending on the machines, domains, and some “hidden” criteria…
Acknowledgements Most slides borrowed or adapted from: Dan Jurafsky and Christopher Manning, Coursera Some inspira(on from Dragomir Radev, Coursera …. J&M(2009)
Text Summariza$on • Goal: produce an abridged version of a text that contains informa(on that is important or relevant to a user. • Summariza$on Applica$ons • outlines or abstracts of any document, ar(cle, etc • summaries of email threads • ac$on items from a mee(ng • simplifying text by compressing sentences 25
26.
What to summarize? Single vs. mul$ple documents • Single-‐document summariza$on • Given a single document, produce • abstract • outline • headline • Mul$ple-‐document summariza$on • Given a group of documents, produce a gist of the content: • a series of news stories on the same event • a set of web pages about some topic or ques(on 26
27.
Query-‐focused Summariza$on & Generic Summariza$on • Generic summariza(on: • Summarize the content of a document • Query-‐focused summariza(on: • summarize a document with respect to an informa(on need expressed in a user query. • a kind of complex ques(on answering: • Answer a ques(on by summarizing a document that has the informa(on to construct the answer 27
28.
Summariza$on for Ques$on Answering: Snippets • Create snippets summarizing a web page for a query • Google: 156 characters (about 26 words) plus (tle and link 28
29.
Summariza$on for Ques$on Answering: Mul$ple documents Create answers to complex ques(ons summarizing mul(ple documents. • Instead of giving a snippet for each document • Create a cohesive answer that combines informa(on from each document 29
30.
Extrac$ve summariza$on & Abstrac$ve summariza$on • Extrac(ve summariza(on: • create the summary from phrases or sentences in the source document(s) • Abstrac(ve summariza(on: • express the ideas in the source documents using (at least in part) different words 30
Summariza$on: Three Stages 1. content selec(on: choose sentences to extract from the document 2. informa(on ordering: choose an order to place them in the summary 3. sentence realiza(on: clean up the sentences 34 Document Sentence Segmentation Sentence Extraction All sentences from documents Extracted sentences Information Ordering Sentence Realization Summary Content Selection Sentence Simplification
35.
Basic Summariza$on Algorithm 1. content selec(on: choose sentences to extract from the document 2. informa(on ordering: just use document order 3. sentence realiza(on: keep original sentences 35 Document Sentence Segmentation Sentence Extraction All sentences from documents Extracted sentences Information Ordering Sentence Realization Summary Content Selection Sentence Simplification
36.
Unsupervised content selec$on • Intui(on da(ng back to Luhn (1958): • Choose sentences that have salient or informa(ve words • Two approaches to defining salient words 1. o-‐idf: weigh each word wi in document j by o-‐idf 2. topic signature: choose a smaller set of salient words • mutual informa(on • log-‐likelihood ra(o (LLR) Dunning (1993), Lin and Hovy (2000) 36 weight(wi ) = tfij ×idfi weight(wi ) = 1 if -2logλ(wi ) >10 0 otherwise ! " # $# H. P. Luhn. 1958. The Automa(c Crea(on of Literature Abstracts. IBM Journal of Research and Development. 2:2, 159-‐165.
37.
Topic signature-‐based content selec$on with queries • choose words that are informa(ve either • by log-‐likelihood ra(o (LLR) • or by appearing in the query • Weigh a sentence (or window) by weight of its words: 37 Conroy, Schlesinger, and O’Leary 2006 weight(wi ) = 1 if -2logλ(wi ) >10 1 if wi ∈ question 0 otherwise " # $$ % $ $ weight(s) = 1 S weight(w) w∈S ∑ (could learn more complex weights)
38.
Supervised content selec$on • Given: • a labeled training set of good summaries for each document • Align: • the sentences in the document with sentences in the summary • Extract features • posi(on (first sentence?) • length of sentence • word informa(veness, cue phrases • cohesion • Train • Problems: • hard to get labeled training data • alignment difficult • performance not beHer than unsupervised algorithms • So in prac(ce: • Unsupervised content selec$on is more common • a binary classifier (put sentence in summary? yes or no)
ROUGE (Recall Oriented Understudy for Gis$ng Evalua$on) • Intrinsic metric for automa(cally evalua(ng summaries • Based on BLEU (a metric used for machine transla(on) • Not as good as human evalua(on (“Did this answer the user’s ques(on?”) • But much more convenient • Given a document D, and an automa(c summary X: 1. Have N humans produce a set of reference summaries of D 2. Run system, giving automa(c summary X 3. What percentage of the bigrams from the reference summaries appear in X? 40 Lin and Hovy 2003 ROUGE − 2 = min(count(i, X),count(i,S)) bigrams i∈S ∑ s∈{RefSummaries} ∑ count(i,S) bigrams i∈S ∑ s∈{RefSummaries} ∑
41.
A ROUGE example: Q: “What is water spinach?” Human 1: Water spinach is a green leafy vegetable grown in the tropics. Human 2: Water spinach is a semi-‐aqua(c tropical plant grown as a vegetable. Human 3: Water spinach is a commonly eaten leaf vegetable of Asia. • System answer: Water spinach is a leaf vegetable commonly eaten in tropical areas of Asia. • ROUGE-‐2 = 41 10 + 9 + 9 3 + 3 + 6 = 12/28 = .43
Defini$on ques$ons Q: What is water spinach? A: Water spinach (ipomoea aqua(ca) is a semi-‐aqua(c leafy green plant with long hollow stems and spear-‐ or heart-‐ shaped leaves, widely grown throughout Asia as a leaf vegetable. The leaves and stems are oten eaten s(r-‐fried flavored with salt or in soups. Other common names include morning glory vegetable, kangkong (Malay), rau muong (Viet.), ong choi (Cant.), and kong xin cai (Mand.). It is not related to spinach, but is closely related to sweet potato and convolvulus.
44.
Medical ques$ons Q: In children with an acute febrile illness, what is the efficacy of single medica(on therapy with acetaminophen or ibuprofen in reducing fever? A: Ibuprofen provided greater temperature decrement and longer dura(on of an(pyresis than acetaminophen when the two drugs were administered in approximately equal doses. (PubMedID: 1621668, Evidence Strength: A) Demner-‐Fushman and Lin (2007)
45.
Other complex ques$ons 1. How is compost made and used for gardening (including different types of compost, their uses, origins and benefits)? 2. What causes train wrecks and what can be done to prevent them? 3. Where have poachers endangered wildlife, what wildlife has been endangered and what steps have been taken to prevent poaching? 4. What has been the human toll in death or injury of tropical storms in recent years? 45 Modified from the DUC 2005 compe((on (Hoa Trang Dang 2005)
46.
Answering harder ques$ons: Query-‐focused mul$-‐document summariza$on • The (boHom-‐up) snippet method • Find a set of relevant documents • Extract informa(ve sentences from the documents • Order and modify the sentences into an answer • The (top-‐down) informa(on extrac(on method • build specific answerers for different ques(on types: • defini(on ques(ons • biography ques(ons • certain medical ques(ons
47.
Query-‐Focused Mul$-‐Document Summariza$on 47 • a Document Document Document Document Document Input Docs Sentence Segmentation All sentences from documents Sentence Simplification Content Selection Sentence Extraction: LLR, MMR Extracted sentences Information Ordering Sentence Realization Summary All sentences plus simplified versions Query
48.
Informa$on Ordering • Chronological ordering: • Order sentences by the date of the document (for summarizing news).. (Barzilay, Elhadad, and McKeown 2002) • Coherence: • Choose orderings that make neighboring sentences similar (by cosine). • Choose orderings in which neighboring sentences discuss the same en(ty (Barzilay and Lapata 2007) • Topical ordering • Learn the ordering of topics in the source documents 48
49.
Domain-‐specific answering: The Informa$on Extrac$on method • a good biography of a person contains: • a person’s birth/death, fame factor, educa$on, na$onality and so on • a good defini$on contains: • genus or hypernym • The Hajj is a type of ritual • a medical answer about a drug’s use contains: • the problem (the medical condi(on), • the interven$on (the drug or procedure), and • the outcome (the result of the study).