Lecture: Summarization

Seman&c Analysis in Language Technology http://stp.lingﬁl.uu.se/~santinim/sais/2016/sais_2016.htm     Summarization Marina San(ni san$nim@stp.lingﬁl.uu.se Department of Linguis(cs and Philology Uppsala University, Uppsala, Sweden Spring 2016

Previous Lecture: Rela$on Extrac$on 2

What’s a rela$on? •  A rela(on can be formally deﬁned in the form of a tuple •  t = (e1; e2 …; en) •  where the ei are en((es in a predeﬁned rela(on r within document D. •  Most rela(on extrac(on systems focus on extrac(ng binary rela$ons. •  Examples of binary rela(ons include •  located-‐in(CMU, PiHsburgh), •  father-‐of(ManuelBlum, Avrim Blum). •  It is also possible to go to higher-‐order rela(ons as well and extract more complex rela(ons (ex biomedicine). 3

Why Rela$on Extrac$on? •  There exists a vast amount of unstructured electronic text on the Web, including newswire, blogs ,emails, governmental documents, chats, and so on. •  The whole idea of IE is turn unstructured text into structured by annota(ng seman(c informa(on. •  RE is the task of recognizing rela(ons between en((es in unstructured text. ! If a query to a search engine is “When was Gandhi born ?”, then the expected answer would be“Gandhi was born in 1869”. The template of the answer is <PERSON> born-in <YEAR> which is nothing but the relational triple: ! born in(PERSON, YEAR) ! where PERSON and YEAR are the entities. ! 4

Watch out! •  RE = extract facts from unstructured texts, ie rela(ons that exist betw en((es, such as dates, proper names, companies. •  Other rela(ons (related to Word Senses): seman(c rela(ons betw concepts: hyperonyms, hyponyms, etc. like in Wordnet. 5

How to build rela$on extractors 1.  Hand-‐wriHen paHerns 2.  Supervised machine learning 3.  Semi-‐supervised and unsupervised •  Bootstrapping (using seeds) •  Distant supervision •  Unsupervised learning from the web 6

Seed-‐based or bootstrapping approaches to rela$on extrac$on •  No training set? Maybe you have: •  A few seed tuples or •  A few high-‐precision paHerns •  Can you use those seeds to do something useful? •  Bootstrapping: use the seeds to directly learn to populate a rela(on 7 Roughly said: Use seeds to ini(alize a process of annota(on, then reﬁne through itera(ons

Dipre: Extract <author,book> pairs •  Start with 5 seeds: •  Find Instances: The Comedy of Errors, by William Shakespeare, was The Comedy of Errors, by William Shakespeare, is The Comedy of Errors, one of William Shakespeare's earliest aHempts The Comedy of Errors, one of William Shakespeare's most •  Extract paHerns (group by middle, take longest common prefix/suffix) ?x , by ?y , ?x , one of ?y ‘s ! •  Now iterate, finding new seeds that match the paHern ! Brin, Sergei. 1998. Extracting Patterns and Relations from the World Wide Web. Author Book Isaac Asimov The Robots of Dawn David Brin Star(de Rising James Gleick Chaos: Making a New Science Charles Dickens Great Expecta(ons William Shakespeare The Comedy of Errors 8

Prac$cal Ac$vity Search for phrasal paHerns on the web Our seeds: "* is a novel by *" "* wrote the novel *" "the novel * was wriHen by *" op#onally add more phrases… Further reﬁnemets that we felt are needed: •  get read of non-‐informa(ve text included in the returned strings (maybe via adding addi(onal paHerns in the regular expressions) •  Iden(fy name en((es •  Maybe via Reg Expressions (eg. iden(fy words star(ng with uppercase) •  Maybe combining seeds and a NER system •  ect. 9c Google is fantastic, but also unpredictable… à different behaviours depending on the machines, domains, and some “hidden” criteria…

End of previous lecture 10

Acknowledgements Most slides borrowed or adapted from: Dan Jurafsky and Christopher Manning, Coursera Some inspira(on from Dragomir Radev, Coursera …. J&M(2009)

Book Summaries 15 Cliﬀ’s Notes are a series of student study guides available primarily in the United States.

Search Engine Snippets 17

Human Summariza$on and Abstrac$ng 22

Extrac$ve Summariza$on 23

Question Answering Summarization in Question Answering

Text Summariza$on •  Goal: produce an abridged version of a text that contains informa(on that is important or relevant to a user. •  Summariza$on Applica$ons •  outlines or abstracts of any document, ar(cle, etc •  summaries of email threads •  ac$on items from a mee(ng •  simplifying text by compressing sentences 25

What to summarize? Single vs. mul$ple documents •  Single-‐document summariza$on •  Given a single document, produce •  abstract •  outline •  headline •  Mul$ple-‐document summariza$on •  Given a group of documents, produce a gist of the content: •  a series of news stories on the same event •  a set of web pages about some topic or ques(on 26

Query-‐focused Summariza$on & Generic Summariza$on •  Generic summariza(on: •  Summarize the content of a document •  Query-‐focused summariza(on: •  summarize a document with respect to an informa(on need expressed in a user query. •  a kind of complex ques(on answering: •  Answer a ques(on by summarizing a document that has the informa(on to construct the answer 27

Summariza$on for Ques$on Answering: Snippets •  Create snippets summarizing a web page for a query •  Google: 156 characters (about 26 words) plus (tle and link 28

Summariza$on for Ques$on Answering: Mul$ple documents Create answers to complex ques(ons summarizing mul(ple documents. •  Instead of giving a snippet for each document •  Create a cohesive answer that combines informa(on from each document 29

Extrac$ve summariza$on & Abstrac$ve summariza$on •  Extrac(ve summariza(on: •  create the summary from phrases or sentences in the source document(s) •  Abstrac(ve summariza(on: •  express the ideas in the source documents using (at least in part) diﬀerent words 30

Simple baseline: take the ﬁrst sentence 31

Question Answering Generating Snippets and other Single- Document Answers

Snippets: query-‐focused summaries 33

Summariza$on: Three Stages 1.  content selec(on: choose sentences to extract from the document 2.  informa(on ordering: choose an order to place them in the summary 3.  sentence realiza(on: clean up the sentences 34 Document Sentence Segmentation Sentence Extraction All sentences from documents Extracted sentences Information Ordering Sentence Realization Summary Content Selection Sentence Simpliﬁcation

Basic Summariza$on Algorithm 1.  content selec(on: choose sentences to extract from the document 2.  informa(on ordering: just use document order 3.  sentence realiza(on: keep original sentences 35 Document Sentence Segmentation Sentence Extraction All sentences from documents Extracted sentences Information Ordering Sentence Realization Summary Content Selection Sentence Simpliﬁcation

Unsupervised content selec$on •  Intui(on da(ng back to Luhn (1958): •  Choose sentences that have salient or informa(ve words •  Two approaches to deﬁning salient words 1.  o-‐idf: weigh each word wi in document j by o-‐idf 2.  topic signature: choose a smaller set of salient words •  mutual informa(on •  log-‐likelihood ra(o (LLR) Dunning (1993), Lin and Hovy (2000) 36 weight(wi ) = tfij ×idfi weight(wi ) = 1 if -2logλ(wi ) >10 0 otherwise ! " # $# H. P. Luhn. 1958. The Automa(c Crea(on of Literature Abstracts. IBM Journal of Research and Development. 2:2, 159-‐165.

Topic signature-‐based content selec$on with queries •  choose words that are informa(ve either •  by log-‐likelihood ra(o (LLR) •  or by appearing in the query •  Weigh a sentence (or window) by weight of its words: 37 Conroy, Schlesinger, and O’Leary 2006 weight(wi ) = 1 if -2logλ(wi ) >10 1 if wi ∈ question 0 otherwise " # $$ % $ $ weight(s) = 1 S weight(w) w∈S ∑ (could learn more complex weights)

Supervised content selec$on •  Given: •  a labeled training set of good summaries for each document •  Align: •  the sentences in the document with sentences in the summary •  Extract features •  posi(on (first sentence?) •  length of sentence •  word informa(veness, cue phrases •  cohesion •  Train •  Problems: •  hard to get labeled training data •  alignment difficult •  performance not beHer than unsupervised algorithms •  So in prac(ce: •  Unsupervised content selec$on is more common •  a binary classifier (put sentence in summary? yes or no)

Question Answering Evalua(ng Summaries: ROUGE

ROUGE (Recall Oriented Understudy for Gis$ng Evalua$on) •  Intrinsic metric for automa(cally evalua(ng summaries •  Based on BLEU (a metric used for machine transla(on) •  Not as good as human evalua(on (“Did this answer the user’s ques(on?”) •  But much more convenient •  Given a document D, and an automa(c summary X: 1.  Have N humans produce a set of reference summaries of D 2.  Run system, giving automa(c summary X 3.  What percentage of the bigrams from the reference summaries appear in X? 40 Lin and Hovy 2003 ROUGE − 2 = min(count(i, X),count(i,S)) bigrams i∈S ∑ s∈{RefSummaries} ∑ count(i,S) bigrams i∈S ∑ s∈{RefSummaries} ∑

A ROUGE example: Q: “What is water spinach?” Human 1: Water spinach is a green leafy vegetable grown in the tropics. Human 2: Water spinach is a semi-‐aqua(c tropical plant grown as a vegetable. Human 3: Water spinach is a commonly eaten leaf vegetable of Asia. •  System answer: Water spinach is a leaf vegetable commonly eaten in tropical areas of Asia. •  ROUGE-‐2 = 41 10 + 9 + 9 3 + 3 + 6 = 12/28 = .43

Question Answering Summarization for Complex Questions

Deﬁni$on ques$ons Q: What is water spinach? A: Water spinach (ipomoea aqua(ca) is a semi-‐aqua(c leafy green plant with long hollow stems and spear-‐ or heart-‐ shaped leaves, widely grown throughout Asia as a leaf vegetable. The leaves and stems are oten eaten s(r-‐fried ﬂavored with salt or in soups. Other common names include morning glory vegetable, kangkong (Malay), rau muong (Viet.), ong choi (Cant.), and kong xin cai (Mand.). It is not related to spinach, but is closely related to sweet potato and convolvulus.

Medical ques$ons Q: In children with an acute febrile illness, what is the eﬃcacy of single medica(on therapy with acetaminophen or ibuprofen in reducing fever? A: Ibuprofen provided greater temperature decrement and longer dura(on of an(pyresis than acetaminophen when the two drugs were administered in approximately equal doses. (PubMedID: 1621668, Evidence Strength: A) Demner-‐Fushman and Lin (2007)

Other complex ques$ons 1.  How is compost made and used for gardening (including different types of compost, their uses, origins and benefits)? 2.  What causes train wrecks and what can be done to prevent them? 3.  Where have poachers endangered wildlife, what wildlife has been endangered and what steps have been taken to prevent poaching? 4.  What has been the human toll in death or injury of tropical storms in recent years? 45 Modified from the DUC 2005 compe((on (Hoa Trang Dang 2005)

Answering harder ques$ons: Query-‐focused mul$-‐document summariza$on •  The (boHom-‐up) snippet method •  Find a set of relevant documents •  Extract informa(ve sentences from the documents •  Order and modify the sentences into an answer •  The (top-‐down) informa(on extrac(on method •  build specific answerers for different ques(on types: •  defini(on ques(ons •  biography ques(ons •  certain medical ques(ons

Query-‐Focused Mul$-‐Document Summariza$on 47 •  a Document Document Document Document Document Input Docs Sentence Segmentation All sentences from documents Sentence Simplification Content Selection Sentence Extraction: LLR, MMR Extracted sentences Information Ordering Sentence Realization Summary All sentences plus simplified versions Query

Informa$on Ordering •  Chronological ordering: •  Order sentences by the date of the document (for summarizing news).. (Barzilay, Elhadad, and McKeown 2002) •  Coherence: •  Choose orderings that make neighboring sentences similar (by cosine). •  Choose orderings in which neighboring sentences discuss the same en(ty (Barzilay and Lapata 2007) •  Topical ordering •  Learn the ordering of topics in the source documents 48

Domain-‐speciﬁc answering: The Informa$on Extrac$on method •  a good biography of a person contains: •  a person’s birth/death, fame factor, educa$on, na$onality and so on •  a good deﬁni$on contains: •  genus or hypernym •  The Hajj is a type of ritual •  a medical answer about a drug’s use contains: •  the problem (the medical condi(on), •  the interven$on (the drug or procedure), and •  the outcome (the result of the study).

Informa$on that should be in the answer for 3 kinds of ques$ons

Lecture: Summarization

In this document

More Related Content

What's hot

Viewers also liked

Similar to Lecture: Summarization

More from Marina Santini

Recently uploaded

Lecture: Summarization