CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH
We have used material from several popular books, papers, course notes and presentations made by experts in this area. We have provided all references to the best of our knowledge. This list however, serves only as a pointer to work in this area and is by no means a comprehensive resource.
KNO.E.SIS knoesis.org Director: Amit Sheth knoesis.wright.edu/amit/ Graduate Students: Meena Nagarajan knoesis.wright.edu/students/meena/ [email_address] Cartic Ramakrishnan knoesis.wright.edu/students/cartic/ [email_address]
An Overview of Empirical Natural Language Processing, Eric Brill, Raymond J. Mooney Word Sequence Syntactic Parser Parse Tree Semantic Analyzer Literal Meaning Discourse Analyzer Meaning
Traditional (Rationalist) Natural Language Processing Main insight: Using rule-based representations of knowledge and grammar (hand-coded) for language study KB Text NLP System Analysis
Empirical Natural Language Processing Main insight: Using distributional environment of a word as a tool for language study KB Text NLP System Analysis Corpus Learning System
Two approaches not incompatible. Several systems use both. Many empirical systems make use of manually created domain knowledge. Many empirical systems use representations of rationalist methods replacing hand-coded rules with rules acquired from data.
Several algorithms, methods in each task, rationalist and empiricist approaches What does a NL processing task typically entail? How do systems, applications and tasks perform these tasks? Syntax : POS Tagging, Parser Semantics : Meaning of words, using context/domain knowledge to enhance tasks
Finding more about what we already know Ex. patterns that characterize known information The search/browse OR ‘finding a needle in a haystack’ paradigm Discovering what we did not know Deriving new information from data Ex. Relationships between known entities previously unknown The ‘extracting ore from rock’ paradigm
Information Extraction - those that operate directly on the text input this includes entity, relationship and event detection  Inferring new links and paths between key entities sophisticated representations for information content, beyond the "bag-of-words" representations used by IR systems Scenario detection techniques discover patterns of relationships between entities that signify some larger event, e.g. money laundering activities. 
They all make use of knowledge of language (exploiting syntax and structure, different extents) Named entities begin with capital letters Morphology and meanings of words They all use some fundamental text analysis operations Pre-processing, Parsing, chunking, part-of-speech, lemmatization, tokenization To some extent, they all deal with some language understanding challenges Ambiguity, co-reference resolution, entity variations etc. Use of a core subset of theoretical models and algorithms State machines, rule systems, probabilistic models, vector-space models, classifiers, EM etc.
Analysis for both these goals have many similarities Finding entities What are we interested in knowing more about? (the known) Is what we found something of interest? (the unknown) Text is structured (to some extent) Syntax, Structure Semantics, Pragmatics, Discourse Text is noisy Pre-processing is not an option in many cases Variations not uncommon
Wikipedia like text (GOOD) “ Thomas Edison invented the light bulb.” Scientific literature (BAD) “ This MEK dependency was observed in BRAF mutant cells regardless of tissue lineage, and correlated with both downregulation of cyclin D1 protein expression and the induction of G1 arrest.” Text from Social Media (UGLY) "heylooo..ano u must hear it loadsss bu your propa faabbb!!"
Illustrate analysis of and challenges posed by these three text types throughout the tutorial
WHAT CAN TM DO FOR HARRY PORTER? A bag of words
Discovering connections hidden in text UNDISCOVERED PUBLIC KNOWLEDGE
Undiscovered Public Knowledge [Swanson] – as mentioned in [Hearst99] Search no longer enough Information overload – prohibitively large number of hits UPK increases with increasing corpus size Manual analysis very tedious Examples [Hearst99] Example 1 – Using Text to Form Hypotheses about Disease Example 2 – Using Text to Uncover Social Impact
Swanson’s discoveries Associations between Migraine and Magnesium [Hearst99] stress is associated with migraines stress can lead to loss of magnesium calcium channel blockers prevent some migraines magnesium is a natural calcium channel blocker spreading cortical depression (SCD) is implicated in some migraines high levels of magnesium inhibit SCD migraine patients have high platelet aggregability magnesium can suppress platelet aggregability
Mining popularity from Social Media Goal: Top X artists from MySpace artist comment pages Traditional Top X lists got from radio plays, cd sales. An attempt at creating a list closer to listeners preferences Mining positive, negative affect / sentiment Slang, casual text necessitates transliteration ‘ you are so bad’ == ‘you are good’
Mining text to improve existing information access mechanisms Search [Storylines] IR [QA systems] Browsing [Flamenco] Mining text for Discovery & insight [Relationship Extraction] Creation of new knowledge Ontology instance-base population Ontology schema learning
Web search – aims at optimizing for top k (~10) hits Beyond top 10 Pages expressing related latent views on topic Possible reliable sources of additional information Storylines in search results [3]
 
TextRunner[4] A system that uses the result of dependency parses of sentences to train a Naïve Bayes classifier for Web-scale extraction of relationships Does not require parsing for extraction – only required for training Training on features – POS tag sequences, if object is proper noun, number of tokens to right or left etc. This system is able to respond to queries like "What did Thomas Edison invent?"
Castanet [1] Semi-automatically builds faceted hierarchical metadata structures from text This is combined with Flamenco [2] to support faceted browsing of content
Documents Select terms WordNet Build core tree Augment core tree Remove top level categories Compress Tree Divide into facets
Domains used to prune applicable senses in Wordnet (e.g. “dip”) frozen dessert sundae entity substance,matter nutriment dessert ice cream sundae frozen dessert entity substance,matter nutriment dessert sherbet,sorbet sherbet sundae sherbet substance,matter nutriment dessert sherbet,sorbet frozen dessert entity ice cream sundae
Biologically active substance Lipid Disease or Syndrome affects causes affects causes complicates Fish Oils Raynaud’s Disease ??????? instance_of instance_of UMLS Semantic Network MeSH PubMed 9284 documents 4733 documents 5 documents
[Hearst92] Finding class instances [Ramakrishnan et. al. 08] [Nguyen07] Finding attribute “like” relation instances
Automatic acquisition of Class Labels Class hierarchies Attributes Relationships Constraints Rules
 
SYNTAX, SEMANTICS, STATISTICAL NLP, TOOLS, RESOURCES, GETTING STARTED
[hearst 97] Abstract concepts are difficult to represent “ Countless” combinations of subtle, abstract relationships among concepts Many ways to represent similar concepts E.g. space ship, flying saucer, UFO Concepts are difficult to visualize High dimensionality Tens or hundreds of thousands of features
Ambiguity (sense) Keep that smile playin’ (Smile is a track) Keep that smile on! Variations (spellings, synonyms, complex forms) Illeal Neoplasm vs. Adenomatous lesion of the Illeal wall Coreference resolution “ John wanted a copy of Netscape to run on his PC on the desk in his den; fortunately, his ISP included it in their startup package,”
[hearst 97] Highly redundant data … most of the methods count on this property Just about any simple algorithm can get “good” results for simple tasks: Pull out “important” phrases Find “meaningfully” related words Create some sort of summary from documents
Concerned with processing documents in natural language Computational Linguistics, Information Retrieval, Machine learning, Statistics, Information Theory, Data Mining etc. TM generally concerned with practical applications As opposed to lexical acquisition (for ex.)in CL
Computing Resources Faster disks, CPUs, Networked Information Data Resources Large corpora, tree banks, lexical data for training and testing systems Tools for analysis NL analysis: taggers, parsers, noun-chunkers, tokenizers; Statistical Text Analysis: classifiers, nl model generators Emphasis on applications and evaluation Practical systems experimentally evaluated on real data
TYPES, WHAT THEY DO AND DON’T
Co-occurrence based Rule / knowledge based Statistical / machine-learning based Systems typically use a combination of two or more
Look for terms and posit relationships based on co-occurrence “ A word is known by the company it keeps” Non-trivial as they deal with problems of language – expression variability, ambiguity Sometimes used as a simple baseline when evaluating more sophisticated systems
Exploit real-world knowledge About language About terms in the domain About relationship between terms in the domain About variations of terms we know of etc. Spectrum of work from using hard-coded patterns (text or linguistic) for TM to using linguistic + semantic analysis for TM Harder to develop, maintain rules; comprehensiveness also an issue
Using classifiers operating on text At pos level, n-grams, parse trees Classifying documents, words, attributes/properties Typically requires hand-tagged/labeled data Supervised, semi-supervised approaches http://compbio.uchsc.edu/Hunter_lab/Cohen/Hunter_Cohen_Molecular_Cell.pdf
Computational Linguistics - Syntax Parts of speech, morphology, phrase structure, parsing, chunking Semantics Lexical semantics, Syntax-driven semantic analysis, domain model-assisted semantic analysis (WordNet), Getting your hands dirty Text encoding, Tokenization, sentence splitting, morphology variants, lemmatization Using parsers, understanding outputs Tools, resources, frameworks
Statistical NLP Mathematical foundations, some information theory Words, Statistical Inference using n-grams, language models How are these used – some examples Collocations, Lexical Acquisition, Word sense disambiguation
Several algorithms, variations for each We wont go into each algorithm That’s a whole semester course We’ll define the problem, point out general approach, show examples More importantly, we’ll show how results of these components are used by applications we know of today
POS Tags, Taggers, Ambiguities, Examples Word Sequence Syntactic Parser Parse Tree Semantic Analyzer Literal Meaning Discourse Analyzer Meaning
Word classes, syntactic/grammatical categories, parts-of-speech Comprehensive lists used by taggers 87 in the Brown Corpus tagset 45 in the Penn Treebank tagset 146 for the C7 tagset POS Tag Description Example CC coordinating conjunction and CD cardinal number 1, third DT determiner the EX existential there there is FW foreign word d'hoevre IN preposition/subordinating conjunction in, of, like JJ adjective green JJR adjective, comparative greener JJS adjective, superlative greenest LS list marker 1) MD modal could, will NN noun, singular or mass table NNS noun plural tables NNP proper noun, singular John NNPS proper noun, plural Vikings PDT predeterminer both the boys POS possessive ending friend's PRP personal pronoun I, he, it PRP$ possessive pronoun my, his
Assigning a pos or syntactic class marker to a word in a sentence/corpus. Word classes, syntactic/grammatical categories Usually preceded by tokenization delimit sentence boundaries, tag punctuations and words. Publicly available tree banks, documents tagged for syntactic structure Typical input and output of a tagger Cancel that ticket. Cancel /VB that /DT ticket /NN ./.
Lexical ambiguity Words have multiple usages and parts-of-speech A duck in the pond ; Don’ t duck when I bowl Is duck a noun or a verb? Yes, we can ; Can of soup ; I can ned this idea Is can an auxiliary, a noun or a verb? Problem in tagging is resolving such ambiguities
Information about a word and its neighbors has implications on language models Possessive pronouns (mine, her, its) usually followed a noun Understand new words Toves did gyre and gimble. On IE Nouns as cues for named entities Adjectives as cues for subjective expressions
Useful in understanding words We can guess what a new words means by looking at words around it. Toves did gyre and gimble. Toves is something than can perform an action. (from hearst tagging)
Rule-based Database of hand-written/learned rules to resolve ambiguity -EngCG Probability / Stochastic taggers Use a training corpus to compute probability of a word taking a tag in a specific context - HMM Tagger Hybrids – transformation-based The Brill tagger A comprehensive list of available taggers http://www-nlp.stanford.edu/links/statnlp.html#Taggers
Not a complete representation EngCG based on the Constraint Grammar Approach Two step architecture Use a lexicon of words and likely pos tags to first tag words Use a large list of hand-coded disambiguation rules that assign a single pos tag for each word
Sample lexicon Word POS AdditionalPOS features Slower ADJ COMPARITIVE Show V PRESENT Show N NOMINATIVE Sample rules
What is the best possible tag given this sequence of words? Takes context into account; global Example: HMM (hidden Markov models) A special case of Bayesian Inference likely tag sequence is the one that maximizes the product of two terms: probability of sequence of tags and probability of each tag generating a word
Peter /NNP is /VBZ expected /VBN to /TO race /VB tomorrow /NN to /TO race /??? t i = argmax j P(t j |t i-1 )P(w i |t j ) P(VB|TO) × P(race|VB) Based on the Brown Corpus: Probability that you will see this POS transition and that the word will take this POS P(VB|TO) = .34 × P(race|VB) = .00003 = .00001
Be aware of possibility of ambiguities Possible one has to normalize content before sending it to the tagger Pre Post Transliteration “ Rhi you were da coolest last eve” Rhi/VB you/PRP were/VBD da/VBG coolest/JJ last/JJ eve/NN “ Rhi you were the coolest last eve” Rhi/VB you/PRP were/VBD the/DT coolest/JJ last/JJ eve/NN
Understanding Phrase Structures, Parsing, Chunking Word Sequence Syntactic Parser Parse Tree Semantic Analyzer Literal Meaning Discourse Analyzer Meaning
Words don’t just occur in some order Words are organized in phrases groupings of words that clunk together Major phrase types Noun Phrases Prepositional phrases Verb phrases
Deriving the syntactic structure of a sentence based on a language model ( grammar ) Natural Language Syntax described by a context free grammar the Start-Symbol S ≡ sentence Non-Terminals NT ≡ syntactic constituents Terminals T ≡ lexical entries/ words Productions P  NT  (NT  T) + ≡ grammar rules http://www.cs.umanitoba.ca/~comp4190/2006/NLP-Parsing.ppt
S  NT, Part-of-Speech  NT, Constituents  NT, Words  T, Rules: S  NP VP statement S  Aux NP VP question S  VP command NP  Det Nominal NP  Proper-Noun Nominal  Noun | Noun Nominal | Nominal PP VP  Verb | Verb NP | Verb PP | Verb NP PP PP  Prep NP Det  that | this | a Noun  book | flight | meal | money
Bottom-up Parsing or data-driven Top-down Parsing or goal-driven S Aux NP VP Det Nominal Verb NP Noun Det Nominal does this flight include a meal
Natural Language Parsers, Peter Hellwig, Heidelberg Constituency Parse - Nested Phrasal Structures Dependency parse - Role Specific Structures
Tagging John/NNP bought/VBD a/DT book/NN ./. Constituency Parse Nested phrasal structure (ROOT (S (NP (NNP John)) (VP (VBD bought) (NP (DT a) (NN book))) (. .))) Typed dependencies Role specific structure nsubj(bought-2, John-1) det(book-4, a-3) dobj(bought-2, book-4)
Grammar checking: sentences that cannot be parsed may have grammatical errors Using results of Dependency parse Word sense disambiguation (dependencies as features or co-occurrence vectors)
MINIPAR http://www.cs.ualberta.ca/~lindek/minipar.htm Link Grammar parser: http://www.link.cs.cmu.edu/link/ Standard “CFG” parsers like the Stanford parser http://www-nlp.stanford.edu/software/lex-parser.shtml ENJU’s probabilistic HPSG grammar http://www-tsujii.is.s.u-tokyo.ac.jp/enju/
Some applications don’t need the complex output of a full parse Chunking / Shallow Parse / Partial Parse Identifying and classifying flat, non-overlapping contiguous units in text Segmenting and tagging Example of chunking a sentence [ NP The morning flight] from [ NP Denver] [ VP has arrived] Chunking algos mention
From Hearst 97
Entity recognition people, locations, organizations Studying linguistic patterns (Hearst 92) gave NP gave up NP in NP gave NP NP gave NP to NP
Stanford and Enju parser demos; analyzing results http://www-tsujii.is.s.u-tokyo.ac.jp/enju/demo.html http://nlp.stanford.edu:8080/parser/ If you want to know how to run it stand alone Talk to one of us or see their very helpful help pages
COLORLESS GREEN IDEAS SLEEP FURIOUSLY Word Sequence Syntactic Parser Parse Tree Semantic Analyzer Literal Meaning Discourse Analyzer Meaning
When raw linguistic inputs nor any structures derived from them will facilitate required semantic processing When we need to link linguistic information to the non-linguistic real-world knowledge Typical sources of knowledge Meaning of words, grammatical constructs, discourse, topic..
Typical sources of knowledge Meanings of words Meanings of grammatical constructs Knowledge about structure of discourse Common sense knowledge about topic Knowledge about state of affairs in which discourse is occurring
Lexical Semantics The meanings of individual words Formal Semantics (Compositional Semantics or Sentential Semantics) How those meanings combine to make meanings for individual sentences or utterances Discourse or Pragmatics How those meanings combine with each other and with other facts about various kinds of context to make meanings for a text or discourse http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt
Lexeme: set of forms taken by a single word run, runs, ran and running forms of the same lexeme RUN Lemma: a particular form of a lexeme that is chosen to represent a canonical form Carpet for carpets; Sing for sing, sang, sung Lemmatization: Meaning of a word approximated by meaning of its lemma Mapping a morphological variant to its root Derivational and Inflectional Morphology
Word sense: Meaning of a word (lemma) Varies with context Significance Lexical ambiguity consequences on tasks like parsing and tagging implications on results of Machine translation, Text classification etc. Word Sense Disambiguation Selecting the correct sense for a word
The study of the way words are built up from smaller meaning units. Derivational and Inflectional morphology Forming new words from old words (derivational) Suffixes (inflections) Knowing root helps us understand new words
Porter Stemming Algorithm http://tartarus.org/~martin/PorterStemmer/ Catvar http://clipdemos.umiacs.umd.edu/catvar/ Lingsoft http://www2.lingsoft.fi/cgi-bin/engtwol?word=cmputers Wordnet http://www.shiffman.net/teaching/a2z/wordnet/
Homonymy Polysemy Synonymy Antonymy Hypernomy Hyponomy Meronomy Why do we care? http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt
Homonymy: share a form, relatively unrelated senses Bank (financial institution, a sloping mound) Polysemy: semantically related Bank as a financial institution, as a blood bank Verbs tend more to polysemy
Different words/lemmas that have the same sense Couch/chair One sense more specific than the other (hyponymy) Car is a hyponym of vehicle One sense more general than the other (hypernymy) Vehicle is a hypernym of car
Meronymy Engine part of car; engine meronym of car Holonymy Car is a holonym of engine
Semantic fields Cohesive chunks of knowledge Air travel: Flight, travel, reservation, ticket, departure…
Models these sense relations A hierarchically organized lexical database On-line thesaurus + aspects of a dictionary Versions for other languages are under development http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt
http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt
Verbs and Nouns in separate hierarchies http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt
The set of near-synonyms for a WordNet sense is called a synset ( synonym set ) Their version of a sense or a concept Duck as a verb to mean to move (the head or body) quickly downwards or away dip, douse, hedge, fudge, evade, put off, circumvent, parry, elude, skirt, dodge, duck, sidestep
IR and QnA Indexing using similar (synonymous) words/query or specific to general words (hyponymy / hypernymy) improves text retrieval Machine translation, QnA Need to know if two words are similar to know if we can substitute one for another
Most well developed Synonymy or similarity Synonymy - a binary relationship between words, rather their senses Approaches Thesaurus based : measuring word/sense similarity in a thesaurus Distributional methods: finding other words with similar distributions in a corpus
Thesaurus based Path based similarity – two words are similar if they are similar in the thesaurus hierarchy http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt
We don’t have a thesaurus for every language. Even if we do, many words are missing Wordnet: Strong for nouns, but lacking for adjectives and even verbs Expensive to build They rely on hyponym info for similarity car hyponym of vehicle Alternative - Distributional methods for word similarity http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt
Firth (1957): “You shall know a word by the company it keeps!” Similar words appear in similar contexts - Nida example noted by Lin: A bottle of tezgüino is on the table Everybody likes tezgüino Tezgüino makes you drunk We make tezgüino out of corn. Partial material from http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt
http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt
So you want to build your own text miner!
Infrastructure intensive Luckily, plenty of open source tools, frameworks, resources.. http://www-nlp.stanford.edu/links/statnlp.html http://www.cedar.buffalo.edu/~rohini/CSE718/References2.html
Mining opinions from casual text Data – user comments on artist pages from MySpace “ Your musics the shit,…lovve your video you are so bad” “ Your music is wicked!!!!” Goal Popularity lists generated from listener’s comments to complement radio plays/cd sales lists
“ Your musics the shit,…lovve your video you are so bad” Pre-processing strip html, normalizing text from different sources.. Tokenization Splitting text into tokens : word tokens, number tokens, domain specific requirements Sentence splitting ! . ? … ; harder in casual text Normalizing words Stop word removal, lemmatization, stemming, transliterations (da == the)
‘ The smile is so wicked!!’ Syntax : Marking sentiment expression from syntax or a dictionary The/DT smile/NN is/VBZ so/RB wicked/JJ !/. !/. Semantics : Surrounding context On Lily Allen’s MySpace page. Cues for Co-ref resolution Smile is a track by Lilly Allen. Ambiguity Background knowledge / resources Using urbandictionary.com for semantic orientation of ‘wicked’
GATE - General Architecture of Text Engineering, since 1995 at University of Sheffield, UK UIMA - Unstructured Information Management Architecture, IBM Document processing tools, Components syntactic tools, nlp tools, integrating framework
The GATE (General Architecture for Text Engineering) System: http://gate.ac.uk http://sourceforge.net/projects/gate User’s Guide: http://gate.ac.uk/sale/tao/ IBM’s UIMA (Unstructured Information Management Architecture): http://www.research.ibm.com/UIMA/ http://sourceforge.net/projects/uima-framework/ Other Resources WordNet: http://wordnet.princeton.edu/ MuNPEx: http://www.ipd.uka.de/~durm/tm/munpex/
TO COME: USAGE EXAMPLES OF WHAT WE COVERED THUS FAR
SAMPLE APPLICATIONS, SURVEY OF EFFORTS IN TWO SAMPLE AREAS
This MEK dependency was observed in BRAF mutant cells regardless of tissue lineage, and correlated with both downregulation of cyclin D1 protein expression and the induction of G1 arrest. *MEK dependency ISA Dependency_on_an_Organic_chemical *BRAF mutant cells ISA Cell_type *downregulation of cyclin D1 protein expression ISA Biological_process *tissue lineage ISA Biological_concept *induction of G1 arrest ISA Biological_process Information Extraction = segmentation+classification+association+mining Text mining = entity identification+named relationship extraction+discovering association chains…. Segmentation Classification Named Relationship Extraction MEK dependency observed in BRAF mutant cells downregulation of cyclin D1 protein expression correlated with induction of G1 arrest correlated with
MEK dependency observed in BRAF mutant cells downregulation of cyclin D1 protein expression correlated with induction of G1 arrest correlated with
The task of classifying token sequences in text into one or more predefined classes Approaches Look up a list Sliding window Use rules Machine learning Compound entities Applied to Wikipedia like text Biomedical text
The simplest approach Proper nouns make up majority of named entities Look up a gazetteer CIA fact book for organizations, country names etc. Poor recall coverage problems
Rule based [Mikheev et. Al 1999] Frequency Based "China International Trust and Investment Corp” "Suspended Ceiling Contractors Ltd” "Hughes“ when "Hughes Communications Ltd.“ is already marked as an organization Scalability issues: Expensive to create manually Leverages domain specific information – domain specific Tend to be corpus-specific – due to manual process
Machine learning approaches Ability to generalize better than rules Can capture complex patterns Requires training data Often the bottleneck Techniques [list taken from Agichtein2007 ] Naive Bayes SRV [Freitag 1998], Inductive Logic Programming Rapier [Califf and Mooney 1997] Hidden Markov Models [Leek 1997] Maximum Entropy Markov Models [McCallum et al. 2000] Conditional Random Fields [Lafferty et al. 2001]
Orthographical Features CD28 a protein Context Features Window of words Fixed Variable Part-of-speech features Current word Adjacent words – within fixed window Word shape features Kappa-B replaced with Aaaaa-A Dictionary features Inexact matches Prefixes and Suffixes “ ~ase” = protein
HMMs a powerful tool for representing sequential data are probabilistic finite state models with parameters for state-transition probabilities and state-specific observation probabilities the observation probabilities are typically represented as a multinomial distribution over a discrete, finite vocabulary of words Training is used to learn parameters that maximize the probability of the observation sequences in the training data Generative Find parameters to maximize P(X,Y) When labeling X i future observations are taken into account (forward-backward) Problems Feature overlap in NER E.g. to extract previously unseen company names from a newswire article the identity of a word alone is not very predictive knowing that the word is capitalized, that is a noun, that it is used in an appositive, and that it appears near the top of the article would all be quite predictive Would like the observations to be parameterized with these overlapping features Feature independence assumption Several features about same word can affect parameters
MEMMs [McCallum et. al, 2000] Discriminative Find parameters to maximize P(Y|X) No longer assume that features are independent f<Is-capitalized,Company>(“Apple”, Company) = 1. Do not take future observations into account (no forward-backward) Problems Label bias problem
CRFs [Lafferty et. al, 2001] Discriminative Doesn’t assume that features are independent When labeling Y i future observations are taken into account Global optimization – label bias prevented The best of both worlds!
Example [ORG U.S. ] general [PER David Petraeus ] heads for [LOC Baghdad ] . Token POS Chunk Tag --------------------------------------------------------- U.S. NNP I-NP I-ORG general NN I-NP O David NNP I-NP B-PER Petraeus NNP I-NP I-PER heads VBZ I-VP O for IN I-PP O Baghdad NNP I-NP I-LOC . . O O CONLL format – Mallet Major bottleneck is training data
Context Induction approach [Talukdar2006] Starting with a few seed entities, it is possible to induce high-precision context patterns by exploiting entity context redundancy. New entity instances of the same category can be extracted from unlabeled data with the induced patterns to create high-precision extensions of the seed lists. Features derived from token membership in the extended lists improve the accuracy of learned named-entity taggers. Pruned Extraction patterns Feature generation For CRF
Machine Learning Best performance Problem Training data bottleneck Pattern induction Reduce training data creation time
Knowledge Engineering approach Manually crafted rules Over lexical items <person> works for <organization> Over syntactic structures – parse trees GATE Machine learning approaches Supervised Semi-supervised Unsupervised
Supervised BioText – extraction of relationships between diseases and their treatments [Rosario et. al 2004] Rule-based supervised approach [Rinaldi et. al 2004] Semantics of specific relationship encoded as rules Identify a set of relations along with their morphological variants (bind, regulate, signal etc.) subj(bind,X,_,_),pobj(bind,Y,to,_) prep(Y,to,_,_) => bind(X,Y). Axiom formulation was however a manual process involving a domain expert.
Hand-coded domain specific rules that encode patterns used to extract Molecular pathways [Freidman et. al. 2001] Protein interaction [Saric et. al. 2006] All of the above in the biomedical domain Notice – specificity of relationship types Amount of effort required Also notice types of entities involved in the relationships
IMPLICIT EXPLICIT
Semantic Role Labeling Features Detailed tutorial on SRL is available By Yih & Toutanova here
Other approaches Discovering concept-specific relations Dmitry Davidov, et. al 2007, preemptive IE approach Rosenfeld & Feldman 2007 Open Information Extraction Banko et. al 2007 Self supervised approach Uses dependency parses to train extractors On-demand information extraction Sekine 2006 IR driven Patterns discovery Paraphrase
Rule and Heuristic based method YAGO Suchanek et. al, 2007 Pattern-based approach Uses WordNet Subtree mining over dependency parse trees Nguyen et. al, 2007
Entities (MeSH terms) in sentences occur in modified forms “ adenomatous ” modifies “ hyperplasia ” “ An excessive endogenous or exogenous stimulation ” modifies “ estrogen ” Entities can also occur as composites of 2 or more other entities “ adenomatous hyperplasia ” and “ endometrium ” occur as “ adenomatous hyperplasia of the endometrium”
Relationship head Subject head Object head Object head Small set of rules over dependency types dealing with modifiers ( amod, nn ) etc. subjects, objects ( nsubj, nsubjpass ) etc. Since dependency types are arranged in a hierarchy We use this hierarchy to generalize the more specific rules There are only 4 rules in our current implementation Carroll, J., G. Minnen and E. Briscoe (1999) `Corpus annotation for parser evaluation'. In Proceedings of the EACL-99 Post-Conference Workshop on Linguistically Interpreted Corpora, Bergen, Norway. 35-41. Also in Proceedings of the ATALA Workshop on Corpus Annotés pour la Syntaxe - Treebanks, Paris, France. 13-20.
Modifiers Modified entities Composite Entities
Manual Evaluation Test if the RDF conveys same “meaning” as the sentence Juxtapose the triple with the sentence Allow user to assess correctness/incorrectness of the subject, object and triple
 
 
Discovering informative subgraphs (Harry Potter) Given a pair of end-points (entities) Produce a subgraph with relationships connecting them such that The subgraph is small enough to be visualized And contains relevant “interesting” connections We defined an interestingness measure based on the ontology schema In future biomedical domain the scientist will control this with the help of a browsable ontology Our interestingness measure takes into account Specificity of the relationships and entity classes involved Rarity of relationships etc. Cartic Ramakrishnan , William H. Milnor, Matthew Perry, Amit P. Sheth: Discovering informative connection subgraphs in multi-relational graphs. SIGKDD Explorations 7 (2): 56-63 (2005)
Two factor influencing interestingness
Bidirectional lock-step growth from S and T Choice of next node based on interestingness measure Stop when there are enough connections between the frontiers This is treated as the candidate graph
Model the Candidate graph as an electrical circuit S is the source and T the sink Edge weight derived from the ontology schema are treated as conductance values Using Ohm’s and Kirchoff’s laws we find maximum current flow paths through the candidate graph from S to T At each step adding this path to the output graph to be displayed we repeat this process till a certain number of predefined nodes is reached Results Arnold schwarzenegger, Edward Kennedy Other related work Semantic Associations
 
 
Text Mining, Analysis  understanding  utilization in decision making  knowledge discovery Entity Identification  focus change from simple to compound Relationship extraction  implicit vs. explicit Need more unsupervised approaches Need to think of incentives to evaluate
Existing corpora GENIA , BioInfer many others Narrow focus Precision and Recall Utility How useful is the extracted information? How do we measure utility? Swanson’s discovery, Enrichment of Browsing experience Text types and mining Systematically compensating for (in)formality
http://www.cs.famaf.unc.edu.ar/~laura/text_mining/ http://www.stanford.edu/class/cs276/cs276-2005-syllabus.html http://www-nlp.stanford.edu/links/statnlp.html http://www.cedar.buffalo.edu/~rohini/CSE718/References2.html
Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst, and Megan Richardson, in the proceedings of NAACL-HLT , Rochester NY, April 2007 Finding the Flow in Web Site Search , Marti Hearst, Jennifer English, Rashmi Sinha, Kirsten Swearingen, and Ping Yee, Communications of the ACM, 45 (9), September 2002, pp.42-49. R. Kumar , U. Mahadevan , and D. Sivakumar , &quot;A graph-theoretic approach to extract storylines from search results&quot;,  in Proc. KDD, 2004, pp.216-225. Michele Banko , Michael J. Cafarella , Stephen Soderland , Matthew Broadhead , Oren Etzioni: Open Information Extraction from the Web. IJCAI 2007 : 2670-2676 Hearst, M. A. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th Conference on Computational Linguistics - Volume 2 (Nantes, France, August 23 - 28, 1992). Dat P. T. Nguyen , Yutaka Matsuo, Mitsuru Ishizuka : Relation Extraction from Wikipedia Using Subtree Mining. AAAI 2007 : 1414-1420 &quot;Unsupervised Discovery of Compound Entities for Relationship Extraction&quot; Cartic Ramakrishnan , Pablo N. Mendes , Shaojun Wang and Amit P. Sheth EKAW 2008 - 16th International Conference on Knowledge Engineering and Knowledge Management Knowledge Patterns Mikheev, A., Moens, M., and Grover, C. 1999. Named Entity recognition without gazetteers. In Proceedings of the Ninth Conference on European Chapter of the Association For Computational Linguistics (Bergen, Norway, June 08 - 12, 1999). McCallum, A., Freitag, D., and Pereira, F. C. 2000. Maximum Entropy Markov Models for Information Extraction and Segmentation. In Proceedings of the Seventeenth international Conference on Machine Learning Lafferty, J. D., McCallum, A., and Pereira, F. C. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth international Conference on Machine Learning
Barbara, R. and A.H. Marti, Classifying semantic relations in bioscience texts, in Proceedings of the 42 nd ACL. 2004, Association for Computational Linguistics: Barcelona, Spain. M.A. Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of COLING‘ 92, pages 539–545 M. Hearst, &quot;Untangling text data mining,&quot; 1999. [Online]. Available: http://citeseer.ist.psu.edu/563035.html Friedman, C., et al., GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 2001. 17 Suppl 1: p. 1367-4803. Saric, J., et al., Extraction of regulatory gene/protein networks from Medline. Bioinformatics, 2005. Ciaramita, M., et al., Unsupervised Learning of Semantic Relations between Concepts of a Molecular Biology Ontology, in 19th IJCAI. 2005. Dmitry Davidov, Ari Rappoport, Moshe Koppel. Fully Unsupervised Discovery of Concept-Specific Relationships by Web Mining . Proceedings, ACL 2007 , June 2007, Prague. Rosenfeld, B. and Feldman, R. 2007. Clustering for unsupervised relation identification. In Proceedings of the Sixteenth ACM Conference on Conference on information and Knowledge Management (Lisbon, Portugal, November 06 - 10, 2007). Michele Banko , Michael J. Cafarella , Stephen Soderland, Matthew Broadhead , Oren Etzioni : Open Information Extraction from the Web. IJCAI 2007 : 2670-2676 Sekine, S. 2006. On-demand information extraction. In Proceedings of the COLING/ACL on Main Conference Poster Sessions (Sydney, Australia, July 17 - 18, 2006). Annual Meeting of the ACL. Association for Computational Linguistics, Morristown, NJ, 731-738. Suchanek, F. M., Kasneci, G., and Weikum, G. 2007. Yago: a core of semantic knowledge. In Proceedings of the 16th international Conference on World Wide Web (Banff, Alberta, Canada, May 08 - 12, 2007). WWW '07.

Text Analytics for Semantic Computing

  • 1.
    CARTIC RAMAKRISHNANMEENAKSHI NAGARAJAN AMIT SHETH
  • 2.
    We have usedmaterial from several popular books, papers, course notes and presentations made by experts in this area. We have provided all references to the best of our knowledge. This list however, serves only as a pointer to work in this area and is by no means a comprehensive resource.
  • 3.
    KNO.E.SIS knoesis.org Director:Amit Sheth knoesis.wright.edu/amit/ Graduate Students: Meena Nagarajan knoesis.wright.edu/students/meena/ [email_address] Cartic Ramakrishnan knoesis.wright.edu/students/cartic/ [email_address]
  • 4.
    An Overview ofEmpirical Natural Language Processing, Eric Brill, Raymond J. Mooney Word Sequence Syntactic Parser Parse Tree Semantic Analyzer Literal Meaning Discourse Analyzer Meaning
  • 5.
    Traditional (Rationalist) NaturalLanguage Processing Main insight: Using rule-based representations of knowledge and grammar (hand-coded) for language study KB Text NLP System Analysis
  • 6.
    Empirical Natural LanguageProcessing Main insight: Using distributional environment of a word as a tool for language study KB Text NLP System Analysis Corpus Learning System
  • 7.
    Two approaches notincompatible. Several systems use both. Many empirical systems make use of manually created domain knowledge. Many empirical systems use representations of rationalist methods replacing hand-coded rules with rules acquired from data.
  • 8.
    Several algorithms, methodsin each task, rationalist and empiricist approaches What does a NL processing task typically entail? How do systems, applications and tasks perform these tasks? Syntax : POS Tagging, Parser Semantics : Meaning of words, using context/domain knowledge to enhance tasks
  • 9.
    Finding more aboutwhat we already know Ex. patterns that characterize known information The search/browse OR ‘finding a needle in a haystack’ paradigm Discovering what we did not know Deriving new information from data Ex. Relationships between known entities previously unknown The ‘extracting ore from rock’ paradigm
  • 10.
    Information Extraction -those that operate directly on the text input this includes entity, relationship and event detection  Inferring new links and paths between key entities sophisticated representations for information content, beyond the &quot;bag-of-words&quot; representations used by IR systems Scenario detection techniques discover patterns of relationships between entities that signify some larger event, e.g. money laundering activities. 
  • 11.
    They all makeuse of knowledge of language (exploiting syntax and structure, different extents) Named entities begin with capital letters Morphology and meanings of words They all use some fundamental text analysis operations Pre-processing, Parsing, chunking, part-of-speech, lemmatization, tokenization To some extent, they all deal with some language understanding challenges Ambiguity, co-reference resolution, entity variations etc. Use of a core subset of theoretical models and algorithms State machines, rule systems, probabilistic models, vector-space models, classifiers, EM etc.
  • 12.
    Analysis for boththese goals have many similarities Finding entities What are we interested in knowing more about? (the known) Is what we found something of interest? (the unknown) Text is structured (to some extent) Syntax, Structure Semantics, Pragmatics, Discourse Text is noisy Pre-processing is not an option in many cases Variations not uncommon
  • 13.
    Wikipedia like text(GOOD) “ Thomas Edison invented the light bulb.” Scientific literature (BAD) “ This MEK dependency was observed in BRAF mutant cells regardless of tissue lineage, and correlated with both downregulation of cyclin D1 protein expression and the induction of G1 arrest.” Text from Social Media (UGLY) &quot;heylooo..ano u must hear it loadsss bu your propa faabbb!!&quot;
  • 14.
    Illustrate analysis ofand challenges posed by these three text types throughout the tutorial
  • 15.
    WHAT CAN TM DO FOR HARRY PORTER? A bag of words
  • 16.
    Discovering connections hiddenin text UNDISCOVERED PUBLIC KNOWLEDGE
  • 17.
    Undiscovered Public Knowledge[Swanson] – as mentioned in [Hearst99] Search no longer enough Information overload – prohibitively large number of hits UPK increases with increasing corpus size Manual analysis very tedious Examples [Hearst99] Example 1 – Using Text to Form Hypotheses about Disease Example 2 – Using Text to Uncover Social Impact
  • 18.
    Swanson’s discoveries Associationsbetween Migraine and Magnesium [Hearst99] stress is associated with migraines stress can lead to loss of magnesium calcium channel blockers prevent some migraines magnesium is a natural calcium channel blocker spreading cortical depression (SCD) is implicated in some migraines high levels of magnesium inhibit SCD migraine patients have high platelet aggregability magnesium can suppress platelet aggregability
  • 19.
    Mining popularity fromSocial Media Goal: Top X artists from MySpace artist comment pages Traditional Top X lists got from radio plays, cd sales. An attempt at creating a list closer to listeners preferences Mining positive, negative affect / sentiment Slang, casual text necessitates transliteration ‘ you are so bad’ == ‘you are good’
  • 20.
    Mining text toimprove existing information access mechanisms Search [Storylines] IR [QA systems] Browsing [Flamenco] Mining text for Discovery & insight [Relationship Extraction] Creation of new knowledge Ontology instance-base population Ontology schema learning
  • 21.
    Web search –aims at optimizing for top k (~10) hits Beyond top 10 Pages expressing related latent views on topic Possible reliable sources of additional information Storylines in search results [3]
  • 22.
  • 23.
    TextRunner[4] Asystem that uses the result of dependency parses of sentences to train a Naïve Bayes classifier for Web-scale extraction of relationships Does not require parsing for extraction – only required for training Training on features – POS tag sequences, if object is proper noun, number of tokens to right or left etc. This system is able to respond to queries like &quot;What did Thomas Edison invent?&quot;
  • 24.
    Castanet [1] Semi-automatically builds faceted hierarchical metadata structures from text This is combined with Flamenco [2] to support faceted browsing of content
  • 25.
    Documents Select terms WordNet Build core tree Augment core tree Remove top level categories Compress Tree Divide into facets
  • 26.
    Domains used toprune applicable senses in Wordnet (e.g. “dip”) frozen dessert sundae entity substance,matter nutriment dessert ice cream sundae frozen dessert entity substance,matter nutriment dessert sherbet,sorbet sherbet sundae sherbet substance,matter nutriment dessert sherbet,sorbet frozen dessert entity ice cream sundae
  • 27.
    Biologically activesubstance Lipid Disease or Syndrome affects causes affects causes complicates Fish Oils Raynaud’s Disease ??????? instance_of instance_of UMLS Semantic Network MeSH PubMed 9284 documents 4733 documents 5 documents
  • 28.
    [Hearst92] Finding classinstances [Ramakrishnan et. al. 08] [Nguyen07] Finding attribute “like” relation instances
  • 29.
    Automatic acquisition of Class Labels Class hierarchies Attributes Relationships Constraints Rules
  • 30.
  • 31.
    SYNTAX, SEMANTICS, STATISTICALNLP, TOOLS, RESOURCES, GETTING STARTED
  • 32.
    [hearst 97] Abstract concepts are difficult to represent “ Countless” combinations of subtle, abstract relationships among concepts Many ways to represent similar concepts E.g. space ship, flying saucer, UFO Concepts are difficult to visualize High dimensionality Tens or hundreds of thousands of features
  • 33.
    Ambiguity (sense) Keepthat smile playin’ (Smile is a track) Keep that smile on! Variations (spellings, synonyms, complex forms) Illeal Neoplasm vs. Adenomatous lesion of the Illeal wall Coreference resolution “ John wanted a copy of Netscape to run on his PC on the desk in his den; fortunately, his ISP included it in their startup package,”
  • 34.
    [hearst 97] Highlyredundant data … most of the methods count on this property Just about any simple algorithm can get “good” results for simple tasks: Pull out “important” phrases Find “meaningfully” related words Create some sort of summary from documents
  • 35.
    Concerned with processingdocuments in natural language Computational Linguistics, Information Retrieval, Machine learning, Statistics, Information Theory, Data Mining etc. TM generally concerned with practical applications As opposed to lexical acquisition (for ex.)in CL
  • 36.
    Computing Resources Fasterdisks, CPUs, Networked Information Data Resources Large corpora, tree banks, lexical data for training and testing systems Tools for analysis NL analysis: taggers, parsers, noun-chunkers, tokenizers; Statistical Text Analysis: classifiers, nl model generators Emphasis on applications and evaluation Practical systems experimentally evaluated on real data
  • 37.
    TYPES, WHAT THEY DO AND DON’T
  • 38.
    Co-occurrence based Rule/ knowledge based Statistical / machine-learning based Systems typically use a combination of two or more
  • 39.
    Look for termsand posit relationships based on co-occurrence “ A word is known by the company it keeps” Non-trivial as they deal with problems of language – expression variability, ambiguity Sometimes used as a simple baseline when evaluating more sophisticated systems
  • 40.
    Exploit real-world knowledgeAbout language About terms in the domain About relationship between terms in the domain About variations of terms we know of etc. Spectrum of work from using hard-coded patterns (text or linguistic) for TM to using linguistic + semantic analysis for TM Harder to develop, maintain rules; comprehensiveness also an issue
  • 41.
    Using classifiers operatingon text At pos level, n-grams, parse trees Classifying documents, words, attributes/properties Typically requires hand-tagged/labeled data Supervised, semi-supervised approaches http://compbio.uchsc.edu/Hunter_lab/Cohen/Hunter_Cohen_Molecular_Cell.pdf
  • 42.
    Computational Linguistics -Syntax Parts of speech, morphology, phrase structure, parsing, chunking Semantics Lexical semantics, Syntax-driven semantic analysis, domain model-assisted semantic analysis (WordNet), Getting your hands dirty Text encoding, Tokenization, sentence splitting, morphology variants, lemmatization Using parsers, understanding outputs Tools, resources, frameworks
  • 43.
    Statistical NLPMathematical foundations, some information theory Words, Statistical Inference using n-grams, language models How are these used – some examples Collocations, Lexical Acquisition, Word sense disambiguation
  • 44.
    Several algorithms, variationsfor each We wont go into each algorithm That’s a whole semester course We’ll define the problem, point out general approach, show examples More importantly, we’ll show how results of these components are used by applications we know of today
  • 45.
    POS Tags, Taggers,Ambiguities, Examples Word Sequence Syntactic Parser Parse Tree Semantic Analyzer Literal Meaning Discourse Analyzer Meaning
  • 46.
    Word classes, syntactic/grammaticalcategories, parts-of-speech Comprehensive lists used by taggers 87 in the Brown Corpus tagset 45 in the Penn Treebank tagset 146 for the C7 tagset POS Tag Description Example CC coordinating conjunction and CD cardinal number 1, third DT determiner the EX existential there there is FW foreign word d'hoevre IN preposition/subordinating conjunction in, of, like JJ adjective green JJR adjective, comparative greener JJS adjective, superlative greenest LS list marker 1) MD modal could, will NN noun, singular or mass table NNS noun plural tables NNP proper noun, singular John NNPS proper noun, plural Vikings PDT predeterminer both the boys POS possessive ending friend's PRP personal pronoun I, he, it PRP$ possessive pronoun my, his
  • 47.
    Assigning a posor syntactic class marker to a word in a sentence/corpus. Word classes, syntactic/grammatical categories Usually preceded by tokenization delimit sentence boundaries, tag punctuations and words. Publicly available tree banks, documents tagged for syntactic structure Typical input and output of a tagger Cancel that ticket. Cancel /VB that /DT ticket /NN ./.
  • 48.
    Lexical ambiguity Wordshave multiple usages and parts-of-speech A duck in the pond ; Don’ t duck when I bowl Is duck a noun or a verb? Yes, we can ; Can of soup ; I can ned this idea Is can an auxiliary, a noun or a verb? Problem in tagging is resolving such ambiguities
  • 49.
    Information about aword and its neighbors has implications on language models Possessive pronouns (mine, her, its) usually followed a noun Understand new words Toves did gyre and gimble. On IE Nouns as cues for named entities Adjectives as cues for subjective expressions
  • 50.
    Useful in understanding words We can guess what a new words means by looking at words around it. Toves did gyre and gimble. Toves is something than can perform an action. (from hearst tagging)
  • 51.
    Rule-based Database ofhand-written/learned rules to resolve ambiguity -EngCG Probability / Stochastic taggers Use a training corpus to compute probability of a word taking a tag in a specific context - HMM Tagger Hybrids – transformation-based The Brill tagger A comprehensive list of available taggers http://www-nlp.stanford.edu/links/statnlp.html#Taggers
  • 52.
    Not a completerepresentation EngCG based on the Constraint Grammar Approach Two step architecture Use a lexicon of words and likely pos tags to first tag words Use a large list of hand-coded disambiguation rules that assign a single pos tag for each word
  • 53.
    Sample lexicon WordPOS AdditionalPOS features Slower ADJ COMPARITIVE Show V PRESENT Show N NOMINATIVE Sample rules
  • 54.
    What is thebest possible tag given this sequence of words? Takes context into account; global Example: HMM (hidden Markov models) A special case of Bayesian Inference likely tag sequence is the one that maximizes the product of two terms: probability of sequence of tags and probability of each tag generating a word
  • 55.
    Peter /NNP is /VBZ expected /VBN to /TO race /VB tomorrow /NN to /TO race /??? t i = argmax j P(t j |t i-1 )P(w i |t j ) P(VB|TO) × P(race|VB) Based on the Brown Corpus: Probability that you will see this POS transition and that the word will take this POS P(VB|TO) = .34 × P(race|VB) = .00003 = .00001
  • 56.
    Be aware ofpossibility of ambiguities Possible one has to normalize content before sending it to the tagger Pre Post Transliteration “ Rhi you were da coolest last eve” Rhi/VB you/PRP were/VBD da/VBG coolest/JJ last/JJ eve/NN “ Rhi you were the coolest last eve” Rhi/VB you/PRP were/VBD the/DT coolest/JJ last/JJ eve/NN
  • 57.
    Understanding Phrase Structures,Parsing, Chunking Word Sequence Syntactic Parser Parse Tree Semantic Analyzer Literal Meaning Discourse Analyzer Meaning
  • 58.
    Words don’t justoccur in some order Words are organized in phrases groupings of words that clunk together Major phrase types Noun Phrases Prepositional phrases Verb phrases
  • 59.
    Deriving the syntacticstructure of a sentence based on a language model ( grammar ) Natural Language Syntax described by a context free grammar the Start-Symbol S ≡ sentence Non-Terminals NT ≡ syntactic constituents Terminals T ≡ lexical entries/ words Productions P  NT  (NT  T) + ≡ grammar rules http://www.cs.umanitoba.ca/~comp4190/2006/NLP-Parsing.ppt
  • 60.
    S  NT, Part-of-Speech  NT, Constituents  NT, Words  T, Rules: S  NP VP statement S  Aux NP VP question S  VP command NP  Det Nominal NP  Proper-Noun Nominal  Noun | Noun Nominal | Nominal PP VP  Verb | Verb NP | Verb PP | Verb NP PP PP  Prep NP Det  that | this | a Noun  book | flight | meal | money
  • 61.
    Bottom-up Parsing or data-driven Top-down Parsing or goal-driven S Aux NP VP Det Nominal Verb NP Noun Det Nominal does this flight include a meal
  • 62.
    Natural Language Parsers,Peter Hellwig, Heidelberg Constituency Parse - Nested Phrasal Structures Dependency parse - Role Specific Structures
  • 63.
    Tagging John/NNP bought/VBDa/DT book/NN ./. Constituency Parse Nested phrasal structure (ROOT (S (NP (NNP John)) (VP (VBD bought) (NP (DT a) (NN book))) (. .))) Typed dependencies Role specific structure nsubj(bought-2, John-1) det(book-4, a-3) dobj(bought-2, book-4)
  • 64.
    Grammar checking: sentencesthat cannot be parsed may have grammatical errors Using results of Dependency parse Word sense disambiguation (dependencies as features or co-occurrence vectors)
  • 65.
    MINIPAR http://www.cs.ualberta.ca/~lindek/minipar.htm LinkGrammar parser: http://www.link.cs.cmu.edu/link/ Standard “CFG” parsers like the Stanford parser http://www-nlp.stanford.edu/software/lex-parser.shtml ENJU’s probabilistic HPSG grammar http://www-tsujii.is.s.u-tokyo.ac.jp/enju/
  • 66.
    Some applications don’tneed the complex output of a full parse Chunking / Shallow Parse / Partial Parse Identifying and classifying flat, non-overlapping contiguous units in text Segmenting and tagging Example of chunking a sentence [ NP The morning flight] from [ NP Denver] [ VP has arrived] Chunking algos mention
  • 67.
  • 68.
    Entity recognition people,locations, organizations Studying linguistic patterns (Hearst 92) gave NP gave up NP in NP gave NP NP gave NP to NP
  • 69.
    Stanford and Enjuparser demos; analyzing results http://www-tsujii.is.s.u-tokyo.ac.jp/enju/demo.html http://nlp.stanford.edu:8080/parser/ If you want to know how to run it stand alone Talk to one of us or see their very helpful help pages
  • 70.
    COLORLESS GREEN IDEASSLEEP FURIOUSLY Word Sequence Syntactic Parser Parse Tree Semantic Analyzer Literal Meaning Discourse Analyzer Meaning
  • 71.
    When raw linguisticinputs nor any structures derived from them will facilitate required semantic processing When we need to link linguistic information to the non-linguistic real-world knowledge Typical sources of knowledge Meaning of words, grammatical constructs, discourse, topic..
  • 72.
    Typical sources ofknowledge Meanings of words Meanings of grammatical constructs Knowledge about structure of discourse Common sense knowledge about topic Knowledge about state of affairs in which discourse is occurring
  • 73.
    Lexical Semantics Themeanings of individual words Formal Semantics (Compositional Semantics or Sentential Semantics) How those meanings combine to make meanings for individual sentences or utterances Discourse or Pragmatics How those meanings combine with each other and with other facts about various kinds of context to make meanings for a text or discourse http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt
  • 74.
    Lexeme: set offorms taken by a single word run, runs, ran and running forms of the same lexeme RUN Lemma: a particular form of a lexeme that is chosen to represent a canonical form Carpet for carpets; Sing for sing, sang, sung Lemmatization: Meaning of a word approximated by meaning of its lemma Mapping a morphological variant to its root Derivational and Inflectional Morphology
  • 75.
    Word sense: Meaningof a word (lemma) Varies with context Significance Lexical ambiguity consequences on tasks like parsing and tagging implications on results of Machine translation, Text classification etc. Word Sense Disambiguation Selecting the correct sense for a word
  • 76.
    The study ofthe way words are built up from smaller meaning units. Derivational and Inflectional morphology Forming new words from old words (derivational) Suffixes (inflections) Knowing root helps us understand new words
  • 77.
    Porter Stemming Algorithmhttp://tartarus.org/~martin/PorterStemmer/ Catvar http://clipdemos.umiacs.umd.edu/catvar/ Lingsoft http://www2.lingsoft.fi/cgi-bin/engtwol?word=cmputers Wordnet http://www.shiffman.net/teaching/a2z/wordnet/
  • 78.
    Homonymy Polysemy SynonymyAntonymy Hypernomy Hyponomy Meronomy Why do we care? http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt
  • 79.
    Homonymy: share aform, relatively unrelated senses Bank (financial institution, a sloping mound) Polysemy: semantically related Bank as a financial institution, as a blood bank Verbs tend more to polysemy
  • 80.
    Different words/lemmas thathave the same sense Couch/chair One sense more specific than the other (hyponymy) Car is a hyponym of vehicle One sense more general than the other (hypernymy) Vehicle is a hypernym of car
  • 81.
    Meronymy Engine partof car; engine meronym of car Holonymy Car is a holonym of engine
  • 82.
    Semantic fields Cohesivechunks of knowledge Air travel: Flight, travel, reservation, ticket, departure…
  • 83.
    Models these senserelations A hierarchically organized lexical database On-line thesaurus + aspects of a dictionary Versions for other languages are under development http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt
  • 84.
  • 85.
    Verbs and Nounsin separate hierarchies http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt
  • 86.
    The set ofnear-synonyms for a WordNet sense is called a synset ( synonym set ) Their version of a sense or a concept Duck as a verb to mean to move (the head or body) quickly downwards or away dip, douse, hedge, fudge, evade, put off, circumvent, parry, elude, skirt, dodge, duck, sidestep
  • 87.
    IR and QnAIndexing using similar (synonymous) words/query or specific to general words (hyponymy / hypernymy) improves text retrieval Machine translation, QnA Need to know if two words are similar to know if we can substitute one for another
  • 88.
    Most well developedSynonymy or similarity Synonymy - a binary relationship between words, rather their senses Approaches Thesaurus based : measuring word/sense similarity in a thesaurus Distributional methods: finding other words with similar distributions in a corpus
  • 89.
    Thesaurus based Pathbased similarity – two words are similar if they are similar in the thesaurus hierarchy http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt
  • 90.
    We don’t havea thesaurus for every language. Even if we do, many words are missing Wordnet: Strong for nouns, but lacking for adjectives and even verbs Expensive to build They rely on hyponym info for similarity car hyponym of vehicle Alternative - Distributional methods for word similarity http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt
  • 91.
    Firth (1957): “Youshall know a word by the company it keeps!” Similar words appear in similar contexts - Nida example noted by Lin: A bottle of tezgüino is on the table Everybody likes tezgüino Tezgüino makes you drunk We make tezgüino out of corn. Partial material from http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt
  • 92.
  • 93.
    So you want to build your own text miner!
  • 94.
    Infrastructure intensive Luckily,plenty of open source tools, frameworks, resources.. http://www-nlp.stanford.edu/links/statnlp.html http://www.cedar.buffalo.edu/~rohini/CSE718/References2.html
  • 95.
    Mining opinions fromcasual text Data – user comments on artist pages from MySpace “ Your musics the shit,…lovve your video you are so bad” “ Your music is wicked!!!!” Goal Popularity lists generated from listener’s comments to complement radio plays/cd sales lists
  • 96.
    “ Your musicsthe shit,…lovve your video you are so bad” Pre-processing strip html, normalizing text from different sources.. Tokenization Splitting text into tokens : word tokens, number tokens, domain specific requirements Sentence splitting ! . ? … ; harder in casual text Normalizing words Stop word removal, lemmatization, stemming, transliterations (da == the)
  • 97.
    ‘ The smileis so wicked!!’ Syntax : Marking sentiment expression from syntax or a dictionary The/DT smile/NN is/VBZ so/RB wicked/JJ !/. !/. Semantics : Surrounding context On Lily Allen’s MySpace page. Cues for Co-ref resolution Smile is a track by Lilly Allen. Ambiguity Background knowledge / resources Using urbandictionary.com for semantic orientation of ‘wicked’
  • 98.
    GATE - GeneralArchitecture of Text Engineering, since 1995 at University of Sheffield, UK UIMA - Unstructured Information Management Architecture, IBM Document processing tools, Components syntactic tools, nlp tools, integrating framework
  • 99.
    The GATE (GeneralArchitecture for Text Engineering) System: http://gate.ac.uk http://sourceforge.net/projects/gate User’s Guide: http://gate.ac.uk/sale/tao/ IBM’s UIMA (Unstructured Information Management Architecture): http://www.research.ibm.com/UIMA/ http://sourceforge.net/projects/uima-framework/ Other Resources WordNet: http://wordnet.princeton.edu/ MuNPEx: http://www.ipd.uka.de/~durm/tm/munpex/
  • 100.
    TO COME: USAGE EXAMPLES OF WHAT WE COVERED THUS FAR
  • 101.
    SAMPLE APPLICATIONS,SURVEY OF EFFORTS IN TWO SAMPLE AREAS
  • 102.
    This MEK dependencywas observed in BRAF mutant cells regardless of tissue lineage, and correlated with both downregulation of cyclin D1 protein expression and the induction of G1 arrest. *MEK dependency ISA Dependency_on_an_Organic_chemical *BRAF mutant cells ISA Cell_type *downregulation of cyclin D1 protein expression ISA Biological_process *tissue lineage ISA Biological_concept *induction of G1 arrest ISA Biological_process Information Extraction = segmentation+classification+association+mining Text mining = entity identification+named relationship extraction+discovering association chains…. Segmentation Classification Named Relationship Extraction MEK dependency observed in BRAF mutant cells downregulation of cyclin D1 protein expression correlated with induction of G1 arrest correlated with
  • 103.
    MEK dependency observedin BRAF mutant cells downregulation of cyclin D1 protein expression correlated with induction of G1 arrest correlated with
  • 104.
    The task ofclassifying token sequences in text into one or more predefined classes Approaches Look up a list Sliding window Use rules Machine learning Compound entities Applied to Wikipedia like text Biomedical text
  • 105.
    The simplest approachProper nouns make up majority of named entities Look up a gazetteer CIA fact book for organizations, country names etc. Poor recall coverage problems
  • 106.
    Rule based [Mikheevet. Al 1999] Frequency Based &quot;China International Trust and Investment Corp” &quot;Suspended Ceiling Contractors Ltd” &quot;Hughes“ when &quot;Hughes Communications Ltd.“ is already marked as an organization Scalability issues: Expensive to create manually Leverages domain specific information – domain specific Tend to be corpus-specific – due to manual process
  • 107.
    Machine learning approachesAbility to generalize better than rules Can capture complex patterns Requires training data Often the bottleneck Techniques [list taken from Agichtein2007 ] Naive Bayes SRV [Freitag 1998], Inductive Logic Programming Rapier [Califf and Mooney 1997] Hidden Markov Models [Leek 1997] Maximum Entropy Markov Models [McCallum et al. 2000] Conditional Random Fields [Lafferty et al. 2001]
  • 108.
    Orthographical Features CD28 a protein Context Features Window of words Fixed Variable Part-of-speech features Current word Adjacent words – within fixed window Word shape features Kappa-B replaced with Aaaaa-A Dictionary features Inexact matches Prefixes and Suffixes “ ~ase” = protein
  • 109.
    HMMs a powerfultool for representing sequential data are probabilistic finite state models with parameters for state-transition probabilities and state-specific observation probabilities the observation probabilities are typically represented as a multinomial distribution over a discrete, finite vocabulary of words Training is used to learn parameters that maximize the probability of the observation sequences in the training data Generative Find parameters to maximize P(X,Y) When labeling X i future observations are taken into account (forward-backward) Problems Feature overlap in NER E.g. to extract previously unseen company names from a newswire article the identity of a word alone is not very predictive knowing that the word is capitalized, that is a noun, that it is used in an appositive, and that it appears near the top of the article would all be quite predictive Would like the observations to be parameterized with these overlapping features Feature independence assumption Several features about same word can affect parameters
  • 110.
    MEMMs [McCallum et.al, 2000] Discriminative Find parameters to maximize P(Y|X) No longer assume that features are independent f<Is-capitalized,Company>(“Apple”, Company) = 1. Do not take future observations into account (no forward-backward) Problems Label bias problem
  • 111.
    CRFs [Lafferty et.al, 2001] Discriminative Doesn’t assume that features are independent When labeling Y i future observations are taken into account Global optimization – label bias prevented The best of both worlds!
  • 112.
    Example [ORG U.S. ] general [PER David Petraeus ] heads for [LOC Baghdad ] . Token POS Chunk Tag --------------------------------------------------------- U.S. NNP I-NP I-ORG general NN I-NP O David NNP I-NP B-PER Petraeus NNP I-NP I-PER heads VBZ I-VP O for IN I-PP O Baghdad NNP I-NP I-LOC . . O O CONLL format – Mallet Major bottleneck is training data
  • 113.
    Context Induction approach[Talukdar2006] Starting with a few seed entities, it is possible to induce high-precision context patterns by exploiting entity context redundancy. New entity instances of the same category can be extracted from unlabeled data with the induced patterns to create high-precision extensions of the seed lists. Features derived from token membership in the extended lists improve the accuracy of learned named-entity taggers. Pruned Extraction patterns Feature generation For CRF
  • 114.
    Machine Learning Best performance Problem Training data bottleneck Pattern induction Reduce training data creation time
  • 115.
    Knowledge Engineering approachManually crafted rules Over lexical items <person> works for <organization> Over syntactic structures – parse trees GATE Machine learning approaches Supervised Semi-supervised Unsupervised
  • 116.
    Supervised BioText – extraction of relationships between diseases and their treatments [Rosario et. al 2004] Rule-based supervised approach [Rinaldi et. al 2004] Semantics of specific relationship encoded as rules Identify a set of relations along with their morphological variants (bind, regulate, signal etc.) subj(bind,X,_,_),pobj(bind,Y,to,_) prep(Y,to,_,_) => bind(X,Y). Axiom formulation was however a manual process involving a domain expert.
  • 117.
    Hand-coded domain specificrules that encode patterns used to extract Molecular pathways [Freidman et. al. 2001] Protein interaction [Saric et. al. 2006] All of the above in the biomedical domain Notice – specificity of relationship types Amount of effort required Also notice types of entities involved in the relationships
  • 118.
  • 119.
    Semantic Role LabelingFeatures Detailed tutorial on SRL is available By Yih & Toutanova here
  • 120.
    Other approaches Discoveringconcept-specific relations Dmitry Davidov, et. al 2007, preemptive IE approach Rosenfeld & Feldman 2007 Open Information Extraction Banko et. al 2007 Self supervised approach Uses dependency parses to train extractors On-demand information extraction Sekine 2006 IR driven Patterns discovery Paraphrase
  • 121.
    Rule and Heuristicbased method YAGO Suchanek et. al, 2007 Pattern-based approach Uses WordNet Subtree mining over dependency parse trees Nguyen et. al, 2007
  • 122.
    Entities (MeSH terms)in sentences occur in modified forms “ adenomatous ” modifies “ hyperplasia ” “ An excessive endogenous or exogenous stimulation ” modifies “ estrogen ” Entities can also occur as composites of 2 or more other entities “ adenomatous hyperplasia ” and “ endometrium ” occur as “ adenomatous hyperplasia of the endometrium”
  • 123.
    Relationship head Subjecthead Object head Object head Small set of rules over dependency types dealing with modifiers ( amod, nn ) etc. subjects, objects ( nsubj, nsubjpass ) etc. Since dependency types are arranged in a hierarchy We use this hierarchy to generalize the more specific rules There are only 4 rules in our current implementation Carroll, J., G. Minnen and E. Briscoe (1999) `Corpus annotation for parser evaluation'. In Proceedings of the EACL-99 Post-Conference Workshop on Linguistically Interpreted Corpora, Bergen, Norway. 35-41. Also in Proceedings of the ATALA Workshop on Corpus Annotés pour la Syntaxe - Treebanks, Paris, France. 13-20.
  • 124.
    Modifiers Modified entitiesComposite Entities
  • 125.
    Manual Evaluation Testif the RDF conveys same “meaning” as the sentence Juxtapose the triple with the sentence Allow user to assess correctness/incorrectness of the subject, object and triple
  • 126.
  • 127.
  • 128.
    Discovering informative subgraphs(Harry Potter) Given a pair of end-points (entities) Produce a subgraph with relationships connecting them such that The subgraph is small enough to be visualized And contains relevant “interesting” connections We defined an interestingness measure based on the ontology schema In future biomedical domain the scientist will control this with the help of a browsable ontology Our interestingness measure takes into account Specificity of the relationships and entity classes involved Rarity of relationships etc. Cartic Ramakrishnan , William H. Milnor, Matthew Perry, Amit P. Sheth: Discovering informative connection subgraphs in multi-relational graphs. SIGKDD Explorations 7 (2): 56-63 (2005)
  • 129.
    Two factor influencinginterestingness
  • 130.
    Bidirectional lock-step growthfrom S and T Choice of next node based on interestingness measure Stop when there are enough connections between the frontiers This is treated as the candidate graph
  • 131.
    Model the Candidate graph as an electrical circuit S is the source and T the sink Edge weight derived from the ontology schema are treated as conductance values Using Ohm’s and Kirchoff’s laws we find maximum current flow paths through the candidate graph from S to T At each step adding this path to the output graph to be displayed we repeat this process till a certain number of predefined nodes is reached Results Arnold schwarzenegger, Edward Kennedy Other related work Semantic Associations
  • 132.
  • 133.
  • 134.
    Text Mining, Analysis  understanding  utilization in decision making  knowledge discovery Entity Identification  focus change from simple to compound Relationship extraction  implicit vs. explicit Need more unsupervised approaches Need to think of incentives to evaluate
  • 135.
    Existing corpora GENIA, BioInfer many others Narrow focus Precision and Recall Utility How useful is the extracted information? How do we measure utility? Swanson’s discovery, Enrichment of Browsing experience Text types and mining Systematically compensating for (in)formality
  • 136.
    http://www.cs.famaf.unc.edu.ar/~laura/text_mining/ http://www.stanford.edu/class/cs276/cs276-2005-syllabus.html http://www-nlp.stanford.edu/links/statnlp.html http://www.cedar.buffalo.edu/~rohini/CSE718/References2.html
  • 137.
    Automating Creation ofHierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst, and Megan Richardson, in the proceedings of NAACL-HLT , Rochester NY, April 2007 Finding the Flow in Web Site Search , Marti Hearst, Jennifer English, Rashmi Sinha, Kirsten Swearingen, and Ping Yee, Communications of the ACM, 45 (9), September 2002, pp.42-49. R. Kumar , U. Mahadevan , and D. Sivakumar , &quot;A graph-theoretic approach to extract storylines from search results&quot;,  in Proc. KDD, 2004, pp.216-225. Michele Banko , Michael J. Cafarella , Stephen Soderland , Matthew Broadhead , Oren Etzioni: Open Information Extraction from the Web. IJCAI 2007 : 2670-2676 Hearst, M. A. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th Conference on Computational Linguistics - Volume 2 (Nantes, France, August 23 - 28, 1992). Dat P. T. Nguyen , Yutaka Matsuo, Mitsuru Ishizuka : Relation Extraction from Wikipedia Using Subtree Mining. AAAI 2007 : 1414-1420 &quot;Unsupervised Discovery of Compound Entities for Relationship Extraction&quot; Cartic Ramakrishnan , Pablo N. Mendes , Shaojun Wang and Amit P. Sheth EKAW 2008 - 16th International Conference on Knowledge Engineering and Knowledge Management Knowledge Patterns Mikheev, A., Moens, M., and Grover, C. 1999. Named Entity recognition without gazetteers. In Proceedings of the Ninth Conference on European Chapter of the Association For Computational Linguistics (Bergen, Norway, June 08 - 12, 1999). McCallum, A., Freitag, D., and Pereira, F. C. 2000. Maximum Entropy Markov Models for Information Extraction and Segmentation. In Proceedings of the Seventeenth international Conference on Machine Learning Lafferty, J. D., McCallum, A., and Pereira, F. C. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth international Conference on Machine Learning
  • 138.
    Barbara, R. andA.H. Marti, Classifying semantic relations in bioscience texts, in Proceedings of the 42 nd ACL. 2004, Association for Computational Linguistics: Barcelona, Spain. M.A. Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of COLING‘ 92, pages 539–545 M. Hearst, &quot;Untangling text data mining,&quot; 1999. [Online]. Available: http://citeseer.ist.psu.edu/563035.html Friedman, C., et al., GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 2001. 17 Suppl 1: p. 1367-4803. Saric, J., et al., Extraction of regulatory gene/protein networks from Medline. Bioinformatics, 2005. Ciaramita, M., et al., Unsupervised Learning of Semantic Relations between Concepts of a Molecular Biology Ontology, in 19th IJCAI. 2005. Dmitry Davidov, Ari Rappoport, Moshe Koppel. Fully Unsupervised Discovery of Concept-Specific Relationships by Web Mining . Proceedings, ACL 2007 , June 2007, Prague. Rosenfeld, B. and Feldman, R. 2007. Clustering for unsupervised relation identification. In Proceedings of the Sixteenth ACM Conference on Conference on information and Knowledge Management (Lisbon, Portugal, November 06 - 10, 2007). Michele Banko , Michael J. Cafarella , Stephen Soderland, Matthew Broadhead , Oren Etzioni : Open Information Extraction from the Web. IJCAI 2007 : 2670-2676 Sekine, S. 2006. On-demand information extraction. In Proceedings of the COLING/ACL on Main Conference Poster Sessions (Sydney, Australia, July 17 - 18, 2006). Annual Meeting of the ACL. Association for Computational Linguistics, Morristown, NJ, 731-738. Suchanek, F. M., Kasneci, G., and Weikum, G. 2007. Yago: a core of semantic knowledge. In Proceedings of the 16th international Conference on World Wide Web (Banff, Alberta, Canada, May 08 - 12, 2007). WWW '07.

Editor's Notes

  • #5 One of the primary goals of AI has been natural language understanding. Understanding language required not only lexical and grammatical information but also semantic pragmatic and general world knowledge. It’s a complex task and involves many levels of processing and a variety of subtasks. Typical components of a language understanding processing system Understanding language deals with (what is said) syntax and structure of language+ (what does the thing being said say/ask/inform of the world) understanding (semantics, pragmatics, discourse)
  • #6 In the 1970’s AI systems were developed that demostrates interesting aspects of language understanding by developing nl understanding systems that used hand coded symbolic grammars and knowledge bases. Although there was some corpus based language learnng in the 1950’s post Shannon’s Information theory etc, Chomsky’s argument that learnability of language is more an innate property than learned was instrumental in redefining goals of linguistics in the 1950’s. Emphasis on symbolic grammars and representing innate linguistic knowledge (universal grammar)
  • #7 Develpoing such systems however were very human intensive requiring intensive knowledge engineering, typically ran on toy examples and were rather brittle. Partially in response to these uissues there was a paradigm shift in nl understanding. Approaches moved from rationalist methods based on hand coded rules to systems that derived these rules through introspection or empirical or courpus based methods. Development ismore data driven and atleast partially automated thru the use of statistical or machine learning methods.
  • #9 One of the primary goals of AI has been natural language understanding. Understanding language required not only lexical and grammatical information but also semantic pragmatic and general world knowledge. It’s a complex task and involves many levels of processing and a variety of subtasks. Typical components of a language understanding processing system Understanding language deals with (what is said) syntax and structure of language+ (what does the thing being said say/ask/inform of the world) understanding (semantics, pragmatics, discourse)
  • #10 When we are analyzing text for semantic comp, we are doing one of two things - Finding more about wat we know. Often termed as the finding needle in a haystack paradigm, this search / browse method contrasts the other goal of text analysis.. Finding wat we do not know or discovering undisc knowledge
  • #14 Assertional, simple, easier to parse and understand Biomedical literature, however, contains text that describes complex scientific investigations which do not always contain explicit factual assertions. Instead, there is often a series of arguments, opinions and experiments supported by evidence that collectively corroborate or refute a hypothesis that may not be explicitly stated in a simple sentence. Sentences tend to be rather long and convoluted. Furthermore domain specific terms, abbreviations, number ranges and symbols often make sentences hard for the human reader to parse, further complicating automated information extraction. These factors make the task of mining biomedical text substantially more complex than Wikipedia like text. Casual, goal is largely interactive as opposed to informative grammatical errors, misspellings, entity variations not uncommon
  • #15 One of the primary goals of AI has been natural language understanding. Understanding language required not only lexical and grammatical information but also semantic pragmatic and general world knowledge. It’s a complex task and involves many levels of processing and a variety of subtasks. Typical components of a language understanding processing system Understanding language deals with (what is said) syntax and structure of language+ (what does the thing being said say/ask/inform of the world) understanding (semantics, pragmatics, discourse)
  • #19 Using Text to Form Hypotheses about Disease For more than a decade, Don Swanson has eloquently argued why it is plausible to expect new information to be derivable from text collections: experts can only read a small subset of what is published in their fields and are often unaware of developments in related fields. Thus it should be possible to find useful linkages between information in related literatures, if the authors of those literatures rarely refer to one another&apos;s work. Swanson has shown how chains of causal implication within the medical literature can lead to hypotheses for causes of rare diseases, some of which have received supporting experimental evidence [Swanson1987,Swanson1991,Swanson and Smalheiser1994,Swanson and Smalheiser1997]. For example, when investigating causes of migraine headaches, he extracted various pieces of evidence from titles of articles in the biomedical literature. Some of these clues can be paraphrased as follows: stress is associated with migraines stress can lead to loss of magnesium calcium channel blockers prevent some migraines magnesium is a natural calcium channel blocker spreading cortical depression (SCD) is implicated in some migraines high leveles of magnesium inhibit SCD migraine patients have high platelet aggregability magnesium can suppress platelet aggregability These clues suggest that magnesium deficiency may play a role in some kinds of migraine headache; a hypothesis which did not exist in the literature at the time Swanson found these links. The hypothesis has to be tested via non-textual means, but the important point is that a new, potentially plausible medical hypothesis was derived from a combination of text fragments and the explorer&apos;s medical expertise. (According to swanson91, subsequent study found support for the magnesium-migraine hypothesis [Ramadan et al.1989].) This approach has been only partially automated. There is, of course, a potential for combinatorial explosion of potentially valid links. beeferman98 has developed a flexible interface and analysis tool for exploring certain kinds of chains of links among lexical relations within WordNet.2 However, sophisticated new algorithms are needed for helping in the pruning process, since a good pruning algorithm will want to take into account various kinds of semantic constraints. This may be an interesting area of investigation for computational linguists.
  • #21 Beyond search Analytical operations over text to answer complex questions Requiring aggregation of information across a corpus Context specific Domain specific Application specific Text mining, also known as intelligent text analysis, text data mining or knowledge-discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text mining is a young interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics and computational linguistics. As most information (over 80%) is stored as text, text mining is believed to have a high commercial potential value.
  • #23 The first query is “Flu Epidemic.” In Table 1, we see that the first storyline contains information about flu (identified by terms like ‘vaccines’, ‘strains’), the second contains seasonal news (identified by terms like ‘deaths’, ‘reported’), the third is about bird flu (identified by terms like ‘avian’, ‘bird’), and the fourth is about Spanish flu epidemic from 1918 (identified by terms like ‘spanish’, ‘ 1918’).
  • #24 http://turing.cs.washington.edu/papers/ijcai07.pdf
  • #27 Select well-distributed terms from the collection Eliminate stopwords Retain only those terms with a distribution higher than a threshold (default: top 10%) Build a “backbone” Create paths from unambiguous terms only Bias the structure towards appropriate senses of words Get hypernym path if term: - has only one sense, or - matches a pre-selected WordNet domain Adding a new term increases a count at each node on its path by # of docs with the term.
  • #36 We will cover some fundamentals that are a core part of most TM systems.
  • #43 Tabulate Task, Tools/Methods, Resource, Frameworks Eg. NEI, Syntactic parse based, POS taggers, UIMA,Gate NEI, Lexicon based, lexicons available for dload, UIMA,Gate
  • #48 Syntactic analysis involves determingng the grammatical structure of a sentence. One subtask is pos tagging or
  • #54 Local word characteristics *s not always plural nouns (he works) (his works)
  • #59 Headed by nouns and provide information about the noun in the sentence Headed by prepositions, contains noun phrases and express spatial, temporal and other attributes Oraganizes all elements on a sentence that syntactically depend on the verb
  • #61 a grammar describes which of the possible sequences of symbols (strings) in a language constitute valid words or statements in that language, but it does not describe their semantics (i.e. what they mean).
  • #72 Typical sources of knowledge Meanings of words Meanings of grammatical constructs Knowledge about structure of discourse Common sense knowledge about topic Knowledge about state of affairs in which discourse is occurring
  • #75 Forming new words from old words (derivational) Suffixes (inflections)
  • #81 Synonym test for words/lemmas - Propositional meaning When one word can be substituted for another without changing meaning of sentence Car and automobile substitutable but not identical in meaning
  • #90 pathlen(c1,c2) = number of edges in the shortest path in the thesaurus graph between the sense nodes c1 and c2 simpath(c1,c2) = -log pathlen(c1,c2) wordsim(w1,w2) = max c1  senses(w1),c2  senses(w2) sim(c1,c2)
  • #107 if &amp;quot;Adam Kluver Ltd&amp;quot; had already been recognised as an organisation by the sure-fire rule, in this second step any occurrences of &amp;quot;Kluver Ltd&amp;quot;, &amp;quot;Adam Ltd&amp;quot; and &amp;quot;Adam Kluver&amp;quot; are also tagged as possible organizations. This assignment, however, is not definite since some of these words (such as &amp;quot;Adam&amp;quot;) could refer to a different entity. This information goes to a pre-trained maximum entropy model (see Mikheev (1998) for more details on this aproach).
  • #108 SRV - Two trends are evident in the recent evolution of the field of information extraction: a preference for simple, often corpus-driven techniques over linguistically sophisticated ones; and a broadening of the central problem definition to include many non-traditional text domains. This development calls for information extraction systems which are as retctrgetable and general as possible. Here, we describe SRV, a learning architecture for information extraction which is designed for maximum generality and flexibility.SRV can exploit domain-specific information,including linguistic syntax and lexical information, in the form of features provided to the system explicitly as input for training. This process is illustrated using a domain created from Reuters corporate acquisitions articles. Features are derived from two general-purpose NLP systems, Sleator and Temperly&apos;s link grammar parser and Wordnet. Experiments compare the learner&apos;s performance with and without such linguistic information. Surprisingly, in many cases, the system performs as well without this information as with it.
  • #111 The label bias problem represents a simple finite-state model designed to distinguish between the two words rib and rob. In the first time step, r matches both transitions from the start state, so the probability mass gets distributed roughly equally among those two transitions. Next we observe i. Both states 1 and 4 have only one outgoing transition. State 1 has seen this observation often in training, state 4 has almost never seen this observation; but like state 1, state 4 has no choice but to pass all its mass to its single outgoing transition, since it is not generating the observation, only conditioning on it. Thus, states with a single outgoing transition effectively ignore their observations. The top path and the bottom path will be about equally likely, independently of the observation sequence. If one of the two words is slightly more common in the training set, the transitions out of the start state will slightly prefer its corresponding transition, and that word’s state sequence will always win.
  • #114 i. Starting with a few seed entities, it is possible to induce high-precision context patterns by exploiting entity context redundancy. ii. New entity instances of the same category can be extracted from unlabeled data with the induced patterns to create high-precision extensions of the seed lists. iii. Features derived from token membership in the extended lists improve the accuracy of learned named-entity taggers.