so it's 9 o'clock on the west coast. so good morning, everyone, and good afternoonto those who are joining us from the east coast. i'm alex bui. i'm the co-director of the bd2k centers coordinationcenter. so we're wrapping up section two of our datascience center, which to remind you has been an overview of data representation issues. we've covered a spectrum of different topics,starting with databases and data warehousing challenges, through to issues related to datawrangling and exploration.
and today we're going to hear about naturallanguage processing, or nlp, from dr. noemie elhadad of columbia university. and for any of you who have had to deal witheither unstructured clinical reports or other types of data, such as biomedical literature,you can already appreciate many of the issues of trying to get a computer to understanda textual document. nlp has been and continues to be an activearea of research, particularly in regards to the biomedical space. so it's my pleasure to introduce dr. elhadad. she is an associate professor of biomedicalinformatics at columbia.
and as a data scientist, her efforts focuson the development of techniques, smart clinicians, patients, and health researchers in the informationworkflows by automatically extracting and making accessible information from large clinicaldatasets. two particular areas of interest in her researchinclude computational approaches that infer models of health phenomena and that accountfor the specific biases seen in large health datasets and the translation of these modelsinto actionable knowledge and applications. dr. elhadah, we're delighted to have you withus this morning. and so without further ado, let's get started. the floor is yours.
great. thank you, alex. hi everyone. so i'm going to be talking to you about naturallanguage processing and nlp in health in particular. and the way i think i'm going to structurethis lecture is into four topics. and i'm going to spend most of my time onactually the first two topics. the first one is to kind of give you an overviewof the applications of natural language processing in health and health care. it's a very active field of research for corenlp methods, but for the sake of this audience
here, i think it's interesting to think aboutall the different ways in which texts and semi-structured texts can be leveraged foradvancing knowledge of biomedical sources or for helping clinicians in their daily careof patients. i am then going to be spending quite sometime on talking with you through some examples of why computing language is difficult, andi'll give you an example both in the general and in the health domain, too. and then i'll move on to some approaches inhealth natural language processing. there's been a shift incomputing approaches, and so it's interesting to kind of give an overview of those.
and i'll conclude with the current developmentsin both domain-generic nlp and health nlp research. so let me just start by saying something thatmight be obvious to you by now as human beings, which is that language is primitive. as women, when we talk to each other, we uselanguage in written and not written form, and the big revolution for us in the computingworld is that there is more and more of such instances of written text available to us. and so this is an example of a j-identifiedclinical note. and the things to notice here which are goingto become a themes is that it's english, but
it's not your typical english, right? so like there's a lot of abbreviations likehpi, which means history of present illness, and so if i read, if i were to read this firstsentence, i would say 77-year-old male, with history of hypertension, coronary artery disease,status post coronary artery bypass left, 1988, meaning that happened in 1988. and so you can already kind of think aboutthe challenges that we have here, right? we are trying to maybe understand or extractsome information from these type of texts, and we have to somehow be able to encode orto reason about the fact that this note is about a male patient of that particular age,at the time the note was written, and some
procedures were given to the patient, likethe cabg, and we actually didn't have timing for these procedures, but there are also informationabout medications, symptoms, disorders, et cetera, et cetera. another really important source of pretextfor people who do nlp is the literature, the scientific literature. and here maybe the language is more structured,but it's still a very specific genre of text. and there are some things that we can leverage. we know that scientific articles have somestructure to them, their backgrounds, methods, results, conclusions, et cetera.
but they also have slightly complex sentencesfrom a syntactic standpoint. and so if we want to be able to understandwhat type of information is being conveyed in these articles and do so at scale acrossmany articles, then we need to probably be able to parse these sentences accurately. health news are becoming an important sourceof knowledge and trends in health and other domains as well. i'm sure you know outside of health, there'sall these interesting new things happening with fake news, for example. so how do we detect that something is a fakenews?
how do we detect that vitamin b12 is indeedcorrelated with higher cancer risk? you could think of the health news as a wayto disseminate information that comes with science, but with a very biased slant. and interestingly, from an nlp standpoint,we see yet a different style of language. and finally another type of text that is ofinterest to natural language processing people is all the information that patients and healthconsumers are exchanging on the social web. so this is an example from a publicly availablebreast cancer community. and i like this example because it kind ofdescribes to us in a very clear fashion what some of our issues are going to be if we wantto, if we want to parse this type of text.
and so this sounds like a very exciting--this is a thread. chainsawz is the first person writing, andshe's very happy. she said about someone else that she's gladyou found reggie but i hope you don't consume for the twoto-one ned or ned. and it sounds like it's not about cancer. it actually is about breast cancer, but whathappens here is a common phenomenon in online communities, where the community is comingup with their own abbreviations and nearly their own register of language. and so in fact "reggie" is regression of tumor,and ned or ned is "no evidence of disease."
and so this is referring to the fact thatsomeone has regression of her tumor and everyone is very happy. there's a lot of positive sentiment, whichcan be easily identified automatically. but we also would love to be able to automaticallyidentify that this person we're talking about had a regression in her tumor. so what are some of the applications thathave been heavily using nlp? one that is a classic application and a veryrobust one by no and still some time for interesting reason is becoming yet again a new questionis automated coding of clinical notes. so the task here is i have a clinical note,typically a discharge summary.
and on the other hand, i have a very largetaxonomy of diagnosis codes, or icd-9 codes, and i would like to be able to automaticallysay what are the most, say, eight or 10 or 15 likely codes that can help me bill fora visit or that can help me represent what this note is about. so from a computing standpoint, the difficultyhere is twofold. on one hand, we have free text, and the seconddifficulty is that it's not a simple text categorization task because it's multilevelclassification. that universe of labels is really, reallywide. so icd-9 codes, there's about 15,000 of them.
and i'm sure you guys know, we actually inthe united states have moved to icd-10 coding, where now we're talking about 65,000-pluscodes. so again from a computing standpoint, howdo we train algorithms to recognize what are the most likely codes is an interesting question. in a very classical question of nlp, wherenlp has been used again in the clinical world, is one of cohort detection. and i know you have had a lecture on informationretrievals, so i'm not going to go too much into it, but it's an important one. by the way, i gave examples of articles thati think represent very well what the nlp challenges
are and what the tasks or methods are. and so you're welcome to go and look at them,but they're just simple examples. there's actually quite a lot of research oneach of those areas. and so here again the interesting fact isthat if we want to select a cohort, say from the electronic health records. so if we want to identify in the electronichealth record, who are the patients who have a particular disease? in this example, we're talking about peripheralarterial disease. then we know we could use some of the metadatathat's available to us, but we also would
probably want to use or leverage the clinicalnotes. and as a matter of fact, it turns out thatwhen it comes to identifying cohorts, the nlp's often quite critical to identifyingthe right population. clinical decision support is a traditionalexample of clinical informatics, where again nlp would be useful. the ideaclinical decision support is a traditional example of clinical informatics, where againnlp would be useful. the idea here is to leverage information fromclinical notes within the logic of clinical decisions support.
so if we want to have an alert that looksat potentially dangerous drug-drug interactions at the time of prescription, in the electronichealth record, or if we want to make sure that we're not prescribing the drug when apatient has a specific allergy, this type of clinical decision support system will goand look at the structured information or does it make that data available in the electronichealth record? but again, there is no guarantee that allthe information about the patient is included in the structured part of the ehr. and in fact, very often, old allergies orold things kind of fall through the cracks, and are documented in the notes, but not inthe structured part.
and so nlp here, again, is extremely helpful. another type of application is one of dataexploration. and this comes from the fact that there isso much text available to clinicians, to house practitioners, to researchers, that we needto help and support them in just making sense of all the data that's available. so this is an example of a patient's recordsummarization system that we have at columbia university. and it's actually deployed and used by physicians. and the idea here is that we're taking allof the notes longitudinally.
we have 20-plus years worth of data for eachof our patients. and we can, through natural language processing,identify what are the primary problems for that patient, and on top of it, have an interactivevisualization that allows the clinicians and ehr workers to explore and go through thetimeline of a patient. this is another example of how nlp can beused for data exploration. this is in a slightly different genre of text. we're looking at autism spectrum disordercommunities. so these are parents of children, typically,who are on the asd spectrum. and the question, again, is there is a lotof discussions.
how do i make sense of all of these discussions? why would i want to do this? maybe i'm a public health practitioner, andi want to look into health communication issues. or maybe i want to learn about how parentsare viewing certain treatments. or what is the awareness about particulardisorders or symptoms for patients? and so here what we've done is, we've createda set of named entity recognition system that can identify terms in posts, and so on thetop right here you see an example a snippet from a post. and we then did so for all the posts in thecommunity, and built, again, a visualization
to help with exploration on top that kindof builds a network of significantly occurring pairs of terms. and so in this case, we were in, again, anautism community. and [inaudible] is very specific to the languageand the register and the domain of autism. but it's interesting to see in which context,basically, that type of term is being talked about. this is another example of data explorationfrom text that has been extremely successful in many different areas, actually. and this is one of topic modeling.
and so this is taken from a paper that lookedat all of the publications from science journal, and is trying to make sense, in a very unsupervisedfashion, about all the different topics or themes that have been talked about in thescience journal over 100 plus years. so what is the challenge here? the challenge is that we do not know in advance. we're completely unsupervised. we do not know in advance. we're not making any hypotheses about whattype of themes are being discussed in these articles.
and we want a method that can allow us toidentify what are the primary themes. and moreover, in the case of correlated topicmodels like in this example, we want to know how these themes relate to each other. and so we can build this type, again, of visualizationsthat tells us that rna, dna, and that type of cluster of words here correspond to a significanttopic or a cluster. and moreover, it is quite relevant relatedto-- meaning it occurs or it correlates often-- with a topic about sequencing and geneticsequencing, and another one about proteins, for example. so the value here is that when you have avery, very large corpus given to you, for
example, all of the articles published ina journal, and you're trying to extract, to understand what are the primary topics discussedin that journal, like in this example, or if you have a patient cohort with lots ofanatomic health-related notes related to them, or if you have access to an online house community,for example, prior to detecting a particular phenomenon, maybe you want to just exploreand understand what is being talked about. and so a lot of nlp methods rely on this typeof topic modeling for data exploration. so another thing that you can do with thistype of topic modeling is you can then look at how they change through time. so again, this is from the science journalarticle, where you can look at-- these are
examples of two different topics. and you can look at, through time, from theearliest years of publication of the science journal in 1880 up to 2000 what, for example,the mention of the term "oxygen," and what was basically the interest, maybe, of oxygenin the journal. so keep in mind, we're not extracting informationhere. we're just identifying kind of salient termsthat are representative of the corpus here. and again, it should not be interpreted asa literal representation of the corpus, but rather as a tool for data exploration. and a more and more prevalent type of applicationhas to come with surveillance from publicly
available data sets. this is a very active field of research. i'm putting here in just two examples of papers,but there are many, many more, where the authors are anything at public social media sources,so things like twitter or yelp, and trying to identify very specific occurrences of acondition. in this case, the authors and the departmentof health and [inaudible] are interested in identifying who has had food-borne illnessafter eating in a restaurant, so that they can go and investigate what's going on with that restaurant.
so what are the challenges again here, froman nlp standpoint? well, on the top left, you see a yelp review. and we want to, knowing that we're lookingat food-borne illness, we want to be able to go and extract that someone was sick, butnot only just generally sick, but sick from eating. and not only that, the symptoms of foodborneillness kind of happened after enough time that it would kind of support the idea thatfood-borne illness came from that restaurant. so it's a complex information extraction test. there is a bit of reasoning involved in here.
another challenge is that, especially if we'retalking about twitter, not all of the information is going to be nicely contained into a singletext. and there is going to be, maybe, conversations,and all sorts of interesting linguistic phenomena, such as coreferences that are going to beneeded to be taken care of to identify, for example, like in this example on the bottomright here, that this person indeed got sick, and that that was a particular location wherethe restaurant was the potential culprit for the food-borne illness. researchers have been thinking about thisidea of info surveillance quite a lot. this is healthmap, which i encourage you togo and visit.
this is work from harvard and john brownstein. and if you google healthmap, you'll find itright away. and it's a very nice interactive visualizationof all the potential diseases that are being talked about, typically on the news. but they are incorporating other data sourcesas well in here. and so again, the same challenges come again. how do we identify that a particular diseaseis being talked about? and moreover, in this case, what is the locationof where the patients are being notified about? so let me switch gears and continue with mysummary of applications.
this is going back to the clinical domain. we're interested in high throughput phenotyping. why do we want to do high throughput phenotyping? we want to maybe be able to identify all conditionsthat are being talked about across a large cohort of patients. or maybe we want to do that [inaudible] identification,but instead of finding methods to do so very well for one disease at a time, maybe we wantto do it for all diseases at once. and the reason i'm bringing up this exampleis mostly because the methods are very similar to the one of topicmodeling.
and they're interesting, because they're takinginto account not only the text, but also the other data structure information that we havein the electronic health records, such as laboratory tests, medications, and diagnosisnotes. and so we can build nlp-based methods thatkind of incorporate this additional information, and bring better cues, basically, to identifywhat are the observations that enable us to decide that there is probably diabetes mellitustype 2 for a particular patient, for example. and so in this case, it's very much like topicmodeling. it is an unsupervised method. and we can, moreover, kind of have a descriptionof those typical conditions or phenotypes
we are looking for. so for example here, we have lupus. and we're finding, again, in a very unsupervisedfashion, that anti-malarial are a medication class that actually makes sense for lupus. and for the non-clinicians out there, actually,that's correct. anti-malarial are a way of treating lupus. another type of application which is really,really big in health care, but is becoming really big also on social media type of situationsis predictive analytics. so the question here is, can we use text--either text which involves a patient or in
this case, in this example that i'm giving,text that are written by individuals on reddit. and can i predict whether something is likelyto happen, or they're likely to behave in a certain fashion. so in this example, we are interested in mentalhealth. and the authors are looking at subreddit thathave people talking about their depression. and then they want to know within six months,are they actually now moving on to another type of subreddit which is more about suicidewatch. and so the question here is, can i predictthat someone is going to start ideating about suicide?
and it turns out that there are very specificlinguistic indicators that can help us predict who is going to be likely to move on fromthe mental health type of subreddit to the suicide watch type of subreddit. you can imagine, there's a lot of interestingapplications that can come out of this work. but what's nice is that there's these cuesthat come from the syntax, or from the type of words that are being used by people, wherewe can actually predict what's going to happen in their behavior in an online community. similarly, in the clinical world, we can,again, try to predict adverse outcomes. so here, this is an example of a predictivemodel that looks at progression of chronic
kidney disease. and what's interesting here is that, becausewe're in the clinical world, we want to be able to predict, but we also want to be ableto interpret some of our learned rules. and so the type of methods that are used hereare, again, unsupervised, and are basically learning topics together with a kind of atemporal model, which is a common filter in this case. but primarily for nlp purposes, thepoint is that we can learn topics of words that occur in clinical notes that are goingto be more or less predictive or increasing risks for progression-- those are the orange/redtopics in here-- and the ones that are going
to have a productive effect on patients. and so that's, again, useful, because we cango further, generate hypotheses, and work further with clinicians on these questions. i spent a lot of time on those applications. but i think it's useful. and again, there's many, many others. it's just useful, hopefully, for you-- andit's definitely helpful for me in my research-- to think very widely about all the differenttypes of texts as data or seasoned knowledge versus the various types of tasks we wantto to.
but i want to now focus on, well, ok, so weknow that there are lots of applications to natural language processing, but why is itso hard? and more specifically, why can't i just useregular expressions? regular expressions are just like models ofwhat i expect a set of string or characters is going to look like, and it's going to helpme search for a particular type of information on a large collection of text, for example. why do i need more complex models of language? well, the first problem is that language isambiguous. and it's ambiguous at all linguistic levels.
and for those interested in what i mean bylinguistic levels, morphology, lexical levels in texts, discourse and pragmatics are allinteresting levels to think about. and here are some of the examples that arethe classic examples. these examples are taken actually taken fromthe excellent, excellent course from chris manning at stanford on natural language processing. and so here is one where there's definitelysome syntax you can [inaudible]. boy paralyzed after tumor fights back to gainblack belt. same for this other headline in the news,"san jose cops kill man with a knife." so where is the ambiguity here?
well, are we saying who had the knife, basically? are we saying that the cops had a knife, andthat's how they killed the man? are we saying that the cops killed a man whohad a knife? and so that type of ambiguity is called ppattachment ambiguity. we don't know if we were to build a syntacticparse for that sentence, we don't know where we should attach that phrase "with knife." should we attach it to the noun phrase, "man?"or should we attach it to the noun phrase, "san jose cops?" it's actually ambiguous.
we do not know. we do know, maybe, as humans when we readit, because maybe there is some more likely scenario. but it's still ambiguous. there could be a scenario where the cops arekilling the man with a knife. sorry, that was ambiguous, too. they are killing a man, and they're usinga knife to kill him. some of our more clinical examples, whereambiguity creeps in are very lexical for us in clinical texts.
this is anactual example from a social history of a patient, and this patient's history-- ca inmother, breast ca. and so we have here a term, which is ca, whichhappens to be an abbreviation. and those type of abbreviations are usuallyfairly ambiguous. in the first occurrence of ca, we mean california. in the second one, we mean cancer, and inparticular, breast cancer. so how do we build a language model that canhelp us disambiguate between these cases? moreover there is other senses to this typeof text, to this type of terms. for example, ca also means calcium.
so this lexical ambiguity is quite difficult. the other challenge with language is the oneof variability. and again, it happens at all linguistic levels. another way to think of it is that there'sa high paraphrasing power in language. and it's a good thing, actually, because ashumans, we want to have ability to create new ways of saying new things, but also oldthings. there usually are some reasons to why youwould say a sentence one way or another. but for the sake of computing, we want tobe able to identify that maybe different forms of a term actually refer to the same concept,like on the example on the left.
diabetes mellitus 2, diabetes, t2 dm-- allof these three and many others-- how can i build a model that recognizes that these guysare all talking about type 2 diabetes? and note that i put diabetes here as an example. and that's actually ambiguous. are we talking about type 1 or type 2 diabetes? and so maybe there's some additional reasoningthat would tell me that in my corpus, really when i see diabetes, i mean type 2 diabetes. and i'm able to map or normalize all of theseforms to a single concept. and so that's an example of variability fromthe lexical standpoint.
but there is, obviously, a lot of variabilityin syntactically. john loves mary. mary is loved by john. the man who is named john loves mary. all these things mean the same thing. they're paraphrases of each other, and are,nevertheless, conveyed with different syntactic structures. it's interesting to think further about thelexical ambiguity, and really go back to that question about diabetes.
do we mean diabetes type 2 in general? so the context is really helpful in thesethings. and as a matter of fact, a lot of the computationalmethods are incorporating context to determine this type of language models. and so i thought i would show you a very simplevisualization of what we mean by context. so here we have two online health communities. one is the same autism spectrum disorder discussionforum that i mentioned earlier. and the other one is a general parenting discussionforum. and again, we built a representation of allthe terms that are being mentioned
in those forums. and we do some normalization, as much as wecan, but we also do then try to look for co-occurrence patterns. and so two terms are going to happen, in thisnetwork, an edge between them if they co-occur together more often than they do not. and so why am i putting these two networkshere? those are networks centered, actually, ona single node-- on the node "drug" here. and similarly here on the asd on your left,about the node "drug" but as it is mentioned in the autism communities.
and so what do we see? well, it's two very, very different networks. and clearly, you can say a lot about, eventhough they are exactly the same terms, that they mean completely different things in differentcontexts. and so in the general parenting forum, "drug"is typically meant as an illicit or illegal drug, or a problematic drug like alcohol,marijuana, substance abuse. pharmaceutical company's more about the pharmatype of drug and illicit. those are the terms that are found as beingstatistically important related to "drug." in the autism world, clearly, they mean differenttypes of drugs.
and moreover, all of these drugs are veryrepresentative of the type of treatments that children on the spectrum are using. so again, from an nlp standpoint, the interesthere is that, even though there are terms that have exactly the same lexical information,they actually mean different things in different contexts. and so a lot of the methods that are beingdeveloped nowadays have to do with language modeling, with trying to represent or encodewhat is the meaning of the particular word through what type of words they occur or co-occurwith. this is a similar example again.
same word on both sides means very, very differentthings in two types of context. in general parenting, when parents talk aboutlanguage, they mean foul language or maybe bilingual language. whereas in the autism community, languageis all about all of the potential symptoms related to autism, as well as potential therapiesthat have to do with helping with language difficulties. the last challenge that i will focus on throughvery simple examples is the fact that language is vague. and another way to think about it is thatmaybe we need to reason about language.
and like for the ambiguity and the variabilitychallenges, it's present at all lexical levels. so there's three types of example that i wantto focus on. the first one is one that we see quite a lotin the clinical world. this is two sentences-- mi three weeks ago,mi three days ago. so what's the question here? first of all, we need to recognize that mimeans of myocardial infarct in this case. that's fine. that's a lexical ambiguity or paraphrasing[inaudible] question. but here, i want to focus on that temporalexpression here, which is three weeks ago.
and here it says three daysago. if we want to actually reason over this information,and if we want to understand or represent the meaning of those occurrences, we needto understand that first of all, this is a temporal expression. and moreover, it is about that clinical eventwhich was a myocardial infarct. the last thing you'd want to do, then, isto say, well, ok, so if we were to establish a timeline of events of the patient, theni would use a time at which the note was written, and i would go three weeks in the past, andput the myocardial infarct at that time. that would be a sensible thing to do automatically.
it turns out that in the clinical world, that'snot always correct. that's not always the right thing to do. i mean, it is the right thing to do, but it'smore complex than that. and so what's going on? well, we can actually find that when someonesays, "mi three days ago," in fact, the person said three days ago. they really meant that it happened three daysago. but when a clinician says three weeks ago,it could be three weeks ago-- 21 days ago. but it could be 20 days ago, or 19 days ago,or something like that.
and it sounds like it's not an important problem. it actually is quite an important problemfor two reasons. one is that it's not always the case thatthis type of temporal information is available in the structured part of the electronic healthrecord. we really want to rely on the notes, becausethe notes are often talking about events that did not happen in the current institution,or for various reasons. the second one is, if we want to study temporalor time-related phenomena like adverse drug reactions or prediction of particular events,we really want to have the timeline down correctly. and so how do we know that when someone says"three weeks ago," that really meant 21 days
plus or minus four days, whereas when someonesays "three days ago," that really means three days ago? we need some reasoning here. or at least we need a way to encode this typeof uncertainty about the language. speaking of uncertainty, the next exampleis interesting as well. and that's something that happens, actually,very often in the clinical and in the health domain. so when i say, "paul thinks that john lovesmary," there are a few things that are happening from a linguistic standpoint.
first is that there is the fact that saysthat john loves mary. and now, what can we say about that fact? is it true? or is it untrue? what do we know exactly? we actually do not know whether that factis true. but we know that a third person thinks thatthat fact is true. similarly, this type of pragmatic things creepup even in clinical nlp. when someone says, "patient denies smoking,"that's typically pharma's "patient does not
smoke." but really, when we think about it, it reallymeans, "i do not know whether the patient smokes or not, but i'm going to encode it,and i'm going to represent it by saying the patient denies it." it's something to think about when we do thistype of information extraction and reasoning. in the health domain, that comes quite often. you can imagine if a patient says, write aquestion online or searches for some symptoms somewhere, or writes to their peers, and theyoften would say, "i think this is happening." and there's a lot of uncertainty about allof these phenomena that they're discussing.
and so the last thing we would want to doas nlp practitioners would be to extract this information and just say, look, it says, "johnloves mary." i'm going to just extract that fact as beingtrue, and ignore the other part of the sentence. so taking into account this uncertainty oflanguage is very interesting. the final type of reasoning that's needed,i think, is illustrated well by that tweet example, where this person says that theyjust ate a bowl of xanax for breakfast. it's true it's a well-formed sentence-- which,by the way, does not happen often on twitter, so that's good. even the medication is well written in here,so there's no problem of identifying that
xanax means xanax. the question is, did this person actuallyeat a bowl of xanax for breakfast? probably not. and so is it sarcasm? is it humor? or did this person actually mean that theydidn't eat a whole bowl of it, but maybe they ate a lot of xanax? and maybe we would want to pay attention tothat type of signal when doing monitoring of twitter.
we want to be able to build models that canrecognize that there's something uncertain about this information. so approaches to health nlp-- there are actuallymany, many different approaches out there. and i made kind of like an editorial choicein focusing on two types of methods. and i'll talk about a third one in the currentdevelopments, but keep in mind, again, i'm just giving you an overview. and i definitely suggest that if you're interestedin the field of natural language processing in the health domain, there are wonderfulclasses online and tutorials. and i'm happy to, if you want to get in touchwith me, to point you to some of them.
so a task for which a lot of methods havebeen designed over time in clinical texts and clinical nlp is one of information extraction. so the task is, if i have a sentence likethis one, "patient should come back if severe facial rash occurs," i want to be able todo some what's called "named entity recognition," meaning that i recognize that "facial rash"is a disorder. it's a type of problem that can happen tothe patient. not only that, it's a particular concept ina terminology. in this case, i'm showing you a concept inthe umls, the unified medical language system, which is a really large terminology.
and maybe i'd like to extract other modifiersabout these particular disorders. so in this case, maybe i would want to identifythat "severe" is an important modifier here. and the "if" here is also, as you can imagine,quite important, and is indicating that this is a conditional type of disorder. in other words, the patient actually doesnot have a facial rash, but the facial rash is discussed, nevertheless, in the contextof this particular patient. so what other type of modifiers have beenthought about? many, many different ones.
this is an example of a set of modifiers fordisorders. we would want it normalized to a terminology. we would want to be able to identify negation,who the subject of that disorder is, who is the disorder attributed to is another wayto think about it. is there uncertainty? is there some sort of progression or coursementioned about that disorder? what is the severity of the disorder? is it conditional, like in the example earlier? or is it actually not a disorder, like inthis example, when we say, "the patient goes
to the hiv clinic." really, it says "hiv," and hiv is a disorder,but hiv clinic is really an identity recognition we would want to pay attention to. and if there is a particular body locationthat's affected by the disorder, like facial rash, facial is actually part of the importantbody location we would want to extract here. so there's been a lot of work in natural languageprocessing to kind of define this type of schemas to figure out what it is that we wantto extract in clinical texts like this one. that's one type of work. and typically, this is done through lots ofannotations of text.
and there's a whole field of research abouthow to do these annotations in a way that is correct, in ways that we can trust thetype of annotation, that we found that the schemas are generic enough, and can be appliedto many different types of information extraction systems. and then, of course, we want to be able tobuild methods that are going to actually use the schema, and then go to a new type of text,and identify those named entities being mentioned, along with all of these modifiers. this is a really, really large field of researchin clinical nlp. and so i just am giving here literally threeexamples of famous papers here on this domain.
but there are many, many more. the examples i'm giving here is from the takesfrom the ctakes system, which is actually open source. and you can try, if you have any clinicaltexts. and the reason i like it is it kind of showsyou all of the different part of the pipeline that the system goes through in order to dothis named entity recognition. so there is a sentence here that says, "familyhistory," fx is an abbreviation for family history, "of obesity, but no family historyof coronary artery disease." so the first question is, how do we talk nice?
and this, i believe in the text framing lecture,you guys had some discussion of that. we want to then identify the part of speechof all of those different sentences, the different words in thesentences, and maybe do what's called a shallow parsing, meaning we would identify not thewhole syntactic structure of the sentence, but rather what are the primary noun phrases. so a noun phrase would be a set of words thattogether refer to a noun, a [inaudible] noun. and from there, we would then go and lookfor particular noun phrases that can be mapped to a terminology, like here. so obesity is found, and coronary artery diseaseis found.
but also coronary artery, and artery, anddiseases. and all of those are potential named entitieswe would want to identify. furthermore, the status for obesity is thatit's a family history, as opposed to that patient has obesity, and it's not negated. so nowadays, i think that the state of theart type of named entity recognition systems are using sequence labellers. and again, [inaudible] kind of mentioned it,and so i thought i would move on from there. but you can see what i mean here by sequencelabel, where i'm able to go through each of these terms, each of these words, and i'mslowly building up these particular term,
"coronary artery disease," by saying thatthis is the beginning of my entity, and i'm going to keep in my sequence of terms untili feel like it doesn't belong to the named entity anymore. you've heard me refer to terminologies quitea lot. and this is outside the scope of this lecture,but you realize, i hope, that we are relying heavily in this type of clinical task on lexicons,or terminologies, or ontologies. and one huge challenge is, how do we ensurethat the lexicon actually captures the language of the corpus under study? the umls is very large, but it's also quitespecific, for example, to clinical and technical
type of language. and so now, when we are dealing working inonline health communities or on twitter, we need to figure out ways to either augmentor shelter the umls. and nlp can help for this as well. another type of approach that's been reallyhelpful and i've mentioned is one of probabilistic topic model. and i would like to refer you for this ofall the work of david blei, and all of the people who worked on this for the past 10plus years. the idea here is a very different approachto clinical named entity recognition, and
is looking at in ways completely unsupervised,and is trying to find patterns of text, and in particular, is trying to identify clustersof words that tend to co-occur quite often with each other, such that the documents thatthey appear on are considered to be a probabilistic mixture of those clusters or topics. and so you can add the output of your topicmodeling, or in this case, latent dirichlet allocation, you can identify that all of thesedifferent words together are very likely to be part of that topic. and moreover, we could think of that topic,as humans, as being about genetics, or evolution, or disease in particular for that example.
i have not talked at all about evaluation. but this is something that people think alot about in health natural language processing. there's a lot of task-based evaluation. so when we do progression of disease, we thinkof task-based evaluation as did i predict whois going to progress accurately? and that's fine. but when we think about the core nlp methods,we do usually need gold-standard annotations to do some sort of validation, or internalvalidation, or intrinsic evaluation of the core nlp methods.
and again, it's the whole topic. so i'm not touching on it, but i'm mentioningit here. so i want to finish with the current developmentsin health nlp and nlp research. and this is a very exciting type of researchthat is happening for the past less than five years, actually, and which is the event ofneural nets in the world of linguistics. and so i felt like i couldn't really pointyou to very specific papers. but if you are interested in the context ofdeep learning and natural language processing, richard socher at stanford has an excellentclass. and all material is given online.
so i suggest that you go and either read,or even try to do the homework. but i thought that basically, there's fourareas that are very interesting, i think, from an nlp standpoint as far as new areasof work. one is a better or improved language model. we know already that we need a good contextualrepresentation of what words are. and now, through neural nets, we have bettercontextual representations, not only of the words, but what the sequence of the words,or even sequences of characters, represent. and that has been shown to be extremely powerful. we have better sequenced models.
and so remember that named entity recognitiontask, where we're looking for a sequence of words that together define a medical term,for instance, recurrent neural nets, either on words or on characters, are very exciting,because it can capture rich, long-distance dependencies in text, which was not possiblewith different models, like crf, for example. finally, there is really interesting workin learning models for mixed modalities. so what do i mean by mixed modality? things like text plus images, text plus laboratorytests, et cetera. and so we've had some example in the applications,where we know that by incorporating all this metadata, if you want, about a patient, orabout an individual online, we can actually
predict better or understand better what aperson is talking about, or what a doctor is talking about with respect to a patient. and finally, there is interesting new typeof machine learning. i'm thinking about a adversarial neural netsand reinforcement learning, in particular, that are starting to show some very usefulprogress in nlp. so again, i'm not pointing you to any specificpapers. but if you're interested, there are some classesonline for you to look at. so i want to conclude by telling you thatlanguage has pervasive. it's hard to process automatically.
and sometimes, it's also hard for humans. but that's because there is ambiguity. there is variability and vagueness. but there is much progress that has been done. and i'll finish by saying that this is anactive field of research, both in terms of the core nlp methods that arebeing developed, but also in ways of thinking of how to leverage texts in general for biomedicalknowledge discovery, characterizing diseases and behavior, improving care, et cetera. so with that, i'll stop.
and hopefully, there is enough time for questions. thank you, dr. elhadad for a great talk andoverview on nlp and the medical and health domains. so we have one question for you from the audience. this first question is regarding privacy issues. and so a lot of the examples that you gaveare from social media like reddit and other areas where you might be monitoring forumsand such. can you address how issues related to privacyand potentially de-identification are addressed in this type of work?
obviously, that has applications also, morespecifically when you're monitoring clinical records. but if you could touch broadly on that, thatwould be appreciated. great, yes, that's a great question. and you're right, i didn't talk about this. i gave you an overview of different papersout there. and everyone has their own [inaudible] oftheir own institution that usually makes sure that the methods are being respectful of theparticipants, in this case, the people who write those texts, or the people about whichthese texts are written.
so when it comes to online communities andsocial media, most of the papers i've seen deal with publicly available data sets andtext corpora. so for example, this breast cancer data setthat i mentioned very early on is completely available online. twitter is available if you don't pay attentionto private messages, et cetera. there are some questions about how would you--say you go and crawl in the entirety of an online health community. how do you go about distributing these data? as researchers, we do not have the right todistribute it.
we need to have, somehow, the agreement ofthe people who created and are monitoring these data. and so people have been looking a lot intohow to de-indentify texts. and there are some, both commercial and research,identifiers out there. and it turns out it's a very institution-specifictype of thing, where some institutions are ok with letting you distribute your data ifyou de-identified it, and others are not. there are de-identified clinical corpora outthere. [? minick ?] is one of them. and there are publicly available online healthcommunities out there.
great, so our second question is actuallyabout evaluation. what is the typical error rate in interpretingthese texts? as you apply actually language processingtechniques, obviously, it's not necessarily perfect. and as you pointed out, there is a lot ofinherent vagueness to the texts. so even humans are not necessarily perfect. but can you discuss what the typical errorrate might be? and are there any examples where any misinterpretationshave actually led to bad decisions? ok, so i feel like there are kind of two questions--unintended consequences of recognition of
something and what is the typical error rate. it's hard to say what typical is. if you think of a named entity recognitionsystem, it's typically built with a terminology in mind for a particular type of text. and so if i type umls and ctakes and applyit to-- or metamap or medlee-- and apply it to an autism community, i might find thati have an error rate of, say, 40% in f-measure. but if i were to do exactly the same, andapply it to clinical notes, i would have an error rate of 85 f-measure. so it's difficult to say, without thinkingof the task, the corpus, and the terminologies,
for example, how accurate typically an namedentity recognition system is. and what this is pointing at is that there'snot a whole lot of benchmarking in health nlp. that's something that the community is workingon heavily. so there have been, for the past few years,data sets that have been deidentified in the case of clinical texts where gold-standardannotations are provided for people to share. and there are challenges that many differentresearch groups try different methods. but everyone agrees to use the same evaluationmetrics on the same benchmark annotations and texts.
and we can then compare and understand whichmethods work better than another. i would say, though, that we're probably notas far ahead as we could be compared to other domains of nlp and other areas of machinelearning, in part because of those privacy issues, for sure. great, so we're actually at 10 o'clock now. so being mindful of the time, thank you againso much for giving this talk.
and we'll close it here. happy holidays, everyone. we will be back in january, 2017, with thenext set of talks.
thank you, and bye, everyone.
No comments:
Post a Comment