If you like learning by doing, this article will provide you a list of practical projects you can use to learn spaCy in depth.
Before you begin
This article is meant for programmers. If you don't have any programming experience, you can still skim this article to see what kind of problems can be solved by Natural Language Processing. But to be able to work on these projects, you need to already have some Python experience.
The projects are not sorted or ordered by level of difficulty. In fact, some of these projects are probably open problems in their specific domains. But the important thing isn't that you completely solve the problem, but rather you understand the techniques used to solve them.
You might say that these projects are just "tasks" within a single project. And I do agree, to an extent, although some of the tasks are quite large in scope.
I don't explain how to actually do the projects, although I plan to create some material on that topic in the future.
I use the abbreviation SME (subject matter expert) to refer to all the medical folks - epidemiologists, doctors etc.
All these projects were inspired by a Kaggle competition which is trying to use NLP to better understand medical literature around COVID-19.
I contributed a tool called a dynamic evidence gap map to this effort. There is also a much larger effort called CoronaWhy which is a large group of data science folks and SMEs who are creating other tools for the same competition.
I expect that you already understand what I am referring to when I say intervention (risk factor), outcome and study design. If you don't, I suggest either reading the article about the dynamic evidence gap map (linked before), or you can also do an online search to see what these terms mean in the field of epidemiology.
The COVID19 research dataset consists of a metadata.csv file which provides metadata information about a research paper (such as title, date published, journal name etc).
There are also multiple folders (each folder corresponds to a data source such as arxiv, biorxiv) which has the full text of the research paper in JSON format. Some kind of data extraction has been used to parse the text of the original PDF files. So the text in the JSON file may not be very clean. For example, it will often have footer text about the license of the paper etc.
It is important to understand the actual structure of the JSON file which was parsed.
Let us use a single JSON file as an example. (It is possible the ID of the file changes and the link may not work in the future. But you can still follow the concepts by reading the description below).
First, we can see that the following fields are available at the top level: paper_id, metadata, abstract, body_text, ref_entries and back_matter.
The paper ID is a randomly generated GUID, but this number is also (usually) in the metadata.csv file under the field called SHA. This ID is used to link the information in the metadata CSV and the actual full body text of a research paper.
The metadata field provides information about the title and list of authors.
The abstract, as you might expect, provides the text of the abstract.
This is the main section, of course.
There are some things we need to notice here:
- Each body text JSON object has a subfield called section, which is the name of the section
- A single section could span multiple JSON objects
- Sometimes section names are empty
This field has details about the papers which are cited by the current paper.
This JSON field has information about tables and figures from the paper. Since this is extracted from PDF, often the formatting is messed up, meaning the data often isn't usable.
This field has information about the Acknowledgements.
There have already been plenty of tools built for the CORD19 dataset. You can see a list of them by going to the Tools tab from the Contributions page.
There are all kinds of exploration and visualization tools, and I would like to bring a couple of them to your attention.
Dynamic Evidence Gap Map
You can read more about dynamic evidence gap maps here.
Notice that the structure of this dynamic evidence gap map requires the following pieces of information:
- what is the risk factor or intervention being studied?
- which outcome is being studied?
- what kind of study design was used?
We will go into these in more detail later.
There is also a challenge on the Kaggle website to generate AI powered literature reviews. To do this, the SMEs first proposed a "result normalization" format. This is a CSV format with the different attributes of interest (i.e. what do they want to extract from the research paper?).
The challenge participants were asked to populate these fields for the papers they summarized.
Here is an example CSV format:
There are a couple of big advantages of defining such a CSV format:
- the data science folks have targets to aim at
- the CSV data can be easily exported into a lot of different visualization tools, such as the one below (which uses Microsoft Power BI).
Now, with this background, let us take a look at the different projects which have been posed by this challenge/competition.
Is this a method section?
Most of the clinical trial papers include a section about the Methods used. Unfortunately, not all authors actually name their sections as methods.
Suppose you find a keyword or phrase corresponding to one of the study designs. If we know that this is a methods section, then the probability that they are talking about their study design used for the paper is much higher.
Is this a Results section?
Suppose our algorithm identifies some keywords corresponding to outcomes. If we know that this is a Results section, then the odds are much higher that the outcome mentioned in this section is the actual outcome of the trial.
Takeaway: identifying Methods and Results section is an important preliminary task. If we manage to successfully identify these sections, the text within those sections is highly likely to contain attributes of interest.
Is a specific risk factor being considered in this paper?
As humans, we can read the abstract and very quickly determine what risk factor is being considered in a paper. Can an algorithm identify if a risk factor is being considered in the study?
There are a lot of risk factors being studied for COVID-19 since it is still not very well understood. You can find a list of risk factors by looking at the evidence gap map and selecting one of the "Inteventions" drop-down boxes.
Is a specific outcome being considered?
Similar to risk factors, can we identify if a specific outcome is being considered in this paper?
You can find a list of outcomes by taking a look at the Outcomes dropdown in the dynamic EGM tool.
What is the study type?
Different types of studies provide different levels of evidence. The SMEs are more interested in identifying papers which provide higher levels of evidence.
This is a mind map depicting how the study type is defined (according to an SME):
We don't have to apply this flowchart to determine the study type. Quite often, it is just stated up front in the paper's abstract.
What type of study design was used?
The study design describes (again, according to the SME) the attributes of the study type.
""The design is a bit more (but includes the study type): structure, specific details of the studied population, time frame, etc."
We would like to know the severity of the outcome - e.g. death, mechanical ventilation, ICU admission etc.
We would like to know if there was a fatal outcome in any of the patients, and associated parameters.
Notice that severity and fatality are two of the columns provided in the CSV format.
Identify sample size
How many patients were studied as part of the trial? A higher number provides more evidence.
For example, here is an example from the CSV visualization we saw before:
Identify sample population
Can we identify how the sample population was selected?
Here is an example from one of the papers, which explains how the sample was selected:
Identify statistical measures
The research papers often report a lot of statistical measures, and surfacing this information can be very useful if building an NLP powered literature review.
A good example is the confidence interval, which is an important statistical measure which indicates how certain the authors are about their results.
Whose work is being discussed?
One of the biggest sources of false positives (wrongly identifying a risk factor or outcome as being studied by the current paper) is the inability to distinguish between the work of the author and the work of other people cited by the paper.
For example, in the search result highlighted below, poor hand hygiene is not a risk factor of the current paper even though it is mentioned in it. Rather, it is (likely) discussed as a risk factor in the paper cited in the previous sentence.
- Dialogflow vs RASA NLU
- Dialogflow vs Lex vs LUIS vs Watson vs Chatfuel
- Machine Learning vs non-Machine Learning algorithm
- BotFlo update
- Learn Dialogflow basics for free (till May 31)
- 10+ practical projects to learn spaCy in depth
- An Epidemiology Glossary for Programmers
- All my mini-courses are free this week
- Reader Question: What if a specific system entity isn’t available in all languages in a multi-lingual bot?
- How much can Machine Learning ACTUALLY help with answering free-form questions?