|
|
||
|---|---|---|
| code | ||
| configs | ||
| datasets | ||
| supplementary | ||
| README.md | ||
Towards Reliable Named Entity Recognition in the Biomedical Domain
This repository contains the corpora and supplementary data, along with instructions for recreating the experiments, for our paper: "Towards reliable named entity recognition in the biomedical domain".
Table of Contents
Model
The model used in this study is Saber, a tool we are building for text-mining and information extraction of biomedical text. The named entity recognizer (NER) implemented in Saber is based on a bi-directional long short term memory network-conditional random field (BiLSTM-CRF) [1]. The tool can be accessed here.
Documentation for the tool can be found here.
Data
Word Embeddings
The word embeddings used in this study were obtained from here [2].
Datasets
Datasets used in this study are listed below, along with links where they can be publicly accessed. We obtained most datasets in a pre-processed state from here [3]. The final, preprocessed datasets that we used in this study are available under datasets.
| Corpora | Text Genre | Standard | Entities | Publication |
|---|---|---|---|---|
| BioCreative II GM (BC2GM) | Scientific Article | Gold | genes/proteins | link |
| *BioCreative V Chemical-Disease Relation (CDR) Task Corpus (BC5CDR) | Scientific Article | Gold | chemicals, diseases | link |
| CALBC-III-Small | Scientific Article | Silver | chemicals, diseases, species, genes/proteins | link |
| CRAFT | Scientific Article | Gold | chemicals, species, genes/proteins, sequence ontology, gene ontology, cell lines | link |
| BC5CDR | Scientific Article | Gold | chemicals, diseases | link |
| Linneaus | Scientific Article | Gold | species | link |
| NCBI-Disease | Scientific Article | Gold | diseases | link |
| S800 | Scientific Article | Gold | species | link |
| Variome | Scientific Article | Gold | diseases, species, genes/proteins | link |
*Requires that you create an account at https://biocreative.bioinformatics.udel.edu/ to access.
Supplementary Information
Blacklists used during transfer learning can be found in supplementary. There are two sets:
entity_blacklists: Contain a list of single-token entities that were annotated in the silver-standard corpus (SSC) and present in at least one of the gold-standard corpora but never annotated.pmid_blacklists: PMIDs corresponding to documents in the gold-standard corpora (GSCs).
Recreating the Experiments
To recreate the experiments, you must first install our package and collect the relevant data. Start by cloning and moving into this repo
$ git clone https://github.com/BaderLab/Towards-reliable-BioNER.git
$ cd Towards-reliable-BioNER
Installation
First, you will need to install python 3.6. If not already installed, python3 can be installed via
- The official installer
- Homebrew, on MacOS (
brew install python3) - Miniconda3 / Anaconda3
Run
python --versionat the command line to make sure installation was successful. You may need to typepython3(not justpython) depending on your install method.
It is also highly recommended that you use a virtual environment. See (Optional) Creating and Activating a Virtual Environment.
Finally, download and install the fork of Saber that we used in this paper.
(saber) $ pip install -e git+https://github.com/JohnGiorgi/saber.git@master#egg=saber
(Optional) Creating and Activating a Virtual Environment
To create a virtual environment named saber
Using Conda
$ conda create -n saber -y python=3.6
To activate the environment
$ conda activate saber
# Notice your command prompt has changed to indicate that the environment is active
(saber) $
Using virtualenv or venv
Using virtualenv
$ virtualenv --python=python3 /path/to/new/venv/saber
Using venv
$ python3 -m venv /path/to/new/venv/saber
To activate the environment
$ source /path/to/new/venv/saber/bin/activate
# Notice your command prompt has changed to indicate that the environment is active
(saber) $
Collecting Data
Datasets
Preprocessed datasets are provided under the datasets directory for convenience. They just need to be unzipped
$ (saber) tar -xvjf datasets/datasets.tar.bz2 datasets/
Word Emebddings
Finally, you will need to collect the word embeddings
$ (saber) mkdir word_embeddings
$ (saber) wget -O word_embeddings/wikipedia-pubmed-and-PMC-w2v.bin http://evexdb.org/pmresources/vec-space-models/wikipedia-pubmed-and-PMC-w2v.bin
Note that this file is 4GB and can take a while to download.
Running the Experiments
The following instructions assume that you have git cloned and moved into this repository locally, that Saber is installed, and that the datasets are available under Towards-reliable-BioNER/datasets and that word embeddings are available under Towards-reliable-BioNER/word_embeddings.
The results of the experiments will be saved under
Towards-reliable-BioNER/outputby default. If you would like the results to be saved elsewhere, provide a different path with the--output_folderargument or change theoutput_folderargument value in one of the config files.
Baseline Experiments
To run the baseline experiments (Results section 3.1, Supplementary data Table 3)
$ (saber) python -m saber.cli.train --config_filepath configs/baseline.ini
Generalization Experiments
To run the generalization experiments (Results section 3.2, Supplementary data Table 4)
$ (saber) python -m saber.cli.train --config_filepath configs/generalization.ini
Variational Dropout Experiments
To run the variational dropout experiments (Results section 3.3, Supplementary data Table 5 and 6)
$ (saber) python -m saber.cli.train --config_filepath configs/variational.ini
Transfer Learning Experiments
To run the transfer learning experiments (Results section 3.4, Supplementary data Table 7 and 8)
TODO.
Multi-task Learning Experiments
To run the multi-task learning experiments (Results section 3.5, Supplementary data Table 9 and 10)
$ (saber) python -m saber.cli.train --config_filepath configs/multi_task_learning.ini
Choosing the Dataset
For each experiment, we provide configurations files under configs. The only thing you should need to modify is the dataset_folder argument. To train on a certain dataset, either provide its path with the dataset_folder argument
$ (saber) python -m saber.cli.train --config_filepath configs/baseline.ini --dataset_folder ./datasets/BC2GM_BIO
or change the value for dataset_folder in configs/baseline.ini.
For "in-corpus" experiments (see our paper), use the standalone dataset (e.g. "NCBI_Disease_BIO", "BC5CDR_DISO_BIO", etc.). For the "out-of-corpus" experiments, use one of the datasets named train_on_<x>_test_on_<y>. In these datasets, the train.* and the test.* data come from different datasets, allowing you to evaluate how well a model trained on one dataset performs on data from another dataset.
For multi-task experiments, simply provide multiple arguments to dataset_folder, either at the command line, separated by a space, e.g.
$ (saber) python -m saber.cli.train --config_filepath configs/multi_task_learning.ini --dataset_folder ./datasets/NCBI_Disease_BIO ./datasets/BC5CDR_DISO_BIO
or in the config.ini file, separated by a comma,
dataset_folder = ./datasets/NCBI_Disease_BIO, ./datasets/BC5CDR_DISO_BIO
Combining Modifications
To run the combined modifications experiments (Results section 3.6, Figure 1), just combine the above instructions as appropriate. E.g., to perform variational dropout and multi-task learning, set dropout_rate to 0.3, 0.3, 0.1, variational_dropout to True and provide two datasets to dataset_folder.
Issues
Please open an issue if you have any questions.
Citations
- Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360.
- Moen, S. P. F. G. H., & Ananiadou, T. S. S. (2013). Distributional semantics resources for biomedical text processing. In Proceedings of the 5th International Symposium on Languages in Biology and Medicine, Tokyo, Japan (pp. 39-43).
- Crichton, G., Pyysalo, S., Chiu, B., & Korhonen, A. (2017). A neural network multi-task learning approach to biomedical named entity recognition. BMC bioinformatics, 18(1), 368.