mirror of https://github.com/BaderLab/Towards-reliable-BioNER.git synced 2026-05-23 07:38:35 -06:00

No description

Find a file

John Giorgi 840a707d28 Merge pull request #3 from BaderLab/clean-up-after-publication Clean up after publication		2019-07-26 13:13:03 -04:00
code	♻️ Cleanup blacklists	2019-02-25 12:09:47 -05:00
configs	🔧 Update configs	2019-07-26 12:59:27 -04:00
datasets	🗃️ Update datasets	2019-07-26 13:05:49 -04:00
supplementary	♻️ Cleanup blacklists	2019-02-25 12:09:47 -05:00
README.md	🐛 Fix instructions for unzipping datasets	2019-07-26 13:12:48 -04:00

README.md

Towards Reliable Named Entity Recognition in the Biomedical Domain

This repository contains the corpora and supplementary data, along with instructions for recreating the experiments, for our paper: "Towards reliable named entity recognition in the biomedical domain".

Model
Data
Recreating the Experiments
Issues
Citations

Model

The model used in this study is Saber, a tool we are building for text-mining and information extraction of biomedical text. The named entity recognizer (NER) implemented in Saber is based on a bi-directional long short term memory network-conditional random field (BiLSTM-CRF) [1]. The tool can be accessed here.

Documentation for the tool can be found here.

Data

Word Embeddings

The word embeddings used in this study were obtained from here [2].

Datasets

Datasets used in this study are listed below, along with links where they can be publicly accessed. We obtained most datasets in a pre-processed state from here [3]. The final, preprocessed datasets that we used in this study are available under datasets.

Corpora	Text Genre	Standard	Entities	Publication
BioCreative II GM (BC2GM)	Scientific Article	Gold	genes/proteins	link
*BioCreative V Chemical-Disease Relation (CDR) Task Corpus (BC5CDR)	Scientific Article	Gold	chemicals, diseases	link
CALBC-III-Small	Scientific Article	Silver	chemicals, diseases, species, genes/proteins	link
CRAFT	Scientific Article	Gold	chemicals, species, genes/proteins, sequence ontology, gene ontology, cell lines	link
BC5CDR	Scientific Article	Gold	chemicals, diseases	link
Linneaus	Scientific Article	Gold	species	link
NCBI-Disease	Scientific Article	Gold	diseases	link
S800	Scientific Article	Gold	species	link
Variome	Scientific Article	Gold	diseases, species, genes/proteins	link

*Requires that you create an account at https://biocreative.bioinformatics.udel.edu/ to access.

Supplementary Information

Blacklists used during transfer learning can be found in supplementary. There are two sets:

entity_blacklists: Contain a list of single-token entities that were annotated in the silver-standard corpus (SSC) and present in at least one of the gold-standard corpora but never annotated.
pmid_blacklists: PMIDs corresponding to documents in the gold-standard corpora (GSCs).

Recreating the Experiments

To recreate the experiments, you must first install our package and collect the relevant data. Start by cloning and moving into this repo

$ git clone https://github.com/BaderLab/Towards-reliable-BioNER.git
$ cd Towards-reliable-BioNER

Installation

First, you will need to install python 3.6. If not already installed, python3 can be installed via

The official installer
Homebrew, on MacOS (brew install python3)
Miniconda3 / Anaconda3

Run python --version at the command line to make sure installation was successful. You may need to type python3 (not just python) depending on your install method.

It is also highly recommended that you use a virtual environment. See (Optional) Creating and Activating a Virtual Environment.

Finally, download and install the fork of Saber that we used in this paper.

(saber) $ pip install -e git+https://github.com/JohnGiorgi/saber.git@master#egg=saber

(Optional) Creating and Activating a Virtual Environment

To create a virtual environment named saber

Using Conda

Using Conda / Miniconda

$ conda create -n saber -y python=3.6

To activate the environment

$ conda activate saber
# Notice your command prompt has changed to indicate that the environment is active
(saber) $

Using virtualenv or venv

Using virtualenv

$ virtualenv --python=python3 /path/to/new/venv/saber

Using venv

$ python3 -m venv /path/to/new/venv/saber

To activate the environment

$ source /path/to/new/venv/saber/bin/activate
# Notice your command prompt has changed to indicate that the environment is active
(saber) $

Collecting Data

Datasets

Preprocessed datasets are provided under the datasets directory for convenience. They just need to be unzipped

$ (saber) tar -xvjf datasets/datasets.tar.bz2 datasets/

Word Emebddings

Finally, you will need to collect the word embeddings

$ (saber) mkdir word_embeddings
$ (saber) wget -O word_embeddings/wikipedia-pubmed-and-PMC-w2v.bin http://evexdb.org/pmresources/vec-space-models/wikipedia-pubmed-and-PMC-w2v.bin

Note that this file is 4GB and can take a while to download.

Running the Experiments

The following instructions assume that you have git cloned and moved into this repository locally, that Saber is installed, and that the datasets are available under Towards-reliable-BioNER/datasets and that word embeddings are available under Towards-reliable-BioNER/word_embeddings.

The results of the experiments will be saved under Towards-reliable-BioNER/output by default. If you would like the results to be saved elsewhere, provide a different path with the --output_folder argument or change the output_folder argument value in one of the config files.

Baseline Experiments

To run the baseline experiments (Results section 3.1, Supplementary data Table 3)

$ (saber) python -m saber.cli.train --config_filepath configs/baseline.ini

Generalization Experiments

To run the generalization experiments (Results section 3.2, Supplementary data Table 4)

$ (saber) python -m saber.cli.train --config_filepath configs/generalization.ini

Variational Dropout Experiments

To run the variational dropout experiments (Results section 3.3, Supplementary data Table 5 and 6)

$ (saber) python -m saber.cli.train --config_filepath configs/variational.ini

Transfer Learning Experiments

To run the transfer learning experiments (Results section 3.4, Supplementary data Table 7 and 8)

TODO.

Multi-task Learning Experiments

To run the multi-task learning experiments (Results section 3.5, Supplementary data Table 9 and 10)

$ (saber) python -m saber.cli.train --config_filepath configs/multi_task_learning.ini

Choosing the Dataset

For each experiment, we provide configurations files under configs. The only thing you should need to modify is the dataset_folder argument. To train on a certain dataset, either provide its path with the dataset_folder argument

$ (saber) python -m saber.cli.train --config_filepath configs/baseline.ini --dataset_folder ./datasets/BC2GM_BIO

or change the value for dataset_folder in configs/baseline.ini.

For "in-corpus" experiments (see our paper), use the standalone dataset (e.g. "NCBI_Disease_BIO", "BC5CDR_DISO_BIO", etc.). For the "out-of-corpus" experiments, use one of the datasets named train_on_<x>_test_on_<y>. In these datasets, the train.* and the test.* data come from different datasets, allowing you to evaluate how well a model trained on one dataset performs on data from another dataset.

For multi-task experiments, simply provide multiple arguments to dataset_folder, either at the command line, separated by a space, e.g.

$ (saber) python -m saber.cli.train --config_filepath configs/multi_task_learning.ini --dataset_folder ./datasets/NCBI_Disease_BIO ./datasets/BC5CDR_DISO_BIO

or in the config.ini file, separated by a comma,

dataset_folder = ./datasets/NCBI_Disease_BIO, ./datasets/BC5CDR_DISO_BIO

Combining Modifications

To run the combined modifications experiments (Results section 3.6, Figure 1), just combine the above instructions as appropriate. E.g., to perform variational dropout and multi-task learning, set dropout_rate to 0.3, 0.3, 0.1, variational_dropout to True and provide two datasets to dataset_folder.

Issues

Please open an issue if you have any questions.

Citations

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360.
Moen, S. P. F. G. H., & Ananiadou, T. S. S. (2013). Distributional semantics resources for biomedical text processing. In Proceedings of the 5th International Symposium on Languages in Biology and Medicine, Tokyo, Japan (pp. 39-43).
Crichton, G., Pyysalo, S., Chiu, B., & Korhonen, A. (2017). A neural network multi-task learning approach to biomedical named entity recognition. BMC bioinformatics, 18(1), 368.

README.md

Towards Reliable Named Entity Recognition in the Biomedical Domain

Table of Contents

Model

Data

Word Embeddings

Datasets

Supplementary Information

Recreating the Experiments

Installation

(Optional) Creating and Activating a Virtual Environment

Using Conda

Using virtualenv or venv

Collecting Data

Datasets

Word Emebddings

Running the Experiments

Baseline Experiments

Generalization Experiments

Variational Dropout Experiments

Transfer Learning Experiments

Multi-task Learning Experiments

Choosing the Dataset

Combining Modifications

Issues

Citations