embedding

BioBERT Embedding

To get contextualized embeddings from BioBERT-v1.1 (base), run the command below. Note that as the output is saved in hdf5 format, you need to install the h5py package (pip install h5py) first. We also provide a sample input text (pubmed_entity_2048.txt) which contains biomedical concepts for each line.

export MAX_LENGTH=384
export DATA_PATH=pubmed_entity_2048.txt
export OUTPUT_PATH=pubmed_entity_2048.h5
export BATCH_SIZE=64

python run_embedding.py \
    --model_name_or_path dmis-lab/biobert-base-cased-v1.1 \
    --max_seq_length  ${MAX_LENGTH} \
    --data_path ${DATA_PATH} \
    --output_path ${OUTPUT_PATH} \
    --batch_size ${BATCH_SIZE} \
    --pooling mean

Required Arguments

--pooling
- none: embeddings of a sequence of tokens
- first: embedding of the first token (i.e., embedding at [CLS])
- mean: embedding of mean of token embeddings
- sum: embedding of sum of token embeddings

Load Embeddings

export DATA_PATH=pubmed_entity_2048.txt
export OUTPUT_PATH=pubmed_entity_2048.h5

python load_embedding.py \
    --inputtext_path ${DATA_PATH}\
    --indexed_path ${OUTPUT_PATH}

Result

The number of keys in h5: 2048
entity_name = Lohmann Selected Leghorn
embedding = [2.77513593e-01  2.03759596e-02  1.59252986e-01 ...  7.65920877e-02  2.49284402e-01 -1.48969248e-01]

Visualization

The embedding of different biomedical concepts (obtained from here) are visualized below with T-SNE. Each of different colors or shapes refers to the unique biomedical concept (having multiple synonyms).

Contact

For help or issues using BioBERT-PyTorch, please create an issue and tag @mjeensung.

Name		Name	Last commit message	Last commit date
parent directory ..
img		img
README.md		README.md
load_embedding.py		load_embedding.py
pubmed_entity_2048.txt		pubmed_entity_2048.txt
run_embedding.py		run_embedding.py
utils_embedding.py		utils_embedding.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

embedding

embedding

README.md

BioBERT Embedding

Required Arguments

Load Embeddings

Result

Visualization

Contact

Files

embedding

Directory actions

More options

Directory actions

More options

Latest commit

History

embedding

Folders and files

parent directory

README.md

BioBERT Embedding

Required Arguments

Load Embeddings

Result

Visualization

Contact