Taken down from the Github repo of bert-as-service: https://github.com/hanxiao/bert-as-service
Installation
pip install bert-serving-server bert-serving-client
Getting Started
1. Download a Pre-trained BERT Model
Download a model listed below, then uncompress the zip file into some folder, say /tmp/english_L-12_H-768_A-12/
List of released pretrained BERT models:
BERT-Base, Uncased | 12-layer, 768-hidden, 12-heads, 110M parameters |
BERT-Large, Uncased | 24-layer, 1024-hidden, 16-heads, 340M parameters |
BERT-Base, Cased | 12-layer, 768-hidden, 12-heads , 110M parameters |
BERT-Large, Cased | 24-layer, 1024-hidden, 16-heads, 340M parameters |
BERT-Base, Multilingual Cased (New) | 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters |
BERT-Base, Multilingual Cased (Old) | 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters |
BERT-Base, Chinese | Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters |
2. Start the BERT service
After installing the server, you should be able to use bert-serving-start
CLI as follows:
bert-serving-start -model_dir /tmp/english_L-12_H-768_A-12/ -num_worker=4
3. Use Client to Get Sentence Encodes
Now you can encode sentences simply as follows:
from bert_serving.client import BertClient
bc = BertClient()
bc.encode(['First do it', 'then do it right', 'then do it better'])
It will return a ndarray
(or List[List[float]]
if you wish), in which each row is a fixed-length vector representing a sentence. Having thousands of sentences? Just encode
! Don’t even bother to batch, the server will take care of it.
As a feature of BERT, you may get encodes of a pair of sentences by concatenating them with |||
(with whitespace before and after), e.g.
bc.encode(['First do it ||| then do it right'])