This is a step-by-step tutorial on how to get started with LiBai:
We have prepared relevant datasets, which can be downloaded from the following links:
Download the dataset and move the data file to the folder. The file structure should be like:
$ tree data path/to/bert_data ├── bert-base-chinese-vocab.txt ├── loss_compara_content_sentence.bin └── loss_compara_content_sentence.idxHow to Train Bert_large Model with Parallelism¶
We provide train.sh
for execute training. Before invoking the script, perform the following steps.
Step 1. Set data path and vocab path
Update the data path and vocab path in bert_large_pretrain config file:
# Refine data path and vocab path to data folder vocab_file = "/path/to/bert_data/bert-base-chinese-vocab.txt" data_prefix = "/path/to/bert_data/loss_compara_content_sentence"
Step 2. Configure your parameters
In the configs/bert_large_pretrain.py
provided, a set of parameters are defined including training scheme, model, etc.
You can also modify the parameters setting. For example, if you want to use 8 GPUs for training, you can refer to the file configs/common/train.py
. If you want to train model with 2D mesh hybrid parallelism (4 groups for data parallel and 2 groups for tensor parallel), you can set the the parameters as follows:
train.dist.data_parallel_size=4 train.dist.tensor_parallel_size=2
Step 3. Invoke parallel training
To train BertForPreTraining
model on a single node with 8 GPUs, run:
bash tools/train.sh tools/train_net.py configs/bert_large_pretrain.py 8
To train BertForPreTraining
model on 2 nodes with 16 GPUs,
in node0
, run:
NODE=2 NODE_RANK=0 ADDR=192.168.0.0 PORT=12345 bash tools/train.sh tools/train_net.py configs/bert_large_pretrain.py 8
NODE=2
means total number of nodes
NODE_RANK=0
means current node is node0
ADDR=192.168.0.0
means the ip address of node0
PORT=12345
means the port of node0
in node1
, run:
NODE=2 NODE_RANK=1 ADDR=192.168.0.0 PORT=12345 bash tools/train.sh tools/train_net.py configs/bert_large_pretrain.py 8
NODE=2
means total number of nodes
NODE_RANK=1
means current node is node1
ADDR=192.168.0.0
means the ip address of node0
PORT=12345
means the port of node0
For ImageNet, we use standard ImageNet dataset, which can be downloaded from http://image-net.org/.
For the standard folder dataset, move validation images to labeled sub-folders. The file structure should be like:
$ tree data imagenet ├── train │ ├── class1 │ │ ├── img1.jpeg │ │ ├── img2.jpeg │ │ └── ... │ ├── class2 │ │ ├── img3.jpeg │ │ └── ... │ └── ... └── val ├── class1 │ ├── img4.jpeg │ ├── img5.jpeg │ └── ... ├── class2 │ ├── img6.jpeg │ └── ... └── ...Train vit Model from Scratch¶
Update the data path in vit_imagenet config file:
# Refine data path to imagenet data folder dataloader.train.dataset[0].root = "/path/to/imagenet" dataloader.test[0].dataset.root = "/path/to/imagenet"
To train vit_tiny_patch16_224
model on ImageNet on a single node with 8 GPUs for 300 epochs, run:
bash tools/train.sh tools/train_net.py configs/vit_imagenet.py 8
The default vit model in LiBai is set to vit_tiny_patch16_224
. To train other vit models, update the vit_imagenet config file by importing other vit models in the config file as follows:
# from .common.models.vit.vit_tiny_patch16_224 import model from .common.models.vit.vit_base_patch16_224 import model
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4