pyf98 commited on
Commit
1c2921a
1 Parent(s): 9813a67

Upload training logs for owsm_v3

Browse files
exp/s2t_train_s2t_transformer_conv2d_size1024_e24_d24_lr2.5e-4_warmup10k_finetune_raw_bpe50000/train.1.log ADDED
The diff for this file is too large to render. See raw diff
 
exp/s2t_train_s2t_transformer_conv2d_size1024_e24_d24_lr2.5e-4_warmup10k_finetune_raw_bpe50000/train.10.log ADDED
The diff for this file is too large to render. See raw diff
 
exp/s2t_train_s2t_transformer_conv2d_size1024_e24_d24_lr2.5e-4_warmup10k_finetune_raw_bpe50000/train.12.log ADDED
The diff for this file is too large to render. See raw diff
 
exp/s2t_train_s2t_transformer_conv2d_size1024_e24_d24_lr2.5e-4_warmup10k_finetune_raw_bpe50000/train.2.log ADDED
The diff for this file is too large to render. See raw diff
 
exp/s2t_train_s2t_transformer_conv2d_size1024_e24_d24_lr2.5e-4_warmup10k_finetune_raw_bpe50000/train.4.log ADDED
The diff for this file is too large to render. See raw diff
 
exp/s2t_train_s2t_transformer_conv2d_size1024_e24_d24_lr2.5e-4_warmup10k_finetune_raw_bpe50000/train.5.log ADDED
The diff for this file is too large to render. See raw diff
 
exp/s2t_train_s2t_transformer_conv2d_size1024_e24_d24_lr2.5e-4_warmup10k_finetune_raw_bpe50000/train.6.log ADDED
The diff for this file is too large to render. See raw diff
 
exp/s2t_train_s2t_transformer_conv2d_size1024_e24_d24_lr2.5e-4_warmup10k_finetune_raw_bpe50000/train.8.log ADDED
The diff for this file is too large to render. See raw diff
 
exp/s2t_train_s2t_transformer_conv2d_size1024_e24_d24_lr2.5e-4_warmup10k_finetune_raw_bpe50000/train.9.log ADDED
The diff for this file is too large to render. See raw diff
 
exp/s2t_train_s2t_transformer_conv2d_size1024_e24_d24_lr2.5e-4_warmup10k_finetune_raw_bpe50000/train.log ADDED
@@ -0,0 +1,1294 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Running on gpub074.delta.ncsa.illinois.edu
2
+ # Started at Sun Jul 16 00:42:43 CDT 2023
3
+ # SLURMD_NODENAME=gpub074
4
+ # SLURM_CLUSTER_NAME=delta
5
+ # SLURM_CONF=/var/spool/slurmd/conf-cache/slurm.conf
6
+ # SLURM_CPUS_ON_NODE=64
7
+ # SLURM_CPUS_PER_TASK=64
8
+ # SLURM_EXPORT_ENV=PATH
9
+ # SLURM_GET_USER_ENV=1
10
+ # SLURM_GPUS_ON_NODE=4
11
+ # SLURM_GTIDS=0
12
+ # SLURM_JOBID=2179250
13
+ # SLURM_JOB_ACCOUNT=bbjs-delta-gpu
14
+ # SLURM_JOB_CPUS_PER_NODE=64
15
+ # SLURM_JOB_GID=202
16
+ # SLURM_JOB_GPUS=0,1,2,3
17
+ # SLURM_JOB_ID=2179250
18
+ # SLURM_JOB_NAME=exp/s2t_train_s2t_transformer_conv2d_size1024_e24_d24_lr2.5e-4_warmup10k_finetune_raw_bpe50000/train.log
19
+ # SLURM_JOB_NODELIST=gpub074
20
+ # SLURM_JOB_NUM_NODES=1
21
+ # SLURM_JOB_PARTITION=gpuA40x4
22
+ # SLURM_JOB_QOS=bbjs-delta-gpu
23
+ # SLURM_JOB_UID=68077
24
+ # SLURM_JOB_USER=peng6
25
+ # SLURM_LOCALID=0
26
+ # SLURM_MEM_PER_NODE=240000
27
+ # SLURM_NNODES=1
28
+ # SLURM_NODEID=0
29
+ # SLURM_NODELIST=gpub074
30
+ # SLURM_NODE_ALIASES='(null)'
31
+ # SLURM_OPEN_MODE=a
32
+ # SLURM_PRIO_PROCESS=0
33
+ # SLURM_PROCID=0
34
+ # SLURM_SUBMIT_DIR=/scratch/bbjs/peng6/espnet-whisper-public/egs2/mixed_v3/s2t1
35
+ # SLURM_SUBMIT_HOST=dt-login02.delta.internal.ncsa.edu
36
+ # SLURM_TASKS_PER_NODE=1
37
+ # SLURM_TASK_PID=4188774
38
+ # SLURM_TOPOLOGY_ADDR=ss00.ss12.gpub074
39
+ # SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.node
40
+ # SLURM_WORKING_CLUSTER=delta:dt-sched:6817:9728:109
41
+ # python3 -m espnet2.bin.s2t_train --use_preprocessor true --bpemodel data/token_list/bpe_unigram50000/bpe.model --token_type bpe --token_list data/token_list/bpe_unigram50000/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --valid_data_path_and_name_and_type dump/raw/dev/wav.scp,speech,kaldi_ark --valid_shape_file exp/s2t_stats_raw_bpe50000/valid/speech_shape --resume true --fold_length 80000 --output_dir exp/s2t_train_s2t_transformer_conv2d_size1024_e24_d24_lr2.5e-4_warmup10k_finetune_raw_bpe50000 --config conf/train_s2t_transformer_conv2d_size1024_e24_d24_lr2.5e-4_warmup10k_finetune.yaml --frontend_conf fs=16k --normalize=global_mvn --normalize_conf stats_file=exp/s2t_stats_raw_bpe50000/train/feats_stats.npz --train_data_path_and_name_and_type exp/s2t_stats_raw_bpe50000/splits12/wav.scp,speech,kaldi_ark --train_shape_file exp/s2t_stats_raw_bpe50000/splits12/speech_shape --fold_length 150 --train_data_path_and_name_and_type exp/s2t_stats_raw_bpe50000/splits12/text.prev,text_prev,text --train_shape_file exp/s2t_stats_raw_bpe50000/splits12/text_prev_shape.bpe --fold_length 150 --train_data_path_and_name_and_type exp/s2t_stats_raw_bpe50000/splits12/text.ctc,text_ctc,text --train_shape_file exp/s2t_stats_raw_bpe50000/splits12/text_ctc_shape.bpe --fold_length 150 --train_data_path_and_name_and_type exp/s2t_stats_raw_bpe50000/splits12/text,text,text --train_shape_file exp/s2t_stats_raw_bpe50000/splits12/text_shape.bpe --multiple_iterator true --valid_data_path_and_name_and_type dump/raw/dev/text.prev,text_prev,text --valid_shape_file exp/s2t_stats_raw_bpe50000/valid/text_prev_shape.bpe --valid_data_path_and_name_and_type dump/raw/dev/text.ctc,text_ctc,text --valid_shape_file exp/s2t_stats_raw_bpe50000/valid/text_ctc_shape.bpe --valid_data_path_and_name_and_type dump/raw/dev/text,text,text --valid_shape_file exp/s2t_stats_raw_bpe50000/valid/text_shape.bpe --ngpu 4 --multiprocessing_distributed True
42
+ /scratch/bbjs/peng6/espnet-whisper-public/tools/miniconda/envs/espnet/bin/python3 /scratch/bbjs/peng6/espnet-whisper-public/espnet2/bin/s2t_train.py --use_preprocessor true --bpemodel data/token_list/bpe_unigram50000/bpe.model --token_type bpe --token_list data/token_list/bpe_unigram50000/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --valid_data_path_and_name_and_type dump/raw/dev/wav.scp,speech,kaldi_ark --valid_shape_file exp/s2t_stats_raw_bpe50000/valid/speech_shape --resume true --fold_length 80000 --output_dir exp/s2t_train_s2t_transformer_conv2d_size1024_e24_d24_lr2.5e-4_warmup10k_finetune_raw_bpe50000 --config conf/train_s2t_transformer_conv2d_size1024_e24_d24_lr2.5e-4_warmup10k_finetune.yaml --frontend_conf fs=16k --normalize=global_mvn --normalize_conf stats_file=exp/s2t_stats_raw_bpe50000/train/feats_stats.npz --train_data_path_and_name_and_type exp/s2t_stats_raw_bpe50000/splits12/wav.scp,speech,kaldi_ark --train_shape_file exp/s2t_stats_raw_bpe50000/splits12/speech_shape --fold_length 150 --train_data_path_and_name_and_type exp/s2t_stats_raw_bpe50000/splits12/text.prev,text_prev,text --train_shape_file exp/s2t_stats_raw_bpe50000/splits12/text_prev_shape.bpe --fold_length 150 --train_data_path_and_name_and_type exp/s2t_stats_raw_bpe50000/splits12/text.ctc,text_ctc,text --train_shape_file exp/s2t_stats_raw_bpe50000/splits12/text_ctc_shape.bpe --fold_length 150 --train_data_path_and_name_and_type exp/s2t_stats_raw_bpe50000/splits12/text,text,text --train_shape_file exp/s2t_stats_raw_bpe50000/splits12/text_shape.bpe --multiple_iterator true --valid_data_path_and_name_and_type dump/raw/dev/text.prev,text_prev,text --valid_shape_file exp/s2t_stats_raw_bpe50000/valid/text_prev_shape.bpe --valid_data_path_and_name_and_type dump/raw/dev/text.ctc,text_ctc,text --valid_shape_file exp/s2t_stats_raw_bpe50000/valid/text_ctc_shape.bpe --valid_data_path_and_name_and_type dump/raw/dev/text,text,text --valid_shape_file exp/s2t_stats_raw_bpe50000/valid/text_shape.bpe --ngpu 4 --multiprocessing_distributed True
43
+ [gpub074:0/4] 2023-07-16 00:44:44,966 (distributed_c10d:319) INFO: Added key: store_based_barrier_key:1 to store for rank: 0
44
+ [gpub074:0/4] 2023-07-16 00:44:44,967 (distributed_c10d:353) INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
45
+ [gpub074:0/4] 2023-07-16 00:44:45,026 (s2t:483) INFO: Vocabulary size: 50002
46
+ [gpub074:0/4] 2023-07-16 00:44:56,171 (abs_task:1201) INFO: pytorch.version=1.13.1, cuda.available=True, cudnn.version=8500, cudnn.benchmark=False, cudnn.deterministic=True
47
+ [gpub074:0/4] 2023-07-16 00:44:56,225 (abs_task:1202) INFO: Model structure:
48
+ ESPnetS2TModel(
49
+ (frontend): DefaultFrontend(
50
+ (stft): Stft(n_fft=512, win_length=400, hop_length=160, center=True, normalized=False, onesided=True)
51
+ (frontend): Frontend()
52
+ (logmel): LogMel(sr=16000, n_fft=512, n_mels=80, fmin=0, fmax=8000.0, htk=False)
53
+ )
54
+ (specaug): SpecAug(
55
+ (freq_mask): MaskAlongAxis(mask_width_range=[0, 27], num_mask=2, axis=freq)
56
+ (time_mask): MaskAlongAxisVariableMaxWidth(mask_width_ratio_range=[0.0, 0.05], num_mask=10, axis=time)
57
+ )
58
+ (normalize): GlobalMVN(stats_file=exp/s2t_stats_raw_bpe50000/train/feats_stats.npz, norm_means=True, norm_vars=True)
59
+ (encoder): TransformerEncoder(
60
+ (embed): Conv2dSubsampling(
61
+ (conv): Sequential(
62
+ (0): Conv2d(1, 1024, kernel_size=(3, 3), stride=(2, 2))
63
+ (1): ReLU()
64
+ (2): Conv2d(1024, 1024, kernel_size=(3, 3), stride=(2, 2))
65
+ (3): ReLU()
66
+ )
67
+ (out): Sequential(
68
+ (0): Linear(in_features=19456, out_features=1024, bias=True)
69
+ (1): PositionalEncoding(
70
+ (dropout): Dropout(p=0.1, inplace=False)
71
+ )
72
+ )
73
+ )
74
+ (encoders): MultiSequential(
75
+ (0): EncoderLayer(
76
+ (self_attn): MultiHeadedAttention(
77
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
78
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
79
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
80
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
81
+ (dropout): Dropout(p=0.1, inplace=False)
82
+ )
83
+ (feed_forward): PositionwiseFeedForward(
84
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
85
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
86
+ (dropout): Dropout(p=0.1, inplace=False)
87
+ (activation): ReLU()
88
+ )
89
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
90
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
91
+ (dropout): Dropout(p=0.1, inplace=False)
92
+ )
93
+ (1): EncoderLayer(
94
+ (self_attn): MultiHeadedAttention(
95
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
96
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
97
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
98
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
99
+ (dropout): Dropout(p=0.1, inplace=False)
100
+ )
101
+ (feed_forward): PositionwiseFeedForward(
102
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
103
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
104
+ (dropout): Dropout(p=0.1, inplace=False)
105
+ (activation): ReLU()
106
+ )
107
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
108
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
109
+ (dropout): Dropout(p=0.1, inplace=False)
110
+ )
111
+ (2): EncoderLayer(
112
+ (self_attn): MultiHeadedAttention(
113
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
114
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
115
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
116
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
117
+ (dropout): Dropout(p=0.1, inplace=False)
118
+ )
119
+ (feed_forward): PositionwiseFeedForward(
120
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
121
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
122
+ (dropout): Dropout(p=0.1, inplace=False)
123
+ (activation): ReLU()
124
+ )
125
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
126
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
127
+ (dropout): Dropout(p=0.1, inplace=False)
128
+ )
129
+ (3): EncoderLayer(
130
+ (self_attn): MultiHeadedAttention(
131
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
132
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
133
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
134
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
135
+ (dropout): Dropout(p=0.1, inplace=False)
136
+ )
137
+ (feed_forward): PositionwiseFeedForward(
138
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
139
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
140
+ (dropout): Dropout(p=0.1, inplace=False)
141
+ (activation): ReLU()
142
+ )
143
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
144
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
145
+ (dropout): Dropout(p=0.1, inplace=False)
146
+ )
147
+ (4): EncoderLayer(
148
+ (self_attn): MultiHeadedAttention(
149
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
150
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
151
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
152
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
153
+ (dropout): Dropout(p=0.1, inplace=False)
154
+ )
155
+ (feed_forward): PositionwiseFeedForward(
156
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
157
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
158
+ (dropout): Dropout(p=0.1, inplace=False)
159
+ (activation): ReLU()
160
+ )
161
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
162
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
163
+ (dropout): Dropout(p=0.1, inplace=False)
164
+ )
165
+ (5): EncoderLayer(
166
+ (self_attn): MultiHeadedAttention(
167
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
168
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
169
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
170
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
171
+ (dropout): Dropout(p=0.1, inplace=False)
172
+ )
173
+ (feed_forward): PositionwiseFeedForward(
174
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
175
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
176
+ (dropout): Dropout(p=0.1, inplace=False)
177
+ (activation): ReLU()
178
+ )
179
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
180
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
181
+ (dropout): Dropout(p=0.1, inplace=False)
182
+ )
183
+ (6): EncoderLayer(
184
+ (self_attn): MultiHeadedAttention(
185
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
186
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
187
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
188
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
189
+ (dropout): Dropout(p=0.1, inplace=False)
190
+ )
191
+ (feed_forward): PositionwiseFeedForward(
192
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
193
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
194
+ (dropout): Dropout(p=0.1, inplace=False)
195
+ (activation): ReLU()
196
+ )
197
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
198
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
199
+ (dropout): Dropout(p=0.1, inplace=False)
200
+ )
201
+ (7): EncoderLayer(
202
+ (self_attn): MultiHeadedAttention(
203
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
204
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
205
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
206
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
207
+ (dropout): Dropout(p=0.1, inplace=False)
208
+ )
209
+ (feed_forward): PositionwiseFeedForward(
210
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
211
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
212
+ (dropout): Dropout(p=0.1, inplace=False)
213
+ (activation): ReLU()
214
+ )
215
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
216
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
217
+ (dropout): Dropout(p=0.1, inplace=False)
218
+ )
219
+ (8): EncoderLayer(
220
+ (self_attn): MultiHeadedAttention(
221
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
222
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
223
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
224
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
225
+ (dropout): Dropout(p=0.1, inplace=False)
226
+ )
227
+ (feed_forward): PositionwiseFeedForward(
228
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
229
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
230
+ (dropout): Dropout(p=0.1, inplace=False)
231
+ (activation): ReLU()
232
+ )
233
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
234
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
235
+ (dropout): Dropout(p=0.1, inplace=False)
236
+ )
237
+ (9): EncoderLayer(
238
+ (self_attn): MultiHeadedAttention(
239
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
240
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
241
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
242
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
243
+ (dropout): Dropout(p=0.1, inplace=False)
244
+ )
245
+ (feed_forward): PositionwiseFeedForward(
246
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
247
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
248
+ (dropout): Dropout(p=0.1, inplace=False)
249
+ (activation): ReLU()
250
+ )
251
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
252
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
253
+ (dropout): Dropout(p=0.1, inplace=False)
254
+ )
255
+ (10): EncoderLayer(
256
+ (self_attn): MultiHeadedAttention(
257
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
258
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
259
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
260
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
261
+ (dropout): Dropout(p=0.1, inplace=False)
262
+ )
263
+ (feed_forward): PositionwiseFeedForward(
264
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
265
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
266
+ (dropout): Dropout(p=0.1, inplace=False)
267
+ (activation): ReLU()
268
+ )
269
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
270
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
271
+ (dropout): Dropout(p=0.1, inplace=False)
272
+ )
273
+ (11): EncoderLayer(
274
+ (self_attn): MultiHeadedAttention(
275
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
276
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
277
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
278
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
279
+ (dropout): Dropout(p=0.1, inplace=False)
280
+ )
281
+ (feed_forward): PositionwiseFeedForward(
282
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
283
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
284
+ (dropout): Dropout(p=0.1, inplace=False)
285
+ (activation): ReLU()
286
+ )
287
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
288
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
289
+ (dropout): Dropout(p=0.1, inplace=False)
290
+ )
291
+ (12): EncoderLayer(
292
+ (self_attn): MultiHeadedAttention(
293
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
294
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
295
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
296
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
297
+ (dropout): Dropout(p=0.1, inplace=False)
298
+ )
299
+ (feed_forward): PositionwiseFeedForward(
300
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
301
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
302
+ (dropout): Dropout(p=0.1, inplace=False)
303
+ (activation): ReLU()
304
+ )
305
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
306
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
307
+ (dropout): Dropout(p=0.1, inplace=False)
308
+ )
309
+ (13): EncoderLayer(
310
+ (self_attn): MultiHeadedAttention(
311
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
312
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
313
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
314
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
315
+ (dropout): Dropout(p=0.1, inplace=False)
316
+ )
317
+ (feed_forward): PositionwiseFeedForward(
318
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
319
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
320
+ (dropout): Dropout(p=0.1, inplace=False)
321
+ (activation): ReLU()
322
+ )
323
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
324
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
325
+ (dropout): Dropout(p=0.1, inplace=False)
326
+ )
327
+ (14): EncoderLayer(
328
+ (self_attn): MultiHeadedAttention(
329
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
330
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
331
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
332
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
333
+ (dropout): Dropout(p=0.1, inplace=False)
334
+ )
335
+ (feed_forward): PositionwiseFeedForward(
336
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
337
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
338
+ (dropout): Dropout(p=0.1, inplace=False)
339
+ (activation): ReLU()
340
+ )
341
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
342
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
343
+ (dropout): Dropout(p=0.1, inplace=False)
344
+ )
345
+ (15): EncoderLayer(
346
+ (self_attn): MultiHeadedAttention(
347
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
348
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
349
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
350
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
351
+ (dropout): Dropout(p=0.1, inplace=False)
352
+ )
353
+ (feed_forward): PositionwiseFeedForward(
354
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
355
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
356
+ (dropout): Dropout(p=0.1, inplace=False)
357
+ (activation): ReLU()
358
+ )
359
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
360
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
361
+ (dropout): Dropout(p=0.1, inplace=False)
362
+ )
363
+ (16): EncoderLayer(
364
+ (self_attn): MultiHeadedAttention(
365
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
366
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
367
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
368
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
369
+ (dropout): Dropout(p=0.1, inplace=False)
370
+ )
371
+ (feed_forward): PositionwiseFeedForward(
372
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
373
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
374
+ (dropout): Dropout(p=0.1, inplace=False)
375
+ (activation): ReLU()
376
+ )
377
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
378
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
379
+ (dropout): Dropout(p=0.1, inplace=False)
380
+ )
381
+ (17): EncoderLayer(
382
+ (self_attn): MultiHeadedAttention(
383
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
384
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
385
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
386
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
387
+ (dropout): Dropout(p=0.1, inplace=False)
388
+ )
389
+ (feed_forward): PositionwiseFeedForward(
390
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
391
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
392
+ (dropout): Dropout(p=0.1, inplace=False)
393
+ (activation): ReLU()
394
+ )
395
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
396
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
397
+ (dropout): Dropout(p=0.1, inplace=False)
398
+ )
399
+ (18): EncoderLayer(
400
+ (self_attn): MultiHeadedAttention(
401
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
402
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
403
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
404
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
405
+ (dropout): Dropout(p=0.1, inplace=False)
406
+ )
407
+ (feed_forward): PositionwiseFeedForward(
408
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
409
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
410
+ (dropout): Dropout(p=0.1, inplace=False)
411
+ (activation): ReLU()
412
+ )
413
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
414
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
415
+ (dropout): Dropout(p=0.1, inplace=False)
416
+ )
417
+ (19): EncoderLayer(
418
+ (self_attn): MultiHeadedAttention(
419
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
420
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
421
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
422
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
423
+ (dropout): Dropout(p=0.1, inplace=False)
424
+ )
425
+ (feed_forward): PositionwiseFeedForward(
426
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
427
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
428
+ (dropout): Dropout(p=0.1, inplace=False)
429
+ (activation): ReLU()
430
+ )
431
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
432
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
433
+ (dropout): Dropout(p=0.1, inplace=False)
434
+ )
435
+ (20): EncoderLayer(
436
+ (self_attn): MultiHeadedAttention(
437
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
438
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
439
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
440
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
441
+ (dropout): Dropout(p=0.1, inplace=False)
442
+ )
443
+ (feed_forward): PositionwiseFeedForward(
444
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
445
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
446
+ (dropout): Dropout(p=0.1, inplace=False)
447
+ (activation): ReLU()
448
+ )
449
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
450
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
451
+ (dropout): Dropout(p=0.1, inplace=False)
452
+ )
453
+ (21): EncoderLayer(
454
+ (self_attn): MultiHeadedAttention(
455
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
456
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
457
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
458
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
459
+ (dropout): Dropout(p=0.1, inplace=False)
460
+ )
461
+ (feed_forward): PositionwiseFeedForward(
462
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
463
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
464
+ (dropout): Dropout(p=0.1, inplace=False)
465
+ (activation): ReLU()
466
+ )
467
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
468
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
469
+ (dropout): Dropout(p=0.1, inplace=False)
470
+ )
471
+ (22): EncoderLayer(
472
+ (self_attn): MultiHeadedAttention(
473
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
474
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
475
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
476
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
477
+ (dropout): Dropout(p=0.1, inplace=False)
478
+ )
479
+ (feed_forward): PositionwiseFeedForward(
480
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
481
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
482
+ (dropout): Dropout(p=0.1, inplace=False)
483
+ (activation): ReLU()
484
+ )
485
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
486
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
487
+ (dropout): Dropout(p=0.1, inplace=False)
488
+ )
489
+ (23): EncoderLayer(
490
+ (self_attn): MultiHeadedAttention(
491
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
492
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
493
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
494
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
495
+ (dropout): Dropout(p=0.1, inplace=False)
496
+ )
497
+ (feed_forward): PositionwiseFeedForward(
498
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
499
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
500
+ (dropout): Dropout(p=0.1, inplace=False)
501
+ (activation): ReLU()
502
+ )
503
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
504
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
505
+ (dropout): Dropout(p=0.1, inplace=False)
506
+ )
507
+ )
508
+ (after_norm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
509
+ )
510
+ (decoder): TransformerDecoder(
511
+ (embed): Sequential(
512
+ (0): Embedding(50002, 1024)
513
+ (1): PositionalEncoding(
514
+ (dropout): Dropout(p=0.1, inplace=False)
515
+ )
516
+ )
517
+ (after_norm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
518
+ (output_layer): Linear(in_features=1024, out_features=50002, bias=True)
519
+ (decoders): MultiSequential(
520
+ (0): DecoderLayer(
521
+ (self_attn): MultiHeadedAttention(
522
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
523
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
524
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
525
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
526
+ (dropout): Dropout(p=0.1, inplace=False)
527
+ )
528
+ (src_attn): MultiHeadedAttention(
529
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
530
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
531
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
532
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
533
+ (dropout): Dropout(p=0.1, inplace=False)
534
+ )
535
+ (feed_forward): PositionwiseFeedForward(
536
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
537
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
538
+ (dropout): Dropout(p=0.1, inplace=False)
539
+ (activation): ReLU()
540
+ )
541
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
542
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
543
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
544
+ (dropout): Dropout(p=0.1, inplace=False)
545
+ )
546
+ (1): DecoderLayer(
547
+ (self_attn): MultiHeadedAttention(
548
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
549
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
550
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
551
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
552
+ (dropout): Dropout(p=0.1, inplace=False)
553
+ )
554
+ (src_attn): MultiHeadedAttention(
555
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
556
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
557
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
558
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
559
+ (dropout): Dropout(p=0.1, inplace=False)
560
+ )
561
+ (feed_forward): PositionwiseFeedForward(
562
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
563
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
564
+ (dropout): Dropout(p=0.1, inplace=False)
565
+ (activation): ReLU()
566
+ )
567
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
568
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
569
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
570
+ (dropout): Dropout(p=0.1, inplace=False)
571
+ )
572
+ (2): DecoderLayer(
573
+ (self_attn): MultiHeadedAttention(
574
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
575
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
576
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
577
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
578
+ (dropout): Dropout(p=0.1, inplace=False)
579
+ )
580
+ (src_attn): MultiHeadedAttention(
581
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
582
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
583
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
584
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
585
+ (dropout): Dropout(p=0.1, inplace=False)
586
+ )
587
+ (feed_forward): PositionwiseFeedForward(
588
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
589
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
590
+ (dropout): Dropout(p=0.1, inplace=False)
591
+ (activation): ReLU()
592
+ )
593
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
594
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
595
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
596
+ (dropout): Dropout(p=0.1, inplace=False)
597
+ )
598
+ (3): DecoderLayer(
599
+ (self_attn): MultiHeadedAttention(
600
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
601
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
602
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
603
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
604
+ (dropout): Dropout(p=0.1, inplace=False)
605
+ )
606
+ (src_attn): MultiHeadedAttention(
607
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
608
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
609
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
610
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
611
+ (dropout): Dropout(p=0.1, inplace=False)
612
+ )
613
+ (feed_forward): PositionwiseFeedForward(
614
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
615
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
616
+ (dropout): Dropout(p=0.1, inplace=False)
617
+ (activation): ReLU()
618
+ )
619
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
620
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
621
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
622
+ (dropout): Dropout(p=0.1, inplace=False)
623
+ )
624
+ (4): DecoderLayer(
625
+ (self_attn): MultiHeadedAttention(
626
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
627
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
628
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
629
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
630
+ (dropout): Dropout(p=0.1, inplace=False)
631
+ )
632
+ (src_attn): MultiHeadedAttention(
633
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
634
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
635
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
636
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
637
+ (dropout): Dropout(p=0.1, inplace=False)
638
+ )
639
+ (feed_forward): PositionwiseFeedForward(
640
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
641
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
642
+ (dropout): Dropout(p=0.1, inplace=False)
643
+ (activation): ReLU()
644
+ )
645
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
646
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
647
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
648
+ (dropout): Dropout(p=0.1, inplace=False)
649
+ )
650
+ (5): DecoderLayer(
651
+ (self_attn): MultiHeadedAttention(
652
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
653
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
654
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
655
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
656
+ (dropout): Dropout(p=0.1, inplace=False)
657
+ )
658
+ (src_attn): MultiHeadedAttention(
659
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
660
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
661
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
662
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
663
+ (dropout): Dropout(p=0.1, inplace=False)
664
+ )
665
+ (feed_forward): PositionwiseFeedForward(
666
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
667
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
668
+ (dropout): Dropout(p=0.1, inplace=False)
669
+ (activation): ReLU()
670
+ )
671
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
672
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
673
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
674
+ (dropout): Dropout(p=0.1, inplace=False)
675
+ )
676
+ (6): DecoderLayer(
677
+ (self_attn): MultiHeadedAttention(
678
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
679
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
680
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
681
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
682
+ (dropout): Dropout(p=0.1, inplace=False)
683
+ )
684
+ (src_attn): MultiHeadedAttention(
685
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
686
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
687
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
688
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
689
+ (dropout): Dropout(p=0.1, inplace=False)
690
+ )
691
+ (feed_forward): PositionwiseFeedForward(
692
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
693
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
694
+ (dropout): Dropout(p=0.1, inplace=False)
695
+ (activation): ReLU()
696
+ )
697
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
698
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
699
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
700
+ (dropout): Dropout(p=0.1, inplace=False)
701
+ )
702
+ (7): DecoderLayer(
703
+ (self_attn): MultiHeadedAttention(
704
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
705
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
706
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
707
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
708
+ (dropout): Dropout(p=0.1, inplace=False)
709
+ )
710
+ (src_attn): MultiHeadedAttention(
711
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
712
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
713
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
714
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
715
+ (dropout): Dropout(p=0.1, inplace=False)
716
+ )
717
+ (feed_forward): PositionwiseFeedForward(
718
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
719
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
720
+ (dropout): Dropout(p=0.1, inplace=False)
721
+ (activation): ReLU()
722
+ )
723
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
724
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
725
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
726
+ (dropout): Dropout(p=0.1, inplace=False)
727
+ )
728
+ (8): DecoderLayer(
729
+ (self_attn): MultiHeadedAttention(
730
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
731
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
732
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
733
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
734
+ (dropout): Dropout(p=0.1, inplace=False)
735
+ )
736
+ (src_attn): MultiHeadedAttention(
737
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
738
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
739
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
740
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
741
+ (dropout): Dropout(p=0.1, inplace=False)
742
+ )
743
+ (feed_forward): PositionwiseFeedForward(
744
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
745
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
746
+ (dropout): Dropout(p=0.1, inplace=False)
747
+ (activation): ReLU()
748
+ )
749
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
750
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
751
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
752
+ (dropout): Dropout(p=0.1, inplace=False)
753
+ )
754
+ (9): DecoderLayer(
755
+ (self_attn): MultiHeadedAttention(
756
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
757
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
758
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
759
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
760
+ (dropout): Dropout(p=0.1, inplace=False)
761
+ )
762
+ (src_attn): MultiHeadedAttention(
763
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
764
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
765
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
766
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
767
+ (dropout): Dropout(p=0.1, inplace=False)
768
+ )
769
+ (feed_forward): PositionwiseFeedForward(
770
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
771
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
772
+ (dropout): Dropout(p=0.1, inplace=False)
773
+ (activation): ReLU()
774
+ )
775
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
776
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
777
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
778
+ (dropout): Dropout(p=0.1, inplace=False)
779
+ )
780
+ (10): DecoderLayer(
781
+ (self_attn): MultiHeadedAttention(
782
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
783
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
784
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
785
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
786
+ (dropout): Dropout(p=0.1, inplace=False)
787
+ )
788
+ (src_attn): MultiHeadedAttention(
789
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
790
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
791
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
792
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
793
+ (dropout): Dropout(p=0.1, inplace=False)
794
+ )
795
+ (feed_forward): PositionwiseFeedForward(
796
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
797
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
798
+ (dropout): Dropout(p=0.1, inplace=False)
799
+ (activation): ReLU()
800
+ )
801
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
802
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
803
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
804
+ (dropout): Dropout(p=0.1, inplace=False)
805
+ )
806
+ (11): DecoderLayer(
807
+ (self_attn): MultiHeadedAttention(
808
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
809
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
810
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
811
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
812
+ (dropout): Dropout(p=0.1, inplace=False)
813
+ )
814
+ (src_attn): MultiHeadedAttention(
815
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
816
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
817
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
818
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
819
+ (dropout): Dropout(p=0.1, inplace=False)
820
+ )
821
+ (feed_forward): PositionwiseFeedForward(
822
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
823
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
824
+ (dropout): Dropout(p=0.1, inplace=False)
825
+ (activation): ReLU()
826
+ )
827
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
828
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
829
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
830
+ (dropout): Dropout(p=0.1, inplace=False)
831
+ )
832
+ (12): DecoderLayer(
833
+ (self_attn): MultiHeadedAttention(
834
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
835
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
836
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
837
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
838
+ (dropout): Dropout(p=0.1, inplace=False)
839
+ )
840
+ (src_attn): MultiHeadedAttention(
841
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
842
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
843
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
844
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
845
+ (dropout): Dropout(p=0.1, inplace=False)
846
+ )
847
+ (feed_forward): PositionwiseFeedForward(
848
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
849
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
850
+ (dropout): Dropout(p=0.1, inplace=False)
851
+ (activation): ReLU()
852
+ )
853
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
854
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
855
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
856
+ (dropout): Dropout(p=0.1, inplace=False)
857
+ )
858
+ (13): DecoderLayer(
859
+ (self_attn): MultiHeadedAttention(
860
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
861
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
862
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
863
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
864
+ (dropout): Dropout(p=0.1, inplace=False)
865
+ )
866
+ (src_attn): MultiHeadedAttention(
867
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
868
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
869
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
870
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
871
+ (dropout): Dropout(p=0.1, inplace=False)
872
+ )
873
+ (feed_forward): PositionwiseFeedForward(
874
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
875
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
876
+ (dropout): Dropout(p=0.1, inplace=False)
877
+ (activation): ReLU()
878
+ )
879
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
880
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
881
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
882
+ (dropout): Dropout(p=0.1, inplace=False)
883
+ )
884
+ (14): DecoderLayer(
885
+ (self_attn): MultiHeadedAttention(
886
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
887
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
888
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
889
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
890
+ (dropout): Dropout(p=0.1, inplace=False)
891
+ )
892
+ (src_attn): MultiHeadedAttention(
893
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
894
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
895
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
896
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
897
+ (dropout): Dropout(p=0.1, inplace=False)
898
+ )
899
+ (feed_forward): PositionwiseFeedForward(
900
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
901
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
902
+ (dropout): Dropout(p=0.1, inplace=False)
903
+ (activation): ReLU()
904
+ )
905
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
906
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
907
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
908
+ (dropout): Dropout(p=0.1, inplace=False)
909
+ )
910
+ (15): DecoderLayer(
911
+ (self_attn): MultiHeadedAttention(
912
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
913
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
914
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
915
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
916
+ (dropout): Dropout(p=0.1, inplace=False)
917
+ )
918
+ (src_attn): MultiHeadedAttention(
919
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
920
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
921
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
922
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
923
+ (dropout): Dropout(p=0.1, inplace=False)
924
+ )
925
+ (feed_forward): PositionwiseFeedForward(
926
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
927
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
928
+ (dropout): Dropout(p=0.1, inplace=False)
929
+ (activation): ReLU()
930
+ )
931
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
932
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
933
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
934
+ (dropout): Dropout(p=0.1, inplace=False)
935
+ )
936
+ (16): DecoderLayer(
937
+ (self_attn): MultiHeadedAttention(
938
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
939
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
940
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
941
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
942
+ (dropout): Dropout(p=0.1, inplace=False)
943
+ )
944
+ (src_attn): MultiHeadedAttention(
945
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
946
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
947
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
948
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
949
+ (dropout): Dropout(p=0.1, inplace=False)
950
+ )
951
+ (feed_forward): PositionwiseFeedForward(
952
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
953
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
954
+ (dropout): Dropout(p=0.1, inplace=False)
955
+ (activation): ReLU()
956
+ )
957
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
958
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
959
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
960
+ (dropout): Dropout(p=0.1, inplace=False)
961
+ )
962
+ (17): DecoderLayer(
963
+ (self_attn): MultiHeadedAttention(
964
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
965
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
966
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
967
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
968
+ (dropout): Dropout(p=0.1, inplace=False)
969
+ )
970
+ (src_attn): MultiHeadedAttention(
971
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
972
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
973
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
974
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
975
+ (dropout): Dropout(p=0.1, inplace=False)
976
+ )
977
+ (feed_forward): PositionwiseFeedForward(
978
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
979
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
980
+ (dropout): Dropout(p=0.1, inplace=False)
981
+ (activation): ReLU()
982
+ )
983
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
984
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
985
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
986
+ (dropout): Dropout(p=0.1, inplace=False)
987
+ )
988
+ (18): DecoderLayer(
989
+ (self_attn): MultiHeadedAttention(
990
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
991
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
992
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
993
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
994
+ (dropout): Dropout(p=0.1, inplace=False)
995
+ )
996
+ (src_attn): MultiHeadedAttention(
997
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
998
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
999
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
1000
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
1001
+ (dropout): Dropout(p=0.1, inplace=False)
1002
+ )
1003
+ (feed_forward): PositionwiseFeedForward(
1004
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
1005
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
1006
+ (dropout): Dropout(p=0.1, inplace=False)
1007
+ (activation): ReLU()
1008
+ )
1009
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
1010
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
1011
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
1012
+ (dropout): Dropout(p=0.1, inplace=False)
1013
+ )
1014
+ (19): DecoderLayer(
1015
+ (self_attn): MultiHeadedAttention(
1016
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
1017
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
1018
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
1019
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
1020
+ (dropout): Dropout(p=0.1, inplace=False)
1021
+ )
1022
+ (src_attn): MultiHeadedAttention(
1023
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
1024
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
1025
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
1026
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
1027
+ (dropout): Dropout(p=0.1, inplace=False)
1028
+ )
1029
+ (feed_forward): PositionwiseFeedForward(
1030
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
1031
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
1032
+ (dropout): Dropout(p=0.1, inplace=False)
1033
+ (activation): ReLU()
1034
+ )
1035
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
1036
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
1037
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
1038
+ (dropout): Dropout(p=0.1, inplace=False)
1039
+ )
1040
+ (20): DecoderLayer(
1041
+ (self_attn): MultiHeadedAttention(
1042
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
1043
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
1044
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
1045
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
1046
+ (dropout): Dropout(p=0.1, inplace=False)
1047
+ )
1048
+ (src_attn): MultiHeadedAttention(
1049
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
1050
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
1051
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
1052
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
1053
+ (dropout): Dropout(p=0.1, inplace=False)
1054
+ )
1055
+ (feed_forward): PositionwiseFeedForward(
1056
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
1057
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
1058
+ (dropout): Dropout(p=0.1, inplace=False)
1059
+ (activation): ReLU()
1060
+ )
1061
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
1062
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
1063
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
1064
+ (dropout): Dropout(p=0.1, inplace=False)
1065
+ )
1066
+ (21): DecoderLayer(
1067
+ (self_attn): MultiHeadedAttention(
1068
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
1069
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
1070
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
1071
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
1072
+ (dropout): Dropout(p=0.1, inplace=False)
1073
+ )
1074
+ (src_attn): MultiHeadedAttention(
1075
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
1076
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
1077
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
1078
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
1079
+ (dropout): Dropout(p=0.1, inplace=False)
1080
+ )
1081
+ (feed_forward): PositionwiseFeedForward(
1082
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
1083
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
1084
+ (dropout): Dropout(p=0.1, inplace=False)
1085
+ (activation): ReLU()
1086
+ )
1087
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
1088
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
1089
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
1090
+ (dropout): Dropout(p=0.1, inplace=False)
1091
+ )
1092
+ (22): DecoderLayer(
1093
+ (self_attn): MultiHeadedAttention(
1094
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
1095
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
1096
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
1097
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
1098
+ (dropout): Dropout(p=0.1, inplace=False)
1099
+ )
1100
+ (src_attn): MultiHeadedAttention(
1101
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
1102
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
1103
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
1104
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
1105
+ (dropout): Dropout(p=0.1, inplace=False)
1106
+ )
1107
+ (feed_forward): PositionwiseFeedForward(
1108
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
1109
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
1110
+ (dropout): Dropout(p=0.1, inplace=False)
1111
+ (activation): ReLU()
1112
+ )
1113
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
1114
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
1115
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
1116
+ (dropout): Dropout(p=0.1, inplace=False)
1117
+ )
1118
+ (23): DecoderLayer(
1119
+ (self_attn): MultiHeadedAttention(
1120
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
1121
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
1122
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
1123
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
1124
+ (dropout): Dropout(p=0.1, inplace=False)
1125
+ )
1126
+ (src_attn): MultiHeadedAttention(
1127
+ (linear_q): Linear(in_features=1024, out_features=1024, bias=True)
1128
+ (linear_k): Linear(in_features=1024, out_features=1024, bias=True)
1129
+ (linear_v): Linear(in_features=1024, out_features=1024, bias=True)
1130
+ (linear_out): Linear(in_features=1024, out_features=1024, bias=True)
1131
+ (dropout): Dropout(p=0.1, inplace=False)
1132
+ )
1133
+ (feed_forward): PositionwiseFeedForward(
1134
+ (w_1): Linear(in_features=1024, out_features=4096, bias=True)
1135
+ (w_2): Linear(in_features=4096, out_features=1024, bias=True)
1136
+ (dropout): Dropout(p=0.1, inplace=False)
1137
+ (activation): ReLU()
1138
+ )
1139
+ (norm1): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
1140
+ (norm2): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
1141
+ (norm3): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
1142
+ (dropout): Dropout(p=0.1, inplace=False)
1143
+ )
1144
+ )
1145
+ )
1146
+ (criterion_att): LabelSmoothingLoss(
1147
+ (criterion): KLDivLoss()
1148
+ )
1149
+ (ctc): CTC(
1150
+ (ctc_lo): Linear(in_features=1024, out_features=50002, bias=True)
1151
+ (ctc_loss): CTCLoss()
1152
+ )
1153
+ )
1154
+
1155
+ Model summary:
1156
+ Class Name: ESPnetS2TModel
1157
+ Total Number of model parameters: 888.51 M
1158
+ Number of trainable parameters: 888.51 M (100.0%)
1159
+ Size: 3.55 GB
1160
+ Type: torch.float32
1161
+ [gpub074:0/4] 2023-07-16 00:44:56,225 (abs_task:1205) INFO: Optimizer:
1162
+ AdamW (
1163
+ Parameter Group 0
1164
+ amsgrad: False
1165
+ betas: [0.9, 0.98]
1166
+ capturable: False
1167
+ eps: 1e-06
1168
+ foreach: None
1169
+ initial_lr: 0.00025
1170
+ lr: 2.5e-08
1171
+ maximize: False
1172
+ weight_decay: 0.0
1173
+ )
1174
+ [gpub074:0/4] 2023-07-16 00:44:56,225 (abs_task:1206) INFO: Scheduler: WarmupLR(warmup_steps=10000)
1175
+ [gpub074:0/4] 2023-07-16 00:44:56,240 (abs_task:1215) INFO: Saving the configuration in exp/s2t_train_s2t_transformer_conv2d_size1024_e24_d24_lr2.5e-4_warmup10k_finetune_raw_bpe50000/config.yaml
1176
+ [gpub074:0/4] 2023-07-16 00:44:56,958 (abs_task:1272) INFO: Loading pretrained params from /scratch/bbjs/peng6/espnet-whisper-public/egs2/mixed_v2/s2t1/exp/s2t_train_s2t_transformer_conv2d_size1024_e18_d18_lr5e-4_warmup20k_raw_bpe50000/valid.acc.ave.pth
1177
+ [gpub074:0/4] 2023-07-16 00:45:04,695 (s2t:454) INFO: Optional Data Names: ('text_prev', 'text_ctc', 'text_spk2', 'text_spk3', 'text_spk4')
1178
+ [gpub074:0/4] 2023-07-16 00:45:04,919 (abs_task:1570) INFO: [valid] dataset:
1179
+ ESPnetDataset(
1180
+ speech: {"path": "dump/raw/dev/wav.scp", "type": "kaldi_ark"}
1181
+ text_prev: {"path": "dump/raw/dev/text.prev", "type": "text"}
1182
+ text_ctc: {"path": "dump/raw/dev/text.ctc", "type": "text"}
1183
+ text: {"path": "dump/raw/dev/text", "type": "text"}
1184
+ preprocess: <espnet2.train.preprocessor.S2TPreprocessor object at 0x7f5d36853df0>)
1185
+ [gpub074:0/4] 2023-07-16 00:45:04,919 (abs_task:1571) INFO: [valid] Batch sampler: UnsortedBatchSampler(N-batch=1012, batch_size=128, key_file=exp/s2t_stats_raw_bpe50000/valid/speech_shape,
1186
+ [gpub074:0/4] 2023-07-16 00:45:04,927 (abs_task:1572) INFO: [valid] mini-batch sizes summary: N-batch=1012, mean=128.1, min=128, max=129
1187
+ [gpub074:0/4] 2023-07-16 00:45:05,429 (s2t:454) INFO: Optional Data Names: ('text_prev', 'text_ctc', 'text_spk2', 'text_spk3', 'text_spk4')
1188
+ [gpub074:0/4] 2023-07-16 00:45:05,815 (abs_task:1570) INFO: [plot_att] dataset:
1189
+ ESPnetDataset(
1190
+ speech: {"path": "dump/raw/dev/wav.scp", "type": "kaldi_ark"}
1191
+ text_prev: {"path": "dump/raw/dev/text.prev", "type": "text"}
1192
+ text_ctc: {"path": "dump/raw/dev/text.ctc", "type": "text"}
1193
+ text: {"path": "dump/raw/dev/text", "type": "text"}
1194
+ preprocess: <espnet2.train.preprocessor.S2TPreprocessor object at 0x7f5d36853a00>)
1195
+ [gpub074:0/4] 2023-07-16 00:45:05,815 (abs_task:1571) INFO: [plot_att] Batch sampler: UnsortedBatchSampler(N-batch=129591, batch_size=1, key_file=exp/s2t_stats_raw_bpe50000/valid/speech_shape,
1196
+ [gpub074:0/4] 2023-07-16 00:45:05,815 (abs_task:1572) INFO: [plot_att] mini-batch sizes summary: N-batch=3, mean=1.0, min=1, max=1
1197
+ [gpub074:0/4] 2023-07-16 00:45:33,488 (trainer:159) INFO: The training was resumed using exp/s2t_train_s2t_transformer_conv2d_size1024_e24_d24_lr2.5e-4_warmup10k_finetune_raw_bpe50000/checkpoint.pth
1198
+ [gpub074:0/4] 2023-07-16 00:45:33,492 (trainer:218) WARNING: The training has already reached at max_epoch: 56
1199
+ gpub074:4188818:4188818 [0] NCCL INFO Bootstrap : Using eth1:172.28.23.174<0>
1200
+ gpub074:4188818:4188818 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
1201
+ gpub074:4188818:4188818 [0] NCCL INFO cudaDriverVersion 12010
1202
+ NCCL version 2.14.3+cuda11.7
1203
+ gpub074:4188819:4188819 [1] NCCL INFO cudaDriverVersion 12010
1204
+ gpub074:4188819:4188819 [1] NCCL INFO Bootstrap : Using eth1:172.28.23.174<0>
1205
+ gpub074:4188819:4188819 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
1206
+ gpub074:4188819:4188945 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB eth1:172.28.23.174<0>
1207
+ gpub074:4188819:4188945 [1] NCCL INFO Using network IB
1208
+ gpub074:4188819:4188945 [1] NCCL INFO Setting affinity for GPU 1 to ffff,00000000
1209
+ gpub074:4188819:4188945 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
1210
+ gpub074:4188819:4188945 [1] NCCL INFO Channel 00/0 : 1[46000] -> 2[85000] via P2P/IPC
1211
+ gpub074:4188819:4188945 [1] NCCL INFO Channel 01/0 : 1[46000] -> 2[85000] via P2P/IPC
1212
+ gpub074:4188819:4188945 [1] NCCL INFO Channel 02/0 : 1[46000] -> 2[85000] via P2P/IPC
1213
+ gpub074:4188819:4188945 [1] NCCL INFO Channel 03/0 : 1[46000] -> 2[85000] via P2P/IPC
1214
+ gpub074:4188819:4188945 [1] NCCL INFO Connected all rings
1215
+ gpub074:4188819:4188945 [1] NCCL INFO Channel 00/0 : 1[46000] -> 0[7000] via P2P/IPC
1216
+ gpub074:4188819:4188945 [1] NCCL INFO Channel 01/0 : 1[46000] -> 0[7000] via P2P/IPC
1217
+ gpub074:4188819:4188945 [1] NCCL INFO Channel 02/0 : 1[46000] -> 0[7000] via P2P/IPC
1218
+ gpub074:4188819:4188945 [1] NCCL INFO Channel 03/0 : 1[46000] -> 0[7000] via P2P/IPC
1219
+ gpub074:4188819:4188945 [1] NCCL INFO Connected all trees
1220
+ gpub074:4188819:4188945 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
1221
+ gpub074:4188819:4188945 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
1222
+ gpub074:4188819:4188945 [1] NCCL INFO comm 0xa71a630 rank 1 nranks 4 cudaDev 1 busId 46000 - Init COMPLETE
1223
+ gpub074:4188821:4188821 [3] NCCL INFO cudaDriverVersion 12010
1224
+ gpub074:4188821:4188821 [3] NCCL INFO Bootstrap : Using eth1:172.28.23.174<0>
1225
+ gpub074:4188821:4188821 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
1226
+ gpub074:4188821:4188946 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB eth1:172.28.23.174<0>
1227
+ gpub074:4188821:4188946 [3] NCCL INFO Using network IB
1228
+ gpub074:4188821:4188946 [3] NCCL INFO Setting affinity for GPU 3 to ffff
1229
+ gpub074:4188821:4188946 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2
1230
+ gpub074:4188821:4188946 [3] NCCL INFO Channel 00/0 : 3[c7000] -> 0[7000] via P2P/IPC
1231
+ gpub074:4188821:4188946 [3] NCCL INFO Channel 01/0 : 3[c7000] -> 0[7000] via P2P/IPC
1232
+ gpub074:4188821:4188946 [3] NCCL INFO Channel 02/0 : 3[c7000] -> 0[7000] via P2P/IPC
1233
+ gpub074:4188821:4188946 [3] NCCL INFO Channel 03/0 : 3[c7000] -> 0[7000] via P2P/IPC
1234
+ gpub074:4188821:4188946 [3] NCCL INFO Connected all rings
1235
+ gpub074:4188821:4188946 [3] NCCL INFO Channel 00/0 : 3[c7000] -> 2[85000] via P2P/IPC
1236
+ gpub074:4188821:4188946 [3] NCCL INFO Channel 01/0 : 3[c7000] -> 2[85000] via P2P/IPC
1237
+ gpub074:4188821:4188946 [3] NCCL INFO Channel 02/0 : 3[c7000] -> 2[85000] via P2P/IPC
1238
+ gpub074:4188821:4188946 [3] NCCL INFO Channel 03/0 : 3[c7000] -> 2[85000] via P2P/IPC
1239
+ gpub074:4188821:4188946 [3] NCCL INFO Connected all trees
1240
+ gpub074:4188821:4188946 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
1241
+ gpub074:4188821:4188946 [3] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
1242
+ gpub074:4188821:4188946 [3] NCCL INFO comm 0xa2e9cb0 rank 3 nranks 4 cudaDev 3 busId c7000 - Init COMPLETE
1243
+ gpub074:4188820:4188820 [2] NCCL INFO cudaDriverVersion 12010
1244
+ gpub074:4188820:4188820 [2] NCCL INFO Bootstrap : Using eth1:172.28.23.174<0>
1245
+ gpub074:4188820:4188820 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
1246
+ gpub074:4188820:4188947 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB eth1:172.28.23.174<0>
1247
+ gpub074:4188820:4188947 [2] NCCL INFO Using network IB
1248
+ gpub074:4188820:4188947 [2] NCCL INFO Setting affinity for GPU 2 to ffff0000
1249
+ gpub074:4188820:4188947 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1
1250
+ gpub074:4188820:4188947 [2] NCCL INFO Channel 00/0 : 2[85000] -> 3[c7000] via P2P/IPC
1251
+ gpub074:4188820:4188947 [2] NCCL INFO Channel 01/0 : 2[85000] -> 3[c7000] via P2P/IPC
1252
+ gpub074:4188820:4188947 [2] NCCL INFO Channel 02/0 : 2[85000] -> 3[c7000] via P2P/IPC
1253
+ gpub074:4188820:4188947 [2] NCCL INFO Channel 03/0 : 2[85000] -> 3[c7000] via P2P/IPC
1254
+ gpub074:4188820:4188947 [2] NCCL INFO Connected all rings
1255
+ gpub074:4188820:4188947 [2] NCCL INFO Channel 00/0 : 2[85000] -> 1[46000] via P2P/IPC
1256
+ gpub074:4188820:4188947 [2] NCCL INFO Channel 01/0 : 2[85000] -> 1[46000] via P2P/IPC
1257
+ gpub074:4188820:4188947 [2] NCCL INFO Channel 02/0 : 2[85000] -> 1[46000] via P2P/IPC
1258
+ gpub074:4188820:4188947 [2] NCCL INFO Channel 03/0 : 2[85000] -> 1[46000] via P2P/IPC
1259
+ gpub074:4188820:4188947 [2] NCCL INFO Connected all trees
1260
+ gpub074:4188820:4188947 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
1261
+ gpub074:4188820:4188947 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
1262
+ gpub074:4188820:4188947 [2] NCCL INFO comm 0x4f8c6c10 rank 2 nranks 4 cudaDev 2 busId 85000 - Init COMPLETE
1263
+ gpub074:4188819:4188953 [1] NCCL INFO [Service thread] Connection closed by localRank 1
1264
+ gpub074:4188819:4188819 [1] NCCL INFO comm 0xa71a630 rank 1 nranks 4 cudaDev 1 busId 46000 - Abort COMPLETE
1265
+ gpub074:4188821:4188955 [3] NCCL INFO [Service thread] Connection closed by localRank 3
1266
+ gpub074:4188821:4188821 [3] NCCL INFO comm 0xa2e9cb0 rank 3 nranks 4 cudaDev 3 busId c7000 - Abort COMPLETE
1267
+ gpub074:4188820:4188952 [2] NCCL INFO [Service thread] Connection closed by localRank 2
1268
+ gpub074:4188820:4188820 [2] NCCL INFO comm 0x4f8c6c10 rank 2 nranks 4 cudaDev 2 busId 85000 - Abort COMPLETE
1269
+ [gpub074:0/4] 2023-07-16 00:45:37,470 (trainer:458) INFO: The training was finished at 55 epochs
1270
+ [gpub074:0/4] 2023-07-16 00:45:37,508 (average_nbest_models:69) INFO: Averaging 5best models: criterion="valid.acc": exp/s2t_train_s2t_transformer_conv2d_size1024_e24_d24_lr2.5e-4_warmup10k_finetune_raw_bpe50000/valid.acc.ave_5best.pth
1271
+ [gpub074:0/4] 2023-07-16 00:46:24,407 (average_nbest_models:69) INFO: Averaging 5best models: criterion="valid.total_count": exp/s2t_train_s2t_transformer_conv2d_size1024_e24_d24_lr2.5e-4_warmup10k_finetune_raw_bpe50000/valid.total_count.ave_5best.pth
1272
+ gpub074:4188818:4188944 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB eth1:172.28.23.174<0>
1273
+ gpub074:4188818:4188944 [0] NCCL INFO Using network IB
1274
+ gpub074:4188818:4188944 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000,00000000
1275
+ gpub074:4188818:4188944 [0] NCCL INFO Channel 00/04 : 0 1 2 3
1276
+ gpub074:4188818:4188944 [0] NCCL INFO Channel 01/04 : 0 1 2 3
1277
+ gpub074:4188818:4188944 [0] NCCL INFO Channel 02/04 : 0 1 2 3
1278
+ gpub074:4188818:4188944 [0] NCCL INFO Channel 03/04 : 0 1 2 3
1279
+ gpub074:4188818:4188944 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
1280
+ gpub074:4188818:4188944 [0] NCCL INFO Channel 00/0 : 0[7000] -> 1[46000] via P2P/IPC
1281
+ gpub074:4188818:4188944 [0] NCCL INFO Channel 01/0 : 0[7000] -> 1[46000] via P2P/IPC
1282
+ gpub074:4188818:4188944 [0] NCCL INFO Channel 02/0 : 0[7000] -> 1[46000] via P2P/IPC
1283
+ gpub074:4188818:4188944 [0] NCCL INFO Channel 03/0 : 0[7000] -> 1[46000] via P2P/IPC
1284
+ gpub074:4188818:4188944 [0] NCCL INFO Connected all rings
1285
+ gpub074:4188818:4188944 [0] NCCL INFO Connected all trees
1286
+ gpub074:4188818:4188944 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
1287
+ gpub074:4188818:4188944 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
1288
+ gpub074:4188818:4188944 [0] NCCL INFO comm 0x4f7f3ad0 rank 0 nranks 4 cudaDev 0 busId 7000 - Init COMPLETE
1289
+ gpub074:4188818:4188954 [0] NCCL INFO [Service thread] Connection closed by localRank 0
1290
+ gpub074:4188818:4188818 [0] NCCL INFO comm 0x4f7f3ad0 rank 0 nranks 4 cudaDev 0 busId 7000 - Abort COMPLETE
1291
+ # Accounting: begin_time=1689486163
1292
+ # Accounting: end_time=1689486432
1293
+ # Accounting: time=269 threads=1
1294
+ # Finished at Sun Jul 16 00:47:12 CDT 2023 with status 0