Your Name commited on
Commit
8b4b30a
1 Parent(s): d29f72c
README.md CHANGED
@@ -4,6 +4,25 @@
4
 
5
  If you like this project, please give it a Star. If you've come up with more useful academic shortcuts, feel free to open an issue or pull request.
6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
 
8
  - 支持GPT输出的markdown表格
9
  <div align="center">
 
4
 
5
  If you like this project, please give it a Star. If you've come up with more useful academic shortcuts, feel free to open an issue or pull request.
6
 
7
+ 功能 | 描述
8
+ --- | ---
9
+ 一键润色 | 支持一键润色、一键查找论文语法错误
10
+ 一键中英互译 | 一键中英互译
11
+ 一键代码解释 | 可以正确显示代码、解释代码
12
+ 自定义快捷键 | 支持自定义快捷键
13
+ 配置代理服务器 | 支持配置代理服务器
14
+ 模块化设计 | 支持自定义高阶的实验性功能
15
+ 自我程序剖析 | [实验性功能] 可以读懂自己的源代码
16
+ 程序剖析 | [实验性功能] 可以剖析其他Python/C++项目
17
+ 读论文 | [实验性功能] 解读latex论文写摘要
18
+ 批量注释生成 | [实验性功能] 批量生成函数注释
19
+ 公式显示 | 可以同时显示公式的tex形式和渲染形式
20
+ 图片显示 | 可以在markdown中显示图片
21
+ 支持GPT输出的markdown表格 | 可以输出支持GPT的markdown表格
22
+
23
+
24
+
25
+
26
 
27
  - 支持GPT输出的markdown表格
28
  <div align="center">
crazy_functions/test_project/latex/attention/results.tex DELETED
@@ -1,166 +0,0 @@
1
- \subsection{Machine Translation}
2
- \begin{table}[t]
3
- \begin{center}
4
- \caption{The Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German and English-to-French newstest2014 tests at a fraction of the training cost. }
5
- \label{tab:wmt-results}
6
- \vspace{-2mm}
7
- %\scalebox{1.0}{
8
- \begin{tabular}{lccccc}
9
- \toprule
10
- \multirow{2}{*}{\vspace{-2mm}Model} & \multicolumn{2}{c}{BLEU} & & \multicolumn{2}{c}{Training Cost (FLOPs)} \\
11
- \cmidrule{2-3} \cmidrule{5-6}
12
- & EN-DE & EN-FR & & EN-DE & EN-FR \\
13
- \hline
14
- ByteNet \citep{NalBytenet2017} & 23.75 & & & &\\
15
- Deep-Att + PosUnk \citep{DBLP:journals/corr/ZhouCWLX16} & & 39.2 & & & $1.0\cdot10^{20}$ \\
16
- GNMT + RL \citep{wu2016google} & 24.6 & 39.92 & & $2.3\cdot10^{19}$ & $1.4\cdot10^{20}$\\
17
- ConvS2S \citep{JonasFaceNet2017} & 25.16 & 40.46 & & $9.6\cdot10^{18}$ & $1.5\cdot10^{20}$\\
18
- MoE \citep{shazeer2017outrageously} & 26.03 & 40.56 & & $2.0\cdot10^{19}$ & $1.2\cdot10^{20}$ \\
19
- \hline
20
- \rule{0pt}{2.0ex}Deep-Att + PosUnk Ensemble \citep{DBLP:journals/corr/ZhouCWLX16} & & 40.4 & & &
21
- $8.0\cdot10^{20}$ \\
22
- GNMT + RL Ensemble \citep{wu2016google} & 26.30 & 41.16 & & $1.8\cdot10^{20}$ & $1.1\cdot10^{21}$\\
23
- ConvS2S Ensemble \citep{JonasFaceNet2017} & 26.36 & \textbf{41.29} & & $7.7\cdot10^{19}$ & $1.2\cdot10^{21}$\\
24
- \specialrule{1pt}{-1pt}{0pt}
25
- \rule{0pt}{2.2ex}Transformer (base model) & 27.3 & 38.1 & & \multicolumn{2}{c}{\boldmath$3.3\cdot10^{18}$}\\
26
- Transformer (big) & \textbf{28.4} & \textbf{41.8} & & \multicolumn{2}{c}{$2.3\cdot10^{19}$} \\
27
- %\hline
28
- %\specialrule{1pt}{-1pt}{0pt}
29
- %\rule{0pt}{2.0ex}
30
- \bottomrule
31
- \end{tabular}
32
- %}
33
- \end{center}
34
- \end{table}
35
-
36
-
37
- On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table~\ref{tab:wmt-results}) outperforms the best previously reported models (including ensembles) by more than $2.0$ BLEU, establishing a new state-of-the-art BLEU score of $28.4$. The configuration of this model is listed in the bottom line of Table~\ref{tab:variations}. Training took $3.5$ days on $8$ P100 GPUs. Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models.
38
-
39
- On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of $41.0$, outperforming all of the previously published single models, at less than $1/4$ the training cost of the previous state-of-the-art model. The Transformer (big) model trained for English-to-French used dropout rate $P_{drop}=0.1$, instead of $0.3$.
40
-
41
- For the base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals. For the big models, we averaged the last 20 checkpoints. We used beam search with a beam size of $4$ and length penalty $\alpha=0.6$ \citep{wu2016google}. These hyperparameters were chosen after experimentation on the development set. We set the maximum output length during inference to input length + $50$, but terminate early when possible \citep{wu2016google}.
42
-
43
- Table \ref{tab:wmt-results} summarizes our results and compares our translation quality and training costs to other model architectures from the literature. We estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU \footnote{We used values of 2.8, 3.7, 6.0 and 9.5 TFLOPS for K80, K40, M40 and P100, respectively.}.
44
- %where we compare against the leading machine translation results in the literature. Even our smaller model, with number of parameters comparable to ConvS2S, outperforms all existing single models, and achieves results close to the best ensemble model.
45
-
46
- \subsection{Model Variations}
47
-
48
- \begin{table}[t]
49
- \caption{Variations on the Transformer architecture. Unlisted values are identical to those of the base model. All metrics are on the English-to-German translation development set, newstest2013. Listed perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to per-word perplexities.}
50
- \label{tab:variations}
51
- \begin{center}
52
- \vspace{-2mm}
53
- %\scalebox{1.0}{
54
- \begin{tabular}{c|ccccccccc|ccc}
55
- \hline\rule{0pt}{2.0ex}
56
- & \multirow{2}{*}{$N$} & \multirow{2}{*}{$\dmodel$} &
57
- \multirow{2}{*}{$\dff$} & \multirow{2}{*}{$h$} &
58
- \multirow{2}{*}{$d_k$} & \multirow{2}{*}{$d_v$} &
59
- \multirow{2}{*}{$P_{drop}$} & \multirow{2}{*}{$\epsilon_{ls}$} &
60
- train & PPL & BLEU & params \\
61
- & & & & & & & & & steps & (dev) & (dev) & $\times10^6$ \\
62
- % & & & & & & & & & & & & \\
63
- \hline\rule{0pt}{2.0ex}
64
- base & 6 & 512 & 2048 & 8 & 64 & 64 & 0.1 & 0.1 & 100K & 4.92 & 25.8 & 65 \\
65
- \hline\rule{0pt}{2.0ex}
66
- \multirow{4}{*}{(A)}
67
- & & & & 1 & 512 & 512 & & & & 5.29 & 24.9 & \\
68
- & & & & 4 & 128 & 128 & & & & 5.00 & 25.5 & \\
69
- & & & & 16 & 32 & 32 & & & & 4.91 & 25.8 & \\
70
- & & & & 32 & 16 & 16 & & & & 5.01 & 25.4 & \\
71
- \hline\rule{0pt}{2.0ex}
72
- \multirow{2}{*}{(B)}
73
- & & & & & 16 & & & & & 5.16 & 25.1 & 58 \\
74
- & & & & & 32 & & & & & 5.01 & 25.4 & 60 \\
75
- \hline\rule{0pt}{2.0ex}
76
- \multirow{7}{*}{(C)}
77
- & 2 & & & & & & & & & 6.11 & 23.7 & 36 \\
78
- & 4 & & & & & & & & & 5.19 & 25.3 & 50 \\
79
- & 8 & & & & & & & & & 4.88 & 25.5 & 80 \\
80
- & & 256 & & & 32 & 32 & & & & 5.75 & 24.5 & 28 \\
81
- & & 1024 & & & 128 & 128 & & & & 4.66 & 26.0 & 168 \\
82
- & & & 1024 & & & & & & & 5.12 & 25.4 & 53 \\
83
- & & & 4096 & & & & & & & 4.75 & 26.2 & 90 \\
84
- \hline\rule{0pt}{2.0ex}
85
- \multirow{4}{*}{(D)}
86
- & & & & & & & 0.0 & & & 5.77 & 24.6 & \\
87
- & & & & & & & 0.2 & & & 4.95 & 25.5 & \\
88
- & & & & & & & & 0.0 & & 4.67 & 25.3 & \\
89
- & & & & & & & & 0.2 & & 5.47 & 25.7 & \\
90
- \hline\rule{0pt}{2.0ex}
91
- (E) & & \multicolumn{7}{c}{positional embedding instead of sinusoids} & & 4.92 & 25.7 & \\
92
- \hline\rule{0pt}{2.0ex}
93
- big & 6 & 1024 & 4096 & 16 & & & 0.3 & & 300K & \textbf{4.33} & \textbf{26.4} & 213 \\
94
- \hline
95
- \end{tabular}
96
- %}
97
- \end{center}
98
- \end{table}
99
-
100
-
101
- %Table \ref{tab:ende-results}. Our base model for this task uses 6 attention layers, 512 hidden dim, 2048 filter dim, 8 attention heads with both attention and symbol dropout of 0.2 and 0.1 respectively. Increasing the filter size of our feed forward component to 8192 increases the BLEU score on En $\to$ De by $?$. For both the models, we use beam search decoding of size $?$ and length penalty with an alpha of $?$ \cite? \todo{Update results}
102
-
103
- To evaluate the importance of different components of the Transformer, we varied our base model in different ways, measuring the change in performance on English-to-German translation on the development set, newstest2013. We used beam search as described in the previous section, but no checkpoint averaging. We present these results in Table~\ref{tab:variations}.
104
-
105
- In Table~\ref{tab:variations} rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section \ref{sec:multihead}. While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.
106
-
107
- In Table~\ref{tab:variations} rows (B), we observe that reducing the attention key size $d_k$ hurts model quality. This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneficial. We further observe in rows (C) and (D) that, as expected, bigger models are better, and dropout is very helpful in avoiding over-fitting. In row (E) we replace our sinusoidal positional encoding with learned positional embeddings \citep{JonasFaceNet2017}, and observe nearly identical results to the base model.
108
-
109
- %To evaluate the importance of different components of the Transformer, we use our base model to ablate on a single hyperparameter at each time and measure the change in performance on English$\to$German translation. Our variations in Table~\ref{tab:variations} show that the number of attention layers and attention heads is the most important architecture hyperparamter However, the we do not see performance gains beyond 6 layers, suggesting that we either don't have enough data to train a large model or we need to turn up regularization. We leave this exploration for future work. Among our regularizers, attention dropout has the most significant impact on performance.
110
-
111
-
112
- %Increasing the width of our feed forward component helps both on log ppl and Accuracy \marginpar{Intuition?}
113
- %Using dropout to regularize our models helps to prevent overfitting
114
-
115
- \subsection{English Constituency Parsing}
116
-
117
- \begin{table}[t]
118
- \begin{center}
119
- \caption{The Transformer generalizes well to English constituency parsing (Results are on Section 23 of WSJ)}
120
- \label{tab:parsing-results}
121
- \vspace{-2mm}
122
- %\scalebox{1.0}{
123
- \begin{tabular}{c|c|c}
124
- \hline
125
- {\bf Parser} & {\bf Training} & {\bf WSJ 23 F1} \\ \hline
126
- Vinyals \& Kaiser el al. (2014) \cite{KVparse15}
127
- & WSJ only, discriminative & 88.3 \\
128
- Petrov et al. (2006) \cite{petrov-EtAl:2006:ACL}
129
- & WSJ only, discriminative & 90.4 \\
130
- Zhu et al. (2013) \cite{zhu-EtAl:2013:ACL}
131
- & WSJ only, discriminative & 90.4 \\
132
- Dyer et al. (2016) \cite{dyer-rnng:16}
133
- & WSJ only, discriminative & 91.7 \\
134
- \specialrule{1pt}{-1pt}{0pt}
135
- Transformer (4 layers) & WSJ only, discriminative & 91.3 \\
136
- \specialrule{1pt}{-1pt}{0pt}
137
- Zhu et al. (2013) \cite{zhu-EtAl:2013:ACL}
138
- & semi-supervised & 91.3 \\
139
- Huang \& Harper (2009) \cite{huang-harper:2009:EMNLP}
140
- & semi-supervised & 91.3 \\
141
- McClosky et al. (2006) \cite{mcclosky-etAl:2006:NAACL}
142
- & semi-supervised & 92.1 \\
143
- Vinyals \& Kaiser el al. (2014) \cite{KVparse15}
144
- & semi-supervised & 92.1 \\
145
- \specialrule{1pt}{-1pt}{0pt}
146
- Transformer (4 layers) & semi-supervised & 92.7 \\
147
- \specialrule{1pt}{-1pt}{0pt}
148
- Luong et al. (2015) \cite{multiseq2seq}
149
- & multi-task & 93.0 \\
150
- Dyer et al. (2016) \cite{dyer-rnng:16}
151
- & generative & 93.3 \\
152
- \hline
153
- \end{tabular}
154
- \end{center}
155
- \end{table}
156
-
157
- To evaluate if the Transformer can generalize to other tasks we performed experiments on English constituency parsing. This task presents specific challenges: the output is subject to strong structural constraints and is significantly longer than the input.
158
- Furthermore, RNN sequence-to-sequence models have not been able to attain state-of-the-art results in small-data regimes \cite{KVparse15}.
159
-
160
- We trained a 4-layer transformer with $d_{model} = 1024$ on the Wall Street Journal (WSJ) portion of the Penn Treebank \citep{marcus1993building}, about 40K training sentences. We also trained it in a semi-supervised setting, using the larger high-confidence and BerkleyParser corpora from with approximately 17M sentences \citep{KVparse15}. We used a vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens for the semi-supervised setting.
161
-
162
- We performed only a small number of experiments to select the dropout, both attention and residual (section~\ref{sec:reg}), learning rates and beam size on the Section 22 development set, all other parameters remained unchanged from the English-to-German base translation model. During inference, we increased the maximum output length to input length + $300$. We used a beam size of $21$ and $\alpha=0.3$ for both WSJ only and the semi-supervised setting.
163
-
164
- Our results in Table~\ref{tab:parsing-results} show that despite the lack of task-specific tuning our model performs surprisingly well, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar \cite{dyer-rnng:16}.
165
-
166
- In contrast to RNN sequence-to-sequence models \citep{KVparse15}, the Transformer outperforms the BerkeleyParser \cite{petrov-EtAl:2006:ACL} even when training only on the WSJ training set of 40K sentences.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
crazy_functions/test_project/latex/attention/sqrt_d_trick.tex DELETED
@@ -1,28 +0,0 @@
1
- \section*{Justfication of the Scaling Factor in Dot-product Attention}
2
-
3
- In Section~\ref{sec:scaled-dot-prod}, we introduced Scaled dot-product attention, where we scale down the dot products by $\sqrt{d_k}$. In this section, we will give a rough justification of this scaling factor. If we assume that $q$ and $k$ are $d_k$-dimensional vectors whose components are independent random variables with mean $0$ and variance $1$, then their dot product, $q \cdot k = \sum_{i=1}^{d_k} u_iv_i$, has mean $0$ and variance $d_k$. Since we would prefer these values to have variance $1$, we divide by $\sqrt{d_k}$.
4
-
5
-
6
-
7
- %For any two $d_k$-dimension vectors $\vec{u}$ and $\vec{v}$, whose dimensions are independent, the mean and variance of the dot product will be the summation of the product of means and variances over the dimensions, that is, $E[<\vec{u},\vec{v}>] = \sum_{i=1}^{d_k} E[u_i]E[v_i]$, and $E[(<\vec{u},\vec{v}>-E[<\vec{u},\vec{v}>])^2] = \sum_{i=1}^{d_k} E[({u_i}-E[u_i])^2] E[({v_i}-E[v_i])^2]$. Layer norm encourages the mean and variance of each dimension to be $0$ and $1$ respectively, resultig in the dot product having mean $0$ and $d_k$ respectively. Therefore, scaling by $\sqrt{d_k}$ encourages the logits to be normalized as well.
8
-
9
- \iffalse
10
-
11
- In this section, we will give a rough justification of this scaling factor, that is, we will show that for any two vectors, $\vec{u}$ and $\vec{v}$, whose variance and mean are $1$ and $0$ respectively, the variance and the mean of the dot product are $d_k$ and $0$ respectively. Therefore, dividing by $\sqrt{d_k}$ ensures that each component of the attention logits are normalized. The repeated layer norms at each transformer layer encourage $\vec{u}$ and $\vec{v}$ to be normalized.
12
-
13
-
14
- \begin{align*}
15
- E[<\vec{u},\vec{v}>] & = \sum_k E[u_i v_i] &\text{By linearity of expectation} \\
16
- & =\sum_k E[u_i]E[v_i] & \text{Assuming independence} \\
17
- & = 0
18
- \end{align*}
19
-
20
- \begin{align*}
21
- E[(<\vec{u},\vec{v}>-E[<\vec{u},\vec{v}>])^2] & = E[(<\vec{u},\vec{v}>)^2] - E[<\vec{u},\vec{v}>]^2 \\
22
- & = E[(<\vec{u},\vec{v}>)^2] \\
23
- & = \sum_k E[{u_i}^2] E[{v_i}^2] &\text{By linearity of expectation and indepedence} \\
24
- & = d_k
25
- \end{align*}
26
-
27
-
28
- \fi
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
crazy_functions/test_project/latex/attention/training.tex DELETED
@@ -1,42 +0,0 @@
1
- This section describes the training regime for our models.
2
-
3
- %In order to speed up experimentation, our ablations are performed relative to a smaller base model described in detail in Section \ref{sec:results}.
4
-
5
- \subsection{Training Data and Batching}
6
- We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding \citep{DBLP:journals/corr/BritzGLL17}, which has a shared source-target vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary \citep{wu2016google}. Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.
7
-
8
- \subsection{Hardware and Schedule}
9
-
10
- We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models,(described on the bottom line of table \ref{tab:variations}), step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days).
11
-
12
- \subsection{Optimizer} We used the Adam optimizer~\citep{kingma2014adam} with $\beta_1=0.9$, $\beta_2=0.98$ and $\epsilon=10^{-9}$. We varied the learning rate over the course of training, according to the formula:
13
-
14
- \begin{equation}
15
- lrate = \dmodel^{-0.5} \cdot
16
- \min({step\_num}^{-0.5},
17
- {step\_num} \cdot {warmup\_steps}^{-1.5})
18
- \end{equation}
19
-
20
- This corresponds to increasing the learning rate linearly for the first $warmup\_steps$ training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. We used $warmup\_steps=4000$.
21
-
22
- \subsection{Regularization} \label{sec:reg}
23
-
24
- We employ three types of regularization during training:
25
- \paragraph{Residual Dropout} We apply dropout \citep{srivastava2014dropout} to the output of each sub-layer, before it is added to the sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of $P_{drop}=0.1$.
26
-
27
- % \paragraph{Attention Dropout} Query to key attentions are structurally similar to hidden-to-hidden weights in a feed-forward network, albeit across positions. The softmax activations yielding attention weights can then be seen as the analogue of hidden layer activations. A natural possibility is to extend dropout \citep{srivastava2014dropout} to attention. We implement attention dropout by dropping out attention weights as,
28
- % \begin{equation*}
29
- % \mathrm{Attention}(Q, K, V) = \mathrm{dropout}(\mathrm{softmax}(\frac{QK^T}{\sqrt{d}}))V
30
- % \end{equation*}
31
- % In addition to residual dropout, we found attention dropout to be beneficial for our parsing experiments.
32
-
33
- %\paragraph{Symbol Dropout} In the source and target embedding layers, we replace a random subset of the token ids with a sentinel id. For the base model, we use a rate of $symbol\_dropout\_rate=0.1$. Note that this applies only to the auto-regressive use of the target ids - not their use in the cross-entropy loss.
34
-
35
- %\paragraph{Attention Dropout} Query to memory attentions are structurally similar to hidden-to-hidden weights in a feed-forward network, albeit across positions. The softmax activations yielding attention weights can then be seen as the analogue of hidden layer activations. A natural possibility is to extend dropout \citep{srivastava2014dropout} to attentions. We implement attention dropout by dropping out attention weights as,
36
- %\begin{equation*}
37
- % A(Q, K, V) = \mathrm{dropout}(\mathrm{softmax}(\frac{QK^T}{\sqrt{d}}))V
38
- %\end{equation*}
39
- %As a result, the query will not be able to access the memory values at the dropped out position. In our experiments, we tried attention dropout rates of 0.2, and 0.3, and found it to work favorably for English-to-German translation.
40
- %$attention\_dropout\_rate=0.2$.
41
-
42
- \paragraph{Label Smoothing} During training, we employed label smoothing of value $\epsilon_{ls}=0.1$ \citep{DBLP:journals/corr/SzegedyVISW15}. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
crazy_functions/test_project/latex/attention/visualizations.tex DELETED
@@ -1,18 +0,0 @@
1
- \pagebreak
2
- \section*{Attention Visualizations}\label{sec:viz-att}
3
- \begin{figure*}[h]
4
- {\includegraphics[width=\textwidth, trim=0 0 0 36, clip]{./vis/making_more_difficult5_new.pdf}}
5
- \caption{An example of the attention mechanism following long-distance dependencies in the encoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of the verb `making', completing the phrase `making...more difficult'. Attentions here shown only for the word `making'. Different colors represent different heads. Best viewed in color.}
6
- \end{figure*}
7
-
8
- \begin{figure*}
9
- {\includegraphics[width=\textwidth, trim=0 0 0 45, clip]{./vis/anaphora_resolution_new.pdf}}
10
- {\includegraphics[width=\textwidth, trim=0 0 0 37, clip]{./vis/anaphora_resolution2_new.pdf}}
11
- \caption{Two attention heads, also in layer 5 of 6, apparently involved in anaphora resolution. Top: Full attentions for head 5. Bottom: Isolated attentions from just the word `its' for attention heads 5 and 6. Note that the attentions are very sharp for this word.}
12
- \end{figure*}
13
-
14
- \begin{figure*}
15
- {\includegraphics[width=\textwidth, trim=0 0 0 36, clip]{./vis/attending_to_head_new.pdf}}
16
- {\includegraphics[width=\textwidth, trim=0 0 0 36, clip]{./vis/attending_to_head2_new.pdf}}
17
- \caption{Many of the attention heads exhibit behaviour that seems related to the structure of the sentence. We give two such examples above, from two different heads from the encoder self-attention at layer 5 of 6. The heads clearly learned to perform different tasks.}
18
- \end{figure*}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
crazy_functions/test_project/latex/attention/why_self_attention.tex DELETED
@@ -1,98 +0,0 @@
1
- %We focus on the general task of mapping one variable-length sequence of symbol representations ${x_1, ..., x_n} \in \mathbb{R}^d$ to another sequence of the same length ${y_1, ..., y_n} \in \mathbb{R}^d$. \marginpar{should we use this notation? alternatively we can just say "d-dimensional vectors"}
2
-
3
- In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations $(x_1, ..., x_n)$ to another sequence of equal length $(z_1, ..., z_n)$, with $x_i, z_i \in \mathbb{R}^d$, such as a hidden layer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we consider three desiderata.
4
-
5
- One is the total computational complexity per layer.
6
- Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.
7
-
8
- The third is the path length between long-range dependencies in the network. Learning long-range dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network. The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies \citep{hochreiter2001gradient}. Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types.
9
-
10
- %\subsection{Computational Performance and Path Lengths}
11
-
12
- \begin{table}[t]
13
- \caption{
14
- Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types. $n$ is the sequence length, $d$ is the representation dimension, $k$ is the kernel size of convolutions and $r$ the size of the neighborhood in restricted self-attention.}
15
- %Attention models are quite efficient for cross-positional communications when sequence length is smaller than channel depth.
16
- \label{tab:op_complexities}
17
- \begin{center}
18
- \vspace{-1mm}
19
- %\scalebox{0.75}{
20
-
21
- \begin{tabular}{lccc}
22
- \toprule
23
- Layer Type & Complexity per Layer & Sequential & Maximum Path Length \\
24
- & & Operations & \\
25
- \hline
26
- \rule{0pt}{2.0ex}Self-Attention & $O(n^2 \cdot d)$ & $O(1)$ & $O(1)$ \\
27
- Recurrent & $O(n \cdot d^2)$ & $O(n)$ & $O(n)$ \\
28
-
29
- Convolutional & $O(k \cdot n \cdot d^2)$ & $O(1)$ & $O(log_k(n))$ \\
30
- %\cmidrule
31
- Self-Attention (restricted)& $O(r \cdot n \cdot d)$ & $O(1)$ & $O(n/r)$ \\
32
-
33
- %Convolutional (separable) & $O(k \cdot n \cdot d + n \cdot d^2)$ & $O(1)$ & $O(log_k(n))$ \\
34
-
35
- %Position-wise Feed-Forward & $O(n \cdot d^2)$ & $O(1)$ & $\infty$ \\
36
-
37
- %Fully Connected & $O(n^2 \cdot d^2)$ & $O(1)$ & $O(1)$ \\
38
- %Convolutional (separable) & $O(k \cdot n \cdot d + n \cdot d^2)$ & $O(1)$ & $O(log_k(n))$ \\
39
-
40
- %Position-wise Feed-Forward & $O(n \cdot d^2)$ & $O(1)$ & $\infty$ \\
41
-
42
- %Fully Connected & $O(n^2 \cdot d^2)$ & $O(1)$ & $O(1)$ \\
43
- \bottomrule
44
- \end{tabular}
45
- %}
46
- \end{center}
47
- \end{table}
48
-
49
-
50
- %\begin{table}[b]
51
- %\caption{
52
- % Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types. $n$ is the sequence length, $d$ is the representation dimensionality, $k$ is the kernel size of convolutions and $r$ the size of the neighborhood in localized self-attention.}
53
- %Attention models are quite efficient for cross-positional communications when sequence length is smaller than channel depth.
54
- %\label{tab:op_complexities}
55
- %\begin{center}
56
- %\vspace{-1mm}
57
- %%\scalebox{0.75}{
58
- %
59
- %\begin{tabular}{lccc}
60
- %\hline
61
- %Layer Type & Receptive & Complexity per Layer & Sequential %\\
62
- % & Field Size & & Operations \\
63
- %\hline
64
- %Self-Attention & $n$ & $O(n^2 \cdot d)$ & $O(1)$ \\
65
- %Recurrent & $n$ & $O(n \cdot d^2)$ & $O(n)$ \\
66
-
67
- %Convolutional & $k$ & $O(k \cdot n \cdot d^2)$ & %$O(log_k(n))$ \\
68
- %\hline
69
- %Self-Attention (localized)& $r$ & $O(r \cdot n \cdot d)$ & %$O(1)$ \\
70
-
71
- %Convolutional (separable) & $k$ & $O(k \cdot n \cdot d + n %\cdot d^2)$ & $O(log_k(n))$ \\
72
-
73
- %Position-wise Feed-Forward & $1$ & $O(n \cdot d^2)$ & $O(1)$ %\\
74
-
75
- %Fully Connected & $n$ & $O(n^2 \cdot d^2)$ & $O(1)$ \\
76
-
77
- %\end{tabular}
78
- %%}
79
- %\end{center}
80
- %\end{table}
81
-
82
- %The receptive field size of a layer is the number of different input representations that can influence any particular output representation. Recurrent layers and self-attention layers have a full receptive field equal to the sequence length $n$. Convolutional layers have a limited receptive field equal to their kernel width $k$, which is generally chosen to be small in order to limit computational cost.
83
-
84
- As noted in Table \ref{tab:op_complexities}, a self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires $O(n)$ sequential operations.
85
- In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length $n$ is smaller than the representation dimensionality $d$, which is most often the case with sentence representations used by state-of-the-art models in machine translations, such as word-piece \citep{wu2016google} and byte-pair \citep{sennrich2015neural} representations.
86
- To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size $r$ in the input sequence centered around the respective output position. This would increase the maximum path length to $O(n/r)$. We plan to investigate this approach further in future work.
87
-
88
- A single convolutional layer with kernel width $k < n$ does not connect all pairs of input and output positions. Doing so requires a stack of $O(n/k)$ convolutional layers in the case of contiguous kernels, or $O(log_k(n))$ in the case of dilated convolutions \citep{NalBytenet2017}, increasing the length of the longest paths between any two positions in the network.
89
- Convolutional layers are generally more expensive than recurrent layers, by a factor of $k$. Separable convolutions \citep{xception2016}, however, decrease the complexity considerably, to $O(k \cdot n \cdot d + n \cdot d^2)$. Even with $k=n$, however, the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model.
90
-
91
- %\subsection{Unfiltered Bottleneck Argument}
92
-
93
- %An orthogonal argument can be made for self-attention layers based on when the layer imposes the bottleneck of mapping all of the information used to compute a given output position into a single, fixed-length vector. ...
94
-
95
- %There is a second argument for self-attention layers which we call the unfiltered bottleneck argument. In both recurrent and the convolutional layers, the information that position $i$ receives from the other positions is compressed to a vector of dimension $d$ before it ever can be filtered by the content $x_i$. More precisely, we can express $y_i = F(i, x_i, G(i, \{x_{j \neq i}\}))$, where $G(i, \{x_{j \neq i}\})$ is a vector of dimension $d$. Intuitively, we would expect that this would cause a large amount of irrelevant information to crowd out the relevant information. Self-attention does not suffer from the unfiltered bottleneck problem, since the aggregation happens after filtering, and so, intuitively, we have the chance of transmitting lots of relevant information.
96
-
97
- As side benefit, self-attention could yield more interpretable models. We inspect attention distributions from our models and present and discuss examples in the appendix. Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.
98
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
crazy_functions/test_project/其他测试 CHANGED
@@ -6,4 +6,22 @@ To ensure the Cooperation Graph initialization has higher entropy,
6
  we will randomly generate multiple initial states,
7
  rank by their entropy and then pick the one with maximum $H$."
8
 
 
 
9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  we will randomly generate multiple initial states,
7
  rank by their entropy and then pick the one with maximum $H$."
8
 
9
+ ```
10
+ FROM ubuntu:latest
11
 
12
+ RUN apt-get update && \
13
+ apt-get install -y python3 python3-pip && \
14
+ rm -rf /var/lib/apt/lists/*
15
+
16
+ RUN echo '[global]' > /etc/pip.conf && \
17
+ echo 'index-url = https://mirrors.aliyun.com/pypi/simple/' >> /etc/pip.conf && \
18
+ echo 'trusted-host = mirrors.aliyun.com' >> /etc/pip.conf
19
+
20
+ RUN pip3 install gradio requests[socks] mdtex2html
21
+
22
+ COPY . /gpt
23
+ WORKDIR /gpt
24
+
25
+
26
+ CMD ["python3", "main.py"]
27
+ ```
crazy_functions/生成函数注释.py CHANGED
@@ -19,13 +19,7 @@ def 生成函数注释(file_manifest, project_folder, top_p, temperature, chatbo
19
  if not fast_debug:
20
  msg = '正常'
21
  # ** gpt request **
22
- while True:
23
- try:
24
- gpt_say = yield from predict_no_ui_but_counting_down(i_say, i_say_show_user, chatbot, top_p, temperature, history=[]) # 带超时倒计时
25
- break
26
- except ConnectionAbortedError as e:
27
- i_say = i_say[:len(i_say)//2]
28
- msg = '文件太长,进行了拦腰截断'
29
 
30
  print('[2] end gpt req')
31
  chatbot[-1] = (i_say_show_user, gpt_say)
 
19
  if not fast_debug:
20
  msg = '正常'
21
  # ** gpt request **
22
+ gpt_say = yield from predict_no_ui_but_counting_down(i_say, i_say_show_user, chatbot, top_p, temperature, history=[]) # 带超时倒计时
 
 
 
 
 
 
23
 
24
  print('[2] end gpt req')
25
  chatbot[-1] = (i_say_show_user, gpt_say)
crazy_functions/解析项目源代码.py CHANGED
@@ -13,27 +13,17 @@ def 解析源代码(file_manifest, project_folder, top_p, temperature, chatbot,
13
  i_say = 前言 + f'请对下面的程序文件做一个概述文件名是{os.path.relpath(fp, project_folder)},文件代码是 ```{file_content}```'
14
  i_say_show_user = 前言 + f'[{index}/{len(file_manifest)}] 请对下面的程序文件做一个概述: {os.path.abspath(fp)}'
15
  chatbot.append((i_say_show_user, "[Local Message] waiting gpt response."))
16
- print('[1] yield chatbot, history')
17
  yield chatbot, history, '正常'
18
 
19
  if not fast_debug:
20
  msg = '正常'
 
21
  # ** gpt request **
22
- while True:
23
- try:
24
- # gpt_say = predict_no_ui(inputs=i_say, top_p=top_p, temperature=temperature)
25
- gpt_say = yield from predict_no_ui_but_counting_down(i_say, i_say_show_user, chatbot, top_p, temperature, history=[]) # 带超时倒计时
26
- break
27
- except ConnectionAbortedError as e:
28
- i_say = i_say[:len(i_say)//2]
29
- msg = '文件太长,进行了拦腰截断'
30
-
31
- print('[2] end gpt req')
32
  chatbot[-1] = (i_say_show_user, gpt_say)
33
  history.append(i_say_show_user); history.append(gpt_say)
34
- print('[3] yield chatbot, history')
35
  yield chatbot, history, msg
36
- print('[4] next')
37
  if not fast_debug: time.sleep(2)
38
 
39
  all_file = ', '.join([os.path.relpath(fp, project_folder) for index, fp in enumerate(file_manifest)])
@@ -44,16 +34,8 @@ def 解析源代码(file_manifest, project_folder, top_p, temperature, chatbot,
44
  if not fast_debug:
45
  msg = '正常'
46
  # ** gpt request **
47
- while True:
48
- try:
49
- # gpt_say = predict_no_ui(inputs=i_say, top_p=top_p, temperature=temperature, history=history)
50
- gpt_say = yield from predict_no_ui_but_counting_down(i_say, i_say, chatbot, top_p, temperature, history=history) # 带超时倒计时
51
- break
52
- except ConnectionAbortedError as e:
53
- history = [his[len(his)//2:] for his in history]
54
- msg = '对话历史太长,每段历史拦腰截断'
55
 
56
-
57
  chatbot[-1] = (i_say, gpt_say)
58
  history.append(i_say); history.append(gpt_say)
59
  yield chatbot, history, msg
 
13
  i_say = 前言 + f'请对下面的程序文件做一个概述文件名是{os.path.relpath(fp, project_folder)},文件代码是 ```{file_content}```'
14
  i_say_show_user = 前言 + f'[{index}/{len(file_manifest)}] 请对下面的程序文件做一个概述: {os.path.abspath(fp)}'
15
  chatbot.append((i_say_show_user, "[Local Message] waiting gpt response."))
 
16
  yield chatbot, history, '正常'
17
 
18
  if not fast_debug:
19
  msg = '正常'
20
+
21
  # ** gpt request **
22
+ gpt_say = yield from predict_no_ui_but_counting_down(i_say, i_say_show_user, chatbot, top_p, temperature, history=[]) # 带超时倒计时
23
+
 
 
 
 
 
 
 
 
24
  chatbot[-1] = (i_say_show_user, gpt_say)
25
  history.append(i_say_show_user); history.append(gpt_say)
 
26
  yield chatbot, history, msg
 
27
  if not fast_debug: time.sleep(2)
28
 
29
  all_file = ', '.join([os.path.relpath(fp, project_folder) for index, fp in enumerate(file_manifest)])
 
34
  if not fast_debug:
35
  msg = '正常'
36
  # ** gpt request **
37
+ gpt_say = yield from predict_no_ui_but_counting_down(i_say, i_say, chatbot, top_p, temperature, history=history) # 带超时倒计时
 
 
 
 
 
 
 
38
 
 
39
  chatbot[-1] = (i_say, gpt_say)
40
  history.append(i_say); history.append(gpt_say)
41
  yield chatbot, history, msg
crazy_functions/读文章写摘要.py CHANGED
@@ -20,13 +20,7 @@ def 解析Paper(file_manifest, project_folder, top_p, temperature, chatbot, hist
20
  if not fast_debug:
21
  msg = '正常'
22
  # ** gpt request **
23
- while True:
24
- try:
25
- gpt_say = yield from predict_no_ui_but_counting_down(i_say, i_say_show_user, chatbot, top_p, temperature, history=[]) # 带超时倒计时
26
- break
27
- except ConnectionAbortedError as e:
28
- i_say = i_say[:len(i_say)//2]
29
- msg = '文件太长,进行了拦腰截断'
30
 
31
  print('[2] end gpt req')
32
  chatbot[-1] = (i_say_show_user, gpt_say)
@@ -44,14 +38,7 @@ def 解析Paper(file_manifest, project_folder, top_p, temperature, chatbot, hist
44
  if not fast_debug:
45
  msg = '正常'
46
  # ** gpt request **
47
- while True:
48
- try:
49
- gpt_say = yield from predict_no_ui_but_counting_down(i_say, i_say, chatbot, top_p, temperature, history=history) # 带超时倒计时
50
- break
51
- except ConnectionAbortedError as e:
52
- history = [his[len(his)//2:] for his in history]
53
- msg = '对话历史太长,每段历史拦腰截断'
54
-
55
 
56
  chatbot[-1] = (i_say, gpt_say)
57
  history.append(i_say); history.append(gpt_say)
 
20
  if not fast_debug:
21
  msg = '正常'
22
  # ** gpt request **
23
+ gpt_say = yield from predict_no_ui_but_counting_down(i_say, i_say_show_user, chatbot, top_p, temperature, history=[]) # 带超时倒计时
 
 
 
 
 
 
24
 
25
  print('[2] end gpt req')
26
  chatbot[-1] = (i_say_show_user, gpt_say)
 
38
  if not fast_debug:
39
  msg = '正常'
40
  # ** gpt request **
41
+ gpt_say = yield from predict_no_ui_but_counting_down(i_say, i_say, chatbot, top_p, temperature, history=history) # 带超时倒计时
 
 
 
 
 
 
 
42
 
43
  chatbot[-1] = (i_say, gpt_say)
44
  history.append(i_say); history.append(gpt_say)
predict.py CHANGED
@@ -32,7 +32,7 @@ def predict_no_ui(inputs, top_p, temperature, history=[]):
32
  # make a POST request to the API endpoint, stream=False
33
  response = requests.post(API_URL, headers=headers, proxies=proxies,
34
  json=payload, stream=False, timeout=TIMEOUT_SECONDS*2); break
35
- except TimeoutError as e:
36
  retry += 1
37
  traceback.print_exc()
38
  if MAX_RETRY!=0: print(f'请求超时,正在重试 ({retry}/{MAX_RETRY}) ……')
@@ -110,7 +110,8 @@ def predict(inputs, top_p, temperature, chatbot=[], history=[], system_prompt=''
110
  chunk = get_full_error(chunk, stream_response)
111
  error_msg = chunk.decode()
112
  if "reduce the length" in error_msg:
113
- chatbot[-1] = (history[-1], "[local] input is too long, reduce input or clear history.")
 
114
  yield chatbot, history, "Json解析不合常规,很可能是文本过长" + error_msg
115
  return
116
 
 
32
  # make a POST request to the API endpoint, stream=False
33
  response = requests.post(API_URL, headers=headers, proxies=proxies,
34
  json=payload, stream=False, timeout=TIMEOUT_SECONDS*2); break
35
+ except requests.exceptions.ReadTimeout as e:
36
  retry += 1
37
  traceback.print_exc()
38
  if MAX_RETRY!=0: print(f'请求超时,正在重试 ({retry}/{MAX_RETRY}) ……')
 
110
  chunk = get_full_error(chunk, stream_response)
111
  error_msg = chunk.decode()
112
  if "reduce the length" in error_msg:
113
+ chatbot[-1] = (history[-1], "[Local Message] Input (or history) is too long, please reduce input or clear history by refleshing this page.")
114
+ history = []
115
  yield chatbot, history, "Json解析不合常规,很可能是文本过长" + error_msg
116
  return
117
 
toolbox.py CHANGED
@@ -4,19 +4,33 @@ from functools import wraps
4
 
5
  def predict_no_ui_but_counting_down(i_say, i_say_show_user, chatbot, top_p, temperature, history=[]):
6
  """
7
- 调用简单的predict_no_ui接口,但是依然保留了些许界面心跳功能
8
  """
9
  import time
10
  try: from config_private import TIMEOUT_SECONDS
11
  except: from config import TIMEOUT_SECONDS
12
  from predict import predict_no_ui
13
- mutable = [None]
14
- def mt(): mutable[0] = predict_no_ui(inputs=i_say, top_p=top_p, temperature=temperature, history=history)
15
- thread_name = threading.Thread(target=mt); thread_name.start()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  cnt = 0
17
  while thread_name.is_alive():
18
  cnt += 1
19
- chatbot[-1] = (i_say_show_user, f"[Local Message] waiting gpt response {cnt}/{TIMEOUT_SECONDS*2}"+''.join(['.']*(cnt%4)))
20
  yield chatbot, history, '正常'
21
  time.sleep(1)
22
  gpt_say = mutable[0]
 
4
 
5
  def predict_no_ui_but_counting_down(i_say, i_say_show_user, chatbot, top_p, temperature, history=[]):
6
  """
7
+ 调用简单的predict_no_ui接口,但是依然保留了些许界面心跳功能,当对话太长时,会自动采用二分法截断
8
  """
9
  import time
10
  try: from config_private import TIMEOUT_SECONDS
11
  except: from config import TIMEOUT_SECONDS
12
  from predict import predict_no_ui
13
+ mutable = [None, '']
14
+ def mt(i_say, history):
15
+ while True:
16
+ try:
17
+ mutable[0] = predict_no_ui(inputs=i_say, top_p=top_p, temperature=temperature, history=history)
18
+ break
19
+ except ConnectionAbortedError as e:
20
+ if len(history) > 0:
21
+ history = [his[len(his)//2:] for his in history if his is not None]
22
+ mutable[1] = 'Warning! History conversation is too long, cut into half. '
23
+ else:
24
+ i_say = i_say[:len(i_say)//2]
25
+ mutable[1] = 'Warning! Input file is too long, cut into half. '
26
+ except TimeoutError as e:
27
+ mutable[0] = '[Local Message] Failed with timeout'
28
+
29
+ thread_name = threading.Thread(target=mt, args=(i_say, history)); thread_name.start()
30
  cnt = 0
31
  while thread_name.is_alive():
32
  cnt += 1
33
+ chatbot[-1] = (i_say_show_user, f"[Local Message] {mutable[1]}waiting gpt response {cnt}/{TIMEOUT_SECONDS*2}"+''.join(['.']*(cnt%4)))
34
  yield chatbot, history, '正常'
35
  time.sleep(1)
36
  gpt_say = mutable[0]