diff --git "a/methods.json" "b/methods.json" --- "a/methods.json" +++ "b/methods.json" @@ -1 +1 @@ -{"title":{"0":"Causal Inference","1":"AutoEncoder","2":"LDA","3":"SVM","4":"GloVe","5":"Residual Connection","6":"Attention Dropout","7":"Linear Warmup With Linear Decay","8":"Weight Decay","9":"GELU","10":"Dense Connections","11":"Adam","12":"WordPiece","13":"Softmax","14":"Dropout","15":"Multi-Head Attention","16":"Layer Normalization","17":"Scaled Dot-Product Attention","18":"BERT","19":"Absolute Position Encodings","20":"Position-Wise Feed-Forward Layer","21":"BPE","22":"Label Smoothing","23":"ReLU","24":"Transformer","25":"Convolution","26":"Dilated Convolution","27":"PCA","28":"Graph Convolutional Networks","29":"Tanh Activation","30":"Sigmoid Activation","31":"LSTM","32":"BiLSTM","33":"ELMo","34":"RPN","35":"RoIPool","36":"Faster R-CNN","37":"GAN","38":"Concatenated Skip Connection","39":"Max Pooling","40":"U-Net","41":"Interpretability","42":"CurricularFace","43":"Weight Normalization","44":"L1 Regularization","45":"Softsign Activation","46":"Leaky ReLU","47":"GLU","48":"Normalizing Flows","49":"DV3 Attention Block","50":"DV3 Convolution Block","51":"Bridge-net","52":"ClariNet","53":"Mixture of Logistic Distributions","54":"Dilated Causal Convolution","55":"WaveNet","56":"Dot-Product Attention","57":"GCN","58":"Batch Normalization","59":"TuckER","60":"Average Pooling","61":"1x1 Convolution","62":"Bottleneck Residual Block","63":"Global Average Pooling","64":"Residual Block","65":"Kaiming Initialization","66":"ResNet","67":"Q-Learning","68":"1D CNN","69":"XLM","70":"Cosine Annealing","71":"Strided Attention","72":"Linear Warmup With Cosine Annealing","73":"Fixed Factorized Attention","74":"GPT-3","75":"AWARE","76":"Experience Replay","77":"Entropy Regularization","78":"Soft Actor Critic","79":"PPO","80":"VQ-VAE","81":"Non Maximum Suppression","82":"SSD","83":"RoIAlign","84":"Mask R-CNN","85":"Inpainting","86":"Local Response Normalization","87":"Grouped Convolution","88":"AlexNet","89":"Non-Local Operation","90":"Non-Local Block","91":"k-Means Clustering","92":"Logistic Regression","93":"Darknet-53","94":"YOLOv3","95":"DTW","96":"ADMM","97":"Knowledge Distillation","98":"Additive Attention","99":"CNN BiLSTM","100":"Cross-View Training","101":"ALIGN","102":"Restricted Boltzmann Machine","103":"ReLIC","104":"GRU","105":"GTrXL","106":"CoBERL","107":"Seq2Seq","108":"VERSE","109":"Temporal attention","110":"DQN","111":"Hourglass Module","112":"Random Scaling","113":"Stacked Hourglass Network","114":"Corner Pooling","115":"CornerNet","116":"Soft-NMS","117":"Random Horizontal Flip","118":"Step Decay","119":"MatrixNet","120":"Global-Local Attention","121":"R1 Regularization","122":"Feedforward Network","123":"Adaptive Instance Normalization","124":"StyleGAN","125":"MAML","126":"NeRF","127":"ELECTRA","128":"CARLA","129":"CRF","130":"MDL","131":"Memory Network","132":"Diffusion","133":"Discriminative Fine-Tuning","134":"GPT-2","135":"Gaussian Process","136":"RoBERTa","137":"node2vec","138":"Triplet Loss","139":"CLIP","140":"FCN","141":"ENet Dilated Bottleneck","142":"ENet Bottleneck","143":"ENet Initial Block","144":"SpatialDropout","145":"PReLU","146":"ENet","147":"PatchGAN","148":"Instance Normalization","149":"GAN Least Squares Loss","150":"Cycle Consistency Loss","151":"CycleGAN","152":"PointNet","153":"Adaptive Dropout","154":"Vision Transformer","155":"DistilBERT","156":"SimCSE","157":"TS","158":"Adafactor","159":"Inverse Square Root Schedule","160":"SentencePiece","161":"T5","162":"Dense Block","163":"DenseNet","164":"TGN","165":"Random Erasing","166":"HyperNetwork","167":"Temporal Activation Regularization","168":"DropConnect","169":"Activation Regularization","170":"Embedding Dropout","171":"Variational Dropout","172":"Weight Tying","173":"AWD-LSTM","174":"Slanted Triangular Learning Rates","175":"ULMFiT","176":"Random Gaussian Blur","177":"ColorJitter","178":"Random Resized Crop","179":"NT-Xent","180":"InfoNCE","181":"SimCLR","182":"MoCo","183":"Pointer Network","184":"Detr","185":"SGD","186":"VGG","187":"DeepMask","188":"GAT","189":"PGM","190":"Mixup","191":"Dice Loss","192":"RMSProp","193":"Depthwise Convolution","194":"Swish","195":"Pointwise Convolution","196":"Depthwise Separable Convolution","197":"Squeeze-and-Excitation Block","198":"Inverted Residual Block","199":"EfficientNet","200":"DCNN","201":"REINFORCE","202":"Focal Loss","203":"Monte-Carlo Tree Search","204":"GPS","205":"fastText","206":"Softplus","207":"Mish","208":"Spatial Pyramid Pooling","209":"RepPoints","210":"Xception","211":"Nesterov Accelerated Gradient","212":"Stochastic Depth","213":"Swin Transformer","214":"DeiT","215":"Dynamic Convolution","216":"FixMatch","217":"Early Stopping","218":"Auxiliary Classifier","219":"Inception-v3 Module","220":"Inception-v3","221":"VAE","222":"Double Q-learning","223":"Double DQN","224":"Pyramidal Bottleneck Residual Unit","225":"Zero-padded Shortcut Connection","226":"Pyramidal Residual Unit","227":"PyramidNet","228":"SegNet","229":"Linear Regression","230":"DualGCN","231":"TrOCR","232":"Cross-Attention Module","233":"FPN","234":"RetinaNet","235":"Laplacian PE","236":"Graph Transformer","237":"Spectral Clustering","238":"Highway Layer","239":"Highway Network","240":"BiGRU","241":"CBHG","242":"Residual GRU","243":"Griffin-Lim Algorithm","244":"Tacotron","245":"Random Search","246":"CodeBERT","247":"IPL","248":"ScatNet","249":"Contrastive Predictive Coding","250":"MobileNetV2","251":"GPT","252":"Colorization","253":"Darknet-19","254":"YOLOv2","255":"EMF","256":"Channel Attention Module","257":"Spatial Attention Module","258":"Channel attention","259":"GradDrop","260":"Gradient Sparsification","261":"Clipped Double Q-learning","262":"Target Policy Smoothing","263":"TD3","264":"DDPG","265":"GAIL","266":"Maxout","267":"Minibatch Discrimination","268":"Orthogonal Regularization","269":"Multiscale Dilated Convolution Block","270":"IAN","271":"Barlow Twins","272":"Poincar\u00e9 Embeddings","273":"RAM","274":"Cutout","275":"DropPath","276":"ProxylessNAS","277":"Retrace","278":"FCOS","279":"TILDEv2","280":"Early exiting","281":"Contextualized Topic Models","282":"Capsule Network","283":"GraphSAGE","284":"Siamese Network","285":"MobileNetV1","286":"Position-Sensitive RoI Pooling","287":"R-FCN","288":"DARTS","289":"LIME","290":"Focal Transformers","291":"DeepWalk","292":"Inception Module","293":"Inception v2","294":"ResNeXt Block","295":"ResNeXt","296":"RAN","297":"MADDPG","298":"Spatial Transformer","299":"RESCAL","300":"AE","301":"SOM","302":"Rendezvous","303":"XLNet","304":"Self-Adversarial Negative Sampling","305":"RotatE","306":"CAM","307":"Label Quality Model","308":"SMOTE","309":"InfoGAN","310":"Soft Actor-Critic (Autotuned Temperature)","311":"V-trace","312":"IMPALA","313":"A2C","314":"A3C","315":"Affine Coupling","316":"Invertible 1x1 Convolution","317":"WaveGlow","318":"TransE","319":"Apollo","320":"WGAN","321":"Denoising Autoencoder","322":"ReLU6","323":"Hard Swish","324":"MobileNetV3","325":"MnasNet","326":"GA","327":"HiFi-GAN","328":"SENet","329":"3D Convolution","330":"Activation Normalization","331":"GLOW","332":"MADGRAD","333":"AdaGrad","334":"DCGAN","335":"PresGAN","336":"SGD with Momentum","337":"Wide Residual Block","338":"WideResNet","339":"GoogLeNet","340":"Expected Sarsa","341":"Sarsa","342":"Agglomerative Contextual Decomposition","343":"DeepLab","344":"mBART","345":"BART","346":"Deformable Convolution","347":"ConvLSTM","348":"OASIS","349":"DropBlock","350":"TRPO","351":"RandAugment","352":"Noisy Student","353":"STN","354":"Deep Belief Network","355":"AMP","356":"Temporal ROIAlign","357":"ICA","358":"DINO","359":"BYOL","360":"Two-Way Dense Layer","361":"PeleeNet","362":"NAM","363":"HANet","364":"NeuroTactic","365":"EWC","366":"CayleyNet","367":"wav2vec-U","368":"SELU","369":"SNN","370":"CodeT5","371":"Fast R-CNN","372":"N-step Returns","373":"ELU","374":"PixelCNN","375":"Pyramid Pooling Module","376":"PSPNet","377":"Spectral Normalization","378":"Levenshtein Transformer","379":"Spatial Gating Unit","380":"gMLP","381":"AdamW","382":"Dilated Sliding Window Attention","383":"Sliding Window Attention","384":"Global and Sliding Window Attention","385":"Longformer","386":"Electric","387":"BAM","388":"Discriminative Adversarial Search","389":"ArcFace","390":"SAGAN Self-Attention Module","391":"SAGAN","392":"Truncation Trick","393":"Off-Diagonal Orthogonal Regularization","394":"GAN Hinge Loss","395":"TTUR","396":"Conditional Batch Normalization","397":"Linear Layer","398":"Projection Discriminator","399":"BigGAN","400":"Blender","401":"GAM","402":"Attention Gate","403":"SANet","404":"Discrete Cosine Transform","405":"Procrustes","406":"AccoMontage","407":"SAC","408":"DANet","409":"MelGAN Residual Block","410":"Window-based Discriminator","411":"MelGAN","412":"FBNet Block","413":"FBNet","414":"Location-based Attention","415":"Content-based Attention","416":"Neural Turing Machine","417":"Path Length Regularization","418":"Weight Demodulation","419":"StyleGAN2","420":"SepFormer","421":"FAVOR+","422":"Performer","423":"DLA","424":"CSPDarknet53","425":"Bottom-up Path Augmentation","426":"Grid Sensitive","427":"CutMix","428":"PAFPN","429":"YOLOv4","430":"Disentangled Attention Mechanism","431":"DeBERTa","432":"PnP","433":"Highway networks","434":"SSE","435":"PVTv2","436":"DeCLUTR","437":"Fire Module","438":"Xavier Initialization","439":"SqueezeNet","440":"GMVAE","441":"Dense Contrastive Learning","442":"GLN","443":"Jigsaw","444":"Res2Net Block","445":"Res2Net","446":"Channel Shuffle","447":"ShuffleNet V2 Block","448":"DetNASNet","449":"DetNAS","450":"Spatial Broadcast Decoder","451":"Prioritized Experience Replay","452":"D4PG","453":"Stochastic Weight Averaging","454":"SHAP","455":"DEQ","456":"CSL","457":"mBERT","458":"Gradient Clipping","459":"Linear Warmup","460":"CTRL","461":"Boost-GNN","462":"Exponential Decay","463":"SRM","464":"PolarNet","465":"Groupwise Point Convolution","466":"ShuffleNet Block","467":"ShuffleNet","468":"Gradient Checkpointing","469":"SPNet","470":"Denoising Score Matching","471":"LAMB","472":"ALBERT","473":"TDN","474":"SFT","475":"Sparse Autoencoder","476":"WGAN-GP Loss","477":"Gravity","478":"ProGAN","479":"MuZero","480":"Prioritized Sweeping","481":"DPG","482":"MUSIQ","483":"ASPP","484":"DeepLabv3","485":"CenterTrack","486":"DAEL","487":"LAMA","488":"TayPO","489":"Adaptive Loss","490":"VOS","491":"Jukebox","492":"DAC","493":"Residual SRM","494":"Style-based Recalibration Module","495":"Reversible Residual Block","496":"RevNet","497":"TAM","498":"Concatenation Affinity","499":"Embedded Dot Product Affinity","500":"Embedded Gaussian Affinity","501":"WaveRNN","502":"Graph Self-Attention","503":"Hierarchical Feature Fusion","504":"ESP","505":"Sharpness-Aware Minimization","506":"SABL","507":"Cascade R-CNN","508":"Synthesizer","509":"CBAM","510":"LXMERT","511":"AugMix","512":"AMSGrad","513":"LV-ViT","514":"Concrete Dropout","515":"PointASNL","516":"IndexNet","517":"MAS","518":"ShuffleNet V2 Downsampling Block","519":"ShuffleNet v2","520":"Auxiliary Batch Normalization","521":"AdvProp","522":"MoCo v2","523":"Relative Position Encodings","524":"ETC","525":"T2T-ViT","526":"Dynamic Memory Network","527":"RandomRotate","528":"Polynomial Rate Decay","529":"GPipe","530":"Hydra","531":"Gumbel Softmax","532":"PixelShuffle","533":"Models Genesis","534":"Multi-Head Linear Attention","535":"Neural Architecture Search","536":"NAS-FPN","537":"Cyclical Learning Rate Policy","538":"Manifold Mixup","539":"STAC","540":"ResNeXt-Elastic","541":"DenseNet-Elastic","542":"Elastic Dense Block","543":"Elastic ResNeXt Block","544":"TAPAS","545":"k-NN","546":"Sparsemax","547":"NON","548":"R2D2","549":"R-CNN","550":"ZoomNet","551":"(2+1)D Convolution","552":"R(2+1)D","553":"RFB","554":"VLMo","555":"Pix2Pix","556":"Squared ReLU","557":"Multi-DConv-Head Attention","558":"Primer","559":"Channel-wise Soft Attention","560":"Anti-Alias Downsampling","561":"Selective Kernel Convolution","562":"Selective Kernel","563":"Big-Little Module","564":"AutoAugment","565":"Assemble-ResNet","566":"ResNet-D","567":"MPNN","568":"RAdam","569":"HypE","570":"Supervised Contrastive Loss","571":"Perceiver IO","572":"HyperDenseNet","573":"CR-NET","574":"Fast-OCR","575":"AlphaZero","576":"RAG","577":"Selective Search","578":"Ape-X","579":"VisualBERT","580":"ViLBERT","581":"Cosine Power Annealing","582":"AdaDelta","583":"AdaSmooth","584":"BiFPN","585":"EfficientDet","586":"MLP-Mixer","587":"mT5","588":"Adaptive Input Representations","589":"Adaptive Softmax","590":"SCNN_UNet_ConvLSTM","591":"TridentNet Block","592":"GAN Feature Matching","593":"Laplacian Pyramid","594":"Viewmaker Network","595":"SLR","596":"CoOp","597":"HITNet","598":"ScheduledDropPath","599":"Accumulating Eligibility Trace","600":"TD Lambda","601":"TD-Gammon","602":"Skip-gram Word2Vec","603":"MeRL","604":"OODformer","605":"RGA","606":"Content-Conditioned Style Encoder","607":"COCO-FUNIT","608":"Natural Gradient Descent","609":"Neural Probabilistic Language Model","610":"InstaBoost","611":"Inception-A","612":"Inception-C","613":"Reduction-A","614":"Inception-B","615":"Reduction-B","616":"Inception-v4","617":"LeNet","618":"Split Attention","619":"CoordConv","620":"DDParser","621":"Meta-augmentation","622":"Polya-Gamma Augmentation","623":"Deformable Attention Module","624":"Deformable DETR","625":"Tofu","626":"DSGN","627":"Fraternal Dropout","628":"Adversarial Color Enhancement","629":"Phase Shuffle","630":"WaveGAN","631":"AdaMax","632":"Axial Attention","633":"Local SGD","634":"DGCNN","635":"SRS","636":"Causal Convolution","637":"SPADE","638":"SCAN-clustering","639":"CRF-RNN","640":"CoVe","641":"ARMA","642":"MODERN","643":"RGCN","644":"TD-VAE","645":"Disentangled Attribution Curves","646":"CCNet","647":"Adaptive Masking","648":"Adaptive Span Transformer","649":"Sandwich Transformer","650":"SimAdapter","651":"Deep Boltzmann Machine","652":"Hopfield Layer","653":"LayerScale","654":"Spatial Group-wise Enhance","655":"VSF","656":"OSCAR","657":"RAE","658":"ESPNet","659":"DE-GAN","660":"PISA","661":"Noisy Linear Layer","662":"Dueling Network","663":"Rainbow DQN","664":"Center Pooling","665":"Cascade Corner Pooling","666":"CenterNet","667":"VL-T5","668":"Polyak Averaging","669":"CheXNet","670":"StruBERT","671":"MViT","672":"Transformer-XL","673":"HaloNet","674":"K-Net","675":"UNITER","676":"RFP","677":"BASNet","678":"Cascade Mask R-CNN","679":"Hit-Detector","680":"DeepCluster","681":"CvT","682":"SFAM","683":"SRGAN Residual Block","684":"VGG Loss","685":"SRGAN","686":"LCC","687":"ESIM","688":"SIFA","689":"Epsilon Greedy Exploration","690":"Affine Operator","691":"MATE","692":"CenterPoint","693":"AlphaFold","694":"Teacher-Tutor-Student Knowledge Distillation","695":"MagFace","696":"Dual Softmax Loss","697":"CAMoE","698":"CBNet","699":"EEND","700":"CCT","701":"VoiceFilter-Lite","702":"SGDW","703":"Deformable ConvNets","704":"Deformable Position-Sensitive RoI Pooling","705":"Deformable RoI Pooling","706":"LightGCN","707":"SCA-CNN","708":"CMCL","709":"UNIMO","710":"Demon","711":"T-Fixup","712":"Sparse Transformer","713":"Spatial Feature Transform","714":"SpreadsheetCoder","715":"Inception-ResNet-v2 Reduction-B","716":"Inception-ResNet-v2-A","717":"Inception-ResNet-v2-B","718":"Inception-ResNet-v2-C","719":"Inception-ResNet-v2","720":"NODE","721":"Mechanism Transfer","722":"DD-PPO","723":"FLICA","724":"Spatially Separable Convolution","725":"GNS","726":"SoftPool","727":"Style Transfer Module","728":"OHEM","729":"PIRL","730":"DVD-GAN DBlock","731":"DVD-GAN GBlock","732":"TSRUc","733":"TSRUp","734":"TSRUs","735":"TrIVD-GAN","736":"L-GCN","737":"E2EAdaptiveDistTraining","738":"BigBird","739":"CGNN","740":"MAVL","741":"LMOT","742":"Latent Optimisation","743":"GIN","744":"nnFormer","745":"LayoutReader","746":"DECA","747":"REM","748":"ChebNet","749":"FixRes","750":"Single-Headed Attention","751":"Boom Layer","752":"SHA-RNN","753":"SortCut Sinkhorn Attention","754":"Sparse Sinkhorn Attention","755":"Sinkhorn Transformer","756":"ORB-SLAM2","757":"YOLOv1","758":"Multiplicative Attention","759":"U2-Net","760":"K3M","761":"RegNetY","762":"LARS","763":"SwAV","764":"SEER","765":"PP-OCR","766":"RFB Net","767":"Parallax","768":"Symbolic Deep Learning","769":"Parrot","770":"Probabilistic Anchor Assignment","771":"Pixel-BERT","772":"SAGA","773":"Ape-X DQN","774":"OFA","775":"YOHO","776":"HRNet","777":"Universal Transformer","778":"Temporal Distribution Matching","779":"Temporal Distribution Characterization","780":"AdaRNN","781":"CTC Loss","782":"CRISS","783":"SM3","784":"Beta-VAE","785":"DynaBERT","786":"DiffPool","787":"DASPP","788":"LiteSeg","789":"MFF","790":"UNet++","791":"TaBERT","792":"Macaw","793":"End-To-End Memory Network","794":"LRNet","795":"Network Dissection","796":"Visual Parsing","797":"Image Scale Augmentation","798":"GShard","799":"NICE","800":"Voxel RoI Pooling","801":"Voxel R-CNN","802":"State-Aware Tracker","803":"LGCL","804":"ABC","805":"DistanceNet","806":"Population Based Training","807":"PAR Transformer","808":"Fractal Block","809":"FractalNet","810":"BLIP","811":"FastSGT","812":"FoveaBox","813":"Context Enhancement Module","814":"Spatial Attention Module (ThunderNet)","815":"Position-Sensitive RoIAlign","816":"SNet","817":"ThunderNet","818":"HTC","819":"SIG"},"description":{"0":"Causal inference is the process of drawing a conclusion about a causal connection based on the conditions of the occurrence of an effect. The main difference between causal inference and inference of association is that the former analyzes the response of the effect variable when the cause is changed.","1":"An **Autoencoder** is a bottleneck architecture that turns a high-dimensional input into a latent low-dimensional code (encoder), and then performs a reconstruction of the input with this latent code (the decoder).\r\n\r\nImage: [Michael Massi](https:\/\/en.wikipedia.org\/wiki\/Autoencoder#\/media\/File:Autoencoder_schema.png)","2":"**Linear discriminant analysis** (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics, pattern recognition, and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.\r\n\r\nExtracted from [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Linear_discriminant_analysis)\r\n\r\n**Source**:\r\n\r\nPaper: [Linear Discriminant Analysis: A Detailed Tutorial](https:\/\/dx.doi.org\/10.3233\/AIC-170729)\r\n\r\nPublic version: [Linear Discriminant Analysis: A Detailed Tutorial](https:\/\/usir.salford.ac.uk\/id\/eprint\/52074\/)","3":"A **Support Vector Machine**, or **SVM**, is a non-parametric supervised learning model. For non-linear classification and regression, they utilise the kernel trick to map inputs to high-dimensional feature spaces. SVMs construct a hyper-plane or set of hyper-planes in a high or infinite dimensional space, which can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier. The figure to the right shows the decision function for a linearly separable problem, with three samples on the margin boundaries, called \u201csupport vectors\u201d. \r\n\r\nSource: [scikit-learn](https:\/\/scikit-learn.org\/stable\/modules\/svm.html)","4":"**GloVe Embeddings** are a type of word embedding that encode the co-occurrence probability ratio between two words as vector differences. GloVe uses a weighted least squares objective $J$ that minimizes the difference between the dot product of the vectors of two words and the logarithm of their number of co-occurrences:\r\n\r\n$$ J=\\sum\\_{i, j=1}^{V}f\\left(\ud835\udc4b\\_{i j}\\right)(w^{T}\\_{i}\\tilde{w}_{j} + b\\_{i} + \\tilde{b}\\_{j} - \\log{\ud835\udc4b}\\_{ij})^{2} $$\r\n\r\nwhere $w\\_{i}$ and $b\\_{i}$ are the word vector and bias respectively of word $i$, $\\tilde{w}_{j}$ and $b\\_{j}$ are the context word vector and bias respectively of word $j$, $X\\_{ij}$ is the number of times word $i$ occurs in the context of word $j$, and $f$ is a weighting function that assigns lower weights to rare and frequent co-occurrences.","5":"**Residual Connections** are a type of skip-connection that learn residual functions with reference to the layer inputs, instead of learning unreferenced functions. \r\n\r\nFormally, denoting the desired underlying mapping as $\\mathcal{H}({x})$, we let the stacked nonlinear layers fit another mapping of $\\mathcal{F}({x}):=\\mathcal{H}({x})-{x}$. The original mapping is recast into $\\mathcal{F}({x})+{x}$.\r\n\r\nThe intuition is that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.","6":"**Attention Dropout** is a type of [dropout](https:\/\/paperswithcode.com\/method\/dropout) used in attention-based architectures, where elements are randomly dropped out of the [softmax](https:\/\/paperswithcode.com\/method\/softmax) in the attention equation. For example, for scaled-dot product attention, we would drop elements from the first term:\r\n\r\n$$ {\\text{Attention}}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^{T}}{\\sqrt{d_k}}\\right)V $$","7":"**Linear Warmup With Linear Decay** is a learning rate schedule in which we increase the learning rate linearly for $n$ updates and then linearly decay afterwards.","8":"**Weight Decay**, or **$L_{2}$ Regularization**, is a regularization technique applied to the weights of a neural network. We minimize a loss function compromising both the primary loss function and a penalty on the $L\\_{2}$ Norm of the weights:\r\n\r\n$$L\\_{new}\\left(w\\right) = L\\_{original}\\left(w\\right) + \\lambda{w^{T}w}$$\r\n\r\nwhere $\\lambda$ is a value determining the strength of the penalty (encouraging smaller weights). \r\n\r\nWeight decay can be incorporated directly into the weight update rule, rather than just implicitly by defining it through to objective function. Often weight decay refers to the implementation where we specify it directly in the weight update rule (whereas L2 regularization is usually the implementation which is specified in the objective function).\r\n\r\nImage Source: Deep Learning, Goodfellow et al","9":"The **Gaussian Error Linear Unit**, or **GELU**, is an activation function. The GELU activation function is $x\\Phi(x)$, where $\\Phi(x)$ the standard Gaussian cumulative distribution function. The GELU nonlinearity weights inputs by their percentile, rather than gates inputs by their sign as in [ReLUs](https:\/\/paperswithcode.com\/method\/relu) ($x\\mathbf{1}_{x>0}$). Consequently the GELU can be thought of as a smoother ReLU.\r\n\r\n$$\\text{GELU}\\left(x\\right) = x{P}\\left(X\\leq{x}\\right) = x\\Phi\\left(x\\right) = x \\cdot \\frac{1}{2}\\left[1 + \\text{erf}(x\/\\sqrt{2})\\right],$$\r\nif $X\\sim \\mathcal{N}(0,1)$.\r\n\r\nOne can approximate the GELU with\r\n$0.5x\\left(1+\\tanh\\left[\\sqrt{2\/\\pi}\\left(x + 0.044715x^{3}\\right)\\right]\\right)$ or $x\\sigma\\left(1.702x\\right),$\r\nbut PyTorch's exact implementation is sufficiently fast such that these approximations may be unnecessary. (See also the [SiLU](https:\/\/paperswithcode.com\/method\/silu) $x\\sigma(x)$ which was also coined in the paper that introduced the GELU.)\r\n\r\nGELUs are used in [GPT-3](https:\/\/paperswithcode.com\/method\/gpt-3), [BERT](https:\/\/paperswithcode.com\/method\/bert), and most other Transformers.","10":"**Dense Connections**, or **Fully Connected Connections**, are a type of layer in a deep neural network that use a linear operation where every input is connected to every output by a weight. This means there are $n\\_{\\text{inputs}}*n\\_{\\text{outputs}}$ parameters, which can lead to a lot of parameters for a sizeable network.\r\n\r\n$$h\\_{l} = g\\left(\\textbf{W}^{T}h\\_{l-1}\\right)$$\r\n\r\nwhere $g$ is an activation function.\r\n\r\nImage Source: Deep Learning by Goodfellow, Bengio and Courville","11":"**Adam** is an adaptive learning rate optimization algorithm that utilises both momentum and scaling, combining the benefits of [RMSProp](https:\/\/paperswithcode.com\/method\/rmsprop) and [SGD w\/th Momentum](https:\/\/paperswithcode.com\/method\/sgd-with-momentum). The optimizer is designed to be appropriate for non-stationary objectives and problems with very noisy and\/or sparse gradients. \r\n\r\nThe weight updates are performed as:\r\n\r\n$$ w_{t} = w_{t-1} - \\eta\\frac{\\hat{m}\\_{t}}{\\sqrt{\\hat{v}\\_{t}} + \\epsilon} $$\r\n\r\nwith\r\n\r\n$$ \\hat{m}\\_{t} = \\frac{m_{t}}{1-\\beta^{t}_{1}} $$\r\n\r\n$$ \\hat{v}\\_{t} = \\frac{v_{t}}{1-\\beta^{t}_{2}} $$\r\n\r\n$$ m_{t} = \\beta_{1}m_{t-1} + (1-\\beta_{1})g_{t} $$\r\n\r\n$$ v_{t} = \\beta_{2}v_{t-1} + (1-\\beta_{2})g_{t}^{2} $$\r\n\r\n\r\n$ \\eta $ is the step size\/learning rate, around 1e-3 in the original paper. $ \\epsilon $ is a small number, typically 1e-8 or 1e-10, to prevent dividing by zero. $ \\beta_{1} $ and $ \\beta_{2} $ are forgetting parameters, with typical values 0.9 and 0.999, respectively.","12":"**WordPiece** is a subword segmentation algorithm used in natural language processing. The vocabulary is initialized with individual characters in the language, then the most frequent combinations of symbols in the vocabulary are iteratively added to the vocabulary. The process is:\r\n\r\n1. Initialize the word unit inventory with all the characters in the text.\r\n2. Build a language model on the training data using the inventory from 1.\r\n3. Generate a new word unit by combining two units out of the current word inventory to increment the word unit inventory by one. Choose the new word unit out of all the possible ones that increases the likelihood on the training data the most when added to the model.\r\n4. Goto 2 until a predefined limit of word units is reached or the likelihood increase falls below a certain threshold.\r\n\r\nText: [Source](https:\/\/stackoverflow.com\/questions\/55382596\/how-is-wordpiece-tokenization-helpful-to-effectively-deal-with-rare-words-proble\/55416944#55416944)\r\n\r\nImage: WordPiece as used in [BERT](https:\/\/paperswithcode.com\/method\/bert)","13":"The **Softmax** output function transforms a previous layer's output into a vector of probabilities. It is commonly used for multiclass classification. Given an input vector $x$ and a weighting vector $w$ we have:\r\n\r\n$$ P(y=j \\mid{x}) = \\frac{e^{x^{T}w_{j}}}{\\sum^{K}_{k=1}e^{x^{T}wk}} $$","14":"**Dropout** is a regularization technique for neural networks that drops a unit (along with connections) at training time with a specified probability $p$ (a common value is $p=0.5$). At test time, all units are present, but with weights scaled by $p$ (i.e. $w$ becomes $pw$).\r\n\r\nThe idea is to prevent co-adaptation, where the neural network becomes too reliant on particular connections, as this could be symptomatic of overfitting. Intuitively, dropout can be thought of as creating an implicit ensemble of neural networks.","15":"**Multi-head Attention** is a module for attention mechanisms which runs through an attention mechanism several times in parallel. The independent attention outputs are then concatenated and linearly transformed into the expected dimension. Intuitively, multiple attention heads allows for attending to parts of the sequence differently (e.g. longer-term dependencies versus shorter-term dependencies). \r\n\r\n$$ \\text{MultiHead}\\left(\\textbf{Q}, \\textbf{K}, \\textbf{V}\\right) = \\left[\\text{head}\\_{1},\\dots,\\text{head}\\_{h}\\right]\\textbf{W}_{0}$$\r\n\r\n$$\\text{where} \\text{ head}\\_{i} = \\text{Attention} \\left(\\textbf{Q}\\textbf{W}\\_{i}^{Q}, \\textbf{K}\\textbf{W}\\_{i}^{K}, \\textbf{V}\\textbf{W}\\_{i}^{V} \\right) $$\r\n\r\nAbove $\\textbf{W}$ are all learnable parameter matrices.\r\n\r\nNote that [scaled dot-product attention](https:\/\/paperswithcode.com\/method\/scaled) is most commonly used in this module, although in principle it can be swapped out for other types of attention mechanism.\r\n\r\nSource: [Lilian Weng](https:\/\/lilianweng.github.io\/lil-log\/2018\/06\/24\/attention-attention.html#a-family-of-attention-mechanisms)","16":"Unlike [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization), **Layer Normalization** directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer so the normalization does not introduce any new dependencies between training cases. It works well for [RNNs](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks) and improves both the training time and the generalization performance of several existing RNN models. More recently, it has been used with [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) models.\r\n\r\nWe compute the layer normalization statistics over all the hidden units in the same layer as follows:\r\n\r\n$$ \\mu^{l} = \\frac{1}{H}\\sum^{H}\\_{i=1}a\\_{i}^{l} $$\r\n\r\n$$ \\sigma^{l} = \\sqrt{\\frac{1}{H}\\sum^{H}\\_{i=1}\\left(a\\_{i}^{l}-\\mu^{l}\\right)^{2}} $$\r\n\r\nwhere $H$ denotes the number of hidden units in a layer. Under layer normalization, all the hidden units in a layer share the same normalization terms $\\mu$ and $\\sigma$, but different training cases have different normalization terms. Unlike batch normalization, layer normalization does not impose any constraint on the size of the mini-batch and it can be used in the pure online regime with batch size 1.","17":"**Scaled dot-product attention** is an attention mechanism where the dot products are scaled down by $\\sqrt{d_k}$. Formally we have a query $Q$, a key $K$ and a value $V$ and calculate the attention as:\r\n\r\n$$ {\\text{Attention}}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^{T}}{\\sqrt{d_k}}\\right)V $$\r\n\r\nIf we assume that $q$ and $k$ are $d_k$-dimensional vectors whose components are independent random variables with mean $0$ and variance $1$, then their dot product, $q \\cdot k = \\sum_{i=1}^{d_k} u_iv_i$, has mean $0$ and variance $d_k$. Since we would prefer these values to have variance $1$, we divide by $\\sqrt{d_k}$.","18":"**BERT**, or Bidirectional Encoder Representations from Transformers, improves upon standard [Transformers](http:\/\/paperswithcode.com\/method\/transformer) by removing the unidirectionality constraint by using a *masked language model* (MLM) pre-training objective. The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-to-right language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pre-train a deep bidirectional Transformer. In addition to the masked language model, BERT uses a *next sentence prediction* task that jointly pre-trains text-pair representations. \r\n\r\nThere are two steps in BERT: *pre-training* and *fine-tuning*. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For fine-tuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Each downstream task has separate fine-tuned models, even though they\r\nare initialized with the same pre-trained parameters.","19":"**Absolute Position Encodings** are a type of position embeddings for [[Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based models] where positional encodings are added to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension $d\\_{model}$ as the embeddings, so that the two can be summed. In the original implementation, sine and cosine functions of different frequencies are used:\r\n\r\n$$ \\text{PE}\\left(pos, 2i\\right) = \\sin\\left(pos\/10000^{2i\/d\\_{model}}\\right) $$\r\n\r\n$$ \\text{PE}\\left(pos, 2i+1\\right) = \\cos\\left(pos\/10000^{2i\/d\\_{model}}\\right) $$\r\n\r\nwhere $pos$ is the position and $i$ is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $2\\pi$ to $10000 \\dot 2\\pi$. This function was chosen because the authors hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$, $\\text{PE}\\_{pos+k}$ can be represented as a linear function of $\\text{PE}\\_{pos}$.\r\n\r\nImage Source: [D2L.ai](https:\/\/d2l.ai\/chapter_attention-mechanisms\/self-attention-and-positional-encoding.html)","20":"**Position-Wise Feed-Forward Layer** is a type of [feedforward layer](https:\/\/www.paperswithcode.com\/method\/category\/feedforwad-networks) consisting of two [dense layers](https:\/\/www.paperswithcode.com\/method\/dense-connections) that applies to the last dimension, which means the same dense layers are used for each position item in the sequence, so called position-wise.","21":"**Byte Pair Encoding**, or **BPE**, is a subword segmentation algorithm that encodes rare and unknown words as sequences of subword units. The intuition is that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations).\r\n\r\n[Lei Mao](https:\/\/leimao.github.io\/blog\/Byte-Pair-Encoding\/) has a detailed blog post that explains how this works.","22":"**Label Smoothing** is a regularization technique that introduces noise for the labels. This accounts for the fact that datasets may have mistakes in them, so maximizing the likelihood of $\\log{p}\\left(y\\mid{x}\\right)$ directly can be harmful. Assume for a small constant $\\epsilon$, the training set label $y$ is correct with probability $1-\\epsilon$ and incorrect otherwise. Label Smoothing regularizes a model based on a [softmax](https:\/\/paperswithcode.com\/method\/softmax) with $k$ output values by replacing the hard $0$ and $1$ classification targets with targets of $\\frac{\\epsilon}{k-1}$ and $1-\\epsilon$ respectively.\r\n\r\nSource: Deep Learning, Goodfellow et al\r\n\r\nImage Source: [When Does Label Smoothing Help?](https:\/\/arxiv.org\/abs\/1906.02629)","23":"**Rectified Linear Units**, or **ReLUs**, are a type of activation function that are linear in the positive dimension, but zero in the negative dimension. The kink in the function is the source of the non-linearity. Linearity in the positive dimension has the attractive property that it prevents non-saturation of gradients (contrast with [sigmoid activations](https:\/\/paperswithcode.com\/method\/sigmoid-activation)), although for half of the real line its gradient is zero.\r\n\r\n$$ f\\left(x\\right) = \\max\\left(0, x\\right) $$","24":"A **Transformer** is a model architecture that eschews recurrence and instead relies entirely on an [attention mechanism](https:\/\/paperswithcode.com\/methods\/category\/attention-mechanisms-1) to draw global dependencies between input and output. Before Transformers, the dominant sequence transduction models were based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The Transformer also employs an encoder and decoder, but removing recurrence in favor of [attention mechanisms](https:\/\/paperswithcode.com\/methods\/category\/attention-mechanisms-1) allows for significantly more parallelization than methods like [RNNs](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks) and [CNNs](https:\/\/paperswithcode.com\/methods\/category\/convolutional-neural-networks).","25":"A **convolution** is a type of matrix operation, consisting of a kernel, a small matrix of weights, that slides over input data performing element-wise multiplication with the part of the input it is on, then summing the results into an output.\r\n\r\nIntuitively, a convolution allows for weight sharing - reducing the number of effective parameters - and image translation (allowing for the same feature to be detected in different parts of the input space).\r\n\r\nImage Source: [https:\/\/arxiv.org\/pdf\/1603.07285.pdf](https:\/\/arxiv.org\/pdf\/1603.07285.pdf)","26":"**Dilated Convolutions** are a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) that \u201cinflate\u201d the kernel by inserting holes between the kernel elements. An additional parameter $l$ (dilation rate) indicates how much the kernel is widened. There are usually $l-1$ spaces inserted between kernel elements. \r\n\r\nNote that concept has existed in past literature under different names, for instance the *algorithme a trous*, an algorithm for wavelet decomposition (Holschneider et al., 1987; Shensa, 1992).","27":"**Principle Components Analysis (PCA)** is an unsupervised method primary used for dimensionality reduction within machine learning. PCA is calculated via a singular value decomposition (SVD) of the design matrix, or alternatively, by calculating the covariance matrix of the data and performing eigenvalue decomposition on the covariance matrix. The results of PCA provide a low-dimensional picture of the structure of the data and the leading (uncorrelated) latent factors determining variation in the data.\r\n\r\nImage Source: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Principal_component_analysis#\/media\/File:GaussianScatterPCA.svg)","28":"A Graph Convolutional Network, or GCN, is an approach for semi-supervised learning on graph-structured data. It is based on an efficient variant of convolutional neural networks which operate directly on graphs.\r\n\r\nImage source: [Semi-Supervised Classification with Graph Convolutional Networks](https:\/\/arxiv.org\/pdf\/1609.02907v4.pdf)","29":"**Tanh Activation** is an activation function used for neural networks:\r\n\r\n$$f\\left(x\\right) = \\frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$$\r\n\r\nHistorically, the tanh function became preferred over the [sigmoid function](https:\/\/paperswithcode.com\/method\/sigmoid-activation) as it gave better performance for multi-layer neural networks. But it did not solve the vanishing gradient problem that sigmoids suffered, which was tackled more effectively with the introduction of [ReLU](https:\/\/paperswithcode.com\/method\/relu) activations.\r\n\r\nImage Source: [Junxi Feng](https:\/\/www.researchgate.net\/profile\/Junxi_Feng)","30":"**Sigmoid Activations** are a type of activation function for neural networks:\r\n\r\n$$f\\left(x\\right) = \\frac{1}{\\left(1+\\exp\\left(-x\\right)\\right)}$$\r\n\r\nSome drawbacks of this activation that have been noted in the literature are: sharp damp gradients during backpropagation from deeper hidden layers to inputs, gradient saturation, and slow convergence.","31":"An **LSTM** is a type of [recurrent neural network](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks) that addresses the vanishing gradient problem in vanilla RNNs through additional cells, input and output gates. Intuitively, vanishing gradients are solved through additional *additive* components, and forget gate activations, that allow the gradients to flow through the network without vanishing as quickly.\r\n\r\n(Image Source [here](https:\/\/medium.com\/datadriveninvestor\/how-do-lstm-networks-solve-the-problem-of-vanishing-gradients-a6784971a577))\r\n\r\n(Introduced by Hochreiter and Schmidhuber)","32":"A **Bidirectional LSTM**, or **biLSTM**, is a sequence processing model that consists of two LSTMs: one taking the input in a forward direction, and the other in a backwards direction. BiLSTMs effectively increase the amount of information available to the network, improving the context available to the algorithm (e.g. knowing what words immediately follow *and* precede a word in a sentence).\r\n\r\nImage Source: Modelling Radiological Language with Bidirectional Long Short-Term Memory Networks, Cornegruta et al","33":"**Embeddings from Language Models**, or **ELMo**, is a type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus.\r\n\r\nA biLM combines both a forward and backward LM. ELMo jointly maximizes the log likelihood of the forward and backward directions. To add ELMo to a supervised model, we freeze the weights of the biLM and then concatenate the ELMo vector $\\textbf{ELMO}^{task}_k$ with $\\textbf{x}_k$ and pass the ELMO enhanced representation $[\\textbf{x}_k; \\textbf{ELMO}^{task}_k]$ into the task RNN. Here $\\textbf{x}_k$ is a context-independent token representation for each token position. \r\n\r\nImage Source: [here](https:\/\/medium.com\/@duyanhnguyen_38925\/create-a-strong-text-classification-with-the-help-from-elmo-e90809ba29da)","34":"A **Region Proposal Network**, or **RPN**, is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals. RPN and algorithms like [Fast R-CNN](https:\/\/paperswithcode.com\/method\/fast-r-cnn) can be merged into a single network by sharing their convolutional features - using the recently popular terminology of neural networks with attention mechanisms, the RPN component tells the unified network where to look.\r\n\r\nRPNs are designed to efficiently predict region proposals with a wide range of scales and aspect ratios. RPNs use anchor boxes that serve as references at multiple scales and aspect ratios. The scheme can be thought of as a pyramid of regression references, which avoids enumerating images or filters of multiple scales or aspect ratios.","35":"**Region of Interest Pooling**, or **RoIPool**, is an operation for extracting a small feature map (e.g., $7\u00d77$) from each RoI in detection and segmentation based tasks. Features are extracted from each candidate box, and thereafter in models like [Fast R-CNN](https:\/\/paperswithcode.com\/method\/fast-r-cnn), are then classified and bounding box regression performed.\r\n\r\nThe actual scaling to, e.g., $7\u00d77$, occurs by dividing the region proposal into equally sized sections, finding the largest value in each section, and then copying these max values to the output buffer. In essence, **RoIPool** is [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) on a discrete grid based on a box.\r\n\r\nImage Source: [Joyce Xu](https:\/\/towardsdatascience.com\/deep-learning-for-object-detection-a-comprehensive-review-73930816d8d9)","36":"**Faster R-CNN** is an object detection model that improves on [Fast R-CNN](https:\/\/paperswithcode.com\/method\/fast-r-cnn) by utilising a region proposal network ([RPN](https:\/\/paperswithcode.com\/method\/rpn)) with the CNN model. The RPN shares full-image convolutional features with the detection network, enabling nearly cost-free region proposals. It is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by [Fast R-CNN](https:\/\/paperswithcode.com\/method\/fast-r-cnn) for detection. RPN and Fast [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn) are merged into a single network by sharing their convolutional features: the RPN component tells the unified network where to look.\r\n\r\nAs a whole, Faster R-CNN consists of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector that uses the proposed regions.","37":"A **GAN**, or **Generative Adversarial Network**, is a generative model that simultaneously trains\r\ntwo models: a generative model $G$ that captures the data distribution, and a discriminative model $D$ that estimates the\r\nprobability that a sample came from the training data rather than $G$.\r\n\r\nThe training procedure for $G$ is to maximize the probability of $D$ making\r\na mistake. This framework corresponds to a minimax two-player game. In the\r\nspace of arbitrary functions $G$ and $D$, a unique solution exists, with $G$\r\nrecovering the training data distribution and $D$ equal to $\\frac{1}{2}$\r\neverywhere. In the case where $G$ and $D$ are defined by multilayer perceptrons,\r\nthe entire system can be trained with backpropagation. \r\n\r\n(Image Source: [here](http:\/\/www.kdnuggets.com\/2017\/01\/generative-adversarial-networks-hot-topic-machine-learning.html))","38":"A **Concatenated Skip Connection** is a type of skip connection that seeks to reuse features by concatenating them to new layers, allowing more information to be retained from previous layers of the network. This contrasts with say, residual connections, where element-wise summation is used instead to incorporate information from previous layers. This type of skip connection is prominently used in DenseNets (and also Inception networks), which the Figure to the right illustrates.","39":"**Max Pooling** is a pooling operation that calculates the maximum value for patches of a feature map, and uses it to create a downsampled (pooled) feature map. It is usually used after a convolutional layer. It adds a small amount of translation invariance - meaning translating the image by a small amount does not significantly affect the values of most pooled outputs.\r\n\r\nImage Source: [here](https:\/\/computersciencewiki.org\/index.php\/File:MaxpoolSample2.png)","40":"**U-Net** is an architecture for semantic segmentation. It consists of a contracting path and an expansive path. The contracting path follows the typical architecture of a convolutional network. It consists of the repeated application of two 3x3 convolutions (unpadded convolutions), each followed by a rectified linear unit ([ReLU](https:\/\/paperswithcode.com\/method\/relu)) and a 2x2 [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) operation with stride 2 for downsampling. At each downsampling step we double the number of feature channels. Every step in the expansive path consists of an upsampling of the feature map followed by a 2x2 [convolution](https:\/\/paperswithcode.com\/method\/convolution) (\u201cup-convolution\u201d) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3x3 convolutions, each followed by a ReLU. The cropping is necessary due to the loss of border pixels in every convolution. At the final layer a [1x1 convolution](https:\/\/paperswithcode.com\/method\/1x1-convolution) is used to map each 64-component feature vector to the desired number of classes. In total the network has 23 convolutional layers.\r\n\r\n[Original MATLAB Code](https:\/\/lmb.informatik.uni-freiburg.de\/people\/ronneber\/u-net\/u-net-release-2015-10-02.tar.gz)","41":"Please enter a description about the method here","42":"**CurricularFace**, or **Adaptive Curriculum Learning**, is a method for face recognition that embeds the idea of curriculum learning into the loss function to achieve a new training scheme. This training scheme mainly addresses easy samples in the early training stage and hard ones in the later stage. Specifically, CurricularFace adaptively adjusts the relative importance of easy and hard samples during different training stages.","43":"**Weight Normalization** is a normalization method for training neural networks. It is inspired by [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization), but it is a deterministic method that does not share batch normalization's property of adding noise to the gradients. It reparameterizes each weight vector $\\textbf{w}$ in terms of a parameter vector $\\textbf{v}$ and a scalar parameter $g$ and to perform stochastic gradient descent with respect to those parameters instead. Weight vectors are expressed in terms of the new parameters using:\r\n\r\n$$ \\textbf{w} = \\frac{g}{\\Vert\\\\textbf{v}\\Vert}\\textbf{v}$$\r\n\r\nwhere $\\textbf{v}$ is a $k$-dimensional vector, $g$ is a scalar, and $\\Vert\\textbf{v}\\Vert$ denotes the Euclidean norm of $\\textbf{v}$. This reparameterization has the effect of fixing the Euclidean norm of the weight vector $\\textbf{w}$: we now have $\\Vert\\textbf{w}\\Vert = g$, independent of the parameters $\\textbf{v}$.","44":"**$L_{1}$ Regularization** is a regularization technique applied to the weights of a neural network. We minimize a loss function compromising both the primary loss function and a penalty on the $L\\_{1}$ Norm of the weights:\r\n\r\n$$L\\_{new}\\left(w\\right) = L\\_{original}\\left(w\\right) + \\lambda{||w||}\\_{1}$$\r\n\r\nwhere $\\lambda$ is a value determining the strength of the penalty. In contrast to [weight decay](https:\/\/paperswithcode.com\/method\/weight-decay), $L_{1}$ regularization promotes sparsity; i.e. some parameters have an optimal value of zero.\r\n\r\nImage Source: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Regularization_(mathematics)#\/media\/File:Sparsityl1.png)","45":"**Softsign** is an activation function for neural networks:\r\n\r\n$$ f\\left(x\\right) = \\left(\\frac{x}{|x|+1}\\right)$$\r\n\r\nImage Source: [Sefik Ilkin Serengil](https:\/\/sefiks.com\/2017\/11\/10\/softsign-as-a-neural-networks-activation-function\/)","46":"**Leaky Rectified Linear Unit**, or **Leaky ReLU**, is a type of activation function based on a [ReLU](https:\/\/paperswithcode.com\/method\/relu), but it has a small slope for negative values instead of a flat slope. The slope coefficient is determined before training, i.e. it is not learnt during training. This type of activation function is popular in tasks where we we may suffer from sparse gradients, for example training generative adversarial networks.","47":"A **Gated Linear Unit**, or **GLU** computes:\r\n\r\n$$ \\text{GLU}\\left(a, b\\right) = a\\otimes \\sigma\\left(b\\right) $$\r\n\r\nIt is used in natural language processing architectures, for example the [Gated CNN](https:\/\/paperswithcode.com\/method\/gated-convolution-network), because here $b$ is the gate that control what information from $a$ is passed up to the following layer. Intuitively, for a language modeling task, the gating mechanism allows selection of words or features that are important for predicting the next word. The GLU also has non-linear capabilities, but has a linear path for the gradient so diminishes the vanishing gradient problem.","48":"**Normalizing Flows** are a method for constructing complex distributions by transforming a\r\nprobability density through a series of invertible mappings. By repeatedly applying the rule for change of variables, the initial density \u2018flows\u2019 through the sequence of invertible mappings. At the end of this sequence we obtain a valid probability distribution and hence this type of flow is referred to as a normalizing flow.\r\n\r\nIn the case of finite flows, the basic rule for the transformation of densities considers an invertible, smooth mapping $f : \\mathbb{R}^{d} \\rightarrow \\mathbb{R}^{d}$ with inverse $f^{-1} = g$, i.e. the composition $g \\cdot f\\left(z\\right) = z$. If we use this mapping to transform a random variable $z$ with distribution $q\\left(z\\right)$, the resulting random variable $z' = f\\left(z\\right)$ has a distribution:\r\n\r\n$$ q\\left(\\mathbf{z}'\\right) = q\\left(\\mathbf{z}\\right)\\bigl\\vert{\\text{det}}\\frac{\\delta{f}^{-1}}{\\delta{\\mathbf{z'}}}\\bigr\\vert = q\\left(\\mathbf{z}\\right)\\bigl\\vert{\\text{det}}\\frac{\\delta{f}}{\\delta{\\mathbf{z}}}\\bigr\\vert ^{-1} $$\r\n\f\r\nwhere the last equality can be seen by applying the chain rule (inverse function theorem) and is a property of Jacobians of invertible functions. We can construct arbitrarily complex densities by composing several simple maps and successively applying the above equation. The density $q\\_{K}\\left(\\mathbf{z}\\right)$ obtained by successively transforming a random variable $z\\_{0}$ with distribution $q\\_{0}$ through a chain of $K$ transformations $f\\_{k}$ is:\r\n\r\n$$ z\\_{K} = f\\_{K} \\cdot \\dots \\cdot f\\_{2} \\cdot f\\_{1}\\left(z\\_{0}\\right) $$\r\n\r\n$$ \\ln{q}\\_{K}\\left(z\\_{K}\\right) = \\ln{q}\\_{0}\\left(z\\_{0}\\right) \u2212 \\sum^{K}\\_{k=1}\\ln\\vert\\det\\frac{\\delta{f\\_{k}}}{\\delta{\\mathbf{z\\_{k-1}}}}\\vert $$\r\n\f\r\nThe path traversed by the random variables $z\\_{k} = f\\_{k}\\left(z\\_{k-1}\\right)$ with initial distribution $q\\_{0}\\left(z\\_{0}\\right)$ is called the flow and the path formed by the successive distributions $q\\_{k}$ is a normalizing flow.","49":"**DV3 Attention Block** is an attention-based module used in the [Deep Voice 3](https:\/\/paperswithcode.com\/method\/deep-voice-3) architecture. It uses a [dot-product attention](https:\/\/paperswithcode.com\/method\/dot-product-attention) mechanism. A query vector (the hidden states of the decoder) and the per-timestep key vectors from the encoder are used to compute attention weights. This then outputs a context vector computed as the weighted average of the value vectors.","50":"**DV3 Convolution Block** is a convolutional block used for the [Deep Voice 3](https:\/\/paperswithcode.com\/method\/deep-voice-3) text-to-speech architecture. It consists of a 1-D [convolution](https:\/\/paperswithcode.com\/method\/convolution) with a gated linear unit and a [residual connection](https:\/\/paperswithcode.com\/method\/residual-connection). In the Figure, $c$ denotes the dimensionality of the input. The convolution output of size $2 \\cdot c$ is split into equal-sized portions: the gate vector and the input vector. A scaling factor $\\sqrt{0.5}$ is used to ensure that we preserve the input variance early in training. The gated linear unit provides a linear path for the gradient flow, which alleviates the vanishing gradient issue for stacked convolution blocks while retaining non-linearity. To introduce speaker-dependent control, a speaker-dependent embedding is added as a bias to the convolution filter output, after a softsign function. The authors use the softsign nonlinearity because it limits the range of the output while also avoiding the saturation problem that exponential based nonlinearities sometimes exhibit. Convolution filter weights are initialized with zero-mean and unit-variance activations throughout the entire network.","51":"**Bridge-net** is an audio model block used in the [ClariNet](https:\/\/paperswithcode.com\/method\/clarinet) text-to-speech architecture. Bridge-net maps frame-level hidden representation to sample-level through several [convolution](https:\/\/paperswithcode.com\/method\/convolution) blocks and [transposed convolution](https:\/\/paperswithcode.com\/method\/transposed-convolution) layers interleaved with softsign non-linearities.","52":"**ClariNet** is an end-to-end text-to-speech architecture. Unlike previous TTS systems which use text-to-spectogram models with a separate waveform [synthesizer](https:\/\/paperswithcode.com\/method\/synthesizer) (vocoder), ClariNet is a text-to-wave architecture that is fully convolutional and can be trained from scratch. In ClariNet, the [WaveNet](https:\/\/paperswithcode.com\/method\/wavenet) module is conditioned on the hidden states instead of the mel-spectogram. The architecture is otherwise based on [Deep Voice 3](https:\/\/paperswithcode.com\/method\/deep-voice-3).","53":"**Mixture of Logistic Distributions (MoL)** is a type of output function, and an alternative to a [softmax](https:\/\/paperswithcode.com\/method\/softmax) layer. Discretized logistic mixture likelihood is used in [PixelCNN](https:\/\/paperswithcode.com\/method\/pixelcnn)++ and [WaveNet](https:\/\/paperswithcode.com\/method\/wavenet) to predict discrete values.\r\n\r\nImage Credit: [Hao Gao](https:\/\/medium.com\/@smallfishbigsea\/an-explanation-of-discretized-logistic-mixture-likelihood-bdfe531751f0)","54":"A **Dilated Causal Convolution** is a [causal convolution](https:\/\/paperswithcode.com\/method\/causal-convolution) where the filter is applied over an area larger than its length by skipping input values with a certain step. A dilated causal [convolution](https:\/\/paperswithcode.com\/method\/convolution) effectively allows the network to have very large receptive fields with just a few layers.","55":"**WaveNet** is an audio generative model based on the [PixelCNN](https:\/\/paperswithcode.com\/method\/pixelcnn) architecture. In order to deal with long-range temporal dependencies needed for raw audio generation, architectures are developed based on dilated causal convolutions, which exhibit very large receptive fields.\r\n\r\nThe joint probability of a waveform $\\vec{x} = \\{ x_1, \\dots, x_T \\}$ is factorised as a product of conditional probabilities as follows:\r\n\r\n$$p\\left(\\vec{x}\\right) = \\prod_{t=1}^{T} p\\left(x_t \\mid x_1, \\dots ,x_{t-1}\\right)$$\r\n\r\nEach audio sample $x_t$ is therefore conditioned on the samples at all previous timesteps.","56":"**Dot-Product Attention** is an attention mechanism where the alignment score function is calculated as: \r\n\r\n$$f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right) = h\\_{i}^{T}s\\_{j}$$\r\n\r\nIt is equivalent to [multiplicative attention](https:\/\/paperswithcode.com\/method\/multiplicative-attention) (without a trainable weight matrix, assuming this is instead an identity matrix). Here $\\textbf{h}$ refers to the hidden states for the encoder, and $\\textbf{s}$ is the hidden states for the decoder. The function above is thus a type of alignment score function. \r\n\r\nWithin a neural network, once we have the alignment scores, we calculate the final scores\/weights using a [softmax](https:\/\/paperswithcode.com\/method\/softmax) function of these alignment scores (ensuring it sums to 1).","57":"A **Graph Convolutional Network**, or **GCN**, is an approach for semi-supervised learning on graph-structured data. It is based on an efficient variant of [convolutional neural networks](https:\/\/paperswithcode.com\/methods\/category\/convolutional-neural-networks) which operate directly on graphs. The choice of convolutional architecture is motivated via a localized first-order approximation of spectral graph convolutions. The model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes.","58":"**Batch Normalization** aims to reduce internal covariate shift, and in doing so aims to accelerate the training of deep neural nets. It accomplishes this via a normalization step that fixes the means and variances of layer inputs. Batch Normalization also has a beneficial effect on the gradient flow through the network, by reducing the dependence of gradients on the scale of the parameters or of their initial values. This allows for use of much higher learning rates without the risk of divergence. Furthermore, batch normalization regularizes the model and reduces the need for [Dropout](https:\/\/paperswithcode.com\/method\/dropout).\r\n\r\nWe apply a batch normalization layer as follows for a minibatch $\\mathcal{B}$:\r\n\r\n$$ \\mu\\_{\\mathcal{B}} = \\frac{1}{m}\\sum^{m}\\_{i=1}x\\_{i} $$\r\n\r\n$$ \\sigma^{2}\\_{\\mathcal{B}} = \\frac{1}{m}\\sum^{m}\\_{i=1}\\left(x\\_{i}-\\mu\\_{\\mathcal{B}}\\right)^{2} $$\r\n\r\n$$ \\hat{x}\\_{i} = \\frac{x\\_{i} - \\mu\\_{\\mathcal{B}}}{\\sqrt{\\sigma^{2}\\_{\\mathcal{B}}+\\epsilon}} $$\r\n\r\n$$ y\\_{i} = \\gamma\\hat{x}\\_{i} + \\beta = \\text{BN}\\_{\\gamma, \\beta}\\left(x\\_{i}\\right) $$\r\n\r\nWhere $\\gamma$ and $\\beta$ are learnable parameters.","59":"TuckER","60":"**Average Pooling** is a pooling operation that calculates the average value for patches of a feature map, and uses it to create a downsampled (pooled) feature map. It is usually used after a convolutional layer. It adds a small amount of translation invariance - meaning translating the image by a small amount does not significantly affect the values of most pooled outputs. It extracts features more smoothly than [Max Pooling](https:\/\/paperswithcode.com\/method\/max-pooling), whereas max pooling extracts more pronounced features like edges.\r\n\r\nImage Source: [here](https:\/\/www.researchgate.net\/figure\/Illustration-of-Max-Pooling-and-Average-Pooling-Figure-2-above-shows-an-example-of-max_fig2_333593451)","61":"A **1 x 1 Convolution** is a [convolution](https:\/\/paperswithcode.com\/method\/convolution) with some special properties in that it can be used for dimensionality reduction, efficient low dimensional embeddings, and applying non-linearity after convolutions. It maps an input pixel with all its channels to an output pixel which can be squeezed to a desired output depth. It can be viewed as an [MLP](https:\/\/paperswithcode.com\/method\/feedforward-network) looking at a particular pixel location.\r\n\r\nImage Credit: [http:\/\/deeplearning.ai](http:\/\/deeplearning.ai)","62":"A **Bottleneck Residual Block** is a variant of the [residual block](https:\/\/paperswithcode.com\/method\/residual-block) that utilises 1x1 convolutions to create a bottleneck. The use of a bottleneck reduces the number of parameters and matrix multiplications. The idea is to make residual blocks as thin as possible to increase depth and have less parameters. They were introduced as part of the [ResNet](https:\/\/paperswithcode.com\/method\/resnet) architecture, and are used as part of deeper ResNets such as ResNet-50 and ResNet-101.","63":"**Global Average Pooling** is a pooling operation designed to replace fully connected layers in classical CNNs. The idea is to generate one feature map for each corresponding category of the classification task in the last mlpconv layer. Instead of adding fully connected layers on top of the feature maps, we take the average of each feature map, and the resulting vector is fed directly into the [softmax](https:\/\/paperswithcode.com\/method\/softmax) layer. \r\n\r\nOne advantage of global [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling) over the fully connected layers is that it is more native to the [convolution](https:\/\/paperswithcode.com\/method\/convolution) structure by enforcing correspondences between feature maps and categories. Thus the feature maps can be easily interpreted as categories confidence maps. Another advantage is that there is no parameter to optimize in the global average pooling thus overfitting is avoided at this layer. Furthermore, global average pooling sums out the spatial information, thus it is more robust to spatial translations of the input.","64":"**Residual Blocks** are skip-connection blocks that learn residual functions with reference to the layer inputs, instead of learning unreferenced functions. They were introduced as part of the [ResNet](https:\/\/paperswithcode.com\/method\/resnet) architecture.\r\n \r\nFormally, denoting the desired underlying mapping as $\\mathcal{H}({x})$, we let the stacked nonlinear layers fit another mapping of $\\mathcal{F}({x}):=\\mathcal{H}({x})-{x}$. The original mapping is recast into $\\mathcal{F}({x})+{x}$. The $\\mathcal{F}({x})$ acts like a residual, hence the name 'residual block'.\r\n\r\nThe intuition is that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers. Having skip connections allows the network to more easily learn identity-like mappings.\r\n\r\nNote that in practice, [Bottleneck Residual Blocks](https:\/\/paperswithcode.com\/method\/bottleneck-residual-block) are used for deeper ResNets, such as ResNet-50 and ResNet-101, as these bottleneck blocks are less computationally intensive.","65":"**Kaiming Initialization**, or **He Initialization**, is an initialization method for neural networks that takes into account the non-linearity of activation functions, such as [ReLU](https:\/\/paperswithcode.com\/method\/relu) activations.\r\n\r\nA proper initialization method should avoid reducing or magnifying the magnitudes of input signals exponentially. Using a derivation they work out that the condition to stop this happening is:\r\n\r\n$$\\frac{1}{2}n\\_{l}\\text{Var}\\left[w\\_{l}\\right] = 1 $$\r\n\r\nThis implies an initialization scheme of:\r\n\r\n$$ w\\_{l} \\sim \\mathcal{N}\\left(0, 2\/n\\_{l}\\right)$$\r\n\r\nThat is, a zero-centered Gaussian with standard deviation of $\\sqrt{2\/{n}\\_{l}}$ (variance shown in equation above). Biases are initialized at $0$.","66":"**Residual Networks**, or **ResNets**, learn residual functions with reference to the layer inputs, instead of learning unreferenced functions. Instead of hoping each few stacked layers directly fit a desired underlying mapping, residual nets let these layers fit a residual mapping. They stack [residual blocks](https:\/\/paperswithcode.com\/method\/residual-block) ontop of each other to form network: e.g. a ResNet-50 has fifty layers using these blocks. \r\n\r\nFormally, denoting the desired underlying mapping as $\\mathcal{H}(x)$, we let the stacked nonlinear layers fit another mapping of $\\mathcal{F}(x):=\\mathcal{H}(x)-x$. The original mapping is recast into $\\mathcal{F}(x)+x$.\r\n\r\nThere is empirical evidence that these types of network are easier to optimize, and can gain accuracy from considerably increased depth.","67":"**Q-Learning** is an off-policy temporal difference control algorithm:\r\n\r\n$$Q\\left(S\\_{t}, A\\_{t}\\right) \\leftarrow Q\\left(S\\_{t}, A\\_{t}\\right) + \\alpha\\left[R_{t+1} + \\gamma\\max\\_{a}Q\\left(S\\_{t+1}, a\\right) - Q\\left(S\\_{t}, A\\_{t}\\right)\\right] $$\r\n\r\nThe learned action-value function $Q$ directly approximates $q\\_{*}$, the optimal action-value function, independent of the policy being followed.\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition","68":"1D Convolutional Neural Networks are similar to well known and more established 2D Convolutional Neural Networks. 1D Convolutional Neural Networks are used mainly used on text and 1D signals.","69":"**XLM** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) based architecture that is pre-trained using one of three language modelling objectives:\r\n\r\n1. Causal Language Modeling - models the probability of a word given the previous words in a sentence.\r\n2. Masked Language Modeling - the masked language modeling objective of [BERT](https:\/\/paperswithcode.com\/method\/bert).\r\n3. Translation Language Modeling - a (new) translation language modeling objective for improving cross-lingual pre-training.\r\n\r\nThe authors find that both the CLM and MLM approaches provide strong cross-lingual features that can be used for pretraining models.","70":"**Cosine Annealing** is a type of learning rate schedule that has the effect of starting with a large learning rate that is relatively rapidly decreased to a minimum value before being increased rapidly again. The resetting of the learning rate acts like a simulated restart of the learning process and the re-use of good weights as the starting point of the restart is referred to as a \"warm restart\" in contrast to a \"cold restart\" where a new set of small random numbers may be used as a starting point.\r\n\r\n$$\\eta\\_{t} = \\eta\\_{min}^{i} + \\frac{1}{2}\\left(\\eta\\_{max}^{i}-\\eta\\_{min}^{i}\\right)\\left(1+\\cos\\left(\\frac{T\\_{cur}}{T\\_{i}}\\pi\\right)\\right)\r\n$$\r\n\r\nWhere where $\\eta\\_{min}^{i}$ and $ \\eta\\_{max}^{i}$ are ranges for the learning rate, and $T\\_{cur}$ account for how many epochs have been performed since the last restart.\r\n\r\nText Source: [Jason Brownlee](https:\/\/machinelearningmastery.com\/snapshot-ensemble-deep-learning-neural-network\/)\r\n\r\nImage Source: [Gao Huang](https:\/\/www.researchgate.net\/figure\/Training-loss-of-100-layer-DenseNet-on-CIFAR10-using-standard-learning-rate-blue-and-M_fig2_315765130)","71":"**Strided Attention** is a factorized attention pattern that has one head attend to the previous\r\n$l$ locations, and the other head attend to every $l$th location, where $l$ is the stride and chosen to be close to $\\sqrt{n}$. It was proposed as part of the [Sparse Transformer](https:\/\/paperswithcode.com\/method\/sparse-transformer) architecture.\r\n\r\nA self-attention layer maps a matrix of input embeddings $X$ to an output matrix and is parameterized by a connectivity pattern $S = \\text{set}\\left(S\\_{1}, \\dots, S\\_{n}\\right)$, where $S\\_{i}$ denotes the set of indices of the input vectors to which the $i$th output vector attends. The output vector is a weighted sum of transformations of the input vectors:\r\n\r\n$$ \\text{Attend}\\left(X, S\\right) = \\left(a\\left(\\mathbf{x}\\_{i}, S\\_{i}\\right)\\right)\\_{i\\in\\text{set}\\left(1,\\dots,n\\right)}$$\r\n\r\n$$ a\\left(\\mathbf{x}\\_{i}, S\\_{i}\\right) = \\text{softmax}\\left(\\frac{\\left(W\\_{q}\\mathbf{x}\\_{i}\\right)K^{T}\\_{S\\_{i}}}{\\sqrt{d}}\\right)V\\_{S\\_{i}} $$\r\n\r\n$$ K\\_{Si} = \\left(W\\_{k}\\mathbf{x}\\_{j}\\right)\\_{j\\in{S\\_{i}}} $$\r\n\r\n$$ V\\_{Si} = \\left(W\\_{v}\\mathbf{x}\\_{j}\\right)\\_{j\\in{S\\_{i}}} $$\r\n\r\nHere $W\\_{q}$, $W\\_{k}$, and $W\\_{v}$ represent the weight matrices which transform a given $x\\_{i}$ into a query, key, or value, and $d$ is the inner dimension of the queries and keys. The output at each position is a sum of the values weighted by the scaled dot-product similarity of the keys and queries.\r\n\r\nFull self-attention for autoregressive models defines $S\\_{i} = \\text{set}\\left(j : j \\leq i\\right)$, allowing every element to attend to all previous positions and its own position.\r\n\r\nFactorized self-attention instead has $p$ separate attention heads, where the $m$th head defines a subset of the indices $A\\_{i}^{(m)} \u2282 \\text{set}\\left(j : j \\leq i\\right)$ and lets $S\\_{i} = A\\_{i}^{(m)}$. The goal with the Sparse [Transformer](https:\/\/paperswithcode.com\/method\/transformer) was to find efficient choices for the subset $A$.\r\n\r\nFormally for Strided Attention, $A^{(1)}\\_{i} = ${$t, t + 1, ..., i$} for $t = \\max\\left(0, i \u2212 l\\right)$, and $A^{(2)}\\_{i} = ${$j : (i \u2212 j) \\mod l = 0$}. The $i$-th output vector of the attention head attends to all input vectors either from $A^{(1)}\\_{i}$ or $A^{(2)}\\_{i}$. This pattern can be visualized in the figure to the right.\r\n\r\nThis formulation is convenient if the data naturally has a structure that aligns with the stride, like images or some types of music. For data without a periodic structure, like text, however, the authors find that the network can fail to properly route information with the strided pattern, as spatial coordinates for an element do not necessarily correlate with the positions where the element may be most relevant in the future.","72":"**Linear Warmup With Cosine Annealing** is a learning rate schedule where we increase the learning rate linearly for $n$ updates and then anneal according to a cosine schedule afterwards.","73":"**Fixed Factorized Attention** is a factorized attention pattern where specific cells summarize previous locations and propagate that information to all future cells. It was proposed as part of the [Sparse Transformer](https:\/\/paperswithcode.com\/method\/sparse-transformer) architecture.\r\n\r\n\r\nA self-attention layer maps a matrix of input embeddings $X$ to an output matrix and is parameterized by a connectivity pattern $S = \\text{set}\\left(S\\_{1}, \\dots, S\\_{n}\\right)$, where $S\\_{i}$ denotes the set of indices of the input vectors to which the $i$th output vector attends. The output vector is a weighted sum of transformations of the input vectors:\r\n\r\n$$ \\text{Attend}\\left(X, S\\right) = \\left(a\\left(\\mathbf{x}\\_{i}, S\\_{i}\\right)\\right)\\_{i\\in\\text{set}\\left(1,\\dots,n\\right)}$$\r\n\r\n$$ a\\left(\\mathbf{x}\\_{i}, S\\_{i}\\right) = \\text{softmax}\\left(\\frac{\\left(W\\_{q}\\mathbf{x}\\_{i}\\right)K^{T}\\_{S\\_{i}}}{\\sqrt{d}}\\right)V\\_{S\\_{i}} $$\r\n\r\n$$ K\\_{Si} = \\left(W\\_{k}\\mathbf{x}\\_{j}\\right)\\_{j\\in{S\\_{i}}} $$\r\n\r\n$$ V\\_{Si} = \\left(W\\_{v}\\mathbf{x}\\_{j}\\right)\\_{j\\in{S\\_{i}}} $$\r\n\r\nHere $W\\_{q}$, $W\\_{k}$, and $W\\_{v}$ represent the weight matrices which transform a given $x\\_{i}$ into a query, key, or value, and $d$ is the inner dimension of the queries and keys. The output at each position is a sum of the values weighted by the scaled dot-product similarity of the keys and queries.\r\n\r\nFull self-attention for autoregressive models defines $S\\_{i} = \\text{set}\\left(j : j \\leq i\\right)$, allowing every element to attend to all previous positions and its own position.\r\n\r\nFactorized self-attention instead has $p$ separate attention heads, where the $m$th head defines a subset of the indices $A\\_{i}^{(m)} \u2282 \\text{set}\\left(j : j \\leq i\\right)$ and lets $S\\_{i} = A\\_{i}^{(m)}$. The goal with the Sparse [Transformer](https:\/\/paperswithcode.com\/method\/transformer) was to find efficient choices for the subset $A$.\r\n\r\nFormally for Fixed Factorized Attention, $A^{(1)}\\_{i} = ${$j : \\left(\\lfloor{j\/l\\rfloor}=\\lfloor{i\/l\\rfloor}\\right)$}, where the brackets denote the floor operation, and $A^{(2)}\\_{i} = ${$j : j \\mod l \\in ${$t, t+1, \\ldots, l$}}, where $t=l-c$ and $c$ is a hyperparameter. The $i$-th output vector of the attention head attends to all input vectors either from $A^{(1)}\\_{i}$ or $A^{(2)}\\_{i}$. This pattern can be visualized in the figure to the right.\r\n\r\nIf the stride is 128 and $c = 8$, then all future positions greater than 128 can attend to positions 120-128, all positions greater than 256 can attend to 248-256, and so forth. \r\n\r\nA fixed-attention pattern with $c = 1$ limits the expressivity of the network significantly, as many representations in the network are only used for one block whereas a small number of locations are used by all blocks. The authors found choosing $c \\in ${$8, 16, 32$} for typical values of $l \\in\r\n{128, 256}$ performs well, although this increases the computational cost of this method by $c$ in comparison to the [strided attention](https:\/\/paperswithcode.com\/method\/strided-attention).\r\n\r\nAdditionally, the authors found that when using multiple heads, having them attend to distinct subblocks of length $c$ within the block of size $l$ was preferable to having them attend to the same subblock.","74":"**GPT-3** is an autoregressive [transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) model with 175 billion\r\nparameters. It uses the same architecture\/model as [GPT-2](https:\/\/paperswithcode.com\/method\/gpt-2), including the modified initialization, pre-normalization, and reversible tokenization, with the exception that GPT-3 uses alternating dense and locally banded sparse attention patterns in the layers of the [transformer](https:\/\/paperswithcode.com\/method\/transformer), similar to the [Sparse Transformer](https:\/\/paperswithcode.com\/method\/sparse-transformer).","75":"We propose to theoretically and empirically examine the effect of incorporating weighting schemes into walk-aggregating GNNs. To this end, we propose a simple, interpretable, and end-to-end supervised GNN model, called AWARE (Attentive Walk-Aggregating GRaph Neural NEtwork), for graph-level prediction. AWARE aggregates the walk information by means of weighting schemes at distinct levels (vertex-, walk-, and graph-level) in a principled manner. By virtue of the incorporated weighting schemes at these different levels, AWARE can emphasize the information important for prediction while diminishing the irrelevant ones\u2014leading to representations that can improve learning performance.","76":"**Experience Replay** is a replay memory technique used in reinforcement learning where we store the agent\u2019s experiences at each time-step, $e\\_{t} = \\left(s\\_{t}, a\\_{t}, r\\_{t}, s\\_{t+1}\\right)$ in a data-set $D = e\\_{1}, \\cdots, e\\_{N}$ , pooled over many episodes into a replay memory. We then usually sample the memory randomly for a minibatch of experience, and use this to learn off-policy, as with Deep Q-Networks. This tackles the problem of autocorrelation leading to unstable training, by making the problem more like a supervised learning problem.\r\n\r\nImage Credit: [Hands-On Reinforcement Learning with Python, Sudharsan Ravichandiran](https:\/\/subscription.packtpub.com\/book\/big_data_and_business_intelligence\/9781788836524)","77":"**Entropy Regularization** is a type of regularization used in [reinforcement learning](https:\/\/paperswithcode.com\/methods\/area\/reinforcement-learning). For on-policy policy gradient based methods like [A3C](https:\/\/paperswithcode.com\/method\/a3c), the same mutual reinforcement behaviour leads to a highly-peaked $\\pi\\left(a\\mid{s}\\right)$ towards a few actions or action sequences, since it is easier for the actor and critic to overoptimise to a small portion of the environment. To reduce this problem, entropy regularization adds an entropy term to the loss to promote action diversity:\r\n\r\n$$H(X) = -\\sum\\pi\\left(x\\right)\\log\\left(\\pi\\left(x\\right)\\right) $$\r\n\r\nImage Credit: Wikipedia","78":"**Soft Actor Critic**, or **SAC**, is an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as [Q-learning methods](https:\/\/paperswithcode.com\/method\/q-learning). [SAC](https:\/\/paperswithcode.com\/method\/sac) combines off-policy updates with a stable stochastic actor-critic formulation.\r\n\r\nThe SAC objective has a number of advantages. First, the policy is incentivized to explore more widely, while giving up on clearly unpromising avenues. Second, the policy can capture multiple modes of near-optimal behavior. In problem settings where multiple actions seem equally attractive, the policy will commit equal probability mass to those actions. Lastly, the authors present evidence that it improves learning speed over state-of-art methods that optimize the conventional RL objective function.","79":"**Proximal Policy Optimization**, or **PPO**, is a policy gradient method for reinforcement learning. The motivation was to have an algorithm with the data efficiency and reliable performance of [TRPO](https:\/\/paperswithcode.com\/method\/trpo), while using only first-order optimization. \r\n\r\nLet $r\\_{t}\\left(\\theta\\right)$ denote the probability ratio $r\\_{t}\\left(\\theta\\right) = \\frac{\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}$, so $r\\left(\\theta\\_{old}\\right) = 1$. TRPO maximizes a \u201csurrogate\u201d objective:\r\n\r\n$$ L^{\\text{CPI}}\\left({\\theta}\\right) = \\hat{\\mathbb{E}}\\_{t}\\left[\\frac{\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\_{t}\\mid{s\\_{t}}\\right)})\\hat{A}\\_{t}\\right] = \\hat{\\mathbb{E}}\\_{t}\\left[r\\_{t}\\left(\\theta\\right)\\hat{A}\\_{t}\\right] $$\r\n\r\nWhere $CPI$ refers to a conservative policy iteration. Without a constraint, maximization of $L^{CPI}$ would lead to an excessively large policy update; hence, we PPO modifies the objective, to penalize changes to the policy that move $r\\_{t}\\left(\\theta\\right)$ away from 1:\r\n\r\n$$ J^{\\text{CLIP}}\\left({\\theta}\\right) = \\hat{\\mathbb{E}}\\_{t}\\left[\\min\\left(r\\_{t}\\left(\\theta\\right)\\hat{A}\\_{t}, \\text{clip}\\left(r\\_{t}\\left(\\theta\\right), 1-\\epsilon, 1+\\epsilon\\right)\\hat{A}\\_{t}\\right)\\right] $$\r\n\r\nwhere $\\epsilon$ is a hyperparameter, say, $\\epsilon = 0.2$. The motivation for this objective is as follows. The first term inside the min is $L^{CPI}$. The second term, $\\text{clip}\\left(r\\_{t}\\left(\\theta\\right), 1-\\epsilon, 1+\\epsilon\\right)\\hat{A}\\_{t}$ modifies the surrogate\r\nobjective by clipping the probability ratio, which removes the incentive for moving $r\\_{t}$ outside of the interval $\\left[1 \u2212 \\epsilon, 1 + \\epsilon\\right]$. Finally, we take the minimum of the clipped and unclipped objective, so the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective. With this scheme, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse. \r\n\r\nOne detail to note is that when we apply PPO for a network where we have shared parameters for actor and critic functions, we typically add to the objective function an error term on value estimation and an entropy term to encourage exploration.","80":"**VQ-VAE** is a type of variational autoencoder that uses vector quantisation to obtain a discrete latent representation. It differs from [VAEs](https:\/\/paperswithcode.com\/method\/vae) in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, ideas from vector quantisation (VQ) are incorporated. Using the VQ method allows the model to circumvent issues of posterior collapse - where the latents are ignored when they are paired with a powerful autoregressive decoder - typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes.","81":"**Non Maximum Suppression** is a computer vision method that selects a single entity out of many overlapping entities (for example bounding boxes in object detection). The criteria is usually discarding entities that are below a given probability bound. With remaining entities we repeatedly pick the entity with the highest probability, output that as the prediction, and discard any remaining box where a $\\text{IoU} \\geq 0.5$ with the box output in the previous step.\r\n\r\nImage Credit: [Martin Kersner](https:\/\/github.com\/martinkersner\/non-maximum-suppression-cpp)","82":"**SSD** is a single-stage object detection method that discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. \r\n\r\nThe fundamental improvement in speed comes from eliminating bounding box proposals and the subsequent pixel or feature resampling stage. Improvements over competing single-stage methods include using a small convolutional filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales.","83":"**Region of Interest Align**, or **RoIAlign**, is an operation for extracting a small feature map from each RoI in detection and segmentation based tasks. It removes the harsh quantization of [RoI Pool](https:\/\/paperswithcode.com\/method\/roi-pooling), properly *aligning* the extracted features with the input. To avoid any quantization of the RoI boundaries or bins (using $x\/16$ instead of $[x\/16]$), RoIAlign uses bilinear interpolation to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and the result is then aggregated (using max or average).","84":"**Mask R-CNN** extends [Faster R-CNN](http:\/\/paperswithcode.com\/method\/faster-r-cnn) to solve instance segmentation tasks. It achieves this by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. In principle, Mask R-CNN is an intuitive extension of Faster [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn), but constructing the mask branch properly is critical for good results. \r\n\r\nMost importantly, Faster R-CNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is evident in how [RoIPool](http:\/\/paperswithcode.com\/method\/roi-pooling), the *de facto* core operation for attending to instances, performs coarse spatial quantization for feature extraction. To fix the misalignment, Mask R-CNN utilises a simple, quantization-free layer, called [RoIAlign](http:\/\/paperswithcode.com\/method\/roi-align), that faithfully preserves exact spatial locations. \r\n\r\nSecondly, Mask R-CNN *decouples* mask and class prediction: it predicts a binary mask for each class independently, without competition among classes, and relies on the network's RoI classification branch to predict the category. In contrast, an [FCN](http:\/\/paperswithcode.com\/method\/fcn) usually perform per-pixel multi-class categorization, which couples segmentation and classification.","85":"Train a convolutional neural network to generate the contents of an arbitrary image region conditioned on its surroundings.","86":"**Local Response Normalization** is a normalization layer that implements the idea of lateral inhibition. Lateral inhibition is a concept in neurobiology that refers to the phenomenon of an excited neuron inhibiting its neighbours: this leads to a peak in the form of a local maximum, creating contrast in that area and increasing sensory perception. In practice, we can either normalize within the same channel or normalize across channels when we apply LRN to convolutional neural networks.\r\n\r\n$$ b_{c} = a_{c}\\left(k + \\frac{\\alpha}{n}\\sum_{c'=\\max(0, c-n\/2)}^{\\min(N-1,c+n\/2)}a_{c'}^2\\right)^{-\\beta} $$\r\n\r\nWhere the size is the number of neighbouring channels used for normalization, $\\alpha$ is multiplicative factor, $\\beta$ an exponent and $k$ an additive factor","87":"A **Grouped Convolution** uses a group of convolutions - multiple kernels per layer - resulting in multiple channel outputs per layer. This leads to wider networks helping a network learn a varied set of low level and high level features. The original motivation of using Grouped Convolutions in [AlexNet](https:\/\/paperswithcode.com\/method\/alexnet) was to distribute the model over multiple GPUs as an engineering compromise. But later, with models such as [ResNeXt](https:\/\/paperswithcode.com\/method\/resnext), it was shown this module could be used to improve classification accuracy. Specifically by exposing a new dimension through grouped convolutions, *cardinality* (the size of set of transformations), we can increase accuracy by increasing it.","88":"**AlexNet** is a classic convolutional neural network architecture. It consists of convolutions, [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) and dense layers as the basic building blocks. Grouped convolutions are used in order to fit the model across two GPUs.","89":"A **Non-Local Operation** is a component for capturing long-range dependencies with deep neural networks. It is a generalization of the classical non-local mean operation in computer vision. Intuitively a non-local operation computes the response at a position as a weighted sum of the features at all positions in the input feature maps. The set of positions can be in space, time, or spacetime, implying that these operations are applicable for image, sequence, and video problems.\r\n\r\nFollowing the non-local mean operation, a generic non-local operation for deep neural networks is defined as:\r\n\r\n$$ \\mathbb{y}\\_{i} = \\frac{1}{\\mathcal{C}\\left(\\mathbb{x}\\right)}\\sum\\_{\\forall{j}}f\\left(\\mathbb{x}\\_{i}, \\mathbb{x}\\_{j}\\right)g\\left(\\mathbb{x}\\_{j}\\right) $$\r\n\r\nHere $i$ is the index of an output position (in space, time, or spacetime) whose response is to be computed and $j$ is the index that enumerates all possible positions. x is the input signal (image, sequence, video; often their features) and $y$ is the output signal of the same size as $x$. A pairwise function $f$ computes a scalar (representing relationship such as affinity) between $i$ and all $j$. The unary function $g$ computes a representation of the input signal at the position $j$. The\r\nresponse is normalized by a factor $C\\left(x\\right)$.\r\n\r\nThe non-local behavior is due to the fact that all positions ($\\forall{j}$) are considered in the operation. As a comparison, a convolutional operation sums up the weighted input in a local neighborhood (e.g., $i \u2212 1 \\leq j \\leq i + 1$ in a 1D case with kernel size 3), and a recurrent operation at time $i$ is often based only on the current and the latest time steps (e.g., $j = i$ or $i \u2212 1$).\r\n\r\nThe non-local operation is also different from a fully-connected (fc) layer. The equation above computes responses based on relationships between different locations, whereas fc uses learned weights. In other words, the relationship between $x\\_{j}$ and $x\\_{i}$ is not a function of the input data in fc, unlike in nonlocal layers. Furthermore, the formulation in the equation above supports inputs of variable sizes, and maintains the corresponding size in the output. On the contrary, an fc layer requires a fixed-size input\/output and loses positional correspondence (e.g., that from $x\\_{i}$ to $y\\_{i}$ at the position $i$).\r\n\r\nA non-local operation is a flexible building block and can be easily used together with convolutional\/recurrent layers. It can be added into the earlier part of deep neural networks, unlike fc layers that are often used in the end. This allows us to build a richer hierarchy that combines both non-local and local information.\r\n\r\nIn terms of parameterisation, we usually parameterise $g$ as a linear embedding of the form $g\\left(x\\_{j}\\right) = W\\_{g}\\mathbb{x}\\_{j}$ , where $W\\_{g}$ is a weight matrix to be learned. This is implemented as, e.g., 1\u00d71 [convolution](https:\/\/paperswithcode.com\/method\/convolution) in space or 1\u00d71\u00d71 convolution in spacetime. For $f$ we use an affinity function, a list of which can be found [here](https:\/\/paperswithcode.com\/methods\/category\/affinity-functions).","90":"A **Non-Local Block** is an image block module used in neural networks that wraps a [non-local operation](https:\/\/paperswithcode.com\/method\/non-local-operation). We can define a non-local block as:\r\n\r\n$$ \\mathbb{z}\\_{i} = W\\_{z}\\mathbb{y\\_{i}} + \\mathbb{x}\\_{i} $$\r\n\r\nwhere $y\\_{i}$ is the output from the non-local operation and $+ \\mathbb{x}\\_{i}$ is a [residual connection](https:\/\/paperswithcode.com\/method\/residual-connection).","91":"**k-Means Clustering** is a clustering algorithm that divides a training set into $k$ different clusters of examples that are near each other. It works by initializing $k$ different centroids {$\\mu\\left(1\\right),\\ldots,\\mu\\left(k\\right)$} to different values, then alternating between two steps until convergence:\r\n\r\n(i) each training example is assigned to cluster $i$ where $i$ is the index of the nearest centroid $\\mu^{(i)}$\r\n\r\n(ii) each centroid $\\mu^{(i)}$ is updated to the mean of all training examples $x^{(j)}$ assigned to cluster $i$.\r\n\r\nText Source: Deep Learning, Goodfellow et al\r\n\r\nImage Source: [scikit-learn](https:\/\/scikit-learn.org\/stable\/auto_examples\/cluster\/plot_kmeans_digits.html)","92":"**Logistic Regression**, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.\r\n\r\nSource: [scikit-learn](https:\/\/scikit-learn.org\/stable\/modules\/linear_model.html#logistic-regression)\r\n\r\nImage: [Michaelg2015](https:\/\/commons.wikimedia.org\/wiki\/User:Michaelg2015)","93":"**Darknet-53** is a convolutional neural network that acts as a backbone for the [YOLOv3](https:\/\/paperswithcode.com\/method\/yolov3) object detection approach. The improvements upon its predecessor [Darknet-19](https:\/\/paperswithcode.com\/method\/darknet-19) include the use of residual connections, as well as more layers.","94":"**YOLOv3** is a real-time, single-stage object detection model that builds on [YOLOv2](https:\/\/paperswithcode.com\/method\/yolov2) with several improvements. Improvements include the use of a new backbone network, [Darknet-53](https:\/\/paperswithcode.com\/method\/darknet-53) that utilises residual connections, or in the words of the author, \"those newfangled residual network stuff\", as well as some improvements to the bounding box prediction step, and use of three different scales from which to extract features (similar to an [FPN](https:\/\/paperswithcode.com\/method\/fpn)).","95":"Dynamic Time Warping (DTW) [1] is one of well-known distance measures between a pairwise of time series. The main idea of DTW is to compute the distance from the matching of similar elements between time series. It uses the dynamic programming technique to find the optimal temporal matching between elements of two time series.\r\n\r\nFor instance, similarities in walking could be detected using DTW, even if one person was walking faster than the other, or if there were accelerations and decelerations during the course of an observation. DTW has been applied to temporal sequences of video, audio, and graphics data \u2014 indeed, any data that can be turned into a linear sequence can be analyzed with DTW. A well known application has been automatic speech recognition, to cope with different speaking speeds. Other applications include speaker recognition and online signature recognition. It can also be used in partial shape matching application.\r\n\r\nIn general, DTW is a method that calculates an optimal match between two given sequences (e.g. time series) with certain restriction and rules:\r\n\r\n1. Every index from the first sequence must be matched with one or more indices from the other sequence, and vice versa\r\n2. The first index from the first sequence must be matched with the first index from the other sequence (but it does not have to be its only match)\r\n3. The last index from the first sequence must be matched with the last index from the other sequence (but it does not have to be its only match)\r\n4. The mapping of the indices from the first sequence to indices from the other sequence must be monotonically increasing, and vice versa, i.e. if j>i are indices from the first sequence, then there must not be two indices l>k in the other sequence, such that index i is matched with index l and index j is matched with index k, and vice versa.\r\n\r\n[1] Sakoe, Hiroaki, and Seibi Chiba. \"Dynamic programming algorithm optimization for spoken word recognition.\" IEEE transactions on acoustics, speech, and signal processing 26, no. 1 (1978): 43-49.","96":"The **alternating direction method of multipliers** (**ADMM**) is an algorithm that solves convex optimization problems by breaking them into smaller pieces, each of which are then easier to handle. It takes the form of a decomposition-coordination procedure, in which the solutions to small\r\nlocal subproblems are coordinated to find a solution to a large global problem. ADMM can be viewed as an attempt to blend the benefits of dual decomposition and augmented Lagrangian methods for constrained optimization. It turns out to be equivalent or closely related to many other algorithms\r\nas well, such as Douglas-Rachford splitting from numerical analysis, Spingarn\u2019s method of partial inverses, Dykstra\u2019s alternating projections method, Bregman iterative algorithms for l1 problems in signal processing, proximal methods, and many others.\r\n\r\nText Source: [https:\/\/stanford.edu\/~boyd\/papers\/pdf\/admm_distr_stats.pdf](https:\/\/stanford.edu\/~boyd\/papers\/pdf\/admm_distr_stats.pdf)\r\n\r\nImage Source: [here](https:\/\/www.slideshare.net\/derekcypang\/alternating-direction)","97":"A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\r\nSource: [Distilling the Knowledge in a Neural Network](https:\/\/arxiv.org\/abs\/1503.02531)","98":"**Additive Attention**, also known as **Bahdanau Attention**, uses a one-hidden layer feed-forward network to calculate the attention alignment score:\r\n\r\n$$f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right) = v\\_{a}^{T}\\tanh\\left(\\textbf{W}\\_{a}\\left[\\textbf{h}\\_{i};\\textbf{s}\\_{j}\\right]\\right)$$\r\n\r\nwhere $\\textbf{v}\\_{a}$ and $\\textbf{W}\\_{a}$ are learned attention parameters. Here $\\textbf{h}$ refers to the hidden states for the encoder, and $\\textbf{s}$ is the hidden states for the decoder. The function above is thus a type of alignment score function. We can use a matrix of alignment scores to show the correlation between source and target words, as the Figure to the right shows.\r\n\r\nWithin a neural network, once we have the alignment scores, we calculate the final scores using a [softmax](https:\/\/paperswithcode.com\/method\/softmax) function of these alignment scores (ensuring it sums to 1).","99":"A **CNN BiLSTM** is a hybrid bidirectional [LSTM](https:\/\/paperswithcode.com\/method\/lstm) and CNN architecture. In the original formulation applied to named entity recognition, it learns both character-level and word-level features. The CNN component is used to induce the character-level features. For each word the model employs a [convolution](https:\/\/paperswithcode.com\/method\/convolution) and a [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) layer to extract a new feature vector from the per-character feature vectors such as character embeddings and (optionally) character type.","100":"**Cross View Training**, or **CVT**, is a semi-supervised algorithm for training distributed word representations that makes use of unlabelled and labelled examples. \r\n\r\nCVT adds $k$ auxiliary prediction modules to the model, a Bi-[LSTM](https:\/\/paperswithcode.com\/method\/lstm) encoder, which are used when learning on unlabeled examples. A prediction module is usually a small neural network (e.g., a hidden layer followed by a [softmax](https:\/\/paperswithcode.com\/method\/softmax) layer). Each one takes as input an intermediate representation $h^j(x_i)$ produced by the model (e.g., the outputs of one of the LSTMs in a Bi-LSTM model). It outputs a distribution over labels $p\\_{j}^{\\theta}\\left(y\\mid{x\\_{i}}\\right)$.\r\n\r\nEach $h^j$ is chosen such that it only uses a part of the input $x_i$; the particular choice can depend on the task and model architecture. The auxiliary prediction modules are only used during training; the test-time prediction come from the primary prediction module that produces $p_\\theta$.","101":"In the ALIGN method, visual and language representations are jointly trained from noisy image alt-text data. The image and text encoders are learned via contrastive loss (formulated as normalized softmax) that pushes the embeddings of the matched image-text pair together and pushing those of non-matched image-text pair apart. The model learns to align visual and language representations of the image and text pairs using the contrastive loss. The representations can be used for vision-only or vision-language task transfer. Without any fine-tuning, ALIGN powers zero-shot visual classification and cross-modal search including image-to-text search, text-to image search and even search with joint image+text queries.","102":"**Restricted Boltzmann Machines**, or **RBMs**, are two-layer generative neural networks that learn a probability distribution over the inputs. They are a special class of Boltzmann Machine in that they have a restricted number of connections between visible and hidden units. Every node in the visible layer is connected to every node in the hidden layer, but no nodes in the same group are connected. RBMs are usually trained using the contrastive divergence learning procedure.\r\n\r\nImage Source: [here](https:\/\/medium.com\/datatype\/restricted-boltzmann-machine-a-complete-analysis-part-1-introduction-model-formulation-1a4404873b3)","103":"**ReLIC**, or **Representation Learning via Invariant Causal Mechanisms**, is a self-supervised learning objective that enforces invariant prediction of proxy targets across augmentations through an invariance regularizer which yields improved generalization guarantees. \r\n\r\nWe can write the objective as:\r\n\r\n$$\r\n\\underset{X}{\\mathbb{E}} \\underset{\\sim\\_{l k}, a\\_{q \\mathcal{A}}}{\\mathbb{E}} \\sum_{b \\in\\left\\(a\\_{l k}, a\\_{q t}\\right\\)} \\mathcal{L}\\_{b}\\left(Y^{R}, f(X)\\right) \\text { s.t. } K L\\left(p^{d o\\left(a\\_{l k}\\right)}\\left(Y^{R} \\mid f(X)\\right), p^{d o\\left(a\\_{q t}\\right)}\\left(Y^{R} \\mid f(X)\\right)\\right) \\leq \\rho\r\n$$\r\n\r\nwhere $\\mathcal{L}$ is the proxy task loss and $K L$ is the Kullback-Leibler (KL) divergence. Note that any distance measure on distributions can be used in place of the KL divergence.\r\n\r\nConcretely, as proxy task we associate to every datapoint $x\\_{i}$ the label $y\\_{i}^{R}=i$. This corresponds to the instance discrimination task, commonly used in contrastive learning. We take pairs of points $\\left(x\\_{i}, x\\_{j}\\right)$ to compute similarity scores and use pairs of augmentations $a\\_{l k}=\\left(a\\_{l}, a\\_{k}\\right) \\in$ $\\mathcal{A} \\times \\mathcal{A}$ to perform a style intervention. Given a batch of samples $\\left\\(x\\_{i}\\right\\)\\_{i=1}^{N} \\sim \\mathcal{D}$, we use\r\n\r\n$$\r\np^{d o\\left(a\\_{l k}\\right)}\\left(Y^{R}=j \\mid f\\left(x\\_{i}\\right)\\right) \\propto \\exp \\left(\\phi\\left(f\\left(x\\_{i}^{a\\_{l}}\\right), h\\left(x\\_{j}^{a\\_{k}}\\right)\\right) \/ \\tau\\right)\r\n$$\r\n\r\nwith $x^{a}$ data augmented with $a$ and $\\tau$ a softmax temperature parameter. We encode $f$ using a neural network and choose $h$ to be related to $f$, e.g. $h=f$ or as a network with an exponential moving average of the weights of $f$ (e.g. target networks). To compare representations we use the function $\\phi\\left(f\\left(x\\_{i}\\right), h\\left(x\\_{j}\\right)\\right)=\\left\\langle g\\left(f\\left(x\\_{i}\\right)\\right), g\\left(h\\left(x\\_{j}\\right)\\right)\\right\\rangle$ where $g$ is a fully-connected neural network often called the critic.\r\n\r\nCombining these pieces, we learn representations by minimizing the following objective over the full set of data $x\\_{i} \\in \\mathcal{D}$ and augmentations $a_{l k} \\in \\mathcal{A} \\times \\mathcal{A}$\r\n\r\n$$\r\n-\\sum_{i=1}^{N} \\sum\\_{a\\_{l k}} \\log \\frac{\\exp \\left(\\phi\\left(f\\left(x\\_{i}^{a_{l}}\\right), h\\left(x\\_{i}^{a\\_{k}}\\right)\\right) \/ \\tau\\right)}{\\sum\\_{m=1}^{M} \\exp \\left(\\phi\\left(f\\left(x\\_{i}^{a\\_{l}}\\right), h\\left(x\\_{m}^{a\\_{k}}\\right)\\right) \/ \\tau\\right)}+\\alpha \\sum\\_{a\\_{l k}, a\\_{q t}} K L\\left(p^{d o\\left(a\\_{l k}\\right)}, p^{d o\\left(a\\_{q t}\\right)}\\right)\r\n$$\r\n\r\nwith $M$ the number of points we use to construct the contrast set and $\\alpha$ the weighting of the invariance penalty. The shorthand $p^{d o(a)}$ is used for $p^{d o(a)}\\left(Y^{R}=j \\mid f\\left(x\\_{i}\\right)\\right)$. The Figure shows a schematic of the RELIC objective.","104":"A **Gated Recurrent Unit**, or **GRU**, is a type of recurrent neural network. It is similar to an [LSTM](https:\/\/paperswithcode.com\/method\/lstm), but only has two gates - a reset gate and an update gate - and notably lacks an output gate. Fewer parameters means GRUs are generally easier\/faster to train than their LSTM counterparts.\r\n\r\nImage Source: [here](https:\/\/www.google.com\/url?sa=i&url=https%3A%2F%2Fcommons.wikimedia.org%2Fwiki%2FFile%3AGated_Recurrent_Unit%2C_type_1.svg&psig=AOvVaw3EmNX8QXC5hvyxeenmJIUn&ust=1590332062671000&source=images&cd=vfe&ved=0CA0QjhxqFwoTCMiev9-eyukCFQAAAAAdAAAAABAR)","105":"**Gated Transformer-XL**, or **GTrXL**, is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based architecture for reinforcement learning. It introduces architectural modifications that improve the stability and learning speed of the original Transformer and XL variant. Changes include:\r\n\r\n- Placing the [layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization) on only the input stream of the submodules. A key benefit to this reordering is that it now enables an identity map from the input of the transformer at the first layer to the output of the transformer after the last layer. This is in contrast to the canonical transformer, where there are a series of layer normalization operations that non-linearly transform the state encoding.\r\n- Replacing [residual connections](https:\/\/paperswithcode.com\/method\/residual-connection) with gating layers. The authors' experiments found that [GRUs](https:\/\/www.paperswithcode.com\/method\/gru) were the most effective form of gating.","106":"**Contrastive BERT** is a reinforcement learning agent that combines a new contrastive loss and a hybrid [LSTM](https:\/\/paperswithcode.com\/method\/lstm)-[transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture to tackle the challenge of improving data efficiency for RL. It uses bidirectional masked prediction in combination with a generalization of recent contrastive methods to learn better representations for transformers in RL, without the need of hand engineered data augmentations.\r\n\r\nFor the architecture, a residual network is used to encode observations into embeddings $Y\\_{t}$. $Y_{t}$ is fed through a causally masked [GTrXL transformer](https:\/\/www.paperswithcode.com\/method\/gtrxl), which computes the predicted masked inputs $X\\_{t}$ and passes those together with $Y\\_{t}$ to a learnt gate. The output of the gate is passed through a single [LSTM](https:\/\/www.paperswithcode.com\/method\/lstm) layer to produce the values that we use for computing the RL loss. A contrastive loss is computed using predicted masked inputs $X_{t}$ and $Y_{t}$ as targets. For this, we do not use the causal mask of the Transformer.","107":"**Seq2Seq**, or **Sequence To Sequence**, is a model used in sequence prediction tasks, such as language modelling and machine translation. The idea is to use one [LSTM](https:\/\/paperswithcode.com\/method\/lstm), the *encoder*, to read the input sequence one timestep at a time, to obtain a large fixed dimensional vector representation (a context vector), and then to use another LSTM, the *decoder*, to extract the output sequence\r\nfrom that vector. The second LSTM is essentially a recurrent neural network language model except that it is conditioned on the input sequence.\r\n\r\n(Note that this page refers to the original seq2seq not general sequence-to-sequence models)","108":"VERtex Similarity Embeddings (VERSE) is a simple, versatile, and memory-efficient method that derives graph embeddings explicitly calibrated to preserve the distributions of a selected vertex-to-vertex similarity measure. VERSE learns such embeddings by training a single-layer neural network.\r\n\r\nSource: [Tsitsulin et al.](https:\/\/arxiv.org\/pdf\/1803.04742v1.pdf)\r\n\r\nImage source: [Tsitsulin et al.](https:\/\/arxiv.org\/pdf\/1803.04742v1.pdf)","109":"Temporal attention can be seen as a dynamic time selection mechanism determining when to pay attention, and is thus usually used for video processing.","110":"A **DQN**, or Deep Q-Network, approximates a state-value function in a [Q-Learning](https:\/\/paperswithcode.com\/method\/q-learning) framework with a neural network. In the Atari Games case, they take in several frames of the game as an input and output state values for each action as an output. \r\n\r\nIt is usually used in conjunction with [Experience Replay](https:\/\/paperswithcode.com\/method\/experience-replay), for storing the episode steps in memory for off-policy learning, where samples are drawn from the replay memory at random. Additionally, the Q-Network is usually optimized towards a frozen target network that is periodically updated with the latest weights every $k$ steps (where $k$ is a hyperparameter). The latter makes training more stable by preventing short-term oscillations from a moving target. The former tackles autocorrelation that would occur from on-line learning, and having a replay memory makes the problem more like a supervised learning problem.\r\n\r\nImage Source: [here](https:\/\/www.researchgate.net\/publication\/319643003_Autonomous_Quadrotor_Landing_using_Deep_Reinforcement_Learning)","111":"An **Hourglass Module** is an image block module used mainly for pose estimation tasks. The design of the hourglass is motivated by the need to capture information at every scale. While local evidence is essential for identifying features like faces and hands, a final pose estimate requires a coherent understanding of the full body. The person\u2019s orientation, the arrangement of their limbs, and the relationships of adjacent joints are among the many cues that are best recognized at different scales in the image. The hourglass is a simple, minimal design that has the capacity to capture all of these features and bring them together to output pixel-wise predictions.\r\n\r\nThe network must have some mechanism to effectively process and consolidate features across scales. The Hourglass uses a single pipeline with skip layers to preserve spatial information at each resolution. The network reaches its lowest resolution at 4x4 pixels allowing smaller spatial filters to be applied that compare features across the entire space of the image.\r\n\r\nThe hourglass is set up as follows: Convolutional and [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) layers are used to process features down to a very low resolution. At each max pooling step, the network branches off and applies more convolutions at the original pre-pooled resolution. After reaching the lowest resolution, the network begins the top-down sequence of upsampling and combination of features across scales. To bring together information across two adjacent resolutions, we do nearest neighbor upsampling of the lower resolution followed by an elementwise addition of the two sets of features. The topology of the hourglass is symmetric, so for every layer present on the way down there is a corresponding layer going up.\r\n\r\nAfter reaching the output resolution of the network, two consecutive rounds of 1x1 convolutions are applied to produce the final network predictions. The output of the network is a set of heatmaps where for a given [heatmap](https:\/\/paperswithcode.com\/method\/heatmap) the network predicts the probability of a joint\u2019s presence at each and every pixel.","112":"**Random Scaling** is a type of image data augmentation where we randomly change the scale the image between a specified range.","113":"**Stacked Hourglass Networks** are a type of convolutional neural network for pose estimation. They are based on the successive steps of pooling and upsampling that are done to produce a final set of predictions.","114":"**Corner Pooling** is a pooling technique for object detection that seeks to better localize corners by encoding explicit prior knowledge. Suppose we want to determine if a pixel at location $\\left(i, j\\right)$ is a top-left corner. Let $f\\_{t}$ and $f\\_{l}$ be the feature maps that are the inputs to the top-left corner pooling layer, and let $f\\_{t\\_{ij}}$ and $f\\_{l\\_{ij}}$ be the vectors at location $\\left(i, j\\right)$ in $f\\_{t}$ and $f\\_{l}$ respectively. With $H \\times W$ feature maps, the corner pooling layer first max-pools all feature vectors between $\\left(i, j\\right)$ and $\\left(i, H\\right)$ in $f\\_{t}$ to a feature vector $t\\_{ij}$ , and max-pools all feature vectors between $\\left(i, j\\right)$ and $\\left(W, j\\right)$ in $f\\_{l}$ to a feature vector $l\\_{ij}$. Finally, it adds $t\\_{ij}$ and $l\\_{ij}$ together.","115":"**CornerNet** is an object detection model that detects an object bounding box as a pair of keypoints, the top-left corner and the bottom-right corner, using a single [convolution](https:\/\/paperswithcode.com\/method\/convolution) neural network. By detecting objects as paired keypoints, we eliminate the need for designing a set of anchor boxes commonly used in prior single-stage detectors. It also utilises [corner pooling](https:\/\/paperswithcode.com\/method\/corner-pooling), a new type of pooling layer than helps the network better localize corners.","116":"Non-maximum suppression is an integral part of the object detection pipeline. First, it sorts all detection boxes on the basis of their scores. The detection box $M$ with the maximum score is selected and all other detection boxes with a significant overlap (using a pre-defined threshold)\r\nwith $M$ are suppressed. This process is recursively applied on the remaining boxes. As per the design of the algorithm, if an object lies within the predefined overlap threshold, it leads to a miss. \r\n\r\n**Soft-NMS** solves this problem by decaying the detection scores of all other objects as a continuous function of their overlap with M. Hence, no object is eliminated in this process.","117":"**RandomHorizontalFlip** is a type of image data augmentation which horizontally flips a given image with a given probability.\r\n\r\nImage Credit: [Apache MXNet](https:\/\/mxnet.apache.org\/versions\/1.5.0\/tutorials\/gluon\/data_augmentation.html)","118":"**Step Decay** is a learning rate schedule that drops the learning rate by a factor every few epochs, where the number of epochs is a hyperparameter.\r\n\r\nImage Credit: [Suki Lau](https:\/\/towardsdatascience.com\/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1)","119":"**MatrixNet** is a scale and aspect ratio aware building block for object detection that seek to handle objects of different sizes and aspect ratios. They have several matrix layers, each layer handles an object of specific size and aspect ratio. They can be seen as an alternative to [FPNs](https:\/\/paperswithcode.com\/method\/fpn). While FPNs are capable of handling objects of different sizes, they do not have a solution for objects of different aspect ratios. Objects such as a high tower, a giraffe, or a knife introduce a design difficulty for FPNs: does one map these objects to layers according to their width or height? Assigning the object to a layer according to its larger dimension would result in loss of information along the smaller dimension due to aggressive downsampling, and vice versa. \r\n\r\nMatrixNets assign objects of different sizes and aspect ratios to layers such that object sizes within their assigned layers are close to uniform. This assignment allows a square output [convolution](https:\/\/paperswithcode.com\/method\/convolution) kernel to equally gather information about objects of all aspect ratios and scales. MatrixNets can be applied to any backbone, similar to FPNs. We denote this by appending a \"-X\" to the backbone, i.e. ResNet50-X.","120":"**Global-Local Attention** is a type of attention mechanism used in the [ETC](https:\/\/paperswithcode.com\/method\/etc) architecture. ETC receives two separate input sequences: the global input $x^{g} = (x^{g}\\_{1}, \\dots, x^{g}\\_{n\\_{g}})$ and the long input $x^{l} = (x^{l}\\_{1}, \\dots x^{l}\\_{n\\_{l}})$. Typically, the long input contains the input a [standard Transformer](https:\/\/paperswithcode.com\/method\/transformer) would receive, while the global input contains a much smaller number of auxiliary tokens ($n\\_{g} \\ll n\\_{l}$). Attention is then split into four separate pieces: global-to-global (g2g), global-tolong (g2l), long-to-global (l2g), and long-to-long (l2l). Attention in the l2l piece (the most computationally expensive piece) is restricted to a fixed radius $r \\ll n\\_{l}$. To compensate for this limited attention span, the tokens in the global input have unrestricted attention, and thus long input tokens can transfer information to each other through global input tokens. Accordingly, g2g, g2l, and l2g pieces of attention are unrestricted.","121":"**R_INLINE_MATH_1 Regularization** is a regularization technique and gradient penalty for training [generative adversarial networks](https:\/\/paperswithcode.com\/methods\/category\/generative-adversarial-networks). It penalizes the discriminator from deviating from the Nash Equilibrium via penalizing the gradient on real data alone: when the generator distribution produces the true data distribution and the discriminator is equal to 0 on the data manifold, the gradient penalty ensures that the discriminator cannot create a non-zero gradient orthogonal to the data manifold without suffering a loss in the [GAN](https:\/\/paperswithcode.com\/method\/gan) game.\r\n\r\nThis leads to the following regularization term:\r\n\r\n$$ R\\_{1}\\left(\\psi\\right) = \\frac{\\gamma}{2}E\\_{p\\_{D}\\left(x\\right)}\\left[||\\nabla{D\\_{\\psi}\\left(x\\right)}||^{2}\\right] $$","122":"A **Feedforward Network**, or a **Multilayer Perceptron (MLP)**, is a neural network with solely densely connected layers. This is the classic neural network architecture of the literature. It consists of inputs $x$ passed through units $h$ (of which there can be many layers) to predict a target $y$. Activation functions are generally chosen to be non-linear to allow for flexible functional approximation.\r\n\r\nImage Source: Deep Learning, Goodfellow et al","123":"**Adaptive Instance Normalization** is a normalization method that aligns the mean and variance of the content features with those of the style features. \r\n\r\n[Instance Normalization](https:\/\/paperswithcode.com\/method\/instance-normalization) normalizes the input to a single style specified by the affine parameters. Adaptive Instance Normaliation is an extension. In AdaIN, we receive a content input $x$ and a style input $y$, and we simply align the channel-wise mean and variance of $x$ to match those of $y$. Unlike [Batch Normalization](https:\/\/paperswithcode.com\/method\/batch-normalization), Instance Normalization or [Conditional Instance Normalization](https:\/\/paperswithcode.com\/method\/conditional-instance-normalization), AdaIN has no learnable affine parameters. Instead, it adaptively computes the affine parameters from the style input:\r\n\r\n$$\r\n\\textrm{AdaIN}(x, y)= \\sigma(y)\\left(\\frac{x-\\mu(x)}{\\sigma(x)}\\right)+\\mu(y)\r\n$$","124":"**StyleGAN** is a type of generative adversarial network. It uses an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature; in particular, the use of [adaptive instance normalization](https:\/\/paperswithcode.com\/method\/adaptive-instance-normalization). Otherwise it follows Progressive [GAN](https:\/\/paperswithcode.com\/method\/gan) in using a progressively growing training regime. Other quirks include the fact it generates from a fixed value tensor not stochastically generated latent variables as in regular GANs. The stochastically generated latent variables are used as style vectors in the adaptive [instance normalization](https:\/\/paperswithcode.com\/method\/instance-normalization) at each resolution after being transformed by an 8-layer [feedforward network](https:\/\/paperswithcode.com\/method\/feedforward-network). Lastly, it employs a form of regularization called mixing regularization, which mixes two style latent variables during training.","125":"**MAML**, or **Model-Agnostic Meta-Learning**, is a model and task-agnostic algorithm for meta-learning that trains a model\u2019s parameters such that a small number of gradient updates will lead to fast learning on a new task.\r\n\r\nConsider a model represented by a parametrized function $f\\_{\\theta}$ with parameters $\\theta$. When adapting to a new task $\\mathcal{T}\\_{i}$, the model\u2019s parameters $\\theta$ become $\\theta'\\_{i}$. With MAML, the updated parameter vector $\\theta'\\_{i}$ is computed using one or more gradient descent updates on task $\\mathcal{T}\\_{i}$. For example, when using one gradient update,\r\n\r\n$$ \\theta'\\_{i} = \\theta - \\alpha\\nabla\\_{\\theta}\\mathcal{L}\\_{\\mathcal{T}\\_{i}}\\left(f\\_{\\theta}\\right) $$\r\n\r\nThe step size $\\alpha$ may be fixed as a hyperparameter or metalearned. The model parameters are trained by optimizing for the performance of $f\\_{\\theta'\\_{i}}$ with respect to $\\theta$ across tasks sampled from $p\\left(\\mathcal{T}\\_{i}\\right)$. More concretely the meta-objective is as follows:\r\n\r\n$$ \\min\\_{\\theta} \\sum\\_{\\mathcal{T}\\_{i} \\sim p\\left(\\mathcal{T}\\right)} \\mathcal{L}\\_{\\mathcal{T\\_{i}}}\\left(f\\_{\\theta'\\_{i}}\\right) = \\sum\\_{\\mathcal{T}\\_{i} \\sim p\\left(\\mathcal{T}\\right)} \\mathcal{L}\\_{\\mathcal{T\\_{i}}}\\left(f\\_{\\theta - \\alpha\\nabla\\_{\\theta}\\mathcal{L}\\_{\\mathcal{T}\\_{i}}\\left(f\\_{\\theta}\\right)}\\right) $$\r\n\r\nNote that the meta-optimization is performed over the model parameters $\\theta$, whereas the objective is computed using the updated model parameters $\\theta'$. In effect MAML aims to optimize the model parameters such that one or a small number of gradient steps on a new task will produce maximally effective behavior on that task. The meta-optimization across tasks is performed via stochastic gradient descent ([SGD](https:\/\/paperswithcode.com\/method\/sgd)), such that the model parameters $\\theta$ are updated as follows:\r\n\r\n$$ \\theta \\leftarrow \\theta - \\beta\\nabla\\_{\\theta} \\sum\\_{\\mathcal{T}\\_{i} \\sim p\\left(\\mathcal{T}\\right)} \\mathcal{L}\\_{\\mathcal{T\\_{i}}}\\left(f\\_{\\theta'\\_{i}}\\right)$$\r\n\r\nwhere $\\beta$ is the meta step size.","126":"NeRF represents a scene with learned, continuous volumetric radiance field $F_\\theta$ defined over a bounded 3D volume. In a NeRF, $F_\\theta$ is a multilayer perceptron (MLP) that takes as input a 3D position $x = (x, y, z)$ and unit-norm viewing direction $d = (dx, dy, dz)$, and produces as output a density $\\sigma$ and color $c = (r, g, b)$. The weights of the multilayer perceptron that parameterize $F_\\theta$ are optimized so as to encode the radiance field of the scene. Volume rendering is used to compute the color of a single pixel.","127":"**ELECTRA** is a [transformer](https:\/\/paperswithcode.com\/method\/transformer) with a new pre-training approach which trains two transformer models: the generator and the discriminator. The generator replaces tokens in the sequence - trained as a masked language model - and the discriminator (the ELECTRA contribution) attempts to identify which tokens are replaced by the generator in the sequence. This pre-training task is called replaced token detection, and is a replacement for masking the input.","128":"CARLA is an open-source simulator for autonomous driving research. CARLA has been developed from the ground up to support development, training, and validation of autonomous urban driving systems. In addition to open-source code and protocols, CARLA provides open digital assets (urban layouts, buildings, vehicles) that were created for this purpose and can be used freely. \r\n\r\nSource: [Dosovitskiy et al.](https:\/\/arxiv.org\/pdf\/1711.03938v1.pdf)\r\n\r\nImage source: [Dosovitskiy et al.](https:\/\/arxiv.org\/pdf\/1711.03938v1.pdf)","129":"**Conditional Random Fields** or **CRFs** are a type of probabilistic graph model that take neighboring sample context into account for tasks like classification. Prediction is modeled as a graphical model, which implements dependencies between the predictions. Graph choice depends on the application, for example linear chain CRFs are popular in natural language processing, whereas in image-based tasks, the graph would connect to neighboring locations in an image to enforce that they have similar predictions.\r\n\r\nImage Credit: [Charles Sutton and Andrew McCallum, An Introduction to Conditional Random Fields](https:\/\/homepages.inf.ed.ac.uk\/csutton\/publications\/crftut-fnt.pdf)","130":"**Minimum Description Length** provides a criterion for the selection of models, regardless of their complexity, without the restrictive assumption that the data form a sample from a 'true' distribution.\r\n\r\nExtracted from [scholarpedia](http:\/\/scholarpedia.org\/article\/Minimum_description_length)\r\n\r\n**Source**:\r\n\r\nPaper: [J. Rissanen (1978) Modeling by the shortest data description. Automatica 14, 465-471](https:\/\/doi.org\/10.1016\/0005-1098(78)90005-5)\r\n\r\nBook: [P. D. Gr\u00fcnwald (2007) The Minimum Description Length Principle, MIT Press, June 2007, 570 pages](https:\/\/ieeexplore.ieee.org\/servlet\/opac?bknumber=6267274)","131":"A **Memory Network** provides a memory component that can be read from and written to with the inference capabilities of a neural network model. The motivation is that many neural networks lack a long-term memory component, and their existing memory component encoded by states and weights is too small and not compartmentalized enough to accurately remember facts from the past (RNNs for example, have difficult memorizing and doing tasks like copying). \r\n\r\nA memory network consists of a memory $\\textbf{m}$ (an array of objects indexed by $\\textbf{m}\\_{i}$ and four potentially learned components:\r\n\r\n- Input feature map $I$ - feature representation of the data input.\r\n- Generalization $G$ - updates old memories given the new input.\r\n- Output feature map $O$ - produces new feature map given $I$ and $G$.\r\n- Response $R$ - converts output into the desired response. \r\n\r\nGiven an input $x$ (e.g., an input character, word or sentence depending on the granularity chosen, an image or an audio signal) the flow of the model is as follows:\r\n\r\n1. Convert $x$ to an internal feature representation $I\\left(x\\right)$.\r\n2. Update memories $m\\_{i}$ given the new input: $m\\_{i} = G\\left(m\\_{i}, I\\left(x\\right), m\\right)$, $\\forall{i}$.\r\n3. Compute output features $o$ given the new input and the memory: $o = O\\left(I\\left(x\\right), m\\right)$.\r\n4. Finally, decode output features $o$ to give the final response: $r = R\\left(o\\right)$.\r\n\r\nThis process is applied at both train and test time, if there is a distinction between such phases, that\r\nis, memories are also stored at test time, but the model parameters of $I$, $G$, $O$ and $R$ are not updated. Memory networks cover a wide class of possible implementations. The components $I$, $G$, $O$ and $R$ can potentially use any existing ideas from the machine learning literature.\r\n\r\nImage Source: [Adrian Colyer](https:\/\/blog.acolyer.org\/2016\/03\/10\/memory-networks\/)","132":"Diffusion models generate samples by gradually\r\nremoving noise from a signal, and their training objective can be expressed as a reweighted variational lower-bound (https:\/\/arxiv.org\/abs\/2006.11239).","133":"**Discriminative Fine-Tuning** is a fine-tuning strategy that is used for [ULMFiT](https:\/\/paperswithcode.com\/method\/ulmfit) type models. Instead of using the same learning rate for all layers of the model, discriminative fine-tuning allows us to tune each layer with different learning rates. For context, the regular stochastic gradient descent ([SGD](https:\/\/paperswithcode.com\/method\/sgd)) update of a model\u2019s parameters $\\theta$ at time step $t$ looks like the following (Ruder, 2016):\r\n\r\n$$ \\theta\\_{t} = \\theta\\_{t-1} \u2212 \\eta\\cdot\\nabla\\_{\\theta}J\\left(\\theta\\right)$$\r\n\r\nwhere $\\eta$ is the learning rate and $\\nabla\\_{\\theta}J\\left(\\theta\\right)$ is the gradient with regard to the model\u2019s objective function. For discriminative fine-tuning, we split the parameters $\\theta$ into {$\\theta\\_{1}, \\ldots, \\theta\\_{L}$} where $\\theta\\_{l}$ contains the parameters of the model at the $l$-th layer and $L$ is the number of layers of the model. Similarly, we obtain {$\\eta\\_{1}, \\ldots, \\eta\\_{L}$} where $\\theta\\_{l}$ where $\\eta\\_{l}$ is the learning rate of the $l$-th layer. The SGD update with discriminative finetuning is then:\r\n\r\n$$ \\theta\\_{t}^{l} = \\theta\\_{t-1}^{l} - \\eta^{l}\\cdot\\nabla\\_{\\theta^{l}}J\\left(\\theta\\right) $$\r\n\r\nThe authors find that empirically it worked well to first choose the learning rate $\\eta^{L}$ of the last layer by fine-tuning only the last layer and using $\\eta^{l-1}=\\eta^{l}\/2.6$ as the learning rate for lower layers.","134":"**GPT-2** is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) architecture that was notable for its size (1.5 billion parameters) on its release. The model is pretrained on a WebText dataset - text from 45 million website links. It largely follows the previous [GPT](https:\/\/paperswithcode.com\/method\/gpt) architecture with some modifications:\r\n\r\n- [Layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization) is moved to the input of each sub-block, similar to a\r\npre-activation residual network and an additional layer normalization was added after the final self-attention block. \r\n\r\n- A modified initialization which accounts for the accumulation on the residual path with model depth\r\nis used. Weights of residual layers are scaled at initialization by a factor of $1\/\\sqrt{N}$ where $N$ is the number of residual layers. \r\n\r\n- The vocabulary is expanded to 50,257. The context size is expanded from 512 to 1024 tokens and\r\na larger batch size of 512 is used.","135":"**Gaussian Processes** are non-parametric models for approximating functions. They rely upon a measure of similarity between points (the kernel function) to predict the value for an unseen point from training data. The models are fully probabilistic so uncertainty bounds are baked in with the model.\r\n\r\nImage Source: Gaussian Processes for Machine Learning, C. E. Rasmussen & C. K. I. Williams","136":"**RoBERTa** is an extension of [BERT](https:\/\/paperswithcode.com\/method\/bert) with changes to the pretraining procedure. The modifications include: \r\n\r\n- training the model longer, with bigger batches, over more data\r\n- removing the next sentence prediction objective\r\n- training on longer sequences\r\n- dynamically changing the masking pattern applied to the training data. The authors also collect a large new dataset ($\\text{CC-News}$) of comparable size to other privately used datasets, to better control for training set size effects","137":"**node2vec** is a framework for learning graph embeddings for nodes in graphs. Node2vec maximizes a likelihood objective over mappings which preserve neighbourhood distances in higher dimensional spaces. From an algorithm design perspective, node2vec exploits the freedom to define neighbourhoods for nodes and provide an explanation for the effect of the choice of neighborhood on the learned representations. \r\n\r\nFor each node, node2vec simulates biased random walks based on an efficient network-aware search strategy and the nodes appearing in the random walk define neighbourhoods. The search strategy accounts for the relative influence nodes exert in a network. It also generalizes prior work alluding to naive search strategies by providing flexibility in exploring neighborhoods.","138":"The goal of **Triplet loss**, in the context of Siamese Networks, is to maximize the joint probability among all score-pairs i.e. the product of all probabilities. By using its negative logarithm, we can get the loss formulation as follows:\r\n\r\n$$\r\nL\\_{t}\\left(\\mathcal{V}\\_{p}, \\mathcal{V}\\_{n}\\right)=-\\frac{1}{M N} \\sum\\_{i}^{M} \\sum\\_{j}^{N} \\log \\operatorname{prob}\\left(v p\\_{i}, v n\\_{j}\\right)\r\n$$\r\n\r\nwhere the balance weight $1\/MN$ is used to keep the loss with the same scale for different number of instance sets.","139":"**Contrastive Language-Image Pre-training** (**CLIP**), consisting of a simplified version of ConVIRT trained from scratch, is an efficient method of image representation learning from natural language supervision. , CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset\u2019s classes. \r\n\r\nFor pre-training, CLIP is trained to predict which of the $N X N$ possible (image, text) pairings across a batch actually occurred. CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the $N$ real pairs in the batch while minimizing the cosine similarity of the embeddings of the $N^2 - N$ incorrect pairings. A symmetric cross entropy loss is optimized over these similarity scores. \r\n\r\nImage credit: [Learning Transferable Visual Models From Natural Language Supervision](https:\/\/arxiv.org\/pdf\/2103.00020.pdf)","140":"**Fully Convolutional Networks**, or **FCNs**, are an architecture used mainly for semantic segmentation. They employ solely locally connected layers, such as [convolution](https:\/\/paperswithcode.com\/method\/convolution), pooling and upsampling. Avoiding the use of dense layers means less parameters (making the networks faster to train). It also means an FCN can work for variable image sizes given all connections are local.\r\n\r\nThe network consists of a downsampling path, used to extract and interpret the context, and an upsampling path, which allows for localization. \r\n\r\nFCNs also employ skip connections to recover the fine-grained spatial information lost in the downsampling path.","141":"**ENet Dilated Bottleneck** is an image model block used in the [ENet](https:\/\/paperswithcode.com\/method\/enet) semantic segmentation architecture. It is the same as a regular [ENet Bottleneck](https:\/\/paperswithcode.com\/method\/enet-bottleneck) but employs dilated convolutions instead.","142":"**ENet Bottleneck** is an image model block used in the [ENet](https:\/\/paperswithcode.com\/method\/enet) semantic segmentation architecture. Each block consists of three convolutional layers: a 1 \u00d7 1 projection that reduces the dimensionality, a main convolutional layer, and a 1 \u00d7 1 expansion. We place [Batch Normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) and [PReLU](https:\/\/paperswithcode.com\/method\/prelu) between all convolutions. If the bottleneck is downsampling, a [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) layer is added to the main branch.\r\nAlso, the first 1 \u00d7 1 projection is replaced with a 2 \u00d7 2 [convolution](https:\/\/paperswithcode.com\/method\/convolution) with stride 2 in both dimensions. We zero pad the activations, to match the number of feature maps.","143":"The **ENet Initial Block** is an image model block used in the [ENet](https:\/\/paperswithcode.com\/method\/enet) semantic segmentation architecture. [Max Pooling](https:\/\/paperswithcode.com\/method\/max-pooling) is performed with non-overlapping 2 \u00d7 2 windows, and the [convolution](https:\/\/paperswithcode.com\/method\/convolution) has 13 filters, which sums up to 16 feature maps after concatenation. This is heavily inspired by Inception Modules.","144":"**SpatialDropout** is a type of [dropout](https:\/\/paperswithcode.com\/method\/dropout) for convolutional networks. For a given [convolution](https:\/\/paperswithcode.com\/method\/convolution) feature tensor of size $n\\_{\\text{feats}}$\u00d7height\u00d7width, we perform only $n\\_{\\text{feats}}$ dropout\r\ntrials and extend the dropout value across the entire feature map. Therefore, adjacent pixels in the dropped-out feature\r\nmap are either all 0 (dropped-out) or all active as illustrated in the figure to the right.","145":"A **Parametric Rectified Linear Unit**, or **PReLU**, is an activation function that generalizes the traditional rectified unit with a slope for negative values. Formally:\r\n\r\n$$f\\left(y\\_{i}\\right) = y\\_{i} \\text{ if } y\\_{i} \\ge 0$$\r\n$$f\\left(y\\_{i}\\right) = a\\_{i}y\\_{i} \\text{ if } y\\_{i} \\leq 0$$\r\n\r\nThe intuition is that different layers may require different types of nonlinearity. Indeed the authors find in experiments with convolutional neural networks that PReLus for the initial layer have more positive slopes, i.e. closer to linear. Since the filters of the first layers are Gabor-like filters such as edge or texture detectors, this shows a circumstance where positive and negative responses of filters are respected. In contrast the authors find deeper layers have smaller coefficients, suggesting the model becomes more discriminative at later layers (while it wants to retain more information at earlier layers).","146":"**ENet** is a semantic segmentation architecture which utilises a compact encoder-decoder architecture. Some design choices include:\r\n\r\n1. Using the [SegNet](https:\/\/paperswithcode.com\/method\/segnet) approach to downsampling y saving indices of elements chosen in max\r\npooling layers, and using them to produce sparse upsampled maps in the decoder.\r\n2. Early downsampling to optimize the early stages of the network and reduce the cost of processing large input frames. The first two blocks of ENet heavily reduce the input size, and use only a small set of feature maps. \r\n3. Using PReLUs as an activation function\r\n4. Using dilated convolutions \r\n5. Using Spatial [Dropout](https:\/\/paperswithcode.com\/method\/dropout)","147":"**PatchGAN** is a type of discriminator for generative adversarial networks which only penalizes structure at the scale of local image patches. The PatchGAN discriminator tries to classify if each $N \\times N$ patch in an image is real or fake. This discriminator is run convolutionally across the image, averaging all responses to provide the ultimate output of $D$. Such a discriminator effectively models the image as a Markov random field, assuming independence between pixels separated by more than a patch diameter. It can be understood as a type of texture\/style loss.","148":"**Instance Normalization** (also known as contrast normalization) is a normalization layer where:\r\n\r\n$$\r\n y_{tijk} = \\frac{x_{tijk} - \\mu_{ti}}{\\sqrt{\\sigma_{ti}^2 + \\epsilon}},\r\n \\quad\r\n \\mu_{ti} = \\frac{1}{HW}\\sum_{l=1}^W \\sum_{m=1}^H x_{tilm},\r\n \\quad\r\n \\sigma_{ti}^2 = \\frac{1}{HW}\\sum_{l=1}^W \\sum_{m=1}^H (x_{tilm} - mu_{ti})^2.\r\n$$\r\n\r\nThis prevents instance-specific mean and covariance shift simplifying the learning process. Intuitively, the normalization process allows to remove instance-specific contrast information from the content image in a task like image stylization, which simplifies generation.","149":"**GAN Least Squares Loss** is a least squares loss function for generative adversarial networks. Minimizing this objective function is equivalent to minimizing the Pearson $\\chi^{2}$ divergence. The objective function (here for [LSGAN](https:\/\/paperswithcode.com\/method\/lsgan)) can be defined as:\r\n\r\n$$ \\min\\_{D}V\\_{LS}\\left(D\\right) = \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{x} \\sim p\\_{data}\\left(\\mathbf{x}\\right)}\\left[\\left(D\\left(\\mathbf{x}\\right) - b\\right)^{2}\\right] + \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{z}\\sim p\\_{data}\\left(\\mathbf{z}\\right)}\\left[\\left(D\\left(G\\left(\\mathbf{z}\\right)\\right) - a\\right)^{2}\\right] $$\r\n\r\n$$ \\min\\_{G}V\\_{LS}\\left(G\\right) = \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{z} \\sim p\\_{\\mathbf{z}}\\left(\\mathbf{z}\\right)}\\left[\\left(D\\left(G\\left(\\mathbf{z}\\right)\\right) - c\\right)^{2}\\right] $$\r\n\r\nwhere $a$ and $b$ are the labels for fake data and real data and $c$ denotes the value that $G$ wants $D$ to believe for fake data.","150":"**Cycle Consistency Loss** is a type of loss used for generative adversarial networks that performs unpaired image-to-image translation. It was introduced with the [CycleGAN](https:\/\/paperswithcode.com\/method\/cyclegan) architecture. For two domains $X$ and $Y$, we want to learn a mapping $G : X \\rightarrow Y$ and $F: Y \\rightarrow X$. We want to enforce the intuition that these mappings should be reverses of each other and that both mappings should be bijections. Cycle Consistency Loss encourages $F\\left(G\\left(x\\right)\\right) \\approx x$ and $G\\left(F\\left(y\\right)\\right) \\approx y$. It reduces the space of possible mapping functions by enforcing forward and backwards consistency:\r\n\r\n$$ \\mathcal{L}\\_{cyc}\\left(G, F\\right) = \\mathbb{E}\\_{x \\sim p\\_{data}\\left(x\\right)}\\left[||F\\left(G\\left(x\\right)\\right) - x||\\_{1}\\right] + \\mathbb{E}\\_{y \\sim p\\_{data}\\left(y\\right)}\\left[||G\\left(F\\left(y\\right)\\right) - y||\\_{1}\\right] $$","151":"**CycleGAN**, or **Cycle-Consistent GAN**, is a type of generative adversarial network for unpaired image-to-image translation. For two domains $X$ and $Y$, CycleGAN learns a mapping $G : X \\rightarrow Y$ and $F: Y \\rightarrow X$. The novelty lies in trying to enforce the intuition that these mappings should be reverses of each other and that both mappings should be bijections. This is achieved through a [cycle consistency loss](https:\/\/paperswithcode.com\/method\/cycle-consistency-loss) that encourages $F\\left(G\\left(x\\right)\\right) \\approx x$ and $G\\left(Y\\left(y\\right)\\right) \\approx y$. Combining this loss with the adversarial losses on $X$ and $Y$ yields the full objective for unpaired image-to-image translation.\r\n\r\nFor the mapping $G : X \\rightarrow Y$ and its discriminator $D\\_{Y}$ we have the objective:\r\n\r\n$$ \\mathcal{L}\\_{GAN}\\left(G, D\\_{Y}, X, Y\\right) =\\mathbb{E}\\_{y \\sim p\\_{data}\\left(y\\right)}\\left[\\log D\\_{Y}\\left(y\\right)\\right] + \\mathbb{E}\\_{x \\sim p\\_{data}\\left(x\\right)}\\left[log(1 \u2212 D\\_{Y}\\left(G\\left(x\\right)\\right)\\right] $$\r\n\r\nwhere $G$ tries to generate images $G\\left(x\\right)$ that look similar to images from domain $Y$, while $D\\_{Y}$ tries to discriminate between translated samples $G\\left(x\\right)$ and real samples $y$. A similar loss is postulated for the mapping $F: Y \\rightarrow X$ and its discriminator $D\\_{X}$.\r\n\r\nThe Cycle Consistency Loss reduces the space of possible mapping functions by enforcing forward and backwards consistency:\r\n\r\n$$ \\mathcal{L}\\_{cyc}\\left(G, F\\right) = \\mathbb{E}\\_{x \\sim p\\_{data}\\left(x\\right)}\\left[||F\\left(G\\left(x\\right)\\right) - x||\\_{1}\\right] + \\mathbb{E}\\_{y \\sim p\\_{data}\\left(y\\right)}\\left[||G\\left(F\\left(y\\right)\\right) - y||\\_{1}\\right] $$\r\n\r\nThe full objective is:\r\n\r\n$$ \\mathcal{L}\\_{GAN}\\left(G, F, D\\_{X}, D\\_{Y}\\right) = \\mathcal{L}\\_{GAN}\\left(G, D\\_{Y}, X, Y\\right) + \\mathcal{L}\\_{GAN}\\left(F, D\\_{X}, X, Y\\right) + \\lambda\\mathcal{L}\\_{cyc}\\left(G, F\\right) $$\r\n\r\nWhere we aim to solve:\r\n\r\n$$ G^{\\*}, F^{\\*} = \\arg \\min\\_{G, F} \\max\\_{D\\_{X}, D\\_{Y}} \\mathcal{L}\\_{GAN}\\left(G, F, D\\_{X}, D\\_{Y}\\right) $$\r\n\r\nFor the original architecture the authors use:\r\n\r\n- two stride-2 convolutions, several residual blocks, and two fractionally strided convolutions with stride $\\frac{1}{2}$.\r\n- [instance normalization](https:\/\/paperswithcode.com\/method\/instance-normalization)\r\n- PatchGANs for the discriminator\r\n- Least Square Loss for the [GAN](https:\/\/paperswithcode.com\/method\/gan) objectives.","152":"**PointNet** provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. It directly takes point clouds as input and outputs either class labels for the entire input or per point segment\/part labels for each point of the input.\r\n\r\nSource: [Qi et al.](https:\/\/arxiv.org\/pdf\/1612.00593v2.pdf)\r\n\r\nImage source: [Qi et al.](https:\/\/arxiv.org\/pdf\/1612.00593v2.pdf)","153":"**Adaptive Dropout** is a regularization technique that extends dropout by allowing the dropout probability to be different for different units. The intuition is that there may be hidden units that can individually make confident predictions for the presence or absence of an important feature or combination of features. [Dropout](https:\/\/paperswithcode.com\/method\/dropout) will ignore this confidence and drop the unit out 50% of the time. \r\n\r\nDenote the activity of unit $j$ in a deep neural network by $a\\_{j}$ and assume that its inputs are {$a\\_{i}: i < j$}. In dropout, $a\\_{j}$ is randomly set to zero with probability 0.5. Let $m\\_{j}$ be a binary variable that is used to mask, the activity $a\\_{j}$, so that its value is:\r\n\r\n$$ a\\_{j} = m\\_{j}g \\left( \\sum\\_{i: i