diff --git "a/methods.json" "b/methods.json"
--- "a/methods.json"
+++ "b/methods.json"
@@ -1 +1 @@
-{"title":{"0":"Causal Inference","1":"AutoEncoder","2":"LDA","3":"SVM","4":"GloVe","5":"Residual Connection","6":"Attention Dropout","7":"Linear Warmup With Linear Decay","8":"Weight Decay","9":"GELU","10":"Dense Connections","11":"Adam","12":"WordPiece","13":"Softmax","14":"Dropout","15":"Multi-Head Attention","16":"Layer Normalization","17":"Scaled Dot-Product Attention","18":"BERT","19":"Absolute Position Encodings","20":"Position-Wise Feed-Forward Layer","21":"BPE","22":"Label Smoothing","23":"ReLU","24":"Transformer","25":"Convolution","26":"Dilated Convolution","27":"PCA","28":"Graph Convolutional Networks","29":"Tanh Activation","30":"Sigmoid Activation","31":"LSTM","32":"BiLSTM","33":"ELMo","34":"RPN","35":"RoIPool","36":"Faster R-CNN","37":"GAN","38":"Concatenated Skip Connection","39":"Max Pooling","40":"U-Net","41":"Interpretability","42":"CurricularFace","43":"Weight Normalization","44":"L1 Regularization","45":"Softsign Activation","46":"Leaky ReLU","47":"GLU","48":"Normalizing Flows","49":"DV3 Attention Block","50":"DV3 Convolution Block","51":"Bridge-net","52":"ClariNet","53":"Mixture of Logistic Distributions","54":"Dilated Causal Convolution","55":"WaveNet","56":"Dot-Product Attention","57":"GCN","58":"Batch Normalization","59":"TuckER","60":"Average Pooling","61":"1x1 Convolution","62":"Bottleneck Residual Block","63":"Global Average Pooling","64":"Residual Block","65":"Kaiming Initialization","66":"ResNet","67":"Q-Learning","68":"1D CNN","69":"XLM","70":"Cosine Annealing","71":"Strided Attention","72":"Linear Warmup With Cosine Annealing","73":"Fixed Factorized Attention","74":"GPT-3","75":"AWARE","76":"Experience Replay","77":"Entropy Regularization","78":"Soft Actor Critic","79":"PPO","80":"VQ-VAE","81":"Non Maximum Suppression","82":"SSD","83":"RoIAlign","84":"Mask R-CNN","85":"Inpainting","86":"Local Response Normalization","87":"Grouped Convolution","88":"AlexNet","89":"Non-Local Operation","90":"Non-Local Block","91":"k-Means Clustering","92":"Logistic Regression","93":"Darknet-53","94":"YOLOv3","95":"DTW","96":"ADMM","97":"Knowledge Distillation","98":"Additive Attention","99":"CNN BiLSTM","100":"Cross-View Training","101":"ALIGN","102":"Restricted Boltzmann Machine","103":"ReLIC","104":"GRU","105":"GTrXL","106":"CoBERL","107":"Seq2Seq","108":"VERSE","109":"Temporal attention","110":"DQN","111":"Hourglass Module","112":"Random Scaling","113":"Stacked Hourglass Network","114":"Corner Pooling","115":"CornerNet","116":"Soft-NMS","117":"Random Horizontal Flip","118":"Step Decay","119":"MatrixNet","120":"Global-Local Attention","121":"R1 Regularization","122":"Feedforward Network","123":"Adaptive Instance Normalization","124":"StyleGAN","125":"MAML","126":"NeRF","127":"ELECTRA","128":"CARLA","129":"CRF","130":"MDL","131":"Memory Network","132":"Diffusion","133":"Discriminative Fine-Tuning","134":"GPT-2","135":"Gaussian Process","136":"RoBERTa","137":"node2vec","138":"Triplet Loss","139":"CLIP","140":"FCN","141":"ENet Dilated Bottleneck","142":"ENet Bottleneck","143":"ENet Initial Block","144":"SpatialDropout","145":"PReLU","146":"ENet","147":"PatchGAN","148":"Instance Normalization","149":"GAN Least Squares Loss","150":"Cycle Consistency Loss","151":"CycleGAN","152":"PointNet","153":"Adaptive Dropout","154":"Vision Transformer","155":"DistilBERT","156":"SimCSE","157":"TS","158":"Adafactor","159":"Inverse Square Root Schedule","160":"SentencePiece","161":"T5","162":"Dense Block","163":"DenseNet","164":"TGN","165":"Random Erasing","166":"HyperNetwork","167":"Temporal Activation Regularization","168":"DropConnect","169":"Activation Regularization","170":"Embedding Dropout","171":"Variational Dropout","172":"Weight Tying","173":"AWD-LSTM","174":"Slanted Triangular Learning Rates","175":"ULMFiT","176":"Random Gaussian Blur","177":"ColorJitter","178":"Random Resized Crop","179":"NT-Xent","180":"InfoNCE","181":"SimCLR","182":"MoCo","183":"Pointer Network","184":"Detr","185":"SGD","186":"VGG","187":"DeepMask","188":"GAT","189":"PGM","190":"Mixup","191":"Dice Loss","192":"RMSProp","193":"Depthwise Convolution","194":"Swish","195":"Pointwise Convolution","196":"Depthwise Separable Convolution","197":"Squeeze-and-Excitation Block","198":"Inverted Residual Block","199":"EfficientNet","200":"DCNN","201":"REINFORCE","202":"Focal Loss","203":"Monte-Carlo Tree Search","204":"GPS","205":"fastText","206":"Softplus","207":"Mish","208":"Spatial Pyramid Pooling","209":"RepPoints","210":"Xception","211":"Nesterov Accelerated Gradient","212":"Stochastic Depth","213":"Swin Transformer","214":"DeiT","215":"Dynamic Convolution","216":"FixMatch","217":"Early Stopping","218":"Auxiliary Classifier","219":"Inception-v3 Module","220":"Inception-v3","221":"VAE","222":"Double Q-learning","223":"Double DQN","224":"Pyramidal Bottleneck Residual Unit","225":"Zero-padded Shortcut Connection","226":"Pyramidal Residual Unit","227":"PyramidNet","228":"SegNet","229":"Linear Regression","230":"DualGCN","231":"TrOCR","232":"Cross-Attention Module","233":"FPN","234":"RetinaNet","235":"Laplacian PE","236":"Graph Transformer","237":"Spectral Clustering","238":"Highway Layer","239":"Highway Network","240":"BiGRU","241":"CBHG","242":"Residual GRU","243":"Griffin-Lim Algorithm","244":"Tacotron","245":"Random Search","246":"CodeBERT","247":"IPL","248":"ScatNet","249":"Contrastive Predictive Coding","250":"MobileNetV2","251":"GPT","252":"Colorization","253":"Darknet-19","254":"YOLOv2","255":"EMF","256":"Channel Attention Module","257":"Spatial Attention Module","258":"Channel attention","259":"GradDrop","260":"Gradient Sparsification","261":"Clipped Double Q-learning","262":"Target Policy Smoothing","263":"TD3","264":"DDPG","265":"GAIL","266":"Maxout","267":"Minibatch Discrimination","268":"Orthogonal Regularization","269":"Multiscale Dilated Convolution Block","270":"IAN","271":"Barlow Twins","272":"Poincar\u00e9 Embeddings","273":"RAM","274":"Cutout","275":"DropPath","276":"ProxylessNAS","277":"Retrace","278":"FCOS","279":"TILDEv2","280":"Early exiting","281":"Contextualized Topic Models","282":"Capsule Network","283":"GraphSAGE","284":"Siamese Network","285":"MobileNetV1","286":"Position-Sensitive RoI Pooling","287":"R-FCN","288":"DARTS","289":"LIME","290":"Focal Transformers","291":"DeepWalk","292":"Inception Module","293":"Inception v2","294":"ResNeXt Block","295":"ResNeXt","296":"RAN","297":"MADDPG","298":"Spatial Transformer","299":"RESCAL","300":"AE","301":"SOM","302":"Rendezvous","303":"XLNet","304":"Self-Adversarial Negative Sampling","305":"RotatE","306":"CAM","307":"Label Quality Model","308":"SMOTE","309":"InfoGAN","310":"Soft Actor-Critic (Autotuned Temperature)","311":"V-trace","312":"IMPALA","313":"A2C","314":"A3C","315":"Affine Coupling","316":"Invertible 1x1 Convolution","317":"WaveGlow","318":"TransE","319":"Apollo","320":"WGAN","321":"Denoising Autoencoder","322":"ReLU6","323":"Hard Swish","324":"MobileNetV3","325":"MnasNet","326":"GA","327":"HiFi-GAN","328":"SENet","329":"3D Convolution","330":"Activation Normalization","331":"GLOW","332":"MADGRAD","333":"AdaGrad","334":"DCGAN","335":"PresGAN","336":"SGD with Momentum","337":"Wide Residual Block","338":"WideResNet","339":"GoogLeNet","340":"Expected Sarsa","341":"Sarsa","342":"Agglomerative Contextual Decomposition","343":"DeepLab","344":"mBART","345":"BART","346":"Deformable Convolution","347":"ConvLSTM","348":"OASIS","349":"DropBlock","350":"TRPO","351":"RandAugment","352":"Noisy Student","353":"STN","354":"Deep Belief Network","355":"AMP","356":"Temporal ROIAlign","357":"ICA","358":"DINO","359":"BYOL","360":"Two-Way Dense Layer","361":"PeleeNet","362":"NAM","363":"HANet","364":"NeuroTactic","365":"EWC","366":"CayleyNet","367":"wav2vec-U","368":"SELU","369":"SNN","370":"CodeT5","371":"Fast R-CNN","372":"N-step Returns","373":"ELU","374":"PixelCNN","375":"Pyramid Pooling Module","376":"PSPNet","377":"Spectral Normalization","378":"Levenshtein Transformer","379":"Spatial Gating Unit","380":"gMLP","381":"AdamW","382":"Dilated Sliding Window Attention","383":"Sliding Window Attention","384":"Global and Sliding Window Attention","385":"Longformer","386":"Electric","387":"BAM","388":"Discriminative Adversarial Search","389":"ArcFace","390":"SAGAN Self-Attention Module","391":"SAGAN","392":"Truncation Trick","393":"Off-Diagonal Orthogonal Regularization","394":"GAN Hinge Loss","395":"TTUR","396":"Conditional Batch Normalization","397":"Linear Layer","398":"Projection Discriminator","399":"BigGAN","400":"Blender","401":"GAM","402":"Attention Gate","403":"SANet","404":"Discrete Cosine Transform","405":"Procrustes","406":"AccoMontage","407":"SAC","408":"DANet","409":"MelGAN Residual Block","410":"Window-based Discriminator","411":"MelGAN","412":"FBNet Block","413":"FBNet","414":"Location-based Attention","415":"Content-based Attention","416":"Neural Turing Machine","417":"Path Length Regularization","418":"Weight Demodulation","419":"StyleGAN2","420":"SepFormer","421":"FAVOR+","422":"Performer","423":"DLA","424":"CSPDarknet53","425":"Bottom-up Path Augmentation","426":"Grid Sensitive","427":"CutMix","428":"PAFPN","429":"YOLOv4","430":"Disentangled Attention Mechanism","431":"DeBERTa","432":"PnP","433":"Highway networks","434":"SSE","435":"PVTv2","436":"DeCLUTR","437":"Fire Module","438":"Xavier Initialization","439":"SqueezeNet","440":"GMVAE","441":"Dense Contrastive Learning","442":"GLN","443":"Jigsaw","444":"Res2Net Block","445":"Res2Net","446":"Channel Shuffle","447":"ShuffleNet V2 Block","448":"DetNASNet","449":"DetNAS","450":"Spatial Broadcast Decoder","451":"Prioritized Experience Replay","452":"D4PG","453":"Stochastic Weight Averaging","454":"SHAP","455":"DEQ","456":"CSL","457":"mBERT","458":"Gradient Clipping","459":"Linear Warmup","460":"CTRL","461":"Boost-GNN","462":"Exponential Decay","463":"SRM","464":"PolarNet","465":"Groupwise Point Convolution","466":"ShuffleNet Block","467":"ShuffleNet","468":"Gradient Checkpointing","469":"SPNet","470":"Denoising Score Matching","471":"LAMB","472":"ALBERT","473":"TDN","474":"SFT","475":"Sparse Autoencoder","476":"WGAN-GP Loss","477":"Gravity","478":"ProGAN","479":"MuZero","480":"Prioritized Sweeping","481":"DPG","482":"MUSIQ","483":"ASPP","484":"DeepLabv3","485":"CenterTrack","486":"DAEL","487":"LAMA","488":"TayPO","489":"Adaptive Loss","490":"VOS","491":"Jukebox","492":"DAC","493":"Residual SRM","494":"Style-based Recalibration Module","495":"Reversible Residual Block","496":"RevNet","497":"TAM","498":"Concatenation Affinity","499":"Embedded Dot Product Affinity","500":"Embedded Gaussian Affinity","501":"WaveRNN","502":"Graph Self-Attention","503":"Hierarchical Feature Fusion","504":"ESP","505":"Sharpness-Aware Minimization","506":"SABL","507":"Cascade R-CNN","508":"Synthesizer","509":"CBAM","510":"LXMERT","511":"AugMix","512":"AMSGrad","513":"LV-ViT","514":"Concrete Dropout","515":"PointASNL","516":"IndexNet","517":"MAS","518":"ShuffleNet V2 Downsampling Block","519":"ShuffleNet v2","520":"Auxiliary Batch Normalization","521":"AdvProp","522":"MoCo v2","523":"Relative Position Encodings","524":"ETC","525":"T2T-ViT","526":"Dynamic Memory Network","527":"RandomRotate","528":"Polynomial Rate Decay","529":"GPipe","530":"Hydra","531":"Gumbel Softmax","532":"PixelShuffle","533":"Models Genesis","534":"Multi-Head Linear Attention","535":"Neural Architecture Search","536":"NAS-FPN","537":"Cyclical Learning Rate Policy","538":"Manifold Mixup","539":"STAC","540":"ResNeXt-Elastic","541":"DenseNet-Elastic","542":"Elastic Dense Block","543":"Elastic ResNeXt Block","544":"TAPAS","545":"k-NN","546":"Sparsemax","547":"NON","548":"R2D2","549":"R-CNN","550":"ZoomNet","551":"(2+1)D Convolution","552":"R(2+1)D","553":"RFB","554":"VLMo","555":"Pix2Pix","556":"Squared ReLU","557":"Multi-DConv-Head Attention","558":"Primer","559":"Channel-wise Soft Attention","560":"Anti-Alias Downsampling","561":"Selective Kernel Convolution","562":"Selective Kernel","563":"Big-Little Module","564":"AutoAugment","565":"Assemble-ResNet","566":"ResNet-D","567":"MPNN","568":"RAdam","569":"HypE","570":"Supervised Contrastive Loss","571":"Perceiver IO","572":"HyperDenseNet","573":"CR-NET","574":"Fast-OCR","575":"AlphaZero","576":"RAG","577":"Selective Search","578":"Ape-X","579":"VisualBERT","580":"ViLBERT","581":"Cosine Power Annealing","582":"AdaDelta","583":"AdaSmooth","584":"BiFPN","585":"EfficientDet","586":"MLP-Mixer","587":"mT5","588":"Adaptive Input Representations","589":"Adaptive Softmax","590":"SCNN_UNet_ConvLSTM","591":"TridentNet Block","592":"GAN Feature Matching","593":"Laplacian Pyramid","594":"Viewmaker Network","595":"SLR","596":"CoOp","597":"HITNet","598":"ScheduledDropPath","599":"Accumulating Eligibility Trace","600":"TD Lambda","601":"TD-Gammon","602":"Skip-gram Word2Vec","603":"MeRL","604":"OODformer","605":"RGA","606":"Content-Conditioned Style Encoder","607":"COCO-FUNIT","608":"Natural Gradient Descent","609":"Neural Probabilistic Language Model","610":"InstaBoost","611":"Inception-A","612":"Inception-C","613":"Reduction-A","614":"Inception-B","615":"Reduction-B","616":"Inception-v4","617":"LeNet","618":"Split Attention","619":"CoordConv","620":"DDParser","621":"Meta-augmentation","622":"Polya-Gamma Augmentation","623":"Deformable Attention Module","624":"Deformable DETR","625":"Tofu","626":"DSGN","627":"Fraternal Dropout","628":"Adversarial Color Enhancement","629":"Phase Shuffle","630":"WaveGAN","631":"AdaMax","632":"Axial Attention","633":"Local SGD","634":"DGCNN","635":"SRS","636":"Causal Convolution","637":"SPADE","638":"SCAN-clustering","639":"CRF-RNN","640":"CoVe","641":"ARMA","642":"MODERN","643":"RGCN","644":"TD-VAE","645":"Disentangled Attribution Curves","646":"CCNet","647":"Adaptive Masking","648":"Adaptive Span Transformer","649":"Sandwich Transformer","650":"SimAdapter","651":"Deep Boltzmann Machine","652":"Hopfield Layer","653":"LayerScale","654":"Spatial Group-wise Enhance","655":"VSF","656":"OSCAR","657":"RAE","658":"ESPNet","659":"DE-GAN","660":"PISA","661":"Noisy Linear Layer","662":"Dueling Network","663":"Rainbow DQN","664":"Center Pooling","665":"Cascade Corner Pooling","666":"CenterNet","667":"VL-T5","668":"Polyak Averaging","669":"CheXNet","670":"StruBERT","671":"MViT","672":"Transformer-XL","673":"HaloNet","674":"K-Net","675":"UNITER","676":"RFP","677":"BASNet","678":"Cascade Mask R-CNN","679":"Hit-Detector","680":"DeepCluster","681":"CvT","682":"SFAM","683":"SRGAN Residual Block","684":"VGG Loss","685":"SRGAN","686":"LCC","687":"ESIM","688":"SIFA","689":"Epsilon Greedy Exploration","690":"Affine Operator","691":"MATE","692":"CenterPoint","693":"AlphaFold","694":"Teacher-Tutor-Student Knowledge Distillation","695":"MagFace","696":"Dual Softmax Loss","697":"CAMoE","698":"CBNet","699":"EEND","700":"CCT","701":"VoiceFilter-Lite","702":"SGDW","703":"Deformable ConvNets","704":"Deformable Position-Sensitive RoI Pooling","705":"Deformable RoI Pooling","706":"LightGCN","707":"SCA-CNN","708":"CMCL","709":"UNIMO","710":"Demon","711":"T-Fixup","712":"Sparse Transformer","713":"Spatial Feature Transform","714":"SpreadsheetCoder","715":"Inception-ResNet-v2 Reduction-B","716":"Inception-ResNet-v2-A","717":"Inception-ResNet-v2-B","718":"Inception-ResNet-v2-C","719":"Inception-ResNet-v2","720":"NODE","721":"Mechanism Transfer","722":"DD-PPO","723":"FLICA","724":"Spatially Separable Convolution","725":"GNS","726":"SoftPool","727":"Style Transfer Module","728":"OHEM","729":"PIRL","730":"DVD-GAN DBlock","731":"DVD-GAN GBlock","732":"TSRUc","733":"TSRUp","734":"TSRUs","735":"TrIVD-GAN","736":"L-GCN","737":"E2EAdaptiveDistTraining","738":"BigBird","739":"CGNN","740":"MAVL","741":"LMOT","742":"Latent Optimisation","743":"GIN","744":"nnFormer","745":"LayoutReader","746":"DECA","747":"REM","748":"ChebNet","749":"FixRes","750":"Single-Headed Attention","751":"Boom Layer","752":"SHA-RNN","753":"SortCut Sinkhorn Attention","754":"Sparse Sinkhorn Attention","755":"Sinkhorn Transformer","756":"ORB-SLAM2","757":"YOLOv1","758":"Multiplicative Attention","759":"U2-Net","760":"K3M","761":"RegNetY","762":"LARS","763":"SwAV","764":"SEER","765":"PP-OCR","766":"RFB Net","767":"Parallax","768":"Symbolic Deep Learning","769":"Parrot","770":"Probabilistic Anchor Assignment","771":"Pixel-BERT","772":"SAGA","773":"Ape-X DQN","774":"OFA","775":"YOHO","776":"HRNet","777":"Universal Transformer","778":"Temporal Distribution Matching","779":"Temporal Distribution Characterization","780":"AdaRNN","781":"CTC Loss","782":"CRISS","783":"SM3","784":"Beta-VAE","785":"DynaBERT","786":"DiffPool","787":"DASPP","788":"LiteSeg","789":"MFF","790":"UNet++","791":"TaBERT","792":"Macaw","793":"End-To-End Memory Network","794":"LRNet","795":"Network Dissection","796":"Visual Parsing","797":"Image Scale Augmentation","798":"GShard","799":"NICE","800":"Voxel RoI Pooling","801":"Voxel R-CNN","802":"State-Aware Tracker","803":"LGCL","804":"ABC","805":"DistanceNet","806":"Population Based Training","807":"PAR Transformer","808":"Fractal Block","809":"FractalNet","810":"BLIP","811":"FastSGT","812":"FoveaBox","813":"Context Enhancement Module","814":"Spatial Attention Module (ThunderNet)","815":"Position-Sensitive RoIAlign","816":"SNet","817":"ThunderNet","818":"HTC","819":"SIG"},"description":{"0":"Causal inference is the process of drawing a conclusion about a causal connection based on the conditions of the occurrence of an effect. The main difference between causal inference and inference of association is that the former analyzes the response of the effect variable when the cause is changed.","1":"An **Autoencoder** is a bottleneck architecture that turns a high-dimensional input into a latent low-dimensional code (encoder), and then performs a reconstruction of the input with this latent code (the decoder).\r\n\r\nImage: [Michael Massi](https:\/\/en.wikipedia.org\/wiki\/Autoencoder#\/media\/File:Autoencoder_schema.png)","2":"**Linear discriminant analysis** (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics, pattern recognition, and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.\r\n\r\nExtracted from [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Linear_discriminant_analysis)\r\n\r\n**Source**:\r\n\r\nPaper: [Linear Discriminant Analysis: A Detailed Tutorial](https:\/\/dx.doi.org\/10.3233\/AIC-170729)\r\n\r\nPublic version: [Linear Discriminant Analysis: A Detailed Tutorial](https:\/\/usir.salford.ac.uk\/id\/eprint\/52074\/)","3":"A **Support Vector Machine**, or **SVM**, is a non-parametric supervised learning model. For non-linear classification and regression, they utilise the kernel trick to map inputs to high-dimensional feature spaces. SVMs construct a hyper-plane or set of hyper-planes in a high or infinite dimensional space, which can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier. The figure to the right shows the decision function for a linearly separable problem, with three samples on the margin boundaries, called \u201csupport vectors\u201d. \r\n\r\nSource: [scikit-learn](https:\/\/scikit-learn.org\/stable\/modules\/svm.html)","4":"**GloVe Embeddings** are a type of word embedding that encode the co-occurrence probability ratio between two words as vector differences. GloVe uses a weighted least squares objective $J$ that minimizes the difference between the dot product of the vectors of two words and the logarithm of their number of co-occurrences:\r\n\r\n$$ J=\\sum\\_{i, j=1}^{V}f\\left(\ud835\udc4b\\_{i j}\\right)(w^{T}\\_{i}\\tilde{w}_{j} + b\\_{i} + \\tilde{b}\\_{j} - \\log{\ud835\udc4b}\\_{ij})^{2} $$\r\n\r\nwhere $w\\_{i}$ and $b\\_{i}$ are the word vector and bias respectively of word $i$, $\\tilde{w}_{j}$ and $b\\_{j}$ are the context word vector and bias respectively of word $j$, $X\\_{ij}$ is the number of times word $i$ occurs in the context of word $j$, and $f$ is a weighting function that assigns lower weights to rare and frequent co-occurrences.","5":"**Residual Connections** are a type of skip-connection that learn residual functions with reference to the layer inputs, instead of learning unreferenced functions. \r\n\r\nFormally, denoting the desired underlying mapping as $\\mathcal{H}({x})$, we let the stacked nonlinear layers fit another mapping of $\\mathcal{F}({x}):=\\mathcal{H}({x})-{x}$. The original mapping is recast into $\\mathcal{F}({x})+{x}$.\r\n\r\nThe intuition is that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.","6":"**Attention Dropout** is a type of [dropout](https:\/\/paperswithcode.com\/method\/dropout) used in attention-based architectures, where elements are randomly dropped out of the [softmax](https:\/\/paperswithcode.com\/method\/softmax) in the attention equation. For example, for scaled-dot product attention, we would drop elements from the first term:\r\n\r\n$$ {\\text{Attention}}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^{T}}{\\sqrt{d_k}}\\right)V $$","7":"**Linear Warmup With Linear Decay** is a learning rate schedule in which we increase the learning rate linearly for $n$ updates and then linearly decay afterwards.","8":"**Weight Decay**, or **$L_{2}$ Regularization**, is a regularization technique applied to the weights of a neural network. We minimize a loss function compromising both the primary loss function and a penalty on the $L\\_{2}$ Norm of the weights:\r\n\r\n$$L\\_{new}\\left(w\\right) = L\\_{original}\\left(w\\right) + \\lambda{w^{T}w}$$\r\n\r\nwhere $\\lambda$ is a value determining the strength of the penalty (encouraging smaller weights). \r\n\r\nWeight decay can be incorporated directly into the weight update rule, rather than just implicitly by defining it through to objective function. Often weight decay refers to the implementation where we specify it directly in the weight update rule (whereas L2 regularization is usually the implementation which is specified in the objective function).\r\n\r\nImage Source: Deep Learning, Goodfellow et al","9":"The **Gaussian Error Linear Unit**, or **GELU**,  is an activation function. The GELU activation function is $x\\Phi(x)$, where $\\Phi(x)$ the standard Gaussian cumulative distribution function. The GELU nonlinearity weights inputs by their percentile, rather than gates inputs by their sign as in [ReLUs](https:\/\/paperswithcode.com\/method\/relu) ($x\\mathbf{1}_{x>0}$). Consequently the GELU can be thought of as a smoother ReLU.\r\n\r\n$$\\text{GELU}\\left(x\\right) = x{P}\\left(X\\leq{x}\\right) = x\\Phi\\left(x\\right) = x \\cdot \\frac{1}{2}\\left[1 + \\text{erf}(x\/\\sqrt{2})\\right],$$\r\nif $X\\sim \\mathcal{N}(0,1)$.\r\n\r\nOne can approximate the GELU with\r\n$0.5x\\left(1+\\tanh\\left[\\sqrt{2\/\\pi}\\left(x + 0.044715x^{3}\\right)\\right]\\right)$ or $x\\sigma\\left(1.702x\\right),$\r\nbut PyTorch's exact implementation is sufficiently fast such that these approximations may be unnecessary. (See also the [SiLU](https:\/\/paperswithcode.com\/method\/silu) $x\\sigma(x)$ which was also coined in the paper that introduced the GELU.)\r\n\r\nGELUs are used in [GPT-3](https:\/\/paperswithcode.com\/method\/gpt-3), [BERT](https:\/\/paperswithcode.com\/method\/bert), and most other Transformers.","10":"**Dense Connections**, or **Fully Connected Connections**, are a type of layer in a deep neural network that use a linear operation where every input is connected to every output by a weight. This means there are $n\\_{\\text{inputs}}*n\\_{\\text{outputs}}$ parameters, which can lead to a lot of parameters for a sizeable network.\r\n\r\n$$h\\_{l} = g\\left(\\textbf{W}^{T}h\\_{l-1}\\right)$$\r\n\r\nwhere $g$ is an activation function.\r\n\r\nImage Source: Deep Learning by Goodfellow, Bengio and Courville","11":"**Adam** is an adaptive learning rate optimization algorithm that utilises both momentum and scaling, combining the benefits of [RMSProp](https:\/\/paperswithcode.com\/method\/rmsprop) and [SGD w\/th Momentum](https:\/\/paperswithcode.com\/method\/sgd-with-momentum). The optimizer is designed to be appropriate for non-stationary objectives and problems with very noisy and\/or sparse gradients. \r\n\r\nThe weight updates are performed as:\r\n\r\n$$ w_{t} = w_{t-1} - \\eta\\frac{\\hat{m}\\_{t}}{\\sqrt{\\hat{v}\\_{t}} + \\epsilon}  $$\r\n\r\nwith\r\n\r\n$$ \\hat{m}\\_{t} = \\frac{m_{t}}{1-\\beta^{t}_{1}} $$\r\n\r\n$$ \\hat{v}\\_{t} = \\frac{v_{t}}{1-\\beta^{t}_{2}} $$\r\n\r\n$$ m_{t} = \\beta_{1}m_{t-1} + (1-\\beta_{1})g_{t} $$\r\n\r\n$$ v_{t} = \\beta_{2}v_{t-1} + (1-\\beta_{2})g_{t}^{2}  $$\r\n\r\n\r\n$ \\eta $ is the step size\/learning rate, around 1e-3 in the original paper. $ \\epsilon $ is a small number, typically 1e-8 or 1e-10, to prevent dividing by zero. $ \\beta_{1} $ and $ \\beta_{2} $ are forgetting parameters, with typical values 0.9 and 0.999, respectively.","12":"**WordPiece** is a subword segmentation algorithm used in natural language processing.  The vocabulary is initialized with individual characters in the language, then the most frequent combinations of symbols in the vocabulary are iteratively added to the vocabulary. The process is:\r\n\r\n1. Initialize the word unit inventory with all the characters in the text.\r\n2. Build a language model on the training data using the inventory from 1.\r\n3. Generate a new word unit by combining two units out of the current word inventory to increment the word unit inventory by one. Choose the new word unit out of all the possible ones that increases the likelihood on the training data the most when added to the model.\r\n4. Goto 2 until a predefined limit of word units is reached or the likelihood increase falls below a certain threshold.\r\n\r\nText: [Source](https:\/\/stackoverflow.com\/questions\/55382596\/how-is-wordpiece-tokenization-helpful-to-effectively-deal-with-rare-words-proble\/55416944#55416944)\r\n\r\nImage: WordPiece as used in [BERT](https:\/\/paperswithcode.com\/method\/bert)","13":"The **Softmax** output function transforms a previous layer's output into a vector of probabilities. It is commonly used for multiclass classification.  Given an input vector $x$ and a weighting vector $w$ we have:\r\n\r\n$$ P(y=j \\mid{x}) = \\frac{e^{x^{T}w_{j}}}{\\sum^{K}_{k=1}e^{x^{T}wk}} $$","14":"**Dropout** is a regularization technique for neural networks that drops a unit (along with connections) at training time with a specified probability $p$ (a common value is $p=0.5$). At test time, all units are present, but with weights scaled by $p$ (i.e. $w$ becomes $pw$).\r\n\r\nThe idea is to prevent co-adaptation, where the neural network becomes too reliant on particular connections, as this could be symptomatic of overfitting. Intuitively, dropout can be thought of as creating an implicit ensemble of neural networks.","15":"**Multi-head Attention** is a module for attention mechanisms which runs through an attention mechanism several times in parallel. The independent attention outputs are then concatenated and linearly transformed into the expected dimension. Intuitively, multiple attention heads allows for attending to parts of the sequence differently (e.g. longer-term dependencies versus shorter-term dependencies). \r\n\r\n$$ \\text{MultiHead}\\left(\\textbf{Q}, \\textbf{K}, \\textbf{V}\\right) = \\left[\\text{head}\\_{1},\\dots,\\text{head}\\_{h}\\right]\\textbf{W}_{0}$$\r\n\r\n$$\\text{where} \\text{ head}\\_{i} = \\text{Attention} \\left(\\textbf{Q}\\textbf{W}\\_{i}^{Q}, \\textbf{K}\\textbf{W}\\_{i}^{K}, \\textbf{V}\\textbf{W}\\_{i}^{V} \\right) $$\r\n\r\nAbove $\\textbf{W}$ are all learnable parameter matrices.\r\n\r\nNote that [scaled dot-product attention](https:\/\/paperswithcode.com\/method\/scaled) is most commonly used in this module, although in principle it can be swapped out for other types of attention mechanism.\r\n\r\nSource: [Lilian Weng](https:\/\/lilianweng.github.io\/lil-log\/2018\/06\/24\/attention-attention.html#a-family-of-attention-mechanisms)","16":"Unlike [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization), **Layer Normalization** directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer so the normalization does not introduce any new dependencies between training cases. It works well for [RNNs](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks) and improves both the training time and the generalization performance of several existing RNN models. More recently, it has been used with [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) models.\r\n\r\nWe compute the layer normalization statistics over all the hidden units in the same layer as follows:\r\n\r\n$$ \\mu^{l} = \\frac{1}{H}\\sum^{H}\\_{i=1}a\\_{i}^{l} $$\r\n\r\n$$ \\sigma^{l} = \\sqrt{\\frac{1}{H}\\sum^{H}\\_{i=1}\\left(a\\_{i}^{l}-\\mu^{l}\\right)^{2}}  $$\r\n\r\nwhere $H$ denotes the number of hidden units in a layer. Under layer normalization, all the hidden units in a layer share the same normalization terms $\\mu$ and $\\sigma$, but different training cases have different normalization terms. Unlike batch normalization, layer normalization does not impose any constraint on the size of the mini-batch and it can be used in the pure online regime with batch size 1.","17":"**Scaled dot-product attention** is an attention mechanism where the dot products are scaled down by $\\sqrt{d_k}$. Formally we have a query $Q$, a key $K$ and a value $V$ and calculate the attention as:\r\n\r\n$$ {\\text{Attention}}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^{T}}{\\sqrt{d_k}}\\right)V $$\r\n\r\nIf we assume that $q$ and $k$ are $d_k$-dimensional vectors whose components are independent random variables with mean $0$ and variance $1$, then their dot product, $q \\cdot k = \\sum_{i=1}^{d_k} u_iv_i$, has mean $0$ and variance $d_k$.  Since we would prefer these values to have variance $1$, we divide by $\\sqrt{d_k}$.","18":"**BERT**, or Bidirectional Encoder Representations from Transformers, improves upon standard [Transformers](http:\/\/paperswithcode.com\/method\/transformer) by removing the unidirectionality constraint by using a *masked language model* (MLM) pre-training objective. The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-to-right language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pre-train a deep bidirectional Transformer. In addition to the masked language model, BERT uses a *next sentence prediction* task that jointly pre-trains text-pair representations. \r\n\r\nThere are two steps in BERT: *pre-training* and *fine-tuning*. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For fine-tuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Each downstream task has separate fine-tuned models, even though they\r\nare initialized with the same pre-trained parameters.","19":"**Absolute Position Encodings** are a type of position embeddings for [[Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based models] where positional encodings are added to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension $d\\_{model}$ as the embeddings, so that the two can be summed. In the original implementation, sine and cosine functions of different frequencies are used:\r\n\r\n$$ \\text{PE}\\left(pos, 2i\\right) = \\sin\\left(pos\/10000^{2i\/d\\_{model}}\\right) $$\r\n\r\n$$ \\text{PE}\\left(pos, 2i+1\\right) = \\cos\\left(pos\/10000^{2i\/d\\_{model}}\\right) $$\r\n\r\nwhere $pos$ is the position and $i$ is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $2\\pi$ to $10000 \\dot 2\\pi$. This function was chosen because the authors hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$,  $\\text{PE}\\_{pos+k}$ can be represented as a linear function of $\\text{PE}\\_{pos}$.\r\n\r\nImage Source: [D2L.ai](https:\/\/d2l.ai\/chapter_attention-mechanisms\/self-attention-and-positional-encoding.html)","20":"**Position-Wise Feed-Forward Layer** is a type of [feedforward layer](https:\/\/www.paperswithcode.com\/method\/category\/feedforwad-networks) consisting of two [dense layers](https:\/\/www.paperswithcode.com\/method\/dense-connections) that applies to the last dimension, which means the same dense layers are used for each position item in the sequence, so called position-wise.","21":"**Byte Pair Encoding**, or **BPE**, is a subword segmentation algorithm that encodes rare and unknown words as sequences of subword units. The intuition is that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations).\r\n\r\n[Lei Mao](https:\/\/leimao.github.io\/blog\/Byte-Pair-Encoding\/) has a detailed blog post that explains how this works.","22":"**Label Smoothing** is a regularization technique that introduces noise for the labels. This accounts for the fact that datasets may have mistakes in them, so maximizing the likelihood of $\\log{p}\\left(y\\mid{x}\\right)$ directly can be harmful. Assume for a small constant $\\epsilon$, the training set label $y$ is correct with probability $1-\\epsilon$ and incorrect otherwise. Label Smoothing regularizes a model based on a [softmax](https:\/\/paperswithcode.com\/method\/softmax) with $k$ output values by replacing the hard $0$ and $1$ classification targets with targets of $\\frac{\\epsilon}{k-1}$ and $1-\\epsilon$ respectively.\r\n\r\nSource: Deep Learning, Goodfellow et al\r\n\r\nImage Source: [When Does Label Smoothing Help?](https:\/\/arxiv.org\/abs\/1906.02629)","23":"**Rectified Linear Units**, or **ReLUs**, are a type of activation function that are linear in the positive dimension, but zero in the negative dimension. The kink in the function is the source of the non-linearity. Linearity in the positive dimension has the attractive property that it prevents non-saturation of gradients (contrast with [sigmoid activations](https:\/\/paperswithcode.com\/method\/sigmoid-activation)), although for half of the real line its gradient is zero.\r\n\r\n$$ f\\left(x\\right) = \\max\\left(0, x\\right) $$","24":"A **Transformer** is a model architecture that eschews recurrence and instead relies entirely on an [attention mechanism](https:\/\/paperswithcode.com\/methods\/category\/attention-mechanisms-1) to draw global dependencies between input and output. Before Transformers, the dominant sequence transduction models were based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The Transformer also employs an encoder and decoder, but removing recurrence in favor of [attention mechanisms](https:\/\/paperswithcode.com\/methods\/category\/attention-mechanisms-1) allows for significantly more parallelization than methods like [RNNs](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks) and [CNNs](https:\/\/paperswithcode.com\/methods\/category\/convolutional-neural-networks).","25":"A **convolution** is a type of matrix operation, consisting of a kernel, a small matrix of weights, that slides over input data performing element-wise multiplication with the part of the input it is on, then summing the results into an output.\r\n\r\nIntuitively, a convolution allows for weight sharing - reducing the number of effective parameters - and image translation (allowing for the same feature to be detected in different parts of the input space).\r\n\r\nImage Source: [https:\/\/arxiv.org\/pdf\/1603.07285.pdf](https:\/\/arxiv.org\/pdf\/1603.07285.pdf)","26":"**Dilated Convolutions** are a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) that \u201cinflate\u201d the kernel by inserting holes between the kernel elements. An additional parameter $l$ (dilation rate) indicates how much the kernel is widened. There are usually $l-1$ spaces inserted between kernel elements. \r\n\r\nNote that concept has existed in past literature under different names, for instance the *algorithme a trous*,  an algorithm for wavelet decomposition (Holschneider et al., 1987; Shensa, 1992).","27":"**Principle Components Analysis (PCA)** is an unsupervised method primary used for dimensionality reduction within machine learning.  PCA is calculated via a singular value decomposition (SVD) of the design matrix, or alternatively, by calculating the covariance matrix of the data and performing eigenvalue decomposition on the covariance matrix. The results of PCA provide a low-dimensional picture of the structure of the data and the leading (uncorrelated) latent factors determining variation in the data.\r\n\r\nImage Source: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Principal_component_analysis#\/media\/File:GaussianScatterPCA.svg)","28":"A Graph Convolutional Network, or GCN, is an approach for semi-supervised learning on graph-structured data. It is based on an efficient variant of convolutional neural networks which operate directly on graphs.\r\n\r\nImage source: [Semi-Supervised Classification with Graph Convolutional Networks](https:\/\/arxiv.org\/pdf\/1609.02907v4.pdf)","29":"**Tanh Activation** is an activation function used for neural networks:\r\n\r\n$$f\\left(x\\right) = \\frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$$\r\n\r\nHistorically, the tanh function became preferred over the [sigmoid function](https:\/\/paperswithcode.com\/method\/sigmoid-activation) as it gave better performance for multi-layer neural networks. But it did not solve the vanishing gradient problem that sigmoids suffered, which was tackled more effectively with the introduction of [ReLU](https:\/\/paperswithcode.com\/method\/relu) activations.\r\n\r\nImage Source: [Junxi Feng](https:\/\/www.researchgate.net\/profile\/Junxi_Feng)","30":"**Sigmoid Activations** are a type of activation function for neural networks:\r\n\r\n$$f\\left(x\\right) = \\frac{1}{\\left(1+\\exp\\left(-x\\right)\\right)}$$\r\n\r\nSome drawbacks of this activation that have been noted in the literature are: sharp damp gradients during backpropagation from deeper hidden layers to inputs, gradient saturation, and slow convergence.","31":"An **LSTM** is a type of [recurrent neural network](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks) that addresses the vanishing gradient problem in vanilla RNNs through additional cells, input and output gates. Intuitively, vanishing gradients are solved through additional *additive* components, and forget gate activations, that allow the gradients to flow through the network without vanishing as quickly.\r\n\r\n(Image Source [here](https:\/\/medium.com\/datadriveninvestor\/how-do-lstm-networks-solve-the-problem-of-vanishing-gradients-a6784971a577))\r\n\r\n(Introduced by Hochreiter and Schmidhuber)","32":"A **Bidirectional LSTM**, or **biLSTM**, is a sequence processing model that consists of two LSTMs: one taking the input in a forward direction, and the other in a backwards direction. BiLSTMs effectively increase the amount of information available to the network, improving the context available to the algorithm (e.g. knowing what words immediately follow *and* precede a word in a sentence).\r\n\r\nImage Source: Modelling Radiological Language with Bidirectional Long Short-Term Memory Networks, Cornegruta et al","33":"**Embeddings from Language Models**, or **ELMo**, is a type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus.\r\n\r\nA biLM combines both a forward and backward LM.  ELMo jointly maximizes the log likelihood of the forward and backward directions. To add ELMo to a supervised model, we freeze the weights of the biLM and then concatenate the ELMo vector $\\textbf{ELMO}^{task}_k$ with $\\textbf{x}_k$ and pass the ELMO enhanced representation $[\\textbf{x}_k; \\textbf{ELMO}^{task}_k]$ into the task RNN. Here $\\textbf{x}_k$ is a context-independent token representation for each token position. \r\n\r\nImage Source: [here](https:\/\/medium.com\/@duyanhnguyen_38925\/create-a-strong-text-classification-with-the-help-from-elmo-e90809ba29da)","34":"A **Region Proposal Network**, or **RPN**, is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals. RPN and algorithms like [Fast R-CNN](https:\/\/paperswithcode.com\/method\/fast-r-cnn) can be merged into a single network by sharing their convolutional features - using the recently popular terminology of neural networks with attention mechanisms, the RPN component tells the unified network where to look.\r\n\r\nRPNs are designed to efficiently predict region proposals with a wide range of scales and aspect ratios. RPNs use anchor boxes that serve as references at multiple scales and aspect ratios. The scheme can be thought of as a pyramid of regression references, which avoids enumerating images or filters of multiple scales or aspect ratios.","35":"**Region of Interest Pooling**, or **RoIPool**, is an operation for extracting a small feature map (e.g., $7\u00d77$) from each RoI in detection and segmentation based tasks. Features are extracted from each candidate box, and thereafter in models like [Fast R-CNN](https:\/\/paperswithcode.com\/method\/fast-r-cnn), are then classified and bounding box regression performed.\r\n\r\nThe actual scaling to, e.g., $7\u00d77$, occurs by dividing the region proposal into equally sized sections, finding the largest value in each section, and then copying these max values to the output buffer. In essence, **RoIPool** is [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) on a discrete grid based on a box.\r\n\r\nImage Source: [Joyce Xu](https:\/\/towardsdatascience.com\/deep-learning-for-object-detection-a-comprehensive-review-73930816d8d9)","36":"**Faster R-CNN** is an object detection model that improves on [Fast R-CNN](https:\/\/paperswithcode.com\/method\/fast-r-cnn) by utilising a region proposal network ([RPN](https:\/\/paperswithcode.com\/method\/rpn)) with the CNN model. The RPN shares full-image convolutional features with the detection network, enabling nearly cost-free region proposals. It is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by [Fast R-CNN](https:\/\/paperswithcode.com\/method\/fast-r-cnn) for detection. RPN and Fast [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn) are merged into a single network by sharing their convolutional features: the RPN component tells the unified network where to look.\r\n\r\nAs a whole, Faster R-CNN consists of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector that uses the proposed regions.","37":"A **GAN**, or **Generative Adversarial Network**, is a generative model that simultaneously trains\r\ntwo models: a generative model $G$ that captures the data distribution, and a discriminative model $D$ that estimates the\r\nprobability that a sample came from the training data rather than $G$.\r\n\r\nThe training procedure for $G$ is to maximize the probability of $D$ making\r\na mistake. This framework corresponds to a minimax two-player game. In the\r\nspace of arbitrary functions $G$ and $D$, a unique solution exists, with $G$\r\nrecovering the training data distribution and $D$ equal to $\\frac{1}{2}$\r\neverywhere. In the case where $G$ and $D$ are defined by multilayer perceptrons,\r\nthe entire system can be trained with backpropagation. \r\n\r\n(Image Source: [here](http:\/\/www.kdnuggets.com\/2017\/01\/generative-adversarial-networks-hot-topic-machine-learning.html))","38":"A **Concatenated Skip Connection** is a type of skip connection that seeks to reuse features by concatenating them to new layers, allowing more information to be retained from previous layers of the network. This contrasts with say, residual connections, where element-wise summation is used instead to incorporate information from previous layers. This type of skip connection is prominently used in DenseNets (and also Inception networks), which the Figure to the right illustrates.","39":"**Max Pooling** is a pooling operation that calculates the maximum value for patches of a feature map, and uses it to create a downsampled (pooled) feature map.  It is usually used after a convolutional layer. It adds a small amount of translation invariance - meaning translating the image by a small amount does not significantly affect the values of most pooled outputs.\r\n\r\nImage Source: [here](https:\/\/computersciencewiki.org\/index.php\/File:MaxpoolSample2.png)","40":"**U-Net** is an architecture for semantic segmentation. It consists of a contracting path and an expansive path. The contracting path follows the typical architecture of a convolutional network. It consists of the repeated application of two 3x3 convolutions (unpadded convolutions), each followed by a rectified linear unit ([ReLU](https:\/\/paperswithcode.com\/method\/relu)) and a 2x2 [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) operation with stride 2 for downsampling. At each downsampling step we double the number of feature channels. Every step in the expansive path consists of an upsampling of the feature map followed by a 2x2 [convolution](https:\/\/paperswithcode.com\/method\/convolution) (\u201cup-convolution\u201d) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3x3 convolutions, each followed by a ReLU. The cropping is necessary due to the loss of border pixels in every convolution. At the final layer a [1x1 convolution](https:\/\/paperswithcode.com\/method\/1x1-convolution) is used to map each 64-component feature vector to the desired number of classes. In total the network has 23 convolutional layers.\r\n\r\n[Original MATLAB Code](https:\/\/lmb.informatik.uni-freiburg.de\/people\/ronneber\/u-net\/u-net-release-2015-10-02.tar.gz)","41":"Please enter a description about the method here","42":"**CurricularFace**, or **Adaptive Curriculum Learning**, is a method for face recognition that embeds the idea of curriculum learning into the loss function to achieve a new training scheme. This training scheme mainly addresses easy samples in the early training stage and hard ones in the later stage. Specifically, CurricularFace adaptively adjusts the relative importance of easy and hard samples during different training stages.","43":"**Weight Normalization** is a normalization method for training neural networks. It is inspired by [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization), but it is a deterministic method that does not share batch normalization's property of adding noise to the gradients. It reparameterizes each weight vector $\\textbf{w}$ in terms of a parameter vector $\\textbf{v}$ and a scalar parameter $g$ and to perform stochastic gradient descent with respect to those parameters instead. Weight vectors are expressed in terms of the new parameters using:\r\n\r\n$$ \\textbf{w} = \\frac{g}{\\Vert\\\\textbf{v}\\Vert}\\textbf{v}$$\r\n\r\nwhere $\\textbf{v}$ is a $k$-dimensional vector, $g$ is a scalar, and $\\Vert\\textbf{v}\\Vert$ denotes the Euclidean norm of $\\textbf{v}$. This reparameterization has the effect of fixing the Euclidean norm of the weight vector $\\textbf{w}$: we now have $\\Vert\\textbf{w}\\Vert = g$, independent of the parameters $\\textbf{v}$.","44":"**$L_{1}$ Regularization** is a regularization technique applied to the weights of a neural network. We minimize a loss function compromising both the primary loss function and a penalty on the $L\\_{1}$ Norm of the weights:\r\n\r\n$$L\\_{new}\\left(w\\right) = L\\_{original}\\left(w\\right) + \\lambda{||w||}\\_{1}$$\r\n\r\nwhere $\\lambda$ is a value determining the strength of the penalty. In contrast to [weight decay](https:\/\/paperswithcode.com\/method\/weight-decay), $L_{1}$ regularization promotes sparsity; i.e. some parameters have an optimal value of zero.\r\n\r\nImage Source: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Regularization_(mathematics)#\/media\/File:Sparsityl1.png)","45":"**Softsign** is an activation function for neural networks:\r\n\r\n$$ f\\left(x\\right) = \\left(\\frac{x}{|x|+1}\\right)$$\r\n\r\nImage Source: [Sefik Ilkin Serengil](https:\/\/sefiks.com\/2017\/11\/10\/softsign-as-a-neural-networks-activation-function\/)","46":"**Leaky Rectified Linear Unit**, or **Leaky ReLU**, is a type of activation function based on a [ReLU](https:\/\/paperswithcode.com\/method\/relu), but it has a small slope for negative values instead of a flat slope. The slope coefficient is determined before training, i.e. it is not learnt during training. This type of activation function is popular in tasks where we we may suffer from sparse gradients, for example training generative adversarial networks.","47":"A **Gated Linear Unit**, or **GLU** computes:\r\n\r\n$$ \\text{GLU}\\left(a, b\\right) = a\\otimes \\sigma\\left(b\\right) $$\r\n\r\nIt is used in natural language processing architectures, for example the [Gated CNN](https:\/\/paperswithcode.com\/method\/gated-convolution-network), because here $b$ is the gate that control what information from $a$ is passed up to the following layer. Intuitively, for a language modeling task, the gating mechanism allows selection of words or features that are important for predicting the next word. The GLU also has non-linear capabilities, but has a linear path for the gradient so diminishes the vanishing gradient problem.","48":"**Normalizing Flows** are a method for constructing complex distributions by transforming a\r\nprobability density through a series of invertible mappings. By repeatedly applying the rule for change of variables, the initial density \u2018flows\u2019 through the sequence of invertible mappings. At the end of this sequence we obtain a valid probability distribution and hence this type of flow is referred to as a normalizing flow.\r\n\r\nIn the case of finite flows, the basic rule for the transformation of densities considers an invertible, smooth mapping $f : \\mathbb{R}^{d} \\rightarrow \\mathbb{R}^{d}$ with inverse $f^{-1} = g$, i.e. the composition $g \\cdot f\\left(z\\right) = z$. If we use this mapping to transform a random variable $z$ with distribution $q\\left(z\\right)$, the resulting random variable $z' = f\\left(z\\right)$ has a distribution:\r\n\r\n$$ q\\left(\\mathbf{z}'\\right) = q\\left(\\mathbf{z}\\right)\\bigl\\vert{\\text{det}}\\frac{\\delta{f}^{-1}}{\\delta{\\mathbf{z'}}}\\bigr\\vert = q\\left(\\mathbf{z}\\right)\\bigl\\vert{\\text{det}}\\frac{\\delta{f}}{\\delta{\\mathbf{z}}}\\bigr\\vert ^{-1} $$\r\n\f\r\nwhere the last equality can be seen by applying the chain rule (inverse function theorem) and is a property of Jacobians of invertible functions. We can construct arbitrarily complex densities by composing several simple maps and successively applying the above equation. The density $q\\_{K}\\left(\\mathbf{z}\\right)$ obtained by successively transforming a random variable $z\\_{0}$ with distribution $q\\_{0}$ through a chain of $K$ transformations $f\\_{k}$ is:\r\n\r\n$$ z\\_{K} = f\\_{K} \\cdot \\dots \\cdot f\\_{2} \\cdot f\\_{1}\\left(z\\_{0}\\right) $$\r\n\r\n$$ \\ln{q}\\_{K}\\left(z\\_{K}\\right) = \\ln{q}\\_{0}\\left(z\\_{0}\\right) \u2212 \\sum^{K}\\_{k=1}\\ln\\vert\\det\\frac{\\delta{f\\_{k}}}{\\delta{\\mathbf{z\\_{k-1}}}}\\vert $$\r\n\f\r\nThe path traversed by the random variables $z\\_{k} = f\\_{k}\\left(z\\_{k-1}\\right)$ with initial distribution $q\\_{0}\\left(z\\_{0}\\right)$ is called the flow and the path formed by the successive distributions $q\\_{k}$ is a normalizing flow.","49":"**DV3 Attention Block** is an attention-based module used in the [Deep Voice 3](https:\/\/paperswithcode.com\/method\/deep-voice-3) architecture. It uses a [dot-product attention](https:\/\/paperswithcode.com\/method\/dot-product-attention) mechanism. A query vector (the hidden states of the decoder) and the per-timestep key vectors from the encoder are used to compute attention weights. This then outputs a context vector computed as the weighted average of the value vectors.","50":"**DV3 Convolution Block** is a convolutional block used for the [Deep Voice 3](https:\/\/paperswithcode.com\/method\/deep-voice-3) text-to-speech architecture. It consists of a 1-D [convolution](https:\/\/paperswithcode.com\/method\/convolution) with a gated linear unit and a [residual connection](https:\/\/paperswithcode.com\/method\/residual-connection). In the Figure, $c$ denotes the dimensionality of the input. The convolution output of size $2 \\cdot c$ is split into equal-sized portions: the gate vector and the input vector. A scaling factor $\\sqrt{0.5}$ is used to ensure that we preserve the input variance early in training. The gated linear unit provides a linear path for the gradient flow, which alleviates the vanishing gradient issue for stacked convolution blocks while retaining non-linearity. To introduce speaker-dependent control, a speaker-dependent embedding is added as a bias to the convolution filter output, after a softsign function. The authors use the softsign nonlinearity because it limits the range of the output while also avoiding the saturation problem that exponential based nonlinearities sometimes exhibit. Convolution filter weights are initialized with zero-mean and unit-variance activations throughout the entire network.","51":"**Bridge-net** is an audio model block used in the [ClariNet](https:\/\/paperswithcode.com\/method\/clarinet) text-to-speech architecture. Bridge-net maps frame-level hidden representation to sample-level through several [convolution](https:\/\/paperswithcode.com\/method\/convolution) blocks and [transposed convolution](https:\/\/paperswithcode.com\/method\/transposed-convolution) layers interleaved with softsign non-linearities.","52":"**ClariNet** is an end-to-end text-to-speech architecture. Unlike previous TTS systems which use text-to-spectogram models with a separate waveform [synthesizer](https:\/\/paperswithcode.com\/method\/synthesizer) (vocoder), ClariNet is a text-to-wave architecture that is fully convolutional and can be trained from scratch. In ClariNet, the [WaveNet](https:\/\/paperswithcode.com\/method\/wavenet) module is conditioned on the hidden states instead of the mel-spectogram. The architecture is otherwise based on [Deep Voice 3](https:\/\/paperswithcode.com\/method\/deep-voice-3).","53":"**Mixture of Logistic Distributions (MoL)** is a type of output function, and an alternative to a [softmax](https:\/\/paperswithcode.com\/method\/softmax) layer. Discretized logistic mixture likelihood is used in [PixelCNN](https:\/\/paperswithcode.com\/method\/pixelcnn)++ and [WaveNet](https:\/\/paperswithcode.com\/method\/wavenet) to predict discrete values.\r\n\r\nImage Credit: [Hao Gao](https:\/\/medium.com\/@smallfishbigsea\/an-explanation-of-discretized-logistic-mixture-likelihood-bdfe531751f0)","54":"A **Dilated Causal Convolution** is a [causal convolution](https:\/\/paperswithcode.com\/method\/causal-convolution) where the filter is applied over an area larger than its length by skipping input values with a certain step. A dilated causal [convolution](https:\/\/paperswithcode.com\/method\/convolution) effectively allows the network to have very large receptive fields with just a few layers.","55":"**WaveNet** is an audio generative model based on the [PixelCNN](https:\/\/paperswithcode.com\/method\/pixelcnn) architecture. In order to deal with long-range temporal dependencies needed for raw audio generation, architectures are developed based on dilated causal convolutions, which exhibit very large receptive fields.\r\n\r\nThe joint probability of a waveform $\\vec{x} = \\{ x_1, \\dots, x_T \\}$ is factorised as a product of conditional probabilities as follows:\r\n\r\n$$p\\left(\\vec{x}\\right) = \\prod_{t=1}^{T} p\\left(x_t \\mid x_1, \\dots ,x_{t-1}\\right)$$\r\n\r\nEach audio sample $x_t$ is therefore conditioned on the samples at all previous timesteps.","56":"**Dot-Product Attention** is an attention mechanism where the alignment score function is calculated as: \r\n\r\n$$f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right) = h\\_{i}^{T}s\\_{j}$$\r\n\r\nIt is equivalent to [multiplicative attention](https:\/\/paperswithcode.com\/method\/multiplicative-attention) (without a trainable weight matrix, assuming this is instead an identity matrix). Here $\\textbf{h}$ refers to the hidden states for the encoder, and $\\textbf{s}$ is the hidden states for the decoder. The function above is thus a type of alignment score function. \r\n\r\nWithin a neural network, once we have the alignment scores, we calculate the final scores\/weights using a [softmax](https:\/\/paperswithcode.com\/method\/softmax) function of these alignment scores (ensuring it sums to 1).","57":"A **Graph Convolutional Network**, or **GCN**, is an approach for semi-supervised learning on graph-structured data. It is based on an efficient variant of [convolutional neural networks](https:\/\/paperswithcode.com\/methods\/category\/convolutional-neural-networks) which operate directly on graphs. The choice of convolutional architecture is motivated via a localized first-order approximation of spectral graph convolutions. The model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes.","58":"**Batch Normalization** aims to reduce internal covariate shift, and in doing so aims to accelerate the training of deep neural nets. It accomplishes this via a normalization step that fixes the means and variances of layer inputs. Batch Normalization also has a beneficial effect on the gradient flow through the network, by reducing the dependence of gradients on the scale of the parameters or of their initial values. This allows for use of much higher learning rates without the risk of divergence. Furthermore, batch normalization regularizes the model and reduces the need for [Dropout](https:\/\/paperswithcode.com\/method\/dropout).\r\n\r\nWe apply a batch normalization layer as follows for a minibatch $\\mathcal{B}$:\r\n\r\n$$ \\mu\\_{\\mathcal{B}} = \\frac{1}{m}\\sum^{m}\\_{i=1}x\\_{i} $$\r\n\r\n$$ \\sigma^{2}\\_{\\mathcal{B}} = \\frac{1}{m}\\sum^{m}\\_{i=1}\\left(x\\_{i}-\\mu\\_{\\mathcal{B}}\\right)^{2} $$\r\n\r\n$$ \\hat{x}\\_{i} = \\frac{x\\_{i} - \\mu\\_{\\mathcal{B}}}{\\sqrt{\\sigma^{2}\\_{\\mathcal{B}}+\\epsilon}} $$\r\n\r\n$$ y\\_{i} = \\gamma\\hat{x}\\_{i} + \\beta = \\text{BN}\\_{\\gamma, \\beta}\\left(x\\_{i}\\right) $$\r\n\r\nWhere $\\gamma$ and $\\beta$ are learnable parameters.","59":"TuckER","60":"**Average Pooling** is a pooling operation that calculates the average value for patches of a feature map, and uses it to create a downsampled (pooled) feature map. It is usually used after a convolutional layer. It adds a small amount of translation invariance - meaning translating the image by a small amount does not significantly affect the values of most pooled outputs. It extracts features more smoothly than [Max Pooling](https:\/\/paperswithcode.com\/method\/max-pooling), whereas max pooling extracts more pronounced features like edges.\r\n\r\nImage Source: [here](https:\/\/www.researchgate.net\/figure\/Illustration-of-Max-Pooling-and-Average-Pooling-Figure-2-above-shows-an-example-of-max_fig2_333593451)","61":"A **1 x 1 Convolution** is a [convolution](https:\/\/paperswithcode.com\/method\/convolution) with some special properties in that it can be used for dimensionality reduction, efficient low dimensional embeddings, and applying non-linearity after convolutions. It maps an input pixel with all its channels to an output pixel which can be squeezed to a desired output depth. It can be viewed as an [MLP](https:\/\/paperswithcode.com\/method\/feedforward-network) looking at a particular pixel location.\r\n\r\nImage Credit: [http:\/\/deeplearning.ai](http:\/\/deeplearning.ai)","62":"A **Bottleneck Residual Block** is a variant of the [residual block](https:\/\/paperswithcode.com\/method\/residual-block) that utilises 1x1 convolutions to create a bottleneck. The use of a bottleneck reduces the number of parameters and matrix multiplications. The idea is to make residual blocks as thin as possible to increase depth and have less parameters. They were introduced as part of the [ResNet](https:\/\/paperswithcode.com\/method\/resnet) architecture, and are used as part of deeper ResNets such as ResNet-50 and ResNet-101.","63":"**Global Average Pooling** is a pooling operation designed to replace fully connected layers in classical CNNs. The idea is to generate one feature map for each corresponding category of the classification task in the last mlpconv layer. Instead of adding fully connected layers on top of the feature maps, we take the average of each feature map, and the resulting vector is fed directly into the [softmax](https:\/\/paperswithcode.com\/method\/softmax) layer. \r\n\r\nOne advantage of global [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling) over the fully connected layers is that it is more native to the [convolution](https:\/\/paperswithcode.com\/method\/convolution) structure by enforcing correspondences between feature maps and categories. Thus the feature maps can be easily interpreted as categories confidence maps. Another advantage is that there is no parameter to optimize in the global average pooling thus overfitting is avoided at this layer. Furthermore, global average pooling sums out the spatial information, thus it is more robust to spatial translations of the input.","64":"**Residual Blocks** are skip-connection blocks that learn residual functions with reference to the layer inputs, instead of learning unreferenced functions. They were introduced as part of the [ResNet](https:\/\/paperswithcode.com\/method\/resnet) architecture.\r\n \r\nFormally, denoting the desired underlying mapping as $\\mathcal{H}({x})$, we let the stacked nonlinear layers fit another mapping of $\\mathcal{F}({x}):=\\mathcal{H}({x})-{x}$. The original mapping is recast into $\\mathcal{F}({x})+{x}$. The $\\mathcal{F}({x})$ acts like a residual, hence the name 'residual block'.\r\n\r\nThe intuition is that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers. Having skip connections allows the network to more easily learn identity-like mappings.\r\n\r\nNote that in practice, [Bottleneck Residual Blocks](https:\/\/paperswithcode.com\/method\/bottleneck-residual-block) are used for deeper ResNets, such as ResNet-50 and ResNet-101, as these bottleneck blocks are less computationally intensive.","65":"**Kaiming Initialization**, or **He Initialization**, is an initialization method for neural networks that takes into account the non-linearity of activation functions, such as [ReLU](https:\/\/paperswithcode.com\/method\/relu) activations.\r\n\r\nA proper initialization method should avoid reducing or magnifying the magnitudes of input signals exponentially. Using a derivation they work out that the condition to stop this happening is:\r\n\r\n$$\\frac{1}{2}n\\_{l}\\text{Var}\\left[w\\_{l}\\right] = 1 $$\r\n\r\nThis implies an initialization scheme of:\r\n\r\n$$ w\\_{l} \\sim \\mathcal{N}\\left(0,  2\/n\\_{l}\\right)$$\r\n\r\nThat is, a zero-centered Gaussian with standard deviation of $\\sqrt{2\/{n}\\_{l}}$ (variance shown in equation above). Biases are initialized at $0$.","66":"**Residual Networks**, or **ResNets**, learn residual functions with reference to the layer inputs, instead of learning unreferenced functions. Instead of hoping each few stacked layers directly fit a desired underlying mapping, residual nets let these layers fit a residual mapping. They stack [residual blocks](https:\/\/paperswithcode.com\/method\/residual-block) ontop of each other to form network: e.g. a ResNet-50 has fifty layers using these blocks. \r\n\r\nFormally, denoting the desired underlying mapping as $\\mathcal{H}(x)$, we let the stacked nonlinear layers fit another mapping of $\\mathcal{F}(x):=\\mathcal{H}(x)-x$. The original mapping is recast into $\\mathcal{F}(x)+x$.\r\n\r\nThere is empirical evidence that these types of network are easier to optimize, and can gain accuracy from considerably increased depth.","67":"**Q-Learning** is an off-policy temporal difference control algorithm:\r\n\r\n$$Q\\left(S\\_{t}, A\\_{t}\\right) \\leftarrow Q\\left(S\\_{t}, A\\_{t}\\right) + \\alpha\\left[R_{t+1} + \\gamma\\max\\_{a}Q\\left(S\\_{t+1}, a\\right) - Q\\left(S\\_{t}, A\\_{t}\\right)\\right] $$\r\n\r\nThe learned action-value function $Q$ directly approximates $q\\_{*}$, the optimal action-value function, independent of the policy being followed.\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition","68":"1D Convolutional Neural Networks are similar to well known and more established 2D Convolutional Neural Networks. 1D Convolutional Neural Networks are used mainly used on text and 1D signals.","69":"**XLM** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) based architecture that is pre-trained using one of three language modelling objectives:\r\n\r\n1. Causal Language Modeling - models the probability of a word given the previous words in a sentence.\r\n2. Masked Language Modeling - the masked language modeling objective of [BERT](https:\/\/paperswithcode.com\/method\/bert).\r\n3. Translation Language Modeling - a (new) translation language modeling objective for improving cross-lingual pre-training.\r\n\r\nThe authors find that both the CLM and MLM approaches provide strong cross-lingual features that can be used for pretraining models.","70":"**Cosine Annealing** is a type of learning rate schedule that has the effect of starting with a large learning rate that is relatively rapidly decreased to a minimum value before being increased rapidly again. The resetting of the learning rate acts like a simulated restart of the learning process and the re-use of good weights as the starting point of the restart is referred to as a \"warm restart\" in contrast to a \"cold restart\" where a new set of small random numbers may be used as a starting point.\r\n\r\n$$\\eta\\_{t} = \\eta\\_{min}^{i} + \\frac{1}{2}\\left(\\eta\\_{max}^{i}-\\eta\\_{min}^{i}\\right)\\left(1+\\cos\\left(\\frac{T\\_{cur}}{T\\_{i}}\\pi\\right)\\right)\r\n$$\r\n\r\nWhere where $\\eta\\_{min}^{i}$ and $ \\eta\\_{max}^{i}$ are ranges for the learning rate, and $T\\_{cur}$ account for how many epochs have been performed since the last restart.\r\n\r\nText Source: [Jason Brownlee](https:\/\/machinelearningmastery.com\/snapshot-ensemble-deep-learning-neural-network\/)\r\n\r\nImage Source: [Gao Huang](https:\/\/www.researchgate.net\/figure\/Training-loss-of-100-layer-DenseNet-on-CIFAR10-using-standard-learning-rate-blue-and-M_fig2_315765130)","71":"**Strided Attention** is a factorized attention pattern that has one head attend to the previous\r\n$l$ locations, and the other head attend to every $l$th location, where $l$ is the stride and chosen to be close to $\\sqrt{n}$. It was proposed as part of the [Sparse Transformer](https:\/\/paperswithcode.com\/method\/sparse-transformer) architecture.\r\n\r\nA self-attention layer maps a matrix of input embeddings $X$ to an output matrix and is parameterized by a connectivity pattern $S = \\text{set}\\left(S\\_{1}, \\dots, S\\_{n}\\right)$, where $S\\_{i}$ denotes the set of indices of the input vectors to which the $i$th output vector attends. The output vector is a weighted sum of transformations of the input vectors:\r\n\r\n$$ \\text{Attend}\\left(X, S\\right) = \\left(a\\left(\\mathbf{x}\\_{i}, S\\_{i}\\right)\\right)\\_{i\\in\\text{set}\\left(1,\\dots,n\\right)}$$\r\n\r\n$$ a\\left(\\mathbf{x}\\_{i}, S\\_{i}\\right) = \\text{softmax}\\left(\\frac{\\left(W\\_{q}\\mathbf{x}\\_{i}\\right)K^{T}\\_{S\\_{i}}}{\\sqrt{d}}\\right)V\\_{S\\_{i}} $$\r\n\r\n$$ K\\_{Si} = \\left(W\\_{k}\\mathbf{x}\\_{j}\\right)\\_{j\\in{S\\_{i}}} $$\r\n\r\n$$ V\\_{Si} = \\left(W\\_{v}\\mathbf{x}\\_{j}\\right)\\_{j\\in{S\\_{i}}} $$\r\n\r\nHere $W\\_{q}$, $W\\_{k}$, and $W\\_{v}$ represent the weight matrices which transform a given $x\\_{i}$ into a query, key, or value, and $d$ is the inner dimension of the queries and keys. The output at each position is a sum of the values weighted by the scaled dot-product similarity of the keys and queries.\r\n\r\nFull self-attention for autoregressive models defines $S\\_{i} = \\text{set}\\left(j : j \\leq i\\right)$, allowing every element to attend to all previous positions and its own position.\r\n\r\nFactorized self-attention instead has $p$ separate attention heads, where the $m$th head defines a subset of the indices $A\\_{i}^{(m)} \u2282 \\text{set}\\left(j : j \\leq i\\right)$ and lets $S\\_{i} = A\\_{i}^{(m)}$. The goal with the Sparse [Transformer](https:\/\/paperswithcode.com\/method\/transformer) was to find efficient choices for the subset $A$.\r\n\r\nFormally for Strided Attention, $A^{(1)}\\_{i} = ${$t, t + 1, ..., i$} for $t = \\max\\left(0, i \u2212 l\\right)$, and $A^{(2)}\\_{i} = ${$j : (i \u2212 j) \\mod l = 0$}. The $i$-th output vector of the attention head attends to all input vectors either from $A^{(1)}\\_{i}$ or $A^{(2)}\\_{i}$. This pattern can be visualized in the figure to the right.\r\n\r\nThis formulation is convenient if the data naturally has a structure that aligns with the stride, like images or some types of music. For data without a periodic structure, like text, however, the authors find that the network can fail to properly route information with the strided pattern, as spatial coordinates for an element do not necessarily correlate with the positions where the element may be most relevant in the future.","72":"**Linear Warmup With Cosine Annealing** is a learning rate schedule where we increase the learning rate linearly for $n$ updates and then anneal according to a cosine schedule afterwards.","73":"**Fixed Factorized Attention** is a factorized attention pattern where specific cells summarize previous locations and propagate that information to all future cells. It was proposed as part of the [Sparse Transformer](https:\/\/paperswithcode.com\/method\/sparse-transformer) architecture.\r\n\r\n\r\nA self-attention layer maps a matrix of input embeddings $X$ to an output matrix and is parameterized by a connectivity pattern $S = \\text{set}\\left(S\\_{1}, \\dots, S\\_{n}\\right)$, where $S\\_{i}$ denotes the set of indices of the input vectors to which the $i$th output vector attends. The output vector is a weighted sum of transformations of the input vectors:\r\n\r\n$$ \\text{Attend}\\left(X, S\\right) = \\left(a\\left(\\mathbf{x}\\_{i}, S\\_{i}\\right)\\right)\\_{i\\in\\text{set}\\left(1,\\dots,n\\right)}$$\r\n\r\n$$ a\\left(\\mathbf{x}\\_{i}, S\\_{i}\\right) = \\text{softmax}\\left(\\frac{\\left(W\\_{q}\\mathbf{x}\\_{i}\\right)K^{T}\\_{S\\_{i}}}{\\sqrt{d}}\\right)V\\_{S\\_{i}} $$\r\n\r\n$$ K\\_{Si} = \\left(W\\_{k}\\mathbf{x}\\_{j}\\right)\\_{j\\in{S\\_{i}}} $$\r\n\r\n$$ V\\_{Si} = \\left(W\\_{v}\\mathbf{x}\\_{j}\\right)\\_{j\\in{S\\_{i}}} $$\r\n\r\nHere $W\\_{q}$, $W\\_{k}$, and $W\\_{v}$ represent the weight matrices which transform a given $x\\_{i}$ into a query, key, or value, and $d$ is the inner dimension of the queries and keys. The output at each position is a sum of the values weighted by the scaled dot-product similarity of the keys and queries.\r\n\r\nFull self-attention for autoregressive models defines $S\\_{i} = \\text{set}\\left(j : j \\leq i\\right)$, allowing every element to attend to all previous positions and its own position.\r\n\r\nFactorized self-attention instead has $p$ separate attention heads, where the $m$th head defines a subset of the indices $A\\_{i}^{(m)} \u2282 \\text{set}\\left(j : j \\leq i\\right)$ and lets $S\\_{i} = A\\_{i}^{(m)}$. The goal with the Sparse [Transformer](https:\/\/paperswithcode.com\/method\/transformer) was to find efficient choices for the subset $A$.\r\n\r\nFormally for Fixed Factorized Attention, $A^{(1)}\\_{i} = ${$j : \\left(\\lfloor{j\/l\\rfloor}=\\lfloor{i\/l\\rfloor}\\right)$}, where the brackets denote the floor operation, and $A^{(2)}\\_{i} = ${$j : j \\mod l \\in ${$t, t+1, \\ldots, l$}}, where $t=l-c$ and $c$ is a hyperparameter. The $i$-th output vector of the attention head attends to all input vectors either from $A^{(1)}\\_{i}$ or $A^{(2)}\\_{i}$. This pattern can be visualized in the figure to the right.\r\n\r\nIf the stride is 128 and $c = 8$, then all future positions greater than 128 can attend to positions 120-128, all positions greater than 256 can attend to 248-256, and so forth. \r\n\r\nA fixed-attention pattern with $c = 1$ limits the expressivity of the network significantly, as many representations in the network are only used for one block whereas a small number of locations are used by all blocks. The authors found choosing $c \\in ${$8, 16, 32$} for typical values of $l \\in\r\n{128, 256}$ performs well, although this increases the computational cost of this method by $c$ in comparison to the [strided attention](https:\/\/paperswithcode.com\/method\/strided-attention).\r\n\r\nAdditionally, the authors found that when using multiple heads, having them attend to distinct subblocks of length $c$ within the block of size $l$ was preferable to having them attend to the same subblock.","74":"**GPT-3** is an autoregressive [transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)  model with 175 billion\r\nparameters. It uses the same architecture\/model as [GPT-2](https:\/\/paperswithcode.com\/method\/gpt-2), including the modified initialization, pre-normalization, and reversible tokenization, with the exception that GPT-3 uses alternating dense and locally banded sparse attention patterns in the layers of the [transformer](https:\/\/paperswithcode.com\/method\/transformer), similar to the [Sparse Transformer](https:\/\/paperswithcode.com\/method\/sparse-transformer).","75":"We propose to theoretically and empirically examine the effect of incorporating weighting schemes into walk-aggregating GNNs. To this end, we propose a simple, interpretable, and end-to-end supervised GNN model, called AWARE (Attentive Walk-Aggregating GRaph Neural NEtwork), for graph-level prediction. AWARE aggregates the walk information by means of weighting schemes at distinct levels (vertex-, walk-, and graph-level) in a principled manner. By virtue of the incorporated weighting schemes at these different levels, AWARE can emphasize the information important for prediction while diminishing the irrelevant ones\u2014leading to representations that can improve learning performance.","76":"**Experience Replay** is a replay memory technique used in reinforcement learning where we store the agent\u2019s experiences at each time-step, $e\\_{t} = \\left(s\\_{t}, a\\_{t}, r\\_{t}, s\\_{t+1}\\right)$ in a data-set $D = e\\_{1}, \\cdots, e\\_{N}$ , pooled over many episodes into a replay memory. We then usually sample the memory randomly for a minibatch of experience, and use this to learn off-policy, as with Deep Q-Networks. This tackles the problem of autocorrelation leading to unstable training, by making the problem more like a supervised learning problem.\r\n\r\nImage Credit: [Hands-On Reinforcement Learning with Python, Sudharsan Ravichandiran](https:\/\/subscription.packtpub.com\/book\/big_data_and_business_intelligence\/9781788836524)","77":"**Entropy Regularization** is a type of regularization used in [reinforcement learning](https:\/\/paperswithcode.com\/methods\/area\/reinforcement-learning). For on-policy policy gradient based methods like [A3C](https:\/\/paperswithcode.com\/method\/a3c), the same mutual  reinforcement behaviour leads to a highly-peaked $\\pi\\left(a\\mid{s}\\right)$ towards a few actions or action sequences, since it is easier for the actor and critic to overoptimise to a small portion of the environment. To reduce this problem, entropy regularization adds an entropy term to the loss to promote action diversity:\r\n\r\n$$H(X) = -\\sum\\pi\\left(x\\right)\\log\\left(\\pi\\left(x\\right)\\right) $$\r\n\r\nImage Credit: Wikipedia","78":"**Soft Actor Critic**, or **SAC**, is an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as [Q-learning methods](https:\/\/paperswithcode.com\/method\/q-learning). [SAC](https:\/\/paperswithcode.com\/method\/sac) combines off-policy updates with a stable stochastic actor-critic formulation.\r\n\r\nThe SAC objective has a number of advantages. First, the policy is incentivized to explore more widely, while giving up on clearly unpromising avenues. Second, the policy can capture multiple modes of near-optimal behavior. In problem settings where multiple actions seem equally attractive, the policy will commit equal probability mass to those actions. Lastly, the authors present evidence that it improves learning speed over state-of-art methods that optimize the conventional RL objective function.","79":"**Proximal Policy Optimization**, or **PPO**, is a policy gradient method for reinforcement learning. The motivation was to have an algorithm with the data efficiency and reliable performance of [TRPO](https:\/\/paperswithcode.com\/method\/trpo), while using only first-order optimization. \r\n\r\nLet $r\\_{t}\\left(\\theta\\right)$ denote the probability ratio $r\\_{t}\\left(\\theta\\right) = \\frac{\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}$, so $r\\left(\\theta\\_{old}\\right) = 1$. TRPO maximizes a \u201csurrogate\u201d objective:\r\n\r\n$$ L^{\\text{CPI}}\\left({\\theta}\\right) = \\hat{\\mathbb{E}}\\_{t}\\left[\\frac{\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\_{t}\\mid{s\\_{t}}\\right)})\\hat{A}\\_{t}\\right] = \\hat{\\mathbb{E}}\\_{t}\\left[r\\_{t}\\left(\\theta\\right)\\hat{A}\\_{t}\\right] $$\r\n\r\nWhere $CPI$ refers to a conservative policy iteration. Without a constraint, maximization of $L^{CPI}$ would lead to an excessively large policy update; hence, we PPO modifies the objective, to penalize changes to the policy that move $r\\_{t}\\left(\\theta\\right)$ away from 1:\r\n\r\n$$ J^{\\text{CLIP}}\\left({\\theta}\\right) = \\hat{\\mathbb{E}}\\_{t}\\left[\\min\\left(r\\_{t}\\left(\\theta\\right)\\hat{A}\\_{t}, \\text{clip}\\left(r\\_{t}\\left(\\theta\\right), 1-\\epsilon, 1+\\epsilon\\right)\\hat{A}\\_{t}\\right)\\right] $$\r\n\r\nwhere $\\epsilon$ is a hyperparameter, say, $\\epsilon = 0.2$. The motivation for this objective is as follows. The first term inside the min is $L^{CPI}$. The second term, $\\text{clip}\\left(r\\_{t}\\left(\\theta\\right), 1-\\epsilon, 1+\\epsilon\\right)\\hat{A}\\_{t}$ modifies the surrogate\r\nobjective by clipping the probability ratio, which removes the incentive for moving $r\\_{t}$ outside of the interval $\\left[1 \u2212 \\epsilon, 1 + \\epsilon\\right]$. Finally, we take the minimum of the clipped and unclipped objective, so the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective. With this scheme, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse. \r\n\r\nOne detail to note is that when we apply PPO for a network where we have shared parameters for actor and critic functions, we typically add to the objective function an error term on value estimation and an entropy term to encourage exploration.","80":"**VQ-VAE** is a type of variational autoencoder that uses vector quantisation to obtain a discrete latent representation. It differs from [VAEs](https:\/\/paperswithcode.com\/method\/vae) in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, ideas from vector quantisation (VQ) are incorporated. Using the VQ method allows the model to circumvent issues of posterior collapse - where the latents are ignored when they are paired with a powerful autoregressive decoder - typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes.","81":"**Non Maximum Suppression** is a computer vision method that selects a single entity out of many overlapping entities (for example bounding boxes in object detection). The criteria is usually discarding entities that are below a given probability bound. With remaining entities we repeatedly pick the entity with the highest probability, output that as the prediction, and discard any remaining box where a $\\text{IoU} \\geq 0.5$ with the box output in the previous step.\r\n\r\nImage Credit: [Martin Kersner](https:\/\/github.com\/martinkersner\/non-maximum-suppression-cpp)","82":"**SSD** is a single-stage object detection method that discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. \r\n\r\nThe fundamental improvement in speed comes from eliminating bounding box proposals and the subsequent pixel or feature resampling stage. Improvements over competing single-stage methods include using a small convolutional filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales.","83":"**Region of Interest Align**, or **RoIAlign**, is an operation for extracting a small feature map from each RoI in detection and segmentation based tasks. It removes the harsh quantization of [RoI Pool](https:\/\/paperswithcode.com\/method\/roi-pooling), properly *aligning* the extracted features with the input. To avoid any quantization of the RoI boundaries or bins (using $x\/16$ instead of $[x\/16]$), RoIAlign uses bilinear interpolation to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and the result is then aggregated (using max or average).","84":"**Mask R-CNN** extends [Faster R-CNN](http:\/\/paperswithcode.com\/method\/faster-r-cnn) to solve instance segmentation tasks. It achieves this by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. In principle, Mask R-CNN is an intuitive extension of Faster [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn), but constructing the mask branch properly is critical for good results. \r\n\r\nMost importantly, Faster R-CNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is evident in how [RoIPool](http:\/\/paperswithcode.com\/method\/roi-pooling), the *de facto* core operation for attending to instances, performs coarse spatial quantization for feature extraction. To fix the misalignment, Mask R-CNN utilises a simple, quantization-free layer, called [RoIAlign](http:\/\/paperswithcode.com\/method\/roi-align), that faithfully preserves exact spatial locations. \r\n\r\nSecondly, Mask R-CNN *decouples* mask and class prediction: it predicts a binary mask for each class independently, without competition among classes, and relies on the network's RoI classification branch to predict the category. In contrast, an [FCN](http:\/\/paperswithcode.com\/method\/fcn) usually perform per-pixel multi-class categorization, which couples segmentation and classification.","85":"Train a convolutional neural network to generate the contents of an arbitrary image region conditioned on its surroundings.","86":"**Local Response Normalization** is a normalization layer that implements the idea of lateral inhibition. Lateral inhibition is a concept in neurobiology that refers to the phenomenon of an excited neuron inhibiting its neighbours: this leads to a peak in the form of a local maximum, creating contrast in that area and increasing sensory perception. In practice, we can either normalize within the same channel or normalize across channels when we apply LRN to convolutional neural networks.\r\n\r\n$$ b_{c} = a_{c}\\left(k + \\frac{\\alpha}{n}\\sum_{c'=\\max(0, c-n\/2)}^{\\min(N-1,c+n\/2)}a_{c'}^2\\right)^{-\\beta} $$\r\n\r\nWhere the size is the number of neighbouring channels used for normalization, $\\alpha$ is multiplicative factor, $\\beta$ an exponent and $k$ an additive factor","87":"A **Grouped Convolution** uses a group of convolutions - multiple kernels per layer - resulting in multiple channel outputs per layer. This leads to wider networks helping a network learn a varied set of low level and high level features. The original motivation of using Grouped Convolutions in [AlexNet](https:\/\/paperswithcode.com\/method\/alexnet) was to distribute the model over multiple GPUs as an engineering compromise. But later, with models such as [ResNeXt](https:\/\/paperswithcode.com\/method\/resnext), it was shown this module could be used to improve classification accuracy. Specifically by exposing a new dimension through grouped convolutions, *cardinality* (the size of set of transformations), we can increase accuracy by increasing it.","88":"**AlexNet** is a classic convolutional neural network architecture. It consists of convolutions, [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) and dense layers as the basic building blocks. Grouped convolutions are used in order to fit the model across two GPUs.","89":"A **Non-Local Operation** is a component for capturing long-range dependencies with deep neural networks. It is a generalization of the classical non-local mean operation in computer vision. Intuitively a non-local operation computes the response at a position as a weighted sum of the features at all positions in the input feature maps. The set of positions can be in space, time, or spacetime, implying that these operations are applicable for image, sequence, and video problems.\r\n\r\nFollowing the non-local mean operation, a generic non-local operation for deep neural networks is defined as:\r\n\r\n$$ \\mathbb{y}\\_{i} = \\frac{1}{\\mathcal{C}\\left(\\mathbb{x}\\right)}\\sum\\_{\\forall{j}}f\\left(\\mathbb{x}\\_{i}, \\mathbb{x}\\_{j}\\right)g\\left(\\mathbb{x}\\_{j}\\right) $$\r\n\r\nHere $i$ is the index of an output position (in space, time, or spacetime) whose response is to be computed and $j$ is the index that enumerates all possible positions. x is the input signal (image, sequence, video; often their features) and $y$ is the output signal of the same size as $x$. A pairwise function $f$ computes a scalar (representing relationship such as affinity) between $i$ and all $j$. The unary function $g$ computes a representation of the input signal at the position $j$. The\r\nresponse is normalized by a factor $C\\left(x\\right)$.\r\n\r\nThe non-local behavior is due to the fact that all positions ($\\forall{j}$) are considered in the operation. As a comparison, a convolutional operation sums up the weighted input in a local neighborhood (e.g., $i \u2212 1 \\leq j \\leq i + 1$ in a 1D case with kernel size 3), and a recurrent operation at time $i$ is often based only on the current and the latest time steps (e.g., $j = i$ or $i \u2212 1$).\r\n\r\nThe non-local operation is also different from a fully-connected (fc) layer. The equation above computes responses based on relationships between different locations, whereas fc uses learned weights. In other words, the relationship between $x\\_{j}$ and $x\\_{i}$ is not a function of the input data in fc, unlike in nonlocal layers. Furthermore, the formulation in the equation above supports inputs of variable sizes, and maintains the corresponding size in the output. On the contrary, an fc layer requires a fixed-size input\/output and loses positional correspondence (e.g., that from $x\\_{i}$ to $y\\_{i}$ at the position $i$).\r\n\r\nA non-local operation is a flexible building block and can be easily used together with convolutional\/recurrent layers. It can be added into the earlier part of deep neural networks, unlike fc layers that are often used in the end. This allows us to build a richer hierarchy that combines both non-local and local information.\r\n\r\nIn terms of parameterisation, we usually parameterise $g$ as a linear embedding of the form $g\\left(x\\_{j}\\right) = W\\_{g}\\mathbb{x}\\_{j}$ , where $W\\_{g}$ is a weight matrix to be learned. This is implemented as, e.g., 1\u00d71 [convolution](https:\/\/paperswithcode.com\/method\/convolution) in space or 1\u00d71\u00d71 convolution in spacetime. For $f$ we use an affinity function, a list of which can be found [here](https:\/\/paperswithcode.com\/methods\/category\/affinity-functions).","90":"A **Non-Local Block** is an image block module used in neural networks that wraps a [non-local operation](https:\/\/paperswithcode.com\/method\/non-local-operation). We can define a non-local block as:\r\n\r\n$$ \\mathbb{z}\\_{i} = W\\_{z}\\mathbb{y\\_{i}} + \\mathbb{x}\\_{i} $$\r\n\r\nwhere $y\\_{i}$ is the output from the non-local operation and $+ \\mathbb{x}\\_{i}$ is a [residual connection](https:\/\/paperswithcode.com\/method\/residual-connection).","91":"**k-Means Clustering** is a clustering algorithm that divides a training set into $k$ different clusters of examples that are near each other. It works by initializing $k$ different centroids {$\\mu\\left(1\\right),\\ldots,\\mu\\left(k\\right)$} to different values, then alternating between two steps until convergence:\r\n\r\n(i) each training example is assigned to cluster $i$ where $i$ is the index of the nearest centroid $\\mu^{(i)}$\r\n\r\n(ii) each centroid $\\mu^{(i)}$ is updated to the mean of all training examples $x^{(j)}$ assigned to cluster $i$.\r\n\r\nText Source: Deep Learning, Goodfellow et al\r\n\r\nImage Source: [scikit-learn](https:\/\/scikit-learn.org\/stable\/auto_examples\/cluster\/plot_kmeans_digits.html)","92":"**Logistic Regression**, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.\r\n\r\nSource: [scikit-learn](https:\/\/scikit-learn.org\/stable\/modules\/linear_model.html#logistic-regression)\r\n\r\nImage: [Michaelg2015](https:\/\/commons.wikimedia.org\/wiki\/User:Michaelg2015)","93":"**Darknet-53** is a convolutional neural network that acts as a backbone for the [YOLOv3](https:\/\/paperswithcode.com\/method\/yolov3) object detection approach. The improvements upon its predecessor [Darknet-19](https:\/\/paperswithcode.com\/method\/darknet-19) include the use of residual connections, as well as more layers.","94":"**YOLOv3** is a real-time, single-stage object detection model that builds on [YOLOv2](https:\/\/paperswithcode.com\/method\/yolov2) with several improvements. Improvements include the use of a new backbone network, [Darknet-53](https:\/\/paperswithcode.com\/method\/darknet-53) that utilises residual connections, or in the words of the author, \"those newfangled residual network stuff\", as well as some improvements to the bounding box prediction step, and use of three different scales from which to extract features (similar to an [FPN](https:\/\/paperswithcode.com\/method\/fpn)).","95":"Dynamic Time Warping (DTW) [1] is one of well-known distance measures between a pairwise of time series. The main idea of DTW is to compute the distance from the matching of similar elements between time series. It uses the dynamic programming technique to find the optimal temporal matching between elements of two time series.\r\n\r\nFor instance, similarities in walking could be detected using DTW, even if one person was walking faster than the other, or if there were accelerations and decelerations during the course of an observation. DTW has been applied to temporal sequences of video, audio, and graphics data \u2014 indeed, any data that can be turned into a linear sequence can be analyzed with DTW. A well known application has been automatic speech recognition, to cope with different speaking speeds. Other applications include speaker recognition and online signature recognition. It can also be used in partial shape matching application.\r\n\r\nIn general, DTW is a method that calculates an optimal match between two given sequences (e.g. time series) with certain restriction and rules:\r\n\r\n1. Every index from the first sequence must be matched with one or more indices from the other sequence, and vice versa\r\n2. The first index from the first sequence must be matched with the first index from the other sequence (but it does not have to be its only match)\r\n3. The last index from the first sequence must be matched with the last index from the other sequence (but it does not have to be its only match)\r\n4. The mapping of the indices from the first sequence to indices from the other sequence must be monotonically increasing, and vice versa, i.e. if j>i  are indices from the first sequence, then there must not be two indices l>k in the other sequence, such that index i is matched with index l and index j is matched with index k, and vice versa.\r\n\r\n[1] Sakoe, Hiroaki, and Seibi Chiba. \"Dynamic programming algorithm optimization for spoken word recognition.\" IEEE transactions on acoustics, speech, and signal processing 26, no. 1 (1978): 43-49.","96":"The **alternating direction method of multipliers** (**ADMM**) is an algorithm that solves convex optimization problems by breaking them into smaller pieces, each of which are then easier to handle. It takes the form of a decomposition-coordination procedure, in which the solutions to small\r\nlocal subproblems are coordinated to find a solution to a large global problem. ADMM can be viewed as an attempt to blend the benefits of dual decomposition and augmented Lagrangian methods for constrained optimization. It turns out to be equivalent or closely related to many other algorithms\r\nas well, such as Douglas-Rachford splitting from numerical analysis, Spingarn\u2019s method of partial inverses, Dykstra\u2019s alternating projections method, Bregman iterative algorithms for l1 problems in signal processing, proximal methods, and many others.\r\n\r\nText Source: [https:\/\/stanford.edu\/~boyd\/papers\/pdf\/admm_distr_stats.pdf](https:\/\/stanford.edu\/~boyd\/papers\/pdf\/admm_distr_stats.pdf)\r\n\r\nImage Source: [here](https:\/\/www.slideshare.net\/derekcypang\/alternating-direction)","97":"A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\r\nSource: [Distilling the Knowledge in a Neural Network](https:\/\/arxiv.org\/abs\/1503.02531)","98":"**Additive Attention**, also known as **Bahdanau Attention**, uses a one-hidden layer feed-forward network to calculate the attention alignment score:\r\n\r\n$$f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right) = v\\_{a}^{T}\\tanh\\left(\\textbf{W}\\_{a}\\left[\\textbf{h}\\_{i};\\textbf{s}\\_{j}\\right]\\right)$$\r\n\r\nwhere $\\textbf{v}\\_{a}$ and $\\textbf{W}\\_{a}$ are learned attention parameters. Here $\\textbf{h}$ refers to the hidden states for the encoder, and $\\textbf{s}$ is the hidden states for the decoder. The function above is thus a type of alignment score function. We can use a matrix of alignment scores to show the correlation between source and target words, as the Figure to the right shows.\r\n\r\nWithin a neural network, once we have the alignment scores, we calculate the final scores using a [softmax](https:\/\/paperswithcode.com\/method\/softmax) function of these alignment scores (ensuring it sums to 1).","99":"A **CNN BiLSTM** is a hybrid bidirectional [LSTM](https:\/\/paperswithcode.com\/method\/lstm) and CNN architecture. In the original formulation applied to named entity recognition, it learns both character-level and word-level features. The CNN component is used to induce the character-level features. For each word the model employs a [convolution](https:\/\/paperswithcode.com\/method\/convolution) and a [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) layer to extract a new feature vector from the per-character feature vectors such as character embeddings and (optionally) character type.","100":"**Cross View Training**, or **CVT**, is a semi-supervised algorithm for training distributed word representations that makes use of unlabelled and labelled examples. \r\n\r\nCVT adds $k$ auxiliary prediction modules to the model, a Bi-[LSTM](https:\/\/paperswithcode.com\/method\/lstm) encoder, which are used when learning on unlabeled examples. A prediction module is usually a small neural network (e.g., a hidden layer followed by a [softmax](https:\/\/paperswithcode.com\/method\/softmax) layer). Each one takes as input an intermediate representation $h^j(x_i)$ produced by the model (e.g., the outputs of one of the LSTMs in a Bi-LSTM model). It outputs a distribution over labels $p\\_{j}^{\\theta}\\left(y\\mid{x\\_{i}}\\right)$.\r\n\r\nEach $h^j$ is chosen such that it only uses a part of the input $x_i$; the particular choice can depend on the task and model architecture. The auxiliary prediction modules are only used during training; the test-time prediction come from the primary prediction module that produces $p_\\theta$.","101":"In the ALIGN method, visual and language representations are jointly trained from noisy image alt-text data. The image and text encoders are learned via contrastive loss (formulated as normalized softmax) that pushes the embeddings of the matched image-text pair together and pushing those of non-matched image-text pair apart. The model learns to align visual and language representations of the image and text pairs using the contrastive loss. The representations can be used for vision-only or vision-language task transfer. Without any fine-tuning, ALIGN powers zero-shot visual classification and cross-modal search including image-to-text search, text-to image search and even search with joint image+text queries.","102":"**Restricted Boltzmann Machines**, or **RBMs**, are two-layer generative neural networks that learn a probability distribution over the inputs. They are a special class of Boltzmann Machine in that they have a restricted number of connections between visible and hidden units. Every node in the visible layer is connected to every node in the hidden layer, but no nodes in the same group are connected. RBMs are usually trained using the contrastive divergence learning procedure.\r\n\r\nImage Source: [here](https:\/\/medium.com\/datatype\/restricted-boltzmann-machine-a-complete-analysis-part-1-introduction-model-formulation-1a4404873b3)","103":"**ReLIC**, or **Representation Learning via Invariant Causal Mechanisms**, is a self-supervised learning objective that enforces invariant prediction of proxy targets across augmentations through an invariance regularizer which yields improved generalization guarantees. \r\n\r\nWe can write the objective as:\r\n\r\n$$\r\n\\underset{X}{\\mathbb{E}} \\underset{\\sim\\_{l k}, a\\_{q \\mathcal{A}}}{\\mathbb{E}} \\sum_{b \\in\\left\\(a\\_{l k}, a\\_{q t}\\right\\)} \\mathcal{L}\\_{b}\\left(Y^{R}, f(X)\\right) \\text { s.t. } K L\\left(p^{d o\\left(a\\_{l k}\\right)}\\left(Y^{R} \\mid f(X)\\right), p^{d o\\left(a\\_{q t}\\right)}\\left(Y^{R} \\mid f(X)\\right)\\right) \\leq \\rho\r\n$$\r\n\r\nwhere $\\mathcal{L}$ is the proxy task loss and $K L$ is the Kullback-Leibler (KL) divergence. Note that any distance measure on distributions can be used in place of the KL divergence.\r\n\r\nConcretely, as proxy task we associate to every datapoint $x\\_{i}$ the label $y\\_{i}^{R}=i$. This corresponds to the instance discrimination task, commonly used in contrastive learning. We take pairs of points $\\left(x\\_{i}, x\\_{j}\\right)$ to compute similarity scores and use pairs of augmentations $a\\_{l k}=\\left(a\\_{l}, a\\_{k}\\right) \\in$ $\\mathcal{A} \\times \\mathcal{A}$ to perform a style intervention. Given a batch of samples $\\left\\(x\\_{i}\\right\\)\\_{i=1}^{N} \\sim \\mathcal{D}$, we use\r\n\r\n$$\r\np^{d o\\left(a\\_{l k}\\right)}\\left(Y^{R}=j \\mid f\\left(x\\_{i}\\right)\\right) \\propto \\exp \\left(\\phi\\left(f\\left(x\\_{i}^{a\\_{l}}\\right), h\\left(x\\_{j}^{a\\_{k}}\\right)\\right) \/ \\tau\\right)\r\n$$\r\n\r\nwith $x^{a}$ data augmented with $a$ and $\\tau$ a softmax temperature parameter. We encode $f$ using a neural network and choose $h$ to be related to $f$, e.g. $h=f$ or as a network with an exponential moving average of the weights of $f$ (e.g. target networks). To compare representations we use the function $\\phi\\left(f\\left(x\\_{i}\\right), h\\left(x\\_{j}\\right)\\right)=\\left\\langle g\\left(f\\left(x\\_{i}\\right)\\right), g\\left(h\\left(x\\_{j}\\right)\\right)\\right\\rangle$ where $g$ is a fully-connected neural network often called the critic.\r\n\r\nCombining these pieces, we learn representations by minimizing the following objective over the full set of data $x\\_{i} \\in \\mathcal{D}$ and augmentations $a_{l k} \\in \\mathcal{A} \\times \\mathcal{A}$\r\n\r\n$$\r\n-\\sum_{i=1}^{N} \\sum\\_{a\\_{l k}} \\log \\frac{\\exp \\left(\\phi\\left(f\\left(x\\_{i}^{a_{l}}\\right), h\\left(x\\_{i}^{a\\_{k}}\\right)\\right) \/ \\tau\\right)}{\\sum\\_{m=1}^{M} \\exp \\left(\\phi\\left(f\\left(x\\_{i}^{a\\_{l}}\\right), h\\left(x\\_{m}^{a\\_{k}}\\right)\\right) \/ \\tau\\right)}+\\alpha \\sum\\_{a\\_{l k}, a\\_{q t}} K L\\left(p^{d o\\left(a\\_{l k}\\right)}, p^{d o\\left(a\\_{q t}\\right)}\\right)\r\n$$\r\n\r\nwith $M$ the number of points we use to construct the contrast set and $\\alpha$ the weighting of the invariance penalty. The shorthand $p^{d o(a)}$ is used for $p^{d o(a)}\\left(Y^{R}=j \\mid f\\left(x\\_{i}\\right)\\right)$. The Figure shows a schematic of the RELIC objective.","104":"A **Gated Recurrent Unit**, or **GRU**, is a type of recurrent neural network. It is similar to an [LSTM](https:\/\/paperswithcode.com\/method\/lstm), but only has two gates - a reset gate and an update gate - and notably lacks an output gate. Fewer parameters means GRUs are generally easier\/faster to train than their LSTM counterparts.\r\n\r\nImage Source: [here](https:\/\/www.google.com\/url?sa=i&url=https%3A%2F%2Fcommons.wikimedia.org%2Fwiki%2FFile%3AGated_Recurrent_Unit%2C_type_1.svg&psig=AOvVaw3EmNX8QXC5hvyxeenmJIUn&ust=1590332062671000&source=images&cd=vfe&ved=0CA0QjhxqFwoTCMiev9-eyukCFQAAAAAdAAAAABAR)","105":"**Gated Transformer-XL**, or **GTrXL**, is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based architecture for reinforcement learning. It introduces architectural modifications that improve the stability and learning speed of the original Transformer and XL variant. Changes include:\r\n\r\n- Placing the [layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization) on only the input stream of the submodules. A key benefit to this reordering is that it now enables an identity map from the input of the transformer at the first layer to the output of the transformer after the last layer. This is in contrast to the canonical transformer, where there are a series of layer normalization operations that non-linearly transform the state encoding.\r\n- Replacing [residual connections](https:\/\/paperswithcode.com\/method\/residual-connection) with gating layers. The authors' experiments found that [GRUs](https:\/\/www.paperswithcode.com\/method\/gru) were the most effective form of gating.","106":"**Contrastive BERT** is a reinforcement learning agent that combines a new contrastive loss and a hybrid [LSTM](https:\/\/paperswithcode.com\/method\/lstm)-[transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture to tackle the challenge of improving data efficiency for RL. It uses bidirectional masked prediction in combination with a generalization of recent contrastive methods to learn better representations for transformers in RL, without the need of hand engineered data augmentations.\r\n\r\nFor the architecture, a residual network is used to encode observations into embeddings $Y\\_{t}$. $Y_{t}$  is fed through a causally masked [GTrXL transformer](https:\/\/www.paperswithcode.com\/method\/gtrxl), which computes the predicted masked inputs $X\\_{t}$ and passes those together with $Y\\_{t}$ to a learnt gate. The output of the gate is passed through a single [LSTM](https:\/\/www.paperswithcode.com\/method\/lstm) layer to produce the values that we use for computing the RL loss. A contrastive loss is computed using predicted masked inputs $X_{t}$ and $Y_{t}$ as targets. For this, we do not use the causal mask of the Transformer.","107":"**Seq2Seq**, or **Sequence To Sequence**, is a model used in sequence prediction tasks, such as language modelling and machine translation. The idea is to use one [LSTM](https:\/\/paperswithcode.com\/method\/lstm), the *encoder*, to read the input sequence one timestep at a time, to obtain a large fixed dimensional vector representation (a context vector), and then to use another LSTM, the *decoder*, to extract the output sequence\r\nfrom that vector. The second LSTM is essentially a recurrent neural network language model except that it is conditioned on the input sequence.\r\n\r\n(Note that this page refers to the original seq2seq not general sequence-to-sequence models)","108":"VERtex Similarity Embeddings (VERSE) is a simple, versatile, and memory-efficient method that derives graph embeddings explicitly calibrated to preserve the distributions of a selected vertex-to-vertex similarity measure. VERSE learns such embeddings by training a single-layer neural network.\r\n\r\nSource: [Tsitsulin et al.](https:\/\/arxiv.org\/pdf\/1803.04742v1.pdf)\r\n\r\nImage source: [Tsitsulin et al.](https:\/\/arxiv.org\/pdf\/1803.04742v1.pdf)","109":"Temporal attention can be seen as a dynamic time selection mechanism determining when to pay attention, and is thus usually used for video processing.","110":"A **DQN**, or Deep Q-Network, approximates a state-value function in a [Q-Learning](https:\/\/paperswithcode.com\/method\/q-learning) framework with a neural network. In the Atari Games case, they take in several frames of the game as an input and output state values for each action as an output. \r\n\r\nIt is usually used in conjunction with [Experience Replay](https:\/\/paperswithcode.com\/method\/experience-replay), for storing the episode steps in memory for off-policy learning, where samples are drawn from the replay memory at random. Additionally, the Q-Network is usually optimized towards a frozen target network that is periodically updated with the latest weights every $k$ steps (where $k$ is a hyperparameter). The latter makes training more stable by preventing short-term oscillations from a moving target. The former tackles autocorrelation that would occur from on-line learning, and having a replay memory makes the problem more like a supervised learning problem.\r\n\r\nImage Source: [here](https:\/\/www.researchgate.net\/publication\/319643003_Autonomous_Quadrotor_Landing_using_Deep_Reinforcement_Learning)","111":"An **Hourglass Module** is an image block module used mainly for pose estimation tasks. The design of the hourglass is motivated by the need to capture information at every scale. While local evidence is essential for identifying features like faces and hands, a final pose estimate requires a coherent understanding of the full body. The person\u2019s orientation, the arrangement of their limbs, and the relationships of adjacent joints are among the many cues that are best recognized at different scales in the image. The hourglass is a simple, minimal design that has the capacity to capture all of these features and bring them together to output pixel-wise predictions.\r\n\r\nThe network must have some mechanism to effectively process and consolidate features across scales. The Hourglass uses a single pipeline with skip layers to preserve spatial information at each resolution. The network reaches its lowest resolution at 4x4 pixels allowing smaller spatial filters to be applied that compare features across the entire space of the image.\r\n\r\nThe hourglass is set up as follows: Convolutional and [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) layers are used to process features down to a very low resolution. At each max pooling step, the network branches off and applies more convolutions at the original pre-pooled resolution. After reaching the lowest resolution, the network begins the top-down sequence of upsampling and combination of features across scales. To bring together information across two adjacent resolutions, we do nearest neighbor upsampling of the lower resolution followed by an elementwise addition of the two sets of features. The topology of the hourglass is symmetric, so for every layer present on the way down there is a corresponding layer going up.\r\n\r\nAfter reaching the output resolution of the network, two consecutive rounds of 1x1 convolutions are applied to produce the final network predictions. The output of the network is a set of heatmaps where for a given [heatmap](https:\/\/paperswithcode.com\/method\/heatmap) the network predicts the probability of a joint\u2019s presence at each and every pixel.","112":"**Random Scaling** is a type of image data augmentation where we randomly change the scale the image between a specified range.","113":"**Stacked Hourglass Networks** are a type of convolutional neural network for pose estimation. They are based on the successive steps of pooling and upsampling that are done to produce a final set of predictions.","114":"**Corner Pooling** is a pooling technique for object detection that seeks to better localize corners by encoding explicit prior knowledge. Suppose we want to determine if a pixel at location $\\left(i, j\\right)$ is a top-left corner. Let $f\\_{t}$ and $f\\_{l}$ be the feature maps that are the inputs to the top-left corner pooling layer, and let $f\\_{t\\_{ij}}$ and $f\\_{l\\_{ij}}$ be the vectors at location $\\left(i, j\\right)$ in $f\\_{t}$ and $f\\_{l}$ respectively. With $H \\times W$ feature maps, the corner pooling layer first max-pools all feature vectors between $\\left(i, j\\right)$ and $\\left(i, H\\right)$ in $f\\_{t}$ to a feature vector $t\\_{ij}$ , and max-pools all feature vectors between $\\left(i, j\\right)$ and $\\left(W, j\\right)$ in $f\\_{l}$ to a feature vector $l\\_{ij}$. Finally, it adds $t\\_{ij}$ and $l\\_{ij}$ together.","115":"**CornerNet** is an object detection model that detects an object bounding box as a pair of keypoints, the top-left corner and the bottom-right corner, using a single [convolution](https:\/\/paperswithcode.com\/method\/convolution) neural network. By detecting objects as paired keypoints, we eliminate the need for designing a set of anchor boxes commonly used in prior single-stage detectors. It also utilises [corner pooling](https:\/\/paperswithcode.com\/method\/corner-pooling), a new type of pooling layer than helps the network better localize corners.","116":"Non-maximum suppression is an integral part of the object detection pipeline. First, it sorts all detection boxes on the basis of their scores. The detection box $M$ with the maximum score is selected and all other detection boxes with a significant overlap (using a pre-defined threshold)\r\nwith $M$ are suppressed. This process is recursively applied on the remaining boxes. As per the design of the algorithm, if an object lies within the predefined overlap threshold, it leads to a miss. \r\n\r\n**Soft-NMS** solves this problem by decaying the detection scores of all other objects as a continuous function of their overlap with M. Hence, no object is eliminated in this process.","117":"**RandomHorizontalFlip** is a type of image data augmentation which horizontally flips a given image with a given probability.\r\n\r\nImage Credit: [Apache MXNet](https:\/\/mxnet.apache.org\/versions\/1.5.0\/tutorials\/gluon\/data_augmentation.html)","118":"**Step Decay** is a learning rate schedule that drops the learning rate by a factor every few epochs, where the number of epochs is a hyperparameter.\r\n\r\nImage Credit: [Suki Lau](https:\/\/towardsdatascience.com\/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1)","119":"**MatrixNet** is a scale and aspect ratio aware building block for object detection that seek to handle objects of different sizes and aspect ratios. They have several matrix layers, each layer handles an object of specific size and aspect ratio. They can be seen as an alternative to [FPNs](https:\/\/paperswithcode.com\/method\/fpn). While FPNs are capable of handling objects of different sizes, they do not have a solution for objects of different aspect ratios. Objects such as a high tower, a giraffe, or a knife introduce a design difficulty for FPNs: does one map these objects to layers according to their width or height? Assigning the object to a layer according to its larger dimension would result in loss of information along the smaller dimension due to aggressive downsampling, and vice versa. \r\n\r\nMatrixNets assign objects of different sizes and aspect ratios to layers such that object sizes within their assigned layers are close to uniform. This assignment allows a square output [convolution](https:\/\/paperswithcode.com\/method\/convolution) kernel to equally gather information about objects of all aspect ratios and scales. MatrixNets can be applied to any backbone, similar to FPNs. We denote this by appending a \"-X\" to the backbone, i.e. ResNet50-X.","120":"**Global-Local Attention** is a type of attention mechanism used in the [ETC](https:\/\/paperswithcode.com\/method\/etc) architecture. ETC receives two separate input sequences: the global input $x^{g} = (x^{g}\\_{1}, \\dots, x^{g}\\_{n\\_{g}})$ and the long input $x^{l} = (x^{l}\\_{1}, \\dots x^{l}\\_{n\\_{l}})$. Typically, the long input contains the input a [standard Transformer](https:\/\/paperswithcode.com\/method\/transformer) would receive, while the global input contains a much smaller number of auxiliary tokens ($n\\_{g}  \\ll n\\_{l}$). Attention is then split into four separate pieces: global-to-global (g2g), global-tolong (g2l), long-to-global (l2g), and long-to-long (l2l). Attention in the l2l piece (the most computationally expensive piece) is restricted to a fixed radius $r \\ll n\\_{l}$. To compensate for this limited attention span, the tokens in the global input have unrestricted attention, and thus long input tokens can transfer information to each other through global input tokens. Accordingly, g2g, g2l, and l2g pieces of attention are unrestricted.","121":"**R_INLINE_MATH_1 Regularization** is a regularization technique and gradient penalty for training [generative adversarial networks](https:\/\/paperswithcode.com\/methods\/category\/generative-adversarial-networks). It penalizes the discriminator from deviating from the Nash Equilibrium via penalizing the gradient on real data alone: when the generator distribution produces the true data distribution and the discriminator is equal to 0 on the data manifold, the gradient penalty ensures that the discriminator cannot create a non-zero gradient orthogonal to the data manifold without suffering a loss in the [GAN](https:\/\/paperswithcode.com\/method\/gan) game.\r\n\r\nThis leads to the following regularization term:\r\n\r\n$$ R\\_{1}\\left(\\psi\\right) = \\frac{\\gamma}{2}E\\_{p\\_{D}\\left(x\\right)}\\left[||\\nabla{D\\_{\\psi}\\left(x\\right)}||^{2}\\right] $$","122":"A **Feedforward Network**, or a **Multilayer Perceptron (MLP)**, is a neural network with solely densely connected layers. This is the classic neural network architecture of the literature. It consists of inputs $x$ passed through units $h$ (of which there can be many layers) to predict a target $y$. Activation functions are generally chosen to be non-linear to allow for flexible functional approximation.\r\n\r\nImage Source: Deep Learning, Goodfellow et al","123":"**Adaptive Instance Normalization** is a normalization method that aligns the mean and variance of the content features with those of the style features. \r\n\r\n[Instance Normalization](https:\/\/paperswithcode.com\/method\/instance-normalization) normalizes the input to a single style specified by the affine parameters. Adaptive Instance Normaliation is an extension. In AdaIN, we receive a content input $x$ and a style input $y$, and we simply align the channel-wise mean and variance of $x$ to match those of $y$. Unlike [Batch Normalization](https:\/\/paperswithcode.com\/method\/batch-normalization), Instance Normalization or [Conditional Instance Normalization](https:\/\/paperswithcode.com\/method\/conditional-instance-normalization), AdaIN has no learnable affine parameters. Instead, it adaptively computes the affine parameters from the style input:\r\n\r\n$$\r\n\\textrm{AdaIN}(x, y)= \\sigma(y)\\left(\\frac{x-\\mu(x)}{\\sigma(x)}\\right)+\\mu(y)\r\n$$","124":"**StyleGAN** is a type of generative adversarial network. It uses an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature; in particular, the use of [adaptive instance normalization](https:\/\/paperswithcode.com\/method\/adaptive-instance-normalization). Otherwise it follows Progressive [GAN](https:\/\/paperswithcode.com\/method\/gan) in using a progressively growing training regime. Other quirks include the fact it generates from a fixed value tensor not stochastically generated latent variables as in regular GANs. The stochastically generated latent variables are used as style vectors in the adaptive [instance normalization](https:\/\/paperswithcode.com\/method\/instance-normalization) at each resolution after being transformed by an 8-layer [feedforward network](https:\/\/paperswithcode.com\/method\/feedforward-network). Lastly, it employs a form of regularization called mixing regularization, which mixes two style latent variables during training.","125":"**MAML**, or **Model-Agnostic Meta-Learning**, is a model and task-agnostic algorithm for meta-learning that trains a model\u2019s parameters such that a small number of gradient updates will lead to fast learning on a new task.\r\n\r\nConsider a model represented by a parametrized function $f\\_{\\theta}$ with parameters $\\theta$. When adapting to a new task $\\mathcal{T}\\_{i}$, the model\u2019s parameters $\\theta$ become $\\theta'\\_{i}$. With MAML, the updated parameter vector $\\theta'\\_{i}$ is computed using one or more gradient descent updates on task $\\mathcal{T}\\_{i}$. For example, when using one gradient update,\r\n\r\n$$ \\theta'\\_{i} = \\theta - \\alpha\\nabla\\_{\\theta}\\mathcal{L}\\_{\\mathcal{T}\\_{i}}\\left(f\\_{\\theta}\\right) $$\r\n\r\nThe step size $\\alpha$ may be fixed as a hyperparameter or metalearned. The model parameters are trained by optimizing for the performance of $f\\_{\\theta'\\_{i}}$ with respect to $\\theta$ across tasks sampled from $p\\left(\\mathcal{T}\\_{i}\\right)$. More concretely the meta-objective is as follows:\r\n\r\n$$ \\min\\_{\\theta} \\sum\\_{\\mathcal{T}\\_{i} \\sim p\\left(\\mathcal{T}\\right)} \\mathcal{L}\\_{\\mathcal{T\\_{i}}}\\left(f\\_{\\theta'\\_{i}}\\right) = \\sum\\_{\\mathcal{T}\\_{i} \\sim p\\left(\\mathcal{T}\\right)} \\mathcal{L}\\_{\\mathcal{T\\_{i}}}\\left(f\\_{\\theta - \\alpha\\nabla\\_{\\theta}\\mathcal{L}\\_{\\mathcal{T}\\_{i}}\\left(f\\_{\\theta}\\right)}\\right) $$\r\n\r\nNote that the meta-optimization is performed over the model parameters $\\theta$, whereas the objective is computed using the updated model parameters $\\theta'$. In effect MAML aims to optimize the model parameters such that one or a small number of gradient steps on a new task will produce maximally effective behavior on that task. The meta-optimization across tasks is performed via stochastic gradient descent ([SGD](https:\/\/paperswithcode.com\/method\/sgd)), such that the model parameters $\\theta$ are updated as follows:\r\n\r\n$$ \\theta \\leftarrow \\theta - \\beta\\nabla\\_{\\theta} \\sum\\_{\\mathcal{T}\\_{i} \\sim p\\left(\\mathcal{T}\\right)} \\mathcal{L}\\_{\\mathcal{T\\_{i}}}\\left(f\\_{\\theta'\\_{i}}\\right)$$\r\n\r\nwhere $\\beta$ is the meta step size.","126":"NeRF represents a scene with learned, continuous volumetric radiance field $F_\\theta$ defined over a bounded 3D volume. In a NeRF, $F_\\theta$ is a multilayer perceptron (MLP) that takes as input a 3D position $x = (x, y, z)$ and unit-norm viewing direction $d = (dx, dy, dz)$, and produces as output a density $\\sigma$ and color $c = (r, g, b)$. The weights of the multilayer perceptron that parameterize $F_\\theta$ are optimized so as to encode the radiance field of the scene. Volume rendering is used to compute the color of a single pixel.","127":"**ELECTRA** is a [transformer](https:\/\/paperswithcode.com\/method\/transformer) with a new pre-training approach which trains two transformer models: the generator and the discriminator. The generator replaces tokens in the sequence - trained as a masked language model - and the discriminator (the ELECTRA contribution) attempts to identify which tokens are replaced by the generator in the sequence. This pre-training task is called replaced token detection, and is a replacement for masking the input.","128":"CARLA is an open-source simulator for autonomous driving research. CARLA has been developed from the ground up to support development, training, and validation of autonomous urban driving systems. In addition to open-source code and protocols, CARLA provides open digital assets (urban layouts, buildings, vehicles) that were created for this purpose and can be used freely. \r\n\r\nSource: [Dosovitskiy et al.](https:\/\/arxiv.org\/pdf\/1711.03938v1.pdf)\r\n\r\nImage source: [Dosovitskiy et al.](https:\/\/arxiv.org\/pdf\/1711.03938v1.pdf)","129":"**Conditional Random Fields** or **CRFs** are a type of probabilistic graph model that take neighboring sample context into account for tasks like classification. Prediction is modeled as a graphical model, which implements dependencies between the predictions. Graph choice depends on the application, for example linear chain CRFs are popular in natural language processing, whereas in image-based tasks, the graph would connect to neighboring locations in an image to enforce that they have similar predictions.\r\n\r\nImage Credit: [Charles Sutton and Andrew McCallum, An Introduction to Conditional Random Fields](https:\/\/homepages.inf.ed.ac.uk\/csutton\/publications\/crftut-fnt.pdf)","130":"**Minimum Description Length** provides a criterion for the selection of models, regardless of their complexity, without the restrictive assumption that the data form a sample from a 'true' distribution.\r\n\r\nExtracted from [scholarpedia](http:\/\/scholarpedia.org\/article\/Minimum_description_length)\r\n\r\n**Source**:\r\n\r\nPaper: [J. Rissanen (1978) Modeling by the shortest data description. Automatica 14, 465-471](https:\/\/doi.org\/10.1016\/0005-1098(78)90005-5)\r\n\r\nBook: [P. D. Gr\u00fcnwald (2007) The Minimum Description Length Principle, MIT Press, June 2007, 570 pages](https:\/\/ieeexplore.ieee.org\/servlet\/opac?bknumber=6267274)","131":"A **Memory Network** provides a memory component that can be read from and written to with the inference capabilities of a neural network model. The motivation is that many neural networks lack a long-term memory component, and their existing memory component encoded by states and weights is too small and not compartmentalized enough to accurately remember facts from the past (RNNs for example, have difficult memorizing and doing tasks like copying). \r\n\r\nA memory network consists of a memory $\\textbf{m}$ (an array of objects indexed by $\\textbf{m}\\_{i}$ and four potentially learned components:\r\n\r\n- Input feature map $I$ - feature representation of the data input.\r\n- Generalization $G$ - updates old memories given the new input.\r\n- Output feature map $O$ - produces new feature map given $I$ and $G$.\r\n- Response $R$ - converts output into the desired response. \r\n\r\nGiven an input $x$ (e.g., an input character, word or sentence depending on the granularity chosen, an image or an audio signal) the flow of the model is as follows:\r\n\r\n1. Convert $x$ to an internal feature representation $I\\left(x\\right)$.\r\n2. Update memories $m\\_{i}$ given the new input: $m\\_{i} = G\\left(m\\_{i}, I\\left(x\\right), m\\right)$, $\\forall{i}$.\r\n3. Compute output features $o$ given the new input and the memory: $o = O\\left(I\\left(x\\right), m\\right)$.\r\n4. Finally, decode output features $o$ to give the final response: $r = R\\left(o\\right)$.\r\n\r\nThis process is applied at both train and test time, if there is a distinction between such phases, that\r\nis, memories are also stored at test time, but the model parameters of $I$, $G$, $O$ and $R$ are not updated. Memory networks cover a wide class of possible implementations. The components $I$, $G$, $O$ and $R$ can potentially use any existing ideas from the machine learning literature.\r\n\r\nImage Source: [Adrian Colyer](https:\/\/blog.acolyer.org\/2016\/03\/10\/memory-networks\/)","132":"Diffusion models generate samples by gradually\r\nremoving noise from a signal, and their training objective can be expressed as a reweighted variational lower-bound (https:\/\/arxiv.org\/abs\/2006.11239).","133":"**Discriminative Fine-Tuning** is a fine-tuning strategy that is used for [ULMFiT](https:\/\/paperswithcode.com\/method\/ulmfit) type models. Instead of using the same learning rate for all layers of the model, discriminative fine-tuning allows us to tune each layer with different learning rates. For context, the regular stochastic gradient descent ([SGD](https:\/\/paperswithcode.com\/method\/sgd)) update of a model\u2019s parameters $\\theta$ at time step $t$ looks like the following (Ruder, 2016):\r\n\r\n$$ \\theta\\_{t} = \\theta\\_{t-1} \u2212 \\eta\\cdot\\nabla\\_{\\theta}J\\left(\\theta\\right)$$\r\n\r\nwhere $\\eta$ is the learning rate and $\\nabla\\_{\\theta}J\\left(\\theta\\right)$ is the gradient with regard to the model\u2019s objective function. For discriminative fine-tuning, we split the parameters $\\theta$ into {$\\theta\\_{1}, \\ldots, \\theta\\_{L}$} where $\\theta\\_{l}$ contains the parameters of the model at the $l$-th layer and $L$ is the number of layers of the model. Similarly, we obtain {$\\eta\\_{1}, \\ldots, \\eta\\_{L}$} where $\\theta\\_{l}$ where $\\eta\\_{l}$ is the learning rate of the $l$-th layer. The SGD update with discriminative finetuning is then:\r\n\r\n$$ \\theta\\_{t}^{l} = \\theta\\_{t-1}^{l} - \\eta^{l}\\cdot\\nabla\\_{\\theta^{l}}J\\left(\\theta\\right) $$\r\n\r\nThe authors find that empirically it worked well to first choose the learning rate $\\eta^{L}$ of the last layer by fine-tuning only the last layer and using $\\eta^{l-1}=\\eta^{l}\/2.6$ as the learning rate for lower layers.","134":"**GPT-2** is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) architecture that was notable for its size (1.5 billion parameters) on its release. The model is pretrained on a WebText dataset - text from 45 million website links. It largely follows the previous [GPT](https:\/\/paperswithcode.com\/method\/gpt) architecture with some modifications:\r\n\r\n- [Layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization) is moved to the input of each sub-block, similar to a\r\npre-activation residual network and an additional layer normalization was added after the final self-attention block. \r\n\r\n- A modified initialization which accounts for the accumulation on the residual path with model depth\r\nis used. Weights of residual layers are scaled at initialization by a factor of $1\/\\sqrt{N}$ where $N$ is the number of residual layers. \r\n\r\n- The vocabulary is expanded to 50,257. The context size is expanded from 512 to 1024 tokens and\r\na larger batch size of 512 is used.","135":"**Gaussian Processes** are non-parametric models for approximating functions. They rely upon a measure of similarity between points (the kernel function) to predict the value for an unseen point from training data. The models are fully probabilistic so uncertainty bounds are baked in with the model.\r\n\r\nImage Source: Gaussian Processes for Machine Learning, C. E. Rasmussen & C. K. I. Williams","136":"**RoBERTa** is an extension of [BERT](https:\/\/paperswithcode.com\/method\/bert) with changes to the pretraining procedure. The modifications include: \r\n\r\n- training the model longer, with bigger batches, over more data\r\n- removing the next sentence prediction objective\r\n- training on longer sequences\r\n- dynamically changing the masking pattern applied to the training data. The authors also collect a large new dataset ($\\text{CC-News}$) of comparable size to other privately used datasets, to better control for training set size effects","137":"**node2vec** is a framework for learning graph embeddings for nodes in graphs. Node2vec maximizes a likelihood objective over mappings which preserve neighbourhood distances in higher dimensional spaces. From an algorithm design perspective, node2vec exploits the freedom to define neighbourhoods for nodes and provide an explanation for the effect of the choice of neighborhood on the learned representations. \r\n\r\nFor each node, node2vec simulates biased random walks based on an efficient network-aware search strategy and the nodes appearing in the random walk define neighbourhoods. The search strategy accounts for the relative influence nodes exert in a network. It also generalizes prior work alluding to naive search strategies by providing flexibility in exploring neighborhoods.","138":"The goal of **Triplet loss**, in the context of Siamese Networks, is to maximize the joint probability among all score-pairs i.e. the product of all probabilities. By using its negative logarithm, we can get the loss formulation as follows:\r\n\r\n$$\r\nL\\_{t}\\left(\\mathcal{V}\\_{p}, \\mathcal{V}\\_{n}\\right)=-\\frac{1}{M N} \\sum\\_{i}^{M} \\sum\\_{j}^{N} \\log \\operatorname{prob}\\left(v p\\_{i}, v n\\_{j}\\right)\r\n$$\r\n\r\nwhere the balance weight $1\/MN$ is used to keep the loss with the same scale for different number of instance sets.","139":"**Contrastive Language-Image Pre-training** (**CLIP**), consisting of a simplified version of ConVIRT trained from scratch, is an efficient method of image representation learning from natural language supervision. , CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset\u2019s classes. \r\n\r\nFor pre-training, CLIP is trained to predict which of the $N X N$ possible (image, text) pairings across a batch actually occurred. CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the $N$ real pairs in the batch while minimizing the cosine similarity of the embeddings of the $N^2 - N$ incorrect pairings. A symmetric cross entropy loss is optimized over these similarity scores. \r\n\r\nImage credit: [Learning Transferable Visual Models From Natural Language Supervision](https:\/\/arxiv.org\/pdf\/2103.00020.pdf)","140":"**Fully Convolutional Networks**, or **FCNs**, are an architecture used mainly for semantic segmentation. They employ solely locally connected layers, such as [convolution](https:\/\/paperswithcode.com\/method\/convolution), pooling and upsampling. Avoiding the use of dense layers means less parameters (making the networks faster to train). It also means an FCN can work for variable image sizes given all connections are local.\r\n\r\nThe network consists of a downsampling path, used to extract and interpret the context, and an upsampling path, which allows for localization. \r\n\r\nFCNs also employ skip connections to recover the fine-grained spatial information lost in the downsampling path.","141":"**ENet Dilated Bottleneck** is an image model block used in the [ENet](https:\/\/paperswithcode.com\/method\/enet) semantic segmentation architecture. It is the same as a regular [ENet Bottleneck](https:\/\/paperswithcode.com\/method\/enet-bottleneck) but employs dilated convolutions instead.","142":"**ENet Bottleneck** is an image model block used in the [ENet](https:\/\/paperswithcode.com\/method\/enet) semantic segmentation architecture. Each block consists of three convolutional layers: a 1 \u00d7 1 projection that reduces the dimensionality, a main convolutional layer, and a 1 \u00d7 1 expansion. We place [Batch Normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) and [PReLU](https:\/\/paperswithcode.com\/method\/prelu) between all convolutions. If the bottleneck is downsampling, a [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) layer is added to the main branch.\r\nAlso, the first 1 \u00d7 1 projection is replaced with a 2 \u00d7 2 [convolution](https:\/\/paperswithcode.com\/method\/convolution) with stride 2 in both dimensions. We zero pad the activations, to match the number of feature maps.","143":"The **ENet Initial Block** is an image model block used in the [ENet](https:\/\/paperswithcode.com\/method\/enet) semantic segmentation architecture. [Max Pooling](https:\/\/paperswithcode.com\/method\/max-pooling) is performed with non-overlapping 2 \u00d7 2 windows, and the [convolution](https:\/\/paperswithcode.com\/method\/convolution) has 13 filters, which sums up to 16 feature maps after concatenation. This is heavily inspired by Inception Modules.","144":"**SpatialDropout** is a type of [dropout](https:\/\/paperswithcode.com\/method\/dropout) for convolutional networks. For a given [convolution](https:\/\/paperswithcode.com\/method\/convolution) feature tensor of size $n\\_{\\text{feats}}$\u00d7height\u00d7width, we perform only $n\\_{\\text{feats}}$ dropout\r\ntrials and extend the dropout value across the entire feature map. Therefore, adjacent pixels in the dropped-out feature\r\nmap are either all 0 (dropped-out) or all active as illustrated in the figure to the right.","145":"A **Parametric Rectified Linear Unit**, or **PReLU**, is an activation function that generalizes the traditional rectified unit with a slope for negative values. Formally:\r\n\r\n$$f\\left(y\\_{i}\\right) = y\\_{i} \\text{ if } y\\_{i} \\ge 0$$\r\n$$f\\left(y\\_{i}\\right) = a\\_{i}y\\_{i} \\text{ if } y\\_{i} \\leq 0$$\r\n\r\nThe intuition is that different layers may require different types of nonlinearity. Indeed the authors find in experiments with convolutional neural networks that PReLus for the initial layer have more positive slopes, i.e. closer to linear. Since the filters of the first layers are Gabor-like filters such as edge or texture detectors, this shows a circumstance where positive and negative responses of filters are respected. In contrast the authors find deeper layers have smaller coefficients, suggesting the model becomes more discriminative at later layers (while it wants to retain more information at earlier layers).","146":"**ENet** is a semantic segmentation architecture which utilises a compact encoder-decoder architecture. Some design choices include:\r\n\r\n1. Using the [SegNet](https:\/\/paperswithcode.com\/method\/segnet) approach to downsampling y saving indices of elements chosen in max\r\npooling layers, and using them to produce sparse upsampled maps in the decoder.\r\n2.  Early downsampling to optimize the early stages of the network and reduce the cost of processing large input frames. The first two blocks of ENet heavily reduce the input size, and use only a small set of feature maps. \r\n3. Using PReLUs as an activation function\r\n4. Using dilated convolutions \r\n5. Using Spatial [Dropout](https:\/\/paperswithcode.com\/method\/dropout)","147":"**PatchGAN** is a type of discriminator for generative adversarial networks which only penalizes structure at the scale of local image patches. The PatchGAN discriminator tries to classify if each $N \\times N$ patch in an image is real or fake. This discriminator is run convolutionally across the image, averaging all responses to provide the ultimate output of $D$. Such a discriminator effectively models the image as a Markov random field, assuming independence between pixels separated by more than a patch diameter. It can be understood as a type of texture\/style loss.","148":"**Instance Normalization** (also known as contrast normalization) is a normalization layer where:\r\n\r\n$$\r\n    y_{tijk} =  \\frac{x_{tijk} - \\mu_{ti}}{\\sqrt{\\sigma_{ti}^2 + \\epsilon}},\r\n    \\quad\r\n    \\mu_{ti} = \\frac{1}{HW}\\sum_{l=1}^W \\sum_{m=1}^H x_{tilm},\r\n    \\quad\r\n    \\sigma_{ti}^2 = \\frac{1}{HW}\\sum_{l=1}^W \\sum_{m=1}^H (x_{tilm} - mu_{ti})^2.\r\n$$\r\n\r\nThis prevents instance-specific mean and covariance shift simplifying the learning process. Intuitively, the normalization process allows to remove instance-specific contrast information from the content image in a task like image stylization, which simplifies generation.","149":"**GAN Least Squares Loss** is a least squares loss function for generative adversarial networks. Minimizing this objective function is equivalent to minimizing the Pearson $\\chi^{2}$ divergence. The objective function (here for [LSGAN](https:\/\/paperswithcode.com\/method\/lsgan)) can be defined as:\r\n\r\n$$ \\min\\_{D}V\\_{LS}\\left(D\\right) = \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{x} \\sim p\\_{data}\\left(\\mathbf{x}\\right)}\\left[\\left(D\\left(\\mathbf{x}\\right) - b\\right)^{2}\\right] + \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{z}\\sim p\\_{data}\\left(\\mathbf{z}\\right)}\\left[\\left(D\\left(G\\left(\\mathbf{z}\\right)\\right) - a\\right)^{2}\\right] $$\r\n\r\n$$ \\min\\_{G}V\\_{LS}\\left(G\\right) = \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{z} \\sim p\\_{\\mathbf{z}}\\left(\\mathbf{z}\\right)}\\left[\\left(D\\left(G\\left(\\mathbf{z}\\right)\\right) - c\\right)^{2}\\right] $$\r\n\r\nwhere $a$ and $b$ are the labels for fake data and real data and $c$ denotes the value that $G$ wants $D$ to believe for fake data.","150":"**Cycle Consistency Loss** is a type of loss used for generative adversarial networks that performs unpaired image-to-image translation. It was introduced with the [CycleGAN](https:\/\/paperswithcode.com\/method\/cyclegan) architecture. For two domains $X$ and $Y$, we want to learn a mapping $G : X \\rightarrow Y$ and $F: Y \\rightarrow X$. We want to enforce the intuition that these mappings should be reverses of each other and that both mappings should be bijections. Cycle Consistency Loss encourages $F\\left(G\\left(x\\right)\\right) \\approx x$ and $G\\left(F\\left(y\\right)\\right) \\approx y$.  It reduces the space of possible mapping functions by enforcing forward and backwards consistency:\r\n\r\n$$ \\mathcal{L}\\_{cyc}\\left(G, F\\right) = \\mathbb{E}\\_{x \\sim p\\_{data}\\left(x\\right)}\\left[||F\\left(G\\left(x\\right)\\right) - x||\\_{1}\\right] + \\mathbb{E}\\_{y \\sim p\\_{data}\\left(y\\right)}\\left[||G\\left(F\\left(y\\right)\\right) - y||\\_{1}\\right] $$","151":"**CycleGAN**, or **Cycle-Consistent GAN**, is a type of generative adversarial network for unpaired image-to-image translation. For two domains $X$ and $Y$, CycleGAN learns a mapping $G : X \\rightarrow Y$ and $F: Y \\rightarrow X$. The novelty lies in trying to enforce the intuition that these mappings should be reverses of each other and that both mappings should be bijections. This is achieved through a [cycle consistency loss](https:\/\/paperswithcode.com\/method\/cycle-consistency-loss) that encourages $F\\left(G\\left(x\\right)\\right) \\approx x$ and $G\\left(Y\\left(y\\right)\\right) \\approx y$. Combining this loss with the adversarial losses on $X$ and $Y$ yields the full objective for unpaired image-to-image translation.\r\n\r\nFor the mapping $G : X \\rightarrow Y$ and its discriminator $D\\_{Y}$ we have the objective:\r\n\r\n$$ \\mathcal{L}\\_{GAN}\\left(G, D\\_{Y}, X, Y\\right) =\\mathbb{E}\\_{y \\sim p\\_{data}\\left(y\\right)}\\left[\\log D\\_{Y}\\left(y\\right)\\right] + \\mathbb{E}\\_{x \\sim p\\_{data}\\left(x\\right)}\\left[log(1 \u2212 D\\_{Y}\\left(G\\left(x\\right)\\right)\\right] $$\r\n\r\nwhere $G$ tries to generate images $G\\left(x\\right)$ that look similar to images from domain $Y$, while $D\\_{Y}$ tries to discriminate between translated samples $G\\left(x\\right)$ and real samples $y$. A similar loss is postulated for the mapping $F: Y \\rightarrow X$ and its discriminator $D\\_{X}$.\r\n\r\nThe Cycle Consistency Loss reduces the space of possible mapping functions by enforcing forward and backwards consistency:\r\n\r\n$$ \\mathcal{L}\\_{cyc}\\left(G, F\\right) = \\mathbb{E}\\_{x \\sim p\\_{data}\\left(x\\right)}\\left[||F\\left(G\\left(x\\right)\\right) - x||\\_{1}\\right] + \\mathbb{E}\\_{y \\sim p\\_{data}\\left(y\\right)}\\left[||G\\left(F\\left(y\\right)\\right) - y||\\_{1}\\right] $$\r\n\r\nThe full objective is:\r\n\r\n$$ \\mathcal{L}\\_{GAN}\\left(G, F, D\\_{X}, D\\_{Y}\\right) = \\mathcal{L}\\_{GAN}\\left(G, D\\_{Y}, X, Y\\right) + \\mathcal{L}\\_{GAN}\\left(F, D\\_{X}, X, Y\\right) + \\lambda\\mathcal{L}\\_{cyc}\\left(G, F\\right) $$\r\n\r\nWhere we aim to solve:\r\n\r\n$$ G^{\\*}, F^{\\*} = \\arg \\min\\_{G, F} \\max\\_{D\\_{X}, D\\_{Y}} \\mathcal{L}\\_{GAN}\\left(G, F, D\\_{X}, D\\_{Y}\\right) $$\r\n\r\nFor the original architecture the authors use:\r\n\r\n-  two stride-2 convolutions, several residual blocks, and two fractionally strided convolutions with stride $\\frac{1}{2}$.\r\n- [instance normalization](https:\/\/paperswithcode.com\/method\/instance-normalization)\r\n- PatchGANs for the discriminator\r\n- Least Square Loss for the [GAN](https:\/\/paperswithcode.com\/method\/gan) objectives.","152":"**PointNet** provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. It directly takes point clouds as input and outputs either class labels for the entire input or per point segment\/part labels for each point of the input.\r\n\r\nSource: [Qi et al.](https:\/\/arxiv.org\/pdf\/1612.00593v2.pdf)\r\n\r\nImage source: [Qi et al.](https:\/\/arxiv.org\/pdf\/1612.00593v2.pdf)","153":"**Adaptive Dropout** is a regularization technique that extends dropout by allowing the dropout probability to be different for different units. The intuition is that there may be hidden units that can individually make confident predictions for the presence or absence of an important feature or combination of features. [Dropout](https:\/\/paperswithcode.com\/method\/dropout) will ignore this confidence and drop the unit out 50% of the time. \r\n\r\nDenote the activity of unit $j$ in a deep neural network by $a\\_{j}$ and assume that its inputs are {$a\\_{i}: i < j$}. In dropout, $a\\_{j}$ is randomly set to zero with probability 0.5. Let $m\\_{j}$ be a binary variable that is used to mask, the activity $a\\_{j}$, so that its value is:\r\n\r\n$$ a\\_{j} = m\\_{j}g \\left( \\sum\\_{i: i<j}w\\_{j, i}a\\_{i} \\right)$$\r\n\r\nwhere $w\\_{j,i}$ is the weight from unit $i$ to unit $j$ and $g\\left(\u00b7\\right)$ is the activation function and $a\\_{0} = 1$ accounts for biases. Whereas in standard dropout, $m\\_{j}$ is Bernoulli with probability $0.5$, adaptive dropout uses adaptive dropout probabilities that depends on input activities:\r\n\r\n$$ P\\left(m\\_{j} = 1\\mid{\\{a\\_{i}: i < j\\}}\\right) = f \\left( \\sum\\_{i: i<j}\\pi{\\_{j, i}a\\_{i}} \\right) $$\r\n\r\nwhere $\\pi\\_{j, i}$ is the weight from unit $i$ to unit $j$ in the standout network or the adaptive dropout network; $f(\u00b7)$ is a sigmoidal function. Here 'standout' refers to a binary belief network is that is overlaid on a neural network as part of the overall regularization technique.","154":"The **Vision Transformer**, or **ViT**, is a model for image classification that employs a [Transformer](https:\/\/paperswithcode.com\/method\/transformer)-like architecture over patches of the image.  An image is split into fixed-size patches, each of them are then linearly embedded, position embeddings are added, and the resulting sequence of vectors is fed to a standard [Transformer](https:\/\/paperswithcode.com\/method\/transformer) encoder. In order to perform classification, the standard approach of adding an extra learnable \u201cclassification token\u201d to the sequence is used.","155":"**DistilBERT**  is a small, fast, cheap and light [Transformer](https:\/\/paperswithcode.com\/method\/transformer) model based on the [BERT](https:\/\/paperswithcode.com\/method\/bert) architecture. Knowledge distillation is performed during the pre-training phase to reduce the size of a BERT model by 40%. To leverage the inductive biases learned by larger models during pre-training, the authors introduce a triple loss combining language modeling, distillation and cosine-distance losses.","156":"**SimCSE** is a contrastive learning framework for generating sentence embeddings. It utilizes an unsupervised approach, which takes an input sentence and predicts itself in contrastive objective, with only standard [dropout](https:\/\/paperswithcode.com\/method\/dropout) used as noise. The authors find that dropout acts as minimal \u201cdata augmentation\u201d of hidden representations, while removing it leads to a representation collapse. Afterwards a supervised approach is used, which incorporates annotated pairs from natural language inference datasets into the contrastive framework, by using \u201centailment\u201d pairs as positives and \u201ccontradiction","157":"Spatio-temporal features extraction that measure the stabilty. The proposed method is based on a compression algorithm named Run Length Encoding. The workflow of the method is presented bellow.","158":"**Adafactor** is a stochastic optimization method based on [Adam](https:\/\/paperswithcode.com\/method\/adam) that reduces memory usage while retaining the empirical benefits of adaptivity. This is achieved through maintaining a factored representation of the squared gradient accumulator across training steps. Specifically, by tracking moving averages of the row and column sums of the squared gradients for matrix-valued variables, we are able to reconstruct a low-rank approximation of the exponentially smoothed accumulator at each training step that is optimal with respect to the generalized Kullback-Leibler divergence. For an $n \\times m$ matrix, this reduces the memory requirements from $O(n m)$ to $O(n + m)$. \r\n\r\nInstead of defining the optimization algorithm in terms of absolute step sizes {$\\alpha_t$}$\\_{t=1}^T$, the authors define the optimization algorithm in terms of relative step sizes {$\\rho_t$}$\\_{t=1}^T$, which get multiplied by the scale of the parameters. The scale of a parameter vector or matrix is defined as the root-mean-square of its components, lower-bounded by a small constant $\\epsilon_2$.  The reason for this lower bound is to allow zero-initialized parameters to escape 0. \r\n\r\nProposed hyperparameters are: $\\epsilon\\_{1} = 10^{-30}$, $\\epsilon\\_{2} = 10^{-3}$, $d=1$, $p\\_{t} = \\min\\left(10^{-2}, \\frac{1}{\\sqrt{t}}\\right)$, $\\hat{\\beta}\\_{2\\_{t}} = 1 - t^{-0.8}$.","159":"**Inverse Square Root** is a learning rate schedule 1 \/ $\\sqrt{\\max\\left(n, k\\right)}$ where\r\n$n$ is the current training iteration and $k$ is the number of warm-up steps. This sets a constant learning rate for the first $k$ steps, then exponentially decays the learning rate until pre-training is over.","160":"**SentencePiece** is a subword tokenizer and detokenizer for natural language processing. It performs subword segmentation, supporting the byte-pair-encoding ([BPE](https:\/\/paperswithcode.com\/method\/bpe)) algorithm and unigram language model, and then converts this text into an id sequence guarantee perfect reproducibility of the normalization and subword segmentation.","161":"**T5**, or **Text-to-Text Transfer Transformer**, is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) based architecture that uses a text-to-text approach. Every task \u2013 including translation, question answering, and classification \u2013 is cast as feeding the model text as input and training it to generate some target text. This allows for the use of the same model, loss function, hyperparameters, etc. across our diverse set of tasks. The changes compared to [BERT](https:\/\/paperswithcode.com\/method\/bert) include:\r\n\r\n- adding a *causal* decoder to the bidirectional architecture.\r\n- replacing the fill-in-the-blank cloze task with a mix of alternative pre-training tasks.","162":"A **Dense Block** is a module used in convolutional neural networks that connects *all layers* (with matching feature-map sizes) directly with each other. It was originally proposed as part of the [DenseNet](https:\/\/paperswithcode.com\/method\/densenet) architecture. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers. In contrast to [ResNets](https:\/\/paperswithcode.com\/method\/resnet), we never combine features through summation before they are passed into a layer; instead, we combine features by concatenating them. Hence, the $\\ell^{th}$ layer has $\\ell$ inputs, consisting of the feature-maps of all preceding convolutional blocks. Its own feature-maps are passed on to all $L-\\ell$ subsequent layers. This introduces $\\frac{L(L+1)}{2}$  connections in an $L$-layer network, instead of just $L$, as in traditional architectures: \"dense connectivity\".","163":"A **DenseNet** is a type of convolutional neural network that utilises [dense connections](https:\/\/paperswithcode.com\/method\/dense-connections) between layers, through [Dense Blocks](http:\/\/www.paperswithcode.com\/method\/dense-block), where we connect *all layers* (with matching feature-map sizes) directly with each other. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers.","164":"**Temporal Graph Network**, or **TGN**, is a framework for deep learning on dynamic graphs represented as sequences of timed events. The memory (state) of the model at time $t$ consists of a vector $\\mathbf{s}_i(t)$ for each node $i$ the model has seen so far. The memory of a node is updated after an event (e.g. interaction with another node or node-wise change), and its purpose is to represent the node's history in a compressed format. Thanks to this specific module, TGNs have the capability to memorize long term dependencies for each node in the graph. When a new node is encountered, its memory is initialized as the zero vector, and it is then updated for each event involving the node, even after the model has finished training.","165":"Random Erasing is a data augmentation method for training the convolutional neural network (CNN), which randomly selects a rectangle region in an image and erases its pixels with random values. In this process, training images with various levels of occlusion are generated, which reduces the risk of over-fitting and makes the model robust to occlusion. Random Erasing is parameter learning free, easy to implement, and can be integrated with most of the CNN-based recognition models. Random Erasing is complementary to commonly used data augmentation techniques such as random cropping and flipping, and can be implemented in various vision tasks, such as image classification, object detection, semantic segmentation.","166":"A **HyperNetwork** is a network that generates weights for a main network.  The behavior of the main network is the same with any usual neural network: it learns to map some raw inputs to their desired targets; whereas the hypernetwork takes a set of inputs that contain information about the structure of the weights and generates the weight for that layer.","167":"**Temporal Activation Regularization (TAR)** is a type of slowness regularization for [RNNs](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks) that penalizes differences between states that have been explored in the past. Formally we minimize:\r\n\r\n$$\\beta{L\\_{2}}\\left(h\\_{t} - h\\_{t+1}\\right)$$\r\n\r\nwhere $L\\_{2}$ is the $L\\_{2}$ norm, $h_{t}$ is the output of the RNN at timestep $t$, and $\\beta$ is a scaling coefficient.","168":"**DropConnect** generalizes [Dropout](https:\/\/paperswithcode.com\/method\/dropout) by randomly dropping the weights rather than the activations with probability $1-p$. DropConnect is similar to Dropout as it introduces dynamic sparsity within the model, but differs in that the sparsity is on the weights $W$, rather than the output vectors of a layer. In other words, the fully connected layer with DropConnect becomes a sparsely connected layer in which the connections are chosen at random during the training stage. Note that this is not equivalent to setting $W$ to be a fixed sparse matrix during training.\r\n\r\nFor a DropConnect layer, the output is given as:\r\n\r\n$$ r = a \\left(\\left(M * W\\right){v}\\right)$$\r\n\r\nHere $r$ is the output of a layer, $v$ is the input to a layer, $W$ are weight parameters, and $M$ is a binary matrix encoding the connection information where $M\\_{ij} \\sim \\text{Bernoulli}\\left(p\\right)$. Each element of the mask $M$ is drawn independently for each example during training, essentially instantiating a different connectivity for each example seen. Additionally, the biases are also masked out during training.","169":"**Activation Regularization (AR)**, or $L\\_{2}$ activation regularization, is regularization performed on activations as opposed to weights. It is usually used in conjunction with [RNNs](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks). It is defined as:\r\n\r\n$$\\alpha{L}\\_{2}\\left(m\\circ{h\\_{t}}\\right) $$\r\n\r\nwhere $m$ is a [dropout](https:\/\/paperswithcode.com\/method\/dropout) mask used by later parts of the model, $L\\_{2}$ is the $L\\_{2}$ norm, and $h_{t}$ is the output of an RNN at timestep $t$, and $\\alpha$ is a scaling coefficient. \r\n\r\nWhen applied to the output of a dense layer, AR penalizes activations that are substantially away from 0, encouraging activations to remain small.","170":"**Embedding Dropout** is equivalent to performing [dropout](https:\/\/paperswithcode.com\/method\/dropout) on the embedding matrix at a word level, where the dropout is broadcast across all the word vector\u2019s embedding. The remaining non-dropped-out word embeddings are scaled by $\\frac{1}{1-p\\_{e}}$ where $p\\_{e}$ is the probability of embedding dropout. As the dropout occurs on the embedding matrix that is used for a full forward and backward pass, this means that all occurrences of a specific word will disappear within that pass, equivalent to performing [variational dropout](https:\/\/paperswithcode.com\/method\/variational-dropout) on the connection between the one-hot embedding and the embedding lookup.\r\n\r\nSource: Merity et al, Regularizing and Optimizing [LSTM](https:\/\/paperswithcode.com\/method\/lstm) Language Models","171":"**Variational Dropout** is a regularization technique based on [dropout](https:\/\/paperswithcode.com\/method\/dropout), but uses a variational inference grounded approach. In Variational Dropout, we repeat the same dropout mask at each time step for both inputs, outputs, and recurrent layers (drop the same network units at each time step). This is in contrast to ordinary Dropout where different dropout masks are sampled at each time step for the inputs and outputs alone.","172":"**Weight Tying** improves the performance of language models by tying (sharing) the weights of the embedding and [softmax](https:\/\/paperswithcode.com\/method\/softmax) layers. This method also massively reduces the total number of parameters in the language models that it is applied to. \r\n\r\nLanguage models are typically comprised of an embedding layer, followed by a number of [Transformer](https:\/\/paperswithcode.com\/method\/transformer) or [LSTM](https:\/\/paperswithcode.com\/method\/lstm) layers, which are finally followed by a softmax layer. Embedding layers learn word representations, such that similar words (in meaning) are represented by vectors that are near each other (in cosine distance). [Press & Wolf, 2016] showed that the softmax matrix, in which every word also has a vector representation, also exhibits this property. This leads them to propose to share the softmax and embedding matrices, which is done today in nearly all language models.  \r\n\r\nThis method was independently introduced by [Press & Wolf, 2016](https:\/\/paperswithcode.com\/paper\/using-the-output-embedding-to-improve) and [Inan et al, 2016](https:\/\/paperswithcode.com\/paper\/tying-word-vectors-and-word-classifiers-a).\r\n\r\nAdditionally, the Press & Wolf paper proposes Three-way Weight Tying, a method for NMT models in which the embedding matrix for the source language, the embedding matrix for the target language, and the softmax matrix for the target language are all tied. That method has been adopted by the Attention Is All You Need model and many other neural machine translation models.","173":"**ASGD Weight-Dropped LSTM**, or **AWD-LSTM**, is a type of recurrent neural network that employs [DropConnect](https:\/\/paperswithcode.com\/method\/dropconnect) for regularization, as well as [NT-ASGD](https:\/\/paperswithcode.com\/method\/nt-asgd) for optimization - non-monotonically triggered averaged [SGD](https:\/\/paperswithcode.com\/method\/sgd) - which returns an average of last iterations of weights. Additional regularization techniques employed include variable length backpropagation sequences, [variational dropout](https:\/\/paperswithcode.com\/method\/variational-dropout), [embedding dropout](https:\/\/paperswithcode.com\/method\/embedding-dropout), [weight tying](https:\/\/paperswithcode.com\/method\/weight-tying), independent embedding\/hidden size, [activation regularization](https:\/\/paperswithcode.com\/method\/activation-regularization) and [temporal activation regularization](https:\/\/paperswithcode.com\/method\/temporal-activation-regularization).","174":"**Slanted Triangular Learning Rates (STLR)** is a learning rate schedule which first linearly increases the learning rate and then linearly decays it, which can be seen in Figure to the right. It is a modification of Triangular Learning Rates, with a short increase and a long decay period.","175":"**Universal Language Model Fine-tuning**, or **ULMFiT**, is an architecture and transfer learning method that can be applied to NLP tasks. It involves a 3-layer [AWD-LSTM](https:\/\/paperswithcode.com\/method\/awd-lstm) architecture for its representations. The training consists of three steps: 1) general language model pre-training on a Wikipedia-based text, 2) fine-tuning the language model on a target task, and 3) fine-tuning the classifier on the target task.\r\n\r\nAs different layers capture different types of information, they are fine-tuned to different extents using [discriminative fine-tuning](https:\/\/paperswithcode.com\/method\/discriminative-fine-tuning). Training is performed using [Slanted triangular learning rates](https:\/\/paperswithcode.com\/method\/slanted-triangular-learning-rates) (STLR), a learning rate scheduling strategy that first linearly increases the learning rate and then linearly decays it.\r\n\r\nFine-tuning the target classifier is achieved in ULMFiT using gradual unfreezing. Rather than fine-tuning all layers at once, which risks catastrophic forgetting, ULMFiT gradually unfreezes the model starting from the last layer (i.e., closest to the output) as this contains the least general knowledge. First the last layer is unfrozen and all unfrozen layers are fine-tuned for one epoch. Then the next group of frozen layers is unfrozen and fine-tuned and repeat, until all layers are fine-tuned until convergence at the last iteration.","176":"**Random Gaussian Blur** is an image data augmentation technique where we randomly blur the image using a Gaussian distribution.\r\n\r\nImage Source: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Gaussian_blur)","177":"**ColorJitter** is a type of image data augmentation where we randomly change the brightness, contrast and saturation of an image.\r\n\r\nImage Credit: [Apache MXNet](https:\/\/mxnet.apache.org\/versions\/1.5.0\/tutorials\/gluon\/data_augmentation.html)","178":"**RandomResizedCrop** is a type of image data augmentation where a crop of random size of the original size and a random aspect ratio of the original aspect ratio is made. This crop is finally resized to given size.\r\n\r\nImage Credit: [Apache MXNet](https:\/\/mxnet.apache.org\/versions\/1.5.0\/tutorials\/gluon\/data_augmentation.html)","179":"**NT-Xent**, or **Normalized Temperature-scaled Cross Entropy Loss**, is a loss function. Let $\\text{sim}\\left(\\mathbf{u}, \\mathbf{v}\\right) = \\mathbf{u}^{T}\\mathbf{v}\/||\\mathbf{u}|| ||\\mathbf{v}||$ denote the cosine similarity between two vectors $\\mathbf{u}$ and $\\mathbf{v}$. Then the loss function for a positive pair of examples $\\left(i, j\\right)$ is :\r\n\r\n$$ \\mathbb{l}\\_{i,j} = -\\log\\frac{\\exp\\left(\\text{sim}\\left(\\mathbf{z}\\_{i}, \\mathbf{z}\\_{j}\\right)\/\\tau\\right)}{\\sum^{2N}\\_{k=1}\\mathcal{1}\\_{[k\\neq{i}]}\\exp\\left(\\text{sim}\\left(\\mathbf{z}\\_{i}, \\mathbf{z}\\_{k}\\right)\/\\tau\\right)}$$\r\n\r\nwhere $\\mathcal{1}\\_{[k\\neq{i}]} \\in ${$0, 1$} is an indicator function evaluating to $1$ iff $k\\neq{i}$ and $\\tau$ denotes a temperature parameter. The final loss is computed across all positive pairs, both $\\left(i, j\\right)$ and $\\left(j, i\\right)$, in a mini-batch.\r\n\r\nSource: [SimCLR](https:\/\/paperswithcode.com\/method\/simclr)","180":"**InfoNCE**, where NCE stands for Noise-Contrastive Estimation, is a type of contrastive loss function used for [self-supervised learning](https:\/\/paperswithcode.com\/methods\/category\/self-supervised-learning).\r\n\r\nGiven a set $X = ${$x\\_{1}, \\dots, x\\_{N}$} of $N$ random samples containing one positive sample from $p\\left(x\\_{t+k}|c\\_{t}\\right)$ and $N \u2212 1$ negative samples from the 'proposal' distribution $p\\left(x\\_{t+k}\\right)$, we optimize:\r\n\r\n$$ \\mathcal{L}\\_{N} = - \\mathbb{E}\\_{X}\\left[\\log\\frac{f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right)}{\\sum\\_{x\\_{j}\\in{X}}f\\_{k}\\left(x\\_{j}, c\\_{t}\\right)}\\right] $$\r\n\r\nOptimizing this loss will result in $f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right)$ estimating the density ratio, which is:\r\n\r\n$$ f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right) \\propto \\frac{p\\left(x\\_{t+k}|c\\_{t}\\right)}{p\\left(x\\_{t+k}\\right)} $$","181":"**SimCLR** is a framework for contrastive learning of visual representations. It learns representations by maximizing agreement between differently augmented views of the same data example via a contrastive loss in the latent space. It consists of:\r\n\r\n- A stochastic data augmentation module that transforms any given data example randomly resulting in two correlated views of the same example, denoted $\\mathbf{\\tilde{x}\\_{i}}$ and $\\mathbf{\\tilde{x}\\_{j}}$, which is considered a positive pair. SimCLR sequentially applies three simple augmentations: random cropping followed by resize back to the original size, random color distortions, and [random Gaussian blur](https:\/\/paperswithcode.com\/method\/random-gaussian-blur). The authors find random crop and color distortion is crucial to achieve good performance.\r\n\r\n- A neural network base encoder $f\\left(\u00b7\\right)$ that extracts representation vectors from augmented data examples. The framework allows various choices of the network architecture without any constraints. The authors opt for simplicity and adopt [ResNet](https:\/\/paperswithcode.com\/method\/resnet) to obtain $h\\_{i} = f\\left(\\mathbf{\\tilde{x}}\\_{i}\\right) = \\text{ResNet}\\left(\\mathbf{\\tilde{x}}\\_{i}\\right)$ where $h\\_{i} \\in \\mathbb{R}^{d}$ is the output after the [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling) layer.\r\n\r\n- A small neural network projection head $g\\left(\u00b7\\right)$ that maps representations to the space where contrastive loss is applied. Authors use a MLP with one hidden layer to obtain $z\\_{i} = g\\left(h\\_{i}\\right) = W^{(2)}\\sigma\\left(W^{(1)}h\\_{i}\\right)$ where $\\sigma$ is a [ReLU](https:\/\/paperswithcode.com\/method\/relu) nonlinearity. The authors find it beneficial to define the contrastive loss on $z\\_{i}$\u2019s rather than $h\\_{i}$\u2019s.\r\n\r\n- A contrastive loss function defined for a contrastive prediction task. Given a set {$\\mathbf{\\tilde{x}}\\_{k}$} including a positive pair of examples $\\mathbf{\\tilde{x}}\\_{i}$ and $\\mathbf{\\tilde{x}\\_{j}}$ , the contrastive prediction task aims to identify $\\mathbf{\\tilde{x}}\\_{j}$ in {$\\mathbf{\\tilde{x}}\\_{k}$}$\\_{k\\neq{i}}$ for a given $\\mathbf{\\tilde{x}}\\_{i}$.\r\n\r\nA minibatch of $N$ examples is randomly sampled and the contrastive prediction task is defined on pairs of augmented examples derived from the minibatch, resulting in $2N$ data points. Negative examples are not sampled explicitly. Instead, given a positive pair, the other $2(N \u2212 1)$ augmented examples within a minibatch are treated as negative examples. A [NT-Xent](https:\/\/paperswithcode.com\/method\/nt-xent) (the normalized\r\ntemperature-scaled cross entropy loss) loss function is used (see components).","182":"**MoCo**, or **Momentum Contrast**, is a self-supervised learning algorithm with a contrastive loss. \r\n\r\nContrastive loss methods can be thought of as building dynamic dictionaries. The \"keys\" (tokens) in the dictionary are sampled from data (e.g., images or patches) and are represented by an encoder network. Unsupervised learning trains encoders to perform dictionary look-up: an encoded \u201cquery\u201d should be similar to its matching key and dissimilar to others. Learning is formulated as minimizing a contrastive loss. \r\n\r\nMoCo can be viewed as a way to build large and consistent dictionaries for unsupervised learning with a contrastive loss. In MoCo, we maintain the dictionary as a queue of data samples: the encoded representations of the current mini-batch are enqueued, and the oldest are dequeued. The queue decouples the dictionary size from the mini-batch size, allowing it to be large. Moreover, as the dictionary keys come from the preceding several mini-batches, a slowly progressing key encoder, implemented as a momentum-based moving average of the query encoder, is proposed to maintain consistency.","183":"**Pointer Networks** tackle problems where input and output data are sequential data, but can't be solved by seq2seq type models because discrete categories of output elements depend on the variable input size (and are not decided in advance).\r\n\r\nA Pointer Network learns the conditional  probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence. They solve the problem of variable size output dictionaries using [additive attention](https:\/\/paperswithcode.com\/method\/additive-attention). But instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, Pointer Networks use attention as a pointer to select a member of the input sequence as the output. \r\n\r\nPointer-Nets can be used to learn approximate solutions to challenging geometric problems such as finding planar convex hulls, computing Delaunay triangulations, and the planar Travelling Salesman Problem.","184":"**Detr**, or **Detection Transformer**, is a set-based object detector using a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) on top of a convolutional backbone. It uses a conventional CNN backbone to learn a 2D representation of an input image. The model flattens it and supplements it with a positional encoding before passing it into a transformer encoder. A transformer decoder then takes as input a small fixed number of learned positional embeddings, which we call object queries, and additionally attends to the encoder output. We pass each output embedding of the decoder to a shared feed forward network (FFN) that predicts either a detection (class\r\nand bounding box) or a \u201cno object\u201d class.","185":"**Stochastic Gradient Descent** is an iterative optimization technique that uses minibatches of data to form an expectation of the gradient, rather than the full gradient using all available data. That is for weights $w$ and a loss function $L$ we have:\r\n\r\n$$ w\\_{t+1} = w\\_{t} - \\eta\\hat{\\nabla}\\_{w}{L(w\\_{t})} $$\r\n\r\nWhere $\\eta$ is a learning rate. SGD reduces redundancy compared to batch gradient descent - which recomputes gradients for similar examples before each parameter update - so it is usually much faster.\r\n\r\n(Image Source: [here](http:\/\/rasbt.github.io\/mlxtend\/user_guide\/general_concepts\/gradient-optimization\/))","186":"**VGG** is a classical convolutional neural network architecture. It was based on an analysis of how to increase the depth of such networks. The network utilises small 3 x 3 filters. Otherwise the network is characterized by its simplicity: the only other components being pooling layers and a fully connected layer.\r\n\r\nImage: [Davi Frossard](https:\/\/www.cs.toronto.edu\/frossard\/post\/vgg16\/)","187":"**DeepMask** is an object proposal algorithm based on a convolutional neural network. Given an input image patch, DeepMask generates a class-agnostic mask and an associated score which estimates the likelihood of the patch fully containing a centered object (without any notion of an object category). The core of the model is a ConvNet which jointly predicts the mask and the object score. A large part of the network is shared between those two tasks: only the last few network\r\nlayers are specialized for separately outputting a mask and score prediction.","188":"A **Graph Attention Network (GAT)** is a neural network architecture that operates on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods\u2019 features, a GAT enables (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront.\r\n\r\nSee [here](https:\/\/docs.dgl.ai\/en\/0.4.x\/tutorials\/models\/1_gnn\/9_gat.html) for an explanation by DGL.","189":"A regularization criterion that, differently from [dropout](https:\/\/paperswithcode.com\/method\/dropout) and its variants, is deterministic rather than random. It grounds on the empirical evidence that feature descriptors with larger L2-norm and highly-active nodes are strongly correlated to confident class predictions. Thus, the criterion guides towards dropping a percentage of the most active nodes of the descriptors, proportionally to the estimated class probability","190":"**Mixup** is a data augmentation technique that that generates a weighted combinations of random image pairs from the training data. Given two images and their ground truth labels: $\\left(x\\_{i}, y\\_{i}\\right), \\left(x\\_{j}, y\\_{j}\\right)$, a synthetic training example $\\left(\\hat{x}, \\hat{y}\\right)$ is generated as:\r\n\r\n$$ \\hat{x} = \\lambda{x\\_{i}} + \\left(1 \u2212 \\lambda\\right){x\\_{j}} $$\r\n$$ \\hat{y} = \\lambda{y\\_{i}} + \\left(1 \u2212 \\lambda\\right){y\\_{j}} $$\r\n\r\nwhere $\\lambda \\sim \\text{Beta}\\left(\\alpha = 0.2\\right)$ is independently sampled for each augmented example.","191":"\\begin{equation}\r\nDiceLoss\\left( y,\\overline{p} \\right) = 1-\\big(\\left(  2y\\overline{p}+1 \\right) \\div  \\left( y+\\overline{p}+1 \\right)\\big)\r\n\\end{equation}","192":"**RMSProp** is an unpublished adaptive learning rate optimizer [proposed by Geoff Hinton](http:\/\/www.cs.toronto.edu\/~tijmen\/csc321\/slides\/lecture_slides_lec6.pdf). The motivation is that the magnitude of gradients can differ for different weights, and can change during learning, making it hard to choose a single global learning rate. RMSProp tackles this by keeping a moving average of the squared gradient and adjusting the weight updates by this magnitude. The gradient updates are performed as:\r\n\r\n$$E\\left[g^{2}\\right]\\_{t} = \\gamma E\\left[g^{2}\\right]\\_{t-1} + \\left(1 - \\gamma\\right) g^{2}\\_{t}$$\r\n\r\n$$\\theta\\_{t+1} = \\theta\\_{t} - \\frac{\\eta}{\\sqrt{E\\left[g^{2}\\right]\\_{t} + \\epsilon}}g\\_{t}$$\r\n\r\nHinton suggests $\\gamma=0.9$, with a good default for $\\eta$ as $0.001$.\r\n\r\nImage: [Alec Radford](https:\/\/twitter.com\/alecrad)","193":"**Depthwise Convolution** is a type of convolution where we apply a single convolutional filter for each input channel. In the regular 2D [convolution](https:\/\/paperswithcode.com\/method\/convolution) performed over multiple input channels, the filter is as deep as the input and lets us freely mix channels to generate each element in the output. In contrast, depthwise convolutions keep each channel separate. To summarize the steps, we:\r\n\r\n1. Split the input and filter into channels.\r\n2. We convolve each input with the respective filter.\r\n3. We stack the convolved outputs together.\r\n\r\nImage Credit: [Chi-Feng Wang](https:\/\/towardsdatascience.com\/a-basic-introduction-to-separable-convolutions-b99ec3102728)","194":"**Swish** is an activation function, $f(x) = x \\cdot \\text{sigmoid}(\\beta x)$, where $\\beta$ a learnable parameter. Nearly all implementations do not use the learnable parameter $\\beta$, in which case the activation function is $x\\sigma(x)$ (\"Swish-1\").\r\n\r\nThe function $x\\sigma(x)$ is exactly the [SiLU](https:\/\/paperswithcode.com\/method\/silu), which was introduced by other authors before the swish.\r\nSee [Gaussian Error Linear Units](https:\/\/arxiv.org\/abs\/1606.08415) ([GELUs](https:\/\/paperswithcode.com\/method\/gelu)) where the SiLU (Sigmoid Linear Unit) was originally coined, and see [Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning](https:\/\/arxiv.org\/abs\/1702.03118) and [Swish: a Self-Gated Activation Function](https:\/\/arxiv.org\/abs\/1710.05941v1) where the same activation function was experimented with later.","195":"**Pointwise Convolution** is a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) that uses a 1x1 kernel: a kernel that iterates through every single point. This kernel has a depth of however many channels the input image has. It can be used in conjunction with [depthwise convolutions](https:\/\/paperswithcode.com\/method\/depthwise-convolution) to produce an efficient class of convolutions known as [depthwise-separable convolutions](https:\/\/paperswithcode.com\/method\/depthwise-separable-convolution).\r\n\r\nImage Credit: [Chi-Feng Wang](https:\/\/towardsdatascience.com\/a-basic-introduction-to-separable-convolutions-b99ec3102728)","196":"While [standard convolution](https:\/\/paperswithcode.com\/method\/convolution) performs the channelwise and spatial-wise computation in one step, **Depthwise Separable Convolution**  splits the computation into two steps: [depthwise convolution](https:\/\/paperswithcode.com\/method\/depthwise-convolution) applies a single convolutional filter per each input channel and [pointwise convolution](https:\/\/paperswithcode.com\/method\/pointwise-convolution) is used to create a linear combination of the output of the depthwise convolution. The comparison of standard convolution and depthwise separable convolution is shown to the right.\r\n\r\nCredit: [Depthwise Convolution Is All You Need for Learning Multiple Visual Domains](https:\/\/paperswithcode.com\/paper\/depthwise-convolution-is-all-you-need-for)","197":"The **Squeeze-and-Excitation Block** is an architectural unit designed to improve the representational power of a network by enabling it to perform dynamic channel-wise feature recalibration. The process is:\r\n\r\n- The block has a convolutional block as an input.\r\n- Each channel is \"squeezed\" into a single numeric value using [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling).\r\n- A dense layer followed by a [ReLU](https:\/\/paperswithcode.com\/method\/relu) adds non-linearity and output channel complexity is reduced by a ratio.\r\n- Another dense layer followed by a sigmoid gives each channel a smooth gating function.\r\n- Finally, we weight each feature map of the convolutional block based on the side network; the \"excitation\".","198":"An **Inverted Residual Block**, sometimes called an **MBConv Block**, is a type of residual block used for image models that uses an inverted structure for efficiency reasons. It was originally proposed for the [MobileNetV2](https:\/\/paperswithcode.com\/method\/mobilenetv2) CNN architecture. It has since been reused for several mobile-optimized CNNs.\r\n\r\nA traditional [Residual Block](https:\/\/paperswithcode.com\/method\/residual-block) has a wide -> narrow -> wide structure with the number of channels. The input has a high number of channels, which are compressed with a [1x1 convolution](https:\/\/paperswithcode.com\/method\/1x1-convolution). The number of channels is then increased again with a 1x1 [convolution](https:\/\/paperswithcode.com\/method\/convolution) so input and output can be added. \r\n\r\nIn contrast, an Inverted Residual Block follows a narrow -> wide -> narrow approach, hence the inversion. We first widen with a 1x1 convolution, then use a 3x3 [depthwise convolution](https:\/\/paperswithcode.com\/method\/depthwise-convolution) (which greatly reduces the number of parameters), then we use a 1x1 convolution to reduce the number of channels so input and output can be added.","199":"**EfficientNet** is a convolutional neural network architecture and scaling method that uniformly scales all dimensions of depth\/width\/resolution using a *compound coefficient*. Unlike conventional practice that arbitrary scales  these factors, the EfficientNet scaling method uniformly scales network width, depth, and resolution with a set of fixed scaling coefficients. For example, if we want to use $2^N$ times more computational resources, then we can simply increase the network depth by $\\alpha ^ N$,  width by $\\beta ^ N$, and image size by $\\gamma ^ N$, where $\\alpha, \\beta, \\gamma$ are constant coefficients determined by a small grid search on the original small model. EfficientNet uses a compound coefficient $\\phi$ to uniformly scales network width, depth, and resolution in a  principled way.\r\n\r\nThe compound scaling method is justified by the intuition that if the input image is bigger, then the network needs more layers to increase the receptive field and more channels to capture more fine-grained patterns on the bigger image.\r\n\r\nThe base EfficientNet-B0 network is based on the inverted bottleneck residual blocks of [MobileNetV2](https:\/\/paperswithcode.com\/method\/mobilenetv2), in addition to squeeze-and-excitation blocks.\r\n\r\n EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters.","200":"Diffusion-convolutional neural networks (DCNN) is a model for graph-structured data. Through the introduction of a diffusion-convolution operation, diffusion-based representations can be learned from graph structured data and used as an effective basis for node classification.\r\n\r\nDescription and image from: [Diffusion-Convolutional Neural Networks](https:\/\/arxiv.org\/pdf\/1511.02136.pdf)","201":"**REINFORCE** is a Monte Carlo variant of a policy gradient algorithm in reinforcement learning. The agent collects samples of an episode using its current policy, and uses it to update the policy parameter $\\theta$. Since one full trajectory must be completed to construct a sample space, it is updated as an off-policy algorithm.\r\n\r\n$$ \\nabla\\_{\\theta}J\\left(\\theta\\right) = \\mathbb{E}\\_{\\pi}\\left[G\\_{t}\\nabla\\_{\\theta}\\ln\\pi\\_{\\theta}\\left(A\\_{t}\\mid{S\\_{t}}\\right)\\right]$$\r\n\r\nImage Credit: [Tingwu Wang](http:\/\/www.cs.toronto.edu\/~tingwuwang\/REINFORCE.pdf)","202":"A **Focal Loss** function addresses class imbalance during training in tasks like object detection. Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard misclassified examples. It is a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples. \r\n\r\nFormally, the Focal Loss adds a factor $(1 - p\\_{t})^\\gamma$ to the standard cross entropy criterion. Setting $\\gamma>0$ reduces the relative loss for well-classified examples ($p\\_{t}>.5$), putting more focus on hard, misclassified examples. Here there is tunable *focusing* parameter $\\gamma \\ge 0$. \r\n\r\n$$ {\\text{FL}(p\\_{t}) = - (1 - p\\_{t})^\\gamma \\log\\left(p\\_{t}\\right)} $$","203":"**Monte-Carlo Tree Search** is a planning algorithm that accumulates value estimates obtained from Monte Carlo simulations in order to successively direct simulations towards more highly-rewarded trajectories. We execute MCTS after encountering each new state to select an agent's action for that state: it is executed again to select the action for the next state. Each execution is an iterative process that simulates many trajectories starting from the current state to the terminal state. The core idea is to successively focus multiple simulations starting at the current state by extending the initial portions of trajectories that have received high evaluations from earlier simulations.\r\n\r\nSource: Sutton and Barto, Reinforcement Learning (2nd Edition)\r\n\r\nImage Credit: [Chaslot et al](https:\/\/www.aaai.org\/Papers\/AIIDE\/2008\/AIIDE08-036.pdf)","204":"**Greedy Policy Search** (GPS) is a simple algorithm that learns a policy for test-time data augmentation based on the predictive performance on a validation set. GPS starts with an empty policy and builds it in an iterative fashion. Each step selects a sub-policy that provides the largest improvement in calibrated log-likelihood of ensemble predictions and adds it to the current policy.","205":"**fastText** embeddings exploit subword information to construct word embeddings. Representations are learnt of character $n$-grams, and words represented as the sum of the $n$-gram vectors. This extends the word2vec type models with subword information. This helps the embeddings understand suffixes and prefixes. Once a word is represented using character $n$-grams, a skipgram model is trained to learn the embeddings.","206":"**Softplus** is an activation function $f\\left(x\\right) = \\log\\left(1+\\exp\\left(x\\right)\\right)$. It can be viewed as a smooth version of [ReLU](https:\/\/paperswithcode.com\/method\/relu).","207":"**Mish** is an activation function for neural networks which can be defined as:\r\n\r\n$$ f\\left(x\\right) = x\\cdot\\tanh{\\text{softplus}\\left(x\\right)}$$\r\n\r\nwhere\r\n\r\n$$\\text{softplus}\\left(x\\right) = \\ln\\left(1+e^{x}\\right)$$\r\n\r\n(Compare with functionally similar previously proposed activation functions such as the [GELU](https:\/\/paperswithcode.com\/method\/silu) $x\\Phi(x)$ and the [SiLU](https:\/\/paperswithcode.com\/method\/silu) $x\\sigma(x)$.)","208":"** Spatial Pyramid Pooling (SPP)** is a pooling layer that removes the fixed-size constraint of the network, i.e. a CNN does not require a fixed-size input image. Specifically, we add an SPP layer on top of the last convolutional layer. The SPP layer pools the features and generates fixed-length outputs, which are then fed into the fully-connected layers (or other classifiers). In other words, we perform some information aggregation at a deeper stage of the network hierarchy (between convolutional layers and fully-connected layers) to avoid the need for cropping or warping at the beginning.","209":"**RepPoints** is a representation for object detection that consists of a set of points which indicate the spatial extent of an object and semantically significant local areas. This representation is learned via weak localization supervision from rectangular ground-truth boxes and implicit recognition feedback. Based on the richer RepPoints representation, the authors develop an anchor-free object detector that yields improved performance compared to using bounding boxes.","210":"**Xception** is a convolutional neural network architecture that relies solely on [depthwise separable convolution](https:\/\/paperswithcode.com\/method\/depthwise-separable-convolution) layers.","211":"**Nesterov Accelerated Gradient** is a momentum-based [SGD](https:\/\/paperswithcode.com\/method\/sgd) optimizer that \"looks ahead\" to where the parameters will be to calculate the gradient **ex post** rather than **ex ante**:\r\n\r\n$$ v\\_{t} = \\gamma{v}\\_{t-1} + \\eta\\nabla\\_{\\theta}J\\left(\\theta-\\gamma{v\\_{t-1}}\\right) $$\r\n$$\\theta\\_{t} = \\theta\\_{t-1} + v\\_{t}$$\r\n\r\nLike SGD with momentum $\\gamma$ is usually set to $0.9$.\r\n\r\nThe intuition is that the [standard momentum](https:\/\/paperswithcode.com\/method\/sgd-with-momentum) method first computes the gradient at the current location and then takes a big jump in the direction of the updated accumulated gradient. In contrast Nesterov momentum first makes a big jump in the direction of the previous accumulated gradient and then measures the gradient where it ends up and makes a correction. The idea being that it is better to correct a mistake after you have made it. \r\n\r\nImage Source: [Geoff Hinton lecture notes](http:\/\/www.cs.toronto.edu\/~tijmen\/csc321\/slides\/lecture_slides_lec6.pdf)","212":"**Stochastic Depth** aims to shrink the depth of a network during training, while\r\nkeeping it unchanged during testing. This is achieved by randomly dropping entire [ResBlocks](https:\/\/paperswithcode.com\/method\/residual-block) during training and bypassing their transformations through skip connections. \r\n\r\nLet $b\\_{l} \\in$ {$0, 1$} denote a Bernoulli random variable, which indicates whether the $l$th ResBlock is active ($b\\_{l} = 1$) or inactive ($b\\_{l} = 0$). Further, let us denote the \u201csurvival\u201d probability of ResBlock $l$ as $p\\_{l} = \\text{Pr}\\left(b\\_{l} = 1\\right)$. With this definition we can bypass the $l$th ResBlock by multiplying its function $f\\_{l}$ with $b\\_{l}$ and we extend the update rule to:\r\n\r\n$$ H\\_{l} = \\text{ReLU}\\left(b\\_{l}f\\_{l}\\left(H\\_{l-1}\\right) + \\text{id}\\left(H\\_{l-1}\\right)\\right) $$\r\n\r\nIf $b\\_{l} = 1$, this reduces to the original [ResNet](https:\/\/paperswithcode.com\/method\/resnet) update and this ResBlock remains unchanged. If $b\\_{l} = 0$, the ResBlock reduces to the identity function, $H\\_{l} = \\text{id}\\left((H\\_{l}\u22121\\right)$.","213":"The **Swin Transformer** is a type of [Vision Transformer](https:\/\/paperswithcode.com\/method\/vision-transformer). It builds hierarchical feature maps by merging image patches (shown in gray) in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local window (shown in red). It can thus serve as a general-purpose backbone for both image classification and dense recognition tasks. In contrast, previous vision Transformers produce feature maps of a single low resolution and have quadratic computation complexity to input image size due to computation of self-attention globally.","214":"A **Data-Efficient Image Transformer** is a type of [Vision Transformer](https:\/\/paperswithcode.com\/method\/vision-transformer) for image classification tasks. The model is trained using a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention.","215":"The extremely low computational cost of lightweight CNNs constrains the depth and width of the networks, further decreasing their representational power. To address the above problem, Chen et al. proposed dynamic convolution, a novel operator design that increases  representational power with negligible additional computational cost and does not change the width or depth of the network in parallel with CondConv.\r\n\r\nDynamic convolution uses $K$ parallel convolution kernels of the same  size and input\/output dimensions instead of one kernel per layer. Like SE blocks, it adopts a squeeze-and-excitation mechanism to generate the attention weights for the different convolution kernels. These kernels are then aggregated dynamically by weighted summation and applied to the input feature map $X$:\r\n\\begin{align}\r\n    s & = \\text{softmax} (W_{2} \\delta (W_{1}\\text{GAP}(X)))\r\n\\end{align}\r\n\\begin{align}\r\n    \\text{DyConv} &= \\sum_{i=1}^{K} s_k \\text{Conv}_k \r\n\\end{align}\r\n\\begin{align}\r\n    Y &= \\text{DyConv}(X)\r\n\\end{align}\r\nHere the convolutions are combined by summation of weights and biases of convolutional kernels. \r\n\r\nCompared to applying convolution to the feature map, the computational cost of squeeze-and-excitation and weighted summation is extremely low. Dynamic convolution thus provides an efficient operation to improve  representational power and can be easily used as a replacement for any convolution.","216":"FixMatch is an algorithm that first generates pseudo-labels using the model's predictions on weakly-augmented unlabeled images. For a given image, the pseudo-label is only retained if the model produces a high-confidence prediction. The model is then trained to predict the pseudo-label when fed a strongly-augmented version of the same image.\r\n\r\nDescription from: [FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence](https:\/\/paperswithcode.com\/paper\/fixmatch-simplifying-semi-supervised-learning)\r\n\r\nImage credit:  [FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence](https:\/\/paperswithcode.com\/paper\/fixmatch-simplifying-semi-supervised-learning)","217":"**Early Stopping** is a regularization technique for deep neural networks that stops training when parameter updates no longer begin to yield improves on a validation set. In essence, we store and update the current best parameters during training, and when parameter updates no longer yield an improvement (after a set number of iterations) we stop training and use the last best parameters. It works as a regularizer by restricting the optimization procedure to a smaller volume of parameter space.\r\n\r\nImage Source: [Ramazan Gen\u00e7ay](https:\/\/www.researchgate.net\/figure\/Early-stopping-based-on-cross-validation_fig1_3302948)","218":"**Auxiliary Classifiers** are type of architectural component that seek to improve the convergence of very deep networks. They are classifier heads we attach to layers before the end of the network. The motivation is to push useful gradients to the lower layers to make them immediately useful and improve the convergence during training by combatting the vanishing gradient problem. They are notably used in the Inception family of convolutional neural networks.","219":"**Inception-v3 Module** is an image block used in the [Inception-v3](https:\/\/paperswithcode.com\/method\/inception-v3) architecture. This architecture is used on the coarsest (8 \u00d7 8) grids to promote high dimensional representations.","220":"**Inception-v3** is a convolutional neural network architecture from the Inception family that makes several improvements including using [Label Smoothing](https:\/\/paperswithcode.com\/method\/label-smoothing), Factorized 7 x 7 convolutions, and the use of an auxiliary classifer to propagate label information lower down the network (along with the use of [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) for layers in the sidehead).","221":"A **Variational Autoencoder** is a type of likelihood-based generative model. It consists of an encoder, that takes in data $x$ as input and transforms this into a latent representation $z$,  and a decoder, that takes a latent representation $z$ and returns a reconstruction $\\hat{x}$. Inference is performed via variational inference to approximate the posterior of the model.","222":"**Double Q-learning** is an off-policy reinforcement learning algorithm that utilises double estimation to counteract overestimation problems with traditional Q-learning. \r\n\r\nThe max operator in standard [Q-learning](https:\/\/paperswithcode.com\/method\/q-learning) and [DQN](https:\/\/paperswithcode.com\/method\/dqn) uses the same values both to select and to evaluate an action. This makes it more likely to select overestimated values, resulting in overoptimistic value estimates. To prevent this, we can decouple the selection from the evaluation, which is the idea behind Double Q-learning:\r\n\r\n$$ Y^{Q}\\_{t} = R\\_{t+1} + \\gamma{Q}\\left(S\\_{t+1}, \\arg\\max\\_{a}Q\\left(S\\_{t+1}, a; \\mathbb{\\theta}\\_{t}\\right);\\mathbb{\\theta}\\_{t}\\right) $$\r\n\r\nThe Double Q-learning error can then be written as:\r\n\r\n$$ Y^{DoubleQ}\\_{t} = R\\_{t+1} + \\gamma{Q}\\left(S\\_{t+1}, \\arg\\max\\_{a}Q\\left(S\\_{t+1}, a; \\mathbb{\\theta}\\_{t}\\right);\\mathbb{\\theta}^{'}\\_{t}\\right) $$\r\n\r\nHere the selection of the action in the $\\arg\\max$ is still due to the online weights $\\theta\\_{t}$. But we use a second set of weights $\\mathbb{\\theta}^{'}\\_{t}$ to fairly evaluate the value of this policy.\r\n\r\nSource: [Deep Reinforcement Learning with Double Q-learning](https:\/\/paperswithcode.com\/paper\/deep-reinforcement-learning-with-double-q)","223":"A **Double Deep Q-Network**, or **Double DQN** utilises [Double Q-learning](https:\/\/paperswithcode.com\/method\/double-q-learning) to reduce overestimation by decomposing the max operation in the target into action selection and action evaluation. We evaluate the greedy policy according to the online network, but we use the target network to estimate its value.  The update is the same as for [DQN](https:\/\/paperswithcode.com\/method\/dqn), but replacing the target $Y^{DQN}\\_{t}$ with:\r\n\r\n$$ Y^{DoubleDQN}\\_{t} = R\\_{t+1}+\\gamma{Q}\\left(S\\_{t+1}, \\arg\\max\\_{a}Q\\left(S\\_{t+1}, a; \\theta\\_{t}\\right);\\theta\\_{t}^{-}\\right) $$\r\n\r\nCompared to the original formulation of Double [Q-Learning](https:\/\/paperswithcode.com\/method\/q-learning), in Double DQN the weights of the second network $\\theta^{'}\\_{t}$ are replaced with the weights of the target network $\\theta\\_{t}^{-}$ for the evaluation of the current greedy policy.","224":"A **Pyramidal Bottleneck Residual Unit** is a type of residual unit where the number of channels gradually increases as a function of the depth at which the layer occurs, which is similar to a pyramid structure of which the shape gradually widens from the top downwards. It also consists of a bottleneck using 1x1 convolutions. It was introduced as part of the [PyramidNet](https:\/\/paperswithcode.com\/method\/pyramidnet) architecture.","225":"A **Zero-padded Shortcut Connection** is a type of [residual connection](https:\/\/paperswithcode.com\/method\/residual-connection) used in the [PyramidNet](https:\/\/paperswithcode.com\/method\/pyramidnet) architecture. For PyramidNets, identity mapping alone cannot be used for a shortcut because the feature map dimension differs among individual residual units. Therefore, only a zero-padded shortcut or projection shortcut can be used for all the residual units. However,  a projection shortcut can hamper information propagation and lead to optimization problems, especially for very deep networks. On the other hand, the zero-padded shortcut avoids the overfitting problem because no additional parameters exist.","226":"A **Pyramidal Residual Unit** is a type of residual unit where the number of channels gradually increases as a function of the depth at which the layer occurs, which is similar to a pyramid structure of which the shape gradually widens from the top downwards. It was introduced as part of the [PyramidNet](https:\/\/paperswithcode.com\/method\/pyramidnet) architecture.","227":"A **PyramidNet** is a type of convolutional network where the key idea is to concentrate on the feature map dimension by increasing it gradually instead of by increasing it sharply at each residual unit with downsampling. In addition, the network architecture works as a mixture of both plain and residual networks by using zero-padded identity-mapping shortcut connections when increasing the feature map dimension.","228":"**SegNet** is a semantic segmentation model. This core trainable segmentation architecture consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the\r\nVGG16 network. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature maps. Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to\r\nperform non-linear upsampling.","229":"**Linear Regression** is a method for modelling a relationship between a dependent variable and independent variables. These models can be fit with numerous approaches. The most common is *least squares*, where we minimize the mean square error between the predicted values $\\hat{y} = \\textbf{X}\\hat{\\beta}$ and actual values $y$: $\\left(y-\\textbf{X}\\beta\\right)^{2}$.\r\n\r\nWe can also define the problem in probabilistic terms as a generalized linear model (GLM) where the pdf is a Gaussian distribution, and then perform maximum likelihood estimation to estimate $\\hat{\\beta}$.\r\n\r\nImage Source: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Linear_regression)","230":"A dual graph convolutional neural network jointly considers the two essential assumptions of semi-supervised learning: (1) local consistency and (2) global consistency. Accordingly, two convolutional neural networks are devised to embed the local-consistency-based and global-consistency-based knowledge, respectively.\r\n\r\nDescription  and image from: [Dual Graph Convolutional Networks for Graph-Based Semi-Supervised Classification](https:\/\/persagen.com\/files\/misc\/zhuang2018dual.pdf)","231":"**TrOCR** is an end-to-end [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based OCR model for text recognition with pre-trained CV and NLP models. It leverages the [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture for both image understanding and wordpiece-level text generation. It first resizes the input text image into $384 \u00d7 384$ and then the image is split into a sequence of 16 patches which are used as the input to image Transformers.  Standard Transformer architecture with the [self-attention mechanism](https:\/\/paperswithcode.com\/method\/scaled) is leveraged on both encoder and decoder parts, where wordpiece units are generated as the recognized text from the input image.","232":"The **Cross-Attention** module is an attention module used in [CrossViT](https:\/\/paperswithcode.com\/method\/crossvit) for fusion of multi-scale features. The CLS token of the large branch (circle) serves as a query token to interact with the patch tokens from the small branch through attention. $f\\left(\u00b7\\right)$ and $g\\left(\u00b7\\right)$ are projections to align dimensions. The small branch follows the same procedure but swaps CLS and patch tokens from another branch.","233":"A **Feature Pyramid Network**, or **FPN**, is a feature extractor that takes a single-scale image of an arbitrary size as input, and outputs proportionally sized feature maps at multiple levels, in a fully convolutional fashion. This process is independent of the backbone convolutional architectures. It therefore acts as a generic solution for building feature pyramids inside deep convolutional networks to be used in tasks like object detection.\r\n\r\nThe construction of the pyramid involves a bottom-up pathway and a top-down pathway.\r\n\r\nThe bottom-up pathway is the feedforward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2. For the feature\r\npyramid, one pyramid level is defined for each stage. The output of the last layer of each stage is used as a reference set of feature maps. For [ResNets](https:\/\/paperswithcode.com\/method\/resnet) we use the feature activations output by each stage\u2019s last [residual block](https:\/\/paperswithcode.com\/method\/residual-block). \r\n\r\nThe top-down pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels. These features are then enhanced with features from the bottom-up pathway via lateral connections. Each lateral connection merges feature maps of the same spatial size from the bottom-up pathway and the top-down pathway. The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times.","234":"**RetinaNet** is a one-stage object detection model that utilizes a [focal loss](https:\/\/paperswithcode.com\/method\/focal-loss) function to address class imbalance during training. Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard negative examples. RetinaNet is a single, unified network composed of a *backbone* network and two task-specific *subnetworks*. The backbone is responsible for computing a convolutional feature map over an entire input image and is an off-the-self convolutional network. The first subnet performs convolutional object classification on the backbone's output; the second subnet performs convolutional bounding box regression. The two subnetworks feature a simple design that the authors propose specifically for one-stage, dense detection. \r\n\r\nWe can see the motivation for focal loss by comparing with two-stage object detectors. Here class imbalance is addressed by a two-stage cascade and sampling heuristics. The proposal stage (e.g., [Selective Search](https:\/\/paperswithcode.com\/method\/selective-search), [EdgeBoxes](https:\/\/paperswithcode.com\/method\/edgeboxes), [DeepMask](https:\/\/paperswithcode.com\/method\/deepmask), [RPN](https:\/\/paperswithcode.com\/method\/rpn)) rapidly narrows down the number of candidate object locations to a small number (e.g., 1-2k), filtering out most background samples. In the second classification stage, sampling heuristics, such as a fixed foreground-to-background ratio, or online hard example mining ([OHEM](https:\/\/paperswithcode.com\/method\/ohem)), are performed to maintain a\r\nmanageable balance between foreground and background.\r\n\r\nIn contrast, a one-stage detector must process a much larger set of candidate object locations regularly sampled across an image. To tackle this, RetinaNet uses a focal loss function, a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples. \r\n\r\nFormally, the Focal Loss adds a factor $(1 - p\\_{t})^\\gamma$ to the standard cross entropy criterion. Setting $\\gamma>0$ reduces the relative loss for well-classified examples ($p\\_{t}>.5$), putting more focus on hard, misclassified examples. Here there is tunable *focusing* parameter $\\gamma \\ge 0$. \r\n\r\n$$ {\\text{FL}(p\\_{t}) = - (1 - p\\_{t})^\\gamma \\log\\left(p\\_{t}\\right)} $$","235":"[Laplacian eigenvectors](https:\/\/paperswithcode.com\/paper\/laplacian-eigenmaps-and-spectral-techniques) represent a natural generalization of the [Transformer](https:\/\/paperswithcode.com\/method\/transformer) positional encodings (PE) for graphs as the eigenvectors of a discrete line (NLP graph) are the cosine and sinusoidal functions. They help encode distance-aware information (i.e., nearby nodes have similar positional features and farther nodes have dissimilar positional features).\r\n\r\nHence, Laplacian Positional Encoding (PE) is a general method to encode node positions in a graph. For each node, its Laplacian PE is the k smallest non-trivial eigenvectors.","236":"This is **Graph Transformer** method, proposed as a generalization of [Transformer](https:\/\/paperswithcode.com\/method\/transformer) Neural Network architectures, for arbitrary graphs.\r\n\r\nCompared to the original Transformer, the highlights of the presented architecture are:\r\n\r\n- The attention mechanism is a function of neighborhood connectivity for each node in the graph.  \r\n- The position encoding is represented by Laplacian eigenvectors, which naturally generalize the sinusoidal positional encodings often used in NLP.  \r\n- The [layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization) is replaced by a [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) layer.  \r\n- The architecture is extended to have edge representation, which can be critical to tasks with rich information on the edges, or pairwise interactions (such as bond types in molecules, or relationship type in KGs. etc).","237":"Spectral clustering has attracted increasing attention due to\r\nthe promising ability in dealing with nonlinearly separable datasets [15], [16]. In spectral clustering, the spectrum of the graph Laplacian is used to reveal the cluster structure. The spectral clustering algorithm mainly consists of two steps: 1) constructs the low dimensional embedded representation of the data based on the eigenvectors of the graph Laplacian, 2) applies k-means on the constructed low dimensional data to obtain the clustering result. Thus,","238":"A **Highway Layer** contains an information highway to other layers that helps with information flow. It is characterised by the use of a gating unit to help this information flow. \r\n\r\nA plain feedforward neural network typically consists of $L$ layers where the $l$th layer ($l \\in ${$1, 2, \\dots, L$}) applies a nonlinear transform $H$ (parameterized by $\\mathbf{W\\_{H,l}}$) on its input $\\mathbf{x\\_{l}}$ to produce its output $\\mathbf{y\\_{l}}$. Thus, $\\mathbf{x\\_{1}}$ is the input to the network and $\\mathbf{y\\_{L}}$ is the network\u2019s output. Omitting the layer index and biases for clarity,\r\n\r\n$$ \\mathbf{y} = H\\left(\\mathbf{x},\\mathbf{W\\_{H}}\\right) $$\r\n\r\n$H$ is usually an affine transform followed by a non-linear activation function, but in general it may take other forms. \r\n\r\nFor a [highway network](https:\/\/paperswithcode.com\/method\/highway-network), we additionally define two nonlinear transforms $T\\left(\\mathbf{x},\\mathbf{W\\_{T}}\\right)$ and $C\\left(\\mathbf{x},\\mathbf{W\\_{C}}\\right)$ such that:\r\n\r\n$$ \\mathbf{y} = H\\left(\\mathbf{x},\\mathbf{W\\_{H}}\\right)\u00b7T\\left(\\mathbf{x},\\mathbf{W\\_{T}}\\right) + \\mathbf{x}\u00b7C\\left(\\mathbf{x},\\mathbf{W\\_{C}}\\right)$$\r\n\r\nWe refer to T as the transform gate and C as the carry gate, since they express how much of the output is produced by transforming the input and carrying it, respectively. In the original paper, the authors set $C = 1 \u2212 T$, giving:\r\n\r\n$$ \\mathbf{y} = H\\left(\\mathbf{x},\\mathbf{W\\_{H}}\\right)\u00b7T\\left(\\mathbf{x},\\mathbf{W\\_{T}}\\right) + \\mathbf{x}\u00b7\\left(1-T\\left(\\mathbf{x},\\mathbf{W\\_{T}}\\right)\\right)$$\r\n\r\nThe authors set:\r\n\r\n$$ T\\left(x\\right) = \\sigma\\left(\\mathbf{W\\_{T}}^{T}\\mathbf{x} + \\mathbf{b\\_{T}}\\right) $$\r\n\r\nImage: [Sik-Ho Tsang](https:\/\/towardsdatascience.com\/review-highway-networks-gating-function-to-highway-image-classification-5a33833797b5)","239":"A **Highway Network** is an architecture designed to ease gradient-based training of very deep networks. They allow unimpeded information flow across several layers on \"information highways\". The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. Highway networks with hundreds of layers can be trained directly using stochastic gradient descent and with a variety of activation functions.","240":"A **Bidirectional GRU**, or **BiGRU**, is a sequence processing model that consists of two [GRUs](https:\/\/paperswithcode.com\/method\/gru). one taking the input in a forward direction, and the other in a backwards direction. It is a bidirectional recurrent neural network with only the input and forget gates.\r\n\r\nImage Source: *Rana R (2016). Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech.*","241":"**CBHG** is a building block used in the [Tacotron](https:\/\/paperswithcode.com\/method\/tacotron) text-to-speech model. It consists of a bank of 1-D convolutional filters, followed by highway networks and a bidirectional gated recurrent unit ([BiGRU](https:\/\/paperswithcode.com\/method\/bigru)). \r\n\r\nThe module is used to extract representations from sequences. The input sequence is first\r\nconvolved with $K$ sets of 1-D convolutional filters, where the $k$-th set contains $C\\_{k}$ filters of width $k$ (i.e. $k = 1, 2, \\dots , K$). These filters explicitly model local and contextual information (akin to modeling unigrams, bigrams, up to K-grams). The [convolution](https:\/\/paperswithcode.com\/method\/convolution) outputs are stacked together and further max pooled along time to increase local invariances. A stride of 1 is used to  preserve the original time resolution. The processed sequence is further passed to a few fixed-width 1-D convolutions, whose outputs are added with the original input sequence via residual connections. [Batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) is used for all convolutional layers. The convolution outputs are fed into a multi-layer [highway network](https:\/\/paperswithcode.com\/method\/highway-network) to extract high-level features. Finally, a bidirectional [GRU](https:\/\/paperswithcode.com\/method\/gru) RNN is stacked on top to extract sequential features from both forward and backward context.","242":"A **Residual GRU** is a [gated recurrent unit (GRU)](https:\/\/paperswithcode.com\/method\/gru) that incorporates the idea of residual connections from [ResNets](https:\/\/paperswithcode.com\/method\/resnet).","243":"The **Griffin-Lim Algorithm (GLA)** is a phase reconstruction method based on the redundancy of the short-time Fourier transform. It promotes the consistency of a spectrogram by iterating two projections, where a spectrogram is said to be consistent when its inter-bin dependency owing to the redundancy of STFT is retained.  GLA is based only on the consistency and does not take any prior knowledge about the target signal into account. \r\n\r\nThis algorithm expects to recover a complex-valued spectrogram, which is consistent and maintains the given amplitude $\\mathbf{A}$, by the following alternative projection procedure:\r\n\r\n$$ \\mathbf{X}^{[m+1]} = P\\_{\\mathcal{C}}\\left(P\\_{\\mathcal{A}}\\left(\\mathbf{X}^{[m]}\\right)\\right) $$\r\n\r\nwhere $\\mathbf{X}$ is a complex-valued spectrogram updated through the iteration, $P\\_{\\mathcal{S}}$ is the metric projection onto a set $\\mathcal{S}$, and $m$ is the iteration index. Here, $\\mathcal{C}$ is the set of consistent spectrograms, and $\\mathcal{A}$ is the set of spectrograms whose amplitude is the same as the given one. The metric projections onto these sets $\\mathcal{C}$ and $\\mathcal{A}$ are given by:\r\n\r\n$$ P\\_{\\mathcal{C}}(\\mathbf{X}) = \\mathcal{GG}^{\u2020}\\mathbf{X} $$\r\n$$ P\\_{\\mathcal{A}}(\\mathbf{X}) = \\mathbf{A} \\odot \\mathbf{X} \\oslash |\\mathbf{X}| $$\r\n\r\n\r\nwhere $\\mathcal{G}$ represents STFT, $\\mathcal{G}^{\u2020}$ is the pseudo inverse of STFT (iSTFT), $\\odot$ and $\\oslash$ are element-wise multiplication and division, respectively, and division by zero is replaced by zero. GLA is obtained as an algorithm for the following optimization problem:\r\n\r\n$$ \\min\\_{\\mathbf{X}} || \\mathbf{X} - P\\_{\\mathcal{C}}\\left(\\mathbf{X}\\right) ||^{2}\\_{\\text{Fro}} \\text{ s.t. } \\mathbf{X} \\in \\mathcal{A} $$\r\n\r\nwhere $ || \u00b7 ||\\_{\\text{Fro}}$ is the Frobenius norm. This equation minimizes the energy of the inconsistent components under the constraint on amplitude which must be equal to the given one. Although GLA has been widely utilized because of its simplicity, GLA often involves many iterations until it converges to a certain spectrogram and results in low reconstruction quality. This is because the cost function only requires the consistency, and the characteristics of the target signal are not taken into account.","244":"**Tacotron** is an end-to-end generative text-to-speech model that takes a character sequence as input and outputs the corresponding spectrogram. The backbone of Tacotron is a seq2seq model with attention. The Figure depicts the model, which includes an encoder, an attention-based decoder, and a post-processing net. At a high-level, the model takes characters as input and produces spectrogram\r\nframes, which are then converted to waveforms.","245":"**Random Search** replaces the exhaustive enumeration of all combinations by selecting them randomly. This can be simply applied to the discrete setting described above, but also generalizes to continuous and mixed spaces. It can outperform Grid search, especially when only a small number of hyperparameters affects the final performance of the machine learning algorithm. In this case, the optimization problem is said to have a low intrinsic dimensionality. Random Search is also embarrassingly parallel, and additionally allows the inclusion of prior knowledge by specifying the distribution from which to sample.\r\n\r\n\r\nExtracted from [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Hyperparameter_optimization#Random_search)\r\n\r\nSource [Paper](https:\/\/dl.acm.org\/doi\/10.5555\/2188385.2188395)\r\n\r\nImage Source: [BERGSTRA AND BENGIO](https:\/\/dl.acm.org\/doi\/pdf\/10.5555\/2188385.2188395)","246":"**CodeBERT** is a bimodal pre-trained model for programming language (PL) and natural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language code search, code documentation generation, etc. CodeBERT is developed with a [Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based neural architecture, and is trained with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables the utilization of both bimodal data of NL-PL pairs and unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators.","247":"**Iterative Pseudo-Labeling** (IPL) is a semi-supervised algorithm for speech recognition which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves. In particular, IPL fine tunes an existing model at each iteration using both labeled data and a subset of unlabeled data.","248":"A wavelet **scattering transform** computes a translation invariant representation, which is stable to deformation, using a deep [convolution](https:\/\/paperswithcode.com\/method\/convolution) network architecture. It computes non-linear invariants with modulus and averaging pooling functions. It helps to eliminate the image variability due to translation and is stable to deformations. \r\n\r\nImage source: [Bruna and Mallat](https:\/\/arxiv.org\/pdf\/1203.1513v2.pdf)","249":"**Contrastive Predictive Coding (CPC)** learns self-supervised representations by predicting the future in latent space by using powerful autoregressive models. The model uses a probabilistic contrastive loss which induces the latent space to capture information that is maximally useful\r\nto predict future samples.\r\n\r\nFirst, a non-linear encoder $g\\_{enc}$ maps the input sequence of observations $x\\_{t}$ to a sequence of latent representations $z\\_{t} = g\\_{enc}\\left(x\\_{t}\\right)$, potentially with a lower temporal resolution. Next, an autoregressive model $g\\_{ar}$ summarizes all $z\\leq{t}$ in the latent space and produces a context latent representation $c\\_{t} = g\\_{ar}\\left(z\\leq{t}\\right)$.\r\n\r\nA density ratio is modelled which preserves the mutual information between $x\\_{t+k}$ and $c\\_{t}$ as follows:\r\n\r\n$$ f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right) \\propto \\frac{p\\left(x\\_{t+k}|c\\_{t}\\right)}{p\\left(x\\_{t+k}\\right)} $$\r\n\r\nwhere $\\propto$ stands for \u2019proportional to\u2019 (i.e. up to a multiplicative constant). Note that the density ratio $f$ can be unnormalized (does not have to integrate to 1). The authors use a simple log-bilinear model:\r\n\r\n$$ f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right) = \\exp\\left(z^{T}\\_{t+k}W\\_{k}c\\_{t}\\right) $$\r\n\r\nAny type of autoencoder and autoregressive can be used. An example the authors opt for is strided convolutional layers with residual blocks and GRUs.\r\n\r\nThe autoencoder and autoregressive models are trained to minimize an [InfoNCE](https:\/\/paperswithcode.com\/method\/infonce) loss (see components).","250":"**MobileNetV2** is a convolutional neural network architecture that seeks to perform well on mobile devices. It is based on an inverted residual structure where the residual connections are between the bottleneck layers.  The intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity. As a whole, the architecture of MobileNetV2 contains the initial fully [convolution](https:\/\/paperswithcode.com\/method\/convolution) layer with 32 filters, followed by 19 residual bottleneck layers.","251":"**GPT** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based architecture and training procedure for natural language processing tasks. Training follows a two-stage procedure. First, a language modeling objective is used on\r\nthe unlabeled data to learn the initial parameters of a neural network model. Subsequently, these parameters are adapted to a target task using the corresponding supervised objective.","252":"**Colorization** is a self-supervision approach that relies on colorization as the pretext task in order to learn image representations.","253":"**Darknet-19** is a convolutional neural network that is used as the backbone of [YOLOv2](https:\/\/paperswithcode.com\/method\/yolov2).  Similar to the [VGG](https:\/\/paperswithcode.com\/method\/vgg) models it mostly uses $3 \\times 3$ filters and doubles the number of channels after every pooling step. Following the work on Network in Network (NIN) it uses [global average pooling](https:\/\/paperswithcode.com\/method\/global-average-pooling) to make predictions as well as $1 \\times 1$ filters to compress the feature representation between $3 \\times 3$ convolutions. [Batch Normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) is used to stabilize training, speed up convergence, and regularize the model batch.","254":"**YOLOv2**, or [**YOLO9000**](https:\/\/www.youtube.com\/watch?v=QsDDXSmGJZA), is a single-stage real-time object detection model. It improves upon [YOLOv1](https:\/\/paperswithcode.com\/method\/yolov1) in several ways, including the use of [Darknet-19](https:\/\/paperswithcode.com\/method\/darknet-19) as a backbone, [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization), use of a high-resolution classifier, and the use of anchor boxes to predict bounding boxes, and more.","255":"BCI MI framework to classifiy brain signals using a multimodal decission making phase, with an addtional differentiation of the signal.","256":"A **Channel Attention Module** is a module for channel-based attention in convolutional neural networks. We produce a channel attention map by exploiting the inter-channel relationship of features. As each channel of a feature map is considered as a feature detector, channel attention focuses on \u2018what\u2019 is meaningful given an input image. To compute the channel attention efficiently, we squeeze the spatial dimension of the input feature map. \r\n\r\nWe first aggregate spatial information of a feature map by using both average-pooling and max-pooling operations, generating two different spatial context descriptors: $\\mathbf{F}^{c}\\_{avg}$ and $\\mathbf{F}^{c}\\_{max}$, which denote average-pooled features and max-pooled features respectively. \r\n\r\nBoth descriptors are then forwarded to a shared network to produce our channel attention map $\\mathbf{M}\\_{c} \\in \\mathbb{R}^{C\\times{1}\\times{1}}$. Here $C$ is the number of channels. The shared network is composed of multi-layer perceptron (MLP) with one hidden layer. To reduce parameter overhead, the hidden activation size is set to $\\mathbb{R}^{C\/r\u00d71\u00d71}$, where $r$ is the reduction ratio. After the shared network is applied to each descriptor, we merge the output feature vectors using element-wise summation. In short, the channel attention is computed as:\r\n\r\n$$  \\mathbf{M\\_{c}}\\left(\\mathbf{F}\\right) = \\sigma\\left(\\text{MLP}\\left(\\text{AvgPool}\\left(\\mathbf{F}\\right)\\right)+\\text{MLP}\\left(\\text{MaxPool}\\left(\\mathbf{F}\\right)\\right)\\right) $$\r\n\r\n$$  \\mathbf{M\\_{c}}\\left(\\mathbf{F}\\right) = \\sigma\\left(\\mathbf{W\\_{1}}\\left(\\mathbf{W\\_{0}}\\left(\\mathbf{F}^{c}\\_{avg}\\right)\\right) +\\mathbf{W\\_{1}}\\left(\\mathbf{W\\_{0}}\\left(\\mathbf{F}^{c}\\_{max}\\right)\\right)\\right) $$\r\n\r\nwhere $\\sigma$ denotes the sigmoid function, $\\mathbf{W}\\_{0} \\in \\mathbb{R}^{C\/r\\times{C}}$, and $\\mathbf{W}\\_{1} \\in \\mathbb{R}^{C\\times{C\/r}}$. Note that the MLP weights, $\\mathbf{W}\\_{0}$ and $\\mathbf{W}\\_{1}$, are shared for both inputs and the [ReLU](https:\/\/paperswithcode.com\/method\/relu) activation function is followed by $\\mathbf{W}\\_{0}$.\r\n\r\nNote that the channel attention module with just [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling) is the same as the [Squeeze-and-Excitation Module](https:\/\/paperswithcode.com\/method\/squeeze-and-excitation-block).","257":"A **Spatial Attention Module** is a module for spatial attention in convolutional neural networks. It generates a spatial attention map by utilizing the inter-spatial relationship of features. Different from the [channel attention](https:\/\/paperswithcode.com\/method\/channel-attention-module), the spatial attention focuses on where is an informative part, which is complementary to the channel attention. To compute the spatial attention, we first apply average-pooling and max-pooling operations along the channel axis and concatenate them to generate an efficient feature descriptor. On the concatenated feature descriptor, we apply a [convolution](https:\/\/paperswithcode.com\/method\/convolution) layer to generate a spatial attention map $\\textbf{M}\\_{s}\\left(F\\right) \\in \\mathcal{R}^{H\u00d7W}$ which encodes where to emphasize or suppress. \r\n\r\nWe aggregate channel information of a feature map by using two pooling operations, generating two 2D maps: $\\mathbf{F}^{s}\\_{avg} \\in \\mathbb{R}^{1\\times{H}\\times{W}}$ and $\\mathbf{F}^{s}\\_{max} \\in \\mathbb{R}^{1\\times{H}\\times{W}}$. Each denotes average-pooled features and max-pooled features across the channel. Those are then concatenated and convolved by a standard convolution layer, producing the 2D spatial attention map. In short, the spatial attention is computed as:\r\n\r\n$$ \\textbf{M}\\_{s}\\left(F\\right) = \\sigma\\left(f^{7x7}\\left(\\left[\\text{AvgPool}\\left(F\\right);\\text{MaxPool}\\left(F\\right)\\right]\\right)\\right) $$\r\n\r\n$$ \\textbf{M}\\_{s}\\left(F\\right) = \\sigma\\left(f^{7x7}\\left(\\left[\\mathbf{F}^{s}\\_{avg};\\mathbf{F}^{s}\\_{max} \\right]\\right)\\right) $$\r\n\r\nwhere $\\sigma$ denotes the sigmoid function and $f^{7\u00d77}$ represents a convolution operation with the filter size of 7 \u00d7 7.","258":"SENet pioneered channel attention. The core of SENet is a squeeze-and-excitation (SE) block which is used to collect global information, capture channel-wise relationships and improve representation ability.\r\nSE blocks are divided into two parts, a squeeze module and an excitation module. Global spatial information is collected in the squeeze module by global average pooling. The excitation module captures channel-wise relationships and outputs an attention vector by using fully-connected layers and non-linear layers (ReLU and sigmoid). Then, each channel of the input feature is scaled by multiplying the corresponding element in the attention vector. Overall, a squeeze-and-excitation block $F_\\text{se}$ (with parameter $\\theta$) which takes $X$ as input and outputs $Y$ can be formulated \r\nas:\r\n\\begin{align}\r\n    s = F_\\text{se}(X, \\theta) & = \\sigma (W_{2} \\delta (W_{1}\\text{GAP}(X)))\r\n\\end{align}\r\n\\begin{align}\r\n    Y = sX\r\n\\end{align}","259":"**GradDrop**, or **Gradient Sign Dropout**, is a probabilistic masking procedure which samples gradients at an activation layer based on their level of consistency. It is applied as a layer in any standard network forward pass, usually on the final layer before the prediction head to save on compute overhead and maximize benefits during backpropagation. Below, we develop the GradDrop formalism. Throughout, o denotes elementwise multiplication after any necessary tiling operations (if any) are completed.\r\nTo implement GradDrop, we first define the Gradient Positive Sign Purity, $\\mathcal{P}$, as\r\n\r\n$$\r\n\\mathcal{P}=\\frac{1}{2}\\left(1+\\frac{\\sum\\_{i} \\nabla L_\\{i}}{\\sum\\_{i}\\left|\\nabla L\\_{i}\\right|}\\right)\r\n$$\r\n\r\n$\\mathcal{P}$ is bounded by $[0,1] .$ For multiple gradient values $\\nabla\\_{a} L\\_{i}$ at some scalar $a$, we see that $\\mathcal{P}=0$ if $\\nabla_{a} L\\_{i}<0 $ $\\forall i$, while $\\mathcal{P}=1$ if $\\nabla\\_{a} L\\_{i}>0$ $\\forall i $. Thus, $\\mathcal{P}$ is a measure of how many positive gradients are present at any given value. We then form a mask for each gradient $\\mathcal{M}\\_{i}$ as follows:\r\n\r\n$$\r\n\\mathcal{M}\\_{i}=\\mathcal{I}[f(\\mathcal{P})>U] \\circ \\mathcal{I}\\left[\\nabla L\\_{i}>0\\right]+\\mathcal{I}[f(\\mathcal{P})<U] \\circ \\mathcal{I}\\left[\\nabla L\\_{i}<0\\right]\r\n$$\r\n\r\nfor $\\mathcal{I}$ the standard indicator function and $f$ some monotonically increasing function (often just the identity) that maps $[0,1] \\mapsto[0,1]$ and is odd around $(0.5,0.5)$. $U$ is a tensor composed of i.i.d $U(0,1)$ random variables. The $\\mathcal{M}\\_{i}$ is then used to produce a final gradient $\\sum \\mathcal{M}\\_{i} \\nabla L\\_{i}$","260":"**Gradient Sparsification** is a technique for distributed training that sparsifies stochastic gradients to reduce the communication cost, with minor increase in the number of iterations. The key idea behind our sparsification technique is to drop some coordinates of the stochastic gradient and appropriately amplify the remaining coordinates to ensure the unbiasedness of the sparsified stochastic gradient. The sparsification approach can significantly reduce the coding length of the stochastic gradient and only slightly increase the variance of the stochastic gradient.","261":"**Clipped Double Q-learning** is a variant on [Double Q-learning](https:\/\/paperswithcode.com\/method\/double-q-learning) that upper-bounds the less biased Q estimate $Q\\_{\\theta\\_{2}}$ by the biased estimate $Q\\_{\\theta\\_{1}}$. This is equivalent to taking the minimum of the two estimates, resulting in the following target update:\r\n\r\n$$ y\\_{1} = r + \\gamma\\min\\_{i=1,2}Q\\_{\\theta'\\_{i}}\\left(s', \\pi\\_{\\phi\\_{1}}\\left(s'\\right)\\right) $$\r\n\r\nThe motivation for this extension is that vanilla double [Q-learning](https:\/\/paperswithcode.com\/method\/q-learning) is sometimes ineffective if the target and current networks are too similar, e.g. with a slow-changing policy in an actor-critic framework.","262":"**Target Policy Smoothing** is a regularization strategy for the value function in reinforcement learning. Deterministic policies can overfit to narrow peaks in the value estimate, making them highly susceptible to functional approximation error, increasing the variance of the target. To reduce this variance, target policy smoothing adds a small amount of random noise to the target policy and averages over mini-batches - approximating a [SARSA](https:\/\/paperswithcode.com\/method\/sarsa)-like expectation\/integral.\r\n\r\nThe modified target update is:\r\n\r\n$$ y = r + \\gamma{Q}\\_{\\theta'}\\left(s', \\pi\\_{\\theta'}\\left(s'\\right) + \\epsilon \\right) $$\r\n\r\n$$ \\epsilon \\sim \\text{clip}\\left(\\mathcal{N}\\left(0, \\sigma\\right), -c, c \\right) $$\r\n\r\nwhere the added noise is clipped to keep the target close to the original action. The outcome is an algorithm reminiscent of [Expected SARSA](https:\/\/paperswithcode.com\/method\/expected-sarsa), where the value estimate is instead learned off-policy and the noise added to the target policy is chosen independently of the exploration policy. The value estimate learned is with respect to a noisy policy defined by the parameter $\\sigma$.","263":"**TD3** builds on the [DDPG](https:\/\/paperswithcode.com\/method\/ddpg) algorithm for reinforcement learning, with a couple of modifications aimed at tackling overestimation bias with the value function. In particular, it utilises [clipped double Q-learning](https:\/\/paperswithcode.com\/method\/clipped-double-q-learning), delayed update of target and policy networks, and [target policy smoothing](https:\/\/paperswithcode.com\/method\/target-policy-smoothing) (which is similar to a [SARSA](https:\/\/paperswithcode.com\/method\/sarsa) based update; a safer update, as they provide higher value to actions resistant to perturbations).","264":"**DDPG**, or **Deep Deterministic Policy Gradient**, is an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. It combines the actor-critic approach with insights from [DQNs](https:\/\/paperswithcode.com\/method\/dqn): in particular, the insights that 1) the network is trained off-policy with samples from a replay buffer to minimize correlations between samples, and 2) the network is trained with a target Q network to give consistent targets during temporal difference backups. DDPG makes use of the same ideas along with [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization).","265":"**Generative Adversarial Imitation Learning** presents a new general framework for directly extracting a policy from data, as if it were obtained by reinforcement learning following inverse reinforcement learning.","266":"The **Maxout Unit** is a generalization of the [ReLU](https:\/\/paperswithcode.com\/method\/relu) and the [leaky ReLU](https:\/\/paperswithcode.com\/method\/leaky-relu) functions. It is a piecewise linear function that returns the maximum of the inputs, designed to be used in conjunction with [dropout](https:\/\/paperswithcode.com\/method\/dropout). Both ReLU and leaky ReLU are special cases of Maxout. \r\n\r\n$$f\\left(x\\right) = \\max\\left(w^{T}\\_{1}x + b\\_{1}, w^{T}\\_{2}x + b\\_{2}\\right)$$\r\n\r\nThe main drawback of Maxout is that it is computationally expensive as it doubles the number of parameters for each neuron.","267":"**Minibatch Discrimination** is a discriminative technique for generative adversarial networks where we discriminate between whole minibatches of samples rather than between individual samples. This is intended to avoid collapse of the generator.","268":"**Orthogonal Regularization** is a regularization technique for convolutional neural networks, introduced with generative modelling as the task in mind. Orthogonality is argued to be a desirable quality in ConvNet filters, partially because multiplication by an orthogonal matrix leaves the norm of the original matrix unchanged. This property is valuable in deep or recurrent networks, where repeated matrix multiplication can result in signals vanishing or exploding. To try to maintain orthogonality throughout training, Orthogonal Regularization encourages weights to be orthogonal by pushing them towards the nearest orthogonal manifold. The objective function is augmented with the cost:\r\n\r\n$$ \\mathcal{L}\\_{ortho} = \\sum\\left(|WW^{T} \u2212 I|\\right) $$\r\n\r\nWhere $\\sum$ indicates a sum across all filter banks, $W$ is a filter bank, and $I$ is the identity matrix","269":"A **Multiscale Dilated Convolution Block** is an Inception-style convolutional block motivated by the ideas that image features naturally occur at multiple scales, that a network\u2019s expressivity is proportional to the range of functions it can represent divided by its total number of parameters, and by the desire to efficiently expand a network\u2019s receptive field. The Multiscale [Dilated Convolution](https:\/\/paperswithcode.com\/method\/dilated-convolution) (MDC) block applies a single $F\\times{F}$ filter at multiple dilation factors, then performs a weighted elementwise sum of each dilated filter\u2019s output, allowing the network to simultaneously learn a set of features and the relevant scales at which those features occur with a minimal increase in parameters. This also rapidly expands the network\u2019s receptive field without requiring an increase in depth or the number of parameters.","270":"The **Introspective Adversarial Network (IAN)** is a hybridization of [GANs](https:\/\/paperswithcode.com\/method\/gan) and [VAEs](https:\/\/paperswithcode.com\/method\/vae) that leverages the power of the adversarial objective while maintaining the VAE\u2019s efficient inference mechanism. It uses the discriminator of the GAN, $D$, as a feature extractor for an inference subnetwork, $E$, which is implemented as a fully-connected layer on top of the final convolutional layer of the discriminator. We infer latent values $Z \\sim E\\left(X\\right) = q\\left(Z\\mid{X}\\right)$ for reconstruction and sample random values $Z \\sim p\\left(Z\\right)$ from a standard normal for random image generation using the generator network, $G$.\r\n\r\nThree distinct loss functions are used:\r\n\r\n- $\\mathcal{L}\\_{img}$, the L1 pixel-wise reconstruction loss, which is preferred to the L2 reconstruction loss for its higher average gradient.\r\n- $\\mathcal{L\\_{feature}}$, the feature-wise reconstruction loss, evaluated as the L2 difference between the original and reconstruction in the space of the hidden layers of the discriminator.\r\n- $\\mathcal{L}\\_{adv}$, the ternary adversarial loss, a modification of the adversarial loss that forces the discriminator to label a sample as real, generated, or reconstructed (as opposed to a binary\r\nreal vs. generated label).\r\n\r\nIncluding the VAE\u2019s KL divergence between the inferred latents $E\\left(X\\right)$ and the prior $p\\left(Z\\right)$, the loss function for the generator and encoder network is thus:\r\n\r\n$$\\mathcal{L}\\_{E, G} = \\lambda\\_{adv}\\mathcal{L}\\_{G\\_{adv}} + \\lambda\\_{img}\\mathcal{L}\\_{img}  + \\lambda\\_{feature}\\mathcal{L}\\_{feature}  + D\\_{KL}\\left(E\\left(X\\right) || p\\left(Z\\right)\\right) $$\r\n\r\nWhere the $\\lambda$ terms weight the relative importance of each loss. We set $\\lambda\\_{img}$ to 3 and leave the other terms at 1. The discriminator is updated solely using the ternary adversarial loss. During each training step, the generator produces reconstructions $G\\left(E\\left(X\\right)\\right)$ (using the standard VAE reparameterization trick) from data $X$ and random samples $G\\left(Z\\right)$, while the discriminator observes $X$ as well as the reconstructions and random samples, and both networks are simultaneously updated.","271":"**Barlow Twins** is a self-supervised learning method that applies redundancy-reduction \u2014 a principle first proposed in neuroscience \u2014 to self supervised learning. The objective function measures the cross-correlation matrix between the embeddings of two identical networks fed with distorted versions of a batch of samples, and tries to make this matrix close to the identity. This causes the embedding vectors of distorted version of a sample to be similar, while minimizing the redundancy between the components of these vectors. Barlow Twins does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates. Intriguingly it benefits from very high-dimensional output vectors.","272":"**Poincar\u00e9 Embeddings** learn hierarchical representations of symbolic data by embedding them into hyperbolic space -- or more precisely into an $n$-dimensional Poincar\u00e9 ball. Due to the underlying hyperbolic geometry, this allows for learning of parsimonious representations of symbolic data by simultaneously capturing hierarchy and similarity. Embeddings are learnt based on\r\nRiemannian optimization.","273":"RAM adopts RNNs and reinforcement learning (RL) to make the network learn where to pay attention.","274":"**Cutout** is an image augmentation and regularization technique that randomly masks out square regions of input during training. and can be used to improve the robustness and overall performance of convolutional neural networks. The main motivation for cutout comes from the problem of object occlusion, which is commonly encountered in many computer vision tasks, such as object recognition,\r\ntracking, or human pose estimation. By generating new images which simulate occluded examples, we not only better prepare the model for encounters with occlusions in the real world, but the model also learns to take more of the image context into consideration when making decisions","275":"Just as [dropout](https:\/\/paperswithcode.com\/method\/dropout) prevents co-adaptation of activations, **DropPath** prevents co-adaptation of parallel paths in networks such as [FractalNets](https:\/\/paperswithcode.com\/method\/fractalnet) by randomly dropping operands of the join layers. This\r\ndiscourages the network from using one input path as an anchor and another as a corrective term (a\r\nconfiguration that, if not prevented, is prone to overfitting). Two sampling strategies are:\r\n\r\n- **Local**: a join drops each input with fixed probability, but we make sure at least one survives.\r\n- **Global**: a single path is selected for the entire network. We restrict this path to be a single\r\ncolumn, thereby promoting individual columns as independently strong predictors.","276":"**ProxylessNAS** directly learns neural network architectures on the target task and target hardware without any proxy task. Additional contributions include:\r\n\r\n- Using a new path-level pruning perspective for [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search), showing a close connection between NAS and model compression. Memory consumption is saved by one order of magnitude by using path-level binarization.\r\n- Using a novel gradient-based approach (latency regularization loss) for handling hardware objectives (e.g. latency). Given different hardware platforms: CPU\/GPU\/Mobile, ProxylessNAS enables hardware-aware neural network specialization that\u2019s exactly optimized for the target hardware.","277":"**Retrace** is an off-policy Q-value estimation algorithm which has guaranteed convergence for a target and behaviour policy $\\left(\\pi, \\beta\\right)$. With off-policy rollout for TD learning, we must use importance sampling for the update:\r\n\r\n$$ \\Delta{Q}^{\\text{imp}}\\left(S\\_{t}, A\\_{t}\\right) = \\gamma^{t}\\prod\\_{1\\leq{\\tau}\\leq{t}}\\frac{\\pi\\left(A\\_{\\tau}\\mid{S\\_{\\tau}}\\right)}{\\beta\\left(A\\_{\\tau}\\mid{S\\_{\\tau}}\\right)}\\delta\\_{t} $$\r\n\r\nThis product term can lead to high variance, so Retrace modifies $\\Delta{Q}$ to have importance weights truncated by no more than a constant $c$:\r\n\r\n$$ \\Delta{Q}^{\\text{imp}}\\left(S\\_{t}, A\\_{t}\\right) = \\gamma^{t}\\prod\\_{1\\leq{\\tau}\\leq{t}}\\min\\left(c, \\frac{\\pi\\left(A\\_{\\tau}\\mid{S\\_{\\tau}}\\right)}{\\beta\\left(A\\_{\\tau}\\mid{S\\_{\\tau}}\\right)}\\right)\\delta\\_{t} $$","278":"**FCOS** is an anchor-box free, proposal free, single-stage object detection model. By eliminating the predefined set of anchor boxes, FCOS avoids computation related to anchor boxes such as calculating overlapping during training. It also avoids all hyper-parameters related to anchor boxes, which are often very sensitive to the final detection performance.","279":"**TILDEv2** is a [BERT](https:\/\/paperswithcode.com\/method\/bert)-based re-ranking method that stems from [TILDE](https:\/\/dl.acm.org\/doi\/abs\/10.1145\/3404835.3462922) but that addresses its limitations. It relies on contextualized exact term matching with expanded passages. This requires to only store in the index the score of tokens that appear in the expanded passages (rather than all the vocabulary), thus producing indexes that are 99% smaller than those of the original.\r\n\r\nSpecifically, TILDE is modified in the following aspects:\r\n\r\n- **Exact Term Matching**. The query likelihood matching originally employed in TILDE, expands passages into the BERT vocabulary size, resulting in large indexes. To overcome this issue, estimating relevance scores is achieved with contextualized exact term matching. This allows the model to index tokens only present in the passage, thus reducing the index size. In addition to this, we replace the query likelihood loss function, with the Noise contrastive estimation (NCE) loss that allows to better leverage negative training samples. \r\n \r\n- **Passage Expansion**. To overcome the vocabulary mismatch problem that affects exact term matching methods, passage expansion is used to expand the original passage collection. Passages in the collection are expanded using deep LMs with a limited number of tokens. This requires TILDEv2 to only index a few extra tokens in addition to those in the original passages.","280":"Exit whenever the model is confident enough allowing early exiting from hidden layers","281":"Contextualized Topic Models are based on the Neural-ProdLDA variational autoencoding approach by Srivastava and Sutton (2017). \r\n\r\nThis approach trains an encoding neural network to map pre-trained contextualized word embeddings (e.g., [BERT](https:\/\/paperswithcode.com\/method\/bert)) to latent representations. Those latent representations are sampled variationally from a Gaussian distribution $N(\\mu, \\sigma^2)$ and passed to a decoder network that has to reconstruct the document bag-of-word representation.","282":"A capsule is an activation vector that basically executes on its inputs some complex internal\r\ncomputations. Length of these activation vectors signifies the\r\nprobability of availability of a feature. Furthermore, the condition\r\nof the recognized element is encoded as the direction in which\r\nthe vector is pointing. In traditional, CNN uses Max pooling for\r\ninvariance activities of neurons, which is nothing except a minor\r\nchange in input and the neurons of output signal will remains\r\nsame.","283":"GraphSAGE is a general inductive framework that leverages node feature information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data.\r\n\r\nImage from: [Inductive Representation Learning on Large Graphs](https:\/\/arxiv.org\/pdf\/1706.02216v4.pdf)","284":"A **Siamese Network** consists of twin networks which accept distinct inputs but are joined by an energy function at the top. This function computes a metric between the highest level feature representation on each side. The parameters between the twin networks are tied. [Weight tying](https:\/\/paperswithcode.com\/method\/weight-tying) guarantees that two extremely similar images are not mapped by each network to very different locations in feature space because each network computes the same function. The network is symmetric, so that whenever we present two distinct images to the twin networks, the top conjoining layer will compute the same metric as if we were to we present the same two images but to the opposite twins.\r\n\r\nIntuitively instead of trying to classify inputs, a siamese network learns to differentiate between inputs, learning their similarity. The loss function used is usually a form of contrastive loss.\r\n\r\nSource: [Koch et al](https:\/\/www.cs.cmu.edu\/~rsalakhu\/papers\/oneshot1.pdf)","285":"**MobileNet** is a type of convolutional neural network designed for mobile and embedded vision applications. They are based on a streamlined architecture that uses depthwise separable convolutions to build lightweight deep neural networks that can have low latency for mobile and embedded devices.","286":"**Position-Sensitive RoI Pooling layer** aggregates the outputs of the last convolutional layer and generates scores for each RoI. Unlike [RoI Pooling](https:\/\/paperswithcode.com\/method\/roi-pooling), PS RoI Pooling conducts selective pooling, and each of the $k$ \u00d7 $k$ bin aggregates responses from only one score map out of the bank of $k$ \u00d7 $k$ score maps. With end-to-end training, this RoI layer shepherds the last convolutional layer to learn specialized position-sensitive score maps.","287":"**Region-based Fully Convolutional Networks**, or **R-FCNs**, are a type of region-based object detector. In contrast to previous region-based object detectors such as Fast\/[Faster R-CNN](https:\/\/paperswithcode.com\/method\/faster-r-cnn) that apply a costly per-region subnetwork hundreds of times, R-FCN is fully convolutional with almost all computation shared on the entire image.\r\n\r\nTo achieve this, R-FCN utilises position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection.","288":"**Differentiable Architecture Search** (**DART**) is a method for efficient architecture search. The search space is made continuous so that the architecture can be optimized with respect to its validation set performance through gradient descent.","289":"**LIME**, or **Local Interpretable Model-Agnostic Explanations**, is an algorithm that can explain the predictions of any classifier or regressor in a faithful way, by approximating it locally with an interpretable model. It modifies a single data sample by tweaking the feature values and observes the resulting impact on the output. It performs the role of an \"explainer\" to explain predictions from each data sample. The output of LIME is a set of explanations representing the contribution of each feature to a prediction for a single sample, which is a form of local interpretability.\r\n\r\nInterpretable models in LIME can be, for instance, [linear regression](https:\/\/paperswithcode.com\/method\/linear-regression) or decision trees, which are trained on small perturbations (e.g. adding noise, removing words, hiding parts of the image) of the original model to provide a good local approximation.","290":"The **focal self-attention** is built to make Transformer layers scalable to high-resolution inputs.  Instead of attending all tokens at fine-grain, the approach attends the fine-grain tokens only locally, but the summarized ones globally. As such, it can cover as many regions as standard self-attention but with much less cost. An image is first partitioned into patches, resulting in visual tokens. Then a patch embedding layer, consisting of a convolutional layer with filter and stride of same size, to project the patches into hidden features. This spatial feature map in then passed to four stages of focal Transformer blocks. Each focal Transformer block consists of $N_i$ focal Transformer layers. Patch embedding layers are used in between to reduce spatial size of feature map by factor 2, while feature dimension increased by 2.","291":"**DeepWalk** learns embeddings (social representations) of a graph's vertices, by modeling a stream of short random walks. Social representations are latent features of the vertices that capture neighborhood similarity and community membership. These latent representations encode social relations in a continuous vector space with a relatively small number of dimensions. It generalizes neural language models to process a special language composed of a set of randomly-generated walks. \r\n\r\nThe goal is to learn a latent representation, not only a probability distribution of node co-occurrences, and so as to introduce a mapping function $\\Phi \\colon v \\in V \\mapsto \\mathbb{R}^{|V|\\times d}$.\r\nThis mapping $\\Phi$ represents the latent social representation associated with each vertex $v$ in the graph. In practice, $\\Phi$ is represented by a $|V| \\times d$ matrix of free parameters.","292":"An **Inception Module** is an image model block that aims to approximate an optimal local sparse structure in a CNN. Put simply, it allows for us to use multiple types of filter size, instead of being restricted to a single filter size, in a single image block, which we then concatenate and pass onto the next layer.","293":"**Inception v2** is the second generation of Inception convolutional neural network architectures which notably uses [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization). Other changes include dropping [dropout](https:\/\/paperswithcode.com\/method\/dropout) and removing [local response normalization](https:\/\/paperswithcode.com\/method\/local-response-normalization), due to the benefits of batch normalization.","294":"A **ResNeXt Block** is a type of [residual block](https:\/\/paperswithcode.com\/method\/residual-block) used as part of the [ResNeXt](https:\/\/paperswithcode.com\/method\/resnext) CNN architecture. It uses a \"split-transform-merge\" strategy (branched paths within a single module) similar to an [Inception module](https:\/\/paperswithcode.com\/method\/inception-module), i.e. it aggregates a set of transformations. Compared to a Residual Block, it exposes a new dimension,  *cardinality* (size of set of transformations) $C$, as an essential factor in addition to depth and width. \r\n\r\nFormally, a set of aggregated transformations can be represented as: $\\mathcal{F}(x)=\\sum_{i=1}^{C}\\mathcal{T}_i(x)$, where $\\mathcal{T}_i(x)$ can be an arbitrary function. Analogous to a simple neuron, $\\mathcal{T}_i$ should project $x$ into an (optionally low-dimensional) embedding and then transform it.","295":"A **ResNeXt** repeats a building block that aggregates a set of transformations with the same topology. Compared to a [ResNet](https:\/\/paperswithcode.com\/method\/resnet), it exposes a new dimension,  *cardinality* (the size of the set of transformations) $C$, as an essential factor in addition to the dimensions of depth and width. \r\n\r\nFormally, a set of aggregated transformations can be represented as: $\\mathcal{F}(x)=\\sum_{i=1}^{C}\\mathcal{T}_i(x)$, where $\\mathcal{T}_i(x)$ can be an arbitrary function. Analogous to a simple neuron, $\\mathcal{T}_i$ should project $x$ into an (optionally low-dimensional) embedding and then transform it.","296":"Inspired by the success of ResNet,\r\nWang et al. proposed\r\nthe very deep convolutional residual attention network (RAN) by \r\ncombining an attention mechanism with residual connections. \r\n\r\nEach attention module stacked in a residual attention network \r\ncan be divided into a mask branch and a trunk branch. \r\nThe trunk branch processes features,\r\nand can be implemented by any state-of-the-art structure\r\nincluding a pre-activation residual unit and an inception block.\r\nThe mask branch uses a bottom-up top-down structure\r\nto learn a mask of the same size that \r\nsoftly weights output features from the trunk branch. \r\nA sigmoid layer normalizes the output to $[0,1]$ after two $1\\times 1$ convolution layers. Overall the residual attention mechanism can be written as\r\n\r\n\\begin{align}\r\ns &= \\sigma(Conv_{2}^{1\\times 1}(Conv_{1}^{1\\times 1}( h_\\text{up}(h_\\text{down}(X))))) \r\n\\end{align}\r\n\r\n\\begin{align}\r\nX_{out} &= s f(X) + f(X)\r\n\\end{align}\r\nwhere $h_\\text{up}$ is a bottom-up structure, \r\nusing max-pooling several times after residual units\r\nto increase the receptive field, while\r\n$h_\\text{down}$ is the top-down part using \r\nlinear interpolation to keep the output size the \r\nsame as the input feature map. \r\nThere are also skip-connections between the two parts,\r\nwhich are omitted from the formulation.\r\n$f$ represents the trunk branch\r\nwhich can be any state-of-the-art structure.\r\n\r\nInside each attention module, a\r\nbottom-up top-down feedforward structure models\r\nboth spatial and cross-channel dependencies, \r\n leading to a consistent performance improvement. \r\nResidual attention can be incorporated into\r\nany deep network structure in an end-to-end training fashion.\r\nHowever, the proposed bottom-up top-down structure fails to leverage global spatial information.  \r\nFurthermore, directly predicting a 3D attention map  has high computational cost.","297":"**MADDPG**, or **Multi-agent DDPG**, extends [DDPG](https:\/\/paperswithcode.com\/method\/ddpg) into a multi-agent policy gradient algorithm where decentralized agents learn a centralized critic based on the observations and actions of all agents. It leads to learned policies that only use local information (i.e. their own observations) at execution time, does not assume a differentiable model of the environment dynamics or any particular structure on the communication method between agents, and is applicable not only to cooperative interaction but to competitive or mixed interaction involving both physical and communicative behavior. The critic is augmented with extra information about the policies of other agents, while the actor only has access to local information. After training is completed, only the local actors are used at execution phase, acting in a decentralized manner.","298":"A **Spatial Transformer** is an image model block that explicitly allows the spatial manipulation of data within a [convolutional neural network](https:\/\/paperswithcode.com\/methods\/category\/convolutional-neural-networks). It gives CNNs the ability to actively spatially transform feature maps, conditional on the feature map itself, without any extra training supervision or modification to the optimisation process. Unlike pooling layers, where the receptive fields are fixed and local, the spatial transformer module is a dynamic mechanism that can actively spatially transform an image (or a feature map) by producing an appropriate transformation for each input sample. The transformation is then performed on the entire feature map (non-locally) and can include scaling, cropping, rotations, as well as non-rigid deformations.\r\n\r\nThe architecture is shown in the Figure to the right. The input feature map $U$ is passed to a localisation network which regresses the transformation parameters $\\theta$. The regular spatial grid $G$ over $V$ is transformed to the sampling grid $T\\_{\\theta}\\left(G\\right)$, which is applied to $U$, producing the warped output feature map $V$. The combination of the localisation network and sampling mechanism defines a spatial transformer.","299":"RESCAL","300":"An **autoencoder** is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal \u201cnoise\u201d. Along with the reduction side, a reconstructing side is learnt, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name. \r\n\r\nExtracted from: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Autoencoder)\r\n\r\nImage source: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Autoencoder#\/media\/File:Autoencoder_schema.png)","301":"The **Self-Organizing Map (SOM)**, commonly also known as Kohonen network (Kohonen 1982, Kohonen 2001) is a computational method for the visualization and analysis of high-dimensional data, especially experimentally acquired information.\r\n\r\nExtracted from [scholarpedia](http:\/\/www.scholarpedia.org\/article\/Self-organizing_map)\r\n\r\n**Sources**:\r\n\r\nImage: [scholarpedia](http:\/\/www.scholarpedia.org\/article\/File:Somnbc.png)\r\n\r\nPaper: [Kohonen, T. Self-organized formation of topologically correct feature maps. Biol. Cybern. 43, 59\u201369 (1982)](https:\/\/doi.org\/10.1007\/BF00337288)\r\n\r\nBook: [Self-Organizing Maps](https:\/\/doi.org\/10.1007\/978-3-642-56927-2)","302":"Multi-heads of both self and cross attentions","303":"**XLNet** is an autoregressive [Transformer](https:\/\/paperswithcode.com\/method\/transformer) that leverages the best of both autoregressive language modeling and autoencoding while attempting to avoid their limitations. Instead of using a fixed forward or backward factorization order as in conventional autoregressive models, XLNet maximizes the expected log likelihood of a sequence w.r.t. all possible permutations of the factorization order. Thanks to the permutation operation, the context for each position can consist of tokens from both left and right. In expectation, each position learns to utilize contextual information from all positions, i.e., capturing bidirectional context.\r\n\r\nAdditionally, inspired by the latest advancements in autogressive language modeling, XLNet integrates the segment recurrence mechanism and relative encoding scheme of [Transformer-XL](https:\/\/paperswithcode.com\/method\/transformer-xl) into pretraining, which empirically improves the performance especially for tasks involving a longer text sequence.","304":"**Self-Adversarial Negative Sampling** is a negative sampling technique used for methods like [word embeddings](https:\/\/paperswithcode.com\/methods\/category\/word-embeddings) and [knowledge graph embeddings](https:\/\/paperswithcode.com\/methods\/category\/graph-embeddings). The traditional negative sampling loss from word2vec for optimizing distance-based models be written as:\r\n\r\n$$ L = \u2212\\log\\sigma\\left(\\gamma \u2212 d\\_{r}\\left(\\mathbf{h}, \\mathbf{t}\\right)\\right) \u2212 \\sum^{n}\\_{i=1}\\frac{1}{k}\\log\\sigma\\left(d\\_{r}\\left(\\mathbf{h}^{'}\\_{i}, \\mathbf{t}^{'}\\_{i}\\right) - \\gamma\\right) $$\r\n\r\nwhere $\\gamma$ is a fixed margin, $\\sigma$ is the sigmoid function, and $\\left(\\mathbf{h}^{'}\\_{i}, r, \\mathbf{t}^{'}\\_{i}\\right)$ is the $i$-th negative triplet. \r\n\r\nThe negative sampling loss samples the negative triplets in a uniform way. Such a uniform negative sampling suffers the problem of inefficiency since many samples are obviously false as training goes on, which does not provide any meaningful information. Therefore, the authors propose an approach called self-adversarial negative sampling, which samples negative triples according to the current embedding model. Specifically, we sample negative triples from the following distribution:\r\n\r\n$$ p\\left(h^{'}\\_{j}, r, t^{'}\\_{j} | \\text{set}\\left(h\\_{i}, r\\_{i}, t\\_{i} \\right) \\right) = \\frac{\\exp\\alpha{f}\\_{r}\\left(\\mathbf{h}^{'}\\_{j}, \\mathbf{t}^{'}\\_{j}\\right)}{\\sum\\_{i=1}\\exp\\alpha{f}\\_{r}\\left(\\mathbf{h}^{'}\\_{i}, \\mathbf{t}^{'}\\_{i}\\right)} $$\r\n\r\nwhere $\\alpha$ is the temperature of sampling. Moreover, since the sampling procedure may be costly, the authors treat the above probability as the weight of the negative sample. Therefore, the final negative sampling loss with self-adversarial training takes the following form:\r\n\r\n$$ L = \u2212\\log\\sigma\\left(\\gamma \u2212 d\\_{r}\\left(\\mathbf{h}, \\mathbf{t}\\right)\\right) \u2212 \\sum^{n}\\_{i=1}p\\left(h^{'}\\_{i}, r, t^{'}\\_{i}\\right)\\log\\sigma\\left(d\\_{r}\\left(\\mathbf{h}^{'}\\_{i}, \\mathbf{t}^{'}\\_{i}\\right) - \\gamma\\right) $$","305":"**RotatE** is a method for generating graph embeddings which is able to model and infer various relation patterns including: symmetry\/antisymmetry, inversion, and composition. Specifically, the RotatE model defines each relation as a rotation from the source entity to the target entity in the complex vector space. The RotatE model is trained using a [self-adversarial negative sampling](https:\/\/paperswithcode.com\/method\/self-adversarial-negative-sampling) technique.","306":"Class activation maps could be used to interpret the prediction decision made by the convolutional neural network (CNN).\r\n\r\nImage source: [Learning Deep Features for Discriminative Localization](https:\/\/paperswithcode.com\/paper\/learning-deep-features-for-discriminative)","307":"**Label Quality Model** is an intermediate supervised task aimed at predicting the clean labels from noisy labels by leveraging rater features and a paired subset for supervision. The LQM technique assumes the existence of rater features and a subset of training data with both noisy and clean labels, which we call paired-subset. In real world scenarios, some level of label noise may be unavoidable. The LQM approach still works as long as the clean(er) label is less noisy than a label from a rater that is randomly selected from the pool, e.g., clean labels can be from either expert raters or aggregation of multiple raters. LQM is trained on the paired-subset using rater features and noisy label as input, and inferred on the entire training corpus. The output of LQM is used during model training as a more accurate alternative to the noisy labels.","308":"Perhaps the most widely used approach to synthesizing new examples is called the Synthetic Minority Oversampling Technique, or SMOTE for short. This technique was described by Nitesh Chawla, et al. in their 2002 paper named for the technique titled \u201cSMOTE: Synthetic Minority Over-sampling Technique.\u201d\r\n\r\nSMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.","309":"**InfoGAN** is a type of generative adversarial network that modifies the [GAN](https:\/\/paperswithcode.com\/method\/gan) objective to\r\nencourage it to learn interpretable and meaningful representations. This is done by maximizing the\r\nmutual information between a fixed small subset of the GAN\u2019s noise variables and the observations.\r\n\r\nFormally, InfoGAN is defined as a minimax game with a variational regularization of mutual information and the hyperparameter $\\lambda$:\r\n\r\n$$ \\min\\_{G, Q}\\max\\_{D}V\\_{INFOGAN}\\left(D, G, Q\\right) = V\\left(D, G\\right) - \\lambda{L}\\_{I}\\left(G, Q\\right) $$\r\n\r\nWhere $Q$ is an auxiliary distribution that approximates the posterior $P\\left(c\\mid{x}\\right)$ - the probability of the latent code $c$ given the data $x$ - and $L\\_{I}$ is the variational lower bound of the mutual information between the latent code and the observations.\r\n\r\nIn the practical implementation, there is another fully-connected layer to output parameters for the conditional distribution $Q$ (negligible computation ontop of regular GAN structures). Q is represented with a [softmax](https:\/\/paperswithcode.com\/method\/softmax) non-linearity for a categorical latent code. For a continuous latent code, the authors assume a factored Gaussian.","310":"**Soft Actor Critic (Autotuned Temperature** is a modification of the [SAC](https:\/\/paperswithcode.com\/method\/soft-actor-critic) reinforcement learning algorithm. [SAC](https:\/\/paperswithcode.com\/method\/sac) can suffer from brittleness to the temperature hyperparameter. Unlike in conventional reinforcement learning, where the optimal policy is independent of scaling of the reward function, in maximum entropy reinforcement learning the scaling factor has to be compensated by the choice a of suitable temperature, and a sub-optimal temperature can drastically degrade performance. To resolve this issue, SAC with Autotuned Temperature has an automatic gradient-based temperature tuning method that adjusts the expected entropy over the visited states to match a target value.","311":"**V-trace** is an off-policy actor-critic reinforcement learning algorithm that helps tackle the lag between when actions are generated by the actors and when the learner estimates the gradient. Consider a trajectory $\\left(x\\_{t}, a\\_{t}, r\\_{t}\\right)^{t=s+n}\\_{t=s}$ generated by the actor following some policy $\\mu$. We can define the $n$-steps V-trace target for $V\\left(x\\_{s}\\right)$, our value approximation at state $x\\_{s}$ as:\r\n\r\n$$ v\\_{s} = V\\left(x\\_{s}\\right) + \\sum^{s+n-1}\\_{t=s}\\gamma^{t-s}\\left(\\prod^{t-1}\\_{i=s}c\\_{i}\\right)\\delta\\_{t}V $$\r\n\r\nWhere $\\delta\\_{t}V = \\rho\\_{t}\\left(r\\_{t} + \\gamma{V}\\left(x\\_{t+1}\\right) - V\\left(x\\_{t}\\right)\\right)$ is a temporal difference algorithm for $V$, and $\\rho\\_{t} = \\text{min}\\left(\\bar{\\rho}, \\frac{\\pi\\left(a\\_{t}\\mid{x\\_{t}}\\right)}{\\mu\\left(a\\_{t}\\mid{x\\_{t}}\\right)}\\right)$ and $c\\_{i} = \\text{min}\\left(\\bar{c}, \\frac{\\pi\\left(a\\_{t}\\mid{x\\_{t}}\\right)}{\\mu\\left(a\\_{t}\\mid{x\\_{t}}\\right)}\\right)$ are truncated importance sampling weights. We assume that the truncation levels are such that $\\bar{\\rho} \\geq \\bar{c}$.","312":"**IMPALA**, or the **Importance Weighted Actor Learner Architecture**, is an off-policy actor-critic framework that decouples acting from learning and learns from experience trajectories using [V-trace](https:\/\/paperswithcode.com\/method\/v-trace). Unlike the popular [A3C](https:\/\/paperswithcode.com\/method\/a3c)-based agents, in which workers communicate gradients with respect to the parameters of the policy to a central parameter server, IMPALA actors communicate trajectories of experience (sequences of states, actions, and rewards) to a centralized learner. Since the learner in IMPALA has access to full trajectories of experience we use a GPU to perform updates on mini-batches of trajectories while aggressively parallelising all time independent operations. \r\n\r\nThis type of decoupled architecture can achieve very high throughput. However, because the policy used to generate a trajectory can lag behind the policy on the learner by several updates at the time of gradient calculation, learning becomes off-policy. The V-trace off-policy actor-critic algorithm is used to correct for this harmful discrepancy.","313":"**A2C**, or **Advantage Actor Critic**, is a synchronous version of the [A3C](https:\/\/paperswithcode.com\/method\/a3c) policy gradient method. As an alternative to the asynchronous implementation of A3C, A2C is a synchronous, deterministic implementation that waits for each actor to finish its segment of experience before updating, averaging over all of the actors. This more effectively uses GPUs due to larger batch sizes.\r\n\r\nImage Credit: [OpenAI Baselines](https:\/\/openai.com\/blog\/baselines-acktr-a2c\/)","314":"**A3C**, **Asynchronous Advantage Actor Critic**, is a policy gradient algorithm in reinforcement learning that maintains a policy $\\pi\\left(a\\_{t}\\mid{s}\\_{t}; \\theta\\right)$ and an estimate of the value\r\nfunction $V\\left(s\\_{t}; \\theta\\_{v}\\right)$. It operates in the forward view and uses a mix of $n$-step returns to update both the policy and the value-function. The policy and the value function are updated after every $t\\_{\\text{max}}$ actions or when a terminal state is reached. The update performed by the algorithm can be seen as $\\nabla\\_{\\theta{'}}\\log\\pi\\left(a\\_{t}\\mid{s\\_{t}}; \\theta{'}\\right)A\\left(s\\_{t}, a\\_{t}; \\theta, \\theta\\_{v}\\right)$ where $A\\left(s\\_{t}, a\\_{t}; \\theta, \\theta\\_{v}\\right)$ is an estimate of the advantage function given by:\r\n\r\n$$\\sum^{k-1}\\_{i=0}\\gamma^{i}r\\_{t+i} + \\gamma^{k}V\\left(s\\_{t+k}; \\theta\\_{v}\\right) - V\\left(s\\_{t}; \\theta\\_{v}\\right)$$\r\n\r\nwhere $k$ can vary from state to state and is upper-bounded by $t\\_{max}$.\r\n\r\nThe critics in A3C learn the value function while multiple actors are trained in parallel and get synced with global parameters every so often. The gradients are accumulated as part of training for stability - this is like parallelized stochastic gradient descent.\r\n\r\nNote that while the parameters $\\theta$ of the policy and $\\theta\\_{v}$ of the value function are shown as being separate for generality, we always share some of the parameters in practice. We typically use a convolutional neural network that has one [softmax](https:\/\/paperswithcode.com\/method\/softmax) output for the policy $\\pi\\left(a\\_{t}\\mid{s}\\_{t}; \\theta\\right)$ and one linear output for the value function $V\\left(s\\_{t}; \\theta\\_{v}\\right)$, with all non-output layers shared.","315":"**Affine Coupling** is a method for implementing a normalizing flow (where we stack a sequence of invertible bijective transformation functions). Affine coupling is one of these bijective transformation functions. Specifically, it is an example of a reversible transformation where the forward function, the reverse function and the log-determinant are computationally efficient. For the forward function, we split the input dimension into two parts:\r\n\r\n$$ \\mathbf{x}\\_{a}, \\mathbf{x}\\_{b} = \\text{split}\\left(\\mathbf{x}\\right) $$\r\n\r\nThe second part stays the same $\\mathbf{x}\\_{b} = \\mathbf{y}\\_{b}$, while the first part  $\\mathbf{x}\\_{a}$ undergoes an affine transformation, where the parameters for this transformation are learnt using the second part $\\mathbf{x}\\_{b}$ being put through a neural network. Together we have:\r\n\r\n$$ \\left(\\log{\\mathbf{s}, \\mathbf{t}}\\right) = \\text{NN}\\left(\\mathbf{x}\\_{b}\\right) $$\r\n\r\n$$ \\mathbf{s} = \\exp\\left(\\log{\\mathbf{s}}\\right) $$\r\n\r\n$$ \\mathbf{y}\\_{a} = \\mathbf{s} \\odot \\mathbf{x}\\_{a} + \\mathbf{t}  $$\r\n\r\n$$ \\mathbf{y}\\_{b} = \\mathbf{x}\\_{b} $$\r\n\r\n$$ \\mathbf{y} = \\text{concat}\\left(\\mathbf{y}\\_{a}, \\mathbf{y}\\_{b}\\right) $$\r\n\r\nImage: [GLOW](https:\/\/paperswithcode.com\/method\/glow)","316":"The **Invertible 1x1 Convolution** is a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) used in flow-based generative models that reverses the ordering of channels. The weight matrix is initialized as a random rotation matrix. The log-determinant of an invertible 1 \u00d7 1 convolution of a $h \\times w \\times c$ tensor $h$ with $c \\times c$ weight matrix $\\mathbf{W}$ is straightforward to compute:\r\n\r\n$$ \\log | \\text{det}\\left(\\frac{d\\text{conv2D}\\left(\\mathbf{h};\\mathbf{W}\\right)}{d\\mathbf{h}}\\right) | = h \\cdot w \\cdot \\log | \\text{det}\\left(\\mathbf{W}\\right) | $$","317":"**WaveGlow** is a flow-based generative model that generates audio by sampling from a distribution. Specifically samples are taken from a zero mean spherical Gaussian with the same number of dimensions as our desired output, and those samples are put through a series of layers that transforms the simple distribution to one which has the desired distribution.","318":"**TransE** is an energy-based model that produces knowledge base embeddings. It models relationships by interpreting them as translations operating on the low-dimensional embeddings of the entities. Relationships are represented as translations in the embedding space: if $\\left(h, \\mathcal{l}, t\\right)$ holds, the embedding of the tail entity $t$ should be close to the embedding of the head entity $h$ plus some vector that depends on the relationship $\\mathcal{l}$.","319":"Please enter a description about the method here","320":"**Wasserstein GAN**, or **WGAN**, is a type of generative adversarial network that minimizes an approximation of the Earth-Mover's distance (EM) rather than the Jensen-Shannon divergence as in the original [GAN](https:\/\/paperswithcode.com\/method\/gan) formulation. It leads to more stable training than original GANs with less evidence of mode collapse, as well as meaningful curves that can be used for debugging and searching hyperparameters.","321":"A **Denoising Autoencoder** is a modification on the [autoencoder](https:\/\/paperswithcode.com\/method\/autoencoder) to prevent the network learning the identity function. Specifically, if the autoencoder is too big, then it can just learn the data, so the output equals the input, and does not perform any useful representation learning or dimensionality reduction. Denoising autoencoders solve this problem by corrupting the input data on purpose, adding noise or masking some of the input values.\r\n\r\nImage Credit: [Kumar et al](https:\/\/www.semanticscholar.org\/paper\/Static-hand-gesture-recognition-using-stacked-Kumar-Nandi\/5191ddf3f0841c89ba9ee592a2f6c33e4a40d4bf)","322":"**ReLU6** is a modification of the [rectified linear unit](https:\/\/paperswithcode.com\/method\/relu) where we limit the activation to a maximum size of $6$. This is due to increased robustness when used with low-precision computation.\r\n\r\nImage Credit: [PyTorch](https:\/\/pytorch.org\/docs\/master\/generated\/torch.nn.ReLU6.html)","323":"**Hard Swish** is a type of activation function based on [Swish](https:\/\/paperswithcode.com\/method\/swish), but replaces the computationally expensive sigmoid with a piecewise linear analogue:\r\n\r\n$$\\text{h-swish}\\left(x\\right) = x\\frac{\\text{ReLU6}\\left(x+3\\right)}{6} $$","324":"**MobileNetV3** is a convolutional neural network that is tuned to mobile phone CPUs through a combination of hardware-aware network architecture search (NAS) complemented by the [NetAdapt](https:\/\/paperswithcode.com\/method\/netadapt) algorithm, and then subsequently improved through novel architecture advances. Advances include (1) complementary search techniques, (2) new efficient versions of nonlinearities practical for the mobile setting, (3) new efficient network design.\r\n\r\nThe network design includes the use of a [hard swish](https:\/\/paperswithcode.com\/method\/hard-swish) activation and squeeze-and-excitation modules in the MBConv blocks.","325":"**MnasNet** is a type of convolutional neural network optimized for mobile devices that is discovered through mobile [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search), which explicitly incorporates model latency into the main objective so that the search can identify a model that achieves a good trade-off between accuracy and latency. The main building block is an [inverted residual block](https:\/\/paperswithcode.com\/method\/inverted-residual-block) (from [MobileNetV2](https:\/\/paperswithcode.com\/method\/mobilenetv2)).","326":"Genetic Algorithms are search algorithms that mimic Darwinian biological evolution in order to select and propagate better solutions.","327":"**HiFi-GAN** is a generative adversarial network for speech synthesis. HiFi-GAN consists of one generator and two discriminators: multi-scale and multi-period discriminators. The generator and discriminators are trained adversarially, along with two additional losses for improving training stability and model performance.\r\n\r\nThe generator is a fully convolutional neural network. It uses a mel-spectrogram as input and upsamples it through transposed convolutions until the length of the output sequence matches the temporal resolution of raw waveforms. Every [transposed convolution](https:\/\/paperswithcode.com\/method\/transposed-convolution) is followed by a multi-receptive field fusion (MRF) module.\r\n\r\nFor the discriminator, a multi-period discriminator (MPD) is used consisting of several sub-discriminators each handling a portion of periodic signals of input audio. Additionally, to capture consecutive patterns and long-term dependencies, the multi-scale discriminator (MSD) proposed in [MelGAN](https:\/\/paperswithcode.com\/method\/melgan) is used, which consecutively evaluates audio samples at different levels.","328":"A **SENet** is a convolutional neural network architecture that employs squeeze-and-excitation blocks to enable the network to perform dynamic channel-wise feature recalibration.","329":"A **3D Convolution** is a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) where the kernel slides in 3 dimensions as opposed to 2 dimensions with 2D convolutions. One example use case is medical imaging where a model is constructed using 3D image slices. Additionally video based data has an additional temporal dimension over images making it suitable for this module. \r\n\r\nImage: Lung nodule detection based on 3D convolutional neural networks, Fan et al","330":"**Activation Normalization** is a type of normalization used for flow-based generative models; specifically it was introduced in the [GLOW](https:\/\/paperswithcode.com\/method\/glow) architecture. An ActNorm layer performs an affine transformation of the activations using a scale and bias parameter per channel, similar to [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization). These parameters are initialized such that the post-actnorm activations per-channel have zero mean and unit variance given an initial minibatch of data. This is a form of data dependent initilization. After initialization, the scale and bias are treated as regular trainable parameters that are independent of the data.","331":"**GLOW** is a type of flow-based generative model that is based on an invertible $1 \\times 1$ [convolution](https:\/\/paperswithcode.com\/method\/convolution). This builds on the flows introduced by [NICE](https:\/\/paperswithcode.com\/method\/nice) and [RealNVP](https:\/\/paperswithcode.com\/method\/realnvp). It consists of a series of steps of flow, combined in a multi-scale architecture; see the Figure to the right. Each step of flow consists of Act Normalization followed by an *invertible $1 \\times 1$ convolution* followed by an [affine coupling](https:\/\/paperswithcode.com\/method\/affine-coupling) layer.","332":"The MADGRAD method contains a series of modifications to the [AdaGrad](https:\/\/paperswithcode.com\/method\/adagrad)-DA method to improve its performance on deep learning optimization problems. It gives state-of-the-art generalization performance across a diverse set of problems, including those that [Adam](https:\/\/paperswithcode.com\/method\/adam) normally under-performs on.","333":"**AdaGrad** is a stochastic optimization method that adapts the learning rate to the parameters. It performs smaller updates for parameters associated with frequently occurring features, and larger updates for parameters associated with infrequently occurring features. In its update rule, Adagrad modifies the general learning rate $\\eta$ at each time step $t$ for every parameter $\\theta\\_{i}$ based on the past gradients for $\\theta\\_{i}$: \r\n\r\n$$ \\theta\\_{t+1, i} = \\theta\\_{t, i} - \\frac{\\eta}{\\sqrt{G\\_{t, ii} + \\epsilon}}g\\_{t, i} $$\r\n\r\nThe benefit of AdaGrad is that it eliminates the need to manually tune the learning rate; most leave it at a default value of $0.01$. Its main weakness is the accumulation of the squared gradients in the denominator. Since every added term is positive, the accumulated sum keeps growing during training, causing the learning rate to shrink and becoming infinitesimally small.\r\n\r\nImage: [Alec Radford](https:\/\/twitter.com\/alecrad)","334":"**DCGAN**, or **Deep Convolutional GAN**, is a generative adversarial network architecture. It uses a couple of guidelines, in particular:\r\n\r\n- Replacing any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator).\r\n- Using batchnorm in both the generator and the discriminator.\r\n- Removing fully connected hidden layers for deeper architectures.\r\n- Using [ReLU](https:\/\/paperswithcode.com\/method\/relu) activation in generator for all layers except for the output, which uses tanh.\r\n- Using LeakyReLU activation in the discriminator for all layer.","335":"**Prescribed GANs** add noise to the output of a density network and optimize an entropy-regularized adversarial loss. The added noise renders tractable approximations of the predictive log-likelihood and stabilizes the training procedure. The entropy regularizer encourages PresGANs to capture all the modes of the data distribution. Fitting PresGANs involves computing the intractable gradients of the [entropy regularization](https:\/\/paperswithcode.com\/method\/entropy-regularization) term; PresGANs sidestep this intractability using\r\nunbiased stochastic estimates.","336":"**SGD with Momentum** is a stochastic optimization method that adds a momentum term to regular stochastic gradient descent:\r\n\r\n$$v\\_{t} = \\gamma{v}\\_{t-1} + \\eta\\nabla\\_{\\theta}J\\left(\\theta\\right)$$\r\n$$\\theta\\_{t} = \\theta\\_{t-1} - v\\_{t} $$\r\n\r\nA typical value for $\\gamma$ is $0.9$. The momentum name comes from an analogy to physics, such as ball accelerating down a slope. In the case of weight updates, we can think of the weights as a particle traveling through parameter space which incurs acceleration from the gradient of the loss.\r\n\r\nImage Source: [Juan Du](https:\/\/www.researchgate.net\/figure\/The-compare-of-the-SGD-algorithms-with-and-without-momentum-Take-Task-1-as-example-The_fig1_333469047)","337":"A **Wide Residual Block** is a type of [residual block](https:\/\/paperswithcode.com\/method\/residual-block) that utilises two conv 3x3 layers (with [dropout](https:\/\/paperswithcode.com\/method\/dropout)). This is wider than other variants of residual blocks (for instance [bottleneck residual blocks](https:\/\/paperswithcode.com\/method\/bottleneck-residual-block)). It was proposed as part of the [WideResNet](https:\/\/paperswithcode.com\/method\/wideresnet) CNN architecture.","338":"**Wide Residual Networks** are a variant on [ResNets](https:\/\/paperswithcode.com\/method\/resnet) where we decrease depth and increase the width of residual networks. This is achieved through the use of wide residual blocks.","339":"**GoogLeNet** is a type of convolutional neural network based on the [Inception](https:\/\/paperswithcode.com\/method\/inception-module) architecture. It utilises Inception modules, which allow the network to choose between multiple convolutional filter sizes in each block. An Inception network stacks these modules on top of each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid.","340":"**Expected Sarsa** is like [Q-learning](https:\/\/paperswithcode.com\/method\/q-learning) but instead of taking the maximum over next state-action pairs, we use the expected value, taking into account how likely each action is under the current policy.\r\n\r\n$$Q\\left(S\\_{t}, A\\_{t}\\right) \\leftarrow Q\\left(S\\_{t}, A\\_{t}\\right) + \\alpha\\left[R_{t+1} + \\gamma\\sum\\_{a}\\pi\\left(a\\mid{S\\_{t+1}}\\right)Q\\left(S\\_{t+1}, a\\right) - Q\\left(S\\_{t}, A\\_{t}\\right)\\right] $$\r\n\r\nExcept for this change to the update rule, the algorithm otherwise follows the scheme of Q-learning. It is more computationally expensive than [Sarsa](https:\/\/paperswithcode.com\/method\/sarsa) but it eliminates the variance due to the random selection of $A\\_{t+1}$.\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition","341":"**Sarsa** is an on-policy TD control algorithm:\r\n\r\n$$Q\\left(S\\_{t}, A\\_{t}\\right) \\leftarrow Q\\left(S\\_{t}, A\\_{t}\\right) + \\alpha\\left[R_{t+1} + \\gamma{Q}\\left(S\\_{t+1}, A\\_{t+1}\\right) - Q\\left(S\\_{t}, A\\_{t}\\right)\\right] $$\r\n\r\nThis update is done after every transition from a nonterminal state $S\\_{t}$. if $S\\_{t+1}$ is terminal, then $Q\\left(S\\_{t+1}, A\\_{t+1}\\right)$ is defined as zero.\r\n\r\nTo design an on-policy control algorithm using Sarsa, we estimate $q\\_{\\pi}$ for a behaviour policy $\\pi$ and then change $\\pi$ towards greediness with respect to $q\\_{\\pi}$.\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition","342":"**Agglomerative Contextual Decomposition (ACD)** is an interpretability method that produces hierarchical interpretations for a single prediction made by a neural network, by scoring interactions and building them into a tree. Given a prediction from a trained neural network, ACD produces a hierarchical clustering of the input features, along with the contribution of each cluster to the final prediction. This hierarchy is optimized to identify clusters of features that the DNN learned are predictive.","343":"**DeepLab** is a semantic segmentation architecture. First, the input image goes through the network with the use of dilated convolutions. Then the output from the network is bilinearly interpolated and goes through the fully connected [CRF](https:\/\/paperswithcode.com\/method\/crf) to fine tune the result we obtain the final predictions.","344":"**mBART** is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the [BART objective](https:\/\/paperswithcode.com\/method\/bart). The input texts are noised by masking phrases and permuting sentences, and a single [Transformer model](https:\/\/paperswithcode.com\/method\/transformer) is learned to recover the texts. Different from other pre-training approaches for machine translation, mBART pre-trains a complete autoregressive [Seq2Seq](https:\/\/paperswithcode.com\/method\/seq2seq) model. mBART is trained once for all languages, providing a set of parameters that can be fine-tuned for any of the language pairs in both supervised and unsupervised settings, without any task-specific or language-specific modifications or initialization schemes.","345":"**BART** is a [denoising autoencoder](https:\/\/paperswithcode.com\/method\/denoising-autoencoder) for pretraining sequence-to-sequence models. It is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard [Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based neural machine translation architecture. It uses a standard seq2seq\/NMT architecture with a bidirectional encoder (like [BERT](https:\/\/paperswithcode.com\/method\/bert)) and a left-to-right decoder (like [GPT](https:\/\/paperswithcode.com\/method\/gpt)). This means the encoder's attention mask is fully visible, like BERT, and the decoder's attention mask is causal, like [GPT2](https:\/\/paperswithcode.com\/method\/gpt-2).","346":"**Deformable convolutions** add 2D offsets to the regular grid sampling locations in the standard [convolution](https:\/\/paperswithcode.com\/method\/convolution). It enables free form deformation of the sampling grid. The offsets are learned from the preceding feature maps, via additional convolutional layers. Thus, the deformation is conditioned on the input features in a local, dense, and adaptive manner.","347":"**ConvLSTM** is a type of recurrent neural network for spatio-temporal prediction that has convolutional structures in both the input-to-state and state-to-state transitions. The ConvLSTM determines the future state of a certain cell in the grid by the inputs and past states of its local neighbors. This can easily be achieved by using a [convolution](https:\/\/paperswithcode.com\/method\/convolution) operator in the state-to-state and input-to-state transitions (see Figure). The key equations of ConvLSTM are shown  below, where $\u2217$ denotes the convolution operator and $\\odot$ the Hadamard product:\r\n\r\n$$ i\\_{t} = \\sigma\\left(W\\_{xi} \u2217 X\\_{t} + W\\_{hi} \u2217 H\\_{t\u22121} + W\\_{ci} \\odot \\mathcal{C}\\_{t\u22121} + b\\_{i}\\right) $$\r\n\r\n$$ f\\_{t} = \\sigma\\left(W\\_{xf} \u2217 X\\_{t} + W\\_{hf} \u2217 H\\_{t\u22121} + W\\_{cf} \\odot \\mathcal{C}\\_{t\u22121} + b\\_{f}\\right) $$\r\n\r\n$$ \\mathcal{C}\\_{t} = f\\_{t} \\odot \\mathcal{C}\\_{t\u22121} + i\\_{t} \\odot \\text{tanh}\\left(W\\_{xc} \u2217 X\\_{t} + W\\_{hc} \u2217 \\mathcal{H}\\_{t\u22121} + b\\_{c}\\right) $$\r\n\r\n$$ o\\_{t} = \\sigma\\left(W\\_{xo} \u2217 X\\_{t} + W\\_{ho} \u2217 \\mathcal{H}\\_{t\u22121} + W\\_{co} \\odot \\mathcal{C}\\_{t} + b\\_{o}\\right) $$\r\n\r\n$$ \\mathcal{H}\\_{t} = o\\_{t} \\odot \\text{tanh}\\left(C\\_{t}\\right) $$\r\n\r\nIf we view the states as the hidden representations of moving objects, a ConvLSTM with a larger transitional kernel should be able to capture faster motions while one with a smaller kernel can capture slower motions. \r\n\r\nTo ensure that the states have the same number of rows and same number of columns as the inputs, padding is needed before applying the convolution operation. Here, padding of the hidden states on the boundary points can be viewed as using the state of the outside world for calculation. Usually, before the first input comes, we initialize all the states of the [LSTM](https:\/\/paperswithcode.com\/method\/lstm) to zero which corresponds to \"total ignorance\" of the future.","348":"OASIS is a [GAN](https:\/\/paperswithcode.com\/method\/gan)-based model to translate semantic label maps into realistic-looking images. The model builds on preceding work such as [Pix2Pix](https:\/\/paperswithcode.com\/method\/pix2pix) and SPADE. OASIS introduces the following innovations:  \r\n\r\n1. The method is not dependent on the perceptual loss, which is commonly used for the semantic image synthesis task. A [VGG](https:\/\/paperswithcode.com\/method\/vgg) network trained on ImageNet is routinely employed as the perceptual loss to strongly improve the synthesis quality. The authors show that this perceptual loss also has negative effects: First, it reduces the diversity of the generated images. Second, it negatively influences the color distribution to be more biased towards ImageNet. OASIS eliminates the dependence on the perceptual loss by changing the common discriminator design: The OASIS discriminator segments an image into one of the real classes or an additional fake class. In doing so, it makes more efficient use of the label maps that the discriminator normally receives. This distinguishes the discriminator from the commonly used encoder-shaped discriminators, which concatenate the label maps to the input image and predict a single score per image. With the more fine-grained supervision through the loss of the OASIS discriminator, the perceptual loss is shown to become unnecessary.\r\n\r\n2. A user can generate a diverse set of images per label map by simply resampling noise. This is achieved by conditioning the [spatially-adaptive denormalization](https:\/\/arxiv.org\/abs\/1903.07291) module in each layer of the GAN generator directly on spatially replicated input noise. A side effect of this conditioning is that at inference time an image can be resampled either globally or locally (either the complete image changes or a restricted region in the image).","349":"**DropBlock** is a structured form of [dropout](https:\/\/paperswithcode.com\/method\/dropout) directed at regularizing convolutional networks. In DropBlock, units in a contiguous region of a feature map are dropped together.  As DropBlock discards features in a correlated area, the networks must look elsewhere for evidence to fit the data.","350":"**Trust Region Policy Optimization**, or **TRPO**, is a policy gradient method in reinforcement learning that avoids parameter updates that change the policy too much with a KL divergence constraint on the size of the policy update at each iteration.\r\n\r\nTake the case of off-policy reinforcement learning, where the policy $\\beta$ for collecting trajectories on rollout workers is different from the policy $\\pi$ to optimize for. The objective function in an off-policy model measures the total advantage over the state visitation distribution and actions, while the mismatch between the training data distribution and the true policy state distribution is compensated with an importance sampling estimator:\r\n\r\n$$ J\\left(\\theta\\right) = \\sum\\_{s\\in{S}}p^{\\pi\\_{\\theta\\_{old}}}\\sum\\_{a\\in\\mathcal{A}}\\left(\\pi\\_{\\theta}\\left(a\\mid{s}\\right)\\hat{A}\\_{\\theta\\_{old}}\\left(s, a\\right)\\right) $$\r\n\r\n$$ J\\left(\\theta\\right) = \\sum\\_{s\\in{S}}p^{\\pi\\_{\\theta\\_{old}}}\\sum\\_{a\\in\\mathcal{A}}\\left(\\beta\\left(a\\mid{s}\\right)\\frac{\\pi\\_{\\theta}\\left(a\\mid{s}\\right)}{\\beta\\left(a\\mid{s}\\right)}\\hat{A}\\_{\\theta\\_{old}}\\left(s, a\\right)\\right) $$\r\n\r\n$$ J\\left(\\theta\\right) = \\mathbb{E}\\_{s\\sim{p}^{\\pi\\_{\\theta\\_{old}}}, a\\sim{\\beta}} \\left(\\frac{\\pi\\_{\\theta}\\left(a\\mid{s}\\right)}{\\beta\\left(a\\mid{s}\\right)}\\hat{A}\\_{\\theta\\_{old}}\\left(s, a\\right)\\right)$$\r\n\r\nWhen training on policy, theoretically the policy for collecting data is same as the policy that we want to optimize. However, when rollout workers and optimizers are running in parallel asynchronously, the behavior policy can get stale. TRPO considers this subtle difference: It labels the behavior policy as $\\pi\\_{\\theta\\_{old}}\\left(a\\mid{s}\\right)$ and thus the objective function becomes:\r\n\r\n$$ J\\left(\\theta\\right) = \\mathbb{E}\\_{s\\sim{p}^{\\pi\\_{\\theta\\_{old}}}, a\\sim{\\pi\\_{\\theta\\_{old}}}} \\left(\\frac{\\pi\\_{\\theta}\\left(a\\mid{s}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\mid{s}\\right)}\\hat{A}\\_{\\theta\\_{old}}\\left(s, a\\right)\\right)$$\r\n\r\nTRPO aims to maximize the objective function $J\\left(\\theta\\right)$ subject to a trust region constraint which enforces the distance between old and new policies measured by KL-divergence to be small enough, within a parameter $\\delta$:\r\n\r\n$$ \\mathbb{E}\\_{s\\sim{p}^{\\pi\\_{\\theta\\_{old}}}} \\left[D\\_{KL}\\left(\\pi\\_{\\theta\\_{old}}\\left(.\\mid{s}\\right)\\mid\\mid\\pi\\_{\\theta}\\left(.\\mid{s}\\right)\\right)\\right] \\leq \\delta$$","351":"**RandAugment** is an automated data augmentation method. The search space for data augmentation has 2 interpretable hyperparameter $N$ and $M$.  $N$ is the number of augmentation transformations to apply sequentially, and $M$ is the magnitude for all the transformations. To reduce the parameter space but still maintain image diversity, learned policies and probabilities for applying each transformation are replaced with a parameter-free procedure of always selecting a transformation with uniform probability $\\frac{1}{K}$. Here $K$ is the number of transformation options. So given $N$ transformations for a training image, RandAugment may thus express $KN$ potential policies.\r\n\r\nTransformations applied include identity transformation, autoContrast, equalize, rotation, solarixation, colorjittering, posterizing, changing contrast, changing brightness, changing sharpness, shear-x, shear-y, translate-x, translate-y.","352":"**Noisy Student Training** is a semi-supervised learning approach. It extends the idea of self-training\r\nand distillation with the use of equal-or-larger student models and noise added to the student during learning. It has three main steps: \r\n\r\n1. train a teacher model on labeled images\r\n2. use the teacher to generate pseudo labels on unlabeled images\r\n3. train a student model on the combination of labeled images and pseudo labeled images. \r\n\r\nThe algorithm is iterated a few times by treating the student as a teacher to relabel the unlabeled data and training a new student.\r\n\r\nNoisy Student Training seeks to improve on self-training and distillation in two ways. First, it makes the student larger than, or at least equal to, the teacher so the student can better learn from a larger dataset. Second, it adds noise to the student so the noised student is forced to learn harder from the pseudo labels. To noise the student, it uses input noise such as [RandAugment](https:\/\/paperswithcode.com\/method\/randaugment) data augmentation, and model noise such as [dropout](https:\/\/paperswithcode.com\/method\/dropout) and [stochastic depth](https:\/\/paperswithcode.com\/method\/stochastic-depth) during training.","353":"spatial transformer networks uses an explicit procedure to learn invariance to translation, scaling, rotation and other more general warps, making the network pay attention to the most relevant regions. STN was the first attention mechanism to explicitly predict important regions and provide a deep neural network with transformation invariance.\r\n\r\nTaking a 2D image as an example, a 2D affine transformation can be formulated as followed, where A denotes a $ 2 \\times 3 $ learneable affine matrix:\r\n\r\n\\begin{align}\r\nA = f_\\text{loc}(U) \r\n\\end{align}\r\n\\begin{align}\r\nx_i^s = A x_i^t\r\n\\end{align}\r\n\r\nHere, $U$ is the input feature map, and $f_\\text{loc}$ can be any differentiable function, such as a lightweight fully-connected network or convolutional neural network. $x_{i}^{s}$  is coordinates in the output feature map, while $x_{i}^{t}$ is corresponding coordinates in the input feature map and the $ A $ matrix is the learnable affine matrix. After obtaining the correspondence, the network can sample relevant input regions using the correspondence. \r\nTo ensure that the whole process is differentiable and can be updated in an end-to-end manner,  bilinear sampling is used to sample the input features.\r\n\r\nSTNs focus on discriminative regions automatically\r\nand  learn invariance to some geometric transformations.","354":"A **Deep Belief Network (DBN)** is a multi-layer generative graphical model. DBNs have bi-directional connections ([RBM](https:\/\/paperswithcode.com\/method\/restricted-boltzmann-machine)-type connections) on the top layer while the bottom layers only have top-down connections. They are trained using layerwise pre-training. Pre-training occurs by training the network component by component bottom up: treating the first two layers as an RBM and training, then treating the second layer and third layer as another RBM and training for those parameters.\r\n\r\nSource: [Origins of Deep Learning](https:\/\/arxiv.org\/pdf\/1702.07800.pdf)\r\n\r\nImage Source: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Deep_belief_network)","355":"Based on the understanding that the flat local minima of the empirical risk cause the model to generalize better. Adversarial Model Perturbation (AMP) improves generalization via minimizing the **AMP loss**, which is obtained from the empirical risk by applying the **worst** norm-bounded perturbation on each point in the parameter space.","356":"**Temporal ROI Align** is an operator for extracting features from other frames' feature maps for current frame proposals by utilizing feature similarity. Considering the features of the same object instance are highly similar among frames in a video, the proposed operator implicitly extracts the most similar ROI features from support frames feature map for target frame proposals based on feature similarity.","357":"_**Independent component analysis** (ICA) is a statistical and computational technique for revealing hidden factors that underlie sets of random variables, measurements, or signals._\r\n\r\n_ICA defines a generative model for the observed multivariate data, which is typically given as a large database of samples. In the model, the data variables are assumed to be linear mixtures of some unknown latent variables, and the mixing system is also unknown. The latent variables are assumed nongaussian and mutually independent, and they are called the independent components of the observed data. These independent components, also called sources or factors, can be found by ICA._\r\n\r\n_ICA is superficially related to principal component analysis and factor analysis. ICA is a much more powerful technique, however, capable of finding the underlying factors or sources when these classic methods fail completely._\r\n\r\n\r\nExtracted from (https:\/\/www.cs.helsinki.fi\/u\/ahyvarin\/whatisica.shtml)\r\n\r\n**Source papers**:\r\n\r\n[Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture](https:\/\/doi.org\/10.1016\/0165-1684(91)90079-X)\r\n\r\n[Independent component analysis, A new concept?](https:\/\/doi.org\/10.1016\/0165-1684(94)90029-9)\r\n\r\n[Independent component analysis: algorithms and applications](https:\/\/doi.org\/10.1016\/S0893-6080(00)00026-5)","358":"**DINO** (self-distillation with no labels) is a self-supervised learning method that directly predicts the output of a teacher network - built with a momentum encoder - by using a standard cross-entropy loss. \r\n\r\nIn the example to the right, DINO is illustrated in the case of one single pair of views $\\left(x\\_{1}, x\\_{2}\\right)$ for simplicity. The model passes two different random transformations of an input image to the student and teacher networks. Both networks have\r\nthe same architecture but different parameters. The output of the teacher network is centered with a mean computed over the batch. Each networks outputs a $K$ dimensional feature that is normalized with a temperature [softmax](https:\/\/paperswithcode.com\/method\/softmax) over the feature dimension. Their\r\nsimilarity is then measured with a cross-entropy loss. A stop-gradient (sg) operator is applied on the teacher to propagate gradients\r\nonly through the student. The teacher parameters are updated with an exponential moving average (ema) of the student parameters.","359":"BYOL (Bootstrap Your Own Latent) is a new approach to self-supervised learning. BYOL\u2019s goal is to learn a representation $y_\u03b8$ which can then be used for downstream tasks. BYOL uses two neural networks to learn: the online and target networks. The online network is defined by a set of weights $\u03b8$ and is comprised of three stages: an encoder $f_\u03b8$, a projector $g_\u03b8$ and a predictor $q_\u03b8$. The target network has the same architecture\r\nas the online network, but uses a different set of weights $\u03be$. The target network provides the regression\r\ntargets to train the online network, and its parameters $\u03be$ are an exponential moving average of the\r\nonline parameters $\u03b8$.\r\n\r\nGiven the architecture diagram on the right, BYOL minimizes a similarity loss between $q_\u03b8(z_\u03b8)$ and $sg(z'{_\u03be})$, where $\u03b8$ are the trained weights, $\u03be$ are an exponential moving average of $\u03b8$ and $sg$ means stop-gradient. At the end of training, everything but $f_\u03b8$ is discarded, and $y_\u03b8$ is used as the image representation.\r\n\r\nSource: [Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning](https:\/\/paperswithcode.com\/paper\/bootstrap-your-own-latent-a-new-approach-to-1)\r\n\r\nImage credit: [Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning](https:\/\/paperswithcode.com\/paper\/bootstrap-your-own-latent-a-new-approach-to-1)","360":"**Two-Way Dense Layer** is an image model block used in the [PeleeNet](https:\/\/paperswithcode.com\/method\/peleenet) architectures. Motivated by [GoogLeNet](https:\/\/paperswithcode.com\/method\/googlenet), the 2-way dense layer is used to get different scales of receptive fields. One way of the layer uses a 3x3 kernel size. The other way of the layer uses two stacked 3x3 [convolution](https:\/\/paperswithcode.com\/method\/convolution) to learn visual patterns for large objects.","361":"**PeleeNet** is a convolutional neural network  and object detection backbone that is a variation of [DenseNet](https:\/\/paperswithcode.com\/method\/densenet) with optimizations to meet a memory and computational budget. Unlike competing networks, it does not use depthwise convolutions and instead relies on regular convolutions.","362":"**Neural Additive Models (NAMs)** make restrictions on the structure of neural networks, which yields a family of models that are inherently interpretable while suffering little loss in prediction accuracy when applied to tabular data. Methodologically, NAMs belong to a larger model family called Generalized Additive Models (GAMs). \r\n\r\nNAMs learn a linear combination of networks that each attend to a single input feature: each $f\\_{i}$ in the traditional GAM formulationis parametrized by a neural network. These networks are trained jointly using backpropagation and can learn arbitrarily complex shape functions. Interpreting NAMs is easy as the impact of a feature on the prediction does not rely on the other features and can be understood by visualizing its corresponding shape function (e.g., plotting $f\\_{i}\\left(x\\_{i}\\right)$ vs. $x\\_{i}$).","363":"**Height-driven Attention Network**, or **HANet**, is a general add-on module for improving semantic segmentation for urban-scene images. It emphasizes informative features or classes selectively according to the vertical position of a pixel. The pixel-wise class distributions are significantly different from each other among horizontally segmented sections in the urban-scene images. Likewise, urban-scene images have their own distinct characteristics, but most semantic segmentation networks do not reflect such unique attributes in the architecture. The proposed network architecture incorporates the capability exploiting the attributes to handle the urban scene dataset effectively.","364":"**NeuroTactic** is a model for theorem proving which leverages [graph neural networks](https:\/\/paperswithcode.com\/methods\/category\/graph-models) to represent the theorem and premises, and applies graph contrastive learning for pre-training. Specifically, premise selection is designed as a pretext task for the graph contrastive learning approach. The learned representations are then used for the downstream task, tactic prediction","365":"The methon to overcome catastrophic forgetting in neural network while continual learning","366":"The core ingredient of **CayleyNet** is a new class of parametric rational complex functions (Cayley polynomials) allowing to efficiently compute spectral filters on graphs that specialize on frequency bands of interest. The model generates rich spectral filters that are localized in space, scales linearly with the size of the input data for sparsely-connected graphs, and can handle different constructions of Laplacian operators.\r\n\r\nDescription adapted from: [CayleyNets: Graph Convolutional Neural Networks with Complex Rational Spectral Filters](https:\/\/arxiv.org\/pdf\/1705.07664.pdf)","367":"**wav2vec-U** is an unsupervised method to train speech recognition models without any labeled data. It leverages self-supervised speech representations to segment unlabeled language and learn a mapping from these representations to phonemes via adversarial training. \r\n\r\nSpecifically, we learn self-supervised representations with wav2vec 2.0 on unlabeled speech audio, then identify clusters in the representations with k-means to segment the audio data. Next, we build segment representations by mean pooling the wav2vec 2.0 representations, performing [PCA](https:\/\/paperswithcode.com\/method\/pca) and a second mean pooling step between adjacent segments. This is input to the generator which outputs a phoneme sequence that is fed to the discriminator, similar to phonemized unlabeled text to perform adversarial training.","368":"**Scaled Exponential Linear Units**, or **SELUs**, are activation functions that induce self-normalizing properties.\r\n\r\nThe SELU activation function is given by \r\n\r\n$$f\\left(x\\right) = \\lambda{x} \\text{ if } x \\geq{0}$$\r\n$$f\\left(x\\right) = \\lambda{\\alpha\\left(\\exp\\left(x\\right) -1 \\right)} \\text{ if } x < 0 $$\r\n\r\nwith $\\alpha \\approx 1.6733$ and $\\lambda \\approx 1.0507$.","369":"**Self-normalizing neural networks** (**SNNs**) are a type of neural architecture that aim to enable high-level abstract representations. While batch normalization requires explicit normalization, neuron activations of SNNs automatically converge towards zero mean and unit variance. The activation function of SNNs are \u201cscaled exponential linear units\u201d (SELUs), which induce self-normalizing properties. Using the Banach fixed point theorem, it's possible to prove that  activations close to zero mean and unit variance that are propagated through many network layers will converge towards zero mean and unit variance \u2014 even under the presence of noise and perturbations. This convergence property of SNNs allows to (1) train deep networks with many layers, (2) employ strong regularization schemes, and (3) to make learning highly robust.","370":"**CodeT5** is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based model for code understanding and generation based on the [T5 architecture](https:\/\/paperswithcode.com\/method\/t5). It utilizes an identifier-aware pre-training objective that considers the crucial token type information (identifiers) from code. Specifically, the denoising [Seq2Seq](https:\/\/paperswithcode.com\/method\/seq2seq) objective of T5 is extended with two identifier tagging and prediction tasks to enable the model to better leverage the token type information from programming languages, which are the identifiers assigned by developers. To improve the natural language-programming language alignment, a bimodal dual learning objective is used for a bidirectional conversion between natural language and programming language.","371":"**Fast R-CNN** is an object detection model that improves in its predecessor [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn) in a number of ways. Instead of extracting CNN features independently for each region of interest, Fast R-CNN aggregates them into a single forward pass over the image; i.e. regions of interest from the same image share computation and memory in the forward and backward passes.","372":"**$n$-step Returns** are used for value function estimation in reinforcement learning. Specifically, for $n$ steps we can write the complete return as:\r\n\r\n$$ R\\_{t}^{(n)} = r\\_{t+1} + \\gamma{r}\\_{t+2} + \\cdots + \\gamma^{n-1}\\_{t+n} + \\gamma^{n}V\\_{t}\\left(s\\_{t+n}\\right) $$\r\n\r\nWe can then write an $n$-step backup, in the style of TD learning, as:\r\n\r\n$$ \\Delta{V}\\_{r}\\left(s\\_{t}\\right) = \\alpha\\left[R\\_{t}^{(n)} - V\\_{t}\\left(s\\_{t}\\right)\\right] $$\r\n\r\nMulti-step returns often lead to faster learning with suitably tuned $n$.\r\n\r\nImage Credit: Sutton and Barto, Reinforcement Learning","373":"The **Exponential Linear Unit** (ELU) is an activation function for neural networks. In contrast to [ReLUs](https:\/\/paperswithcode.com\/method\/relu), ELUs have negative values which allows them to push mean unit activations closer to zero like [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) but with lower computational complexity. Mean shifts toward zero speed up learning by bringing the normal gradient closer to the unit natural gradient because of a reduced bias shift effect. While [LReLUs](https:\/\/paperswithcode.com\/method\/leaky-relu) and [PReLUs](https:\/\/paperswithcode.com\/method\/prelu) have negative values, too, they do not ensure a noise-robust deactivation state. ELUs saturate to a negative value with smaller inputs and thereby decrease the forward propagated variation and information.\r\n\r\nThe exponential linear unit (ELU) with $0 < \\alpha$ is:\r\n\r\n$$f\\left(x\\right) = x \\text{ if } x > 0$$\r\n$$\\alpha\\left(\\exp\\left(x\\right) \u2212 1\\right) \\text{ if } x \\leq 0$$","374":"A **PixelCNN** is a generative model that uses autoregressive connections to model images pixel by pixel, decomposing the joint image distribution as a product of conditionals. PixelCNNs are much faster to train than [PixelRNNs](https:\/\/paperswithcode.com\/method\/pixelrnn) because convolutions are inherently easier to parallelize; given the vast number of pixels present in large image datasets this is an important advantage.","375":"A **Pyramid Pooling Module** is a module for semantic segmentation which acts as an effective global contextual prior. The motivation is that the problem of using a convolutional network like a [ResNet](https:\/\/paperswithcode.com\/method\/resnet) is that, while the receptive field is already larger than the input image, the empirical receptive field is much smaller than the theoretical one especially on high-level layers. This makes many networks not sufficiently incorporate the momentous global scenery prior. \r\n\r\nThe PPM is an effective global prior representation that addresses this problem. It contains information with different scales and varying among different sub-regions. Using our 4-level pyramid, the pooling kernels cover the whole, half of, and small portions of the image. They are fused as the global prior. Then we concatenate the prior with the original feature map in the final part.","376":"**PSPNet**, or **Pyramid Scene Parsing Network**, is a semantic segmentation model that utilises a pyramid parsing module that exploits global context information by different-region based context aggregation. The local and global clues together make the final prediction more reliable. We also propose an optimization\r\n\r\nGiven an input image, PSPNet use a pretrained CNN with the dilated network strategy to extract the feature map. The final feature map size is $1\/8$ of the input image. On top of the map, we use the [pyramid pooling module](https:\/\/paperswithcode.com\/method\/pyramid-pooling-module) to gather context information. Using our 4-level pyramid, the pooling kernels cover the whole, half of, and small portions of the image. They are fused as the global prior.\r\nThen we concatenate the prior with the original feature map in the final part of. It is followed by a [convolution](https:\/\/paperswithcode.com\/method\/convolution) layer to generate the final prediction map.","377":"**Spectral Normalization** is a normalization technique used for generative adversarial networks, used to stabilize training of the discriminator. Spectral normalization has the convenient property that the Lipschitz constant is the only hyper-parameter to be tuned.\r\n\r\nIt controls the Lipschitz constant of the discriminator $f$ by constraining the spectral norm of each layer $g : \\textbf{h}\\_{in} \\rightarrow \\textbf{h}_{out}$. The Lipschitz norm $\\Vert{g}\\Vert\\_{\\text{Lip}}$ is equal to $\\sup\\_{\\textbf{h}}\\sigma\\left(\\nabla{g}\\left(\\textbf{h}\\right)\\right)$, where $\\sigma\\left(a\\right)$ is the spectral norm of the matrix $A$ ($L\\_{2}$ matrix norm of $A$):\r\n\r\n$$ \\sigma\\left(a\\right) = \\max\\_{\\textbf{h}:\\textbf{h}\\neq{0}}\\frac{\\Vert{A\\textbf{h}}\\Vert\\_{2}}{\\Vert\\textbf{h}\\Vert\\_{2}} = \\max\\_{\\Vert\\textbf{h}\\Vert\\_{2}\\leq{1}}{\\Vert{A\\textbf{h}}\\Vert\\_{2}} $$\r\n\r\nwhich is equivalent to the largest singular value of $A$. Therefore for a [linear layer](https:\/\/paperswithcode.com\/method\/linear-layer) $g\\left(\\textbf{h}\\right) = W\\textbf{h}$ the norm is given by $\\Vert{g}\\Vert\\_{\\text{Lip}} = \\sup\\_{\\textbf{h}}\\sigma\\left(\\nabla{g}\\left(\\textbf{h}\\right)\\right) = \\sup\\_{\\textbf{h}}\\sigma\\left(W\\right) = \\sigma\\left(W\\right) $. Spectral normalization normalizes the spectral norm of the weight matrix $W$ so it satisfies the Lipschitz constraint $\\sigma\\left(W\\right) = 1$:\r\n\r\n$$ \\bar{W}\\_{\\text{SN}}\\left(W\\right) = W \/ \\sigma\\left(W\\right) $$","378":"The **Levenshtein Transformer** (LevT) is a type of [transformer](https:\/\/paperswithcode.com\/method\/transformer) that aims to address the lack of flexibility of previous decoding models. Notably, in previous frameworks, the length of generated sequences is either fixed or monotonically increased as the decoding proceeds. The authors argue this is incompatible with human-level intelligence where humans can revise, replace, revoke or delete any part of their generated text. Hence, LevT is proposed to bridge this gap by breaking the in-so-far standardized decoding mechanism and replacing it with two basic operations \u2014 insertion and deletion.\r\n\r\nLevT is trained using imitation learning. The resulted model contains two policies and they are executed in an alternate manner. The authors argue that with this model decoding becomes more flexible. For example, when the decoder is given an empty token, it falls back to a normal sequence generation model. On the other hand, the decoder acts as a refinement model when the initial state is a low-quality generated sequence.\r\n\r\nOne crucial component in LevT framework is the learning algorithm. The authors leverage the characteristics of insertion and deletion \u2014 they are complementary but also adversarial. The algorithm they propose is called \u201cdual policy learning\u201d. The idea is that when training one policy (insertion or deletion), we use the output from its adversary at the previous iteration as input. An expert policy, on the other hand, is drawn to provide a correction signal.","379":"**Spatial Gating Unit**, or **SGU**, is a gating unit used in the [gMLP](https:\/\/paperswithcode.com\/method\/gmlp) architecture to captures spatial interactions. To enable cross-token interactions, it is necessary for the layer $s(\\cdot)$ to contain a contraction operation over the spatial dimension. The layer $s(\\cdot)$ is formulated as the output of linear gating:\r\n\r\n$$\r\ns(Z)=Z \\odot f\\_{W, b}(Z)\r\n$$\r\n\r\nwhere $\\odot$ denotes element-wise multiplication. For training stability, the authors find it critical to initialize $W$ as near-zero values and $b$ as ones, meaning that $f\\_{W, b}(Z) \\approx 1$ and therefore $s(Z) \\approx Z$ at the beginning of training. This initialization ensures each [gMLP](https:\/\/paperswithcode.com\/method\/gmlp) block behaves like a regular [FFN](https:\/\/paperswithcode.com\/method\/gmlp) at the early stage of training, where each token is processed independently, and only gradually injects spatial information across tokens during the course of learning.\r\n\r\nThe authors find it further effective to split $Z$ into two independent parts $\\left(Z\\_{1}, Z\\_{2}\\right)$ along the channel dimension for the gating function and for the multiplicative bypass:\r\n\r\n$$\r\ns(Z)=Z\\_{1} \\odot f\\_{W, b}\\left(Z\\_{2}\\right)\r\n$$\r\n\r\nThey also normalize the input to $f\\_{W, b}$ which empirically improved the stability of large NLP models.","380":"**gMLP** is an [MLP](https:\/\/paperswithcode.com\/methods\/category\/feedforward-networks)-based alternative to [Transformers](https:\/\/paperswithcode.com\/methods\/category\/vision-transformer) without [self-attention](https:\/\/paperswithcode.com\/method\/scaled), which simply consists of channel projections and spatial projections with static parameterization. It is built out of basic MLP layers with gating. The model consists of a stack of $L$ blocks with identical size and structure. Let $X \\in \\mathbb{R}^{n \\times d}$ be the token representations with sequence length $n$ and dimension $d$. Each block is defined as:\r\n\r\n$$\r\nZ=\\sigma(X U), \\quad \\tilde{Z}=s(Z), \\quad Y=\\tilde{Z} V\r\n$$\r\n\r\nwhere $\\sigma$ is an activation function such as [GeLU](https:\/\/paperswithcode.com\/method\/gelu). $U$ and $V$ define linear projections along the channel dimension - the same as those in the FFNs of Transformers (e.g., their shapes are $768 \\times 3072$ and $3072 \\times 768$ for $\\text{BERT}_{\\text {base }}$).\r\n\r\nA key ingredient is $s(\\cdot)$, a layer which captures spatial interactions. When $s$ is an identity mapping, the above transformation degenerates to a regular FFN, where individual tokens are processed independently without any cross-token communication. One of the major focuses is therefore to design a good $s$ capable of capturing complex spatial interactions across tokens. This leads to the use of a [Spatial Gating Unit](https:\/\/www.paperswithcode.com\/method\/spatial-gating-unit) which involves a modified linear gating.\r\n\r\nThe overall block layout is inspired by [inverted bottlenecks](https:\/\/paperswithcode.com\/method\/inverted-residual-block), which define $s(\\cdot)$ as a [spatial depthwise convolution](https:\/\/paperswithcode.com\/method\/depthwise-separable-convolution). Note, unlike Transformers, gMLP does not require position embeddings because such information will be captured in $s(\\cdot)$.","381":"**AdamW** is a stochastic optimization method that modifies the typical implementation of weight decay in [Adam](https:\/\/paperswithcode.com\/method\/adam), by decoupling [weight decay](https:\/\/paperswithcode.com\/method\/weight-decay) from the gradient update. To see this, $L\\_{2}$ regularization in Adam is usually implemented with the below modification where $w\\_{t}$ is the rate of the weight decay at time $t$:\r\n\r\n$$ g\\_{t} = \\nabla{f\\left(\\theta\\_{t}\\right)} + w\\_{t}\\theta\\_{t}$$\r\n\r\nwhile AdamW adjusts the weight decay term to appear in the gradient update:\r\n\r\n$$ \\theta\\_{t+1, i} = \\theta\\_{t, i} - \\eta\\left(\\frac{1}{\\sqrt{\\hat{v}\\_{t} + \\epsilon}}\\cdot{\\hat{m}\\_{t}} + w\\_{t, i}\\theta\\_{t, i}\\right), \\forall{t}$$","382":"**Dilated Sliding Window Attention** is an attention pattern for attention-based models. It was proposed as part of the [Longformer](https:\/\/paperswithcode.com\/method\/longformer) architecture. It is motivated by the fact that non-sparse attention in the original [Transformer](https:\/\/paperswithcode.com\/method\/transformer) formulation has a [self-attention component](https:\/\/paperswithcode.com\/method\/scaled) with $O\\left(n^{2}\\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs. \r\n\r\nCompared to a [Sliding Window Attention](https:\/\/paperswithcode.com\/method\/sliding-window-attention) pattern, we can further increase the receptive field without increasing computation by making the sliding window \"dilated\". This is analogous to [dilated CNNs](https:\/\/paperswithcode.com\/method\/dilated-convolution) where the window has gaps of size dilation $d$. Assuming a fixed $d$ and $w$ for all layers, the receptive field is $l \u00d7 d \u00d7 w$, which can reach tens of thousands of tokens even for small values of $d$.","383":"**Sliding Window Attention** is an attention pattern for attention-based models. It was proposed as part of the [Longformer](https:\/\/paperswithcode.com\/method\/longformer) architecture. It is motivated by the fact that non-sparse attention in the original [Transformer](https:\/\/paperswithcode.com\/method\/transformer) formulation has a [self-attention component](https:\/\/paperswithcode.com\/method\/scaled) with $O\\left(n^{2}\\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs. Given the importance of local context, the sliding window attention pattern employs a fixed-size window attention surrounding each token. Using multiple stacked layers of such windowed attention results in a large receptive field, where top layers have access to all input locations and have the capacity to build representations that incorporate information across the entire input. \r\n\r\nMore formally, in this attention pattern, given a fixed window size $w$, each token attends to $\\frac{1}{2}w$ tokens on each side. The computation complexity of this pattern is $O\\left(n\u00d7w\\right)$,\r\nwhich scales linearly with input sequence length $n$. To make this attention pattern efficient, $w$ should be small compared with $n$. But a model with typical multiple stacked transformers will have a large receptive field. This is analogous to CNNs where stacking layers of small kernels leads to high level features that are built from a large portion of the input (receptive field)\r\n\r\nIn this case, with a transformer of $l$ layers, the receptive field size is $l \u00d7 w$ (assuming\r\n$w$ is fixed for all layers). Depending on the application, it might be helpful to use different values of $w$ for each layer to balance between efficiency and model representation capacity.","384":"**Global and Sliding Window Attention** is an attention pattern for attention-based models. It is motivated by the fact that non-sparse attention in the original [Transformer](https:\/\/paperswithcode.com\/method\/transformer) formulation has a [self-attention component](https:\/\/paperswithcode.com\/method\/scaled) with $O\\left(n^{2}\\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs. \r\n\r\nSince [windowed](https:\/\/paperswithcode.com\/method\/sliding-window-attention) and [dilated](https:\/\/paperswithcode.com\/method\/dilated-sliding-window-attention) attention patterns are not flexible enough to learn task-specific representations, the authors of the [Longformer](https:\/\/paperswithcode.com\/method\/longformer) add \u201cglobal attention\u201d on few pre-selected input locations. This attention is operation symmetric: that is, a token with a global attention attends to all tokens across the sequence, and all tokens in the sequence attend to it. The Figure to the right shows an example of a sliding window attention with global attention at a few tokens at custom locations. For the example of classification, global attention is used for the [CLS] token, while in the example of Question Answering, global attention is provided on all question tokens.","385":"**Longformer** is a modified [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture. Traditional [Transformer-based models](https:\/\/paperswithcode.com\/methods\/category\/transformers) are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this, **Longformer** uses an attention pattern that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. The attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention.\r\n\r\nThe attention patterns utilised include: [sliding window attention](https:\/\/paperswithcode.com\/method\/sliding-window-attention), [dilated sliding window attention](https:\/\/paperswithcode.com\/method\/dilated-sliding-window-attention) and global + sliding window. These can be viewed in the components section of this page.","386":"**Electric** is an energy-based cloze model for representation learning over text. Like BERT, it is a conditional generative model of tokens given their contexts. However, Electric does not use masking or output a full distribution over tokens that could occur in a context. Instead, it assigns a scalar energy score to each input token indicating how likely it is given its context.\r\n\r\nSpecifically, like BERT, Electric also models $p\\_{\\text {data }}\\left(x\\_{t} \\mid \\mathbf{x}\\_{\\backslash t}\\right)$, but does not use masking or a softmax layer. Electric first maps the unmasked input $\\mathbf{x}=\\left[x\\_{1}, \\ldots, x\\_{n}\\right]$ into contextualized vector representations $\\mathbf{h}(\\mathbf{x})=\\left[\\mathbf{h}\\_{1}, \\ldots, \\mathbf{h}\\_{n}\\right]$ using a transformer network. The model assigns a given position $t$ an energy score\r\n\r\n$$\r\nE(\\mathbf{x})\\_{t}=\\mathbf{w}^{T} \\mathbf{h}(\\mathbf{x})\\_{t}\r\n$$\r\n\r\nusing a learned weight vector $w$. The energy function defines a distribution over the possible tokens at position $t$ as\r\n\r\n$$\r\np\\_{\\theta}\\left(x\\_{t} \\mid \\mathbf{x}_{\\backslash t}\\right)=\\exp \\left(-E(\\mathbf{x})\\_{t}\\right) \/ Z\\left(\\mathbf{x}\\_{\\backslash t}\\right) \r\n$$\r\n\r\n$$\r\n=\\frac{\\exp \\left(-E(\\mathbf{x})\\_{t}\\right)}{\\sum\\_{x^{\\prime} \\in \\mathcal{V}} \\exp \\left(-E\\left(\\operatorname{REPLACE}\\left(\\mathbf{x}, t, x^{\\prime}\\right)\\right)\\_{t}\\right)}\r\n$$\r\n\r\nwhere $\\text{REPLACE}\\left(\\mathbf{x}, t, x^{\\prime}\\right)$ denotes replacing the token at position $t$ with $x^{\\prime}$ and $\\mathcal{V}$ is the vocabulary, in practice usually word pieces. Unlike with BERT, which produces the probabilities for all possible tokens $x^{\\prime}$ using a softmax layer, a candidate $x^{\\prime}$ is passed in as input to the transformer. As a result, computing $p_{\\theta}$ is prohibitively expensive because the partition function $Z\\_{\\theta}\\left(\\mathbf{x}\\_{\\backslash t}\\right)$ requires running the transformer $|\\mathcal{V}|$ times; unlike most EBMs, the intractability of $Z\\_{\\theta}(\\mathbf{x} \\backslash t)$ is more due to the expensive scoring function rather than having a large sample space.","387":"Park et al. proposed the bottleneck attention module (BAM), aiming\r\nto efficiently improve the representational capability of networks. \r\nIt uses dilated convolution to enlarge the receptive field of the spatial attention sub-module, and build a bottleneck structure as suggested  by ResNet to save computational cost.\r\n\r\nFor a given input feature map $X$, BAM infers the channel attention $s_c \\in \\mathbb{R}^C$ and spatial attention $s_s\\in \\mathbb{R}^{H\\times W}$ in two parallel streams, then sums the two attention maps after resizing both branch outputs to $\\mathbb{R}^{C\\times H \\times W}$. The channel attention branch, like an SE block, applies global average pooling to the feature map to aggregate global information, and then uses an MLP with channel dimensionality reduction. In order to utilize contextual information effectively, the spatial attention branch combines a bottleneck structure and dilated convolutions. Overall, BAM can be written as\r\n\\begin{align}\r\n    s_c &= \\text{BN}(W_2(W_1\\text{GAP}(X)+b_1)+b_2)\r\n\\end{align}\r\n\r\n\\begin{align}\r\n    s_s &= BN(Conv_2^{1 \\times 1}(DC_2^{3\\times 3}(DC_1^{3 \\times 3}(Conv_1^{1 \\times 1}(X))))) \r\n\\end{align}\r\n\\begin{align}\r\n    s &= \\sigma(\\text{Expand}(s_s)+\\text{Expand}(s_c)) \r\n\\end{align}\r\n\\begin{align}\r\n    Y &= s X+X\r\n\\end{align}\r\nwhere $W_i$, $b_i$ denote  weights and biases of fully connected layers respectively, $Conv_{1}^{1\\times 1}$ and $Conv_{2}^{1\\times 1}$ are convolution layers  used for channel reduction. $DC_i^{3\\times 3}$ denotes a dilated convolution with $3\\times 3$ kernel,  applied to utilize contextual information effectively. $\\text{Expand}$ expands the attention maps $s_s$ and $s_c$ to $\\mathbb{R}^{C\\times H\\times W}$.\r\n\r\nBAM can emphasize or suppress features in both spatial and channel dimensions, as well as improving the representational power. Dimensional reduction applied to both channel and spatial attention branches enables it to be integrated with any convolutional neural network with little extra computational cost. However, although dilated convolutions enlarge the receptive field effectively, it still fails to capture long-range contextual information as well as encoding cross-domain relationships.","388":"**Discriminative Adversarial Search**, or **DAS**, is a sequence decoding approach which aims to alleviate the effects of exposure bias and to optimize on the data distribution itself rather than for external metrics. Inspired by generative adversarial networks (GANs), wherein a discriminator is used to improve the generator, DAS differs from GANs in that the generator parameters are not updated at training time and the discriminator is only used to drive sequence generation at inference time.","389":"**ArcFace**, or **Additive Angular Margin Loss**, is a loss function used in face recognition tasks. The [softmax](https:\/\/paperswithcode.com\/method\/softmax) is traditionally used in these tasks. However, the softmax loss function does not explicitly optimise the feature embedding to enforce higher similarity for intraclass samples and diversity for inter-class samples, which results in a performance gap for deep face recognition under large intra-class appearance variations. \r\n\r\nThe ArcFace loss transforms the logits $W^{T}\\_{j}x\\_{i} = || W\\_{j} || \\text{ } || x\\_{i} || \\cos\\theta\\_{j}$,\r\nwhere $\\theta\\_{j}$ is the angle between the weight $W\\_{j}$ and the feature $x\\_{i}$. The individual weight $ || W\\_{j} || = 1$ is fixed by $l\\_{2}$ normalization. The embedding feature $ ||x\\_{i} ||$ is fixed by $l\\_{2}$ normalization and re-scaled to $s$. The normalisation step on features and weights makes the predictions only depend on the angle between the feature and the weight. The learned embedding\r\nfeatures are thus distributed on a hypersphere with a radius of $s$. Finally, an additive angular margin penalty $m$ is added between $x\\_{i}$ and $W\\_{y\\_{i}}$ to simultaneously enhance the intra-class compactness and inter-class discrepancy. Since the proposed additive angular margin penalty is\r\nequal to the geodesic distance margin penalty in the normalised hypersphere, the method is named ArcFace:\r\n\r\n$$ L\\_{3} = -\\frac{1}{N}\\sum^{N}\\_{i=1}\\log\\frac{e^{s\\left(\\cos\\left(\\theta\\_{y\\_{i}} + m\\right)\\right)}}{e^{s\\left(\\cos\\left(\\theta\\_{y\\_{i}} + m\\right)\\right)} + \\sum^{n}\\_{j=1, j \\neq y\\_{i}}e^{s\\cos\\theta\\_{j}}} $$\r\n\r\nThe authors select face images from 8 different identities containing enough samples (around 1,500 images\/class) to train 2-D feature embedding networks with the softmax and ArcFace loss, respectively. As the Figure shows, the softmax loss provides roughly separable feature embedding\r\nbut produces noticeable ambiguity in decision boundaries, while the proposed ArcFace loss can obviously enforce a more evident gap between the nearest classes.\r\n\r\nOther alternatives to enforce intra-class compactness and inter-class distance include [Supervised Contrastive Learning](https:\/\/arxiv.org\/abs\/2004.11362).","390":"The **SAGAN Self-Attention Module** is a self-attention module used in the [Self-Attention GAN](https:\/\/paperswithcode.com\/method\/sagan) architecture for image synthesis. In the module, image features from the previous hidden layer $\\textbf{x} \\in \\mathbb{R}^{C\\text{x}N}$ are first transformed into two feature spaces $\\textbf{f}$, $\\textbf{g}$ to calculate the attention, where $\\textbf{f(x) = W}\\_{\\textbf{f}}{\\textbf{x}}$, $\\textbf{g}(\\textbf{x})=\\textbf{W}\\_{\\textbf{g}}\\textbf{x}$. We then calculate:\r\n\r\n$$\\beta_{j, i} = \\frac{\\exp\\left(s_{ij}\\right)}{\\sum^{N}\\_{i=1}\\exp\\left(s_{ij}\\right)} $$\r\n\r\n$$ \\text{where } s_{ij} = \\textbf{f}(\\textbf{x}\\_{i})^{T}\\textbf{g}(\\textbf{x}\\_{i}) $$\r\n\r\nand $\\beta_{j, i}$ indicates the extent to which the model attends to the $i$th location when synthesizing the $j$th region. Here, $C$ is the number of channels and $N$ is the number of feature\r\nlocations of features from the previous hidden layer. The output of the attention layer is $\\textbf{o} = \\left(\\textbf{o}\\_{\\textbf{1}}, \\textbf{o}\\_{\\textbf{2}}, \\ldots, \\textbf{o}\\_{\\textbf{j}} , \\ldots, \\textbf{o}\\_{\\textbf{N}}\\right) \\in \\mathbb{R}^{C\\text{x}N}$ , where,\r\n\r\n$$ \\textbf{o}\\_{\\textbf{j}} = \\textbf{v}\\left(\\sum^{N}\\_{i=1}\\beta_{j, i}\\textbf{h}\\left(\\textbf{x}\\_{\\textbf{i}}\\right)\\right) $$\r\n\r\n$$ \\textbf{h}\\left(\\textbf{x}\\_{\\textbf{i}}\\right) = \\textbf{W}\\_{\\textbf{h}}\\textbf{x}\\_{\\textbf{i}} $$\r\n\r\n$$ \\textbf{v}\\left(\\textbf{x}\\_{\\textbf{i}}\\right) = \\textbf{W}\\_{\\textbf{v}}\\textbf{x}\\_{\\textbf{i}} $$\r\n\r\nIn the above formulation, $\\textbf{W}\\_{\\textbf{g}} \\in \\mathbb{R}^{\\bar{C}\\text{x}C}$, $\\mathbf{W}\\_{f} \\in \\mathbb{R}^{\\bar{C}\\text{x}C}$, $\\textbf{W}\\_{\\textbf{h}} \\in \\mathbb{R}^{\\bar{C}\\text{x}C}$ and $\\textbf{W}\\_{\\textbf{v}} \\in \\mathbb{R}^{C\\text{x}\\bar{C}}$ are the learned weight matrices, which are implemented as $1$\u00d7$1$ convolutions. The authors choose  $\\bar{C} = C\/8$.\r\n\r\nIn addition, the module further multiplies the output of the attention layer by a scale parameter and adds back the input feature map. Therefore, the final output is given by,\r\n\r\n$$\\textbf{y}\\_{\\textbf{i}} = \\gamma\\textbf{o}\\_{\\textbf{i}} + \\textbf{x}\\_{\\textbf{i}}$$\r\n\r\nwhere $\\gamma$ is a learnable scalar and it is initialized as 0. Introducing $\\gamma$ allows the network to first rely on the cues in the local neighborhood \u2013 since this is easier \u2013 and then gradually learn to assign more weight to the non-local evidence.","391":"The **Self-Attention Generative Adversarial Network**, or **SAGAN**, allows for attention-driven, long-range dependency modeling for image generation tasks. Traditional convolutional GANs generate high-resolution details as a function of only spatially local points in lower-resolution feature maps. In SAGAN, details can be generated using cues from all feature locations. Moreover, the discriminator can check that highly detailed features in distant portions of the image are consistent with each other.","392":"The **Truncation Trick** is a latent sampling procedure for generative adversarial networks, where we sample $z$ from a truncated normal (where values which fall outside a range are resampled to fall inside that range). \r\nThe original implementation was in [Megapixel Size Image Creation with GAN](https:\/\/paperswithcode.com\/paper\/megapixel-size-image-creation-using).\r\nIn [BigGAN](http:\/\/paperswithcode.com\/method\/biggan), the authors find this provides a boost to the Inception Score and FID.","393":"**Off-Diagonal Orthogonal Regularization** is a modified form of [orthogonal regularization](https:\/\/paperswithcode.com\/method\/orthogonal-regularization) originally used in [BigGAN](https:\/\/paperswithcode.com\/method\/biggan). The original orthogonal regularization is known to be limiting so the authors explore several variants designed to relax the constraint while still imparting the desired smoothness to the models. They opt for a modification where they remove diagonal terms from the regularization, and aim to minimize the pairwise cosine similarity between filters but does not constrain their norm:\r\n\r\n$$ R\\_{\\beta}\\left(W\\right) = \\beta|| W^{T}W \\odot \\left(\\mathbf{1}-I\\right) ||^{2}\\_{F} $$\r\n\r\nwhere $\\mathbf{1}$ denotes a matrix with all elements set to 1. The authors sweep $\\beta$ values and select $10^{\u22124}$.","394":"The **GAN Hinge Loss** is a hinge loss based loss function for [generative adversarial networks](https:\/\/paperswithcode.com\/methods\/category\/generative-adversarial-networks):\r\n\r\n$$ L\\_{D} = -\\mathbb{E}\\_{\\left(x, y\\right)\\sim{p}\\_{data}}\\left[\\min\\left(0, -1 + D\\left(x, y\\right)\\right)\\right] -\\mathbb{E}\\_{z\\sim{p\\_{z}}, y\\sim{p\\_{data}}}\\left[\\min\\left(0, -1 - D\\left(G\\left(z\\right), y\\right)\\right)\\right] $$\r\n\r\n$$ L\\_{G} = -\\mathbb{E}\\_{z\\sim{p\\_{z}}, y\\sim{p\\_{data}}}D\\left(G\\left(z\\right), y\\right) $$","395":"The **Two Time-scale Update Rule (TTUR)** is an update rule for generative adversarial networks trained with stochastic gradient descent. TTUR has an individual learning rate for both the discriminator and the generator. The main premise is that the discriminator converges to a local minimum when the generator is fixed. If the generator changes slowly enough, then the discriminator still converges, since the generator perturbations are small. Besides ensuring convergence, the performance may also improve since the discriminator must first learn new patterns before they are transferred to the generator. In contrast, a generator which is overly fast, drives the discriminator steadily into new regions without capturing its gathered information.","396":"**Conditional Batch Normalization (CBN)** is a class-conditional variant of [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization). The key idea is to predict the $\\gamma$ and $\\beta$ of the batch normalization from an embedding - e.g. a language embedding in VQA. CBN enables the linguistic embedding to manipulate entire feature maps by scaling them up or down, negating them, or shutting them off. CBN has also been used in [GANs](https:\/\/paperswithcode.com\/methods\/category\/generative-adversarial-networks) to allow class information to affect the batch normalization parameters.\r\n\r\nConsider a single convolutional layer with batch normalization module $\\text{BN}\\left(F\\_{i,c,h,w}|\\gamma\\_{c}, \\beta\\_{c}\\right)$ for which pretrained scalars $\\gamma\\_{c}$ and $\\beta\\_{c}$ are available. We would like to directly predict these affine scaling parameters from, e.g., a language embedding $\\mathbf{e\\_{q}}$. When starting the training procedure, these parameters must be close to the pretrained values to recover the original [ResNet](https:\/\/paperswithcode.com\/method\/resnet) model as a poor initialization could significantly deteriorate performance. Unfortunately, it is difficult to initialize a network to output the pretrained $\\gamma$ and $\\beta$. For these reasons, the authors propose to predict a change $\\delta\\beta\\_{c}$ and $\\delta\\gamma\\_{c}$ on the frozen original scalars, for which it is straightforward to initialize a neural network to produce an output with zero-mean and small variance.\r\n\r\nThe authors use a one-hidden-layer MLP to predict these deltas from a question embedding $\\mathbf{e\\_{q}}$ for all feature maps within the layer:\r\n\r\n$$\\Delta\\beta = \\text{MLP}\\left(\\mathbf{e\\_{q}}\\right)$$\r\n\r\n$$\\Delta\\gamma = \\text{MLP}\\left(\\mathbf{e\\_{q}}\\right)$$\r\n\r\nSo, given a feature map with $C$ channels, these MLPs output a vector of size $C$. We then add these predictions to the $\\beta$ and $\\gamma$ parameters:\r\n\r\n$$ \\hat{\\beta}\\_{c} = \\beta\\_{c} + \\Delta\\beta\\_{c} $$\r\n\r\n$$ \\hat{\\gamma}\\_{c} = \\gamma\\_{c} + \\Delta\\gamma\\_{c} $$\r\n\r\nFinally, these updated $\\hat{\u03b2}$ and $\\hat{\\gamma}$ are used as parameters for the batch normalization: $\\text{BN}\\left(F\\_{i,c,h,w}|\\hat{\\gamma\\_{c}}, \\hat{\\beta\\_{c}}\\right)$. The authors freeze all ResNet parameters, including $\\gamma$ and $\\beta$, during training. A ResNet consists of\r\nfour stages of computation, each subdivided in several residual blocks. In each block, the authors apply CBN to the three convolutional layers.","397":"A **Linear Layer** is a projection $\\mathbf{XW + b}$.","398":"A **Projection Discriminator** is a type of discriminator for generative adversarial networks. It is motivated by a probabilistic model in which the distribution of the conditional variable $\\textbf{y}$ given $\\textbf{x}$ is discrete or uni-modal continuous distributions.\r\n\r\nIf we look at the original solution for the loss function $\\mathcal{L}\\_{D}$ in the vanilla GANs, we can decompose it into the sum of two log-likelihood ratios:\r\n\r\n$$ f^{*}\\left(\\mathbf{x}, \\mathbf{y}\\right) = \\log\\frac{q\\left(\\mathbf{x}\\mid{\\mathbf{y}}\\right)q\\left(\\mathbf{y}\\right)}{p\\left(\\mathbf{x}\\mid{\\mathbf{y}}\\right)p\\left(\\mathbf{y}\\right)} = \\log\\frac{q\\left(\\mathbf{y}\\mid{\\mathbf{x}}\\right)}{p\\left(\\mathbf{y}\\mid{\\mathbf{x}}\\right)} + \\log\\frac{q\\left(\\mathbf{x}\\right)}{p\\left(\\mathbf{x}\\right)}  = r\\left(\\mathbf{y\\mid{x}}\\right) + r\\left(\\mathbf{x}\\right) $$\r\n\r\nWe can model the log likelihood ratio $r\\left(\\mathbf{y\\mid{x}}\\right)$ and  $r\\left(\\mathbf{x}\\right)$ by some parametric functions $f\\_{1}$ and $f\\_{2}$ respectively. If we make a standing assumption that $p\\left(y\\mid{x}\\right)$ and $q\\left(y\\mid{x}\\right)$ are simple distributions like those that are Gaussian or discrete log linear on the feature space, then the parametrization of the following form becomes natural:\r\n\r\n$$ f\\left(\\mathbf{x}, \\mathbf{y}; \\theta\\right) = f\\_{1}\\left(\\mathbf{x}, \\mathbf{y}; \\theta\\right) + f\\_{2}\\left(\\mathbf{x}; \\theta\\right) = \\mathbf{y}^{T}V\\phi\\left(\\mathbf{x}; \\theta\\_{\\phi}\\right) + \\psi\\left(\\phi(\\mathbf{x}; \\theta\\_{\\phi}); \\theta\\_{\\psi}\\right) $$\r\n\r\nwhere $V$ is the embedding matrix of $y$, $\\phi\\left(\u00b7, \\theta\\_{\\phi}\\right)$ is a vector output function of $x$, and $\\psi\\left(\u00b7, \\theta\\_{\\psi}\\right)$ is a scalar function of the same $\\phi\\left(\\mathbf{x}; \\theta\\_{\\phi}\\right)$ that appears in $f\\_{1}$. The learned parameters $\\theta = ${$V, \\theta\\_{\\phi}, \\theta\\_{\\psi}$} are trained to optimize the adversarial loss. This model of the discriminator is the projection.","399":"**BigGAN** is a type of generative adversarial network that was designed for scaling generation to high-resolution, high-fidelity images. It includes a number of incremental changes and innovations. The baseline and incremental changes are:\r\n\r\n- Using [SAGAN](https:\/\/paperswithcode.com\/method\/sagan) as a baseline with spectral norm. for G and D, and using [TTUR](https:\/\/paperswithcode.com\/method\/ttur).\r\n- Using a Hinge Loss [GAN](https:\/\/paperswithcode.com\/method\/gan) objective\r\n- Using class-[conditional batch normalization](https:\/\/paperswithcode.com\/method\/conditional-batch-normalization) to provide class information to G (but with linear projection not MLP.\r\n- Using a [projection discriminator](https:\/\/paperswithcode.com\/method\/projection-discriminator) for D to provide class information to D.\r\n- Evaluating with EWMA of G's weights, similar to ProGANs.\r\n\r\nThe innovations are:\r\n\r\n- Increasing batch sizes, which has a big effect on the Inception Score of the model.\r\n- Increasing the width in each layer leads to a further Inception Score improvement.\r\n- Adding skip connections from the latent variable $z$ to further layers helps performance.\r\n- A new variant of [Orthogonal Regularization](https:\/\/paperswithcode.com\/method\/orthogonal-regularization).","400":"**Blender** is a proposal-based instance mask generation module which incorporates rich instance-level information with accurate dense pixel features. A single [convolution](https:\/\/paperswithcode.com\/method\/convolution) layer is added on top of the detection towers to produce attention masks along with each bounding box prediction. For each predicted instance, the blender crops predicted bases with its bounding box and linearly combines them according the learned attention maps.\r\n\r\nThe inputs of the blender module are bottom-level bases $\\mathbf{B}$, the selected top-level attentions $A$ and bounding box proposals $P$. First [RoIPool](https:\/\/paperswithcode.com\/method\/roi-pooling) of Mask R-CNN to crop bases with each proposal $\\mathbf{p}\\_{d}$ and then resize the region to a fixed size $R \\times R$ feature map $\\mathbf{r}\\_{d}$\r\n\r\n$$\r\n\\mathbf{r}\\_{d}=\\operatorname{RoIPool}_{R \\times R}\\left(\\mathbf{B}, \\mathbf{p}\\_{d}\\right), \\quad \\forall d \\in\\{1 \\ldots D\\}\r\n$$\r\n\r\nMore specifically,  asampling ratio 1 is used for [RoIAlign](https:\/\/paperswithcode.com\/method\/roi-align), i.e. one bin for each sampling point. During training, ground truth boxes are used as the proposals. During inference, [FCOS](https:\/\/paperswithcode.com\/method\/fcos) prediction results are used.\r\n\r\nThe attention size $M$ is smaller than $R$. We interpolate $\\mathbf{a}\\_{d}$ from $M \\times M$ to $R \\times R$, into the shapes of $R=\\left\\(\\mathbf{r}\\_{d} \\mid d=1 \\ldots D\\right)$\r\n\r\n$$\r\n\\mathbf{a}\\_{d}^{\\prime}=\\text { interpolate }\\_{M \\times M \\rightarrow R \\times R}\\left(\\mathbf{a}\\_{d}\\right), \\quad \\forall d \\in\\{1 \\ldots D\\}\r\n$$\r\n\r\nThen $\\mathbf{a}\\_{d}^{\\prime}$ is normalized with a softmax function along the $K$ dimension to make it a set of score maps $\\mathbf{s}\\_{d}$.\r\n\r\n$$\r\n\\mathbf{s}\\_{d}=\\operatorname{softmax}\\left(\\mathbf{a}\\_{d}^{\\prime}\\right), \\quad \\forall d \\in\\{1 \\ldots D\\}\r\n$$\r\n\r\nThen we apply element-wise product between each entity $\\mathbf{r}\\_{d}, \\mathbf{s}\\_{d}$ of the regions $R$ and scores $S$, and sum along the $K$ dimension to get our mask logit $\\mathbf{m}\\_{d}:$\r\n\r\n$$\r\n\\mathbf{m}\\_{d}=\\sum\\_{k=1}^{K} \\mathbf{s}\\_{d}^{k} \\circ \\mathbf{r}\\_{d}^{k}, \\quad \\forall d \\in\\{1 \\ldots D\\}\r\n$$\r\n\r\nwhere $k$ is the index of the basis. The mask blending process with $K=4$ is visualized in the Figure.","401":"Please enter a description about the method here","402":"Attention gate focuses on targeted regions while suppressing feature activations in irrelevant regions.\r\nGiven the input feature map $X$ and the gating signal $G\\in \\mathbb{R}^{C'\\times H\\times W}$ which is collected at a coarse scale and contains contextual information, the attention gate uses additive attention to obtain the gating coefficient. Both the input $X$ and the gating signal are first linearly mapped to an $\\mathbb{R}^{F\\times H\\times W}$ dimensional space, and then the output is squeezed in the channel domain to produce a spatial attention weight map $ S \\in \\mathbb{R}^{1\\times H\\times W}$. The overall process can be written as\r\n\\begin{align}\r\n    S &= \\sigma(\\varphi(\\delta(\\phi_x(X)+\\phi_g(G))))\r\n\\end{align}\r\n\\begin{align}\r\n    Y &= S X\r\n\\end{align}\r\nwhere $\\varphi$, $\\phi_x$ and $\\phi_g$ are linear transformations implemented as $1\\times 1$ convolutions. \r\n\r\nThe attention gate guides the model's attention to important regions while suppressing feature activation in unrelated areas. It substantially enhances the representational power of the model without a significant increase in computing cost or number of model parameters due to its lightweight design. It is general and modular, making it simple to use in various CNN models.","403":"**Self-Attention Network** (**SANet**) proposes two variations of self-attention used for image recognition: 1) pairwise self-attention which generalizes standard [dot-product attention](https:\/\/paperswithcode.com\/method\/dot-product-attention) and is fundamentally a set operator, and 2) patchwise self-attention which is strictly more powerful than [convolution](https:\/\/paperswithcode.com\/method\/convolution).","404":"**Discrete Cosine Transform (DCT)** is an orthogonal transformation method that decomposes an\r\nimage to its spatial frequency spectrum. It expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. It is used a lot in compression tasks, e..g image compression where for example high-frequency components can be discarded. It is a type of Fourier-related Transform, similar to discrete fourier transforms (DFTs), but only using real numbers.\r\n\r\nImage Credit: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Discrete_cosine_transform#\/media\/File:Example_dft_dct.svg)","405":"Procrustes","406":"**AccoMontage** is a model for accompaniment arrangement, a type of music generation task involving intertwined constraints of melody, harmony, texture, and music structure. AccoMontage generates piano accompaniments for folk\/pop songs based on a lead sheet (i.e. a melody with chord progression). It first retrieves phrase montages from a database while recombining them structurally using dynamic programming. Second, chords of the retrieved phrases are manipulated to match the lead sheet via style transfer. Lastly, the system offers controls over the generation process. In contrast to pure deep learning approaches, AccoMontage uses a hybrid pathway, in which rule-based optimization and deep learning are both leveraged.","407":"**Switchable Atrous Convolution (SAC)** softly switches the convolutional computation between different atrous rates and gathers the results using switch functions. The switch functions are spatially dependent, i.e., each location of the feature map might have different switches to control the outputs of SAC. To use SAC in a detector, we convert all the standard 3x3 convolutional layers in the bottom-up backbone to SAC.","408":"In the field of scene segmentation,\r\nencoder-decoder structures cannot make use of the global relationships \r\nbetween objects, whereas RNN-based structures \r\nheavily rely on the output of the long-term memorization.\r\nTo address the above problems, \r\nFu et al. proposed a novel framework, \r\n the dual attention network (DANet), \r\nfor natural scene image segmentation. \r\nUnlike CBAM and BAM, it adopts a self-attention mechanism \r\ninstead of simply stacking convolutions to compute the spatial attention map,\r\nwhich enables the network to capture global information directly. \r\n\r\nDANet uses in parallel a position attention module and a channel attention module to capture feature dependencies in spatial and channel domains. Given the input feature map $X$, convolution layers are applied first in the position attention module to obtain new feature maps. Then the position attention module selectively aggregates the features at each position using a weighted sum of features at all positions, where the weights are determined by feature similarity between corresponding pairs of positions. The channel attention module has a similar form except for dimensional reduction to model cross-channel relations. Finally the outputs from the two branches are fused to obtain final feature representations. For simplicity, we reshape the feature map $X$ to $C\\times (H \\times W)$ whereupon the overall process can be written as \r\n\\begin{align}\r\n    Q,\\quad K,\\quad V &= W_qX,\\quad W_kX,\\quad W_vX\r\n\\end{align}\r\n\\begin{align}\r\n    Y^\\text{pos} &=  X+ V\\text{Softmax}(Q^TK)\r\n\\end{align}\r\n\\begin{align}\r\n    Y^\\text{chn} &=  X+ \\text{Softmax}(XX^T)X \r\n\\end{align}\r\n\\begin{align}\r\n    Y &= Y^\\text{pos} + Y^\\text{chn}\r\n\\end{align}\r\nwhere $W_q$, $W_k$, $W_v \\in \\mathbb{R}^{C\\times C}$ are used to generate new feature maps.   \r\n\r\nThe position attention module enables\r\nDANet to capture long-range contextual information\r\nand adaptively integrate similar features at any scale\r\nfrom a global viewpoint,\r\nwhile the channel attention module is responsible for \r\nenhancing useful channels \r\nas well as suppressing noise. \r\nTaking spatial and channel \r\nrelationships into consideration explicitly\r\nimproves the feature representation for scene segmentation.\r\nHowever, it is computationally costly, especially for large input feature maps.","409":"The **MelGAN Residual Block** is a convolutional [residual block](https:\/\/paperswithcode.com\/method\/residual-block) used in the [MelGAN](https:\/\/paperswithcode.com\/method\/melgan) generative audio architecture. It employs residual connections with dilated convolutions. Dilations are used so that temporally far output activations of each subsequent layer has significant overlapping inputs. Receptive field of a stack of [dilated convolution](https:\/\/paperswithcode.com\/method\/dilated-convolution) layers increases exponentially with the number of layers. Incorporating these into the MelGAN generator allows us to efficiently increase the induced receptive fields of each output time-step. This effectively implies larger overlap in the induced receptive field of far apart time-steps, leading to better long range correlation.","410":"A **Window-based Discriminator** is a type of discriminator for generative adversarial networks. It is analogous to a [PatchGAN](https:\/\/paperswithcode.com\/method\/patchgan) but designed for audio. While a standard [GAN](https:\/\/paperswithcode.com\/method\/gan) discriminator learns to classify between distributions of entire audio sequences, window-based discriminator learns to classify between distribution of small audio chunks. Since the discriminator loss is computed over the overlapping windows where each window is very large (equal to the receptive field of the discriminator), the model learns to maintain coherence across patches.","411":"**MelGAN** is a non-autoregressive feed-forward convolutional architecture to perform audio waveform generation in a [GAN](https:\/\/paperswithcode.com\/method\/gan) setup. The architecture is a fully convolutional feed-forward network with mel-spectrogram $s$ as input and raw waveform $x$ as output. Since the mel-spectrogram is at\r\na 256\u00d7 lower temporal resolution, the authors use a stack of transposed convolutional layers to upsample the input sequence. Each transposed convolutional layer is followed by a stack of residual blocks with dilated convolutions. Unlike traditional GANs, the MelGAN generator does not use a global noise vector as input.\r\n\r\nTo deal with 'checkerboard artifacts' in audio, instead of using [PhaseShuffle](https:\/\/paperswithcode.com\/method\/phase-shuffle), MelGAN uses kernel-size as a multiple of stride.\r\n\r\n[Weight normalization](https:\/\/paperswithcode.com\/method\/weight-normalization) is used for normalization. A [window-based discriminator](https:\/\/paperswithcode.com\/method\/window-based-discriminator), similar to a [PatchGAN](https:\/\/paperswithcode.com\/method\/patchgan) is used for the discriminator.","412":"**FBNet Block** is an image model block used in the [FBNet](https:\/\/paperswithcode.com\/method\/fbnet) architectures discovered through [DNAS](https:\/\/paperswithcode.com\/method\/dnas) [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search). The basic building blocks employed are [depthwise convolutions](https:\/\/paperswithcode.com\/method\/depthwise-convolution) and a [residual connection](https:\/\/paperswithcode.com\/method\/residual-connection).","413":"**FBNet** is a type of convolutional neural architectures discovered through [DNAS](https:\/\/paperswithcode.com\/method\/dnas) [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search). It utilises a basic type of image model block inspired by [MobileNetv2](https:\/\/paperswithcode.com\/method\/mobilenetv2) that utilises depthwise convolutions and an inverted residual structure (see components).","414":"**Location-based Attention** is an attention mechanism in which the alignment scores are computed from solely the target hidden state $\\mathbf{h}\\_{t}$ as follows:\r\n\r\n$$ \\mathbf{a}\\_{t} = \\text{softmax}(\\mathbf{W}\\_{a}\\mathbf{h}_{t}) $$","415":"**Content-based attention** is an attention mechanism based on cosine similarity:\r\n\r\n$$f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right) = \\cos\\left[\\textbf{h}\\_{i};\\textbf{s}\\_{j}\\right] $$\r\n\r\nIt was utilised in [Neural Turing Machines](https:\/\/paperswithcode.com\/method\/neural-turing-machine) as part of the Addressing Mechanism.\r\n\r\nWe produce a normalized attention weighting by taking a [softmax](https:\/\/paperswithcode.com\/method\/softmax) over these attention alignment scores.","416":"A **Neural Turing Machine** is a working memory neural network model. It couples a neural network architecture with external memory resources. The whole architecture is differentiable end-to-end with gradient descent. The models can infer tasks such as copying, sorting and associative recall.\r\n\r\nA Neural Turing Machine (NTM) architecture contains two basic components: a neural\r\nnetwork controller and a memory bank. The Figure presents a high-level diagram of the NTM\r\narchitecture. Like most neural networks, the controller interacts with the external world via\r\ninput and output vectors. Unlike a standard network, it also interacts with a memory matrix\r\nusing selective read and write operations. By analogy to the Turing machine we refer to the\r\nnetwork outputs that parameterise these operations as \u201cheads.\u201d\r\n\r\nEvery component of the architecture is differentiable. This is achieved by defining 'blurry' read and write operations that interact to a greater or lesser degree with all the elements in memory (rather\r\nthan addressing a single element, as in a normal Turing machine or digital computer). The\r\ndegree of blurriness is determined by an attentional \u201cfocus\u201d mechanism that constrains each\r\nread and write operation to interact with a small portion of the memory, while ignoring the\r\nrest. Because interaction with the memory is highly sparse, the NTM is biased towards\r\nstoring data without interference. The memory location brought into attentional focus is\r\ndetermined by specialised outputs emitted by the heads. These outputs define a normalised\r\nweighting over the rows in the memory matrix (referred to as memory \u201clocations\u201d). Each\r\nweighting, one per read or write head, defines the degree to which the head reads or writes\r\nat each location. A head can thereby attend sharply to the memory at a single location or\r\nweakly to the memory at many locations","417":"**Path Length Regularization** is a type of regularization for [generative adversarial networks](https:\/\/paperswithcode.com\/methods\/category\/generative-adversarial-networks) that encourages good conditioning in the mapping from latent codes to images. The idea is to encourage that a fixed-size step in the latent space $\\mathcal{W}$ results in a non-zero, fixed-magnitude change in the image.\r\n\r\nWe can measure the deviation from this ideal empirically by stepping into random directions in the image space and observing the corresponding $\\mathbf{w}$ gradients. These gradients should have close to an equal length regardless of $\\mathbf{w}$ or the image-space direction, indicating that the mapping from the latent space to image space is well-conditioned.\r\n\r\nAt a single $\\mathbf{w} \\in \\mathcal{W}$ the local metric scaling properties of the generator mapping $g\\left(\\mathbf{w}\\right) : \\mathcal{W} \\rightarrow \\mathcal{Y}$ are captured by the Jacobian matrix $\\mathbf{J\\_{w}} = \\delta{g}\\left(\\mathbf{w}\\right)\/\\delta{\\mathbf{w}}$. Motivated by the desire to preserve the expected lengths of vectors regardless of the direction, we formulate the regularizer as:\r\n\r\n$$ \\mathbb{E}\\_{\\mathbf{w},\\mathbf{y} \\sim \\mathcal{N}\\left(0, \\mathbf{I}\\right)} \\left(||\\mathbf{J}^{\\mathbf{T}}\\_{\\mathbf{w}}\\mathbf{y}||\\_{2} - a\\right)^{2} $$\r\n\r\nwhere $y$ are random images with normally distributed pixel intensities, and $w \\sim f\\left(z\\right)$, where $z$ are normally distributed. \r\n\r\nTo avoid explicit computation of the Jacobian matrix, we use the identity $\\mathbf{J}^{\\mathbf{T}}\\_{\\mathbf{w}}\\mathbf{y} = \\nabla\\_{\\mathbf{w}}\\left(g\\left(\\mathbf{w}\\right)\u00b7y\\right)$, which is efficiently computable using standard backpropagation. The constant $a$ is set dynamically during optimization as the long-running exponential moving average of the lengths $||\\mathbf{J}^{\\mathbf{T}}\\_{\\mathbf{w}}\\mathbf{y}||\\_{2}$, allowing the optimization to find a suitable global scale by itself.\r\n\r\nThe authors note that they find that path length regularization leads to more reliable and consistently behaving models, making architecture exploration easier. They also observe that the smoother generator is significantly easier to invert.","418":"**Weight Modulation** is an alternative to [adaptive instance normalization](https:\/\/paperswithcode.com\/method\/adaptive-instance-normalization) for use in generative adversarial networks, specifically it is introduced in [StyleGAN2](https:\/\/paperswithcode.com\/method\/stylegan2). The purpose of [instance normalization](https:\/\/paperswithcode.com\/method\/instance-normalization) is to remove the effect of $s$ - the scales of the features maps - from the statistics of the [convolution](https:\/\/paperswithcode.com\/method\/convolution)\u2019s output feature maps. Weight modulation tries to achieve this goal more directly. Assuming that input activations are i.i.d. random variables with unit standard deviation. After modulation and convolution, the output activations have standard deviation of:\r\n\r\n$$ \\sigma\\_{j} = \\sqrt{{\\sum\\_{i,k}w\\_{ijk}'}^{2}} $$\r\n\r\ni.e., the outputs are scaled by the $L\\_{2}$ norm of the corresponding weights. The subsequent normalization aims to restore the outputs back to unit standard deviation. This can be achieved if we scale (\u201cdemodulate\u201d) each output feature map $j$ by $1\/\\sigma\\_{j}$ . Alternatively, we can again bake this into the convolution weights:\r\n\r\n$$ w''\\_{ijk} = w'\\_{ijk} \/ \\sqrt{{\\sum\\_{i, k}w'\\_{ijk}}^{2} + \\epsilon} $$\r\n\r\nwhere $\\epsilon$ is a small constant to avoid numerical issues.","419":"**StyleGAN2** is a generative adversarial network that builds on [StyleGAN](https:\/\/paperswithcode.com\/method\/stylegan) with several improvements. First, [adaptive instance normalization](https:\/\/paperswithcode.com\/method\/adaptive-instance-normalization) is redesigned and replaced with a normalization technique called [weight demodulation](https:\/\/paperswithcode.com\/method\/weight-demodulation). Secondly, an improved training scheme upon progressively growing is introduced, which achieves the same goal - training starts by focusing on low-resolution images and then progressively shifts focus to higher and higher resolutions - without changing the network topology during training. Additionally, new types of regularization like lazy regularization and [path length regularization](https:\/\/paperswithcode.com\/method\/path-length-regularization) are proposed.","420":"**SepFormer** is [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based neural network for speech separation. The SepFormer learns short and long-term dependencies with a multi-scale approach that employs transformers. It is mainly composed of multi-head attention and feed-forward layers. A dual-path framework (introduced by DPRNN) is adopted and [RNNs](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks) are replaced with a multiscale pipeline composed of transformers that learn both short and long-term dependencies. The dual-path framework enables the mitigation of the quadratic complexity of transformers, as transformers in the dual-path framework process smaller chunks.\r\n\r\nThe model is based on the learned-domain masking approach and employs an encoder, a decoder, and a masking network, as shown in the figure. The encoder is fully convolutional, while the decoder employs two Transformers embedded inside the dual-path processing block. The decoder finally reconstructs the separated signals in the time domain by using the masks predicted by the masking network.","421":"**FAVOR+**, or **Fast Attention Via Positive Orthogonal Random Features**, is an efficient attention mechanism used in the [Performer](https:\/\/paperswithcode.com\/method\/performer) architecture which leverages approaches such as kernel methods and random features approximation for approximating [softmax](https:\/\/paperswithcode.com\/method\/softmax) and Gaussian kernels. \r\n\r\nFAVOR+ works for attention blocks using matrices $\\mathbf{A} \\in \\mathbb{R}^{L\u00d7L}$ of the form $\\mathbf{A}(i, j) = K(\\mathbf{q}\\_{i}^{T}, \\mathbf{k}\\_{j}^{T})$, with $\\mathbf{q}\\_{i}\/\\mathbf{k}\\_{j}$ standing for the $i^{th}\/j^{th}$ query\/key row-vector in $\\mathbf{Q}\/\\mathbf{K}$ and kernel $K : \\mathbb{R}^{d } \u00d7 \\mathbb{R}^{d} \\rightarrow \\mathbb{R}\\_{+}$ defined for the (usually randomized) mapping: $\\phi : \\mathbb{R}^{d } \u2192 \\mathbb{R}^{r}\\_{+}$ (for some $r > 0$) as:\r\n\r\n$$K(\\mathbf{x}, \\mathbf{y}) = E[\\phi(\\mathbf{x})^{T}\\phi(\\mathbf{y})] $$\r\n\r\nWe call $\\phi(\\mathbf{u})$ a random feature map for $\\mathbf{u} \\in \\mathbb{R}^{d}$ . For $\\mathbf{Q}^{'}, \\mathbf{K}^{'} \\in \\mathbb{R}^{L \\times r}$ with rows given as $\\phi(\\mathbf{q}\\_{i}^{T})^{T}$ and $\\phi(\\mathbf{k}\\_{i}^{T})^{T}$  respectively, this leads directly to the efficient attention mechanism of the form:\r\n\r\n$$ \\hat{Att\\_{\\leftrightarrow}}\\left(\\mathbf{Q}, \\mathbf{K}, \\mathbf{V}\\right) = \\hat{\\mathbf{D}}^{-1}(\\mathbf{Q^{'}}((\\mathbf{K^{'}})^{T}\\mathbf{V}))$$\r\n\r\nwhere\r\n\r\n$$\\mathbf{\\hat{D}} = \\text{diag}(\\mathbf{Q^{'}}((\\mathbf{K^{'}})\\mathbf{1}\\_{L})) $$\r\n\r\nThe above scheme constitutes the [FA](https:\/\/paperswithcode.com\/method\/dfa)-part of the FAVOR+ mechanism. The other parts are achieved by:\r\n\r\n- The R part :  The softmax kernel is approximated though trigonometric functions, in the form of a regularized softmax-kernel SMREG, that employs positive random features (PRFs).\r\n- The OR+ part : To reduce the variance of the estimator, so we can use a smaller number of random features, different samples are entangled to be exactly orthogonal using the Gram-Schmidt orthogonalization procedure.\r\n\r\nThe details are quite technical, so it is recommended you read the paper for further information on these steps.","422":"**Performer** is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) architectures which can estimate regular ([softmax](https:\/\/paperswithcode.com\/method\/softmax)) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. To approximate softmax attention-kernels, Performers use a Fast Attention Via positive Orthogonal Random features approach (FAVOR+), leveraging new methods for approximating softmax and Gaussian kernels.","423":"**DLA**, or **Deep Layer Aggregation**,  iteratively and hierarchically merges the feature hierarchy across layers in neural networks to make networks with better accuracy and fewer parameters. \r\n\r\nIn iterative deep aggregation (IDA), aggregation begins at the shallowest, smallest scale and then iteratively merges deeper,\r\nlarger scales. In this way shallow features are refined as\r\nthey are propagated through different stages of aggregation.\r\n\r\nIn hierarchical deep aggregation (HDA), blocks and stages\r\nin a tree are merged to preserve and combine feature channels. With\r\nHDA shallower and deeper layers are combined to learn\r\nricher combinations that span more of the feature hierarchy.\r\nWhile IDA effectively combines stages, it is insufficient\r\nfor fusing the many blocks of a network, as it is still only\r\nsequential.","424":"**CSPDarknet53** is a convolutional neural network and backbone for object detection that uses [DarkNet-53](https:\/\/paperswithcode.com\/method\/darknet-53). It employs a CSPNet strategy to partition the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network. \r\n\r\nThis CNN is used as the backbone for [YOLOv4](https:\/\/paperswithcode.com\/method\/yolov4).","425":"**Bottom-up Path Augmentation** is a feature extraction technique that seeks to shorten the information path and enhance a feature pyramid with accurate localization signals existing in low-levels. This is based on the fact that high response to edges or instance parts is a strong indicator to accurately localize instances. \r\n\r\nEach building block takes a higher resolution feature map $N\\_{i}$ and a coarser map $P\\_{i+1}$ through lateral connection and generates the new feature map $N\\_{i+1}$ Each feature map $N\\_{i}$ first goes through a $3 \\times 3$ convolutional layer with stride $2$ to reduce the spatial size. Then each element of feature map $P\\_{i+1}$ and the down-sampled map are added through lateral connection. The fused feature map is then processed by another $3 \\times 3$ convolutional layer to generate $N\\_{i+1}$ for following sub-networks. This is an iterative process and terminates after approaching $P\\_{5}$. In these building blocks, we consistently use channel 256 of feature maps. The feature grid for each proposal is then pooled from new feature maps, i.e., {$N\\_{2}$, $N\\_{3}$, $N\\_{4}$, $N\\_{5}$}.","426":"**Grid Sensitive** is a trick for object detection introduced by [YOLOv4](https:\/\/paperswithcode.com\/method\/yolov4). When we decode the coordinate of the bounding box center $x$ and $y$, in original [YOLOv3](https:\/\/paperswithcode.com\/method\/yolov3), we can get them by\r\n\r\n$$\r\n\\begin{aligned}\r\n&x=s \\cdot\\left(g\\_{x}+\\sigma\\left(p\\_{x}\\right)\\right) \\\\\r\n&y=s \\cdot\\left(g\\_{y}+\\sigma\\left(p\\_{y}\\right)\\right)\r\n\\end{aligned}\r\n$$\r\n\r\nwhere $\\sigma$ is the sigmoid function, $g\\_{x}$ and $g\\_{y}$ are integers and $s$ is a scale factor. Obviously, $x$ and $y$ cannot be exactly equal to $s \\cdot g\\_{x}$ or $s \\cdot\\left(g\\_{x}+1\\right)$. This makes it difficult to predict the centres of bounding boxes that just located on the grid boundary. We can address this problem, by changing the equation to\r\n\r\n$$\r\n\\begin{aligned}\r\n&x=s \\cdot\\left(g\\_{x}+\\alpha \\cdot \\sigma\\left(p\\_{x}\\right)-(\\alpha-1) \/ 2\\right) \\\\\r\n&y=s \\cdot\\left(g\\_{y}+\\alpha \\cdot \\sigma\\left(p\\_{y}\\right)-(\\alpha-1) \/ 2\\right)\r\n\\end{aligned}\r\n$$\r\n\r\nThis makes it easier for the model to predict bounding box center exactly located on the grid boundary. The FLOPs added by Grid Sensitive are really small, and can be totally ignored.","427":"**CutMix** is an image data augmentation strategy. Instead of simply removing pixels as in [Cutout](https:\/\/paperswithcode.com\/method\/cutout), we replace the removed regions with a patch from another image. The ground truth labels are also mixed proportionally to the number of pixels of combined images. The added patches further enhance localization ability by requiring the model to identify the object from a partial view.","428":"**PAFPN** is a feature pyramid module used in Path Aggregation networks ([PANet](https:\/\/paperswithcode.com\/method\/panet)) that combines FPNs with [bottom-up path augmentation](https:\/\/paperswithcode.com\/method\/bottom-up-path-augmentation), which shortens the information path between lower layers and topmost feature.","429":"**YOLOv4** is a one-stage object detection model that improves on [YOLOv3](https:\/\/paperswithcode.com\/method\/yolov3) with several bags of tricks and modules introduced in the literature. The components section below details the tricks and modules used.","430":"**Disentangled Attention Mechanism** is an attention mechanism used in the [DeBERTa](https:\/\/paperswithcode.com\/method\/deberta) architecture. Unlike [BERT](https:\/\/paperswithcode.com\/method\/bert) where each word in the input layer is represented using a vector which is the sum of its word (content) embedding and position embedding, each word in DeBERTa is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices based on their contents and relative positions, respectively. This is motivated by the observation that the attention weight of a word pair depends on not only their contents but their relative positions. For example, the dependency between the words \u201cdeep\u201d and \u201clearning\u201d is much stronger when they occur next to each other than when they occur in different sentences.","431":"**DeBERTa** is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based neural language model that aims to improve the [BERT](https:\/\/paperswithcode.com\/method\/bert) and [RoBERTa](https:\/\/paperswithcode.com\/method\/roberta) models with two techniques: a [disentangled attention mechanism](https:\/\/paperswithcode.com\/method\/disentangled-attention-mechanism) and an enhanced mask decoder. The disentangled attention mechanism is where each word is represented unchanged using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangle matrices on their contents and relative positions. The enhanced mask decoder is used to replace the output [softmax](https:\/\/paperswithcode.com\/method\/softmax) layer to predict the masked tokens for model pre-training.  In addition, a new virtual adversarial training method is used for fine-tuning to improve model\u2019s generalization on downstream tasks.","432":"**PnP**, or **Poll and Pool**, is sampling module extension for [DETR](https:\/\/paperswithcode.com\/method\/detr)-type architectures that adaptively allocates its computation spatially to be more efficient. Concretely, the PnP module abstracts the image feature map into fine foreground object feature vectors and a small number of coarse background contextual feature vectors. The [transformer](https:\/\/paperswithcode.com\/method\/transformer) models information interaction within the fine-coarse feature space and translates the features into the detection result.","433":"Please enter a description about the method here","434":"Stochastic Steady-state Embedding (SSE) is an algorithm that can learn many steady-state algorithms over graphs. Different from graph neural network family models, SSE is trained stochastically which only requires 1-hop information, but can capture fixed point relationships efficiently and effectively.\r\n\r\nDescription and Image from: [Learning Steady-States of Iterative Algorithms over Graphs](https:\/\/proceedings.mlr.press\/v80\/dai18a.html)","435":"**Pyramid Vision Transformer v2** (PVTv2) is a type of [Vision Transformer](https:\/\/paperswithcode.com\/method\/vision-transformer) for detection and segmentation tasks. It improves on [PVTv1](https:\/\/paperswithcode.com\/method\/pvt) through several design improvements: (1) overlapping patch embedding, (2) convolutional feed-forward networks, and (3) linear complexity attention layers that are orthogonal to the PVTv1 framework.","436":"**DeCLUTR** is an approach for learning universal sentence embeddings that utilizes a self-supervised objective that does not require labelled training data. The objective learns universal sentence embeddings by training an encoder to minimize the distance between the embeddings of textual segments randomly sampled from nearby in the same document.","437":"A **Fire Module** is a building block for convolutional neural networks, notably used as part of [SqueezeNet](https:\/\/paperswithcode.com\/method\/squeezenet). A Fire module is comprised of: a squeeze [convolution](https:\/\/paperswithcode.com\/method\/convolution) layer (which has only 1x1 filters), feeding into an expand layer that has a mix of 1x1 and 3x3 convolution filters.  We expose three tunable dimensions (hyperparameters) in a Fire module: $s\\_{1x1}$, $e\\_{1x1}$, and $e\\_{3x3}$. In a Fire module, $s\\_{1x1}$ is the number of filters in the squeeze layer (all 1x1), $e\\_{1x1}$ is the number of 1x1 filters in the expand layer, and $e\\_{3x3}$ is the number of 3x3 filters in the expand layer. When we use Fire modules we set $s\\_{1x1}$ to be less than ($e\\_{1x1}$ + $e\\_{3x3}$), so the squeeze layer helps to limit the number of input channels to the 3x3 filters.","438":"**Xavier Initialization**, or **Glorot Initialization**, is an initialization scheme for neural networks. Biases are initialized be 0 and the weights $W\\_{ij}$ at each layer are initialized as:\r\n\r\n$$ W\\_{ij} \\sim U\\left[-\\frac{1}{\\sqrt{n}}, \\frac{1}{\\sqrt{n}}\\right] $$\r\n\r\nWhere $U$ is a uniform distribution and $n$ is the size of the previous layer (number of columns in $W$).","439":"**SqueezeNet** is a convolutional neural network that employs design strategies to reduce the number of parameters, notably with the use of fire modules that \"squeeze\" parameters using 1x1 convolutions.","440":"**GMVAE**, or **Gaussian Mixture Variational Autoencoder**, is a stochastic regularization layer for [transformers](https:\/\/paperswithcode.com\/methods\/category\/transformers). A GMVAE layer is trained using a 700-dimensional internal representation of the first MLP layer. For every output from the first MLP layer, the GMVAE layer first computes a latent low-dimensional representation sampling from the GMVAE posterior distribution to then provide at the output a reconstruction sampled from a generative model.","441":"**Dense Contrastive Learning** is a self-supervised learning method for dense prediction tasks. It implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images. Contrasting with regular contrastive loss, the contrastive loss is computed between the single feature vectors outputted by the global projection head, at the level of global feature, while the dense contrastive loss is computed between the dense feature vectors outputted by the dense projection head, at the level of local feature.","442":"A **Gated Linear Network**, or **GLN**, is a type of backpropagation-free neural architecture. What distinguishes GLNs from contemporary neural networks is the distributed and local nature of their credit assignment mechanism; each neuron directly predicts the target, forgoing the ability to learn feature representations in favor of rapid online learning. Individual neurons can model nonlinear functions via the use of data-dependent gating in conjunction with online convex optimization. \r\n\r\nGLNs are feedforward networks composed of many layers of gated geometric mixing neurons as shown in the Figure . Each neuron in a given layer outputs a gated geometric mixture of the predictions from the previous layer, with the final layer consisting of just a single neuron. In a supervised learning setting, a $\\mathrm{GLN}$ is trained on (side information, base predictions, label) triplets $\\left(z\\_{t}, p\\_{t}, x\\_{t}\\right)_{t=1,2,3, \\ldots}$ derived from input-label pairs $\\left(z\\_{t}, x\\_{t}\\right)$. There are two types of input to neurons in the network: the first is the side information $z\\_{t}$, which can be thought of as the input features; the second is the input to the neuron, which will be the predictions output by the previous layer, or in the case of layer 0 , some (optionally) provided base predictions $p\\_{t}$ that typically will be a function of $z\\_{t} .$ Each neuron will also take in a constant bias prediction, which helps empirically and is essential for universality guarantees.\r\n\r\nWeights are learnt in a Gated Linear Network using Online Gradient Descent (OGD) locally at each neuron. They key observation is that as each neuron $(i, k)$ in layers $i>0$ is itself a gated geometric mixture, all of these neurons can be thought of as individually predicting the target. Given side information $z$ , each neuron $(i, k)$ suffers a loss convex in its active weights $u:=w\\_{i k c\\_{i k}(z)}$ of\r\n$$\r\n\\ell\\_{t}(u):=-\\log \\left(\\operatorname{GEO}\\_{u}\\left(x_{t} ; p\\_{i-1}\\right)\\right)\r\n$$","443":"**Jigsaw** is a self-supervision approach that relies on jigsaw-like puzzles as the pretext task in order to learn image representations.","444":"A **Res2Net Block** is an image model block that constructs hierarchical residual-like connections\r\nwithin one single [residual block](https:\/\/paperswithcode.com\/method\/residual-block). It was proposed as part of the [Res2Net](https:\/\/paperswithcode.com\/method\/res2net) CNN architecture.\r\n\r\nThe block represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The $3 \\times 3$ filters of $n$ channels is replaced with a set of smaller filter groups, each with $w$ channels. These smaller filter groups are connected in a hierarchical residual-like style to increase the number of scales that the output features can represent. Specifically, we divide input feature maps into several groups. A group of filters first extracts features from a group of input feature maps. Output features of the previous group are then sent to the next group of filters along with another group of input feature maps. \r\n\r\nThis process repeats several times until all input feature maps are processed. Finally, feature maps from all groups are concatenated and sent to another group of $1 \\times 1$ filters to fuse information altogether. Along with any possible path in which input features are transformed to output features, the equivalent receptive field increases whenever it passes a $3 \\times 3$ filter, resulting in many equivalent feature scales due to combination effects.\r\n\r\nOne way of thinking of these blocks is that they expose a new dimension, **scale**,  alongside the existing dimensions of depth, width, and cardinality.","445":"**Res2Net** is an image model that employs a variation on bottleneck residual blocks. The motivation is to be able to represent features at multiple scales. This is achieved through a novel building block for CNNs that constructs hierarchical residual-like connections within one single [residual block](https:\/\/paperswithcode.com\/method\/residual-block).\r\nThis represents multi-scale features at a granular level and increases the range of receptive fields for each network layer.","446":"**Channel Shuffle** is an operation to help information flow across feature channels in convolutional neural networks. It was used as part of the [ShuffleNet](https:\/\/paperswithcode.com\/method\/shufflenet) architecture. \r\n\r\nIf we allow a group [convolution](https:\/\/paperswithcode.com\/method\/convolution) to obtain input data from different groups, the input and output channels will be fully related. Specifically, for the feature map generated from the previous group layer, we can first divide the channels in each group into several subgroups, then feed each group in the next layer with different subgroups. \r\n\r\nThe above can be efficiently and elegantly implemented by a channel shuffle operation: suppose a convolutional layer with $g$ groups whose output has $g \\times n$ channels; we first reshape the output channel dimension into $\\left(g, n\\right)$, transposing and then flattening it back as the input of next layer. Channel shuffle is also differentiable, which means it can be embedded into network structures for end-to-end training.","447":"**ShuffleNet V2 Block** is an image model block used in the [ShuffleNet V2](https:\/\/paperswithcode.com\/method\/shufflenet-v2) architecture, where speed is the metric optimized for (instead of indirect ones like FLOPs). It utilizes a simple operator called channel split. At the beginning of each unit, the input of $c$ feature channels are split into two branches with $c - c'$ and $c'$ channels, respectively. Following **G3**, one branch remains as identity. The other branch consists of three convolutions with the same input and output channels to satisfy **G1**. The two $1\\times1$ convolutions are no longer group-wise, unlike the original [ShuffleNet](https:\/\/paperswithcode.com\/method\/shufflenet). This is partially to follow **G2**, and partially because the split operation already produces two groups. After [convolution](https:\/\/paperswithcode.com\/method\/convolution), the two branches are concatenated. So, the number of channels keeps the same (G1). The same \u201c[channel shuffle](https:\/\/paperswithcode.com\/method\/channel-shuffle)\u201d operation as in ShuffleNet is then used to enable information communication between the two branches.\r\n\r\nThe motivation behind channel split is that alternative architectures, where pointwise group convolutions and bottleneck structures are used, lead to increased memory access cost. Additionally more network fragmentation with group convolutions reduces parallelism (less friendly for GPU), and the element-wise addition operation, while they have low FLOPs, have high memory access cost. Channel split is an alternative where we can maintain a large number of equally wide channels (equally wide minimizes memory access cost) without having dense convolutions or too many groups.","448":"**DetNASNet** is a convolutional neural network designed to be an object detection backbone and discovered through [DetNAS](https:\/\/paperswithcode.com\/method\/detnas) architecture search. It uses [ShuffleNet V2](https:\/\/paperswithcode.com\/method\/shufflenet-v2) blocks as its basic building block.","449":"**DetNAS** is a [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) algorithm for the design of better backbones for object detection. It is based on the technique of one-shot supernet, which contains all possible networks in the search space. The supernet is trained under the typical detector training schedule: ImageNet pre-training and detection fine-tuning. Then, the architecture search is performed on the trained supernet, using the detection task as the guidance. DetNAS uses evolutionary search as opposed to RL-based methods or gradient-based methods.","450":"Spatial Broadcast Decoder is an architecture that aims to improve disentangling, reconstruction accuracy, and generalization to held-out regions in data space. It provides a particularly dramatic\r\nbenefit when applied to datasets with small objects.\r\n\r\nSource: [Watters et al.](https:\/\/arxiv.org\/pdf\/1901.07017v2.pdf)\r\n\r\nImage source: [Watters et al.](https:\/\/arxiv.org\/pdf\/1901.07017v2.pdf)","451":"**Prioritized Experience Replay** is a type of [experience replay](https:\/\/paperswithcode.com\/method\/experience-replay) in reinforcement learning where we In more frequently replay transitions with high expected learning progress, as measured by the magnitude of their temporal-difference (TD) error. This prioritization can lead to a loss of diversity, which is alleviated with stochastic prioritization, and introduce bias, which can be corrected with importance sampling.\r\n\r\nThe stochastic sampling method interpolates between pure greedy prioritization and uniform random sampling. The probability of being sampled is ensured to be monotonic in a transition's priority,  while guaranteeing a non-zero probability even for the lowest-priority transition. Concretely, define the probability of sampling transition $i$ as\r\n\r\n$$P(i) = \\frac{p_i^{\\alpha}}{\\sum_k p_k^{\\alpha}}$$\r\n\r\nwhere $p_i > 0$ is the priority of transition $i$. The exponent $\\alpha$ determines how much prioritization is used, with $\\alpha=0$ corresponding to the uniform case.\r\n\r\nPrioritized replay introduces bias because it changes this distribution in an uncontrolled fashion, and therefore changes the solution that the estimates will converge to. We can correct this bias by using\r\nimportance-sampling (IS) weights:\r\n\r\n$$ w\\_{i} = \\left(\\frac{1}{N}\\cdot\\frac{1}{P\\left(i\\right)}\\right)^{\\beta} $$\r\n\r\nthat fully compensates for the non-uniform probabilities $P\\left(i\\right)$ if $\\beta = 1$. These weights can be folded into the [Q-learning](https:\/\/paperswithcode.com\/method\/q-learning) update by using $w\\_{i}\\delta\\_{i}$ instead of $\\delta\\_{i}$ - weighted IS rather than ordinary IS. For stability reasons, we always normalize weights by $1\/\\max\\_{i}w\\_{i}$ so\r\nthat they only scale the update downwards.\r\n\r\nThe two types of prioritization are proportional based, where $p\\_{i} = |\\delta\\_{i}| + \\epsilon$ and rank-based, where $p\\_{i} = \\frac{1}{\\text{rank}\\left(i\\right)}$, the latter where $\\text{rank}\\left(i\\right)$ is the rank of transition $i$ when the replay memory is sorted according to |$\\delta\\_{i}$|, For proportional based, hyperparameters used were $\\alpha = 0.7$, $\\beta\\_{0} = 0.5$. For the rank-based variant, hyperparameters used were $\\alpha = 0.6$, $\\beta\\_{0} = 0.4$.","452":"**D4PG**, or **Distributed Distributional DDPG**, is a policy gradient algorithm that extends upon the [DDPG](https:\/\/paperswithcode.com\/method\/ddpg). The improvements include a distributional updates to the DDPG algorithm, combined with the use of multiple distributed workers all writing into the same replay table. The biggest performance gain of other simpler changes was the use of $N$-step returns. The authors found that the use of [prioritized experience replay](https:\/\/paperswithcode.com\/method\/prioritized-experience-replay) was less crucial to the overall D4PG algorithm especially on harder problems.","453":"**Stochastic Weight Averaging** is an optimization procedure that averages multiple points along the trajectory of [SGD](https:\/\/paperswithcode.com\/method\/sgd), with a cyclical or constant learning rate. On the one hand it averages weights, but it also has the property that, with a cyclical or constant learning rate, SGD proposals are approximately sampling from the loss surface of the network, leading to stochastic weights and helping to discover broader optima.","454":"**SHAP**, or **SHapley Additive exPlanations**, is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions. Shapley values are approximating using Kernel SHAP, which uses a weighting kernel for the approximation, and DeepSHAP, which uses DeepLift to approximate them.","455":"A new kind of implicit models, where the output of the network is defined as the solution to an \"infinite-level\" fixed point equation. Thanks to this we can compute the gradient of the output without activations and therefore with a significantly reduced memory footprint.","456":"**Circular Smooth Label** (CSL) is a classification-based rotation detection technique for arbitrary-oriented object detection. It is used for circularly distributed angle classification and addresses the periodicity of the angle and increases the error tolerance to adjacent angles.","457":"mBERT","458":"One difficulty that arises with optimization of deep neural networks is that large parameter gradients can lead an [SGD](https:\/\/paperswithcode.com\/method\/sgd) optimizer to update the parameters strongly into a region where the loss function is much greater, effectively undoing much of the work that was needed to get to the current solution.\r\n\r\n**Gradient Clipping** clips the size of the gradients to ensure optimization performs more reasonably near sharp areas of the loss surface. It can be performed in a number of ways. One option is to simply clip the parameter gradient element-wise before a parameter update. Another option is to clip the norm ||$\\textbf{g}$|| of the gradient $\\textbf{g}$ before a parameter update:\r\n\r\n$$\\text{ if } ||\\textbf{g}||  > v \\text{ then } \\textbf{g} \\leftarrow \\frac{\\textbf{g}{v}}{||\\textbf{g}||}$$\r\n\r\nwhere $v$ is a norm threshold.\r\n\r\nSource: Deep Learning, Goodfellow et al\r\n\r\nImage Source: [Pascanu et al](https:\/\/arxiv.org\/pdf\/1211.5063.pdf)","459":"**Linear Warmup** is a learning rate schedule where we linearly increase the learning rate from a low rate to a constant rate thereafter. This reduces volatility in the early stages of training.\r\n\r\nImage Credit: [Chengwei Zhang](https:\/\/www.dlology.com\/about-me\/)","460":"**CTRL** is conditional [transformer](https:\/\/paperswithcode.com\/method\/transformer) language model, trained\r\nto condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw\r\ntext, preserving the advantages of unsupervised learning while providing more\r\nexplicit control over text generation. These codes also allow CTRL to predict\r\nwhich parts of the training data are most likely given a sequence","461":"**Boost-GNN** is an architecture that trains GBDT and GNN jointly to get the best of both worlds: the GBDT model deals with heterogeneous features, while GNN accounts for the graph structure. The model benefits from end-to-end optimization by allowing new trees to fit the gradient updates of GNN.","462":"**Exponential Decay** is a learning rate schedule where we decay the learning rate with more iterations using an exponential function:\r\n\r\n$$ \\text{lr} = \\text{lr}\\_{0}\\exp\\left(-kt\\right) $$\r\n\r\nImage Credit: [Suki Lau](https:\/\/towardsdatascience.com\/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1)","463":"SRM combines style transfer with an attention mechanism. Its main contribution is style pooling which utilizes both mean and standard deviation of the input features to improve its capability to capture global information. It also adopts a lightweight channel-wise fully-connected (CFC) layer, in place of the original fully-connected layer, to reduce the computational requirements.\r\nGiven an input feature map $X \\in \\mathbb{R}^{C \\times H \\times W}$, SRM first collects global information by using style pooling ($\\text{SP}(\\cdot)$) which combines global average pooling and global standard deviation pooling. \r\nThen a channel-wise fully connected ($\\text{CFC}(\\cdot)$) layer (i.e. fully connected per channel), batch normalization $\\text{BN}$ and sigmoid function $\\sigma$ are used  to provide the attention vector. Finally,   as in an SE block, the input features are multiplied by the attention vector. Overall, an SRM can be written as:\r\n\\begin{align}\r\n    s = F_\\text{srm}(X, \\theta) & = \\sigma (\\text{BN}(\\text{CFC}(\\text{SP}(X))))\r\n\\end{align}\r\n\\begin{align}\r\n    Y & = s  X\r\n\\end{align}\r\nThe SRM block improves both squeeze and excitation modules, yet can be added after each residual unit like an SE block.","464":"**PolarNet** is an improved grid representation for online, single-scan LiDAR point clouds. Instead of using common spherical or bird's-eye-view projection, the polar bird's-eye-view representation balances the points across grid cells in a polar coordinate system, indirectly aligning a segmentation network's attention with the long-tailed distribution of the points along the radial axis.","465":"A **Groupwise Point Convolution** is a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) where we apply a [point convolution](https:\/\/paperswithcode.com\/method\/pointwise-convolution) groupwise (using different set of convolution filter groups).\r\n\r\nImage Credit: [Chi-Feng Wang](https:\/\/towardsdatascience.com\/a-basic-introduction-to-separable-convolutions-b99ec3102728)","466":"A **ShuffleNet Block** is an image model block that utilises a [channel shuffle](https:\/\/paperswithcode.com\/method\/channel-shuffle) operation, along with depthwise convolutions, for an efficient architectural design. It was proposed as part of the [ShuffleNet](https:\/\/paperswithcode.com\/method\/shufflenet) architecture. The starting point is the [Residual Block](https:\/\/paperswithcode.com\/method\/residual-block) unit from [ResNets](https:\/\/paperswithcode.com\/method\/resnet), which is then modified with a pointwise group [convolution](https:\/\/paperswithcode.com\/method\/convolution) and a channel shuffle operation.","467":"**ShuffleNet** is a convolutional neural network designed specially for mobile devices with very limited computing power. The architecture utilizes two new operations, pointwise group [convolution](https:\/\/paperswithcode.com\/method\/convolution) and [channel shuffle](https:\/\/paperswithcode.com\/method\/channel-shuffle), to reduce computation cost while maintaining accuracy.","468":"**Gradient Checkpointing** is a method used for reducing the memory footprint when training deep neural networks, at the cost of having a small increase in computation time.","469":"Spatial pooling usually operates on a small region which limits its capability to capture long-range dependencies and focus on distant regions. To overcome this, Hou et al. proposed  strip pooling, a novel pooling method capable of encoding long-range context in either horizontal or vertical spatial domains.  \r\n\r\nStrip pooling has two branches for horizontal  and vertical strip pooling. The horizontal strip pooling part first pools the input feature $F \\in \\mathcal{R}^{C \\times H \\times W}$ in the horizontal direction:\r\n\\begin{align}\r\ny^1 = \\text{GAP}^w (X) \r\n\\end{align}\r\nThen a 1D convolution with kernel size 3 is applied in $y$ to capture the relationship between different rows and channels. This is repeated $W$ times to make  the output $y_v$  consistent with the input shape:\r\n\\begin{align}\r\n    y_h = \\text{Expand}(\\text{Conv1D}(y^1))\r\n\\end{align}\r\nVertical strip pooling is performed in a similar way. Finally, the outputs of the two branches are fused using element-wise summation to produce the attention map:\r\n\\begin{align}\r\ns &= \\sigma(Conv^{1\\times 1}(y_{v} + y_{h}))\r\n\\end{align}\r\n\\begin{align}\r\nY &= s  X\r\n\\end{align}\r\n\r\nThe strip pooling module (SPM) is further developed in the mixed pooling module (MPM). Both consider  spatial  and channel relationships to overcome the locality of convolutional neural networks.  SPNet achieves  state-of-the-art results for several complex semantic segmentation benchmarks.","470":"Training a denoiser on signals gives you a powerful prior over this signal that you can then use to sample examples of this signal.","471":"**LAMB** is a a layerwise adaptive large batch optimization technique. It provides a strategy for adapting the learning rate in large batch settings. LAMB uses [Adam](https:\/\/paperswithcode.com\/method\/adam) as the base algorithm and then forms an update as:\r\n\r\n$$r\\_{t} = \\frac{m\\_{t}}{\\sqrt{v\\_{t}} + \\epsilon}$$\r\n$$x\\_{t+1}^{\\left(i\\right)} = x\\_{t}^{\\left(i\\right)}  - \\eta\\_{t}\\frac{\\phi\\left(|| x\\_{t}^{\\left(i\\right)} ||\\right)}{|| m\\_{t}^{\\left(i\\right)} || }\\left(r\\_{t}^{\\left(i\\right)}+\\lambda{x\\_{t}^{\\left(i\\right)}}\\right) $$\r\n\r\nUnlike [LARS](https:\/\/paperswithcode.com\/method\/lars), the adaptivity of LAMB is two-fold: (i) per dimension normalization with respect to the square root of the second moment used in Adam and (ii) layerwise normalization obtained due to layerwise adaptivity.","472":"**ALBERT** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture based on [BERT](https:\/\/paperswithcode.com\/method\/bert) but with much fewer parameters. It achieves this through two parameter reduction techniques. The first is a factorized embeddings parameterization. By decomposing the large vocabulary embedding matrix into two small matrices, the size of the hidden layers is separated from the size of vocabulary embedding. This makes it easier to grow the hidden size without significantly increasing the parameter size of the vocabulary embeddings. The second technique is cross-layer parameter sharing. This technique prevents the parameter from growing with the depth of the network. \r\n\r\nAdditionally, ALBERT utilises a self-supervised loss for sentence-order prediction (SOP). SOP primary focuses on inter-sentence coherence and is designed to address the ineffectiveness of the next sentence prediction (NSP) loss proposed in the original BERT.","473":"**TDN**, or **Temporaral Difference Network**, is an action recognition model that aims to capture multi-scale temporal information. To fully capture temporal information over the entire video, the TDN is established with a two-level difference modeling paradigm. Specifically, for local motion modeling, temporal difference over consecutive frames is used to supply 2D CNNs with finer motion pattern, while for global motion modeling, temporal difference across segments is incorporated to capture long-range structure for motion feature excitation.","474":"**Shrink and Fine-Tune**, or **SFT**, is a type of distillation that avoids explicit distillation by copying parameters to a student student model and then fine-tuning. Specifically it extracts a student model from the maximally spaced layers of a fine-tuned teacher. Each layer $l \\in L'$ is copied fully from $L$. For example, when creating a [BART](https:\/\/paperswithcode.com\/method\/bart) student with 3 decoder layers from the 12 encoder layer 12 decoder layer teacher, we copy the teacher\u2019s full $Enc^{L}$ and decoder layers 0, 6, and 11 to the student. When deciding which layers to copy, we break ties arbitrarily; copying layers 0, 5, and 11 might work just as well. When copy only 1 decoder layer, we copy layer 0. This was found this to work better than copying layer 11. The impact of initialization on performance is measured experimentally in Section 6.1. After initialization, the student model continues to fine-tune on the summarization dataset, with the objective of minimizing $\\mathcal{L}\\_{Data}$.","475":"A **Sparse Autoencoder** is a type of autoencoder that employs sparsity to achieve an information bottleneck. Specifically the loss function is constructed so that activations are penalized within a layer. The sparsity constraint can be imposed with [L1 regularization](https:\/\/paperswithcode.com\/method\/l1-regularization) or a KL divergence between expected average neuron activation to an ideal distribution $p$.\r\n\r\nImage: [Jeff Jordan](https:\/\/www.jeremyjordan.me\/autoencoders\/). Read his blog post (click) for a detailed summary of autoencoders.","476":"**Wasserstein Gradient Penalty Loss**, or **WGAN-GP Loss**, is a loss used for generative adversarial networks that augments the Wasserstein loss with a gradient norm penalty for random samples $\\mathbf{\\hat{x}} \\sim \\mathbb{P}\\_{\\hat{\\mathbf{x}}}$ to achieve Lipschitz continuity:\r\n\r\n$$ L = \\mathbb{E}\\_{\\mathbf{\\hat{x}} \\sim \\mathbb{P}\\_{g}}\\left[D\\left(\\tilde{\\mathbf{x}}\\right)\\right] - \\mathbb{E}\\_{\\mathbf{x} \\sim \\mathbb{P}\\_{r}}\\left[D\\left(\\mathbf{x}\\right)\\right] + \\lambda\\mathbb{E}\\_{\\mathbf{\\hat{x}} \\sim \\mathbb{P}\\_{\\hat{\\mathbf{x}}}}\\left[\\left(||\\nabla\\_{\\tilde{\\mathbf{x}}}D\\left(\\mathbf{\\tilde{x}}\\right)||\\_{2}-1\\right)^{2}\\right]$$\r\n\r\nIt was introduced as part of the [WGAN-GP](https:\/\/paperswithcode.com\/method\/wgan-gp) overall model.","477":"Gravity is a kinematic approach to optimization based on gradients.","478":"**ProGAN**, or **Progressively Growing GAN**, is a generative adversarial network that utilises a progressively growing training approach. The idea is to grow both the generator and discriminator progressively: starting from a low resolution, we add new layers that model increasingly fine details as training progresses.","479":"**MuZero** is a model-based reinforcement learning algorithm. It builds upon [AlphaZero](https:\/\/paperswithcode.com\/method\/alphazero)'s search and search-based policy iteration algorithms, but incorporates a learned model into the training procedure. \r\n\r\nThe main idea of the algorithm is to predict those aspects of the future that are directly relevant for planning. The model receives the observation (e.g. an image of the Go board or the Atari screen) as an\r\ninput and transforms it into a hidden state. The hidden state is then updated iteratively by a recurrent process that receives the previous hidden state and a hypothetical next action. At every one of these steps the model predicts the policy (e.g. the move to play), value function (e.g. the predicted winner), and immediate reward (e.g. the points scored by playing a move). The model is trained end-to-end, with the sole objective of accurately estimating these three important quantities, so as to match the improved estimates of policy and value generated by search as well as the observed reward. \r\n\r\nThere is no direct constraint or requirement for the hidden state to capture all information necessary to reconstruct the original observation, drastically reducing the amount of information the model has to maintain and predict; nor is there any requirement for the hidden state to match the unknown, true state of the environment; nor any other constraints on the semantics of state. Instead, the hidden states are free to represent state in whatever way is relevant to predicting current and future values and policies. Intuitively, the agent can invent, internally, the rules or dynamics that lead to most accurate planning.","480":"**Prioritized Sweeping** is a reinforcement learning technique for model-based algorithms that prioritizes updates according to a measure of urgency, and performs these updates first. A queue is maintained of every state-action pair whose estimated value would change nontrivially if updated, prioritized by the size of the change. When the top pair in the queue is updated, the effect on each of its predecessor pairs is computed. If the effect is greater than some threshold, then the pair is inserted in the queue with the new priority.\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition","481":"**Deterministic Policy Gradient**, or **DPG**, is a policy gradient method for reinforcement learning. Instead of the policy function $\\pi\\left(.\\mid{s}\\right)$ being modeled as a probability distribution, DPG considers and calculates gradients for a deterministic policy $a = \\mu\\_{theta}\\left(s\\right)$.","482":"**MUSIQ**, or **Multi-scale Image Quality Transformer**, is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based model for multi-scale image quality assessment. It processes native resolution images with varying sizes and aspect ratios. In MUSIQ, we construct a multi-scale image representation as input, including the native resolution image and its ARP resized variants.  Each image is split into fixed-size patches which are embedded by a patch encoding module (blue boxes). To capture 2D structure of the image and handle images of varying aspect ratios, the spatial embedding is encoded by hashing the patch position $(i,j)$ to $(t_{i},t_{j})$ within a grid of learnable embeddings (red boxes). Scale Embedding (green boxes) is introduced to capture scale information. The Transformer encoder takes the input tokens and performs multi-head self-attention. To predict the image quality, MUSIQ follows a common strategy in Transformers to add an [CLS] token to the sequence to represent the whole multi-scale input and the corresponding Transformer output is used as the final representation.","483":"**Atrous Spatial Pyramid Pooling (ASPP)** is a semantic segmentation module for resampling a given feature layer at multiple rates prior to [convolution](https:\/\/paperswithcode.com\/method\/convolution). This amounts to probing the original image with multiple filters that have complementary effective fields of view, thus capturing objects as well as useful image context at multiple scales. Rather than actually resampling features, the mapping is implemented using multiple parallel atrous convolutional layers with different sampling rates.","484":"**DeepLabv3** is a semantic segmentation architecture that improves upon [DeepLabv2](https:\/\/paperswithcode.com\/method\/deeplabv2) with several modifications. To handle the problem of segmenting objects at multiple scales, modules are designed which employ atrous [convolution](https:\/\/paperswithcode.com\/method\/convolution) in cascade or in parallel to capture multi-scale context by adopting multiple atrous rates. Furthermore, the Atrous [Spatial Pyramid Pooling](https:\/\/paperswithcode.com\/method\/spatial-pyramid-pooling) module from DeepLabv2 augmented with image-level features encoding global context and further boost performance. \r\n\r\nThe changes to the ASSP module are that the authors apply [global average pooling](https:\/\/paperswithcode.com\/method\/global-average-pooling) on the last feature map of the model, feed the resulting image-level features to a 1 \u00d7 1 convolution with 256 filters (and [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization)), and then bilinearly upsample the feature to the desired spatial dimension. In the\r\nend, the improved [ASPP](https:\/\/paperswithcode.com\/method\/aspp) consists of (a) one 1\u00d71 convolution and three 3 \u00d7 3 convolutions with rates = (6, 12, 18) when output stride = 16 (all with 256 filters and batch normalization), and (b) the image-level features.\r\n\r\nAnother interesting difference is that DenseCRF post-processing from DeepLabv2 is no longer needed.","485":"Our tracker, CenterTrack, applies a detection model to a pair of images and detections from the prior frame. Given this minimal input, CenterTrack localizes objects and predicts their associations with the previous frame. That's it. CenterTrack is simple, online (no peeking into the future), and real-time.","486":"**Domain Adaptive Ensemble Learning**, or **DAEL**, is an architecture for domain adaptation. The model is composed of a CNN feature extractor shared across domains and multiple classifier heads each trained to specialize in a particular source domain. Each such classifier is an expert to its own domain and a non-expert to others. DAEL aims to learn these experts collaboratively so that when forming an ensemble, they can leverage complementary information from each other to be more effective for an unseen target domain. To this end, each source domain is used in turn as a pseudo-target-domain with its own expert providing supervisory signal to the ensemble of non-experts learned from the other sources. For unlabeled target data under the UDA setting where real expert does not exist, DAEL uses pseudo-label to supervise the ensemble learning.","487":"**Low-Rank Factorization-based Multi-head Attention Mechanism**, or **LAMA**, is a type of attention module that uses low-rank factorization to reduce computational complexity. It uses low-rank bilinear pooling to construct a structured sentence representation that attends to multiple aspects of a sentence.","488":"**TayPO**, or **Taylor Expansion Policy Optimization**, refers to a set of algorithms that apply the $k$-th order Taylor expansions for policy optimization. This generalizes prior work, including [TRPO](https:\/\/paperswithcode.com\/method\/trpo) as a special case. It can be thought of unifying ideas from trust-region policy optimization and off-policy corrections. Taylor expansions share high-level similarities with both trust region policy search and off-policy corrections. To get high-level intuitions of such similarities, consider a simple 1D example of Taylor expansions. Given a sufficiently smooth real-valued function on the real line $f : \\mathbb{R} \\rightarrow \\mathbb{R}$, the $k$-th order Taylor expansion of $f\\left(x\\right)$ at $x\\_{0}$ is \r\n\r\n$$f\\_{k}\\left(x\\right) = f\\left(x\\_{0}\\right)+\\sum^{k}\\_{i=1}\\left[f^{(i)}\\left(x\\_{0}\\right)\/i!\\right]\\left(x\u2212x\\_{0}\\right)^{i}$$\r\n\r\nwhere $f^{(i)}\\left(x\\_{0}\\right)$ are the $i$-th order derivatives at $x\\_{0}$. First, a common feature shared by Taylor expansions and trust-region policy search is the inherent notion of a trust region constraint. Indeed, in order for convergence to take place, a trust-region constraint is required $|x \u2212 x\\_{0}| < R\\left(f, x\\_{0}\\right)^{1}$. Second, when using the truncation as an approximation to the original function $f\\_{K}\\left(x\\right) \\approx f\\left(x\\right)$, Taylor expansions satisfy the requirement of off-policy evaluations: evaluate target policy with behavior data. Indeed, to evaluate the truncation $f\\_{K}\\left(x\\right)$ at any $x$ (target policy), we only require the behavior policy \"data\" at $x\\_{0}$ (i.e., derivatives $f^{(i)}\\left(x\\_{0}\\right)$).","489":"The Robust Loss is a generalization of the Cauchy\/Lorentzian, Geman-McClure, Welsch\/Leclerc, generalized Charbonnier, Charbonnier\/pseudo-Huber\/L1-L2, and L2 loss functions. By introducing robustness as a continuous parameter, the loss function allows algorithms built around robust loss minimization to be generalized, which improves performance on basic vision tasks such as registration and clustering. Interpreting the loss as the negative log of a univariate density yields a general probability distribution that includes normal and Cauchy distributions as special cases. This probabilistic interpretation enables the training of neural networks in which the robustness of the loss automatically adapts itself during training, which improves performance on learning-based tasks such as generative image synthesis and unsupervised monocular depth estimation, without requiring any manual parameter tuning.","490":"**VOS** is a type of video object segmentation model consisting of two network components. The target appearance model consists of a light-weight module, which is learned during the inference stage using fast optimization techniques to predict a coarse but robust target segmentation. The segmentation model is exclusively trained offline, designed to process the coarse scores into high quality segmentation masks.","491":"**Jukebox** is a model that generates music with singing in the raw audio domain. It tackles the long context of raw audio using a multi-scale [VQ-VAE](https:\/\/paperswithcode.com\/method\/vq-vae) to compress it to discrete codes, and modeling those using [autoregressive Transformers](https:\/\/paperswithcode.com\/methods\/category\/autoregressive-transformers). It can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable.\r\n\r\nThree separate VQ-VAE models are trained with different temporal resolutions. At each level, the input audio is segmented and encoded into latent vectors $\\mathbf{h}\\_{t}$, which are then quantized to the closest codebook vectors $\\mathbf{e}\\_{z\\_{t}}$. The code $z\\_{t}$ is a discrete representation of the audio that we later train our prior on. The decoder takes the sequence of codebook vectors and reconstructs the audio. The top level learns the highest degree of abstraction, since it is encoding longer audio per token while keeping the codebook size the same. Audio can be reconstructed using the codes at any one of the abstraction levels, where the least abstract bottom-level codes result in the highest-quality audio.","492":"Dynamic algorithm configuration (DAC) is capable of generalizing over prior optimization approaches, as well as handling optimization of hyperparameters that need to be adjusted over multiple time-steps.\r\n\r\nImage Source: [Biedenkapp et al.](http:\/\/ecai2020.eu\/papers\/1237_paper.pdf)","493":"A **Residual SRM** is a module for convolutional neural networks that uses a [Style-based Recalibration Module](https:\/\/paperswithcode.com\/method\/style-based-recalibration-module) within a [residual block](https:\/\/paperswithcode.com\/method\/residual-block) like structure. The Style-based Recalibration Module (SRM) adaptively recalibrates intermediate feature maps by exploiting their styles.","494":"A **Style-based Recalibration Module (SRM)** is a module for convolutional neural networks that adaptively recalibrates intermediate feature maps by exploiting their styles. SRM first extracts the style information from each channel of the feature maps by style pooling, then estimates per-channel recalibration weight via channel-independent style integration. By incorporating the relative importance of individual styles into feature maps, SRM is aimed at enhancing the representational ability of a CNN.\r\n\r\nThe overall structure of SRM is illustrated in the Figure to the right. It consists of two main components: style pooling and style integration. The style pooling operator extracts style features\r\nfrom each channel by summarizing feature responses across spatial dimensions. It is followed by the style integration operator, which produces example-specific style weights by utilizing the style features via channel-wise operation. The style weights finally recalibrate the feature maps to either\r\nemphasize or suppress their information.","495":"**Reversible Residual Blocks** are skip-connection blocks that learn *reversible* residual functions with reference to the layer inputs. It is proposed as part of the [RevNet](https:\/\/paperswithcode.com\/method\/revnet) CNN architecture. Units in each layer are partitioned into two groups, denoted $x\\_{1}$ and $x\\_{2}$; the authors find what works best is partitioning the channels. Each reversible block takes inputs $\\left(x\\_{1}, x\\_{2}\\right)$ and produces outputs $\\left(y\\_{1}, y\\_{2}\\right)$ according to the following additive coupling rules \u2013 inspired by the transformation in [NICE](https:\/\/paperswithcode.com\/method\/nice) (nonlinear independent components estimation) \u2013 and residual functions $F$ and $G$ analogous to those in standard [ResNets](https:\/\/paperswithcode.com\/method\/resnet):\r\n\r\n$$y\\_{1} = x\\_{1} + F\\left(x\\_{2}\\right)$$\r\n$$y\\_{2} = x\\_{2} + G\\left(y\\_{1}\\right)$$\r\n\r\nEach layer\u2019s activations can be reconstructed from the next layer\u2019s activations as follows:\r\n\r\n$$ x\\_{2} = y\\_{2} \u2212 G\\left(y\\_{1}\\right)$$\r\n$$ x\\_{1} = y\\_{1} \u2212 F\\left(x\\_{2}\\right)$$","496":"A **Reversible Residual Network**, or **RevNet**, is a variant of a [ResNet](https:\/\/paperswithcode.com\/method\/resnet) where each layer\u2019s activations can be reconstructed exactly from the next layer\u2019s. Therefore, the activations for most layers need not be stored in memory during backpropagation. The result is a network architecture whose activation storage requirements are independent of depth, and typically at least an order of magnitude smaller compared with equally sized ResNets.\r\n\r\nRevNets are composed of a series of reversible blocks. Units in each layer are partitioned into two groups, denoted $x\\_{1}$ and $x\\_{2}$; the authors find what works best is partitioning the channels. Each reversible block takes inputs $\\left(x\\_{1}, x\\_{2}\\right)$ and produces outputs $\\left(y\\_{1}, y\\_{2}\\right)$ according to the following additive coupling rules \u2013 inspired the transformation in [NICE](https:\/\/paperswithcode.com\/method\/nice) (nonlinear independent components estimation) \u2013 and residual functions $F$ and $G$ analogous to those in standard ResNets:\r\n\r\n$$y\\_{1} = x\\_{1} + F\\left(x\\_{2}\\right)$$\r\n$$y\\_{2} = x\\_{2} + G\\left(y\\_{1}\\right)$$\r\n\r\nEach layer\u2019s activations can be reconstructed from the next layer\u2019s activations as follows:\r\n\r\n$$ x\\_{2} = y\\_{2} \u2212 G\\left(y\\_{1}\\right)$$\r\n$$ x\\_{1} = y\\_{1} \u2212 F\\left(x\\_{2}\\right)$$\r\n\r\nNote that unlike residual blocks, reversible blocks must have a stride of 1 because otherwise the layer\r\ndiscards information, and therefore cannot be reversible. Standard ResNet architectures typically\r\nhave a handful of layers with a larger stride. If we define a RevNet architecture analogously, the\r\nactivations must be stored explicitly for all non-reversible layers.","497":"TAM is designed to capture complex temporal relationships both  efficiently and  flexibly,\r\nIt adopts an adaptive kernel instead of self-attention to capture  global contextual information, with lower time complexity \r\nthan GLTR.\r\n\r\nTAM has two branches, a local branch and a global branch. Given the input feature map $X\\in \\mathbb{R}^{C\\times T\\times H\\times W}$,  global spatial average pooling $\\text{GAP}$ is first applied to the feature map to ensure TAM has a low computational cost. Then the local branch in TAM employs several 1D convolutions with ReLU nonlinearity across the temporal domain to produce location-sensitive importance maps for enhancing frame-wise features.\r\nThe local branch can be written as\r\n\\begin{align}\r\n    s &= \\sigma(\\text{Conv1D}(\\delta(\\text{Conv1D}(\\text{GAP}(X)))))\r\n\\end{align}\r\n\\begin{align}\r\n    X^1 &= s X\r\n\\end{align}\r\nUnlike the local branch, the global branch is location invariant and focuses on generating a channel-wise adaptive kernel based on global temporal information in each channel. For the $c$-th channel, the  kernel can be written as\r\n\r\n\\begin{align}\r\n    \\Theta_c = \\text{Softmax}(\\text{FC}_2(\\delta(\\text{FC}_1(\\text{GAP}(X)_c)))) \r\n\\end{align}\r\n\r\nwhere $\\Theta_c \\in \\mathbb{R}^{K}$ and $K$ is the adaptive kernel size. Finally, TAM  convolves the adaptive kernel $\\Theta$ with $ X_\\text{out}^1$:\r\n\\begin{align}\r\n    Y = \\Theta \\otimes  X^1\r\n\\end{align}\r\n\r\nWith the help of the local branch and global branch,\r\nTAM can capture the complex temporal structures in video and \r\nenhance per-frame features at low computational cost.\r\nDue to its flexibility and lightweight design,\r\nTAM can be added to any existing 2D CNNs.","498":"**Concatenation Affinity** is a type of affinity or self-similarity function between two points $\\mathbb{x\\_{i}}$ and $\\mathbb{x\\_{j}}$ that uses a concatenation function:\r\n\r\n$$ f\\left(\\mathbb{x\\_{i}}, \\mathbb{x\\_{j}}\\right) = \\text{ReLU}\\left(\\mathbb{w}^{T}\\_{f}\\left[\\theta\\left(\\mathbb{x}\\_{i}\\right), \\phi\\left(\\mathbb{x}\\_{j}\\right)\\right]\\right)$$\r\n\r\nHere $\\left[\u00b7, \u00b7\\right]$ denotes concatenation and $\\mathbb{w}\\_{f}$ is a weight vector that projects the concatenated vector to a scalar.","499":"**Embedded Dot Product Affinity** is a type of affinity or self-similarity function between two points $\\mathbb{x\\_{i}}$ and $\\mathbb{x\\_{j}}$ that uses a dot product function in an embedding space:\r\n\r\n$$ f\\left(\\mathbb{x\\_{i}}, \\mathbb{x\\_{j}}\\right) = \\theta\\left(\\mathbb{x\\_{i}}\\right)^{T}\\phi\\left(\\mathbb{x\\_{j}}\\right) $$\r\n\r\nHere $\\theta\\left(x\\_{i}\\right) = W\\_{\u03b8}x\\_{i}$ and $\\phi\\left(x\\_{j}\\right) = W\\_{\u03c6}x\\_{j}$ are two embeddings.\r\n\r\nThe main difference between the dot product and [embedded Gaussian affinity](https:\/\/paperswithcode.com\/method\/embedded-gaussian-affinity) functions is the presence of [softmax](https:\/\/paperswithcode.com\/method\/softmax), which plays the role of an activation function.","500":"**Embedded Gaussian Affinity** is a type of affinity or self-similarity function between two points $\\mathbf{x\\_{i}}$ and $\\mathbf{x\\_{j}}$ that uses a Gaussian function in an embedding space:\r\n\r\n$$ f\\left(\\mathbf{x\\_{i}}, \\mathbf{x\\_{j}}\\right) = e^{\\theta\\left(\\mathbf{x\\_{i}}\\right)^{T}\\phi\\left(\\mathbf{x\\_{j}}\\right)} $$\r\n\r\nHere $\\theta\\left(x\\_{i}\\right) = W\\_{\u03b8}x\\_{i}$ and $\\phi\\left(x\\_{j}\\right) = W\\_{\u03c6}x\\_{j}$ are two embeddings.\r\n\r\nNote that the self-attention module used in the original [Transformer](https:\/\/paperswithcode.com\/method\/transformer) model is a special case of non-local operations in the embedded Gaussian version. This can be seen from the fact that for a given $i$, $\\frac{1}{\\mathcal{C}\\left(\\mathbf{x}\\right)}\\sum\\_{\\forall{j}}f\\left(\\mathbf{x}\\_{i}, \\mathbf{x}\\_{j}\\right)g\\left(\\mathbf{x}\\_{j}\\right)$ becomes the [softmax](https:\/\/paperswithcode.com\/method\/softmax) computation along the dimension $j$. So we have $\\mathbf{y} = \\text{softmax}\\left(\\mathbf{x}^{T}W^{T}\\_{\\theta}W\\_{\\phi}\\mathbf{x}\\right)g\\left(\\mathbf{x}\\right)$, which is the self-attention form in the Transformer model. This shows how we can relate this recent self-attention model to the classic computer vision method of non-local means.","501":"**WaveRNN** is a single-layer recurrent neural network for audio generation that is designed efficiently predict 16-bit raw audio samples.\r\n\r\nThe overall computation in the WaveRNN is as follows (biases omitted for brevity):\r\n\r\n$$ \\mathbf{x}\\_{t} = \\left[\\mathbf{c}\\_{t\u22121},\\mathbf{f}\\_{t\u22121}, \\mathbf{c}\\_{t}\\right] $$\r\n\r\n$$ \\mathbf{u}\\_{t} = \\sigma\\left(\\mathbf{R}\\_{u}\\mathbf{h}\\_{t-1} + \\mathbf{I}^{*}\\_{u}\\mathbf{x}\\_{t}\\right) $$\r\n\r\n$$ \\mathbf{r}\\_{t} = \\sigma\\left(\\mathbf{R}\\_{r}\\mathbf{h}\\_{t-1} + \\mathbf{I}^{*}\\_{r}\\mathbf{x}\\_{t}\\right) $$\r\n\r\n$$ \\mathbf{e}\\_{t} = \\tau\\left(\\mathbf{r}\\_{t} \\odot \\left(\\mathbf{R}\\_{e}\\mathbf{h}\\_{t-1}\\right) + \\mathbf{I}^{*}\\_{e}\\mathbf{x}\\_{t} \\right) $$\r\n\r\n$$ \\mathbf{h}\\_{t} = \\mathbf{u}\\_{t} \\cdot \\mathbf{h}\\_{t-1} + \\left(1-\\mathbf{u}\\_{t}\\right) \\cdot \\mathbf{e}\\_{t} $$\r\n\r\n$$ \\mathbf{y}\\_{c}, \\mathbf{y}\\_{f} = \\text{split}\\left(\\mathbf{h}\\_{t}\\right) $$\r\n\r\n$$ P\\left(\\mathbf{c}\\_{t}\\right) = \\text{softmax}\\left(\\mathbf{O}\\_{2}\\text{relu}\\left(\\mathbf{O}\\_{1}\\mathbf{y}\\_{c}\\right)\\right) $$\r\n\r\n$$ P\\left(\\mathbf{f}\\_{t}\\right) = \\text{softmax}\\left(\\mathbf{O}\\_{4}\\text{relu}\\left(\\mathbf{O}\\_{3}\\mathbf{y}\\_{f}\\right)\\right) $$\r\n\r\nwhere the $*$ indicates a masked matrix whereby the last coarse input $\\mathbf{c}\\_{t}$ is only connected to the fine part of the states $\\mathbf{u}\\_{t}$, $\\mathbf{r}\\_{t}$, $\\mathbf{e}\\_{t}$ and $\\mathbf{h}\\_{t}$ and thus only affects the fine output $\\mathbf{y}\\_{f}$. The coarse and fine parts $\\mathbf{c}\\_{t}$ and $\\mathbf{f}\\_{t}$ are encoded as scalars in $\\left[0, 255\\right]$ and scaled to the interval $\\left[\u22121, 1\\right]$. The matrix $\\mathbf{R}$ formed from the matrices $\\mathbf{R}\\_{u}$, $\\mathbf{R}\\_{r}$, $\\mathbf{R}\\_{e}$ is computed as a single matrix-vector product to produce the contributions to all three gates $\\mathbf{u}\\_{t}$, $mathbf{r}\\_{t}$ and $\\mathbf{e}\\_{t}$ (a variant of the [GRU cell](https:\/\/paperswithcode.com\/method\/gru). $\\sigma$ and $\\tau$ are the standard sigmoid and tanh non-linearities.\r\n\r\nEach part feeds into a [softmax](https:\/\/paperswithcode.com\/method\/softmax) layer over the corresponding 8 bits and the prediction of the 8 fine bits is conditioned on the 8 coarse bits. The resulting Dual Softmax layer allows for efficient prediction of 16-bit samples using two small output spaces (2 8 values each) instead of a single large output space (with 2 16 values).","502":"**Graph Self-Attention (GSA)** is a self-attention module used in the [BP-Transformer](https:\/\/paperswithcode.com\/method\/bp-transformer) architecture, and is based on the [graph attentional layer](https:\/\/paperswithcode.com\/method\/graph-attentional-layer).\r\n\r\nFor a given node $u$, we update its representation according to its neighbour nodes, formulated as $\\mathbf{h}\\_{u} \\leftarrow \\text{GSA}\\left(\\mathcal{G}, \\mathbf{h}^{u}\\right)$.\r\n\r\nLet $\\mathbf{A}\\left(u\\right)$ denote the set of the neighbour nodes of $u$ in $\\mathcal{G}$, $\\text{GSA}\\left(\\mathcal{G}, \\mathbf{h}^{u}\\right)$ is detailed as follows:\r\n\r\n$$ \\mathbf{A}^{u} = \\text{concat}\\left(\\{\\mathbf{h}\\_{v} | v \\in \\mathcal{A}\\left(u\\right)\\}\\right) $$\r\n\r\n$$ \\mathbf{Q}^{u}\\_{i} = \\mathbf{H}\\_{k}\\mathbf{W}^{Q}\\_{i},\\mathbf{K}\\_{i}^{u} = \\mathbf{A}^{u}\\mathbf{W}^{K}\\_{i},\\mathbf{V}^{u}\\_{i} = \\mathbf{A}^{u}\\mathbf{W}\\_{i}^{V} $$\r\n\r\n$$ \\text{head}^{u}\\_{i} = \\text{softmax}\\left(\\frac{\\mathbf{Q}^{u}\\_{i}\\mathbf{K}\\_{i}^{uT}}{\\sqrt{d}}\\right)\\mathbf{V}\\_{i}^{u} $$\r\n\r\n$$ \\text{GSA}\\left(\\mathcal{G}, \\mathbf{h}^{u}\\right) = \\left[\\text{head}^{u}\\_{1}, \\dots, \\text{head}^{u}\\_{h}\\right]\\mathbf{W}^{O}$$\r\n\r\nwhere d is the dimension of h, and $\\mathbf{W}^{Q}\\_{i}$, $\\mathbf{W}^{K}\\_{i}$ and $\\mathbf{W}^{V}\\_{i}$ are trainable parameters of the $i$-th attention head.","503":"**Hierarchical Feature Fusion (HFF)** is a feature fusion method employed in [ESP](https:\/\/paperswithcode.com\/method\/esp) and [EESP](https:\/\/paperswithcode.com\/method\/eesp) image model blocks for degridding. In the ESP module, concatenating the outputs of dilated convolutions gives the ESP module a large effective receptive field, but it introduces unwanted checkerboard or gridding artifacts. To address the gridding artifact in ESP, the feature maps obtained using kernels of different dilation rates are hierarchically added before concatenating them (HFF). This solution is simple and effective and does not increase the complexity of the ESP module.","504":"An **Efficient Spatial Pyramid (ESP)** is an image model block based on a factorization principle that decomposes a standard [convolution](https:\/\/paperswithcode.com\/method\/convolution) into two steps: (1) point-wise convolutions and (2) spatial pyramid of dilated convolutions. The point-wise convolutions help in reducing the computation, while the spatial pyramid of dilated convolutions re-samples the feature maps to learn the representations from large effective receptive field. This allows for increased efficiency compared to another image blocks like [ResNeXt](https:\/\/paperswithcode.com\/method\/resnext) blocks and Inception modules.","505":"**Sharpness-Aware Minimization**, or **SAM**, is a procedure that improves model generalization by simultaneously minimizing loss value and loss sharpness. SAM functions by seeking parameters that lie in neighborhoods having uniformly low loss value (rather than parameters that only themselves have low loss value).","506":"**Side-Aware Boundary Localization (SABL)** is a methodology for precise localization in object detection where each side of the bounding box is respectively localized with a dedicated network branch. Empirically, the authors observe that when they manually annotate a bounding box for an object, it is often much easier to align each side of the box to the object boundary than to move the\r\nbox as a whole while tuning the size. Inspired by this observation, in SABL each side of the bounding box is respectively positioned based on its surrounding context. \r\n\r\nAs shown in the Figure, the authors devise a bucketing scheme to improve the localization precision. For each side of a bounding box, this scheme divides the target space into multiple buckets, then determines the bounding box via two steps. Specifically, it first searches for the correct bucket, i.e., the one in which the boundary resides. Leveraging the centerline of the selected buckets as a\r\ncoarse estimate, fine regression is then performed by predicting the offsets. This scheme allows very precise localization even in the presence of displacements with large variance. Moreover, to preserve precisely localized bounding boxes in the non-maximal suppression procedure, the authors also propose to adjust the classification score based on the bucketing confidences, which leads to further performance gains.","507":"**Cascade R-CNN** is an object detection architecture that seeks to address problems with degrading performance with increased IoU thresholds (due to overfitting during training and inference-time mismatch between IoUs for which detector is optimal and the inputs). It is a multi-stage extension of the [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn), where detector stages deeper into the cascade are sequentially more selective against close false positives. The cascade of R-CNN stages are trained sequentially, using the output of one stage to train the next. This is motivated by the observation that the output IoU of a regressor is almost invariably better than the input IoU. \r\n\r\nCascade R-CNN does not aim to mine hard negatives. Instead, by adjusting bounding boxes, each stage aims to find a good set of close false positives for training the next stage. When operating in this manner, a sequence of detectors adapted to increasingly higher IoUs can beat the overfitting problem, and thus be effectively trained. At inference, the same cascade procedure is applied. The progressively improved hypotheses are better matched to the increasing detector quality at each stage.","508":"The  **Synthesizer** is a model that learns synthetic attention weights without token-token interactions. Unlike [Transformers](https:\/\/paperswithcode.com\/method\/transformer), the model eschews dot product self-attention but also content-based self-attention altogether. Synthesizer learns to synthesize the self-alignment matrix instead of manually computing pairwise dot products. It is transformation-based, only relies on simple feed-forward layers, and completely dispenses with dot products and explicit token-token interactions. \r\n\r\nThis new module employed by the Synthesizer is called \"Synthetic Attention\": a new way of learning to attend without explicitly attending (i.e., without dot product attention or [content-based attention](https:\/\/paperswithcode.com\/method\/content-based-attention)). Instead, Synthesizer generate the alignment matrix independent of token-token dependencies.","509":"**Convolutional Block Attention Module (CBAM)** is an attention module for convolutional neural networks. Given an intermediate feature map, the module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement.\r\n\r\nGiven an intermediate feature map $\\mathbf{F} \\in \\mathbb{R}^{C\u00d7H\u00d7W}$ as input, CBAM sequentially infers a 1D channel attention map $\\mathbf{M}\\_{c} \\in \\mathbb{R}^{C\u00d71\u00d71}$ and a 2D spatial attention map $\\mathbf{M}\\_{s} \\in \\mathbb{R}^{1\u00d7H\u00d7W}$. The overall attention process can be summarized as:\r\n\r\n$$ \\mathbf{F}' = \\mathbf{M}\\_{c}\\left(\\mathbf{F}\\right) \\otimes \\mathbf{F} $$\r\n\r\n$$ \\mathbf{F}'' = \\mathbf{M}\\_{s}\\left(\\mathbf{F'}\\right) \\otimes \\mathbf{F'} $$\r\n\r\nDuring multiplication, the attention values are broadcasted (copied) accordingly: channel attention values are broadcasted along the spatial dimension, and vice versa. $\\mathbf{F}''$ is the final refined\r\noutput.","510":"LXMERT is a model for learning vision-and-language cross-modality representations. It consists of a Transformer model that consists three encoders: object relationship encoder, a language encoder, and a cross-modality encoder. The model takes two inputs: image with its related sentence. The images are represented as a sequence of objects, whereas each sentence is represented as sequence of words. By combining the self-attention and cross-attention layers the model is able to generated language representation, image representations, and cross-modality representations from the input. The model is pre-trained with image-sentence pairs via five pre-training tasks: masked language modeling, masked object prediction, cross-modality matching, and image questions answering. These tasks help the model to learn both intra-modality and cross-modality relationships.","511":"AugMix mixes augmented images through linear interpolations. Consequently it is like [Mixup](https:\/\/paperswithcode.com\/method\/mixup) but instead mixes augmented versions of the same image.","512":"**AMSGrad** is a stochastic optimization method that seeks to fix a convergence issue with [Adam](https:\/\/paperswithcode.com\/method\/adam) based optimizers. AMSGrad uses the maximum of past squared gradients \r\n$v\\_{t}$ rather than the exponential average to update the parameters:\r\n\r\n$$m\\_{t} = \\beta\\_{1}m\\_{t-1} + \\left(1-\\beta\\_{1}\\right)g\\_{t} $$\r\n\r\n$$v\\_{t} = \\beta\\_{2}v\\_{t-1} + \\left(1-\\beta\\_{2}\\right)g\\_{t}^{2}$$\r\n\r\n$$ \\hat{v}\\_{t} = \\max\\left(\\hat{v}\\_{t-1}, v\\_{t}\\right) $$\r\n\r\n$$\\theta\\_{t+1} = \\theta\\_{t} - \\frac{\\eta}{\\sqrt{\\hat{v}_{t}} + \\epsilon}m\\_{t}$$","513":"**LV-ViT** is a type of [vision transformer](https:\/\/paperswithcode.com\/method\/vision-transformer) that uses token labelling as a training objective. Different from the standard training\r\nobjective of ViTs that computes the classification loss on an additional trainable class token, token labelling takes advantage of all the image patch tokens to compute the training loss in a dense manner. Specifically, token labeling reformulates the image classification problem into multiple token-level recognition problems and assigns each patch token with an individual location-specific supervision generated by a machine annotator.","514":"Please enter a description about the method here","515":"**PointASNL** is a non-local neural network for point clouds processing It consists of two general modules: adaptive sampling (AS) module and local-Nonlocal (L-NL) module. The AS module first re-weights the neighbors around the initial sampled points from farthest point sampling (FPS), and then adaptively adjusts the sampled points beyond the entire point cloud. The AS module can not only benefit the feature learning of point clouds, but also ease the biased effect of outliers. The L-NL module capture the neighbor and long-range dependencies of the sampled point, and enables the learning process to be insensitive to noise.","516":"Please enter a description about the method here","517":"This optimizer mix [ADAM](https:\/\/paperswithcode.com\/method\/adam) and [SGD](https:\/\/paperswithcode.com\/method\/sgd) creating the MAS optimizer.","518":"**ShuffleNet V2 Downsampling Block** is a block for spatial downsampling used in the [ShuffleNet V2](https:\/\/paperswithcode.com\/method\/shufflenet-v2) architecture. Unlike the regular [ShuffleNet](https:\/\/paperswithcode.com\/method\/shufflenet) V2 block, the channel split operator is removed so the number of output channels is doubled.","519":"**ShuffleNet v2** is a convolutional neural network optimized for a direct metric (speed) rather than indirect metrics like FLOPs. It builds upon [ShuffleNet v1](https:\/\/paperswithcode.com\/method\/shufflenet), which utilised pointwise group convolutions, bottleneck-like structures, and a [channel shuffle](https:\/\/paperswithcode.com\/method\/channel-shuffle) operation. Differences are shown in the Figure to the right, including a new channel split operation and moving the channel shuffle operation further down the block.","520":"**Auxiliary Batch Normalization** is a type of regularization used in adversarial training schemes. The idea is that adversarial examples should have a separate [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) components to the clean examples, as they have different underlying statistics.","521":"**AdvProp** is an adversarial training scheme which treats adversarial examples as additional examples, to prevent overfitting. Key to the method is the usage of a separate auxiliary batch norm for adversarial examples, as they have different underlying distributions to normal examples.","522":"**MoCo v2** is an improved version of the [Momentum Contrast](https:\/\/paperswithcode.com\/method\/moco) self-supervised learning algorithm. Motivated by the findings presented in the [SimCLR](https:\/\/paperswithcode.com\/method\/simclr) paper, authors:\r\n\r\n- Replace the 1-layer fully connected layer with a 2-layer MLP head with [ReLU](https:\/\/paperswithcode.com\/method\/relu) for the unsupervised training stage.\r\n- Include blur augmentation.\r\n- Use cosine learning rate schedule.\r\n\r\nThese modifications enable MoCo to outperform the state-of-the-art SimCLR with a smaller batch size and fewer epochs.","523":"**Relative Position Encodings** are a type of position embeddings for [Transformer-based models](https:\/\/paperswithcode.com\/methods\/category\/transformers) that attempts to exploit pairwise, relative positional information. Relative positional information is supplied to the model on two levels: values and keys. This becomes apparent in the two modified self-attention equations shown below. First, relative positional information is supplied to the model as an additional component to the keys\r\n\r\n$$ e\\_{ij} = \\frac{x\\_{i}W^{Q}\\left(x\\_{j}W^{K} + a^{K}\\_{ij}\\right)^{T}}{\\sqrt{d\\_{z}}} $$\r\n\r\nHere $a$ is an edge representation for the inputs $x\\_{i}$ and $x\\_{j}$. The [softmax](https:\/\/paperswithcode.com\/method\/softmax) operation remains unchanged from vanilla self-attention. Then relative positional information is supplied again as a sub-component of the values matrix:\r\n\r\n$$ z\\_{i} = \\sum^{n}\\_{j=1}\\alpha\\_{ij}\\left(x\\_{j}W^{V} + a\\_{ij}^{V}\\right)$$\r\n\r\nIn other words, instead of simply combining semantic embeddings with absolute positional ones, relative positional information is added to keys and values on the fly during attention calculation.\r\n\r\nSource: [Jake Tae](https:\/\/jaketae.github.io\/study\/relative-positional-encoding\/)\r\n\r\nImage Source: [Relative Positional Encoding for Transformers with Linear Complexity](https:\/\/www.youtube.com\/watch?v=qajudaEHuq8","524":"**Extended Transformer Construction**, or **ETC**, is an extension of the [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture with a new attention mechanism that extends the original in two main ways: (1) it allows scaling up the input length from 512 to several thousands; and (2) it can ingesting structured inputs instead of just linear sequences. The key ideas that enable ETC to achieve these are a new [global-local attention mechanism](https:\/\/paperswithcode.com\/method\/global-local-attention), coupled with [relative position encodings](https:\/\/paperswithcode.com\/method\/relative-position-encodings). ETC also allows lifting weights from existing [BERT](https:\/\/paperswithcode.com\/method\/bert) models, saving computational resources while training.","525":"**T2T-ViT** (Tokens-To-Token Vision Transformer) is a type of [Vision Transformer](https:\/\/paperswithcode.com\/method\/vision-transformer) which incorporates 1) a layerwise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision [transformer](https:\/\/paperswithcode.com\/method\/transformer) motivated by CNN architecture design after empirical study.","526":"A **Dynamic Memory Network** is a neural network architecture which processes input sequences and questions, forms episodic memories, and generates relevant answers. Questions trigger an iterative attention process which allows the model to condition its attention on the inputs and the result of previous iterations. These results are then reasoned over in a hierarchical recurrent sequence model to generate answers. \r\n\r\nThe DMN consists of a number of modules:\r\n\r\n- Input Module: The input module encodes raw text inputs from the task into distributed vector representations. The input takes forms like a sentence, a long story, a movie review and so on.\r\n- Question Module: The question module encodes the question of the task into a distributed\r\nvector representation. For question answering, the question may be a sentence such as \"Where did the author first fly?\". The representation is fed into the episodic memory module, and forms the basis, or initial state, upon which the episodic memory module iterates.\r\n- Episodic Memory Module: Given a collection of input representations, the episodic memory module chooses which parts of the inputs to focus on through the attention mechanism. It then produces a \u201dmemory\u201d vector representation taking into account the question as well as the previous memory. Each iteration provides the module with newly relevant information about the input. In other words,\r\nthe module has the ability to retrieve new information, in the form of input representations, which were thought to be irrelevant in previous iterations.\r\n- Answer Module: The answer module generates an answer from the final memory vector of the memory module.","527":"**RandomRotate** is a type of image data augmentation where we randomly rotate the image by a degree.","528":"**Polynomial Rate Decay** is a learning rate schedule where we polynomially decay the learning rate.","529":"**GPipe** is a distributed model parallel method for neural networks. With GPipe, each model can be specified as a sequence of layers, and consecutive groups of layers can be partitioned into cells. Each cell is then placed on a separate accelerator. Based on this partitioned setup, batch splitting is applied. A mini-batch of training examples is split into smaller micro-batches, then the execution of each set of micro-batches is pipelined over cells. Synchronous mini-batch gradient descent is applied for training, where gradients are accumulated across all micro-batches in a mini-batch and applied at the end of a mini-batch.","530":"**Hydra** is a multi-headed neural network for model distillation with a shared body network. The shared body network learns a joint feature representation that enables each head to capture the predictive behavior of each ensemble member.  Existing distillation methods often train a distillation network to imitate the prediction of a larger network. Hydra instead learns to distill the individual predictions of each ensemble member into separate light-weight head models while amortizing the computation through a shared heavy-weight body network. This retains the diversity of ensemble member predictions which is otherwise lost in knowledge distillation.","531":"**Gumbel-Softmax** is a continuous distribution that has the property that it can be smoothly annealed into a categorical distribution, and whose parameter gradients can be easily computed via the reparameterization trick.","532":"**PixelShuffle** is an operation used in super-resolution models to implement efficient sub-pixel convolutions with a stride of $1\/r$. Specifically it rearranges elements in a tensor of shape $(\\*, C \\times r^2, H, W)$ to a tensor of shape $(\\*, C, H \\times r, W \\times r)$.\r\n\r\nImage Source: [Remote Sensing Single-Image Resolution Improvement Using A Deep Gradient-Aware Network with Image-Specific Enhancement](https:\/\/www.researchgate.net\/figure\/The-pixel-shuffle-layer-transforms-feature-maps-from-the-LR-domain-to-the-HR-image_fig3_339531308)","533":"**Models Genesis**, or **Generic Autodidactic Models**, is a self-supervised approach for learning 3D image representations. The objective of Models Genesis is to learn a common image representation that is transferable and generalizable across diseases, organs, and modalities.  It consists of an encoder-decoder architecture with skip connections in between, and is trained to learn a common image representation by restoring the original sub-volume $x\\_{i}$ (as ground truth) from the transformed one $\\bar{x}\\_{i}$ (as input), in which the reconstruction loss (MSE) is computed between the model prediction $x'\\_{0}$ and ground truth $x\\_{i}$. Once trained, the encoder alone can be fine-tuned for target classification tasks; while the encoder and decoder together can be fine-tuned for target segmentation tasks.","534":"**Multi-Head Linear Attention** is a type of linear multi-head self-attention module, proposed with the [Linformer](https:\/\/paperswithcode.com\/method\/linformer) architecture. The main idea is to add two linear projection matrices $E\\_{i}, F\\_{i} \\in \\mathbb{R}^{n\\times{k}}$ when computing key and value. We first project the original $\\left(n \\times d\\right)$-dimensional key and value layers $KW\\_{i}^{K}$ and $VW\\_{i}^{V}$ into $\\left(k\\times{d}\\right)$-dimensional projected key and value layers. We then compute a $\\left(n\\times{k}\\right)$ dimensional context mapping $\\bar{P}$ using scaled-dot product attention:\r\n\r\n$$ \\bar{\\text{head}\\_{i}} = \\text{Attention}\\left(QW^{Q}\\_{i}, E\\_{i}KW\\_{i}^{K}, F\\_{i}VW\\_{i}^{V}\\right) $$\r\n\r\n$$ \\bar{\\text{head}\\_{i}} = \\text{softmax}\\left(\\frac{QW^{Q}\\_{i}\\left(E\\_{i}KW\\_{i}^{K}\\right)^{T}}{\\sqrt{d\\_{k}}}\\right) \\cdot F\\_{i}VW\\_{i}^{V} $$\r\n\r\nFinally, we compute context embeddings for each head using $\\bar{P} \\cdot \\left(F\\_{i}{V}W\\_{i}^{V}\\right)$.","535":"**Neural Architecture Search (NAS)** learns a modular architecture which can be transferred from a small dataset to a large dataset. The method does this by reducing the problem of learning best convolutional architectures to the problem of learning a small convolutional cell. The cell can then be stacked in series to handle larger images and more complex datasets.\r\n\r\nNote that this refers to the original method referred to as NAS - there is also a broader category of methods called \"neural architecture search\".","536":"**NAS-FPN** is a Feature Pyramid Network that is discovered via [Neural Architecture Search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) in a novel scalable search space covering all cross-scale connections. The discovered architecture consists of a combination of top-down and bottom-up connections to fuse features across scales","537":"A **Cyclical Learning Rate Policy** combines a linear learning rate decay with warm restarts.\r\n\r\nImage: [ESPNetv2](https:\/\/paperswithcode.com\/method\/espnetv2)","538":"**Manifold Mixup** is a regularization method that encourages neural networks to predict less confidently on interpolations of hidden representations. It leverages semantic interpolations as an additional training signal, obtaining neural networks with smoother decision boundaries at multiple levels of representation. As a result, neural networks trained with Manifold Mixup learn class-representations with fewer directions of variance.\r\n\r\nConsider training a deep neural network $f\\left(x\\right) = f\\_{k}\\left(g\\_{k}\\left(x\\right)\\right)$, where $g\\_{k}$ denotes the part of the neural network mapping the input data to the hidden representation at layer $k$, and $f\\_{k}$ denotes the\r\npart mapping such hidden representation to the output $f\\left(x\\right)$. Training $f$ using Manifold Mixup is performed in five steps:\r\n\r\n(1) Select a random layer $k$ from a set of eligible layers $S$ in the neural network. This set may include the input layer $g\\_{0}\\left(x\\right)$.\r\n\r\n(2) Process two random data minibatches $\\left(x, y\\right)$ and $\\left(x', y'\\right)$ as usual, until reaching layer $k$. This provides us with two intermediate minibatches $\\left(g\\_{k}\\left(x\\right), y\\right)$ and $\\left(g\\_{k}\\left(x'\\right), y'\\right)$.\r\n\r\n(3) Perform Input [Mixup](https:\/\/paperswithcode.com\/method\/mixup) on these intermediate minibatches. This produces the mixed minibatch:\r\n\r\n$$\r\n\\left(\\tilde{g}\\_{k}, \\tilde{y}\\right) = \\left(\\text{Mix}\\_{\\lambda}\\left(g\\_{k}\\left(x\\right), g\\_{k}\\left(x'\\right)\\right), \\text{Mix}\\_{\\lambda}\\left(y, y'\\right\r\n)\\right),\r\n$$\r\n\r\nwhere $\\text{Mix}\\_{\\lambda}\\left(a, b\\right) = \\lambda \\cdot a + \\left(1 \u2212 \\lambda\\right) \\cdot b$. Here, $\\left(y, y'\r\n\\right)$ are one-hot labels, and the mixing coefficient\r\n$\\lambda \\sim \\text{Beta}\\left(\\alpha, \\alpha\\right)$ as in mixup. For instance, $\\alpha = 1.0$ is equivalent to sampling $\\lambda \\sim U\\left(0, 1\\right)$.\r\n\r\n(4) Continue the forward pass in the network from layer $k$ until the output using the mixed minibatch $\\left(\\tilde{g}\\_{k}, \\tilde{y}\\right)$.\r\n\r\n(5) This output is used to compute the loss value and\r\ngradients that update all the parameters of the neural network.","539":"**STAC** is a semi-supervised framework for visual object detection along with a data augmentation strategy. STAC deploys highly confident pseudo labels of localized objects from an unlabeled image and updates the model by enforcing consistency via strong augmentations. We generate pseudo labels (i.e., bounding boxes and their class labels) for unlabeled data using test-time inference, including NMS , of the teacher model trained with labeled data. We then compute unsupervised loss with respect to pseudo labels whose confidence scores are above a threshold $\\tau$ . The strong augmentations are applied for augmentation consistency during the model training. Target boxes are augmented when global geometric transformations are used.","540":"**ResNeXt-Elastic** is a convolutional neural network that is a modification of a [ResNeXt](https:\/\/paperswithcode.com\/method\/resnext) with elastic blocks (extra upsampling and downsampling).","541":"**DenseNet-Elastic** is a convolutional neural network that is a modification of a [DenseNet](https:\/\/paperswithcode.com\/method\/densenet) with elastic blocks (extra upsampling and downsampling).","542":"**Elastic Dense Block** is a skip connection block that modifies the [Dense Block](https:\/\/paperswithcode.com\/method\/dense-block) with downsamplings and upsamplings in parallel branches at each layer to let the network learn from a data scaling policy in which inputs are processed at different resolutions in each layer. It is called \"elastic\" because each layer in the network is flexible in terms of choosing the best scale by a soft policy.","543":"An **Elastic ResNeXt Block** is a modification of the [ResNeXt Block](https:\/\/paperswithcode.com\/method\/resnext-block) that adds downsamplings and upsamplings in parallel branches at each layer. It is called \"elastic\" because each layer in the network is flexible in terms of choosing the best scale by a soft policy.","544":"**TAPAS** is a weakly supervised question answering model that reasons over tables without generating logical forms. TAPAS predicts a minimal program by selecting a subset of the table cells and a possible aggregation operation to be executed on top of them. Consequently, TAPAS can learn operations from natural language, without the need to specify them in some formalism. This is implemented by extending [BERT](https:\/\/paperswithcode.com\/method\/bert)\u2019s architecture with additional embeddings that capture tabular structure, and with two classification layers for selecting cells and predicting a corresponding aggregation operator.","545":"**$k$-Nearest Neighbors** is a clustering-based algorithm for classification and regression. It is a a type of instance-based learning as it does not attempt to construct a general internal model, but simply stores instances of the training data. Prediction is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.\r\n\r\nSource of Description and Image: [scikit-learn](https:\/\/scikit-learn.org\/stable\/modules\/neighbors.html#classification)","546":"**Sparsemax** is a type of activation\/output function similar to the traditional [softmax](https:\/\/paperswithcode.com\/method\/softmax), but able to output sparse probabilities. \r\n\r\n$$ \\text{sparsemax}\\left(z\\right) = \\arg\\_{p\u2208\\Delta^{K\u22121}}\\min||\\mathbf{p} - \\mathbf{z}||^{2} $$","547":"Network On Network (NON) is practical tabular data classification model based on deep neural network to provide accurate predictions. Various deep methods have been proposed and promising progress has been made. However, most of them use operations like neural network and factorization machines to fuse the embeddings of different features directly, and linearly combine the outputs of those operations to get the final prediction. As a result, the intra-field information and the non-linear interactions between those operations (e.g. neural network and factorization machines) are ignored. Intra-field information is the information that features inside each field belong to the same field. NON is proposed to take full advantage of intra-field information and non-linear interactions. It consists of three components: field-wise network at the bottom to capture the intra-field information, across field network in the middle to choose suitable operations data-drivenly, and operation fusion network on the top to fuse outputs of the chosen operations deeply","548":"Building on the recent successes of distributed training of RL agents, R2D2 is an RL approach that trains a RNN-based RL agents from distributed prioritized experience replay. \r\nUsing a single network architecture and fixed set of hyperparameters, Recurrent Replay Distributed DQN quadrupled the previous state of the art on Atari-57, and matches the state of the art on DMLab-30. \r\nIt was the first agent to exceed human-level performance in 52 of the 57 Atari games.","549":"**R-CNN**, or **Regions with CNN Features**, is an object detection model that uses high-capacity CNNs to bottom-up region proposals in order to localize and segment objects. It uses [selective search](https:\/\/paperswithcode.com\/method\/selective-search) to identify a number of bounding-box object region candidates (\u201cregions of interest\u201d), and then extracts features from each region independently for classification.","550":"**ZoomNet** is a 2D human whole-body pose estimation technique. It aims to localize dense landmarks on the entire human body including face, hands, body, and feet. ZoomNet follows the top-down paradigm. Given a human bounding box of each person, ZoomNet first localizes the easy-to-detect body keypoints and estimates the rough position of hands and face. Then it zooms in to focus on the hand\/face areas and predicts keypoints using features with higher resolution for accurate localization. Unlike previous approaches which usually assemble multiple networks, ZoomNet has a single network that is end-to-end trainable. It unifies five network heads including the human body pose estimator, hand and face detectors, and hand and face pose estimators into a single network with shared low-level features.","551":"A **(2+1)D Convolution** is a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) used for action recognition convolutional neural networks, with a spatiotemporal volume. As opposed to applying a [3D Convolution](https:\/\/paperswithcode.com\/method\/3d-convolution) over the entire volume, which can be computationally expensive and lead to overfitting, a (2+1)D convolution splits computation into two convolutions: a spatial 2D convolution followed by a temporal 1D convolution.","552":"A **R(2+1)D** convolutional neural network is a network for action recognition that employs [R(2+1)D](https:\/\/paperswithcode.com\/method\/2-1-d-convolution) convolutions in a [ResNet](https:\/\/paperswithcode.com\/method\/resnet) inspired architecture. The use of these convolutions over regular [3D Convolutions](https:\/\/paperswithcode.com\/method\/3d-convolution) reduces computational complexity, prevents overfitting, and introduces more non-linearities that allow for a better functional relationship to be modeled.","553":"**Receptive Field Block (RFB)** is a module for strengthening the deep features learned from lightweight CNN models so that they can contribute to fast and accurate detectors. Specifically, RFB makes use of multi-branch pooling with varying kernels corresponding to RFs of different sizes, applies [dilated convolution](https:\/\/paperswithcode.com\/method\/dilated-convolution) layers to control their eccentricities, and reshapes them to generate\r\nfinal representation.","554":"VLMo is a unified vision-language pre-trained model that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. A Mixture-of-Modality-Experts (MOME) transformer is introduced to encode different modalities which helps it to capture modality-specific information by modality experts, and align content of different modalities by the self-attention module shared across modalities. The model parameters are shared across image-text contrastive learning, masked language modeling, and image-text matching tasks. During fine-tuning, the flexible modeling allows for VLMO to be used as either a dual encoder (i.e., separately encode images and text for retrieval tasks) or a fusion encoder (i.e., jointly encode image-text pairs for better interaction across modalities) Stage-wise pretraining on image-only and text-only data improved the vision-language pre-trained model. The model can be used for classification tasks and fine-tuned as a dual encoder for retrieval tasks.","555":"**Pix2Pix** is a conditional image-to-image translation architecture that uses a conditional [GAN](https:\/\/paperswithcode.com\/method\/gan) objective combined with a reconstruction loss. The conditional GAN objective for observed images $x$, output images $y$ and the random noise vector $z$ is:\r\n\r\n$$ \\mathcal{L}\\_{cGAN}\\left(G, D\\right) =\\mathbb{E}\\_{x,y}\\left[\\log D\\left(x, y\\right)\\right]+\r\n\\mathbb{E}\\_{x,z}\\left[log(1 \u2212 D\\left(x, G\\left(x, z\\right)\\right)\\right] $$\r\n\r\nWe augment this with a reconstruction term:\r\n\r\n$$ \\mathcal{L}\\_{L1}\\left(G\\right) = \\mathbb{E}\\_{x,y,z}\\left[||y - G\\left(x, z\\right)||\\_{1}\\right] $$\r\n\r\nand we get the final objective as:\r\n\r\n$$ G^{*} = \\arg\\min\\_{G}\\max\\_{D}\\mathcal{L}\\_{cGAN}\\left(G, D\\right) + \\lambda\\mathcal{L}\\_{L1}\\left(G\\right) $$\r\n\r\nThe architectures employed for the generator and discriminator closely follow [DCGAN](https:\/\/paperswithcode.com\/method\/dcgan), with a few modifications:\r\n\r\n- Concatenated skip connections are used to \"shuttle\" low-level information between the input and output, similar to a [U-Net](https:\/\/paperswithcode.com\/method\/u-net).\r\n- The use of a [PatchGAN](https:\/\/paperswithcode.com\/method\/patchgan) discriminator that only penalizes structure at the scale of patches.","556":"**Squared ReLU** is an activation function used in the [Primer](https:\/\/paperswithcode.com\/method\/primer) architecture in the feedforward block of the [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) layer. It is simply squared [ReLU](https:\/\/paperswithcode.com\/method\/relu) activations.\r\n\r\nThe effectiveness of higher order polynomials can also be observed in other effective [Transformer](https:\/\/paperswithcode.com\/method\/transformer) nonlinearities, such as [GLU](https:\/\/paperswithcode.com\/method\/glu) variants like [ReGLU](https:\/\/paperswithcode.com\/method\/reglu) and point-wise activations like [approximate GELU](https:\/\/paperswithcode.com\/method\/gelu). However, squared ReLU has drastically different asymptotics as $x \\rightarrow \\inf$ compared to the most commonly used activation functions: [ReLU](https:\/\/paperswithcode.com\/method\/relu), [GELU](https:\/\/paperswithcode.com\/method\/gelu) and [Swish](https:\/\/paperswithcode.com\/method\/swish). Squared ReLU does have significant overlap with ReGLU and in fact is equivalent when ReGLU\u2019s $U$ and $V$ weight matrices are the same and squared ReLU is immediately preceded by a linear transformation with weight matrix $U$. This leads the authors to believe that squared ReLUs capture the benefits of these GLU variants, while being simpler, without additional parameters, and delivering better quality.","557":"**Multi-DConv-Head Attention**, or **MDHA**, is a type of [Multi-Head Attention](https:\/\/paperswithcode.com\/method\/multi-head-attention) that utilizes [depthwise convolutions](https:\/\/paperswithcode.com\/method\/depthwise-convolution) after the multi-head projections. It is used in the [Primer](https:\/\/paperswithcode.com\/method\/primer) [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture.\r\n\r\nSpecifically, 3x1 depthwise convolutions are added after each of the multi-head projections for query $Q$, key $K$ and value $V$ in self-attention. These depthwise convolutions are performed over the spatial dimension of each dense projection\u2019s output. Interestingly, this ordering of pointwise followed by depthwise convolution is the reverse of typical [separable convolution](https:\/\/paperswithcode.com\/method\/depthwise-separable-convolution), which the authors find to be less effective. They also find that wider depthwise convolution and [standard convolution](https:\/\/paperswithcode.com\/method\/convolution) not only do not improve performance, but in several cases hurt it. \r\n\r\nMDHA is similar to [Convolutional Attention](https:\/\/paperswithcode.com\/method\/cvt), which uses [separable convolution](https:\/\/paperswithcode.com\/method\/depthwise-separable-convolution) instead of depthwise convolution and does not apply convolution operations per attention head as in MDHA.","558":"**Primer** is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based architecture that improves upon the [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture with two improvements found through [neural architecture search](https:\/\/paperswithcode.com\/methods\/category\/neural-architecture-search): [squared RELU activations](https:\/\/paperswithcode.com\/method\/squared-relu) in the feedforward block, and [depthwise convolutions]() added to the attention multi-head projections: resulting in a new module called [Multi-DConv-Head-Attention](https:\/\/paperswithcode.com\/method\/multi-dconv-head-attention).","559":"**Channel-wise Soft Attention** is an attention mechanism in computer vision that assigns \"soft\" attention weights for each channel $c$. In soft channel-wise attention, the alignment weights are learned and placed \"softly\" over each channel. This would contrast with hard attention which would only selects one channel to attend to at a time.\r\n\r\nImage: [Xu et al](http:\/\/proceedings.mlr.press\/v37\/xuc15.pdf)","560":"**Anti-Alias Downsampling (AA)** aims to improve the shift-equivariance of deep networks. Max-pooling is inherently composed of two operations. The first operation is to densely evaluate the max operator and second operation is naive subsampling. AA is proposed as a low-pass filter between them to achieve practical anti-aliasing in any existing strided layer such as strided [convolution](https:\/\/paperswithcode.com\/method\/convolution). The smoothing factor can be adjusted by changing the blur kernel filter size, where a larger filter size results in increased blur.","561":"A **Selective Kernel Convolution** is a [convolution](https:\/\/paperswithcode.com\/method\/convolution) that enables neurons to adaptively adjust their RF sizes among multiple kernels with different kernel sizes. Specifically, the SK convolution has three operators \u2013 Split, Fuse and Select. Multiple branches with different kernel sizes are fused using\r\n[softmax](https:\/\/paperswithcode.com\/method\/softmax) attention that is guided by the information in these branches. Different attentions on these branches yield different sizes of the effective receptive fields of neurons in the fusion layer","562":"A **Selective Kernel** unit is a bottleneck block consisting of a sequence of 1\u00d71 [convolution](https:\/\/paperswithcode.com\/method\/convolution), SK convolution and 1\u00d71 convolution. It was proposed as part of the [SKNet](https:\/\/paperswithcode.com\/method\/sknet) CNN architecture. In general, all the large kernel convolutions in the original bottleneck blocks in [ResNeXt](https:\/\/paperswithcode.com\/method\/resnext) are replaced by the proposed SK convolutions, enabling the network to choose appropriate receptive field sizes in an adaptive manner. \r\n\r\nIn SK units, there are three important hyper-parameters which determine the final settings of SK convolutions: the number of paths $M$ that determines the number of choices of different kernels to be aggregated, the group number $G$ that controls the cardinality of each path, and the reduction ratio $r$ that controls the number of parameters in the fuse operator. One typical setting of SK convolutions is $\\text{SK}\\left[M, G, r\\right]$ to be $\\text{SK}\\left[2, 32, 16\\right]$.","563":"**Big-Little Modules** are blocks for image models that have two branches: each of which represents a separate block from a deep model and a less deep counterpart. They were proposed as part of the [BigLittle-Net](https:\/\/paperswithcode.com\/method\/big-little-net) architecture. The two branches are fused with a linear combination and unit weights. These two branches are known as Big-Branch (more layers and channels at low resolutions) and Little-Branch (fewer layers and channels at high resolution).","564":"**AutoAugment** is an automated approach to find data augmentation policies from data. It formulates the problem of finding the best augmentation policy as a discrete search problem. It consists of two components: a search algorithm and a search space. \r\n\r\nAt a high level, the search algorithm (implemented as a controller RNN) samples a data augmentation policy $S$, which has information about what image processing operation to use, the probability of using the operation in each batch, and the magnitude of the operation. The policy $S$ is used to train a neural network with a fixed architecture, whose validation accuracy $R$ is sent back to update the controller. Since $R$ is not differentiable, the controller will be updated by policy gradient methods. \r\n\r\nThe operations used are from PIL, a popular Python image library: all functions in PIL that accept an image as input and output an image. It additionally uses two other augmentation techniques: [Cutout](https:\/\/paperswithcode.com\/method\/cutout) and SamplePairing. The operations searched over are ShearX\/Y, TranslateX\/Y, Rotate, AutoContrast, Invert, Equalize, Solarize, Posterize, Contrast, Color, Brightness, Sharpness, Cutout and Sample Pairing.","565":"**Assemble-ResNet** is a modification to the [ResNet](https:\/\/paperswithcode.com\/method\/resnet) architecture with several tweaks including using [ResNet-D](https:\/\/paperswithcode.com\/method\/resnet-d), channel attention, [anti-alias downsampling](https:\/\/paperswithcode.com\/method\/anti-alias-downsampling), and Big Little Networks.","566":"**ResNet-D** is a modification on the [ResNet](https:\/\/paperswithcode.com\/method\/resnet) architecture that utilises an [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling) tweak for downsampling. The motivation is that in the unmodified ResNet, the 1 \u00d7 1 [convolution](https:\/\/paperswithcode.com\/method\/convolution) for the downsampling block ignores 3\/4 of input feature maps, so this is modified so no information will be ignored","567":"There are at least eight notable examples of models from the literature that can be described using the **Message Passing Neural Networks** (**MPNN**) framework. For simplicity we describe MPNNs which operate on undirected graphs $G$ with node features $x_{v}$ and edge features $e_{vw}$. It is trivial to extend the formalism to directed multigraphs. The forward pass has two phases, a message passing phase and a readout phase. The message passing phase runs for $T$ time steps and is defined in terms of message functions $M_{t}$ and vertex update functions $U_{t}$. During the message passing phase, hidden states $h_{v}^{t}$ at each node in the graph are updated based on messages $m_{v}^{t+1}$ according to\r\n$$\r\nm_{v}^{t+1} = \\sum_{w \\in N(v)} M_{t}(h_{v}^{t}, h_{w}^{t}, e_{vw})\r\n$$\r\n$$\r\nh_{v}^{t+1} = U_{t}(h_{v}^{t}, m_{v}^{t+1})\r\n$$\r\nwhere in the sum, $N(v)$ denotes the neighbors of $v$ in graph $G$. The readout phase computes a feature vector for the whole graph using some readout function $R$ according to\r\n$$\r\n\\hat{y} = R(\\\\{ h_{v}^{T} | v \\in G \\\\})\r\n$$\r\nThe message functions $M_{t}$, vertex update functions $U_{t}$, and readout function $R$ are all learned differentiable functions. $R$ operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism.","568":"**Rectified Adam**, or **RAdam**, is a variant of the [Adam](https:\/\/paperswithcode.com\/method\/adam) stochastic optimizer that introduces a term to rectify the variance of the adaptive learning rate. It seeks to tackle the bad convergence problem suffered by Adam. The authors argue that the root cause of this behaviour is that the adaptive learning rate has undesirably large variance in the early stage of model training, due to the limited amount of training samples being used. Thus, to reduce such variance, it is better to use smaller learning rates in the first few epochs of training - which justifies the warmup heuristic. This heuristic motivates RAdam which rectifies the variance problem:\r\n\r\n$$g\\_{t} = \\nabla\\_{\\theta}f\\_{t}\\left(\\theta\\_{t-1}\\right) $$\r\n\r\n$$v\\_{t} = 1\/\\beta\\_{2}v\\_{t-1} + \\left(1-\\beta\\_{2}\\right)g^{2}\\_{t} $$\r\n\r\n$$m\\_{t} = \\beta\\_{1}m\\_{t-1} + \\left(1-\\beta\\_{1}\\right)g\\_{t} $$\r\n\r\n$$ \\hat{m\\_{t}} = m\\_{t} \/ \\left(1-\\beta^{t}\\_{1}\\right) $$\r\n\r\n$$ \\rho\\_{t} = \\rho\\_{\\infty} - 2t\\beta^{t}\\_{2}\/\\left(1-\\beta^{t}\\_{2}\\right) $$\r\n\r\n$$\\rho_{\\infty} = \\frac{2}{1-\\beta_2} - 1$$ \r\n\r\nIf the variance is tractable - $\\rho\\_{t} > 4$ then:\r\n\r\n...the adaptive learning rate is computed as:\r\n\r\n$$ l\\_{t} = \\sqrt{\\left(1-\\beta^{t}\\_{2}\\right)\/v\\_{t}}$$\r\n\r\n...the variance rectification term is calculated as:\r\n\r\n$$ r\\_{t} = \\sqrt{\\frac{(\\rho\\_{t}-4)(\\rho\\_{t}-2)\\rho\\_{\\infty}}{(\\rho\\_{\\infty}-4)(\\rho\\_{\\infty}-2)\\rho\\_{t}}}$$\r\n\r\n...and we update parameters with adaptive momentum:\r\n\r\n$$ \\theta\\_{t} = \\theta\\_{t-1} - \\alpha\\_{t}r\\_{t}\\hat{m}\\_{t}l\\_{t} $$\r\n\r\nIf the variance isn't tractable we update instead with:\r\n\r\n$$ \\theta\\_{t} = \\theta\\_{t-1} - \\alpha\\_{t}\\hat{m}\\_{t} $$","569":"Hyperboloid Embeddings (HypE) is a novel self-supervised dynamic reasoning framework, that utilizes positive first-order existential queries on a KG to learn representations of its entities and relations as hyperboloids in a Poincar\u00e9 ball. HypE models the positive first-order queries as geometrical translation (t), intersection ($\\cap$), and union ($\\cup$). For the problem of KG reasoning in real-world datasets, the proposed HypE model significantly outperforms the state-of-the art results. HypE is also applied to an anomaly detection task on a popular e-commerce website product taxonomy as well as hierarchically organized web articles and demonstrate significant performance improvements compared to existing baseline methods. Finally, HypE embeddings can also be visualized in a Poincar\u00e9 ball to clearly interpret and comprehend the representation space.","570":"**Supervised Contrastive Loss** is an alternative loss function to cross entropy that the authors argue can leverage label information more effectively. Clusters of points belonging to the same class are pulled together in embedding space, while simultaneously pushing apart clusters of samples from different classes.\r\n\r\n$$\r\n  \\mathcal{L}^{sup}=\\sum_{i=1}^{2N}\\mathcal{L}_i^{sup}\r\n  \\label{eqn:total_supervised_loss}\r\n$$\r\n\r\n$$\r\n  \\mathcal{L}\\_i^{sup}=\\frac{-1}{2N\\_{\\boldsymbol{\\tilde{y}}\\_i}-1}\\sum\\_{j=1}^{2N}\\mathbf{1}\\_{i\\neq j}\\cdot\\mathbf{1}\\_{\\boldsymbol{\\tilde{y}}\\_i=\\boldsymbol{\\tilde{y}}_j}\\cdot\\log{\\frac{\\exp{\\left(\\boldsymbol{z}\\_i\\cdot\\boldsymbol{z}\\_j\/\\tau\\right)}}{\\sum\\_{k=1}^{2N}\\mathbf{1}\\_{i\\neq k}\\cdot\\exp{\\left(\\boldsymbol{z}\\_i\\cdot\\boldsymbol{z}\\_k\/\\tau\\right)}}}\r\n$$\r\n\r\nwhere $N_{\\boldsymbol{\\tilde{y}}_i}$ is the total number of images in the minibatch that have the same label, $\\boldsymbol{\\tilde{y}}_i$, as the anchor, $i$. This loss has important properties well suited for supervised learning: (a) generalization to an arbitrary number of positives, (b) contrastive power increases with more negatives.","571":"Perceiver IO is a general neural network architecture that performs well for structured input modalities and output tasks. Perceiver IO is built to easily integrate and transform arbitrary information for arbitrary tasks.","572":"Recently, [dense connections](https:\/\/paperswithcode.com\/method\/dense-connections) have attracted substantial attention in computer vision because they facilitate gradient flow and implicit deep supervision during training. Particularly, [DenseNet](https:\/\/paperswithcode.com\/method\/densenet) that connects each layer to every other layer in a feed-forward fashion and has shown impressive performances in natural image classification tasks. We propose HyperDenseNet, a 3-D fully convolutional neural network that extends the definition of dense connectivity to multi-modal segmentation problems. Each imaging modality has a path, and dense connections occur not only between the pairs of layers within the same path but also between those across different paths. This contrasts with the existing multi-modal CNN approaches, in which modeling several modalities relies entirely on a single joint layer (or level of abstraction) for fusion, typically either at the input or at the output of the network. Therefore, the proposed network has total freedom to learn more complex combinations between the modalities, within and in-between all the levels of abstraction, which increases significantly the learning representation. We report extensive evaluations over two different and highly competitive multi-modal brain tissue segmentation challenges, iSEG 2017 and MRBrainS 2013, with the former focusing on six month infant data and the latter on adult images. HyperDenseNet yielded significant improvements over many state-of-the-art segmentation networks, ranking at the top on both benchmarks. We further provide a comprehensive experimental analysis of features re-use, which confirms the importance of hyper-dense connections in multi-modal representation learning.","573":"CR-NET is a YOLO-based model proposed for license plate character detection and recognition","574":"Fast-OCR is a new lightweight detection network that incorporates features from existing models focused on the speed\/accuracy trade-off, such as [YOLOv2](https:\/\/paperswithcode.com\/method\/yolov2), [CR-NET](https:\/\/paperswithcode.com\/method\/cr-net), and Fast-[YOLOv4](https:\/\/paperswithcode.com\/method\/yolov4).","575":"**AlphaZero** is a reinforcement learning agent for playing board games such as Go, chess, and shogi. ","576":"**Retriever-Augmented Generation**, or **RAG**, is a type of language generation model that combines pre-trained parametric and non-parametric memory for language generation. Specifically, the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever.  For query $x$, Maximum Inner Product Search (MIPS) is used to find the top-K documents $z\\_{i}$. For final prediction $y$, we treat $z$ as a latent variable and marginalize over seq2seq predictions given different documents.","577":"**Selective Search** is a region proposal algorithm for object detection tasks. It starts by over-segmenting the image based on intensity of the pixels using a graph-based segmentation method by Felzenszwalb and Huttenlocher. Selective Search then takes these oversegments as initial input and performs the following steps\r\n\r\n1. Add all bounding boxes corresponding to segmented parts to the list of regional proposals\r\n2. Group adjacent segments based on similarity\r\n3. Go to step 1\r\n\r\nAt each iteration, larger segments are formed and added to the list of region proposals. Hence we create region proposals from smaller segments to larger segments in a bottom-up approach. This is what we mean by computing \u201chierarchical\u201d segmentations using Felzenszwalb and Huttenlocher\u2019s oversegments.","578":"**Ape-X** is a distributed architecture for deep reinforcement learning. The algorithm decouples acting from learning: the actors interact with their own instances of the environment by selecting actions according to a shared neural network, and accumulate the resulting experience in a shared [experience replay](https:\/\/paperswithcode.com\/method\/experience-replay) memory; the learner replays samples of experience and updates the neural network. The architecture relies on [prioritized experience replay](https:\/\/paperswithcode.com\/method\/prioritized-experience-replay) to focus only on the most significant data generated by the actors.\r\n\r\nIn contrast to Gorila, Ape-X uses a shared, centralized replay memory, and instead of sampling\r\nuniformly, it prioritizes, to sample the most useful data more often. All communications are batched with the centralized replay, increasing the efficiency and throughput at the cost of some latency. \r\nAnd by learning off-policy, Ape-X has the ability to combine data from many distributed actors, by giving the different actors different exploration policies, broadening the diversity of the experience they jointly encounter.","579":"VisualBERT aims to reuse self-attention to implicitly align elements of the input text and regions in the input image. Visual embeddings are used to model images where the representations are represented by a bounding region in an image obtained from an object detector. These visual embeddings are constructed by summing three embeddings: 1) visual feature representation, 2) a segment embedding indicate whether it is an image embedding, and 3) position embedding. Essentially, image regions and language are combined with a Transformer to allow self-attention to discover implicit alignments between language and vision. VisualBERT is trained using COCO, which consists of images paired with captions. It is pre-trained using two objectives: masked language modeling objective and sentence-image prediction task. It can then be fine-tuned on different downstream tasks.","580":"**Vision-and-Language BERT** (**ViLBERT**) is a [BERT](https:\/\/paperswithcode.com\/method\/bert)-based model for learning task-agnostic joint representations of image content and natural language. ViLBERT extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional [transformer](https:\/\/paperswithcode.com\/method\/transformer) layers.","581":"Interpolation between [exponential decay](https:\/\/paperswithcode.com\/method\/exponential-decay) and [cosine annealing](https:\/\/paperswithcode.com\/method\/cosine-annealing).","582":"**AdaDelta** is a stochastic optimization technique that allows for per-dimension learning rate method for [SGD](https:\/\/paperswithcode.com\/method\/sgd). It is an extension of [Adagrad](https:\/\/paperswithcode.com\/method\/adagrad) that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to a fixed size $w$.\r\n\r\nInstead of inefficiently storing $w$ previous squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients. The running average $E\\left[g^{2}\\right]\\_{t}$ at time step $t$ then depends only on the previous average and current gradient:\r\n\r\n$$E\\left[g^{2}\\right]\\_{t} = \\gamma{E}\\left[g^{2}\\right]\\_{t-1} + \\left(1-\\gamma\\right)g^{2}\\_{t}$$\r\n\r\nUsually $\\gamma$ is set to around $0.9$. Rewriting SGD updates in terms of the parameter update vector:\r\n\r\n$$ \\Delta\\theta_{t} = -\\eta\\cdot{g\\_{t, i}}$$\r\n$$\\theta\\_{t+1}  = \\theta\\_{t} + \\Delta\\theta_{t}$$\r\n\r\nAdaDelta takes the form:\r\n\r\n$$ \\Delta\\theta_{t} = -\\frac{\\eta}{\\sqrt{E\\left[g^{2}\\right]\\_{t} + \\epsilon}}g_{t} $$\r\n\r\nThe main advantage of AdaDelta is that we do not need to set a default learning rate.","583":"**AdaSmooth** is a stochastic optimization technique that allows for per-dimension learning rate method for [SGD](https:\/\/paperswithcode.com\/method\/sgd). It is an extension of [Adagrad](https:\/\/paperswithcode.com\/method\/adagrad) and [AdaDelta](https:\/\/paperswithcode.com\/method\/adadelta) that seek to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to a fixed size $w$ while AdaSmooth adaptively selects the size of the window.\r\n\r\nGiven the window size  $M$, the effective ratio is calculated by \r\n\r\n$$e_t  = \\frac{s_t}{n_t}= \\frac{| x_t -  x_{t-M}|}{\\sum_{i=0}^{M-1} | x_{t-i} -  x_{t-1-i}|}\\\\\r\n= \\frac{| \\sum_{i=0}^{M-1} \\Delta x_{t-1-i}|}{\\sum_{i=0}^{M-1} | \\Delta x_{t-1-i}|}.$$\r\n\r\nGiven the effective ratio, the scaled smoothing constant is obtained by:\r\n\r\n$$c_t =  ( \\rho_2- \\rho_1) \\times e_t   + (1-\\rho_2),$$\r\n\r\nThe running average $E\\left[g^{2}\\right]\\_{t}$ at time step $t$ then depends only on the previous average and current gradient:\r\n\r\n$$ E\\left[g^{2}\\right]\\_{t} = c_t^2 \\odot g_{t}^2  +  \\left(1-c_t^2 \\right)\\odot E[g^2]_{t-1} $$\r\n\r\nUsually $\\rho_1$ is set to around $0.5$ and $\\rho_2$ is set to around 0.99. The update step the follows:\r\n\r\n$$ \\Delta x_t = -\\frac{\\eta}{\\sqrt{E\\left[g^{2}\\right]\\_{t} + \\epsilon}} \\odot  g_{t}, $$\r\n\r\nwhich is incorporated into the final update:\r\n\r\n$$x_{t+1} = x_{t} + \\Delta x_t.$$\r\n\r\nThe main advantage of AdaSmooth is its faster convergence rate and insensitivity to hyperparameters.","584":"A **BiFPN**, or **Weighted Bi-directional Feature Pyramid Network**, is a type of feature pyramid network which allows easy and fast multi-scale feature fusion. It incorporates the multi-level feature fusion idea from [FPN](https:\/\/paperswithcode.com\/method\/fpn), [PANet](https:\/\/paperswithcode.com\/method\/panet) and [NAS-FPN](https:\/\/paperswithcode.com\/method\/nas-fpn) that enables information to flow in both the top-down and bottom-up directions, while using regular and efficient connections. It also utilizes a fast normalized fusion technique. Traditional approaches usually treat all features input to the FPN equally, even those with different resolutions. However, input features at different resolutions often have unequal contributions to the output features. Thus, the BiFPN adds an additional weight for each input feature allowing the network to learn the importance of each. All regular convolutions are also replaced with less expensive depthwise separable convolutions.\r\n\r\nComparing with PANet, PANet added an extra bottom-up path for information flow at the expense of more computational cost. Whereas BiFPN optimizes these cross-scale connections by removing nodes with a single input edge, adding an extra edge from the original input to output node if they are on the same level, and treating each bidirectional path as one feature network layer (repeating it several times for more high-level future fusion).","585":"**EfficientDet** is a type of object detection model, which utilizes several optimization and backbone tweaks, such as the use of a [BiFPN](https:\/\/paperswithcode.com\/method\/bifpn), and a compound scaling method that uniformly scales the resolution,depth and width for all backbones, feature networks and box\/class prediction networks at the same time.","586":"The **MLP-Mixer** architecture (or \u201cMixer\u201d for short) is an image architecture that doesn't use convolutions or self-attention. Instead, Mixer\u2019s architecture is based entirely on multi-layer perceptrons (MLPs) that are repeatedly applied across either spatial locations or feature channels. Mixer relies only on basic matrix multiplication routines, changes to data layout (reshapes and transpositions), and scalar nonlinearities.\r\n\r\nIt accepts a sequence of linearly projected image patches (also referred to as tokens) shaped as a \u201cpatches \u00d7 channels\u201d table as an input, and maintains this dimensionality. Mixer makes use of two types of MLP layers: channel-mixing MLPs and token-mixing MLPs. The channel-mixing MLPs allow communication between different channels; they operate on each token independently and take individual rows of the table as inputs. The token-mixing MLPs allow communication between different spatial locations (tokens); they operate on each channel independently and take individual columns of the table as inputs. These two types of layers are interleaved to enable interaction of both input dimensions.","587":"**mt5** is a multilingual variant of [T5](https:\/\/paperswithcode.com\/method\/t5) that was pre-trained on a new Common Crawl-based dataset covering $101$ languages.","588":"**Adaptive Input Embeddings** extend the [adaptive softmax](https:\/\/paperswithcode.com\/method\/adaptive-softmax) to input word representations. The factorization assigns more capacity to frequent words and reduces the capacity for less frequent words with the benefit of reducing overfitting to rare words.","589":"**Adaptive Softmax** is a speedup technique for the computation of probability distributions over words. The adaptive [softmax](https:\/\/paperswithcode.com\/method\/softmax) is inspired by the class-based [hierarchical softmax](https:\/\/paperswithcode.com\/method\/hierarchical-softmax), where the word classes are built to minimize the computation time. Adaptive softmax achieves efficiency by explicitly taking into account the computation time of matrix-multiplication on parallel systems and combining it with a few important observations, namely keeping a shortlist of frequent words in the root node\r\nand reducing the capacity of rare words.","590":"Spatial CNN with UNet based Encoder-decoder and ConvLSTM","591":"A **TridentNet Block** is a feature extractor used in object detection models. Instead of feeding in multi-scale inputs like the image pyramid, in a [TridentNet](https:\/\/paperswithcode.com\/method\/tridentnet) block we adapt the backbone network for different scales. These blocks create multiple scale-specific feature maps. With the help of dilated convolutions, different branches of trident blocks have the same network structure and share the\r\nsame parameters yet have different receptive fields. Furthermore, to avoid training objects with extreme scales, a scale-aware training scheme is employed to make each branch specific to a given scale range matching its receptive field. Weight sharing is used to prevent overfitting.","592":"**Feature Matching** is a regularizing objective for a generator in [generative adversarial networks](https:\/\/paperswithcode.com\/methods\/category\/generative-adversarial-networks) that prevents it from overtraining on the current discriminator. Instead of directly maximizing the output of the discriminator, the new objective requires the generator to generate data that matches the statistics of the real data, where we use the discriminator only to specify the statistics that we think are worth matching. Specifically, we train the generator to match the expected value of the features on an intermediate layer of the discriminator. This is a natural choice of statistics for the generator to match, since by training the discriminator we ask it to find those features that are most discriminative of real data versus data generated by the current model.\r\n\r\nLetting $\\mathbf{f}\\left(\\mathbf{x}\\right)$ denote activations on an intermediate layer of the discriminator, our new objective for the generator is defined as: $ ||\\mathbb{E}\\_{x\\sim p\\_{data} } \\mathbf{f}\\left(\\mathbf{x}\\right) \u2212 \\mathbb{E}\\_{\\mathbf{z}\u223cp\\_{\\mathbf{z}}\\left(\\mathbf{z}\\right)}\\mathbf{f}\\left(G\\left(\\mathbf{z}\\right)\\right)||^{2}\\_{2} $. The discriminator, and hence\r\n$\\mathbf{f}\\left(\\mathbf{x}\\right)$, are trained as with vanilla GANs. As with regular [GAN](https:\/\/paperswithcode.com\/method\/gan) training, the objective has a fixed point where G exactly matches the distribution of training data.","593":"A **Laplacian Pyramid** is a linear invertible image representation consisting of a set of band-pass\r\nimages, spaced an octave apart, plus a low-frequency residual. Formally, let $d\\left(.\\right)$ be a downsampling operation which blurs and decimates a $j \\times j$ image $I$, so that $d\\left(I\\right)$ is a new image of size $j\/2 \\times j\/2$. Also, let $u\\left(.\\right)$ be an upsampling operator which smooths and expands $I$ to be twice the size, so $u\\left(I\\right)$ is a new image of size $2j \\times 2j$. We first build a Gaussian pyramid $G\\left(I\\right) = \\left[I\\_{0}, I\\_{1}, \\dots, I\\_{K}\\right]$, where\r\n$I\\_{0} = I$ and $I\\_{k}$ is $k$ repeated applications\u2217 of $d\\left(.\\right)$ to $I$. $K$ is the number of levels in the pyramid, selected so that the final level has very small spatial extent ($\\leq 8 \\times 8$ pixels).\r\n\r\nThe coefficients $h\\_{k}$ at each level $k$ of the Laplacian pyramid $L\\left(I\\right)$ are constructed by taking the difference between adjacent levels in the Gaussian pyramid, upsampling the smaller one with $u\\left(.\\right)$ so that the sizes are compatible:\r\n\r\n$$ h\\_{k} = \\mathcal{L}\\_{k}\\left(I\\right) = G\\_{k}\\left(I\\right) \u2212 u\\left(G\\_{k+1}\\left(I\\right)\\right) = I\\_{k} \u2212 u\\left(I\\_{k+1}\\right) $$\r\n\r\nIntuitively, each level captures image structure present at a particular scale. The final level of the\r\nLaplacian pyramid $h\\_{K}$ is not a difference image, but a low-frequency residual equal to the final\r\nGaussian pyramid level, i.e. $h\\_{K} = I\\{K}$. Reconstruction from a Laplacian pyramid coefficients\r\n$\\left[h\\_{1}, \\dots, h\\_{K}\\right]$ is performed using the backward recurrence:\r\n\r\n$$ I\\_{k} = u\\left(I\\_{k+1}\\right) + h\\_{k} $$\r\n\r\nwhich is started with $I\\_{K} = h\\_{K}$ and the reconstructed image being $I = I\\_{o}$. In other words, starting at the coarsest level, we repeatedly upsample and add the difference image h at the next finer level until we get back to the full resolution image.\r\n\r\nSource: [LAPGAN](https:\/\/paperswithcode.com\/method\/lapgan)\r\n\r\nImage : [Design of FIR Filters for Fast Multiscale Directional Filter Banks](https:\/\/www.researchgate.net\/figure\/Relationship-between-Gaussian-and-Laplacian-Pyramids_fig2_275038450)","594":"**Viewmaker Network** is a type of generative model that learns to produce input-dependent views for contrastive learning. This network is trained jointly with an encoder network. The viewmaker network is trained adversarially to create views which increase the contrastive loss of the encoder network. Rather than directly outputting views for an image, the viewmaker instead outputs a stochastic perturbation that is added to the input. This perturbation is projected onto an $\\mathcal{l}\\_{p}$ sphere, controlling the effective strength of the view, similar to methods in adversarial robustness. This constrained adversarial training method enables the model to reduce the mutual information between different views while preserving useful input features for the encoder to learn from.\r\n\r\nSpecifically, the encoder and viewmaker are optimized in alternating steps to minimize and maximize $\\mathcal{L}$, respectively. An image-to-image neural network is used as the viewmaker network, with an architecture adapted from work on style transfer. This network ingests the input image and outputs a perturbation that is constrained to an $\\ell_{1}$ sphere. The sphere's radius is determined by the volume of the input tensor times a hyperparameter $\\epsilon$, the distortion budget, which determines the strength of the applied perturbation. This perturbation is added to the input image and optionally clamped in the case of images to ensure all pixels are in $[0,1]$.","595":"Please enter a description about the method here","596":"**CoOp**, or **Context Optimization**, is an automated prompt engineering method that avoids manual prompt tuning by modeling context words with continuous vectors that are end-to-end learned from data. The context could be shared among all classes or designed to be class-specific. During training, we simply minimize the prediction error using the cross-entropy loss with respect to the learnable context vectors, while keeping the pre-trained parameters fixed. The gradients can be back-propagated all the way through the text encoder, distilling the rich knowledge encoded in the parameters for learning task-relevant context.","597":"**HITNet** is a framework for neural network based depth estimation which overcomes the computational disadvantages of operating on a 3D volume by integrating image warping, spatial propagation and a fast high resolution initialization step into the network architecture, while keeping the flexibility of a learned representation by allowing features to flow through the network. The main idea of the approach is to represent image tiles as planar patches which have a learned compact feature descriptor attached to them. The basic principle of the approach is to fuse information from the high resolution initialization and the current hypotheses using spatial propagation. The propagation is implemented via a [convolutional neural network](https:\/\/paperswithcode.com\/methods\/category\/convolutional-neural-networks) module that updates the estimate of the planar patches and their attached features. \r\n\r\nIn order for the network to iteratively increase the accuracy of the disparity predictions, the network is provided a local cost volume in a narrow band (\u00b11 disparity) around the planar patch using in-network image warping allowing the network to minimize image dissimilarity. To reconstruct fine details while also capturing large texture-less areas we start at low resolution and hierarchically upsample predictions to higher resolution. A critical feature of the architecture is that at each resolution, matches from the initialization module are provided to facilitate recovery of thin structures that cannot be represented at low resolution.","598":"**ScheduledDropPath** is a modified version of [DropPath](https:\/\/paperswithcode.com\/method\/droppath). In DropPath, each path in the cell is stochastically dropped with some fixed probability during training. In ScheduledDropPath, each path in the cell is dropped out with a probability that is linearly increased over the course of training.","599":"An **Accumulating Eligibility Trace** is a type of [eligibility trace](https:\/\/paperswithcode.com\/method\/eligibility-trace) where the trace increments in an accumulative way. For the memory vector $\\textbf{e}\\_{t} \\in \\mathbb{R}^{b} \\geq \\textbf{0}$:\r\n\r\n$$\\mathbf{e\\_{0}} = \\textbf{0}$$\r\n\r\n$$\\textbf{e}\\_{t} = \\nabla{\\hat{v}}\\left(S\\_{t}, \\mathbf{\\theta}\\_{t}\\right) + \\gamma\\lambda\\textbf{e}\\_{t}$$","600":"**TD_INLINE_MATH_1** is a generalisation of **TD_INLINE_MATH_2** reinforcement learning algorithms, but it employs an [eligibility trace](https:\/\/paperswithcode.com\/method\/eligibility-trace) $\\lambda$ and $\\lambda$-weighted returns. The eligibility trace vector is initialized to zero at the beginning of the episode, and it is incremented on each time step by the value gradient, and then fades away by $\\gamma\\lambda$:\r\n\r\n$$ \\textbf{z}\\_{-1} = \\mathbf{0} $$\r\n$$ \\textbf{z}\\_{t} = \\gamma\\lambda\\textbf{z}\\_{t-1} + \\nabla\\hat{v}\\left(S\\_{t}, \\mathbf{w}\\_{t}\\right), 0 \\leq t \\leq T$$\r\n\r\nThe eligibility trace keeps track of which components of the weight vector contribute to recent state valuations. Here $\\nabla\\hat{v}\\left(S\\_{t}, \\mathbf{w}\\_{t}\\right)$ is the feature vector.\r\n\r\nThe TD error for state-value prediction is:\r\n\r\n$$ \\delta\\_{t} = R\\_{t+1} + \\gamma\\hat{v}\\left\\(S\\_{t+1}, \\mathbf{w}\\_{t}\\right) - \\hat{v}\\left(S\\_{t}, \\mathbf{w}\\_{t}\\right) $$\r\n\r\nIn **TD_INLINE_MATH_1**, the weight vector is updated on each step proportional to the scalar TD error and the vector eligibility trace:\r\n\r\n$$ \\mathbf{w}\\_{t+1} = \\mathbf{w}\\_{t} + \\alpha\\delta\\mathbf{z}\\_{t}  $$\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition","601":"**TD-Gammon** is a game-learning architecture for playing backgammon. It involves the use of a $TD\\left(\\lambda\\right)$ learning algorithm and a feedforward neural network.\r\n\r\nCredit: [Temporal Difference Learning and\r\nTD-Gammon](https:\/\/cling.csd.uwo.ca\/cs346a\/extra\/tdgammon.pdf)","602":"**Skip-gram Word2Vec** is an architecture for computing word embeddings. Instead of using surrounding words to predict the center word, as with CBow Word2Vec, Skip-gram Word2Vec uses the central word to predict the surrounding words.\r\n\r\nThe skip-gram objective function sums the log probabilities of the surrounding $n$ words to the left and right of the target word $w\\_{t}$ to produce the following objective:\r\n\r\n$$J\\_\\theta = \\frac{1}{T}\\sum^{T}\\_{t=1}\\sum\\_{-n\\leq{j}\\leq{n}, \\neq{0}}\\log{p}\\left(w\\_{j+1}\\mid{w\\_{t}}\\right)$$","603":"**Meta Reward Learning (MeRL)** is a meta-learning method for the problem of learning from sparse and underspecified rewards. For example, an agent receives a complex input, such as a natural language instruction, and needs to generate a complex response, such as an action sequence, while only receiving binary success-failure feedback. The key insight of MeRL in dealing with underspecified rewards is that spurious trajectories and programs that achieve accidental success are detrimental to the agent's generalization performance. For example, an agent might be able to solve a specific instance of the maze problem above. However, if it learns to perform spurious actions during training, it is likely to fail when provided with unseen instructions. To mitigate this issue, MeRL optimizes a more refined auxiliary reward function, which can differentiate between accidental and purposeful success based on features of action trajectories. The auxiliary reward is optimized by maximizing the trained agent's performance on a hold-out validation set via meta learning.","604":"OODformer is a [transformer](https:\/\/paperswithcode.com\/method\/transformer)-based OOD detection architecture that leverages the contextualization capabilities of the transformer. Incorporating the transformer as the principal feature extractor allows to exploit the object concepts and their discriminate attributes along with their co-occurrence via [visual attention](https:\/\/paperswithcode.com\/method\/visual-attention). \r\n\r\nOODformer employs [ViT](method\/vision-transformer) and its data efficient variant [DeiT](\/method\/deit). Each encoder layer consist of multi-head self attention and a multi-layer perception block. The combination of MSA and MLP layers in the encoder jointly encode the attributes' importance, associated correlation, and co-occurrence. The [class] token (a representative of an image $x$) consolidated multiple attributes and their related features via the global context. The [class] token from the final layer is used for OOD detection in two ways; first, it is passed to $\r\nF_{\\text {classifier }}\\left(x_{\\text {feat }}\\right)$  for softmax confidence score, and second it is used for latent space distance calculation.","605":"In relation-aware global attention (RGA) stresses the importance of global structural information provided by pairwise relations, and uses it to produce attention maps. \r\n\r\nRGA comes in two forms,  spatial RGA (RGA-S) and channel RGA (RGA-C). RGA-S first reshapes the input feature map $X$ to $C\\times (H\\times W)$ and the pairwise relation matrix $R \\in \\mathbb{R}^{(H\\times W)\\times (H\\times W)}$ is computed using \r\n\\begin{align}\r\n    Q &= \\delta(W^QX) \r\n\\end{align}\r\n\\begin{align}\r\n    K &= \\delta(W^KX) \r\n\\end{align}\r\n\\begin{align}\r\n    R &= Q^TK\r\n\\end{align}\r\nThe relation vector $r_i$ at position $i$ is defined by stacking  pairwise relations at all positions:\r\n\\begin{align}\r\n    r_i = [R(i, :); R(:,i)]    \r\n\\end{align}\r\nand the spatial relation-aware feature $y_i$ can be written as\r\n\\begin{align}\r\n    Y_i = [g^c_\\text{avg}(\\delta(W^\\varphi x_i)); \\delta(W^\\phi r_i)]\r\n\\end{align}\r\nwhere $g^c_\\text{avg}$ denotes global average pooling in the channel domain. Finally, the spatial attention score at position $i$ is given by \r\n\\begin{align}\r\n    a_i = \\sigma(W_2\\delta(W_1y_i))\r\n\\end{align}\r\nRGA-C has the same form as RGA-S, except for taking the input feature map as a set of $H\\times W$-dimensional features.\r\n\r\nRGA uses global relations to generate the attention score for each feature node,  so provides valuable structural information and significantly enhances the representational power. RGA-S and RGA-C are flexible enough to be used in any CNN network; Zhang et al. propose  using them jointly in sequence to better capture both spatial and cross-channel relationships.","606":"The **Content-Conditioned Style Encoder**, or **COCO**, is a style encoder used for image-to-image translation in the [COCO-FUNIT](https:\/\/paperswithcode.com\/method\/coco-funit#) architecture.  Unlike the style encoder in [FUNIT](https:\/\/arxiv.org\/abs\/1905.01723), COCO takes both content and style image as input. With this content conditioning scheme, we create a direct feedback path during learning to let the content image influence how the style code is computed. It also helps reduce the direct influence of the style image to the extract style code.\r\n\r\nThe bottom part of the Figure details architecture. First, the content image is fed into an encoder $E\\_{S, C}$ to compute a spatial feature map. This content feature map is then mean-pooled and mapped to a vector $\\zeta\\_{c} .$ Similarly, the style image is fed into encoder $E\\_{S, S}$ to compute a spatial feature map. The style feature map is then mean-pooled and concatenated with an input-independent bias vector: the constant style bias (CSB). Note that while the regular bias in deep networks is added to the activations, in CSB, the bias is concatenated with the activations. The CSB provides a fixed input to the style encoder, which helps compute a style code that is less sensitive to the variations in the style image.\r\n\r\nThe concatenation of the style vector and the CSB is mapped to a vector $\\zeta\\_{s}$ via a fully connected layer. We then perform an element-wise product operation to $\\zeta\\_{c}$ and $\\zeta\\_{s}$, which is the final style code. The style code is then mapped to produce the [AdaIN](https:\/\/paperswithcode.com\/method\/adaptive-instance-normalization) parameters for generating the translation. Through this element-wise product operation, the resulting style code is heavily influenced by the content image. One way to look at this mechanism is that it produces a customized style code for the input content image.\r\n\r\nThe COCO is used as a drop-in replacement for the style encoder in FUNIT. Let $\\phi$ denote the COCO mapping. The translation output is then computed via\r\n\r\n$$\r\nz\\_{c}=E\\_{c}\\left(x_{c}\\right), z_{s}=\\phi\\left(E\\_{s, s}\\left(x_{s}\\right), E\\_{s, c}\\left(x\\_{c}\\right)\\right), \\overline{\\mathbf{x}}=F\\left(z\\_{c}, z\\_{s}\\right)\r\n$$\r\n\r\nThe style code extracted by the COCO is more robust to variations in the style image. Note that we set $E\\_{S, C} \\equiv E\\_{C}$ to keep the number of parameters in our model similar to that in FUNIT.","607":"**COCO-FUNIT** is few-shot image translation model which computes the style embedding of the example images conditioned on the input image and a new module called the constant style bias. It builds on top of [FUNIT](https:\/\/arxiv.org\/abs\/1905.01723) by identifying the content loss problem and then addressing it with a novel content-conditioned style encoder architecture.\r\n\r\nThe FUNIT method suffers from the content loss problem\u2014the translation result is not well-aligned with the input image. While a direct theoretical analysis is likely elusive, we conduct an empirical study, aiming at identify the cause of the content loss problem. In analyses, the authors show that the FUNIT style encoder produces very different style codes using different crops -- suggesting the style code contains other information about the style image such as the object pose.\r\n\r\nTo make the style embedding more robust to small variations in the style image, a new style encoder architecture, the Content-Conditioned style encoder (COCO), is introduced. The most distinctive feature of this new encoder is the conditioning in the content image as illustrated in the top-right of the Figure. Unlike the style encoder in FUNIT, COCO takes both content and style image as input. With this content-conditioning scheme, a direct feedback path is created during learning to let the content image influence how the style code is computed. It also helps reduce the direct influence of the style image to the extract style code.","608":"**Natural Gradient Descent** is an approximate second-order optimisation method. It has an interpretation as optimizing over a Riemannian manifold using an intrinsic distance metric, which implies the updates are invariant to transformations such as whitening. By using the positive semi-definite (PSD) Gauss-Newton matrix to approximate the (possibly negative definite) Hessian, NGD can often work better than exact second-order methods.\r\n\r\nGiven the gradient of $z$, $g = \\frac{\\delta{f}\\left(z\\right)}{\\delta{z}}$, NGD computes the update as:\r\n\r\n$$\\Delta{z} = \\alpha{F}^{\u22121}g$$\r\n\r\nwhere the Fisher information matrix $F$ is defined as:\r\n\r\n$$ F = \\mathbb{E}\\_{p\\left(t\\mid{z}\\right)}\\left[\\nabla\\ln{p}\\left(t\\mid{z}\\right)\\nabla\\ln{p}\\left(t\\mid{z}\\right)^{T}\\right] $$\r\n\r\nThe log-likelihood function $\\ln{p}\\left(t\\mid{z}\\right)$ typically corresponds to commonly used error functions such as the cross entropy loss.\r\n\r\nSource: [LOGAN](https:\/\/paperswithcode.com\/method\/logan)\r\n\r\nImage: [Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks\r\n](https:\/\/arxiv.org\/abs\/1905.10961)","609":"A **Neural Probablistic Language Model** is an early language modelling architecture. It involves a feedforward architecture that takes in input vector representations (i.e. word embeddings) of the previous $n$ words, which are looked up in a table $C$.\r\n\r\nThe word embeddings are concatenated and fed into a hidden layer which then feeds into a [softmax](https:\/\/paperswithcode.com\/method\/softmax) layer to estimate the probability of the word given the context.","610":"**InstaBoost** is a data augmentation technique for instance segmentation that utilises existing instance mask annotations.\r\n\r\nIntuitively in a small neighbor area of $(x_0, y_0, 1, 0)$, the probability map $P(x, y, s, r)$ should be high-valued since images are usually continuous and redundant in pixel level. Based on this, InstaBoost is a form of augmentation where we apply object jittering that randomly samples transformation tuples from the neighboring space of identity transform $(x_0, y_0, 1, 0)$ and paste the cropped object following affine transform $\\mathbf{H}$.","611":"**Inception-A** is an image model block used in the [Inception-v4](https:\/\/paperswithcode.com\/method\/inception-v4) architecture.","612":"**Inception-C** is an image model block used in the [Inception-v4](https:\/\/paperswithcode.com\/method\/inception-v4) architecture.","613":"**Reduction-A** is an image model block used in the [Inception-v4](https:\/\/paperswithcode.com\/method\/inception-v4) architecture.","614":"**Inception-B** is an image model block used in the [Inception-v4](https:\/\/paperswithcode.com\/method\/inception-v4) architecture.","615":"**Reduction-B** is an image model block used in the [Inception-v4](https:\/\/paperswithcode.com\/method\/inception-v4) architecture.","616":"**Inception-v4** is a convolutional neural network architecture that builds on previous iterations of the Inception family by simplifying the architecture and using more inception modules than [Inception-v3](https:\/\/paperswithcode.com\/method\/inception-v3).","617":"**LeNet** is a classic convolutional neural network employing the use of convolutions, pooling and fully connected layers. It was used for the handwritten digit recognition task with the MNIST dataset. The architectural design served as inspiration for future networks such as [AlexNet](https:\/\/paperswithcode.com\/method\/alexnet) and [VGG](https:\/\/paperswithcode.com\/method\/vgg).","618":"A **Split Attention** block enables attention across feature-map groups. As in [ResNeXt blocks](https:\/\/paperswithcode.com\/method\/resnext-block), the feature can be divided into several groups, and the number of feature-map groups is given by a cardinality hyperparameter $K$. The resulting feature-map groups are called cardinal groups. Split Attention blocks introduce a new radix hyperparameter $R$ that indicates the number of splits within a cardinal group, so the total number of feature groups is $G = KR$. We may apply a series of transformations {$\\mathcal{F}\\_1, \\mathcal{F}\\_2, \\cdots\\mathcal{F}\\_G$} to each individual group, then the intermediate representation of each group is $U\\_i = \\mathcal{F}\\_i\\left(X\\right)$, for $i \\in$ {$1, 2, \\cdots{G}$}.\r\n\r\nA combined representation for each cardinal group can be obtained by fusing via an element-wise summation across multiple splits. The representation for $k$-th cardinal group is \r\n$\\hat{U}^k = \\sum_{j=R(k-1)+1}^{R k} U_j $, where $\\hat{U}^k \\in \\mathbb{R}^{H\\times W\\times C\/K}$ for $k\\in{1,2,...K}$, and $H$, $W$ and $C$ are the block output feature-map sizes. \r\nGlobal contextual information with embedded channel-wise statistics can be gathered with [global average pooling](https:\/\/paperswithcode.com\/method\/global-average-pooling) across spatial dimensions  $s^k\\in\\mathbb{R}^{C\/K}$. Here the $c$-th component is calculated as:\r\n\r\n$$\r\n    s^k\\_c = \\frac{1}{H\\times W} \\sum\\_{i=1}^H\\sum\\_{j=1}^W \\hat{U}^k\\_c(i, j).\r\n$$\r\n\r\nA weighted fusion of the cardinal group representation $V^k\\in\\mathbb{R}^{H\\times W\\times C\/K}$ is aggregated using [channel-wise soft attention](https:\/\/paperswithcode.com\/method\/channel-wise-soft-attention), where each feature-map channel is produced using a weighted combination over splits. The $c$-th channel is calculated as:\r\n\r\n$$\r\n    V^k_c=\\sum_{i=1}^R a^k_i(c) U_{R(k-1)+i} ,\r\n$$\r\n\r\nwhere $a_i^k(c)$ denotes a (soft) assignment weight given by:\r\n\r\n$$\r\na_i^k(c) =\r\n\\begin{cases}\r\n  \\frac{exp(\\mathcal{G}^c_i(s^k))}{\\sum_{j=0}^R exp(\\mathcal{G}^c_j(s^k))} & \\quad\\textrm{if } R>1, \\\\\r\n   \\frac{1}{1+exp(-\\mathcal{G}^c_i(s^k))} & \\quad\\textrm{if } R=1,\\\\\r\n\\end{cases}\r\n$$\r\n\r\nand mapping $\\mathcal{G}_i^c$ determines the weight of each split for the $c$-th channel based on the global context representation $s^k$.","619":"A **CoordConv** layer is a simple extension to the standard convolutional layer. It has the same functional signature as a convolutional layer, but accomplishes the mapping by first concatenating extra channels to the incoming representation. These channels contain hard-coded coordinates, the most basic version of which is one channel for the $i$ coordinate and one for the $j$ coordinate.\r\n\r\nThe CoordConv layer keeps the properties of few parameters and efficient computation from convolutions, but allows the network to learn to keep or to discard translation invariance as is needed for the task being learned. This is useful for coordinate transform based tasks where regular convolutions can fail.","620":"**DDParser**, or **Baidu Dependency Parser**, is a Chinese dependency parser trained on a large-scale manually labeled dataset called Baidu Chinese Treebank (DuCTB).\r\n\r\nFor inputs, for the $i$ th word, its input vector $e_{i}$ is the concatenation of the word embedding and character-level representation:\r\n\r\n$$\r\ne\\_{i}=e\\_{i}^{w o r d} \\oplus C h a r L S T M\\left(w\\_{i}\\right)\r\n$$\r\n\r\nWhere $\\operatorname{CharLSTM}\\left(w_{i}\\right)$ is the output vectors after feeding the character sequence into a [BiLSTM](https:\/\/paperswithcode.com\/method\/bilstm) layer. The experimental results on DuCTB dataset show that replacing POS tag embeddings with $\\operatorname{CharLSTM}\\left(w_{i}\\right)$ leads to the improvement.\r\n\r\nFor the BiLSTM encoder, three BiLSTM layers are employed over the input vectors for context encoding. Denote $r\\_{i}$ the output vector of the top-layer BiLSTM for $w\\_{i}$\r\n\r\nThe dependency parser of [Dozat and Manning](https:\/\/arxiv.org\/abs\/1611.01734) is used. Dimension-reducing MLPs are applied to each recurrent output vector $r\\_{i}$ before applying the biaffine transformation. Applying smaller MLPs to the recurrent output states before the biaffine classifier has the advantage of stripping away information not relevant to the current decision. Then biaffine attention is used both in the dependency arc classifier and relation classifier. The computations of all symbols in the Figure are shown below:\r\n\r\n$$\r\nh_{i}^{d-a r c}=M L P^{d-a r c}\\left(r_{i}\\right)\r\n$$\r\n$$\r\nh_{i}^{h-a r c}=M L P^{h-a r c}\\left(r_{i}\\right) \\\\\r\n$$\r\n$$\r\nh_{i}^{d-r e l}=M L P^{d-r e l}\\left(r_{i}\\right) \\\\\r\n$$\r\n$$\r\nh_{i}^{h-r e l}=M L P^{h-r e l}\\left(r_{i}\\right) \\\\\r\n$$\r\n$$\r\nS^{a r c}=\\left(H^{d-a r c} \\oplus I\\right) U^{a r c} H^{h-a r c} \\\\\r\n$$\r\n$$\r\nS^{r e l}=\\left(H^{d-r e l} \\oplus I\\right) U^{r e l}\\left(\\left(H^{h-r e l}\\right)^{T} \\oplus I\\right)^{T}\r\n$$\r\n\r\nFor the decoder, the first-order Eisner algorithm is used to ensure that the output is a projection tree. Based on the dependency tree built by biaffine parser, we get a word sequence through the in-order traversal of the tree. The output is a projection tree only if the word sequence is in order.","621":"**Meta-augmentation** helps generate more varied tasks for a single example in meta-learning. It can be distinguished from data augmentation in classic machine learning as follows. For data augmentation in classical machine learning, the aim is to generate more varied examples, within a single task. Meta-augmentation has the exact opposite aim: we wish to generate more varied tasks,\r\nfor a single example, to force the learner to quickly learn a new task from feedback. In meta-augmentation, adding randomness discourages the base learner and model from learning trivial solutions that do not generalize to new tasks.","622":"This method applies Polya-Gamma latent variables as a way to obtain closed form expressions for full-conditionals of posterior distributions in sampling algorithms like MCMC.","623":"**Deformable Attention Module** is an attention module used in the [Deformable DETR](https:\/\/paperswithcode.com\/method\/deformable-detr) architecture, which seeks to overcome one issue base [Transformer attention](https:\/\/paperswithcode.com\/method\/scaled) in that it looks over all possible spatial locations. Inspired by [deformable convolution](https:\/\/paperswithcode.com\/method\/deformable-convolution), the deformable attention module only attends to a small set of key sampling points around a reference point, regardless of the spatial size of the feature maps. By assigning only a small fixed number of keys for each query, the issues of convergence and feature spatial resolution can be mitigated.\r\n\r\nGiven an input feature map $x \\in \\mathbb{R}^{C \\times H \\times W}$, let $q$ index a query element with content feature $\\mathbf{z}\\_{q}$ and a 2-d reference point $\\mathbf{p}\\_{q}$, the deformable attention feature is calculated by:\r\n\r\n$$ \\text{DeformAttn}\\left(\\mathbf{z}\\_{q}, \\mathbf{p}\\_{q}, \\mathbf{x}\\right)=\\sum\\_{m=1}^{M} \\mathbf{W}\\_{m}\\left[\\sum\\_{k=1}^{K} A\\_{m q k} \\cdot \\mathbf{W}\\_{m}^{\\prime} \\mathbf{x}\\left(\\mathbf{p}\\_{q}+\\Delta \\mathbf{p}\\_{m q k}\\right)\\right]\r\n$$\r\n\r\nwhere $m$ indexes the attention head, $k$ indexes the sampled keys, and $K$ is the total sampled key number $(K \\ll H W) . \\Delta p_{m q k}$ and $A_{m q k}$ denote the sampling offset and attention weight of the $k^{\\text {th }}$ sampling point in the $m^{\\text {th }}$ attention head, respectively. The scalar attention weight $A_{m q k}$ lies in the range $[0,1]$, normalized by $\\sum_{k=1}^{K} A_{m q k}=1 . \\Delta \\mathbf{p}_{m q k} \\in \\mathbb{R}^{2}$ are of 2-d real numbers with unconstrained range. As $p\\_{q}+\\Delta p\\_{m q k}$ is fractional, bilinear interpolation is applied as in Dai et al. (2017) in computing $\\mathbf{x}\\left(\\mathbf{p}\\_{q}+\\Delta \\mathbf{p}\\_{m q k}\\right)$. Both $\\Delta \\mathbf{p}\\_{m q k}$ and $A\\_{m q k}$ are obtained via linear projection over the query feature $z\\_{q} .$ In implementation, the query feature $z\\_{q}$ is fed to a linear projection operator of $3 M K$ channels, where the first $2 M K$ channels encode the sampling offsets $\\Delta p\\_{m q k}$, and the remaining $M K$ channels are fed to a softmax operator to obtain the attention weights $A\\_{m q k}$.","624":"**Deformable DETR** is an object detection method that aims mitigates the slow convergence and high complexity issues of [DETR](https:\/\/www.paperswithcode.com\/method\/detr). It combines the best of the sparse spatial sampling of [deformable convolution](https:\/\/paperswithcode.com\/method\/deformable-convolution), and the relation modeling capability of [Transformers](https:\/\/paperswithcode.com\/methods\/category\/transformers). Specifically, it introduces a \r\n deformable attention module, which attends to a small set of sampling locations as a pre-filter for prominent key elements out of all the feature map pixels. The module can be naturally extended to aggregating multi-scale features, without the help of [FPN](https:\/\/paperswithcode.com\/method\/fpn).","625":"**Tofu** is an intra-layer model parallel system that partitions very large DNN models across multiple GPU devices to reduce per-GPU memory footprint. Tofu is designed to partition a dataflow graph of fine-grained tensor operators used by platforms like MXNet and TensorFlow. To optimally partition different operators in a dataflow graph, Tofu uses a recursive search algorithm that minimizes the total communication cost.","626":"**Deep Stereo Geometry Network** is a 3D object detection pipeline that relies on space transformation from 2D features to an effective 3D structure, called 3D geometric volume (3DGV). The whole neural network consists of four components. (a) A 2D image\r\nfeature extractor for capture of both pixel- and high-level feature. (b) Constructing the plane-sweep volume and 3D geometric volume. (c) Depth Estimation on the plane-sweep volume. (d) 3D object detection on 3D geometric volume.","627":"**Fraternal Dropout** is a regularization method for recurrent neural networks that trains two identical copies of an RNN (that share parameters) with different [dropout](https:\/\/paperswithcode.com\/method\/dropout) masks while minimizing the difference between their (pre-[softmax](https:\/\/paperswithcode.com\/method\/softmax)) predictions. This encourages the representations of RNNs to be invariant to dropout mask, thus being robust.","628":"**Adversarial Color Enhancement** is an approach to generating unrestricted adversarial images by optimizing a color filter via gradient descent.","629":"**Phase Shuffle** is a technique for removing pitched noise artifacts that come from using transposed convolutions in audio generation models. Phase shuffle is an operation with hyperparameter $n$. It randomly perturbs the phase of each layer\u2019s activations by \u2212$n$ to $n$ samples before input to the next layer.\r\n\r\nIn the original application in [WaveGAN](https:\/\/paperswithcode.com\/method\/wavegan), the authors only apply phase shuffle to the discriminator, as the latent vector already provides the generator a mechanism to manipulate the phase\r\nof a resultant waveform. Intuitively speaking, phase shuffle makes the discriminator\u2019s job more challenging by requiring invariance to the phase of the input waveform.","630":"**WaveGAN** is a generative adversarial network for unsupervised synthesis of raw-waveform audio (as opposed to image-like spectrograms). \r\n\r\nThe WaveGAN architecture is based off [DCGAN](https:\/\/paperswithcode.com\/method\/dcgan). The DCGAN generator uses the [transposed convolution](https:\/\/paperswithcode.com\/method\/transposed-convolution) operation to iteratively upsample low-resolution feature maps into a high-resolution image. WaveGAN modifies this transposed [convolution](https:\/\/paperswithcode.com\/method\/convolution) operation to widen its receptive field, using a longer one-dimensional filters of length 25 instead of two-dimensional filters of size 5x5, and upsampling by a factor of 4 instead of 2 at each layer. The discriminator is modified in a similar way, using length-25 filters in one dimension and increasing stride\r\nfrom 2 to 4. These changes result in WaveGAN having the same number of parameters, numerical\r\noperations, and output dimensionality as DCGAN. An additional layer is added afterwards to allow for more audio samples. Further changes include:\r\n\r\n1. Flattening 2D convolutions into 1D (e.g. 5x5 2D conv becomes length-25 1D).\r\n2. Increasing the stride factor for all convolutions (e.g. stride 2x2 becomes stride 4).\r\n3. Removing [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) from the generator and discriminator.\r\n4. Training using the [WGAN](https:\/\/paperswithcode.com\/method\/wgan)-GP strategy.","631":"**AdaMax** is a generalisation of [Adam](https:\/\/paperswithcode.com\/method\/adam) from the $l\\_{2}$ norm to the $l\\_{\\infty}$ norm. Define:\r\n\r\n$$ u\\_{t} = \\beta^{\\infty}\\_{2}v\\_{t-1} + \\left(1-\\beta^{\\infty}\\_{2}\\right)|g\\_{t}|^{\\infty}$$\r\n\r\n$$ = \\max\\left(\\beta\\_{2}\\cdot{v}\\_{t-1}, |g\\_{t}|\\right)$$\r\n\r\nWe can plug into the Adam update equation by replacing $\\sqrt{\\hat{v}_{t} + \\epsilon}$ with $u\\_{t}$ to obtain the AdaMax update rule:\r\n\r\n$$ \\theta\\_{t+1} = \\theta\\_{t} - \\frac{\\eta}{u\\_{t}}\\hat{m}\\_{t} $$\r\n\r\nCommon default values are $\\eta = 0.002$ and $\\beta\\_{1}=0.9$ and $\\beta\\_{2}=0.999$.","632":"**Axial Attention** is a simple generalization of self-attention that naturally aligns with the multiple dimensions of the tensors in both the encoding and the decoding settings. It was first proposed in [CCNet](https:\/\/paperswithcode.com\/method\/ccnet) [1] named as criss-cross attention, which harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. Ho et al [2] extents CCNet to process multi-dimensional data.  The proposed structure of the layers allows for the vast majority of the context to be computed in parallel during decoding without introducing any independence assumptions. It serves as the basic building block for developing self-attention-based autoregressive models for high-dimensional data tensors, e.g., Axial Transformers. It has been applied in [AlphaFold](https:\/\/paperswithcode.com\/method\/alphafold) [3] for interpreting protein sequences.\r\n\r\n[1] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, Wenyu Liu. CCNet: Criss-Cross Attention for Semantic Segmentation. ICCV, 2019.\r\n\r\n[2] Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, Tim Salimans. arXiv:1912.12180\r\n\r\n[3] Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, \u017d\u00eddek A, Potapenko A, Bridgland A. Highly accurate protein structure prediction with AlphaFold. Nature. 2021 Jul 15:1-1.","633":"**Local SGD** is a distributed training technique that runs [SGD](https:\/\/paperswithcode.com\/method\/sgd) independently in parallel on different workers and averages the sequences only once in a while.","634":"DGCNN involves neural networks that read the graphs directly and learn a classification function. There are two main challenges: 1) how to extract useful features characterizing the rich information encoded in a graph for classification purpose, and 2) how to sequentially read a graph in a meaningful and consistent order. To address the first challenge, we design a localized graph convolution model and show its connection with two graph kernels. To address the second challenge, we design a novel SortPooling layer which sorts graph vertices in a consistent order so that traditional neural networks can be trained on the graphs.\r\n\r\nDescription and image from: [An End-to-End Deep Learning Architecture for Graph Classification](https:\/\/muhanzhang.github.io\/papers\/AAAI_2018_DGCNN.pdf)","635":"**Sticker Response Selector**, or **SRS**, is a model for multi-turn dialog that automatically selects a sticker response. SRS first employs a convolutional based sticker image encoder and a self-attention based multi-turn dialog encoder to obtain the representation of stickers and utterances. Next, deep interaction network is proposed to conduct deep matching between the sticker with each utterance in the dialog history. SRS then learns the short-term and long-term dependency between all interaction results by a fusion network to output the the final matching score.","636":"**Causal convolutions** are a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) used for temporal data which ensures the model cannot violate the ordering in which we model the data: the prediction $p(x_{t+1} | x_{1}, \\ldots, x_{t})$ emitted by the model at timestep $t$ cannot depend on any of the future timesteps $x_{t+1}, x_{t+2}, \\ldots, x_{T}$. For images, the equivalent of a causal convolution is a [masked convolution](https:\/\/paperswithcode.com\/method\/masked-convolution) which can be implemented by constructing a mask tensor and doing an element-wise multiplication of this mask with the convolution kernel before applying it. For 1-D data such as audio one can more easily implement this by shifting the output of a normal convolution by a few timesteps.","637":"**SPADE**, or **Spatially-Adaptive Normalization** is a conditional normalization method for semantic image synthesis. Similar to [Batch Normalization](https:\/\/www.paperswithcode.com\/method\/batch-normalization), the activation is normalized in the channel-wise manner and then modulated with learned scale and bias. In the SPADE, the mask is first projected onto an embedding space and then convolved to produce the modulation parameters $\\gamma$ and $\\beta .$ Unlike prior conditional normalization methods, $\\gamma$ and $\\mathbf{\\beta}$ are not vectors, but tensors with spatial dimensions. The produced $\\gamma$ and $\\mathbf{\\beta}$ are multiplied and added to the normalized activation element-wise.","638":"SCAN automatically groups images into semantically meaningful clusters when ground-truth annotations are absent. SCAN is a two-step approach where feature learning and clustering are decoupled. First, a self-supervised task is employed to obtain semantically meaningful features. Second, the obtained features are used as a prior in a learnable clustering  approach.\r\n\r\nImage source: [Gansbeke et al.](https:\/\/arxiv.org\/pdf\/2005.12320v2.pdf)","639":"**CRF-RNN** is a formulation of a [CRF](https:\/\/paperswithcode.com\/method\/crf) as a Recurrent Neural Network. Specifically it formulates mean-field approximate inference for the Conditional Random Fields with Gaussian pairwise potentials as Recurrent Neural Networks.","640":"**CoVe**, or **Contextualized Word Vectors**, uses a deep [LSTM](https:\/\/paperswithcode.com\/method\/lstm) encoder from an attentional sequence-to-sequence model trained for machine translation to contextualize word vectors. $\\text{CoVe}$ word embeddings are therefore a function of the entire input sequence. These word embeddings can then be used in downstream tasks by concatenating them with $\\text{GloVe}$ embeddings:\r\n\r\n$$ v = \\left[\\text{GloVe}\\left(x\\right), \\text{CoVe}\\left(x\\right)\\right]$$\r\n\r\nand then feeding these in as features for the task-specific models.","641":"The ARMA GNN layer implements a rational graph filter with a recursive approximation.","642":"**MODERN**, or **Modulated Residual Network**, is an architecture for [visual question answering](https:\/\/paperswithcode.com\/task\/visual-question-answering) (VQA). It employs [conditional batch normalization](https:\/\/paperswithcode.com\/method\/conditional-batch-normalization) to allow a linguistic embedding from an [LSTM](https:\/\/paperswithcode.com\/method\/lstm) to modulate the [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) parameters of a [ResNet](https:\/\/paperswithcode.com\/method\/resnet). This enables the linguistic embedding to manipulate entire feature maps by scaling them up or down, negating them, or shutting them off, etc.","643":"An **RGCN**, or **Relational Graph Convolution Network**, is a an application of the [GCN framework](https:\/\/paperswithcode.com\/method\/gcn) to modeling relational data, specifically\r\nto link prediction and entity classification tasks.\r\n\r\nSee [here](https:\/\/docs.dgl.ai\/en\/0.4.x\/tutorials\/models\/1_gnn\/4_rgcn.html) for an in-depth explanation of RGCNs by DGL.","644":"**TD-VAE**, or **Temporal Difference VAE**, is a generative sequence model that learns representations containing explicit beliefs about states several steps into the future, and that can be rolled out directly without single-step transitions. TD-VAE is trained on pairs of temporally separated time points, using an analogue of [temporal difference learning](https:\/\/paperswithcode.com\/method\/td-lambda) used in reinforcement learning.","645":"**Disentangled Attribution Curves (DAC)** provide interpretations of tree ensemble methods in the form of (multivariate) feature importance curves. For a given variable, or group of variables, [DAC](https:\/\/paperswithcode.com\/method\/dac) plots the importance of a variable(s) as their value changes.\r\n\r\nThe Figure to the right shows an example. The tree depicts a decision tree which performs binary classification using two features (representing the XOR function). In this problem, knowing the value of one of the features without knowledge of the other feature yields no information - the classifier still has a 50% chance of predicting either class. As a result, DAC produces curves which assign 0 importance to either feature on its own. Knowing both features yields perfect information about the classifier, and thus the DAC curve for both features together correctly shows that the interaction of the features produces the model\u2019s predictions.","646":"**Criss-Cross Network** (**CCNet**) aims to obtain full-image contextual information in an effective and efficient way. Concretely,\r\nfor each pixel, a novel criss-cross attention module harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. **CCNet** is with the following\r\nmerits: **1)** GPU memory friendly. Compared with the [non-local block](https:\/\/paperswithcode.com\/method\/non-local-block), the proposed recurrent criss-cross attention module requires 11\u00d7 less GPU memory usage. **2)** High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85% of the non-local block. **3)** The state-of-the-art performance.","647":"**Adaptive Masking** is a type of attention mechanism that allows a model to learn its own context size to attend over. For each head in [Multi-Head Attention](https:\/\/paperswithcode.com\/method\/multi-head-attention), a masking function is added to control for the span of the attention. A masking function is a non-increasing function that maps a\r\ndistance to a value in $\\left[0, 1\\right]$. Adaptive masking takes the following soft masking function $m\\_{z}$ parametrized by a real value $z$ in $\\left[0, S\\right]$:\r\n\r\n$$ m\\_{z}\\left(x\\right) = \\min\\left[\\max\\left[\\frac{1}{R}\\left(R+z-x\\right), 0\\right], 1\\right] $$\r\n\r\nwhere $R$ is a hyper-parameter that controls its softness. The shape of this piecewise function as a function of the distance. This soft masking function is inspired by [Jernite et al. (2017)](https:\/\/arxiv.org\/abs\/1611.06188). The attention weights from are then computed on the masked span:\r\n\r\n$$ a\\_{tr} = \\frac{m\\_{z}\\left(t-r\\right)\\exp\\left(s\\_{tr}\\right)}{\\sum^{t-1}\\_{q=t-S}m\\_{z}\\left(t-q\\right)\\exp\\left(s\\_{tq}\\right)}$$\r\n\r\nA $\\mathcal{l}\\_{1}$ penalization is added on the parameters $z\\_{i}$ for each attention head $i$ of the model to the loss function:\r\n\r\n$$ L = - \\log{P}\\left(w\\_{1}, \\dots, w\\_{T}\\right) + \\frac{\\lambda}{M}\\sum\\_{i}z\\_{i} $$\r\n\r\nwhere $\\lambda > 0$ is the regularization hyperparameter, and $M$ is the number of heads in each\r\nlayer. This formulation is differentiable in the parameters $z\\_{i}$, and learnt jointly with the rest of the model.","648":"The **Adaptive Attention Span Transformer** is a Transformer that utilises an improvement to the self-attention layer called [adaptive masking](https:\/\/paperswithcode.com\/method\/adaptive-masking) that allows the model to choose its own context size. This results in a network where each attention layer gathers information on their own context. This allows for scaling to input sequences of more than 8k tokens.\r\n\r\nTheir proposals are based on the observation that, with the dense attention of a traditional [Transformer](https:\/\/paperswithcode.com\/method\/transformer), each attention head shares the same attention span $S$ (attending over the full context). But many attention heads can specialize to more local context (others look at the longer sequence). This motivates the need for a variant of self-attention that allows the model to choose its own context size (adaptive masking - see components).","649":"A **Sandwich Transformer** is a variant of a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) that reorders sublayers in the architecture to achieve better performance. The reordering is based on the authors' analysis that models with more self-attention toward the bottom and more\r\nfeedforward sublayers toward the top tend to perform better in general.","650":"**SimAdapter** is a module for explicitly learning knowledge from adapters. SimAdapter aims to learn the similarities between the source and target languages during fine-tuning using the adapters, and the similarity is based on an [attention mechanism](https:\/\/paperswithcode.com\/methods\/category\/attention-mechanisms-1). \r\n\r\nThe detailed composition of the SimAdapter is shown in the Figure. By taking the language-agnostic representations from the backbone model as the query, and the language-specific outputs from multiple adapter as the keys and values, the final output for SimAdapter over attention are computed as (For notation simplicity, we omit the layer index $l$ below):\r\n\r\n$$\r\n\\operatorname{SimAdapter}\\left(\\mathbf{z}, \\mathbf{a}\\_{\\left\\(S\\_{1}, S\\_{2}, \\ldots, S\\_{N}\\right\\)}\\right)=\\sum_{i=1}^{N} \\operatorname{Attn}\\left(\\mathbf{z}, \\mathbf{a}\\_{S\\_{i}}\\right) \\cdot\\left(\\mathbf{a}\\_{S\\_{i}} \\mathbf{W}\\_{V}\\right)\r\n$$\r\n\r\nwhere SimAdapter $(\\cdot)$ and $\\operatorname{Attn}(\\cdot)$ denotes the SimAdapter and attention operations, respectively. Specifically, the attention operation is computed as:\r\n\r\n$$\r\n\\operatorname{Attn}(\\mathbf{z}, \\mathbf{a})=\\operatorname{Softmax}\\left(\\frac{\\left(\\mathbf{z} \\mathbf{W}\\_{Q}\\right)\\left(\\mathbf{a} \\mathbf{W}\\_{K}\\right)^{\\top}}{\\tau}\\right)\r\n$$\r\n\r\nwhere $\\tau$ is the temperature coefficient, $\\mathbf{W}\\_{Q}, \\mathbf{W}\\_{K}, \\mathbf{W}\\_{V}$ are attention matrices. Note that while $\\mathbf{W}\\_{Q}, \\mathbf{W}\\_{K}$ are initialized randomly, $\\mathbf{W}\\_{V}$ is initialized with a diagonal of ones and the rest of the matrix with small weights $(1 e-6)$ to retain the adapter representations. Furthermore, a regularization term is introduced to avoid drastic feature changes:\r\n\r\n$$\r\n\\mathcal{L}\\_{\\mathrm{reg}}=\\sum\\_{i, j}\\left(\\left(\\mathbf{I}\\_{V}\\right)\\_{i, j}-\\left(\\mathbf{W}\\_{V}\\right)_{i, j}\\right)^{2}\r\n$$\r\n\r\nwhere $\\mathbf{I}\\_{V}$ is the identity matrix with the same size as $\\mathbf{W}\\_{V}$","651":"A **Deep Boltzmann Machine (DBM)** is a three-layer generative model. It is similar to a [Deep Belief Network](https:\/\/paperswithcode.com\/method\/deep-belief-network), but instead allows bidirectional connections in the bottom layers. Its energy function is  as an extension of the energy function of the RBM:\r\n\r\n$$ E\\left(v, h\\right) = -\\sum^{i}\\_{i}v\\_{i}b\\_{i} - \\sum^{N}\\_{n=1}\\sum_{k}h\\_{n,k}b\\_{n,k}-\\sum\\_{i, k}v\\_{i}w\\_{ik}h\\_{k} - \\sum^{N-1}\\_{n=1}\\sum\\_{k,l}h\\_{n,k}w\\_{n, k, l}h\\_{n+1, l}$$\r\n\r\nfor a DBM with $N$ hidden layers.\r\n\r\nSource: [On the Origin of Deep Learning](https:\/\/arxiv.org\/pdf\/1702.07800.pdf)","652":"A **Hopfield Layer** is a module that enables a network to associate two sets of vectors. This general functionality allows for [transformer](https:\/\/paperswithcode.com\/method\/transformer)-like self-attention, for decoder-encoder attention, for time series prediction (maybe with positional encoding), for sequence analysis, for multiple instance learning, for learning with point sets, for combining data sources by associations, for constructing a memory, for averaging and pooling operations, and for many more. \r\n\r\nIn particular, the Hopfield layer can readily be used as plug-in replacement for existing layers like pooling layers ([max-pooling](https:\/\/paperswithcode.com\/method\/max-pooling) or [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling), permutation equivariant layers, [GRU](https:\/\/paperswithcode.com\/method\/gru) & [LSTM](https:\/\/paperswithcode.com\/method\/lstm) layers, and attention layers. The Hopfield layer is based on modern Hopfield networks with continuous states that have very high storage capacity and converge after one update.","653":"**LayerScale** is a method used for [vision transformer](https:\/\/paperswithcode.com\/methods\/category\/vision-transformer) architectures to help improve training dynamics. It adds a learnable diagonal matrix on output of each residual block, initialized close to (but not at) 0. Adding this simple layer after each residual block improves the training dynamic, allowing for the training of deeper high-capacity image transformers that benefit from depth.\r\n\r\nSpecifically, LayerScale is a per-channel multiplication of the vector produced by each residual block, as opposed to a single scalar, see Figure (d). The objective is to group the updates of the weights associated with the same output channel. Formally, LayerScale is a multiplication by a diagonal matrix on output of each residual block. In other words:\r\n\r\n$$\r\nx\\_{l}^{\\prime} =x\\_{l}+\\operatorname{diag}\\left(\\lambda\\_{l, 1}, \\ldots, \\lambda\\_{l, d}\\right) \\times \\operatorname{SA}\\left(\\eta\\left(x\\_{l}\\right)\\right) \r\n$$\r\n\r\n$$\r\nx\\_{l+1} =x\\_{l}^{\\prime}+\\operatorname{diag}\\left(\\lambda\\_{l, 1}^{\\prime}, \\ldots, \\lambda\\_{l, d}^{\\prime}\\right) \\times \\operatorname{FFN}\\left(\\eta\\left(x\\_{l}^{\\prime}\\right)\\right)\r\n$$\r\n\r\nwhere the parameters $\\lambda\\_{l, i}$ and $\\lambda\\_{l, i}^{\\prime}$ are learnable weights. The diagonal values are all initialized to a fixed small value $\\varepsilon:$ we set it to $\\varepsilon=0.1$ until depth 18 , $\\varepsilon=10^{-5}$ for depth 24 and $\\varepsilon=10^{-6}$ for deeper networks. \r\n\r\nThis formula is akin to other [normalization](https:\/\/paperswithcode.com\/methods\/category\/normalization) strategies [ActNorm](https:\/\/paperswithcode.com\/method\/activation-normalization) or [LayerNorm](https:\/\/paperswithcode.com\/method\/layer-normalization) but executed on output of the residual block. Yet LayerScale seeks a different effect: [ActNorm](https:\/\/paperswithcode.com\/method\/activation-normalization) is a data-dependent initialization that calibrates activations so that they have zero-mean and unit variance, like [BatchNorm](https:\/\/paperswithcode.com\/method\/batch-normalization). In contrast, in LayerScale, we initialize the diagonal with small values so that the initial contribution of the residual branches to the function implemented by the transformer is small. In that respect the motivation is therefore closer to that of [ReZero](https:\/\/paperswithcode.com\/method\/rezero), [SkipInit](https:\/\/paperswithcode.com\/method\/skipinit), [Fixup](https:\/\/paperswithcode.com\/method\/fixup-initialization) and [T-Fixup](https:\/\/paperswithcode.com\/method\/t-fixup): to train closer to the identity function and let the network integrate the additional parameters progressively during the training. LayerScale offers more diversity in the optimization than just adjusting the whole layer by a single learnable scalar as in [ReZero](https:\/\/paperswithcode.com\/method\/rezero)\/[SkipInit](https:\/\/paperswithcode.com\/method\/skipinit), [Fixup](https:\/\/paperswithcode.com\/method\/fixup-initialization) and [T-Fixup](https:\/\/paperswithcode.com\/method\/t-fixup).","654":"**Spatial Group-wise Enhance** is a module for convolutional neural networks that can adjust the\r\nimportance of each sub-feature by generating an attention factor for each spatial location in each semantic group, so that every individual group can autonomously enhance its learnt expression and suppress possible noise\r\n\r\nInside each feature group, we model a spatial enhance mechanism inside each feature group, by scaling the feature vectors over all the locations with an attention mask. This attention mask is designed to suppress the possible noise and highlight the correct semantic feature regions. Different from other popular attention methods, it utilises the similarity between the global statistical feature and the local ones of each location as the source of generation for the attention masks.","655":"**VisuoSpatial Foresight** is a method for robotic fabric manipulation that leverages a combination of RGB and depth information to learn goal conditioned fabric manipulation policies for a variety of long horizon tasks.","656":"OSCAR is a new learning method that uses object tags detected in images as anchor points to ease the learning of image-text alignment. The model take a triple as input (word-tag-region) and pre-trained with two losses (masked token loss over words and tags, and a contrastive loss between tags and others). OSCAR represents an image-text pair into semantic space via dictionary lookup. Object tags are used as anchor points to align image regions with word embeddings of pre-trained language models. The model is then fine-tuned for understanding and generation tasks.","657":"This method introduces several regularization schemes that can be applied to an Autoencoder. To make the model genrative *ex-post* density estimation is proposed and consists in fitting a Mixture of Gaussian distribution on the train data embeddings after the model is trained.","658":"**ESPNet** is a convolutional neural network for semantic segmentation of high resolution images under resource constraints. ESPNet is based on a convolutional module, efficient spatial pyramid ([ESP](https:\/\/paperswithcode.com\/method\/esp)), which is efficient in terms of computation, memory, and power.","659":"Documents often exhibit various forms of degradation, which make it hard to be read and substantially deteriorate the\r\nperformance of an OCR system. In this paper, we propose an effective end-to-end framework named Document Enhancement\r\nGenerative Adversarial Networks (DE-GAN) that uses the conditional GANs (cGANs) to restore severely degraded document images.\r\nTo the best of our knowledge, this practice has not been studied within the context of generative adversarial deep networks. We\r\ndemonstrate that, in different tasks (document clean up, binarization, deblurring and watermark removal), DE-GAN can produce an\r\nenhanced version of the degraded document with a high quality. In addition, our approach provides consistent improvements compared to state-of-the-art methods over the widely used DIBCO 2013, DIBCO 2017 and H-DIBCO 2018 datasets, proving its ability to restore a degraded document image to its ideal condition. The obtained results on a wide variety of degradation reveal the flexibility of the proposed model to be exploited in other document enhancement problems.","660":"**PrIme Sample Attention (PISA)** directs the training of object detection frameworks towards prime samples. These are samples that play a key role in driving the detection performance. The authors define Hierarchical Local Rank (HLR) as a metric of importance. Specifically, they use IoU-HLR to rank positive samples and ScoreHLR to rank negative samples in each mini-batch. This ranking strategy places the positive samples with highest IoUs around each object and the negative samples with highest scores in each cluster to the top of the ranked list and directs the focus of the training process to them via a simple re-weighting scheme. The authors also devise a classification-aware regression loss to jointly optimize the classification and regression branches. Particularly, this loss would suppress those samples with large regression loss, thus reinforcing the attention to prime samples.","661":"A **Noisy Linear Layer** is a [linear layer](https:\/\/paperswithcode.com\/method\/linear-layer) with parametric noise added to the weights. This induced stochasticity can be used in reinforcement learning networks for the agent's policy to aid efficient exploration. The parameters of the noise are learned with gradient descent along with any other remaining network weights. Factorized Gaussian noise is the type of noise usually employed.\r\n\r\nThe noisy linear layer takes the form:\r\n\r\n$$y = \\left(b + Wx\\right) + \\left(b\\_{noisy}\\odot\\epsilon^{b}+\\left(W\\_{noisy}\\odot\\epsilon^{w}\\right)x\\right) $$\r\n\r\nwhere $\\epsilon^{b}$ and $\\epsilon^{w}$ are random variables.","662":"A **Dueling Network** is a type of Q-Network that has two streams to separately estimate (scalar) state-value and the advantages for each action. Both streams share a common convolutional feature learning module. The two streams are combined via a special aggregating layer to produce an\r\nestimate of the state-action value function Q as shown in the figure to the right.\r\n\r\nThe last module uses the following mapping:\r\n\r\n$$ Q\\left(s, a, \\theta, \\alpha, \\beta\\right) =V\\left(s, \\theta, \\beta\\right) + \\left(A\\left(s, a, \\theta, \\alpha\\right) - \\frac{1}{|\\mathcal{A}|}\\sum\\_{a'}A\\left(s, a'; \\theta, \\alpha\\right)\\right) $$\r\n\r\nThis formulation is chosen for identifiability so that the advantage function has zero advantage for the chosen action, but instead of a maximum we use an average operator to increase the stability of the optimization.","663":"**Rainbow DQN** is an extended [DQN](https:\/\/paperswithcode.com\/method\/dqn) that combines several improvements into a single learner. Specifically:\r\n\r\n- It uses [Double Q-Learning](https:\/\/paperswithcode.com\/method\/double-q-learning) to tackle overestimation bias.\r\n- It uses [Prioritized Experience Replay](https:\/\/paperswithcode.com\/method\/prioritized-experience-replay) to prioritize important transitions.\r\n- It uses dueling networks.\r\n- It uses multi-step learning .\r\n- It uses distributional reinforcement learning instead of the expected return.\r\n- It uses noisy linear layers for exploration.","664":"**Center Pooling** is a pooling technique for object detection that aims to capture richer and more recognizable visual patterns. The geometric centers of objects do not necessarily convey very recognizable visual patterns (e.g., the human head contains strong visual patterns, but the center keypoint is often in the middle of the human body). \r\n\r\nThe detailed process of center pooling is as follows: the backbone outputs a feature map, and to determine if a pixel in the feature map is a center keypoint, we need to find the maximum value in its both horizontal and vertical directions and add them together. By doing this, center pooling helps the better detection of center keypoints.","665":"**Cascade Corner Pooling** is a pooling layer for object detection that builds upon the [corner pooling](https:\/\/paperswithcode.com\/method\/corner-pooling) operation. Corners are often outside the objects, which lacks local appearance features. [CornerNet](https:\/\/paperswithcode.com\/method\/cornernet) uses corner pooling to address this issue, where we find the maximum values on the boundary directions so as to determine corners. However, it makes corners sensitive to the edges. To address this problem, we need to let corners see the visual patterns of objects. Cascade corner pooling first looks along a boundary to find a boundary maximum value, then looks inside along the location of the boundary maximum value to find an internal maximum value, and finally, add the two maximum values together. By doing this, the corners obtain both the the boundary information and the visual patterns of objects.","666":"**CenterNet** is a one-stage object detector that detects each object as a triplet, rather than a pair, of keypoints. It utilizes two customized modules named [cascade corner pooling](https:\/\/paperswithcode.com\/method\/cascade-corner-pooling) and [center pooling](https:\/\/paperswithcode.com\/method\/center-pooling), which play the roles of enriching information collected by both top-left and bottom-right corners and providing more recognizable information at the central regions, respectively. The intuition is that, if a predicted bounding box has a high IoU with the ground-truth box, then the probability that the center keypoint in its central region is predicted as the same class is high, and vice versa. Thus, during inference, after a proposal is generated as a pair of corner keypoints, we determine if the proposal is indeed an object by checking if there is a center keypoint of the same class falling within its central region.","667":"VL-T5 is a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation. The model learns to generate labels in text based on the visual and textual inputs. In contrast to other existing methods, the framework unifies tasks as generating text labels conditioned on multimodal inputs. This allows the model to tackle vision-and-language tasks with unified text generation objective. The models use text prefixes to adapt to different tasks.","668":"**Polyak Averaging** is an optimization technique that sets final parameters to an average of (recent) parameters visited in the optimization trajectory. Specifically if in $t$ iterations we have parameters $\\theta\\_{1}, \\theta\\_{2}, \\dots, \\theta\\_{t}$, then Polyak Averaging suggests setting \r\n\r\n$$ \\theta\\_t =\\frac{1}{t}\\sum\\_{i}\\theta\\_{i} $$\r\n\r\nImage Credit: [Shubhendu Trivedi & Risi Kondor](https:\/\/ttic.uchicago.edu\/~shubhendu\/Pages\/Files\/Lecture6_flat.pdf)","669":"**CheXNet** is a 121-layer [DenseNet](https:\/\/paperswithcode.com\/method\/densenet) trained on ChestX-ray14 for pneumonia detection.","670":"A large amount of information is stored in data tables. Users can search for data tables using a keyword-based query. A table is composed primarily of data values that are organized in rows and columns providing implicit structural information. A table is usually accompanied by secondary information such as the caption, page title, etc., that form the textual information. Understanding the connection between the textual and structural information is an important yet neglected aspect in table retrieval as previous methods treat each source of information independently. In addition, users can search for data tables that are similar to an existing table, and this setting can be seen as a content-based table retrieval. In this paper, we propose StruBERT, a structure-aware BERT model that fuses the textual and structural information of a data table to produce context-aware representations for both textual and tabular content of a data table. StruBERT features are integrated in a new end-to-end neural ranking model to solve three table-related downstream tasks: keyword- and content-based table retrieval, and table similarity. We evaluate our approach using three datasets, and we demonstrate substantial improvements in terms of retrieval and classification metrics over state-of-the-art methods.","671":"**Multiscale Vision Transformer**, or **MViT**, is a [transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture for modeling visual data such as images and videos. Unlike conventional transformers, which maintain a constant channel capacity and resolution throughout the network, Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features.","672":"**Transformer-XL** (meaning extra long) is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture that introduces the notion of recurrence to the deep self-attention network. Instead of computing the hidden states from scratch for each new segment, Transformer-XL reuses the hidden states obtained in previous segments. The reused hidden states serve as memory for the current segment, which builds up a recurrent connection between the segments. As a result, modeling very long-term dependency becomes possible because information can be propagated through the recurrent connections. As an additional contribution, the Transformer-XL uses a new relative positional encoding formulation that generalizes to attention lengths longer than the one observed during training.","673":"A **HaloNet** is a self-attention based model for efficient image classification. It relies on a local self-attention architecture that efficiently maps to existing hardware with haloing. The formulation breaks translational equivariance, but the authors observe that it improves  throughput and accuracies over the centered local self-attention used in regular self-attention. The approach also utilises a strided self-attentive downsampling operation for multi-scale feature extraction.","674":"**K-Net** is a framework for unified semantic and instance segmentation that segments both instances and semantic categories consistently by a group of learnable kernels, where each kernel is responsible for generating a mask for either a potential instance or a stuff class. It begins with a set of kernels that are randomly initialized, and learns the kernels in accordance to the segmentation targets at hand, namely, semantic kernels for semantic categories and instance kernels for instance identities. A simple combination of semantic kernels and instance kernels allows panoptic segmentation naturally. In the forward pass, the kernels perform [convolution](https:\/\/paperswithcode.com\/method\/convolution) on the image features to obtain the corresponding segmentation predictions.\r\n\r\nK-Net is formulated so that it dynamically updates the kernels to make them conditional to their activations on the image. Such a content-aware mechanism is crucial to ensure that each kernel, especially an instance kernel, responds accurately to varying objects in an image. Through applying this adaptive kernel update strategy iteratively, K-Net significantly improves the discriminative ability of the kernels and boosts the final segmentation performance. It is noteworthy that this strategy universally applies to kernels for all the segmentation tasks.\r\n\r\nIt also utilises a bipartite matching strategy to assign learning targets for each kernel. This training approach is advantageous to conventional training strategies as it builds a one-to-one mapping between kernels and instances in an image. It thus resolves the problem of dealing with a varying number of instances in an image. In addition, it is purely mask-driven without involving boxes. Hence, K-Net is naturally [NMS](https:\/\/paperswithcode.com\/method\/non-maximum-suppression)-free and box-free, which is appealing to real-time applications.","675":"UNITER or UNiversal Image-TExt Representation model is a large-scale pre-trained model for joint multimodal embedding. It is pre-trained using four image-text datasets COCO, Visual Genome, Conceptual Captions, and SBU Captions. It can power heterogeneous downstream V+L tasks with joint multimodal embeddings. \r\nUNITER takes the visual regions of the image and textual tokens of the sentence as inputs. A faster R-CNN is used in Image Embedder to extract the visual features of each region and a Text Embedder is used to tokenize the input sentence into WordPieces.  \r\n\r\nIt proposes WRA via the Optimal Transport to provide more fine-grained alignment between word tokens and image regions that is effective in calculating the minimum cost of transporting the contextualized image embeddings to word embeddings and vice versa. \r\n\r\nFour pretraining tasks were designed for this model. They are Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). This model is different from the previous models because it uses conditional masking on pre-training tasks.","676":"An **Recursive Feature Pyramid (RFP)** builds on top of the Feature Pyramid Networks ([FPN](https:\/\/paperswithcode.com\/method\/fpn)) by incorporating extra feedback connections from the FPN layers into the bottom-up backbone layers. Unrolling the recursive structure to a sequential implementation, we obtain a backbone for object detector that looks at the images twice or more. Similar to the cascaded detector heads in [Cascade R-CNN](https:\/\/paperswithcode.com\/method\/cascade-r-cnn) trained with more selective examples, an RFP recursively enhances FPN to generate increasingly powerful representations. Resembling Deeply-Supervised Nets, the feedback connections bring the features that directly receive gradients from the detector heads back to the low levels of the bottom-up backbone to speed up training and boost performance.","677":"**BASNet**, or **Boundary-Aware Segmentation Network**, is an image segmentation architecture that consists of a predict-refine architecture and a hybrid loss. The proposed BASNet comprises a predict-refine architecture and a hybrid loss, for highly accurate image segmentation.  The predict-refine architecture consists of a densely supervised encoder-decoder network and a residual \r\n refinement module, which are respectively used to predict and refine a segmentation probability map. The hybrid loss is a combination of the binary cross entropy, structural similarity and intersection-over-union losses, which guide the network to learn three-level (i.e., pixel-, patch- and map- level) hierarchy representations.","678":"**Cascade Mask R-CNN** extends [Cascade R-CNN](https:\/\/paperswithcode.com\/method\/cascade-r-cnn) to instance segmentation, by adding a\r\nmask head to the cascade.\r\n\r\nIn the [Mask R-CNN](https:\/\/paperswithcode.com\/method\/mask-r-cnn), the segmentation branch is inserted in parallel to the detection branch. However, the Cascade [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn) has multiple detection branches. This raises the questions of 1) where to add the segmentation branch and 2) how many segmentation branches to add. The authors consider three strategies for mask prediction in the Cascade R-CNN. The first two strategies address the first question, adding a single mask prediction head at either the first or last stage of the Cascade R-CNN. Since the instances used to train the segmentation branch are the positives of the detection branch, their number varies in these two strategies. Placing the segmentation head later on the cascade leads to more examples. However, because segmentation is a pixel-wise operation, a large number of highly overlapping instances is not necessarily as helpful as for object detection, which is a patch-based operation. The third strategy addresses the second question, adding a segmentation branch to each\r\ncascade stage. This maximizes the diversity of samples used to learn the mask prediction task. \r\n\r\nAt inference time, all three strategies predict the segmentation masks on the patches produced by the final object detection stage, irrespective of the cascade stage on which the segmentation mask is implemented and how many segmentation branches there are.","679":"**Hit-Detector** is a neural architectures search algorithm that simultaneously searches all components of an object detector in an end-to-end manner. It is a hierarchical approach to mine the proper subsearch space from the large volume of operation candidates. It consists of two main procedures. First, given a large search space containing all the operation candidates, we screen out the customized sub search space suitable for each part of detector with the help of group sparsity regularization. Secondly, we search the architectures for each part within the corresponding sub search space by adopting the differentiable manner.","680":"**DeepCluster** is a self-supervision approach for learning image representations.  DeepCluster iteratively groups the features with a standard clustering algorithm, k-means, and uses the subsequent assignments as supervision to update\r\nthe weights of the network","681":"The **Convolutional vision Transformer (CvT)** is an architecture which incorporates convolutions into the [Transformer](https:\/\/paperswithcode.com\/method\/transformer). The CvT design introduces convolutions to two core sections of the ViT architecture.\r\n\r\nFirst, the Transformers are partitioned into multiple stages that form a hierarchical structure of Transformers. The beginning of each stage consists of a convolutional token embedding that performs an overlapping [convolution](https:\/\/paperswithcode.com\/method\/convolution) operation with stride on a 2D-reshaped token map (i.e., reshaping flattened token sequences back to the spatial grid), followed by [layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization). This allows the model to not only capture local information, but also progressively decrease the sequence length while simultaneously increasing the dimension of token features across stages, achieving spatial downsampling while concurrently increasing the number of feature maps, as is performed in CNNs. \r\n\r\nSecond, the linear projection prior to every self-attention block in the Transformer module is replaced with a proposed convolutional projection, which employs a s \u00d7 s depth-wise separable convolution operation on an 2D-reshaped token map. This allows the model to further capture local spatial context and reduce semantic ambiguity in the attention mechanism. It also permits management of computational complexity, as the stride of convolution can be used to subsample the key and value matrices to improve efficiency by 4\u00d7 or more, with minimal degradation of performance.","682":"**SFAM**, or **Scale-wise Feature Aggregation Module**, is a feature extraction block from the [M2Det](https:\/\/paperswithcode.com\/method\/m2det) architecture. It aims to aggregate the multi-level multi-scale features generated by [Thinned U-Shaped Modules](https:\/\/paperswithcode.com\/method\/tum) into a multi-level feature pyramid. \r\n\r\nThe first stage of SFAM is to concatenate features of the equivalent scale together along the channel dimension. The aggregated feature pyramid can be presented as $\\mathbf{X} =[\\mathbf{X}\\_1,\\mathbf{X}\\_2,\\dots,\\mathbf{X}\\_i]$, where $\\mathbf{X}\\_i = \\text{Concat}(\\mathbf{x}\\_i^1,\\mathbf{x}\\_i^2,\\dots,\\mathbf{x}\\_i^L) \\in \\mathbb{R}^{W\\_{i}\\times H\\_{i}\\times C}$ refers to the features of the $i$-th largest scale. Here, each scale in the aggregated pyramid contains features from multi-level depths. \r\n\r\nHowever, simple concatenation operations are not adaptive enough. In the second stage, we introduce a channel-wise attention module to encourage features to focus on channels that they benefit most. Following Squeeze-and-Excitation, we use [global average pooling](https:\/\/paperswithcode.com\/method\/global-average-pooling) to generate channel-wise statistics $\\mathbf{z} \\in \\mathbb{R}^C$ at the squeeze step. And to fully capture channel-wise dependencies, the following excitation step learns the attention mechanism via two fully connected layers:\r\n\r\n$$\r\n\\mathbf{s} = \\mathbf{F}\\_{ex}(\\mathbf{z},\\mathbf{W}) = \\sigma(\\mathbf{W}\\_{2} \\delta(\\mathbf{W}\\_{1}\\mathbf{z})),\r\n$$\r\n\r\nwhere $\\sigma$ refers to the [ReLU](https:\/\/paperswithcode.com\/method\/relu) function, $\\delta$ refers to the sigmoid function, $\\mathbf{W}\\_{1} \\in \\mathbb{R}^{\\frac{C}{r}\\times C}$ , $\\mathbf{W}\\_{2} \\in \\mathbb{R}^{C\\times \\frac{C}{r}}$, r is the reduction ratio ($r=16$ in our experiments). The final output is obtained by reweighting the input $\\mathbf{X}$ with activation $\\mathbf{s}$:\r\n\r\n$$\r\n\\tilde{\\mathbf{X}}_i^c = \\mathbf{F}\\_{scale}(\\mathbf{X}\\_i^c,s_c) = s_c \\cdot \\mathbf{X}_i^c,\r\n$$\r\n\r\nwhere $\\tilde{\\mathbf{X}\\_i} = [\\tilde{\\mathbf{X}}\\_i^1,\\tilde{\\mathbf{X}}\\_i^2,...,\\tilde{\\mathbf{X}}\\_i^C]$, each of the features is enhanced or weakened by the rescaling operation.","683":"**SRGAN Residual Block** is a residual block used in the [SRGAN](https:\/\/paperswithcode.com\/method\/srgan) generator for image super-resolution. It is similar to standard [residual blocks](https:\/\/paperswithcode.com\/method\/residual-block), although it uses a [PReLU](https:\/\/paperswithcode.com\/method\/prelu) activation function to help training (preventing sparse gradients during [GAN](https:\/\/paperswithcode.com\/method\/gan) training).","684":"**VGG Loss** is a type of content loss intorduced in the [Perceptual Losses for Real-Time Style Transfer and Super-Resolution](https:\/\/paperswithcode.com\/paper\/perceptual-losses-for-real-time-style) super-resolution and style transfer framework. It is an alternative to pixel-wise losses; VGG Loss attempts to be closer to perceptual similarity. The [VGG](https:\/\/paperswithcode.com\/method\/vgg) loss is based on the [ReLU](https:\/\/paperswithcode.com\/method\/relu) activation layers of the pre-trained 19 layer VGG network. With $\\phi\\_{i,j}$ we indicate the feature map obtained by the $j$-th [convolution](https:\/\/paperswithcode.com\/method\/convolution) (after activation) before the $i$-th maxpooling layer within the VGG19 network, which we consider given. We then define the VGG loss as the euclidean distance between the feature representations of a reconstructed image $G\\_{\\theta\\_{G}}\\left(I^{LR}\\right)$ and the reference image $I^{HR}$:\r\n\r\n$$ l\\_{VGG\/i.j} = \\frac{1}{W\\_{i,j}H\\_{i,j}}\\sum\\_{x=1}^{W\\_{i,j}}\\sum\\_{y=1}^{H\\_{i,j}}\\left(\\phi\\_{i,j}\\left(I^{HR}\\right)\\_{x, y} - \\phi\\_{i,j}\\left(G\\_{\\theta\\_{G}}\\left(I^{LR}\\right)\\right)\\_{x, y}\\right)^{2}$$ \r\n\r\nHere $W\\_{i,j}$ and $H\\_{i,j}$ describe the dimensions of the respective feature maps within the VGG network.","685":"**SRGAN** is a generative adversarial network for single image super-resolution. It uses a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes the solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images. In addition, the authors use a content loss motivated by perceptual similarity instead of similarity in pixel space. The actual networks - depicted in the Figure to the right - consist mainly of residual blocks for feature extraction.\r\n\r\nFormally we write the perceptual loss function as a weighted sum of a ([VGG](https:\/\/paperswithcode.com\/method\/vgg)) content loss $l^{SR}\\_{X}$ and an adversarial loss component $l^{SR}\\_{Gen}$:\r\n\r\n$$ l^{SR} = l^{SR}\\_{X} + 10^{-3}l^{SR}\\_{Gen} $$","686":"Please enter a description about the method here","687":"**Enhanced Sequential Inference Model** or **ESIM** is a sequential NLI model proposed in [Enhanced LSTM for Natural Language Inference](https:\/\/www.aclweb.org\/anthology\/P17-1152) paper.","688":"**Synergistic Image and Feature Alignment** is an unsupervised domain adaptation framework that conducts synergistic alignment of domains from both image and feature perspectives. In SIFA, we simultaneously transform the appearance of images across domains and enhance domain-invariance of the extracted features by leveraging adversarial learning in multiple aspects and with a deeply supervised mechanism. The feature encoder is shared between both adaptive perspectives to leverage their mutual benefits via end-to-end learning.","689":"**$\\epsilon$-Greedy Exploration** is an exploration strategy in reinforcement learning that takes an exploratory action with probability $\\epsilon$ and a greedy action with probability $1-\\epsilon$. It tackles the exploration-exploitation tradeoff with reinforcement learning algorithms: the desire to explore the state space with the desire to seek an optimal policy. Despite its simplicity, it is still commonly used as an behaviour policy $\\pi$ in several state-of-the-art reinforcement learning models.\r\n\r\nImage Credit: [Robin van Embden](https:\/\/cran.r-project.org\/web\/packages\/contextual\/vignettes\/sutton_barto.html)","690":"The **Affine Operator** is an affine transformation layer introduced in the [ResMLP](https:\/\/paperswithcode.com\/method\/resmlp) architecture. This replaces [layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization), as in [Transformer based networks](https:\/\/paperswithcode.com\/methods\/category\/transformers), which is possible since in the ResMLP, there are no [self-attention layers](https:\/\/paperswithcode.com\/method\/scaled) which makes training more stable - hence allowing a more simple affine transformation.\r\n\r\nThe affine operator is defined as:\r\n\r\n$$ \\operatorname{Aff}_{\\mathbf{\\alpha}, \\mathbf{\\beta}}(\\mathbf{x})=\\operatorname{Diag}(\\mathbf{\\alpha}) \\mathbf{x}+\\mathbf{\\beta} $$\r\n\r\nwhere $\\alpha$ and $\\beta$ are learnable weight vectors. This operation only rescales and shifts the input element-wise. This operation has several advantages over other normalization operations: first, as opposed to Layer Normalization, it has no cost at inference time, since it can absorbed in the adjacent linear layer. Second, as opposed to [BatchNorm](https:\/\/paperswithcode.com\/method\/batch-normalization) and Layer Normalization, the Aff operator does not depend on batch statistics.","691":"**MATE** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture designed to model the structure of web tables. It uses sparse attention in a way that allows heads to efficiently attend to either rows or columns in a table. Each attention head reorders the tokens by either column or row index and then applies a windowed attention mechanism. Unlike traditional self-attention, Mate scales linearly in the sequence length.","692":"**CenterPoint** is a two-stage 3D detector that finds centers of objects and their properties using a keypoint detector and regresses to other attributes, including 3D size, 3D orientation and velocity. In a second-stage, it refines these estimates using additional point features on the object. CenterPoint uses a standard Lidar-based backbone network, i.e., VoxelNet or PointPillars, to build a representation of the input point-cloud. CenterPoint predicts the relative offset (velocity) of objects between consecutive frames, which are then linked up greedily -- so in Centerpoint, 3D object tracking simplifies to greedy closest-point matching.","693":"AlphaFold is a deep learning based algorithm for accurate protein structure prediction. AlphaFold incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.\r\n\r\nDescription from: [Highly accurate protein structure prediction with AlphaFold](https:\/\/paperswithcode.com\/paper\/highly-accurate-protein-structure-prediction)\r\n\r\nImage credit: [DeepMind](https:\/\/deepmind.com\/blog\/article\/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology)","694":"**Teacher-Tutor-Student Knowledge Distillation** is a method for image virtual try-on models. It treats fake images produced by the parser-based method as \"tutor knowledge\", where the artifacts can be corrected by real \"teacher knowledge\", which is extracted from the real person images in a self-supervised way. Other than using real images as supervisions, knowledge distillation is formulated in the try-on problem as distilling the appearance flows between the person image and the garment image, enabling the finding of dense correspondences between them to produce high-quality results.","695":"**MagFace** is a category of losses for face recognition that learn a universal feature embedding whose magnitude can measure the quality of a given face. Under the new loss, it can be proven that the magnitude of the feature embedding monotonically increases if the subject is more likely to be recognized. In addition, MagFace introduces an adaptive mechanism to learn a well-structured within-class feature distributions by pulling easy samples to class centers while pushing hard samples away. For face recognition, MagFace helps prevent model overfitting on noisy and low-quality samples by an adaptive mechanism to learn well-structured within-class feature distributions -- by pulling easy samples to class centers while pushing hard samples away.","696":"**Dual Softmax Loss** is a loss function based on symmetric cross-entropy loss used in the [CAMoE](https:\/\/paperswithcode.com\/method\/camoe) video-text retrieval model. Every text and video are calculated the\r\nsimilarity with other videos or texts, which should be maximum in terms of the ground truth pair. For DSL, a prior is introduced to revise the similarity score. Multiplying the prior with the original similarity matrix imposes an efficient constraint and can help to filter those single side match pairs. As a result, DSL highlights the one with both great Text-to-Video and Video-to-Text probability, conducting a more convincing result.","697":"**CAMoE** is a multi-stream Corpus Alignment network with single gate Mixture-of-Experts (MoE) for video-text retrieval. The CAMoE employs Mixture-of-Experts (MoE) to extract multi-perspective video representations, including action, entity, scene, etc., then align them with the corresponding part of the text. A [Dual Softmax Loss](https:\/\/paperswithcode.com\/method\/dual-softmax-loss) (DSL) is used to avoid the one-way optimum-match which occurs in previous contrastive methods. Introducing the intrinsic prior of each pair in a batch, DSL serves as a reviser to correct the similarity matrix and achieves the dual optimal match.","698":"**CBNet** is a backbone architecture that consists of multiple identical backbones (specially called Assistant Backbones and Lead Backbone) and composite connections between neighbor backbones. From left to right, the output of each stage in an Assistant Backbone, namely higher-level\r\nfeatures, flows to the parallel stage of the succeeding backbone as part of inputs through composite connections. Finally, the feature maps of the last backbone named Lead\r\nBackbone are used for object detection. The features extracted by CBNet for object detection fuse the high-level and low-level features of multiple backbones, hence improve the detection performance.","699":"**End-to-End Neural Diarization** is a neural network for speaker diarization in which a neural network directly outputs speaker diarization results given a multi-speaker recording. To realize such an end-to-end model, the speaker diarization problem is formulated as a multi-label classification problem and a permutation-free objective function is introduced to directly minimize diarization errors. The EEND method can explicitly handle speaker overlaps during training and inference. Just by feeding multi-speaker recordings with corresponding speaker segment labels, the model can be adapted to real conversations.","700":"**Compact Convolutional Transformers** utilize sequence pooling and replace the patch embedding with a convolutional embedding, allowing for better inductive bias and making positional embeddings optional. CCT achieves better accuracy than ViT-Lite (smaller ViTs) and increases the flexibility of the input parameters.","701":"**VoiceFilter-Lite** is a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system. In this architecture, the voice filtering model operates as a frame-by-frame frontend signal processor to enhance the features consumed by the speech recognizer, without reconstructing audio signals from the features. The key contributions are (1) A system to perform speech separation directly on ASR input features; (2) An asymmetric loss function to penalize oversuppression during training, to make the model harmless under various acoustic environments, (3) An adaptive suppression strength mechanism to adapt to different noise conditions.","702":"**SGDW** is a stochastic optimization technique that decouples [weight decay](https:\/\/paperswithcode.com\/method\/weight-decay) from the gradient update:\r\n\r\n$$ g\\_{t} =  \\nabla{f\\_{t}}\\left(\\theta\\_{t-1}\\right) + \\lambda\\theta\\_{t-1}$$\r\n\r\n$$ m\\_{t} =  \\beta\\_{1}m\\_{t-1} + \\eta\\_{t}\\alpha{g}\\_{t}$$\r\n\r\n$$ \\theta\\_{t} = \\theta\\_{t-1} - m\\_{t} - \\eta\\_{t}\\lambda\\theta\\_{t-1}$$","703":"Deformable ConvNets do not learn an affine transformation. They divide convolution into two steps, firstly sampling features on a regular grid $ \\mathcal{R} $ from the input feature map, then aggregating sampled features by weighted summation using a convolution kernel. The process can be written as:\r\n\\begin{align}\r\n    Y(p_{0}) &= \\sum_{p_i \\in \\mathcal{R}} w(p_{i}) X(p_{0} + p_{i})\r\n\\end{align}\r\n\\begin{align}\r\n    \\mathcal{R}  &= \\{(-1,-1), (-1, 0), \\dots, (1, 1)\\}\r\n\\end{align}\r\nThe deformable convolution augments the sampling process by introducing a group of learnable offsets $\\Delta p_{i}$ which can be generated by a lightweight CNN. Using the offsets $\\Delta p_{i}$, the deformable convolution can be formulated as:\r\n\\begin{align}\r\n    Y(p_{0}) &= \\sum_{p_i \\in \\mathcal{R}} w(p_{i}) X(p_{0} + p_{i} + \\Delta p_{i}). \r\n\\end{align}\r\nThrough the above method, adaptive sampling is achieved.\r\nHowever, $\\Delta p_{i}$ is a floating point value\r\nunsuited to grid sampling. \r\nTo address this problem, bilinear interpolation is used. Deformable RoI pooling is also used, which greatly improves object detection. \r\n\r\nDeformable ConvNets adaptively select the important regions and enlarge the valid receptive field of convolutional neural networks; this is important in object detection and semantic segmentation tasks.","704":"**Deformable Position-Sensitive RoI Pooling** is similar to PS RoI Pooling but it adds an offset to each bin position in the regular bin partition. Offset learning follows the \u201cfully convolutional\u201d spirit. In the top branch, a convolutional layer generates the full spatial resolution offset fields. For each RoI (also for each class), PS RoI pooling is applied on such fields to obtain normalized offsets, which are then transformed to the real offsets, in the same way as in deformable RoI pooling.","705":"**Deformable RoI Pooling** adds an offset to each bin position in the regular bin partition of the RoI Pooling. Similarly, the offsets are learned from the preceding feature maps and the RoIs, enabling adaptive part localization for objects with different shapes.","706":"**LightGCN** is a type of [graph convolutional neural network](https:\/\/paperswithcode.com\/method\/gcn) (GCN), including only the most essential component in GCN (neighborhood aggregation) for collaborative filtering. Specifically, LightGCN learns user and item embeddings by linearly propagating them on the user-item interaction graph, and uses the weighted sum of the embeddings learned at all layers as the final embedding.","707":"As CNN features are naturally spatial, channel-wise and multi-layer, \r\nChen et al. proposed a novel spatial and channel-wise attention-based convolutional neural network (SCA-CNN). \r\nIt was designed for the task of image captioning, and uses an encoder-decoder framework where a CNN first encodes an input image into a vector and then an LSTM decodes the vector into a sequence of words. Given an input feature map $X$ and the previous time step LSTM hidden state $h_{t-1} \\in \\mathbb{R}^d$, a spatial attention mechanism pays more attention to the semantically useful regions, guided by LSTM hidden state $h_{t-1}$. The  spatial attention model is:\r\n\r\n\\begin{align}\r\na(h_{t-1}, X) &= \\tanh(Conv_1^{1 \\times 1}(X) \\oplus W_1 h_{t-1})\r\n\\end{align}\r\n\r\n\\begin{align}\r\n\\Phi_s(h_{t-1}, X) &= \\text{Softmax}(Conv_2^{1 \\times 1}(a(h_{t-1}, X)))    \r\n\\end{align}\r\n\r\nwhere $\\oplus$ represents  addition of a matrix and a vector. Similarly, channel-wise attention aggregates global information first, and then computes a channel-wise attention weight vector with the hidden state $h_{t-1}$:\r\n\\begin{align}\r\nb(h_{t-1}, X) &= \\tanh((W_2\\text{GAP}(X)+b_2)\\oplus W_1h_{t-1})\r\n\\end{align}\r\n\\begin{align}\r\n\\Phi_c(h_{t-1}, X) &= \\text{Softmax}(W_3(b(h_{t-1}, X))+b_3)    \r\n\\end{align}\r\nOverall, the  SCA mechanism can be written in one of two ways. If channel-wise attention is applied before spatial attention, we have\r\n\\begin{align}\r\nY &= f(X,\\Phi_s(h_{t-1}, X \\Phi_c(h_{t-1}, X)), \\Phi_c(h_{t-1}, X)) \r\n\\end{align}\r\nand  if spatial attention comes first:\r\n\\begin{align}\r\nY &= f(X,\\Phi_s(h_{t-1}, X), \\Phi_c(h_{t-1}, X \\Phi_s(h_{t-1}, X)))\r\n\\end{align}\r\nwhere $f(\\cdot)$ denotes the modulate function which takes the feature map $X$ and attention maps as input and then outputs the modulated feature map $Y$.\r\n\r\nUnlike previous attention mechanisms which consider each image region equally and use global spatial information to tell the network where to focus, SCA-Net leverages the semantic vector to produce the spatial attention map as well as the channel-wise attention weight vector. Being more than a powerful attention model, SCA-CNN also provides a better understanding of where and what the model should focus on during sentence generation.","708":"**CMCL**, or **Crossmodal Contrastive Learning**, is a method for unifying visual and textual representations into the same semantic space based on a large-scale corpus of image collections, text corpus and image-text pairs. The CMCL aligns the visual representations and textual representations, and unifies them into the same semantic space based on image-text pairs. As shown in the Figure, to facilitate different levels of semantic alignment between vision and language, a series of text rewriting techniques are utilized to improve the diversity of cross-modal information. Specifically, for an image-text pair, various positive examples and hard negative examples can be obtained by rewriting the original caption at different levels. Moreover, to incorporate more background information from the single-modal data, text and image retrieval are also applied to augment each image-text pair with various related texts and images. The positive pairs, negative pairs, related images and texts are learned jointly by CMCL. In this way, the model can effectively unify different levels of visual and textual representations into the same semantic space, and incorporate more single-modal knowledge to enhance each other.","709":"**UNIMO** is a multi-modal pre-training architecture that can effectively adapt to both single modal and multimodal understanding and generation tasks. UNIMO learns visual representations and textual representations simultaneously, and unifies them into the same semantic space via [cross-modal contrastive learning](https:\/\/paperswithcode.com\/method\/cmcl) (CMCL) based on a large-scale corpus of image collections, text corpus and image-text pairs. The CMCL aligns the visual representation and textual representation, and unifies them into the same semantic\r\nspace based on image-text pairs.","710":"**Decaying Momentum**, or **Demon**, is a stochastic optimizer motivated by decaying the total contribution of a gradient to all future updates. By decaying the momentum parameter, the total contribution of a gradient to all future updates is decayed. A particular gradient term $g\\_{t}$ contributes a total of  $\\eta\\sum\\_{i}\\beta^{i}$ of its \"energy\" to all future gradient updates, and this results in the geometric sum, $\\sum^{\\infty}\\_{i=1}\\beta^{i} = \\beta\\sum^{\\infty}\\_{i=0}\\beta^{i} = \\frac{\\beta}{\\left(1-\\beta\\right)}$. Decaying this sum results in the Demon algorithm. Letting $\\beta\\_{init}$ be the initial $\\beta$; then at the current step $t$ with total $T$ steps, the decay routine is given by solving the below for $\\beta\\_{t}$:\r\n\r\n$$ \\frac{\\beta\\_{t}}{\\left(1-\\beta\\_{t}\\right)} =  \\left(1-t\/T\\right)\\beta\\_{init}\/\\left(1-\\beta\\_{init}\\right)$$\r\n\r\nWhere $\\left(1-t\/T\\right)$ refers to the proportion of iterations remaining. Note that Demon typically requires no hyperparameter tuning as it is usually decayed to $0$ or a small negative value at time \r\n$T$. Improved performance is observed by delaying the decaying. Demon can be applied to any gradient descent algorithm with a momentum parameter.","711":"**T-Fixup** is an [initialization](https:\/\/paperswithcode.com\/methods\/category\/initialization) method for [Transformers](https:\/\/paperswithcode.com\/methods\/category\/transformers) that aims to remove the need for [layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization) and [warmup](https:\/\/paperswithcode.com\/method\/linear-warmup). The initialization procedure is as follows:\r\n\r\n- Apply [Xavier initialization](https:\/\/paperswithcode.com\/method\/xavier-initialization) for all parameters excluding input embeddings. Use Gaussian initialization $\\mathcal{N}\\left(0, d^{-\\frac{1}{2}}\\right)$ for input embeddings where $d$ is the embedding dimension.\r\n- Scale $\\mathbf{v}\\_{d}$ and $\\mathbf{w}\\_{d}$ matrices in each decoder [attention block](https:\/\/paperswithcode.com\/method\/multi-head-attention), weight matrices in each decoder [MLP block](https:\/\/paperswithcode.com\/method\/position-wise-feed-forward-layer) and input embeddings $\\mathbf{x}$ and $\\mathbf{y}$ in encoder and decoder by $(9 N)^{-\\frac{1}{4}}$\r\n- Scale $\\mathbf{v}\\_{e}$ and $\\mathbf{w}\\_{e}$ matrices in each encoder [attention block](https:\/\/paperswithcode.com\/method\/multi-head-attention) and weight matrices in each encoder [MLP block](https:\/\/paperswithcode.com\/method\/position-wise-feed-forward-layer) by $0.67 N^{-\\frac{1}{4}}$","712":"A **Sparse Transformer** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) based architecture which utilises sparse factorizations of the attention matrix to reduce time\/memory to $O(n \\sqrt{n})$. Other changes to the Transformer architecture include: (a) a restructured [residual block](https:\/\/paperswithcode.com\/method\/residual-block) and weight initialization, (b) A set of sparse attention kernels which efficiently compute subsets of the attention matrix, (c) recomputation of attention weights during the backwards pass to reduce memory usage","713":"**Spatial Feature Transform**, or **SFT**, is a layer that generates affine transformation parameters for spatial-wise feature modulation, and was originally proposed within the context of image super-resolution. A Spatial Feature Transform (SFT) layer learns a mapping function $\\mathcal{M}$ that outputs a modulation parameter pair $(\\mathbf{\\gamma}, \\mathbf{\\beta})$ based on some prior condition $\\Psi$. The learned parameter pair adaptively influences the outputs by applying an affine transformation spatially to each intermediate feature maps in an SR network. During testing, only a single forward pass is needed to generate the HR image given the LR input and segmentation probability maps.\r\n\r\nMore precisely, the prior $\\Psi$ is modeled by a pair of affine transformation parameters $(\\mathbf{\\gamma}, \\mathbf{\\beta})$ through a mapping function $\\mathcal{M}: \\Psi \\mapsto(\\mathbf{\\gamma}, \\mathbf{\\beta})$. Consequently,\r\n\r\n$$\r\n\\hat{\\mathbf{y}}=G_{\\mathbf{\\theta}}(\\mathbf{x} \\mid \\mathbf{\\gamma}, \\mathbf{\\beta}), \\quad(\\mathbf{\\gamma}, \\mathbf{\\beta})=\\mathcal{M}(\\Psi)\r\n$$\r\n\r\nAfter obtaining $(\\mathbf{\\gamma}, \\mathbf{\\beta})$ from conditions, the transformation is carried out by scaling and shifting feature maps of a specific layer:\r\n\r\n$$\r\n\\operatorname{SFT}(\\mathbf{F} \\mid \\mathbf{\\gamma}, \\mathbf{\\beta})=\\mathbf{\\gamma} \\odot \\mathbf{F}+\\mathbf{\\beta}\r\n$$\r\n\r\nwhere $\\mathbf{F}$ denotes the feature maps, whose dimension is the same as $\\gamma$ and $\\mathbf{\\beta}$, and $\\odot$ is referred to element-wise multiplication, i.e., Hadamard product. Since the spatial dimensions are preserved, the SFT layer not only performs feature-wise manipulation but also spatial-wise transformation.","714":"**SpreadsheetCoder** is a neural network architecture for spreadsheet formula prediction. It is a [BERT](https:\/\/paperswithcode.com\/method\/bert)-based model architecture to represent the tabular context in both row-based and column-based formats. A [BERT](https:\/\/paperswithcode.com\/method\/bert) encoder computes an embedding vector for each input token, incorporating the contextual information from nearby rows and columns. The BERT encoder is initialized from the weights pre-trained on English text corpora, which is beneficial for encoding table headers. To handle cell references, a two-stage decoding process is used inspired by sketch learning for program synthesis. The decoder first generates a formula sketch, which does not include concrete cell references, and then predicts the corresponding cell ranges to generate the complete formula","715":"**Inception-ResNet-v2 Reduction-B** is an image model block used in the [Inception-ResNet-v2](https:\/\/paperswithcode.com\/method\/inception-resnet-v2) architecture.","716":"**Inception-ResNet-v2-A** is an image model block for a 35 x 35 grid used in the [Inception-ResNet-v2](https:\/\/paperswithcode.com\/method\/inception-resnet-v2) architecture.","717":"**Inception-ResNet-v2-B** is an image model block for a 17 x 17 grid used in the [Inception-ResNet-v2](https:\/\/paperswithcode.com\/method\/inception-resnet-v2) architecture. It largely follows the idea of Inception modules - and grouped convolutions - but also includes residual connections.","718":"**Inception-ResNet-v2-C** is an image model block for an 8 x 8 grid used in the [Inception-ResNet-v2](https:\/\/paperswithcode.com\/method\/inception-resnet-v2) architecture. It largely follows the idea of Inception modules - and grouped convolutions - but also includes residual connections.","719":"**Inception-ResNet-v2** is a convolutional neural architecture that builds on the Inception family of architectures but incorporates [residual connections](https:\/\/paperswithcode.com\/method\/residual-connection) (replacing the filter concatenation stage of the Inception architecture).","720":"**Neural Oblivious Decision Ensembles (NODE)** is a tabular data architecture that consists of differentiable\r\noblivious decision trees (ODT) that are trained end-to-end by backpropagation. \r\n\r\nThe core building block is a Neural Oblivious Decision Ensemble (NODE) layer. The layer is composed of $m$ differentiable oblivious decision trees (ODTs) of equal depth $d$. As an input, all $m$ trees get a common vector $x \\in \\mathbb{R}^{n}$, containing $n$ numeric features. Below we describe a design of a single differentiable ODT.\r\n\r\nIn its essence, an ODT is a decision table that splits the data along $d$ splitting features and compares each feature to a learned threshold. Then, the tree returns one of the $2^{d}$ possible responses, corresponding to the comparisons result. Therefore, each ODT is completely determined by its splitting features $f \\in \\mathbb{R}^{d}$, splitting thresholds $b \\in \\mathbb{R}^{d}$ and a $d$-dimensional tensor of responses $R \\in \\mathbb{R} \\underbrace{2 \\times 2 \\times 2}_{d}$. In this notation, the tree output is defined as:\r\n\r\n$$\r\nh(x)=R\\left[\\mathbb{1}\\left(f\\_{1}(x)-b_{1}\\right), \\ldots, \\mathbb{1}\\left(f\\_{d}(x)-b\\_{d}\\right)\\right]\r\n$$\r\nwhere $\\mathbb{1}(\\cdot)$ denotes the Heaviside function.","721":"**Mechanism Transfer** is a meta-distributional scenario for few-shot domain adaptation in which a data generating mechanism is invariant across domains. This transfer assumption can accommodate nonparametric shifts resulting in apparently different distributions while providing a solid statistical basis for domain adaptation.","722":"**Decentralized Distributed Proximal Policy Optimization (DD-PPO)** is a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever `stale'), making it conceptually simple and easy to implement. \r\n\r\nProximal Policy Optimization, or [PPO](https:\/\/paperswithcode.com\/method\/ppo), is a policy gradient method for reinforcement learning. The motivation was to have an algorithm with the data efficiency and reliable performance of [TRPO](https:\/\/paperswithcode.com\/method\/trpo), while using only first-order optimization. \r\n\r\nLet $r\\_{t}\\left(\\theta\\right)$ denote the probability ratio $r\\_{t}\\left(\\theta\\right) = \\frac{\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}$, so $r\\left(\\theta\\_{old}\\right) = 1$. TRPO maximizes a \u201csurrogate\u201d objective:\r\n\r\n$$ L^{v}\\left({\\theta}\\right) = \\hat{\\mathbb{E}}\\_{t}\\left[\\frac{\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\_{t}\\mid{s\\_{t}}\\right)})\\hat{A}\\_{t}\\right] = \\hat{\\mathbb{E}}\\_{t}\\left[r\\_{t}\\left(\\theta\\right)\\hat{A}\\_{t}\\right] $$\r\n\r\nAs a general abstraction, DD-PPO implements the following:\r\nat step $k$, worker $n$ has a copy of the parameters, $\\theta^k_n$, calculates the gradient, $\\delta \\theta^k_n$, and updates $\\theta$ via \r\n\r\n$$ \\theta^{k+1}\\_n =  \\text{ParamUpdate}\\Big(\\theta^{k}\\_n, \\text{AllReduce}\\big(\\delta \\theta^k\\_1, \\ldots, \\delta \\theta^k\\_N\\big)\\Big) = \\text{ParamUpdate}\\Big(\\theta^{k}\\_n, \\frac{1}{N}  \\sum_{i=1}^{N} { \\delta \\theta^k_i}   \\Big) $$\r\n\r\nwhere $\\text{ParamUpdate}$ is any first-order optimization technique (e.g. gradient descent) and $\\text{AllReduce}$ performs a reduction (e.g. mean) over all copies of a variable and returns the result to all workers.\r\nDistributed DataParallel scales very well (near-linear scaling up to 32,000 GPUs), and is reasonably simple to implement (all workers synchronously running identical code).","723":"An agreement of a group to follow a common purpose is manifested by its coalescence into a coordinated behavior. The process of initiating this behavior and the period of decision-making by the group members necessarily precedes the coordinated behavior. Given time series of group members\u2019 behavior, the goal is to find these periods of decision-making and identify the initiating individual, if one exists.\r\n\r\nImage Source:  [Amornbunchornvej et al.](https:\/\/arxiv.org\/pdf\/1603.01570v2.pdf)","724":"A **Spatially Separable Convolution** decomposes a [convolution](https:\/\/paperswithcode.com\/method\/convolution) into two separate operations. In regular convolution, if we have a 3 x 3 kernel then we directly convolve this with the image. We can divide a 3 x 3 kernel into a 3 x 1 kernel and a 1 x 3 kernel. Then, in spatially separable convolution, we first convolve the 3 x 1 kernel then the 1 x 3 kernel. This requires 6 instead of 9 parameters compared to regular convolution, and so it is more parameter efficient (additionally less matrix multiplications are required).\r\n\r\nImage Source: [Kunlun Bai](https:\/\/towardsdatascience.com\/a-comprehensive-introduction-to-different-types-of-convolutions-in-deep-learning-669281e58215)","725":"**Graph Network-Based Simulators** is a type of graph neural network that represents the state of a physical system with particles, expressed as nodes in a graph, and computes dynamics via learned message-passing.","726":"SoftPool: a fast and efficient method that sums exponentially weighted activations. Compared to a range of other pooling methods, SoftPool retains more information in the downsampled activation maps. More refined downsampling leads to better classification accuracy.","727":"Modules used in [GAN](https:\/\/paperswithcode.com\/method\/gan)'s style transfer.","728":"Some object detection datasets contain an overwhelming number of easy examples and a small number of hard examples. Automatic selection of these hard examples can make training more\r\neffective and efficient. **OHEM**, or **Online Hard Example Mining**, is a bootstrapping technique that modifies [SGD](https:\/\/paperswithcode.com\/method\/sgd) to sample from examples in a non-uniform way depending on the current loss of each example under consideration. The method takes advantage of detection-specific problem structure in which each SGD mini-batch consists of only one or two images, but thousands of candidate examples. The candidate examples are subsampled according to a distribution\r\nthat favors diverse, high loss instances.","729":"**Pretext-Invariant Representation Learning (PIRL, pronounced as \u201cpearl\u201d)** learns invariant representations based on pretext tasks. PIRL is used with a commonly used pretext task that involves solving [jigsaw](https:\/\/paperswithcode.com\/method\/jigsaw) puzzles. Specifically, PIRL constructs image representations that are similar to the representation of transformed versions of the same image and different from the representations of other images.","730":"**DVD-GAN DBlock** is a residual block for the discriminator used in the [DVD-GAN](https:\/\/paperswithcode.com\/method\/dvd-gan) architecture for video generation. Unlike regular [residual blocks](https:\/\/paperswithcode.com\/method\/residual-block), [3D convolutions](https:\/\/paperswithcode.com\/method\/3d-convolution) are employed due to the application to multiple frames in a video.","731":"**DVD-GAN GBlock** is a [residual block](https:\/\/paperswithcode.com\/method\/residual-block) for the generator used in the [DVD-GAN](https:\/\/paperswithcode.com\/method\/dvd-gan) architecture for video generation.","732":"**TSRUc**, or **Transformation-based Spatial Recurrent Unit c**, is a modification of a [ConvGRU](https:\/\/paperswithcode.com\/method\/cgru) used in the [TriVD-GAN](https:\/\/paperswithcode.com\/method\/trivd-gan) architecture for video generation.\r\n\r\nInstead of computing the reset gate $r$ and resetting $h\\_{t\u22121}$, the TSRUc computes the parameters of a transformation $\\theta$, which we use to warp $h\\_{t\u22121}$. The rest of our model is unchanged (with $\\hat{h}\\_{t-1}$ playing the role of $h'\\_{t}$ in $c$\u2019s update equation from ConvGRU. The TSRUc module is described by the following equations:\r\n\r\n$$ \\theta\\_{h,x} = f\\left(h\\_{t\u22121}, x\\_{t}\\right) $$\r\n\r\n$$ \\hat{h}\\_{t-1} = w\\left(h\\_{t-1}; \\theta\\_{h, x}\\right) $$\r\n\r\n$$ c = \\rho\\left(W\\_{c} \\star\\_{n}\\left[\\hat{h}\\_{t-1};x\\_{t}\\right] + b\\_{c} \\right) $$\r\n\r\n$$ u = \\sigma\\left(W\\_{u} \\star\\_{n}\\left[h\\_{t-1};x\\_{t}\\right] + b\\_{u} \\right) $$\r\n\r\n$$ h\\_{t} = u \\odot h\\_{t-1} + \\left(1-u\\right) \\odot c $$\r\n\r\nIn these equations $\\sigma$ and $\\rho$ are the elementwise sigmoid and [ReLU](https:\/\/paperswithcode.com\/method\/relu) functions respectively and the $\\star\\_{n}$ represents a [convolution](https:\/\/paperswithcode.com\/method\/convolution) with a kernel of size $n \\times n$. Brackets are used to represent a feature concatenation.","733":"**TSRUp**, or **Transformation-based Spatial Recurrent Unit p**, is a modification of a [ConvGRU](https:\/\/paperswithcode.com\/method\/cgru) used in the [TriVD-GAN](https:\/\/paperswithcode.com\/method\/trivd-gan) architecture for video generation.\r\n\r\nIt largely follows [TSRUc](https:\/\/paperswithcode.com\/method\/tsruc), but computes $\\theta$, $u$ and $c$ in parallel given $x\\_{t}$ and $h\\_{t\u22121}$, yielding the following replacement for the $c$ update equation:\r\n\r\n$$ c = \\rho\\left(W\\_{c} \\star\\_{n}\\left[h\\_{t-1}; x\\_{t}\\right] + b\\_{c} \\right) $$\r\n\r\nIn these equations $\\sigma$ and $\\rho$ are the elementwise sigmoid and [ReLU](https:\/\/paperswithcode.com\/method\/relu) functions respectively and the $\\star\\_{n}$ represents a [convolution](https:\/\/paperswithcode.com\/method\/convolution) with a kernel of size $n \\times n$. Brackets are used to represent a feature concatenation.","734":"**TSRUs**, or **Transformation-based Spatial Recurrent Unit p**, is a modification of a [ConvGRU](https:\/\/paperswithcode.com\/method\/cgru) used in the [TriVD-GAN](https:\/\/paperswithcode.com\/method\/trivd-gan) architecture for video generation.\r\n\r\nIt largely follows [TSRUc](https:\/\/paperswithcode.com\/method\/tsruc), but computes each intermediate output in a fully sequential manner: like in TSRUc, $c$ is given access to $\\hat{h}\\_{t-1}$, but additionally, $u$ is given access to both outputs $\\hat{h}\\_{t-1}$ and $c$, so as to make an informed decision prior to mixing. This yields the following replacement for $u$:\r\n\r\n$$ u = \\sigma\\left(W\\_{u} \\star\\_{n}\\left[\\hat{h}\\_{t-1};c\\right] + b\\_{u} \\right) $$\r\n\r\nIn these equations $\\sigma$ and $\\rho$ are the elementwise sigmoid and [ReLU](https:\/\/paperswithcode.com\/method\/relu) functions respectively and the $\\star\\_{n}$ represents a [convolution](https:\/\/paperswithcode.com\/method\/convolution) with a kernel of size $n \\times n$. Brackets are used to represent a feature concatenation.","735":"**TrIVD-GAN**, or **Transformation-based & TrIple Video Discriminator GAN**, is a type of generative adversarial network for video generation that builds upon [DVD-GAN](https:\/\/paperswithcode.com\/method\/dvd-gan). Improvements include a novel transformation-based recurrent unit (the TSRU) that makes the generator more expressive, and an improved discriminator architecture. \r\n\r\nIn contrast with DVD-[GAN](https:\/\/paperswithcode.com\/method\/gan), TrIVD-GAN has an alternative split for the roles of the discriminators, with $\\mathcal{D}\\_{S}$ judging per-frame global structure, while $\\mathcal{D}\\_{T}$ critiques local spatiotemporal structure. This is achieved by downsampling the $k$ randomly sampled frames fed to $\\mathcal{D}\\_{S}$ by a factor $s$, and cropping $T \\times H\/s \\times W\/s$ clips inside the high resolution video fed to $\\mathcal{D}\\_{T}$, where $T, H, W, C$ correspond to time, height, width and channel dimension of the input. This further reduces the number of pixels to process per video,\r\nfrom $k \\times H \\times W + T \\times H\/s \\times W\/s$ to $\\left(k + T\\right) \\times H\/s \\times W\/s$.","736":"Graph structure is learnable","737":"Distributed training has become a pervasive and effective approach for training a large neural network\r\n(NN) model with processing massive data. However, it is very challenging to satisfy requirements\r\nfrom various NN models, diverse computing resources, and their dynamic changes during a training\r\njob. In this study, we design our distributed training framework in a systematic end-to-end view to\r\nprovide the built-in adaptive ability for different scenarios, especially for industrial applications and\r\nproduction environments, by fully considering resource allocation, model partition, task placement,\r\nand distributed execution. Based on the unified distributed graph and the unified cluster object,\r\nour adaptive framework is equipped with a global cost model and a global planner, which can\r\nenable arbitrary parallelism, resource-aware placement, multi-mode execution, fault-tolerant, and\r\nelastic distributed training. The experiments demonstrate that our framework can satisfy various\r\nrequirements from the diversity of applications and the heterogeneity of resources with highly\r\ncompetitive performance.","738":"**BigBird** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) with a sparse attention mechanism that reduces the quadratic dependency of self-attention to linear in the number of tokens. BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.  In particular, BigBird consists of three main parts:\r\n\r\n- A set of $g$ global tokens attending on all parts of the sequence.\r\n- All tokens attending to a set of $w$ local neighboring tokens.\r\n- All tokens attending to a set of $r$ random tokens.\r\n\r\nThis leads to a high performing attention mechanism scaling to much longer sequence lengths (8x).","739":"The full architecture of CGNN is presented at [CGNN's official site](https:\/\/tony-y.github.io\/cgnn\/architectures\/).","740":"Multiscale Attention ViT with Late fusion (MAVL) is a multi-modal network, trained with aligned image-text pairs, capable of performing targeted detection using human understandable natural language text queries. It utilizes multi-scale image features and uses deformable convolutions with late multi-modal fusion. The authors demonstrate excellent ability of MAVL as class-agnostic object detector when queried using general human understandable natural language command, such as \"all objects\", \"all entities\", etc.","741":"Rana Mostafa, Hoda Baraka and AbdelMoniem Bayoumi\r\n\r\n**LMOT**, i.e., Light-weight Multi-Object Tracker,  performs joint pedestrian detection and tracking. LMOT introduces a simplified DLA-34 encoder network to extract detection features for the current image that are computationally efficient. Furthermore, we generate efficient tracking features using a linear transformer for the prior image frame and its corresponding detection heatmap. After that, LMOT fuses both detection and tracking feature maps in a multi-layer scheme and performs a two-stage online data association relying on the Kalman filter to generate tracklets. We evaluated our model on the challenging real-world MOT16\/17\/20 datasets, showing LMOT significantly outperforms the state-of-the-art trackers concerning runtime while maintaining high robustness. LMOT is approximately ten times faster than state-of-the-art trackers while being only 3.8% behind in performance accuracy on average leading to a much computationally lighter model.\r\n\r\nCode: https:\/\/github.com\/RanaMostafaAbdElMohsen\/LMOT\r\nPaper: https:\/\/doi.org\/10.1109\/ACCESS.2022.3197157","742":"**Latent Optimisation** is a technique used for generative adversarial networks to refine the sample quality of $z$. Specifically, it exploits knowledge from the discriminator $D$ to refine the latent source $z$. Intuitively, the gradient $\\nabla\\_{z}f\\left(z\\right) = \\delta{f}\\left(z\\right)\\delta{z}$ points in the direction that better satisfies the discriminator $D$, which implies better samples. Therefore, instead of using the randomly sampled $z \\sim p\\left(z\\right)$, we uses the optimised latent:\r\n\r\n$$ \\Delta{z} = \\alpha\\frac{\\delta{f}\\left(z\\right)}{\\delta{z}} $$\r\n\r\n$$ z' = z + \\Delta{z} $$\r\n\r\nSource: [LOGAN](https:\/\/paperswithcode.com\/method\/logan)\r\n.","743":"Per the authors, Graph Isomorphism Network (GIN) generalizes the WL test and hence achieves maximum discriminative power among GNNs.","744":"**nnFormer**, or **not-another transFormer**, is a semantic segmentation model with an interleaved architecture based on empirical combination of self-attention and [convolution](https:\/\/paperswithcode.com\/method\/convolution). Firstly, a light-weight convolutional embedding layer ahead is used ahead of [transformer](https:\/\/paperswithcode.com\/method\/transformer) blocks. In comparison to directly flattening raw pixels and applying 1D pre-processing, the convolutional embedding layer encodes precise (i.e., pixel-level) spatial information and provide low-level yet high-resolution 3D features. After the embedding block, transformer and convolutional down-sampling blocks are interleaved to fully entangle long-term dependencies with high-level and hierarchical object concepts at various scales, which helps improve the generalization ability and robustness of learned representations.","745":"** LayoutReader** is a sequence-to-sequence model for reading order detection that uses both textual and layout information, where the layout-aware language model [LayoutLM](https:\/\/paperswithcode.com\/method\/layoutlmv2) is leveraged as an encoder. The generation step in the encoder-decoder structure tis modified to generate the reading order sequence.\r\n\r\nIn the encoding stage, LayoutReader packs the pair of source and target segments into a contiguous input sequence of LayoutLM and carefully designs the [self-attention mask](https:\/\/paperswithcode.com\/methods\/category\/factorized-attention) to control the visibility between tokens. As shown in the Figure, LayoutReader allows the tokens in the source segment to attend to each other while preventing the tokens in the target segment from attending to the rightward context. If 1 means allowing and 0 means preventing, the detail of the mask $M$ is as follows:\r\n\r\n$$ M\\_{i, j}= \\begin{cases}1, & \\text { if } i<j \\text { or } i, j \\in \\operatorname{src} \\\\ 0, & \\text { otherwise }\\end{cases} $$\r\n\r\nwhere $i, j$ are the indices in the packed input sequence, so they may be from source or target segments; $i, j \\in$ src means both tokens are from source segment.\r\n\r\nIn the decoding stage, since the source and target are reordered sequences, the prediction candidates can be constrained to the source segment. Therefore, we ask the model to predict the indices in the source sequence. The probability is calculated as follows:\r\n\r\n$$\r\n\\mathcal{P}\\left(x_{k}=i \\mid x_{<k}\\right)=\\frac{\\exp \\left(e_{i}^{T} h\\_{k}+b\\_{k}\\right)}{\\sum_{j} \\exp \\left(e\\_{j}^{T} h_{k}+b\\_{k}\\right)}\r\n$$\r\n\r\nwhere $i$ is an index in the source segment; $e\\_{i}$ and $e\\_{j}$ are the $\\mathrm{i}$-th and $\\mathrm{j}$-th input embeddings of the source segment; $h\\_{k}$ is the hidden states at the $\\mathrm{k}$-th time step; $b\\_{k}$ is the bias at the $\\mathrm{k}$-th time step.","746":"**Detailed Expression Capture and Animation**, or **DECA**, is a model for 3D face reconstruction that is trained to robustly produce a UV displacement map from a low-dimensional latent representation that consists of person-specific detail parameters and generic expression parameters, while a regressor is trained to predict detail, shape, albedo, expression, pose and illumination parameters from a single image. A detail-consistency loss is used to disentangle person-specific details and expression-dependent wrinkles. This disentanglement allows us to synthesize realistic person-specific wrinkles by controlling expression parameters while keeping person-specific details unchanged.","747":"Random Ensemble Mixture (REM) is an easy to implement extension of [DQN](https:\/\/paperswithcode.com\/method\/dqn) inspired by [Dropout](https:\/\/paperswithcode.com\/method\/dropout). The key intuition behind REM is that if one has access to multiple estimates of Q-values, then a weighted combination of the Q-value estimates is also an estimate for Q-values. Accordingly, in each training step, REM randomly combines multiple Q-value estimates and uses this random combination for robust training.","748":"ChebNet involves a formulation of CNNs in the context of spectral graph theory, which provides the necessary mathematical background and efficient numerical schemes to design fast localized convolutional filters on graphs.\r\n\r\nDescription from: [Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering](https:\/\/arxiv.org\/pdf\/1606.09375.pdf)","749":"**FixRes** is an image scaling strategy that seeks to optimize classifier performance. It is motivated by the observation that data augmentations induce a significant discrepancy between the size of the objects seen by the classifier at train and test time: in fact, a lower train resolution improves the classification at test time! FixRes is a simple strategy to optimize the classifier performance, that employs different train and test resolutions. The calibrations are: (a) calibrating the object sizes by adjusting the crop size and (b) adjusting statistics before spatial pooling.","750":"**Single-Headed Attention** is a single-headed attention module used in the [SHA-RNN](https:\/\/paperswithcode.com\/method\/sha-rnn) language model. The principle design reasons for single-headedness were simplicity (avoiding running out of memory) and scepticism about the benefits of using multiple heads.","751":"A **Boom Layer** is a type of feedforward layer that is closely related to the feedforward layers used in Transformers. The layer takes a vector of the form $v \\in \\mathbb{R}^{H}$ and uses a matrix\r\nmultiplication with a GeLU activation to produce a vector $u \\in \\mathbb{R}^{N\\times{H}}$. We then break $u$ into $N$ vectors and sum those together, producing $w \\in \\mathbb{R}^{H}$. This minimizes computation and removes an entire matrix of parameters compared to traditional down-projection layers.\r\n\r\nThe Figure to the right shows the Boom Layer used in the context of [SHA-RNN](https:\/\/paperswithcode.com\/method\/sha-rnn) from the original paper.","752":"**SHA-RNN**, or **Single Headed Attention RNN**, is a recurrent neural network, and language model when combined with an embedding input and [softmax](https:\/\/paperswithcode.com\/method\/softmax) classifier, based on a core [LSTM](https:\/\/paperswithcode.com\/method\/lstm) component and a [single-headed attention](https:\/\/paperswithcode.com\/method\/single-headed-attention) module. Other design choices include a Boom feedforward layer and the use of [layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization). The guiding principles of the author were to ensure simplicity in the architecture and to keep computational costs bounded (the model was originally trained with a single GPU).","753":"**SortCut Sinkhorn Attention** is a variant of [Sparse Sinkhorn Attention](https:\/\/paperswithcode.com\/method\/sparse-sinkhorn-attention) where a post-sorting truncation of the input sequence is performed, essentially performing a hard top-k operation on the input sequence blocks within the computational graph. While most attention models mainly re-weight or assign near-zero weights during training, this allows for explicitly and dynamically truncate the input sequence. Specifically:\r\n\r\n$$ Y = \\text{Softmax}\\left(Q{\\psi\\_{S}}\\left(K\\right)^{T}\\_{\\left[:n\\right]}\\right)\\psi\\_{S}\\left(V\\right)\\_{\\left[:n\\right]} $$\r\n\r\nwhere $n$ is the Sortfut budget hyperparameter.","754":"**Sparse Sinkhorn Attention** is an attention mechanism that reduces the memory complexity of the [dot-product attention mechanism](https:\/\/paperswithcode.com\/method\/scaled) and is capable of learning sparse attention outputs. It is based on the idea of differentiable sorting of internal representations within the self-attention module. SSA incorporates a meta sorting network that learns to rearrange and sort input sequences. Sinkhorn normalization is used to normalize the rows and columns of the sorting matrix. The actual SSA attention mechanism then acts on the block sorted sequences.","755":"The **Sinkhorn Transformer** is a type of [transformer](https:\/\/paperswithcode.com\/method\/transformer) that uses [Sparse Sinkhorn Attention](https:\/\/paperswithcode.com\/method\/sparse-sinkhorn-attention) as a building block. This component is a plug-in replacement for dense fully-connected attention (as well as local attention, and sparse attention alternatives), and allows for reduced memory complexity as well as sparse attention.","756":"ORB-SLAM2 is a complete SLAM system for monocular, stereo and RGB-D cameras, including map reuse, loop closing and relocalization capabilities. The system works in real-time on standard CPUs in a wide variety of environments from small hand-held indoors sequences, to drones flying in industrial environments and cars driving around a city.\r\n\r\nSource: [Mur-Artal and Tardos](https:\/\/arxiv.org\/pdf\/1610.06475v2.pdf)\r\n\r\nImage source: [Mur-Artal and Tardos](https:\/\/arxiv.org\/pdf\/1610.06475v2.pdf)","757":"**YOLOv1** is a single-stage object detection model. Object detection is framed as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. \r\n\r\nThe network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means the network reasons globally about the full image and all the objects in the image.","758":"**Multiplicative Attention** is an attention mechanism where the alignment score function is calculated as:\r\n\r\n$$f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right) = \\mathbf{h}\\_{i}^{T}\\textbf{W}\\_{a}\\mathbf{s}\\_{j}$$\r\n\r\nHere $\\mathbf{h}$ refers to the hidden states for the encoder\/source, and $\\mathbf{s}$ is the hidden states for the decoder\/target. The function above is thus a type of alignment score function. We can use a matrix of alignment scores to show the correlation between source and target words, as the Figure to the right shows. Within a neural network, once we have the alignment scores, we calculate the final scores using a [softmax](https:\/\/paperswithcode.com\/method\/softmax) function of these alignment scores (ensuring it sums to 1).\r\n\r\nAdditive and multiplicative attention are similar in complexity, although multiplicative attention is faster and more space-efficient in practice as it can be implemented more efficiently using matrix multiplication. Both variants perform similar for small dimensionality $d_{h}$ of the decoder states, but [additive attention](https:\/\/paperswithcode.com\/method\/additive-attention) performs better for larger dimensions. One way to mitigate this is to scale $f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right)$ by $1\/\\sqrt{d\\_{h}}$ as with [scaled dot-product attention](https:\/\/paperswithcode.com\/method\/scaled).","759":"**U2-Net** is a two-level nested U-structure architecture that is designed for salient object detection (SOD).  The architecture allows the network to go deeper, attain high resolution, without significantly increasing the memory and computation cost. This is achieved by a nested U-structure: on the bottom level, with a novel ReSidual U-block (RSU) module, which is able to extract intra-stage multi-scale features without degrading the feature map resolution; on the top level, there is a [U-Net](https:\/\/paperswithcode.com\/method\/u-net) like structure, in which each stage is filled by a RSU block.","760":"**K3M** is a multi-modal pretraining method for e-commerce product data that introduces knowledge modality to correct the noise and supplement the missing of image and text modalities. The modal-encoding layer extracts the features of each modality. The modal-interaction layer is capable of effectively modeling the interaction of multiple modalities, where an initial-interactive feature fusion model is designed to maintain the independence of image modality and text modality, and a structure aggregation module is designed to fuse the information of image, text, and knowledge modalities. K3M is pre-trained with three pretraining tasks, including masked object modeling (MOM), masked language modeling (MLM), and link prediction modeling ([LPM](https:\/\/paperswithcode.com\/method\/local-prior-matching)).","761":"**RegNetY** is a convolutional network design space with simple, regular models with parameters: depth $d$, initial width $w\\_{0} > 0$, and slope $w\\_{a} > 0$, and generates a different block width $u\\_{j}$ for each block $j < d$. The key restriction for the RegNet types of model is that there is a linear parameterisation of block widths (the design space only contains models with this linear structure):\r\n\r\n$$ u\\_{j} = w\\_{0} + w\\_{a}\\cdot{j} $$\r\n\r\nFor **RegNetX** we have additional restrictions: we set $b = 1$ (the bottleneck ratio), $12 \\leq d \\leq 28$, and $w\\_{m} \\geq 2$ (the width multiplier).\r\n\r\nFor **RegNetY** we make one change, which is to include Squeeze-and-Excitation blocks.","762":"**Layer-wise Adaptive Rate Scaling**, or **LARS**, is a large batch optimization technique.  There are two notable differences between LARS and other adaptive algorithms such as [Adam](https:\/\/paperswithcode.com\/method\/adam) or [RMSProp](https:\/\/paperswithcode.com\/method\/rmsprop): first, LARS uses a separate learning rate for each layer and not for each weight. And second, the magnitude of the update is controlled with respect to the weight norm for better control of training speed.\r\n\r\n$$m\\_{t} = \\beta\\_{1}m\\_{t-1} + \\left(1-\\beta\\_{1}\\right)\\left(g\\_{t} + \\lambda{x\\_{t}}\\right)$$\r\n$$x\\_{t+1}^{\\left(i\\right)} = x\\_{t}^{\\left(i\\right)}  - \\eta\\_{t}\\frac{\\phi\\left(|| x\\_{t}^{\\left(i\\right)} ||\\right)}{|| m\\_{t}^{\\left(i\\right)} || }m\\_{t}^{\\left(i\\right)} $$","763":"**SwaV**, or **Swapping Assignments Between Views**, is a self-supervised learning approach that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, it simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or views) of the same image, instead of comparing features directly as in contrastive learning. Simply put, SwaV uses a swapped prediction mechanism where we predict the cluster assignment of a view from the representation of another view.","764":"**SEER** is a self-supervised learning approach for training large models on random, uncurated images with no supervision. It trains [RegNet-Y](https:\/\/paperswithcode.com\/method\/regnet-y) architectures with the [SwAV](https:\/\/paperswithcode.com\/method\/swav). Several adjustments are made to self-supervised training to make it work at a larger scale, including using a [cosine learning schedule](https:\/\/paperswithcode.com\/method\/cosine-annealing)","765":"**PP-OCR** is an OCR system that consists of three parts, text detection, detected boxes rectification and text recognition. The purpose of text detection is to locate the text area in the image. In PP-OCR, Differentiable Binarization (DB) is used as text detector which is based on a simple segmentation network. It integrates feature extraction and sequence modeling. It adopts the Connectionist Temporal Classification (CTC) loss to avoid the inconsistency between prediction and label.","766":"**RFB Net** is a one-stage object detector that utilises a receptive field block module. It utilises a VGG16 backbone, and is otherwise quite similar to the [SSD](https:\/\/paperswithcode.com\/method\/ssd) architecture.","767":"**Parallax** is a hybrid parallel method for training large neural networks. Parallax is a framework that optimizes data parallel training by utilizing the sparsity of model parameters. Parallax introduces a hybrid approach that combines Parameter Server and AllReduce architectures to optimize the amount of data transfer according to the sparsity.\r\n\r\nParallax pursues a hybrid approach that uses the Parameter Server architecture for handling sparse variables and the AllReduce architecture for handling dense variables. Moreover, Parallax partitions large sparse variables by a near-optimal number of partitions to maximize parallelism while maintaining low computation and communication overhead. Parallax further optimizes training with local aggregation and smart operation placement to mitigate communication overhead. Graph transformation in Parallax automatically applies all of these optimizations and the data parallel training itself at the framework level to minimize user efforts for writing and optimizing a distributed program by composing low-level primitives.","768":"This is a general approach to convert a neural network into an analytic equation. The technique works as follows:\r\n\r\n1. Encourage sparse latent representations\r\n2. Apply symbolic regression to approximate the transformations between in\/latent\/out layers\r\n3. Compose the symbolic expressions.\r\n\r\nIn the [paper](https:\/\/arxiv.org\/abs\/2006.11287), we show that we find the correct known equations, including force laws and Hamiltonians, can be extracted from the neural network. We then apply our method to a non-trivial cosmology example-a detailed dark matter simulation-and discover a new analytic formula which can predict the concentration of dark matter from the mass distribution of nearby cosmic structures. The symbolic expressions extracted from the GNN using our technique also generalized to out-of-distribution data better than the GNN itself. Our approach offers alternative directions for interpreting neural networks and discovering novel physical principles from the representations they learn.","769":"**Parrot** is an imitation learning approach to automatically learn cache access patterns by leveraging Belady\u2019s optimal policy. Belady\u2019s optimal policy is an oracle policy that computes the theoretically optimal cache eviction decision based on knowledge of future cache accesses, which Parrot approximates with a policy that only conditions on the past accesses.","770":"**Probabilistic anchor assignment (PAA)** adaptively separates a set of anchors into positive and negative samples for a GT box according to the learning status of the model associated with it. To do so we first define a score of a detected bounding box that reflects both the classification and localization qualities. We then identify the connection between this score and the training objectives and represent the score as the combination of two loss objectives. Based on this scoring scheme, we calculate the scores of individual anchors that reflect how the model finds useful cues to detect a target object in each anchor. With these anchor scores, we aim to find a probability distribution of two modalities that best represents the scores as positive or negative samples as in the Figure. \r\n\r\nUnder the found probability distribution, anchors with probabilities from the positive component are high are selected as positive samples. This transforms the anchor assignment problem to a maximum likelihood estimation for a probability distribution where the parameters of the distribution is determined by anchor scores. Based on the assumption that anchor scores calculated by the model are samples drawn from a probability distribution, it is expected that the model can infer the sample separation in a probabilistic way, leading to easier training of the model compared to other non-probabilistic assignments. Moreover, since positive samples are adaptively selected based on the anchor score distribution, it does not require a pre-defined number of positive samples nor an IoU threshold.","771":"Pixel-BERT is pre-trained model that is trained to align image pixels with text. It is an end to end framework that includes CNN based visual encoder and cross modal transformers for visual and language embedding learning.\r\nThis model has three parts: one fully convolutional neural network that takes pixels of image as input, one word level token embedding based on BERT, and a multimodal transformers for jointly learning of visual and language embedding.\r\n\r\nFor language, it uses other pretraining works to use Masked Language Modeling (MLM) for the prediction of masked tokens with surrounding text and image. For vision it uses the random pixel sampling mechanism that makes up for the challenge of predicting pixel level features. This mechanism is also good for solving overfitting issues and improving the robustness of visual features. \r\n\r\nFor vision and language interaction, it applies Image-Text_Matching (ITM) to classify whether an image and sentence pair is matched or not. \r\n\r\nSome cross modality tasks like VQA, image captioning is required to understand both language and visual semantics. Region based visual features extracted from object detection models like Faster RCNN is used for better performance in the newer version of the model.","772":"SAGA is a method in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser. Unlike SDCA, SAGA supports non-strongly convex problems directly, and is adaptive to any inherent strong convexity of the problem.","773":"**Ape-X DQN** is a variant of a [DQN](https:\/\/paperswithcode.com\/method\/dqn) with some components of [Rainbow-DQN](https:\/\/paperswithcode.com\/method\/rainbow-dqn) that utilizes distributed [prioritized experience replay](https:\/\/paperswithcode.com\/method\/prioritized-experience-replay) through the [Ape-X](https:\/\/paperswithcode.com\/method\/ape-x) architecture.","774":"In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task\/modality-specific customization. We propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks, including image generation, visual grounding, image captioning, image classification, language modeling, etc., in a simple sequence-to-sequence learning framework. OFA follows the instruction-based learning in both pretraining and finetuning stages, requiring no extra task-specific layers for downstream tasks. In comparison with the recent state-of-the-art vision & language models that rely on extremely large cross-modal datasets, OFA is pretrained on only 20M publicly available image-text pairs. Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni-modal tasks. Our further analysis indicates that OFA can also effectively transfer to unseen tasks and unseen domains. Our code and models are publicly available at https:\/\/github.com\/OFA-Sys\/OFA.","775":"**You Only Hypothesize Once** is a local descriptor-based framework for the registration of two unaligned point clouds. The proposed descriptor achieves the rotation invariance by recent technologies of group equivariant feature learning, which brings more robustness to point density and noise. The descriptor in YOHO also has a rotation-equivariant part, which enables the estimation the registration from just one correspondence hypothesis.","776":"**HRNet**, or **High-Resolution Net**, is a general purpose convolutional neural network for tasks like semantic segmentation, object detection and image classification. It is able to maintain high resolution representations through the whole process. We start from a high-resolution [convolution](https:\/\/paperswithcode.com\/method\/convolution) stream, gradually add high-to-low resolution convolution streams one by one, and connect the multi-resolution streams in parallel. The resulting network consists of several ($4$ in the paper) stages and\r\nthe $n$th stage contains $n$ streams corresponding to $n$ resolutions. The authors conduct repeated multi-resolution fusions by exchanging the information across the parallel streams over and over.","777":"The **Universal Transformer** is a generalization of the [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture. Universal Transformers combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of [RNNs](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks). They also utilise a dynamic per-position halting mechanism.","778":"**Temporal Distribution Matching**, or **TDM**,  is a module used in the [AdaRNN](https:\/\/paperswithcode.com\/method\/adarnn) architecture to match the distributions of the discovered periods to build a time series prediction model $\\mathcal{M}$ Given the learned time periods, the TDM module is designed to learn the common knowledge shared by different periods via matching their distributions. Thus, the learned model $\\mathcal{M}$ is expected to generalize well on unseen test data compared with the methods which only rely on local or statistical information.\r\n\r\nWithin the context of AdaRNN, Temporal Distribution Matching aims to adaptively match the distributions between the [RNN](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks) cells of two periods while capturing the temporal dependencies. TDM introduces the importance vector $\\mathbf{\\alpha} \\in \\mathbb{R}^{\\hat{V}}$ to learn the relative importance of $V$ hidden states inside the RNN, where all the hidden states are weighted with a normalized $\\alpha$. Note that for each pair of periods, there is an $\\mathbf{\\alpha}$, and we omit the subscript if there is no confusion. In this way, we can dynamically reduce the distribution divergence of cross-periods.\r\n\r\nGiven a period-pair $\\left(\\mathcal{D}\\_{i}, \\mathcal{D}\\_{j}\\right)$, the loss of temporal distribution matching is formulated as:\r\n\r\n$$\r\n\\mathcal{L}\\_{t d m}\\left(\\mathcal{D}\\_{i}, \\mathcal{D}\\_{j} ; \\theta\\right)=\\sum_{t=1}^{V} \\alpha\\_{i, j}^{t} d\\left(\\mathbf{h}\\_{i}^{t}, \\mathbf{h}\\_{j}^{t} ; \\theta\\right)\r\n$$\r\n\r\nwhere $\\alpha\\_{i, j}^{t}$ denotes the distribution importance between the periods $\\mathcal{D}\\_{i}$ and $\\mathcal{D}\\_{j}$ at state $t$.\r\n\r\nAll the hidden states of the RNN can be easily computed by following the standard RNN computation. Denote by $\\delta(\\cdot)$ the computation of a next hidden state based on a previous state. The state computation can be formulated as\r\n\r\n$$\r\n\\mathbf{h}\\_{i}^{t}=\\delta\\left(\\mathbf{x}\\_{i}^{t}, \\mathbf{h}\\_{i}^{t-1}\\right)\r\n$$\r\n\r\nThe final objective of temporal distribution matching (one RNN layer) is:\r\n\r\n$$\r\n\\mathcal{L}(\\theta, \\mathbf{\\alpha})=\\mathcal{L}\\_{\\text {pred }}(\\theta)+\\lambda \\frac{2}{K(K-1)} \\sum\\_{i, j}^{i \\neq j} \\mathcal{L}\\_{t d m}\\left(\\mathcal{D}\\_{i}, \\mathcal{D}\\_{j} ; \\theta, \\mathbf{\\alpha}\\right)\r\n$$\r\n\r\nwhere $\\lambda$ is a trade-off hyper-parameter. Note that in the second term, we compute the average of the distribution distances of all pairwise periods. For computation, we take a mini-batch of $\\mathcal{D}_{i}$ and $\\mathcal{D}\\_{j}$ to perform forward operation in RNN layers and concatenate all hidden features. Then, we can perform TDM using the above equation.","779":"**Temporal Distribution Characterization**, or **TDC**, is a module used in the [AdaRNN](https:\/\/paperswithcode.com\/method\/adarnn) architecture to characterize the distributional information in a time series.\r\n\r\nBased on the principle of maximum entropy, maximizing the utilization of shared knowledge underlying a times series under temporal covariate shift can be done by finding periods which are most dissimilar to each other, which is also considered as the worst case of temporal covariate shift since the cross-period distributions are the most diverse. TDC achieves this goal for splitting the time-series by solving an optimization problem whose objective can be formulated as:\r\n\r\n$$\r\n\\max \\_{0<K \\leq K\\_{0}} \\max \\_{n\\_{1}, \\cdots, n\\_{K}} \\frac{1}{K} \\sum_{1 \\leq i \\neq j \\leq K} d\\left(\\mathcal{D}\\_{i}, \\mathcal{D}\\_{j}\\right) \r\n$$\r\n\r\n$$\r\n\\text { s.t. } \\forall i, \\Delta_{1}<\\left|\\mathcal{D}\\_{i}\\right|<\\Delta_{2} ; \\sum_{i}\\left|\\mathcal{D}\\_{i}\\right|=n\r\n$$\r\n\r\nwhere $d$ is a distance metric, $\\Delta\\_{1}$ and $\\Delta\\_{2}$ are predefined parameters to avoid trivial solutions (e.g., very small values or very large values may fail to capture the distribution information), and $K\\_{0}$ is the hyperparameter to avoid over-splitting. The metric $d(\\cdot, \\cdot)$ above can be any distance function, e.g., Euclidean or Editing distance, or some distribution-based distance \/ divergence, like MMD [14] and KL-divergence.\r\n\r\nThe learning goal of the optimization problem (1) is to maximize the averaged period-wise distribution distances by searching $K$ and the corresponding periods so that the distributions of each period are as diverse as possible and the learned prediction model has better a more generalization ability.","780":"**AdaRNN** is an adaptive [RNN](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks) that learns an adaptive model through two modules: [Temporal Distribution Characterization](https:\/\/paperswithcode.com\/method\/temporal-distribution-characterization) (TDC) and [Temporal Distribution Matching](https:\/\/paperswithcode.com\/method\/temporal-distribution-matching) (TDM) algorithms. Firstly, to better characterize the distribution information in time-series, TDC splits the training data into $K$ most diverse periods that have a large distribution gap inspired by the principle of maximum entropy. After that, a temporal distribution matching (TDM) algorithm is used to dynamically reduce distribution divergence using a [RNN](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks)-based model.","781":"A **Connectionist Temporal Classification Loss**, or **CTC Loss**, is designed for tasks where we need alignment between sequences, but where that alignment is difficult - e.g. aligning each character to its location in an audio file. It calculates a loss between a continuous (unsegmented) time series and a target sequence. It does this by summing over the probability of possible alignments of input to target, producing a loss value which is differentiable with respect to each input node. The alignment of input to target is assumed to be \u201cmany-to-one\u201d, which limits the length of the target sequence such that it must be $\\leq$ the input length.","782":"**CRISS**, or **Cross-lingual Retrievial for Iterative Self-Supervised Training (CRISS)**, is a self-supervised learning method for multilingual sequence generation. CRISS is developed based on the finding that the encoder outputs of multilingual denoising autoencoder can be used as language agnostic representation to retrieve parallel sentence pairs, and training the model on these retrieved sentence pairs can further improve its sentence retrieval and translation capabilities in an iterative manner. Using only unlabeled data from many different languages, CRISS iteratively mines for parallel sentences across languages, trains a new better multilingual model using these mined sentence pairs, mines again for better parallel sentences, and repeats.","783":"# Memory-Efficient Adaptive Optimization\r\n\r\nSource: https:\/\/arxiv.org\/abs\/1901.11150\r\n\r\nAdaptive gradient-based optimizers such as [AdaGrad](https:\/\/paperswithcode.com\/method\/adagrad) and [Adam](https:\/\/paperswithcode.com\/method\/adam) are among the\r\ndefacto methods of choice in modern machine learning.These methods tune the learning rate for each parameter during the optimization process using cumulative second-order statistics. These methods provide superior convergence properties and are very attractive in large scale applications due to their moderate time and space requirements which are linear in the number of parameters.\r\n\r\n\r\nHowever, the recent advances in natural language processing such as [BERT](https:\/\/paperswithcode.com\/method\/bert) and GPT2 show that models with 10<sup>8<\/sup> to 10<sup>10<\/sup> parameters, trained with adaptive optimization methods, achieve state-of-the-art results. In such cases, the memory overhead of the optimizer can restrict the size of the model that can be used as well as the batch size, both of which can have a dramatic effect on the quality of the final model.\r\n\r\n\r\nHere we construct a new adaptive optimization method that retains most of the benefits of standard per-parameter adaptivity while significantly reducing memory overhead.\r\n\r\n\r\nWe observe that in standard neural networks that certain entries of the stochastic gradients have (on average) similar values, and exhibit what we refer to as an activation pattern. For example, in gradients of embedding layers of deep networks, an entire row (or column) is either zero or non-zero. Similarly, in intermediate layers we often observe that gradients associated with the same unit are of similar order of magnitude. In these cases, a similar phenomenon is observed in the second-order statistics maintained by adaptive methods. With this key observation, to reduce the memory overhead of the optimizer our method takes in a cover set of the parameters. Cover sets are typically selected in practice such that parameters in each of the sets have second order statistics of similar magnitude. Our method is general enough that it can easily be extended to arbitrary cover sets. For parameters of deep networks that are organized as a collection of tensors, we form a cover consisting of slices of codimension one for each tensor. Thus, for an m x n parameter matrix, the cover consists of rows and columns of the matrix. The memory requirements therefore drop from mxn to merely m+n. For a parameter tensor of rank p, with dimensions n<sub>1<\/sub>  ...   n<sub>p<\/sub>, the reduction in memory consumption is even more pronounced, dropping from product of all the dimensions to the sum of all dimensions. This virtually eliminates the memory overhead associated with maintaining the adaptive learning rates!\r\n\r\nAnother practical aspect worthy of note is that our method does not require an external hand engineered learning rate decay schedule but instead relies on the per parameter adaptivity that is natural to its update rule which makes it easier to tune. We provide details in the supplementary section of the paper.\r\n\r\n## Advice on using SM3 on your model\r\n\r\n### Learning rate warm-up:\r\n\r\n```python\r\nlearning_rate = lr_constant * tf.minimum(1.0, (warm_up_step \/ global_step) ** p)\r\n```\r\n\r\n* p = 1, linear ramp up of learning rate.\r\n* p = 2, quadratic ramp up of learning rate [preferred].\r\n\r\nWe typically set `warm_up_step` as 5% of overall steps. Initially, the norm of the preconditioned gradient is much larger than norm of the weights. Learning rate warmup allows us to heuristically fix this scale mismatch.\r\n\r\n### Learning rate decay:\r\n\r\nWe make use accumulated gradient squares for the decay. This means that each coordinate gets its own natural decay based on the scales of the gradients over time. Hence, users need not put in an external learning rate decay schedule. Moreover, we found in our experiments with translation and language models that this approach is superior to a hand-tuned learning rate decay schedules which is typically combined with exponential moving averages of the gradient squares.\r\n\r\nHaving said that if users want to add exponential moving averages instead of the standard accumulated gradient squares - It's easy to modify the optimizer implementation to switch to exponential moving averages.\r\n\r\nFor rank > 1:\r\n\r\n|            from                     |                  to                 |\r\n|-------------------------------------|-------------------------------------|\r\n|  current_accumulator += grad * grad |  current_accumulator = beta * current_accumulator + (1-beta) * grad * grad |\r\n\r\n\r\nFor rank <= 1:\r\n\r\n\r\n|            from                     |                  to                 |\r\n|-------------------------------------|-------------------------------------|\r\n|  current_accumulator = tf.assign_add(accumulator, grad * grad) |   current_accumulator = tf.assign(accumulator, beta * accumulator + (1-beta) * (grad * grad)) |\r\n\r\n\r\n### [Polyak averaging](https:\/\/paperswithcode.com\/method\/polyak-averaging) of parameters: \r\nIt's useful to run [polyak averaging](https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/train\/ExponentialMovingAverage) of the parameters. These parameters are then used in inference \/ serving. Using the averaged parameters instead of the last iterate typically improves the overall performance of the model.\r\n\r\nAn **alternative** to polyak averaging which does not make use of extra memory is to decay the learning rate from the constant to zero for the last 10% of the steps of your entire training run, we term the phase a **cool-down** phase of the model. As training makes smaller and smaller steps the final iterate can be thought of as an average iterate.","784":"**Beta-VAE** is a type of variational autoencoder that seeks to discovered disentangled latent factors. It modifies [VAEs](https:\/\/paperswithcode.com\/method\/vae) with an adjustable hyperparameter $\\beta$ that balances latent channel capacity and independence constraints with reconstruction accuracy. The idea is to maximize the probability of generating the real data while keeping the distance between the real and estimated distributions small, under a threshold $\\epsilon$. We can use the Kuhn-Tucker conditions to write this as a single equation:\r\n\r\n$$ \\mathcal{F}\\left(\\theta, \\phi, \\beta; \\mathbf{x}, \\mathbf{z}\\right) = \\mathbb{E}\\_{q\\_{\\phi}\\left(\\mathbf{z}|\\mathbf{x}\\right)}\\left[\\log{p}\\_{\\theta}\\left(\\mathbf{x}\\mid\\mathbf{z}\\right)\\right] - \\beta\\left[D\\_{KL}\\left(\\log{q}\\_{\\theta}\\left(\\mathbf{z}\\mid\\mathbf{x}\\right)||p\\left(\\mathbf{z}\\right)\\right) - \\epsilon\\right]$$\r\n\r\nwhere the KKT multiplier $\\beta$ is the regularization coefficient that constrains the capacity of the latent channel $\\mathbf{z}$ and puts implicit independence pressure on the learnt posterior due to the isotropic nature of the Gaussian prior $p\\left(\\mathbf{z}\\right)$.\r\n\r\nWe write this again using the complementary slackness assumption to get the Beta-VAE formulation:\r\n\r\n$$ \\mathcal{F}\\left(\\theta, \\phi, \\beta; \\mathbf{x}, \\mathbf{z}\\right) \\geq  \\mathcal{L}\\left(\\theta, \\phi, \\beta; \\mathbf{x}, \\mathbf{z}\\right) = \\mathbb{E}\\_{q\\_{\\phi}\\left(\\mathbf{z}|\\mathbf{x}\\right)}\\left[\\log{p}\\_{\\theta}\\left(\\mathbf{x}\\mid\\mathbf{z}\\right)\\right] - \\beta\\{D}\\_{KL}\\left(\\log{q}\\_{\\theta}\\left(\\mathbf{z}\\mid\\mathbf{x}\\right)||p\\left(\\mathbf{z}\\right)\\right)$$","785":"**DynaBERT** is a [BERT](https:\/\/paperswithcode.com\/method\/bert)-variant which can flexibly adjust the size and latency by selecting adaptive width and depth. The training process of DynaBERT includes first training a width-adaptive BERT and then allowing both adaptive width and depth, by distilling knowledge from the full-sized model to small sub-networks. Network rewiring is also used to keep the more important attention heads and neurons shared by more sub-networks. \r\n\r\nA two-stage procedure is used to train DynaBERT. First, using knowledge distillation (dashed lines) to transfer the knowledge from a fixed teacher model to student sub-networks with adaptive width in DynaBERTW. Then, using knowledge distillation (dashed lines) to transfer the knowledge from a trained DynaBERTW to student sub-networks with adaptive width and depth in DynaBERT.","786":"DiffPool is a differentiable graph pooling module that can generate hierarchical representations of graphs and can be combined with various graph neural network architectures in an end-to-end fashion. DiffPool learns a differentiable soft cluster assignment for nodes at each layer of a deep GNN, mapping nodes to a set of clusters, which then form the coarsened input for the next GNN layer.\r\n\r\nDescription and image from: [Hierarchical Graph Representation Learning with Differentiable Pooling](https:\/\/arxiv.org\/pdf\/1806.08804.pdf)","787":"DASPP is a deeper version of the [ASPP](https:\/\/paperswithcode.com\/method\/aspp) module (the latter from [DeepLabv3](https:\/\/paperswithcode.com\/method\/deeplabv3)) that adds standard 3 \u00d7 3 [convolution](https:\/\/paperswithcode.com\/method\/convolution) after 3 \u00d7 3 dilated convolutions to refine the features and also fusing the input and the output of the DASPP module via short [residual connection](https:\/\/paperswithcode.com\/method\/residual-connection). Also, the number of convolution filters of ASPP is reduced from 255 to 96 to gain computational performance.","788":"**LiteSeg** is a lightweight architecture for semantic segmentation that uses a deeper version of Atrous [Spatial Pyramid Pooling](https:\/\/paperswithcode.com\/method\/spatial-pyramid-pooling) module ([ASPP](https:\/\/paperswithcode.com\/method\/aspp)) and applies short and long residual connections, and [depthwise separable convolution](https:\/\/paperswithcode.com\/method\/depthwise-separable-convolution), resulting in a faster, more efficient model.","789":"BCI MI signal Classification Framework using Fuzzy integrals.\r\n\r\nPaper: Ko, L. W., Lu, Y. C., Bustince, H., Chang, Y. C., Chang, Y., Ferandez, J., ... & Lin, C. T. (2019). Multimodal fuzzy fusion for enhancing the motor-imagery-based brain computer interface. IEEE Computational Intelligence Magazine, 14(1), 96-106.","790":"UNet++ is an architecture for semantic segmentation based on the [U-Net](https:\/\/paperswithcode.com\/method\/u-net). Through the use of densely connected nested decoder sub-networks, it enhances extracted feature processing and was reported by its authors to outperform the U-Net in [Electron Microscopy (EM)](https:\/\/imagej.net\/events\/isbi-2012-segmentation-challenge), [Cell](https:\/\/acsjournals.onlinelibrary.wiley.com\/doi\/full\/10.1002\/cncy.21576), [Nuclei](https:\/\/www.kaggle.com\/c\/data-science-bowl-2018), [Brain Tumor](https:\/\/paperswithcode.com\/dataset\/brats-2013-1), [Liver](https:\/\/paperswithcode.com\/dataset\/lits17) and [Lung Nodule](https:\/\/paperswithcode.com\/dataset\/lidc-idri) medical image segmentation tasks.","791":"**TaBERT** is a pretrained language model (LM) that jointly learns representations for natural language sentences and (semi-)structured tables. TaBERT is trained on a large corpus of 26 million tables and their English contexts. \r\n\r\nIn summary, TaBERT's process for learning representations for NL sentences is as follows: Given an utterance $u$ and a table $T$, TaBERT first creates a content snapshot of $T$. This snapshot consists of sampled rows that summarize the information in $T$ most relevant to the input utterance. The model then linearizes each row in the snapshot, concatenates each linearized row with the utterance, and uses the concatenated string as input to a Transformer model, which outputs row-wise encoding vectors of utterance tokens and cells. The encodings for all the rows in the snapshot are fed into a series of vertical self-attention layers, where a cell representation (or an utterance token representation) is computed by attending to vertically-aligned vectors of the same column (or the same NL token). Finally, representations for each utterance token and column are generated from a pooling layer.","792":"**Macaw** is a generative question-answering (QA) system that is built on UnifiedQA, itself built on [T5](https:\/\/paperswithcode.com\/method\/t5). Macaw has three interesting features. First, it often produces high-quality answers to questions far outside the domain it was trained on, sometimes surprisingly so. Second, Macaw allows different permutations (\u201can gles\u201d) of inputs and outputs to be used. For example, we can give it a question and get an answer; or give it an answer and get a question; or give it a question and answer and get a set of multiple-choice (MC) options for that question. This multi-angle QA capability allows versatility in the way Macaw can be used, include recursively using outputs as new inputs to the system. Finally, Macaw also generates explanations as an optional output (or even input) element.","793":"An **End-to-End Memory Network** is a neural network with a recurrent attention model over a possibly large external memory. The architecture is a form of [Memory Network](https:\/\/paperswithcode.com\/method\/memory-network), but unlike the model in that work, it is trained end-to-end, and hence requires significantly less supervision during training. It can also be seen as an extension of RNNsearch to the case where multiple computational steps (hops) are performed per output symbol.\r\n\r\nThe model takes a discrete set of inputs $x\\_{1}, \\dots, x\\_{n}$ that are to be stored in the memory, a query $q$, and outputs an answer $a$. Each of the $x\\_{i}$, $q$, and $a$ contains symbols coming from a dictionary with $V$ words. The model writes all $x$ to the memory up to a fixed buffer size, and then finds a continuous representation for the $x$ and $q$. The continuous representation is then processed via multiple hops to output $a$.","794":"The **Local Relation Network** (**LR-Net**) is a network built with local relation layers which represent a feature image extractor. This feature extractor adaptively determines aggregation weights based on the compositional relationship of local pixel pairs.","795":"**Network Dissection** is an interpretability method for [CNNs](https:\/\/paperswithcode.com\/methods\/category\/convolutional-neural-networks) that evaluates the alignment between individual hidden units and a set of visual semantic concepts. By identifying the best alignments, units are given human interpretable labels across a range of objects, parts, scenes, textures, materials, and colors. \r\n\r\nThe measurement of interpretability proceeds in three steps:\r\n\r\n- Identify a broad set of human-labeled visual concepts.\r\n- Gather the response of the hidden variables to known concepts.\r\n- Quantify alignment of hidden variable\u2212concept pairs.","796":"Visual Parsing is a vision and language pretrained model that adopts self-attention for visual feature learning where each visual token is an approximate weighted mixture of all tokens. Thus, visual parsing provides the dependencies of each visual token pair.  It helps better learning of visual relation with the language and promote inter modal alignment. The model is composed of a vision Transformer that takes an image as input and outputs the visual tokens and a multimodal Transformer. \r\nIt applies a linear layer and a Layer Normalization to embed the vision tokens. It follows BERT to get word embeddings. Vision and language tokens are concatenated to form the input sequences. A multi-modal Transformer is used to fuse the vision and language modality. A metric named Inter-Modality Flow (IMF) is used to quantify the interactions between two modalities.\r\nThree pretraining tasks are adopted: Masked Language Modeling (MLM), Image-Text Matching (ITM), and Masked Feature Regression (MFR). MFR is a novel task that is included to mask visual tokens with similar or correlated semantics in this framework.","797":"Image Scale Augmentation is an augmentation technique where we randomly pick the short size of a image within a dimension range. One use case of this augmentation technique is in object detectiont asks.","798":"**GShard** is a intra-layer parallel distributed method. It consists of set of simple APIs for annotations, and a compiler extension in XLA for automatic parallelization.","799":"**NICE**, or **Non-Linear Independent Components Estimation** is a framework for modeling complex high-dimensional densities. It is based on the idea that a good representation is one in which the data has a distribution that is easy to model. For this purpose, a non-linear deterministic transformation of the data is learned that maps it to a latent space so as to make the transformed data conform to a factorized distribution, i.e., resulting in independent latent variables.  The transformation is parameterised so that computing the determinant of the Jacobian and inverse Jacobian is trivial, yet it maintains the ability to learn complex non-linear transformations, via a composition of simple building blocks, each based on a deep neural network. The training criterion is simply the exact log-likelihood. The transformation used in NICE is the [affine coupling](https:\/\/paperswithcode.com\/method\/affine-coupling) layer without the scale term, known as additive coupling layer:\r\n\r\n$$ y\\_{I\\_{2}} = x\\_{I\\_{2}} + m\\left(x\\_{I\\_{1}}\\right) $$\r\n\r\n$$ x\\_{I\\_{2}} = y\\_{I\\_{2}} + m\\left(y\\_{I\\_{1}}\\right) $$","800":"**Voxel RoI Pooling** is a RoI feature extractor extracts RoI features directly from voxel features for further refinement. It starts by dividing a region proposal into $G \\times G \\times G$ regular sub-voxels. The center point is taken as the grid point of the corresponding sub-voxel. Since $3 D$ feature volumes are extremely sparse (non-empty voxels account for $<3 \\%$ spaces), we cannot directly utilize max pooling over features of each sub-voxel. Instead, features are integrated from neighboring voxels into the grid points for feature extraction. Specifically, given a grid point $g\\_{i}$, we first exploit voxel query to group a set of neighboring voxels $\\Gamma\\_{i}=\\left\\(\\mathbf{v}\\_{i}^{1}, \\mathbf{v}\\_{i}^{2}, \\cdots, \\mathbf{v}\\_{i}^{K}\\right\\) .$ Then, we aggregate the neighboring voxel features with a [PointNet](https:\/\/paperswithcode.com\/method\/pointnet) module $\\mathrm{a}$ as:\r\n\r\n$$\r\n\\mathbf{\\eta}\\_{i}=\\max _{k=1,2, \\cdots, K}\\left\\(\\Psi\\left(\\left[\\mathbf{v}\\_{i}^{k}-\\mathbf{g}\\_{i} ; \\mathbf{\\phi}\\_{i}^{k}\\right]\\right)\\right\\)\r\n$$\r\n\r\nwhere $\\mathbf{v}\\_{i}-\\mathbf{g}\\_{i}$ represents the relative coordinates, $\\mathbf{\\phi}\\_{i}^{k}$ is the voxel feature of $\\mathbf{v}\\_{i}^{k}$, and $\\Psi(\\cdot)$ indicates an MLP. The [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) operation $\\max (\\cdot)$ is performed along the channels to obtain the aggregated feature vector $\\eta_{i} .$ Particularly, Voxel RoI pooling is exploited to extract voxel features from the 3D feature volumes out of the last two stages in the $3 \\mathrm{D}$ backbone network. And for each stage, two Manhattan distance thresholds are set to group voxels with multiple scales. Then, we concatenate the aggregated features pooled from different stages and scales to obtain the RoI features.","801":"**Voxel R-CNN** is a voxel-based two stage framework for 3D object detection. It consists of a 3D backbone network, a 2D bird-eye-view (BEV) Region Proposal Network and a detect head. Voxel RoI Pooling is devised to extract RoI features directly from raw features for further refinement. \r\n\r\nEnd-to-end, the point clouds are first divided into regular voxels and fed into the 3D backbone network for feature extraction. Then, the 3D feature volumes are converted into BEV representation, on which the 2D backbone and [RPN](https:\/\/paperswithcode.com\/method\/rpn) are applied for region proposal generation. Subsequently, [Voxel RoI Pooling](https:\/\/paperswithcode.com\/method\/voxel-roi-pooling) directly extracts RoI features from the 3D feature volumes. Finally the RoI features are exploited in the detect head for further box refinement.","802":"**State-Aware Tracker** is a pipeline for semi-supervised video object segmentation. It takes each target object as a tracklet, which not only makes the pipeline more efficient but also filters distractors to facilitate target modeling. For more stable and robust performance over video sequences, SAT gets awareness for each state and makes self-adaptation via two feedback loops. One loop assists SAT in generating more stable tracklets. The other loop helps to construct a more robust and holistic target representation.","803":"Learnable graph convolutional layer (LGCL) automatically selects a fixed number of neighboring nodes for each feature based on value ranking in order to transform graph data into grid-like structures in 1-D format, thereby enabling the use of regular convolutional operations on generic graphs.\r\n\r\nDescription and image from: [Large-Scale Learnable Graph Convolutional Networks](https:\/\/arxiv.org\/pdf\/1808.03965.pdf)","804":"Class of methods in Bayesian Statistics where the posterior distribution is approximated over a rejection scheme on simulations because the likelihood function is intractable.\r\n\r\nDifferent parameters get sampled and simulated. Then a distance function is calculated to measure the quality of the simulation compared to data from real observations. Only simulations that fall below a certain threshold get accepted.\r\n\r\nImage source: [Kulkarni et al.](https:\/\/www.umass.edu\/nanofabrics\/sites\/default\/files\/PDF_0.pdf)","805":"**DistanceNet** is a learning algorithm for multi-source domain adaptation that uses various distance measures, or a mixture of these distance measures, as an additional loss function to be minimized jointly with the task's loss function, so as to achieve better unsupervised domain adaptation.","806":"**Population Based Training**, or **PBT**, is an optimization method for finding parameters and hyperparameters, and extends upon parallel search methods and sequential optimisation methods.\r\nIt leverages information sharing across a population of concurrently running optimisation processes, and allows for online propagation\/transfer of parameters and hyperparameters between members of the population based on their performance. Furthermore, unlike most other adaptation schemes, the method is capable of performing online adaptation of hyperparameters -- which can be particularly important in problems with highly non-stationary learning dynamics, such as reinforcement learning settings. PBT is decentralised and asynchronous, although it could also be executed semi-serially or with partial synchrony if there is a binding budget constraint.","807":"**PAR Transformer** is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) model that uses 63% fewer [self-attention blocks](https:\/\/paperswithcode.com\/method\/scaled), replacing them with [feed-forward blocks](https:\/\/paperswithcode.com\/method\/position-wise-feed-forward-layer), while retaining test accuracies. It is based on the [Transformer-XL](https:\/\/paperswithcode.com\/method\/transformer-xl) architecture and uses [neural architecture search](https:\/\/paperswithcode.com\/task\/architecture-search) to find an an efficient pattern of blocks in the transformer architecture.","808":"A **Fractal Block** is an image model block that utilizes an expansion rule that yields a structural layout of truncated fractals. For the base case where $f\\_{1}\\left(z\\right) = \\text{conv}\\left(z\\right)$ is a convolutional layer, we then have recursive fractals of the form:\r\n\r\n$$ f\\_{C+1}\\left(z\\right) = \\left[\\left(f\\_{C}\\circ{f\\_{C}}\\right)\\left(z\\right)\\right] \\oplus \\left[\\text{conv}\\left(z\\right)\\right]$$\r\n\r\nWhere $C$ is the number of columns. For the join layer (green in Figure), we use the element-wise mean rather than concatenation or addition.","809":"**FractalNet** is a type of convolutional neural network that eschews [residual connections](https:\/\/paperswithcode.com\/method\/residual-connection) in favour of a \"fractal\" design. They involve repeated application of a simple expansion rule to generate deep networks whose structural layouts are precisely truncated fractals. These networks contain interacting subpaths of different lengths, but do not include any pass-through or residual connections; every internal signal is transformed by a filter and nonlinearity before being seen by subsequent layers.","810":"Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner. Code, models, and datasets are released at https:\/\/github.com\/salesforce\/BLIP.","811":"**Fast Schema Guided Tracker**, or **FastSGT**, is a fast and robust [BERT](https:\/\/paperswithcode.com\/method\/bert)-based model for state tracking in goal-oriented dialogue systems. The model employs carry-over mechanisms for transferring the values between slots, enabling switching between services and accepting the values offered by the system during dialogue. It also uses [multi-head attention](https:\/\/paperswithcode.com\/method\/multi-head-attention) projections in some of the decoders to have a better modelling of the encoder outputs.\r\n\r\nThe model architecture is illustrated in the Figure. It consists of four main modules: 1-Utterance Encoder, 2-Schema Encoder, 3-State Decoder, and 4-State Tracker. The first three modules constitute the NLU component and are based on neural networks, whereas the state tracker is a rule-based module. [BERT](https:\/\/paperswithcode.com\/method\/bert) was used for both encoders in the model.\r\n\r\nThe Utterance Encoder is a BERT model which encodes the user and system utterances at each turn. The Schema Encoder is also a BERT model which encodes the schema descriptions of intents, slots, and values into schema embeddings. These schema embeddings help the decoders to transfer or share knowledge between different services by having some language understanding of each slot, intent, or value. The schema and utterance embeddings are passed to the State Decoder - a multi-task module. This module consists of five sub-modules producing the information necessary to track the state of the dialogue. Finally, the State Tracker module takes the previous state along with the current outputs of the State Decoder and predicts the current state of the dialogue by aggregating and summarizing the information across turns.","812":"**FoveaBox** is anchor-free framework for object detection. Instead of using predefined anchors to enumerate possible locations, scales and aspect ratios for the search of the objects, FoveaBox directly learns the object existing possibility and the bounding box coordinates without anchor reference. This is achieved by: (a) predicting category-sensitive semantic maps for the object existing possibility, and (b) producing category-agnostic bounding box for each position that potentially contains an object. The scales of target boxes are naturally associated with feature pyramid representations for each input image\r\n\r\nIt is a single, unified network composed of a backbone network and two task-specific subnetworks. The backbone is responsible for computing a convolutional feature map over an entire input image and is an off-the-shelf convolutional network. The first subnet performs per pixel classification on the backbone\u2019s output; the second subnet performs bounding box prediction for the corresponding\r\nposition.","813":"**Context Enhancement Module (CEM)** is a feature extraction module used in object detection (specifically, [ThunderNet](https:\/\/paperswithcode.com\/method\/thundernet)) which aims to  to enlarge the receptive field. The key idea of CEM is to aggregate multi-scale local context information and global context information to generate more discriminative features. In CEM, the feature maps from three scales are merged: $C\\_{4}$, $C\\_{5}$ and $C\\_{glb}$. $C\\_{glb}$ is the global context feature vector by applying a [global average pooling](https:\/\/paperswithcode.com\/method\/global-average-pooling) on $C\\_{5}$. We then apply a 1 \u00d7 1 [convolution](https:\/\/paperswithcode.com\/method\/convolution) on each feature map to squeeze the number of channels to $\\alpha \\times p \\times p = 245$.\r\n\r\nAfterwards, $C\\_{5}$ is upsampled by 2\u00d7 and $C\\_{glb}$ is broadcast so that the spatial dimensions of the three feature maps are\r\nequal. At last, the three generated feature maps are aggregated. By leveraging both local and global context, CEM effectively enlarges the receptive field and refines the representation ability of the thin feature map. Compared with prior [FPN](https:\/\/paperswithcode.com\/method\/fpn) structures, CEM involves only two 1\u00d71 convolutions and a fc layer.","814":"**Spatial Attention Module (SAM)** is a feature extraction module for object detection used in [ThunderNet](https:\/\/paperswithcode.com\/method\/thundernet).\r\n\r\nThe ThunderNet SAM explicitly re-weights the feature map before RoI warping over the spatial dimensions. The key idea of SAM is to use the knowledge from [RPN](https:\/\/paperswithcode.com\/method\/rpn) to refine the feature distribution of the feature map. RPN is trained to recognize foreground regions under the supervision of ground truths. Therefore, the intermediate features in RPN can be used to distinguish foreground features from background features. SAM accepts two inputs: the intermediate feature map from RPN $\\mathcal{F}^{RPN}$ and the thin feature map from the [Context Enhancement Module](https:\/\/paperswithcode.com\/method\/context-enhancement-module) $\\mathcal{F}^{CEM}$. The output of SAM $\\mathcal{F}^{SAM}$ is defined as:\r\n\r\n$$ \\mathcal{F}^{SAM} = \\mathcal{F}^{CEM} * \\text{sigmoid}\\left(\\theta\\left(\\mathcal{F}^{RPN}\\right)\\right) $$\r\n\r\nHere $\\theta\\left(\u00b7\\right)$ is a dimension transformation to match the number of channels in both feature maps. The sigmoid function is used to constrain the values within $\\left[0, 1\\right]$. At last, $\\mathcal{F}^{CEM}$ is re-weighted by the generated feature map for better feature distribution. For computational efficiency, we simply apply a 1\u00d71 [convolution](https:\/\/paperswithcode.com\/method\/convolution) as $\\theta\\left(\u00b7\\right)$, so the computational cost of CEM is negligible. The Figure to the right shows the structure of SAM. \r\n\r\nSAM has two functions. The first one is to refine the feature distribution by strengthening foreground features and suppressing background features. The second one is to stabilize the training of RPN as SAM enables extra gradient flow from [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn) subnet to RPN. As a result, RPN receives additional supervision from RCNN subnet, which helps the training of RPN.","815":"**Position-Sensitive RoIAlign** is a positive sensitive version of [RoIAlign](https:\/\/paperswithcode.com\/method\/roi-align) - i.e. it performs selective alignment, allowing for the learning of position-sensitive region of interest aligning.","816":"**SNet** is a convolutional neural network architecture and object detection backbone used for the [ThunderNet](https:\/\/paperswithcode.com\/method\/thundernet) two-stage object detector. SNet uses ShuffleNetV2 basic blocks but replaces all 3\u00d73 depthwise convolutions with 5\u00d75 depthwise convolutions.","817":"**ThunderNet** is a two-stage object detection model. The design of ThunderNet aims at the computationally expensive structures in state-of-the-art two-stage detectors. The backbone utilises a [ShuffleNetV2](https:\/\/paperswithcode.com\/method\/shufflenet-v2) inspired network called [SNet](https:\/\/paperswithcode.com\/method\/snet) designed for object detection. In the detection part, ThunderNet follows the detection head design in Light-Head [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn), and further compresses the [RPN](https:\/\/paperswithcode.com\/method\/rpn) and R-CNN subnet. To eliminate the performance degradation induced by small backbones and small feature maps, ThunderNet uses two new efficient architecture blocks, [Context Enhancement Module](https:\/\/paperswithcode.com\/method\/context-enhancement-module) (CEM) and [Spatial Attention Module](https:\/\/paperswithcode.com\/method\/spatial-attention-module) (SAM). CEM combines the feature maps from multiple scales to leverage local and global context information, while SAM uses the information learned in RPN to refine the feature distribution in RoI warping.","818":"**Hybrid Task Cascade**, or **HTC**, is a framework for cascading in instance segmentation. It differs from [Cascade Mask R-CNN](https:\/\/paperswithcode.com\/method\/cascade-mask-r-cnn) in two important aspects:  (1) instead of performing cascaded refinement on the two tasks of detection and segmentation separately, it interweaves them for a joint multi-stage processing; (2) it adopts a fully convolutional branch to provide spatial context, which can help distinguishing hard foreground from cluttered background.","819":"The **Sliced Iterative Generator (SIG)** is an iterative generative model that is a Normalizing Flow (NF), but shares the advantages of Generative Adversarial Networks (GANs). The model is based on iterative Optimal Transport of a series of 1D slices through the data space, matching on each slice the probability distribution function (PDF) of the samples to the data. To improve the efficiency, the directions of the orthogonal slices are chosen to maximize the PDF difference between the generated samples and the data using Wasserstein distance at each iteration. A patch based approach is adopted to model the images in a hierarchical way, enabling the model to scale well to high dimensions. \r\n\r\nUnlike GANs, SIG has a NF structure and allows efficient likelihood evaluations that can be used in downstream tasks. While SIG has a deep neural network architecture, the approach deviates significantly from the current deep learning paradigm, as it does not use concepts such as mini-batching, stochastic gradient descent, gradient back-propagation through deep layers, or non-convex loss function optimization. SIG is very insensitive to hyper-parameter tuning, making it a useful generator tool for ML experts and non-experts alike."}}
\ No newline at end of file
+{"title":{"0":"Average Pooling","1":"1x1 Convolution","2":"Global Average Pooling","3":"Batch Normalization","4":"ReLU","5":"Kaiming Initialization","6":"Residual Connection","7":"Max Pooling","8":"Residual Block","9":"Bottleneck Residual Block","10":"ResNet","11":"Convolution","12":"AutoEncoder","13":"PCA","14":"Q-Learning","15":"Causal Inference","16":"Sigmoid Activation","17":"Tanh Activation","18":"LSTM","19":"Weight Decay","20":"WordPiece","21":"Softmax","22":"Dense Connections","23":"Scaled Dot-Product Attention","24":"Linear Warmup With Linear Decay","25":"Dropout","26":"Layer Normalization","27":"GELU","28":"Adam","29":"Multi-Head Attention","30":"Attention Dropout","31":"BERT","32":"GAN","33":"GPS","34":"Softplus","35":"Mish","36":"BiGRU","37":"Memory Network","38":"GRU","39":"Concatenated Skip Connection","40":"U-Net","41":"PatchGAN","42":"Instance Normalization","43":"Leaky ReLU","44":"GAN Least Squares Loss","45":"Cycle Consistency Loss","46":"CycleGAN","47":"Mixup","48":"Entropy Regularization","49":"PPO","50":"RPN","51":"RoIAlign","52":"Mask R-CNN","53":"Absolute Position Encodings","54":"Position-Wise Feed-Forward Layer","55":"Label Smoothing","56":"BPE","57":"Transformer","58":"Spectral Clustering","59":"IndexNet","60":"Depthwise Convolution","61":"Pointwise Convolution","62":"Depthwise Separable Convolution","63":"Inverted Residual Block","64":"MobileNetV2","65":"LIME","66":"BiLSTM","67":"Logistic Regression","68":"Diffusion","69":"AMP","70":"Triplet Loss","71":"A3C","72":"SHAP","73":"Knowledge Distillation","74":"Cosine Annealing","75":"Adaptive Input Representations","76":"Linear Warmup With Cosine Annealing","77":"Squeeze-and-Excitation Block","78":"Variational Dropout","79":"Adaptive Softmax","80":"Discriminative Fine-Tuning","81":"Transformer-XL","82":"SENet","83":"GPT-2","84":"SGD","85":"Dilated Causal Convolution","86":"Causal Convolution","87":"Affine Coupling","88":"Normalizing Flows","89":"NICE","90":"MAML","91":"LAMB","92":"RoBERTa","93":"ALBERT","94":"DARTS","95":"CAM","96":"SVM","97":"DQN","98":"Temporal attention","99":"fastText","100":"Seq2Seq","101":"Local Response Normalization","102":"Grouped Convolution","103":"AlexNet","104":"DLA","105":"Center Pooling","106":"Cascade Corner Pooling","107":"CenterNet","108":"Interpretability","109":"RoIPool","110":"Faster R-CNN","111":"BYOL","112":"ReLU6","113":"node2vec","114":"Graph Convolutional Networks","115":"Gravity","116":"Relativistic GAN","117":"WGAN","118":"ADMM","119":"Natural Gradient Descent","120":"AE","121":"ELMo","122":"VAE","123":"ATSS","124":"Focal Loss","125":"Xception","126":"Spectral Normalization","127":"FPN","128":"HTC","129":"k-NN","130":"GloVe","131":"BLANC","132":"NeRF","133":"Gaussian Process","134":"DE-GAN","135":"k-Means Clustering","136":"Darknet-19","137":"Darknet-53","138":"YOLOv3","139":"YOLOv2","140":"3D Convolution","141":"Early Stopping","142":"ResNeXt Block","143":"ResNeXt","144":"Linear Regression","145":"Swish","146":"RMSProp","147":"EfficientNet","148":"AWARE","149":"TRPO","150":"Linear Layer","151":"Capsule Network","152":"FAVOR+","153":"Performer","154":"GCN","155":"DistilBERT","156":"XLM","157":"Weight Demodulation","158":"R1 Regularization","159":"Path Length Regularization","160":"StyleGAN2","161":"Denoising Autoencoder","162":"Deformable Convolution","163":"SELU","164":"SNN","165":"InfoNCE","166":"Contrastive Predictive Coding","167":"Inception-A","168":"Reduction-A","169":"Inception-B","170":"Reduction-B","171":"Inception-C","172":"Inception-v4","173":"DTW","174":"HRNet","175":"rTPNN","176":"Base Boosting","177":"mBERT","178":"AlphaZero","179":"Sparse Transformer","180":"Jigsaw","181":"Population Based Training","182":"Population Based Augmentation","183":"AutoAugment","184":"TGN","185":"HyperNetwork","186":"GLU","187":"Adafactor","188":"Inverse Square Root Schedule","189":"SentencePiece","190":"T5","191":"Morphence","192":"MDETR","193":"Position-Sensitive RoI Pooling","194":"R-FCN","195":"ALIGN","196":"Dilated Convolution","197":"Fast Voxel Query","198":"VoTr","199":"RAN","200":"MPNN","201":"GA","202":"DeepWalk","203":"GraphSAGE","204":"Vision Transformer","205":"VGG","206":"ELECTRA","207":"Axial Attention","208":"Deformable Attention Module","209":"Feedforward Network","210":"Deformable DETR","211":"Detr","212":"MoCo","213":"Target Policy Smoothing","214":"Clipped Double Q-learning","215":"Experience Replay","216":"DDPG","217":"TD3","218":"SAC","219":"Fixed Factorized Attention","220":"Strided Attention","221":"GPT-3","222":"ENet Initial Block","223":"ENet Dilated Bottleneck","224":"SpatialDropout","225":"PReLU","226":"ENet Bottleneck","227":"ENet","228":"R-CNN","229":"PyTorch DDP","230":"Random Search","231":"SimCSE","232":"CARLA","233":"GPT","234":"Sliding Window Attention","235":"Dilated Sliding Window Attention","236":"Global and Sliding Window Attention","237":"AdamW","238":"Longformer","239":"ABC","240":"Discrete Cosine Transform","241":"Inpainting","242":"CutMix","243":"Stochastic Depth","244":"Swin Transformer","245":"Submanifold Convolution","246":"PULSE","247":"CodeBERT","248":"WaveRNN","249":"Mixture of Logistic Distributions","250":"WaveNet","251":"NT-Xent","252":"Random Gaussian Blur","253":"Random Resized Crop","254":"ColorJitter","255":"SimCLR","256":"DCGAN","257":"DANCE","258":"LDA","259":"CRF","260":"SGD with Momentum","261":"Demon ADAM","262":"Demon CM","263":"Demon","264":"Feature Intertwiner","265":"Gradient Clipping","266":"Non Maximum Suppression","267":"Random Horizontal Flip","268":"PointNet","269":"SSD","270":"FCN","271":"WideResNet","272":"Auxiliary Classifier","273":"OFA","274":"ConvLSTM","275":"mBART","276":"Hourglass Module","277":"Stacked Hourglass Network","278":"Image Scale Augmentation","279":"EfficientDet","280":"BiFPN","281":"Siamese Network","282":"Routing Attention","283":"Routing Transformer","284":"Weight Normalization","285":"NON","286":"mT5","287":"BART","288":"Prioritized Experience Replay","289":"Monte-Carlo Tree Search","290":"MuZero","291":"DCNN","292":"Double Q-learning","293":"Double DQN","294":"GMI","295":"Self-Adversarial Negative Sampling","296":"RotatE","297":"VEGA","298":"Procrustes","299":"Inception Module","300":"GoogLeNet","301":"Retrace","302":"Stochastic Dueling Network","303":"ACER","304":"Additive Attention","305":"Pointer Network","306":"FBNet Block","307":"FBNet","308":"Cascade R-CNN","309":"Dense Block","310":"DenseNet","311":"Grid Sensitive","312":"Bottom-up Path Augmentation","313":"Spatial Pyramid Pooling","314":"PAFPN","315":"Spatial Attention Module","316":"DropBlock","317":"CSPDarknet53","318":"YOLOv4","319":"Lovasz-Softmax","320":"Inception-v3 Module","321":"Inception-v3","322":"Step Decay","323":"MobileNetV1","324":"Wide Residual Block","325":"Channel Attention Module","326":"CBAM","327":"FFB6D","328":"IPL","329":"Random Erasing","330":"Spatial Transformer","331":"Adaptive Instance Normalization","332":"StyleGAN","333":"Exponential Decay","334":"Restricted Boltzmann Machine","335":"Maxout","336":"CLIP","337":"Laplacian Pyramid","338":"AccoMontage","339":"Asynchronous Interaction Aggregation","340":"Res2Net Block","341":"Res2Net","342":"Attention Gate","343":"SSE","344":"TopK Copy","345":"TS","346":"GAGNN","347":"SESAME Discriminator","348":"TransE","349":"GENet","350":"Hierarchical Feature Fusion","351":"ESP","352":"ESPNet","353":"GAM","354":"MLP-Mixer","355":"Blender","356":"LMU","357":"RegionViT","358":"Hydra","359":"RetinaNet","360":"AdaGrad","361":"ESIM","362":"Channel Shuffle","363":"Dynamic Convolution","364":"CR-NET","365":"Colorization","366":"LeNet","367":"SMOTE","368":"OASIS","369":"SiLU","370":"Temporal Activation Regularization","371":"Activation Regularization","372":"Weight Tying","373":"Embedding Dropout","374":"DropConnect","375":"AWD-LSTM","376":"Mixture of Softmaxes","377":"Highway Layer","378":"Highway Network","379":"Soft Actor Critic","380":"ORB-SLAM2","381":"SortCut Sinkhorn Attention","382":"Sparse Sinkhorn Attention","383":"Sinkhorn Transformer","384":"Activation Normalization","385":"Invertible 1x1 Convolution","386":"GLOW","387":"MATE","388":"PP-OCR","389":"Dice Loss","390":"Corner Pooling","391":"CornerNet","392":"Soft-NMS","393":"VQ-VAE","394":"GAT","395":"ICA","396":"RealNVP","397":"Xavier Initialization","398":"Spatially Separable Convolution","399":"SqueezeNeXt Block","400":"SqueezeNeXt","401":"Cross-Attention Module","402":"REINFORCE","403":"Cutout","404":"Shake-Shake Regularization","405":"CPN","406":"CoOp","407":"Griffin-Lim Algorithm","408":"Residual GRU","409":"CBHG","410":"Tacotron","411":"SOM","412":"XLNet","413":"CTC Loss","414":"DPN Block","415":"DPN","416":"Early exiting","417":"SPNet","418":"Strip Pooling","419":"SLR","420":"MixConv","421":"MixNet","422":"CaiT","423":"Class Attention","424":"LayerScale","425":"DeiT","426":"Context Enhancement Module","427":"ShuffleNet V2 Block","428":"Spatial Attention Module (ThunderNet)","429":"Position-Sensitive RoIAlign","430":"SNet","431":"ThunderNet","432":"Spatial Broadcast Decoder","433":"Affine Operator","434":"ResMLP","435":"Disentangled Attention Mechanism","436":"DeBERTa","437":"TuckER","438":"V-trace","439":"IMPALA","440":"TayPO","441":"Neural Tangent Transfer","442":"DECA","443":"VoiceFilter-Lite","444":"Channel attention","445":"Efficient Channel Attention","446":"FixRes","447":"Weight Standardization","448":"Group Normalization","449":"Distributed Shampoo","450":"Enhanced Fusion Framework","451":"EMF","452":"MFF","453":"MushroomRL","454":"EWC","455":"L1 Regularization","456":"Sparse Autoencoder","457":"FCOS","458":"AlignPS","459":"VOS","460":"Local SGD","461":"ASLFeat","462":"FastSpeech 2","463":"MAVL","464":"MViT","465":"context2vec","466":"Dot-Product Attention","467":"Spatial Gating Unit","468":"gMLP","469":"VLG-Net","470":"SAGA","471":"UNETR","472":"Global Convolutional Network","473":"FixMatch","474":"RAG","475":"Location-based Attention","476":"CoVe","477":"Graph Self-Attention","478":"RAdam","479":"HypE","480":"Metrix","481":"MoCo v2","482":"DeepMask","483":"Wide&Deep","484":"Laplacian PE","485":"Graph Transformer","486":"Local Contrast Normalization","487":"ZFNet","488":"Electric","489":"COLA","490":"(2+1)D Convolution","491":"R(2+1)D","492":"CodeT5","493":"Lambda Layer","494":"LightGCN","495":"lda2vec","496":"Deep-MAC","497":"Positional Encoding Generator","498":"Conditional Positional Encoding","499":"Global Sub-Sampled Attention","500":"Locally-Grouped Self-Attention","501":"Spatially Separable Self-Attention","502":"Twins-SVT","503":"Twins-PCPVT","504":"Meta-augmentation","505":"DiffPool","506":"WGAN-GP Loss","507":"Phase Shuffle","508":"WaveGAN","509":"PnP","510":"Fire Module","511":"SqueezeNet","512":"Go-Explore","513":"DINO","514":"MoCo v3","515":"AMSGrad","516":"SuperpixelGridMasks","517":"SCA-CNN","518":"Hard Swish","519":"MobileNetV3","520":"T2T-ViT","521":"MagFace","522":"Pix2Pix","523":"LARS","524":"SwAV","525":"PGM","526":"YOHO","527":"Hit-Detector","528":"Coordinate attention","529":"Neural Architecture Search","530":"ScheduledDropPath","531":"DD-PPO","532":"AdaMax","533":"Slanted Triangular Learning Rates","534":"ULMFiT","535":"PQ-Transformer","536":"Fractal Block","537":"FractalNet","538":"Contractive Autoencoder","539":"MADDPG","540":"Disentangled Attribution Curves","541":"LayoutLMv2","542":"Manifold Mixup","543":"RepVGG","544":"TUM","545":"SAGAN Self-Attention Module","546":"Non-Local Operation","547":"Truncation Trick","548":"SAGAN","549":"GAN Hinge Loss","550":"TTUR","551":"Conditional Batch Normalization","552":"Non-Local Block","553":"Projection Discriminator","554":"Off-Diagonal Orthogonal Regularization","555":"BigGAN","556":"Fast R-CNN","557":"DynamicConv","558":"Random Scaling","559":"PixelShuffle","560":"SRGAN Residual Block","561":"VGG Loss","562":"SRGAN","563":"Groupwise Point Convolution","564":"ShuffleNet V2 Downsampling Block","565":"ShuffleNet v2","566":"RAM","567":"Spatial Feature Transform","568":"Deep Boltzmann Machine","569":"CenterPoint","570":"Sharpness-Aware Minimization","571":"Grid R-CNN","572":"1D CNN","573":"Switch FFN","574":"Switch Transformer","575":"BAM","576":"GIN","577":"ProGAN","578":"FairMOT","579":"JLA","580":"LINE","581":"ArcFace","582":"AdaSmooth","583":"AdaDelta","584":"NODE","585":"CrossViT","586":"FIERCE","587":"CRN","588":"Dynamic Memory Network","589":"Nesterov Accelerated Gradient","590":"Relative Position Encodings","591":"Global-Local Attention","592":"ETC","593":"Inception-ResNet-v2-A","594":"Inception-ResNet-v2 Reduction-B","595":"Inception-ResNet-v2-B","596":"Inception-ResNet-v2-C","597":"Inception-ResNet-v2","598":"Deep Belief Network","599":"NNCF","600":"UNet++","601":"SegNet","602":"RAE","603":"VirTex","604":"Gumbel Softmax","605":"Beta-VAE","606":"Channel-wise Soft Attention","607":"Selective Kernel Convolution","608":"Selective Kernel","609":"SCN","610":"MDL","611":"CNN BiLSTM","612":"ASPP","613":"DeepLabv3","614":"Adaptive Loss","615":"CharacterBERT","616":"SRN","617":"VERSE","618":"SNIPER","619":"Apollo","620":"REM","621":"ResNet-RS","622":"ResNet-D","623":"3D ResNet-RS","624":"Shifted Softplus","625":"SchNet","626":"FastGCN","627":"MAS","628":"Rotary Embeddings","629":"SRM","630":"Multi-Head Linear Attention","631":"Skip-gram Word2Vec","632":"PReLU-Net","633":"Epsilon Greedy Exploration","634":"Barlow Twins","635":"Adversarial Color Enhancement","636":"Selective Search","637":"mLSTM","638":"LSGAN","639":"RandWire","640":"DeepLab","641":"CascadePSP","642":"DGCNN","643":"Residual Normal Distribution","644":"NVAE Encoder Residual Cell","645":"NVAE Generative Residual Cell","646":"NVAE","647":"DEQ","648":"CCT","649":"CvT","650":"DAC","651":"ARCH","652":"Pyramid Pooling Module","653":"PSPNet","654":"KnowPrompt","655":"MoNet","656":"ELU","657":"Attention Free Transformer","658":"Stochastic Weight Averaging","659":"DenseNAS-C","660":"DenseNAS-B","661":"DenseNAS-A","662":"DenseNAS","663":"ARMA","664":"MSPFN","665":"Fast_BAT","666":"DNAS","667":"DropPath","668":"ProxylessNAS","669":"Jukebox","670":"Auxiliary Batch Normalization","671":"A2C","672":"LAMA","673":"Adaptive Masking","674":"Scale Aggregation Block","675":"ScaleNet","676":"KAF","677":"ENIGMA","678":"TNT","679":"YOLOX","680":"GShard","681":"GCNII","682":"Highway networks","683":"SpineNet","684":"SNGAN","685":"AutoGAN","686":"Gradient Checkpointing","687":"Fast AutoAugment","688":"VL-BERT","689":"Gradient Sparsification","690":"RReLU","691":"FSAF","692":"1-bit LAMB","693":"1-bit Adam","694":"PIoU Loss","695":"N-step Returns","696":"Revision Network","697":"Drafting Network","698":"LapStyle","699":"SNIP","700":"Mix-FFN","701":"SegFormer","702":"MobileBERT","703":"Distributional Generalization","704":"Macaw","705":"RandAugment","706":"GridMask","707":"ConvTasNet","708":"SepFormer","709":"Mogrifier LSTM","710":"LipGAN","711":"Euclidean Norm Regularization","712":"Latent Optimisation","713":"CS-GAN","714":"LOGAN","715":"BigGAN-deep","716":"CondConv","717":"Cascade Mask R-CNN","718":"PolarMask","719":"InfoGAN","720":"AutoML-Zero","721":"Self-adaptive Training","722":"OSCAR","723":"PixelCNN","724":"Sarsa","725":"ScatNet","726":"Sparse R-CNN","727":"DPG","728":"Object Dropout","729":"Noisy Student","730":"Poincar\u00e9 Embeddings","731":"EEND","732":"GrowNet","733":"Contextualized Topic Models","734":"TimeSformer","735":"HS-ResNet","736":"Hierarchical-Split Block","737":"Ghost Module","738":"Ghost Bottleneck","739":"GhostNet","740":"Bi-attention","741":"Guided Anchoring","742":"RetinaNet-RS","743":"Linear Warmup","744":"CTRL","745":"Cross-Scale Non-Local Attention","746":"Contextual Residual Aggregation","747":"BLIP","748":"Accumulating Eligibility Trace","749":"TD Lambda","750":"TD-Gammon","751":"SEAM","752":"VIME","753":"TDN","754":"BASNet","755":"GAN Feature Matching","756":"Syntax Heat Parse Tree","757":"U2-Net","758":"WGAN GP","759":"Orthogonal Regularization","760":"Spektral","761":"Zero-padded Shortcut Connection","762":"Pyramidal Residual Unit","763":"Pyramidal Bottleneck Residual Unit","764":"PyramidNet","765":"HiFi-GAN","766":"SRU","767":"GPipe","768":"Packed Levitated Markers","769":"Dueling Network","770":"UNITER","771":"ViP-DeepLab","772":"GMVAE","773":"Adaptive Dropout","774":"DANet","775":"Vision-aided GAN","776":"Spatial-Reduction Attention","777":"BigBird","778":"PVT","779":"Reversible Residual Block","780":"LSH Attention","781":"Reformer","782":"DAU-ConvNet","783":"DU-GAN","784":"QHM","785":"QHAdam","786":"MelGAN Residual Block","787":"Window-based Discriminator","788":"Location Sensitive Attention","789":"Zoneout","790":"MelGAN","791":"Tacotron 2","792":"Expected Sarsa","793":"GGS-NNs","794":"Polya-Gamma Augmentation","795":"Cross-View Training","796":"GRoIE","797":"R2D2","798":"Triplet Attention","799":"3DIS","800":"GAIL","801":"Concatenation Affinity","802":"Embedded Dot Product Affinity","803":"Embedded Gaussian Affinity","804":"AugMix","805":"TransferQA","806":"Aggregated Learning","807":"SRS","808":"SGPCS","809":"CBNet","810":"Neighborhood Attention","811":"Spatial Attention-Guided Mask","812":"MCKERNEL","813":"DIME","814":"WenLan","815":"Graph Contrastive Coding","816":"EGT","817":"Single-path NAS","818":"ACTKR","819":"Multi Loss ( BCE Loss + Focal Loss )  + Dice Loss","820":"Quick Attention","821":"Serf","822":"SEED RL","823":"MPRNet","824":"DeltaConv","825":"RepPoints","826":"MeRL","827":"Label Quality Model","828":"TernaryBERT","829":"Ternary Weight Splitting","830":"BinaryBERT","831":"RESCAL","832":"Linformer","833":"Blended Diffusion","834":"ALQ and AMQ","835":"Rendezvous","836":"CHM","837":"GEOMANCER","838":"TWEC","839":"PISA","840":"OHEM","841":"EdgeBoxes","842":"PRNet+","843":"Semantic Reasoning Network","844":"NAM","845":"Dense Contrastive Learning","846":"ProphetNet","847":"Neural Turing Machine","848":"Content-based Attention","849":"Metropolis Hastings","850":"Recurrent Dropout","851":"TPN","852":"IFBlock","853":"IFNet","854":"RIFE","855":"Supervised Contrastive Loss","856":"TaLK Convolution","857":"scSE","858":"Channel Squeeze and Spatial Excitation","859":"Concurrent Spatial and Channel Squeeze & Excitation","860":"F2DNet","861":"CANINE","862":"POTO","863":"SPADE","864":"Teacher-Tutor-Student Knowledge Distillation","865":"Gradual Self-Training","866":"Bridge-net","867":"Softsign Activation","868":"DV3 Convolution Block","869":"DV3 Attention Block","870":"ClariNet","871":"ALI","872":"LV-ViT","873":"MnasNet","874":"STN","875":"MatrixNet","876":"LXMERT","877":"ViLBERT","878":"PowerSGD","879":"LCC","880":"NAS-FPN","881":"AmoebaNet","882":"Precise RoI Pooling","883":"IoU-Net","884":"VSF","885":"ShuffleNet Block","886":"ShuffleNet","887":"HITNet","888":"SDAE","889":"EfficientNetV2","890":"Gated Convolution","891":"SNAIL","892":"RPDet","893":"LGCL","894":"SANet","895":"HANet","896":"Split Attention","897":"Adaptive Feature Pooling","898":"PANet","899":"K-Net","900":"CMCL","901":"UNIMO","902":"Attention Mesh","903":"Re-Attention Module","904":"DeepViT","905":"Noisy Linear Layer","906":"Rainbow DQN","907":"ControlVAE","908":"SSTDA","909":"DIoU-NMS","910":"RFB","911":"Polynomial Rate Decay","912":"CSPResNeXt Block","913":"CSPResNeXt","914":"XGPT","915":"SyncBN","916":"Squared ReLU","917":"Multi-DConv-Head Attention","918":"Primer","919":"ReLIC","920":"Levenshtein Transformer","921":"NIMA","922":"FORK","923":"A2U","924":"TAPAS","925":"DeepCluster","926":"Boost-GNN","927":"DeepSIM","928":"Hamburger","929":"Masked Convolution","930":"Compact Global Descriptor","931":"FMix","932":"Multiple Random Window Discriminator","933":"Conditional DBlock","934":"DBlock","935":"GBlock","936":"GAN-TTS","937":"RevNet","938":"RAHP","939":"GPFL","940":"Aging Evolution","941":"Mechanism Transfer","942":"YOLOv1","943":"Collaborative Distillation","944":"Eligibility Trace","945":"GCNet","946":"Global Context Block","947":"CSL","948":"RegNetY","949":"SEER","950":"SCARF","951":"MaxUp","952":"DCLS","953":"PAR Transformer","954":"QuantTree","955":"PGNet","956":"NesT","957":"CTAB-GAN","958":"TridentNet Block","959":"TridentNet","960":"Generalized Mean Pooling","961":"DELG","962":"IMGEP","963":"LPM","964":"Talking-Heads Attention","965":"SimpleNet","966":"Parallax","967":"RotNet","968":"Inception v2","969":"KIP","970":"Snapshot Ensembles","971":"Anycost GAN","972":"Mesh-TensorFlow","973":"CenterTrack","974":"D4PG","975":"T-D","976":"Synthesizer","977":"Phish","978":"GPT-Neo","979":"PPMC","980":"DistanceNet","981":"Visformer","982":"Fawkes","983":"Anti-Alias Downsampling","984":"Big-Little Module","985":"Assemble-ResNet","986":"Ape-X","987":"SCAN-clustering","988":"CARAFE","989":"HBMP","990":"CBoW Word2Vec","991":"Focal Transformers","992":"DeLighT Block","993":"DExTra","994":"DeLighT","995":"Composite Fields","996":"PolarNet","997":"SCNet","998":"GLN","999":"Magnification Prior Contrastive Similarity","1000":"Two-Way Dense Layer","1001":"PeleeNet","1002":"FLAVA","1003":"LayerDrop","1004":"IPBI","1005":"CoordConv","1006":"SM3","1007":"FastPitch","1008":"AlphaFold","1009":"Denoising Score Matching","1010":"TanhExp","1011":"Pattern-Exploiting Training","1012":"RE-NET","1013":"CGNN","1014":"PCIDA","1015":"CIDA","1016":"modReLU","1017":"Unitary RNN","1018":"Targeted Dropout","1019":"GaAN","1020":"TURL","1021":"Style-based Recalibration Module","1022":"GLIDE","1023":"TAM","1024":"Deactivable Skip Connection","1025":"Weight excitation","1026":"Neural Probabilistic Language Model","1027":"ShakeDrop","1028":"DualCL","1029":"SKNet","1030":"PIRL","1031":"Bilateral Grid","1032":"Conditional Instance Normalization","1033":"Slot Attention","1034":"Fixup Initialization","1035":"RFB Net","1036":"Branch attention","1037":"PMLM","1038":"CRISS","1039":"Span-Based Dynamic Convolution","1040":"Mixed Attention Block","1041":"ConvBERT","1042":"Mirror-BERT","1043":"Self-Calibrated Convolutions","1044":"GAP-Layer","1045":"CT-Layer","1046":"Gated Convolution Network","1047":"DGI","1048":"PGHI","1049":"Point-wise Spatial Attention","1050":"CheXNet","1051":"Visual Parsing","1052":"FiLM Module","1053":"WaveGrad DBlock","1054":"WaveGrad UBlock","1055":"WaveGrad","1056":"Deflation","1057":"GTrXL","1058":"CoBERL","1059":"Ensemble Clustering","1060":"Network Dissection","1061":"CenterMask","1062":"Octave Convolution","1063":"AGCN","1064":"SSKD","1065":"CeiT","1066":"DEXTR","1067":"ExtremeNet","1068":"PCA Whitening","1069":"MultiGrain","1070":"ZCA Whitening","1071":"Active Convolution","1072":"FEFM","1073":"SETR","1074":"Blink Communication","1075":"CRF-RNN","1076":"HGS","1077":"HyperTree MetaModel","1078":"E2EAdaptiveDistTraining","1079":"PVTv2","1080":"SimCLRv2","1081":"Parrot","1082":"RevSilo","1083":"BiGAN","1084":"NEAT","1085":"DBGAN","1086":"Universal Transformer","1087":"Multiplicative Attention","1088":"Hierarchical Softmax","1089":"Augmented SBERT","1090":"DSGN","1091":"ReZero","1092":"Concrete Dropout","1093":"PREDATOR","1094":"ERU","1095":"Cyclical Learning Rate Policy","1096":"ESPNetv2","1097":"Strided EESP","1098":"Depthwise Dilated Separable Convolution","1099":"EESP","1100":"MUSIQ","1101":"NADAM","1102":"VL-T5","1103":"WaveGlow","1104":"CornerNet-Squeeze Hourglass Module","1105":"Depthwise Fire Module","1106":"CornerNet-Saccade","1107":"CornerNet-Squeeze","1108":"CornerNet-Squeeze Hourglass","1109":"MSGAN","1110":"DPT","1111":"CReLU","1112":"BigBiGAN","1113":"Patch Merger","1114":"wav2vec-U","1115":"Gradient Normalization","1116":"Sparsemax","1117":"ACGPN","1118":"Style Transfer Module","1119":"ResNeSt","1120":"AdvProp","1121":"PELU","1122":"Glow-TTS","1123":"Fishr","1124":"Source Hypothesis Transfer","1125":"Batch Nuclear-norm Maximization","1126":"GER","1127":"YOLOP","1128":"VisualBERT","1129":"Content-Conditioned Style Encoder","1130":"COCO-FUNIT","1131":"Accordion","1132":"SFAM","1133":"PonderNet","1134":"VideoBERT","1135":"Movement Pruning","1136":"Fraternal Dropout","1137":"TabNet","1138":"TAPEX","1139":"test_method","1140":"Dynamic Keypoint Head","1141":"FCPose","1142":"HRank","1143":"PocketNet","1144":"PSANet","1145":"MixText","1146":"Minibatch Discrimination","1147":"Multiscale Dilated Convolution Block","1148":"IAN","1149":"Holographic Reduced Representation","1150":"TinyNet","1151":"FGA","1152":"CuBERT","1153":"HyperDenseNet","1154":"Soft Actor-Critic (Autotuned Temperature)","1155":"ALBEF","1156":"ViLT","1157":"MEI","1158":"PNAS","1159":"Large-scale spectral clustering","1160":"CrossTransformers","1161":"SASA","1162":"InstaBoost","1163":"IB-BERT","1164":"DeeBERT","1165":"PAA","1166":"NAS-FCOS","1167":"Seesaw Loss","1168":"Ape-X DQN","1169":"Social-STGCNN","1170":"MODNet","1171":"YellowFin","1172":"RGCN","1173":"PointAugment","1174":"Neural Cache","1175":"ParaNet Convolution Block","1176":"ParaNet","1177":"Cluster-GCN","1178":"TABBIE","1179":"ASVI","1180":"Tofu","1181":"FFMv1","1182":"FFMv2","1183":"MLFPN","1184":"M2Det","1185":"SimAug","1186":"STDC","1187":"NoisyNet-DQN","1188":"AdaBound","1189":"One-Shot Aggregation","1190":"IoU-Balanced Sampling","1191":"AutoSync","1192":"PanNet","1193":"SCCL","1194":"Ape-X DPG","1195":"NT-ASGD","1196":"RGA","1197":"GALA","1198":"Bottleneck Transformer Block","1199":"Bottleneck Transformer","1200":"NetAdapt","1201":"SmeLU","1202":"Denoised Smoothing","1203":"IoU-guided NMS","1204":"Nystr\u00f6mformer","1205":"Panoptic FPN","1206":"PointRend","1207":"VLMo","1208":"SGDW","1209":"BP-Transformer","1210":"Unigram Segmentation","1211":"ComiRec","1212":"Accuracy-Robustness Area (ARA)","1213":"SIG","1214":"Dilated Bottleneck with Projection Block","1215":"Dilated Bottleneck Block","1216":"DetNet","1217":"CondInst","1218":"Involution","1219":"4D A*","1220":"VSGNet","1221":"DynaBERT","1222":"Hard Sigmoid","1223":"Blue River Controls","1224":"Batchboost","1225":"ASFF","1226":"MFEC","1227":"BezierAlign","1228":"ABCNet","1229":"BIDeN","1230":"Balanced L1 Loss","1231":"Balanced Feature Pyramid","1232":"Libra R-CNN","1233":"Adaptively Sparse Transformer","1234":"Attention-augmented Convolution","1235":"ALiBi","1236":"XLSR","1237":"DropAttack","1238":"DCN-V2","1239":"Siamese U-Net","1240":"Bort","1241":"ProxylessNet-Mobile","1242":"ProxylessNet-CPU","1243":"ProxylessNet-GPU","1244":"BlendMask","1245":"Models Genesis","1246":"QRNN","1247":"WEGL","1248":"DROID-SLAM","1249":"SimVLM","1250":"RFP","1251":"TuckER-RP","1252":"RESCAL-RP","1253":"CP-N3-RP","1254":"CP-N3","1255":"ComplEx-N3-RP","1256":"ComplEx-N3","1257":"CCNet","1258":"Child-Tuning","1259":"DALL\u00b7E 2","1260":"SAINT","1261":"AdaHessian","1262":"SFT","1263":"DRPNN","1264":"Z-PNN","1265":"MoGA-C","1266":"MoGA-B","1267":"MoGA-A","1268":"Channel & Spatial attention","1269":"Temporal Jittering","1270":"AutoGL","1271":"Channel-wise Cross Attention","1272":"Channel-wise Cross Fusion Transformer","1273":"UCTransNet","1274":"SOHO","1275":"AVSlowFast","1276":"DropPathway","1277":"WRQE","1278":"Flow Alignment Module","1279":"Singular Value Clipping","1280":"TGAN","1281":"Deep Voice 3","1282":"XGrad-CAM","1283":"Virtual Data Augmentation","1284":"reSGLD","1285":"Hierarchical MTL","1286":"Matrix NMS","1287":"Parametric UMAP","1288":"CoaT","1289":"G-GLN Neuron","1290":"G-GLN","1291":"GBO","1292":"Fast-OCR","1293":"CDCC-NET","1294":"Fast-YOLOv4-SmallObj","1295":"MPNet","1296":"Voxel RoI Pooling","1297":"Voxel R-CNN","1298":"NoisyNet-A3C","1299":"NoisyNet-Dueling","1300":"HaloNet","1301":"Meta Pseudo Labels","1302":"RegNetX","1303":"Dynamic SmoothL1 Loss","1304":"Dynamic R-CNN","1305":"SABL","1306":"CAG","1307":"Soft Split and Soft Composition","1308":"FuseFormer Block","1309":"FuseFormer","1310":"ILVR","1311":"TorchBeast","1312":"MFR","1313":"DFDNet","1314":"Single-Headed Attention","1315":"Chinchilla","1316":"ESACL","1317":"Energy Based Process","1318":"Filter Response Normalization","1319":"Deformable Kernel","1320":"Scatter Connection","1321":"AlphaStar","1322":"TSDAE","1323":"TE2Rules","1324":"Peer-attention","1325":"PermuteFormer","1326":"FLICA","1327":"DSelect-k","1328":"ALIS","1329":"RoIWarp","1330":"OverFeat","1331":"SPP-Net","1332":"BasicVSR","1333":"SAFRAN","1334":"DOLG","1335":"CayleyNet","1336":"Siren","1337":"Mask Scoring R-CNN","1338":"ARShoe","1339":"IRN","1340":"Chinese Pre-trained Unbalanced Transformer","1341":"Adaptive NMS","1342":"NPID","1343":"Perceiver IO","1344":"LightConv","1345":"VC R-CNN","1346":"L-GCN","1347":"Contrastive Multiview Coding","1348":"UFLoss","1349":"3D Dynamic Scene Graph","1350":"NICE-SLAM","1351":"RealFormer","1352":"CGMM","1353":"GPSA","1354":"ConViT","1355":"ZoomNet","1356":"VisTR","1357":"ISPL","1358":"HDCGAN","1359":"ALAE","1360":"StyleALAE","1361":"DeepLabv2","1362":"All-Attention Layer","1363":"Fastformer","1364":"SIRM","1365":"T-Fixup","1366":"RandomRotate","1367":"SCNN_UNet_ConvLSTM","1368":"MacBERT","1369":"Meena","1370":"SymmNet","1371":"Hopfield Layer","1372":"Discriminative Adversarial Search","1373":"GATv2","1374":"Bilateral Guided Aggregation Layer","1375":"BiSeNet V2","1376":"Protagonist Antagonist Induced Regret Environment Design","1377":"SqueezeBERT","1378":"SIFA","1379":"Subformer","1380":"Categorical Modularity","1381":"Deformable RoI Pooling","1382":"K3M","1383":"AutoTinyBERT","1384":"MiVOS","1385":"Attribute2Font","1386":"CP N3","1387":"DiffAugment","1388":"ReInfoSelect","1389":"TaBERT","1390":"DetNAS","1391":"Cosine Power Annealing","1392":"Policy Similarity Metric","1393":"DASPP","1394":"LiteSeg","1395":"NFR","1396":"VQ-VAE-2","1397":"Dual Softmax Loss","1398":"CAMoE","1399":"Vokenization","1400":"EMQAP","1401":"ORN","1402":"PP-YOLO","1403":"GNS","1404":"HTCN","1405":"Attentional Liquid Warping GAN","1406":"AttLWB","1407":"mBARTHez","1408":"Local Patch Interaction","1409":"Cross-Covariance Attention","1410":"XCiT Layer","1411":"XCiT","1412":"BiGG","1413":"PP-YOLOv2","1414":"DualGCN","1415":"Chimera","1416":"GRLIA","1417":"GreedyNAS-C","1418":"GreedyNAS-B","1419":"GreedyNAS-A","1420":"GreedyNAS","1421":"Class-MLP","1422":"NeuralRecon","1423":"Residual SRM","1424":"ECANet","1425":"ECA-Net","1426":"IPA-GNN","1427":"Spatial Group-wise Enhance","1428":"Factorized Dense Synthesized Attention","1429":"Factorized Random Synthesized Attention","1430":"Random Synthesized Attention","1431":"Dense Synthesized Attention","1432":"Online Normalization","1433":"SaBN","1434":"PASE+","1435":"CTAL","1436":"Switchable Normalization","1437":"VATT","1438":"GNNCL","1439":"AEDA","1440":"CodeSLAM","1441":"MODERN","1442":"DARTS Max-W","1443":"Differentiable Hyperparameter Search","1444":"CPC v2","1445":"DMAGE","1446":"PixLoc","1447":"Adaptive Span Transformer","1448":"Harm-Net","1449":"Harmonic Block","1450":"TrOCR","1451":"uNetXST","1452":"End-To-End Memory Network","1453":"PointASNL","1454":"KOVA","1455":"CurricularFace","1456":"gSDE","1457":"BTmPG","1458":"DAFNe","1459":"TLC","1460":"Feedback Memory","1461":"Feedback Transformer","1462":"LightAutoML","1463":"I-BERT","1464":"State-Aware Tracker","1465":"UCNet","1466":"Self-Adjusting Smooth L1 Loss","1467":"RetinaMask","1468":"NeuroTactic","1469":"ShapeConv","1470":"Symbolic rule learning","1471":"SC-GPT","1472":"VoVNet","1473":"MinCutPool","1474":"CGRU","1475":"Hermite Activation","1476":"FastSpeech 2s","1477":"LocalViT","1478":"k-Sparse Autoencoder","1479":"Make-A-Scene","1480":"pixel2style2pixel","1481":"Temporally Consistent Spatial Augmentation","1482":"CVRL","1483":"Symbolic Deep Learning","1484":"GradDrop","1485":"FoveaBox","1486":"Co-Correcting","1487":"GSoP-Net","1488":"SpecGAN","1489":"LRNet","1490":"HardELiSH","1491":"ELiSH","1492":"GroupDNet","1493":"Lookahead","1494":"MeshGraphNet","1495":"HMGNN","1496":"FPG","1497":"CutBlur","1498":"STAC","1499":"CentripetalNet","1500":"EMEA","1501":"AutoInt","1502":"AdaShift","1503":"AdaMod","1504":"InPlace-ABN","1505":"TResNet","1506":"GFSA","1507":"LayoutReader","1508":"MAD Learning","1509":"TD-VAE","1510":"SVPG","1511":"Viewmaker Network","1512":"SRU++","1513":"Fast Minimum-Norm Attack","1514":"ChebNet","1515":"EvoNorms","1516":"FreeAnchor","1517":"SoftPool","1518":"MotionNet","1519":"LTLS","1520":"CV-MIM","1521":"DetNASNet","1522":"ResNeXt-Elastic","1523":"DenseNet-Elastic","1524":"Elastic Dense Block","1525":"Elastic ResNeXt Block","1526":"ALDA","1527":"SSFG regularization","1528":"BiDet","1529":"Attentive Normalization","1530":"MuVER","1531":"Bi3D","1532":"Prioritized Sweeping","1533":"PWIL","1534":"AutoSmart","1535":"SMOT","1536":"GCNFN","1537":"Disp R-CNN","1538":"NLSIG","1539":"PixelRNN","1540":"TaxoExpan","1541":"MobileDet","1542":"TILDEv2","1543":"ClusterFit","1544":"L2M","1545":"Partition Filter Network","1546":"Local Importance-based Pooling","1547":"DeepIR","1548":"myGym","1549":"Noise2Fast","1550":"ReCo","1551":"ERNIE-GEN","1552":"Fast Sample Re-Weighting","1553":"G3D","1554":"Virtual Batch Normalization","1555":"PAFNet","1556":"EdgeFlow","1557":"Recurrent Entity Network","1558":"Fisher-BRC","1559":"EfficientUNet++","1560":"LAPGAN","1561":"PSFR-GAN","1562":"DiCENet","1563":"DimFuse","1564":"DimConv","1565":"DiCE Unit","1566":"Pixel-BERT","1567":"Implicit PointRend","1568":"ClipBERT","1569":"VocGAN","1570":"LSUV Initialization","1571":"Agglomerative Contextual Decomposition","1572":"CTracker","1573":"Computation Redistribution","1574":"Sample Redistribution","1575":"TinaFace","1576":"Compressed Memory","1577":"Compressive Transformer","1578":"KNN and IOU based verification","1579":"SMA","1580":"PresGAN","1581":"DDParser","1582":"Trans-Encoder","1583":"Seq2Edits","1584":"AdaSqrt","1585":"GraphSAINT","1586":"DeCLUTR","1587":"SETSe","1588":"Varifocal Loss","1589":"VFNet","1590":"U-Net GAN","1591":"MaskFlownet","1592":"BAGUA","1593":"BytePS","1594":"HEGCN","1595":"Big-Little Net","1596":"MoViNet","1597":"GANformer","1598":"VQSVD","1599":"Temporal ROIAlign","1600":"Neural adjoint","1601":"SynaNN","1602":"SCARLET","1603":"SCARLET-NAS","1604":"SimAdapter","1605":"SMITH","1606":"LeViT Attention Block","1607":"LeVIT","1608":"ClassSR","1609":"SlowMo","1610":"Mixture Normalization","1611":"HiSD","1612":"ADELE","1613":"MDPO","1614":"GCT","1615":"Crossbow","1616":"PNA","1617":"GHM-R","1618":"GHM-C","1619":"Charformer","1620":"GBST","1621":"Gradient-Based Subword Tokenization","1622":"SRDC","1623":"TraDeS","1624":"BatchChannel Normalization","1625":"Sscs","1626":"Funnel Transformer","1627":"Mixer Layer","1628":"Point-GNN","1629":"Hi-LANDER","1630":"CSPDenseNet-Elastic","1631":"CSPDenseNet","1632":"CSPPeleeNet","1633":"Exact Fusion Model","1634":"Deformable ConvNets","1635":"DMA","1636":"EsViT","1637":"AUCO ResNet","1638":"Cosine Normalization","1639":"Deep LSTM Reader","1640":"OSA (identity mapping + eSE)","1641":"Effective Squeeze-and-Excitation Block","1642":"VoVNetV2","1643":"NPID++","1644":"StyleMapGAN","1645":"ZeRO","1646":"MT-PET","1647":"Polyak Averaging","1648":"SpreadsheetCoder","1649":"FastSGT","1650":"ALDEN","1651":"TabTransformer","1652":"StruBERT","1653":"GFP-GAN","1654":"ConvMLP","1655":"Bayesian REX","1656":"Pipelined Backpropagation","1657":"MADGRAD","1658":"STraTA","1659":"Informative Sample Mining Network","1660":"Florence","1661":"DNN2LR","1662":"SKEP","1663":"MHMA","1664":"Class Activation Guided Attention Mechanism","1665":"DAEL","1666":"Temporal Distribution Matching","1667":"Temporal Distribution Characterization","1668":"AdaRNN","1669":"Pointer Sentinel-LSTM","1670":"Shape Adaptor","1671":"U-RNNs","1672":"NormFormer","1673":"Canvas Method","1674":"ParamCrop","1675":"MXMNet","1676":"STA-LSTM","1677":"Spatial & Temporal Attention","1678":"StyleSwin","1679":"Kaleido-BERT","1680":"H3DNet","1681":"FLAVR","1682":"Unified VLP","1683":"DVD-GAN DBlock","1684":"DVD-GAN GBlock","1685":"TSRUc","1686":"TSRUp","1687":"TSRUs","1688":"TrIVD-GAN","1689":"AdaGPR","1690":"WaveVAE","1691":"Triplet Entropy Loss","1692":"Sandwich Transformer","1693":"Shuffle-T","1694":"AutoDropout","1695":"NUQSGD","1696":"Grammatical evolution + Q-learning","1697":"PAUSE","1698":"DVD-GAN","1699":"InterBERT","1700":"Colorization Transformer","1701":"ReasonBERT","1702":"nnFormer","1703":"PipeDream","1704":"Boom Layer","1705":"SHA-RNN","1706":"CPVT","1707":"Deformable Position-Sensitive RoI Pooling","1708":"DistDGL","1709":"CT3D","1710":"FastMoE","1711":"RPM-Net","1712":"Random Grayscale","1713":"LMOT","1714":"DouZero","1715":"CoVA","1716":"Generalized Focal Loss","1717":"OODformer","1718":"FcaNet","1719":"Probabilistic Anchor Assignment","1720":"VPSNet","1721":"SongNet","1722":"Panoptic-PolarNet","1723":"LFME","1724":"MoBY"},"description":{"0":"**Average Pooling** is a pooling operation that calculates the average value for patches of a feature map, and uses it to create a downsampled (pooled) feature map. It is usually used after a convolutional layer. It adds a small amount of translation invariance - meaning translating the image by a small amount does not significantly affect the values of most pooled outputs. It extracts features more smoothly than [Max Pooling](https:\/\/paperswithcode.com\/method\/max-pooling), whereas max pooling extracts more pronounced features like edges.\r\n\r\nImage Source: [here](https:\/\/www.researchgate.net\/figure\/Illustration-of-Max-Pooling-and-Average-Pooling-Figure-2-above-shows-an-example-of-max_fig2_333593451)","1":"A **1 x 1 Convolution** is a [convolution](https:\/\/paperswithcode.com\/method\/convolution) with some special properties in that it can be used for dimensionality reduction, efficient low dimensional embeddings, and applying non-linearity after convolutions. It maps an input pixel with all its channels to an output pixel which can be squeezed to a desired output depth. It can be viewed as an [MLP](https:\/\/paperswithcode.com\/method\/feedforward-network) looking at a particular pixel location.\r\n\r\nImage Credit: [http:\/\/deeplearning.ai](http:\/\/deeplearning.ai)","2":"**Global Average Pooling** is a pooling operation designed to replace fully connected layers in classical CNNs. The idea is to generate one feature map for each corresponding category of the classification task in the last mlpconv layer. Instead of adding fully connected layers on top of the feature maps, we take the average of each feature map, and the resulting vector is fed directly into the [softmax](https:\/\/paperswithcode.com\/method\/softmax) layer. \r\n\r\nOne advantage of global [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling) over the fully connected layers is that it is more native to the [convolution](https:\/\/paperswithcode.com\/method\/convolution) structure by enforcing correspondences between feature maps and categories. Thus the feature maps can be easily interpreted as categories confidence maps. Another advantage is that there is no parameter to optimize in the global average pooling thus overfitting is avoided at this layer. Furthermore, global average pooling sums out the spatial information, thus it is more robust to spatial translations of the input.","3":"**Batch Normalization** aims to reduce internal covariate shift, and in doing so aims to accelerate the training of deep neural nets. It accomplishes this via a normalization step that fixes the means and variances of layer inputs. Batch Normalization also has a beneficial effect on the gradient flow through the network, by reducing the dependence of gradients on the scale of the parameters or of their initial values. This allows for use of much higher learning rates without the risk of divergence. Furthermore, batch normalization regularizes the model and reduces the need for [Dropout](https:\/\/paperswithcode.com\/method\/dropout).\r\n\r\nWe apply a batch normalization layer as follows for a minibatch $\\mathcal{B}$:\r\n\r\n$$ \\mu\\_{\\mathcal{B}} = \\frac{1}{m}\\sum^{m}\\_{i=1}x\\_{i} $$\r\n\r\n$$ \\sigma^{2}\\_{\\mathcal{B}} = \\frac{1}{m}\\sum^{m}\\_{i=1}\\left(x\\_{i}-\\mu\\_{\\mathcal{B}}\\right)^{2} $$\r\n\r\n$$ \\hat{x}\\_{i} = \\frac{x\\_{i} - \\mu\\_{\\mathcal{B}}}{\\sqrt{\\sigma^{2}\\_{\\mathcal{B}}+\\epsilon}} $$\r\n\r\n$$ y\\_{i} = \\gamma\\hat{x}\\_{i} + \\beta = \\text{BN}\\_{\\gamma, \\beta}\\left(x\\_{i}\\right) $$\r\n\r\nWhere $\\gamma$ and $\\beta$ are learnable parameters.","4":"**Rectified Linear Units**, or **ReLUs**, are a type of activation function that are linear in the positive dimension, but zero in the negative dimension. The kink in the function is the source of the non-linearity. Linearity in the positive dimension has the attractive property that it prevents non-saturation of gradients (contrast with [sigmoid activations](https:\/\/paperswithcode.com\/method\/sigmoid-activation)), although for half of the real line its gradient is zero.\r\n\r\n$$ f\\left(x\\right) = \\max\\left(0, x\\right) $$","5":"**Kaiming Initialization**, or **He Initialization**, is an initialization method for neural networks that takes into account the non-linearity of activation functions, such as [ReLU](https:\/\/paperswithcode.com\/method\/relu) activations.\r\n\r\nA proper initialization method should avoid reducing or magnifying the magnitudes of input signals exponentially. Using a derivation they work out that the condition to stop this happening is:\r\n\r\n$$\\frac{1}{2}n\\_{l}\\text{Var}\\left[w\\_{l}\\right] = 1 $$\r\n\r\nThis implies an initialization scheme of:\r\n\r\n$$ w\\_{l} \\sim \\mathcal{N}\\left(0,  2\/n\\_{l}\\right)$$\r\n\r\nThat is, a zero-centered Gaussian with standard deviation of $\\sqrt{2\/{n}\\_{l}}$ (variance shown in equation above). Biases are initialized at $0$.","6":"**Residual Connections** are a type of skip-connection that learn residual functions with reference to the layer inputs, instead of learning unreferenced functions. \r\n\r\nFormally, denoting the desired underlying mapping as $\\mathcal{H}({x})$, we let the stacked nonlinear layers fit another mapping of $\\mathcal{F}({x}):=\\mathcal{H}({x})-{x}$. The original mapping is recast into $\\mathcal{F}({x})+{x}$.\r\n\r\nThe intuition is that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.","7":"**Max Pooling** is a pooling operation that calculates the maximum value for patches of a feature map, and uses it to create a downsampled (pooled) feature map.  It is usually used after a convolutional layer. It adds a small amount of translation invariance - meaning translating the image by a small amount does not significantly affect the values of most pooled outputs.\r\n\r\nImage Source: [here](https:\/\/computersciencewiki.org\/index.php\/File:MaxpoolSample2.png)","8":"**Residual Blocks** are skip-connection blocks that learn residual functions with reference to the layer inputs, instead of learning unreferenced functions. They were introduced as part of the [ResNet](https:\/\/paperswithcode.com\/method\/resnet) architecture.\r\n \r\nFormally, denoting the desired underlying mapping as $\\mathcal{H}({x})$, we let the stacked nonlinear layers fit another mapping of $\\mathcal{F}({x}):=\\mathcal{H}({x})-{x}$. The original mapping is recast into $\\mathcal{F}({x})+{x}$. The $\\mathcal{F}({x})$ acts like a residual, hence the name 'residual block'.\r\n\r\nThe intuition is that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers. Having skip connections allows the network to more easily learn identity-like mappings.\r\n\r\nNote that in practice, [Bottleneck Residual Blocks](https:\/\/paperswithcode.com\/method\/bottleneck-residual-block) are used for deeper ResNets, such as ResNet-50 and ResNet-101, as these bottleneck blocks are less computationally intensive.","9":"A **Bottleneck Residual Block** is a variant of the [residual block](https:\/\/paperswithcode.com\/method\/residual-block) that utilises 1x1 convolutions to create a bottleneck. The use of a bottleneck reduces the number of parameters and matrix multiplications. The idea is to make residual blocks as thin as possible to increase depth and have less parameters. They were introduced as part of the [ResNet](https:\/\/paperswithcode.com\/method\/resnet) architecture, and are used as part of deeper ResNets such as ResNet-50 and ResNet-101.","10":"**Residual Networks**, or **ResNets**, learn residual functions with reference to the layer inputs, instead of learning unreferenced functions. Instead of hoping each few stacked layers directly fit a desired underlying mapping, residual nets let these layers fit a residual mapping. They stack [residual blocks](https:\/\/paperswithcode.com\/method\/residual-block) ontop of each other to form network: e.g. a ResNet-50 has fifty layers using these blocks. \r\n\r\nFormally, denoting the desired underlying mapping as $\\mathcal{H}(x)$, we let the stacked nonlinear layers fit another mapping of $\\mathcal{F}(x):=\\mathcal{H}(x)-x$. The original mapping is recast into $\\mathcal{F}(x)+x$.\r\n\r\nThere is empirical evidence that these types of network are easier to optimize, and can gain accuracy from considerably increased depth.","11":"A **convolution** is a type of matrix operation, consisting of a kernel, a small matrix of weights, that slides over input data performing element-wise multiplication with the part of the input it is on, then summing the results into an output.\r\n\r\nIntuitively, a convolution allows for weight sharing - reducing the number of effective parameters - and image translation (allowing for the same feature to be detected in different parts of the input space).\r\n\r\nImage Source: [https:\/\/arxiv.org\/pdf\/1603.07285.pdf](https:\/\/arxiv.org\/pdf\/1603.07285.pdf)","12":"An **Autoencoder** is a bottleneck architecture that turns a high-dimensional input into a latent low-dimensional code (encoder), and then performs a reconstruction of the input with this latent code (the decoder).\r\n\r\nImage: [Michael Massi](https:\/\/en.wikipedia.org\/wiki\/Autoencoder#\/media\/File:Autoencoder_schema.png)","13":"**Principle Components Analysis (PCA)** is an unsupervised method primary used for dimensionality reduction within machine learning.  PCA is calculated via a singular value decomposition (SVD) of the design matrix, or alternatively, by calculating the covariance matrix of the data and performing eigenvalue decomposition on the covariance matrix. The results of PCA provide a low-dimensional picture of the structure of the data and the leading (uncorrelated) latent factors determining variation in the data.\r\n\r\nImage Source: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Principal_component_analysis#\/media\/File:GaussianScatterPCA.svg)","14":"**Q-Learning** is an off-policy temporal difference control algorithm:\r\n\r\n$$Q\\left(S\\_{t}, A\\_{t}\\right) \\leftarrow Q\\left(S\\_{t}, A\\_{t}\\right) + \\alpha\\left[R_{t+1} + \\gamma\\max\\_{a}Q\\left(S\\_{t+1}, a\\right) - Q\\left(S\\_{t}, A\\_{t}\\right)\\right] $$\r\n\r\nThe learned action-value function $Q$ directly approximates $q\\_{*}$, the optimal action-value function, independent of the policy being followed.\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition","15":"Causal inference is the process of drawing a conclusion about a causal connection based on the conditions of the occurrence of an effect. The main difference between causal inference and inference of association is that the former analyzes the response of the effect variable when the cause is changed.","16":"**Sigmoid Activations** are a type of activation function for neural networks:\r\n\r\n$$f\\left(x\\right) = \\frac{1}{\\left(1+\\exp\\left(-x\\right)\\right)}$$\r\n\r\nSome drawbacks of this activation that have been noted in the literature are: sharp damp gradients during backpropagation from deeper hidden layers to inputs, gradient saturation, and slow convergence.","17":"**Tanh Activation** is an activation function used for neural networks:\r\n\r\n$$f\\left(x\\right) = \\frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$$\r\n\r\nHistorically, the tanh function became preferred over the [sigmoid function](https:\/\/paperswithcode.com\/method\/sigmoid-activation) as it gave better performance for multi-layer neural networks. But it did not solve the vanishing gradient problem that sigmoids suffered, which was tackled more effectively with the introduction of [ReLU](https:\/\/paperswithcode.com\/method\/relu) activations.\r\n\r\nImage Source: [Junxi Feng](https:\/\/www.researchgate.net\/profile\/Junxi_Feng)","18":"An **LSTM** is a type of [recurrent neural network](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks) that addresses the vanishing gradient problem in vanilla RNNs through additional cells, input and output gates. Intuitively, vanishing gradients are solved through additional *additive* components, and forget gate activations, that allow the gradients to flow through the network without vanishing as quickly.\r\n\r\n(Image Source [here](https:\/\/medium.com\/datadriveninvestor\/how-do-lstm-networks-solve-the-problem-of-vanishing-gradients-a6784971a577))\r\n\r\n(Introduced by Hochreiter and Schmidhuber)","19":"**Weight Decay**, or **$L_{2}$ Regularization**, is a regularization technique applied to the weights of a neural network. We minimize a loss function compromising both the primary loss function and a penalty on the $L\\_{2}$ Norm of the weights:\r\n\r\n$$L\\_{new}\\left(w\\right) = L\\_{original}\\left(w\\right) + \\lambda{w^{T}w}$$\r\n\r\nwhere $\\lambda$ is a value determining the strength of the penalty (encouraging smaller weights). \r\n\r\nWeight decay can be incorporated directly into the weight update rule, rather than just implicitly by defining it through to objective function. Often weight decay refers to the implementation where we specify it directly in the weight update rule (whereas L2 regularization is usually the implementation which is specified in the objective function).\r\n\r\nImage Source: Deep Learning, Goodfellow et al","20":"**WordPiece** is a subword segmentation algorithm used in natural language processing.  The vocabulary is initialized with individual characters in the language, then the most frequent combinations of symbols in the vocabulary are iteratively added to the vocabulary. The process is:\r\n\r\n1. Initialize the word unit inventory with all the characters in the text.\r\n2. Build a language model on the training data using the inventory from 1.\r\n3. Generate a new word unit by combining two units out of the current word inventory to increment the word unit inventory by one. Choose the new word unit out of all the possible ones that increases the likelihood on the training data the most when added to the model.\r\n4. Goto 2 until a predefined limit of word units is reached or the likelihood increase falls below a certain threshold.\r\n\r\nText: [Source](https:\/\/stackoverflow.com\/questions\/55382596\/how-is-wordpiece-tokenization-helpful-to-effectively-deal-with-rare-words-proble\/55416944#55416944)\r\n\r\nImage: WordPiece as used in [BERT](https:\/\/paperswithcode.com\/method\/bert)","21":"The **Softmax** output function transforms a previous layer's output into a vector of probabilities. It is commonly used for multiclass classification.  Given an input vector $x$ and a weighting vector $w$ we have:\r\n\r\n$$ P(y=j \\mid{x}) = \\frac{e^{x^{T}w_{j}}}{\\sum^{K}_{k=1}e^{x^{T}wk}} $$","22":"**Dense Connections**, or **Fully Connected Connections**, are a type of layer in a deep neural network that use a linear operation where every input is connected to every output by a weight. This means there are $n\\_{\\text{inputs}}*n\\_{\\text{outputs}}$ parameters, which can lead to a lot of parameters for a sizeable network.\r\n\r\n$$h\\_{l} = g\\left(\\textbf{W}^{T}h\\_{l-1}\\right)$$\r\n\r\nwhere $g$ is an activation function.\r\n\r\nImage Source: Deep Learning by Goodfellow, Bengio and Courville","23":"**Scaled dot-product attention** is an attention mechanism where the dot products are scaled down by $\\sqrt{d_k}$. Formally we have a query $Q$, a key $K$ and a value $V$ and calculate the attention as:\r\n\r\n$$ {\\text{Attention}}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^{T}}{\\sqrt{d_k}}\\right)V $$\r\n\r\nIf we assume that $q$ and $k$ are $d_k$-dimensional vectors whose components are independent random variables with mean $0$ and variance $1$, then their dot product, $q \\cdot k = \\sum_{i=1}^{d_k} u_iv_i$, has mean $0$ and variance $d_k$.  Since we would prefer these values to have variance $1$, we divide by $\\sqrt{d_k}$.","24":"**Linear Warmup With Linear Decay** is a learning rate schedule in which we increase the learning rate linearly for $n$ updates and then linearly decay afterwards.","25":"**Dropout** is a regularization technique for neural networks that drops a unit (along with connections) at training time with a specified probability $p$ (a common value is $p=0.5$). At test time, all units are present, but with weights scaled by $p$ (i.e. $w$ becomes $pw$).\r\n\r\nThe idea is to prevent co-adaptation, where the neural network becomes too reliant on particular connections, as this could be symptomatic of overfitting. Intuitively, dropout can be thought of as creating an implicit ensemble of neural networks.","26":"Unlike [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization), **Layer Normalization** directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer so the normalization does not introduce any new dependencies between training cases. It works well for [RNNs](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks) and improves both the training time and the generalization performance of several existing RNN models. More recently, it has been used with [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) models.\r\n\r\nWe compute the layer normalization statistics over all the hidden units in the same layer as follows:\r\n\r\n$$ \\mu^{l} = \\frac{1}{H}\\sum^{H}\\_{i=1}a\\_{i}^{l} $$\r\n\r\n$$ \\sigma^{l} = \\sqrt{\\frac{1}{H}\\sum^{H}\\_{i=1}\\left(a\\_{i}^{l}-\\mu^{l}\\right)^{2}}  $$\r\n\r\nwhere $H$ denotes the number of hidden units in a layer. Under layer normalization, all the hidden units in a layer share the same normalization terms $\\mu$ and $\\sigma$, but different training cases have different normalization terms. Unlike batch normalization, layer normalization does not impose any constraint on the size of the mini-batch and it can be used in the pure online regime with batch size 1.","27":"The **Gaussian Error Linear Unit**, or **GELU**,  is an activation function. The GELU activation function is $x\\Phi(x)$, where $\\Phi(x)$ the standard Gaussian cumulative distribution function. The GELU nonlinearity weights inputs by their percentile, rather than gates inputs by their sign as in [ReLUs](https:\/\/paperswithcode.com\/method\/relu) ($x\\mathbf{1}_{x>0}$). Consequently the GELU can be thought of as a smoother ReLU.\r\n\r\n$$\\text{GELU}\\left(x\\right) = x{P}\\left(X\\leq{x}\\right) = x\\Phi\\left(x\\right) = x \\cdot \\frac{1}{2}\\left[1 + \\text{erf}(x\/\\sqrt{2})\\right],$$\r\nif $X\\sim \\mathcal{N}(0,1)$.\r\n\r\nOne can approximate the GELU with\r\n$0.5x\\left(1+\\tanh\\left[\\sqrt{2\/\\pi}\\left(x + 0.044715x^{3}\\right)\\right]\\right)$ or $x\\sigma\\left(1.702x\\right),$\r\nbut PyTorch's exact implementation is sufficiently fast such that these approximations may be unnecessary. (See also the [SiLU](https:\/\/paperswithcode.com\/method\/silu) $x\\sigma(x)$ which was also coined in the paper that introduced the GELU.)\r\n\r\nGELUs are used in [GPT-3](https:\/\/paperswithcode.com\/method\/gpt-3), [BERT](https:\/\/paperswithcode.com\/method\/bert), and most other Transformers.","28":"**Adam** is an adaptive learning rate optimization algorithm that utilises both momentum and scaling, combining the benefits of [RMSProp](https:\/\/paperswithcode.com\/method\/rmsprop) and [SGD w\/th Momentum](https:\/\/paperswithcode.com\/method\/sgd-with-momentum). The optimizer is designed to be appropriate for non-stationary objectives and problems with very noisy and\/or sparse gradients. \r\n\r\nThe weight updates are performed as:\r\n\r\n$$ w_{t} = w_{t-1} - \\eta\\frac{\\hat{m}\\_{t}}{\\sqrt{\\hat{v}\\_{t}} + \\epsilon}  $$\r\n\r\nwith\r\n\r\n$$ \\hat{m}\\_{t} = \\frac{m_{t}}{1-\\beta^{t}_{1}} $$\r\n\r\n$$ \\hat{v}\\_{t} = \\frac{v_{t}}{1-\\beta^{t}_{2}} $$\r\n\r\n$$ m_{t} = \\beta_{1}m_{t-1} + (1-\\beta_{1})g_{t} $$\r\n\r\n$$ v_{t} = \\beta_{2}v_{t-1} + (1-\\beta_{2})g_{t}^{2}  $$\r\n\r\n\r\n$ \\eta $ is the step size\/learning rate, around 1e-3 in the original paper. $ \\epsilon $ is a small number, typically 1e-8 or 1e-10, to prevent dividing by zero. $ \\beta_{1} $ and $ \\beta_{2} $ are forgetting parameters, with typical values 0.9 and 0.999, respectively.","29":"**Multi-head Attention** is a module for attention mechanisms which runs through an attention mechanism several times in parallel. The independent attention outputs are then concatenated and linearly transformed into the expected dimension. Intuitively, multiple attention heads allows for attending to parts of the sequence differently (e.g. longer-term dependencies versus shorter-term dependencies). \r\n\r\n$$ \\text{MultiHead}\\left(\\textbf{Q}, \\textbf{K}, \\textbf{V}\\right) = \\left[\\text{head}\\_{1},\\dots,\\text{head}\\_{h}\\right]\\textbf{W}_{0}$$\r\n\r\n$$\\text{where} \\text{ head}\\_{i} = \\text{Attention} \\left(\\textbf{Q}\\textbf{W}\\_{i}^{Q}, \\textbf{K}\\textbf{W}\\_{i}^{K}, \\textbf{V}\\textbf{W}\\_{i}^{V} \\right) $$\r\n\r\nAbove $\\textbf{W}$ are all learnable parameter matrices.\r\n\r\nNote that [scaled dot-product attention](https:\/\/paperswithcode.com\/method\/scaled) is most commonly used in this module, although in principle it can be swapped out for other types of attention mechanism.\r\n\r\nSource: [Lilian Weng](https:\/\/lilianweng.github.io\/lil-log\/2018\/06\/24\/attention-attention.html#a-family-of-attention-mechanisms)","30":"**Attention Dropout** is a type of [dropout](https:\/\/paperswithcode.com\/method\/dropout) used in attention-based architectures, where elements are randomly dropped out of the [softmax](https:\/\/paperswithcode.com\/method\/softmax) in the attention equation. For example, for scaled-dot product attention, we would drop elements from the first term:\r\n\r\n$$ {\\text{Attention}}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^{T}}{\\sqrt{d_k}}\\right)V $$","31":"**BERT**, or Bidirectional Encoder Representations from Transformers, improves upon standard [Transformers](http:\/\/paperswithcode.com\/method\/transformer) by removing the unidirectionality constraint by using a *masked language model* (MLM) pre-training objective. The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-to-right language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pre-train a deep bidirectional Transformer. In addition to the masked language model, BERT uses a *next sentence prediction* task that jointly pre-trains text-pair representations. \r\n\r\nThere are two steps in BERT: *pre-training* and *fine-tuning*. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For fine-tuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Each downstream task has separate fine-tuned models, even though they\r\nare initialized with the same pre-trained parameters.","32":"A **GAN**, or **Generative Adversarial Network**, is a generative model that simultaneously trains\r\ntwo models: a generative model $G$ that captures the data distribution, and a discriminative model $D$ that estimates the\r\nprobability that a sample came from the training data rather than $G$.\r\n\r\nThe training procedure for $G$ is to maximize the probability of $D$ making\r\na mistake. This framework corresponds to a minimax two-player game. In the\r\nspace of arbitrary functions $G$ and $D$, a unique solution exists, with $G$\r\nrecovering the training data distribution and $D$ equal to $\\frac{1}{2}$\r\neverywhere. In the case where $G$ and $D$ are defined by multilayer perceptrons,\r\nthe entire system can be trained with backpropagation. \r\n\r\n(Image Source: [here](http:\/\/www.kdnuggets.com\/2017\/01\/generative-adversarial-networks-hot-topic-machine-learning.html))","33":"**Greedy Policy Search** (GPS) is a simple algorithm that learns a policy for test-time data augmentation based on the predictive performance on a validation set. GPS starts with an empty policy and builds it in an iterative fashion. Each step selects a sub-policy that provides the largest improvement in calibrated log-likelihood of ensemble predictions and adds it to the current policy.","34":"**Softplus** is an activation function $f\\left(x\\right) = \\log\\left(1+\\exp\\left(x\\right)\\right)$. It can be viewed as a smooth version of [ReLU](https:\/\/paperswithcode.com\/method\/relu).","35":"**Mish** is an activation function for neural networks which can be defined as:\r\n\r\n$$ f\\left(x\\right) = x\\cdot\\tanh{\\text{softplus}\\left(x\\right)}$$\r\n\r\nwhere\r\n\r\n$$\\text{softplus}\\left(x\\right) = \\ln\\left(1+e^{x}\\right)$$\r\n\r\n(Compare with functionally similar previously proposed activation functions such as the [GELU](https:\/\/paperswithcode.com\/method\/silu) $x\\Phi(x)$ and the [SiLU](https:\/\/paperswithcode.com\/method\/silu) $x\\sigma(x)$.)","36":"A **Bidirectional GRU**, or **BiGRU**, is a sequence processing model that consists of two [GRUs](https:\/\/paperswithcode.com\/method\/gru). one taking the input in a forward direction, and the other in a backwards direction. It is a bidirectional recurrent neural network with only the input and forget gates.\r\n\r\nImage Source: *Rana R (2016). Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech.*","37":"A **Memory Network** provides a memory component that can be read from and written to with the inference capabilities of a neural network model. The motivation is that many neural networks lack a long-term memory component, and their existing memory component encoded by states and weights is too small and not compartmentalized enough to accurately remember facts from the past (RNNs for example, have difficult memorizing and doing tasks like copying). \r\n\r\nA memory network consists of a memory $\\textbf{m}$ (an array of objects indexed by $\\textbf{m}\\_{i}$ and four potentially learned components:\r\n\r\n- Input feature map $I$ - feature representation of the data input.\r\n- Generalization $G$ - updates old memories given the new input.\r\n- Output feature map $O$ - produces new feature map given $I$ and $G$.\r\n- Response $R$ - converts output into the desired response. \r\n\r\nGiven an input $x$ (e.g., an input character, word or sentence depending on the granularity chosen, an image or an audio signal) the flow of the model is as follows:\r\n\r\n1. Convert $x$ to an internal feature representation $I\\left(x\\right)$.\r\n2. Update memories $m\\_{i}$ given the new input: $m\\_{i} = G\\left(m\\_{i}, I\\left(x\\right), m\\right)$, $\\forall{i}$.\r\n3. Compute output features $o$ given the new input and the memory: $o = O\\left(I\\left(x\\right), m\\right)$.\r\n4. Finally, decode output features $o$ to give the final response: $r = R\\left(o\\right)$.\r\n\r\nThis process is applied at both train and test time, if there is a distinction between such phases, that\r\nis, memories are also stored at test time, but the model parameters of $I$, $G$, $O$ and $R$ are not updated. Memory networks cover a wide class of possible implementations. The components $I$, $G$, $O$ and $R$ can potentially use any existing ideas from the machine learning literature.\r\n\r\nImage Source: [Adrian Colyer](https:\/\/blog.acolyer.org\/2016\/03\/10\/memory-networks\/)","38":"A **Gated Recurrent Unit**, or **GRU**, is a type of recurrent neural network. It is similar to an [LSTM](https:\/\/paperswithcode.com\/method\/lstm), but only has two gates - a reset gate and an update gate - and notably lacks an output gate. Fewer parameters means GRUs are generally easier\/faster to train than their LSTM counterparts.\r\n\r\nImage Source: [here](https:\/\/www.google.com\/url?sa=i&url=https%3A%2F%2Fcommons.wikimedia.org%2Fwiki%2FFile%3AGated_Recurrent_Unit%2C_type_1.svg&psig=AOvVaw3EmNX8QXC5hvyxeenmJIUn&ust=1590332062671000&source=images&cd=vfe&ved=0CA0QjhxqFwoTCMiev9-eyukCFQAAAAAdAAAAABAR)","39":"A **Concatenated Skip Connection** is a type of skip connection that seeks to reuse features by concatenating them to new layers, allowing more information to be retained from previous layers of the network. This contrasts with say, residual connections, where element-wise summation is used instead to incorporate information from previous layers. This type of skip connection is prominently used in DenseNets (and also Inception networks), which the Figure to the right illustrates.","40":"**U-Net** is an architecture for semantic segmentation. It consists of a contracting path and an expansive path. The contracting path follows the typical architecture of a convolutional network. It consists of the repeated application of two 3x3 convolutions (unpadded convolutions), each followed by a rectified linear unit ([ReLU](https:\/\/paperswithcode.com\/method\/relu)) and a 2x2 [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) operation with stride 2 for downsampling. At each downsampling step we double the number of feature channels. Every step in the expansive path consists of an upsampling of the feature map followed by a 2x2 [convolution](https:\/\/paperswithcode.com\/method\/convolution) (\u201cup-convolution\u201d) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3x3 convolutions, each followed by a ReLU. The cropping is necessary due to the loss of border pixels in every convolution. At the final layer a [1x1 convolution](https:\/\/paperswithcode.com\/method\/1x1-convolution) is used to map each 64-component feature vector to the desired number of classes. In total the network has 23 convolutional layers.\r\n\r\n[Original MATLAB Code](https:\/\/lmb.informatik.uni-freiburg.de\/people\/ronneber\/u-net\/u-net-release-2015-10-02.tar.gz)","41":"**PatchGAN** is a type of discriminator for generative adversarial networks which only penalizes structure at the scale of local image patches. The PatchGAN discriminator tries to classify if each $N \\times N$ patch in an image is real or fake. This discriminator is run convolutionally across the image, averaging all responses to provide the ultimate output of $D$. Such a discriminator effectively models the image as a Markov random field, assuming independence between pixels separated by more than a patch diameter. It can be understood as a type of texture\/style loss.","42":"**Instance Normalization** (also known as contrast normalization) is a normalization layer where:\r\n\r\n$$\r\n    y_{tijk} =  \\frac{x_{tijk} - \\mu_{ti}}{\\sqrt{\\sigma_{ti}^2 + \\epsilon}},\r\n    \\quad\r\n    \\mu_{ti} = \\frac{1}{HW}\\sum_{l=1}^W \\sum_{m=1}^H x_{tilm},\r\n    \\quad\r\n    \\sigma_{ti}^2 = \\frac{1}{HW}\\sum_{l=1}^W \\sum_{m=1}^H (x_{tilm} - mu_{ti})^2.\r\n$$\r\n\r\nThis prevents instance-specific mean and covariance shift simplifying the learning process. Intuitively, the normalization process allows to remove instance-specific contrast information from the content image in a task like image stylization, which simplifies generation.","43":"**Leaky Rectified Linear Unit**, or **Leaky ReLU**, is a type of activation function based on a [ReLU](https:\/\/paperswithcode.com\/method\/relu), but it has a small slope for negative values instead of a flat slope. The slope coefficient is determined before training, i.e. it is not learnt during training. This type of activation function is popular in tasks where we we may suffer from sparse gradients, for example training generative adversarial networks.","44":"**GAN Least Squares Loss** is a least squares loss function for generative adversarial networks. Minimizing this objective function is equivalent to minimizing the Pearson $\\chi^{2}$ divergence. The objective function (here for [LSGAN](https:\/\/paperswithcode.com\/method\/lsgan)) can be defined as:\r\n\r\n$$ \\min\\_{D}V\\_{LS}\\left(D\\right) = \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{x} \\sim p\\_{data}\\left(\\mathbf{x}\\right)}\\left[\\left(D\\left(\\mathbf{x}\\right) - b\\right)^{2}\\right] + \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{z}\\sim p\\_{data}\\left(\\mathbf{z}\\right)}\\left[\\left(D\\left(G\\left(\\mathbf{z}\\right)\\right) - a\\right)^{2}\\right] $$\r\n\r\n$$ \\min\\_{G}V\\_{LS}\\left(G\\right) = \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{z} \\sim p\\_{\\mathbf{z}}\\left(\\mathbf{z}\\right)}\\left[\\left(D\\left(G\\left(\\mathbf{z}\\right)\\right) - c\\right)^{2}\\right] $$\r\n\r\nwhere $a$ and $b$ are the labels for fake data and real data and $c$ denotes the value that $G$ wants $D$ to believe for fake data.","45":"**Cycle Consistency Loss** is a type of loss used for generative adversarial networks that performs unpaired image-to-image translation. It was introduced with the [CycleGAN](https:\/\/paperswithcode.com\/method\/cyclegan) architecture. For two domains $X$ and $Y$, we want to learn a mapping $G : X \\rightarrow Y$ and $F: Y \\rightarrow X$. We want to enforce the intuition that these mappings should be reverses of each other and that both mappings should be bijections. Cycle Consistency Loss encourages $F\\left(G\\left(x\\right)\\right) \\approx x$ and $G\\left(F\\left(y\\right)\\right) \\approx y$.  It reduces the space of possible mapping functions by enforcing forward and backwards consistency:\r\n\r\n$$ \\mathcal{L}\\_{cyc}\\left(G, F\\right) = \\mathbb{E}\\_{x \\sim p\\_{data}\\left(x\\right)}\\left[||F\\left(G\\left(x\\right)\\right) - x||\\_{1}\\right] + \\mathbb{E}\\_{y \\sim p\\_{data}\\left(y\\right)}\\left[||G\\left(F\\left(y\\right)\\right) - y||\\_{1}\\right] $$","46":"**CycleGAN**, or **Cycle-Consistent GAN**, is a type of generative adversarial network for unpaired image-to-image translation. For two domains $X$ and $Y$, CycleGAN learns a mapping $G : X \\rightarrow Y$ and $F: Y \\rightarrow X$. The novelty lies in trying to enforce the intuition that these mappings should be reverses of each other and that both mappings should be bijections. This is achieved through a [cycle consistency loss](https:\/\/paperswithcode.com\/method\/cycle-consistency-loss) that encourages $F\\left(G\\left(x\\right)\\right) \\approx x$ and $G\\left(Y\\left(y\\right)\\right) \\approx y$. Combining this loss with the adversarial losses on $X$ and $Y$ yields the full objective for unpaired image-to-image translation.\r\n\r\nFor the mapping $G : X \\rightarrow Y$ and its discriminator $D\\_{Y}$ we have the objective:\r\n\r\n$$ \\mathcal{L}\\_{GAN}\\left(G, D\\_{Y}, X, Y\\right) =\\mathbb{E}\\_{y \\sim p\\_{data}\\left(y\\right)}\\left[\\log D\\_{Y}\\left(y\\right)\\right] + \\mathbb{E}\\_{x \\sim p\\_{data}\\left(x\\right)}\\left[log(1 \u2212 D\\_{Y}\\left(G\\left(x\\right)\\right)\\right] $$\r\n\r\nwhere $G$ tries to generate images $G\\left(x\\right)$ that look similar to images from domain $Y$, while $D\\_{Y}$ tries to discriminate between translated samples $G\\left(x\\right)$ and real samples $y$. A similar loss is postulated for the mapping $F: Y \\rightarrow X$ and its discriminator $D\\_{X}$.\r\n\r\nThe Cycle Consistency Loss reduces the space of possible mapping functions by enforcing forward and backwards consistency:\r\n\r\n$$ \\mathcal{L}\\_{cyc}\\left(G, F\\right) = \\mathbb{E}\\_{x \\sim p\\_{data}\\left(x\\right)}\\left[||F\\left(G\\left(x\\right)\\right) - x||\\_{1}\\right] + \\mathbb{E}\\_{y \\sim p\\_{data}\\left(y\\right)}\\left[||G\\left(F\\left(y\\right)\\right) - y||\\_{1}\\right] $$\r\n\r\nThe full objective is:\r\n\r\n$$ \\mathcal{L}\\_{GAN}\\left(G, F, D\\_{X}, D\\_{Y}\\right) = \\mathcal{L}\\_{GAN}\\left(G, D\\_{Y}, X, Y\\right) + \\mathcal{L}\\_{GAN}\\left(F, D\\_{X}, X, Y\\right) + \\lambda\\mathcal{L}\\_{cyc}\\left(G, F\\right) $$\r\n\r\nWhere we aim to solve:\r\n\r\n$$ G^{\\*}, F^{\\*} = \\arg \\min\\_{G, F} \\max\\_{D\\_{X}, D\\_{Y}} \\mathcal{L}\\_{GAN}\\left(G, F, D\\_{X}, D\\_{Y}\\right) $$\r\n\r\nFor the original architecture the authors use:\r\n\r\n-  two stride-2 convolutions, several residual blocks, and two fractionally strided convolutions with stride $\\frac{1}{2}$.\r\n- [instance normalization](https:\/\/paperswithcode.com\/method\/instance-normalization)\r\n- PatchGANs for the discriminator\r\n- Least Square Loss for the [GAN](https:\/\/paperswithcode.com\/method\/gan) objectives.","47":"**Mixup** is a data augmentation technique that that generates a weighted combinations of random image pairs from the training data. Given two images and their ground truth labels: $\\left(x\\_{i}, y\\_{i}\\right), \\left(x\\_{j}, y\\_{j}\\right)$, a synthetic training example $\\left(\\hat{x}, \\hat{y}\\right)$ is generated as:\r\n\r\n$$ \\hat{x} = \\lambda{x\\_{i}} + \\left(1 \u2212 \\lambda\\right){x\\_{j}} $$\r\n$$ \\hat{y} = \\lambda{y\\_{i}} + \\left(1 \u2212 \\lambda\\right){y\\_{j}} $$\r\n\r\nwhere $\\lambda \\sim \\text{Beta}\\left(\\alpha = 0.2\\right)$ is independently sampled for each augmented example.","48":"**Entropy Regularization** is a type of regularization used in [reinforcement learning](https:\/\/paperswithcode.com\/methods\/area\/reinforcement-learning). For on-policy policy gradient based methods like [A3C](https:\/\/paperswithcode.com\/method\/a3c), the same mutual  reinforcement behaviour leads to a highly-peaked $\\pi\\left(a\\mid{s}\\right)$ towards a few actions or action sequences, since it is easier for the actor and critic to overoptimise to a small portion of the environment. To reduce this problem, entropy regularization adds an entropy term to the loss to promote action diversity:\r\n\r\n$$H(X) = -\\sum\\pi\\left(x\\right)\\log\\left(\\pi\\left(x\\right)\\right) $$\r\n\r\nImage Credit: Wikipedia","49":"**Proximal Policy Optimization**, or **PPO**, is a policy gradient method for reinforcement learning. The motivation was to have an algorithm with the data efficiency and reliable performance of [TRPO](https:\/\/paperswithcode.com\/method\/trpo), while using only first-order optimization. \r\n\r\nLet $r\\_{t}\\left(\\theta\\right)$ denote the probability ratio $r\\_{t}\\left(\\theta\\right) = \\frac{\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}$, so $r\\left(\\theta\\_{old}\\right) = 1$. TRPO maximizes a \u201csurrogate\u201d objective:\r\n\r\n$$ L^{\\text{CPI}}\\left({\\theta}\\right) = \\hat{\\mathbb{E}}\\_{t}\\left[\\frac{\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\_{t}\\mid{s\\_{t}}\\right)})\\hat{A}\\_{t}\\right] = \\hat{\\mathbb{E}}\\_{t}\\left[r\\_{t}\\left(\\theta\\right)\\hat{A}\\_{t}\\right] $$\r\n\r\nWhere $CPI$ refers to a conservative policy iteration. Without a constraint, maximization of $L^{CPI}$ would lead to an excessively large policy update; hence, we PPO modifies the objective, to penalize changes to the policy that move $r\\_{t}\\left(\\theta\\right)$ away from 1:\r\n\r\n$$ J^{\\text{CLIP}}\\left({\\theta}\\right) = \\hat{\\mathbb{E}}\\_{t}\\left[\\min\\left(r\\_{t}\\left(\\theta\\right)\\hat{A}\\_{t}, \\text{clip}\\left(r\\_{t}\\left(\\theta\\right), 1-\\epsilon, 1+\\epsilon\\right)\\hat{A}\\_{t}\\right)\\right] $$\r\n\r\nwhere $\\epsilon$ is a hyperparameter, say, $\\epsilon = 0.2$. The motivation for this objective is as follows. The first term inside the min is $L^{CPI}$. The second term, $\\text{clip}\\left(r\\_{t}\\left(\\theta\\right), 1-\\epsilon, 1+\\epsilon\\right)\\hat{A}\\_{t}$ modifies the surrogate\r\nobjective by clipping the probability ratio, which removes the incentive for moving $r\\_{t}$ outside of the interval $\\left[1 \u2212 \\epsilon, 1 + \\epsilon\\right]$. Finally, we take the minimum of the clipped and unclipped objective, so the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective. With this scheme, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse. \r\n\r\nOne detail to note is that when we apply PPO for a network where we have shared parameters for actor and critic functions, we typically add to the objective function an error term on value estimation and an entropy term to encourage exploration.","50":"A **Region Proposal Network**, or **RPN**, is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals. RPN and algorithms like [Fast R-CNN](https:\/\/paperswithcode.com\/method\/fast-r-cnn) can be merged into a single network by sharing their convolutional features - using the recently popular terminology of neural networks with attention mechanisms, the RPN component tells the unified network where to look.\r\n\r\nRPNs are designed to efficiently predict region proposals with a wide range of scales and aspect ratios. RPNs use anchor boxes that serve as references at multiple scales and aspect ratios. The scheme can be thought of as a pyramid of regression references, which avoids enumerating images or filters of multiple scales or aspect ratios.","51":"**Region of Interest Align**, or **RoIAlign**, is an operation for extracting a small feature map from each RoI in detection and segmentation based tasks. It removes the harsh quantization of [RoI Pool](https:\/\/paperswithcode.com\/method\/roi-pooling), properly *aligning* the extracted features with the input. To avoid any quantization of the RoI boundaries or bins (using $x\/16$ instead of $[x\/16]$), RoIAlign uses bilinear interpolation to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and the result is then aggregated (using max or average).","52":"**Mask R-CNN** extends [Faster R-CNN](http:\/\/paperswithcode.com\/method\/faster-r-cnn) to solve instance segmentation tasks. It achieves this by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. In principle, Mask R-CNN is an intuitive extension of Faster [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn), but constructing the mask branch properly is critical for good results. \r\n\r\nMost importantly, Faster R-CNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is evident in how [RoIPool](http:\/\/paperswithcode.com\/method\/roi-pooling), the *de facto* core operation for attending to instances, performs coarse spatial quantization for feature extraction. To fix the misalignment, Mask R-CNN utilises a simple, quantization-free layer, called [RoIAlign](http:\/\/paperswithcode.com\/method\/roi-align), that faithfully preserves exact spatial locations. \r\n\r\nSecondly, Mask R-CNN *decouples* mask and class prediction: it predicts a binary mask for each class independently, without competition among classes, and relies on the network's RoI classification branch to predict the category. In contrast, an [FCN](http:\/\/paperswithcode.com\/method\/fcn) usually perform per-pixel multi-class categorization, which couples segmentation and classification.","53":"**Absolute Position Encodings** are a type of position embeddings for [[Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based models] where positional encodings are added to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension $d\\_{model}$ as the embeddings, so that the two can be summed. In the original implementation, sine and cosine functions of different frequencies are used:\r\n\r\n$$ \\text{PE}\\left(pos, 2i\\right) = \\sin\\left(pos\/10000^{2i\/d\\_{model}}\\right) $$\r\n\r\n$$ \\text{PE}\\left(pos, 2i+1\\right) = \\cos\\left(pos\/10000^{2i\/d\\_{model}}\\right) $$\r\n\r\nwhere $pos$ is the position and $i$ is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $2\\pi$ to $10000 \\dot 2\\pi$. This function was chosen because the authors hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$,  $\\text{PE}\\_{pos+k}$ can be represented as a linear function of $\\text{PE}\\_{pos}$.\r\n\r\nImage Source: [D2L.ai](https:\/\/d2l.ai\/chapter_attention-mechanisms\/self-attention-and-positional-encoding.html)","54":"**Position-Wise Feed-Forward Layer** is a type of [feedforward layer](https:\/\/www.paperswithcode.com\/method\/category\/feedforwad-networks) consisting of two [dense layers](https:\/\/www.paperswithcode.com\/method\/dense-connections) that applies to the last dimension, which means the same dense layers are used for each position item in the sequence, so called position-wise.","55":"**Label Smoothing** is a regularization technique that introduces noise for the labels. This accounts for the fact that datasets may have mistakes in them, so maximizing the likelihood of $\\log{p}\\left(y\\mid{x}\\right)$ directly can be harmful. Assume for a small constant $\\epsilon$, the training set label $y$ is correct with probability $1-\\epsilon$ and incorrect otherwise. Label Smoothing regularizes a model based on a [softmax](https:\/\/paperswithcode.com\/method\/softmax) with $k$ output values by replacing the hard $0$ and $1$ classification targets with targets of $\\frac{\\epsilon}{k-1}$ and $1-\\epsilon$ respectively.\r\n\r\nSource: Deep Learning, Goodfellow et al\r\n\r\nImage Source: [When Does Label Smoothing Help?](https:\/\/arxiv.org\/abs\/1906.02629)","56":"**Byte Pair Encoding**, or **BPE**, is a subword segmentation algorithm that encodes rare and unknown words as sequences of subword units. The intuition is that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations).\r\n\r\n[Lei Mao](https:\/\/leimao.github.io\/blog\/Byte-Pair-Encoding\/) has a detailed blog post that explains how this works.","57":"A **Transformer** is a model architecture that eschews recurrence and instead relies entirely on an [attention mechanism](https:\/\/paperswithcode.com\/methods\/category\/attention-mechanisms-1) to draw global dependencies between input and output. Before Transformers, the dominant sequence transduction models were based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The Transformer also employs an encoder and decoder, but removing recurrence in favor of [attention mechanisms](https:\/\/paperswithcode.com\/methods\/category\/attention-mechanisms-1) allows for significantly more parallelization than methods like [RNNs](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks) and [CNNs](https:\/\/paperswithcode.com\/methods\/category\/convolutional-neural-networks).","58":"Spectral clustering has attracted increasing attention due to\r\nthe promising ability in dealing with nonlinearly separable datasets [15], [16]. In spectral clustering, the spectrum of the graph Laplacian is used to reveal the cluster structure. The spectral clustering algorithm mainly consists of two steps: 1) constructs the low dimensional embedded representation of the data based on the eigenvectors of the graph Laplacian, 2) applies k-means on the constructed low dimensional data to obtain the clustering result. Thus,","59":"Please enter a description about the method here","60":"**Depthwise Convolution** is a type of convolution where we apply a single convolutional filter for each input channel. In the regular 2D [convolution](https:\/\/paperswithcode.com\/method\/convolution) performed over multiple input channels, the filter is as deep as the input and lets us freely mix channels to generate each element in the output. In contrast, depthwise convolutions keep each channel separate. To summarize the steps, we:\r\n\r\n1. Split the input and filter into channels.\r\n2. We convolve each input with the respective filter.\r\n3. We stack the convolved outputs together.\r\n\r\nImage Credit: [Chi-Feng Wang](https:\/\/towardsdatascience.com\/a-basic-introduction-to-separable-convolutions-b99ec3102728)","61":"**Pointwise Convolution** is a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) that uses a 1x1 kernel: a kernel that iterates through every single point. This kernel has a depth of however many channels the input image has. It can be used in conjunction with [depthwise convolutions](https:\/\/paperswithcode.com\/method\/depthwise-convolution) to produce an efficient class of convolutions known as [depthwise-separable convolutions](https:\/\/paperswithcode.com\/method\/depthwise-separable-convolution).\r\n\r\nImage Credit: [Chi-Feng Wang](https:\/\/towardsdatascience.com\/a-basic-introduction-to-separable-convolutions-b99ec3102728)","62":"While [standard convolution](https:\/\/paperswithcode.com\/method\/convolution) performs the channelwise and spatial-wise computation in one step, **Depthwise Separable Convolution**  splits the computation into two steps: [depthwise convolution](https:\/\/paperswithcode.com\/method\/depthwise-convolution) applies a single convolutional filter per each input channel and [pointwise convolution](https:\/\/paperswithcode.com\/method\/pointwise-convolution) is used to create a linear combination of the output of the depthwise convolution. The comparison of standard convolution and depthwise separable convolution is shown to the right.\r\n\r\nCredit: [Depthwise Convolution Is All You Need for Learning Multiple Visual Domains](https:\/\/paperswithcode.com\/paper\/depthwise-convolution-is-all-you-need-for)","63":"An **Inverted Residual Block**, sometimes called an **MBConv Block**, is a type of residual block used for image models that uses an inverted structure for efficiency reasons. It was originally proposed for the [MobileNetV2](https:\/\/paperswithcode.com\/method\/mobilenetv2) CNN architecture. It has since been reused for several mobile-optimized CNNs.\r\n\r\nA traditional [Residual Block](https:\/\/paperswithcode.com\/method\/residual-block) has a wide -> narrow -> wide structure with the number of channels. The input has a high number of channels, which are compressed with a [1x1 convolution](https:\/\/paperswithcode.com\/method\/1x1-convolution). The number of channels is then increased again with a 1x1 [convolution](https:\/\/paperswithcode.com\/method\/convolution) so input and output can be added. \r\n\r\nIn contrast, an Inverted Residual Block follows a narrow -> wide -> narrow approach, hence the inversion. We first widen with a 1x1 convolution, then use a 3x3 [depthwise convolution](https:\/\/paperswithcode.com\/method\/depthwise-convolution) (which greatly reduces the number of parameters), then we use a 1x1 convolution to reduce the number of channels so input and output can be added.","64":"**MobileNetV2** is a convolutional neural network architecture that seeks to perform well on mobile devices. It is based on an inverted residual structure where the residual connections are between the bottleneck layers.  The intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity. As a whole, the architecture of MobileNetV2 contains the initial fully [convolution](https:\/\/paperswithcode.com\/method\/convolution) layer with 32 filters, followed by 19 residual bottleneck layers.","65":"**LIME**, or **Local Interpretable Model-Agnostic Explanations**, is an algorithm that can explain the predictions of any classifier or regressor in a faithful way, by approximating it locally with an interpretable model. It modifies a single data sample by tweaking the feature values and observes the resulting impact on the output. It performs the role of an \"explainer\" to explain predictions from each data sample. The output of LIME is a set of explanations representing the contribution of each feature to a prediction for a single sample, which is a form of local interpretability.\r\n\r\nInterpretable models in LIME can be, for instance, [linear regression](https:\/\/paperswithcode.com\/method\/linear-regression) or decision trees, which are trained on small perturbations (e.g. adding noise, removing words, hiding parts of the image) of the original model to provide a good local approximation.","66":"A **Bidirectional LSTM**, or **biLSTM**, is a sequence processing model that consists of two LSTMs: one taking the input in a forward direction, and the other in a backwards direction. BiLSTMs effectively increase the amount of information available to the network, improving the context available to the algorithm (e.g. knowing what words immediately follow *and* precede a word in a sentence).\r\n\r\nImage Source: Modelling Radiological Language with Bidirectional Long Short-Term Memory Networks, Cornegruta et al","67":"**Logistic Regression**, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.\r\n\r\nSource: [scikit-learn](https:\/\/scikit-learn.org\/stable\/modules\/linear_model.html#logistic-regression)\r\n\r\nImage: [Michaelg2015](https:\/\/commons.wikimedia.org\/wiki\/User:Michaelg2015)","68":"Diffusion models generate samples by gradually\r\nremoving noise from a signal, and their training objective can be expressed as a reweighted variational lower-bound (https:\/\/arxiv.org\/abs\/2006.11239).","69":"Based on the understanding that the flat local minima of the empirical risk cause the model to generalize better. Adversarial Model Perturbation (AMP) improves generalization via minimizing the **AMP loss**, which is obtained from the empirical risk by applying the **worst** norm-bounded perturbation on each point in the parameter space.","70":"The goal of **Triplet loss**, in the context of Siamese Networks, is to maximize the joint probability among all score-pairs i.e. the product of all probabilities. By using its negative logarithm, we can get the loss formulation as follows:\r\n\r\n$$\r\nL\\_{t}\\left(\\mathcal{V}\\_{p}, \\mathcal{V}\\_{n}\\right)=-\\frac{1}{M N} \\sum\\_{i}^{M} \\sum\\_{j}^{N} \\log \\operatorname{prob}\\left(v p\\_{i}, v n\\_{j}\\right)\r\n$$\r\n\r\nwhere the balance weight $1\/MN$ is used to keep the loss with the same scale for different number of instance sets.","71":"**A3C**, **Asynchronous Advantage Actor Critic**, is a policy gradient algorithm in reinforcement learning that maintains a policy $\\pi\\left(a\\_{t}\\mid{s}\\_{t}; \\theta\\right)$ and an estimate of the value\r\nfunction $V\\left(s\\_{t}; \\theta\\_{v}\\right)$. It operates in the forward view and uses a mix of $n$-step returns to update both the policy and the value-function. The policy and the value function are updated after every $t\\_{\\text{max}}$ actions or when a terminal state is reached. The update performed by the algorithm can be seen as $\\nabla\\_{\\theta{'}}\\log\\pi\\left(a\\_{t}\\mid{s\\_{t}}; \\theta{'}\\right)A\\left(s\\_{t}, a\\_{t}; \\theta, \\theta\\_{v}\\right)$ where $A\\left(s\\_{t}, a\\_{t}; \\theta, \\theta\\_{v}\\right)$ is an estimate of the advantage function given by:\r\n\r\n$$\\sum^{k-1}\\_{i=0}\\gamma^{i}r\\_{t+i} + \\gamma^{k}V\\left(s\\_{t+k}; \\theta\\_{v}\\right) - V\\left(s\\_{t}; \\theta\\_{v}\\right)$$\r\n\r\nwhere $k$ can vary from state to state and is upper-bounded by $t\\_{max}$.\r\n\r\nThe critics in A3C learn the value function while multiple actors are trained in parallel and get synced with global parameters every so often. The gradients are accumulated as part of training for stability - this is like parallelized stochastic gradient descent.\r\n\r\nNote that while the parameters $\\theta$ of the policy and $\\theta\\_{v}$ of the value function are shown as being separate for generality, we always share some of the parameters in practice. We typically use a convolutional neural network that has one [softmax](https:\/\/paperswithcode.com\/method\/softmax) output for the policy $\\pi\\left(a\\_{t}\\mid{s}\\_{t}; \\theta\\right)$ and one linear output for the value function $V\\left(s\\_{t}; \\theta\\_{v}\\right)$, with all non-output layers shared.","72":"**SHAP**, or **SHapley Additive exPlanations**, is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions. Shapley values are approximating using Kernel SHAP, which uses a weighting kernel for the approximation, and DeepSHAP, which uses DeepLift to approximate them.","73":"A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\r\nSource: [Distilling the Knowledge in a Neural Network](https:\/\/arxiv.org\/abs\/1503.02531)","74":"**Cosine Annealing** is a type of learning rate schedule that has the effect of starting with a large learning rate that is relatively rapidly decreased to a minimum value before being increased rapidly again. The resetting of the learning rate acts like a simulated restart of the learning process and the re-use of good weights as the starting point of the restart is referred to as a \"warm restart\" in contrast to a \"cold restart\" where a new set of small random numbers may be used as a starting point.\r\n\r\n$$\\eta\\_{t} = \\eta\\_{min}^{i} + \\frac{1}{2}\\left(\\eta\\_{max}^{i}-\\eta\\_{min}^{i}\\right)\\left(1+\\cos\\left(\\frac{T\\_{cur}}{T\\_{i}}\\pi\\right)\\right)\r\n$$\r\n\r\nWhere where $\\eta\\_{min}^{i}$ and $ \\eta\\_{max}^{i}$ are ranges for the learning rate, and $T\\_{cur}$ account for how many epochs have been performed since the last restart.\r\n\r\nText Source: [Jason Brownlee](https:\/\/machinelearningmastery.com\/snapshot-ensemble-deep-learning-neural-network\/)\r\n\r\nImage Source: [Gao Huang](https:\/\/www.researchgate.net\/figure\/Training-loss-of-100-layer-DenseNet-on-CIFAR10-using-standard-learning-rate-blue-and-M_fig2_315765130)","75":"**Adaptive Input Embeddings** extend the [adaptive softmax](https:\/\/paperswithcode.com\/method\/adaptive-softmax) to input word representations. The factorization assigns more capacity to frequent words and reduces the capacity for less frequent words with the benefit of reducing overfitting to rare words.","76":"**Linear Warmup With Cosine Annealing** is a learning rate schedule where we increase the learning rate linearly for $n$ updates and then anneal according to a cosine schedule afterwards.","77":"The **Squeeze-and-Excitation Block** is an architectural unit designed to improve the representational power of a network by enabling it to perform dynamic channel-wise feature recalibration. The process is:\r\n\r\n- The block has a convolutional block as an input.\r\n- Each channel is \"squeezed\" into a single numeric value using [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling).\r\n- A dense layer followed by a [ReLU](https:\/\/paperswithcode.com\/method\/relu) adds non-linearity and output channel complexity is reduced by a ratio.\r\n- Another dense layer followed by a sigmoid gives each channel a smooth gating function.\r\n- Finally, we weight each feature map of the convolutional block based on the side network; the \"excitation\".","78":"**Variational Dropout** is a regularization technique based on [dropout](https:\/\/paperswithcode.com\/method\/dropout), but uses a variational inference grounded approach. In Variational Dropout, we repeat the same dropout mask at each time step for both inputs, outputs, and recurrent layers (drop the same network units at each time step). This is in contrast to ordinary Dropout where different dropout masks are sampled at each time step for the inputs and outputs alone.","79":"**Adaptive Softmax** is a speedup technique for the computation of probability distributions over words. The adaptive [softmax](https:\/\/paperswithcode.com\/method\/softmax) is inspired by the class-based [hierarchical softmax](https:\/\/paperswithcode.com\/method\/hierarchical-softmax), where the word classes are built to minimize the computation time. Adaptive softmax achieves efficiency by explicitly taking into account the computation time of matrix-multiplication on parallel systems and combining it with a few important observations, namely keeping a shortlist of frequent words in the root node\r\nand reducing the capacity of rare words.","80":"**Discriminative Fine-Tuning** is a fine-tuning strategy that is used for [ULMFiT](https:\/\/paperswithcode.com\/method\/ulmfit) type models. Instead of using the same learning rate for all layers of the model, discriminative fine-tuning allows us to tune each layer with different learning rates. For context, the regular stochastic gradient descent ([SGD](https:\/\/paperswithcode.com\/method\/sgd)) update of a model\u2019s parameters $\\theta$ at time step $t$ looks like the following (Ruder, 2016):\r\n\r\n$$ \\theta\\_{t} = \\theta\\_{t-1} \u2212 \\eta\\cdot\\nabla\\_{\\theta}J\\left(\\theta\\right)$$\r\n\r\nwhere $\\eta$ is the learning rate and $\\nabla\\_{\\theta}J\\left(\\theta\\right)$ is the gradient with regard to the model\u2019s objective function. For discriminative fine-tuning, we split the parameters $\\theta$ into {$\\theta\\_{1}, \\ldots, \\theta\\_{L}$} where $\\theta\\_{l}$ contains the parameters of the model at the $l$-th layer and $L$ is the number of layers of the model. Similarly, we obtain {$\\eta\\_{1}, \\ldots, \\eta\\_{L}$} where $\\theta\\_{l}$ where $\\eta\\_{l}$ is the learning rate of the $l$-th layer. The SGD update with discriminative finetuning is then:\r\n\r\n$$ \\theta\\_{t}^{l} = \\theta\\_{t-1}^{l} - \\eta^{l}\\cdot\\nabla\\_{\\theta^{l}}J\\left(\\theta\\right) $$\r\n\r\nThe authors find that empirically it worked well to first choose the learning rate $\\eta^{L}$ of the last layer by fine-tuning only the last layer and using $\\eta^{l-1}=\\eta^{l}\/2.6$ as the learning rate for lower layers.","81":"**Transformer-XL** (meaning extra long) is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture that introduces the notion of recurrence to the deep self-attention network. Instead of computing the hidden states from scratch for each new segment, Transformer-XL reuses the hidden states obtained in previous segments. The reused hidden states serve as memory for the current segment, which builds up a recurrent connection between the segments. As a result, modeling very long-term dependency becomes possible because information can be propagated through the recurrent connections. As an additional contribution, the Transformer-XL uses a new relative positional encoding formulation that generalizes to attention lengths longer than the one observed during training.","82":"A **SENet** is a convolutional neural network architecture that employs squeeze-and-excitation blocks to enable the network to perform dynamic channel-wise feature recalibration.","83":"**GPT-2** is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) architecture that was notable for its size (1.5 billion parameters) on its release. The model is pretrained on a WebText dataset - text from 45 million website links. It largely follows the previous [GPT](https:\/\/paperswithcode.com\/method\/gpt) architecture with some modifications:\r\n\r\n- [Layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization) is moved to the input of each sub-block, similar to a\r\npre-activation residual network and an additional layer normalization was added after the final self-attention block. \r\n\r\n- A modified initialization which accounts for the accumulation on the residual path with model depth\r\nis used. Weights of residual layers are scaled at initialization by a factor of $1\/\\sqrt{N}$ where $N$ is the number of residual layers. \r\n\r\n- The vocabulary is expanded to 50,257. The context size is expanded from 512 to 1024 tokens and\r\na larger batch size of 512 is used.","84":"**Stochastic Gradient Descent** is an iterative optimization technique that uses minibatches of data to form an expectation of the gradient, rather than the full gradient using all available data. That is for weights $w$ and a loss function $L$ we have:\r\n\r\n$$ w\\_{t+1} = w\\_{t} - \\eta\\hat{\\nabla}\\_{w}{L(w\\_{t})} $$\r\n\r\nWhere $\\eta$ is a learning rate. SGD reduces redundancy compared to batch gradient descent - which recomputes gradients for similar examples before each parameter update - so it is usually much faster.\r\n\r\n(Image Source: [here](http:\/\/rasbt.github.io\/mlxtend\/user_guide\/general_concepts\/gradient-optimization\/))","85":"A **Dilated Causal Convolution** is a [causal convolution](https:\/\/paperswithcode.com\/method\/causal-convolution) where the filter is applied over an area larger than its length by skipping input values with a certain step. A dilated causal [convolution](https:\/\/paperswithcode.com\/method\/convolution) effectively allows the network to have very large receptive fields with just a few layers.","86":"**Causal convolutions** are a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) used for temporal data which ensures the model cannot violate the ordering in which we model the data: the prediction $p(x_{t+1} | x_{1}, \\ldots, x_{t})$ emitted by the model at timestep $t$ cannot depend on any of the future timesteps $x_{t+1}, x_{t+2}, \\ldots, x_{T}$. For images, the equivalent of a causal convolution is a [masked convolution](https:\/\/paperswithcode.com\/method\/masked-convolution) which can be implemented by constructing a mask tensor and doing an element-wise multiplication of this mask with the convolution kernel before applying it. For 1-D data such as audio one can more easily implement this by shifting the output of a normal convolution by a few timesteps.","87":"**Affine Coupling** is a method for implementing a normalizing flow (where we stack a sequence of invertible bijective transformation functions). Affine coupling is one of these bijective transformation functions. Specifically, it is an example of a reversible transformation where the forward function, the reverse function and the log-determinant are computationally efficient. For the forward function, we split the input dimension into two parts:\r\n\r\n$$ \\mathbf{x}\\_{a}, \\mathbf{x}\\_{b} = \\text{split}\\left(\\mathbf{x}\\right) $$\r\n\r\nThe second part stays the same $\\mathbf{x}\\_{b} = \\mathbf{y}\\_{b}$, while the first part  $\\mathbf{x}\\_{a}$ undergoes an affine transformation, where the parameters for this transformation are learnt using the second part $\\mathbf{x}\\_{b}$ being put through a neural network. Together we have:\r\n\r\n$$ \\left(\\log{\\mathbf{s}, \\mathbf{t}}\\right) = \\text{NN}\\left(\\mathbf{x}\\_{b}\\right) $$\r\n\r\n$$ \\mathbf{s} = \\exp\\left(\\log{\\mathbf{s}}\\right) $$\r\n\r\n$$ \\mathbf{y}\\_{a} = \\mathbf{s} \\odot \\mathbf{x}\\_{a} + \\mathbf{t}  $$\r\n\r\n$$ \\mathbf{y}\\_{b} = \\mathbf{x}\\_{b} $$\r\n\r\n$$ \\mathbf{y} = \\text{concat}\\left(\\mathbf{y}\\_{a}, \\mathbf{y}\\_{b}\\right) $$\r\n\r\nImage: [GLOW](https:\/\/paperswithcode.com\/method\/glow)","88":"**Normalizing Flows** are a method for constructing complex distributions by transforming a\r\nprobability density through a series of invertible mappings. By repeatedly applying the rule for change of variables, the initial density \u2018flows\u2019 through the sequence of invertible mappings. At the end of this sequence we obtain a valid probability distribution and hence this type of flow is referred to as a normalizing flow.\r\n\r\nIn the case of finite flows, the basic rule for the transformation of densities considers an invertible, smooth mapping $f : \\mathbb{R}^{d} \\rightarrow \\mathbb{R}^{d}$ with inverse $f^{-1} = g$, i.e. the composition $g \\cdot f\\left(z\\right) = z$. If we use this mapping to transform a random variable $z$ with distribution $q\\left(z\\right)$, the resulting random variable $z' = f\\left(z\\right)$ has a distribution:\r\n\r\n$$ q\\left(\\mathbf{z}'\\right) = q\\left(\\mathbf{z}\\right)\\bigl\\vert{\\text{det}}\\frac{\\delta{f}^{-1}}{\\delta{\\mathbf{z'}}}\\bigr\\vert = q\\left(\\mathbf{z}\\right)\\bigl\\vert{\\text{det}}\\frac{\\delta{f}}{\\delta{\\mathbf{z}}}\\bigr\\vert ^{-1} $$\r\n\f\r\nwhere the last equality can be seen by applying the chain rule (inverse function theorem) and is a property of Jacobians of invertible functions. We can construct arbitrarily complex densities by composing several simple maps and successively applying the above equation. The density $q\\_{K}\\left(\\mathbf{z}\\right)$ obtained by successively transforming a random variable $z\\_{0}$ with distribution $q\\_{0}$ through a chain of $K$ transformations $f\\_{k}$ is:\r\n\r\n$$ z\\_{K} = f\\_{K} \\cdot \\dots \\cdot f\\_{2} \\cdot f\\_{1}\\left(z\\_{0}\\right) $$\r\n\r\n$$ \\ln{q}\\_{K}\\left(z\\_{K}\\right) = \\ln{q}\\_{0}\\left(z\\_{0}\\right) \u2212 \\sum^{K}\\_{k=1}\\ln\\vert\\det\\frac{\\delta{f\\_{k}}}{\\delta{\\mathbf{z\\_{k-1}}}}\\vert $$\r\n\f\r\nThe path traversed by the random variables $z\\_{k} = f\\_{k}\\left(z\\_{k-1}\\right)$ with initial distribution $q\\_{0}\\left(z\\_{0}\\right)$ is called the flow and the path formed by the successive distributions $q\\_{k}$ is a normalizing flow.","89":"**NICE**, or **Non-Linear Independent Components Estimation** is a framework for modeling complex high-dimensional densities. It is based on the idea that a good representation is one in which the data has a distribution that is easy to model. For this purpose, a non-linear deterministic transformation of the data is learned that maps it to a latent space so as to make the transformed data conform to a factorized distribution, i.e., resulting in independent latent variables.  The transformation is parameterised so that computing the determinant of the Jacobian and inverse Jacobian is trivial, yet it maintains the ability to learn complex non-linear transformations, via a composition of simple building blocks, each based on a deep neural network. The training criterion is simply the exact log-likelihood. The transformation used in NICE is the [affine coupling](https:\/\/paperswithcode.com\/method\/affine-coupling) layer without the scale term, known as additive coupling layer:\r\n\r\n$$ y\\_{I\\_{2}} = x\\_{I\\_{2}} + m\\left(x\\_{I\\_{1}}\\right) $$\r\n\r\n$$ x\\_{I\\_{2}} = y\\_{I\\_{2}} + m\\left(y\\_{I\\_{1}}\\right) $$","90":"**MAML**, or **Model-Agnostic Meta-Learning**, is a model and task-agnostic algorithm for meta-learning that trains a model\u2019s parameters such that a small number of gradient updates will lead to fast learning on a new task.\r\n\r\nConsider a model represented by a parametrized function $f\\_{\\theta}$ with parameters $\\theta$. When adapting to a new task $\\mathcal{T}\\_{i}$, the model\u2019s parameters $\\theta$ become $\\theta'\\_{i}$. With MAML, the updated parameter vector $\\theta'\\_{i}$ is computed using one or more gradient descent updates on task $\\mathcal{T}\\_{i}$. For example, when using one gradient update,\r\n\r\n$$ \\theta'\\_{i} = \\theta - \\alpha\\nabla\\_{\\theta}\\mathcal{L}\\_{\\mathcal{T}\\_{i}}\\left(f\\_{\\theta}\\right) $$\r\n\r\nThe step size $\\alpha$ may be fixed as a hyperparameter or metalearned. The model parameters are trained by optimizing for the performance of $f\\_{\\theta'\\_{i}}$ with respect to $\\theta$ across tasks sampled from $p\\left(\\mathcal{T}\\_{i}\\right)$. More concretely the meta-objective is as follows:\r\n\r\n$$ \\min\\_{\\theta} \\sum\\_{\\mathcal{T}\\_{i} \\sim p\\left(\\mathcal{T}\\right)} \\mathcal{L}\\_{\\mathcal{T\\_{i}}}\\left(f\\_{\\theta'\\_{i}}\\right) = \\sum\\_{\\mathcal{T}\\_{i} \\sim p\\left(\\mathcal{T}\\right)} \\mathcal{L}\\_{\\mathcal{T\\_{i}}}\\left(f\\_{\\theta - \\alpha\\nabla\\_{\\theta}\\mathcal{L}\\_{\\mathcal{T}\\_{i}}\\left(f\\_{\\theta}\\right)}\\right) $$\r\n\r\nNote that the meta-optimization is performed over the model parameters $\\theta$, whereas the objective is computed using the updated model parameters $\\theta'$. In effect MAML aims to optimize the model parameters such that one or a small number of gradient steps on a new task will produce maximally effective behavior on that task. The meta-optimization across tasks is performed via stochastic gradient descent ([SGD](https:\/\/paperswithcode.com\/method\/sgd)), such that the model parameters $\\theta$ are updated as follows:\r\n\r\n$$ \\theta \\leftarrow \\theta - \\beta\\nabla\\_{\\theta} \\sum\\_{\\mathcal{T}\\_{i} \\sim p\\left(\\mathcal{T}\\right)} \\mathcal{L}\\_{\\mathcal{T\\_{i}}}\\left(f\\_{\\theta'\\_{i}}\\right)$$\r\n\r\nwhere $\\beta$ is the meta step size.","91":"**LAMB** is a a layerwise adaptive large batch optimization technique. It provides a strategy for adapting the learning rate in large batch settings. LAMB uses [Adam](https:\/\/paperswithcode.com\/method\/adam) as the base algorithm and then forms an update as:\r\n\r\n$$r\\_{t} = \\frac{m\\_{t}}{\\sqrt{v\\_{t}} + \\epsilon}$$\r\n$$x\\_{t+1}^{\\left(i\\right)} = x\\_{t}^{\\left(i\\right)}  - \\eta\\_{t}\\frac{\\phi\\left(|| x\\_{t}^{\\left(i\\right)} ||\\right)}{|| m\\_{t}^{\\left(i\\right)} || }\\left(r\\_{t}^{\\left(i\\right)}+\\lambda{x\\_{t}^{\\left(i\\right)}}\\right) $$\r\n\r\nUnlike [LARS](https:\/\/paperswithcode.com\/method\/lars), the adaptivity of LAMB is two-fold: (i) per dimension normalization with respect to the square root of the second moment used in Adam and (ii) layerwise normalization obtained due to layerwise adaptivity.","92":"**RoBERTa** is an extension of [BERT](https:\/\/paperswithcode.com\/method\/bert) with changes to the pretraining procedure. The modifications include: \r\n\r\n- training the model longer, with bigger batches, over more data\r\n- removing the next sentence prediction objective\r\n- training on longer sequences\r\n- dynamically changing the masking pattern applied to the training data. The authors also collect a large new dataset ($\\text{CC-News}$) of comparable size to other privately used datasets, to better control for training set size effects","93":"**ALBERT** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture based on [BERT](https:\/\/paperswithcode.com\/method\/bert) but with much fewer parameters. It achieves this through two parameter reduction techniques. The first is a factorized embeddings parameterization. By decomposing the large vocabulary embedding matrix into two small matrices, the size of the hidden layers is separated from the size of vocabulary embedding. This makes it easier to grow the hidden size without significantly increasing the parameter size of the vocabulary embeddings. The second technique is cross-layer parameter sharing. This technique prevents the parameter from growing with the depth of the network. \r\n\r\nAdditionally, ALBERT utilises a self-supervised loss for sentence-order prediction (SOP). SOP primary focuses on inter-sentence coherence and is designed to address the ineffectiveness of the next sentence prediction (NSP) loss proposed in the original BERT.","94":"**Differentiable Architecture Search** (**DART**) is a method for efficient architecture search. The search space is made continuous so that the architecture can be optimized with respect to its validation set performance through gradient descent.","95":"Class activation maps could be used to interpret the prediction decision made by the convolutional neural network (CNN).\r\n\r\nImage source: [Learning Deep Features for Discriminative Localization](https:\/\/paperswithcode.com\/paper\/learning-deep-features-for-discriminative)","96":"A **Support Vector Machine**, or **SVM**, is a non-parametric supervised learning model. For non-linear classification and regression, they utilise the kernel trick to map inputs to high-dimensional feature spaces. SVMs construct a hyper-plane or set of hyper-planes in a high or infinite dimensional space, which can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier. The figure to the right shows the decision function for a linearly separable problem, with three samples on the margin boundaries, called \u201csupport vectors\u201d. \r\n\r\nSource: [scikit-learn](https:\/\/scikit-learn.org\/stable\/modules\/svm.html)","97":"A **DQN**, or Deep Q-Network, approximates a state-value function in a [Q-Learning](https:\/\/paperswithcode.com\/method\/q-learning) framework with a neural network. In the Atari Games case, they take in several frames of the game as an input and output state values for each action as an output. \r\n\r\nIt is usually used in conjunction with [Experience Replay](https:\/\/paperswithcode.com\/method\/experience-replay), for storing the episode steps in memory for off-policy learning, where samples are drawn from the replay memory at random. Additionally, the Q-Network is usually optimized towards a frozen target network that is periodically updated with the latest weights every $k$ steps (where $k$ is a hyperparameter). The latter makes training more stable by preventing short-term oscillations from a moving target. The former tackles autocorrelation that would occur from on-line learning, and having a replay memory makes the problem more like a supervised learning problem.\r\n\r\nImage Source: [here](https:\/\/www.researchgate.net\/publication\/319643003_Autonomous_Quadrotor_Landing_using_Deep_Reinforcement_Learning)","98":"Temporal attention can be seen as a dynamic time selection mechanism determining when to pay attention, and is thus usually used for video processing.","99":"**fastText** embeddings exploit subword information to construct word embeddings. Representations are learnt of character $n$-grams, and words represented as the sum of the $n$-gram vectors. This extends the word2vec type models with subword information. This helps the embeddings understand suffixes and prefixes. Once a word is represented using character $n$-grams, a skipgram model is trained to learn the embeddings.","100":"**Seq2Seq**, or **Sequence To Sequence**, is a model used in sequence prediction tasks, such as language modelling and machine translation. The idea is to use one [LSTM](https:\/\/paperswithcode.com\/method\/lstm), the *encoder*, to read the input sequence one timestep at a time, to obtain a large fixed dimensional vector representation (a context vector), and then to use another LSTM, the *decoder*, to extract the output sequence\r\nfrom that vector. The second LSTM is essentially a recurrent neural network language model except that it is conditioned on the input sequence.\r\n\r\n(Note that this page refers to the original seq2seq not general sequence-to-sequence models)","101":"**Local Response Normalization** is a normalization layer that implements the idea of lateral inhibition. Lateral inhibition is a concept in neurobiology that refers to the phenomenon of an excited neuron inhibiting its neighbours: this leads to a peak in the form of a local maximum, creating contrast in that area and increasing sensory perception. In practice, we can either normalize within the same channel or normalize across channels when we apply LRN to convolutional neural networks.\r\n\r\n$$ b_{c} = a_{c}\\left(k + \\frac{\\alpha}{n}\\sum_{c'=\\max(0, c-n\/2)}^{\\min(N-1,c+n\/2)}a_{c'}^2\\right)^{-\\beta} $$\r\n\r\nWhere the size is the number of neighbouring channels used for normalization, $\\alpha$ is multiplicative factor, $\\beta$ an exponent and $k$ an additive factor","102":"A **Grouped Convolution** uses a group of convolutions - multiple kernels per layer - resulting in multiple channel outputs per layer. This leads to wider networks helping a network learn a varied set of low level and high level features. The original motivation of using Grouped Convolutions in [AlexNet](https:\/\/paperswithcode.com\/method\/alexnet) was to distribute the model over multiple GPUs as an engineering compromise. But later, with models such as [ResNeXt](https:\/\/paperswithcode.com\/method\/resnext), it was shown this module could be used to improve classification accuracy. Specifically by exposing a new dimension through grouped convolutions, *cardinality* (the size of set of transformations), we can increase accuracy by increasing it.","103":"**AlexNet** is a classic convolutional neural network architecture. It consists of convolutions, [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) and dense layers as the basic building blocks. Grouped convolutions are used in order to fit the model across two GPUs.","104":"**DLA**, or **Deep Layer Aggregation**,  iteratively and hierarchically merges the feature hierarchy across layers in neural networks to make networks with better accuracy and fewer parameters. \r\n\r\nIn iterative deep aggregation (IDA), aggregation begins at the shallowest, smallest scale and then iteratively merges deeper,\r\nlarger scales. In this way shallow features are refined as\r\nthey are propagated through different stages of aggregation.\r\n\r\nIn hierarchical deep aggregation (HDA), blocks and stages\r\nin a tree are merged to preserve and combine feature channels. With\r\nHDA shallower and deeper layers are combined to learn\r\nricher combinations that span more of the feature hierarchy.\r\nWhile IDA effectively combines stages, it is insufficient\r\nfor fusing the many blocks of a network, as it is still only\r\nsequential.","105":"**Center Pooling** is a pooling technique for object detection that aims to capture richer and more recognizable visual patterns. The geometric centers of objects do not necessarily convey very recognizable visual patterns (e.g., the human head contains strong visual patterns, but the center keypoint is often in the middle of the human body). \r\n\r\nThe detailed process of center pooling is as follows: the backbone outputs a feature map, and to determine if a pixel in the feature map is a center keypoint, we need to find the maximum value in its both horizontal and vertical directions and add them together. By doing this, center pooling helps the better detection of center keypoints.","106":"**Cascade Corner Pooling** is a pooling layer for object detection that builds upon the [corner pooling](https:\/\/paperswithcode.com\/method\/corner-pooling) operation. Corners are often outside the objects, which lacks local appearance features. [CornerNet](https:\/\/paperswithcode.com\/method\/cornernet) uses corner pooling to address this issue, where we find the maximum values on the boundary directions so as to determine corners. However, it makes corners sensitive to the edges. To address this problem, we need to let corners see the visual patterns of objects. Cascade corner pooling first looks along a boundary to find a boundary maximum value, then looks inside along the location of the boundary maximum value to find an internal maximum value, and finally, add the two maximum values together. By doing this, the corners obtain both the the boundary information and the visual patterns of objects.","107":"**CenterNet** is a one-stage object detector that detects each object as a triplet, rather than a pair, of keypoints. It utilizes two customized modules named [cascade corner pooling](https:\/\/paperswithcode.com\/method\/cascade-corner-pooling) and [center pooling](https:\/\/paperswithcode.com\/method\/center-pooling), which play the roles of enriching information collected by both top-left and bottom-right corners and providing more recognizable information at the central regions, respectively. The intuition is that, if a predicted bounding box has a high IoU with the ground-truth box, then the probability that the center keypoint in its central region is predicted as the same class is high, and vice versa. Thus, during inference, after a proposal is generated as a pair of corner keypoints, we determine if the proposal is indeed an object by checking if there is a center keypoint of the same class falling within its central region.","108":"Please enter a description about the method here","109":"**Region of Interest Pooling**, or **RoIPool**, is an operation for extracting a small feature map (e.g., $7\u00d77$) from each RoI in detection and segmentation based tasks. Features are extracted from each candidate box, and thereafter in models like [Fast R-CNN](https:\/\/paperswithcode.com\/method\/fast-r-cnn), are then classified and bounding box regression performed.\r\n\r\nThe actual scaling to, e.g., $7\u00d77$, occurs by dividing the region proposal into equally sized sections, finding the largest value in each section, and then copying these max values to the output buffer. In essence, **RoIPool** is [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) on a discrete grid based on a box.\r\n\r\nImage Source: [Joyce Xu](https:\/\/towardsdatascience.com\/deep-learning-for-object-detection-a-comprehensive-review-73930816d8d9)","110":"**Faster R-CNN** is an object detection model that improves on [Fast R-CNN](https:\/\/paperswithcode.com\/method\/fast-r-cnn) by utilising a region proposal network ([RPN](https:\/\/paperswithcode.com\/method\/rpn)) with the CNN model. The RPN shares full-image convolutional features with the detection network, enabling nearly cost-free region proposals. It is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by [Fast R-CNN](https:\/\/paperswithcode.com\/method\/fast-r-cnn) for detection. RPN and Fast [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn) are merged into a single network by sharing their convolutional features: the RPN component tells the unified network where to look.\r\n\r\nAs a whole, Faster R-CNN consists of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector that uses the proposed regions.","111":"BYOL (Bootstrap Your Own Latent) is a new approach to self-supervised learning. BYOL\u2019s goal is to learn a representation $y_\u03b8$ which can then be used for downstream tasks. BYOL uses two neural networks to learn: the online and target networks. The online network is defined by a set of weights $\u03b8$ and is comprised of three stages: an encoder $f_\u03b8$, a projector $g_\u03b8$ and a predictor $q_\u03b8$. The target network has the same architecture\r\nas the online network, but uses a different set of weights $\u03be$. The target network provides the regression\r\ntargets to train the online network, and its parameters $\u03be$ are an exponential moving average of the\r\nonline parameters $\u03b8$.\r\n\r\nGiven the architecture diagram on the right, BYOL minimizes a similarity loss between $q_\u03b8(z_\u03b8)$ and $sg(z'{_\u03be})$, where $\u03b8$ are the trained weights, $\u03be$ are an exponential moving average of $\u03b8$ and $sg$ means stop-gradient. At the end of training, everything but $f_\u03b8$ is discarded, and $y_\u03b8$ is used as the image representation.\r\n\r\nSource: [Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning](https:\/\/paperswithcode.com\/paper\/bootstrap-your-own-latent-a-new-approach-to-1)\r\n\r\nImage credit: [Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning](https:\/\/paperswithcode.com\/paper\/bootstrap-your-own-latent-a-new-approach-to-1)","112":"**ReLU6** is a modification of the [rectified linear unit](https:\/\/paperswithcode.com\/method\/relu) where we limit the activation to a maximum size of $6$. This is due to increased robustness when used with low-precision computation.\r\n\r\nImage Credit: [PyTorch](https:\/\/pytorch.org\/docs\/master\/generated\/torch.nn.ReLU6.html)","113":"**node2vec** is a framework for learning graph embeddings for nodes in graphs. Node2vec maximizes a likelihood objective over mappings which preserve neighbourhood distances in higher dimensional spaces. From an algorithm design perspective, node2vec exploits the freedom to define neighbourhoods for nodes and provide an explanation for the effect of the choice of neighborhood on the learned representations. \r\n\r\nFor each node, node2vec simulates biased random walks based on an efficient network-aware search strategy and the nodes appearing in the random walk define neighbourhoods. The search strategy accounts for the relative influence nodes exert in a network. It also generalizes prior work alluding to naive search strategies by providing flexibility in exploring neighborhoods.","114":"A Graph Convolutional Network, or GCN, is an approach for semi-supervised learning on graph-structured data. It is based on an efficient variant of convolutional neural networks which operate directly on graphs.\r\n\r\nImage source: [Semi-Supervised Classification with Graph Convolutional Networks](https:\/\/arxiv.org\/pdf\/1609.02907v4.pdf)","115":"Gravity is a kinematic approach to optimization based on gradients.","116":"A **Relativistic GAN** is a type of generative adversarial network. It has a relativistic discriminator which estimates the probability that the given real data is more realistic than a randomly sampled fake data. The idea is to endow GANs with the property that the probability of real data being real ($D\\left(x\\_{r}\\right)$) should decrease as the probability of fake data being real ($D\\left(x\\_{f}\\right)$) increases.\r\n\r\nWith a standard [GAN](https:\/\/paperswithcode.com\/method\/gan), we can achieve this as follows. The standard GAN discriminator can be defined, in term of the non-transformed layer $C\\left(x\\right)$, as $D\\left(x\\right) = \\text{sigmoid}\\left(C\\left(x\\right)\\right)$. A simple way to make discriminator relativistic - having the output of $D$ depend on both real and fake data - is to sample from real\/fake data pairs $\\tilde{x} = \\left(x\\_{r}, x\\_{f}\\right)$ and define it as $D\\left(\\tilde{x}\\right) = \\text{sigmoid}\\left(C\\left(x\\_{r}\\right) \u2212 C\\left(x\\_{f}\\right)\\right)$. The modification can be interpreted as: the discriminator estimates the probability\r\nthat the given real data is more realistic than a randomly sampled fake data.\r\n\r\nMore generally a Relativistic GAN can be interpreted as having a discriminator of the form $a\\left(C\\left(x\\_{r}\\right)\u2212C\\left(x\\_{f}\\right)\\right)$, where $a$ is the activation function, to be relativistic.","117":"**Wasserstein GAN**, or **WGAN**, is a type of generative adversarial network that minimizes an approximation of the Earth-Mover's distance (EM) rather than the Jensen-Shannon divergence as in the original [GAN](https:\/\/paperswithcode.com\/method\/gan) formulation. It leads to more stable training than original GANs with less evidence of mode collapse, as well as meaningful curves that can be used for debugging and searching hyperparameters.","118":"The **alternating direction method of multipliers** (**ADMM**) is an algorithm that solves convex optimization problems by breaking them into smaller pieces, each of which are then easier to handle. It takes the form of a decomposition-coordination procedure, in which the solutions to small\r\nlocal subproblems are coordinated to find a solution to a large global problem. ADMM can be viewed as an attempt to blend the benefits of dual decomposition and augmented Lagrangian methods for constrained optimization. It turns out to be equivalent or closely related to many other algorithms\r\nas well, such as Douglas-Rachford splitting from numerical analysis, Spingarn\u2019s method of partial inverses, Dykstra\u2019s alternating projections method, Bregman iterative algorithms for l1 problems in signal processing, proximal methods, and many others.\r\n\r\nText Source: [https:\/\/stanford.edu\/~boyd\/papers\/pdf\/admm_distr_stats.pdf](https:\/\/stanford.edu\/~boyd\/papers\/pdf\/admm_distr_stats.pdf)\r\n\r\nImage Source: [here](https:\/\/www.slideshare.net\/derekcypang\/alternating-direction)","119":"**Natural Gradient Descent** is an approximate second-order optimisation method. It has an interpretation as optimizing over a Riemannian manifold using an intrinsic distance metric, which implies the updates are invariant to transformations such as whitening. By using the positive semi-definite (PSD) Gauss-Newton matrix to approximate the (possibly negative definite) Hessian, NGD can often work better than exact second-order methods.\r\n\r\nGiven the gradient of $z$, $g = \\frac{\\delta{f}\\left(z\\right)}{\\delta{z}}$, NGD computes the update as:\r\n\r\n$$\\Delta{z} = \\alpha{F}^{\u22121}g$$\r\n\r\nwhere the Fisher information matrix $F$ is defined as:\r\n\r\n$$ F = \\mathbb{E}\\_{p\\left(t\\mid{z}\\right)}\\left[\\nabla\\ln{p}\\left(t\\mid{z}\\right)\\nabla\\ln{p}\\left(t\\mid{z}\\right)^{T}\\right] $$\r\n\r\nThe log-likelihood function $\\ln{p}\\left(t\\mid{z}\\right)$ typically corresponds to commonly used error functions such as the cross entropy loss.\r\n\r\nSource: [LOGAN](https:\/\/paperswithcode.com\/method\/logan)\r\n\r\nImage: [Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks\r\n](https:\/\/arxiv.org\/abs\/1905.10961)","120":"An **autoencoder** is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal \u201cnoise\u201d. Along with the reduction side, a reconstructing side is learnt, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name. \r\n\r\nExtracted from: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Autoencoder)\r\n\r\nImage source: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Autoencoder#\/media\/File:Autoencoder_schema.png)","121":"**Embeddings from Language Models**, or **ELMo**, is a type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus.\r\n\r\nA biLM combines both a forward and backward LM.  ELMo jointly maximizes the log likelihood of the forward and backward directions. To add ELMo to a supervised model, we freeze the weights of the biLM and then concatenate the ELMo vector $\\textbf{ELMO}^{task}_k$ with $\\textbf{x}_k$ and pass the ELMO enhanced representation $[\\textbf{x}_k; \\textbf{ELMO}^{task}_k]$ into the task RNN. Here $\\textbf{x}_k$ is a context-independent token representation for each token position. \r\n\r\nImage Source: [here](https:\/\/medium.com\/@duyanhnguyen_38925\/create-a-strong-text-classification-with-the-help-from-elmo-e90809ba29da)","122":"A **Variational Autoencoder** is a type of likelihood-based generative model. It consists of an encoder, that takes in data $x$ as input and transforms this into a latent representation $z$,  and a decoder, that takes a latent representation $z$ and returns a reconstruction $\\hat{x}$. Inference is performed via variational inference to approximate the posterior of the model.","123":"**Adaptive Training Sample Selection**, or **ATSS**, is a method to automatically select positive and negative samples according to statistical characteristics of object. It bridges the gap between anchor-based and anchor-free detectors. \r\n\r\nFor each ground-truth box $g$ on the image, we first find out its candidate positive samples. As described in Line $3$ to $6$, on each pyramid level, we select $k$ anchor boxes whose center are closest to the center of $g$ based on L2 distance. Supposing there are $\\mathcal{L}$ feature pyramid levels, the ground-truth box $g$ will have $k\\times\\mathcal{L}$ candidate positive samples. After that, we compute the IoU between these candidates and the ground-truth $g$ as $\\mathcal{D}_g$ in Line $7$, whose mean and standard deviation are computed as $m_g$ and $v_g$ in Line $8$ and Line $9$. With these statistics, the IoU threshold for this ground-truth $g$ is obtained as $t_g=m_g+v_g$ in Line $10$. Finally, we select these candidates whose IoU are greater than or equal to the threshold $t_g$ as final positive samples in Line $11$ to $15$. \r\n\r\nNotably ATSS also limits the positive samples' center to the ground-truth box as shown in Line $12$. Besides, if an anchor box is assigned to multiple ground-truth boxes, the one with the highest IoU will be selected. The rest are negative samples.","124":"A **Focal Loss** function addresses class imbalance during training in tasks like object detection. Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard misclassified examples. It is a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples. \r\n\r\nFormally, the Focal Loss adds a factor $(1 - p\\_{t})^\\gamma$ to the standard cross entropy criterion. Setting $\\gamma>0$ reduces the relative loss for well-classified examples ($p\\_{t}>.5$), putting more focus on hard, misclassified examples. Here there is tunable *focusing* parameter $\\gamma \\ge 0$. \r\n\r\n$$ {\\text{FL}(p\\_{t}) = - (1 - p\\_{t})^\\gamma \\log\\left(p\\_{t}\\right)} $$","125":"**Xception** is a convolutional neural network architecture that relies solely on [depthwise separable convolution](https:\/\/paperswithcode.com\/method\/depthwise-separable-convolution) layers.","126":"**Spectral Normalization** is a normalization technique used for generative adversarial networks, used to stabilize training of the discriminator. Spectral normalization has the convenient property that the Lipschitz constant is the only hyper-parameter to be tuned.\r\n\r\nIt controls the Lipschitz constant of the discriminator $f$ by constraining the spectral norm of each layer $g : \\textbf{h}\\_{in} \\rightarrow \\textbf{h}_{out}$. The Lipschitz norm $\\Vert{g}\\Vert\\_{\\text{Lip}}$ is equal to $\\sup\\_{\\textbf{h}}\\sigma\\left(\\nabla{g}\\left(\\textbf{h}\\right)\\right)$, where $\\sigma\\left(a\\right)$ is the spectral norm of the matrix $A$ ($L\\_{2}$ matrix norm of $A$):\r\n\r\n$$ \\sigma\\left(a\\right) = \\max\\_{\\textbf{h}:\\textbf{h}\\neq{0}}\\frac{\\Vert{A\\textbf{h}}\\Vert\\_{2}}{\\Vert\\textbf{h}\\Vert\\_{2}} = \\max\\_{\\Vert\\textbf{h}\\Vert\\_{2}\\leq{1}}{\\Vert{A\\textbf{h}}\\Vert\\_{2}} $$\r\n\r\nwhich is equivalent to the largest singular value of $A$. Therefore for a [linear layer](https:\/\/paperswithcode.com\/method\/linear-layer) $g\\left(\\textbf{h}\\right) = W\\textbf{h}$ the norm is given by $\\Vert{g}\\Vert\\_{\\text{Lip}} = \\sup\\_{\\textbf{h}}\\sigma\\left(\\nabla{g}\\left(\\textbf{h}\\right)\\right) = \\sup\\_{\\textbf{h}}\\sigma\\left(W\\right) = \\sigma\\left(W\\right) $. Spectral normalization normalizes the spectral norm of the weight matrix $W$ so it satisfies the Lipschitz constraint $\\sigma\\left(W\\right) = 1$:\r\n\r\n$$ \\bar{W}\\_{\\text{SN}}\\left(W\\right) = W \/ \\sigma\\left(W\\right) $$","127":"A **Feature Pyramid Network**, or **FPN**, is a feature extractor that takes a single-scale image of an arbitrary size as input, and outputs proportionally sized feature maps at multiple levels, in a fully convolutional fashion. This process is independent of the backbone convolutional architectures. It therefore acts as a generic solution for building feature pyramids inside deep convolutional networks to be used in tasks like object detection.\r\n\r\nThe construction of the pyramid involves a bottom-up pathway and a top-down pathway.\r\n\r\nThe bottom-up pathway is the feedforward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2. For the feature\r\npyramid, one pyramid level is defined for each stage. The output of the last layer of each stage is used as a reference set of feature maps. For [ResNets](https:\/\/paperswithcode.com\/method\/resnet) we use the feature activations output by each stage\u2019s last [residual block](https:\/\/paperswithcode.com\/method\/residual-block). \r\n\r\nThe top-down pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels. These features are then enhanced with features from the bottom-up pathway via lateral connections. Each lateral connection merges feature maps of the same spatial size from the bottom-up pathway and the top-down pathway. The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times.","128":"**Hybrid Task Cascade**, or **HTC**, is a framework for cascading in instance segmentation. It differs from [Cascade Mask R-CNN](https:\/\/paperswithcode.com\/method\/cascade-mask-r-cnn) in two important aspects:  (1) instead of performing cascaded refinement on the two tasks of detection and segmentation separately, it interweaves them for a joint multi-stage processing; (2) it adopts a fully convolutional branch to provide spatial context, which can help distinguishing hard foreground from cluttered background.","129":"**$k$-Nearest Neighbors** is a clustering-based algorithm for classification and regression. It is a a type of instance-based learning as it does not attempt to construct a general internal model, but simply stores instances of the training data. Prediction is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.\r\n\r\nSource of Description and Image: [scikit-learn](https:\/\/scikit-learn.org\/stable\/modules\/neighbors.html#classification)","130":"**GloVe Embeddings** are a type of word embedding that encode the co-occurrence probability ratio between two words as vector differences. GloVe uses a weighted least squares objective $J$ that minimizes the difference between the dot product of the vectors of two words and the logarithm of their number of co-occurrences:\r\n\r\n$$ J=\\sum\\_{i, j=1}^{V}f\\left(\ud835\udc4b\\_{i j}\\right)(w^{T}\\_{i}\\tilde{w}_{j} + b\\_{i} + \\tilde{b}\\_{j} - \\log{\ud835\udc4b}\\_{ij})^{2} $$\r\n\r\nwhere $w\\_{i}$ and $b\\_{i}$ are the word vector and bias respectively of word $i$, $\\tilde{w}_{j}$ and $b\\_{j}$ are the context word vector and bias respectively of word $j$, $X\\_{ij}$ is the number of times word $i$ occurs in the context of word $j$, and $f$ is a weighting function that assigns lower weights to rare and frequent co-occurrences.","131":"**BLANC** is an automatic estimation approach for document summary quality. The goal is to measure the functional performance of a summary with an objective, reproducible, and fully automated method. BLANC achieves this by measuring the performance boost gained by a pre-trained language model with access to a document summary while carrying out its language understanding task on the document's text.","132":"NeRF represents a scene with learned, continuous volumetric radiance field $F_\\theta$ defined over a bounded 3D volume. In a NeRF, $F_\\theta$ is a multilayer perceptron (MLP) that takes as input a 3D position $x = (x, y, z)$ and unit-norm viewing direction $d = (dx, dy, dz)$, and produces as output a density $\\sigma$ and color $c = (r, g, b)$. The weights of the multilayer perceptron that parameterize $F_\\theta$ are optimized so as to encode the radiance field of the scene. Volume rendering is used to compute the color of a single pixel.","133":"**Gaussian Processes** are non-parametric models for approximating functions. They rely upon a measure of similarity between points (the kernel function) to predict the value for an unseen point from training data. The models are fully probabilistic so uncertainty bounds are baked in with the model.\r\n\r\nImage Source: Gaussian Processes for Machine Learning, C. E. Rasmussen & C. K. I. Williams","134":"Documents often exhibit various forms of degradation, which make it hard to be read and substantially deteriorate the\r\nperformance of an OCR system. In this paper, we propose an effective end-to-end framework named Document Enhancement\r\nGenerative Adversarial Networks (DE-GAN) that uses the conditional GANs (cGANs) to restore severely degraded document images.\r\nTo the best of our knowledge, this practice has not been studied within the context of generative adversarial deep networks. We\r\ndemonstrate that, in different tasks (document clean up, binarization, deblurring and watermark removal), DE-GAN can produce an\r\nenhanced version of the degraded document with a high quality. In addition, our approach provides consistent improvements compared to state-of-the-art methods over the widely used DIBCO 2013, DIBCO 2017 and H-DIBCO 2018 datasets, proving its ability to restore a degraded document image to its ideal condition. The obtained results on a wide variety of degradation reveal the flexibility of the proposed model to be exploited in other document enhancement problems.","135":"**k-Means Clustering** is a clustering algorithm that divides a training set into $k$ different clusters of examples that are near each other. It works by initializing $k$ different centroids {$\\mu\\left(1\\right),\\ldots,\\mu\\left(k\\right)$} to different values, then alternating between two steps until convergence:\r\n\r\n(i) each training example is assigned to cluster $i$ where $i$ is the index of the nearest centroid $\\mu^{(i)}$\r\n\r\n(ii) each centroid $\\mu^{(i)}$ is updated to the mean of all training examples $x^{(j)}$ assigned to cluster $i$.\r\n\r\nText Source: Deep Learning, Goodfellow et al\r\n\r\nImage Source: [scikit-learn](https:\/\/scikit-learn.org\/stable\/auto_examples\/cluster\/plot_kmeans_digits.html)","136":"**Darknet-19** is a convolutional neural network that is used as the backbone of [YOLOv2](https:\/\/paperswithcode.com\/method\/yolov2).  Similar to the [VGG](https:\/\/paperswithcode.com\/method\/vgg) models it mostly uses $3 \\times 3$ filters and doubles the number of channels after every pooling step. Following the work on Network in Network (NIN) it uses [global average pooling](https:\/\/paperswithcode.com\/method\/global-average-pooling) to make predictions as well as $1 \\times 1$ filters to compress the feature representation between $3 \\times 3$ convolutions. [Batch Normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) is used to stabilize training, speed up convergence, and regularize the model batch.","137":"**Darknet-53** is a convolutional neural network that acts as a backbone for the [YOLOv3](https:\/\/paperswithcode.com\/method\/yolov3) object detection approach. The improvements upon its predecessor [Darknet-19](https:\/\/paperswithcode.com\/method\/darknet-19) include the use of residual connections, as well as more layers.","138":"**YOLOv3** is a real-time, single-stage object detection model that builds on [YOLOv2](https:\/\/paperswithcode.com\/method\/yolov2) with several improvements. Improvements include the use of a new backbone network, [Darknet-53](https:\/\/paperswithcode.com\/method\/darknet-53) that utilises residual connections, or in the words of the author, \"those newfangled residual network stuff\", as well as some improvements to the bounding box prediction step, and use of three different scales from which to extract features (similar to an [FPN](https:\/\/paperswithcode.com\/method\/fpn)).","139":"**YOLOv2**, or [**YOLO9000**](https:\/\/www.youtube.com\/watch?v=QsDDXSmGJZA), is a single-stage real-time object detection model. It improves upon [YOLOv1](https:\/\/paperswithcode.com\/method\/yolov1) in several ways, including the use of [Darknet-19](https:\/\/paperswithcode.com\/method\/darknet-19) as a backbone, [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization), use of a high-resolution classifier, and the use of anchor boxes to predict bounding boxes, and more.","140":"A **3D Convolution** is a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) where the kernel slides in 3 dimensions as opposed to 2 dimensions with 2D convolutions. One example use case is medical imaging where a model is constructed using 3D image slices. Additionally video based data has an additional temporal dimension over images making it suitable for this module. \r\n\r\nImage: Lung nodule detection based on 3D convolutional neural networks, Fan et al","141":"**Early Stopping** is a regularization technique for deep neural networks that stops training when parameter updates no longer begin to yield improves on a validation set. In essence, we store and update the current best parameters during training, and when parameter updates no longer yield an improvement (after a set number of iterations) we stop training and use the last best parameters. It works as a regularizer by restricting the optimization procedure to a smaller volume of parameter space.\r\n\r\nImage Source: [Ramazan Gen\u00e7ay](https:\/\/www.researchgate.net\/figure\/Early-stopping-based-on-cross-validation_fig1_3302948)","142":"A **ResNeXt Block** is a type of [residual block](https:\/\/paperswithcode.com\/method\/residual-block) used as part of the [ResNeXt](https:\/\/paperswithcode.com\/method\/resnext) CNN architecture. It uses a \"split-transform-merge\" strategy (branched paths within a single module) similar to an [Inception module](https:\/\/paperswithcode.com\/method\/inception-module), i.e. it aggregates a set of transformations. Compared to a Residual Block, it exposes a new dimension,  *cardinality* (size of set of transformations) $C$, as an essential factor in addition to depth and width. \r\n\r\nFormally, a set of aggregated transformations can be represented as: $\\mathcal{F}(x)=\\sum_{i=1}^{C}\\mathcal{T}_i(x)$, where $\\mathcal{T}_i(x)$ can be an arbitrary function. Analogous to a simple neuron, $\\mathcal{T}_i$ should project $x$ into an (optionally low-dimensional) embedding and then transform it.","143":"A **ResNeXt** repeats a building block that aggregates a set of transformations with the same topology. Compared to a [ResNet](https:\/\/paperswithcode.com\/method\/resnet), it exposes a new dimension,  *cardinality* (the size of the set of transformations) $C$, as an essential factor in addition to the dimensions of depth and width. \r\n\r\nFormally, a set of aggregated transformations can be represented as: $\\mathcal{F}(x)=\\sum_{i=1}^{C}\\mathcal{T}_i(x)$, where $\\mathcal{T}_i(x)$ can be an arbitrary function. Analogous to a simple neuron, $\\mathcal{T}_i$ should project $x$ into an (optionally low-dimensional) embedding and then transform it.","144":"**Linear Regression** is a method for modelling a relationship between a dependent variable and independent variables. These models can be fit with numerous approaches. The most common is *least squares*, where we minimize the mean square error between the predicted values $\\hat{y} = \\textbf{X}\\hat{\\beta}$ and actual values $y$: $\\left(y-\\textbf{X}\\beta\\right)^{2}$.\r\n\r\nWe can also define the problem in probabilistic terms as a generalized linear model (GLM) where the pdf is a Gaussian distribution, and then perform maximum likelihood estimation to estimate $\\hat{\\beta}$.\r\n\r\nImage Source: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Linear_regression)","145":"**Swish** is an activation function, $f(x) = x \\cdot \\text{sigmoid}(\\beta x)$, where $\\beta$ a learnable parameter. Nearly all implementations do not use the learnable parameter $\\beta$, in which case the activation function is $x\\sigma(x)$ (\"Swish-1\").\r\n\r\nThe function $x\\sigma(x)$ is exactly the [SiLU](https:\/\/paperswithcode.com\/method\/silu), which was introduced by other authors before the swish.\r\nSee [Gaussian Error Linear Units](https:\/\/arxiv.org\/abs\/1606.08415) ([GELUs](https:\/\/paperswithcode.com\/method\/gelu)) where the SiLU (Sigmoid Linear Unit) was originally coined, and see [Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning](https:\/\/arxiv.org\/abs\/1702.03118) and [Swish: a Self-Gated Activation Function](https:\/\/arxiv.org\/abs\/1710.05941v1) where the same activation function was experimented with later.","146":"**RMSProp** is an unpublished adaptive learning rate optimizer [proposed by Geoff Hinton](http:\/\/www.cs.toronto.edu\/~tijmen\/csc321\/slides\/lecture_slides_lec6.pdf). The motivation is that the magnitude of gradients can differ for different weights, and can change during learning, making it hard to choose a single global learning rate. RMSProp tackles this by keeping a moving average of the squared gradient and adjusting the weight updates by this magnitude. The gradient updates are performed as:\r\n\r\n$$E\\left[g^{2}\\right]\\_{t} = \\gamma E\\left[g^{2}\\right]\\_{t-1} + \\left(1 - \\gamma\\right) g^{2}\\_{t}$$\r\n\r\n$$\\theta\\_{t+1} = \\theta\\_{t} - \\frac{\\eta}{\\sqrt{E\\left[g^{2}\\right]\\_{t} + \\epsilon}}g\\_{t}$$\r\n\r\nHinton suggests $\\gamma=0.9$, with a good default for $\\eta$ as $0.001$.\r\n\r\nImage: [Alec Radford](https:\/\/twitter.com\/alecrad)","147":"**EfficientNet** is a convolutional neural network architecture and scaling method that uniformly scales all dimensions of depth\/width\/resolution using a *compound coefficient*. Unlike conventional practice that arbitrary scales  these factors, the EfficientNet scaling method uniformly scales network width, depth, and resolution with a set of fixed scaling coefficients. For example, if we want to use $2^N$ times more computational resources, then we can simply increase the network depth by $\\alpha ^ N$,  width by $\\beta ^ N$, and image size by $\\gamma ^ N$, where $\\alpha, \\beta, \\gamma$ are constant coefficients determined by a small grid search on the original small model. EfficientNet uses a compound coefficient $\\phi$ to uniformly scales network width, depth, and resolution in a  principled way.\r\n\r\nThe compound scaling method is justified by the intuition that if the input image is bigger, then the network needs more layers to increase the receptive field and more channels to capture more fine-grained patterns on the bigger image.\r\n\r\nThe base EfficientNet-B0 network is based on the inverted bottleneck residual blocks of [MobileNetV2](https:\/\/paperswithcode.com\/method\/mobilenetv2), in addition to squeeze-and-excitation blocks.\r\n\r\n EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters.","148":"We propose to theoretically and empirically examine the effect of incorporating weighting schemes into walk-aggregating GNNs. To this end, we propose a simple, interpretable, and end-to-end supervised GNN model, called AWARE (Attentive Walk-Aggregating GRaph Neural NEtwork), for graph-level prediction. AWARE aggregates the walk information by means of weighting schemes at distinct levels (vertex-, walk-, and graph-level) in a principled manner. By virtue of the incorporated weighting schemes at these different levels, AWARE can emphasize the information important for prediction while diminishing the irrelevant ones\u2014leading to representations that can improve learning performance.","149":"**Trust Region Policy Optimization**, or **TRPO**, is a policy gradient method in reinforcement learning that avoids parameter updates that change the policy too much with a KL divergence constraint on the size of the policy update at each iteration.\r\n\r\nTake the case of off-policy reinforcement learning, where the policy $\\beta$ for collecting trajectories on rollout workers is different from the policy $\\pi$ to optimize for. The objective function in an off-policy model measures the total advantage over the state visitation distribution and actions, while the mismatch between the training data distribution and the true policy state distribution is compensated with an importance sampling estimator:\r\n\r\n$$ J\\left(\\theta\\right) = \\sum\\_{s\\in{S}}p^{\\pi\\_{\\theta\\_{old}}}\\sum\\_{a\\in\\mathcal{A}}\\left(\\pi\\_{\\theta}\\left(a\\mid{s}\\right)\\hat{A}\\_{\\theta\\_{old}}\\left(s, a\\right)\\right) $$\r\n\r\n$$ J\\left(\\theta\\right) = \\sum\\_{s\\in{S}}p^{\\pi\\_{\\theta\\_{old}}}\\sum\\_{a\\in\\mathcal{A}}\\left(\\beta\\left(a\\mid{s}\\right)\\frac{\\pi\\_{\\theta}\\left(a\\mid{s}\\right)}{\\beta\\left(a\\mid{s}\\right)}\\hat{A}\\_{\\theta\\_{old}}\\left(s, a\\right)\\right) $$\r\n\r\n$$ J\\left(\\theta\\right) = \\mathbb{E}\\_{s\\sim{p}^{\\pi\\_{\\theta\\_{old}}}, a\\sim{\\beta}} \\left(\\frac{\\pi\\_{\\theta}\\left(a\\mid{s}\\right)}{\\beta\\left(a\\mid{s}\\right)}\\hat{A}\\_{\\theta\\_{old}}\\left(s, a\\right)\\right)$$\r\n\r\nWhen training on policy, theoretically the policy for collecting data is same as the policy that we want to optimize. However, when rollout workers and optimizers are running in parallel asynchronously, the behavior policy can get stale. TRPO considers this subtle difference: It labels the behavior policy as $\\pi\\_{\\theta\\_{old}}\\left(a\\mid{s}\\right)$ and thus the objective function becomes:\r\n\r\n$$ J\\left(\\theta\\right) = \\mathbb{E}\\_{s\\sim{p}^{\\pi\\_{\\theta\\_{old}}}, a\\sim{\\pi\\_{\\theta\\_{old}}}} \\left(\\frac{\\pi\\_{\\theta}\\left(a\\mid{s}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\mid{s}\\right)}\\hat{A}\\_{\\theta\\_{old}}\\left(s, a\\right)\\right)$$\r\n\r\nTRPO aims to maximize the objective function $J\\left(\\theta\\right)$ subject to a trust region constraint which enforces the distance between old and new policies measured by KL-divergence to be small enough, within a parameter $\\delta$:\r\n\r\n$$ \\mathbb{E}\\_{s\\sim{p}^{\\pi\\_{\\theta\\_{old}}}} \\left[D\\_{KL}\\left(\\pi\\_{\\theta\\_{old}}\\left(.\\mid{s}\\right)\\mid\\mid\\pi\\_{\\theta}\\left(.\\mid{s}\\right)\\right)\\right] \\leq \\delta$$","150":"A **Linear Layer** is a projection $\\mathbf{XW + b}$.","151":"A capsule is an activation vector that basically executes on its inputs some complex internal\r\ncomputations. Length of these activation vectors signifies the\r\nprobability of availability of a feature. Furthermore, the condition\r\nof the recognized element is encoded as the direction in which\r\nthe vector is pointing. In traditional, CNN uses Max pooling for\r\ninvariance activities of neurons, which is nothing except a minor\r\nchange in input and the neurons of output signal will remains\r\nsame.","152":"**FAVOR+**, or **Fast Attention Via Positive Orthogonal Random Features**, is an efficient attention mechanism used in the [Performer](https:\/\/paperswithcode.com\/method\/performer) architecture which leverages approaches such as kernel methods and random features approximation for approximating [softmax](https:\/\/paperswithcode.com\/method\/softmax) and Gaussian kernels. \r\n\r\nFAVOR+ works for attention blocks using matrices $\\mathbf{A} \\in \\mathbb{R}^{L\u00d7L}$ of the form $\\mathbf{A}(i, j) = K(\\mathbf{q}\\_{i}^{T}, \\mathbf{k}\\_{j}^{T})$, with $\\mathbf{q}\\_{i}\/\\mathbf{k}\\_{j}$ standing for the $i^{th}\/j^{th}$ query\/key row-vector in $\\mathbf{Q}\/\\mathbf{K}$ and kernel $K : \\mathbb{R}^{d } \u00d7 \\mathbb{R}^{d} \\rightarrow \\mathbb{R}\\_{+}$ defined for the (usually randomized) mapping: $\\phi : \\mathbb{R}^{d } \u2192 \\mathbb{R}^{r}\\_{+}$ (for some $r > 0$) as:\r\n\r\n$$K(\\mathbf{x}, \\mathbf{y}) = E[\\phi(\\mathbf{x})^{T}\\phi(\\mathbf{y})] $$\r\n\r\nWe call $\\phi(\\mathbf{u})$ a random feature map for $\\mathbf{u} \\in \\mathbb{R}^{d}$ . For $\\mathbf{Q}^{'}, \\mathbf{K}^{'} \\in \\mathbb{R}^{L \\times r}$ with rows given as $\\phi(\\mathbf{q}\\_{i}^{T})^{T}$ and $\\phi(\\mathbf{k}\\_{i}^{T})^{T}$  respectively, this leads directly to the efficient attention mechanism of the form:\r\n\r\n$$ \\hat{Att\\_{\\leftrightarrow}}\\left(\\mathbf{Q}, \\mathbf{K}, \\mathbf{V}\\right) = \\hat{\\mathbf{D}}^{-1}(\\mathbf{Q^{'}}((\\mathbf{K^{'}})^{T}\\mathbf{V}))$$\r\n\r\nwhere\r\n\r\n$$\\mathbf{\\hat{D}} = \\text{diag}(\\mathbf{Q^{'}}((\\mathbf{K^{'}})\\mathbf{1}\\_{L})) $$\r\n\r\nThe above scheme constitutes the [FA](https:\/\/paperswithcode.com\/method\/dfa)-part of the FAVOR+ mechanism. The other parts are achieved by:\r\n\r\n- The R part :  The softmax kernel is approximated though trigonometric functions, in the form of a regularized softmax-kernel SMREG, that employs positive random features (PRFs).\r\n- The OR+ part : To reduce the variance of the estimator, so we can use a smaller number of random features, different samples are entangled to be exactly orthogonal using the Gram-Schmidt orthogonalization procedure.\r\n\r\nThe details are quite technical, so it is recommended you read the paper for further information on these steps.","153":"**Performer** is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) architectures which can estimate regular ([softmax](https:\/\/paperswithcode.com\/method\/softmax)) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. To approximate softmax attention-kernels, Performers use a Fast Attention Via positive Orthogonal Random features approach (FAVOR+), leveraging new methods for approximating softmax and Gaussian kernels.","154":"A **Graph Convolutional Network**, or **GCN**, is an approach for semi-supervised learning on graph-structured data. It is based on an efficient variant of [convolutional neural networks](https:\/\/paperswithcode.com\/methods\/category\/convolutional-neural-networks) which operate directly on graphs. The choice of convolutional architecture is motivated via a localized first-order approximation of spectral graph convolutions. The model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes.","155":"**DistilBERT**  is a small, fast, cheap and light [Transformer](https:\/\/paperswithcode.com\/method\/transformer) model based on the [BERT](https:\/\/paperswithcode.com\/method\/bert) architecture. Knowledge distillation is performed during the pre-training phase to reduce the size of a BERT model by 40%. To leverage the inductive biases learned by larger models during pre-training, the authors introduce a triple loss combining language modeling, distillation and cosine-distance losses.","156":"**XLM** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) based architecture that is pre-trained using one of three language modelling objectives:\r\n\r\n1. Causal Language Modeling - models the probability of a word given the previous words in a sentence.\r\n2. Masked Language Modeling - the masked language modeling objective of [BERT](https:\/\/paperswithcode.com\/method\/bert).\r\n3. Translation Language Modeling - a (new) translation language modeling objective for improving cross-lingual pre-training.\r\n\r\nThe authors find that both the CLM and MLM approaches provide strong cross-lingual features that can be used for pretraining models.","157":"**Weight Modulation** is an alternative to [adaptive instance normalization](https:\/\/paperswithcode.com\/method\/adaptive-instance-normalization) for use in generative adversarial networks, specifically it is introduced in [StyleGAN2](https:\/\/paperswithcode.com\/method\/stylegan2). The purpose of [instance normalization](https:\/\/paperswithcode.com\/method\/instance-normalization) is to remove the effect of $s$ - the scales of the features maps - from the statistics of the [convolution](https:\/\/paperswithcode.com\/method\/convolution)\u2019s output feature maps. Weight modulation tries to achieve this goal more directly. Assuming that input activations are i.i.d. random variables with unit standard deviation. After modulation and convolution, the output activations have standard deviation of:\r\n\r\n$$ \\sigma\\_{j} = \\sqrt{{\\sum\\_{i,k}w\\_{ijk}'}^{2}} $$\r\n\r\ni.e., the outputs are scaled by the $L\\_{2}$ norm of the corresponding weights. The subsequent normalization aims to restore the outputs back to unit standard deviation. This can be achieved if we scale (\u201cdemodulate\u201d) each output feature map $j$ by $1\/\\sigma\\_{j}$ . Alternatively, we can again bake this into the convolution weights:\r\n\r\n$$ w''\\_{ijk} = w'\\_{ijk} \/ \\sqrt{{\\sum\\_{i, k}w'\\_{ijk}}^{2} + \\epsilon} $$\r\n\r\nwhere $\\epsilon$ is a small constant to avoid numerical issues.","158":"**R_INLINE_MATH_1 Regularization** is a regularization technique and gradient penalty for training [generative adversarial networks](https:\/\/paperswithcode.com\/methods\/category\/generative-adversarial-networks). It penalizes the discriminator from deviating from the Nash Equilibrium via penalizing the gradient on real data alone: when the generator distribution produces the true data distribution and the discriminator is equal to 0 on the data manifold, the gradient penalty ensures that the discriminator cannot create a non-zero gradient orthogonal to the data manifold without suffering a loss in the [GAN](https:\/\/paperswithcode.com\/method\/gan) game.\r\n\r\nThis leads to the following regularization term:\r\n\r\n$$ R\\_{1}\\left(\\psi\\right) = \\frac{\\gamma}{2}E\\_{p\\_{D}\\left(x\\right)}\\left[||\\nabla{D\\_{\\psi}\\left(x\\right)}||^{2}\\right] $$","159":"**Path Length Regularization** is a type of regularization for [generative adversarial networks](https:\/\/paperswithcode.com\/methods\/category\/generative-adversarial-networks) that encourages good conditioning in the mapping from latent codes to images. The idea is to encourage that a fixed-size step in the latent space $\\mathcal{W}$ results in a non-zero, fixed-magnitude change in the image.\r\n\r\nWe can measure the deviation from this ideal empirically by stepping into random directions in the image space and observing the corresponding $\\mathbf{w}$ gradients. These gradients should have close to an equal length regardless of $\\mathbf{w}$ or the image-space direction, indicating that the mapping from the latent space to image space is well-conditioned.\r\n\r\nAt a single $\\mathbf{w} \\in \\mathcal{W}$ the local metric scaling properties of the generator mapping $g\\left(\\mathbf{w}\\right) : \\mathcal{W} \\rightarrow \\mathcal{Y}$ are captured by the Jacobian matrix $\\mathbf{J\\_{w}} = \\delta{g}\\left(\\mathbf{w}\\right)\/\\delta{\\mathbf{w}}$. Motivated by the desire to preserve the expected lengths of vectors regardless of the direction, we formulate the regularizer as:\r\n\r\n$$ \\mathbb{E}\\_{\\mathbf{w},\\mathbf{y} \\sim \\mathcal{N}\\left(0, \\mathbf{I}\\right)} \\left(||\\mathbf{J}^{\\mathbf{T}}\\_{\\mathbf{w}}\\mathbf{y}||\\_{2} - a\\right)^{2} $$\r\n\r\nwhere $y$ are random images with normally distributed pixel intensities, and $w \\sim f\\left(z\\right)$, where $z$ are normally distributed. \r\n\r\nTo avoid explicit computation of the Jacobian matrix, we use the identity $\\mathbf{J}^{\\mathbf{T}}\\_{\\mathbf{w}}\\mathbf{y} = \\nabla\\_{\\mathbf{w}}\\left(g\\left(\\mathbf{w}\\right)\u00b7y\\right)$, which is efficiently computable using standard backpropagation. The constant $a$ is set dynamically during optimization as the long-running exponential moving average of the lengths $||\\mathbf{J}^{\\mathbf{T}}\\_{\\mathbf{w}}\\mathbf{y}||\\_{2}$, allowing the optimization to find a suitable global scale by itself.\r\n\r\nThe authors note that they find that path length regularization leads to more reliable and consistently behaving models, making architecture exploration easier. They also observe that the smoother generator is significantly easier to invert.","160":"**StyleGAN2** is a generative adversarial network that builds on [StyleGAN](https:\/\/paperswithcode.com\/method\/stylegan) with several improvements. First, [adaptive instance normalization](https:\/\/paperswithcode.com\/method\/adaptive-instance-normalization) is redesigned and replaced with a normalization technique called [weight demodulation](https:\/\/paperswithcode.com\/method\/weight-demodulation). Secondly, an improved training scheme upon progressively growing is introduced, which achieves the same goal - training starts by focusing on low-resolution images and then progressively shifts focus to higher and higher resolutions - without changing the network topology during training. Additionally, new types of regularization like lazy regularization and [path length regularization](https:\/\/paperswithcode.com\/method\/path-length-regularization) are proposed.","161":"A **Denoising Autoencoder** is a modification on the [autoencoder](https:\/\/paperswithcode.com\/method\/autoencoder) to prevent the network learning the identity function. Specifically, if the autoencoder is too big, then it can just learn the data, so the output equals the input, and does not perform any useful representation learning or dimensionality reduction. Denoising autoencoders solve this problem by corrupting the input data on purpose, adding noise or masking some of the input values.\r\n\r\nImage Credit: [Kumar et al](https:\/\/www.semanticscholar.org\/paper\/Static-hand-gesture-recognition-using-stacked-Kumar-Nandi\/5191ddf3f0841c89ba9ee592a2f6c33e4a40d4bf)","162":"**Deformable convolutions** add 2D offsets to the regular grid sampling locations in the standard [convolution](https:\/\/paperswithcode.com\/method\/convolution). It enables free form deformation of the sampling grid. The offsets are learned from the preceding feature maps, via additional convolutional layers. Thus, the deformation is conditioned on the input features in a local, dense, and adaptive manner.","163":"**Scaled Exponential Linear Units**, or **SELUs**, are activation functions that induce self-normalizing properties.\r\n\r\nThe SELU activation function is given by \r\n\r\n$$f\\left(x\\right) = \\lambda{x} \\text{ if } x \\geq{0}$$\r\n$$f\\left(x\\right) = \\lambda{\\alpha\\left(\\exp\\left(x\\right) -1 \\right)} \\text{ if } x < 0 $$\r\n\r\nwith $\\alpha \\approx 1.6733$ and $\\lambda \\approx 1.0507$.","164":"**Self-normalizing neural networks** (**SNNs**) are a type of neural architecture that aim to enable high-level abstract representations. While batch normalization requires explicit normalization, neuron activations of SNNs automatically converge towards zero mean and unit variance. The activation function of SNNs are \u201cscaled exponential linear units\u201d (SELUs), which induce self-normalizing properties. Using the Banach fixed point theorem, it's possible to prove that  activations close to zero mean and unit variance that are propagated through many network layers will converge towards zero mean and unit variance \u2014 even under the presence of noise and perturbations. This convergence property of SNNs allows to (1) train deep networks with many layers, (2) employ strong regularization schemes, and (3) to make learning highly robust.","165":"**InfoNCE**, where NCE stands for Noise-Contrastive Estimation, is a type of contrastive loss function used for [self-supervised learning](https:\/\/paperswithcode.com\/methods\/category\/self-supervised-learning).\r\n\r\nGiven a set $X = ${$x\\_{1}, \\dots, x\\_{N}$} of $N$ random samples containing one positive sample from $p\\left(x\\_{t+k}|c\\_{t}\\right)$ and $N \u2212 1$ negative samples from the 'proposal' distribution $p\\left(x\\_{t+k}\\right)$, we optimize:\r\n\r\n$$ \\mathcal{L}\\_{N} = - \\mathbb{E}\\_{X}\\left[\\log\\frac{f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right)}{\\sum\\_{x\\_{j}\\in{X}}f\\_{k}\\left(x\\_{j}, c\\_{t}\\right)}\\right] $$\r\n\r\nOptimizing this loss will result in $f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right)$ estimating the density ratio, which is:\r\n\r\n$$ f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right) \\propto \\frac{p\\left(x\\_{t+k}|c\\_{t}\\right)}{p\\left(x\\_{t+k}\\right)} $$","166":"**Contrastive Predictive Coding (CPC)** learns self-supervised representations by predicting the future in latent space by using powerful autoregressive models. The model uses a probabilistic contrastive loss which induces the latent space to capture information that is maximally useful\r\nto predict future samples.\r\n\r\nFirst, a non-linear encoder $g\\_{enc}$ maps the input sequence of observations $x\\_{t}$ to a sequence of latent representations $z\\_{t} = g\\_{enc}\\left(x\\_{t}\\right)$, potentially with a lower temporal resolution. Next, an autoregressive model $g\\_{ar}$ summarizes all $z\\leq{t}$ in the latent space and produces a context latent representation $c\\_{t} = g\\_{ar}\\left(z\\leq{t}\\right)$.\r\n\r\nA density ratio is modelled which preserves the mutual information between $x\\_{t+k}$ and $c\\_{t}$ as follows:\r\n\r\n$$ f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right) \\propto \\frac{p\\left(x\\_{t+k}|c\\_{t}\\right)}{p\\left(x\\_{t+k}\\right)} $$\r\n\r\nwhere $\\propto$ stands for \u2019proportional to\u2019 (i.e. up to a multiplicative constant). Note that the density ratio $f$ can be unnormalized (does not have to integrate to 1). The authors use a simple log-bilinear model:\r\n\r\n$$ f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right) = \\exp\\left(z^{T}\\_{t+k}W\\_{k}c\\_{t}\\right) $$\r\n\r\nAny type of autoencoder and autoregressive can be used. An example the authors opt for is strided convolutional layers with residual blocks and GRUs.\r\n\r\nThe autoencoder and autoregressive models are trained to minimize an [InfoNCE](https:\/\/paperswithcode.com\/method\/infonce) loss (see components).","167":"**Inception-A** is an image model block used in the [Inception-v4](https:\/\/paperswithcode.com\/method\/inception-v4) architecture.","168":"**Reduction-A** is an image model block used in the [Inception-v4](https:\/\/paperswithcode.com\/method\/inception-v4) architecture.","169":"**Inception-B** is an image model block used in the [Inception-v4](https:\/\/paperswithcode.com\/method\/inception-v4) architecture.","170":"**Reduction-B** is an image model block used in the [Inception-v4](https:\/\/paperswithcode.com\/method\/inception-v4) architecture.","171":"**Inception-C** is an image model block used in the [Inception-v4](https:\/\/paperswithcode.com\/method\/inception-v4) architecture.","172":"**Inception-v4** is a convolutional neural network architecture that builds on previous iterations of the Inception family by simplifying the architecture and using more inception modules than [Inception-v3](https:\/\/paperswithcode.com\/method\/inception-v3).","173":"Dynamic Time Warping (DTW) [1] is one of well-known distance measures between a pairwise of time series. The main idea of DTW is to compute the distance from the matching of similar elements between time series. It uses the dynamic programming technique to find the optimal temporal matching between elements of two time series.\r\n\r\nFor instance, similarities in walking could be detected using DTW, even if one person was walking faster than the other, or if there were accelerations and decelerations during the course of an observation. DTW has been applied to temporal sequences of video, audio, and graphics data \u2014 indeed, any data that can be turned into a linear sequence can be analyzed with DTW. A well known application has been automatic speech recognition, to cope with different speaking speeds. Other applications include speaker recognition and online signature recognition. It can also be used in partial shape matching application.\r\n\r\nIn general, DTW is a method that calculates an optimal match between two given sequences (e.g. time series) with certain restriction and rules:\r\n\r\n1. Every index from the first sequence must be matched with one or more indices from the other sequence, and vice versa\r\n2. The first index from the first sequence must be matched with the first index from the other sequence (but it does not have to be its only match)\r\n3. The last index from the first sequence must be matched with the last index from the other sequence (but it does not have to be its only match)\r\n4. The mapping of the indices from the first sequence to indices from the other sequence must be monotonically increasing, and vice versa, i.e. if j>i  are indices from the first sequence, then there must not be two indices l>k in the other sequence, such that index i is matched with index l and index j is matched with index k, and vice versa.\r\n\r\n[1] Sakoe, Hiroaki, and Seibi Chiba. \"Dynamic programming algorithm optimization for spoken word recognition.\" IEEE transactions on acoustics, speech, and signal processing 26, no. 1 (1978): 43-49.","174":"**HRNet**, or **High-Resolution Net**, is a general purpose convolutional neural network for tasks like semantic segmentation, object detection and image classification. It is able to maintain high resolution representations through the whole process. We start from a high-resolution [convolution](https:\/\/paperswithcode.com\/method\/convolution) stream, gradually add high-to-low resolution convolution streams one by one, and connect the multi-resolution streams in parallel. The resulting network consists of several ($4$ in the paper) stages and\r\nthe $n$th stage contains $n$ streams corresponding to $n$ resolutions. The authors conduct repeated multi-resolution fusions by exchanging the information across the parallel streams over and over.","175":"A neural network model to automatically capture trends in time-series data for improved prediction\/forecasting performance","176":"In the setting of multi-target regression, base boosting permits us to incorporate prior knowledge into the learning mechanism of gradient boosting (or Newton boosting, etc.). Namely, from the vantage of statistics, base boosting is a way of building the following additive expansion in a set of elementary basis functions:\r\n\\begin{equation}\r\nh_{j}(X ; \\{ \\alpha_{j}, \\theta_{j} \\}) = X_{j} + \\sum_{k=1}^{K_{j}} \\alpha_{j,k} b(X ; \\theta_{j,k}),\r\n\\end{equation}\r\nwhere \r\n$X$ is an example from the domain $\\mathcal{X},$\r\n$\\{\\alpha_{j}, \\theta_{j}\\} = \\{\\alpha_{j,1},\\dots, \\alpha_{j,K_{j}},\\theta_{j,1},\\dots,\\theta_{j,K_{j}}\\}$ collects the expansion coefficients and parameter sets,\r\n$X_{j}$ is the image of $X$ under the $j$th coordinate function (a prediction from a user-specified model),\r\n$K_{j}$ is the number of basis functions in the linear sum,\r\n$b(X; \\theta_{j,k})$ is a real-valued function of the example $X,$ characterized by a parameter set $\\theta_{j,k}.$\r\n\r\nThe aforementioned additive expansion differs from the \r\n[standard  additive expansion](https:\/\/projecteuclid.org\/download\/pdf_1\/euclid.aos\/1013203451):\r\n\\begin{equation}\r\nh_{j}(X ; \\{ \\alpha_{j}, \\theta_{j}\\}) = \\alpha_{j, 0} + \\sum_{k=1}^{K_{j}} \\alpha_{j,k} b(X ; \\theta_{j,k}),\r\n\\end{equation}\r\nas it replaces the constant offset value $\\alpha_{j, 0}$ with a prediction from a user-specified model. In essence, this modification permits us to incorporate prior knowledge into the for loop of gradient boosting, as the for loop proceeds to build the linear sum by computing residuals that depend upon predictions from the user-specified model instead of the optimal constant model: $\\mbox{argmin} \\sum_{i=1}^{m_{train}} \\ell_{j}(Y_{j}^{(i)}, c),$ where $m_{train}$ denotes the number of training examples, $\\ell_{j}$ denotes a single-target loss function, and $c \\in \\mathbb{R}$ denotes a real number, e.g, $\\mbox{argmin} \\sum_{i=1}^{m_{train}} (Y_{j}^{(i)} - c)^{2} = \\frac{\\sum_{i=1}^{m_{train}} Y_{j}^{(i)}}{m_{train}}.$","177":"mBERT","178":"**AlphaZero** is a reinforcement learning agent for playing board games such as Go, chess, and shogi. ","179":"A **Sparse Transformer** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) based architecture which utilises sparse factorizations of the attention matrix to reduce time\/memory to $O(n \\sqrt{n})$. Other changes to the Transformer architecture include: (a) a restructured [residual block](https:\/\/paperswithcode.com\/method\/residual-block) and weight initialization, (b) A set of sparse attention kernels which efficiently compute subsets of the attention matrix, (c) recomputation of attention weights during the backwards pass to reduce memory usage","180":"**Jigsaw** is a self-supervision approach that relies on jigsaw-like puzzles as the pretext task in order to learn image representations.","181":"**Population Based Training**, or **PBT**, is an optimization method for finding parameters and hyperparameters, and extends upon parallel search methods and sequential optimisation methods.\r\nIt leverages information sharing across a population of concurrently running optimisation processes, and allows for online propagation\/transfer of parameters and hyperparameters between members of the population based on their performance. Furthermore, unlike most other adaptation schemes, the method is capable of performing online adaptation of hyperparameters -- which can be particularly important in problems with highly non-stationary learning dynamics, such as reinforcement learning settings. PBT is decentralised and asynchronous, although it could also be executed semi-serially or with partial synchrony if there is a binding budget constraint.","182":"**Population Based Augmentation**, or **PBA**, is a data augmentation strategy (PBA), which generates nonstationary augmentation policy schedules instead of a fixed augmentation policy. In PBA we consider the augmentation policy search problem as a special case of hyperparameter schedule learning. It leverages [Population Based Training](https:\/\/paperswithcode.com\/method\/population-based-training) (PBT), a hyperparameter search algorithm which\r\noptimizes the parameters of a network jointly with their hyperparameters to maximize performance. The output of PBT is not an optimal hyperparameter configuration but rather a trained model and schedule of hyperparameters. \r\n\r\nIn PBA, we are only interested in the learned schedule and discard the child model result (similar to [AutoAugment](https:\/\/paperswithcode.com\/method\/autoaugment)). This learned augmentation schedule can then be used to improve the training of different (i.e., larger and costlier to train) models on the same dataset.\r\n\r\nPBT executes as follows. To start, a fixed population of models are randomly initialized and trained in parallel. At certain intervals, an \u201cexploit-and-explore\u201d procedure is applied to the worse performing population members, where the model clones the weights of a better performing model (i.e., exploitation) and then perturbs the hyperparameters of the cloned model to search in the hyperparameter space (i.e., exploration). Because the weights of the models are cloned and never reinitialized, the total computation required is the computation to train a single model times the population size.","183":"**AutoAugment** is an automated approach to find data augmentation policies from data. It formulates the problem of finding the best augmentation policy as a discrete search problem. It consists of two components: a search algorithm and a search space. \r\n\r\nAt a high level, the search algorithm (implemented as a controller RNN) samples a data augmentation policy $S$, which has information about what image processing operation to use, the probability of using the operation in each batch, and the magnitude of the operation. The policy $S$ is used to train a neural network with a fixed architecture, whose validation accuracy $R$ is sent back to update the controller. Since $R$ is not differentiable, the controller will be updated by policy gradient methods. \r\n\r\nThe operations used are from PIL, a popular Python image library: all functions in PIL that accept an image as input and output an image. It additionally uses two other augmentation techniques: [Cutout](https:\/\/paperswithcode.com\/method\/cutout) and SamplePairing. The operations searched over are ShearX\/Y, TranslateX\/Y, Rotate, AutoContrast, Invert, Equalize, Solarize, Posterize, Contrast, Color, Brightness, Sharpness, Cutout and Sample Pairing.","184":"**Temporal Graph Network**, or **TGN**, is a framework for deep learning on dynamic graphs represented as sequences of timed events. The memory (state) of the model at time $t$ consists of a vector $\\mathbf{s}_i(t)$ for each node $i$ the model has seen so far. The memory of a node is updated after an event (e.g. interaction with another node or node-wise change), and its purpose is to represent the node's history in a compressed format. Thanks to this specific module, TGNs have the capability to memorize long term dependencies for each node in the graph. When a new node is encountered, its memory is initialized as the zero vector, and it is then updated for each event involving the node, even after the model has finished training.","185":"A **HyperNetwork** is a network that generates weights for a main network.  The behavior of the main network is the same with any usual neural network: it learns to map some raw inputs to their desired targets; whereas the hypernetwork takes a set of inputs that contain information about the structure of the weights and generates the weight for that layer.","186":"A **Gated Linear Unit**, or **GLU** computes:\r\n\r\n$$ \\text{GLU}\\left(a, b\\right) = a\\otimes \\sigma\\left(b\\right) $$\r\n\r\nIt is used in natural language processing architectures, for example the [Gated CNN](https:\/\/paperswithcode.com\/method\/gated-convolution-network), because here $b$ is the gate that control what information from $a$ is passed up to the following layer. Intuitively, for a language modeling task, the gating mechanism allows selection of words or features that are important for predicting the next word. The GLU also has non-linear capabilities, but has a linear path for the gradient so diminishes the vanishing gradient problem.","187":"**Adafactor** is a stochastic optimization method based on [Adam](https:\/\/paperswithcode.com\/method\/adam) that reduces memory usage while retaining the empirical benefits of adaptivity. This is achieved through maintaining a factored representation of the squared gradient accumulator across training steps. Specifically, by tracking moving averages of the row and column sums of the squared gradients for matrix-valued variables, we are able to reconstruct a low-rank approximation of the exponentially smoothed accumulator at each training step that is optimal with respect to the generalized Kullback-Leibler divergence. For an $n \\times m$ matrix, this reduces the memory requirements from $O(n m)$ to $O(n + m)$. \r\n\r\nInstead of defining the optimization algorithm in terms of absolute step sizes {$\\alpha_t$}$\\_{t=1}^T$, the authors define the optimization algorithm in terms of relative step sizes {$\\rho_t$}$\\_{t=1}^T$, which get multiplied by the scale of the parameters. The scale of a parameter vector or matrix is defined as the root-mean-square of its components, lower-bounded by a small constant $\\epsilon_2$.  The reason for this lower bound is to allow zero-initialized parameters to escape 0. \r\n\r\nProposed hyperparameters are: $\\epsilon\\_{1} = 10^{-30}$, $\\epsilon\\_{2} = 10^{-3}$, $d=1$, $p\\_{t} = \\min\\left(10^{-2}, \\frac{1}{\\sqrt{t}}\\right)$, $\\hat{\\beta}\\_{2\\_{t}} = 1 - t^{-0.8}$.","188":"**Inverse Square Root** is a learning rate schedule 1 \/ $\\sqrt{\\max\\left(n, k\\right)}$ where\r\n$n$ is the current training iteration and $k$ is the number of warm-up steps. This sets a constant learning rate for the first $k$ steps, then exponentially decays the learning rate until pre-training is over.","189":"**SentencePiece** is a subword tokenizer and detokenizer for natural language processing. It performs subword segmentation, supporting the byte-pair-encoding ([BPE](https:\/\/paperswithcode.com\/method\/bpe)) algorithm and unigram language model, and then converts this text into an id sequence guarantee perfect reproducibility of the normalization and subword segmentation.","190":"**T5**, or **Text-to-Text Transfer Transformer**, is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) based architecture that uses a text-to-text approach. Every task \u2013 including translation, question answering, and classification \u2013 is cast as feeding the model text as input and training it to generate some target text. This allows for the use of the same model, loss function, hyperparameters, etc. across our diverse set of tasks. The changes compared to [BERT](https:\/\/paperswithcode.com\/method\/bert) include:\r\n\r\n- adding a *causal* decoder to the bidirectional architecture.\r\n- replacing the fill-in-the-blank cloze task with a mix of alternative pre-training tasks.","191":"**Morphence** is an approach for adversarial defense that shifts the defense landscape by making a model a moving target against adversarial examples. By regularly moving the decision function of a model, Morphence makes it significantly challenging for repeated or correlated attacks to succeed. Morphence deploys a pool of models generated from a base model in a manner that introduces sufficient randomness when it responds to prediction queries. To ensure repeated or correlated attacks fail, the deployed pool of models automatically expires after a query budget is reached and the model pool is replaced by a new model pool generated in advance.","192":"**MDETR** is an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. It utilizes a [transformer](https:\/\/paperswithcode.com\/method\/transformer)-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. The network is pre-trained on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. The network is then fine-tuned on several downstream tasks such as phrase grounding, referring expression comprehension and segmentation.","193":"**Position-Sensitive RoI Pooling layer** aggregates the outputs of the last convolutional layer and generates scores for each RoI. Unlike [RoI Pooling](https:\/\/paperswithcode.com\/method\/roi-pooling), PS RoI Pooling conducts selective pooling, and each of the $k$ \u00d7 $k$ bin aggregates responses from only one score map out of the bank of $k$ \u00d7 $k$ score maps. With end-to-end training, this RoI layer shepherds the last convolutional layer to learn specialized position-sensitive score maps.","194":"**Region-based Fully Convolutional Networks**, or **R-FCNs**, are a type of region-based object detector. In contrast to previous region-based object detectors such as Fast\/[Faster R-CNN](https:\/\/paperswithcode.com\/method\/faster-r-cnn) that apply a costly per-region subnetwork hundreds of times, R-FCN is fully convolutional with almost all computation shared on the entire image.\r\n\r\nTo achieve this, R-FCN utilises position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection.","195":"In the ALIGN method, visual and language representations are jointly trained from noisy image alt-text data. The image and text encoders are learned via contrastive loss (formulated as normalized softmax) that pushes the embeddings of the matched image-text pair together and pushing those of non-matched image-text pair apart. The model learns to align visual and language representations of the image and text pairs using the contrastive loss. The representations can be used for vision-only or vision-language task transfer. Without any fine-tuning, ALIGN powers zero-shot visual classification and cross-modal search including image-to-text search, text-to image search and even search with joint image+text queries.","196":"**Dilated Convolutions** are a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) that \u201cinflate\u201d the kernel by inserting holes between the kernel elements. An additional parameter $l$ (dilation rate) indicates how much the kernel is widened. There are usually $l-1$ spaces inserted between kernel elements. \r\n\r\nNote that concept has existed in past literature under different names, for instance the *algorithme a trous*,  an algorithm for wavelet decomposition (Holschneider et al., 1987; Shensa, 1992).","197":"**Fast Voxel Query** is a module used in the [Voxel Transformer](https:\/\/paperswithcode.com\/method\/votr) 3D object detection model implementation of self-attention, specifically Local and Dilated Attention. For each querying index $v\\_{i}$, an attending voxel index $v\\_{j}$ is determined by Local and Dilated Attention. Then we can lookup the non-empty index $j$ in the hash table with hashed $v\\_{j}$ as the key. Finally, the non-empty index $j$ is used to gather the attending feature $f\\_{j}$ from $\\mathcal{F}$ for [multi-head attention](https:\/\/paperswithcode.com\/method\/multi-head-attention).","198":"**VoTr** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based 3D backbone for 3D object detection from point clouds. It contains a series of sparse and submanifold voxel modules. Submanifold voxel modules perform multi-head self-attention strictly on the non-empty voxels, while sparse voxel modules can extract voxel features at empty locations. Long-range relationships between voxels are captured via self-attention.\r\n\r\nGiven the fact that non-empty voxels are naturally sparse but numerous, directly applying standard Transformer on voxels is non-trivial. To this end, VoTr uses a sparse voxel module and a submanifold voxel module, which can operate on the empty and non-empty voxel positions effectively. To further enlarge the attention range while maintaining comparable computational overhead to the convolutional counterparts, two attention mechanisms are used for [multi-head attention](https:\/\/paperswithcode.com\/method\/multi-head-attention) in those two modules: Local Attention and Dilated Attention. Furthermore [Fast Voxel Query](https:\/\/paperswithcode.com\/method\/fast-voxel-query) is used to accelerate the querying process in multi-head attention.","199":"Inspired by the success of ResNet,\r\nWang et al. proposed\r\nthe very deep convolutional residual attention network (RAN) by \r\ncombining an attention mechanism with residual connections. \r\n\r\nEach attention module stacked in a residual attention network \r\ncan be divided into a mask branch and a trunk branch. \r\nThe trunk branch processes features,\r\nand can be implemented by any state-of-the-art structure\r\nincluding a pre-activation residual unit and an inception block.\r\nThe mask branch uses a bottom-up top-down structure\r\nto learn a mask of the same size that \r\nsoftly weights output features from the trunk branch. \r\nA sigmoid layer normalizes the output to $[0,1]$ after two $1\\times 1$ convolution layers. Overall the residual attention mechanism can be written as\r\n\r\n\\begin{align}\r\ns &= \\sigma(Conv_{2}^{1\\times 1}(Conv_{1}^{1\\times 1}( h_\\text{up}(h_\\text{down}(X))))) \r\n\\end{align}\r\n\r\n\\begin{align}\r\nX_{out} &= s f(X) + f(X)\r\n\\end{align}\r\nwhere $h_\\text{up}$ is a bottom-up structure, \r\nusing max-pooling several times after residual units\r\nto increase the receptive field, while\r\n$h_\\text{down}$ is the top-down part using \r\nlinear interpolation to keep the output size the \r\nsame as the input feature map. \r\nThere are also skip-connections between the two parts,\r\nwhich are omitted from the formulation.\r\n$f$ represents the trunk branch\r\nwhich can be any state-of-the-art structure.\r\n\r\nInside each attention module, a\r\nbottom-up top-down feedforward structure models\r\nboth spatial and cross-channel dependencies, \r\n leading to a consistent performance improvement. \r\nResidual attention can be incorporated into\r\nany deep network structure in an end-to-end training fashion.\r\nHowever, the proposed bottom-up top-down structure fails to leverage global spatial information.  \r\nFurthermore, directly predicting a 3D attention map  has high computational cost.","200":"There are at least eight notable examples of models from the literature that can be described using the **Message Passing Neural Networks** (**MPNN**) framework. For simplicity we describe MPNNs which operate on undirected graphs $G$ with node features $x_{v}$ and edge features $e_{vw}$. It is trivial to extend the formalism to directed multigraphs. The forward pass has two phases, a message passing phase and a readout phase. The message passing phase runs for $T$ time steps and is defined in terms of message functions $M_{t}$ and vertex update functions $U_{t}$. During the message passing phase, hidden states $h_{v}^{t}$ at each node in the graph are updated based on messages $m_{v}^{t+1}$ according to\r\n$$\r\nm_{v}^{t+1} = \\sum_{w \\in N(v)} M_{t}(h_{v}^{t}, h_{w}^{t}, e_{vw})\r\n$$\r\n$$\r\nh_{v}^{t+1} = U_{t}(h_{v}^{t}, m_{v}^{t+1})\r\n$$\r\nwhere in the sum, $N(v)$ denotes the neighbors of $v$ in graph $G$. The readout phase computes a feature vector for the whole graph using some readout function $R$ according to\r\n$$\r\n\\hat{y} = R(\\\\{ h_{v}^{T} | v \\in G \\\\})\r\n$$\r\nThe message functions $M_{t}$, vertex update functions $U_{t}$, and readout function $R$ are all learned differentiable functions. $R$ operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism.","201":"Genetic Algorithms are search algorithms that mimic Darwinian biological evolution in order to select and propagate better solutions.","202":"**DeepWalk** learns embeddings (social representations) of a graph's vertices, by modeling a stream of short random walks. Social representations are latent features of the vertices that capture neighborhood similarity and community membership. These latent representations encode social relations in a continuous vector space with a relatively small number of dimensions. It generalizes neural language models to process a special language composed of a set of randomly-generated walks. \r\n\r\nThe goal is to learn a latent representation, not only a probability distribution of node co-occurrences, and so as to introduce a mapping function $\\Phi \\colon v \\in V \\mapsto \\mathbb{R}^{|V|\\times d}$.\r\nThis mapping $\\Phi$ represents the latent social representation associated with each vertex $v$ in the graph. In practice, $\\Phi$ is represented by a $|V| \\times d$ matrix of free parameters.","203":"GraphSAGE is a general inductive framework that leverages node feature information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data.\r\n\r\nImage from: [Inductive Representation Learning on Large Graphs](https:\/\/arxiv.org\/pdf\/1706.02216v4.pdf)","204":"The **Vision Transformer**, or **ViT**, is a model for image classification that employs a [Transformer](https:\/\/paperswithcode.com\/method\/transformer)-like architecture over patches of the image.  An image is split into fixed-size patches, each of them are then linearly embedded, position embeddings are added, and the resulting sequence of vectors is fed to a standard [Transformer](https:\/\/paperswithcode.com\/method\/transformer) encoder. In order to perform classification, the standard approach of adding an extra learnable \u201cclassification token\u201d to the sequence is used.","205":"**VGG** is a classical convolutional neural network architecture. It was based on an analysis of how to increase the depth of such networks. The network utilises small 3 x 3 filters. Otherwise the network is characterized by its simplicity: the only other components being pooling layers and a fully connected layer.\r\n\r\nImage: [Davi Frossard](https:\/\/www.cs.toronto.edu\/frossard\/post\/vgg16\/)","206":"**ELECTRA** is a [transformer](https:\/\/paperswithcode.com\/method\/transformer) with a new pre-training approach which trains two transformer models: the generator and the discriminator. The generator replaces tokens in the sequence - trained as a masked language model - and the discriminator (the ELECTRA contribution) attempts to identify which tokens are replaced by the generator in the sequence. This pre-training task is called replaced token detection, and is a replacement for masking the input.","207":"**Axial Attention** is a simple generalization of self-attention that naturally aligns with the multiple dimensions of the tensors in both the encoding and the decoding settings. It was first proposed in [CCNet](https:\/\/paperswithcode.com\/method\/ccnet) [1] named as criss-cross attention, which harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. Ho et al [2] extents CCNet to process multi-dimensional data.  The proposed structure of the layers allows for the vast majority of the context to be computed in parallel during decoding without introducing any independence assumptions. It serves as the basic building block for developing self-attention-based autoregressive models for high-dimensional data tensors, e.g., Axial Transformers. It has been applied in [AlphaFold](https:\/\/paperswithcode.com\/method\/alphafold) [3] for interpreting protein sequences.\r\n\r\n[1] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, Wenyu Liu. CCNet: Criss-Cross Attention for Semantic Segmentation. ICCV, 2019.\r\n\r\n[2] Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, Tim Salimans. arXiv:1912.12180\r\n\r\n[3] Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, \u017d\u00eddek A, Potapenko A, Bridgland A. Highly accurate protein structure prediction with AlphaFold. Nature. 2021 Jul 15:1-1.","208":"**Deformable Attention Module** is an attention module used in the [Deformable DETR](https:\/\/paperswithcode.com\/method\/deformable-detr) architecture, which seeks to overcome one issue base [Transformer attention](https:\/\/paperswithcode.com\/method\/scaled) in that it looks over all possible spatial locations. Inspired by [deformable convolution](https:\/\/paperswithcode.com\/method\/deformable-convolution), the deformable attention module only attends to a small set of key sampling points around a reference point, regardless of the spatial size of the feature maps. By assigning only a small fixed number of keys for each query, the issues of convergence and feature spatial resolution can be mitigated.\r\n\r\nGiven an input feature map $x \\in \\mathbb{R}^{C \\times H \\times W}$, let $q$ index a query element with content feature $\\mathbf{z}\\_{q}$ and a 2-d reference point $\\mathbf{p}\\_{q}$, the deformable attention feature is calculated by:\r\n\r\n$$ \\text{DeformAttn}\\left(\\mathbf{z}\\_{q}, \\mathbf{p}\\_{q}, \\mathbf{x}\\right)=\\sum\\_{m=1}^{M} \\mathbf{W}\\_{m}\\left[\\sum\\_{k=1}^{K} A\\_{m q k} \\cdot \\mathbf{W}\\_{m}^{\\prime} \\mathbf{x}\\left(\\mathbf{p}\\_{q}+\\Delta \\mathbf{p}\\_{m q k}\\right)\\right]\r\n$$\r\n\r\nwhere $m$ indexes the attention head, $k$ indexes the sampled keys, and $K$ is the total sampled key number $(K \\ll H W) . \\Delta p_{m q k}$ and $A_{m q k}$ denote the sampling offset and attention weight of the $k^{\\text {th }}$ sampling point in the $m^{\\text {th }}$ attention head, respectively. The scalar attention weight $A_{m q k}$ lies in the range $[0,1]$, normalized by $\\sum_{k=1}^{K} A_{m q k}=1 . \\Delta \\mathbf{p}_{m q k} \\in \\mathbb{R}^{2}$ are of 2-d real numbers with unconstrained range. As $p\\_{q}+\\Delta p\\_{m q k}$ is fractional, bilinear interpolation is applied as in Dai et al. (2017) in computing $\\mathbf{x}\\left(\\mathbf{p}\\_{q}+\\Delta \\mathbf{p}\\_{m q k}\\right)$. Both $\\Delta \\mathbf{p}\\_{m q k}$ and $A\\_{m q k}$ are obtained via linear projection over the query feature $z\\_{q} .$ In implementation, the query feature $z\\_{q}$ is fed to a linear projection operator of $3 M K$ channels, where the first $2 M K$ channels encode the sampling offsets $\\Delta p\\_{m q k}$, and the remaining $M K$ channels are fed to a softmax operator to obtain the attention weights $A\\_{m q k}$.","209":"A **Feedforward Network**, or a **Multilayer Perceptron (MLP)**, is a neural network with solely densely connected layers. This is the classic neural network architecture of the literature. It consists of inputs $x$ passed through units $h$ (of which there can be many layers) to predict a target $y$. Activation functions are generally chosen to be non-linear to allow for flexible functional approximation.\r\n\r\nImage Source: Deep Learning, Goodfellow et al","210":"**Deformable DETR** is an object detection method that aims mitigates the slow convergence and high complexity issues of [DETR](https:\/\/www.paperswithcode.com\/method\/detr). It combines the best of the sparse spatial sampling of [deformable convolution](https:\/\/paperswithcode.com\/method\/deformable-convolution), and the relation modeling capability of [Transformers](https:\/\/paperswithcode.com\/methods\/category\/transformers). Specifically, it introduces a \r\n deformable attention module, which attends to a small set of sampling locations as a pre-filter for prominent key elements out of all the feature map pixels. The module can be naturally extended to aggregating multi-scale features, without the help of [FPN](https:\/\/paperswithcode.com\/method\/fpn).","211":"**Detr**, or **Detection Transformer**, is a set-based object detector using a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) on top of a convolutional backbone. It uses a conventional CNN backbone to learn a 2D representation of an input image. The model flattens it and supplements it with a positional encoding before passing it into a transformer encoder. A transformer decoder then takes as input a small fixed number of learned positional embeddings, which we call object queries, and additionally attends to the encoder output. We pass each output embedding of the decoder to a shared feed forward network (FFN) that predicts either a detection (class\r\nand bounding box) or a \u201cno object\u201d class.","212":"**MoCo**, or **Momentum Contrast**, is a self-supervised learning algorithm with a contrastive loss. \r\n\r\nContrastive loss methods can be thought of as building dynamic dictionaries. The \"keys\" (tokens) in the dictionary are sampled from data (e.g., images or patches) and are represented by an encoder network. Unsupervised learning trains encoders to perform dictionary look-up: an encoded \u201cquery\u201d should be similar to its matching key and dissimilar to others. Learning is formulated as minimizing a contrastive loss. \r\n\r\nMoCo can be viewed as a way to build large and consistent dictionaries for unsupervised learning with a contrastive loss. In MoCo, we maintain the dictionary as a queue of data samples: the encoded representations of the current mini-batch are enqueued, and the oldest are dequeued. The queue decouples the dictionary size from the mini-batch size, allowing it to be large. Moreover, as the dictionary keys come from the preceding several mini-batches, a slowly progressing key encoder, implemented as a momentum-based moving average of the query encoder, is proposed to maintain consistency.","213":"**Target Policy Smoothing** is a regularization strategy for the value function in reinforcement learning. Deterministic policies can overfit to narrow peaks in the value estimate, making them highly susceptible to functional approximation error, increasing the variance of the target. To reduce this variance, target policy smoothing adds a small amount of random noise to the target policy and averages over mini-batches - approximating a [SARSA](https:\/\/paperswithcode.com\/method\/sarsa)-like expectation\/integral.\r\n\r\nThe modified target update is:\r\n\r\n$$ y = r + \\gamma{Q}\\_{\\theta'}\\left(s', \\pi\\_{\\theta'}\\left(s'\\right) + \\epsilon \\right) $$\r\n\r\n$$ \\epsilon \\sim \\text{clip}\\left(\\mathcal{N}\\left(0, \\sigma\\right), -c, c \\right) $$\r\n\r\nwhere the added noise is clipped to keep the target close to the original action. The outcome is an algorithm reminiscent of [Expected SARSA](https:\/\/paperswithcode.com\/method\/expected-sarsa), where the value estimate is instead learned off-policy and the noise added to the target policy is chosen independently of the exploration policy. The value estimate learned is with respect to a noisy policy defined by the parameter $\\sigma$.","214":"**Clipped Double Q-learning** is a variant on [Double Q-learning](https:\/\/paperswithcode.com\/method\/double-q-learning) that upper-bounds the less biased Q estimate $Q\\_{\\theta\\_{2}}$ by the biased estimate $Q\\_{\\theta\\_{1}}$. This is equivalent to taking the minimum of the two estimates, resulting in the following target update:\r\n\r\n$$ y\\_{1} = r + \\gamma\\min\\_{i=1,2}Q\\_{\\theta'\\_{i}}\\left(s', \\pi\\_{\\phi\\_{1}}\\left(s'\\right)\\right) $$\r\n\r\nThe motivation for this extension is that vanilla double [Q-learning](https:\/\/paperswithcode.com\/method\/q-learning) is sometimes ineffective if the target and current networks are too similar, e.g. with a slow-changing policy in an actor-critic framework.","215":"**Experience Replay** is a replay memory technique used in reinforcement learning where we store the agent\u2019s experiences at each time-step, $e\\_{t} = \\left(s\\_{t}, a\\_{t}, r\\_{t}, s\\_{t+1}\\right)$ in a data-set $D = e\\_{1}, \\cdots, e\\_{N}$ , pooled over many episodes into a replay memory. We then usually sample the memory randomly for a minibatch of experience, and use this to learn off-policy, as with Deep Q-Networks. This tackles the problem of autocorrelation leading to unstable training, by making the problem more like a supervised learning problem.\r\n\r\nImage Credit: [Hands-On Reinforcement Learning with Python, Sudharsan Ravichandiran](https:\/\/subscription.packtpub.com\/book\/big_data_and_business_intelligence\/9781788836524)","216":"**DDPG**, or **Deep Deterministic Policy Gradient**, is an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. It combines the actor-critic approach with insights from [DQNs](https:\/\/paperswithcode.com\/method\/dqn): in particular, the insights that 1) the network is trained off-policy with samples from a replay buffer to minimize correlations between samples, and 2) the network is trained with a target Q network to give consistent targets during temporal difference backups. DDPG makes use of the same ideas along with [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization).","217":"**TD3** builds on the [DDPG](https:\/\/paperswithcode.com\/method\/ddpg) algorithm for reinforcement learning, with a couple of modifications aimed at tackling overestimation bias with the value function. In particular, it utilises [clipped double Q-learning](https:\/\/paperswithcode.com\/method\/clipped-double-q-learning), delayed update of target and policy networks, and [target policy smoothing](https:\/\/paperswithcode.com\/method\/target-policy-smoothing) (which is similar to a [SARSA](https:\/\/paperswithcode.com\/method\/sarsa) based update; a safer update, as they provide higher value to actions resistant to perturbations).","218":"**Switchable Atrous Convolution (SAC)** softly switches the convolutional computation between different atrous rates and gathers the results using switch functions. The switch functions are spatially dependent, i.e., each location of the feature map might have different switches to control the outputs of SAC. To use SAC in a detector, we convert all the standard 3x3 convolutional layers in the bottom-up backbone to SAC.","219":"**Fixed Factorized Attention** is a factorized attention pattern where specific cells summarize previous locations and propagate that information to all future cells. It was proposed as part of the [Sparse Transformer](https:\/\/paperswithcode.com\/method\/sparse-transformer) architecture.\r\n\r\n\r\nA self-attention layer maps a matrix of input embeddings $X$ to an output matrix and is parameterized by a connectivity pattern $S = \\text{set}\\left(S\\_{1}, \\dots, S\\_{n}\\right)$, where $S\\_{i}$ denotes the set of indices of the input vectors to which the $i$th output vector attends. The output vector is a weighted sum of transformations of the input vectors:\r\n\r\n$$ \\text{Attend}\\left(X, S\\right) = \\left(a\\left(\\mathbf{x}\\_{i}, S\\_{i}\\right)\\right)\\_{i\\in\\text{set}\\left(1,\\dots,n\\right)}$$\r\n\r\n$$ a\\left(\\mathbf{x}\\_{i}, S\\_{i}\\right) = \\text{softmax}\\left(\\frac{\\left(W\\_{q}\\mathbf{x}\\_{i}\\right)K^{T}\\_{S\\_{i}}}{\\sqrt{d}}\\right)V\\_{S\\_{i}} $$\r\n\r\n$$ K\\_{Si} = \\left(W\\_{k}\\mathbf{x}\\_{j}\\right)\\_{j\\in{S\\_{i}}} $$\r\n\r\n$$ V\\_{Si} = \\left(W\\_{v}\\mathbf{x}\\_{j}\\right)\\_{j\\in{S\\_{i}}} $$\r\n\r\nHere $W\\_{q}$, $W\\_{k}$, and $W\\_{v}$ represent the weight matrices which transform a given $x\\_{i}$ into a query, key, or value, and $d$ is the inner dimension of the queries and keys. The output at each position is a sum of the values weighted by the scaled dot-product similarity of the keys and queries.\r\n\r\nFull self-attention for autoregressive models defines $S\\_{i} = \\text{set}\\left(j : j \\leq i\\right)$, allowing every element to attend to all previous positions and its own position.\r\n\r\nFactorized self-attention instead has $p$ separate attention heads, where the $m$th head defines a subset of the indices $A\\_{i}^{(m)} \u2282 \\text{set}\\left(j : j \\leq i\\right)$ and lets $S\\_{i} = A\\_{i}^{(m)}$. The goal with the Sparse [Transformer](https:\/\/paperswithcode.com\/method\/transformer) was to find efficient choices for the subset $A$.\r\n\r\nFormally for Fixed Factorized Attention, $A^{(1)}\\_{i} = ${$j : \\left(\\lfloor{j\/l\\rfloor}=\\lfloor{i\/l\\rfloor}\\right)$}, where the brackets denote the floor operation, and $A^{(2)}\\_{i} = ${$j : j \\mod l \\in ${$t, t+1, \\ldots, l$}}, where $t=l-c$ and $c$ is a hyperparameter. The $i$-th output vector of the attention head attends to all input vectors either from $A^{(1)}\\_{i}$ or $A^{(2)}\\_{i}$. This pattern can be visualized in the figure to the right.\r\n\r\nIf the stride is 128 and $c = 8$, then all future positions greater than 128 can attend to positions 120-128, all positions greater than 256 can attend to 248-256, and so forth. \r\n\r\nA fixed-attention pattern with $c = 1$ limits the expressivity of the network significantly, as many representations in the network are only used for one block whereas a small number of locations are used by all blocks. The authors found choosing $c \\in ${$8, 16, 32$} for typical values of $l \\in\r\n{128, 256}$ performs well, although this increases the computational cost of this method by $c$ in comparison to the [strided attention](https:\/\/paperswithcode.com\/method\/strided-attention).\r\n\r\nAdditionally, the authors found that when using multiple heads, having them attend to distinct subblocks of length $c$ within the block of size $l$ was preferable to having them attend to the same subblock.","220":"**Strided Attention** is a factorized attention pattern that has one head attend to the previous\r\n$l$ locations, and the other head attend to every $l$th location, where $l$ is the stride and chosen to be close to $\\sqrt{n}$. It was proposed as part of the [Sparse Transformer](https:\/\/paperswithcode.com\/method\/sparse-transformer) architecture.\r\n\r\nA self-attention layer maps a matrix of input embeddings $X$ to an output matrix and is parameterized by a connectivity pattern $S = \\text{set}\\left(S\\_{1}, \\dots, S\\_{n}\\right)$, where $S\\_{i}$ denotes the set of indices of the input vectors to which the $i$th output vector attends. The output vector is a weighted sum of transformations of the input vectors:\r\n\r\n$$ \\text{Attend}\\left(X, S\\right) = \\left(a\\left(\\mathbf{x}\\_{i}, S\\_{i}\\right)\\right)\\_{i\\in\\text{set}\\left(1,\\dots,n\\right)}$$\r\n\r\n$$ a\\left(\\mathbf{x}\\_{i}, S\\_{i}\\right) = \\text{softmax}\\left(\\frac{\\left(W\\_{q}\\mathbf{x}\\_{i}\\right)K^{T}\\_{S\\_{i}}}{\\sqrt{d}}\\right)V\\_{S\\_{i}} $$\r\n\r\n$$ K\\_{Si} = \\left(W\\_{k}\\mathbf{x}\\_{j}\\right)\\_{j\\in{S\\_{i}}} $$\r\n\r\n$$ V\\_{Si} = \\left(W\\_{v}\\mathbf{x}\\_{j}\\right)\\_{j\\in{S\\_{i}}} $$\r\n\r\nHere $W\\_{q}$, $W\\_{k}$, and $W\\_{v}$ represent the weight matrices which transform a given $x\\_{i}$ into a query, key, or value, and $d$ is the inner dimension of the queries and keys. The output at each position is a sum of the values weighted by the scaled dot-product similarity of the keys and queries.\r\n\r\nFull self-attention for autoregressive models defines $S\\_{i} = \\text{set}\\left(j : j \\leq i\\right)$, allowing every element to attend to all previous positions and its own position.\r\n\r\nFactorized self-attention instead has $p$ separate attention heads, where the $m$th head defines a subset of the indices $A\\_{i}^{(m)} \u2282 \\text{set}\\left(j : j \\leq i\\right)$ and lets $S\\_{i} = A\\_{i}^{(m)}$. The goal with the Sparse [Transformer](https:\/\/paperswithcode.com\/method\/transformer) was to find efficient choices for the subset $A$.\r\n\r\nFormally for Strided Attention, $A^{(1)}\\_{i} = ${$t, t + 1, ..., i$} for $t = \\max\\left(0, i \u2212 l\\right)$, and $A^{(2)}\\_{i} = ${$j : (i \u2212 j) \\mod l = 0$}. The $i$-th output vector of the attention head attends to all input vectors either from $A^{(1)}\\_{i}$ or $A^{(2)}\\_{i}$. This pattern can be visualized in the figure to the right.\r\n\r\nThis formulation is convenient if the data naturally has a structure that aligns with the stride, like images or some types of music. For data without a periodic structure, like text, however, the authors find that the network can fail to properly route information with the strided pattern, as spatial coordinates for an element do not necessarily correlate with the positions where the element may be most relevant in the future.","221":"**GPT-3** is an autoregressive [transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)  model with 175 billion\r\nparameters. It uses the same architecture\/model as [GPT-2](https:\/\/paperswithcode.com\/method\/gpt-2), including the modified initialization, pre-normalization, and reversible tokenization, with the exception that GPT-3 uses alternating dense and locally banded sparse attention patterns in the layers of the [transformer](https:\/\/paperswithcode.com\/method\/transformer), similar to the [Sparse Transformer](https:\/\/paperswithcode.com\/method\/sparse-transformer).","222":"The **ENet Initial Block** is an image model block used in the [ENet](https:\/\/paperswithcode.com\/method\/enet) semantic segmentation architecture. [Max Pooling](https:\/\/paperswithcode.com\/method\/max-pooling) is performed with non-overlapping 2 \u00d7 2 windows, and the [convolution](https:\/\/paperswithcode.com\/method\/convolution) has 13 filters, which sums up to 16 feature maps after concatenation. This is heavily inspired by Inception Modules.","223":"**ENet Dilated Bottleneck** is an image model block used in the [ENet](https:\/\/paperswithcode.com\/method\/enet) semantic segmentation architecture. It is the same as a regular [ENet Bottleneck](https:\/\/paperswithcode.com\/method\/enet-bottleneck) but employs dilated convolutions instead.","224":"**SpatialDropout** is a type of [dropout](https:\/\/paperswithcode.com\/method\/dropout) for convolutional networks. For a given [convolution](https:\/\/paperswithcode.com\/method\/convolution) feature tensor of size $n\\_{\\text{feats}}$\u00d7height\u00d7width, we perform only $n\\_{\\text{feats}}$ dropout\r\ntrials and extend the dropout value across the entire feature map. Therefore, adjacent pixels in the dropped-out feature\r\nmap are either all 0 (dropped-out) or all active as illustrated in the figure to the right.","225":"A **Parametric Rectified Linear Unit**, or **PReLU**, is an activation function that generalizes the traditional rectified unit with a slope for negative values. Formally:\r\n\r\n$$f\\left(y\\_{i}\\right) = y\\_{i} \\text{ if } y\\_{i} \\ge 0$$\r\n$$f\\left(y\\_{i}\\right) = a\\_{i}y\\_{i} \\text{ if } y\\_{i} \\leq 0$$\r\n\r\nThe intuition is that different layers may require different types of nonlinearity. Indeed the authors find in experiments with convolutional neural networks that PReLus for the initial layer have more positive slopes, i.e. closer to linear. Since the filters of the first layers are Gabor-like filters such as edge or texture detectors, this shows a circumstance where positive and negative responses of filters are respected. In contrast the authors find deeper layers have smaller coefficients, suggesting the model becomes more discriminative at later layers (while it wants to retain more information at earlier layers).","226":"**ENet Bottleneck** is an image model block used in the [ENet](https:\/\/paperswithcode.com\/method\/enet) semantic segmentation architecture. Each block consists of three convolutional layers: a 1 \u00d7 1 projection that reduces the dimensionality, a main convolutional layer, and a 1 \u00d7 1 expansion. We place [Batch Normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) and [PReLU](https:\/\/paperswithcode.com\/method\/prelu) between all convolutions. If the bottleneck is downsampling, a [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) layer is added to the main branch.\r\nAlso, the first 1 \u00d7 1 projection is replaced with a 2 \u00d7 2 [convolution](https:\/\/paperswithcode.com\/method\/convolution) with stride 2 in both dimensions. We zero pad the activations, to match the number of feature maps.","227":"**ENet** is a semantic segmentation architecture which utilises a compact encoder-decoder architecture. Some design choices include:\r\n\r\n1. Using the [SegNet](https:\/\/paperswithcode.com\/method\/segnet) approach to downsampling y saving indices of elements chosen in max\r\npooling layers, and using them to produce sparse upsampled maps in the decoder.\r\n2.  Early downsampling to optimize the early stages of the network and reduce the cost of processing large input frames. The first two blocks of ENet heavily reduce the input size, and use only a small set of feature maps. \r\n3. Using PReLUs as an activation function\r\n4. Using dilated convolutions \r\n5. Using Spatial [Dropout](https:\/\/paperswithcode.com\/method\/dropout)","228":"**R-CNN**, or **Regions with CNN Features**, is an object detection model that uses high-capacity CNNs to bottom-up region proposals in order to localize and segment objects. It uses [selective search](https:\/\/paperswithcode.com\/method\/selective-search) to identify a number of bounding-box object region candidates (\u201cregions of interest\u201d), and then extracts features from each region independently for classification.","229":"**PyTorch DDP** (Distributed Data Parallel) is a distributed data parallel implementation for PyTorch. To guarantee mathematical equivalence, all replicas start from the same initial values for model parameters and synchronize gradients to keep parameters consistent across training iterations. To minimize the intrusiveness, the implementation exposes the same forward API as the user model, allowing applications to seamlessly replace subsequent occurrences of a user model with the distributed data parallel model object with no additional code changes. Several techniques are integrated into the design to deliver high-performance training, including bucketing gradients, overlapping communication with computation, and skipping synchronization.","230":"**Random Search** replaces the exhaustive enumeration of all combinations by selecting them randomly. This can be simply applied to the discrete setting described above, but also generalizes to continuous and mixed spaces. It can outperform Grid search, especially when only a small number of hyperparameters affects the final performance of the machine learning algorithm. In this case, the optimization problem is said to have a low intrinsic dimensionality. Random Search is also embarrassingly parallel, and additionally allows the inclusion of prior knowledge by specifying the distribution from which to sample.\r\n\r\n\r\nExtracted from [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Hyperparameter_optimization#Random_search)\r\n\r\nSource [Paper](https:\/\/dl.acm.org\/doi\/10.5555\/2188385.2188395)\r\n\r\nImage Source: [BERGSTRA AND BENGIO](https:\/\/dl.acm.org\/doi\/pdf\/10.5555\/2188385.2188395)","231":"**SimCSE** is a contrastive learning framework for generating sentence embeddings. It utilizes an unsupervised approach, which takes an input sentence and predicts itself in contrastive objective, with only standard [dropout](https:\/\/paperswithcode.com\/method\/dropout) used as noise. The authors find that dropout acts as minimal \u201cdata augmentation\u201d of hidden representations, while removing it leads to a representation collapse. Afterwards a supervised approach is used, which incorporates annotated pairs from natural language inference datasets into the contrastive framework, by using \u201centailment\u201d pairs as positives and \u201ccontradiction","232":"CARLA is an open-source simulator for autonomous driving research. CARLA has been developed from the ground up to support development, training, and validation of autonomous urban driving systems. In addition to open-source code and protocols, CARLA provides open digital assets (urban layouts, buildings, vehicles) that were created for this purpose and can be used freely. \r\n\r\nSource: [Dosovitskiy et al.](https:\/\/arxiv.org\/pdf\/1711.03938v1.pdf)\r\n\r\nImage source: [Dosovitskiy et al.](https:\/\/arxiv.org\/pdf\/1711.03938v1.pdf)","233":"**GPT** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based architecture and training procedure for natural language processing tasks. Training follows a two-stage procedure. First, a language modeling objective is used on\r\nthe unlabeled data to learn the initial parameters of a neural network model. Subsequently, these parameters are adapted to a target task using the corresponding supervised objective.","234":"**Sliding Window Attention** is an attention pattern for attention-based models. It was proposed as part of the [Longformer](https:\/\/paperswithcode.com\/method\/longformer) architecture. It is motivated by the fact that non-sparse attention in the original [Transformer](https:\/\/paperswithcode.com\/method\/transformer) formulation has a [self-attention component](https:\/\/paperswithcode.com\/method\/scaled) with $O\\left(n^{2}\\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs. Given the importance of local context, the sliding window attention pattern employs a fixed-size window attention surrounding each token. Using multiple stacked layers of such windowed attention results in a large receptive field, where top layers have access to all input locations and have the capacity to build representations that incorporate information across the entire input. \r\n\r\nMore formally, in this attention pattern, given a fixed window size $w$, each token attends to $\\frac{1}{2}w$ tokens on each side. The computation complexity of this pattern is $O\\left(n\u00d7w\\right)$,\r\nwhich scales linearly with input sequence length $n$. To make this attention pattern efficient, $w$ should be small compared with $n$. But a model with typical multiple stacked transformers will have a large receptive field. This is analogous to CNNs where stacking layers of small kernels leads to high level features that are built from a large portion of the input (receptive field)\r\n\r\nIn this case, with a transformer of $l$ layers, the receptive field size is $l \u00d7 w$ (assuming\r\n$w$ is fixed for all layers). Depending on the application, it might be helpful to use different values of $w$ for each layer to balance between efficiency and model representation capacity.","235":"**Dilated Sliding Window Attention** is an attention pattern for attention-based models. It was proposed as part of the [Longformer](https:\/\/paperswithcode.com\/method\/longformer) architecture. It is motivated by the fact that non-sparse attention in the original [Transformer](https:\/\/paperswithcode.com\/method\/transformer) formulation has a [self-attention component](https:\/\/paperswithcode.com\/method\/scaled) with $O\\left(n^{2}\\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs. \r\n\r\nCompared to a [Sliding Window Attention](https:\/\/paperswithcode.com\/method\/sliding-window-attention) pattern, we can further increase the receptive field without increasing computation by making the sliding window \"dilated\". This is analogous to [dilated CNNs](https:\/\/paperswithcode.com\/method\/dilated-convolution) where the window has gaps of size dilation $d$. Assuming a fixed $d$ and $w$ for all layers, the receptive field is $l \u00d7 d \u00d7 w$, which can reach tens of thousands of tokens even for small values of $d$.","236":"**Global and Sliding Window Attention** is an attention pattern for attention-based models. It is motivated by the fact that non-sparse attention in the original [Transformer](https:\/\/paperswithcode.com\/method\/transformer) formulation has a [self-attention component](https:\/\/paperswithcode.com\/method\/scaled) with $O\\left(n^{2}\\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs. \r\n\r\nSince [windowed](https:\/\/paperswithcode.com\/method\/sliding-window-attention) and [dilated](https:\/\/paperswithcode.com\/method\/dilated-sliding-window-attention) attention patterns are not flexible enough to learn task-specific representations, the authors of the [Longformer](https:\/\/paperswithcode.com\/method\/longformer) add \u201cglobal attention\u201d on few pre-selected input locations. This attention is operation symmetric: that is, a token with a global attention attends to all tokens across the sequence, and all tokens in the sequence attend to it. The Figure to the right shows an example of a sliding window attention with global attention at a few tokens at custom locations. For the example of classification, global attention is used for the [CLS] token, while in the example of Question Answering, global attention is provided on all question tokens.","237":"**AdamW** is a stochastic optimization method that modifies the typical implementation of weight decay in [Adam](https:\/\/paperswithcode.com\/method\/adam), by decoupling [weight decay](https:\/\/paperswithcode.com\/method\/weight-decay) from the gradient update. To see this, $L\\_{2}$ regularization in Adam is usually implemented with the below modification where $w\\_{t}$ is the rate of the weight decay at time $t$:\r\n\r\n$$ g\\_{t} = \\nabla{f\\left(\\theta\\_{t}\\right)} + w\\_{t}\\theta\\_{t}$$\r\n\r\nwhile AdamW adjusts the weight decay term to appear in the gradient update:\r\n\r\n$$ \\theta\\_{t+1, i} = \\theta\\_{t, i} - \\eta\\left(\\frac{1}{\\sqrt{\\hat{v}\\_{t} + \\epsilon}}\\cdot{\\hat{m}\\_{t}} + w\\_{t, i}\\theta\\_{t, i}\\right), \\forall{t}$$","238":"**Longformer** is a modified [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture. Traditional [Transformer-based models](https:\/\/paperswithcode.com\/methods\/category\/transformers) are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this, **Longformer** uses an attention pattern that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. The attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention.\r\n\r\nThe attention patterns utilised include: [sliding window attention](https:\/\/paperswithcode.com\/method\/sliding-window-attention), [dilated sliding window attention](https:\/\/paperswithcode.com\/method\/dilated-sliding-window-attention) and global + sliding window. These can be viewed in the components section of this page.","239":"Class of methods in Bayesian Statistics where the posterior distribution is approximated over a rejection scheme on simulations because the likelihood function is intractable.\r\n\r\nDifferent parameters get sampled and simulated. Then a distance function is calculated to measure the quality of the simulation compared to data from real observations. Only simulations that fall below a certain threshold get accepted.\r\n\r\nImage source: [Kulkarni et al.](https:\/\/www.umass.edu\/nanofabrics\/sites\/default\/files\/PDF_0.pdf)","240":"**Discrete Cosine Transform (DCT)** is an orthogonal transformation method that decomposes an\r\nimage to its spatial frequency spectrum. It expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. It is used a lot in compression tasks, e..g image compression where for example high-frequency components can be discarded. It is a type of Fourier-related Transform, similar to discrete fourier transforms (DFTs), but only using real numbers.\r\n\r\nImage Credit: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Discrete_cosine_transform#\/media\/File:Example_dft_dct.svg)","241":"Train a convolutional neural network to generate the contents of an arbitrary image region conditioned on its surroundings.","242":"**CutMix** is an image data augmentation strategy. Instead of simply removing pixels as in [Cutout](https:\/\/paperswithcode.com\/method\/cutout), we replace the removed regions with a patch from another image. The ground truth labels are also mixed proportionally to the number of pixels of combined images. The added patches further enhance localization ability by requiring the model to identify the object from a partial view.","243":"**Stochastic Depth** aims to shrink the depth of a network during training, while\r\nkeeping it unchanged during testing. This is achieved by randomly dropping entire [ResBlocks](https:\/\/paperswithcode.com\/method\/residual-block) during training and bypassing their transformations through skip connections. \r\n\r\nLet $b\\_{l} \\in$ {$0, 1$} denote a Bernoulli random variable, which indicates whether the $l$th ResBlock is active ($b\\_{l} = 1$) or inactive ($b\\_{l} = 0$). Further, let us denote the \u201csurvival\u201d probability of ResBlock $l$ as $p\\_{l} = \\text{Pr}\\left(b\\_{l} = 1\\right)$. With this definition we can bypass the $l$th ResBlock by multiplying its function $f\\_{l}$ with $b\\_{l}$ and we extend the update rule to:\r\n\r\n$$ H\\_{l} = \\text{ReLU}\\left(b\\_{l}f\\_{l}\\left(H\\_{l-1}\\right) + \\text{id}\\left(H\\_{l-1}\\right)\\right) $$\r\n\r\nIf $b\\_{l} = 1$, this reduces to the original [ResNet](https:\/\/paperswithcode.com\/method\/resnet) update and this ResBlock remains unchanged. If $b\\_{l} = 0$, the ResBlock reduces to the identity function, $H\\_{l} = \\text{id}\\left((H\\_{l}\u22121\\right)$.","244":"The **Swin Transformer** is a type of [Vision Transformer](https:\/\/paperswithcode.com\/method\/vision-transformer). It builds hierarchical feature maps by merging image patches (shown in gray) in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local window (shown in red). It can thus serve as a general-purpose backbone for both image classification and dense recognition tasks. In contrast, previous vision Transformers produce feature maps of a single low resolution and have quadratic computation complexity to input image size due to computation of self-attention globally.","245":"**Submanifold Convolution (SC)** is a spatially sparse [convolution](https:\/\/paperswithcode.com\/method\/convolution) operation used for tasks with sparse data like semantic segmentation of 3D point clouds. An SC convolution computes the set of active sites in the same way as a regular convolution: it looks for the presence of any active sites in its receptive field of size $f^{d}$. If the input has size $l$ then the output will have size $\\left(l \u2212 f + s\\right)\/s$. Unlike a regular convolution, an SC convolution discards the ground state for non-active sites by assuming that the input from those sites is zero. For more details see the [paper](https:\/\/paperswithcode.com\/paper\/3d-semantic-segmentation-with-submanifold), or the official code [here](https:\/\/github.com\/facebookresearch\/SparseConvNet).","246":"**PULSE** is a self-supervised photo upsampling algorithm. Instead of starting with the LR image and slowly adding detail, PULSE traverses the high-resolution natural image manifold, searching for images that downscale to the original LR image. This is formalized through the downscaling loss, which guides exploration through the latent space of a generative model. By leveraging properties of high-dimensional Gaussians, the authors aim to restrict the search space to guarantee realistic outputs.","247":"**CodeBERT** is a bimodal pre-trained model for programming language (PL) and natural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language code search, code documentation generation, etc. CodeBERT is developed with a [Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based neural architecture, and is trained with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables the utilization of both bimodal data of NL-PL pairs and unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators.","248":"**WaveRNN** is a single-layer recurrent neural network for audio generation that is designed efficiently predict 16-bit raw audio samples.\r\n\r\nThe overall computation in the WaveRNN is as follows (biases omitted for brevity):\r\n\r\n$$ \\mathbf{x}\\_{t} = \\left[\\mathbf{c}\\_{t\u22121},\\mathbf{f}\\_{t\u22121}, \\mathbf{c}\\_{t}\\right] $$\r\n\r\n$$ \\mathbf{u}\\_{t} = \\sigma\\left(\\mathbf{R}\\_{u}\\mathbf{h}\\_{t-1} + \\mathbf{I}^{*}\\_{u}\\mathbf{x}\\_{t}\\right) $$\r\n\r\n$$ \\mathbf{r}\\_{t} = \\sigma\\left(\\mathbf{R}\\_{r}\\mathbf{h}\\_{t-1} + \\mathbf{I}^{*}\\_{r}\\mathbf{x}\\_{t}\\right) $$\r\n\r\n$$ \\mathbf{e}\\_{t} = \\tau\\left(\\mathbf{r}\\_{t} \\odot \\left(\\mathbf{R}\\_{e}\\mathbf{h}\\_{t-1}\\right) + \\mathbf{I}^{*}\\_{e}\\mathbf{x}\\_{t} \\right) $$\r\n\r\n$$ \\mathbf{h}\\_{t} = \\mathbf{u}\\_{t} \\cdot \\mathbf{h}\\_{t-1} + \\left(1-\\mathbf{u}\\_{t}\\right) \\cdot \\mathbf{e}\\_{t} $$\r\n\r\n$$ \\mathbf{y}\\_{c}, \\mathbf{y}\\_{f} = \\text{split}\\left(\\mathbf{h}\\_{t}\\right) $$\r\n\r\n$$ P\\left(\\mathbf{c}\\_{t}\\right) = \\text{softmax}\\left(\\mathbf{O}\\_{2}\\text{relu}\\left(\\mathbf{O}\\_{1}\\mathbf{y}\\_{c}\\right)\\right) $$\r\n\r\n$$ P\\left(\\mathbf{f}\\_{t}\\right) = \\text{softmax}\\left(\\mathbf{O}\\_{4}\\text{relu}\\left(\\mathbf{O}\\_{3}\\mathbf{y}\\_{f}\\right)\\right) $$\r\n\r\nwhere the $*$ indicates a masked matrix whereby the last coarse input $\\mathbf{c}\\_{t}$ is only connected to the fine part of the states $\\mathbf{u}\\_{t}$, $\\mathbf{r}\\_{t}$, $\\mathbf{e}\\_{t}$ and $\\mathbf{h}\\_{t}$ and thus only affects the fine output $\\mathbf{y}\\_{f}$. The coarse and fine parts $\\mathbf{c}\\_{t}$ and $\\mathbf{f}\\_{t}$ are encoded as scalars in $\\left[0, 255\\right]$ and scaled to the interval $\\left[\u22121, 1\\right]$. The matrix $\\mathbf{R}$ formed from the matrices $\\mathbf{R}\\_{u}$, $\\mathbf{R}\\_{r}$, $\\mathbf{R}\\_{e}$ is computed as a single matrix-vector product to produce the contributions to all three gates $\\mathbf{u}\\_{t}$, $mathbf{r}\\_{t}$ and $\\mathbf{e}\\_{t}$ (a variant of the [GRU cell](https:\/\/paperswithcode.com\/method\/gru). $\\sigma$ and $\\tau$ are the standard sigmoid and tanh non-linearities.\r\n\r\nEach part feeds into a [softmax](https:\/\/paperswithcode.com\/method\/softmax) layer over the corresponding 8 bits and the prediction of the 8 fine bits is conditioned on the 8 coarse bits. The resulting Dual Softmax layer allows for efficient prediction of 16-bit samples using two small output spaces (2 8 values each) instead of a single large output space (with 2 16 values).","249":"**Mixture of Logistic Distributions (MoL)** is a type of output function, and an alternative to a [softmax](https:\/\/paperswithcode.com\/method\/softmax) layer. Discretized logistic mixture likelihood is used in [PixelCNN](https:\/\/paperswithcode.com\/method\/pixelcnn)++ and [WaveNet](https:\/\/paperswithcode.com\/method\/wavenet) to predict discrete values.\r\n\r\nImage Credit: [Hao Gao](https:\/\/medium.com\/@smallfishbigsea\/an-explanation-of-discretized-logistic-mixture-likelihood-bdfe531751f0)","250":"**WaveNet** is an audio generative model based on the [PixelCNN](https:\/\/paperswithcode.com\/method\/pixelcnn) architecture. In order to deal with long-range temporal dependencies needed for raw audio generation, architectures are developed based on dilated causal convolutions, which exhibit very large receptive fields.\r\n\r\nThe joint probability of a waveform $\\vec{x} = \\{ x_1, \\dots, x_T \\}$ is factorised as a product of conditional probabilities as follows:\r\n\r\n$$p\\left(\\vec{x}\\right) = \\prod_{t=1}^{T} p\\left(x_t \\mid x_1, \\dots ,x_{t-1}\\right)$$\r\n\r\nEach audio sample $x_t$ is therefore conditioned on the samples at all previous timesteps.","251":"**NT-Xent**, or **Normalized Temperature-scaled Cross Entropy Loss**, is a loss function. Let $\\text{sim}\\left(\\mathbf{u}, \\mathbf{v}\\right) = \\mathbf{u}^{T}\\mathbf{v}\/||\\mathbf{u}|| ||\\mathbf{v}||$ denote the cosine similarity between two vectors $\\mathbf{u}$ and $\\mathbf{v}$. Then the loss function for a positive pair of examples $\\left(i, j\\right)$ is :\r\n\r\n$$ \\mathbb{l}\\_{i,j} = -\\log\\frac{\\exp\\left(\\text{sim}\\left(\\mathbf{z}\\_{i}, \\mathbf{z}\\_{j}\\right)\/\\tau\\right)}{\\sum^{2N}\\_{k=1}\\mathcal{1}\\_{[k\\neq{i}]}\\exp\\left(\\text{sim}\\left(\\mathbf{z}\\_{i}, \\mathbf{z}\\_{k}\\right)\/\\tau\\right)}$$\r\n\r\nwhere $\\mathcal{1}\\_{[k\\neq{i}]} \\in ${$0, 1$} is an indicator function evaluating to $1$ iff $k\\neq{i}$ and $\\tau$ denotes a temperature parameter. The final loss is computed across all positive pairs, both $\\left(i, j\\right)$ and $\\left(j, i\\right)$, in a mini-batch.\r\n\r\nSource: [SimCLR](https:\/\/paperswithcode.com\/method\/simclr)","252":"**Random Gaussian Blur** is an image data augmentation technique where we randomly blur the image using a Gaussian distribution.\r\n\r\nImage Source: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Gaussian_blur)","253":"**RandomResizedCrop** is a type of image data augmentation where a crop of random size of the original size and a random aspect ratio of the original aspect ratio is made. This crop is finally resized to given size.\r\n\r\nImage Credit: [Apache MXNet](https:\/\/mxnet.apache.org\/versions\/1.5.0\/tutorials\/gluon\/data_augmentation.html)","254":"**ColorJitter** is a type of image data augmentation where we randomly change the brightness, contrast and saturation of an image.\r\n\r\nImage Credit: [Apache MXNet](https:\/\/mxnet.apache.org\/versions\/1.5.0\/tutorials\/gluon\/data_augmentation.html)","255":"**SimCLR** is a framework for contrastive learning of visual representations. It learns representations by maximizing agreement between differently augmented views of the same data example via a contrastive loss in the latent space. It consists of:\r\n\r\n- A stochastic data augmentation module that transforms any given data example randomly resulting in two correlated views of the same example, denoted $\\mathbf{\\tilde{x}\\_{i}}$ and $\\mathbf{\\tilde{x}\\_{j}}$, which is considered a positive pair. SimCLR sequentially applies three simple augmentations: random cropping followed by resize back to the original size, random color distortions, and [random Gaussian blur](https:\/\/paperswithcode.com\/method\/random-gaussian-blur). The authors find random crop and color distortion is crucial to achieve good performance.\r\n\r\n- A neural network base encoder $f\\left(\u00b7\\right)$ that extracts representation vectors from augmented data examples. The framework allows various choices of the network architecture without any constraints. The authors opt for simplicity and adopt [ResNet](https:\/\/paperswithcode.com\/method\/resnet) to obtain $h\\_{i} = f\\left(\\mathbf{\\tilde{x}}\\_{i}\\right) = \\text{ResNet}\\left(\\mathbf{\\tilde{x}}\\_{i}\\right)$ where $h\\_{i} \\in \\mathbb{R}^{d}$ is the output after the [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling) layer.\r\n\r\n- A small neural network projection head $g\\left(\u00b7\\right)$ that maps representations to the space where contrastive loss is applied. Authors use a MLP with one hidden layer to obtain $z\\_{i} = g\\left(h\\_{i}\\right) = W^{(2)}\\sigma\\left(W^{(1)}h\\_{i}\\right)$ where $\\sigma$ is a [ReLU](https:\/\/paperswithcode.com\/method\/relu) nonlinearity. The authors find it beneficial to define the contrastive loss on $z\\_{i}$\u2019s rather than $h\\_{i}$\u2019s.\r\n\r\n- A contrastive loss function defined for a contrastive prediction task. Given a set {$\\mathbf{\\tilde{x}}\\_{k}$} including a positive pair of examples $\\mathbf{\\tilde{x}}\\_{i}$ and $\\mathbf{\\tilde{x}\\_{j}}$ , the contrastive prediction task aims to identify $\\mathbf{\\tilde{x}}\\_{j}$ in {$\\mathbf{\\tilde{x}}\\_{k}$}$\\_{k\\neq{i}}$ for a given $\\mathbf{\\tilde{x}}\\_{i}$.\r\n\r\nA minibatch of $N$ examples is randomly sampled and the contrastive prediction task is defined on pairs of augmented examples derived from the minibatch, resulting in $2N$ data points. Negative examples are not sampled explicitly. Instead, given a positive pair, the other $2(N \u2212 1)$ augmented examples within a minibatch are treated as negative examples. A [NT-Xent](https:\/\/paperswithcode.com\/method\/nt-xent) (the normalized\r\ntemperature-scaled cross entropy loss) loss function is used (see components).","256":"**DCGAN**, or **Deep Convolutional GAN**, is a generative adversarial network architecture. It uses a couple of guidelines, in particular:\r\n\r\n- Replacing any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator).\r\n- Using batchnorm in both the generator and the discriminator.\r\n- Removing fully connected hidden layers for deeper architectures.\r\n- Using [ReLU](https:\/\/paperswithcode.com\/method\/relu) activation in generator for all layers except for the output, which uses tanh.\r\n- Using LeakyReLU activation in the discriminator for all layer.","257":"**Domain Adaptive Neighborhood Clustering via Entropy Optimization (DANCE)** is a self-supervised clustering method that harnesses the cluster structure of the target domain using self-supervision. This is done with a neighborhood clustering technique that self-supervises feature learning in the target. At the same time, useful source features and class boundaries are preserved and adapted with a partial domain alignment loss that the authors refer to as entropy separation loss. This loss allows the model to either match each target example with the source, or reject it as unknown.","258":"**Linear discriminant analysis** (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics, pattern recognition, and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.\r\n\r\nExtracted from [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Linear_discriminant_analysis)\r\n\r\n**Source**:\r\n\r\nPaper: [Linear Discriminant Analysis: A Detailed Tutorial](https:\/\/dx.doi.org\/10.3233\/AIC-170729)\r\n\r\nPublic version: [Linear Discriminant Analysis: A Detailed Tutorial](https:\/\/usir.salford.ac.uk\/id\/eprint\/52074\/)","259":"**Conditional Random Fields** or **CRFs** are a type of probabilistic graph model that take neighboring sample context into account for tasks like classification. Prediction is modeled as a graphical model, which implements dependencies between the predictions. Graph choice depends on the application, for example linear chain CRFs are popular in natural language processing, whereas in image-based tasks, the graph would connect to neighboring locations in an image to enforce that they have similar predictions.\r\n\r\nImage Credit: [Charles Sutton and Andrew McCallum, An Introduction to Conditional Random Fields](https:\/\/homepages.inf.ed.ac.uk\/csutton\/publications\/crftut-fnt.pdf)","260":"**SGD with Momentum** is a stochastic optimization method that adds a momentum term to regular stochastic gradient descent:\r\n\r\n$$v\\_{t} = \\gamma{v}\\_{t-1} + \\eta\\nabla\\_{\\theta}J\\left(\\theta\\right)$$\r\n$$\\theta\\_{t} = \\theta\\_{t-1} - v\\_{t} $$\r\n\r\nA typical value for $\\gamma$ is $0.9$. The momentum name comes from an analogy to physics, such as ball accelerating down a slope. In the case of weight updates, we can think of the weights as a particle traveling through parameter space which incurs acceleration from the gradient of the loss.\r\n\r\nImage Source: [Juan Du](https:\/\/www.researchgate.net\/figure\/The-compare-of-the-SGD-algorithms-with-and-without-momentum-Take-Task-1-as-example-The_fig1_333469047)","261":"**Demon Adam** is a stochastic optimizer where the [Demon](https:\/\/paperswithcode.com\/method\/demon) momentum rule is applied to the [Adam](https:\/\/paperswithcode.com\/method\/adam) optimizer.\r\n\r\n$$ \\beta\\_{t} = \\beta\\_{init}\\cdot\\frac{\\left(1-\\frac{t}{T}\\right)}{\\left(1-\\beta\\_{init}\\right) + \\beta\\_{init}\\left(1-\\frac{t}{T}\\right)} $$\r\n\r\n$$ m\\_{t, i} = g\\_{t, i} + \\beta\\_{t}m\\_{t-1, i} $$\r\n\r\n$$ v\\_{t+1} = \\beta\\_{2}v\\_{t}  + \\left(1-\\beta\\_{2}\\right)g^{2}\\_{t} $$\r\n\r\n$$ \\theta_{t} = \\theta_{t-1} - \\eta\\frac{\\hat{m}\\_{t}}{\\sqrt{\\hat{v}\\_{t}} + \\epsilon}  $$","262":"**Demon CM**, or **SGD with Momentum and Demon**,  is the [Demon](https:\/\/paperswithcode.com\/method\/demon) momentum rule applied to [SGD with momentum](https:\/\/paperswithcode.com\/method\/sgd-with-momentum).\r\n\r\n$$ \\beta\\_{t} = \\beta\\_{init}\\cdot\\frac{\\left(1-\\frac{t}{T}\\right)}{\\left(1-\\beta\\_{init}\\right) + \\beta\\_{init}\\left(1-\\frac{t}{T}\\right)} $$\r\n\r\n$$ \\theta\\_{t+1} = \\theta\\_{t} - \\eta{g}\\_{t} + \\beta\\_{t}v\\_{t} $$\r\n\r\n$$ v\\_{t+1} = \\beta\\_{t}{v\\_{t}} - \\eta{g\\_{t}} $$","263":"**Decaying Momentum**, or **Demon**, is a stochastic optimizer motivated by decaying the total contribution of a gradient to all future updates. By decaying the momentum parameter, the total contribution of a gradient to all future updates is decayed. A particular gradient term $g\\_{t}$ contributes a total of  $\\eta\\sum\\_{i}\\beta^{i}$ of its \"energy\" to all future gradient updates, and this results in the geometric sum, $\\sum^{\\infty}\\_{i=1}\\beta^{i} = \\beta\\sum^{\\infty}\\_{i=0}\\beta^{i} = \\frac{\\beta}{\\left(1-\\beta\\right)}$. Decaying this sum results in the Demon algorithm. Letting $\\beta\\_{init}$ be the initial $\\beta$; then at the current step $t$ with total $T$ steps, the decay routine is given by solving the below for $\\beta\\_{t}$:\r\n\r\n$$ \\frac{\\beta\\_{t}}{\\left(1-\\beta\\_{t}\\right)} =  \\left(1-t\/T\\right)\\beta\\_{init}\/\\left(1-\\beta\\_{init}\\right)$$\r\n\r\nWhere $\\left(1-t\/T\\right)$ refers to the proportion of iterations remaining. Note that Demon typically requires no hyperparameter tuning as it is usually decayed to $0$ or a small negative value at time \r\n$T$. Improved performance is observed by delaying the decaying. Demon can be applied to any gradient descent algorithm with a momentum parameter.","264":"**Feature Intertwiner** is an object detection module that leverages the features from a more reliable set to help guide the feature learning of another less reliable set. The mutual learning process helps two sets to have closer distance within the cluster in each class. The intertwiner is applied on the object detection task, where a historical buffer is proposed to address the sample missing problem during one mini-batch and the optimal transport (OT) theory is introduced to enforce the similarity among the two sets.","265":"One difficulty that arises with optimization of deep neural networks is that large parameter gradients can lead an [SGD](https:\/\/paperswithcode.com\/method\/sgd) optimizer to update the parameters strongly into a region where the loss function is much greater, effectively undoing much of the work that was needed to get to the current solution.\r\n\r\n**Gradient Clipping** clips the size of the gradients to ensure optimization performs more reasonably near sharp areas of the loss surface. It can be performed in a number of ways. One option is to simply clip the parameter gradient element-wise before a parameter update. Another option is to clip the norm ||$\\textbf{g}$|| of the gradient $\\textbf{g}$ before a parameter update:\r\n\r\n$$\\text{ if } ||\\textbf{g}||  > v \\text{ then } \\textbf{g} \\leftarrow \\frac{\\textbf{g}{v}}{||\\textbf{g}||}$$\r\n\r\nwhere $v$ is a norm threshold.\r\n\r\nSource: Deep Learning, Goodfellow et al\r\n\r\nImage Source: [Pascanu et al](https:\/\/arxiv.org\/pdf\/1211.5063.pdf)","266":"**Non Maximum Suppression** is a computer vision method that selects a single entity out of many overlapping entities (for example bounding boxes in object detection). The criteria is usually discarding entities that are below a given probability bound. With remaining entities we repeatedly pick the entity with the highest probability, output that as the prediction, and discard any remaining box where a $\\text{IoU} \\geq 0.5$ with the box output in the previous step.\r\n\r\nImage Credit: [Martin Kersner](https:\/\/github.com\/martinkersner\/non-maximum-suppression-cpp)","267":"**RandomHorizontalFlip** is a type of image data augmentation which horizontally flips a given image with a given probability.\r\n\r\nImage Credit: [Apache MXNet](https:\/\/mxnet.apache.org\/versions\/1.5.0\/tutorials\/gluon\/data_augmentation.html)","268":"**PointNet** provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. It directly takes point clouds as input and outputs either class labels for the entire input or per point segment\/part labels for each point of the input.\r\n\r\nSource: [Qi et al.](https:\/\/arxiv.org\/pdf\/1612.00593v2.pdf)\r\n\r\nImage source: [Qi et al.](https:\/\/arxiv.org\/pdf\/1612.00593v2.pdf)","269":"**SSD** is a single-stage object detection method that discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. \r\n\r\nThe fundamental improvement in speed comes from eliminating bounding box proposals and the subsequent pixel or feature resampling stage. Improvements over competing single-stage methods include using a small convolutional filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales.","270":"**Fully Convolutional Networks**, or **FCNs**, are an architecture used mainly for semantic segmentation. They employ solely locally connected layers, such as [convolution](https:\/\/paperswithcode.com\/method\/convolution), pooling and upsampling. Avoiding the use of dense layers means less parameters (making the networks faster to train). It also means an FCN can work for variable image sizes given all connections are local.\r\n\r\nThe network consists of a downsampling path, used to extract and interpret the context, and an upsampling path, which allows for localization. \r\n\r\nFCNs also employ skip connections to recover the fine-grained spatial information lost in the downsampling path.","271":"**Wide Residual Networks** are a variant on [ResNets](https:\/\/paperswithcode.com\/method\/resnet) where we decrease depth and increase the width of residual networks. This is achieved through the use of wide residual blocks.","272":"**Auxiliary Classifiers** are type of architectural component that seek to improve the convergence of very deep networks. They are classifier heads we attach to layers before the end of the network. The motivation is to push useful gradients to the lower layers to make them immediately useful and improve the convergence during training by combatting the vanishing gradient problem. They are notably used in the Inception family of convolutional neural networks.","273":"In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task\/modality-specific customization. We propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks, including image generation, visual grounding, image captioning, image classification, language modeling, etc., in a simple sequence-to-sequence learning framework. OFA follows the instruction-based learning in both pretraining and finetuning stages, requiring no extra task-specific layers for downstream tasks. In comparison with the recent state-of-the-art vision & language models that rely on extremely large cross-modal datasets, OFA is pretrained on only 20M publicly available image-text pairs. Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni-modal tasks. Our further analysis indicates that OFA can also effectively transfer to unseen tasks and unseen domains. Our code and models are publicly available at https:\/\/github.com\/OFA-Sys\/OFA.","274":"**ConvLSTM** is a type of recurrent neural network for spatio-temporal prediction that has convolutional structures in both the input-to-state and state-to-state transitions. The ConvLSTM determines the future state of a certain cell in the grid by the inputs and past states of its local neighbors. This can easily be achieved by using a [convolution](https:\/\/paperswithcode.com\/method\/convolution) operator in the state-to-state and input-to-state transitions (see Figure). The key equations of ConvLSTM are shown  below, where $\u2217$ denotes the convolution operator and $\\odot$ the Hadamard product:\r\n\r\n$$ i\\_{t} = \\sigma\\left(W\\_{xi} \u2217 X\\_{t} + W\\_{hi} \u2217 H\\_{t\u22121} + W\\_{ci} \\odot \\mathcal{C}\\_{t\u22121} + b\\_{i}\\right) $$\r\n\r\n$$ f\\_{t} = \\sigma\\left(W\\_{xf} \u2217 X\\_{t} + W\\_{hf} \u2217 H\\_{t\u22121} + W\\_{cf} \\odot \\mathcal{C}\\_{t\u22121} + b\\_{f}\\right) $$\r\n\r\n$$ \\mathcal{C}\\_{t} = f\\_{t} \\odot \\mathcal{C}\\_{t\u22121} + i\\_{t} \\odot \\text{tanh}\\left(W\\_{xc} \u2217 X\\_{t} + W\\_{hc} \u2217 \\mathcal{H}\\_{t\u22121} + b\\_{c}\\right) $$\r\n\r\n$$ o\\_{t} = \\sigma\\left(W\\_{xo} \u2217 X\\_{t} + W\\_{ho} \u2217 \\mathcal{H}\\_{t\u22121} + W\\_{co} \\odot \\mathcal{C}\\_{t} + b\\_{o}\\right) $$\r\n\r\n$$ \\mathcal{H}\\_{t} = o\\_{t} \\odot \\text{tanh}\\left(C\\_{t}\\right) $$\r\n\r\nIf we view the states as the hidden representations of moving objects, a ConvLSTM with a larger transitional kernel should be able to capture faster motions while one with a smaller kernel can capture slower motions. \r\n\r\nTo ensure that the states have the same number of rows and same number of columns as the inputs, padding is needed before applying the convolution operation. Here, padding of the hidden states on the boundary points can be viewed as using the state of the outside world for calculation. Usually, before the first input comes, we initialize all the states of the [LSTM](https:\/\/paperswithcode.com\/method\/lstm) to zero which corresponds to \"total ignorance\" of the future.","275":"**mBART** is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the [BART objective](https:\/\/paperswithcode.com\/method\/bart). The input texts are noised by masking phrases and permuting sentences, and a single [Transformer model](https:\/\/paperswithcode.com\/method\/transformer) is learned to recover the texts. Different from other pre-training approaches for machine translation, mBART pre-trains a complete autoregressive [Seq2Seq](https:\/\/paperswithcode.com\/method\/seq2seq) model. mBART is trained once for all languages, providing a set of parameters that can be fine-tuned for any of the language pairs in both supervised and unsupervised settings, without any task-specific or language-specific modifications or initialization schemes.","276":"An **Hourglass Module** is an image block module used mainly for pose estimation tasks. The design of the hourglass is motivated by the need to capture information at every scale. While local evidence is essential for identifying features like faces and hands, a final pose estimate requires a coherent understanding of the full body. The person\u2019s orientation, the arrangement of their limbs, and the relationships of adjacent joints are among the many cues that are best recognized at different scales in the image. The hourglass is a simple, minimal design that has the capacity to capture all of these features and bring them together to output pixel-wise predictions.\r\n\r\nThe network must have some mechanism to effectively process and consolidate features across scales. The Hourglass uses a single pipeline with skip layers to preserve spatial information at each resolution. The network reaches its lowest resolution at 4x4 pixels allowing smaller spatial filters to be applied that compare features across the entire space of the image.\r\n\r\nThe hourglass is set up as follows: Convolutional and [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) layers are used to process features down to a very low resolution. At each max pooling step, the network branches off and applies more convolutions at the original pre-pooled resolution. After reaching the lowest resolution, the network begins the top-down sequence of upsampling and combination of features across scales. To bring together information across two adjacent resolutions, we do nearest neighbor upsampling of the lower resolution followed by an elementwise addition of the two sets of features. The topology of the hourglass is symmetric, so for every layer present on the way down there is a corresponding layer going up.\r\n\r\nAfter reaching the output resolution of the network, two consecutive rounds of 1x1 convolutions are applied to produce the final network predictions. The output of the network is a set of heatmaps where for a given [heatmap](https:\/\/paperswithcode.com\/method\/heatmap) the network predicts the probability of a joint\u2019s presence at each and every pixel.","277":"**Stacked Hourglass Networks** are a type of convolutional neural network for pose estimation. They are based on the successive steps of pooling and upsampling that are done to produce a final set of predictions.","278":"Image Scale Augmentation is an augmentation technique where we randomly pick the short size of a image within a dimension range. One use case of this augmentation technique is in object detectiont asks.","279":"**EfficientDet** is a type of object detection model, which utilizes several optimization and backbone tweaks, such as the use of a [BiFPN](https:\/\/paperswithcode.com\/method\/bifpn), and a compound scaling method that uniformly scales the resolution,depth and width for all backbones, feature networks and box\/class prediction networks at the same time.","280":"A **BiFPN**, or **Weighted Bi-directional Feature Pyramid Network**, is a type of feature pyramid network which allows easy and fast multi-scale feature fusion. It incorporates the multi-level feature fusion idea from [FPN](https:\/\/paperswithcode.com\/method\/fpn), [PANet](https:\/\/paperswithcode.com\/method\/panet) and [NAS-FPN](https:\/\/paperswithcode.com\/method\/nas-fpn) that enables information to flow in both the top-down and bottom-up directions, while using regular and efficient connections. It also utilizes a fast normalized fusion technique. Traditional approaches usually treat all features input to the FPN equally, even those with different resolutions. However, input features at different resolutions often have unequal contributions to the output features. Thus, the BiFPN adds an additional weight for each input feature allowing the network to learn the importance of each. All regular convolutions are also replaced with less expensive depthwise separable convolutions.\r\n\r\nComparing with PANet, PANet added an extra bottom-up path for information flow at the expense of more computational cost. Whereas BiFPN optimizes these cross-scale connections by removing nodes with a single input edge, adding an extra edge from the original input to output node if they are on the same level, and treating each bidirectional path as one feature network layer (repeating it several times for more high-level future fusion).","281":"A **Siamese Network** consists of twin networks which accept distinct inputs but are joined by an energy function at the top. This function computes a metric between the highest level feature representation on each side. The parameters between the twin networks are tied. [Weight tying](https:\/\/paperswithcode.com\/method\/weight-tying) guarantees that two extremely similar images are not mapped by each network to very different locations in feature space because each network computes the same function. The network is symmetric, so that whenever we present two distinct images to the twin networks, the top conjoining layer will compute the same metric as if we were to we present the same two images but to the opposite twins.\r\n\r\nIntuitively instead of trying to classify inputs, a siamese network learns to differentiate between inputs, learning their similarity. The loss function used is usually a form of contrastive loss.\r\n\r\nSource: [Koch et al](https:\/\/www.cs.cmu.edu\/~rsalakhu\/papers\/oneshot1.pdf)","282":"**Routed Attention** is an attention pattern proposed as part of the [Routing Transformer](https:\/\/paperswithcode.com\/method\/routing-transformer) architecture.  Each attention module\r\nconsiders a clustering of the space: the current timestep only attends to context belonging to the same cluster. In other word, the current time-step query is routed to a limited number of context through its cluster assignment. This can be contrasted with [strided](https:\/\/paperswithcode.com\/method\/strided-attention) attention patterns and those proposed with the [Sparse Transformer](https:\/\/paperswithcode.com\/method\/sparse-transformer).\r\n\r\nIn the image to the right, the rows represent the outputs while the columns represent the inputs. The different colors represent cluster memberships for the output token.","283":"The **Routing Transformer** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) that endows self-attention with a sparse routing module based on online k-means. Each attention module considers a clustering of the space: the current timestep only attends to context belonging to the same cluster. In other word, the current time-step query is routed to a limited number of context through its cluster assignment.","284":"**Weight Normalization** is a normalization method for training neural networks. It is inspired by [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization), but it is a deterministic method that does not share batch normalization's property of adding noise to the gradients. It reparameterizes each weight vector $\\textbf{w}$ in terms of a parameter vector $\\textbf{v}$ and a scalar parameter $g$ and to perform stochastic gradient descent with respect to those parameters instead. Weight vectors are expressed in terms of the new parameters using:\r\n\r\n$$ \\textbf{w} = \\frac{g}{\\Vert\\\\textbf{v}\\Vert}\\textbf{v}$$\r\n\r\nwhere $\\textbf{v}$ is a $k$-dimensional vector, $g$ is a scalar, and $\\Vert\\textbf{v}\\Vert$ denotes the Euclidean norm of $\\textbf{v}$. This reparameterization has the effect of fixing the Euclidean norm of the weight vector $\\textbf{w}$: we now have $\\Vert\\textbf{w}\\Vert = g$, independent of the parameters $\\textbf{v}$.","285":"Network On Network (NON) is practical tabular data classification model based on deep neural network to provide accurate predictions. Various deep methods have been proposed and promising progress has been made. However, most of them use operations like neural network and factorization machines to fuse the embeddings of different features directly, and linearly combine the outputs of those operations to get the final prediction. As a result, the intra-field information and the non-linear interactions between those operations (e.g. neural network and factorization machines) are ignored. Intra-field information is the information that features inside each field belong to the same field. NON is proposed to take full advantage of intra-field information and non-linear interactions. It consists of three components: field-wise network at the bottom to capture the intra-field information, across field network in the middle to choose suitable operations data-drivenly, and operation fusion network on the top to fuse outputs of the chosen operations deeply","286":"**mt5** is a multilingual variant of [T5](https:\/\/paperswithcode.com\/method\/t5) that was pre-trained on a new Common Crawl-based dataset covering $101$ languages.","287":"**BART** is a [denoising autoencoder](https:\/\/paperswithcode.com\/method\/denoising-autoencoder) for pretraining sequence-to-sequence models. It is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard [Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based neural machine translation architecture. It uses a standard seq2seq\/NMT architecture with a bidirectional encoder (like [BERT](https:\/\/paperswithcode.com\/method\/bert)) and a left-to-right decoder (like [GPT](https:\/\/paperswithcode.com\/method\/gpt)). This means the encoder's attention mask is fully visible, like BERT, and the decoder's attention mask is causal, like [GPT2](https:\/\/paperswithcode.com\/method\/gpt-2).","288":"**Prioritized Experience Replay** is a type of [experience replay](https:\/\/paperswithcode.com\/method\/experience-replay) in reinforcement learning where we In more frequently replay transitions with high expected learning progress, as measured by the magnitude of their temporal-difference (TD) error. This prioritization can lead to a loss of diversity, which is alleviated with stochastic prioritization, and introduce bias, which can be corrected with importance sampling.\r\n\r\nThe stochastic sampling method interpolates between pure greedy prioritization and uniform random sampling. The probability of being sampled is ensured to be monotonic in a transition's priority,  while guaranteeing a non-zero probability even for the lowest-priority transition. Concretely, define the probability of sampling transition $i$ as\r\n\r\n$$P(i) = \\frac{p_i^{\\alpha}}{\\sum_k p_k^{\\alpha}}$$\r\n\r\nwhere $p_i > 0$ is the priority of transition $i$. The exponent $\\alpha$ determines how much prioritization is used, with $\\alpha=0$ corresponding to the uniform case.\r\n\r\nPrioritized replay introduces bias because it changes this distribution in an uncontrolled fashion, and therefore changes the solution that the estimates will converge to. We can correct this bias by using\r\nimportance-sampling (IS) weights:\r\n\r\n$$ w\\_{i} = \\left(\\frac{1}{N}\\cdot\\frac{1}{P\\left(i\\right)}\\right)^{\\beta} $$\r\n\r\nthat fully compensates for the non-uniform probabilities $P\\left(i\\right)$ if $\\beta = 1$. These weights can be folded into the [Q-learning](https:\/\/paperswithcode.com\/method\/q-learning) update by using $w\\_{i}\\delta\\_{i}$ instead of $\\delta\\_{i}$ - weighted IS rather than ordinary IS. For stability reasons, we always normalize weights by $1\/\\max\\_{i}w\\_{i}$ so\r\nthat they only scale the update downwards.\r\n\r\nThe two types of prioritization are proportional based, where $p\\_{i} = |\\delta\\_{i}| + \\epsilon$ and rank-based, where $p\\_{i} = \\frac{1}{\\text{rank}\\left(i\\right)}$, the latter where $\\text{rank}\\left(i\\right)$ is the rank of transition $i$ when the replay memory is sorted according to |$\\delta\\_{i}$|, For proportional based, hyperparameters used were $\\alpha = 0.7$, $\\beta\\_{0} = 0.5$. For the rank-based variant, hyperparameters used were $\\alpha = 0.6$, $\\beta\\_{0} = 0.4$.","289":"**Monte-Carlo Tree Search** is a planning algorithm that accumulates value estimates obtained from Monte Carlo simulations in order to successively direct simulations towards more highly-rewarded trajectories. We execute MCTS after encountering each new state to select an agent's action for that state: it is executed again to select the action for the next state. Each execution is an iterative process that simulates many trajectories starting from the current state to the terminal state. The core idea is to successively focus multiple simulations starting at the current state by extending the initial portions of trajectories that have received high evaluations from earlier simulations.\r\n\r\nSource: Sutton and Barto, Reinforcement Learning (2nd Edition)\r\n\r\nImage Credit: [Chaslot et al](https:\/\/www.aaai.org\/Papers\/AIIDE\/2008\/AIIDE08-036.pdf)","290":"**MuZero** is a model-based reinforcement learning algorithm. It builds upon [AlphaZero](https:\/\/paperswithcode.com\/method\/alphazero)'s search and search-based policy iteration algorithms, but incorporates a learned model into the training procedure. \r\n\r\nThe main idea of the algorithm is to predict those aspects of the future that are directly relevant for planning. The model receives the observation (e.g. an image of the Go board or the Atari screen) as an\r\ninput and transforms it into a hidden state. The hidden state is then updated iteratively by a recurrent process that receives the previous hidden state and a hypothetical next action. At every one of these steps the model predicts the policy (e.g. the move to play), value function (e.g. the predicted winner), and immediate reward (e.g. the points scored by playing a move). The model is trained end-to-end, with the sole objective of accurately estimating these three important quantities, so as to match the improved estimates of policy and value generated by search as well as the observed reward. \r\n\r\nThere is no direct constraint or requirement for the hidden state to capture all information necessary to reconstruct the original observation, drastically reducing the amount of information the model has to maintain and predict; nor is there any requirement for the hidden state to match the unknown, true state of the environment; nor any other constraints on the semantics of state. Instead, the hidden states are free to represent state in whatever way is relevant to predicting current and future values and policies. Intuitively, the agent can invent, internally, the rules or dynamics that lead to most accurate planning.","291":"Diffusion-convolutional neural networks (DCNN) is a model for graph-structured data. Through the introduction of a diffusion-convolution operation, diffusion-based representations can be learned from graph structured data and used as an effective basis for node classification.\r\n\r\nDescription and image from: [Diffusion-Convolutional Neural Networks](https:\/\/arxiv.org\/pdf\/1511.02136.pdf)","292":"**Double Q-learning** is an off-policy reinforcement learning algorithm that utilises double estimation to counteract overestimation problems with traditional Q-learning. \r\n\r\nThe max operator in standard [Q-learning](https:\/\/paperswithcode.com\/method\/q-learning) and [DQN](https:\/\/paperswithcode.com\/method\/dqn) uses the same values both to select and to evaluate an action. This makes it more likely to select overestimated values, resulting in overoptimistic value estimates. To prevent this, we can decouple the selection from the evaluation, which is the idea behind Double Q-learning:\r\n\r\n$$ Y^{Q}\\_{t} = R\\_{t+1} + \\gamma{Q}\\left(S\\_{t+1}, \\arg\\max\\_{a}Q\\left(S\\_{t+1}, a; \\mathbb{\\theta}\\_{t}\\right);\\mathbb{\\theta}\\_{t}\\right) $$\r\n\r\nThe Double Q-learning error can then be written as:\r\n\r\n$$ Y^{DoubleQ}\\_{t} = R\\_{t+1} + \\gamma{Q}\\left(S\\_{t+1}, \\arg\\max\\_{a}Q\\left(S\\_{t+1}, a; \\mathbb{\\theta}\\_{t}\\right);\\mathbb{\\theta}^{'}\\_{t}\\right) $$\r\n\r\nHere the selection of the action in the $\\arg\\max$ is still due to the online weights $\\theta\\_{t}$. But we use a second set of weights $\\mathbb{\\theta}^{'}\\_{t}$ to fairly evaluate the value of this policy.\r\n\r\nSource: [Deep Reinforcement Learning with Double Q-learning](https:\/\/paperswithcode.com\/paper\/deep-reinforcement-learning-with-double-q)","293":"A **Double Deep Q-Network**, or **Double DQN** utilises [Double Q-learning](https:\/\/paperswithcode.com\/method\/double-q-learning) to reduce overestimation by decomposing the max operation in the target into action selection and action evaluation. We evaluate the greedy policy according to the online network, but we use the target network to estimate its value.  The update is the same as for [DQN](https:\/\/paperswithcode.com\/method\/dqn), but replacing the target $Y^{DQN}\\_{t}$ with:\r\n\r\n$$ Y^{DoubleDQN}\\_{t} = R\\_{t+1}+\\gamma{Q}\\left(S\\_{t+1}, \\arg\\max\\_{a}Q\\left(S\\_{t+1}, a; \\theta\\_{t}\\right);\\theta\\_{t}^{-}\\right) $$\r\n\r\nCompared to the original formulation of Double [Q-Learning](https:\/\/paperswithcode.com\/method\/q-learning), in Double DQN the weights of the second network $\\theta^{'}\\_{t}$ are replaced with the weights of the target network $\\theta\\_{t}^{-}$ for the evaluation of the current greedy policy.","294":"**Graphic Mutual Information**, or **GMI**, measures the correlation between input graphs and high-level hidden representations. GMI generalizes the idea of conventional mutual information computations from vector space to the graph domain where measuring mutual information from two aspects of node features and topological structure is indispensable. GMI exhibits several benefits: First, it is invariant to the isomorphic transformation of input graphs---an inevitable constraint in many existing graph representation learning algorithms; Besides, it can be efficiently estimated and maximized by current mutual information estimation methods such as MINE.","295":"**Self-Adversarial Negative Sampling** is a negative sampling technique used for methods like [word embeddings](https:\/\/paperswithcode.com\/methods\/category\/word-embeddings) and [knowledge graph embeddings](https:\/\/paperswithcode.com\/methods\/category\/graph-embeddings). The traditional negative sampling loss from word2vec for optimizing distance-based models be written as:\r\n\r\n$$ L = \u2212\\log\\sigma\\left(\\gamma \u2212 d\\_{r}\\left(\\mathbf{h}, \\mathbf{t}\\right)\\right) \u2212 \\sum^{n}\\_{i=1}\\frac{1}{k}\\log\\sigma\\left(d\\_{r}\\left(\\mathbf{h}^{'}\\_{i}, \\mathbf{t}^{'}\\_{i}\\right) - \\gamma\\right) $$\r\n\r\nwhere $\\gamma$ is a fixed margin, $\\sigma$ is the sigmoid function, and $\\left(\\mathbf{h}^{'}\\_{i}, r, \\mathbf{t}^{'}\\_{i}\\right)$ is the $i$-th negative triplet. \r\n\r\nThe negative sampling loss samples the negative triplets in a uniform way. Such a uniform negative sampling suffers the problem of inefficiency since many samples are obviously false as training goes on, which does not provide any meaningful information. Therefore, the authors propose an approach called self-adversarial negative sampling, which samples negative triples according to the current embedding model. Specifically, we sample negative triples from the following distribution:\r\n\r\n$$ p\\left(h^{'}\\_{j}, r, t^{'}\\_{j} | \\text{set}\\left(h\\_{i}, r\\_{i}, t\\_{i} \\right) \\right) = \\frac{\\exp\\alpha{f}\\_{r}\\left(\\mathbf{h}^{'}\\_{j}, \\mathbf{t}^{'}\\_{j}\\right)}{\\sum\\_{i=1}\\exp\\alpha{f}\\_{r}\\left(\\mathbf{h}^{'}\\_{i}, \\mathbf{t}^{'}\\_{i}\\right)} $$\r\n\r\nwhere $\\alpha$ is the temperature of sampling. Moreover, since the sampling procedure may be costly, the authors treat the above probability as the weight of the negative sample. Therefore, the final negative sampling loss with self-adversarial training takes the following form:\r\n\r\n$$ L = \u2212\\log\\sigma\\left(\\gamma \u2212 d\\_{r}\\left(\\mathbf{h}, \\mathbf{t}\\right)\\right) \u2212 \\sum^{n}\\_{i=1}p\\left(h^{'}\\_{i}, r, t^{'}\\_{i}\\right)\\log\\sigma\\left(d\\_{r}\\left(\\mathbf{h}^{'}\\_{i}, \\mathbf{t}^{'}\\_{i}\\right) - \\gamma\\right) $$","296":"**RotatE** is a method for generating graph embeddings which is able to model and infer various relation patterns including: symmetry\/antisymmetry, inversion, and composition. Specifically, the RotatE model defines each relation as a rotation from the source entity to the target entity in the complex vector space. The RotatE model is trained using a [self-adversarial negative sampling](https:\/\/paperswithcode.com\/method\/self-adversarial-negative-sampling) technique.","297":"**VEGA** is an AutoML framework that is compatible and optimized for multiple hardware platforms. It integrates various modules of AutoML, including [Neural Architecture Search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) (NAS), Hyperparameter Optimization (HPO), Auto Data Augmentation, Model Compression, and Fully Train. To support a variety of search algorithms and tasks, it involves a fine-grained search space and a description language to enable easy adaptation to different search algorithms and tasks.","298":"Procrustes","299":"An **Inception Module** is an image model block that aims to approximate an optimal local sparse structure in a CNN. Put simply, it allows for us to use multiple types of filter size, instead of being restricted to a single filter size, in a single image block, which we then concatenate and pass onto the next layer.","300":"**GoogLeNet** is a type of convolutional neural network based on the [Inception](https:\/\/paperswithcode.com\/method\/inception-module) architecture. It utilises Inception modules, which allow the network to choose between multiple convolutional filter sizes in each block. An Inception network stacks these modules on top of each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid.","301":"**Retrace** is an off-policy Q-value estimation algorithm which has guaranteed convergence for a target and behaviour policy $\\left(\\pi, \\beta\\right)$. With off-policy rollout for TD learning, we must use importance sampling for the update:\r\n\r\n$$ \\Delta{Q}^{\\text{imp}}\\left(S\\_{t}, A\\_{t}\\right) = \\gamma^{t}\\prod\\_{1\\leq{\\tau}\\leq{t}}\\frac{\\pi\\left(A\\_{\\tau}\\mid{S\\_{\\tau}}\\right)}{\\beta\\left(A\\_{\\tau}\\mid{S\\_{\\tau}}\\right)}\\delta\\_{t} $$\r\n\r\nThis product term can lead to high variance, so Retrace modifies $\\Delta{Q}$ to have importance weights truncated by no more than a constant $c$:\r\n\r\n$$ \\Delta{Q}^{\\text{imp}}\\left(S\\_{t}, A\\_{t}\\right) = \\gamma^{t}\\prod\\_{1\\leq{\\tau}\\leq{t}}\\min\\left(c, \\frac{\\pi\\left(A\\_{\\tau}\\mid{S\\_{\\tau}}\\right)}{\\beta\\left(A\\_{\\tau}\\mid{S\\_{\\tau}}\\right)}\\right)\\delta\\_{t} $$","302":"A **Stochastic Dueling Network**, or **SDN**, is an architecture for learning a value function $V$. The SDN learns both $V$ and $Q$ off-policy while maintaining consistency between the two estimates. At each time step it outputs a stochastic estimate of $Q$ and a deterministic estimate of $V$.","303":"**ACER**, or **Actor Critic with Experience Replay**, is an actor-critic deep reinforcement learning agent with [experience replay](https:\/\/paperswithcode.com\/method\/experience-replay). It can be seen as an off-policy extension of [A3C](https:\/\/paperswithcode.com\/method\/a3c), where the off-policy estimator is made feasible by:\r\n\r\n- Using [Retrace](https:\/\/paperswithcode.com\/method\/retrace) Q-value estimation.\r\n- Using truncated importance sampling with bias correction.\r\n- Using a trust region policy optimization method.\r\n- Using a [stochastic dueling network](https:\/\/paperswithcode.com\/method\/stochastic-dueling-network) architecture.","304":"**Additive Attention**, also known as **Bahdanau Attention**, uses a one-hidden layer feed-forward network to calculate the attention alignment score:\r\n\r\n$$f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right) = v\\_{a}^{T}\\tanh\\left(\\textbf{W}\\_{a}\\left[\\textbf{h}\\_{i};\\textbf{s}\\_{j}\\right]\\right)$$\r\n\r\nwhere $\\textbf{v}\\_{a}$ and $\\textbf{W}\\_{a}$ are learned attention parameters. Here $\\textbf{h}$ refers to the hidden states for the encoder, and $\\textbf{s}$ is the hidden states for the decoder. The function above is thus a type of alignment score function. We can use a matrix of alignment scores to show the correlation between source and target words, as the Figure to the right shows.\r\n\r\nWithin a neural network, once we have the alignment scores, we calculate the final scores using a [softmax](https:\/\/paperswithcode.com\/method\/softmax) function of these alignment scores (ensuring it sums to 1).","305":"**Pointer Networks** tackle problems where input and output data are sequential data, but can't be solved by seq2seq type models because discrete categories of output elements depend on the variable input size (and are not decided in advance).\r\n\r\nA Pointer Network learns the conditional  probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence. They solve the problem of variable size output dictionaries using [additive attention](https:\/\/paperswithcode.com\/method\/additive-attention). But instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, Pointer Networks use attention as a pointer to select a member of the input sequence as the output. \r\n\r\nPointer-Nets can be used to learn approximate solutions to challenging geometric problems such as finding planar convex hulls, computing Delaunay triangulations, and the planar Travelling Salesman Problem.","306":"**FBNet Block** is an image model block used in the [FBNet](https:\/\/paperswithcode.com\/method\/fbnet) architectures discovered through [DNAS](https:\/\/paperswithcode.com\/method\/dnas) [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search). The basic building blocks employed are [depthwise convolutions](https:\/\/paperswithcode.com\/method\/depthwise-convolution) and a [residual connection](https:\/\/paperswithcode.com\/method\/residual-connection).","307":"**FBNet** is a type of convolutional neural architectures discovered through [DNAS](https:\/\/paperswithcode.com\/method\/dnas) [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search). It utilises a basic type of image model block inspired by [MobileNetv2](https:\/\/paperswithcode.com\/method\/mobilenetv2) that utilises depthwise convolutions and an inverted residual structure (see components).","308":"**Cascade R-CNN** is an object detection architecture that seeks to address problems with degrading performance with increased IoU thresholds (due to overfitting during training and inference-time mismatch between IoUs for which detector is optimal and the inputs). It is a multi-stage extension of the [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn), where detector stages deeper into the cascade are sequentially more selective against close false positives. The cascade of R-CNN stages are trained sequentially, using the output of one stage to train the next. This is motivated by the observation that the output IoU of a regressor is almost invariably better than the input IoU. \r\n\r\nCascade R-CNN does not aim to mine hard negatives. Instead, by adjusting bounding boxes, each stage aims to find a good set of close false positives for training the next stage. When operating in this manner, a sequence of detectors adapted to increasingly higher IoUs can beat the overfitting problem, and thus be effectively trained. At inference, the same cascade procedure is applied. The progressively improved hypotheses are better matched to the increasing detector quality at each stage.","309":"A **Dense Block** is a module used in convolutional neural networks that connects *all layers* (with matching feature-map sizes) directly with each other. It was originally proposed as part of the [DenseNet](https:\/\/paperswithcode.com\/method\/densenet) architecture. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers. In contrast to [ResNets](https:\/\/paperswithcode.com\/method\/resnet), we never combine features through summation before they are passed into a layer; instead, we combine features by concatenating them. Hence, the $\\ell^{th}$ layer has $\\ell$ inputs, consisting of the feature-maps of all preceding convolutional blocks. Its own feature-maps are passed on to all $L-\\ell$ subsequent layers. This introduces $\\frac{L(L+1)}{2}$  connections in an $L$-layer network, instead of just $L$, as in traditional architectures: \"dense connectivity\".","310":"A **DenseNet** is a type of convolutional neural network that utilises [dense connections](https:\/\/paperswithcode.com\/method\/dense-connections) between layers, through [Dense Blocks](http:\/\/www.paperswithcode.com\/method\/dense-block), where we connect *all layers* (with matching feature-map sizes) directly with each other. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers.","311":"**Grid Sensitive** is a trick for object detection introduced by [YOLOv4](https:\/\/paperswithcode.com\/method\/yolov4). When we decode the coordinate of the bounding box center $x$ and $y$, in original [YOLOv3](https:\/\/paperswithcode.com\/method\/yolov3), we can get them by\r\n\r\n$$\r\n\\begin{aligned}\r\n&x=s \\cdot\\left(g\\_{x}+\\sigma\\left(p\\_{x}\\right)\\right) \\\\\r\n&y=s \\cdot\\left(g\\_{y}+\\sigma\\left(p\\_{y}\\right)\\right)\r\n\\end{aligned}\r\n$$\r\n\r\nwhere $\\sigma$ is the sigmoid function, $g\\_{x}$ and $g\\_{y}$ are integers and $s$ is a scale factor. Obviously, $x$ and $y$ cannot be exactly equal to $s \\cdot g\\_{x}$ or $s \\cdot\\left(g\\_{x}+1\\right)$. This makes it difficult to predict the centres of bounding boxes that just located on the grid boundary. We can address this problem, by changing the equation to\r\n\r\n$$\r\n\\begin{aligned}\r\n&x=s \\cdot\\left(g\\_{x}+\\alpha \\cdot \\sigma\\left(p\\_{x}\\right)-(\\alpha-1) \/ 2\\right) \\\\\r\n&y=s \\cdot\\left(g\\_{y}+\\alpha \\cdot \\sigma\\left(p\\_{y}\\right)-(\\alpha-1) \/ 2\\right)\r\n\\end{aligned}\r\n$$\r\n\r\nThis makes it easier for the model to predict bounding box center exactly located on the grid boundary. The FLOPs added by Grid Sensitive are really small, and can be totally ignored.","312":"**Bottom-up Path Augmentation** is a feature extraction technique that seeks to shorten the information path and enhance a feature pyramid with accurate localization signals existing in low-levels. This is based on the fact that high response to edges or instance parts is a strong indicator to accurately localize instances. \r\n\r\nEach building block takes a higher resolution feature map $N\\_{i}$ and a coarser map $P\\_{i+1}$ through lateral connection and generates the new feature map $N\\_{i+1}$ Each feature map $N\\_{i}$ first goes through a $3 \\times 3$ convolutional layer with stride $2$ to reduce the spatial size. Then each element of feature map $P\\_{i+1}$ and the down-sampled map are added through lateral connection. The fused feature map is then processed by another $3 \\times 3$ convolutional layer to generate $N\\_{i+1}$ for following sub-networks. This is an iterative process and terminates after approaching $P\\_{5}$. In these building blocks, we consistently use channel 256 of feature maps. The feature grid for each proposal is then pooled from new feature maps, i.e., {$N\\_{2}$, $N\\_{3}$, $N\\_{4}$, $N\\_{5}$}.","313":"** Spatial Pyramid Pooling (SPP)** is a pooling layer that removes the fixed-size constraint of the network, i.e. a CNN does not require a fixed-size input image. Specifically, we add an SPP layer on top of the last convolutional layer. The SPP layer pools the features and generates fixed-length outputs, which are then fed into the fully-connected layers (or other classifiers). In other words, we perform some information aggregation at a deeper stage of the network hierarchy (between convolutional layers and fully-connected layers) to avoid the need for cropping or warping at the beginning.","314":"**PAFPN** is a feature pyramid module used in Path Aggregation networks ([PANet](https:\/\/paperswithcode.com\/method\/panet)) that combines FPNs with [bottom-up path augmentation](https:\/\/paperswithcode.com\/method\/bottom-up-path-augmentation), which shortens the information path between lower layers and topmost feature.","315":"A **Spatial Attention Module** is a module for spatial attention in convolutional neural networks. It generates a spatial attention map by utilizing the inter-spatial relationship of features. Different from the [channel attention](https:\/\/paperswithcode.com\/method\/channel-attention-module), the spatial attention focuses on where is an informative part, which is complementary to the channel attention. To compute the spatial attention, we first apply average-pooling and max-pooling operations along the channel axis and concatenate them to generate an efficient feature descriptor. On the concatenated feature descriptor, we apply a [convolution](https:\/\/paperswithcode.com\/method\/convolution) layer to generate a spatial attention map $\\textbf{M}\\_{s}\\left(F\\right) \\in \\mathcal{R}^{H\u00d7W}$ which encodes where to emphasize or suppress. \r\n\r\nWe aggregate channel information of a feature map by using two pooling operations, generating two 2D maps: $\\mathbf{F}^{s}\\_{avg} \\in \\mathbb{R}^{1\\times{H}\\times{W}}$ and $\\mathbf{F}^{s}\\_{max} \\in \\mathbb{R}^{1\\times{H}\\times{W}}$. Each denotes average-pooled features and max-pooled features across the channel. Those are then concatenated and convolved by a standard convolution layer, producing the 2D spatial attention map. In short, the spatial attention is computed as:\r\n\r\n$$ \\textbf{M}\\_{s}\\left(F\\right) = \\sigma\\left(f^{7x7}\\left(\\left[\\text{AvgPool}\\left(F\\right);\\text{MaxPool}\\left(F\\right)\\right]\\right)\\right) $$\r\n\r\n$$ \\textbf{M}\\_{s}\\left(F\\right) = \\sigma\\left(f^{7x7}\\left(\\left[\\mathbf{F}^{s}\\_{avg};\\mathbf{F}^{s}\\_{max} \\right]\\right)\\right) $$\r\n\r\nwhere $\\sigma$ denotes the sigmoid function and $f^{7\u00d77}$ represents a convolution operation with the filter size of 7 \u00d7 7.","316":"**DropBlock** is a structured form of [dropout](https:\/\/paperswithcode.com\/method\/dropout) directed at regularizing convolutional networks. In DropBlock, units in a contiguous region of a feature map are dropped together.  As DropBlock discards features in a correlated area, the networks must look elsewhere for evidence to fit the data.","317":"**CSPDarknet53** is a convolutional neural network and backbone for object detection that uses [DarkNet-53](https:\/\/paperswithcode.com\/method\/darknet-53). It employs a CSPNet strategy to partition the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network. \r\n\r\nThis CNN is used as the backbone for [YOLOv4](https:\/\/paperswithcode.com\/method\/yolov4).","318":"**YOLOv4** is a one-stage object detection model that improves on [YOLOv3](https:\/\/paperswithcode.com\/method\/yolov3) with several bags of tricks and modules introduced in the literature. The components section below details the tricks and modules used.","319":"The **Lovasz-Softmax loss** is a loss function for multiclass semantic segmentation that incorporates the [softmax](https:\/\/paperswithcode.com\/method\/softmax) operation in the Lovasz extension. The Lovasz extension is a means by which we can achieve direct optimization of the mean intersection-over-union loss in neural networks.","320":"**Inception-v3 Module** is an image block used in the [Inception-v3](https:\/\/paperswithcode.com\/method\/inception-v3) architecture. This architecture is used on the coarsest (8 \u00d7 8) grids to promote high dimensional representations.","321":"**Inception-v3** is a convolutional neural network architecture from the Inception family that makes several improvements including using [Label Smoothing](https:\/\/paperswithcode.com\/method\/label-smoothing), Factorized 7 x 7 convolutions, and the use of an auxiliary classifer to propagate label information lower down the network (along with the use of [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) for layers in the sidehead).","322":"**Step Decay** is a learning rate schedule that drops the learning rate by a factor every few epochs, where the number of epochs is a hyperparameter.\r\n\r\nImage Credit: [Suki Lau](https:\/\/towardsdatascience.com\/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1)","323":"**MobileNet** is a type of convolutional neural network designed for mobile and embedded vision applications. They are based on a streamlined architecture that uses depthwise separable convolutions to build lightweight deep neural networks that can have low latency for mobile and embedded devices.","324":"A **Wide Residual Block** is a type of [residual block](https:\/\/paperswithcode.com\/method\/residual-block) that utilises two conv 3x3 layers (with [dropout](https:\/\/paperswithcode.com\/method\/dropout)). This is wider than other variants of residual blocks (for instance [bottleneck residual blocks](https:\/\/paperswithcode.com\/method\/bottleneck-residual-block)). It was proposed as part of the [WideResNet](https:\/\/paperswithcode.com\/method\/wideresnet) CNN architecture.","325":"A **Channel Attention Module** is a module for channel-based attention in convolutional neural networks. We produce a channel attention map by exploiting the inter-channel relationship of features. As each channel of a feature map is considered as a feature detector, channel attention focuses on \u2018what\u2019 is meaningful given an input image. To compute the channel attention efficiently, we squeeze the spatial dimension of the input feature map. \r\n\r\nWe first aggregate spatial information of a feature map by using both average-pooling and max-pooling operations, generating two different spatial context descriptors: $\\mathbf{F}^{c}\\_{avg}$ and $\\mathbf{F}^{c}\\_{max}$, which denote average-pooled features and max-pooled features respectively. \r\n\r\nBoth descriptors are then forwarded to a shared network to produce our channel attention map $\\mathbf{M}\\_{c} \\in \\mathbb{R}^{C\\times{1}\\times{1}}$. Here $C$ is the number of channels. The shared network is composed of multi-layer perceptron (MLP) with one hidden layer. To reduce parameter overhead, the hidden activation size is set to $\\mathbb{R}^{C\/r\u00d71\u00d71}$, where $r$ is the reduction ratio. After the shared network is applied to each descriptor, we merge the output feature vectors using element-wise summation. In short, the channel attention is computed as:\r\n\r\n$$  \\mathbf{M\\_{c}}\\left(\\mathbf{F}\\right) = \\sigma\\left(\\text{MLP}\\left(\\text{AvgPool}\\left(\\mathbf{F}\\right)\\right)+\\text{MLP}\\left(\\text{MaxPool}\\left(\\mathbf{F}\\right)\\right)\\right) $$\r\n\r\n$$  \\mathbf{M\\_{c}}\\left(\\mathbf{F}\\right) = \\sigma\\left(\\mathbf{W\\_{1}}\\left(\\mathbf{W\\_{0}}\\left(\\mathbf{F}^{c}\\_{avg}\\right)\\right) +\\mathbf{W\\_{1}}\\left(\\mathbf{W\\_{0}}\\left(\\mathbf{F}^{c}\\_{max}\\right)\\right)\\right) $$\r\n\r\nwhere $\\sigma$ denotes the sigmoid function, $\\mathbf{W}\\_{0} \\in \\mathbb{R}^{C\/r\\times{C}}$, and $\\mathbf{W}\\_{1} \\in \\mathbb{R}^{C\\times{C\/r}}$. Note that the MLP weights, $\\mathbf{W}\\_{0}$ and $\\mathbf{W}\\_{1}$, are shared for both inputs and the [ReLU](https:\/\/paperswithcode.com\/method\/relu) activation function is followed by $\\mathbf{W}\\_{0}$.\r\n\r\nNote that the channel attention module with just [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling) is the same as the [Squeeze-and-Excitation Module](https:\/\/paperswithcode.com\/method\/squeeze-and-excitation-block).","326":"**Convolutional Block Attention Module (CBAM)** is an attention module for convolutional neural networks. Given an intermediate feature map, the module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement.\r\n\r\nGiven an intermediate feature map $\\mathbf{F} \\in \\mathbb{R}^{C\u00d7H\u00d7W}$ as input, CBAM sequentially infers a 1D channel attention map $\\mathbf{M}\\_{c} \\in \\mathbb{R}^{C\u00d71\u00d71}$ and a 2D spatial attention map $\\mathbf{M}\\_{s} \\in \\mathbb{R}^{1\u00d7H\u00d7W}$. The overall attention process can be summarized as:\r\n\r\n$$ \\mathbf{F}' = \\mathbf{M}\\_{c}\\left(\\mathbf{F}\\right) \\otimes \\mathbf{F} $$\r\n\r\n$$ \\mathbf{F}'' = \\mathbf{M}\\_{s}\\left(\\mathbf{F'}\\right) \\otimes \\mathbf{F'} $$\r\n\r\nDuring multiplication, the attention values are broadcasted (copied) accordingly: channel attention values are broadcasted along the spatial dimension, and vice versa. $\\mathbf{F}''$ is the final refined\r\noutput.","327":"**FFB6D** is a full flow bidirectional fusion network for 6D pose estimation of known objects from a single RGBD image. Unlike previous works that extract the RGB and point cloud features independently and fuse them in the final stage, FFB6D builds bidirectional fusion modules as communication bridges in the full flow of the two networks. In this way, the two networks can obtain complementary information from the other and learn representations containing rich appearance and geometry information of the scene.","328":"**Iterative Pseudo-Labeling** (IPL) is a semi-supervised algorithm for speech recognition which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves. In particular, IPL fine tunes an existing model at each iteration using both labeled data and a subset of unlabeled data.","329":"Random Erasing is a data augmentation method for training the convolutional neural network (CNN), which randomly selects a rectangle region in an image and erases its pixels with random values. In this process, training images with various levels of occlusion are generated, which reduces the risk of over-fitting and makes the model robust to occlusion. Random Erasing is parameter learning free, easy to implement, and can be integrated with most of the CNN-based recognition models. Random Erasing is complementary to commonly used data augmentation techniques such as random cropping and flipping, and can be implemented in various vision tasks, such as image classification, object detection, semantic segmentation.","330":"A **Spatial Transformer** is an image model block that explicitly allows the spatial manipulation of data within a [convolutional neural network](https:\/\/paperswithcode.com\/methods\/category\/convolutional-neural-networks). It gives CNNs the ability to actively spatially transform feature maps, conditional on the feature map itself, without any extra training supervision or modification to the optimisation process. Unlike pooling layers, where the receptive fields are fixed and local, the spatial transformer module is a dynamic mechanism that can actively spatially transform an image (or a feature map) by producing an appropriate transformation for each input sample. The transformation is then performed on the entire feature map (non-locally) and can include scaling, cropping, rotations, as well as non-rigid deformations.\r\n\r\nThe architecture is shown in the Figure to the right. The input feature map $U$ is passed to a localisation network which regresses the transformation parameters $\\theta$. The regular spatial grid $G$ over $V$ is transformed to the sampling grid $T\\_{\\theta}\\left(G\\right)$, which is applied to $U$, producing the warped output feature map $V$. The combination of the localisation network and sampling mechanism defines a spatial transformer.","331":"**Adaptive Instance Normalization** is a normalization method that aligns the mean and variance of the content features with those of the style features. \r\n\r\n[Instance Normalization](https:\/\/paperswithcode.com\/method\/instance-normalization) normalizes the input to a single style specified by the affine parameters. Adaptive Instance Normaliation is an extension. In AdaIN, we receive a content input $x$ and a style input $y$, and we simply align the channel-wise mean and variance of $x$ to match those of $y$. Unlike [Batch Normalization](https:\/\/paperswithcode.com\/method\/batch-normalization), Instance Normalization or [Conditional Instance Normalization](https:\/\/paperswithcode.com\/method\/conditional-instance-normalization), AdaIN has no learnable affine parameters. Instead, it adaptively computes the affine parameters from the style input:\r\n\r\n$$\r\n\\textrm{AdaIN}(x, y)= \\sigma(y)\\left(\\frac{x-\\mu(x)}{\\sigma(x)}\\right)+\\mu(y)\r\n$$","332":"**StyleGAN** is a type of generative adversarial network. It uses an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature; in particular, the use of [adaptive instance normalization](https:\/\/paperswithcode.com\/method\/adaptive-instance-normalization). Otherwise it follows Progressive [GAN](https:\/\/paperswithcode.com\/method\/gan) in using a progressively growing training regime. Other quirks include the fact it generates from a fixed value tensor not stochastically generated latent variables as in regular GANs. The stochastically generated latent variables are used as style vectors in the adaptive [instance normalization](https:\/\/paperswithcode.com\/method\/instance-normalization) at each resolution after being transformed by an 8-layer [feedforward network](https:\/\/paperswithcode.com\/method\/feedforward-network). Lastly, it employs a form of regularization called mixing regularization, which mixes two style latent variables during training.","333":"**Exponential Decay** is a learning rate schedule where we decay the learning rate with more iterations using an exponential function:\r\n\r\n$$ \\text{lr} = \\text{lr}\\_{0}\\exp\\left(-kt\\right) $$\r\n\r\nImage Credit: [Suki Lau](https:\/\/towardsdatascience.com\/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1)","334":"**Restricted Boltzmann Machines**, or **RBMs**, are two-layer generative neural networks that learn a probability distribution over the inputs. They are a special class of Boltzmann Machine in that they have a restricted number of connections between visible and hidden units. Every node in the visible layer is connected to every node in the hidden layer, but no nodes in the same group are connected. RBMs are usually trained using the contrastive divergence learning procedure.\r\n\r\nImage Source: [here](https:\/\/medium.com\/datatype\/restricted-boltzmann-machine-a-complete-analysis-part-1-introduction-model-formulation-1a4404873b3)","335":"The **Maxout Unit** is a generalization of the [ReLU](https:\/\/paperswithcode.com\/method\/relu) and the [leaky ReLU](https:\/\/paperswithcode.com\/method\/leaky-relu) functions. It is a piecewise linear function that returns the maximum of the inputs, designed to be used in conjunction with [dropout](https:\/\/paperswithcode.com\/method\/dropout). Both ReLU and leaky ReLU are special cases of Maxout. \r\n\r\n$$f\\left(x\\right) = \\max\\left(w^{T}\\_{1}x + b\\_{1}, w^{T}\\_{2}x + b\\_{2}\\right)$$\r\n\r\nThe main drawback of Maxout is that it is computationally expensive as it doubles the number of parameters for each neuron.","336":"**Contrastive Language-Image Pre-training** (**CLIP**), consisting of a simplified version of ConVIRT trained from scratch, is an efficient method of image representation learning from natural language supervision. , CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset\u2019s classes. \r\n\r\nFor pre-training, CLIP is trained to predict which of the $N X N$ possible (image, text) pairings across a batch actually occurred. CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the $N$ real pairs in the batch while minimizing the cosine similarity of the embeddings of the $N^2 - N$ incorrect pairings. A symmetric cross entropy loss is optimized over these similarity scores. \r\n\r\nImage credit: [Learning Transferable Visual Models From Natural Language Supervision](https:\/\/arxiv.org\/pdf\/2103.00020.pdf)","337":"A **Laplacian Pyramid** is a linear invertible image representation consisting of a set of band-pass\r\nimages, spaced an octave apart, plus a low-frequency residual. Formally, let $d\\left(.\\right)$ be a downsampling operation which blurs and decimates a $j \\times j$ image $I$, so that $d\\left(I\\right)$ is a new image of size $j\/2 \\times j\/2$. Also, let $u\\left(.\\right)$ be an upsampling operator which smooths and expands $I$ to be twice the size, so $u\\left(I\\right)$ is a new image of size $2j \\times 2j$. We first build a Gaussian pyramid $G\\left(I\\right) = \\left[I\\_{0}, I\\_{1}, \\dots, I\\_{K}\\right]$, where\r\n$I\\_{0} = I$ and $I\\_{k}$ is $k$ repeated applications\u2217 of $d\\left(.\\right)$ to $I$. $K$ is the number of levels in the pyramid, selected so that the final level has very small spatial extent ($\\leq 8 \\times 8$ pixels).\r\n\r\nThe coefficients $h\\_{k}$ at each level $k$ of the Laplacian pyramid $L\\left(I\\right)$ are constructed by taking the difference between adjacent levels in the Gaussian pyramid, upsampling the smaller one with $u\\left(.\\right)$ so that the sizes are compatible:\r\n\r\n$$ h\\_{k} = \\mathcal{L}\\_{k}\\left(I\\right) = G\\_{k}\\left(I\\right) \u2212 u\\left(G\\_{k+1}\\left(I\\right)\\right) = I\\_{k} \u2212 u\\left(I\\_{k+1}\\right) $$\r\n\r\nIntuitively, each level captures image structure present at a particular scale. The final level of the\r\nLaplacian pyramid $h\\_{K}$ is not a difference image, but a low-frequency residual equal to the final\r\nGaussian pyramid level, i.e. $h\\_{K} = I\\{K}$. Reconstruction from a Laplacian pyramid coefficients\r\n$\\left[h\\_{1}, \\dots, h\\_{K}\\right]$ is performed using the backward recurrence:\r\n\r\n$$ I\\_{k} = u\\left(I\\_{k+1}\\right) + h\\_{k} $$\r\n\r\nwhich is started with $I\\_{K} = h\\_{K}$ and the reconstructed image being $I = I\\_{o}$. In other words, starting at the coarsest level, we repeatedly upsample and add the difference image h at the next finer level until we get back to the full resolution image.\r\n\r\nSource: [LAPGAN](https:\/\/paperswithcode.com\/method\/lapgan)\r\n\r\nImage : [Design of FIR Filters for Fast Multiscale Directional Filter Banks](https:\/\/www.researchgate.net\/figure\/Relationship-between-Gaussian-and-Laplacian-Pyramids_fig2_275038450)","338":"**AccoMontage** is a model for accompaniment arrangement, a type of music generation task involving intertwined constraints of melody, harmony, texture, and music structure. AccoMontage generates piano accompaniments for folk\/pop songs based on a lead sheet (i.e. a melody with chord progression). It first retrieves phrase montages from a database while recombining them structurally using dynamic programming. Second, chords of the retrieved phrases are manipulated to match the lead sheet via style transfer. Lastly, the system offers controls over the generation process. In contrast to pure deep learning approaches, AccoMontage uses a hybrid pathway, in which rule-based optimization and deep learning are both leveraged.","339":"**Asynchronous Interaction Aggregation**, or **AIA**, is a network that leverages different interactions to boost action detection. There are two key designs in it: one is the Interaction Aggregation structure (IA) adopting a uniform paradigm to model and integrate multiple types of interaction; the other is the Asynchronous Memory Update algorithm (AMU) that enables us to achieve better performance by modeling very long-term interaction dynamically.","340":"A **Res2Net Block** is an image model block that constructs hierarchical residual-like connections\r\nwithin one single [residual block](https:\/\/paperswithcode.com\/method\/residual-block). It was proposed as part of the [Res2Net](https:\/\/paperswithcode.com\/method\/res2net) CNN architecture.\r\n\r\nThe block represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The $3 \\times 3$ filters of $n$ channels is replaced with a set of smaller filter groups, each with $w$ channels. These smaller filter groups are connected in a hierarchical residual-like style to increase the number of scales that the output features can represent. Specifically, we divide input feature maps into several groups. A group of filters first extracts features from a group of input feature maps. Output features of the previous group are then sent to the next group of filters along with another group of input feature maps. \r\n\r\nThis process repeats several times until all input feature maps are processed. Finally, feature maps from all groups are concatenated and sent to another group of $1 \\times 1$ filters to fuse information altogether. Along with any possible path in which input features are transformed to output features, the equivalent receptive field increases whenever it passes a $3 \\times 3$ filter, resulting in many equivalent feature scales due to combination effects.\r\n\r\nOne way of thinking of these blocks is that they expose a new dimension, **scale**,  alongside the existing dimensions of depth, width, and cardinality.","341":"**Res2Net** is an image model that employs a variation on bottleneck residual blocks. The motivation is to be able to represent features at multiple scales. This is achieved through a novel building block for CNNs that constructs hierarchical residual-like connections within one single [residual block](https:\/\/paperswithcode.com\/method\/residual-block).\r\nThis represents multi-scale features at a granular level and increases the range of receptive fields for each network layer.","342":"Attention gate focuses on targeted regions while suppressing feature activations in irrelevant regions.\r\nGiven the input feature map $X$ and the gating signal $G\\in \\mathbb{R}^{C'\\times H\\times W}$ which is collected at a coarse scale and contains contextual information, the attention gate uses additive attention to obtain the gating coefficient. Both the input $X$ and the gating signal are first linearly mapped to an $\\mathbb{R}^{F\\times H\\times W}$ dimensional space, and then the output is squeezed in the channel domain to produce a spatial attention weight map $ S \\in \\mathbb{R}^{1\\times H\\times W}$. The overall process can be written as\r\n\\begin{align}\r\n    S &= \\sigma(\\varphi(\\delta(\\phi_x(X)+\\phi_g(G))))\r\n\\end{align}\r\n\\begin{align}\r\n    Y &= S X\r\n\\end{align}\r\nwhere $\\varphi$, $\\phi_x$ and $\\phi_g$ are linear transformations implemented as $1\\times 1$ convolutions. \r\n\r\nThe attention gate guides the model's attention to important regions while suppressing feature activation in unrelated areas. It substantially enhances the representational power of the model without a significant increase in computing cost or number of model parameters due to its lightweight design. It is general and modular, making it simple to use in various CNN models.","343":"Stochastic Steady-state Embedding (SSE) is an algorithm that can learn many steady-state algorithms over graphs. Different from graph neural network family models, SSE is trained stochastically which only requires 1-hop information, but can capture fixed point relationships efficiently and effectively.\r\n\r\nDescription and Image from: [Learning Steady-States of Iterative Algorithms over Graphs](https:\/\/proceedings.mlr.press\/v80\/dai18a.html)","344":"**TopK Copy** is a cross-attention guided copy mechanism for entity extraction where only the Top-$k$ important attention heads are used for computing copy distributions. The motivation is that that attention heads may not equally important, and that some heads can be pruned out with a marginal decrease in overall performance. Attention probabilities produced by insignificant attention heads may be noisy. Thus, computing copy distributions without these heads could improve the model\u2019s ability to infer the importance of each token in the input document.","345":"Spatio-temporal features extraction that measure the stabilty. The proposed method is based on a compression algorithm named Run Length Encoding. The workflow of the method is presented bellow.","346":"**GAGNN**, or **Group-aware Graph Neural Network**, is a hierarchical model for nationwide city air quality forecasting. The model constructs a city graph and a city group graph to model the spatial and latent dependencies between cities, respectively. GAGNN introduces differentiable grouping network to discover the latent dependencies among cities and generate city groups. Based on the generated city groups, a group correlation encoding module is introduced to learn the correlations between them, which can effectively capture the dependencies between city groups. After the graph construction, GAGNN implements message passing mechanism to model the dependencies between cities and city groups.","347":"Extends [PatchGAN](https:\/\/paperswithcode.com\/method\/patchgan) discriminator for the task of layout2image generation. The discriminator is comprised of two processing streams: one for the RGB image and one for its semantics, which are fused together at the later stages of the discriminator.","348":"**TransE** is an energy-based model that produces knowledge base embeddings. It models relationships by interpreting them as translations operating on the low-dimensional embeddings of the entities. Relationships are represented as translations in the embedding space: if $\\left(h, \\mathcal{l}, t\\right)$ holds, the embedding of the tail entity $t$ should be close to the embedding of the head entity $h$ plus some vector that depends on the relationship $\\mathcal{l}$.","349":"**GENets**, or **GPU-Efficient Networks**, are a family of efficient models found through [neural architecture search](https:\/\/paperswithcode.com\/methods\/category\/neural-architecture-search). The search occurs over several types of convolutional block, which include [depth-wise convolutions](https:\/\/paperswithcode.com\/method\/depthwise-convolution), [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization), [ReLU](https:\/\/paperswithcode.com\/method\/relu), and an [inverted bottleneck](https:\/\/paperswithcode.com\/method\/inverted-residual-block) structure.","350":"**Hierarchical Feature Fusion (HFF)** is a feature fusion method employed in [ESP](https:\/\/paperswithcode.com\/method\/esp) and [EESP](https:\/\/paperswithcode.com\/method\/eesp) image model blocks for degridding. In the ESP module, concatenating the outputs of dilated convolutions gives the ESP module a large effective receptive field, but it introduces unwanted checkerboard or gridding artifacts. To address the gridding artifact in ESP, the feature maps obtained using kernels of different dilation rates are hierarchically added before concatenating them (HFF). This solution is simple and effective and does not increase the complexity of the ESP module.","351":"An **Efficient Spatial Pyramid (ESP)** is an image model block based on a factorization principle that decomposes a standard [convolution](https:\/\/paperswithcode.com\/method\/convolution) into two steps: (1) point-wise convolutions and (2) spatial pyramid of dilated convolutions. The point-wise convolutions help in reducing the computation, while the spatial pyramid of dilated convolutions re-samples the feature maps to learn the representations from large effective receptive field. This allows for increased efficiency compared to another image blocks like [ResNeXt](https:\/\/paperswithcode.com\/method\/resnext) blocks and Inception modules.","352":"**ESPNet** is a convolutional neural network for semantic segmentation of high resolution images under resource constraints. ESPNet is based on a convolutional module, efficient spatial pyramid ([ESP](https:\/\/paperswithcode.com\/method\/esp)), which is efficient in terms of computation, memory, and power.","353":"Please enter a description about the method here","354":"The **MLP-Mixer** architecture (or \u201cMixer\u201d for short) is an image architecture that doesn't use convolutions or self-attention. Instead, Mixer\u2019s architecture is based entirely on multi-layer perceptrons (MLPs) that are repeatedly applied across either spatial locations or feature channels. Mixer relies only on basic matrix multiplication routines, changes to data layout (reshapes and transpositions), and scalar nonlinearities.\r\n\r\nIt accepts a sequence of linearly projected image patches (also referred to as tokens) shaped as a \u201cpatches \u00d7 channels\u201d table as an input, and maintains this dimensionality. Mixer makes use of two types of MLP layers: channel-mixing MLPs and token-mixing MLPs. The channel-mixing MLPs allow communication between different channels; they operate on each token independently and take individual rows of the table as inputs. The token-mixing MLPs allow communication between different spatial locations (tokens); they operate on each channel independently and take individual columns of the table as inputs. These two types of layers are interleaved to enable interaction of both input dimensions.","355":"**Blender** is a proposal-based instance mask generation module which incorporates rich instance-level information with accurate dense pixel features. A single [convolution](https:\/\/paperswithcode.com\/method\/convolution) layer is added on top of the detection towers to produce attention masks along with each bounding box prediction. For each predicted instance, the blender crops predicted bases with its bounding box and linearly combines them according the learned attention maps.\r\n\r\nThe inputs of the blender module are bottom-level bases $\\mathbf{B}$, the selected top-level attentions $A$ and bounding box proposals $P$. First [RoIPool](https:\/\/paperswithcode.com\/method\/roi-pooling) of Mask R-CNN to crop bases with each proposal $\\mathbf{p}\\_{d}$ and then resize the region to a fixed size $R \\times R$ feature map $\\mathbf{r}\\_{d}$\r\n\r\n$$\r\n\\mathbf{r}\\_{d}=\\operatorname{RoIPool}_{R \\times R}\\left(\\mathbf{B}, \\mathbf{p}\\_{d}\\right), \\quad \\forall d \\in\\{1 \\ldots D\\}\r\n$$\r\n\r\nMore specifically,  asampling ratio 1 is used for [RoIAlign](https:\/\/paperswithcode.com\/method\/roi-align), i.e. one bin for each sampling point. During training, ground truth boxes are used as the proposals. During inference, [FCOS](https:\/\/paperswithcode.com\/method\/fcos) prediction results are used.\r\n\r\nThe attention size $M$ is smaller than $R$. We interpolate $\\mathbf{a}\\_{d}$ from $M \\times M$ to $R \\times R$, into the shapes of $R=\\left\\(\\mathbf{r}\\_{d} \\mid d=1 \\ldots D\\right)$\r\n\r\n$$\r\n\\mathbf{a}\\_{d}^{\\prime}=\\text { interpolate }\\_{M \\times M \\rightarrow R \\times R}\\left(\\mathbf{a}\\_{d}\\right), \\quad \\forall d \\in\\{1 \\ldots D\\}\r\n$$\r\n\r\nThen $\\mathbf{a}\\_{d}^{\\prime}$ is normalized with a softmax function along the $K$ dimension to make it a set of score maps $\\mathbf{s}\\_{d}$.\r\n\r\n$$\r\n\\mathbf{s}\\_{d}=\\operatorname{softmax}\\left(\\mathbf{a}\\_{d}^{\\prime}\\right), \\quad \\forall d \\in\\{1 \\ldots D\\}\r\n$$\r\n\r\nThen we apply element-wise product between each entity $\\mathbf{r}\\_{d}, \\mathbf{s}\\_{d}$ of the regions $R$ and scores $S$, and sum along the $K$ dimension to get our mask logit $\\mathbf{m}\\_{d}:$\r\n\r\n$$\r\n\\mathbf{m}\\_{d}=\\sum\\_{k=1}^{K} \\mathbf{s}\\_{d}^{k} \\circ \\mathbf{r}\\_{d}^{k}, \\quad \\forall d \\in\\{1 \\ldots D\\}\r\n$$\r\n\r\nwhere $k$ is the index of the basis. The mask blending process with $K=4$ is visualized in the Figure.","356":"The Legendre Memory Unit (LMU) is mathematically derived to orthogonalize\r\nits continuous-time history \u2013 doing so by solving d coupled ordinary differential\r\nequations (ODEs), whose phase space linearly maps onto sliding windows of\r\ntime via the Legendre polynomials up to degree d-1.  It is optimal for compressing temporal information.\r\n\r\nSee paper for equations (markdown isn't working).\r\n\r\nOfficial github repo: [https:\/\/github.com\/abr\/lmu](https:\/\/github.com\/abr\/lmu)","357":"**RegionViT** consists of two tokenization processes that convert an image into regional (upper path) and local tokens (lower path). Each tokenization is a convolution with different patch sizes, the patch size of regional tokens is $28^2$ while $4^2$ is used for local tokens with dimensions projected to $C$, which means that one regional token covers $7^2$ local tokens based on the spatial locality, leading to the window size of a local region to $7^2$. At stage 1, two set of tokens are passed through the proposed regional-to-local transformer encoders. However, for the later stages, to balance the computational load and to have feature maps at different resolution, the approach uses a downsampling process to halve the spatial resolution while doubling the channel dimension like CNN on both regional and local tokens before going to the next stage. Finally, at the end of the network, it simply averages the remaining regional tokens as the final embedding for the classification while the detection uses all local tokens at each stage since it provides more fine-grained location information. By having the pyramid structure, the ViT can generate multi-scale features and hence it could be easily extended to more vision applications, e.g., object detection, rather than image classification only.","358":"**Hydra** is a multi-headed neural network for model distillation with a shared body network. The shared body network learns a joint feature representation that enables each head to capture the predictive behavior of each ensemble member.  Existing distillation methods often train a distillation network to imitate the prediction of a larger network. Hydra instead learns to distill the individual predictions of each ensemble member into separate light-weight head models while amortizing the computation through a shared heavy-weight body network. This retains the diversity of ensemble member predictions which is otherwise lost in knowledge distillation.","359":"**RetinaNet** is a one-stage object detection model that utilizes a [focal loss](https:\/\/paperswithcode.com\/method\/focal-loss) function to address class imbalance during training. Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard negative examples. RetinaNet is a single, unified network composed of a *backbone* network and two task-specific *subnetworks*. The backbone is responsible for computing a convolutional feature map over an entire input image and is an off-the-self convolutional network. The first subnet performs convolutional object classification on the backbone's output; the second subnet performs convolutional bounding box regression. The two subnetworks feature a simple design that the authors propose specifically for one-stage, dense detection. \r\n\r\nWe can see the motivation for focal loss by comparing with two-stage object detectors. Here class imbalance is addressed by a two-stage cascade and sampling heuristics. The proposal stage (e.g., [Selective Search](https:\/\/paperswithcode.com\/method\/selective-search), [EdgeBoxes](https:\/\/paperswithcode.com\/method\/edgeboxes), [DeepMask](https:\/\/paperswithcode.com\/method\/deepmask), [RPN](https:\/\/paperswithcode.com\/method\/rpn)) rapidly narrows down the number of candidate object locations to a small number (e.g., 1-2k), filtering out most background samples. In the second classification stage, sampling heuristics, such as a fixed foreground-to-background ratio, or online hard example mining ([OHEM](https:\/\/paperswithcode.com\/method\/ohem)), are performed to maintain a\r\nmanageable balance between foreground and background.\r\n\r\nIn contrast, a one-stage detector must process a much larger set of candidate object locations regularly sampled across an image. To tackle this, RetinaNet uses a focal loss function, a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples. \r\n\r\nFormally, the Focal Loss adds a factor $(1 - p\\_{t})^\\gamma$ to the standard cross entropy criterion. Setting $\\gamma>0$ reduces the relative loss for well-classified examples ($p\\_{t}>.5$), putting more focus on hard, misclassified examples. Here there is tunable *focusing* parameter $\\gamma \\ge 0$. \r\n\r\n$$ {\\text{FL}(p\\_{t}) = - (1 - p\\_{t})^\\gamma \\log\\left(p\\_{t}\\right)} $$","360":"**AdaGrad** is a stochastic optimization method that adapts the learning rate to the parameters. It performs smaller updates for parameters associated with frequently occurring features, and larger updates for parameters associated with infrequently occurring features. In its update rule, Adagrad modifies the general learning rate $\\eta$ at each time step $t$ for every parameter $\\theta\\_{i}$ based on the past gradients for $\\theta\\_{i}$: \r\n\r\n$$ \\theta\\_{t+1, i} = \\theta\\_{t, i} - \\frac{\\eta}{\\sqrt{G\\_{t, ii} + \\epsilon}}g\\_{t, i} $$\r\n\r\nThe benefit of AdaGrad is that it eliminates the need to manually tune the learning rate; most leave it at a default value of $0.01$. Its main weakness is the accumulation of the squared gradients in the denominator. Since every added term is positive, the accumulated sum keeps growing during training, causing the learning rate to shrink and becoming infinitesimally small.\r\n\r\nImage: [Alec Radford](https:\/\/twitter.com\/alecrad)","361":"**Enhanced Sequential Inference Model** or **ESIM** is a sequential NLI model proposed in [Enhanced LSTM for Natural Language Inference](https:\/\/www.aclweb.org\/anthology\/P17-1152) paper.","362":"**Channel Shuffle** is an operation to help information flow across feature channels in convolutional neural networks. It was used as part of the [ShuffleNet](https:\/\/paperswithcode.com\/method\/shufflenet) architecture. \r\n\r\nIf we allow a group [convolution](https:\/\/paperswithcode.com\/method\/convolution) to obtain input data from different groups, the input and output channels will be fully related. Specifically, for the feature map generated from the previous group layer, we can first divide the channels in each group into several subgroups, then feed each group in the next layer with different subgroups. \r\n\r\nThe above can be efficiently and elegantly implemented by a channel shuffle operation: suppose a convolutional layer with $g$ groups whose output has $g \\times n$ channels; we first reshape the output channel dimension into $\\left(g, n\\right)$, transposing and then flattening it back as the input of next layer. Channel shuffle is also differentiable, which means it can be embedded into network structures for end-to-end training.","363":"The extremely low computational cost of lightweight CNNs constrains the depth and width of the networks, further decreasing their representational power. To address the above problem, Chen et al. proposed dynamic convolution, a novel operator design that increases  representational power with negligible additional computational cost and does not change the width or depth of the network in parallel with CondConv.\r\n\r\nDynamic convolution uses $K$ parallel convolution kernels of the same  size and input\/output dimensions instead of one kernel per layer. Like SE blocks, it adopts a squeeze-and-excitation mechanism to generate the attention weights for the different convolution kernels. These kernels are then aggregated dynamically by weighted summation and applied to the input feature map $X$:\r\n\\begin{align}\r\n    s & = \\text{softmax} (W_{2} \\delta (W_{1}\\text{GAP}(X)))\r\n\\end{align}\r\n\\begin{align}\r\n    \\text{DyConv} &= \\sum_{i=1}^{K} s_k \\text{Conv}_k \r\n\\end{align}\r\n\\begin{align}\r\n    Y &= \\text{DyConv}(X)\r\n\\end{align}\r\nHere the convolutions are combined by summation of weights and biases of convolutional kernels. \r\n\r\nCompared to applying convolution to the feature map, the computational cost of squeeze-and-excitation and weighted summation is extremely low. Dynamic convolution thus provides an efficient operation to improve  representational power and can be easily used as a replacement for any convolution.","364":"CR-NET is a YOLO-based model proposed for license plate character detection and recognition","365":"**Colorization** is a self-supervision approach that relies on colorization as the pretext task in order to learn image representations.","366":"**LeNet** is a classic convolutional neural network employing the use of convolutions, pooling and fully connected layers. It was used for the handwritten digit recognition task with the MNIST dataset. The architectural design served as inspiration for future networks such as [AlexNet](https:\/\/paperswithcode.com\/method\/alexnet) and [VGG](https:\/\/paperswithcode.com\/method\/vgg).","367":"Perhaps the most widely used approach to synthesizing new examples is called the Synthetic Minority Oversampling Technique, or SMOTE for short. This technique was described by Nitesh Chawla, et al. in their 2002 paper named for the technique titled \u201cSMOTE: Synthetic Minority Over-sampling Technique.\u201d\r\n\r\nSMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.","368":"OASIS is a [GAN](https:\/\/paperswithcode.com\/method\/gan)-based model to translate semantic label maps into realistic-looking images. The model builds on preceding work such as [Pix2Pix](https:\/\/paperswithcode.com\/method\/pix2pix) and SPADE. OASIS introduces the following innovations:  \r\n\r\n1. The method is not dependent on the perceptual loss, which is commonly used for the semantic image synthesis task. A [VGG](https:\/\/paperswithcode.com\/method\/vgg) network trained on ImageNet is routinely employed as the perceptual loss to strongly improve the synthesis quality. The authors show that this perceptual loss also has negative effects: First, it reduces the diversity of the generated images. Second, it negatively influences the color distribution to be more biased towards ImageNet. OASIS eliminates the dependence on the perceptual loss by changing the common discriminator design: The OASIS discriminator segments an image into one of the real classes or an additional fake class. In doing so, it makes more efficient use of the label maps that the discriminator normally receives. This distinguishes the discriminator from the commonly used encoder-shaped discriminators, which concatenate the label maps to the input image and predict a single score per image. With the more fine-grained supervision through the loss of the OASIS discriminator, the perceptual loss is shown to become unnecessary.\r\n\r\n2. A user can generate a diverse set of images per label map by simply resampling noise. This is achieved by conditioning the [spatially-adaptive denormalization](https:\/\/arxiv.org\/abs\/1903.07291) module in each layer of the GAN generator directly on spatially replicated input noise. A side effect of this conditioning is that at inference time an image can be resampled either globally or locally (either the complete image changes or a restricted region in the image).","369":"** Sigmoid Linear Units**, or **SiLUs**, are activation functions for\r\nneural networks. The activation of the SiLU is computed by the sigmoid function multiplied by its input, or $$ x\\sigma(x).$$\r\n\r\nSee [Gaussian Error Linear Units](https:\/\/arxiv.org\/abs\/1606.08415) ([GELUs](https:\/\/paperswithcode.com\/method\/gelu)) where the SiLU was originally coined, and see [Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning](https:\/\/arxiv.org\/abs\/1702.03118) and [Swish: a Self-Gated Activation Function](https:\/\/arxiv.org\/abs\/1710.05941v1) where the SiLU was experimented with later.","370":"**Temporal Activation Regularization (TAR)** is a type of slowness regularization for [RNNs](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks) that penalizes differences between states that have been explored in the past. Formally we minimize:\r\n\r\n$$\\beta{L\\_{2}}\\left(h\\_{t} - h\\_{t+1}\\right)$$\r\n\r\nwhere $L\\_{2}$ is the $L\\_{2}$ norm, $h_{t}$ is the output of the RNN at timestep $t$, and $\\beta$ is a scaling coefficient.","371":"**Activation Regularization (AR)**, or $L\\_{2}$ activation regularization, is regularization performed on activations as opposed to weights. It is usually used in conjunction with [RNNs](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks). It is defined as:\r\n\r\n$$\\alpha{L}\\_{2}\\left(m\\circ{h\\_{t}}\\right) $$\r\n\r\nwhere $m$ is a [dropout](https:\/\/paperswithcode.com\/method\/dropout) mask used by later parts of the model, $L\\_{2}$ is the $L\\_{2}$ norm, and $h_{t}$ is the output of an RNN at timestep $t$, and $\\alpha$ is a scaling coefficient. \r\n\r\nWhen applied to the output of a dense layer, AR penalizes activations that are substantially away from 0, encouraging activations to remain small.","372":"**Weight Tying** improves the performance of language models by tying (sharing) the weights of the embedding and [softmax](https:\/\/paperswithcode.com\/method\/softmax) layers. This method also massively reduces the total number of parameters in the language models that it is applied to. \r\n\r\nLanguage models are typically comprised of an embedding layer, followed by a number of [Transformer](https:\/\/paperswithcode.com\/method\/transformer) or [LSTM](https:\/\/paperswithcode.com\/method\/lstm) layers, which are finally followed by a softmax layer. Embedding layers learn word representations, such that similar words (in meaning) are represented by vectors that are near each other (in cosine distance). [Press & Wolf, 2016] showed that the softmax matrix, in which every word also has a vector representation, also exhibits this property. This leads them to propose to share the softmax and embedding matrices, which is done today in nearly all language models.  \r\n\r\nThis method was independently introduced by [Press & Wolf, 2016](https:\/\/paperswithcode.com\/paper\/using-the-output-embedding-to-improve) and [Inan et al, 2016](https:\/\/paperswithcode.com\/paper\/tying-word-vectors-and-word-classifiers-a).\r\n\r\nAdditionally, the Press & Wolf paper proposes Three-way Weight Tying, a method for NMT models in which the embedding matrix for the source language, the embedding matrix for the target language, and the softmax matrix for the target language are all tied. That method has been adopted by the Attention Is All You Need model and many other neural machine translation models.","373":"**Embedding Dropout** is equivalent to performing [dropout](https:\/\/paperswithcode.com\/method\/dropout) on the embedding matrix at a word level, where the dropout is broadcast across all the word vector\u2019s embedding. The remaining non-dropped-out word embeddings are scaled by $\\frac{1}{1-p\\_{e}}$ where $p\\_{e}$ is the probability of embedding dropout. As the dropout occurs on the embedding matrix that is used for a full forward and backward pass, this means that all occurrences of a specific word will disappear within that pass, equivalent to performing [variational dropout](https:\/\/paperswithcode.com\/method\/variational-dropout) on the connection between the one-hot embedding and the embedding lookup.\r\n\r\nSource: Merity et al, Regularizing and Optimizing [LSTM](https:\/\/paperswithcode.com\/method\/lstm) Language Models","374":"**DropConnect** generalizes [Dropout](https:\/\/paperswithcode.com\/method\/dropout) by randomly dropping the weights rather than the activations with probability $1-p$. DropConnect is similar to Dropout as it introduces dynamic sparsity within the model, but differs in that the sparsity is on the weights $W$, rather than the output vectors of a layer. In other words, the fully connected layer with DropConnect becomes a sparsely connected layer in which the connections are chosen at random during the training stage. Note that this is not equivalent to setting $W$ to be a fixed sparse matrix during training.\r\n\r\nFor a DropConnect layer, the output is given as:\r\n\r\n$$ r = a \\left(\\left(M * W\\right){v}\\right)$$\r\n\r\nHere $r$ is the output of a layer, $v$ is the input to a layer, $W$ are weight parameters, and $M$ is a binary matrix encoding the connection information where $M\\_{ij} \\sim \\text{Bernoulli}\\left(p\\right)$. Each element of the mask $M$ is drawn independently for each example during training, essentially instantiating a different connectivity for each example seen. Additionally, the biases are also masked out during training.","375":"**ASGD Weight-Dropped LSTM**, or **AWD-LSTM**, is a type of recurrent neural network that employs [DropConnect](https:\/\/paperswithcode.com\/method\/dropconnect) for regularization, as well as [NT-ASGD](https:\/\/paperswithcode.com\/method\/nt-asgd) for optimization - non-monotonically triggered averaged [SGD](https:\/\/paperswithcode.com\/method\/sgd) - which returns an average of last iterations of weights. Additional regularization techniques employed include variable length backpropagation sequences, [variational dropout](https:\/\/paperswithcode.com\/method\/variational-dropout), [embedding dropout](https:\/\/paperswithcode.com\/method\/embedding-dropout), [weight tying](https:\/\/paperswithcode.com\/method\/weight-tying), independent embedding\/hidden size, [activation regularization](https:\/\/paperswithcode.com\/method\/activation-regularization) and [temporal activation regularization](https:\/\/paperswithcode.com\/method\/temporal-activation-regularization).","376":"**Mixture of Softmaxes** performs $K$ different softmaxes and mixes them. The motivation is that the traditional [softmax](https:\/\/paperswithcode.com\/method\/softmax) suffers from a softmax bottleneck, i.e. the expressiveness of the conditional probability we can model is constrained by the combination of a dot product and the softmax. By using a mixture of softmaxes, we can model the conditional probability more expressively.","377":"A **Highway Layer** contains an information highway to other layers that helps with information flow. It is characterised by the use of a gating unit to help this information flow. \r\n\r\nA plain feedforward neural network typically consists of $L$ layers where the $l$th layer ($l \\in ${$1, 2, \\dots, L$}) applies a nonlinear transform $H$ (parameterized by $\\mathbf{W\\_{H,l}}$) on its input $\\mathbf{x\\_{l}}$ to produce its output $\\mathbf{y\\_{l}}$. Thus, $\\mathbf{x\\_{1}}$ is the input to the network and $\\mathbf{y\\_{L}}$ is the network\u2019s output. Omitting the layer index and biases for clarity,\r\n\r\n$$ \\mathbf{y} = H\\left(\\mathbf{x},\\mathbf{W\\_{H}}\\right) $$\r\n\r\n$H$ is usually an affine transform followed by a non-linear activation function, but in general it may take other forms. \r\n\r\nFor a [highway network](https:\/\/paperswithcode.com\/method\/highway-network), we additionally define two nonlinear transforms $T\\left(\\mathbf{x},\\mathbf{W\\_{T}}\\right)$ and $C\\left(\\mathbf{x},\\mathbf{W\\_{C}}\\right)$ such that:\r\n\r\n$$ \\mathbf{y} = H\\left(\\mathbf{x},\\mathbf{W\\_{H}}\\right)\u00b7T\\left(\\mathbf{x},\\mathbf{W\\_{T}}\\right) + \\mathbf{x}\u00b7C\\left(\\mathbf{x},\\mathbf{W\\_{C}}\\right)$$\r\n\r\nWe refer to T as the transform gate and C as the carry gate, since they express how much of the output is produced by transforming the input and carrying it, respectively. In the original paper, the authors set $C = 1 \u2212 T$, giving:\r\n\r\n$$ \\mathbf{y} = H\\left(\\mathbf{x},\\mathbf{W\\_{H}}\\right)\u00b7T\\left(\\mathbf{x},\\mathbf{W\\_{T}}\\right) + \\mathbf{x}\u00b7\\left(1-T\\left(\\mathbf{x},\\mathbf{W\\_{T}}\\right)\\right)$$\r\n\r\nThe authors set:\r\n\r\n$$ T\\left(x\\right) = \\sigma\\left(\\mathbf{W\\_{T}}^{T}\\mathbf{x} + \\mathbf{b\\_{T}}\\right) $$\r\n\r\nImage: [Sik-Ho Tsang](https:\/\/towardsdatascience.com\/review-highway-networks-gating-function-to-highway-image-classification-5a33833797b5)","378":"A **Highway Network** is an architecture designed to ease gradient-based training of very deep networks. They allow unimpeded information flow across several layers on \"information highways\". The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. Highway networks with hundreds of layers can be trained directly using stochastic gradient descent and with a variety of activation functions.","379":"**Soft Actor Critic**, or **SAC**, is an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as [Q-learning methods](https:\/\/paperswithcode.com\/method\/q-learning). [SAC](https:\/\/paperswithcode.com\/method\/sac) combines off-policy updates with a stable stochastic actor-critic formulation.\r\n\r\nThe SAC objective has a number of advantages. First, the policy is incentivized to explore more widely, while giving up on clearly unpromising avenues. Second, the policy can capture multiple modes of near-optimal behavior. In problem settings where multiple actions seem equally attractive, the policy will commit equal probability mass to those actions. Lastly, the authors present evidence that it improves learning speed over state-of-art methods that optimize the conventional RL objective function.","380":"ORB-SLAM2 is a complete SLAM system for monocular, stereo and RGB-D cameras, including map reuse, loop closing and relocalization capabilities. The system works in real-time on standard CPUs in a wide variety of environments from small hand-held indoors sequences, to drones flying in industrial environments and cars driving around a city.\r\n\r\nSource: [Mur-Artal and Tardos](https:\/\/arxiv.org\/pdf\/1610.06475v2.pdf)\r\n\r\nImage source: [Mur-Artal and Tardos](https:\/\/arxiv.org\/pdf\/1610.06475v2.pdf)","381":"**SortCut Sinkhorn Attention** is a variant of [Sparse Sinkhorn Attention](https:\/\/paperswithcode.com\/method\/sparse-sinkhorn-attention) where a post-sorting truncation of the input sequence is performed, essentially performing a hard top-k operation on the input sequence blocks within the computational graph. While most attention models mainly re-weight or assign near-zero weights during training, this allows for explicitly and dynamically truncate the input sequence. Specifically:\r\n\r\n$$ Y = \\text{Softmax}\\left(Q{\\psi\\_{S}}\\left(K\\right)^{T}\\_{\\left[:n\\right]}\\right)\\psi\\_{S}\\left(V\\right)\\_{\\left[:n\\right]} $$\r\n\r\nwhere $n$ is the Sortfut budget hyperparameter.","382":"**Sparse Sinkhorn Attention** is an attention mechanism that reduces the memory complexity of the [dot-product attention mechanism](https:\/\/paperswithcode.com\/method\/scaled) and is capable of learning sparse attention outputs. It is based on the idea of differentiable sorting of internal representations within the self-attention module. SSA incorporates a meta sorting network that learns to rearrange and sort input sequences. Sinkhorn normalization is used to normalize the rows and columns of the sorting matrix. The actual SSA attention mechanism then acts on the block sorted sequences.","383":"The **Sinkhorn Transformer** is a type of [transformer](https:\/\/paperswithcode.com\/method\/transformer) that uses [Sparse Sinkhorn Attention](https:\/\/paperswithcode.com\/method\/sparse-sinkhorn-attention) as a building block. This component is a plug-in replacement for dense fully-connected attention (as well as local attention, and sparse attention alternatives), and allows for reduced memory complexity as well as sparse attention.","384":"**Activation Normalization** is a type of normalization used for flow-based generative models; specifically it was introduced in the [GLOW](https:\/\/paperswithcode.com\/method\/glow) architecture. An ActNorm layer performs an affine transformation of the activations using a scale and bias parameter per channel, similar to [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization). These parameters are initialized such that the post-actnorm activations per-channel have zero mean and unit variance given an initial minibatch of data. This is a form of data dependent initilization. After initialization, the scale and bias are treated as regular trainable parameters that are independent of the data.","385":"The **Invertible 1x1 Convolution** is a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) used in flow-based generative models that reverses the ordering of channels. The weight matrix is initialized as a random rotation matrix. The log-determinant of an invertible 1 \u00d7 1 convolution of a $h \\times w \\times c$ tensor $h$ with $c \\times c$ weight matrix $\\mathbf{W}$ is straightforward to compute:\r\n\r\n$$ \\log | \\text{det}\\left(\\frac{d\\text{conv2D}\\left(\\mathbf{h};\\mathbf{W}\\right)}{d\\mathbf{h}}\\right) | = h \\cdot w \\cdot \\log | \\text{det}\\left(\\mathbf{W}\\right) | $$","386":"**GLOW** is a type of flow-based generative model that is based on an invertible $1 \\times 1$ [convolution](https:\/\/paperswithcode.com\/method\/convolution). This builds on the flows introduced by [NICE](https:\/\/paperswithcode.com\/method\/nice) and [RealNVP](https:\/\/paperswithcode.com\/method\/realnvp). It consists of a series of steps of flow, combined in a multi-scale architecture; see the Figure to the right. Each step of flow consists of Act Normalization followed by an *invertible $1 \\times 1$ convolution* followed by an [affine coupling](https:\/\/paperswithcode.com\/method\/affine-coupling) layer.","387":"**MATE** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture designed to model the structure of web tables. It uses sparse attention in a way that allows heads to efficiently attend to either rows or columns in a table. Each attention head reorders the tokens by either column or row index and then applies a windowed attention mechanism. Unlike traditional self-attention, Mate scales linearly in the sequence length.","388":"**PP-OCR** is an OCR system that consists of three parts, text detection, detected boxes rectification and text recognition. The purpose of text detection is to locate the text area in the image. In PP-OCR, Differentiable Binarization (DB) is used as text detector which is based on a simple segmentation network. It integrates feature extraction and sequence modeling. It adopts the Connectionist Temporal Classification (CTC) loss to avoid the inconsistency between prediction and label.","389":"\\begin{equation}\r\nDiceLoss\\left( y,\\overline{p} \\right) = 1-\\big(\\left(  2y\\overline{p}+1 \\right) \\div  \\left( y+\\overline{p}+1 \\right)\\big)\r\n\\end{equation}","390":"**Corner Pooling** is a pooling technique for object detection that seeks to better localize corners by encoding explicit prior knowledge. Suppose we want to determine if a pixel at location $\\left(i, j\\right)$ is a top-left corner. Let $f\\_{t}$ and $f\\_{l}$ be the feature maps that are the inputs to the top-left corner pooling layer, and let $f\\_{t\\_{ij}}$ and $f\\_{l\\_{ij}}$ be the vectors at location $\\left(i, j\\right)$ in $f\\_{t}$ and $f\\_{l}$ respectively. With $H \\times W$ feature maps, the corner pooling layer first max-pools all feature vectors between $\\left(i, j\\right)$ and $\\left(i, H\\right)$ in $f\\_{t}$ to a feature vector $t\\_{ij}$ , and max-pools all feature vectors between $\\left(i, j\\right)$ and $\\left(W, j\\right)$ in $f\\_{l}$ to a feature vector $l\\_{ij}$. Finally, it adds $t\\_{ij}$ and $l\\_{ij}$ together.","391":"**CornerNet** is an object detection model that detects an object bounding box as a pair of keypoints, the top-left corner and the bottom-right corner, using a single [convolution](https:\/\/paperswithcode.com\/method\/convolution) neural network. By detecting objects as paired keypoints, we eliminate the need for designing a set of anchor boxes commonly used in prior single-stage detectors. It also utilises [corner pooling](https:\/\/paperswithcode.com\/method\/corner-pooling), a new type of pooling layer than helps the network better localize corners.","392":"Non-maximum suppression is an integral part of the object detection pipeline. First, it sorts all detection boxes on the basis of their scores. The detection box $M$ with the maximum score is selected and all other detection boxes with a significant overlap (using a pre-defined threshold)\r\nwith $M$ are suppressed. This process is recursively applied on the remaining boxes. As per the design of the algorithm, if an object lies within the predefined overlap threshold, it leads to a miss. \r\n\r\n**Soft-NMS** solves this problem by decaying the detection scores of all other objects as a continuous function of their overlap with M. Hence, no object is eliminated in this process.","393":"**VQ-VAE** is a type of variational autoencoder that uses vector quantisation to obtain a discrete latent representation. It differs from [VAEs](https:\/\/paperswithcode.com\/method\/vae) in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, ideas from vector quantisation (VQ) are incorporated. Using the VQ method allows the model to circumvent issues of posterior collapse - where the latents are ignored when they are paired with a powerful autoregressive decoder - typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes.","394":"A **Graph Attention Network (GAT)** is a neural network architecture that operates on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods\u2019 features, a GAT enables (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront.\r\n\r\nSee [here](https:\/\/docs.dgl.ai\/en\/0.4.x\/tutorials\/models\/1_gnn\/9_gat.html) for an explanation by DGL.","395":"_**Independent component analysis** (ICA) is a statistical and computational technique for revealing hidden factors that underlie sets of random variables, measurements, or signals._\r\n\r\n_ICA defines a generative model for the observed multivariate data, which is typically given as a large database of samples. In the model, the data variables are assumed to be linear mixtures of some unknown latent variables, and the mixing system is also unknown. The latent variables are assumed nongaussian and mutually independent, and they are called the independent components of the observed data. These independent components, also called sources or factors, can be found by ICA._\r\n\r\n_ICA is superficially related to principal component analysis and factor analysis. ICA is a much more powerful technique, however, capable of finding the underlying factors or sources when these classic methods fail completely._\r\n\r\n\r\nExtracted from (https:\/\/www.cs.helsinki.fi\/u\/ahyvarin\/whatisica.shtml)\r\n\r\n**Source papers**:\r\n\r\n[Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture](https:\/\/doi.org\/10.1016\/0165-1684(91)90079-X)\r\n\r\n[Independent component analysis, A new concept?](https:\/\/doi.org\/10.1016\/0165-1684(94)90029-9)\r\n\r\n[Independent component analysis: algorithms and applications](https:\/\/doi.org\/10.1016\/S0893-6080(00)00026-5)","396":"**RealNVP** is a generative model that utilises real-valued non-volume preserving (real NVP) transformations for density estimation. The model can perform efficient and exact inference, sampling and log-density estimation of data points.","397":"**Xavier Initialization**, or **Glorot Initialization**, is an initialization scheme for neural networks. Biases are initialized be 0 and the weights $W\\_{ij}$ at each layer are initialized as:\r\n\r\n$$ W\\_{ij} \\sim U\\left[-\\frac{1}{\\sqrt{n}}, \\frac{1}{\\sqrt{n}}\\right] $$\r\n\r\nWhere $U$ is a uniform distribution and $n$ is the size of the previous layer (number of columns in $W$).","398":"A **Spatially Separable Convolution** decomposes a [convolution](https:\/\/paperswithcode.com\/method\/convolution) into two separate operations. In regular convolution, if we have a 3 x 3 kernel then we directly convolve this with the image. We can divide a 3 x 3 kernel into a 3 x 1 kernel and a 1 x 3 kernel. Then, in spatially separable convolution, we first convolve the 3 x 1 kernel then the 1 x 3 kernel. This requires 6 instead of 9 parameters compared to regular convolution, and so it is more parameter efficient (additionally less matrix multiplications are required).\r\n\r\nImage Source: [Kunlun Bai](https:\/\/towardsdatascience.com\/a-comprehensive-introduction-to-different-types-of-convolutions-in-deep-learning-669281e58215)","399":"A **SqueezeNeXt Block** is a two-stage bottleneck module used in the [SqueezeNeXt](https:\/\/paperswithcode.com\/method\/squeezenext) architecture to reduce the number of input channels to the 3 \u00d7 3 [convolution](https:\/\/paperswithcode.com\/method\/convolution). We decompose with separable convolutions to further reduce the number of parameters (orange parts), followed by a 1 \u00d7 1 expansion module.","400":"**SqueezeNeXt** is a type of convolutional neural network that uses the [SqueezeNet](https:\/\/paperswithcode.com\/method\/squeezenet) architecture as a baseline, but makes a number of changes. First, a more aggressive channel reduction is used by incorporating a two-stage squeeze module. This significantly reduces the total number of parameters used with the 3\u00d73 convolutions. Secondly, it uses separable 3 \u00d7 3 convolutions to further reduce the model size, and removes the additional 1\u00d71 branch after the squeeze module. Thirdly, the network use an element-wise addition skip connection similar to that of [ResNet](https:\/\/paperswithcode.com\/method\/resnet) architecture.","401":"The **Cross-Attention** module is an attention module used in [CrossViT](https:\/\/paperswithcode.com\/method\/crossvit) for fusion of multi-scale features. The CLS token of the large branch (circle) serves as a query token to interact with the patch tokens from the small branch through attention. $f\\left(\u00b7\\right)$ and $g\\left(\u00b7\\right)$ are projections to align dimensions. The small branch follows the same procedure but swaps CLS and patch tokens from another branch.","402":"**REINFORCE** is a Monte Carlo variant of a policy gradient algorithm in reinforcement learning. The agent collects samples of an episode using its current policy, and uses it to update the policy parameter $\\theta$. Since one full trajectory must be completed to construct a sample space, it is updated as an off-policy algorithm.\r\n\r\n$$ \\nabla\\_{\\theta}J\\left(\\theta\\right) = \\mathbb{E}\\_{\\pi}\\left[G\\_{t}\\nabla\\_{\\theta}\\ln\\pi\\_{\\theta}\\left(A\\_{t}\\mid{S\\_{t}}\\right)\\right]$$\r\n\r\nImage Credit: [Tingwu Wang](http:\/\/www.cs.toronto.edu\/~tingwuwang\/REINFORCE.pdf)","403":"**Cutout** is an image augmentation and regularization technique that randomly masks out square regions of input during training. and can be used to improve the robustness and overall performance of convolutional neural networks. The main motivation for cutout comes from the problem of object occlusion, which is commonly encountered in many computer vision tasks, such as object recognition,\r\ntracking, or human pose estimation. By generating new images which simulate occluded examples, we not only better prepare the model for encounters with occlusions in the real world, but the model also learns to take more of the image context into consideration when making decisions","404":"**Shake-Shake Regularization**  aims to improve the generalization ability of multi-branch networks by replacing the standard summation of parallel branches with a stochastic affine combination. A typical pre-activation [ResNet](https:\/\/paperswithcode.com\/method\/resnet) with 2 residual branches would follow this equation:\r\n\r\n$$x\\_{i+1} = x\\_{i} + \\mathcal{F}\\left(x\\_{i}, \\mathcal{W}\\_{i}^{\\left(1\\right)}\\right) + \\mathcal{F}\\left(x\\_{i}, \\mathcal{W}\\_{i}^{\\left(2\\right)}\\right) $$\r\n\r\nShake-shake regularization introduces a random variable $\\alpha\\_{i}$  following a uniform distribution between 0 and 1 during training:\r\n\r\n$$x\\_{i+1} = x\\_{i} + \\alpha\\mathcal{F}\\left(x\\_{i}, \\mathcal{W}\\_{i}^{\\left(1\\right)}\\right) + \\left(1-\\alpha\\right)\\mathcal{F}\\left(x\\_{i}, \\mathcal{W}\\_{i}^{\\left(2\\right)}\\right) $$\r\n\r\nFollowing the same logic as for [dropout](https:\/\/paperswithcode.com\/method\/dropout), all $\\alpha\\_{i}$ are set to the expected value of $0.5$ at test time.","405":"The Contour Proposal Network (CPN) detects possibly overlapping objects in an image while simultaneously fitting pixel-precise closed object contours. The CPN can incorporate state of the art object detection architectures as backbone networks into a fast single-stage instance segmentation model that can be trained end-to-end.","406":"**CoOp**, or **Context Optimization**, is an automated prompt engineering method that avoids manual prompt tuning by modeling context words with continuous vectors that are end-to-end learned from data. The context could be shared among all classes or designed to be class-specific. During training, we simply minimize the prediction error using the cross-entropy loss with respect to the learnable context vectors, while keeping the pre-trained parameters fixed. The gradients can be back-propagated all the way through the text encoder, distilling the rich knowledge encoded in the parameters for learning task-relevant context.","407":"The **Griffin-Lim Algorithm (GLA)** is a phase reconstruction method based on the redundancy of the short-time Fourier transform. It promotes the consistency of a spectrogram by iterating two projections, where a spectrogram is said to be consistent when its inter-bin dependency owing to the redundancy of STFT is retained.  GLA is based only on the consistency and does not take any prior knowledge about the target signal into account. \r\n\r\nThis algorithm expects to recover a complex-valued spectrogram, which is consistent and maintains the given amplitude $\\mathbf{A}$, by the following alternative projection procedure:\r\n\r\n$$ \\mathbf{X}^{[m+1]} = P\\_{\\mathcal{C}}\\left(P\\_{\\mathcal{A}}\\left(\\mathbf{X}^{[m]}\\right)\\right) $$\r\n\r\nwhere $\\mathbf{X}$ is a complex-valued spectrogram updated through the iteration, $P\\_{\\mathcal{S}}$ is the metric projection onto a set $\\mathcal{S}$, and $m$ is the iteration index. Here, $\\mathcal{C}$ is the set of consistent spectrograms, and $\\mathcal{A}$ is the set of spectrograms whose amplitude is the same as the given one. The metric projections onto these sets $\\mathcal{C}$ and $\\mathcal{A}$ are given by:\r\n\r\n$$ P\\_{\\mathcal{C}}(\\mathbf{X}) = \\mathcal{GG}^{\u2020}\\mathbf{X} $$\r\n$$ P\\_{\\mathcal{A}}(\\mathbf{X}) = \\mathbf{A} \\odot \\mathbf{X} \\oslash |\\mathbf{X}| $$\r\n\r\n\r\nwhere $\\mathcal{G}$ represents STFT, $\\mathcal{G}^{\u2020}$ is the pseudo inverse of STFT (iSTFT), $\\odot$ and $\\oslash$ are element-wise multiplication and division, respectively, and division by zero is replaced by zero. GLA is obtained as an algorithm for the following optimization problem:\r\n\r\n$$ \\min\\_{\\mathbf{X}} || \\mathbf{X} - P\\_{\\mathcal{C}}\\left(\\mathbf{X}\\right) ||^{2}\\_{\\text{Fro}} \\text{ s.t. } \\mathbf{X} \\in \\mathcal{A} $$\r\n\r\nwhere $ || \u00b7 ||\\_{\\text{Fro}}$ is the Frobenius norm. This equation minimizes the energy of the inconsistent components under the constraint on amplitude which must be equal to the given one. Although GLA has been widely utilized because of its simplicity, GLA often involves many iterations until it converges to a certain spectrogram and results in low reconstruction quality. This is because the cost function only requires the consistency, and the characteristics of the target signal are not taken into account.","408":"A **Residual GRU** is a [gated recurrent unit (GRU)](https:\/\/paperswithcode.com\/method\/gru) that incorporates the idea of residual connections from [ResNets](https:\/\/paperswithcode.com\/method\/resnet).","409":"**CBHG** is a building block used in the [Tacotron](https:\/\/paperswithcode.com\/method\/tacotron) text-to-speech model. It consists of a bank of 1-D convolutional filters, followed by highway networks and a bidirectional gated recurrent unit ([BiGRU](https:\/\/paperswithcode.com\/method\/bigru)). \r\n\r\nThe module is used to extract representations from sequences. The input sequence is first\r\nconvolved with $K$ sets of 1-D convolutional filters, where the $k$-th set contains $C\\_{k}$ filters of width $k$ (i.e. $k = 1, 2, \\dots , K$). These filters explicitly model local and contextual information (akin to modeling unigrams, bigrams, up to K-grams). The [convolution](https:\/\/paperswithcode.com\/method\/convolution) outputs are stacked together and further max pooled along time to increase local invariances. A stride of 1 is used to  preserve the original time resolution. The processed sequence is further passed to a few fixed-width 1-D convolutions, whose outputs are added with the original input sequence via residual connections. [Batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) is used for all convolutional layers. The convolution outputs are fed into a multi-layer [highway network](https:\/\/paperswithcode.com\/method\/highway-network) to extract high-level features. Finally, a bidirectional [GRU](https:\/\/paperswithcode.com\/method\/gru) RNN is stacked on top to extract sequential features from both forward and backward context.","410":"**Tacotron** is an end-to-end generative text-to-speech model that takes a character sequence as input and outputs the corresponding spectrogram. The backbone of Tacotron is a seq2seq model with attention. The Figure depicts the model, which includes an encoder, an attention-based decoder, and a post-processing net. At a high-level, the model takes characters as input and produces spectrogram\r\nframes, which are then converted to waveforms.","411":"The **Self-Organizing Map (SOM)**, commonly also known as Kohonen network (Kohonen 1982, Kohonen 2001) is a computational method for the visualization and analysis of high-dimensional data, especially experimentally acquired information.\r\n\r\nExtracted from [scholarpedia](http:\/\/www.scholarpedia.org\/article\/Self-organizing_map)\r\n\r\n**Sources**:\r\n\r\nImage: [scholarpedia](http:\/\/www.scholarpedia.org\/article\/File:Somnbc.png)\r\n\r\nPaper: [Kohonen, T. Self-organized formation of topologically correct feature maps. Biol. Cybern. 43, 59\u201369 (1982)](https:\/\/doi.org\/10.1007\/BF00337288)\r\n\r\nBook: [Self-Organizing Maps](https:\/\/doi.org\/10.1007\/978-3-642-56927-2)","412":"**XLNet** is an autoregressive [Transformer](https:\/\/paperswithcode.com\/method\/transformer) that leverages the best of both autoregressive language modeling and autoencoding while attempting to avoid their limitations. Instead of using a fixed forward or backward factorization order as in conventional autoregressive models, XLNet maximizes the expected log likelihood of a sequence w.r.t. all possible permutations of the factorization order. Thanks to the permutation operation, the context for each position can consist of tokens from both left and right. In expectation, each position learns to utilize contextual information from all positions, i.e., capturing bidirectional context.\r\n\r\nAdditionally, inspired by the latest advancements in autogressive language modeling, XLNet integrates the segment recurrence mechanism and relative encoding scheme of [Transformer-XL](https:\/\/paperswithcode.com\/method\/transformer-xl) into pretraining, which empirically improves the performance especially for tasks involving a longer text sequence.","413":"A **Connectionist Temporal Classification Loss**, or **CTC Loss**, is designed for tasks where we need alignment between sequences, but where that alignment is difficult - e.g. aligning each character to its location in an audio file. It calculates a loss between a continuous (unsegmented) time series and a target sequence. It does this by summing over the probability of possible alignments of input to target, producing a loss value which is differentiable with respect to each input node. The alignment of input to target is assumed to be \u201cmany-to-one\u201d, which limits the length of the target sequence such that it must be $\\leq$ the input length.","414":"A **Dual Path Network** block is an image model block used in convolutional neural network. The idea of this module is to enable sharing of common features while maintaining the flexibility to explore new features through dual path architectures. In this sense it combines the benefits of [ResNets](https:\/\/paperswithcode.com\/method\/resnet) and [DenseNets](https:\/\/paperswithcode.com\/method\/densenet). It was proposed as part of the [DPN](https:\/\/paperswithcode.com\/method\/dpn) CNN architecture.\r\n\r\nWe formulate such a dual path architecture as follows:\r\n\r\n$$x^{k} = \\sum\\limits\\_{t=1}^{k-1} f\\_t^{k}(h^t) \\text{,}  $$\r\n\r\n$$\r\ny^{k} = \\sum\\limits\\_{t=1}^{k-1} v\\_t(h^t) = y^{k-1} + \\phi^{k-1}(y^{k-1}) \\text{,} \\\\\\\\\r\n$$\r\n\r\n$$\r\nr^{k} = x^{k} + y^{k} \\text{,} \\\\\\\\\r\n$$\r\n\r\n$$\r\nh^k = g^k \\left( r^{k} \\right) \\text{,}\r\n$$\r\n\r\nwhere $x^{k}$ and $y^{k}$ denote the extracted information at $k$-th step from individual path, $v_t(\\cdot)$ is a feature learning function as $f_t^k(\\cdot)$. The first equation refers to the densely connected path that enables exploring new features. The second equation refers to the residual path that enables common features re-usage. The third equation defines the dual path that integrates them and feeds them to the last transformation function in the last equation.","415":"A **Dual Path Network (DPN)** is a convolutional neural network which presents a new topology of connection paths internally. The intuition is that [ResNets](https:\/\/paperswithcode.com\/method\/resnet) enables feature re-usage while [DenseNet](https:\/\/paperswithcode.com\/method\/densenet) enables new feature exploration, and both are important for learning good representations. To enjoy the benefits from both path topologies, Dual Path Networks share common features while maintaining the flexibility to explore new features through dual path architectures. \r\n\r\nWe formulate such a dual path architecture as follows:\r\n\r\n$$x^{k} = \\sum\\limits\\_{t=1}^{k-1} f\\_t^{k}(h^t) \\text{,}  $$\r\n\r\n$$\r\ny^{k} = \\sum\\limits\\_{t=1}^{k-1} v\\_t(h^t) = y^{k-1} + \\phi^{k-1}(y^{k-1}) \\text{,} \\\\\\\\\r\n$$\r\n\r\n$$\r\nr^{k} = x^{k} + y^{k} \\text{,} \\\\\\\\\r\n$$\r\n\r\n$$\r\nh^k = g^k \\left( r^{k} \\right) \\text{,}\r\n$$\r\n\r\nwhere $x^{k}$ and $y^{k}$ denote the extracted information at $k$-th step from individual path, $v_t(\\cdot)$ is a feature learning function as $f_t^k(\\cdot)$. The first equation refers to the densely connected path that enables exploring new features. The second equation refers to the residual path that enables common features re-usage. The third equation defines the dual path that integrates them and feeds them to the last transformation function in the last equation.","416":"Exit whenever the model is confident enough allowing early exiting from hidden layers","417":"Spatial pooling usually operates on a small region which limits its capability to capture long-range dependencies and focus on distant regions. To overcome this, Hou et al. proposed  strip pooling, a novel pooling method capable of encoding long-range context in either horizontal or vertical spatial domains.  \r\n\r\nStrip pooling has two branches for horizontal  and vertical strip pooling. The horizontal strip pooling part first pools the input feature $F \\in \\mathcal{R}^{C \\times H \\times W}$ in the horizontal direction:\r\n\\begin{align}\r\ny^1 = \\text{GAP}^w (X) \r\n\\end{align}\r\nThen a 1D convolution with kernel size 3 is applied in $y$ to capture the relationship between different rows and channels. This is repeated $W$ times to make  the output $y_v$  consistent with the input shape:\r\n\\begin{align}\r\n    y_h = \\text{Expand}(\\text{Conv1D}(y^1))\r\n\\end{align}\r\nVertical strip pooling is performed in a similar way. Finally, the outputs of the two branches are fused using element-wise summation to produce the attention map:\r\n\\begin{align}\r\ns &= \\sigma(Conv^{1\\times 1}(y_{v} + y_{h}))\r\n\\end{align}\r\n\\begin{align}\r\nY &= s  X\r\n\\end{align}\r\n\r\nThe strip pooling module (SPM) is further developed in the mixed pooling module (MPM). Both consider  spatial  and channel relationships to overcome the locality of convolutional neural networks.  SPNet achieves  state-of-the-art results for several complex semantic segmentation benchmarks.","418":"**Strip Pooling** is a pooling strategy for scene parsing which considers a long but narrow kernel, i.e., $1\\times{N}$ or $N\\times{1}$. As an alternative to global pooling, strip pooling offers two advantages. First, it deploys a long kernel shape along one spatial dimension and hence enables capturing long-range relations of isolated regions. Second, it keeps a narrow kernel shape along the other spatial dimension, which facilitates capturing local context and prevents irrelevant regions from interfering the label prediction. Integrating such long but narrow pooling kernels enables the scene parsing networks to simultaneously aggregate both global and local context. This is essentially different from the traditional spatial pooling which collects context from a fixed square region.","419":"Please enter a description about the method here","420":"**MixConv**, or **Mixed Depthwise Convolution**, is a type of [depthwise convolution](https:\/\/paperswithcode.com\/method\/depthwise-convolution) that naturally mixes up multiple kernel sizes in a single [convolution](https:\/\/paperswithcode.com\/method\/convolution). It is based on the insight that depthwise convolution applies a single kernel size to all channels, which MixConv overcomes by combining the benefits of multiple kernel sizes. It does this by partitioning channels into groups and applying a different kernel size to each group.","421":"**MixNet** is a type of convolutional neural network discovered via AutoML that utilises MixConvs instead of regular depthwise convolutions.","422":"**CaiT**, or **Class-Attention in Image Transformers**, is a type of [vision transformer](https:\/\/paperswithcode.com\/methods\/category\/vision-transformer) with several design alterations upon the original [ViT](https:\/\/paperswithcode.com\/method\/vision-transformer). First a new layer scaling approach called [LayerScale](https:\/\/paperswithcode.com\/method\/layerscale) is used, adding a learnable diagonal matrix on output of each residual block, initialized close to (but not at) 0, which improves the training dynamics. Secondly, [class-attention layers](https:\/\/paperswithcode.com\/method\/ca) are introduced to the architecture. This creates an architecture where the transformer layers involving [self-attention](https:\/\/paperswithcode.com\/method\/scaled) between patches are explicitly separated from class-attention layers -- that are devoted to extract the content of the processed patches into a single vector so that it can be fed to a linear classifier.","423":"A **Class Attention** layer, or **CA Layer**, is an [attention mechanism](https:\/\/paperswithcode.com\/methods\/category\/attention-mechanisms-1) for [vision transformers](https:\/\/paperswithcode.com\/methods\/category\/vision-transformer) used in [CaiT](https:\/\/paperswithcode.com\/method\/cait) that aims to extract information from a set of processed patches. It is identical to a [self-attention layer](https:\/\/paperswithcode.com\/method\/scaled), except that it relies on the attention between (i) the class embedding $x_{\\text {class }}$ (initialized at CLS in the first CA) and (ii) itself plus the set of frozen patch embeddings $x_{\\text {patches }} .$ \r\n\r\nConsidering a network with $h$ heads and $p$ patches, and denoting by $d$ the embedding size, the multi-head class-attention is parameterized with several projection matrices, $W_{q}, W_{k}, W_{v}, W_{o} \\in \\mathbf{R}^{d \\times d}$, and the corresponding biases $b_{q}, b_{k}, b_{v}, b_{o} \\in \\mathbf{R}^{d} .$ With this notation, the computation of the CA residual block proceeds as follows. We first augment the patch embeddings (in matrix form) as $z=\\left[x_{\\text {class }}, x_{\\text {patches }}\\right]$. We then perform the projections:\r\n\r\n$$Q=W\\_{q} x\\_{\\text {class }}+b\\_{q}$$\r\n\r\n$$K=W\\_{k} z+b\\_{k}$$\r\n\r\n$$V=W\\_{v} z+b\\_{v}$$\r\n\r\nThe class-attention weights are given by\r\n\r\n$$\r\nA=\\operatorname{Softmax}\\left(Q . K^{T} \/ \\sqrt{d \/ h}\\right)\r\n$$\r\n\r\nwhere $Q . K^{T} \\in \\mathbf{R}^{h \\times 1 \\times p}$. This attention is involved in the weighted sum $A \\times V$ to produce the residual output vector\r\n\r\n$$\r\n\\operatorname{out}\\_{\\mathrm{CA}}=W\\_{o} A V+b\\_{o}\r\n$$\r\n\r\nwhich is in turn added to $x\\_{\\text {class }}$ for subsequent processing.","424":"**LayerScale** is a method used for [vision transformer](https:\/\/paperswithcode.com\/methods\/category\/vision-transformer) architectures to help improve training dynamics. It adds a learnable diagonal matrix on output of each residual block, initialized close to (but not at) 0. Adding this simple layer after each residual block improves the training dynamic, allowing for the training of deeper high-capacity image transformers that benefit from depth.\r\n\r\nSpecifically, LayerScale is a per-channel multiplication of the vector produced by each residual block, as opposed to a single scalar, see Figure (d). The objective is to group the updates of the weights associated with the same output channel. Formally, LayerScale is a multiplication by a diagonal matrix on output of each residual block. In other words:\r\n\r\n$$\r\nx\\_{l}^{\\prime} =x\\_{l}+\\operatorname{diag}\\left(\\lambda\\_{l, 1}, \\ldots, \\lambda\\_{l, d}\\right) \\times \\operatorname{SA}\\left(\\eta\\left(x\\_{l}\\right)\\right) \r\n$$\r\n\r\n$$\r\nx\\_{l+1} =x\\_{l}^{\\prime}+\\operatorname{diag}\\left(\\lambda\\_{l, 1}^{\\prime}, \\ldots, \\lambda\\_{l, d}^{\\prime}\\right) \\times \\operatorname{FFN}\\left(\\eta\\left(x\\_{l}^{\\prime}\\right)\\right)\r\n$$\r\n\r\nwhere the parameters $\\lambda\\_{l, i}$ and $\\lambda\\_{l, i}^{\\prime}$ are learnable weights. The diagonal values are all initialized to a fixed small value $\\varepsilon:$ we set it to $\\varepsilon=0.1$ until depth 18 , $\\varepsilon=10^{-5}$ for depth 24 and $\\varepsilon=10^{-6}$ for deeper networks. \r\n\r\nThis formula is akin to other [normalization](https:\/\/paperswithcode.com\/methods\/category\/normalization) strategies [ActNorm](https:\/\/paperswithcode.com\/method\/activation-normalization) or [LayerNorm](https:\/\/paperswithcode.com\/method\/layer-normalization) but executed on output of the residual block. Yet LayerScale seeks a different effect: [ActNorm](https:\/\/paperswithcode.com\/method\/activation-normalization) is a data-dependent initialization that calibrates activations so that they have zero-mean and unit variance, like [BatchNorm](https:\/\/paperswithcode.com\/method\/batch-normalization). In contrast, in LayerScale, we initialize the diagonal with small values so that the initial contribution of the residual branches to the function implemented by the transformer is small. In that respect the motivation is therefore closer to that of [ReZero](https:\/\/paperswithcode.com\/method\/rezero), [SkipInit](https:\/\/paperswithcode.com\/method\/skipinit), [Fixup](https:\/\/paperswithcode.com\/method\/fixup-initialization) and [T-Fixup](https:\/\/paperswithcode.com\/method\/t-fixup): to train closer to the identity function and let the network integrate the additional parameters progressively during the training. LayerScale offers more diversity in the optimization than just adjusting the whole layer by a single learnable scalar as in [ReZero](https:\/\/paperswithcode.com\/method\/rezero)\/[SkipInit](https:\/\/paperswithcode.com\/method\/skipinit), [Fixup](https:\/\/paperswithcode.com\/method\/fixup-initialization) and [T-Fixup](https:\/\/paperswithcode.com\/method\/t-fixup).","425":"A **Data-Efficient Image Transformer** is a type of [Vision Transformer](https:\/\/paperswithcode.com\/method\/vision-transformer) for image classification tasks. The model is trained using a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention.","426":"**Context Enhancement Module (CEM)** is a feature extraction module used in object detection (specifically, [ThunderNet](https:\/\/paperswithcode.com\/method\/thundernet)) which aims to  to enlarge the receptive field. The key idea of CEM is to aggregate multi-scale local context information and global context information to generate more discriminative features. In CEM, the feature maps from three scales are merged: $C\\_{4}$, $C\\_{5}$ and $C\\_{glb}$. $C\\_{glb}$ is the global context feature vector by applying a [global average pooling](https:\/\/paperswithcode.com\/method\/global-average-pooling) on $C\\_{5}$. We then apply a 1 \u00d7 1 [convolution](https:\/\/paperswithcode.com\/method\/convolution) on each feature map to squeeze the number of channels to $\\alpha \\times p \\times p = 245$.\r\n\r\nAfterwards, $C\\_{5}$ is upsampled by 2\u00d7 and $C\\_{glb}$ is broadcast so that the spatial dimensions of the three feature maps are\r\nequal. At last, the three generated feature maps are aggregated. By leveraging both local and global context, CEM effectively enlarges the receptive field and refines the representation ability of the thin feature map. Compared with prior [FPN](https:\/\/paperswithcode.com\/method\/fpn) structures, CEM involves only two 1\u00d71 convolutions and a fc layer.","427":"**ShuffleNet V2 Block** is an image model block used in the [ShuffleNet V2](https:\/\/paperswithcode.com\/method\/shufflenet-v2) architecture, where speed is the metric optimized for (instead of indirect ones like FLOPs). It utilizes a simple operator called channel split. At the beginning of each unit, the input of $c$ feature channels are split into two branches with $c - c'$ and $c'$ channels, respectively. Following **G3**, one branch remains as identity. The other branch consists of three convolutions with the same input and output channels to satisfy **G1**. The two $1\\times1$ convolutions are no longer group-wise, unlike the original [ShuffleNet](https:\/\/paperswithcode.com\/method\/shufflenet). This is partially to follow **G2**, and partially because the split operation already produces two groups. After [convolution](https:\/\/paperswithcode.com\/method\/convolution), the two branches are concatenated. So, the number of channels keeps the same (G1). The same \u201c[channel shuffle](https:\/\/paperswithcode.com\/method\/channel-shuffle)\u201d operation as in ShuffleNet is then used to enable information communication between the two branches.\r\n\r\nThe motivation behind channel split is that alternative architectures, where pointwise group convolutions and bottleneck structures are used, lead to increased memory access cost. Additionally more network fragmentation with group convolutions reduces parallelism (less friendly for GPU), and the element-wise addition operation, while they have low FLOPs, have high memory access cost. Channel split is an alternative where we can maintain a large number of equally wide channels (equally wide minimizes memory access cost) without having dense convolutions or too many groups.","428":"**Spatial Attention Module (SAM)** is a feature extraction module for object detection used in [ThunderNet](https:\/\/paperswithcode.com\/method\/thundernet).\r\n\r\nThe ThunderNet SAM explicitly re-weights the feature map before RoI warping over the spatial dimensions. The key idea of SAM is to use the knowledge from [RPN](https:\/\/paperswithcode.com\/method\/rpn) to refine the feature distribution of the feature map. RPN is trained to recognize foreground regions under the supervision of ground truths. Therefore, the intermediate features in RPN can be used to distinguish foreground features from background features. SAM accepts two inputs: the intermediate feature map from RPN $\\mathcal{F}^{RPN}$ and the thin feature map from the [Context Enhancement Module](https:\/\/paperswithcode.com\/method\/context-enhancement-module) $\\mathcal{F}^{CEM}$. The output of SAM $\\mathcal{F}^{SAM}$ is defined as:\r\n\r\n$$ \\mathcal{F}^{SAM} = \\mathcal{F}^{CEM} * \\text{sigmoid}\\left(\\theta\\left(\\mathcal{F}^{RPN}\\right)\\right) $$\r\n\r\nHere $\\theta\\left(\u00b7\\right)$ is a dimension transformation to match the number of channels in both feature maps. The sigmoid function is used to constrain the values within $\\left[0, 1\\right]$. At last, $\\mathcal{F}^{CEM}$ is re-weighted by the generated feature map for better feature distribution. For computational efficiency, we simply apply a 1\u00d71 [convolution](https:\/\/paperswithcode.com\/method\/convolution) as $\\theta\\left(\u00b7\\right)$, so the computational cost of CEM is negligible. The Figure to the right shows the structure of SAM. \r\n\r\nSAM has two functions. The first one is to refine the feature distribution by strengthening foreground features and suppressing background features. The second one is to stabilize the training of RPN as SAM enables extra gradient flow from [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn) subnet to RPN. As a result, RPN receives additional supervision from RCNN subnet, which helps the training of RPN.","429":"**Position-Sensitive RoIAlign** is a positive sensitive version of [RoIAlign](https:\/\/paperswithcode.com\/method\/roi-align) - i.e. it performs selective alignment, allowing for the learning of position-sensitive region of interest aligning.","430":"**SNet** is a convolutional neural network architecture and object detection backbone used for the [ThunderNet](https:\/\/paperswithcode.com\/method\/thundernet) two-stage object detector. SNet uses ShuffleNetV2 basic blocks but replaces all 3\u00d73 depthwise convolutions with 5\u00d75 depthwise convolutions.","431":"**ThunderNet** is a two-stage object detection model. The design of ThunderNet aims at the computationally expensive structures in state-of-the-art two-stage detectors. The backbone utilises a [ShuffleNetV2](https:\/\/paperswithcode.com\/method\/shufflenet-v2) inspired network called [SNet](https:\/\/paperswithcode.com\/method\/snet) designed for object detection. In the detection part, ThunderNet follows the detection head design in Light-Head [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn), and further compresses the [RPN](https:\/\/paperswithcode.com\/method\/rpn) and R-CNN subnet. To eliminate the performance degradation induced by small backbones and small feature maps, ThunderNet uses two new efficient architecture blocks, [Context Enhancement Module](https:\/\/paperswithcode.com\/method\/context-enhancement-module) (CEM) and [Spatial Attention Module](https:\/\/paperswithcode.com\/method\/spatial-attention-module) (SAM). CEM combines the feature maps from multiple scales to leverage local and global context information, while SAM uses the information learned in RPN to refine the feature distribution in RoI warping.","432":"Spatial Broadcast Decoder is an architecture that aims to improve disentangling, reconstruction accuracy, and generalization to held-out regions in data space. It provides a particularly dramatic\r\nbenefit when applied to datasets with small objects.\r\n\r\nSource: [Watters et al.](https:\/\/arxiv.org\/pdf\/1901.07017v2.pdf)\r\n\r\nImage source: [Watters et al.](https:\/\/arxiv.org\/pdf\/1901.07017v2.pdf)","433":"The **Affine Operator** is an affine transformation layer introduced in the [ResMLP](https:\/\/paperswithcode.com\/method\/resmlp) architecture. This replaces [layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization), as in [Transformer based networks](https:\/\/paperswithcode.com\/methods\/category\/transformers), which is possible since in the ResMLP, there are no [self-attention layers](https:\/\/paperswithcode.com\/method\/scaled) which makes training more stable - hence allowing a more simple affine transformation.\r\n\r\nThe affine operator is defined as:\r\n\r\n$$ \\operatorname{Aff}_{\\mathbf{\\alpha}, \\mathbf{\\beta}}(\\mathbf{x})=\\operatorname{Diag}(\\mathbf{\\alpha}) \\mathbf{x}+\\mathbf{\\beta} $$\r\n\r\nwhere $\\alpha$ and $\\beta$ are learnable weight vectors. This operation only rescales and shifts the input element-wise. This operation has several advantages over other normalization operations: first, as opposed to Layer Normalization, it has no cost at inference time, since it can absorbed in the adjacent linear layer. Second, as opposed to [BatchNorm](https:\/\/paperswithcode.com\/method\/batch-normalization) and Layer Normalization, the Aff operator does not depend on batch statistics.","434":"**Residual Multi-Layer Perceptrons**, or **ResMLP**, is an architecture built entirely upon [multi-layer perceptrons](https:\/\/paperswithcode.com\/methods\/category\/feedforward-networks) for image classification. It is a simple [residual network](https:\/\/paperswithcode.com\/method\/residual-connection) that alternates (i) a [linear layer](https:\/\/paperswithcode.com\/method\/linear-layer) in which image patches interact, independently and identically across channels, and (ii) a two-layer [feed-forward network](https:\/\/paperswithcode.com\/method\/feedforward-network) in which channels interact independently per patch. At the end of the network, the patch representations are average pooled, and fed to a linear classifier.\r\n\r\n[Layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization) is replaced with a simpler [affine transformation](https:\/\/paperswithcode.com\/method\/affine-operator), thanks to the absence of self-attention layers which makes training more stable. The affine operator is applied at the beginning (\"pre-normalization\") and end (\"post-normalization\") of each residual block. As a pre-normalization, Aff replaces LayerNorm without using channel-wise statistics. Initialization is achieved as $\\mathbf{\\alpha}=\\mathbf{1}$, and $\\mathbf{\\beta}=\\mathbf{0}$. As a post-normalization, Aff is similar to [LayerScale](https:\/\/paperswithcode.com\/method\/layerscale) and $\\mathbf{\\alpha}$ is initialized with the same small value.","435":"**Disentangled Attention Mechanism** is an attention mechanism used in the [DeBERTa](https:\/\/paperswithcode.com\/method\/deberta) architecture. Unlike [BERT](https:\/\/paperswithcode.com\/method\/bert) where each word in the input layer is represented using a vector which is the sum of its word (content) embedding and position embedding, each word in DeBERTa is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices based on their contents and relative positions, respectively. This is motivated by the observation that the attention weight of a word pair depends on not only their contents but their relative positions. For example, the dependency between the words \u201cdeep\u201d and \u201clearning\u201d is much stronger when they occur next to each other than when they occur in different sentences.","436":"**DeBERTa** is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based neural language model that aims to improve the [BERT](https:\/\/paperswithcode.com\/method\/bert) and [RoBERTa](https:\/\/paperswithcode.com\/method\/roberta) models with two techniques: a [disentangled attention mechanism](https:\/\/paperswithcode.com\/method\/disentangled-attention-mechanism) and an enhanced mask decoder. The disentangled attention mechanism is where each word is represented unchanged using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangle matrices on their contents and relative positions. The enhanced mask decoder is used to replace the output [softmax](https:\/\/paperswithcode.com\/method\/softmax) layer to predict the masked tokens for model pre-training.  In addition, a new virtual adversarial training method is used for fine-tuning to improve model\u2019s generalization on downstream tasks.","437":"TuckER","438":"**V-trace** is an off-policy actor-critic reinforcement learning algorithm that helps tackle the lag between when actions are generated by the actors and when the learner estimates the gradient. Consider a trajectory $\\left(x\\_{t}, a\\_{t}, r\\_{t}\\right)^{t=s+n}\\_{t=s}$ generated by the actor following some policy $\\mu$. We can define the $n$-steps V-trace target for $V\\left(x\\_{s}\\right)$, our value approximation at state $x\\_{s}$ as:\r\n\r\n$$ v\\_{s} = V\\left(x\\_{s}\\right) + \\sum^{s+n-1}\\_{t=s}\\gamma^{t-s}\\left(\\prod^{t-1}\\_{i=s}c\\_{i}\\right)\\delta\\_{t}V $$\r\n\r\nWhere $\\delta\\_{t}V = \\rho\\_{t}\\left(r\\_{t} + \\gamma{V}\\left(x\\_{t+1}\\right) - V\\left(x\\_{t}\\right)\\right)$ is a temporal difference algorithm for $V$, and $\\rho\\_{t} = \\text{min}\\left(\\bar{\\rho}, \\frac{\\pi\\left(a\\_{t}\\mid{x\\_{t}}\\right)}{\\mu\\left(a\\_{t}\\mid{x\\_{t}}\\right)}\\right)$ and $c\\_{i} = \\text{min}\\left(\\bar{c}, \\frac{\\pi\\left(a\\_{t}\\mid{x\\_{t}}\\right)}{\\mu\\left(a\\_{t}\\mid{x\\_{t}}\\right)}\\right)$ are truncated importance sampling weights. We assume that the truncation levels are such that $\\bar{\\rho} \\geq \\bar{c}$.","439":"**IMPALA**, or the **Importance Weighted Actor Learner Architecture**, is an off-policy actor-critic framework that decouples acting from learning and learns from experience trajectories using [V-trace](https:\/\/paperswithcode.com\/method\/v-trace). Unlike the popular [A3C](https:\/\/paperswithcode.com\/method\/a3c)-based agents, in which workers communicate gradients with respect to the parameters of the policy to a central parameter server, IMPALA actors communicate trajectories of experience (sequences of states, actions, and rewards) to a centralized learner. Since the learner in IMPALA has access to full trajectories of experience we use a GPU to perform updates on mini-batches of trajectories while aggressively parallelising all time independent operations. \r\n\r\nThis type of decoupled architecture can achieve very high throughput. However, because the policy used to generate a trajectory can lag behind the policy on the learner by several updates at the time of gradient calculation, learning becomes off-policy. The V-trace off-policy actor-critic algorithm is used to correct for this harmful discrepancy.","440":"**TayPO**, or **Taylor Expansion Policy Optimization**, refers to a set of algorithms that apply the $k$-th order Taylor expansions for policy optimization. This generalizes prior work, including [TRPO](https:\/\/paperswithcode.com\/method\/trpo) as a special case. It can be thought of unifying ideas from trust-region policy optimization and off-policy corrections. Taylor expansions share high-level similarities with both trust region policy search and off-policy corrections. To get high-level intuitions of such similarities, consider a simple 1D example of Taylor expansions. Given a sufficiently smooth real-valued function on the real line $f : \\mathbb{R} \\rightarrow \\mathbb{R}$, the $k$-th order Taylor expansion of $f\\left(x\\right)$ at $x\\_{0}$ is \r\n\r\n$$f\\_{k}\\left(x\\right) = f\\left(x\\_{0}\\right)+\\sum^{k}\\_{i=1}\\left[f^{(i)}\\left(x\\_{0}\\right)\/i!\\right]\\left(x\u2212x\\_{0}\\right)^{i}$$\r\n\r\nwhere $f^{(i)}\\left(x\\_{0}\\right)$ are the $i$-th order derivatives at $x\\_{0}$. First, a common feature shared by Taylor expansions and trust-region policy search is the inherent notion of a trust region constraint. Indeed, in order for convergence to take place, a trust-region constraint is required $|x \u2212 x\\_{0}| < R\\left(f, x\\_{0}\\right)^{1}$. Second, when using the truncation as an approximation to the original function $f\\_{K}\\left(x\\right) \\approx f\\left(x\\right)$, Taylor expansions satisfy the requirement of off-policy evaluations: evaluate target policy with behavior data. Indeed, to evaluate the truncation $f\\_{K}\\left(x\\right)$ at any $x$ (target policy), we only require the behavior policy \"data\" at $x\\_{0}$ (i.e., derivatives $f^{(i)}\\left(x\\_{0}\\right)$).","441":"**Neural Tangent Transfer**, or **NTT**, is a method for finding trainable sparse networks in a label-free manner. Specifically, NTT finds sparse networks whose training dynamics, as characterized by the neural tangent kernel, mimic those of dense networks in function space.","442":"**Detailed Expression Capture and Animation**, or **DECA**, is a model for 3D face reconstruction that is trained to robustly produce a UV displacement map from a low-dimensional latent representation that consists of person-specific detail parameters and generic expression parameters, while a regressor is trained to predict detail, shape, albedo, expression, pose and illumination parameters from a single image. A detail-consistency loss is used to disentangle person-specific details and expression-dependent wrinkles. This disentanglement allows us to synthesize realistic person-specific wrinkles by controlling expression parameters while keeping person-specific details unchanged.","443":"**VoiceFilter-Lite** is a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system. In this architecture, the voice filtering model operates as a frame-by-frame frontend signal processor to enhance the features consumed by the speech recognizer, without reconstructing audio signals from the features. The key contributions are (1) A system to perform speech separation directly on ASR input features; (2) An asymmetric loss function to penalize oversuppression during training, to make the model harmless under various acoustic environments, (3) An adaptive suppression strength mechanism to adapt to different noise conditions.","444":"SENet pioneered channel attention. The core of SENet is a squeeze-and-excitation (SE) block which is used to collect global information, capture channel-wise relationships and improve representation ability.\r\nSE blocks are divided into two parts, a squeeze module and an excitation module. Global spatial information is collected in the squeeze module by global average pooling. The excitation module captures channel-wise relationships and outputs an attention vector by using fully-connected layers and non-linear layers (ReLU and sigmoid). Then, each channel of the input feature is scaled by multiplying the corresponding element in the attention vector. Overall, a squeeze-and-excitation block $F_\\text{se}$ (with parameter $\\theta$) which takes $X$ as input and outputs $Y$ can be formulated \r\nas:\r\n\\begin{align}\r\n    s = F_\\text{se}(X, \\theta) & = \\sigma (W_{2} \\delta (W_{1}\\text{GAP}(X)))\r\n\\end{align}\r\n\\begin{align}\r\n    Y = sX\r\n\\end{align}","445":"**Efficient Channel Attention** is an architectural unit based on [squeeze-and-excitation](https:\/\/paperswithcode.com\/method\/squeeze-and-excitation-block) blocks that reduces model complexity without dimensionality reduction. It was proposed as part of the [ECA-Net](https:\/\/paperswithcode.com\/method\/eca-net) CNN architecture. \r\n\r\nAfter channel-wise [global average pooling](https:\/\/paperswithcode.com\/method\/global-average-pooling) without dimensionality reduction, the ECA captures local cross-channel interaction by considering every channel and its $k$ neighbors. The ECA can be efficiently implemented by fast $1D$ [convolution](https:\/\/paperswithcode.com\/method\/convolution) of size $k$, where kernel size $k$ represents the coverage of local cross-channel interaction, i.e., how many neighbors participate in attention prediction of one channel.","446":"**FixRes** is an image scaling strategy that seeks to optimize classifier performance. It is motivated by the observation that data augmentations induce a significant discrepancy between the size of the objects seen by the classifier at train and test time: in fact, a lower train resolution improves the classification at test time! FixRes is a simple strategy to optimize the classifier performance, that employs different train and test resolutions. The calibrations are: (a) calibrating the object sizes by adjusting the crop size and (b) adjusting statistics before spatial pooling.","447":"**Weight Standardization** is a normalization technique that smooths the loss landscape by standardizing the weights in convolutional layers. Different from the previous normalization methods that focus on *activations*, WS considers the smoothing effects of *weights* more than just length-direction decoupling. Theoretically, WS reduces the Lipschitz constants of the loss and the gradients.\r\nHence, WS smooths the loss landscape and improves training.\r\n\r\nIn Weight Standardization, instead of directly optimizing the loss $\\mathcal{L}$ on the original weights $\\hat{W}$, we reparameterize the weights $\\hat{W}$ as a function of $W$, i.e. $\\hat{W}=\\text{WS}(W)$, and optimize the loss $\\mathcal{L}$ on $W$ by [SGD](https:\/\/paperswithcode.com\/method\/sgd):\r\n\r\n$$\r\n    \\hat{W} = \\Big[ \\hat{W}\\_{i,j}~\\big|~ \\hat{W}\\_{i,j} = \\dfrac{W\\_{i,j} - \\mu\\_{W\\_{i,\\cdot}}}{\\sigma\\_{W\\_{i,\\cdot}+\\epsilon}}\\Big]\r\n$$\r\n\r\n$$\r\n    y = \\hat{W}*x\r\n$$\r\n\r\nwhere\r\n\r\n$$\r\n    \\mu_{W\\_{i,\\cdot}} = \\dfrac{1}{I}\\sum\\_{j=1}^{I}W\\_{i, j},~~\\sigma\\_{W\\_{i,\\cdot}}=\\sqrt{\\dfrac{1}{I}\\sum\\_{i=1}^I(W\\_{i,j} - \\mu\\_{W\\_{i,\\cdot}})^2}\r\n$$\r\n\r\nSimilar to [Batch Normalization](https:\/\/paperswithcode.com\/method\/batch-normalization), WS controls the first and second moments of the weights of each output channel individually in convolutional layers. Note that many initialization methods also initialize the weights in some similar ways. Different from those methods, WS standardizes the weights in a differentiable way which aims to normalize gradients during back-propagation. Note that we do not have any affine transformation on $\\hat{W}$. This is because we assume that normalization layers such as BN or [GN](https:\/\/paperswithcode.com\/method\/group-normalization) will normalize this convolutional layer again.","448":"**Group Normalization** is a normalization layer that divides channels into groups and normalizes the features within each group. GN does not exploit the batch dimension, and its computation is independent of batch sizes. In the case where the group size is 1, it is equivalent to [Instance Normalization](https:\/\/paperswithcode.com\/method\/instance-normalization).\r\n\r\nAs motivation for the method, many classical features like SIFT and HOG had *group-wise* features and involved *group-wise normalization*. For example, a HOG vector is the outcome of several spatial cells where each cell is represented by a normalized orientation histogram.\r\n\r\nFormally, Group Normalization is defined as:\r\n\r\n$$ \\mu\\_{i} = \\frac{1}{m}\\sum\\_{k\\in\\mathcal{S}\\_{i}}x\\_{k} $$\r\n\r\n$$ \\sigma^{2}\\_{i} = \\frac{1}{m}\\sum\\_{k\\in\\mathcal{S}\\_{i}}\\left(x\\_{k}-\\mu\\_{i}\\right)^{2} $$\r\n\r\n$$ \\hat{x}\\_{i} = \\frac{x\\_{i} - \\mu\\_{i}}{\\sqrt{\\sigma^{2}\\_{i}+\\epsilon}} $$\r\n\r\nHere $x$ is the feature computed by a layer, and $i$ is an index. Formally, a Group Norm layer computes $\\mu$ and $\\sigma$ in a set $\\mathcal{S}\\_{i}$ defined as: $\\mathcal{S}\\_{i} = ${$k \\mid k\\_{N} = i\\_{N} ,\\lfloor\\frac{k\\_{C}}{C\/G}\\rfloor = \\lfloor\\frac{I\\_{C}}{C\/G}\\rfloor $}.\r\n\r\nHere $G$ is the number of groups, which is a pre-defined hyper-parameter ($G = 32$ by default). $C\/G$ is the number of channels per group. $\\lfloor$ is the floor operation, and the final term means that the indexes $i$ and $k$ are in the same group of channels, assuming each group of channels are stored in a sequential order along the $C$ axis.","449":"A scalable second order optimization algorithm for deep learning.\r\n\r\nOptimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods, that involve second derivatives and\/or second order statistics of the data, are far less prevalent despite strong theoretical properties, due to their prohibitive computation, memory and communication costs. In an attempt to bridge this gap between theoretical and practical optimization, we present a scalable implementation of a second-order preconditioned method (concretely, a variant of full-matrix Adagrad), that along with several critical algorithmic and numerical improvements, provides significant convergence and wall-clock time improvements compared to conventional first-order methods on state-of-the-art deep models. Our novel design effectively utilizes the prevalent heterogeneous hardware architecture for training deep models, consisting of a multicore CPU coupled with multiple accelerator units. We demonstrate superior performance compared to state-of-the-art on very large learning tasks such as machine translation with Transformers, language modeling with BERT, click-through rate prediction on Criteo, and image classification on ImageNet with ResNet-50.","450":"The **Enhanced Fusion Framework** proposes three different ideas to improve the existing MI-based BCI frameworks.\r\n\r\nImage source: [Fumanal-Idocin et al.](https:\/\/arxiv.org\/pdf\/2101.06968v1.pdf)","451":"BCI MI framework to classifiy brain signals using a multimodal decission making phase, with an addtional differentiation of the signal.","452":"BCI MI signal Classification Framework using Fuzzy integrals.\r\n\r\nPaper: Ko, L. W., Lu, Y. C., Bustince, H., Chang, Y. C., Chang, Y., Ferandez, J., ... & Lin, C. T. (2019). Multimodal fuzzy fusion for enhancing the motor-imagery-based brain computer interface. IEEE Computational Intelligence Magazine, 14(1), 96-106.","453":"**MushroomRL** is an open-source Python library developed to simplify the process of implementing and running Reinforcement Learning (RL) experiments. The architecture of MushroomRL is built in such a way that every component of an RL problem is already provided, and most of the time users can only focus on the implementation of their own algorithms and experiments. MushroomRL comes with a strongly modular architecture that makes it easy to understand how each component is structured and how it interacts with other ones; moreover it provides an exhaustive list of RL methodologies, such as:","454":"The methon to overcome catastrophic forgetting in neural network while continual learning","455":"**$L_{1}$ Regularization** is a regularization technique applied to the weights of a neural network. We minimize a loss function compromising both the primary loss function and a penalty on the $L\\_{1}$ Norm of the weights:\r\n\r\n$$L\\_{new}\\left(w\\right) = L\\_{original}\\left(w\\right) + \\lambda{||w||}\\_{1}$$\r\n\r\nwhere $\\lambda$ is a value determining the strength of the penalty. In contrast to [weight decay](https:\/\/paperswithcode.com\/method\/weight-decay), $L_{1}$ regularization promotes sparsity; i.e. some parameters have an optimal value of zero.\r\n\r\nImage Source: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Regularization_(mathematics)#\/media\/File:Sparsityl1.png)","456":"A **Sparse Autoencoder** is a type of autoencoder that employs sparsity to achieve an information bottleneck. Specifically the loss function is constructed so that activations are penalized within a layer. The sparsity constraint can be imposed with [L1 regularization](https:\/\/paperswithcode.com\/method\/l1-regularization) or a KL divergence between expected average neuron activation to an ideal distribution $p$.\r\n\r\nImage: [Jeff Jordan](https:\/\/www.jeremyjordan.me\/autoencoders\/). Read his blog post (click) for a detailed summary of autoencoders.","457":"**FCOS** is an anchor-box free, proposal free, single-stage object detection model. By eliminating the predefined set of anchor boxes, FCOS avoids computation related to anchor boxes such as calculating overlapping during training. It also avoids all hyper-parameters related to anchor boxes, which are often very sensitive to the final detection performance.","458":"**AlignPS**, or **Feature-Aligned Person Search Network**, is an anchor-free framework for efficient person search. The model employs the typical architecture of an anchor-free detection model (i.e., [FCOS](https:\/\/paperswithcode.com\/method\/fcos)). An aligned feature aggregation (AFA) module is designed to make the model focus more on the re-id subtask. Specifically, AFA reshapes some building blocks of [FPN](https:\/\/paperswithcode.com\/method\/fpn) to overcome the issues of region and scale misalignment in re-id feature learning. A [deformable convolution](https:\/\/paperswithcode.com\/method\/deformable-convolution) is exploited to make the re-id embeddings adaptively aligned with the foreground regions. A feature fusion scheme is designed to better aggregate features from different FPN levels, which makes the re-id features more robust to scale variations. The training procedures of re-id and detection are also optimized to place more emphasis on generating robust re-id embeddings.","459":"**VOS** is a type of video object segmentation model consisting of two network components. The target appearance model consists of a light-weight module, which is learned during the inference stage using fast optimization techniques to predict a coarse but robust target segmentation. The segmentation model is exclusively trained offline, designed to process the coarse scores into high quality segmentation masks.","460":"**Local SGD** is a distributed training technique that runs [SGD](https:\/\/paperswithcode.com\/method\/sgd) independently in parallel on different workers and averages the sequences only once in a while.","461":"**ASLFeat** is a convolutional neural network for learning local features that uses deformable convolutional networks to densely estimate and apply local transformation. It also takes advantage of the inherent feature hierarchy to restore spatial resolution and low-level details for accurate keypoint localization. Finally, it uses a peakiness measurement to relate feature responses and derive more indicative detection scores.","462":"**FastSpeech2** is a text-to-speech model that aims to improve upon FastSpeech by better solving the one-to-many mapping problem in TTS, i.e., multiple speech variations corresponding to the same text. It attempts to solve this problem by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, in FastSpeech 2, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values in inference.\r\n\r\nThe encoder converts the phoneme embedding sequence into the phoneme hidden sequence, and then the variance adaptor adds different variance information such as duration, pitch and energy into the hidden sequence, finally the mel-spectrogram decoder converts the adapted hidden sequence into mel-spectrogram sequence in parallel. FastSpeech 2 uses a feed-forward [Transformer](https:\/\/paperswithcode.com\/method\/transformer) block, which is a stack of [self-attention](https:\/\/paperswithcode.com\/method\/multi-head-attention) and 1D-[convolution](https:\/\/paperswithcode.com\/method\/convolution) as in FastSpeech, as the basic structure for the encoder and mel-spectrogram decoder.","463":"Multiscale Attention ViT with Late fusion (MAVL) is a multi-modal network, trained with aligned image-text pairs, capable of performing targeted detection using human understandable natural language text queries. It utilizes multi-scale image features and uses deformable convolutions with late multi-modal fusion. The authors demonstrate excellent ability of MAVL as class-agnostic object detector when queried using general human understandable natural language command, such as \"all objects\", \"all entities\", etc.","464":"**Multiscale Vision Transformer**, or **MViT**, is a [transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture for modeling visual data such as images and videos. Unlike conventional transformers, which maintain a constant channel capacity and resolution throughout the network, Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features.","465":"**context2vec** is an unsupervised model for learning generic context embedding of wide sentential contexts, using a bidirectional [LSTM](https:\/\/paperswithcode.com\/method\/lstm). A large plain text corpora is trained on to learn a neural model that embeds entire sentential contexts and target words in the same low-dimensional space, which\r\nis optimized to reflect inter-dependencies between targets and their entire sentential context as a whole. \r\n\r\nIn contrast to word2vec that use context modeling mostly internally and considers the target word embeddings as their main output, the focus of context2vec is the context representation. context2vec achieves its objective by assigning similar embeddings to sentential contexts and their associated target words.","466":"**Dot-Product Attention** is an attention mechanism where the alignment score function is calculated as: \r\n\r\n$$f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right) = h\\_{i}^{T}s\\_{j}$$\r\n\r\nIt is equivalent to [multiplicative attention](https:\/\/paperswithcode.com\/method\/multiplicative-attention) (without a trainable weight matrix, assuming this is instead an identity matrix). Here $\\textbf{h}$ refers to the hidden states for the encoder, and $\\textbf{s}$ is the hidden states for the decoder. The function above is thus a type of alignment score function. \r\n\r\nWithin a neural network, once we have the alignment scores, we calculate the final scores\/weights using a [softmax](https:\/\/paperswithcode.com\/method\/softmax) function of these alignment scores (ensuring it sums to 1).","467":"**Spatial Gating Unit**, or **SGU**, is a gating unit used in the [gMLP](https:\/\/paperswithcode.com\/method\/gmlp) architecture to captures spatial interactions. To enable cross-token interactions, it is necessary for the layer $s(\\cdot)$ to contain a contraction operation over the spatial dimension. The layer $s(\\cdot)$ is formulated as the output of linear gating:\r\n\r\n$$\r\ns(Z)=Z \\odot f\\_{W, b}(Z)\r\n$$\r\n\r\nwhere $\\odot$ denotes element-wise multiplication. For training stability, the authors find it critical to initialize $W$ as near-zero values and $b$ as ones, meaning that $f\\_{W, b}(Z) \\approx 1$ and therefore $s(Z) \\approx Z$ at the beginning of training. This initialization ensures each [gMLP](https:\/\/paperswithcode.com\/method\/gmlp) block behaves like a regular [FFN](https:\/\/paperswithcode.com\/method\/gmlp) at the early stage of training, where each token is processed independently, and only gradually injects spatial information across tokens during the course of learning.\r\n\r\nThe authors find it further effective to split $Z$ into two independent parts $\\left(Z\\_{1}, Z\\_{2}\\right)$ along the channel dimension for the gating function and for the multiplicative bypass:\r\n\r\n$$\r\ns(Z)=Z\\_{1} \\odot f\\_{W, b}\\left(Z\\_{2}\\right)\r\n$$\r\n\r\nThey also normalize the input to $f\\_{W, b}$ which empirically improved the stability of large NLP models.","468":"**gMLP** is an [MLP](https:\/\/paperswithcode.com\/methods\/category\/feedforward-networks)-based alternative to [Transformers](https:\/\/paperswithcode.com\/methods\/category\/vision-transformer) without [self-attention](https:\/\/paperswithcode.com\/method\/scaled), which simply consists of channel projections and spatial projections with static parameterization. It is built out of basic MLP layers with gating. The model consists of a stack of $L$ blocks with identical size and structure. Let $X \\in \\mathbb{R}^{n \\times d}$ be the token representations with sequence length $n$ and dimension $d$. Each block is defined as:\r\n\r\n$$\r\nZ=\\sigma(X U), \\quad \\tilde{Z}=s(Z), \\quad Y=\\tilde{Z} V\r\n$$\r\n\r\nwhere $\\sigma$ is an activation function such as [GeLU](https:\/\/paperswithcode.com\/method\/gelu). $U$ and $V$ define linear projections along the channel dimension - the same as those in the FFNs of Transformers (e.g., their shapes are $768 \\times 3072$ and $3072 \\times 768$ for $\\text{BERT}_{\\text {base }}$).\r\n\r\nA key ingredient is $s(\\cdot)$, a layer which captures spatial interactions. When $s$ is an identity mapping, the above transformation degenerates to a regular FFN, where individual tokens are processed independently without any cross-token communication. One of the major focuses is therefore to design a good $s$ capable of capturing complex spatial interactions across tokens. This leads to the use of a [Spatial Gating Unit](https:\/\/www.paperswithcode.com\/method\/spatial-gating-unit) which involves a modified linear gating.\r\n\r\nThe overall block layout is inspired by [inverted bottlenecks](https:\/\/paperswithcode.com\/method\/inverted-residual-block), which define $s(\\cdot)$ as a [spatial depthwise convolution](https:\/\/paperswithcode.com\/method\/depthwise-separable-convolution). Note, unlike Transformers, gMLP does not require position embeddings because such information will be captured in $s(\\cdot)$.","469":"VLG-Net leverages recent advantages in Graph Neural Networks (GCNs) and leverages a novel multi-modality graph-based fusion method for the task of natural language video grounding.","470":"SAGA is a method in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser. Unlike SDCA, SAGA supports non-strongly convex problems directly, and is adaptive to any inherent strong convexity of the problem.","471":"**UNETR**, or **UNet Transformer**, is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based architecture for [medical image segmentation](https:\/\/paperswithcode.com\/task\/medical-image-segmentation) that utilizes a pure [transformer](https:\/\/paperswithcode.com\/method\/transformer) as the encoder to learn sequence representations of the input volume -- effectively capturing the global multi-scale information. The transformer encoder is directly connected to a decoder via [skip connections](https:\/\/paperswithcode.com\/methods\/category\/skip-connections) at different resolutions like a [U-Net](https:\/\/paperswithcode.com\/method\/u-net) to compute the final semantic segmentation output.","472":"A **Global Convolutional Network**, or **GCN**, is a semantic segmentation building block that utilizes a large kernel to help perform classification and localization tasks simultaneously. It can be used in a [FCN](https:\/\/paperswithcode.com\/method\/fcn)-like structure, where the [GCN](https:\/\/paperswithcode.com\/method\/gcn) is used to generate semantic score maps. Instead of directly using larger kernels or global [convolution](https:\/\/paperswithcode.com\/method\/convolution), the GCN module employs a combination of $1 \\times k + k \\times 1$ and $k \\times 1 + 1 \\times k$ convolutions, which enables [dense connections](https:\/\/paperswithcode.com\/method\/dense-connections) within a large\r\n$k\\times{k}$ region in the feature map","473":"FixMatch is an algorithm that first generates pseudo-labels using the model's predictions on weakly-augmented unlabeled images. For a given image, the pseudo-label is only retained if the model produces a high-confidence prediction. The model is then trained to predict the pseudo-label when fed a strongly-augmented version of the same image.\r\n\r\nDescription from: [FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence](https:\/\/paperswithcode.com\/paper\/fixmatch-simplifying-semi-supervised-learning)\r\n\r\nImage credit:  [FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence](https:\/\/paperswithcode.com\/paper\/fixmatch-simplifying-semi-supervised-learning)","474":"**Retriever-Augmented Generation**, or **RAG**, is a type of language generation model that combines pre-trained parametric and non-parametric memory for language generation. Specifically, the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever.  For query $x$, Maximum Inner Product Search (MIPS) is used to find the top-K documents $z\\_{i}$. For final prediction $y$, we treat $z$ as a latent variable and marginalize over seq2seq predictions given different documents.","475":"**Location-based Attention** is an attention mechanism in which the alignment scores are computed from solely the target hidden state $\\mathbf{h}\\_{t}$ as follows:\r\n\r\n$$ \\mathbf{a}\\_{t} = \\text{softmax}(\\mathbf{W}\\_{a}\\mathbf{h}_{t}) $$","476":"**CoVe**, or **Contextualized Word Vectors**, uses a deep [LSTM](https:\/\/paperswithcode.com\/method\/lstm) encoder from an attentional sequence-to-sequence model trained for machine translation to contextualize word vectors. $\\text{CoVe}$ word embeddings are therefore a function of the entire input sequence. These word embeddings can then be used in downstream tasks by concatenating them with $\\text{GloVe}$ embeddings:\r\n\r\n$$ v = \\left[\\text{GloVe}\\left(x\\right), \\text{CoVe}\\left(x\\right)\\right]$$\r\n\r\nand then feeding these in as features for the task-specific models.","477":"**Graph Self-Attention (GSA)** is a self-attention module used in the [BP-Transformer](https:\/\/paperswithcode.com\/method\/bp-transformer) architecture, and is based on the [graph attentional layer](https:\/\/paperswithcode.com\/method\/graph-attentional-layer).\r\n\r\nFor a given node $u$, we update its representation according to its neighbour nodes, formulated as $\\mathbf{h}\\_{u} \\leftarrow \\text{GSA}\\left(\\mathcal{G}, \\mathbf{h}^{u}\\right)$.\r\n\r\nLet $\\mathbf{A}\\left(u\\right)$ denote the set of the neighbour nodes of $u$ in $\\mathcal{G}$, $\\text{GSA}\\left(\\mathcal{G}, \\mathbf{h}^{u}\\right)$ is detailed as follows:\r\n\r\n$$ \\mathbf{A}^{u} = \\text{concat}\\left(\\{\\mathbf{h}\\_{v} | v \\in \\mathcal{A}\\left(u\\right)\\}\\right) $$\r\n\r\n$$ \\mathbf{Q}^{u}\\_{i} = \\mathbf{H}\\_{k}\\mathbf{W}^{Q}\\_{i},\\mathbf{K}\\_{i}^{u} = \\mathbf{A}^{u}\\mathbf{W}^{K}\\_{i},\\mathbf{V}^{u}\\_{i} = \\mathbf{A}^{u}\\mathbf{W}\\_{i}^{V} $$\r\n\r\n$$ \\text{head}^{u}\\_{i} = \\text{softmax}\\left(\\frac{\\mathbf{Q}^{u}\\_{i}\\mathbf{K}\\_{i}^{uT}}{\\sqrt{d}}\\right)\\mathbf{V}\\_{i}^{u} $$\r\n\r\n$$ \\text{GSA}\\left(\\mathcal{G}, \\mathbf{h}^{u}\\right) = \\left[\\text{head}^{u}\\_{1}, \\dots, \\text{head}^{u}\\_{h}\\right]\\mathbf{W}^{O}$$\r\n\r\nwhere d is the dimension of h, and $\\mathbf{W}^{Q}\\_{i}$, $\\mathbf{W}^{K}\\_{i}$ and $\\mathbf{W}^{V}\\_{i}$ are trainable parameters of the $i$-th attention head.","478":"**Rectified Adam**, or **RAdam**, is a variant of the [Adam](https:\/\/paperswithcode.com\/method\/adam) stochastic optimizer that introduces a term to rectify the variance of the adaptive learning rate. It seeks to tackle the bad convergence problem suffered by Adam. The authors argue that the root cause of this behaviour is that the adaptive learning rate has undesirably large variance in the early stage of model training, due to the limited amount of training samples being used. Thus, to reduce such variance, it is better to use smaller learning rates in the first few epochs of training - which justifies the warmup heuristic. This heuristic motivates RAdam which rectifies the variance problem:\r\n\r\n$$g\\_{t} = \\nabla\\_{\\theta}f\\_{t}\\left(\\theta\\_{t-1}\\right) $$\r\n\r\n$$v\\_{t} = 1\/\\beta\\_{2}v\\_{t-1} + \\left(1-\\beta\\_{2}\\right)g^{2}\\_{t} $$\r\n\r\n$$m\\_{t} = \\beta\\_{1}m\\_{t-1} + \\left(1-\\beta\\_{1}\\right)g\\_{t} $$\r\n\r\n$$ \\hat{m\\_{t}} = m\\_{t} \/ \\left(1-\\beta^{t}\\_{1}\\right) $$\r\n\r\n$$ \\rho\\_{t} = \\rho\\_{\\infty} - 2t\\beta^{t}\\_{2}\/\\left(1-\\beta^{t}\\_{2}\\right) $$\r\n\r\n$$\\rho_{\\infty} = \\frac{2}{1-\\beta_2} - 1$$ \r\n\r\nIf the variance is tractable - $\\rho\\_{t} > 4$ then:\r\n\r\n...the adaptive learning rate is computed as:\r\n\r\n$$ l\\_{t} = \\sqrt{\\left(1-\\beta^{t}\\_{2}\\right)\/v\\_{t}}$$\r\n\r\n...the variance rectification term is calculated as:\r\n\r\n$$ r\\_{t} = \\sqrt{\\frac{(\\rho\\_{t}-4)(\\rho\\_{t}-2)\\rho\\_{\\infty}}{(\\rho\\_{\\infty}-4)(\\rho\\_{\\infty}-2)\\rho\\_{t}}}$$\r\n\r\n...and we update parameters with adaptive momentum:\r\n\r\n$$ \\theta\\_{t} = \\theta\\_{t-1} - \\alpha\\_{t}r\\_{t}\\hat{m}\\_{t}l\\_{t} $$\r\n\r\nIf the variance isn't tractable we update instead with:\r\n\r\n$$ \\theta\\_{t} = \\theta\\_{t-1} - \\alpha\\_{t}\\hat{m}\\_{t} $$","479":"Hyperboloid Embeddings (HypE) is a novel self-supervised dynamic reasoning framework, that utilizes positive first-order existential queries on a KG to learn representations of its entities and relations as hyperboloids in a Poincar\u00e9 ball. HypE models the positive first-order queries as geometrical translation (t), intersection ($\\cap$), and union ($\\cup$). For the problem of KG reasoning in real-world datasets, the proposed HypE model significantly outperforms the state-of-the art results. HypE is also applied to an anomaly detection task on a popular e-commerce website product taxonomy as well as hierarchically organized web articles and demonstrate significant performance improvements compared to existing baseline methods. Finally, HypE embeddings can also be visualized in a Poincar\u00e9 ball to clearly interpret and comprehend the representation space.","480":"A generic way of representing and interpolating labels, which allows straightforward extension of any kind of [mixup](https:\/\/paperswithcode.com\/method\/mixup) to deep metric learning for a large class of loss functions.","481":"**MoCo v2** is an improved version of the [Momentum Contrast](https:\/\/paperswithcode.com\/method\/moco) self-supervised learning algorithm. Motivated by the findings presented in the [SimCLR](https:\/\/paperswithcode.com\/method\/simclr) paper, authors:\r\n\r\n- Replace the 1-layer fully connected layer with a 2-layer MLP head with [ReLU](https:\/\/paperswithcode.com\/method\/relu) for the unsupervised training stage.\r\n- Include blur augmentation.\r\n- Use cosine learning rate schedule.\r\n\r\nThese modifications enable MoCo to outperform the state-of-the-art SimCLR with a smaller batch size and fewer epochs.","482":"**DeepMask** is an object proposal algorithm based on a convolutional neural network. Given an input image patch, DeepMask generates a class-agnostic mask and an associated score which estimates the likelihood of the patch fully containing a centered object (without any notion of an object category). The core of the model is a ConvNet which jointly predicts the mask and the object score. A large part of the network is shared between those two tasks: only the last few network\r\nlayers are specialized for separately outputting a mask and score prediction.","483":"**Wide&Deep** jointly trains wide linear models and deep neural networks to combine the benefits of memorization and generalization for real-world recommender systems. In summary, the wide component is a generalized linear model. The deep component is a feed-forward neural network. The deep and wide components are combined using a weighted sum of their output log odds as the prediction. This is then fed to a logistic loss function for joint training, which is done by back-propagating the gradients from the output to both the wide and deep part of the model simultaneously using mini-batch stochastic optimization. The AdaGrad optimizer is used for the wider part. The combined model is illustrated in the figure (center).","484":"[Laplacian eigenvectors](https:\/\/paperswithcode.com\/paper\/laplacian-eigenmaps-and-spectral-techniques) represent a natural generalization of the [Transformer](https:\/\/paperswithcode.com\/method\/transformer) positional encodings (PE) for graphs as the eigenvectors of a discrete line (NLP graph) are the cosine and sinusoidal functions. They help encode distance-aware information (i.e., nearby nodes have similar positional features and farther nodes have dissimilar positional features).\r\n\r\nHence, Laplacian Positional Encoding (PE) is a general method to encode node positions in a graph. For each node, its Laplacian PE is the k smallest non-trivial eigenvectors.","485":"This is **Graph Transformer** method, proposed as a generalization of [Transformer](https:\/\/paperswithcode.com\/method\/transformer) Neural Network architectures, for arbitrary graphs.\r\n\r\nCompared to the original Transformer, the highlights of the presented architecture are:\r\n\r\n- The attention mechanism is a function of neighborhood connectivity for each node in the graph.  \r\n- The position encoding is represented by Laplacian eigenvectors, which naturally generalize the sinusoidal positional encodings often used in NLP.  \r\n- The [layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization) is replaced by a [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) layer.  \r\n- The architecture is extended to have edge representation, which can be critical to tasks with rich information on the edges, or pairwise interactions (such as bond types in molecules, or relationship type in KGs. etc).","486":"**Local Contrast Normalization** is a type of normalization that performs local subtraction and division normalizations, enforcing a sort of local competition between adjacent features in a feature map, and between features at the same spatial location in different feature maps.","487":"**ZFNet** is a classic convolutional neural network. The design was motivated by visualizing intermediate feature layers and the operation of the classifier. Compared to [AlexNet](https:\/\/paperswithcode.com\/method\/alexnet), the filter sizes are reduced and the stride of the convolutions are reduced.","488":"**Electric** is an energy-based cloze model for representation learning over text. Like BERT, it is a conditional generative model of tokens given their contexts. However, Electric does not use masking or output a full distribution over tokens that could occur in a context. Instead, it assigns a scalar energy score to each input token indicating how likely it is given its context.\r\n\r\nSpecifically, like BERT, Electric also models $p\\_{\\text {data }}\\left(x\\_{t} \\mid \\mathbf{x}\\_{\\backslash t}\\right)$, but does not use masking or a softmax layer. Electric first maps the unmasked input $\\mathbf{x}=\\left[x\\_{1}, \\ldots, x\\_{n}\\right]$ into contextualized vector representations $\\mathbf{h}(\\mathbf{x})=\\left[\\mathbf{h}\\_{1}, \\ldots, \\mathbf{h}\\_{n}\\right]$ using a transformer network. The model assigns a given position $t$ an energy score\r\n\r\n$$\r\nE(\\mathbf{x})\\_{t}=\\mathbf{w}^{T} \\mathbf{h}(\\mathbf{x})\\_{t}\r\n$$\r\n\r\nusing a learned weight vector $w$. The energy function defines a distribution over the possible tokens at position $t$ as\r\n\r\n$$\r\np\\_{\\theta}\\left(x\\_{t} \\mid \\mathbf{x}_{\\backslash t}\\right)=\\exp \\left(-E(\\mathbf{x})\\_{t}\\right) \/ Z\\left(\\mathbf{x}\\_{\\backslash t}\\right) \r\n$$\r\n\r\n$$\r\n=\\frac{\\exp \\left(-E(\\mathbf{x})\\_{t}\\right)}{\\sum\\_{x^{\\prime} \\in \\mathcal{V}} \\exp \\left(-E\\left(\\operatorname{REPLACE}\\left(\\mathbf{x}, t, x^{\\prime}\\right)\\right)\\_{t}\\right)}\r\n$$\r\n\r\nwhere $\\text{REPLACE}\\left(\\mathbf{x}, t, x^{\\prime}\\right)$ denotes replacing the token at position $t$ with $x^{\\prime}$ and $\\mathcal{V}$ is the vocabulary, in practice usually word pieces. Unlike with BERT, which produces the probabilities for all possible tokens $x^{\\prime}$ using a softmax layer, a candidate $x^{\\prime}$ is passed in as input to the transformer. As a result, computing $p_{\\theta}$ is prohibitively expensive because the partition function $Z\\_{\\theta}\\left(\\mathbf{x}\\_{\\backslash t}\\right)$ requires running the transformer $|\\mathcal{V}|$ times; unlike most EBMs, the intractability of $Z\\_{\\theta}(\\mathbf{x} \\backslash t)$ is more due to the expensive scoring function rather than having a large sample space.","489":"**COLA** is a self-supervised pre-training approach for learning a general-purpose representation of audio. It is based on contrastive learning: it learns a representation which assigns high similarity to audio segments extracted from the same recording while assigning lower similarity to segments from different recordings.","490":"A **(2+1)D Convolution** is a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) used for action recognition convolutional neural networks, with a spatiotemporal volume. As opposed to applying a [3D Convolution](https:\/\/paperswithcode.com\/method\/3d-convolution) over the entire volume, which can be computationally expensive and lead to overfitting, a (2+1)D convolution splits computation into two convolutions: a spatial 2D convolution followed by a temporal 1D convolution.","491":"A **R(2+1)D** convolutional neural network is a network for action recognition that employs [R(2+1)D](https:\/\/paperswithcode.com\/method\/2-1-d-convolution) convolutions in a [ResNet](https:\/\/paperswithcode.com\/method\/resnet) inspired architecture. The use of these convolutions over regular [3D Convolutions](https:\/\/paperswithcode.com\/method\/3d-convolution) reduces computational complexity, prevents overfitting, and introduces more non-linearities that allow for a better functional relationship to be modeled.","492":"**CodeT5** is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based model for code understanding and generation based on the [T5 architecture](https:\/\/paperswithcode.com\/method\/t5). It utilizes an identifier-aware pre-training objective that considers the crucial token type information (identifiers) from code. Specifically, the denoising [Seq2Seq](https:\/\/paperswithcode.com\/method\/seq2seq) objective of T5 is extended with two identifier tagging and prediction tasks to enable the model to better leverage the token type information from programming languages, which are the identifiers assigned by developers. To improve the natural language-programming language alignment, a bimodal dual learning objective is used for a bidirectional conversion between natural language and programming language.","493":"**Lambda layers** are a building block for modeling long-range dependencies in data. They consist of long-range interactions between a query and a structured set of context elements at a reduced memory cost. Lambda layers transform each available context into a linear function, termed a lambda, which is then directly applied to the corresponding query. Whereas self-attention defines a similarity kernel between the query and the context elements, a lambda layer instead summarizes contextual information into a fixed-size linear function (i.e. a matrix), thus bypassing the need for memory-intensive attention maps.","494":"**LightGCN** is a type of [graph convolutional neural network](https:\/\/paperswithcode.com\/method\/gcn) (GCN), including only the most essential component in GCN (neighborhood aggregation) for collaborative filtering. Specifically, LightGCN learns user and item embeddings by linearly propagating them on the user-item interaction graph, and uses the weighted sum of the embeddings learned at all layers as the final embedding.","495":"**lda2vec** builds representations over both words and documents by mixing word2vec\u2019s skipgram architecture with Dirichlet-optimized sparse topic mixtures. \r\n\r\nThe Skipgram Negative-Sampling (SGNS) objective of word2vec is modified to utilize document-wide feature vectors while simultaneously learning continuous document weights loading onto topic vectors. The total loss term $L$ is the sum of the Skipgram Negative Sampling Loss (SGNS) $L^{neg}\\_{ij}$ with the addition of a Dirichlet-likelihood term over document weights, $L\\_{d}$. The loss is conducted using a context vector, $\\overrightarrow{c\\_{j}}$ , pivot word vector $\\overrightarrow{w\\_{j}}$, target word vector $\\overrightarrow{w\\_{i}}$, and negatively-sampled word vector $\\overrightarrow{w\\_{l}}$:\r\n\r\n$$ L = L^{d} + \\Sigma\\_{ij}L^{neg}\\_{ij} $$\r\n\r\n$$L^{neg}\\_{ij} = \\log\\sigma\\left(c\\_{j}\\cdot\\overrightarrow{w\\_{i}}\\right) + \\sum^{n}\\_{l=0}\\sigma\\left(-\\overrightarrow{c\\_{j}}\\cdot\\overrightarrow{w\\_{l}}\\right)$$","496":"**Deep-MAC**, or **Deep Mask-heads Above CenterNet**, is a type of anchor-free instance segmentation model based on [CenterNet](https:\/\/paperswithcode.com\/method\/centernet).  The motivation for this new architecture is that boxes are much cheaper to annotate than masks, so the authors address the \u201cpartially supervised\u201d instance segmentation problem, where all classes have bounding box annotations but only a subset of classes have mask annotations. \r\n\r\nFor predicting bounding boxes, CenterNet outputs 3 tensors: (1) a class-specific [heatmap](https:\/\/paperswithcode.com\/method\/heatmap) which indicates the probability of the center of a bounding box being present at each location, (2) a class-agnostic 2-channel tensor indicating the height and width of the bounding box at each center pixel, and (3) since the output feature map is typically smaller than the image (stride 4 or 8), CenterNet also predicts an x and y direction offset to recover this discretization error at each center pixel.\r\n\r\nFor Deep-MAC, in parallel to the box-related prediction heads, we add a fourth pixel embedding branch $P$. For each bounding box\r\n$b$, we crop a region $P\\_{b}$ from $P$ corresponding to $b$ via [ROIAlign](https:\/\/paperswithcode.com\/method\/roi-align) which results in a 32 \u00d7 32 tensor. We then feed each $P\\_{b}$ to a mask-head. The final prediction at the end is a class-agnostic 32 \u00d7 32 tensor which we pass through a sigmoid to get per-pixel probabilities. We train this mask-head via a per-pixel cross-entropy loss averaged over all pixels and instances. During post-processing, the predicted mask is re-aligned according to the predicted box and resized to the resolution of the image. \r\n\r\nIn addition to this 32 \u00d7 32 cropped feature map, we add two inputs for improved stability of some mask-heads: (1) Instance embedding: an additional head is added to the backbone that predicts a per-pixel embedding. For each bounding box $b$ we extract its embedding from the center pixel. This embedding is tiled to a size of 32 \u00d7 32 and concatenated to the pixel embedding crop. This helps condition the mask-head on a particular instance and disambiguate it from others. (2) Coordinate Embedding: Inspired by [CoordConv](https:\/\/paperswithcode.com\/method\/coordconv), the authors add a 32 \u00d7 32 \u00d7 2 tensor holding normalized $\\left(x, y\\right)$ coordinates relative to the bounding box $b$.","497":"**Positional Encoding Generator**, or **PEG**, is a module used in the [Conditional Position Encoding](https:\/\/paperswithcode.com\/method\/conditional-positional-encoding) position embeddings. It dynamically produce the positional encodings conditioned on the local neighborhood of an input token. To condition on the local neighbors, we first reshape the flattened input sequence $X \\in \\mathbb{R}^{B \\times N \\times C}$ of DeiT back to $X^{\\prime} \\in \\mathbb{R}^{B \\times H \\times W \\times C}$ in the 2 -D image space. Then, a function (denoted by $\\mathcal{F}$ in the Figure) is repeatedly applied to the local patch in $X^{\\prime}$ to produce the conditional positional encodings $E^{B \\times H \\times W \\times C} .$ PEG can be efficiently implemented with a 2-D convolution with kernel $k(k \\geq 3)$ and $\\frac{k-1}{2}$ zero paddings. Note that the zero paddings here are important to make the model be aware of the absolute positions, and $\\mathcal{F}$ can be of various forms such as separable convolutions and many others.","498":"**Conditional Positional Encoding**, or **CPE**, is a type of positional encoding for [vision transformers](https:\/\/paperswithcode.com\/methods\/category\/vision-transformer). Unlike previous fixed or learnable positional encodings, which are predefined and independent of input tokens, CPE is dynamically generated and conditioned on the local neighborhood of the input tokens. As a result, CPE aims to generalize to the input sequences that are longer than what the model has ever seen during training. CPE can also keep the desired translation-invariance in the image classification task. CPE can be implemented with a [Position\r\nEncoding Generator](https:\/\/paperswithcode.com\/method\/positional-encoding-generator) (PEG) and incorporated into the current [Transformer framework](https:\/\/paperswithcode.com\/methods\/category\/transformers).","499":"**Global Sub-Sampled Attention**, or **GSA**, is a local [attention mechanism](https:\/\/paperswithcode.com\/methods\/category\/attention-mechanisms-1) used in the [Twins-SVT](https:\/\/paperswithcode.com\/method\/twins-svt) architecture. \r\n\r\nA single representative is used to summarize the key information for each of $m \\times n$ subwindows and the representative is used to communicate with other sub-windows (serving as the key in self-attention), which can reduce the cost to $\\mathcal{O}(m n H W d)=\\mathcal{O}\\left(\\frac{H^{2} W^{2} d}{k\\_{1} k\\_{2}}\\right)$. This is essentially equivalent to using the sub-sampled feature maps as the key in attention operations, and thus it is termed global sub-sampled attention (GSA). \r\n\r\nIf we alternatively use the [LSA](https:\/\/paperswithcode.com\/method\/locally-grouped-self-attention) and GSA like [separable convolutions](https:\/\/paperswithcode.com\/method\/depthwise-separable-convolution) (depth-wise + point-wise). The total computation cost is $\\mathcal{O}\\left(\\frac{H^{2} W^{2} d}{k\\_{1} k\\_{2}}+k\\_{1} k\\_{2} H W d\\right) .$ We have:\r\n\r\n$$\\frac{H^{2} W^{2} d}{k\\_{1} k\\_{2}}+k_{1} k_{2} H W d \\geq 2 H W d \\sqrt{H W} $$ \r\n\r\nThe minimum is obtained when $k\\_{1} \\cdot k\\_{2}=\\sqrt{H W}$. Note that $H=W=224$ is popular in classification. Without loss of generality, square sub-windows are used, i.e., $k\\_{1}=k\\_{2}$. Therefore, $k\\_{1}=k\\_{2}=15$ is close to the global minimum for $H=W=224$. However, the network is designed to include several stages with variable resolutions. Stage 1 has feature maps of $56 \\times 56$, the minimum is obtained when $k\\_{1}=k\\_{2}=\\sqrt{56} \\approx 7$. Theoretically, we can calibrate optimal $k\\_{1}$ and $k\\_{2}$ for each of the stages. For simplicity, $k\\_{1}=k\\_{2}=7$ is used everywhere. As for stages with lower resolutions, the summarizing window-size of GSA is controlled to avoid too small amount of generated keys. Specifically, the sizes of 4,2 and 1 are used for the last three stages respectively.","500":"**Locally-Grouped Self-Attention**, or **LSA**, is a local attention mechanism used in the [Twins-SVT](https:\/\/paperswithcode.com\/method\/twins-svt) architecture. Locally-grouped self-attention (LSA). Motivated by the group design in depthwise convolutions for efficient inference, we first equally divide the 2D feature maps into sub-windows, making self-attention communications only happen within each sub-window. This design also resonates with the multi-head design in self-attention, where the communications only occur within the channels of the same head. To be specific, the feature maps are divided into $m \\times n$ sub-windows. Without loss of generality, we assume $H \\% m=0$ and $W \\% n=0$. Each group contains $\\frac{H W}{m n}$ elements, and thus the computation cost of the self-attention in this window is $\\mathcal{O}\\left(\\frac{H^{2} W^{2}}{m^{2} n^{2}} d\\right)$, and the total cost is $\\mathcal{O}\\left(\\frac{H^{2} W^{2}}{m n} d\\right)$. If we let $k\\_{1}=\\frac{H}{n}$ and $k\\_{2}=\\frac{W}{n}$, the cost can be computed as $\\mathcal{O}\\left(k\\_{1} k\\_{2} H W d\\right)$, which is significantly more efficient when $k\\_{1} \\ll H$ and $k\\_{2} \\ll W$ and grows linearly with $H W$ if $k\\_{1}$ and $k\\_{2}$ are fixed.\r\n\r\nAlthough the locally-grouped self-attention mechanism is computation friendly, the image is divided into non-overlapping sub-windows. Thus, we need a mechanism to communicate between different sub-windows, as in Swin. Otherwise, the information would be limited to be processed locally, which makes the receptive field small and significantly degrades the performance as shown in our experiments. This resembles the fact that we cannot replace all standard convolutions by depth-wise convolutions in CNNs.","501":"**Spatially Separable Self-Attention**, or **SSSA**, is an [attention module](https:\/\/paperswithcode.com\/methods\/category\/attention-modules) used in the [Twins-SVT](https:\/\/paperswithcode.com\/method\/twins-svt) architecture that aims to reduce the computational complexity of [vision transformers](https:\/\/paperswithcode.com\/methods\/category\/vision-transformer) for dense prediction tasks (given high-resolution inputs). SSSA is composed of [locally-grouped self-attention](https:\/\/paperswithcode.com\/method\/locally-grouped-self-attention) (LSA) and [global sub-sampled attention](https:\/\/paperswithcode.com\/method\/global-sub-sampled-attention) (GSA).\r\n\r\nFormally, spatially separable self-attention (SSSA) can be written as:\r\n\r\n$$\r\n\\hat{\\mathbf{z}}\\_{i j}^{l}=\\text { LSA }\\left(\\text { LayerNorm }\\left(\\mathbf{z}\\_{i j}^{l-1}\\right)\\right)+\\mathbf{z}\\_{i j}^{l-1} $$\r\n\r\n$$\\mathbf{z}\\_{i j}^{l}=\\mathrm{FFN}\\left(\\operatorname{LayerNorm}\\left(\\hat{\\mathbf{z}}\\_{i j}^{l}\\right)\\right)+\\hat{\\mathbf{z}}\\_{i j}^{l} $$\r\n\r\n$$ \\hat{\\mathbf{z}}^{l+1}=\\text { GSA }\\left(\\text { LayerNorm }\\left(\\mathbf{z}^{l}\\right)\\right)+\\mathbf{z}^{l} $$\r\n\r\n$$ \\mathbf{z}^{l+1}=\\text { FFN }\\left(\\text { LayerNorm }\\left(\\hat{\\mathbf{z}}^{l+1}\\right)\\right)+\\hat{\\mathbf{z}}^{l+1}$$\r\n\r\n$$i \\in\\{1,2, \\ldots ., m\\}, j \\in\\{1,2, \\ldots ., n\\}\r\n$$\r\n\r\nwhere LSA means locally-grouped self-attention within a sub-window; GSA is the global sub-sampled attention by interacting with the representative keys (generated by the sub-sampling functions) from each sub-window $\\hat{\\mathbf{z}}\\_{i j} \\in \\mathcal{R}^{k\\_{1} \\times k\\_{2} \\times C} .$ Both LSA and GSA have multiple heads as in the standard self-attention.","502":"**Twins-SVT** is a type of [vision transformer](https:\/\/paperswithcode.com\/methods\/category\/vision-transformer) which utilizes a [spatially separable attention mechanism](https:\/\/paperswithcode.com\/method\/spatially-separable-self-attention) (SSAM) which is composed of two types of attention operations\u2014(i) locally-grouped self-attention (LSA), and (ii) global sub-sampled attention (GSA), where LSA captures the fine-grained and short-distance information and GSA deals with the long-distance and global information. On top of this, it utilizes [conditional position encodings](https:\/\/paperswithcode.com\/method\/conditional-positional-encoding) as well as the architectural design of the [Pyramid Vision Transformer](https:\/\/paperswithcode.com\/method\/pvt).","503":"**Twins-PCPVT** is a type of [vision transformer](https:\/\/paperswithcode.com\/methods\/category\/vision-transformer) that combines global attention, specifically the global sub-sampled attention as proposed in [Pyramid Vision Transformer](https:\/\/paperswithcode.com\/method\/pvt), with [conditional position encodings](https:\/\/paperswithcode.com\/method\/conditional-positional-encoding) (CPE) to replace the [absolute position encodings](https:\/\/paperswithcode.com\/method\/absolute-position-encodings) used in PVT.\r\n\r\nThe [position encoding generator](https:\/\/paperswithcode.com\/method\/positional-encoding-generator) (PEG), which generates the CPE, is placed after the first encoder block of each stage. The simplest form of PEG is used, i.e., a 2D [depth-wise convolution](https:\/\/paperswithcode.com\/method\/depthwise-convolution) without [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization). For image-level classification, following [CPVT](https:\/\/paperswithcode.com\/method\/cpvt), the class token is removed and [global average pooling](https:\/\/paperswithcode.com\/method\/global-average-pooling) is used at the end of the stage. For other vision tasks, the design of PVT is followed.","504":"**Meta-augmentation** helps generate more varied tasks for a single example in meta-learning. It can be distinguished from data augmentation in classic machine learning as follows. For data augmentation in classical machine learning, the aim is to generate more varied examples, within a single task. Meta-augmentation has the exact opposite aim: we wish to generate more varied tasks,\r\nfor a single example, to force the learner to quickly learn a new task from feedback. In meta-augmentation, adding randomness discourages the base learner and model from learning trivial solutions that do not generalize to new tasks.","505":"DiffPool is a differentiable graph pooling module that can generate hierarchical representations of graphs and can be combined with various graph neural network architectures in an end-to-end fashion. DiffPool learns a differentiable soft cluster assignment for nodes at each layer of a deep GNN, mapping nodes to a set of clusters, which then form the coarsened input for the next GNN layer.\r\n\r\nDescription and image from: [Hierarchical Graph Representation Learning with Differentiable Pooling](https:\/\/arxiv.org\/pdf\/1806.08804.pdf)","506":"**Wasserstein Gradient Penalty Loss**, or **WGAN-GP Loss**, is a loss used for generative adversarial networks that augments the Wasserstein loss with a gradient norm penalty for random samples $\\mathbf{\\hat{x}} \\sim \\mathbb{P}\\_{\\hat{\\mathbf{x}}}$ to achieve Lipschitz continuity:\r\n\r\n$$ L = \\mathbb{E}\\_{\\mathbf{\\hat{x}} \\sim \\mathbb{P}\\_{g}}\\left[D\\left(\\tilde{\\mathbf{x}}\\right)\\right] - \\mathbb{E}\\_{\\mathbf{x} \\sim \\mathbb{P}\\_{r}}\\left[D\\left(\\mathbf{x}\\right)\\right] + \\lambda\\mathbb{E}\\_{\\mathbf{\\hat{x}} \\sim \\mathbb{P}\\_{\\hat{\\mathbf{x}}}}\\left[\\left(||\\nabla\\_{\\tilde{\\mathbf{x}}}D\\left(\\mathbf{\\tilde{x}}\\right)||\\_{2}-1\\right)^{2}\\right]$$\r\n\r\nIt was introduced as part of the [WGAN-GP](https:\/\/paperswithcode.com\/method\/wgan-gp) overall model.","507":"**Phase Shuffle** is a technique for removing pitched noise artifacts that come from using transposed convolutions in audio generation models. Phase shuffle is an operation with hyperparameter $n$. It randomly perturbs the phase of each layer\u2019s activations by \u2212$n$ to $n$ samples before input to the next layer.\r\n\r\nIn the original application in [WaveGAN](https:\/\/paperswithcode.com\/method\/wavegan), the authors only apply phase shuffle to the discriminator, as the latent vector already provides the generator a mechanism to manipulate the phase\r\nof a resultant waveform. Intuitively speaking, phase shuffle makes the discriminator\u2019s job more challenging by requiring invariance to the phase of the input waveform.","508":"**WaveGAN** is a generative adversarial network for unsupervised synthesis of raw-waveform audio (as opposed to image-like spectrograms). \r\n\r\nThe WaveGAN architecture is based off [DCGAN](https:\/\/paperswithcode.com\/method\/dcgan). The DCGAN generator uses the [transposed convolution](https:\/\/paperswithcode.com\/method\/transposed-convolution) operation to iteratively upsample low-resolution feature maps into a high-resolution image. WaveGAN modifies this transposed [convolution](https:\/\/paperswithcode.com\/method\/convolution) operation to widen its receptive field, using a longer one-dimensional filters of length 25 instead of two-dimensional filters of size 5x5, and upsampling by a factor of 4 instead of 2 at each layer. The discriminator is modified in a similar way, using length-25 filters in one dimension and increasing stride\r\nfrom 2 to 4. These changes result in WaveGAN having the same number of parameters, numerical\r\noperations, and output dimensionality as DCGAN. An additional layer is added afterwards to allow for more audio samples. Further changes include:\r\n\r\n1. Flattening 2D convolutions into 1D (e.g. 5x5 2D conv becomes length-25 1D).\r\n2. Increasing the stride factor for all convolutions (e.g. stride 2x2 becomes stride 4).\r\n3. Removing [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) from the generator and discriminator.\r\n4. Training using the [WGAN](https:\/\/paperswithcode.com\/method\/wgan)-GP strategy.","509":"**PnP**, or **Poll and Pool**, is sampling module extension for [DETR](https:\/\/paperswithcode.com\/method\/detr)-type architectures that adaptively allocates its computation spatially to be more efficient. Concretely, the PnP module abstracts the image feature map into fine foreground object feature vectors and a small number of coarse background contextual feature vectors. The [transformer](https:\/\/paperswithcode.com\/method\/transformer) models information interaction within the fine-coarse feature space and translates the features into the detection result.","510":"A **Fire Module** is a building block for convolutional neural networks, notably used as part of [SqueezeNet](https:\/\/paperswithcode.com\/method\/squeezenet). A Fire module is comprised of: a squeeze [convolution](https:\/\/paperswithcode.com\/method\/convolution) layer (which has only 1x1 filters), feeding into an expand layer that has a mix of 1x1 and 3x3 convolution filters.  We expose three tunable dimensions (hyperparameters) in a Fire module: $s\\_{1x1}$, $e\\_{1x1}$, and $e\\_{3x3}$. In a Fire module, $s\\_{1x1}$ is the number of filters in the squeeze layer (all 1x1), $e\\_{1x1}$ is the number of 1x1 filters in the expand layer, and $e\\_{3x3}$ is the number of 3x3 filters in the expand layer. When we use Fire modules we set $s\\_{1x1}$ to be less than ($e\\_{1x1}$ + $e\\_{3x3}$), so the squeeze layer helps to limit the number of input channels to the 3x3 filters.","511":"**SqueezeNet** is a convolutional neural network that employs design strategies to reduce the number of parameters, notably with the use of fire modules that \"squeeze\" parameters using 1x1 convolutions.","512":"**Go-Explore** is a family of algorithms aiming to tackle two challenges with effective exploration in reinforcement learning: algorithms forgetting how to reach previously visited states (\"detachment\") and from failing to first return to a state before exploring from it (\"derailment\").\r\n\r\nTo avoid detachment, Go-Explore builds an archive of the different states it has visited in the environment, thus ensuring that states cannot be forgotten. Starting with an archive beginning with the initial state, the archive is built iteratively. In Go-Explore we:\r\n\r\n(a) Probabilistically select a state from the archive, preferring states associated with promising cells. \r\n\r\n(b) Return to the selected state, such as by restoring simulator state or by running a goal-conditioned policy. \r\n\r\n(c) Explore from that state by taking random actions or sampling from a trained policy. \r\n\r\n(d) Map every state encountered during returning and exploring to a low-dimensional cell representation. \r\n\r\n(e) Add states that map to new cells to the archive and update other archive entries.","513":"**DINO** (self-distillation with no labels) is a self-supervised learning method that directly predicts the output of a teacher network - built with a momentum encoder - by using a standard cross-entropy loss. \r\n\r\nIn the example to the right, DINO is illustrated in the case of one single pair of views $\\left(x\\_{1}, x\\_{2}\\right)$ for simplicity. The model passes two different random transformations of an input image to the student and teacher networks. Both networks have\r\nthe same architecture but different parameters. The output of the teacher network is centered with a mean computed over the batch. Each networks outputs a $K$ dimensional feature that is normalized with a temperature [softmax](https:\/\/paperswithcode.com\/method\/softmax) over the feature dimension. Their\r\nsimilarity is then measured with a cross-entropy loss. A stop-gradient (sg) operator is applied on the teacher to propagate gradients\r\nonly through the student. The teacher parameters are updated with an exponential moving average (ema) of the student parameters.","514":"**MoCo v3** aims to stabilize training of self-supervised ViTs. MoCo v3 is an incremental improvement of MoCo v1\/2. Two crops are used for each image under random data augmentation. They are encoded by two encoders $f_q$ and $f_k$ with output vectors $q$ and $k$. $q$ behaves like a \"query\", where the goal of learning is to retrieve the corresponding \"key\". The objective is to minimize a contrastive loss function of the following form: \r\n\r\n$$\r\n\\mathcal{L_q}=-\\log \\frac{\\exp \\left(q \\cdot k^{+} \/ \\tau\\right)}{\\exp \\left(q \\cdot k^{+} \/ \\tau\\right)+\\sum_{k^{-}} \\exp \\left(q \\cdot k^{-} \/ \\tau\\right)}\r\n$$\r\n\r\nThis approach aims to train the Transformer in the contrastive\/Siamese paradigm. The encoder $f_q$ consists of a backbone (e.g., ResNet and ViT), a projection head, and an extra prediction head. The encoder $f_k$ has the back the backbone and projection head but not the prediction head. $f_k$ is updated by the moving average of $f_q$, excluding the prediction head.","515":"**AMSGrad** is a stochastic optimization method that seeks to fix a convergence issue with [Adam](https:\/\/paperswithcode.com\/method\/adam) based optimizers. AMSGrad uses the maximum of past squared gradients \r\n$v\\_{t}$ rather than the exponential average to update the parameters:\r\n\r\n$$m\\_{t} = \\beta\\_{1}m\\_{t-1} + \\left(1-\\beta\\_{1}\\right)g\\_{t} $$\r\n\r\n$$v\\_{t} = \\beta\\_{2}v\\_{t-1} + \\left(1-\\beta\\_{2}\\right)g\\_{t}^{2}$$\r\n\r\n$$ \\hat{v}\\_{t} = \\max\\left(\\hat{v}\\_{t-1}, v\\_{t}\\right) $$\r\n\r\n$$\\theta\\_{t+1} = \\theta\\_{t} - \\frac{\\eta}{\\sqrt{\\hat{v}_{t}} + \\epsilon}m\\_{t}$$","516":"Karim Hammoudi, Adnane Cabani, Bouthaina Slika, Halim Benhabiles, Fadi Dornaika and Mahmoud Melkemi. SuperpixelGridCut, SuperpixelGridMean and SuperpixelGridMix Data Augmentation, arXiv:2204.08458, 2022. https:\/\/doi.org\/10.48550\/arxiv.2204.08458","517":"As CNN features are naturally spatial, channel-wise and multi-layer, \r\nChen et al. proposed a novel spatial and channel-wise attention-based convolutional neural network (SCA-CNN). \r\nIt was designed for the task of image captioning, and uses an encoder-decoder framework where a CNN first encodes an input image into a vector and then an LSTM decodes the vector into a sequence of words. Given an input feature map $X$ and the previous time step LSTM hidden state $h_{t-1} \\in \\mathbb{R}^d$, a spatial attention mechanism pays more attention to the semantically useful regions, guided by LSTM hidden state $h_{t-1}$. The  spatial attention model is:\r\n\r\n\\begin{align}\r\na(h_{t-1}, X) &= \\tanh(Conv_1^{1 \\times 1}(X) \\oplus W_1 h_{t-1})\r\n\\end{align}\r\n\r\n\\begin{align}\r\n\\Phi_s(h_{t-1}, X) &= \\text{Softmax}(Conv_2^{1 \\times 1}(a(h_{t-1}, X)))    \r\n\\end{align}\r\n\r\nwhere $\\oplus$ represents  addition of a matrix and a vector. Similarly, channel-wise attention aggregates global information first, and then computes a channel-wise attention weight vector with the hidden state $h_{t-1}$:\r\n\\begin{align}\r\nb(h_{t-1}, X) &= \\tanh((W_2\\text{GAP}(X)+b_2)\\oplus W_1h_{t-1})\r\n\\end{align}\r\n\\begin{align}\r\n\\Phi_c(h_{t-1}, X) &= \\text{Softmax}(W_3(b(h_{t-1}, X))+b_3)    \r\n\\end{align}\r\nOverall, the  SCA mechanism can be written in one of two ways. If channel-wise attention is applied before spatial attention, we have\r\n\\begin{align}\r\nY &= f(X,\\Phi_s(h_{t-1}, X \\Phi_c(h_{t-1}, X)), \\Phi_c(h_{t-1}, X)) \r\n\\end{align}\r\nand  if spatial attention comes first:\r\n\\begin{align}\r\nY &= f(X,\\Phi_s(h_{t-1}, X), \\Phi_c(h_{t-1}, X \\Phi_s(h_{t-1}, X)))\r\n\\end{align}\r\nwhere $f(\\cdot)$ denotes the modulate function which takes the feature map $X$ and attention maps as input and then outputs the modulated feature map $Y$.\r\n\r\nUnlike previous attention mechanisms which consider each image region equally and use global spatial information to tell the network where to focus, SCA-Net leverages the semantic vector to produce the spatial attention map as well as the channel-wise attention weight vector. Being more than a powerful attention model, SCA-CNN also provides a better understanding of where and what the model should focus on during sentence generation.","518":"**Hard Swish** is a type of activation function based on [Swish](https:\/\/paperswithcode.com\/method\/swish), but replaces the computationally expensive sigmoid with a piecewise linear analogue:\r\n\r\n$$\\text{h-swish}\\left(x\\right) = x\\frac{\\text{ReLU6}\\left(x+3\\right)}{6} $$","519":"**MobileNetV3** is a convolutional neural network that is tuned to mobile phone CPUs through a combination of hardware-aware network architecture search (NAS) complemented by the [NetAdapt](https:\/\/paperswithcode.com\/method\/netadapt) algorithm, and then subsequently improved through novel architecture advances. Advances include (1) complementary search techniques, (2) new efficient versions of nonlinearities practical for the mobile setting, (3) new efficient network design.\r\n\r\nThe network design includes the use of a [hard swish](https:\/\/paperswithcode.com\/method\/hard-swish) activation and squeeze-and-excitation modules in the MBConv blocks.","520":"**T2T-ViT** (Tokens-To-Token Vision Transformer) is a type of [Vision Transformer](https:\/\/paperswithcode.com\/method\/vision-transformer) which incorporates 1) a layerwise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision [transformer](https:\/\/paperswithcode.com\/method\/transformer) motivated by CNN architecture design after empirical study.","521":"**MagFace** is a category of losses for face recognition that learn a universal feature embedding whose magnitude can measure the quality of a given face. Under the new loss, it can be proven that the magnitude of the feature embedding monotonically increases if the subject is more likely to be recognized. In addition, MagFace introduces an adaptive mechanism to learn a well-structured within-class feature distributions by pulling easy samples to class centers while pushing hard samples away. For face recognition, MagFace helps prevent model overfitting on noisy and low-quality samples by an adaptive mechanism to learn well-structured within-class feature distributions -- by pulling easy samples to class centers while pushing hard samples away.","522":"**Pix2Pix** is a conditional image-to-image translation architecture that uses a conditional [GAN](https:\/\/paperswithcode.com\/method\/gan) objective combined with a reconstruction loss. The conditional GAN objective for observed images $x$, output images $y$ and the random noise vector $z$ is:\r\n\r\n$$ \\mathcal{L}\\_{cGAN}\\left(G, D\\right) =\\mathbb{E}\\_{x,y}\\left[\\log D\\left(x, y\\right)\\right]+\r\n\\mathbb{E}\\_{x,z}\\left[log(1 \u2212 D\\left(x, G\\left(x, z\\right)\\right)\\right] $$\r\n\r\nWe augment this with a reconstruction term:\r\n\r\n$$ \\mathcal{L}\\_{L1}\\left(G\\right) = \\mathbb{E}\\_{x,y,z}\\left[||y - G\\left(x, z\\right)||\\_{1}\\right] $$\r\n\r\nand we get the final objective as:\r\n\r\n$$ G^{*} = \\arg\\min\\_{G}\\max\\_{D}\\mathcal{L}\\_{cGAN}\\left(G, D\\right) + \\lambda\\mathcal{L}\\_{L1}\\left(G\\right) $$\r\n\r\nThe architectures employed for the generator and discriminator closely follow [DCGAN](https:\/\/paperswithcode.com\/method\/dcgan), with a few modifications:\r\n\r\n- Concatenated skip connections are used to \"shuttle\" low-level information between the input and output, similar to a [U-Net](https:\/\/paperswithcode.com\/method\/u-net).\r\n- The use of a [PatchGAN](https:\/\/paperswithcode.com\/method\/patchgan) discriminator that only penalizes structure at the scale of patches.","523":"**Layer-wise Adaptive Rate Scaling**, or **LARS**, is a large batch optimization technique.  There are two notable differences between LARS and other adaptive algorithms such as [Adam](https:\/\/paperswithcode.com\/method\/adam) or [RMSProp](https:\/\/paperswithcode.com\/method\/rmsprop): first, LARS uses a separate learning rate for each layer and not for each weight. And second, the magnitude of the update is controlled with respect to the weight norm for better control of training speed.\r\n\r\n$$m\\_{t} = \\beta\\_{1}m\\_{t-1} + \\left(1-\\beta\\_{1}\\right)\\left(g\\_{t} + \\lambda{x\\_{t}}\\right)$$\r\n$$x\\_{t+1}^{\\left(i\\right)} = x\\_{t}^{\\left(i\\right)}  - \\eta\\_{t}\\frac{\\phi\\left(|| x\\_{t}^{\\left(i\\right)} ||\\right)}{|| m\\_{t}^{\\left(i\\right)} || }m\\_{t}^{\\left(i\\right)} $$","524":"**SwaV**, or **Swapping Assignments Between Views**, is a self-supervised learning approach that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, it simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or views) of the same image, instead of comparing features directly as in contrastive learning. Simply put, SwaV uses a swapped prediction mechanism where we predict the cluster assignment of a view from the representation of another view.","525":"A regularization criterion that, differently from [dropout](https:\/\/paperswithcode.com\/method\/dropout) and its variants, is deterministic rather than random. It grounds on the empirical evidence that feature descriptors with larger L2-norm and highly-active nodes are strongly correlated to confident class predictions. Thus, the criterion guides towards dropping a percentage of the most active nodes of the descriptors, proportionally to the estimated class probability","526":"**You Only Hypothesize Once** is a local descriptor-based framework for the registration of two unaligned point clouds. The proposed descriptor achieves the rotation invariance by recent technologies of group equivariant feature learning, which brings more robustness to point density and noise. The descriptor in YOHO also has a rotation-equivariant part, which enables the estimation the registration from just one correspondence hypothesis.","527":"**Hit-Detector** is a neural architectures search algorithm that simultaneously searches all components of an object detector in an end-to-end manner. It is a hierarchical approach to mine the proper subsearch space from the large volume of operation candidates. It consists of two main procedures. First, given a large search space containing all the operation candidates, we screen out the customized sub search space suitable for each part of detector with the help of group sparsity regularization. Secondly, we search the architectures for each part within the corresponding sub search space by adopting the differentiable manner.","528":"Hou et al. proposed coordinate attention,\r\na novel attention mechanism which\r\nembeds positional information into channel attention,\r\nso that the network can focus on large important regions \r\nat little computational cost.\r\n\r\nThe coordinate attention mechanism has two consecutive steps, coordinate information embedding and coordinate attention generation. First, two spatial extents of pooling kernels encode each channel horizontally  and  vertically. In the second step, a shared $1\\times 1$ convolutional transformation function is applied to the concatenated outputs of the two pooling layers. Then coordinate attention splits the resulting tensor into two separate tensors to yield attention vectors with the same number of channels for horizontal and vertical coordinates of the  input $X$ along. This can be written as \r\n\\begin{align}\r\n    z^h &= \\text{GAP}^h(X) \r\n\\end{align}\r\n\\begin{align}\r\n    z^w &= \\text{GAP}^w(X)\r\n\\end{align}\r\n\\begin{align}\r\n    f &= \\delta(\\text{BN}(\\text{Conv}_1^{1\\times 1}([z^h;z^w])))\r\n\\end{align}\r\n\\begin{align}\r\n    f^h, f^w &= \\text{Split}(f)\r\n\\end{align}\r\n\\begin{align}\r\n    s^h &= \\sigma(\\text{Conv}_h^{1\\times 1}(f^h))\r\n\\end{align}\r\n\\begin{align}\r\n    s^w &= \\sigma(\\text{Conv}_w^{1\\times 1}(f^w))\r\n\\end{align}\r\n\\begin{align}\r\n    Y &= X s^h  s^w\r\n\\end{align}\r\nwhere $\\text{GAP}^h$ and $\\text{GAP}^w$ denote pooling functions for vertical and horizontal coordinates, and $s^h \\in \\mathbb{R}^{C\\times 1\\times W}$ and $s^w \\in \\mathbb{R}^{C\\times H\\times 1}$ represent corresponding attention weights. \r\n\r\nUsing coordinate attention, the network can accurately obtain the position of a targeted object.\r\nThis approach has a larger receptive field than BAM and CBAM.\r\nLike an SE block, it also models cross-channel relationships, effectively enhancing the expressive power of the learned features.\r\nDue to its lightweight design and flexibility, \r\nit can be easily used in classical building blocks of mobile networks.","529":"**Neural Architecture Search (NAS)** learns a modular architecture which can be transferred from a small dataset to a large dataset. The method does this by reducing the problem of learning best convolutional architectures to the problem of learning a small convolutional cell. The cell can then be stacked in series to handle larger images and more complex datasets.\r\n\r\nNote that this refers to the original method referred to as NAS - there is also a broader category of methods called \"neural architecture search\".","530":"**ScheduledDropPath** is a modified version of [DropPath](https:\/\/paperswithcode.com\/method\/droppath). In DropPath, each path in the cell is stochastically dropped with some fixed probability during training. In ScheduledDropPath, each path in the cell is dropped out with a probability that is linearly increased over the course of training.","531":"**Decentralized Distributed Proximal Policy Optimization (DD-PPO)** is a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever `stale'), making it conceptually simple and easy to implement. \r\n\r\nProximal Policy Optimization, or [PPO](https:\/\/paperswithcode.com\/method\/ppo), is a policy gradient method for reinforcement learning. The motivation was to have an algorithm with the data efficiency and reliable performance of [TRPO](https:\/\/paperswithcode.com\/method\/trpo), while using only first-order optimization. \r\n\r\nLet $r\\_{t}\\left(\\theta\\right)$ denote the probability ratio $r\\_{t}\\left(\\theta\\right) = \\frac{\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}$, so $r\\left(\\theta\\_{old}\\right) = 1$. TRPO maximizes a \u201csurrogate\u201d objective:\r\n\r\n$$ L^{v}\\left({\\theta}\\right) = \\hat{\\mathbb{E}}\\_{t}\\left[\\frac{\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\_{t}\\mid{s\\_{t}}\\right)})\\hat{A}\\_{t}\\right] = \\hat{\\mathbb{E}}\\_{t}\\left[r\\_{t}\\left(\\theta\\right)\\hat{A}\\_{t}\\right] $$\r\n\r\nAs a general abstraction, DD-PPO implements the following:\r\nat step $k$, worker $n$ has a copy of the parameters, $\\theta^k_n$, calculates the gradient, $\\delta \\theta^k_n$, and updates $\\theta$ via \r\n\r\n$$ \\theta^{k+1}\\_n =  \\text{ParamUpdate}\\Big(\\theta^{k}\\_n, \\text{AllReduce}\\big(\\delta \\theta^k\\_1, \\ldots, \\delta \\theta^k\\_N\\big)\\Big) = \\text{ParamUpdate}\\Big(\\theta^{k}\\_n, \\frac{1}{N}  \\sum_{i=1}^{N} { \\delta \\theta^k_i}   \\Big) $$\r\n\r\nwhere $\\text{ParamUpdate}$ is any first-order optimization technique (e.g. gradient descent) and $\\text{AllReduce}$ performs a reduction (e.g. mean) over all copies of a variable and returns the result to all workers.\r\nDistributed DataParallel scales very well (near-linear scaling up to 32,000 GPUs), and is reasonably simple to implement (all workers synchronously running identical code).","532":"**AdaMax** is a generalisation of [Adam](https:\/\/paperswithcode.com\/method\/adam) from the $l\\_{2}$ norm to the $l\\_{\\infty}$ norm. Define:\r\n\r\n$$ u\\_{t} = \\beta^{\\infty}\\_{2}v\\_{t-1} + \\left(1-\\beta^{\\infty}\\_{2}\\right)|g\\_{t}|^{\\infty}$$\r\n\r\n$$ = \\max\\left(\\beta\\_{2}\\cdot{v}\\_{t-1}, |g\\_{t}|\\right)$$\r\n\r\nWe can plug into the Adam update equation by replacing $\\sqrt{\\hat{v}_{t} + \\epsilon}$ with $u\\_{t}$ to obtain the AdaMax update rule:\r\n\r\n$$ \\theta\\_{t+1} = \\theta\\_{t} - \\frac{\\eta}{u\\_{t}}\\hat{m}\\_{t} $$\r\n\r\nCommon default values are $\\eta = 0.002$ and $\\beta\\_{1}=0.9$ and $\\beta\\_{2}=0.999$.","533":"**Slanted Triangular Learning Rates (STLR)** is a learning rate schedule which first linearly increases the learning rate and then linearly decays it, which can be seen in Figure to the right. It is a modification of Triangular Learning Rates, with a short increase and a long decay period.","534":"**Universal Language Model Fine-tuning**, or **ULMFiT**, is an architecture and transfer learning method that can be applied to NLP tasks. It involves a 3-layer [AWD-LSTM](https:\/\/paperswithcode.com\/method\/awd-lstm) architecture for its representations. The training consists of three steps: 1) general language model pre-training on a Wikipedia-based text, 2) fine-tuning the language model on a target task, and 3) fine-tuning the classifier on the target task.\r\n\r\nAs different layers capture different types of information, they are fine-tuned to different extents using [discriminative fine-tuning](https:\/\/paperswithcode.com\/method\/discriminative-fine-tuning). Training is performed using [Slanted triangular learning rates](https:\/\/paperswithcode.com\/method\/slanted-triangular-learning-rates) (STLR), a learning rate scheduling strategy that first linearly increases the learning rate and then linearly decays it.\r\n\r\nFine-tuning the target classifier is achieved in ULMFiT using gradual unfreezing. Rather than fine-tuning all layers at once, which risks catastrophic forgetting, ULMFiT gradually unfreezes the model starting from the last layer (i.e., closest to the output) as this contains the least general knowledge. First the last layer is unfrozen and all unfrozen layers are fine-tuned for one epoch. Then the next group of frozen layers is unfrozen and fine-tuned and repeat, until all layers are fine-tuned until convergence at the last iteration.","535":"**PQ-Transformer**, or **PointQuad-Transformer**, is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based architecture that predicts 3D objects and layouts simultaneously, using point cloud inputs. Unlike existing methods that either estimate layout keypoints or edges, room layouts are directly parameterized as a set of quads. Along with the quad representation, a physical constraint loss function is used that discourages object-layout interference.\r\n\r\nGiven an input 3D point cloud of $N$ points, the point cloud feature learning backbone extracts $M$ context-aware point features of $\\left(3+C\\right)$ dimensions, through sampling and grouping. A voting module and a farthest point sampling (FPS) module are used to generate $K\\_{1}$ object proposals and $K\\_{2}$ quad proposals respectively. Then the proposals are processed by a transformer decoder to further refine proposal features. Through several feedforward layers and non-maximum suppression (NMS), the proposals become the final object bounding boxes and layout quads.","536":"A **Fractal Block** is an image model block that utilizes an expansion rule that yields a structural layout of truncated fractals. For the base case where $f\\_{1}\\left(z\\right) = \\text{conv}\\left(z\\right)$ is a convolutional layer, we then have recursive fractals of the form:\r\n\r\n$$ f\\_{C+1}\\left(z\\right) = \\left[\\left(f\\_{C}\\circ{f\\_{C}}\\right)\\left(z\\right)\\right] \\oplus \\left[\\text{conv}\\left(z\\right)\\right]$$\r\n\r\nWhere $C$ is the number of columns. For the join layer (green in Figure), we use the element-wise mean rather than concatenation or addition.","537":"**FractalNet** is a type of convolutional neural network that eschews [residual connections](https:\/\/paperswithcode.com\/method\/residual-connection) in favour of a \"fractal\" design. They involve repeated application of a simple expansion rule to generate deep networks whose structural layouts are precisely truncated fractals. These networks contain interacting subpaths of different lengths, but do not include any pass-through or residual connections; every internal signal is transformed by a filter and nonlinearity before being seen by subsequent layers.","538":"A **Contractive Autoencoder** is an autoencoder that adds a penalty term to the classical reconstruction cost function. This penalty term corresponds to the Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input. This penalty term results in a localized space contraction which in turn yields robust features on the activation layer. The penalty helps to carve a representation that better captures the local directions of variation dictated by the data, corresponding to a lower-dimensional non-linear manifold, while being more invariant to the vast majority of directions orthogonal to the manifold.","539":"**MADDPG**, or **Multi-agent DDPG**, extends [DDPG](https:\/\/paperswithcode.com\/method\/ddpg) into a multi-agent policy gradient algorithm where decentralized agents learn a centralized critic based on the observations and actions of all agents. It leads to learned policies that only use local information (i.e. their own observations) at execution time, does not assume a differentiable model of the environment dynamics or any particular structure on the communication method between agents, and is applicable not only to cooperative interaction but to competitive or mixed interaction involving both physical and communicative behavior. The critic is augmented with extra information about the policies of other agents, while the actor only has access to local information. After training is completed, only the local actors are used at execution phase, acting in a decentralized manner.","540":"**Disentangled Attribution Curves (DAC)** provide interpretations of tree ensemble methods in the form of (multivariate) feature importance curves. For a given variable, or group of variables, [DAC](https:\/\/paperswithcode.com\/method\/dac) plots the importance of a variable(s) as their value changes.\r\n\r\nThe Figure to the right shows an example. The tree depicts a decision tree which performs binary classification using two features (representing the XOR function). In this problem, knowing the value of one of the features without knowledge of the other feature yields no information - the classifier still has a 50% chance of predicting either class. As a result, DAC produces curves which assign 0 importance to either feature on its own. Knowing both features yields perfect information about the classifier, and thus the DAC curve for both features together correctly shows that the interaction of the features produces the model\u2019s predictions.","541":"**LayoutLMv2** is an architecture and pre-training method for document understanding. The model is pre-trained with a great number of unlabeled scanned document images from the IIT-CDIP dataset, where some images in the text-image pairs are randomly replaced with another document image to make the model learn whether the image and OCR texts are correlated or not. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture, so that the model can fully understand the relative positional relationship among different text blocks.\r\n\r\nSpecifically, an enhanced Transformer architecture is used, i.e. a multi-modal Transformer asisthe backbone of LayoutLMv2. The multi-modal Transformer accepts inputs of three modalities: text, image, and layout. The input of each modality is converted to an embedding sequence and fused by the encoder. The model establishes deep interactions within and between modalities by leveraging the powerful Transformer layers.","542":"**Manifold Mixup** is a regularization method that encourages neural networks to predict less confidently on interpolations of hidden representations. It leverages semantic interpolations as an additional training signal, obtaining neural networks with smoother decision boundaries at multiple levels of representation. As a result, neural networks trained with Manifold Mixup learn class-representations with fewer directions of variance.\r\n\r\nConsider training a deep neural network $f\\left(x\\right) = f\\_{k}\\left(g\\_{k}\\left(x\\right)\\right)$, where $g\\_{k}$ denotes the part of the neural network mapping the input data to the hidden representation at layer $k$, and $f\\_{k}$ denotes the\r\npart mapping such hidden representation to the output $f\\left(x\\right)$. Training $f$ using Manifold Mixup is performed in five steps:\r\n\r\n(1) Select a random layer $k$ from a set of eligible layers $S$ in the neural network. This set may include the input layer $g\\_{0}\\left(x\\right)$.\r\n\r\n(2) Process two random data minibatches $\\left(x, y\\right)$ and $\\left(x', y'\\right)$ as usual, until reaching layer $k$. This provides us with two intermediate minibatches $\\left(g\\_{k}\\left(x\\right), y\\right)$ and $\\left(g\\_{k}\\left(x'\\right), y'\\right)$.\r\n\r\n(3) Perform Input [Mixup](https:\/\/paperswithcode.com\/method\/mixup) on these intermediate minibatches. This produces the mixed minibatch:\r\n\r\n$$\r\n\\left(\\tilde{g}\\_{k}, \\tilde{y}\\right) = \\left(\\text{Mix}\\_{\\lambda}\\left(g\\_{k}\\left(x\\right), g\\_{k}\\left(x'\\right)\\right), \\text{Mix}\\_{\\lambda}\\left(y, y'\\right\r\n)\\right),\r\n$$\r\n\r\nwhere $\\text{Mix}\\_{\\lambda}\\left(a, b\\right) = \\lambda \\cdot a + \\left(1 \u2212 \\lambda\\right) \\cdot b$. Here, $\\left(y, y'\r\n\\right)$ are one-hot labels, and the mixing coefficient\r\n$\\lambda \\sim \\text{Beta}\\left(\\alpha, \\alpha\\right)$ as in mixup. For instance, $\\alpha = 1.0$ is equivalent to sampling $\\lambda \\sim U\\left(0, 1\\right)$.\r\n\r\n(4) Continue the forward pass in the network from layer $k$ until the output using the mixed minibatch $\\left(\\tilde{g}\\_{k}, \\tilde{y}\\right)$.\r\n\r\n(5) This output is used to compute the loss value and\r\ngradients that update all the parameters of the neural network.","543":"**RepVGG** is a [VGG](https:\/\/paperswithcode.com\/method\/vgg)-style convolutional architecture. It has the following advantages:\r\n\r\n- The model has a VGG-like plain (a.k.a. feed-forward) topology 1 without any branches. I.e., every layer takes\r\nthe output of its only preceding layer as input and feeds the output into its only following layer.\r\n- The model\u2019s body uses only 3 \u00d7 3 conv and [ReLU](https:\/\/paperswithcode.com\/method\/relu).\r\n- The concrete architecture (including the specific depth and layer widths) is instantiated with no automatic\r\nsearch, manual refinement, compound scaling, nor other heavy designs.","544":"**Thinned U-shape Module**, or **TUM**, is a feature extraction block used for object detection models. It was introduced as part of the [M2Det](https:\/\/paperswithcode.com\/method\/m2det) architecture. Different from [FPN](https:\/\/paperswithcode.com\/method\/fpn) and [RetinaNet](https:\/\/paperswithcode.com\/method\/retinanet), TUM adopts a thinner U-shape structure as illustrated in the Figure to the right. The encoder is a series of 3x3 [convolution](https:\/\/paperswithcode.com\/method\/convolution) layers with stride 2. And the decoder takes the outputs of these layers as its reference set of feature maps, while the original FPN chooses the output of the last layer of each stage in [ResNet](https:\/\/paperswithcode.com\/method\/resnet) backbone. \r\n\r\nIn addition, with TUM, we add [1x1 convolution](https:\/\/paperswithcode.com\/method\/1x1-convolution) layers after the upsample and element-wise sum operation at the decoder branch to enhance learning ability and keep smoothness for the features. In the context of M2Det, all of the outputs in the decoder of each TUM form the multi-scale features of the current level. As a whole, the outputs of stacked TUMs form the multi-level multi-scale features, while the front TUM mainly provides shallow-level features, the middle TUM provides medium-level features, and the back TUM provides deep-level features.","545":"The **SAGAN Self-Attention Module** is a self-attention module used in the [Self-Attention GAN](https:\/\/paperswithcode.com\/method\/sagan) architecture for image synthesis. In the module, image features from the previous hidden layer $\\textbf{x} \\in \\mathbb{R}^{C\\text{x}N}$ are first transformed into two feature spaces $\\textbf{f}$, $\\textbf{g}$ to calculate the attention, where $\\textbf{f(x) = W}\\_{\\textbf{f}}{\\textbf{x}}$, $\\textbf{g}(\\textbf{x})=\\textbf{W}\\_{\\textbf{g}}\\textbf{x}$. We then calculate:\r\n\r\n$$\\beta_{j, i} = \\frac{\\exp\\left(s_{ij}\\right)}{\\sum^{N}\\_{i=1}\\exp\\left(s_{ij}\\right)} $$\r\n\r\n$$ \\text{where } s_{ij} = \\textbf{f}(\\textbf{x}\\_{i})^{T}\\textbf{g}(\\textbf{x}\\_{i}) $$\r\n\r\nand $\\beta_{j, i}$ indicates the extent to which the model attends to the $i$th location when synthesizing the $j$th region. Here, $C$ is the number of channels and $N$ is the number of feature\r\nlocations of features from the previous hidden layer. The output of the attention layer is $\\textbf{o} = \\left(\\textbf{o}\\_{\\textbf{1}}, \\textbf{o}\\_{\\textbf{2}}, \\ldots, \\textbf{o}\\_{\\textbf{j}} , \\ldots, \\textbf{o}\\_{\\textbf{N}}\\right) \\in \\mathbb{R}^{C\\text{x}N}$ , where,\r\n\r\n$$ \\textbf{o}\\_{\\textbf{j}} = \\textbf{v}\\left(\\sum^{N}\\_{i=1}\\beta_{j, i}\\textbf{h}\\left(\\textbf{x}\\_{\\textbf{i}}\\right)\\right) $$\r\n\r\n$$ \\textbf{h}\\left(\\textbf{x}\\_{\\textbf{i}}\\right) = \\textbf{W}\\_{\\textbf{h}}\\textbf{x}\\_{\\textbf{i}} $$\r\n\r\n$$ \\textbf{v}\\left(\\textbf{x}\\_{\\textbf{i}}\\right) = \\textbf{W}\\_{\\textbf{v}}\\textbf{x}\\_{\\textbf{i}} $$\r\n\r\nIn the above formulation, $\\textbf{W}\\_{\\textbf{g}} \\in \\mathbb{R}^{\\bar{C}\\text{x}C}$, $\\mathbf{W}\\_{f} \\in \\mathbb{R}^{\\bar{C}\\text{x}C}$, $\\textbf{W}\\_{\\textbf{h}} \\in \\mathbb{R}^{\\bar{C}\\text{x}C}$ and $\\textbf{W}\\_{\\textbf{v}} \\in \\mathbb{R}^{C\\text{x}\\bar{C}}$ are the learned weight matrices, which are implemented as $1$\u00d7$1$ convolutions. The authors choose  $\\bar{C} = C\/8$.\r\n\r\nIn addition, the module further multiplies the output of the attention layer by a scale parameter and adds back the input feature map. Therefore, the final output is given by,\r\n\r\n$$\\textbf{y}\\_{\\textbf{i}} = \\gamma\\textbf{o}\\_{\\textbf{i}} + \\textbf{x}\\_{\\textbf{i}}$$\r\n\r\nwhere $\\gamma$ is a learnable scalar and it is initialized as 0. Introducing $\\gamma$ allows the network to first rely on the cues in the local neighborhood \u2013 since this is easier \u2013 and then gradually learn to assign more weight to the non-local evidence.","546":"A **Non-Local Operation** is a component for capturing long-range dependencies with deep neural networks. It is a generalization of the classical non-local mean operation in computer vision. Intuitively a non-local operation computes the response at a position as a weighted sum of the features at all positions in the input feature maps. The set of positions can be in space, time, or spacetime, implying that these operations are applicable for image, sequence, and video problems.\r\n\r\nFollowing the non-local mean operation, a generic non-local operation for deep neural networks is defined as:\r\n\r\n$$ \\mathbb{y}\\_{i} = \\frac{1}{\\mathcal{C}\\left(\\mathbb{x}\\right)}\\sum\\_{\\forall{j}}f\\left(\\mathbb{x}\\_{i}, \\mathbb{x}\\_{j}\\right)g\\left(\\mathbb{x}\\_{j}\\right) $$\r\n\r\nHere $i$ is the index of an output position (in space, time, or spacetime) whose response is to be computed and $j$ is the index that enumerates all possible positions. x is the input signal (image, sequence, video; often their features) and $y$ is the output signal of the same size as $x$. A pairwise function $f$ computes a scalar (representing relationship such as affinity) between $i$ and all $j$. The unary function $g$ computes a representation of the input signal at the position $j$. The\r\nresponse is normalized by a factor $C\\left(x\\right)$.\r\n\r\nThe non-local behavior is due to the fact that all positions ($\\forall{j}$) are considered in the operation. As a comparison, a convolutional operation sums up the weighted input in a local neighborhood (e.g., $i \u2212 1 \\leq j \\leq i + 1$ in a 1D case with kernel size 3), and a recurrent operation at time $i$ is often based only on the current and the latest time steps (e.g., $j = i$ or $i \u2212 1$).\r\n\r\nThe non-local operation is also different from a fully-connected (fc) layer. The equation above computes responses based on relationships between different locations, whereas fc uses learned weights. In other words, the relationship between $x\\_{j}$ and $x\\_{i}$ is not a function of the input data in fc, unlike in nonlocal layers. Furthermore, the formulation in the equation above supports inputs of variable sizes, and maintains the corresponding size in the output. On the contrary, an fc layer requires a fixed-size input\/output and loses positional correspondence (e.g., that from $x\\_{i}$ to $y\\_{i}$ at the position $i$).\r\n\r\nA non-local operation is a flexible building block and can be easily used together with convolutional\/recurrent layers. It can be added into the earlier part of deep neural networks, unlike fc layers that are often used in the end. This allows us to build a richer hierarchy that combines both non-local and local information.\r\n\r\nIn terms of parameterisation, we usually parameterise $g$ as a linear embedding of the form $g\\left(x\\_{j}\\right) = W\\_{g}\\mathbb{x}\\_{j}$ , where $W\\_{g}$ is a weight matrix to be learned. This is implemented as, e.g., 1\u00d71 [convolution](https:\/\/paperswithcode.com\/method\/convolution) in space or 1\u00d71\u00d71 convolution in spacetime. For $f$ we use an affinity function, a list of which can be found [here](https:\/\/paperswithcode.com\/methods\/category\/affinity-functions).","547":"The **Truncation Trick** is a latent sampling procedure for generative adversarial networks, where we sample $z$ from a truncated normal (where values which fall outside a range are resampled to fall inside that range). \r\nThe original implementation was in [Megapixel Size Image Creation with GAN](https:\/\/paperswithcode.com\/paper\/megapixel-size-image-creation-using).\r\nIn [BigGAN](http:\/\/paperswithcode.com\/method\/biggan), the authors find this provides a boost to the Inception Score and FID.","548":"The **Self-Attention Generative Adversarial Network**, or **SAGAN**, allows for attention-driven, long-range dependency modeling for image generation tasks. Traditional convolutional GANs generate high-resolution details as a function of only spatially local points in lower-resolution feature maps. In SAGAN, details can be generated using cues from all feature locations. Moreover, the discriminator can check that highly detailed features in distant portions of the image are consistent with each other.","549":"The **GAN Hinge Loss** is a hinge loss based loss function for [generative adversarial networks](https:\/\/paperswithcode.com\/methods\/category\/generative-adversarial-networks):\r\n\r\n$$ L\\_{D} = -\\mathbb{E}\\_{\\left(x, y\\right)\\sim{p}\\_{data}}\\left[\\min\\left(0, -1 + D\\left(x, y\\right)\\right)\\right] -\\mathbb{E}\\_{z\\sim{p\\_{z}}, y\\sim{p\\_{data}}}\\left[\\min\\left(0, -1 - D\\left(G\\left(z\\right), y\\right)\\right)\\right] $$\r\n\r\n$$ L\\_{G} = -\\mathbb{E}\\_{z\\sim{p\\_{z}}, y\\sim{p\\_{data}}}D\\left(G\\left(z\\right), y\\right) $$","550":"The **Two Time-scale Update Rule (TTUR)** is an update rule for generative adversarial networks trained with stochastic gradient descent. TTUR has an individual learning rate for both the discriminator and the generator. The main premise is that the discriminator converges to a local minimum when the generator is fixed. If the generator changes slowly enough, then the discriminator still converges, since the generator perturbations are small. Besides ensuring convergence, the performance may also improve since the discriminator must first learn new patterns before they are transferred to the generator. In contrast, a generator which is overly fast, drives the discriminator steadily into new regions without capturing its gathered information.","551":"**Conditional Batch Normalization (CBN)** is a class-conditional variant of [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization). The key idea is to predict the $\\gamma$ and $\\beta$ of the batch normalization from an embedding - e.g. a language embedding in VQA. CBN enables the linguistic embedding to manipulate entire feature maps by scaling them up or down, negating them, or shutting them off. CBN has also been used in [GANs](https:\/\/paperswithcode.com\/methods\/category\/generative-adversarial-networks) to allow class information to affect the batch normalization parameters.\r\n\r\nConsider a single convolutional layer with batch normalization module $\\text{BN}\\left(F\\_{i,c,h,w}|\\gamma\\_{c}, \\beta\\_{c}\\right)$ for which pretrained scalars $\\gamma\\_{c}$ and $\\beta\\_{c}$ are available. We would like to directly predict these affine scaling parameters from, e.g., a language embedding $\\mathbf{e\\_{q}}$. When starting the training procedure, these parameters must be close to the pretrained values to recover the original [ResNet](https:\/\/paperswithcode.com\/method\/resnet) model as a poor initialization could significantly deteriorate performance. Unfortunately, it is difficult to initialize a network to output the pretrained $\\gamma$ and $\\beta$. For these reasons, the authors propose to predict a change $\\delta\\beta\\_{c}$ and $\\delta\\gamma\\_{c}$ on the frozen original scalars, for which it is straightforward to initialize a neural network to produce an output with zero-mean and small variance.\r\n\r\nThe authors use a one-hidden-layer MLP to predict these deltas from a question embedding $\\mathbf{e\\_{q}}$ for all feature maps within the layer:\r\n\r\n$$\\Delta\\beta = \\text{MLP}\\left(\\mathbf{e\\_{q}}\\right)$$\r\n\r\n$$\\Delta\\gamma = \\text{MLP}\\left(\\mathbf{e\\_{q}}\\right)$$\r\n\r\nSo, given a feature map with $C$ channels, these MLPs output a vector of size $C$. We then add these predictions to the $\\beta$ and $\\gamma$ parameters:\r\n\r\n$$ \\hat{\\beta}\\_{c} = \\beta\\_{c} + \\Delta\\beta\\_{c} $$\r\n\r\n$$ \\hat{\\gamma}\\_{c} = \\gamma\\_{c} + \\Delta\\gamma\\_{c} $$\r\n\r\nFinally, these updated $\\hat{\u03b2}$ and $\\hat{\\gamma}$ are used as parameters for the batch normalization: $\\text{BN}\\left(F\\_{i,c,h,w}|\\hat{\\gamma\\_{c}}, \\hat{\\beta\\_{c}}\\right)$. The authors freeze all ResNet parameters, including $\\gamma$ and $\\beta$, during training. A ResNet consists of\r\nfour stages of computation, each subdivided in several residual blocks. In each block, the authors apply CBN to the three convolutional layers.","552":"A **Non-Local Block** is an image block module used in neural networks that wraps a [non-local operation](https:\/\/paperswithcode.com\/method\/non-local-operation). We can define a non-local block as:\r\n\r\n$$ \\mathbb{z}\\_{i} = W\\_{z}\\mathbb{y\\_{i}} + \\mathbb{x}\\_{i} $$\r\n\r\nwhere $y\\_{i}$ is the output from the non-local operation and $+ \\mathbb{x}\\_{i}$ is a [residual connection](https:\/\/paperswithcode.com\/method\/residual-connection).","553":"A **Projection Discriminator** is a type of discriminator for generative adversarial networks. It is motivated by a probabilistic model in which the distribution of the conditional variable $\\textbf{y}$ given $\\textbf{x}$ is discrete or uni-modal continuous distributions.\r\n\r\nIf we look at the original solution for the loss function $\\mathcal{L}\\_{D}$ in the vanilla GANs, we can decompose it into the sum of two log-likelihood ratios:\r\n\r\n$$ f^{*}\\left(\\mathbf{x}, \\mathbf{y}\\right) = \\log\\frac{q\\left(\\mathbf{x}\\mid{\\mathbf{y}}\\right)q\\left(\\mathbf{y}\\right)}{p\\left(\\mathbf{x}\\mid{\\mathbf{y}}\\right)p\\left(\\mathbf{y}\\right)} = \\log\\frac{q\\left(\\mathbf{y}\\mid{\\mathbf{x}}\\right)}{p\\left(\\mathbf{y}\\mid{\\mathbf{x}}\\right)} + \\log\\frac{q\\left(\\mathbf{x}\\right)}{p\\left(\\mathbf{x}\\right)}  = r\\left(\\mathbf{y\\mid{x}}\\right) + r\\left(\\mathbf{x}\\right) $$\r\n\r\nWe can model the log likelihood ratio $r\\left(\\mathbf{y\\mid{x}}\\right)$ and  $r\\left(\\mathbf{x}\\right)$ by some parametric functions $f\\_{1}$ and $f\\_{2}$ respectively. If we make a standing assumption that $p\\left(y\\mid{x}\\right)$ and $q\\left(y\\mid{x}\\right)$ are simple distributions like those that are Gaussian or discrete log linear on the feature space, then the parametrization of the following form becomes natural:\r\n\r\n$$ f\\left(\\mathbf{x}, \\mathbf{y}; \\theta\\right) = f\\_{1}\\left(\\mathbf{x}, \\mathbf{y}; \\theta\\right) + f\\_{2}\\left(\\mathbf{x}; \\theta\\right) = \\mathbf{y}^{T}V\\phi\\left(\\mathbf{x}; \\theta\\_{\\phi}\\right) + \\psi\\left(\\phi(\\mathbf{x}; \\theta\\_{\\phi}); \\theta\\_{\\psi}\\right) $$\r\n\r\nwhere $V$ is the embedding matrix of $y$, $\\phi\\left(\u00b7, \\theta\\_{\\phi}\\right)$ is a vector output function of $x$, and $\\psi\\left(\u00b7, \\theta\\_{\\psi}\\right)$ is a scalar function of the same $\\phi\\left(\\mathbf{x}; \\theta\\_{\\phi}\\right)$ that appears in $f\\_{1}$. The learned parameters $\\theta = ${$V, \\theta\\_{\\phi}, \\theta\\_{\\psi}$} are trained to optimize the adversarial loss. This model of the discriminator is the projection.","554":"**Off-Diagonal Orthogonal Regularization** is a modified form of [orthogonal regularization](https:\/\/paperswithcode.com\/method\/orthogonal-regularization) originally used in [BigGAN](https:\/\/paperswithcode.com\/method\/biggan). The original orthogonal regularization is known to be limiting so the authors explore several variants designed to relax the constraint while still imparting the desired smoothness to the models. They opt for a modification where they remove diagonal terms from the regularization, and aim to minimize the pairwise cosine similarity between filters but does not constrain their norm:\r\n\r\n$$ R\\_{\\beta}\\left(W\\right) = \\beta|| W^{T}W \\odot \\left(\\mathbf{1}-I\\right) ||^{2}\\_{F} $$\r\n\r\nwhere $\\mathbf{1}$ denotes a matrix with all elements set to 1. The authors sweep $\\beta$ values and select $10^{\u22124}$.","555":"**BigGAN** is a type of generative adversarial network that was designed for scaling generation to high-resolution, high-fidelity images. It includes a number of incremental changes and innovations. The baseline and incremental changes are:\r\n\r\n- Using [SAGAN](https:\/\/paperswithcode.com\/method\/sagan) as a baseline with spectral norm. for G and D, and using [TTUR](https:\/\/paperswithcode.com\/method\/ttur).\r\n- Using a Hinge Loss [GAN](https:\/\/paperswithcode.com\/method\/gan) objective\r\n- Using class-[conditional batch normalization](https:\/\/paperswithcode.com\/method\/conditional-batch-normalization) to provide class information to G (but with linear projection not MLP.\r\n- Using a [projection discriminator](https:\/\/paperswithcode.com\/method\/projection-discriminator) for D to provide class information to D.\r\n- Evaluating with EWMA of G's weights, similar to ProGANs.\r\n\r\nThe innovations are:\r\n\r\n- Increasing batch sizes, which has a big effect on the Inception Score of the model.\r\n- Increasing the width in each layer leads to a further Inception Score improvement.\r\n- Adding skip connections from the latent variable $z$ to further layers helps performance.\r\n- A new variant of [Orthogonal Regularization](https:\/\/paperswithcode.com\/method\/orthogonal-regularization).","556":"**Fast R-CNN** is an object detection model that improves in its predecessor [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn) in a number of ways. Instead of extracting CNN features independently for each region of interest, Fast R-CNN aggregates them into a single forward pass over the image; i.e. regions of interest from the same image share computation and memory in the forward and backward passes.","557":"**DynamicConv** is a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) for sequential modelling where it has kernels that vary over time as a learned function of the individual time steps. It builds upon [LightConv](https:\/\/paperswithcode.com\/method\/lightconv) and takes the same form but uses a time-step dependent kernel:\r\n\r\n$$ \\text{DynamicConv}\\left(X, i, c\\right) = \\text{LightConv}\\left(X, f\\left(X\\_{i}\\right)\\_{h,:}, i, c\\right) $$","558":"**Random Scaling** is a type of image data augmentation where we randomly change the scale the image between a specified range.","559":"**PixelShuffle** is an operation used in super-resolution models to implement efficient sub-pixel convolutions with a stride of $1\/r$. Specifically it rearranges elements in a tensor of shape $(\\*, C \\times r^2, H, W)$ to a tensor of shape $(\\*, C, H \\times r, W \\times r)$.\r\n\r\nImage Source: [Remote Sensing Single-Image Resolution Improvement Using A Deep Gradient-Aware Network with Image-Specific Enhancement](https:\/\/www.researchgate.net\/figure\/The-pixel-shuffle-layer-transforms-feature-maps-from-the-LR-domain-to-the-HR-image_fig3_339531308)","560":"**SRGAN Residual Block** is a residual block used in the [SRGAN](https:\/\/paperswithcode.com\/method\/srgan) generator for image super-resolution. It is similar to standard [residual blocks](https:\/\/paperswithcode.com\/method\/residual-block), although it uses a [PReLU](https:\/\/paperswithcode.com\/method\/prelu) activation function to help training (preventing sparse gradients during [GAN](https:\/\/paperswithcode.com\/method\/gan) training).","561":"**VGG Loss** is a type of content loss intorduced in the [Perceptual Losses for Real-Time Style Transfer and Super-Resolution](https:\/\/paperswithcode.com\/paper\/perceptual-losses-for-real-time-style) super-resolution and style transfer framework. It is an alternative to pixel-wise losses; VGG Loss attempts to be closer to perceptual similarity. The [VGG](https:\/\/paperswithcode.com\/method\/vgg) loss is based on the [ReLU](https:\/\/paperswithcode.com\/method\/relu) activation layers of the pre-trained 19 layer VGG network. With $\\phi\\_{i,j}$ we indicate the feature map obtained by the $j$-th [convolution](https:\/\/paperswithcode.com\/method\/convolution) (after activation) before the $i$-th maxpooling layer within the VGG19 network, which we consider given. We then define the VGG loss as the euclidean distance between the feature representations of a reconstructed image $G\\_{\\theta\\_{G}}\\left(I^{LR}\\right)$ and the reference image $I^{HR}$:\r\n\r\n$$ l\\_{VGG\/i.j} = \\frac{1}{W\\_{i,j}H\\_{i,j}}\\sum\\_{x=1}^{W\\_{i,j}}\\sum\\_{y=1}^{H\\_{i,j}}\\left(\\phi\\_{i,j}\\left(I^{HR}\\right)\\_{x, y} - \\phi\\_{i,j}\\left(G\\_{\\theta\\_{G}}\\left(I^{LR}\\right)\\right)\\_{x, y}\\right)^{2}$$ \r\n\r\nHere $W\\_{i,j}$ and $H\\_{i,j}$ describe the dimensions of the respective feature maps within the VGG network.","562":"**SRGAN** is a generative adversarial network for single image super-resolution. It uses a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes the solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images. In addition, the authors use a content loss motivated by perceptual similarity instead of similarity in pixel space. The actual networks - depicted in the Figure to the right - consist mainly of residual blocks for feature extraction.\r\n\r\nFormally we write the perceptual loss function as a weighted sum of a ([VGG](https:\/\/paperswithcode.com\/method\/vgg)) content loss $l^{SR}\\_{X}$ and an adversarial loss component $l^{SR}\\_{Gen}$:\r\n\r\n$$ l^{SR} = l^{SR}\\_{X} + 10^{-3}l^{SR}\\_{Gen} $$","563":"A **Groupwise Point Convolution** is a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) where we apply a [point convolution](https:\/\/paperswithcode.com\/method\/pointwise-convolution) groupwise (using different set of convolution filter groups).\r\n\r\nImage Credit: [Chi-Feng Wang](https:\/\/towardsdatascience.com\/a-basic-introduction-to-separable-convolutions-b99ec3102728)","564":"**ShuffleNet V2 Downsampling Block** is a block for spatial downsampling used in the [ShuffleNet V2](https:\/\/paperswithcode.com\/method\/shufflenet-v2) architecture. Unlike the regular [ShuffleNet](https:\/\/paperswithcode.com\/method\/shufflenet) V2 block, the channel split operator is removed so the number of output channels is doubled.","565":"**ShuffleNet v2** is a convolutional neural network optimized for a direct metric (speed) rather than indirect metrics like FLOPs. It builds upon [ShuffleNet v1](https:\/\/paperswithcode.com\/method\/shufflenet), which utilised pointwise group convolutions, bottleneck-like structures, and a [channel shuffle](https:\/\/paperswithcode.com\/method\/channel-shuffle) operation. Differences are shown in the Figure to the right, including a new channel split operation and moving the channel shuffle operation further down the block.","566":"RAM adopts RNNs and reinforcement learning (RL) to make the network learn where to pay attention.","567":"**Spatial Feature Transform**, or **SFT**, is a layer that generates affine transformation parameters for spatial-wise feature modulation, and was originally proposed within the context of image super-resolution. A Spatial Feature Transform (SFT) layer learns a mapping function $\\mathcal{M}$ that outputs a modulation parameter pair $(\\mathbf{\\gamma}, \\mathbf{\\beta})$ based on some prior condition $\\Psi$. The learned parameter pair adaptively influences the outputs by applying an affine transformation spatially to each intermediate feature maps in an SR network. During testing, only a single forward pass is needed to generate the HR image given the LR input and segmentation probability maps.\r\n\r\nMore precisely, the prior $\\Psi$ is modeled by a pair of affine transformation parameters $(\\mathbf{\\gamma}, \\mathbf{\\beta})$ through a mapping function $\\mathcal{M}: \\Psi \\mapsto(\\mathbf{\\gamma}, \\mathbf{\\beta})$. Consequently,\r\n\r\n$$\r\n\\hat{\\mathbf{y}}=G_{\\mathbf{\\theta}}(\\mathbf{x} \\mid \\mathbf{\\gamma}, \\mathbf{\\beta}), \\quad(\\mathbf{\\gamma}, \\mathbf{\\beta})=\\mathcal{M}(\\Psi)\r\n$$\r\n\r\nAfter obtaining $(\\mathbf{\\gamma}, \\mathbf{\\beta})$ from conditions, the transformation is carried out by scaling and shifting feature maps of a specific layer:\r\n\r\n$$\r\n\\operatorname{SFT}(\\mathbf{F} \\mid \\mathbf{\\gamma}, \\mathbf{\\beta})=\\mathbf{\\gamma} \\odot \\mathbf{F}+\\mathbf{\\beta}\r\n$$\r\n\r\nwhere $\\mathbf{F}$ denotes the feature maps, whose dimension is the same as $\\gamma$ and $\\mathbf{\\beta}$, and $\\odot$ is referred to element-wise multiplication, i.e., Hadamard product. Since the spatial dimensions are preserved, the SFT layer not only performs feature-wise manipulation but also spatial-wise transformation.","568":"A **Deep Boltzmann Machine (DBM)** is a three-layer generative model. It is similar to a [Deep Belief Network](https:\/\/paperswithcode.com\/method\/deep-belief-network), but instead allows bidirectional connections in the bottom layers. Its energy function is  as an extension of the energy function of the RBM:\r\n\r\n$$ E\\left(v, h\\right) = -\\sum^{i}\\_{i}v\\_{i}b\\_{i} - \\sum^{N}\\_{n=1}\\sum_{k}h\\_{n,k}b\\_{n,k}-\\sum\\_{i, k}v\\_{i}w\\_{ik}h\\_{k} - \\sum^{N-1}\\_{n=1}\\sum\\_{k,l}h\\_{n,k}w\\_{n, k, l}h\\_{n+1, l}$$\r\n\r\nfor a DBM with $N$ hidden layers.\r\n\r\nSource: [On the Origin of Deep Learning](https:\/\/arxiv.org\/pdf\/1702.07800.pdf)","569":"**CenterPoint** is a two-stage 3D detector that finds centers of objects and their properties using a keypoint detector and regresses to other attributes, including 3D size, 3D orientation and velocity. In a second-stage, it refines these estimates using additional point features on the object. CenterPoint uses a standard Lidar-based backbone network, i.e., VoxelNet or PointPillars, to build a representation of the input point-cloud. CenterPoint predicts the relative offset (velocity) of objects between consecutive frames, which are then linked up greedily -- so in Centerpoint, 3D object tracking simplifies to greedy closest-point matching.","570":"**Sharpness-Aware Minimization**, or **SAM**, is a procedure that improves model generalization by simultaneously minimizing loss value and loss sharpness. SAM functions by seeking parameters that lie in neighborhoods having uniformly low loss value (rather than parameters that only themselves have low loss value).","571":"**Grid R-CNN** is an object detection framework, where the traditional regression\r\nformulation is replaced by a grid point guided localization mechanism.\r\n\r\nGrid R-CNN divides the object bounding box region into grids and employs a fully convolutional network ([FCN](https:\/\/paperswithcode.com\/method\/fcn)) to predict the locations of grid points. Owing to the position sensitive property of fully convolutional architecture, Grid R-CNN maintains the explicit spatial information and grid points locations can be obtained in pixel level. When a certain number of grid points at specified location are known, the corresponding bounding box is definitely determined. Guided by the grid points, Grid R-CNN can determine more accurate object bounding box than regression method which lacks the guidance of explicit spatial information.","572":"1D Convolutional Neural Networks are similar to well known and more established 2D Convolutional Neural Networks. 1D Convolutional Neural Networks are used mainly used on text and 1D signals.","573":"A **Switch FFN** is a sparse layer that operates independently on tokens within an input sequence. It is shown in the blue block in the figure. We diagram two tokens ($x\\_{1}$ = \u201cMore\u201d and $x\\_{2}$ = \u201cParameters\u201d below) being routed (solid lines) across four FFN experts, where the router independently routes each token. The switch FFN layer returns the output of the selected FFN multiplied by the router gate value (dotted-line).","574":"**Switch Transformer** is a sparsely-activated expert [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) model that aims to simplify and improve over Mixture of Experts. Through distillation of sparse pre-trained and specialized fine-tuned models into small dense models, it reduces the model size by up to 99% while preserving 30% of the quality gains of the large sparse teacher. It also uses selective precision training that enables training with lower bfloat16 precision, as well as an initialization scheme that allows for scaling to a larger number of experts, and also increased regularization that improves sparse model fine-tuning and multi-task training.","575":"Park et al. proposed the bottleneck attention module (BAM), aiming\r\nto efficiently improve the representational capability of networks. \r\nIt uses dilated convolution to enlarge the receptive field of the spatial attention sub-module, and build a bottleneck structure as suggested  by ResNet to save computational cost.\r\n\r\nFor a given input feature map $X$, BAM infers the channel attention $s_c \\in \\mathbb{R}^C$ and spatial attention $s_s\\in \\mathbb{R}^{H\\times W}$ in two parallel streams, then sums the two attention maps after resizing both branch outputs to $\\mathbb{R}^{C\\times H \\times W}$. The channel attention branch, like an SE block, applies global average pooling to the feature map to aggregate global information, and then uses an MLP with channel dimensionality reduction. In order to utilize contextual information effectively, the spatial attention branch combines a bottleneck structure and dilated convolutions. Overall, BAM can be written as\r\n\\begin{align}\r\n    s_c &= \\text{BN}(W_2(W_1\\text{GAP}(X)+b_1)+b_2)\r\n\\end{align}\r\n\r\n\\begin{align}\r\n    s_s &= BN(Conv_2^{1 \\times 1}(DC_2^{3\\times 3}(DC_1^{3 \\times 3}(Conv_1^{1 \\times 1}(X))))) \r\n\\end{align}\r\n\\begin{align}\r\n    s &= \\sigma(\\text{Expand}(s_s)+\\text{Expand}(s_c)) \r\n\\end{align}\r\n\\begin{align}\r\n    Y &= s X+X\r\n\\end{align}\r\nwhere $W_i$, $b_i$ denote  weights and biases of fully connected layers respectively, $Conv_{1}^{1\\times 1}$ and $Conv_{2}^{1\\times 1}$ are convolution layers  used for channel reduction. $DC_i^{3\\times 3}$ denotes a dilated convolution with $3\\times 3$ kernel,  applied to utilize contextual information effectively. $\\text{Expand}$ expands the attention maps $s_s$ and $s_c$ to $\\mathbb{R}^{C\\times H\\times W}$.\r\n\r\nBAM can emphasize or suppress features in both spatial and channel dimensions, as well as improving the representational power. Dimensional reduction applied to both channel and spatial attention branches enables it to be integrated with any convolutional neural network with little extra computational cost. However, although dilated convolutions enlarge the receptive field effectively, it still fails to capture long-range contextual information as well as encoding cross-domain relationships.","576":"Per the authors, Graph Isomorphism Network (GIN) generalizes the WL test and hence achieves maximum discriminative power among GNNs.","577":"**ProGAN**, or **Progressively Growing GAN**, is a generative adversarial network that utilises a progressively growing training approach. The idea is to grow both the generator and discriminator progressively: starting from a low resolution, we add new layers that model increasingly fine details as training progresses.","578":"**FairMOT** is a model for multi-object tracking which consists of two homogeneous branches to predict pixel-wise objectness scores and re-ID features. The achieved fairness between the tasks is used to achieve high levels of detection and tracking accuracy. The detection branch is implemented in an anchor-free style which estimates object centers and sizes represented as position-aware measurement maps. Similarly, the re-ID branch estimates a re-ID feature for each pixel to characterize the object centered at the pixel. Note that the two branches are completely homogeneous which essentially differs from the previous methods which perform detection and re-ID in a cascaded style. It is also worth noting that FairMOT operates on high-resolution feature maps of strides four while the previous anchor-based methods operate on feature maps of stride 32. The elimination of anchors as well as the use of high-resolution feature maps better aligns re-ID features to object centers which significantly improves the tracking accuracy.","579":"**JLA**, or **Joint Learning Architecture**, is an approach for multiple object tracking and trajectory forecasting. It jointly trains a tracking and trajectory forecasting model, and the trajectory forecasts are used for short-term motion estimates in lieu of linear motion prediction methods such as the Kalman filter. It uses a [FairMOT](https:\/\/paperswithcode.com\/method\/fairmot) model as the base model because this architecture already performs detection and tracking. A forecasting branch is added to the network and is trained end-to-end. [FairMOT](https:\/\/paperswithcode.com\/method\/fairmot) consist of a backbone network utilizing [Deep Layer Aggregation](https:\/\/www.paperswithcode.com\/method\/dla), an object detection head, and a reID head.","580":"LINE is a novel network embedding method which is suitable for arbitrary types of information networks: undirected, directed, and\/or weighted. The method optimizes a carefully designed objective function that preserves both the local and global network structures.\r\n\r\nSource: [Tang et al.](https:\/\/arxiv.org\/pdf\/1503.03578v1.pdf)\r\n\r\nImage source: [Tang et al.](https:\/\/arxiv.org\/pdf\/1503.03578v1.pdf)","581":"**ArcFace**, or **Additive Angular Margin Loss**, is a loss function used in face recognition tasks. The [softmax](https:\/\/paperswithcode.com\/method\/softmax) is traditionally used in these tasks. However, the softmax loss function does not explicitly optimise the feature embedding to enforce higher similarity for intraclass samples and diversity for inter-class samples, which results in a performance gap for deep face recognition under large intra-class appearance variations. \r\n\r\nThe ArcFace loss transforms the logits $W^{T}\\_{j}x\\_{i} = || W\\_{j} || \\text{ } || x\\_{i} || \\cos\\theta\\_{j}$,\r\nwhere $\\theta\\_{j}$ is the angle between the weight $W\\_{j}$ and the feature $x\\_{i}$. The individual weight $ || W\\_{j} || = 1$ is fixed by $l\\_{2}$ normalization. The embedding feature $ ||x\\_{i} ||$ is fixed by $l\\_{2}$ normalization and re-scaled to $s$. The normalisation step on features and weights makes the predictions only depend on the angle between the feature and the weight. The learned embedding\r\nfeatures are thus distributed on a hypersphere with a radius of $s$. Finally, an additive angular margin penalty $m$ is added between $x\\_{i}$ and $W\\_{y\\_{i}}$ to simultaneously enhance the intra-class compactness and inter-class discrepancy. Since the proposed additive angular margin penalty is\r\nequal to the geodesic distance margin penalty in the normalised hypersphere, the method is named ArcFace:\r\n\r\n$$ L\\_{3} = -\\frac{1}{N}\\sum^{N}\\_{i=1}\\log\\frac{e^{s\\left(\\cos\\left(\\theta\\_{y\\_{i}} + m\\right)\\right)}}{e^{s\\left(\\cos\\left(\\theta\\_{y\\_{i}} + m\\right)\\right)} + \\sum^{n}\\_{j=1, j \\neq y\\_{i}}e^{s\\cos\\theta\\_{j}}} $$\r\n\r\nThe authors select face images from 8 different identities containing enough samples (around 1,500 images\/class) to train 2-D feature embedding networks with the softmax and ArcFace loss, respectively. As the Figure shows, the softmax loss provides roughly separable feature embedding\r\nbut produces noticeable ambiguity in decision boundaries, while the proposed ArcFace loss can obviously enforce a more evident gap between the nearest classes.\r\n\r\nOther alternatives to enforce intra-class compactness and inter-class distance include [Supervised Contrastive Learning](https:\/\/arxiv.org\/abs\/2004.11362).","582":"**AdaSmooth** is a stochastic optimization technique that allows for per-dimension learning rate method for [SGD](https:\/\/paperswithcode.com\/method\/sgd). It is an extension of [Adagrad](https:\/\/paperswithcode.com\/method\/adagrad) and [AdaDelta](https:\/\/paperswithcode.com\/method\/adadelta) that seek to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to a fixed size $w$ while AdaSmooth adaptively selects the size of the window.\r\n\r\nGiven the window size  $M$, the effective ratio is calculated by \r\n\r\n$$e_t  = \\frac{s_t}{n_t}= \\frac{| x_t -  x_{t-M}|}{\\sum_{i=0}^{M-1} | x_{t-i} -  x_{t-1-i}|}\\\\\r\n= \\frac{| \\sum_{i=0}^{M-1} \\Delta x_{t-1-i}|}{\\sum_{i=0}^{M-1} | \\Delta x_{t-1-i}|}.$$\r\n\r\nGiven the effective ratio, the scaled smoothing constant is obtained by:\r\n\r\n$$c_t =  ( \\rho_2- \\rho_1) \\times e_t   + (1-\\rho_2),$$\r\n\r\nThe running average $E\\left[g^{2}\\right]\\_{t}$ at time step $t$ then depends only on the previous average and current gradient:\r\n\r\n$$ E\\left[g^{2}\\right]\\_{t} = c_t^2 \\odot g_{t}^2  +  \\left(1-c_t^2 \\right)\\odot E[g^2]_{t-1} $$\r\n\r\nUsually $\\rho_1$ is set to around $0.5$ and $\\rho_2$ is set to around 0.99. The update step the follows:\r\n\r\n$$ \\Delta x_t = -\\frac{\\eta}{\\sqrt{E\\left[g^{2}\\right]\\_{t} + \\epsilon}} \\odot  g_{t}, $$\r\n\r\nwhich is incorporated into the final update:\r\n\r\n$$x_{t+1} = x_{t} + \\Delta x_t.$$\r\n\r\nThe main advantage of AdaSmooth is its faster convergence rate and insensitivity to hyperparameters.","583":"**AdaDelta** is a stochastic optimization technique that allows for per-dimension learning rate method for [SGD](https:\/\/paperswithcode.com\/method\/sgd). It is an extension of [Adagrad](https:\/\/paperswithcode.com\/method\/adagrad) that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to a fixed size $w$.\r\n\r\nInstead of inefficiently storing $w$ previous squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients. The running average $E\\left[g^{2}\\right]\\_{t}$ at time step $t$ then depends only on the previous average and current gradient:\r\n\r\n$$E\\left[g^{2}\\right]\\_{t} = \\gamma{E}\\left[g^{2}\\right]\\_{t-1} + \\left(1-\\gamma\\right)g^{2}\\_{t}$$\r\n\r\nUsually $\\gamma$ is set to around $0.9$. Rewriting SGD updates in terms of the parameter update vector:\r\n\r\n$$ \\Delta\\theta_{t} = -\\eta\\cdot{g\\_{t, i}}$$\r\n$$\\theta\\_{t+1}  = \\theta\\_{t} + \\Delta\\theta_{t}$$\r\n\r\nAdaDelta takes the form:\r\n\r\n$$ \\Delta\\theta_{t} = -\\frac{\\eta}{\\sqrt{E\\left[g^{2}\\right]\\_{t} + \\epsilon}}g_{t} $$\r\n\r\nThe main advantage of AdaDelta is that we do not need to set a default learning rate.","584":"**Neural Oblivious Decision Ensembles (NODE)** is a tabular data architecture that consists of differentiable\r\noblivious decision trees (ODT) that are trained end-to-end by backpropagation. \r\n\r\nThe core building block is a Neural Oblivious Decision Ensemble (NODE) layer. The layer is composed of $m$ differentiable oblivious decision trees (ODTs) of equal depth $d$. As an input, all $m$ trees get a common vector $x \\in \\mathbb{R}^{n}$, containing $n$ numeric features. Below we describe a design of a single differentiable ODT.\r\n\r\nIn its essence, an ODT is a decision table that splits the data along $d$ splitting features and compares each feature to a learned threshold. Then, the tree returns one of the $2^{d}$ possible responses, corresponding to the comparisons result. Therefore, each ODT is completely determined by its splitting features $f \\in \\mathbb{R}^{d}$, splitting thresholds $b \\in \\mathbb{R}^{d}$ and a $d$-dimensional tensor of responses $R \\in \\mathbb{R} \\underbrace{2 \\times 2 \\times 2}_{d}$. In this notation, the tree output is defined as:\r\n\r\n$$\r\nh(x)=R\\left[\\mathbb{1}\\left(f\\_{1}(x)-b_{1}\\right), \\ldots, \\mathbb{1}\\left(f\\_{d}(x)-b\\_{d}\\right)\\right]\r\n$$\r\nwhere $\\mathbb{1}(\\cdot)$ denotes the Heaviside function.","585":"**CrossViT** is a type of [vision transformer](https:\/\/paperswithcode.com\/method\/vision-transformer) that uses a dual-branch architecture to extract multi-scale feature representations for image classification. The architecture combines image patches (i.e. tokens in a [transformer](https:\/\/paperswithcode.com\/method\/transformer)) of different sizes to produce stronger visual features for image classification. It processes small and large patch tokens with two separate branches of different computational complexities and these tokens are fused together multiple times to complement each other.\r\n\r\nFusion is achieved by an efficient [cross-attention module](https:\/\/paperswithcode.com\/method\/cross-attention-module), in which each transformer branch creates a non-patch token as an agent to exchange information with the other branch by attention. This allows for linear-time generation of the attention map in fusion instead of quadratic time otherwise.","586":"FIERCE is an entropic regularization on the **feature** space","587":"**Conditional Relation Network**, or **CRN**, is a building block to construct more sophisticated structures for representation and reasoning over video. CRN takes as input an array of tensorial objects and a conditioning feature, and computes an array of encoded output objects. Model building becomes a simple exercise of replication, rearrangement and stacking of these reusable units for diverse modalities and contextual information. This design thus supports high-order relational and multi-step reasoning.","588":"A **Dynamic Memory Network** is a neural network architecture which processes input sequences and questions, forms episodic memories, and generates relevant answers. Questions trigger an iterative attention process which allows the model to condition its attention on the inputs and the result of previous iterations. These results are then reasoned over in a hierarchical recurrent sequence model to generate answers. \r\n\r\nThe DMN consists of a number of modules:\r\n\r\n- Input Module: The input module encodes raw text inputs from the task into distributed vector representations. The input takes forms like a sentence, a long story, a movie review and so on.\r\n- Question Module: The question module encodes the question of the task into a distributed\r\nvector representation. For question answering, the question may be a sentence such as \"Where did the author first fly?\". The representation is fed into the episodic memory module, and forms the basis, or initial state, upon which the episodic memory module iterates.\r\n- Episodic Memory Module: Given a collection of input representations, the episodic memory module chooses which parts of the inputs to focus on through the attention mechanism. It then produces a \u201dmemory\u201d vector representation taking into account the question as well as the previous memory. Each iteration provides the module with newly relevant information about the input. In other words,\r\nthe module has the ability to retrieve new information, in the form of input representations, which were thought to be irrelevant in previous iterations.\r\n- Answer Module: The answer module generates an answer from the final memory vector of the memory module.","589":"**Nesterov Accelerated Gradient** is a momentum-based [SGD](https:\/\/paperswithcode.com\/method\/sgd) optimizer that \"looks ahead\" to where the parameters will be to calculate the gradient **ex post** rather than **ex ante**:\r\n\r\n$$ v\\_{t} = \\gamma{v}\\_{t-1} + \\eta\\nabla\\_{\\theta}J\\left(\\theta-\\gamma{v\\_{t-1}}\\right) $$\r\n$$\\theta\\_{t} = \\theta\\_{t-1} + v\\_{t}$$\r\n\r\nLike SGD with momentum $\\gamma$ is usually set to $0.9$.\r\n\r\nThe intuition is that the [standard momentum](https:\/\/paperswithcode.com\/method\/sgd-with-momentum) method first computes the gradient at the current location and then takes a big jump in the direction of the updated accumulated gradient. In contrast Nesterov momentum first makes a big jump in the direction of the previous accumulated gradient and then measures the gradient where it ends up and makes a correction. The idea being that it is better to correct a mistake after you have made it. \r\n\r\nImage Source: [Geoff Hinton lecture notes](http:\/\/www.cs.toronto.edu\/~tijmen\/csc321\/slides\/lecture_slides_lec6.pdf)","590":"**Relative Position Encodings** are a type of position embeddings for [Transformer-based models](https:\/\/paperswithcode.com\/methods\/category\/transformers) that attempts to exploit pairwise, relative positional information. Relative positional information is supplied to the model on two levels: values and keys. This becomes apparent in the two modified self-attention equations shown below. First, relative positional information is supplied to the model as an additional component to the keys\r\n\r\n$$ e\\_{ij} = \\frac{x\\_{i}W^{Q}\\left(x\\_{j}W^{K} + a^{K}\\_{ij}\\right)^{T}}{\\sqrt{d\\_{z}}} $$\r\n\r\nHere $a$ is an edge representation for the inputs $x\\_{i}$ and $x\\_{j}$. The [softmax](https:\/\/paperswithcode.com\/method\/softmax) operation remains unchanged from vanilla self-attention. Then relative positional information is supplied again as a sub-component of the values matrix:\r\n\r\n$$ z\\_{i} = \\sum^{n}\\_{j=1}\\alpha\\_{ij}\\left(x\\_{j}W^{V} + a\\_{ij}^{V}\\right)$$\r\n\r\nIn other words, instead of simply combining semantic embeddings with absolute positional ones, relative positional information is added to keys and values on the fly during attention calculation.\r\n\r\nSource: [Jake Tae](https:\/\/jaketae.github.io\/study\/relative-positional-encoding\/)\r\n\r\nImage Source: [Relative Positional Encoding for Transformers with Linear Complexity](https:\/\/www.youtube.com\/watch?v=qajudaEHuq8","591":"**Global-Local Attention** is a type of attention mechanism used in the [ETC](https:\/\/paperswithcode.com\/method\/etc) architecture. ETC receives two separate input sequences: the global input $x^{g} = (x^{g}\\_{1}, \\dots, x^{g}\\_{n\\_{g}})$ and the long input $x^{l} = (x^{l}\\_{1}, \\dots x^{l}\\_{n\\_{l}})$. Typically, the long input contains the input a [standard Transformer](https:\/\/paperswithcode.com\/method\/transformer) would receive, while the global input contains a much smaller number of auxiliary tokens ($n\\_{g}  \\ll n\\_{l}$). Attention is then split into four separate pieces: global-to-global (g2g), global-tolong (g2l), long-to-global (l2g), and long-to-long (l2l). Attention in the l2l piece (the most computationally expensive piece) is restricted to a fixed radius $r \\ll n\\_{l}$. To compensate for this limited attention span, the tokens in the global input have unrestricted attention, and thus long input tokens can transfer information to each other through global input tokens. Accordingly, g2g, g2l, and l2g pieces of attention are unrestricted.","592":"**Extended Transformer Construction**, or **ETC**, is an extension of the [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture with a new attention mechanism that extends the original in two main ways: (1) it allows scaling up the input length from 512 to several thousands; and (2) it can ingesting structured inputs instead of just linear sequences. The key ideas that enable ETC to achieve these are a new [global-local attention mechanism](https:\/\/paperswithcode.com\/method\/global-local-attention), coupled with [relative position encodings](https:\/\/paperswithcode.com\/method\/relative-position-encodings). ETC also allows lifting weights from existing [BERT](https:\/\/paperswithcode.com\/method\/bert) models, saving computational resources while training.","593":"**Inception-ResNet-v2-A** is an image model block for a 35 x 35 grid used in the [Inception-ResNet-v2](https:\/\/paperswithcode.com\/method\/inception-resnet-v2) architecture.","594":"**Inception-ResNet-v2 Reduction-B** is an image model block used in the [Inception-ResNet-v2](https:\/\/paperswithcode.com\/method\/inception-resnet-v2) architecture.","595":"**Inception-ResNet-v2-B** is an image model block for a 17 x 17 grid used in the [Inception-ResNet-v2](https:\/\/paperswithcode.com\/method\/inception-resnet-v2) architecture. It largely follows the idea of Inception modules - and grouped convolutions - but also includes residual connections.","596":"**Inception-ResNet-v2-C** is an image model block for an 8 x 8 grid used in the [Inception-ResNet-v2](https:\/\/paperswithcode.com\/method\/inception-resnet-v2) architecture. It largely follows the idea of Inception modules - and grouped convolutions - but also includes residual connections.","597":"**Inception-ResNet-v2** is a convolutional neural architecture that builds on the Inception family of architectures but incorporates [residual connections](https:\/\/paperswithcode.com\/method\/residual-connection) (replacing the filter concatenation stage of the Inception architecture).","598":"A **Deep Belief Network (DBN)** is a multi-layer generative graphical model. DBNs have bi-directional connections ([RBM](https:\/\/paperswithcode.com\/method\/restricted-boltzmann-machine)-type connections) on the top layer while the bottom layers only have top-down connections. They are trained using layerwise pre-training. Pre-training occurs by training the network component by component bottom up: treating the first two layers as an RBM and training, then treating the second layer and third layer as another RBM and training for those parameters.\r\n\r\nSource: [Origins of Deep Learning](https:\/\/arxiv.org\/pdf\/1702.07800.pdf)\r\n\r\nImage Source: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Deep_belief_network)","599":"**Neural Network Compression Framework**, or **NNCF**, is a Python-based framework for neural network compression with fine-tuning. It leverages recent advances of various network compression methods and implements some of them, namely quantization, sparsity, filter pruning and binarization. These methods allow producing more hardware-friendly models that can be efficiently run on general-purpose hardware computation units (CPU, GPU) or specialized deep learning accelerators.","600":"UNet++ is an architecture for semantic segmentation based on the [U-Net](https:\/\/paperswithcode.com\/method\/u-net). Through the use of densely connected nested decoder sub-networks, it enhances extracted feature processing and was reported by its authors to outperform the U-Net in [Electron Microscopy (EM)](https:\/\/imagej.net\/events\/isbi-2012-segmentation-challenge), [Cell](https:\/\/acsjournals.onlinelibrary.wiley.com\/doi\/full\/10.1002\/cncy.21576), [Nuclei](https:\/\/www.kaggle.com\/c\/data-science-bowl-2018), [Brain Tumor](https:\/\/paperswithcode.com\/dataset\/brats-2013-1), [Liver](https:\/\/paperswithcode.com\/dataset\/lits17) and [Lung Nodule](https:\/\/paperswithcode.com\/dataset\/lidc-idri) medical image segmentation tasks.","601":"**SegNet** is a semantic segmentation model. This core trainable segmentation architecture consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the\r\nVGG16 network. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature maps. Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to\r\nperform non-linear upsampling.","602":"This method introduces several regularization schemes that can be applied to an Autoencoder. To make the model genrative *ex-post* density estimation is proposed and consists in fitting a Mixture of Gaussian distribution on the train data embeddings after the model is trained.","603":"**VirText**, or **Visual representations from Textual annotations** is a pretraining approach using semantically dense captions to learn visual representations. First a ConvNet and [Transformer](https:\/\/paperswithcode.com\/method\/transformer) are jointly trained from scratch to generate natural language captions for images. Then, the learned features are transferred to downstream visual recognition tasks.","604":"**Gumbel-Softmax** is a continuous distribution that has the property that it can be smoothly annealed into a categorical distribution, and whose parameter gradients can be easily computed via the reparameterization trick.","605":"**Beta-VAE** is a type of variational autoencoder that seeks to discovered disentangled latent factors. It modifies [VAEs](https:\/\/paperswithcode.com\/method\/vae) with an adjustable hyperparameter $\\beta$ that balances latent channel capacity and independence constraints with reconstruction accuracy. The idea is to maximize the probability of generating the real data while keeping the distance between the real and estimated distributions small, under a threshold $\\epsilon$. We can use the Kuhn-Tucker conditions to write this as a single equation:\r\n\r\n$$ \\mathcal{F}\\left(\\theta, \\phi, \\beta; \\mathbf{x}, \\mathbf{z}\\right) = \\mathbb{E}\\_{q\\_{\\phi}\\left(\\mathbf{z}|\\mathbf{x}\\right)}\\left[\\log{p}\\_{\\theta}\\left(\\mathbf{x}\\mid\\mathbf{z}\\right)\\right] - \\beta\\left[D\\_{KL}\\left(\\log{q}\\_{\\theta}\\left(\\mathbf{z}\\mid\\mathbf{x}\\right)||p\\left(\\mathbf{z}\\right)\\right) - \\epsilon\\right]$$\r\n\r\nwhere the KKT multiplier $\\beta$ is the regularization coefficient that constrains the capacity of the latent channel $\\mathbf{z}$ and puts implicit independence pressure on the learnt posterior due to the isotropic nature of the Gaussian prior $p\\left(\\mathbf{z}\\right)$.\r\n\r\nWe write this again using the complementary slackness assumption to get the Beta-VAE formulation:\r\n\r\n$$ \\mathcal{F}\\left(\\theta, \\phi, \\beta; \\mathbf{x}, \\mathbf{z}\\right) \\geq  \\mathcal{L}\\left(\\theta, \\phi, \\beta; \\mathbf{x}, \\mathbf{z}\\right) = \\mathbb{E}\\_{q\\_{\\phi}\\left(\\mathbf{z}|\\mathbf{x}\\right)}\\left[\\log{p}\\_{\\theta}\\left(\\mathbf{x}\\mid\\mathbf{z}\\right)\\right] - \\beta\\{D}\\_{KL}\\left(\\log{q}\\_{\\theta}\\left(\\mathbf{z}\\mid\\mathbf{x}\\right)||p\\left(\\mathbf{z}\\right)\\right)$$","606":"**Channel-wise Soft Attention** is an attention mechanism in computer vision that assigns \"soft\" attention weights for each channel $c$. In soft channel-wise attention, the alignment weights are learned and placed \"softly\" over each channel. This would contrast with hard attention which would only selects one channel to attend to at a time.\r\n\r\nImage: [Xu et al](http:\/\/proceedings.mlr.press\/v37\/xuc15.pdf)","607":"A **Selective Kernel Convolution** is a [convolution](https:\/\/paperswithcode.com\/method\/convolution) that enables neurons to adaptively adjust their RF sizes among multiple kernels with different kernel sizes. Specifically, the SK convolution has three operators \u2013 Split, Fuse and Select. Multiple branches with different kernel sizes are fused using\r\n[softmax](https:\/\/paperswithcode.com\/method\/softmax) attention that is guided by the information in these branches. Different attentions on these branches yield different sizes of the effective receptive fields of neurons in the fusion layer","608":"A **Selective Kernel** unit is a bottleneck block consisting of a sequence of 1\u00d71 [convolution](https:\/\/paperswithcode.com\/method\/convolution), SK convolution and 1\u00d71 convolution. It was proposed as part of the [SKNet](https:\/\/paperswithcode.com\/method\/sknet) CNN architecture. In general, all the large kernel convolutions in the original bottleneck blocks in [ResNeXt](https:\/\/paperswithcode.com\/method\/resnext) are replaced by the proposed SK convolutions, enabling the network to choose appropriate receptive field sizes in an adaptive manner. \r\n\r\nIn SK units, there are three important hyper-parameters which determine the final settings of SK convolutions: the number of paths $M$ that determines the number of choices of different kernels to be aggregated, the group number $G$ that controls the cardinality of each path, and the reduction ratio $r$ that controls the number of parameters in the fuse operator. One typical setting of SK convolutions is $\\text{SK}\\left[M, G, r\\right]$ to be $\\text{SK}\\left[2, 32, 16\\right]$.","609":"**Self-Cure Network**, or **SCN**, is a method for suppressing uncertainties for large-scale facial expression recognition, prventing deep networks from overfitting uncertain facial images. Specifically, SCN suppresses the uncertainty from two different aspects: 1) a self-attention mechanism over mini-batch to weight each training sample with a ranking regularization, and 2) a careful relabeling mechanism to modify the labels of these samples in the lowest-ranked group.","610":"**Minimum Description Length** provides a criterion for the selection of models, regardless of their complexity, without the restrictive assumption that the data form a sample from a 'true' distribution.\r\n\r\nExtracted from [scholarpedia](http:\/\/scholarpedia.org\/article\/Minimum_description_length)\r\n\r\n**Source**:\r\n\r\nPaper: [J. Rissanen (1978) Modeling by the shortest data description. Automatica 14, 465-471](https:\/\/doi.org\/10.1016\/0005-1098(78)90005-5)\r\n\r\nBook: [P. D. Gr\u00fcnwald (2007) The Minimum Description Length Principle, MIT Press, June 2007, 570 pages](https:\/\/ieeexplore.ieee.org\/servlet\/opac?bknumber=6267274)","611":"A **CNN BiLSTM** is a hybrid bidirectional [LSTM](https:\/\/paperswithcode.com\/method\/lstm) and CNN architecture. In the original formulation applied to named entity recognition, it learns both character-level and word-level features. The CNN component is used to induce the character-level features. For each word the model employs a [convolution](https:\/\/paperswithcode.com\/method\/convolution) and a [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) layer to extract a new feature vector from the per-character feature vectors such as character embeddings and (optionally) character type.","612":"**Atrous Spatial Pyramid Pooling (ASPP)** is a semantic segmentation module for resampling a given feature layer at multiple rates prior to [convolution](https:\/\/paperswithcode.com\/method\/convolution). This amounts to probing the original image with multiple filters that have complementary effective fields of view, thus capturing objects as well as useful image context at multiple scales. Rather than actually resampling features, the mapping is implemented using multiple parallel atrous convolutional layers with different sampling rates.","613":"**DeepLabv3** is a semantic segmentation architecture that improves upon [DeepLabv2](https:\/\/paperswithcode.com\/method\/deeplabv2) with several modifications. To handle the problem of segmenting objects at multiple scales, modules are designed which employ atrous [convolution](https:\/\/paperswithcode.com\/method\/convolution) in cascade or in parallel to capture multi-scale context by adopting multiple atrous rates. Furthermore, the Atrous [Spatial Pyramid Pooling](https:\/\/paperswithcode.com\/method\/spatial-pyramid-pooling) module from DeepLabv2 augmented with image-level features encoding global context and further boost performance. \r\n\r\nThe changes to the ASSP module are that the authors apply [global average pooling](https:\/\/paperswithcode.com\/method\/global-average-pooling) on the last feature map of the model, feed the resulting image-level features to a 1 \u00d7 1 convolution with 256 filters (and [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization)), and then bilinearly upsample the feature to the desired spatial dimension. In the\r\nend, the improved [ASPP](https:\/\/paperswithcode.com\/method\/aspp) consists of (a) one 1\u00d71 convolution and three 3 \u00d7 3 convolutions with rates = (6, 12, 18) when output stride = 16 (all with 256 filters and batch normalization), and (b) the image-level features.\r\n\r\nAnother interesting difference is that DenseCRF post-processing from DeepLabv2 is no longer needed.","614":"The Robust Loss is a generalization of the Cauchy\/Lorentzian, Geman-McClure, Welsch\/Leclerc, generalized Charbonnier, Charbonnier\/pseudo-Huber\/L1-L2, and L2 loss functions. By introducing robustness as a continuous parameter, the loss function allows algorithms built around robust loss minimization to be generalized, which improves performance on basic vision tasks such as registration and clustering. Interpreting the loss as the negative log of a univariate density yields a general probability distribution that includes normal and Cauchy distributions as special cases. This probabilistic interpretation enables the training of neural networks in which the robustness of the loss automatically adapts itself during training, which improves performance on learning-based tasks such as generative image synthesis and unsupervised monocular depth estimation, without requiring any manual parameter tuning.","615":"CharacterBERT is a variant of [BERT](https:\/\/paperswithcode.com\/method\/bert) that **drops the wordpiece system** and **replaces it with a CharacterCNN module** just like the one [ELMo](https:\/\/paperswithcode.com\/method\/elmo) uses to produce its first layer representation. This allows CharacterBERT to represent any input token without splitting it into wordpieces. Moreover, this frees BERT from the burden of a domain-specific wordpiece vocabulary which may not be suited to your domain of interest (e.g. medical domain). Finally, it allows the model to be more robust to noisy inputs.","616":"**Stable Rank Normalization (SRN)** is a weight-normalization scheme which minimizes the\r\nstable rank of a linear operator. It simultaneously controls the Lipschitz constant and the stable rank of a linear operator. Stable rank is a softer version of the rank operator and is defined as the squared ratio of the Frobenius norm to the spectral norm.","617":"VERtex Similarity Embeddings (VERSE) is a simple, versatile, and memory-efficient method that derives graph embeddings explicitly calibrated to preserve the distributions of a selected vertex-to-vertex similarity measure. VERSE learns such embeddings by training a single-layer neural network.\r\n\r\nSource: [Tsitsulin et al.](https:\/\/arxiv.org\/pdf\/1803.04742v1.pdf)\r\n\r\nImage source: [Tsitsulin et al.](https:\/\/arxiv.org\/pdf\/1803.04742v1.pdf)","618":"**SNIPER** is a multi-scale training approach for instance-level recognition tasks like object detection and instance-level segmentation. Instead of processing all pixels in an image pyramid, SNIPER selectively processes context regions around the ground-truth objects (a.k.a chips). This can help to speed up multi-scale training as it operates on low-resolution chips. Due to its memory-efficient design, SNIPER can benefit from [Batch Normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) during training and it makes larger batch-sizes possible for instance-level recognition tasks on a single GPU.","619":"Please enter a description about the method here","620":"Random Ensemble Mixture (REM) is an easy to implement extension of [DQN](https:\/\/paperswithcode.com\/method\/dqn) inspired by [Dropout](https:\/\/paperswithcode.com\/method\/dropout). The key intuition behind REM is that if one has access to multiple estimates of Q-values, then a weighted combination of the Q-value estimates is also an estimate for Q-values. Accordingly, in each training step, REM randomly combines multiple Q-value estimates and uses this random combination for robust training.","621":"**ResNet-RS** is a family of [ResNet](https:\/\/paperswithcode.com\/method\/resnet) architectures that are 1.7x faster than [EfficientNets](https:\/\/paperswithcode.com\/method\/efficientnet) on TPUs, while achieving similar accuracies on ImageNet. The authors propose two new scaling strategies: (1) scale model depth in regimes where overfitting can occur (width scaling is preferable otherwise); (2) increase image resolution more slowly than previously recommended.\r\n\r\nAdditional improvements include the use of a [cosine learning rate schedule](https:\/\/paperswithcode.com\/method\/cosine-annealing), [label smoothing](https:\/\/paperswithcode.com\/method\/label-smoothing), [stochastic depth](https:\/\/paperswithcode.com\/method\/stochastic-depth), [RandAugment](https:\/\/paperswithcode.com\/method\/randaugment), decreased [weight decay](https:\/\/paperswithcode.com\/method\/weight-decay), [squeeze-and-excitation](https:\/\/paperswithcode.com\/method\/squeeze-and-excitation-block) and the use of the [ResNet-D](https:\/\/paperswithcode.com\/method\/resnet-d) architecture.","622":"**ResNet-D** is a modification on the [ResNet](https:\/\/paperswithcode.com\/method\/resnet) architecture that utilises an [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling) tweak for downsampling. The motivation is that in the unmodified ResNet, the 1 \u00d7 1 [convolution](https:\/\/paperswithcode.com\/method\/convolution) for the downsampling block ignores 3\/4 of input feature maps, so this is modified so no information will be ignored","623":"**3D ResNet-RS** is an architecture and scaling strategy for 3D ResNets for video recognition. The key additions are:\r\n\r\n- **3D ResNet-D stem**: The [ResNet-D](https:\/\/paperswithcode.com\/method\/resnet-d) stem is adapted to 3D inputs by using three consecutive [3D convolutional layers](https:\/\/paperswithcode.com\/method\/3d-convolution). The first convolutional layer employs a temporal kernel size of 5 while the remaining two convolutional layers employ a temporal kernel size of 1.\r\n\r\n- **3D Squeeze-and-Excitation**:  [Squeeze-and-Excite](https:\/\/paperswithcode.com\/method\/squeeze-and-excitation-block) is adapted to spatio-temporal inputs by using a 3D [global average pooling](https:\/\/paperswithcode.com\/method\/global-average-pooling) operation for the squeeze operation. A SE ratio of 0.25 is applied in each 3D bottleneck block for all experiments.\r\n\r\n- **Self-gating**: A self-gating module is used in each 3D bottleneck block after the SE module.","624":"**Shifted Softplus** is an activation function ${\\rm ssp}(x) = \\ln( 0.5 e^{x} + 0.5 )$, which [SchNet](https:\/\/paperswithcode.com\/method\/schnet) employs as non-linearity throughout the network in order to obtain a smooth potential energy surface. The shifting ensures that ${\\rm ssp}(0) = 0$ and improves the convergence of the network. This activation function shows similarity to ELUs, while having infinite order of continuity.","625":"**SchNet** is an end-to-end deep neural network architecture based on continuous-filter convolutions. It follows the deep tensor neural network framework, i.e. atom-wise representations are constructed by starting from embedding vectors that characterize the atom type before introducing the configuration of the system by a series of interaction blocks.","626":"FastGCN is a fast improvement of the GCN model recently proposed by Kipf & Welling (2016a) for learning graph embeddings. It generalizes transductive training to an inductive manner and also addresses the memory bottleneck issue of GCN caused by recursive expansion of neighborhoods. The crucial ingredient is a sampling scheme in the reformulation of the loss and the gradient, well justified through an alternative view of graph convoluntions in the form of integral transforms of embedding functions.\r\n\r\nDescription and image from: [FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling](https:\/\/arxiv.org\/pdf\/1801.10247.pdf)","627":"This optimizer mix [ADAM](https:\/\/paperswithcode.com\/method\/adam) and [SGD](https:\/\/paperswithcode.com\/method\/sgd) creating the MAS optimizer.","628":"**Rotary Position Embedding**, or **RoPE**, is a type of position embedding which encodes absolute positional information with rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation. Notably, RoPE comes with valuable properties such as flexibility of being expand to any sequence lengths, decaying inter-token dependency with increasing relative distances, and capability of equipping the linear self-attention with relative position encoding.","629":"SRM combines style transfer with an attention mechanism. Its main contribution is style pooling which utilizes both mean and standard deviation of the input features to improve its capability to capture global information. It also adopts a lightweight channel-wise fully-connected (CFC) layer, in place of the original fully-connected layer, to reduce the computational requirements.\r\nGiven an input feature map $X \\in \\mathbb{R}^{C \\times H \\times W}$, SRM first collects global information by using style pooling ($\\text{SP}(\\cdot)$) which combines global average pooling and global standard deviation pooling. \r\nThen a channel-wise fully connected ($\\text{CFC}(\\cdot)$) layer (i.e. fully connected per channel), batch normalization $\\text{BN}$ and sigmoid function $\\sigma$ are used  to provide the attention vector. Finally,   as in an SE block, the input features are multiplied by the attention vector. Overall, an SRM can be written as:\r\n\\begin{align}\r\n    s = F_\\text{srm}(X, \\theta) & = \\sigma (\\text{BN}(\\text{CFC}(\\text{SP}(X))))\r\n\\end{align}\r\n\\begin{align}\r\n    Y & = s  X\r\n\\end{align}\r\nThe SRM block improves both squeeze and excitation modules, yet can be added after each residual unit like an SE block.","630":"**Multi-Head Linear Attention** is a type of linear multi-head self-attention module, proposed with the [Linformer](https:\/\/paperswithcode.com\/method\/linformer) architecture. The main idea is to add two linear projection matrices $E\\_{i}, F\\_{i} \\in \\mathbb{R}^{n\\times{k}}$ when computing key and value. We first project the original $\\left(n \\times d\\right)$-dimensional key and value layers $KW\\_{i}^{K}$ and $VW\\_{i}^{V}$ into $\\left(k\\times{d}\\right)$-dimensional projected key and value layers. We then compute a $\\left(n\\times{k}\\right)$ dimensional context mapping $\\bar{P}$ using scaled-dot product attention:\r\n\r\n$$ \\bar{\\text{head}\\_{i}} = \\text{Attention}\\left(QW^{Q}\\_{i}, E\\_{i}KW\\_{i}^{K}, F\\_{i}VW\\_{i}^{V}\\right) $$\r\n\r\n$$ \\bar{\\text{head}\\_{i}} = \\text{softmax}\\left(\\frac{QW^{Q}\\_{i}\\left(E\\_{i}KW\\_{i}^{K}\\right)^{T}}{\\sqrt{d\\_{k}}}\\right) \\cdot F\\_{i}VW\\_{i}^{V} $$\r\n\r\nFinally, we compute context embeddings for each head using $\\bar{P} \\cdot \\left(F\\_{i}{V}W\\_{i}^{V}\\right)$.","631":"**Skip-gram Word2Vec** is an architecture for computing word embeddings. Instead of using surrounding words to predict the center word, as with CBow Word2Vec, Skip-gram Word2Vec uses the central word to predict the surrounding words.\r\n\r\nThe skip-gram objective function sums the log probabilities of the surrounding $n$ words to the left and right of the target word $w\\_{t}$ to produce the following objective:\r\n\r\n$$J\\_\\theta = \\frac{1}{T}\\sum^{T}\\_{t=1}\\sum\\_{-n\\leq{j}\\leq{n}, \\neq{0}}\\log{p}\\left(w\\_{j+1}\\mid{w\\_{t}}\\right)$$","632":"**PReLU-Net** is a type of convolutional neural network that utilises parameterized ReLUs for its activation function. It also uses a robust initialization scheme - afterwards known as [Kaiming Initialization](https:\/\/paperswithcode.com\/method\/he-initialization) - that accounts for non-linear activation functions.","633":"**$\\epsilon$-Greedy Exploration** is an exploration strategy in reinforcement learning that takes an exploratory action with probability $\\epsilon$ and a greedy action with probability $1-\\epsilon$. It tackles the exploration-exploitation tradeoff with reinforcement learning algorithms: the desire to explore the state space with the desire to seek an optimal policy. Despite its simplicity, it is still commonly used as an behaviour policy $\\pi$ in several state-of-the-art reinforcement learning models.\r\n\r\nImage Credit: [Robin van Embden](https:\/\/cran.r-project.org\/web\/packages\/contextual\/vignettes\/sutton_barto.html)","634":"**Barlow Twins** is a self-supervised learning method that applies redundancy-reduction \u2014 a principle first proposed in neuroscience \u2014 to self supervised learning. The objective function measures the cross-correlation matrix between the embeddings of two identical networks fed with distorted versions of a batch of samples, and tries to make this matrix close to the identity. This causes the embedding vectors of distorted version of a sample to be similar, while minimizing the redundancy between the components of these vectors. Barlow Twins does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates. Intriguingly it benefits from very high-dimensional output vectors.","635":"**Adversarial Color Enhancement** is an approach to generating unrestricted adversarial images by optimizing a color filter via gradient descent.","636":"**Selective Search** is a region proposal algorithm for object detection tasks. It starts by over-segmenting the image based on intensity of the pixels using a graph-based segmentation method by Felzenszwalb and Huttenlocher. Selective Search then takes these oversegments as initial input and performs the following steps\r\n\r\n1. Add all bounding boxes corresponding to segmented parts to the list of regional proposals\r\n2. Group adjacent segments based on similarity\r\n3. Go to step 1\r\n\r\nAt each iteration, larger segments are formed and added to the list of region proposals. Hence we create region proposals from smaller segments to larger segments in a bottom-up approach. This is what we mean by computing \u201chierarchical\u201d segmentations using Felzenszwalb and Huttenlocher\u2019s oversegments.","637":"A **Multiplicative LSTM (mLSTM)** is a  recurrent neural network architecture for sequence modelling that combines the long short-term memory ([LSTM](https:\/\/paperswithcode.com\/method\/lstm)) and multiplicative recurrent neural network ([mRNN](https:\/\/paperswithcode.com\/method\/mrnn)) architectures. The mRNN and LSTM architectures can be combined by adding connections from the mRNN\u2019s intermediate state $m\\_{t}$ to each gating units in the LSTM.","638":"**LSGAN**, or **Least Squares GAN**, is a type of generative adversarial network that adopts the least squares loss function for the discriminator. Minimizing the objective function of LSGAN yields minimizing the Pearson $\\chi^{2}$ divergence. The objective function can be defined as:\r\n\r\n$$ \\min\\_{D}V\\_{LSGAN}\\left(D\\right) = \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{x} \\sim p\\_{data}\\left(\\mathbf{x}\\right)}\\left[\\left(D\\left(\\mathbf{x}\\right) - b\\right)^{2}\\right] + \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{z}\\sim p\\_{data}\\left(\\mathbf{z}\\right)}\\left[\\left(D\\left(G\\left(\\mathbf{z}\\right)\\right) - a\\right)^{2}\\right] $$\r\n\r\n$$ \\min\\_{G}V\\_{LSGAN}\\left(G\\right) = \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{z} \\sim p\\_{\\mathbf{z}}\\left(\\mathbf{z}\\right)}\\left[\\left(D\\left(G\\left(\\mathbf{z}\\right)\\right) - c\\right)^{2}\\right] $$\r\n\r\nwhere $a$ and $b$ are the labels for fake data and real data and $c$ denotes the value that $G$ wants $D$ to believe for fake data.","639":"**RandWire** is a type of convolutional neural network that arise from randomly\r\nwired neural networks that are sampled from stochastic network generators, in which a human-designed random\r\nprocess defines generation.","640":"**DeepLab** is a semantic segmentation architecture. First, the input image goes through the network with the use of dilated convolutions. Then the output from the network is bilinearly interpolated and goes through the fully connected [CRF](https:\/\/paperswithcode.com\/method\/crf) to fine tune the result we obtain the final predictions.","641":"**CascadePSP** is a general segmentation refinement model that refines any given segmentation from low to high resolution. The model takes as input an initial mask that can be an output of any algorithm to provide a rough object location. Then the CascadePSP will output a refined mask. The model is designed in a cascade fashion that generates refined segmentation in a coarse-to-fine manner. Coarse outputs from the early levels predict object structure which will be used as input to the latter levels to refine boundary details.","642":"DGCNN involves neural networks that read the graphs directly and learn a classification function. There are two main challenges: 1) how to extract useful features characterizing the rich information encoded in a graph for classification purpose, and 2) how to sequentially read a graph in a meaningful and consistent order. To address the first challenge, we design a localized graph convolution model and show its connection with two graph kernels. To address the second challenge, we design a novel SortPooling layer which sorts graph vertices in a consistent order so that traditional neural networks can be trained on the graphs.\r\n\r\nDescription and image from: [An End-to-End Deep Learning Architecture for Graph Classification](https:\/\/muhanzhang.github.io\/papers\/AAAI_2018_DGCNN.pdf)","643":"**Residual Normal Distributions** are used to help the optimization of VAEs, preventing optimization from entering an unstable region. This can happen due to sharp gradients caused in situations where the encoder and decoder produce distributions far away from each other. The residual distribution parameterizes $q\\left(\\mathbf{z}|\\mathbf{x}\\right)$ relative to $p\\left(\\mathbf{z}\\right)$. Let $p\\left(z^{i}\\_{l}|\\mathbf{z}\\_{<l}\\right) := N \\left(\\mu\\_{i}\\left(\\mathbf{z}\\_{<l}\\right), \\sigma\\_{i}\\left(\\mathbf{z}\\_{<l}\\right)\\right)$ be a Normal distribution for the $i$th variable in $\\mathbf{z}\\_{l}$ in prior. Define $q\\left(z^{i}\\_{l}|\\mathbf{z}\\_{<l}, x\\right) := N\\left(\\mu\\_{i}\\left(\\mathbf{z}\\_{<l}\\right) + \\Delta\\mu\\_{i}\\left(\\mathbf{z}\\_{<l}, x\\right), \\sigma\\_{i}\\left(\\mathbf{z}\\_{<l}\\right) \\cdot \\Delta\\sigma\\_{i}\\left(\\mathbf{z}\\_{<l}, x\\right) \\right)$, where $\\Delta\\mu\\_{i}\\left(\\mathbf{z}\\_{<l}, \\mathbf{x}\\right)$ and $\\Delta\\sigma\\_{i}\\left(\\mathbf{z}\\_{<l}, \\mathbf{x}\\right)$ are the relative location and scale of the approximate posterior with respect to the prior. With this parameterization, when the prior moves, the approximate posterior moves accordingly, if not changed.","644":"The **NVAE Encoder Residual Cell** is a [residual connection](https:\/\/paperswithcode.com\/method\/residual-connection) block used in the [NVAE](https:\/\/paperswithcode.com\/method\/nvae) architecture for the encoder. It applies two series of BN-[Swish](https:\/\/paperswithcode.com\/method\/swish)-Conv layers without changing the number of channels.","645":"The **NVAE Generative Residual Cell** is a skip connection block used as part of the [NVAE](https:\/\/paperswithcode.com\/method\/nvae) architecture for the generator. The residual cell expands the number of channels $E$ times before applying the [depthwise separable convolution](https:\/\/paperswithcode.com\/method\/depthwise-separable-convolution), and then maps it back to $C$ channels. The design motivation was to help model long-range correlations in the data by increasing the receptive field of the network, which explains the expanding path but also the use of depthwise convolutions to keep a handle on parameter count.","646":"**NVAE**, or **Nouveau VAE**, is deep, hierarchical variational autoencoder. It can be trained with the original [VAE](https:\/\/paperswithcode.com\/method\/vae) objective, unlike alternatives such as [VQ-VAE-2](https:\/\/paperswithcode.com\/method\/vq-vae-2). NVAE\u2019s design focuses on tackling two main challenges: (i) designing expressive neural\r\nnetworks specifically for VAEs, and (ii) scaling up the training to a large number of hierarchical\r\ngroups and image sizes while maintaining training stability.\r\n\r\nTo tackle long-range correlations in the data, the model employs hierarchical multi-scale modelling. The generative model starts from a small spatially arranged latent variables as $\\mathbf{z}\\_{1}$ and samples from the hierarchy group-by-group while gradually doubling the spatial dimensions. This multi-scale approach enables NVAE to capture global long-range correlations at the top of the hierarchy and local fine-grained dependencies at the lower groups.\r\n\r\nAdditional design choices include the use of residual cells for the generative models and the encoder, which employ a number of tricks and modules to achieve good performance, and the use of residual normal distributions to smooth optimization. See the components section for more details.","647":"A new kind of implicit models, where the output of the network is defined as the solution to an \"infinite-level\" fixed point equation. Thanks to this we can compute the gradient of the output without activations and therefore with a significantly reduced memory footprint.","648":"**Compact Convolutional Transformers** utilize sequence pooling and replace the patch embedding with a convolutional embedding, allowing for better inductive bias and making positional embeddings optional. CCT achieves better accuracy than ViT-Lite (smaller ViTs) and increases the flexibility of the input parameters.","649":"The **Convolutional vision Transformer (CvT)** is an architecture which incorporates convolutions into the [Transformer](https:\/\/paperswithcode.com\/method\/transformer). The CvT design introduces convolutions to two core sections of the ViT architecture.\r\n\r\nFirst, the Transformers are partitioned into multiple stages that form a hierarchical structure of Transformers. The beginning of each stage consists of a convolutional token embedding that performs an overlapping [convolution](https:\/\/paperswithcode.com\/method\/convolution) operation with stride on a 2D-reshaped token map (i.e., reshaping flattened token sequences back to the spatial grid), followed by [layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization). This allows the model to not only capture local information, but also progressively decrease the sequence length while simultaneously increasing the dimension of token features across stages, achieving spatial downsampling while concurrently increasing the number of feature maps, as is performed in CNNs. \r\n\r\nSecond, the linear projection prior to every self-attention block in the Transformer module is replaced with a proposed convolutional projection, which employs a s \u00d7 s depth-wise separable convolution operation on an 2D-reshaped token map. This allows the model to further capture local spatial context and reduce semantic ambiguity in the attention mechanism. It also permits management of computational complexity, as the stride of convolution can be used to subsample the key and value matrices to improve efficiency by 4\u00d7 or more, with minimal degradation of performance.","650":"Dynamic algorithm configuration (DAC) is capable of generalizing over prior optimization approaches, as well as handling optimization of hyperparameters that need to be adjusted over multiple time-steps.\r\n\r\nImage Source: [Biedenkapp et al.](http:\/\/ecai2020.eu\/papers\/1237_paper.pdf)","651":"**Animatable Reconstruction of Clothed Humans** is an end-to-end framework for accurate reconstruction of animation-ready 3D clothed humans from a monocular image. ARCH is a learned pose-aware model that produces detailed 3D rigged full-body human avatars from a single unconstrained RGB image. A Semantic Space and a Semantic Deformation Field are created using a parametric 3D body estimator. They allow the transformation of 2D\/3D clothed humans into a canonical space, reducing ambiguities in geometry caused by pose variations and occlusions in training data. Detailed surface geometry and appearance are learned using an implicit function representation with spatial local features.","652":"A **Pyramid Pooling Module** is a module for semantic segmentation which acts as an effective global contextual prior. The motivation is that the problem of using a convolutional network like a [ResNet](https:\/\/paperswithcode.com\/method\/resnet) is that, while the receptive field is already larger than the input image, the empirical receptive field is much smaller than the theoretical one especially on high-level layers. This makes many networks not sufficiently incorporate the momentous global scenery prior. \r\n\r\nThe PPM is an effective global prior representation that addresses this problem. It contains information with different scales and varying among different sub-regions. Using our 4-level pyramid, the pooling kernels cover the whole, half of, and small portions of the image. They are fused as the global prior. Then we concatenate the prior with the original feature map in the final part.","653":"**PSPNet**, or **Pyramid Scene Parsing Network**, is a semantic segmentation model that utilises a pyramid parsing module that exploits global context information by different-region based context aggregation. The local and global clues together make the final prediction more reliable. We also propose an optimization\r\n\r\nGiven an input image, PSPNet use a pretrained CNN with the dilated network strategy to extract the feature map. The final feature map size is $1\/8$ of the input image. On top of the map, we use the [pyramid pooling module](https:\/\/paperswithcode.com\/method\/pyramid-pooling-module) to gather context information. Using our 4-level pyramid, the pooling kernels cover the whole, half of, and small portions of the image. They are fused as the global prior.\r\nThen we concatenate the prior with the original feature map in the final part of. It is followed by a [convolution](https:\/\/paperswithcode.com\/method\/convolution) layer to generate the final prediction map.","654":"**KnowPrompt** is a prompt-tuning approach for relational understanding. It injects entity and relation knowledge into prompt construction with learnable virtual template words as well as answer words and synergistically optimize their representation with knowledge constraints. To be specific, TYPED MARKER is utilized around entities initialized with aggregated entity-type embeddings as learnable virtual template words to inject entity type knowledge. The average embeddings of each token are leveraged in relation labels as virtual answer words to inject relation knowledge. Since there exist implicit structural constraints among entities and relations, and virtual words should be consistent with the surrounding contexts, synergistic optimization is introduced to obtain optimized virtual templates and answer words. Concretely, a context-aware prompt calibration method is used with implicit structural constraints to inject structural knowledge implications among relational triples and associate prompt embeddings with each other.","655":"Mixture model network (MoNet) is a general framework allowing to design convolutional deep architectures on non-Euclidean domains such as graphs and manifolds.\r\n\r\nImage and description from: [Geometric deep learning on graphs and manifolds using mixture model CNNs](https:\/\/arxiv.org\/pdf\/1611.08402.pdf)","656":"The **Exponential Linear Unit** (ELU) is an activation function for neural networks. In contrast to [ReLUs](https:\/\/paperswithcode.com\/method\/relu), ELUs have negative values which allows them to push mean unit activations closer to zero like [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) but with lower computational complexity. Mean shifts toward zero speed up learning by bringing the normal gradient closer to the unit natural gradient because of a reduced bias shift effect. While [LReLUs](https:\/\/paperswithcode.com\/method\/leaky-relu) and [PReLUs](https:\/\/paperswithcode.com\/method\/prelu) have negative values, too, they do not ensure a noise-robust deactivation state. ELUs saturate to a negative value with smaller inputs and thereby decrease the forward propagated variation and information.\r\n\r\nThe exponential linear unit (ELU) with $0 < \\alpha$ is:\r\n\r\n$$f\\left(x\\right) = x \\text{ if } x > 0$$\r\n$$\\alpha\\left(\\exp\\left(x\\right) \u2212 1\\right) \\text{ if } x \\leq 0$$","657":"**Attention Free Transformer**, or **AFT**, is an efficient variant of a [multi-head attention module](https:\/\/paperswithcode.com\/method\/multi-head-attention) that eschews [dot product self attention](https:\/\/paperswithcode.com\/method\/scaled). In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes.\r\n\r\nGiven the input $X$, AFT first linearly transforms them into $Q=X W^{Q}, K=X W^{K}, V=X W^{V}$, then performs following operation:\r\n\r\n$$\r\nY=f(X) ; Y\\_{t}=\\sigma\\_{q}\\left(Q\\_{t}\\right) \\odot \\frac{\\sum\\_{t^{\\prime}=1}^{T} \\exp \\left(K\\_{t^{\\prime}}+w\\_{t, t^{\\prime}}\\right) \\odot V\\_{t^{\\prime}}}{\\sum\\_{t^{\\prime}=1}^{T} \\exp \\left(K\\_{t^{\\prime}}+w\\_{t, t^{\\prime}}\\right)}\r\n$$\r\n\r\nwhere $\\odot$ is the element-wise product; $\\sigma\\_{q}$ is the nonlinearity applied to the query with default being sigmoid; $w \\in R^{T \\times T}$ is the learned pair-wise position biases.\r\n\r\nExplained in words, for each target position $t$, AFT performs a weighted average of values, the result of which is combined with the query with element-wise multiplication. In particular, the weighting is simply composed of the keys and a set of learned pair-wise position biases. This provides the immediate advantage of not needing to compute and store the expensive attention matrix, while maintaining the global interactions between query and values as MHA does.","658":"**Stochastic Weight Averaging** is an optimization procedure that averages multiple points along the trajectory of [SGD](https:\/\/paperswithcode.com\/method\/sgd), with a cyclical or constant learning rate. On the one hand it averages weights, but it also has the property that, with a cyclical or constant learning rate, SGD proposals are approximately sampling from the loss surface of the network, leading to stochastic weights and helping to discover broader optima.","659":"**DenseNAS-C** is a mobile convolutional neural network discovered through the [DenseNAS](https:\/\/paperswithcode.com\/method\/densenas) [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) method. The basic building block is MBConvs, or inverted bottleneck residuals, from the [MobileNet](https:\/\/paperswithcode.com\/method\/mobilenetv2) architectures.","660":"**DenseNAS-B** is a mobile convolutional neural network discovered through the [DenseNAS](https:\/\/paperswithcode.com\/method\/densenas) [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) method. The basic building block is MBConvs, or inverted bottleneck residuals, from the [MobileNet](https:\/\/paperswithcode.com\/method\/mobilenetv2) architectures.","661":"**DenseNAS-A** is a mobile convolutional neural network discovered through the [DenseNAS](https:\/\/paperswithcode.com\/method\/densenas) [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) method. The basic building block is MBConvs, or inverted bottleneck residuals, from the MobileNet architectures.","662":"**DenseNAS** is a [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) method that utilises a densely connected search space. The search space is represented as a dense super network, which is built upon designed routing blocks. In the super network, routing blocks are densely connected and we search for the best path between them to derive the final architecture. A chained cost estimation algorithm is used to approximate the model cost during the search.","663":"The ARMA GNN layer implements a rational graph filter with a recursive approximation.","664":"**Multi-scale Progressive Fusion Network** (MSFPN) is a neural network representation for single image deraining. It aims to exploit the correlated information of rain streaks across scales for single image deraining. \r\n\r\nSpecifically, we first generate the Gaussian pyramid rain images using Gaussian kernels to down-sample the original rain image in sequence. A coarse-fusion module (CFM) is designed to capture the global texture information from these multi-scale rain images through recurrent calculation (Conv-[LSTM](https:\/\/paperswithcode.com\/method\/lstm)), thus enabling the network to cooperatively represent the target rain streak using similar counterparts from global feature space. Meanwhile, the representation of the high-resolution pyramid layer is guided by previous outputs as well as all low-resolution pyramid layers. A finefusion module (FFM) is followed to further integrate these correlated information from different scales. By using the channel attention mechanism, the network not only discriminatively learns the scale-specific knowledge from all preceding pyramid layers, but also reduces the feature redundancy effectively. Moreover, multiple FFMs can be cascaded to form a progressive multi-scale fusion. Finally, a reconstruction module (RM) is appended to aggregate the coarse and fine rain information extracted respectively from CFM and FFM for learning the residual rain image, which is the approximation of real rain streak distribution.","665":"Fast-BAT is a new method for accelerated adversarial training.","666":"**DNAS**, or **Differentiable Neural Architecture Search**, uses gradient-based methods to optimize ConvNet architectures, avoiding enumerating and training individual architectures separately as in previous methods. DNAS allows us to explore a layer-wise search space where we can choose a different block for each layer of the network. DNAS represents the search space by a super net whose operators execute stochastically. It relaxes the problem of finding the optimal architecture to find a distribution that yields the optimal architecture. By using the [Gumbel Softmax](https:\/\/paperswithcode.com\/method\/gumbel-softmax) technique, it is possible to directly train the architecture distribution using gradient-based optimization such as [SGD](https:\/\/paperswithcode.com\/method\/sgd).\r\n\r\nThe loss used to train the stochastic super net consists of both the cross-entropy loss that leads to better accuracy and the latency loss that penalizes the network's latency on a target device. To estimate the latency of an architecture, the latency of each operator in the search space is measured and a lookup table model is used to compute the overall latency by adding up the latency of each operator. Using this model allows for estimation of the latency of architectures in an enormous search space. More importantly, it makes the latency differentiable with respect to layer-wise block choices.","667":"Just as [dropout](https:\/\/paperswithcode.com\/method\/dropout) prevents co-adaptation of activations, **DropPath** prevents co-adaptation of parallel paths in networks such as [FractalNets](https:\/\/paperswithcode.com\/method\/fractalnet) by randomly dropping operands of the join layers. This\r\ndiscourages the network from using one input path as an anchor and another as a corrective term (a\r\nconfiguration that, if not prevented, is prone to overfitting). Two sampling strategies are:\r\n\r\n- **Local**: a join drops each input with fixed probability, but we make sure at least one survives.\r\n- **Global**: a single path is selected for the entire network. We restrict this path to be a single\r\ncolumn, thereby promoting individual columns as independently strong predictors.","668":"**ProxylessNAS** directly learns neural network architectures on the target task and target hardware without any proxy task. Additional contributions include:\r\n\r\n- Using a new path-level pruning perspective for [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search), showing a close connection between NAS and model compression. Memory consumption is saved by one order of magnitude by using path-level binarization.\r\n- Using a novel gradient-based approach (latency regularization loss) for handling hardware objectives (e.g. latency). Given different hardware platforms: CPU\/GPU\/Mobile, ProxylessNAS enables hardware-aware neural network specialization that\u2019s exactly optimized for the target hardware.","669":"**Jukebox** is a model that generates music with singing in the raw audio domain. It tackles the long context of raw audio using a multi-scale [VQ-VAE](https:\/\/paperswithcode.com\/method\/vq-vae) to compress it to discrete codes, and modeling those using [autoregressive Transformers](https:\/\/paperswithcode.com\/methods\/category\/autoregressive-transformers). It can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable.\r\n\r\nThree separate VQ-VAE models are trained with different temporal resolutions. At each level, the input audio is segmented and encoded into latent vectors $\\mathbf{h}\\_{t}$, which are then quantized to the closest codebook vectors $\\mathbf{e}\\_{z\\_{t}}$. The code $z\\_{t}$ is a discrete representation of the audio that we later train our prior on. The decoder takes the sequence of codebook vectors and reconstructs the audio. The top level learns the highest degree of abstraction, since it is encoding longer audio per token while keeping the codebook size the same. Audio can be reconstructed using the codes at any one of the abstraction levels, where the least abstract bottom-level codes result in the highest-quality audio.","670":"**Auxiliary Batch Normalization** is a type of regularization used in adversarial training schemes. The idea is that adversarial examples should have a separate [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) components to the clean examples, as they have different underlying statistics.","671":"**A2C**, or **Advantage Actor Critic**, is a synchronous version of the [A3C](https:\/\/paperswithcode.com\/method\/a3c) policy gradient method. As an alternative to the asynchronous implementation of A3C, A2C is a synchronous, deterministic implementation that waits for each actor to finish its segment of experience before updating, averaging over all of the actors. This more effectively uses GPUs due to larger batch sizes.\r\n\r\nImage Credit: [OpenAI Baselines](https:\/\/openai.com\/blog\/baselines-acktr-a2c\/)","672":"**Low-Rank Factorization-based Multi-head Attention Mechanism**, or **LAMA**, is a type of attention module that uses low-rank factorization to reduce computational complexity. It uses low-rank bilinear pooling to construct a structured sentence representation that attends to multiple aspects of a sentence.","673":"**Adaptive Masking** is a type of attention mechanism that allows a model to learn its own context size to attend over. For each head in [Multi-Head Attention](https:\/\/paperswithcode.com\/method\/multi-head-attention), a masking function is added to control for the span of the attention. A masking function is a non-increasing function that maps a\r\ndistance to a value in $\\left[0, 1\\right]$. Adaptive masking takes the following soft masking function $m\\_{z}$ parametrized by a real value $z$ in $\\left[0, S\\right]$:\r\n\r\n$$ m\\_{z}\\left(x\\right) = \\min\\left[\\max\\left[\\frac{1}{R}\\left(R+z-x\\right), 0\\right], 1\\right] $$\r\n\r\nwhere $R$ is a hyper-parameter that controls its softness. The shape of this piecewise function as a function of the distance. This soft masking function is inspired by [Jernite et al. (2017)](https:\/\/arxiv.org\/abs\/1611.06188). The attention weights from are then computed on the masked span:\r\n\r\n$$ a\\_{tr} = \\frac{m\\_{z}\\left(t-r\\right)\\exp\\left(s\\_{tr}\\right)}{\\sum^{t-1}\\_{q=t-S}m\\_{z}\\left(t-q\\right)\\exp\\left(s\\_{tq}\\right)}$$\r\n\r\nA $\\mathcal{l}\\_{1}$ penalization is added on the parameters $z\\_{i}$ for each attention head $i$ of the model to the loss function:\r\n\r\n$$ L = - \\log{P}\\left(w\\_{1}, \\dots, w\\_{T}\\right) + \\frac{\\lambda}{M}\\sum\\_{i}z\\_{i} $$\r\n\r\nwhere $\\lambda > 0$ is the regularization hyperparameter, and $M$ is the number of heads in each\r\nlayer. This formulation is differentiable in the parameters $z\\_{i}$, and learnt jointly with the rest of the model.","674":"A **Scale Aggregation Block** concatenates feature maps at a wide range of scales. Feature maps for each scale are generated by a stack of downsampling, [convolution](https:\/\/paperswithcode.com\/method\/convolution) and upsampling operations. The proposed scale aggregation block is a standard computational module which readily replaces any given transformation $\\mathbf{Y}=\\mathbf{T}(\\mathbf{X})$, where $\\mathbf{X}\\in \\mathbb{R}^{H\\times W\\times C}$, $\\mathbf{Y}\\in \\mathbb{R}^{H\\times W\\times C_o}$ with $C$ and $C_o$ being the input and output channel number respectively. $\\mathbf{T}$ is any operator such as a convolution layer or a series of convolution layers. Assume we have $L$ scales. Each scale $l$ is generated by sequentially conducting a downsampling $\\mathbf{D}_l$, a transformation $\\mathbf{T}_l$ and an unsampling operator $\\mathbf{U}_l$:\r\n\r\n$$\r\n\\mathbf{X}^{'}_l=\\mathbf{D}_l(\\mathbf{X}),\r\n\\label{eq:eq_d}\r\n$$\r\n\r\n$$\r\n\\mathbf{Y}^{'}_l=\\mathbf{T}_l(\\mathbf{X}^{'}_l),\r\n\\label{eq:eq_tl}\r\n$$\r\n\r\n$$\r\n\\mathbf{Y}_l=\\mathbf{U}_l(\\mathbf{Y}^{'}_l),\r\n\\label{eq:eq_u}\r\n$$\r\n\r\nwhere $\\mathbf{X}^{'}_l\\in \\mathbb{R}^{H_l\\times W_l\\times C}$,\r\n$\\mathbf{Y}^{'}_l\\in \\mathbb{R}^{H_l\\times W_l\\times C_l}$, and\r\n$\\mathbf{Y}_l\\in \\mathbb{R}^{H\\times W\\times C_l}$.\r\nNotably, $\\mathbf{T}_l$ has the similar structure as $\\mathbf{T}$.\r\nWe can concatenate all $L$ scales together, getting\r\n\r\n$$\r\n\\mathbf{Y}^{'}=\\Vert^L_1\\mathbf{U}_l(\\mathbf{T}_l(\\mathbf{D}_l(\\mathbf{X}))),\r\n\\label{eq:eq_all}\r\n$$\r\n\r\nwhere $\\Vert$ indicates concatenating feature maps along the channel dimension, and $\\mathbf{Y}^{'} \\in \\mathbb{R}^{H\\times W\\times \\sum^L_1 C_l}$ is the final output feature maps of the scale aggregation block.\r\n\r\nIn the reference implementation, the downsampling $\\mathbf{D}_l$ with factor $s$ is implemented by a max pool layer with $s\\times s$ kernel size and  $s$ stride. The upsampling $\\mathbf{U}_l$ is implemented by resizing with the nearest neighbor  interpolation.","675":"**ScaleNet**, or a **Scale Aggregation Network**, is a type of convolutional neural network which learns a neuron allocation for aggregating multi-scale information in different building blocks of a deep network. The most informative output neurons in each block are preserved while others are discarded, and thus neurons for multiple scales are competitively and adaptively allocated. The scale aggregation (SA) block concatenates feature maps at a wide range of scales. Feature maps for each scale are generated by a stack of downsampling, [convolution](https:\/\/paperswithcode.com\/method\/convolution) and upsampling operations.","676":"A **Kernel Activation Function** is a non-parametric activation function defined as a one-dimensional kernel approximator:\r\n\r\n$$ f(s) = \\sum_{i=1}^D \\alpha_i \\kappa( s, d_i) $$\r\n\r\nwhere:\r\n\r\n1. The dictionary of the kernel elements $d_0, \\ldots, d_D$ is fixed by sampling the $x$-axis with a uniform step around 0.\r\n2. The user selects the kernel function (e.g., Gaussian, [ReLU](https:\/\/paperswithcode.com\/method\/relu), [Softplus](https:\/\/paperswithcode.com\/method\/softplus)) and the number of kernel elements $D$ as a hyper-parameter. A larger dictionary leads to more expressive activation functions and a larger number of trainable parameters.\r\n3. The linear coefficients are adapted independently at every neuron via standard back-propagation.\r\n\r\nIn addition, the linear coefficients can be initialized using kernel ridge regression to behave similarly to a known function in the beginning of the optimization process.","677":"**ENIGMA** is an evaluation framework for dialog systems based on Pearson and Spearman's rank correlations between the estimated rewards and the true rewards.  ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation, making automatic evaluations feasible. More importantly, ENIGMA is model-free and agnostic to the behavior policies for collecting the experience data (see details in Section 2), which significantly alleviates the technical difficulties of modeling complex dialogue environments and human behaviors.","678":"[Transformer](https:\/\/paperswithcode.com\/method\/transformer) is a type of self-attention-based neural networks originally applied for NLP tasks. Recently, pure transformer-based models are proposed to solve computer vision problems. These visual transformers usually view an image as a sequence of patches while they ignore the intrinsic structure information inside each patch. In this paper, we propose a novel Transformer-iN-Transformer (TNT) model for modeling both patch-level and pixel-level representation. In each TNT block, an outer transformer block is utilized to process patch embeddings, and an inner transformer block extracts local features from pixel embeddings. The pixel-level feature is projected to the space of patch embedding by a linear transformation layer and then added into the patch. By stacking the TNT blocks, we build the TNT model for image recognition.\r\n\r\nImage source: [Han et al.](https:\/\/arxiv.org\/pdf\/2103.00112v1.pdf)","679":"**YOLOX** is a single-stage object detector that makes several modifications to [YOLOv3](https:\/\/paperswithcode.com\/method\/yolov3) with a  [DarkNet53](https:\/\/www.paperswithcode.com\/method\/darknet53) backbone. Specifically, YOLO\u2019s head is replaced with a decoupled one. For each level of [FPN](https:\/\/paperswithcode.com\/method\/fpn) feature, we first adopt a 1 \u00d7 1 conv layer to reduce the feature channel to 256 and then add two parallel branches with two 3 \u00d7 3 conv layers each for classification and regression tasks respectively.\r\n\r\nAdditional changes include adding Mosaic and [MixUp](https:\/\/paperswithcode.com\/method\/mixup) into the augmentation strategies to boost YOLOX\u2019s performance. The anchor mechanism is also removed so YOLOX is anchor-free. Lastly, SimOTA for label assignment  -- where label assignment is formulated as an optimal transport problem via a top-k strategy.","680":"**GShard** is a intra-layer parallel distributed method. It consists of set of simple APIs for annotations, and a compiler extension in XLA for automatic parallelization.","681":"**GCNII** is an extension of a [Graph Convolution Networks](https:\/\/www.paperswithcode.com\/method\/gcn) with two new techniques, initial residual and identify mapping, to tackle the problem of oversmoothing -- where stacking more layers and adding non-linearity tends to degrade performance. At each layer, initial residual constructs a skip connection from the input layer, while identity mapping adds an identity matrix to the weight matrix.","682":"Please enter a description about the method here","683":"**SpineNet** is a convolutional neural network backbone with scale-permuted intermediate features and cross-scale connections that is learned on an object detection task by [Neural Architecture Search](https:\/\/paperswithcode.com\/method\/neural-architecture-search).","684":"**SNGAN**, or **Spectrally Normalised GAN**, is a type of generative adversarial network that uses [spectral normalization](https:\/\/paperswithcode.com\/method\/spectral-normalization), a type of [weight normalization](https:\/\/paperswithcode.com\/method\/weight-normalization), to stabilise the training of the discriminator.","685":"[Neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) (NAS) has witnessed prevailing success in image classification and (very recently) segmentation tasks. In this paper, we present the first preliminary study on introducing the NAS algorithm to generative adversarial networks (GANs), dubbed AutoGAN. The marriage of NAS and GANs faces its unique challenges. We define the search space for the generator architectural variations and use an RNN controller to guide the search, with parameter sharing and dynamic-resetting to accelerate the process. Inception score is adopted as the reward, and a multi-level search strategy is introduced to perform NAS in a progressive way.","686":"**Gradient Checkpointing** is a method used for reducing the memory footprint when training deep neural networks, at the cost of having a small increase in computation time.","687":"**Fast AutoAugment** is an image data augmentation algorithm that finds effective augmentation policies via a search strategy based on density matching, motivated by Bayesian DA. The strategy is to improve the generalization performance of a given network by learning the augmentation policies which treat augmented data as missing data points of training data. However, different from Bayesian DA, the proposed method recovers those missing data points by the exploitation-and-exploration of a family of inference-time augmentations via Bayesian optimization in the policy search phase. This is realized by using an efficient density matching algorithm that does not require any back-propagation for network training for each policy evaluation.","688":"VL-BERT is pre-trained on a large-scale image-captions dataset together with text-only corpus. The input to the model are either words from the input sentences or regions-of-interest (RoI) from input images. It can be fine-tuned to fit most visual-linguistic downstream tasks. Its backbone is a multi-layer bidirectional Transformer encoder, modified to accommodate visual contents, and new type of visual feature embedding to the input feature embeddings. VL-BERT takes both visual and linguistic elements as input, represented as RoIs in images and subwords in input sentences. Four different types of embeddings are used to represent each input: token embedding, visual feature embedding, segment embedding, and sequence position embedding. VL-BERT is pre-trained using Conceptual Captions and text-only datasets. Two pre-training tasks are used: masked language modeling with visual clues, and masked RoI classification with linguistic clues.","689":"**Gradient Sparsification** is a technique for distributed training that sparsifies stochastic gradients to reduce the communication cost, with minor increase in the number of iterations. The key idea behind our sparsification technique is to drop some coordinates of the stochastic gradient and appropriately amplify the remaining coordinates to ensure the unbiasedness of the sparsified stochastic gradient. The sparsification approach can significantly reduce the coding length of the stochastic gradient and only slightly increase the variance of the stochastic gradient.","690":"**Randomized Leaky Rectified Linear Units**, or **RReLU**, are an activation function that randomly samples the negative slope for activation values. It was first proposed and used in the Kaggle NDSB Competition. During training, $a\\_{ji}$ is a random number sampled from a uniform distribution $U\\left(l, u\\right)$. Formally:\r\n\r\n$$ y\\_{ji} = x\\_{ji} \\text{   if } x\\_{ji} \\geq{0} $$\r\n$$ y\\_{ji} = a\\_{ji}x\\_{ji} \\text{   if } x\\_{ji} < 0 $$\r\n\r\nwhere\r\n\r\n$$\\alpha\\_{ji} \\sim U\\left(l, u\\right), l < u \\text{ and } l, u \\in \\left[0,1\\right)$$\r\n\r\nIn the test phase, we take average of all the $a\\_{ji}$ in training similar to [dropout](https:\/\/paperswithcode.com\/method\/dropout), and thus set $a\\_{ji}$ to $\\frac{l+u}{2}$ to get a deterministic result. As suggested by the NDSB competition winner, $a\\_{ji}$ is sampled from $U\\left(3, 8\\right)$. \r\n\r\nAt test time, we use:\r\n\r\n$$ y\\_{ji} = \\frac{x\\_{ji}}{\\frac{l+u}{2}} $$","691":"**FSAF**, or Feature Selective Anchor-Free, is a building block for single-shot object detectors. It can be plugged into single-shot detectors with feature pyramid structure. The FSAF module addresses two limitations brought up by the conventional anchor-based detection: 1) heuristic-guided feature selection; 2) overlap-based anchor sampling. The general concept of the FSAF module is online feature selection applied to the training of multi-level anchor-free branches. Specifically, an anchor-free branch is attached to each level of the feature pyramid, allowing box encoding and decoding in the anchor-free manner at an arbitrary level. During training, we dynamically assign each instance to the most suitable feature level. At the time of inference, the FSAF module can work jointly with anchor-based branches by outputting predictions in parallel. We instantiate this concept with simple implementations of anchor-free branches and online feature selection strategy\r\n\r\nThe general concept is presented in the Figure to the right. An anchor-free branch is built per level of feature pyramid, independent to the anchor-based branch. Similar to the anchor-based branch, it consists of a classification subnet and a regression subnet (not shown in figure). An instance can be assigned to arbitrary level of the anchor-free branch. During training, we dynamically select the most suitable level of feature for each instance based on the instance content instead of just the size of instance box. The selected level of feature then learns to detect the assigned instances. At inference, the FSAF module can run independently or jointly with anchor-based branches. The FSAF module is agnostic to the backbone network and can be applied to single-shot detectors with a structure of feature pyramid. Additionally, the instantiation of anchor-free branches and online feature selection can be various.","692":"**1-bit LAMB** is a communication-efficient stochastic optimization technique which introduces a novel way to support adaptive layerwise learning rates even when communication is compressed. Learning from the insights behind [1-bit Adam](https:\/\/paperswithcode.com\/method\/1-bit-adam), it is a a 2-stage algorithm which uses [LAMB](https:\/\/paperswithcode.com\/method\/lamb) (warmup stage) to \u201cpre-condition\u201d a communication compressed momentum SGD algorithm (compression stage). At compression stage where original LAMB algorithm cannot be used to update the layerwise learning rates, 1-bit LAMB employs a novel way to adaptively scale layerwise learning rates based on information from both warmup and compression stages. As a result, 1-bit LAMB is able to achieve large batch optimization (LAMB)\u2019s convergence speed under compressed communication.\r\n\r\nThere are two major differences between 1-bit LAMB and the original LAMB:\r\n\r\n- During compression stage, 1-bit LAMB updates the layerwise learning rate based on a novel \u201creconstructed gradient\u201d based on the compressed momentum. This makes 1-bit LAMB compatible with error compensation and be able to keep track of the training dynamic under compression.\r\n- 1-bit LAMB also introduces extra stabilized soft thresholds when updating layerwise learning rate at compression stage, which makes training more stable under compression.","693":"**1-bit Adam** is a [stochastic optimization](https:\/\/paperswithcode.com\/methods\/category\/stochastic-optimization) technique that is a variant of [ADAM](https:\/\/paperswithcode.com\/method\/adam) with error-compensated 1-bit compression, based on finding that Adam's variance term becomes stable at an early stage. First vanilla Adam is used for a few epochs as a warm-up. After the warm-up stage, the compression stage starts and we stop updating the variance term $\\mathbf{v}$ and use it as a fixed precondition. At the compression stage, we communicate based on the momentum applied with error-compensated 1-bit compression. The momentums are quantized into 1-bit representation (the sign of each element). Accompanying the vector, a scaling factor is computed as $\\frac{\\text { magnitude of compensated gradient }}{\\text { magnitude of quantized gradient }}$. This scaling factor ensures that the compressed momentum has the same magnitude as the uncompressed momentum. This 1-bit compression could reduce the communication cost by $97 \\%$ and $94 \\%$ compared to the original float 32 and float 16 training, respectively.","694":"**PIoU Loss** is a loss function for oriented object detection which is formulated to exploit both the angle and IoU for accurate oriented bounding box regression. The PIoU loss is derived from IoU metric with a pixel-wise form.","695":"**$n$-step Returns** are used for value function estimation in reinforcement learning. Specifically, for $n$ steps we can write the complete return as:\r\n\r\n$$ R\\_{t}^{(n)} = r\\_{t+1} + \\gamma{r}\\_{t+2} + \\cdots + \\gamma^{n-1}\\_{t+n} + \\gamma^{n}V\\_{t}\\left(s\\_{t+n}\\right) $$\r\n\r\nWe can then write an $n$-step backup, in the style of TD learning, as:\r\n\r\n$$ \\Delta{V}\\_{r}\\left(s\\_{t}\\right) = \\alpha\\left[R\\_{t}^{(n)} - V\\_{t}\\left(s\\_{t}\\right)\\right] $$\r\n\r\nMulti-step returns often lead to faster learning with suitably tuned $n$.\r\n\r\nImage Credit: Sutton and Barto, Reinforcement Learning","696":"**Revision Network** is a style transfer module that aims to revise the rough stylized image via generating residual details image $r_{c s}$, while the final stylized image is generated by combining $r\\_{c s}$ and rough stylized image $\\bar{x}\\_{c s}$. This procedure ensures that the distribution of global style pattern in $\\bar{x}\\_{c s}$ is properly kept. Meanwhile, learning to revise local style patterns with residual details image is easier for the Revision Network.\r\n\r\nAs shown in the Figure, the Revision Network is designed as a simple yet effective encoder-decoder architecture, with only one down-sampling and one up-sampling layer. Further, a [patch discriminator](https:\/\/paperswithcode.com\/method\/patchgan) is used to help Revision Network to capture fine patch textures under adversarial learning setting. The patch discriminator $D$ is defined following SinGAN, where $D$ owns 5 convolution layers and 32 hidden channels. A relatively shallow $D$ is chosen to (1) avoid overfitting since we only have one style image and (2) control the receptive field to ensure D can only capture local patterns.","697":"**Drafting Network** is a style transfer module designed to transfer global style patterns in low-resolution, since global patterns can be transferred easier in low resolution due to larger receptive field and less local details. To achieve single style transfer, earlier work trained an encoder-decoder module, where only the content image is used as input. To better combine the style feature and the content feature, the Drafting Network adopts the [AdaIN module](https:\/\/paperswithcode.com\/method\/adaptive-instance-normalization).\r\n\r\nThe architecture of Drafting Network is shown in the Figure, which includes an encoder, several AdaIN modules and a decoder. (1) The encoder is a pre-trained [VGG](https:\/\/paperswithcode.com\/method\/vgg)-19 network, which is fixed during training. Given $\\bar{x}\\_{c}$ and $\\bar{x}\\_{s}$, the VGG encoder extracts features in multiple granularity at 2_1, 3_1 and 4_1 layers. (2) Then, we apply feature modulation between the content and style feature using AdaIN modules after 2_1, 3_1 and 4_1 layers, respectively. (3) Finally, in each granularity of decoder, the corresponding feature from the AdaIN module is merged via a [skip-connection](https:\/\/paperswithcode.com\/methods\/category\/skip-connections). Here, skip-connections after AdaIN modules in both low and high levels are leveraged to help to reserve content structure, especially for low-resolution image.","698":"**LapStyle**, or **Laplacian Pyramid Network**, is a feed-forward style transfer method. It uses a [Drafting Network](https:\/\/paperswithcode.com\/method\/drafting-network) to transfer global style patterns in low-resolution, and adopts higher resolution [Revision Networks](https:\/\/paperswithcode.com\/method\/revision-network) to revise local styles in a pyramid manner according to outputs of multi-level Laplacian filtering of the content image. Higher resolution details can be generated by stacking Revision Networks with multiple Laplacian pyramid levels. The final stylized image is obtained by aggregating outputs of all pyramid levels.\r\n\r\nSpecifically, we first generate image pyramid $\\left\\(\\bar{x}\\_{c}, r\\_{c}\\right\\)$ from content image $x\\_{c}$ with the help of Laplacian filter. Rough low-resolution stylized image are then generated by the Drafting Network. Then the Revision Network generates stylized detail image in high resolution. Then the final stylized image is generated by aggregating the outputs pyramid. $L, C$ and $A$ in an image represent Laplacian, concatenate and aggregation operation separately.","699":"**SNIP**, or **Scale Normalization for Image Pyramids**, is a multi-scale training scheme that selectively back-propagates the gradients of object instances of different sizes as a function of the image scale. SNIP is a modified version of MST where only the object instances that have a resolution close to the pre-training dataset, which is typically 224x224, are used for training the detector. In multi-scale training (MST), each image is observed at different resolutions therefore, at a high resolution (like 1400x2000) large objects are hard to classify and at a low resolution (like 480x800) small objects are hard to classify. Fortunately, each object instance appears at several different scales and some of those appearances fall in the desired scale range. In order to eliminate extreme scale objects, either too large or too small, training is only performed on objects that fall in the desired scale range and the remainder are simply ignored during back-propagation. Effectively, SNIP uses all the object instances during training, which helps capture all the variations in appearance and\r\npose, while reducing the domain-shift in the scale-space for the pre-trained network.","700":"**Mix-FFN** is a feedforward layer used in the [SegFormer](https:\/\/paperswithcode.com\/method\/segformer) architecture. [ViT](https:\/\/www.paperswithcode.com\/method\/vision-transformer) uses [positional encoding](https:\/\/paperswithcode.com\/methods\/category\/position-embeddings) (PE) to introduce the location information. However, the resolution of $\\mathrm{PE}$ is fixed. Therefore, when the test resolution is different from the training one, the positional code needs to be interpolated and this often leads to dropped accuracy. To alleviate this problem, [CPVT](https:\/\/www.paperswithcode.com\/method\/cpvt) uses $3 \\times 3$ Conv together with the PE to implement a data-driven PE. The authors of Mix-FFN argue that positional encoding is actually not necessary for semantic segmentation. Instead, they use Mix-FFN which considers the effect of zero padding to leak location information, by directly using a $3 \\times 3$ Conv in the feed-forward network (FFN). Mix-FFN can be formulated as:\r\n\r\n$$\r\n\\mathbf{x}\\_{\\text {out }}=\\operatorname{MLP}\\left(\\operatorname{GELU}\\left(\\operatorname{Conv}\\_{3 \\times 3}\\left(\\operatorname{MLP}\\left(\\mathbf{x}\\_{i n}\\right)\\right)\\right)\\right)+\\mathbf{x}\\_{i n}\r\n$$\r\n\r\nwhere $\\mathbf{x}\\_{i n}$ is the feature from a self-attention module. Mix-FFN mixes a $3 \\times 3$ convolution and an MLP into each FFN.","701":"**SegFormer** is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based framework for semantic segmentation that unifies Transformers with lightweight [multilayer perceptron](https:\/\/paperswithcode.com\/method\/feedforward-network) (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes which leads to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers, and thus combining both local attention and global attention to render powerful representations.","702":"**MobileBERT** is a type of inverted-bottleneck [BERT](https:\/\/paperswithcode.com\/method\/bert) that compresses and accelerates the popular BERT model. MobileBERT is a thin version of BERT_LARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks. To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated BERT_LARGE model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to various downstream NLP tasks via simple fine-tuning. It is trained by layer-to-layer imitating the inverted bottleneck BERT.","703":"**Distributional Generalization** is a type of generalization that roughly states that outputs of a classifier at train and test time are close as distributions, as opposed to close in just their average error. This behavior is not captured by classical generalization, which would only consider the average error and not the distribution of errors over the input domain.","704":"**Macaw** is a generative question-answering (QA) system that is built on UnifiedQA, itself built on [T5](https:\/\/paperswithcode.com\/method\/t5). Macaw has three interesting features. First, it often produces high-quality answers to questions far outside the domain it was trained on, sometimes surprisingly so. Second, Macaw allows different permutations (\u201can gles\u201d) of inputs and outputs to be used. For example, we can give it a question and get an answer; or give it an answer and get a question; or give it a question and answer and get a set of multiple-choice (MC) options for that question. This multi-angle QA capability allows versatility in the way Macaw can be used, include recursively using outputs as new inputs to the system. Finally, Macaw also generates explanations as an optional output (or even input) element.","705":"**RandAugment** is an automated data augmentation method. The search space for data augmentation has 2 interpretable hyperparameter $N$ and $M$.  $N$ is the number of augmentation transformations to apply sequentially, and $M$ is the magnitude for all the transformations. To reduce the parameter space but still maintain image diversity, learned policies and probabilities for applying each transformation are replaced with a parameter-free procedure of always selecting a transformation with uniform probability $\\frac{1}{K}$. Here $K$ is the number of transformation options. So given $N$ transformations for a training image, RandAugment may thus express $KN$ potential policies.\r\n\r\nTransformations applied include identity transformation, autoContrast, equalize, rotation, solarixation, colorjittering, posterizing, changing contrast, changing brightness, changing sharpness, shear-x, shear-y, translate-x, translate-y.","706":"**GridMask** is a data augmentation method that randomly removes some pixels of an input image. Unlike other methods, the region that the algorithm removes is neither a continuous region nor random pixels in dropout. Instead, the algorithm removes a region with disconnected pixel sets, as shown in the Figure.\r\n\r\nWe express the setting as\r\n\r\n$$\r\n\\tilde{\\mathbf{x}}=\\mathbf{x} \\times M\r\n$$\r\n\r\nwhere $\\mathbf{x} \\in R^{H \\times W \\times C}$ represents the input image, $M \\in$ $\\{0,1\\}^{H \\times W}$ is the binary mask that stores pixels to be removed, and $\\tilde{\\mathbf{x}} \\in R^{H \\times W \\times C}$ is the result produced by the algorithm. For the binary mask $M$, if $M_{i, j}=1$ we keep pixel $(i, j)$ in the input image; otherwise we remove it. GridMask is applied after the image normalization operation.\r\n\r\nThe shape of $M$ looks like a grid, as shown in the Figure . Four numbers $\\left(r, d, \\delta_{x}, \\delta_{y}\\right)$ are used to represent a unique $M$. Every mask is formed by tiling the units. $r$ is the ratio of the shorter gray edge in a unit. $d$ is the length of one unit. $\\delta\\_{x}$ and $\\delta\\_{y}$ are the distances between the first intact unit and boundary of the image.","707":"Combines learned time-frequency representation with a masker architecture based on 1D [dilated convolution](https:\/\/paperswithcode.com\/method\/dilated-convolution).","708":"**SepFormer** is [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based neural network for speech separation. The SepFormer learns short and long-term dependencies with a multi-scale approach that employs transformers. It is mainly composed of multi-head attention and feed-forward layers. A dual-path framework (introduced by DPRNN) is adopted and [RNNs](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks) are replaced with a multiscale pipeline composed of transformers that learn both short and long-term dependencies. The dual-path framework enables the mitigation of the quadratic complexity of transformers, as transformers in the dual-path framework process smaller chunks.\r\n\r\nThe model is based on the learned-domain masking approach and employs an encoder, a decoder, and a masking network, as shown in the figure. The encoder is fully convolutional, while the decoder employs two Transformers embedded inside the dual-path processing block. The decoder finally reconstructs the separated signals in the time domain by using the masks predicted by the masking network.","709":"The **Mogrifier LSTM** is an extension to the [LSTM](https:\/\/paperswithcode.com\/method\/lstm) where the LSTM\u2019s input $\\mathbf{x}$ is gated conditioned on the output of the previous step $\\mathbf{h}\\_{prev}$. Next, the gated input is used in a similar manner to gate the output of the\r\nprevious time step. After a couple of rounds of this mutual gating, the last updated $\\mathbf{x}$ and $\\mathbf{h}\\_{prev}$ are fed to an LSTM.  \r\n\r\nIn detail, the Mogrifier is an LSTM where two inputs $\\mathbf{x}$ and $\\mathbf{h}\\_{prev}$ modulate one another in an alternating fashion before the usual LSTM computation takes place. That is: $ \\text{Mogrify}\\left(\\mathbf{x}, \\mathbf{c}\\_{prev}, \\mathbf{h}\\_{prev}\\right) = \\text{LSTM}\\left(\\mathbf{x}^{\u2191}, \\mathbf{c}\\_{prev}, \\mathbf{h}^{\u2191}\\_{prev}\\right)$ where the modulated inputs $\\mathbf{x}^{\u2191}$ and $\\mathbf{h}^{\u2191}\\_{prev}$ are defined as the highest indexed $\\mathbf{x}^{i}$ and $\\mathbf{h}^{i}\\_{prev}$, respectively, from the interleaved sequences:\r\n\r\n$$ \\mathbf{x}^{i} = 2\\sigma\\left(\\mathbf{Q}^{i}\\mathbf{h}^{i\u22121}\\_{prev}\\right) \\odot x^{i-2} \\text{ for odd } i \\in \\left[1 \\dots r\\right] $$\r\n\r\n$$ \\mathbf{h}^{i}\\_{prev}  = 2\\sigma\\left(\\mathbf{R}^{i}\\mathbf{x}^{i-1}\\right) \\odot \\mathbf{h}^{i-2}\\_{prev} \\text{ for even } i \\in \\left[1 \\dots r\\right] $$\r\n\r\nwith $\\mathbf{x}^{-1} = \\mathbf{x}$ and $\\mathbf{h}^{0}\\_{prev} = \\mathbf{h}\\_{prev}$. The number of \"rounds\", $r \\in \\mathbb{N}$, is a hyperparameter; $r = 0$ recovers the LSTM. Multiplication with the constant 2 ensures that randomly initialized $\\mathbf{Q}^{i}$, $\\mathbf{R}^{i}$ matrices result in transformations close to identity. To reduce the number of additional model parameters, we typically factorize the $\\mathbf{Q}^{i}$, $\\mathbf{R}^{i}$ matrices as products of low-rank matrices: $\\mathbf{Q}^{i}$ =\r\n$\\mathbf{Q}^{i}\\_{left}\\mathbf{Q}^{i}\\_{right}$ with $\\mathbf{Q}^{i} \\in \\mathbb{R}^{m\\times{n}}$, $\\mathbf{Q}^{i}\\_{left} \\in \\mathbb{R}^{m\\times{k}}$, $\\mathbf{Q}^{i}\\_{right} \\in \\mathbb{R}^{k\\times{n}}$, where $k < \\min\\left(m, n\\right)$ is the rank.","710":"**LipGAN** is a generative adversarial network for generating realistic talking faces conditioned on translated speech. It employs an adversary that measures the extent of lip synchronization in the frames generated by the generator. The system is capable of handling faces in random poses without the need for realignment to a template pose. LipGAN is a fully self-supervised approach that learns a phoneme-viseme mapping, making it language independent.","711":"**Euclidean Norm Regularization** is a regularization step used in [generative adversarial networks](https:\/\/paperswithcode.com\/methods\/category\/generative-adversarial-networks), and is typically added to both the generator and discriminator losses:\r\n\r\n$$ R\\_{z} = w\\_{r} \\cdot ||\\Delta{z}||^{2}\\_{2} $$\r\n\r\nwhere the scalar weight $w\\_{r}$ is a parameter.\r\n\r\nImage: [LOGAN](https:\/\/paperswithcode.com\/method\/logan)","712":"**Latent Optimisation** is a technique used for generative adversarial networks to refine the sample quality of $z$. Specifically, it exploits knowledge from the discriminator $D$ to refine the latent source $z$. Intuitively, the gradient $\\nabla\\_{z}f\\left(z\\right) = \\delta{f}\\left(z\\right)\\delta{z}$ points in the direction that better satisfies the discriminator $D$, which implies better samples. Therefore, instead of using the randomly sampled $z \\sim p\\left(z\\right)$, we uses the optimised latent:\r\n\r\n$$ \\Delta{z} = \\alpha\\frac{\\delta{f}\\left(z\\right)}{\\delta{z}} $$\r\n\r\n$$ z' = z + \\Delta{z} $$\r\n\r\nSource: [LOGAN](https:\/\/paperswithcode.com\/method\/logan)\r\n.","713":"**CS-GAN** is a type of generative adversarial network that uses a form of deep compressed sensing, and [latent optimisation](https:\/\/paperswithcode.com\/method\/latent-optimisation), to improve the quality of generated samples.","714":"**LOGAN** is a generative adversarial network that uses a latent optimization approach using [natural gradient descent](https:\/\/paperswithcode.com\/method\/natural-gradient-descent) (NGD). For the Fisher matrix in NGD, the authors use the empirical Fisher $F'$ with Tikhonov damping:\r\n\r\n$$ F' = g \\cdot g^{T} + \\beta{I} $$\r\n\r\nThey also use Euclidian Norm regularization for the optimization step.\r\n\r\nFor LOGAN's base architecture, [BigGAN-deep](https:\/\/paperswithcode.com\/method\/biggan-deep) is used with a few modifications: increasing the size of the latent source from $186$ to $256$, to compensate the randomness of the source lost\r\nwhen optimising $z$. 2, using the uniform distribution $U\\left(\u22121, 1\\right)$ instead of the standard normal distribution $N\\left(0, 1\\right)$ for $p\\left(z\\right)$ to be consistent with the clipping operation, using  leaky [ReLU](https:\/\/paperswithcode.com\/method\/relu) (with the slope of 0.2 for the negative part) instead of ReLU as the non-linearity for smoother gradient flow for $\\frac{\\delta{f}\\left(z\\right)}{\\delta{z}}$ .","715":"**BigGAN-deep** is a deeper version (4x) of [BigGAN](https:\/\/paperswithcode.com\/method\/biggan).  The main difference is a slightly differently designed [residual block](https:\/\/paperswithcode.com\/method\/residual-block). Here the $z$ vector is concatenated with the conditional vector without splitting it into chunks.  It is also based on residual blocks with bottlenecks. BigGAN-deep uses a different strategy than BigGAN aimed at preserving identity throughout the skip connections. In G, where the number of channels needs to be reduced, BigGAN-deep simply retains the first group of channels and drop the rest to produce the required number of channels. In D, where the number of channels should be increased, BigGAN-deep passes the input channels unperturbed, and concatenates them with the remaining channels produced by a 1 \u00d7 1 [convolution](https:\/\/paperswithcode.com\/method\/convolution). As far as the\r\nnetwork configuration is concerned, the discriminator is an exact reflection of the generator. \r\n\r\nThere are two blocks at each resolution (BigGAN uses one), and as a result BigGAN-deep is four times\r\ndeeper than BigGAN. Despite their increased depth, the BigGAN-deep models have significantly\r\nfewer parameters mainly due to the bottleneck structure of their residual blocks.","716":"**CondConv**, or **Conditionally Parameterized Convolutions**, are a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) which learn specialized convolutional kernels for each example. In particular, we parameterize the convolutional kernels in a CondConv layer as a linear combination of $n$ experts $(\\alpha_1 W_1 + \\ldots + \\alpha_n W_n) * x$, where $\\alpha_1, \\ldots, \\alpha_n$ are functions of the input learned through gradient descent. To efficiently increase the capacity of a CondConv layer, developers can increase the number of experts. This can be more computationally efficient than increasing the size of the convolutional kernel itself, because the convolutional kernel is applied at many different positions within the input, while the experts are combined only once per input.","717":"**Cascade Mask R-CNN** extends [Cascade R-CNN](https:\/\/paperswithcode.com\/method\/cascade-r-cnn) to instance segmentation, by adding a\r\nmask head to the cascade.\r\n\r\nIn the [Mask R-CNN](https:\/\/paperswithcode.com\/method\/mask-r-cnn), the segmentation branch is inserted in parallel to the detection branch. However, the Cascade [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn) has multiple detection branches. This raises the questions of 1) where to add the segmentation branch and 2) how many segmentation branches to add. The authors consider three strategies for mask prediction in the Cascade R-CNN. The first two strategies address the first question, adding a single mask prediction head at either the first or last stage of the Cascade R-CNN. Since the instances used to train the segmentation branch are the positives of the detection branch, their number varies in these two strategies. Placing the segmentation head later on the cascade leads to more examples. However, because segmentation is a pixel-wise operation, a large number of highly overlapping instances is not necessarily as helpful as for object detection, which is a patch-based operation. The third strategy addresses the second question, adding a segmentation branch to each\r\ncascade stage. This maximizes the diversity of samples used to learn the mask prediction task. \r\n\r\nAt inference time, all three strategies predict the segmentation masks on the patches produced by the final object detection stage, irrespective of the cascade stage on which the segmentation mask is implemented and how many segmentation branches there are.","718":"**PolarMask** is an anchor-box free and single-shot instance segmentation method. Specifically, PolarMask takes an image as input and predicts the distance from a sampled positive location (ie a candidate object's center) with respect to the object's contour at each angle, and then assembles the predicted points to produce the final mask. There are several benefits to the system: (1) The polar representation unifies instance segmentation (masks) and object detection (bounding boxes) into a single framework (2) Two modules are designed (i.e. soft polar centerness and polar IoU loss) to sample high-quality center examples and optimize polar contour regression, making the performance of PolarMask does not depend on the bounding box prediction results and more efficient in training. (3) PolarMask is fully convolutional and can be embedded into most off-the-shelf detection methods.","719":"**InfoGAN** is a type of generative adversarial network that modifies the [GAN](https:\/\/paperswithcode.com\/method\/gan) objective to\r\nencourage it to learn interpretable and meaningful representations. This is done by maximizing the\r\nmutual information between a fixed small subset of the GAN\u2019s noise variables and the observations.\r\n\r\nFormally, InfoGAN is defined as a minimax game with a variational regularization of mutual information and the hyperparameter $\\lambda$:\r\n\r\n$$ \\min\\_{G, Q}\\max\\_{D}V\\_{INFOGAN}\\left(D, G, Q\\right) = V\\left(D, G\\right) - \\lambda{L}\\_{I}\\left(G, Q\\right) $$\r\n\r\nWhere $Q$ is an auxiliary distribution that approximates the posterior $P\\left(c\\mid{x}\\right)$ - the probability of the latent code $c$ given the data $x$ - and $L\\_{I}$ is the variational lower bound of the mutual information between the latent code and the observations.\r\n\r\nIn the practical implementation, there is another fully-connected layer to output parameters for the conditional distribution $Q$ (negligible computation ontop of regular GAN structures). Q is represented with a [softmax](https:\/\/paperswithcode.com\/method\/softmax) non-linearity for a categorical latent code. For a continuous latent code, the authors assume a factored Gaussian.","720":"**AutoML-Zero** is an AutoML technique that aims to search a fine-grained space simultaneously for the model, optimization procedure, initialization, and so on, permitting much less human-design and even allowing the discovery of non-neural network algorithms. It represents ML algorithms as computer programs comprised of three component functions, Setup, Predict, and Learn, that performs initialization, prediction and learning. The instructions in these functions apply basic mathematical operations on a small memory. The operation and memory addresses used by each instruction are free parameters in the search space, as is the size of the component functions. While this reduces expert design, the consequent sparsity means that [random search](https:\/\/paperswithcode.com\/method\/random-search) cannot make enough progress. To overcome this difficulty, the authors use small proxy tasks and migration techniques to build an optimized infrastructure capable of searching through 10,000 models\/second\/cpu core.\r\n\r\nEvolutionary methods can find solutions in the AutoML-Zero search space despite its enormous\r\nsize and sparsity. The authors show that by randomly modifying the programs and periodically selecting the best performing ones on given tasks\/datasets, AutoML-Zero discovers reasonable algorithms. They start from empty programs and using data labeled by \u201cteacher\u201d neural networks with random weights, and demonstrate  evolution can discover neural networks trained by gradient descent. Following this, they minimize bias toward known algorithms by switching to binary classification tasks extracted from CIFAR-10 and allowing a larger set of possible operations. This discovers interesting techniques like multiplicative interactions, normalized gradient and weight averaging. Finally, they show it is possible for evolution to adapt the algorithm to the type of task provided. For example, [dropout](https:\/\/paperswithcode.com\/method\/dropout)-like operations emerge when the task needs regularization and learning rate decay appears when the task requires faster convergence.","721":"**Self-adaptive Training** is a training algorithm that dynamically corrects problematic training labels by model predictions to improve generalization of deep learning for potentially corrupted training data. Accumulated predictions are used to augment the training dynamics. The use of an exponential-moving-average scheme alleviates the instability issue of model predictions, smooths out the training target during the training process and enables the algorithm to completely change the training labels if necessary.","722":"OSCAR is a new learning method that uses object tags detected in images as anchor points to ease the learning of image-text alignment. The model take a triple as input (word-tag-region) and pre-trained with two losses (masked token loss over words and tags, and a contrastive loss between tags and others). OSCAR represents an image-text pair into semantic space via dictionary lookup. Object tags are used as anchor points to align image regions with word embeddings of pre-trained language models. The model is then fine-tuned for understanding and generation tasks.","723":"A **PixelCNN** is a generative model that uses autoregressive connections to model images pixel by pixel, decomposing the joint image distribution as a product of conditionals. PixelCNNs are much faster to train than [PixelRNNs](https:\/\/paperswithcode.com\/method\/pixelrnn) because convolutions are inherently easier to parallelize; given the vast number of pixels present in large image datasets this is an important advantage.","724":"**Sarsa** is an on-policy TD control algorithm:\r\n\r\n$$Q\\left(S\\_{t}, A\\_{t}\\right) \\leftarrow Q\\left(S\\_{t}, A\\_{t}\\right) + \\alpha\\left[R_{t+1} + \\gamma{Q}\\left(S\\_{t+1}, A\\_{t+1}\\right) - Q\\left(S\\_{t}, A\\_{t}\\right)\\right] $$\r\n\r\nThis update is done after every transition from a nonterminal state $S\\_{t}$. if $S\\_{t+1}$ is terminal, then $Q\\left(S\\_{t+1}, A\\_{t+1}\\right)$ is defined as zero.\r\n\r\nTo design an on-policy control algorithm using Sarsa, we estimate $q\\_{\\pi}$ for a behaviour policy $\\pi$ and then change $\\pi$ towards greediness with respect to $q\\_{\\pi}$.\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition","725":"A wavelet **scattering transform** computes a translation invariant representation, which is stable to deformation, using a deep [convolution](https:\/\/paperswithcode.com\/method\/convolution) network architecture. It computes non-linear invariants with modulus and averaging pooling functions. It helps to eliminate the image variability due to translation and is stable to deformations. \r\n\r\nImage source: [Bruna and Mallat](https:\/\/arxiv.org\/pdf\/1203.1513v2.pdf)","726":"**Sparse R-CNN** is a purely sparse method for object detection in images, without object positional candidates enumerating\r\non all(dense) image grids nor object queries interacting with global(dense) image feature.\r\n\r\nAs shown in the Figure, object candidates are given with a fixed small set of learnable bounding boxes represented by 4-d coordinate. For the example of the COCO dataset, 100 boxes and 400 parameters are needed in total, rather than the predicted ones from hundreds of thousands of candidates in a Region Proposal Network ([RPN](https:\/\/paperswithcode.com\/method\/rpn)). These sparse candidates are used as proposal boxes to extract the feature of Region of Interest (RoI) by [RoIPool](https:\/\/paperswithcode.com\/method\/roi-pooling) or [RoIAlign](https:\/\/paperswithcode.com\/method\/roi-align).","727":"**Deterministic Policy Gradient**, or **DPG**, is a policy gradient method for reinforcement learning. Instead of the policy function $\\pi\\left(.\\mid{s}\\right)$ being modeled as a probability distribution, DPG considers and calculates gradients for a deterministic policy $a = \\mu\\_{theta}\\left(s\\right)$.","728":"Object Dropout is a technique that perturbs object features in an image for [noisy student](https:\/\/paperswithcode.com\/method\/noisy-student) training. It performs at par with standard data augmentation techniques while being significantly faster than the latter to implement.","729":"**Noisy Student Training** is a semi-supervised learning approach. It extends the idea of self-training\r\nand distillation with the use of equal-or-larger student models and noise added to the student during learning. It has three main steps: \r\n\r\n1. train a teacher model on labeled images\r\n2. use the teacher to generate pseudo labels on unlabeled images\r\n3. train a student model on the combination of labeled images and pseudo labeled images. \r\n\r\nThe algorithm is iterated a few times by treating the student as a teacher to relabel the unlabeled data and training a new student.\r\n\r\nNoisy Student Training seeks to improve on self-training and distillation in two ways. First, it makes the student larger than, or at least equal to, the teacher so the student can better learn from a larger dataset. Second, it adds noise to the student so the noised student is forced to learn harder from the pseudo labels. To noise the student, it uses input noise such as [RandAugment](https:\/\/paperswithcode.com\/method\/randaugment) data augmentation, and model noise such as [dropout](https:\/\/paperswithcode.com\/method\/dropout) and [stochastic depth](https:\/\/paperswithcode.com\/method\/stochastic-depth) during training.","730":"**Poincar\u00e9 Embeddings** learn hierarchical representations of symbolic data by embedding them into hyperbolic space -- or more precisely into an $n$-dimensional Poincar\u00e9 ball. Due to the underlying hyperbolic geometry, this allows for learning of parsimonious representations of symbolic data by simultaneously capturing hierarchy and similarity. Embeddings are learnt based on\r\nRiemannian optimization.","731":"**End-to-End Neural Diarization** is a neural network for speaker diarization in which a neural network directly outputs speaker diarization results given a multi-speaker recording. To realize such an end-to-end model, the speaker diarization problem is formulated as a multi-label classification problem and a permutation-free objective function is introduced to directly minimize diarization errors. The EEND method can explicitly handle speaker overlaps during training and inference. Just by feeding multi-speaker recordings with corresponding speaker segment labels, the model can be adapted to real conversations.","732":"**GrowNet** is a novel approach to combine the power of gradient boosting to incrementally build complex deep neural networks out of shallow components. It introduces a versatile framework that can readily be adapted for a diverse range of machine learning tasks in a wide variety of domains.","733":"Contextualized Topic Models are based on the Neural-ProdLDA variational autoencoding approach by Srivastava and Sutton (2017). \r\n\r\nThis approach trains an encoding neural network to map pre-trained contextualized word embeddings (e.g., [BERT](https:\/\/paperswithcode.com\/method\/bert)) to latent representations. Those latent representations are sampled variationally from a Gaussian distribution $N(\\mu, \\sigma^2)$ and passed to a decoder network that has to reconstruct the document bag-of-word representation.","734":"**TimeSformer** is a [convolution](https:\/\/paperswithcode.com\/method\/convolution)-free approach to video classification built exclusively on self-attention over space and time. It adapts the standard [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Specifically, the method adapts the image model [[Vision Transformer](https:\/\/paperswithcode.com\/method\/vision-transformer)](https\/\/www.paperswithcode.com\/method\/vision-transformer) (ViT) to video by extending the self-attention mechanism from the image space to the space-time 3D volume. As in ViT, each patch is linearly mapped into an embedding and augmented with positional information. This makes it possible to interpret the resulting sequence of vector","735":"**HS-ResNet** is a [convolutional neural network](https:\/\/paperswithcode.com\/methods\/category\/convolutional-neural-networks) that employs [Hierarchical-Split Block](https:\/\/paperswithcode.com\/method\/hierarchical-split-block) as its central building block within a [ResNet](https:\/\/paperswithcode.com\/method\/resnet)-like architecture.","736":"**Hierarchical-Split Block** is a representational block for multi-scale feature representations. It contains many hierarchical split and concatenate connections within one single [residual block](https:\/\/paperswithcode.com\/methods\/category\/skip-connection-blocks). \r\n\r\nSpecifically, ordinary feature maps in deep neural networks are split into $s$ groups, each with $w$ channels. As shown in the Figure, only the first group of filters can be straightly connected to next layer. The second group of feature maps are sent to a convolution of $3 \\times 3$ filters to extract features firstly, then the output feature maps are split into two sub-groups in the channel dimension. One sub-group of feature maps straightly connected to next layer, while the other sub-group is concatenated with the next group of input feature maps in the channel dimension. The concatenated feature maps are operated by a set of $3 \\times 3$ convolutional filters. This process repeats several times until the rest of input feature maps are processed. Finally, features maps from all input groups are concatenated and sent to another layer of $1 \\times 1$ filters to rebuild the features.","737":"A **Ghost Module** is an image block for convolutional neural network that aims to generate more features by using fewer parameters. Specifically, an ordinary convolutional layer in deep neural networks is split into two parts. The first part involves ordinary convolutions but their total number is controlled. Given the intrinsic feature maps from the first part, a series of simple linear operations are applied for generating more feature maps. \r\n\r\nGiven the widely existing redundancy in intermediate feature maps calculated by mainstream CNNs, ghost modules aim to reduce them. In practice, given the input data $X\\in\\mathbb{R}^{c\\times h\\times w}$, where $c$ is the number of input channels and $h$ and $w$ are the height and width of the input data, respectively,  the operation of an arbitrary convolutional layer for producing $n$ feature maps can be formulated as\r\n\r\n$$\r\nY = X*f+b,\r\n$$\r\n\r\nwhere $*$ is the [convolution](https:\/\/paperswithcode.com\/method\/convolution) operation, $b$ is the bias term, $Y\\in\\mathbb{R}^{h'\\times w'\\times n}$ is the output feature map with $n$ channels, and $f\\in\\mathbb{R}^{c\\times k\\times k \\times n}$ is the convolution filters in this layer. In addition, $h'$ and $w'$ are the height and width of the output data, and $k\\times k$ is the kernel size of convolution filters $f$, respectively. During this convolution procedure, the required number of FLOPs can be calculated as $n\\cdot h'\\cdot w'\\cdot c\\cdot k\\cdot k$, which is often as large as hundreds of thousands since the number of filters $n$ and the channel number $c$ are generally very large (e.g. 256 or 512).\r\n\r\nHere, the number of parameters (in $f$ and $b$) to be optimized is explicitly determined by the dimensions of input and output feature maps. The output feature maps of convolutional layers often contain much redundancy, and some of them could be similar with each other. We point out that it is unnecessary to generate these redundant feature maps one by one with large number of FLOPs and parameters. Suppose that the output feature maps are *ghosts* of a handful of intrinsic feature maps with some cheap transformations. These intrinsic feature maps are often of smaller size and produced by ordinary convolution filters. Specifically, $m$ intrinsic feature maps $Y'\\in\\mathbb{R}^{h'\\times w'\\times m}$ are generated using a primary convolution:\r\n\r\n$$\r\nY' = X*f',\r\n$$\r\n\r\nwhere $f'\\in\\mathbb{R}^{c\\times k\\times k \\times m}$ is the utilized filters, $m\\leq n$ and the bias term is omitted for simplicity. The hyper-parameters such as filter size, stride, padding, are the same as those in the ordinary convolution to keep the spatial size (ie $h'$ and $w'$) of the output feature maps consistent. To further obtain the desired $n$ feature maps, we apply a series of cheap linear operations on each intrinsic feature in $Y'$ to generate $s$ ghost features according to the following function:\r\n\r\n$$\r\ny_{ij} = \\Phi_{i,j}(y'_i),\\quad \\forall\\; i = 1,...,m,\\;\\; j = 1,...,s,\r\n$$\r\n\r\nwhere $y'\\_i$ is the $i$-th intrinsic feature map in $Y'$, $\\Phi\\_{i,j}$ in the above function is the $j$-th (except the last one) linear operation for generating the $j$-th ghost feature map $y_{ij}$, that is to say, $y'\\_i$ can have one or more ghost feature maps $\\{y\\_{ij}\\}\\_{j=1}^{s}$. The last $\\Phi\\_{i,s}$ is the identity mapping for preserving the intrinsic feature maps. we can obtain $n=m\\cdot s$ feature maps $Y=[y\\_{11},y\\_{12},\\cdots,y\\_{ms}]$ as the output data of a Ghost module. Note that the linear operations $\\Phi$ operate on each channel whose computational cost is much less than the ordinary convolution. In practice, there could be several different linear operations in a Ghost module, eg $3\\times 3$ and $5\\times5$ linear kernels, which will be analyzed in the experiment part.","738":"A **Ghost BottleNeck** is a skip connection block, similar to the basic [residual block](https:\/\/paperswithcode.com\/method\/residual-block) in [ResNet](https:\/\/paperswithcode.com\/method\/resnet) in which several convolutional layers and shortcuts are integrated, but stacks [Ghost Modules](https:\/\/paperswithcode.com\/method\/ghost-module) instead (two stacked Ghost modules). It was proposed as part of the [GhostNet](https:\/\/paperswithcode.com\/method\/ghostnet) CNN architecture.\r\n\r\nThe first Ghost module acts as an expansion layer increasing the number of channels. The ratio between the number of the output channels and that of the input is referred to as the *expansion ratio*. The second Ghost module reduces the number of channels to match the shortcut path. Then the shortcut is connected between the inputs and the outputs of these two Ghost modules. The [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) (BN) and [ReLU](https:\/\/paperswithcode.com\/method\/relu) nonlinearity are applied after each layer, except that ReLU is not used after the second Ghost module as suggested by [MobileNetV2](https:\/\/paperswithcode.com\/method\/mobilenetv2). The Ghost bottleneck described above is for stride=1. As for the case where stride=2, the shortcut path is implemented by a downsampling layer and a [depthwise convolution](https:\/\/paperswithcode.com\/method\/depthwise-convolution) with stride=2 is inserted between the two Ghost modules. In practice, the primary [convolution](https:\/\/paperswithcode.com\/method\/convolution) in Ghost module here is [pointwise convolution](https:\/\/paperswithcode.com\/method\/pointwise-convolution) for its efficiency.","739":"A **GhostNet** is a type of convolutional neural network that is built using Ghost modules, which aim to generate more features by using fewer parameters (allowing for greater efficiency). \r\n\r\nGhostNet mainly consists of a stack of Ghost bottlenecks with the Ghost modules as the building block. The first layer is a standard convolutional layer with 16 filters, then a series of Ghost bottlenecks with gradually increased channels follow. These Ghost bottlenecks are grouped into different stages according to the sizes of their input feature maps. All the Ghost bottlenecks are applied with stride=1 except that the last one in each stage is with stride=2. At last a [global average pooling](https:\/\/paperswithcode.com\/method\/global-average-pooling) and a convolutional layer are utilized to transform the feature maps to a 1280-dimensional feature vector for final classification. The squeeze and excite (SE) module is also applied to the residual layer in some ghost bottlenecks. \r\n\r\nIn contrast to [MobileNetV3](https:\/\/paperswithcode.com\/method\/mobilenetv3), GhostNet does not use [hard-swish](https:\/\/paperswithcode.com\/method\/hard-swish) nonlinearity function due to its large latency.","740":"Bi-attention employs the attention-in-attention (AiA) mechanism to capture second-order statistical information: the outer point-wise channel attention vectors are computed from the output of the inner channel attention.","741":"**Guided Anchoring** is an anchoring scheme for object detection which leverages semantic features to guide the anchoring. The method is motivated by the observation that objects are not distributed evenly over the image. The scale of an object is also closely related to the imagery content, its location and geometry of the scene. Following this intuition, the method generates sparse anchors in two steps: first identifying sub-regions that may contain objects and then determining the shapes at different locations.","742":"**RetinaNet-RS** is an object detection model produced through a model scaling method based on changing the the input resolution and [ResNet](https:\/\/paperswithcode.com\/method\/resnet) backbone depth. For [RetinaNet](https:\/\/paperswithcode.com\/method\/retinanet), we scale up input resolution from 512 to 768 and the ResNet backbone depth from 50 to 152. As RetinaNet performs dense one-stage object detection, the authors find scaling up input resolution leads to large resolution feature maps hence more anchors to process. This results in a higher capacity dense prediction heads and expensive NMS. Scaling stops at input resolution 768 \u00d7 768 for RetinaNet.","743":"**Linear Warmup** is a learning rate schedule where we linearly increase the learning rate from a low rate to a constant rate thereafter. This reduces volatility in the early stages of training.\r\n\r\nImage Credit: [Chengwei Zhang](https:\/\/www.dlology.com\/about-me\/)","744":"**CTRL** is conditional [transformer](https:\/\/paperswithcode.com\/method\/transformer) language model, trained\r\nto condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw\r\ntext, preserving the advantages of unsupervised learning while providing more\r\nexplicit control over text generation. These codes also allow CTRL to predict\r\nwhich parts of the training data are most likely given a sequence","745":"**Cross-Scale Non-Local Attention**, or **CS-NL**,  is a non-local attention module for image super-resolution deep networks. It learns to mine long-range dependencies between LR features to larger-scale HR patches within the same feature map. Specifically, suppose we are conducting an s-scale super-resolution with the module, given a feature map $X$ of spatial size $(W, H)$, we first bilinearly downsample it to $Y$ with scale $s$, and match the $p\\times p$ patches in $X$ with the downsampled $p \\times p$ candidates in $Y$ to obtain the [softmax](https:\/\/paperswithcode.com\/method\/softmax) matching score. Finally, we conduct deconvolution.on the score by weighted adding the patches of size $\\left(sp, sp\\right)$ extracted from $X$. The obtained $Z$ of size $(sW, sH)$ will be $s$ times super-resolved than $X$.","746":"**Contextual Residual Aggregation**, or **CRA**, is a module for image inpainting. It can produce high-frequency residuals for missing contents by weighted aggregating residuals from contextual patches, thus only requiring a low-resolution prediction from the network. Specifically, it involves a neural network to predict a low-resolution inpainted result and up-sample it to yield a large blurry image. Then we produce the high-frequency residuals for in-hole patches by aggregating weighted high-frequency residuals from contextual patches. Finally, we add the aggregated residuals to the large blurry image to obtain a sharp result.","747":"Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner. Code, models, and datasets are released at https:\/\/github.com\/salesforce\/BLIP.","748":"An **Accumulating Eligibility Trace** is a type of [eligibility trace](https:\/\/paperswithcode.com\/method\/eligibility-trace) where the trace increments in an accumulative way. For the memory vector $\\textbf{e}\\_{t} \\in \\mathbb{R}^{b} \\geq \\textbf{0}$:\r\n\r\n$$\\mathbf{e\\_{0}} = \\textbf{0}$$\r\n\r\n$$\\textbf{e}\\_{t} = \\nabla{\\hat{v}}\\left(S\\_{t}, \\mathbf{\\theta}\\_{t}\\right) + \\gamma\\lambda\\textbf{e}\\_{t}$$","749":"**TD_INLINE_MATH_1** is a generalisation of **TD_INLINE_MATH_2** reinforcement learning algorithms, but it employs an [eligibility trace](https:\/\/paperswithcode.com\/method\/eligibility-trace) $\\lambda$ and $\\lambda$-weighted returns. The eligibility trace vector is initialized to zero at the beginning of the episode, and it is incremented on each time step by the value gradient, and then fades away by $\\gamma\\lambda$:\r\n\r\n$$ \\textbf{z}\\_{-1} = \\mathbf{0} $$\r\n$$ \\textbf{z}\\_{t} = \\gamma\\lambda\\textbf{z}\\_{t-1} + \\nabla\\hat{v}\\left(S\\_{t}, \\mathbf{w}\\_{t}\\right), 0 \\leq t \\leq T$$\r\n\r\nThe eligibility trace keeps track of which components of the weight vector contribute to recent state valuations. Here $\\nabla\\hat{v}\\left(S\\_{t}, \\mathbf{w}\\_{t}\\right)$ is the feature vector.\r\n\r\nThe TD error for state-value prediction is:\r\n\r\n$$ \\delta\\_{t} = R\\_{t+1} + \\gamma\\hat{v}\\left\\(S\\_{t+1}, \\mathbf{w}\\_{t}\\right) - \\hat{v}\\left(S\\_{t}, \\mathbf{w}\\_{t}\\right) $$\r\n\r\nIn **TD_INLINE_MATH_1**, the weight vector is updated on each step proportional to the scalar TD error and the vector eligibility trace:\r\n\r\n$$ \\mathbf{w}\\_{t+1} = \\mathbf{w}\\_{t} + \\alpha\\delta\\mathbf{z}\\_{t}  $$\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition","750":"**TD-Gammon** is a game-learning architecture for playing backgammon. It involves the use of a $TD\\left(\\lambda\\right)$ learning algorithm and a feedforward neural network.\r\n\r\nCredit: [Temporal Difference Learning and\r\nTD-Gammon](https:\/\/cling.csd.uwo.ca\/cs346a\/extra\/tdgammon.pdf)","751":"**Self-supervised Equivariant Attention Mechanism**, or **SEAM**, is an attention mechanism for weakly supervised semantic segmentation. The SEAM applies consistency regularization on CAMs from various transformed images to provide self-supervision for network learning. To further improve the network prediction consistency, SEAM introduces the pixel correlation module (PCM), which captures context appearance information for each pixel and revises original CAMs by learned affinity attention maps. The SEAM is implemented by a [siamese network](https:\/\/paperswithcode.com\/method\/siamese-network) with equivariant cross regularization (ECR) loss, which regularizes the original CAMs and the revised CAMs on different branches.","752":"**VIME **, or **Value Imputation and Mask Estimation**, is a self- and semi-supervised learning framework for tabular data. It consists of a pretext task of estimating mask vectors from corrupted tabular data in addition to the reconstruction pretext task for self-supervised learning.","753":"**TDN**, or **Temporaral Difference Network**, is an action recognition model that aims to capture multi-scale temporal information. To fully capture temporal information over the entire video, the TDN is established with a two-level difference modeling paradigm. Specifically, for local motion modeling, temporal difference over consecutive frames is used to supply 2D CNNs with finer motion pattern, while for global motion modeling, temporal difference across segments is incorporated to capture long-range structure for motion feature excitation.","754":"**BASNet**, or **Boundary-Aware Segmentation Network**, is an image segmentation architecture that consists of a predict-refine architecture and a hybrid loss. The proposed BASNet comprises a predict-refine architecture and a hybrid loss, for highly accurate image segmentation.  The predict-refine architecture consists of a densely supervised encoder-decoder network and a residual \r\n refinement module, which are respectively used to predict and refine a segmentation probability map. The hybrid loss is a combination of the binary cross entropy, structural similarity and intersection-over-union losses, which guide the network to learn three-level (i.e., pixel-, patch- and map- level) hierarchy representations.","755":"**Feature Matching** is a regularizing objective for a generator in [generative adversarial networks](https:\/\/paperswithcode.com\/methods\/category\/generative-adversarial-networks) that prevents it from overtraining on the current discriminator. Instead of directly maximizing the output of the discriminator, the new objective requires the generator to generate data that matches the statistics of the real data, where we use the discriminator only to specify the statistics that we think are worth matching. Specifically, we train the generator to match the expected value of the features on an intermediate layer of the discriminator. This is a natural choice of statistics for the generator to match, since by training the discriminator we ask it to find those features that are most discriminative of real data versus data generated by the current model.\r\n\r\nLetting $\\mathbf{f}\\left(\\mathbf{x}\\right)$ denote activations on an intermediate layer of the discriminator, our new objective for the generator is defined as: $ ||\\mathbb{E}\\_{x\\sim p\\_{data} } \\mathbf{f}\\left(\\mathbf{x}\\right) \u2212 \\mathbb{E}\\_{\\mathbf{z}\u223cp\\_{\\mathbf{z}}\\left(\\mathbf{z}\\right)}\\mathbf{f}\\left(G\\left(\\mathbf{z}\\right)\\right)||^{2}\\_{2} $. The discriminator, and hence\r\n$\\mathbf{f}\\left(\\mathbf{x}\\right)$, are trained as with vanilla GANs. As with regular [GAN](https:\/\/paperswithcode.com\/method\/gan) training, the objective has a fixed point where G exactly matches the distribution of training data.","756":"Syntax Heat Parse Tree are heatmaps over parse trees, similar to [\"heat trees\"](https:\/\/doi.org\/10.1371\/journal.pcbi.1005404) in biology.","757":"**U2-Net** is a two-level nested U-structure architecture that is designed for salient object detection (SOD).  The architecture allows the network to go deeper, attain high resolution, without significantly increasing the memory and computation cost. This is achieved by a nested U-structure: on the bottom level, with a novel ReSidual U-block (RSU) module, which is able to extract intra-stage multi-scale features without degrading the feature map resolution; on the top level, there is a [U-Net](https:\/\/paperswithcode.com\/method\/u-net) like structure, in which each stage is filled by a RSU block.","758":"**Wasserstein GAN + Gradient Penalty**, or **WGAN-GP**, is a generative adversarial network that uses the Wasserstein loss formulation plus a gradient norm penalty to achieve Lipschitz continuity.\r\n\r\nThe original [WGAN](https:\/\/paperswithcode.com\/method\/wgan) uses weight clipping to achieve 1-Lipschitz functions, but this can lead to undesirable behaviour by creating pathological value surfaces and capacity underuse, as well as gradient explosion\/vanishing without careful tuning of the weight clipping parameter $c$.\r\n\r\nA Gradient Penalty is a soft version of the Lipschitz constraint, which follows from the fact that functions are 1-Lipschitz iff the gradients are of norm at most 1 everywhere. The squared difference from norm 1 is used as the gradient penalty.","759":"**Orthogonal Regularization** is a regularization technique for convolutional neural networks, introduced with generative modelling as the task in mind. Orthogonality is argued to be a desirable quality in ConvNet filters, partially because multiplication by an orthogonal matrix leaves the norm of the original matrix unchanged. This property is valuable in deep or recurrent networks, where repeated matrix multiplication can result in signals vanishing or exploding. To try to maintain orthogonality throughout training, Orthogonal Regularization encourages weights to be orthogonal by pushing them towards the nearest orthogonal manifold. The objective function is augmented with the cost:\r\n\r\n$$ \\mathcal{L}\\_{ortho} = \\sum\\left(|WW^{T} \u2212 I|\\right) $$\r\n\r\nWhere $\\sum$ indicates a sum across all filter banks, $W$ is a filter bank, and $I$ is the identity matrix","760":"**Spektral** is an open-source Python library for building graph neural networks with TensorFlow and the Keras application programming interface. Spektral implements a large set of methods for deep learning on graphs, including message-passing and pooling operators, as well as utilities for processing graphs and loading popular benchmark datasets.","761":"A **Zero-padded Shortcut Connection** is a type of [residual connection](https:\/\/paperswithcode.com\/method\/residual-connection) used in the [PyramidNet](https:\/\/paperswithcode.com\/method\/pyramidnet) architecture. For PyramidNets, identity mapping alone cannot be used for a shortcut because the feature map dimension differs among individual residual units. Therefore, only a zero-padded shortcut or projection shortcut can be used for all the residual units. However,  a projection shortcut can hamper information propagation and lead to optimization problems, especially for very deep networks. On the other hand, the zero-padded shortcut avoids the overfitting problem because no additional parameters exist.","762":"A **Pyramidal Residual Unit** is a type of residual unit where the number of channels gradually increases as a function of the depth at which the layer occurs, which is similar to a pyramid structure of which the shape gradually widens from the top downwards. It was introduced as part of the [PyramidNet](https:\/\/paperswithcode.com\/method\/pyramidnet) architecture.","763":"A **Pyramidal Bottleneck Residual Unit** is a type of residual unit where the number of channels gradually increases as a function of the depth at which the layer occurs, which is similar to a pyramid structure of which the shape gradually widens from the top downwards. It also consists of a bottleneck using 1x1 convolutions. It was introduced as part of the [PyramidNet](https:\/\/paperswithcode.com\/method\/pyramidnet) architecture.","764":"A **PyramidNet** is a type of convolutional network where the key idea is to concentrate on the feature map dimension by increasing it gradually instead of by increasing it sharply at each residual unit with downsampling. In addition, the network architecture works as a mixture of both plain and residual networks by using zero-padded identity-mapping shortcut connections when increasing the feature map dimension.","765":"**HiFi-GAN** is a generative adversarial network for speech synthesis. HiFi-GAN consists of one generator and two discriminators: multi-scale and multi-period discriminators. The generator and discriminators are trained adversarially, along with two additional losses for improving training stability and model performance.\r\n\r\nThe generator is a fully convolutional neural network. It uses a mel-spectrogram as input and upsamples it through transposed convolutions until the length of the output sequence matches the temporal resolution of raw waveforms. Every [transposed convolution](https:\/\/paperswithcode.com\/method\/transposed-convolution) is followed by a multi-receptive field fusion (MRF) module.\r\n\r\nFor the discriminator, a multi-period discriminator (MPD) is used consisting of several sub-discriminators each handling a portion of periodic signals of input audio. Additionally, to capture consecutive patterns and long-term dependencies, the multi-scale discriminator (MSD) proposed in [MelGAN](https:\/\/paperswithcode.com\/method\/melgan) is used, which consecutively evaluates audio samples at different levels.","766":"**SRU**, or **Simple Recurrent Unit**, is a recurrent neural unit with a light form of recurrence. SRU exhibits the same level of parallelism as [convolution](https:\/\/paperswithcode.com\/method\/convolution) and [feed-forward nets](https:\/\/paperswithcode.com\/methods\/category\/feedforward-networks). This is achieved by balancing sequential dependence and independence: while the state computation of SRU is time-dependent, each state dimension is independent. This simplification enables CUDA-level optimizations that parallelize the computation across hidden dimensions and time steps, effectively using the full capacity of modern GPUs. \r\n\r\nSRU also replaces the use of convolutions (i.e., ngram filters), as in [QRNN](https:\/\/paperswithcode.com\/method\/qrnn) and KNN, with more recurrent connections. This retains modeling capacity, while using less computation (and hyper-parameters). Additionally, SRU improves the training of deep recurrent models by employing [highway connections](https:\/\/paperswithcode.com\/method\/highway-layer) and a parameter initialization scheme tailored for gradient propagation in deep architectures.\r\n\r\nA single layer of SRU involves the following computation:\r\n\r\n$$\r\n\\mathbf{f}\\_{t} =\\sigma\\left(\\mathbf{W}\\_{f} \\mathbf{x}\\_{t}+\\mathbf{v}\\_{f} \\odot \\mathbf{c}\\_{t-1}+\\mathbf{b}\\_{f}\\right) \r\n$$\r\n\r\n$$\r\n\\mathbf{c}\\_{t} =\\mathbf{f}\\_{t} \\odot \\mathbf{c}\\_{t-1}+\\left(1-\\mathbf{f}\\_{t}\\right) \\odot\\left(\\mathbf{W} \\mathbf{x}\\_{t}\\right) \\\\\r\n$$\r\n\r\n$$\r\n\\mathbf{r}\\_{t} =\\sigma\\left(\\mathbf{W}\\_{r} \\mathbf{x}\\_{t}+\\mathbf{v}\\_{r} \\odot \\mathbf{c}\\_{t-1}+\\mathbf{b}\\_{r}\\right) \\\\\r\n$$\r\n\r\n$$\r\n\\mathbf{h}\\_{t} =\\mathbf{r}\\_{t} \\odot \\mathbf{c}\\_{t}+\\left(1-\\mathbf{r}\\_{t}\\right) \\odot \\mathbf{x}\\_{t}\r\n$$\r\n\r\nwhere $\\mathbf{W}, \\mathbf{W}\\_{f}$ and $\\mathbf{W}\\_{r}$ are parameter matrices and $\\mathbf{v}\\_{f}, \\mathbf{v}\\_{r}, \\mathbf{b}\\_{f}$ and $\\mathbf{b}_{v}$ are parameter vectors to be learnt during training. The complete architecture decomposes to two sub-components: a light recurrence and a highway network,\r\n\r\nThe light recurrence component successively reads the input vectors $\\mathbf{x}\\_{t}$ and computes the sequence of states $\\mathbf{c}\\_{t}$ capturing sequential information. The computation resembles other recurrent networks such as [LSTM](https:\/\/paperswithcode.com\/method\/lstm), [GRU](https:\/\/paperswithcode.com\/method\/gru) and RAN. Specifically, a forget gate $\\mathbf{f}\\_{t}$ controls the information flow and the state vector $\\mathbf{c}\\_{t}$ is determined by adaptively averaging the previous state $\\mathbf{c}\\_{t-1}$ and the current observation $\\mathbf{W} \\mathbf{x}_{+}$according to $\\mathbf{f}\\_{t}$.","767":"**GPipe** is a distributed model parallel method for neural networks. With GPipe, each model can be specified as a sequence of layers, and consecutive groups of layers can be partitioned into cells. Each cell is then placed on a separate accelerator. Based on this partitioned setup, batch splitting is applied. A mini-batch of training examples is split into smaller micro-batches, then the execution of each set of micro-batches is pipelined over cells. Synchronous mini-batch gradient descent is applied for training, where gradients are accumulated across all micro-batches in a mini-batch and applied at the end of a mini-batch.","768":"**Packed Levitated Markers**, or **PL-Marker**, is a span representation approach for [named entity recognition](https:\/\/paperswithcode.com\/task\/named-entity-recognition-ner) that considers the dependencies between spans (pairs) by strategically packing the markers in the encoder. A pair of Levitated Markers, emphasizing a span, consists of a start marker and an end marker which share the same position embeddings with span\u2019s start and end tokens respectively. In addition, both levitated markers adopt a restricted attention, that is, they are visible to each other, but not to the text token and other pairs of markers. sBased on the above features, the levitated marker would not affect the attended context of the original text tokens, which allows us to flexibly pack a series of related spans with their levitated markers in the encoding phase and thus model their dependencies.","769":"A **Dueling Network** is a type of Q-Network that has two streams to separately estimate (scalar) state-value and the advantages for each action. Both streams share a common convolutional feature learning module. The two streams are combined via a special aggregating layer to produce an\r\nestimate of the state-action value function Q as shown in the figure to the right.\r\n\r\nThe last module uses the following mapping:\r\n\r\n$$ Q\\left(s, a, \\theta, \\alpha, \\beta\\right) =V\\left(s, \\theta, \\beta\\right) + \\left(A\\left(s, a, \\theta, \\alpha\\right) - \\frac{1}{|\\mathcal{A}|}\\sum\\_{a'}A\\left(s, a'; \\theta, \\alpha\\right)\\right) $$\r\n\r\nThis formulation is chosen for identifiability so that the advantage function has zero advantage for the chosen action, but instead of a maximum we use an average operator to increase the stability of the optimization.","770":"UNITER or UNiversal Image-TExt Representation model is a large-scale pre-trained model for joint multimodal embedding. It is pre-trained using four image-text datasets COCO, Visual Genome, Conceptual Captions, and SBU Captions. It can power heterogeneous downstream V+L tasks with joint multimodal embeddings. \r\nUNITER takes the visual regions of the image and textual tokens of the sentence as inputs. A faster R-CNN is used in Image Embedder to extract the visual features of each region and a Text Embedder is used to tokenize the input sentence into WordPieces.  \r\n\r\nIt proposes WRA via the Optimal Transport to provide more fine-grained alignment between word tokens and image regions that is effective in calculating the minimum cost of transporting the contextualized image embeddings to word embeddings and vice versa. \r\n\r\nFour pretraining tasks were designed for this model. They are Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). This model is different from the previous models because it uses conditional masking on pre-training tasks.","771":"**ViP-DeepLab** is a model for depth-aware video panoptic segmentation. It extends Panoptic-[DeepLab](https:\/\/paperswithcode.com\/method\/deeplab) by adding a depth prediction head to perform monocular depth estimation and a next-frame instance branch which regresses to the object centers in frame $t$ for frame $t + 1$.  This allows the model to jointly perform video panoptic segmentation and monocular depth estimation.","772":"**GMVAE**, or **Gaussian Mixture Variational Autoencoder**, is a stochastic regularization layer for [transformers](https:\/\/paperswithcode.com\/methods\/category\/transformers). A GMVAE layer is trained using a 700-dimensional internal representation of the first MLP layer. For every output from the first MLP layer, the GMVAE layer first computes a latent low-dimensional representation sampling from the GMVAE posterior distribution to then provide at the output a reconstruction sampled from a generative model.","773":"**Adaptive Dropout** is a regularization technique that extends dropout by allowing the dropout probability to be different for different units. The intuition is that there may be hidden units that can individually make confident predictions for the presence or absence of an important feature or combination of features. [Dropout](https:\/\/paperswithcode.com\/method\/dropout) will ignore this confidence and drop the unit out 50% of the time. \r\n\r\nDenote the activity of unit $j$ in a deep neural network by $a\\_{j}$ and assume that its inputs are {$a\\_{i}: i < j$}. In dropout, $a\\_{j}$ is randomly set to zero with probability 0.5. Let $m\\_{j}$ be a binary variable that is used to mask, the activity $a\\_{j}$, so that its value is:\r\n\r\n$$ a\\_{j} = m\\_{j}g \\left( \\sum\\_{i: i<j}w\\_{j, i}a\\_{i} \\right)$$\r\n\r\nwhere $w\\_{j,i}$ is the weight from unit $i$ to unit $j$ and $g\\left(\u00b7\\right)$ is the activation function and $a\\_{0} = 1$ accounts for biases. Whereas in standard dropout, $m\\_{j}$ is Bernoulli with probability $0.5$, adaptive dropout uses adaptive dropout probabilities that depends on input activities:\r\n\r\n$$ P\\left(m\\_{j} = 1\\mid{\\{a\\_{i}: i < j\\}}\\right) = f \\left( \\sum\\_{i: i<j}\\pi{\\_{j, i}a\\_{i}} \\right) $$\r\n\r\nwhere $\\pi\\_{j, i}$ is the weight from unit $i$ to unit $j$ in the standout network or the adaptive dropout network; $f(\u00b7)$ is a sigmoidal function. Here 'standout' refers to a binary belief network is that is overlaid on a neural network as part of the overall regularization technique.","774":"In the field of scene segmentation,\r\nencoder-decoder structures cannot make use of the global relationships \r\nbetween objects, whereas RNN-based structures \r\nheavily rely on the output of the long-term memorization.\r\nTo address the above problems, \r\nFu et al. proposed a novel framework, \r\n the dual attention network (DANet), \r\nfor natural scene image segmentation. \r\nUnlike CBAM and BAM, it adopts a self-attention mechanism \r\ninstead of simply stacking convolutions to compute the spatial attention map,\r\nwhich enables the network to capture global information directly. \r\n\r\nDANet uses in parallel a position attention module and a channel attention module to capture feature dependencies in spatial and channel domains. Given the input feature map $X$, convolution layers are applied first in the position attention module to obtain new feature maps. Then the position attention module selectively aggregates the features at each position using a weighted sum of features at all positions, where the weights are determined by feature similarity between corresponding pairs of positions. The channel attention module has a similar form except for dimensional reduction to model cross-channel relations. Finally the outputs from the two branches are fused to obtain final feature representations. For simplicity, we reshape the feature map $X$ to $C\\times (H \\times W)$ whereupon the overall process can be written as \r\n\\begin{align}\r\n    Q,\\quad K,\\quad V &= W_qX,\\quad W_kX,\\quad W_vX\r\n\\end{align}\r\n\\begin{align}\r\n    Y^\\text{pos} &=  X+ V\\text{Softmax}(Q^TK)\r\n\\end{align}\r\n\\begin{align}\r\n    Y^\\text{chn} &=  X+ \\text{Softmax}(XX^T)X \r\n\\end{align}\r\n\\begin{align}\r\n    Y &= Y^\\text{pos} + Y^\\text{chn}\r\n\\end{align}\r\nwhere $W_q$, $W_k$, $W_v \\in \\mathbb{R}^{C\\times C}$ are used to generate new feature maps.   \r\n\r\nThe position attention module enables\r\nDANet to capture long-range contextual information\r\nand adaptively integrate similar features at any scale\r\nfrom a global viewpoint,\r\nwhile the channel attention module is responsible for \r\nenhancing useful channels \r\nas well as suppressing noise. \r\nTaking spatial and channel \r\nrelationships into consideration explicitly\r\nimproves the feature representation for scene segmentation.\r\nHowever, it is computationally costly, especially for large input feature maps.","775":"Vision-aided GAN training involves using pretrained computer vision models in an ensemble of discriminators to improve GAN performance. Linear separability between real and fake samples in pretrained model embeddings is used as a measure to choose the most accurate pretrained models for a dataset.","776":"**Spatial-Reduction Attention**, or **SRA**, is a [multi-head attention](https:\/\/paperswithcode.com\/method\/multi-head-attention) module used in the [Pyramid Vision Transformer](https:\/\/paperswithcode.com\/method\/pvt) architecture which reduces the spatial scale of the key $K$ and value $V$ before the attention operation. This reduces the computational\/memory overhead. Details of the SRA in the stage $i$ can be formulated as follows:\r\n\r\n$$\r\n\\text{SRA}(Q, K, V)=\\text { Concat }\\left(\\operatorname{head}\\_{0}, \\ldots \\text { head }\\_{N\\_{i}}\\right) W^{O} $$\r\n\r\n$$\\text{ head}\\_{j}=\\text { Attention }\\left(Q W\\_{j}^{Q}, \\operatorname{SR}(K) W\\_{j}^{K}, \\operatorname{SR}(V) W\\_{j}^{V}\\right)\r\n$$\r\n\r\nwhere Concat $(\\cdot)$ is the concatenation operation. $W\\_{j}^{Q} \\in \\mathbb{R}^{C\\_{i} \\times d\\_{\\text {head }}}$, $W\\_{j}^{K} \\in \\mathbb{R}^{C\\_{i} \\times d\\_{\\text {head }}}$, $W\\_{j}^{V} \\in \\mathbb{R}^{C\\_{i} \\times d\\_{\\text {head }}}$, and $W^{O} \\in \\mathbb{R}^{C\\_{i} \\times C\\_{i}}$ are linear projection parameters. $N\\_{i}$ is the head number of the attention layer in Stage $i$. Therefore, the dimension of each head (i.e. $\\left.d\\_{\\text {head }}\\right)$ is equal to $\\frac{C\\_{i}}{N\\_{i}} . \\text{SR}(\\cdot)$ is the operation for reducing the spatial dimension of the input sequence ($K$ or $V$ ), which is written as:\r\n\r\n$$\r\n\\text{SR}(\\mathbf{x})=\\text{Norm}\\left(\\operatorname{Reshape}\\left(\\mathbf{x}, R\\_{i}\\right) W^{S}\\right)\r\n$$\r\n\r\nHere, $\\mathbf{x} \\in \\mathbb{R}^{\\left(H\\_{i} W\\_{i}\\right) \\times C\\_{i}}$ represents a input sequence, and $R\\_{i}$ denotes the reduction ratio of the attention layers in Stage $i .$ Reshape $\\left(\\mathbf{x}, R\\_{i}\\right)$ is an operation of reshaping the input sequence $\\mathbf{x}$ to a sequence of size $\\frac{H\\_{i} W\\_{i}}{R\\_{i}^{2}} \\times\\left(R\\_{i}^{2} C\\_{i}\\right)$. $W\\_{S} \\in \\mathbb{R}^{\\left(R\\_{i}^{2} C\\_{i}\\right) \\times C\\_{i}}$ is a linear projection that reduces the dimension of the input sequence to $C\\_{i}$. $\\text{Norm}(\\cdot)$ refers to layer normalization.","777":"**BigBird** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) with a sparse attention mechanism that reduces the quadratic dependency of self-attention to linear in the number of tokens. BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.  In particular, BigBird consists of three main parts:\r\n\r\n- A set of $g$ global tokens attending on all parts of the sequence.\r\n- All tokens attending to a set of $w$ local neighboring tokens.\r\n- All tokens attending to a set of $r$ random tokens.\r\n\r\nThis leads to a high performing attention mechanism scaling to much longer sequence lengths (8x).","778":"**PVT**, or **Pyramid Vision Transformer**, is a type of [vision transformer](https:\/\/paperswithcode.com\/methods\/category\/vision-transformer) that utilizes a pyramid structure to make it an effective backbone for dense prediction tasks. Specifically it allows for more fine-grained inputs (4 x 4 pixels per patch) to be used, while simultaneously shrinking the sequence length of the Transformer as it deepens - reducing the computational cost. Additionally, a [spatial-reduction attention](https:\/\/paperswithcode.com\/method\/spatial-reduction-attention) (SRA) layer is used to further reduce the resource consumption when learning high-resolution features.\r\n\r\nThe entire model is divided into four stages, each of which is comprised of a patch embedding layer and a $\\mathcal{L}\\_{i}$-layer Transformer encoder. Following a pyramid structure, the output resolution of the four stages progressively shrinks from high (4-stride) to low (32-stride).","779":"**Reversible Residual Blocks** are skip-connection blocks that learn *reversible* residual functions with reference to the layer inputs. It is proposed as part of the [RevNet](https:\/\/paperswithcode.com\/method\/revnet) CNN architecture. Units in each layer are partitioned into two groups, denoted $x\\_{1}$ and $x\\_{2}$; the authors find what works best is partitioning the channels. Each reversible block takes inputs $\\left(x\\_{1}, x\\_{2}\\right)$ and produces outputs $\\left(y\\_{1}, y\\_{2}\\right)$ according to the following additive coupling rules \u2013 inspired by the transformation in [NICE](https:\/\/paperswithcode.com\/method\/nice) (nonlinear independent components estimation) \u2013 and residual functions $F$ and $G$ analogous to those in standard [ResNets](https:\/\/paperswithcode.com\/method\/resnet):\r\n\r\n$$y\\_{1} = x\\_{1} + F\\left(x\\_{2}\\right)$$\r\n$$y\\_{2} = x\\_{2} + G\\left(y\\_{1}\\right)$$\r\n\r\nEach layer\u2019s activations can be reconstructed from the next layer\u2019s activations as follows:\r\n\r\n$$ x\\_{2} = y\\_{2} \u2212 G\\left(y\\_{1}\\right)$$\r\n$$ x\\_{1} = y\\_{1} \u2212 F\\left(x\\_{2}\\right)$$","780":"**LSH Attention**, or **Locality Sensitive Hashing Attention** is a replacement for [dot-product attention](https:\/\/paperswithcode.com\/method\/scaled) with one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\\log L$), where $L$ is the length of the sequence. LSH refers to a family of functions (known as LSH families) to hash data points into buckets so that data points near each other are located in the same buckets with high probability, while data points far from each other are likely to be in different buckets. It was proposed as part of the [Reformer](https:\/\/paperswithcode.com\/method\/reformer) architecture.","781":"**Reformer** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) based architecture that seeks to make efficiency improvements. [Dot-product attention](https:\/\/paperswithcode.com\/method\/dot-product-attention) is replaced by one that uses locality-sensitive hashing, changing its complexity\r\nfrom O($L^2$) to O($L\\log L$), where $L$ is the length of the sequence. Furthermore, Reformers use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers.","782":"**Displaced Aggregation Unit** replaces classic [convolution](https:\/\/paperswithcode.com\/method\/convolution) layer in ConvNets with learnable positions of units.  This introduces explicit structure of hierarchical compositions and results in several benefits:\r\n\r\n* fully adjustable and **learnable receptive fields** through spatially-adjustable filter units\r\n* **reduced parameters** for spatial coverage\r\nefficient inference\r\n* **decupling** of the parameters from the receptive field sizes\r\n\r\nMore information can be found [here.](https:\/\/www.vicos.si\/Research\/DeepCompositionalNet)","783":"**DU-GAN** is a [generative adversarial network](https:\/\/www.paperswithcode.com\/methods\/category\/generative-adversarial-networks) for LDCT denoising in medical imaging. The generator produces denoised LDCT images, and two independent branches with [U-Net](https:\/\/paperswithcode.com\/method\/u-net) based discriminators perform at the image and gradient domains. The U-Net based discriminator provides both global structure and local per-pixel feedback to the generator. Furthermore, the image discriminator encourages the generator to produce photo-realistic CT images while the gradient discriminator is utilized for better edge and alleviating streak artifacts caused by photon starvation.","784":"**Quasi-Hyperbolic Momentum (QHM)** is a stochastic optimization technique that alters [momentum SGD](https:\/\/paperswithcode.com\/method\/sgd-with-momentum) with a momentum step, averaging an [SGD](https:\/\/paperswithcode.com\/method\/sgd) step with a momentum step:\r\n\r\n$$ g\\_{t+1} = \\beta{g\\_{t}} + \\left(1-\\beta\\right)\\cdot{\\nabla}\\hat{L}\\_{t}\\left(\\theta\\_{t}\\right) $$\r\n$$ \\theta\\_{t+1} = \\theta\\_{t} - \\alpha\\left[\\left(1-v\\right)\\cdot\\nabla\\hat{L}\\_{t}\\left(\\theta\\_{t}\\right) + v\\cdot{g\\_{t+1}}\\right]$$\r\n\r\nThe authors suggest a rule of thumb of $v = 0.7$ and $\\beta = 0.999$.","785":"The **Quasi-Hyperbolic Momentum Algorithm (QHM)** is a simple alteration of [momentum SGD](https:\/\/paperswithcode.com\/method\/sgd-with-momentum), averaging a plain [SGD](https:\/\/paperswithcode.com\/method\/sgd) step with a momentum step. **QHAdam** is a QH augmented version of [Adam](https:\/\/paperswithcode.com\/method\/adam), where we replace both of Adam's moment estimators with quasi-hyperbolic terms. QHAdam decouples the momentum term from the current gradient when updating the weights, and decouples the mean squared gradients term from the current squared gradient when updating the weights. \r\n\r\nIn essence, it is a weighted average of the momentum and plain SGD, weighting the current gradient with an immediate discount factor $v\\_{1}$ divided by a weighted average of the mean squared gradients and the current squared gradient, weighting the current squared gradient with an immediate discount factor $v\\_{2}$. \r\n\r\n$$ \\theta\\_{t+1, i} = \\theta\\_{t, i} - \\eta\\left[\\frac{\\left(1-v\\_{1}\\right)\\cdot{g\\_{t}} + v\\_{1}\\cdot\\hat{m}\\_{t}}{\\sqrt{\\left(1-v\\_{2}\\right)g^{2}\\_{t} + v\\_{2}\\cdot{\\hat{v}\\_{t}}} + \\epsilon}\\right], \\forall{t} $$\r\n\r\nIt is recommended to set $v\\_{2} = 1$ and $\\beta\\_{2}$ same as in Adam.","786":"The **MelGAN Residual Block** is a convolutional [residual block](https:\/\/paperswithcode.com\/method\/residual-block) used in the [MelGAN](https:\/\/paperswithcode.com\/method\/melgan) generative audio architecture. It employs residual connections with dilated convolutions. Dilations are used so that temporally far output activations of each subsequent layer has significant overlapping inputs. Receptive field of a stack of [dilated convolution](https:\/\/paperswithcode.com\/method\/dilated-convolution) layers increases exponentially with the number of layers. Incorporating these into the MelGAN generator allows us to efficiently increase the induced receptive fields of each output time-step. This effectively implies larger overlap in the induced receptive field of far apart time-steps, leading to better long range correlation.","787":"A **Window-based Discriminator** is a type of discriminator for generative adversarial networks. It is analogous to a [PatchGAN](https:\/\/paperswithcode.com\/method\/patchgan) but designed for audio. While a standard [GAN](https:\/\/paperswithcode.com\/method\/gan) discriminator learns to classify between distributions of entire audio sequences, window-based discriminator learns to classify between distribution of small audio chunks. Since the discriminator loss is computed over the overlapping windows where each window is very large (equal to the receptive field of the discriminator), the model learns to maintain coherence across patches.","788":"**Location Sensitive Attention** is an attention mechanism that extends the [additive attention mechanism](https:\/\/paperswithcode.com\/method\/additive-attention) to use cumulative attention weights from previous decoder time steps as an additional feature. This encourages the model to move forward consistently through the input, mitigating potential failure modes where some subsequences are repeated or ignored by the decoder.\r\n\r\nStarting with additive attention where $h$ is a sequential representation from a BiRNN encoder and ${s}\\_{i-1}$ is the $(i \u2212 1)$-th state of a recurrent neural network (e.g. a [LSTM](https:\/\/paperswithcode.com\/method\/lstm) or [GRU](https:\/\/paperswithcode.com\/method\/gru)):\r\n\r\n$$ e\\_{i, j} = w^{T}\\tanh\\left(W{s}\\_{i-1} + Vh\\_{j} + b\\right) $$\r\n\r\nwhere $w$ and $b$ are vectors, $W$ and $V$ are matrices. We extend this to be location-aware by making it take into account the alignment produced at the previous step. First, we extract $k$ vectors\r\n$f\\_{i,j} \\in \\mathbb{R}^{k}$ for every position $j$ of the previous alignment $\\alpha\\_{i\u22121}$ by convolving it with a matrix $F \\in R^{k\\times{r}}$:\r\n\r\n$$ f\\_{i} = F \u2217 \\alpha\\_{i\u22121} $$\r\n\r\nThese additional vectors $f\\_{i,j}$ are then used by the scoring mechanism $e\\_{i,j}$:\r\n\r\n$$ e\\_{i,j} = w^{T}\\tanh\\left(Ws\\_{i\u22121} + Vh\\_{j} + Uf\\_{i,j} + b\\right) $$","789":"**Zoneout** is a  method for regularizing [RNNs](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks). At each timestep, zoneout stochastically forces some hidden units to maintain their previous values. Like [dropout](https:\/\/paperswithcode.com\/method\/dropout), zoneout uses random noise to train a pseudo-ensemble, improving generalization.\r\nBut by preserving instead of dropping hidden units, gradient information and state information are more readily propagated through time, as in feedforward [stochastic depth](https:\/\/paperswithcode.com\/method\/stochastic-depth) networks.","790":"**MelGAN** is a non-autoregressive feed-forward convolutional architecture to perform audio waveform generation in a [GAN](https:\/\/paperswithcode.com\/method\/gan) setup. The architecture is a fully convolutional feed-forward network with mel-spectrogram $s$ as input and raw waveform $x$ as output. Since the mel-spectrogram is at\r\na 256\u00d7 lower temporal resolution, the authors use a stack of transposed convolutional layers to upsample the input sequence. Each transposed convolutional layer is followed by a stack of residual blocks with dilated convolutions. Unlike traditional GANs, the MelGAN generator does not use a global noise vector as input.\r\n\r\nTo deal with 'checkerboard artifacts' in audio, instead of using [PhaseShuffle](https:\/\/paperswithcode.com\/method\/phase-shuffle), MelGAN uses kernel-size as a multiple of stride.\r\n\r\n[Weight normalization](https:\/\/paperswithcode.com\/method\/weight-normalization) is used for normalization. A [window-based discriminator](https:\/\/paperswithcode.com\/method\/window-based-discriminator), similar to a [PatchGAN](https:\/\/paperswithcode.com\/method\/patchgan) is used for the discriminator.","791":"**Tacotron 2** is a neural network architecture for speech synthesis directly from text. It consists of two components:\r\n\r\n- a recurrent sequence-to-sequence feature prediction network with\r\nattention which predicts a sequence of mel spectrogram frames from\r\nan input character sequence\r\n- a modified version of [WaveNet](https:\/\/paperswithcode.com\/method\/wavenet) which generates time-domain waveform samples conditioned on the\r\npredicted mel spectrogram frames\r\n\r\nIn contrast to the original [Tacotron](https:\/\/paperswithcode.com\/method\/tacotron), Tacotron 2 uses simpler building blocks, using vanilla [LSTM](https:\/\/paperswithcode.com\/method\/lstm) and convolutional layers in the encoder and decoder instead of [CBHG](https:\/\/paperswithcode.com\/method\/cbhg) stacks and [GRU](https:\/\/paperswithcode.com\/method\/gru) recurrent layers. Tacotron 2 does not use a \u201creduction factor\u201d, i.e., each decoder step corresponds to a single spectrogram frame. Location-sensitive attention is used instead of [additive attention](https:\/\/paperswithcode.com\/method\/additive-attention).","792":"**Expected Sarsa** is like [Q-learning](https:\/\/paperswithcode.com\/method\/q-learning) but instead of taking the maximum over next state-action pairs, we use the expected value, taking into account how likely each action is under the current policy.\r\n\r\n$$Q\\left(S\\_{t}, A\\_{t}\\right) \\leftarrow Q\\left(S\\_{t}, A\\_{t}\\right) + \\alpha\\left[R_{t+1} + \\gamma\\sum\\_{a}\\pi\\left(a\\mid{S\\_{t+1}}\\right)Q\\left(S\\_{t+1}, a\\right) - Q\\left(S\\_{t}, A\\_{t}\\right)\\right] $$\r\n\r\nExcept for this change to the update rule, the algorithm otherwise follows the scheme of Q-learning. It is more computationally expensive than [Sarsa](https:\/\/paperswithcode.com\/method\/sarsa) but it eliminates the variance due to the random selection of $A\\_{t+1}$.\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition","793":"Gated Graph Sequence Neural Networks (GGS-NNs) is a novel graph-based neural network model. GGS-NNs modifies Graph Neural Networks (Scarselli et al., 2009) to use gated recurrent units and modern optimization techniques and then extend to output sequences.\r\n\r\nSource: [Li et al.](https:\/\/arxiv.org\/pdf\/1511.05493v4.pdf)\r\n\r\nImage source: [Li et al.](https:\/\/arxiv.org\/pdf\/1511.05493v4.pdf)","794":"This method applies Polya-Gamma latent variables as a way to obtain closed form expressions for full-conditionals of posterior distributions in sampling algorithms like MCMC.","795":"**Cross View Training**, or **CVT**, is a semi-supervised algorithm for training distributed word representations that makes use of unlabelled and labelled examples. \r\n\r\nCVT adds $k$ auxiliary prediction modules to the model, a Bi-[LSTM](https:\/\/paperswithcode.com\/method\/lstm) encoder, which are used when learning on unlabeled examples. A prediction module is usually a small neural network (e.g., a hidden layer followed by a [softmax](https:\/\/paperswithcode.com\/method\/softmax) layer). Each one takes as input an intermediate representation $h^j(x_i)$ produced by the model (e.g., the outputs of one of the LSTMs in a Bi-LSTM model). It outputs a distribution over labels $p\\_{j}^{\\theta}\\left(y\\mid{x\\_{i}}\\right)$.\r\n\r\nEach $h^j$ is chosen such that it only uses a part of the input $x_i$; the particular choice can depend on the task and model architecture. The auxiliary prediction modules are only used during training; the test-time prediction come from the primary prediction module that produces $p_\\theta$.","796":"**GroIE** is an RoI extractor which intends to overcome the limitation of existing extractors which select only one (the best) layer from the [FPN](https:\/\/paperswithcode.com\/method\/fpn). The intuition is that all the layers of FPN retain useful\r\ninformation. Therefore, the proposed layer introduces non-local building blocks and attention mechanisms to boost the performance.","797":"Building on the recent successes of distributed training of RL agents, R2D2 is an RL approach that trains a RNN-based RL agents from distributed prioritized experience replay. \r\nUsing a single network architecture and fixed set of hyperparameters, Recurrent Replay Distributed DQN quadrupled the previous state of the art on Atari-57, and matches the state of the art on DMLab-30. \r\nIt was the first agent to exceed human-level performance in 52 of the 57 Atari games.","798":"Triplet attention comprises of three branches each responsible for capturing crossdimension between the spatial dimensions and channel dimension of the input. Given an input tensor with shape (C \u00d7 H \u00d7 W), each branch is responsible for aggregating cross-dimensional interactive features between either the spatial dimension H or W and the channel dimension C.","799":"A **trainable 3D interaction space** aims to captures the associations between the triplet components and helps model the recognition of multiple triplets in the same frame.\r\n\r\nSource: [Nwoye et al.](https:\/\/arxiv.org\/pdf\/2007.05405v1.pdf)\r\n\r\nImage source: [Nwoye et al.](https:\/\/arxiv.org\/pdf\/2007.05405v1.pdf)","800":"**Generative Adversarial Imitation Learning** presents a new general framework for directly extracting a policy from data, as if it were obtained by reinforcement learning following inverse reinforcement learning.","801":"**Concatenation Affinity** is a type of affinity or self-similarity function between two points $\\mathbb{x\\_{i}}$ and $\\mathbb{x\\_{j}}$ that uses a concatenation function:\r\n\r\n$$ f\\left(\\mathbb{x\\_{i}}, \\mathbb{x\\_{j}}\\right) = \\text{ReLU}\\left(\\mathbb{w}^{T}\\_{f}\\left[\\theta\\left(\\mathbb{x}\\_{i}\\right), \\phi\\left(\\mathbb{x}\\_{j}\\right)\\right]\\right)$$\r\n\r\nHere $\\left[\u00b7, \u00b7\\right]$ denotes concatenation and $\\mathbb{w}\\_{f}$ is a weight vector that projects the concatenated vector to a scalar.","802":"**Embedded Dot Product Affinity** is a type of affinity or self-similarity function between two points $\\mathbb{x\\_{i}}$ and $\\mathbb{x\\_{j}}$ that uses a dot product function in an embedding space:\r\n\r\n$$ f\\left(\\mathbb{x\\_{i}}, \\mathbb{x\\_{j}}\\right) = \\theta\\left(\\mathbb{x\\_{i}}\\right)^{T}\\phi\\left(\\mathbb{x\\_{j}}\\right) $$\r\n\r\nHere $\\theta\\left(x\\_{i}\\right) = W\\_{\u03b8}x\\_{i}$ and $\\phi\\left(x\\_{j}\\right) = W\\_{\u03c6}x\\_{j}$ are two embeddings.\r\n\r\nThe main difference between the dot product and [embedded Gaussian affinity](https:\/\/paperswithcode.com\/method\/embedded-gaussian-affinity) functions is the presence of [softmax](https:\/\/paperswithcode.com\/method\/softmax), which plays the role of an activation function.","803":"**Embedded Gaussian Affinity** is a type of affinity or self-similarity function between two points $\\mathbf{x\\_{i}}$ and $\\mathbf{x\\_{j}}$ that uses a Gaussian function in an embedding space:\r\n\r\n$$ f\\left(\\mathbf{x\\_{i}}, \\mathbf{x\\_{j}}\\right) = e^{\\theta\\left(\\mathbf{x\\_{i}}\\right)^{T}\\phi\\left(\\mathbf{x\\_{j}}\\right)} $$\r\n\r\nHere $\\theta\\left(x\\_{i}\\right) = W\\_{\u03b8}x\\_{i}$ and $\\phi\\left(x\\_{j}\\right) = W\\_{\u03c6}x\\_{j}$ are two embeddings.\r\n\r\nNote that the self-attention module used in the original [Transformer](https:\/\/paperswithcode.com\/method\/transformer) model is a special case of non-local operations in the embedded Gaussian version. This can be seen from the fact that for a given $i$, $\\frac{1}{\\mathcal{C}\\left(\\mathbf{x}\\right)}\\sum\\_{\\forall{j}}f\\left(\\mathbf{x}\\_{i}, \\mathbf{x}\\_{j}\\right)g\\left(\\mathbf{x}\\_{j}\\right)$ becomes the [softmax](https:\/\/paperswithcode.com\/method\/softmax) computation along the dimension $j$. So we have $\\mathbf{y} = \\text{softmax}\\left(\\mathbf{x}^{T}W^{T}\\_{\\theta}W\\_{\\phi}\\mathbf{x}\\right)g\\left(\\mathbf{x}\\right)$, which is the self-attention form in the Transformer model. This shows how we can relate this recent self-attention model to the classic computer vision method of non-local means.","804":"AugMix mixes augmented images through linear interpolations. Consequently it is like [Mixup](https:\/\/paperswithcode.com\/method\/mixup) but instead mixes augmented versions of the same image.","805":"**TransferQA** is a transferable generative QA model, built upon [T5](https:\/\/paperswithcode.com\/method\/t5) that combines extractive QA and multi-choice QA via a text-to-text [transformer](https:\/\/paperswithcode.com\/method\/transformer) framework, and tracks both categorical slots and non-categorical slots in DST. In addition, it introduces two effective ways to construct unanswerable questions, namely, negative question sampling and context truncation, which enable the model to handle \u201cnone\u201d value slots in the zero-shot DST setting.","806":"**Aggregated Learning (AgrLearn)** is a vector-quantization approach to learning neural network classifiers. It builds on an equivalence between IB learning and IB quantization and exploits the power of vector quantization, which is well known in information theory.","807":"**Sticker Response Selector**, or **SRS**, is a model for multi-turn dialog that automatically selects a sticker response. SRS first employs a convolutional based sticker image encoder and a self-attention based multi-turn dialog encoder to obtain the representation of stickers and utterances. Next, deep interaction network is proposed to conduct deep matching between the sticker with each utterance in the dialog history. SRS then learns the short-term and long-term dependency between all interaction results by a fusion network to output the the final matching score.","808":"Model to adapt:\r\nWe use Ultra Fast Structure-aware Deep Lane Detection (UFLD) as baseline and strictly adopt its training scheme and hyperparameters. UFLD treats lane detection as a row-based classification problem and utilizes the row anchors defined by TuSimple.\r\n\r\nUnsupervised Domain Adaptation with SGPCS:\r\nSGPCS builds upon PCS  and performs in-domain contrastive learning and crossdomain self-supervised learning via cluster prototypes. \r\n\r\nWe reformulate the pseudo label selection mechanism from SGADA. For each lane, we select the highest confidence value from the griding cells of each row anchor. Based on their griding cell position, the confidence values are divided into two cases: absent lane points and present lane points. Thereby, the last griding cell represents absent lane points as in. For each case, we calculate the mean confidence over the corresponding lanes. We then use the thresholds defined by SGADA to decide whether the prediction is treated as a pseudo label.\r\n\r\nOur overall objective function comprises the in-domain and cross-domain loss from PCS, the losses defined by UFLD, and our adopted pseudo loss from SGADA. We adjust the momentum for memory bank feature updates to 0.5 and use spherical K-means with K = 2,500 to cluster them into prototypes.","809":"**CBNet** is a backbone architecture that consists of multiple identical backbones (specially called Assistant Backbones and Lead Backbone) and composite connections between neighbor backbones. From left to right, the output of each stage in an Assistant Backbone, namely higher-level\r\nfeatures, flows to the parallel stage of the succeeding backbone as part of inputs through composite connections. Finally, the feature maps of the last backbone named Lead\r\nBackbone are used for object detection. The features extracted by CBNet for object detection fuse the high-level and low-level features of multiple backbones, hence improve the detection performance.","810":"Neighborhood Attention is a restricted self attention pattern in which each token's receptive field is limited to its nearest neighboring pixels. It was proposed in [Neighborhood Attention Transformer](https:\/\/paperswithcode.com\/paper\/neighborhood-attention-transformer) as an alternative to other local attention mechanisms used in Hierarchical Vision Transformers.\r\n\r\nNA is in concept similar to [stand alone self attention (SASA)](https:\/\/paperswithcode.com\/method\/sasa), in that both can be implemented with a raster scan sliding window operation over the key value pair. However, NA would require a modification to handle corner pixels, which helps maintain a fixed receptive field size and an increased number of relative positions.\r\n\r\nThe primary challenge in experimenting with both NA and SASA has been computation. Simply extracting key values for each query is slow, takes up a large amount of memory, and is eventually intractable at scale. NA was therefore implemented through a new CUDA extension to PyTorch, [NATTEN](https:\/\/github.com\/SHI-Labs\/Neighborhood-Attention-Transformer).","811":"**A Spatial Attention-Guided Mask** is a module for [instance segmentation](https:\/\/paperswithcode.com\/task\/instance-segmentation) that predicts a segmentation mask on each detected box with a spatial attention map that helps to focus on informative pixels and suppress noise. The goal is to guide the mask head for spotlighting meaningful pixels and repressing uninformative ones. \r\n\r\nOnce features inside the predicted RoIs are extracted by [RoIAlign](https:\/\/paperswithcode.com\/method\/roi-align) with 14\u00d714 resolution, those features are fed into four conv layers and the [spatial attention module](https:\/\/paperswithcode.com\/method\/spatial-attention-module) (SAM) sequentially. To exploit the spatial attention map $A\\_{sag}\\left(X\\_{i}\\right) \\in \\mathcal{R}^{1\\times{W}\\times{H}}$ as a feature descriptor given input feature map $X\\_{i} \\in \\mathcal{R}^{C\u00d7W\u00d7H}$, the SAM first generates pooled features $P\\_{avg}, P\\_{max} \\in \\mathcal{R}^{1\\times{W}\\times{H}}$ by both average and [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) operations respectively along the channel axis and aggregates them via concatenation. Then it is followed by a 3 \u00d7 3 conv layer and normalized by the sigmoid function. The computation process\r\nis summarized as follow:\r\n\r\n$$\r\nA\\_{sag}\\left(X\\_{i}\\right) = \\sigma\\left(F\\_{3\\times{3}}(P\\_{max} \\cdot P\\_{avg})\\right)\r\n$$\r\n\r\nwhere $\\sigma$ denotes the sigmoid function, $F\\_{3\\times{3}}$ is 3 \u00d7 3 conv layer and $\\cdot$ represents the concatenate operation. Finally, the attention guided feature map $X\\_{sag} \u2208 \\mathcal{R}^{C\\times{W}\\times{H}}$ is computed as:\r\n\r\n$$\r\nX\\_{sag} = A\\_{sag}\\left(X\\_{i}\\right) \\otimes X\\_{i}\r\n$$\r\n\r\nwhere \u2297 denotes element-wise multiplication. After then, a 2 \u00d7 2 deconv upsamples the spatially attended feature map to 28 \u00d7 28 resolution. Lastly, a 1 \u00d7 1 conv is applied for predicting class-specific masks.","812":"McKernel introduces a framework to use kernel approximates in the mini-batch setting with Stochastic Gradient Descent ([SGD](https:\/\/paperswithcode.com\/method\/sgd)) as an alternative to Deep Learning.\r\n\r\nThe core library was developed in 2014 as integral part of a thesis of Master of Science [1,2] at Carnegie Mellon and City University of Hong Kong. The original intend was to implement a speedup of Random Kitchen Sinks (Rahimi and Recht 2007) by writing a very efficient HADAMARD tranform, which was the main bottleneck of the construction. The code though was later expanded at ETH Z\u00fcrich (in McKernel by Curt\u00f3 et al. 2017) to propose a framework that could explain both Kernel Methods and Neural Networks. This manuscript and the corresponding theses, constitute one of the first usages (if not the first) in the literature of FOURIER features and Deep Learning; which later got a lot of research traction and interest in the community.\r\n\r\nMore information can be found in this presentation that the first author gave at ICLR 2020 [iclr2020_DeCurto](https:\/\/www.decurto.tw\/c\/iclr2020_DeCurto.pdf).\r\n\r\n[1] [https:\/\/www.curto.hk\/c\/decurto.pdf](https:\/\/www.curto.hk\/c\/decurto.pdf)\r\n\r\n[2] [https:\/\/www.zarza.hk\/z\/dezarza.pdf](https:\/\/www.zarza.hk\/z\/dezarza.pdf)","813":"**DIME**, or **Distance to Modelled Embedding**, is a method for detecting out-of-distribution examples during prediction time. Given a trained neural network, the training data drawn from some high-dimensional distribution in data space $X$ is transformed into the model\u2019s intermediate feature vector space $\\mathbb{R}^{p}$. The training set embedding is linearly approximated as a hyperplane. When we then receive new observations it is difficult to assess if observations are out-of-distribution directly in data space, so we transform them into the same intermediate feature space. Finally, the Distance-to-Modelled-Embedding (DIME) can be used to assess whether new observations fit into the expected embedding covariance structure.","814":"Proposes a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. A cross-modal pre-training model is defined based on the image-text retrieval task. The main goal is thus to learn two encoders that can embed image and text samples into the same space for effective image-text retrieval. To enforce such cross-modal embedding learning, we introduce contrastive learning with the InfoNCE loss into the BriVL model. Given text embedding, the learning objective aims to find the best image embedding from a batch of image embeddings. Similarly, for a given image embedding, the learning objective is to find the best text embedding from a batch of text embeddings. The pre-training model learns a cross-modal embedding space by jointly training the image and text encoders to maximize the cosine similarity of the image and text embeddings of the true pair for each sample in the batch while minimizing the cosine similarity of the embeddings of the other incorrect pairs.","815":"**Graph Contrastive Coding** is a self-supervised graph neural network pre-training framework to capture the universal network topological properties across multiple networks. GCC's pre-training task is designed as subgraph instance discrimination in and across networks and leverages contrastive learning to empower graph neural networks to learn the intrinsic and transferable structural representations.","816":"Transformer neural networks have achieved state-of-the-art results for unstructured data such as text and images but their adoption for graph-structured data has been limited. This is partly due to the difficulty of incorporating complex structural information in the basic transformer framework. We propose a simple yet powerful extension to the transformer - residual edge channels. The resultant framework, which we call Edge-augmented Graph Transformer (EGT), can directly accept, process and output structural information as well as node information. It allows us to use global self-attention, the key element of transformers, directly for graphs and comes with the benefit of long-range interaction among nodes. Moreover, the edge channels allow the structural information to evolve from layer to layer, and prediction tasks on edges\/links can be performed directly from the output embeddings of these channels. In addition, we introduce a generalized positional encoding scheme for graphs based on Singular Value Decomposition which can improve the performance of EGT. Our framework, which relies on global node feature aggregation, achieves better performance compared to Convolutional\/Message-Passing Graph Neural Networks, which rely on local feature aggregation within a neighborhood. We verify the performance of EGT in a supervised learning setting on a wide range of experiments on benchmark datasets. Our findings indicate that convolutional aggregation is not an essential inductive bias for graphs and global self-attention can serve as a flexible and adaptive alternative.","817":"**Single-Path NAS** is a convolutional neural network architecture discovered through the Single-Path [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) approach. The NAS utilises a single-path search space. Specifically, compared to previous differentiable  NAS methods, Single-Path NAS uses one single-path over-parameterized  ConvNet to encode all architectural decisions with shared convolutional kernel parameters. The approach is built upon the  observation that different candidate convolutional operations in NAS  can be viewed as subsets of a single superkernel. Without having to  choose among different paths\/operations as in multi-path methods, we instead  solve the NAS problem as finding which subset of kernel weights to use in each ConvNet layer. By sharing the convolutional kernel weights,  we encode all candidate NAS operations into a single superkernel.\r\n\r\nThe architecture itself uses the [inverted residual block](https:\/\/paperswithcode.com\/method\/inverted-residual-block) from [MobileNetV2](https:\/\/paperswithcode.com\/method\/mobilenetv2) as its basic building block.","818":"**ACKTR**, or **Actor Critic with Kronecker-factored Trust Region**, is an actor-critic method for reinforcement learning that applies [trust region optimization](https:\/\/paperswithcode.com\/method\/trpo) using a recently proposed Kronecker-factored approximation to the curvature. The method extends the framework of natural policy gradient and optimizes both the actor and the critic using Kronecker-factored approximate\r\ncurvature (K-FAC) with trust region.","819":"Our proposed loss function is a combination of BCE Loss, Focal Loss, and Dice loss. Each one of them contributes individually to improve performance further details of loss functions are mentioned below,\r\n\r\n(1) BCE Loss calculates probabilities and compares each actual class output with predicted probabilities which can be either 0 or 1, it is based on Bernoulli distribution loss, it is mostly used when there are only two classes are available in our case there are exactly two classes are available one is background and other is foreground. In a proposed method it is used for pixel-level classification.\r\n\r\n(2) Focal Loss is a variant of BCE, it enables the model to focus on learning hard examples by decreasing the wights of easy examples it works well when the data is highly imbalanced.\r\n\r\n(3) Dice Loss is inspired by the Dice Coefficient Score which is an evaluation metric used to evaluate the results of image segmentation tasks. Dice Coefficient is convex in nature so it has been changed, so it can be more traceable. It is used to calculate the similarity between two images, Dice Loss represent as\r\n\r\n\r\nWe proposed a Loss function which is a combination of all three above mention loss functions to benefit from all, BCE is used for pixel-wise classification, Focal Loss is used for learning hard examples, we use 0.25 as the value for alpha and 2.0 as the value of gamma. Dice Loss is used for learning better boundary representation, our proposed loss function represent as\r\n\\begin{equation}\r\nLoss = \\left( BCE Loss + Focal Loss \\right)  + Dice Loss\r\n\\end{equation}","820":"\\begin{equation}\r\nQA\\left( x \\right) = \\sigma\\left( f\\left( x \\right)^{1x1} \\right) + x \\end{equation}\r\n\r\nQuick Attention takes in the feature map as an input WxHxC (Width x Height x Channels) and creates two instances of the input feature map then it performs the 1x1xC convolution on the first instance and calculates the sigmoid activations after that it is added with the second instance to generate the final attention map as output which is of same dimensions as of input.","821":"**Serf**, or **Log-Softplus ERror activation Function**, is a type of activation function which is self-regularized and nonmonotonic in nature. It belongs to the [Swish](https:\/\/paperswithcode.com\/method\/swish) family of functions. Serf is defined as:\r\n\r\n$$f\\left(x\\right) = x\\text{erf}\\left(\\ln\\left(1 + e^{x}\\right)\\right)$$","822":"**SEED** (Scalable, Efficient, Deep-RL) is a scalable reinforcement learning agent. It utilizes an architecture that features centralized inference and an optimized communication layer. SEED adopts two state of the art distributed algorithms, [IMPALA](https:\/\/paperswithcode.com\/method\/impala)\/[V-trace](https:\/\/paperswithcode.com\/method\/v-trace) (policy gradients) and R2D2 ([Q-learning](https:\/\/paperswithcode.com\/method\/q-learning)).","823":"**MPRNet** is a multi-stage progressive image restoration architecture that progressively learns restoration functions for the degraded inputs, thereby breaking down the overall recovery process into more manageable steps. Specifically, the model first learns the contextualized features using encoder-decoder architectures and later combines them with a high-resolution branch that retains local information. At each stage, a per-pixel adaptive design is introduced that leverages in-situ supervised attention to reweight the local features.","824":"Anisotropic convolution is a central building block of CNNs but challenging to transfer to surfaces. DeltaConv learns combinations and compositions of operators from vector calculus, which are a natural fit for curved surfaces. The result is a simple and robust anisotropic convolution operator for point clouds with state-of-the-art results.","825":"**RepPoints** is a representation for object detection that consists of a set of points which indicate the spatial extent of an object and semantically significant local areas. This representation is learned via weak localization supervision from rectangular ground-truth boxes and implicit recognition feedback. Based on the richer RepPoints representation, the authors develop an anchor-free object detector that yields improved performance compared to using bounding boxes.","826":"**Meta Reward Learning (MeRL)** is a meta-learning method for the problem of learning from sparse and underspecified rewards. For example, an agent receives a complex input, such as a natural language instruction, and needs to generate a complex response, such as an action sequence, while only receiving binary success-failure feedback. The key insight of MeRL in dealing with underspecified rewards is that spurious trajectories and programs that achieve accidental success are detrimental to the agent's generalization performance. For example, an agent might be able to solve a specific instance of the maze problem above. However, if it learns to perform spurious actions during training, it is likely to fail when provided with unseen instructions. To mitigate this issue, MeRL optimizes a more refined auxiliary reward function, which can differentiate between accidental and purposeful success based on features of action trajectories. The auxiliary reward is optimized by maximizing the trained agent's performance on a hold-out validation set via meta learning.","827":"**Label Quality Model** is an intermediate supervised task aimed at predicting the clean labels from noisy labels by leveraging rater features and a paired subset for supervision. The LQM technique assumes the existence of rater features and a subset of training data with both noisy and clean labels, which we call paired-subset. In real world scenarios, some level of label noise may be unavoidable. The LQM approach still works as long as the clean(er) label is less noisy than a label from a rater that is randomly selected from the pool, e.g., clean labels can be from either expert raters or aggregation of multiple raters. LQM is trained on the paired-subset using rater features and noisy label as input, and inferred on the entire training corpus. The output of LQM is used during model training as a more accurate alternative to the noisy labels.","828":"**TernaryBERT** is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based model which ternarizes the weights of a pretrained [BERT](https:\/\/paperswithcode.com\/method\/bert) model to $\\{-1,0,+1\\}$, with different granularities for word embedding and weights in the Transformer layer. Instead of directly using knowledge distillation to compress a model, it is used to improve the performance of ternarized student model with the same size as the teacher model. In this way, we transfer the knowledge from the highly-accurate teacher model to the ternarized student model with smaller capacity.","829":"**Ternary Weight Splitting** is a ternarization approach used in [BinaryBERT](https:\/\/www.paperswithcode.com\/method\/binarybert) that exploits the flatness of ternary loss landscape as the optimization proxy of the binary model. We first train the half-sized ternary BERT to convergence, and then split both the latent full-precision weight $\\mathbf{w}^{t}$ and quantized $\\hat{\\mathbf{w}}^{t}$ to their binary counterparts $\\mathbf{w}\\_{1}^{b}, \\mathbf{w}\\_{2}^{b}$ and $\\hat{\\mathbf{w}}\\_{1}^{b}, \\hat{\\mathbf{w}}\\_{2}^{b}$ via the TWS operator. To inherit the performance of the ternary model after splitting, the TWS operator requires the splitting equivalency (i.e., the same output given the same input):\r\n\r\n$$\r\n\\mathbf{w}^{t}=\\mathbf{w}\\_{1}^{b}+\\mathbf{w}\\_{2}^{b}, \\quad \\hat{\\mathbf{w}}^{t}=\\hat{\\mathbf{w}}\\_{1}^{b}+\\hat{\\mathbf{w}}\\_{2}^{b}\r\n$$\r\n\r\nWhile solution to the above equation is not unique, we constrain the latent full-precision weights after splitting $\\mathbf{w}\\_{1}^{b}, \\mathbf{w}\\_{2}^{b}$ to satisfy $\\mathbf{w}^{t}=\\mathbf{w}\\_{1}^{b}+\\mathbf{w}\\_{2}^{b}$. See the paper for more details.","830":"**BinaryBERT** is a [BERT](https:\/\/paperswithcode.com\/method\/bert)-variant that applies quantization in the form of weight binarization. Specifically, ternary weight splitting is proposed which initializes BinaryBERT by equivalently splitting from a half-sized ternary network. To obtain BinaryBERT, we first train a half-sized [ternary BERT](https:\/\/paperswithcode.com\/method\/ternarybert) model, and then apply a [ternary weight splitting](https:\/\/paperswithcode.com\/method\/ternary-weight-splitting) operator to obtain the latent full-precision and quantized weights as the initialization of the full-sized BinaryBERT. We then fine-tune BinaryBERT for further refinement.","831":"RESCAL","832":"**Linformer** is a linear [Transformer](https:\/\/paperswithcode.com\/method\/transformer) that utilises a linear self-attention mechanism to tackle the self-attention bottleneck with [Transformer models](https:\/\/paperswithcode.com\/methods\/category\/transformers). The original [scaled dot-product attention](https:\/\/paperswithcode.com\/method\/scaled) is decomposed into multiple smaller attentions through linear projections, such that the combination of these operations forms a low-rank factorization of the original attention.","833":"Blended Diffusion enables a zero-shot local text-guided image editing of natural images.\r\nGiven an input image $x$, an input mask $m$ and a target guiding text $t$ - the method enables to change the masked area within the image corresponding the the guiding text s.t. the unmasked area is left unchanged.","834":"Many communication-efficient variants of [SGD](https:\/\/paperswithcode.com\/method\/sgd) use gradient quantization schemes. These schemes are often heuristic and fixed over the course of training. We empirically observe that the statistics of gradients of deep models change during the training. Motivated by this observation, we introduce two adaptive quantization schemes, ALQ and AMQ. In both schemes, processors update their compression schemes in parallel by efficiently computing sufficient statistics of a parametric distribution. We improve the validation accuracy by almost 2% on CIFAR-10 and 1% on ImageNet in challenging low-cost communication setups. Our adaptive methods are also significantly more robust to the choice of hyperparameters.","835":"Multi-heads of both self and cross attentions","836":"**Convolutional Hough Matching**, or **CHM**, is a geometric matching algorithm that distributes similarities of candidate matches over a geometric transformation space and evaluates them in a convolutional manner. It is casted into a trainable neural layer with a  semi-isotropic high-dimensional kernel, which learns non-rigid matching with a small number of interpretable parameters.","837":"**Geomancer** is a nonparametric algorithm for symmetry-based disentangling of data manifolds. It learns a set of subspaces to assign to each point in the dataset, where each subspace is the tangent space of one disentangled submanifold. This means that geomancer can be used to disentangle manifolds for which there may not be a global axis-aligned coordinate system.","838":"TWEC is a method to generate temporal word embeddings: this method is efficient and it is based on a simple heuristic: we train an atemporal word embedding, the compass and we use this embedding to freeze one of the layers of the CBOW architecture. The frozen architecture is then used to train time-specific slices that are all comparable after training.","839":"**PrIme Sample Attention (PISA)** directs the training of object detection frameworks towards prime samples. These are samples that play a key role in driving the detection performance. The authors define Hierarchical Local Rank (HLR) as a metric of importance. Specifically, they use IoU-HLR to rank positive samples and ScoreHLR to rank negative samples in each mini-batch. This ranking strategy places the positive samples with highest IoUs around each object and the negative samples with highest scores in each cluster to the top of the ranked list and directs the focus of the training process to them via a simple re-weighting scheme. The authors also devise a classification-aware regression loss to jointly optimize the classification and regression branches. Particularly, this loss would suppress those samples with large regression loss, thus reinforcing the attention to prime samples.","840":"Some object detection datasets contain an overwhelming number of easy examples and a small number of hard examples. Automatic selection of these hard examples can make training more\r\neffective and efficient. **OHEM**, or **Online Hard Example Mining**, is a bootstrapping technique that modifies [SGD](https:\/\/paperswithcode.com\/method\/sgd) to sample from examples in a non-uniform way depending on the current loss of each example under consideration. The method takes advantage of detection-specific problem structure in which each SGD mini-batch consists of only one or two images, but thousands of candidate examples. The candidate examples are subsampled according to a distribution\r\nthat favors diverse, high loss instances.","841":"**EdgeBoxes** is an approach for generating object bounding box proposals directly from edges. Similar to segments, edges provide a simplified but informative representation of an image. In fact, line drawings of an image can accurately convey the high-level information contained in an image\r\nusing only a small fraction of the information. \r\n\r\nThe main insight behind the method is the observation: the number of contours wholly enclosed by a bounding box is indicative of the likelihood of the box containing an object. We say a contour is wholly enclosed by a box if all edge pixels belonging to the contour lie within the interior of the box. Edges tend to correspond to object boundaries, and as such boxes that tightly enclose a set of edges are likely to contain an object. However, some edges that lie within an object\u2019s bounding box may not be part of the contained object. Specifically, edge pixels that belong to contours straddling the box\u2019s boundaries are likely to correspond to objects or structures that lie outside the box.\r\n\r\nSource: [Zitnick and Dollar](https:\/\/pdollar.github.io\/files\/papers\/ZitnickDollarECCV14edgeBoxes.pdf)","842":"**PRNet+** is a multi-task neural network for outdoor position recovery from measurement record (MR) data. PRNet+ develops a feature extraction module to learn common local-, short- and long-term spatio-temporal locality from heterogeneous MR samples, with a convolutional neural network (CNN), long short-term memory cells ([LSTM](https:\/\/paperswithcode.com\/method\/lstm)), and attention mechanisms. Specifically, PRNet+ 1) allows the various-length sequences of MR samples, such that the two components (CNN and LSTM) are able to capture spatial locality from the samples within each MR sequence, 2) exploits two attention mechanisms for the time-interval between neighbouring MR samples, together with the one between neighbouring MR sequences, to capture temporal locality, and 3) incorporates the detected transportation modes and predicted locations of heterogeneous MR data into a joint loss for better result.","843":"**Semantic reasoning network**, or **SRN**, is an end-to-end trainable framework for scene text recognition that consists of four parts: backbone network, parallel [visual attention](https:\/\/paperswithcode.com\/method\/visual-attention) module (PVAM), global semantic reasoning module (GSRM), and visual-semantic fusion decoder (VSFD). Given an input image, the backbone network is first used to extract 2D features $V$. Then, the PVAM is used to generate $N$ aligned 1-D features $G$, where each feature corresponds to a character in the text and captures the aligned visual information. These $N$ 1-D features $G$ are then fed into a GSRM to capture the semantic information $S$. Finally, the aligned visual features $G$ and the semantic information $S$ are fused by the VSFD to predict $N$ characters. For text string shorter than $N$, \u2019EOS\u2019 are padded.","844":"**Neural Additive Models (NAMs)** make restrictions on the structure of neural networks, which yields a family of models that are inherently interpretable while suffering little loss in prediction accuracy when applied to tabular data. Methodologically, NAMs belong to a larger model family called Generalized Additive Models (GAMs). \r\n\r\nNAMs learn a linear combination of networks that each attend to a single input feature: each $f\\_{i}$ in the traditional GAM formulationis parametrized by a neural network. These networks are trained jointly using backpropagation and can learn arbitrarily complex shape functions. Interpreting NAMs is easy as the impact of a feature on the prediction does not rely on the other features and can be understood by visualizing its corresponding shape function (e.g., plotting $f\\_{i}\\left(x\\_{i}\\right)$ vs. $x\\_{i}$).","845":"**Dense Contrastive Learning** is a self-supervised learning method for dense prediction tasks. It implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images. Contrasting with regular contrastive loss, the contrastive loss is computed between the single feature vectors outputted by the global projection head, at the level of global feature, while the dense contrastive loss is computed between the dense feature vectors outputted by the dense projection head, at the level of local feature.","846":"**ProphetNet** is a sequence-to-sequence pre-training model that introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of optimizing one-step-ahead prediction in the traditional sequence-to-sequence model, the ProphetNet is optimized by $n$-step ahead prediction that predicts the next $n$ tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and further help predict multiple future tokens.","847":"A **Neural Turing Machine** is a working memory neural network model. It couples a neural network architecture with external memory resources. The whole architecture is differentiable end-to-end with gradient descent. The models can infer tasks such as copying, sorting and associative recall.\r\n\r\nA Neural Turing Machine (NTM) architecture contains two basic components: a neural\r\nnetwork controller and a memory bank. The Figure presents a high-level diagram of the NTM\r\narchitecture. Like most neural networks, the controller interacts with the external world via\r\ninput and output vectors. Unlike a standard network, it also interacts with a memory matrix\r\nusing selective read and write operations. By analogy to the Turing machine we refer to the\r\nnetwork outputs that parameterise these operations as \u201cheads.\u201d\r\n\r\nEvery component of the architecture is differentiable. This is achieved by defining 'blurry' read and write operations that interact to a greater or lesser degree with all the elements in memory (rather\r\nthan addressing a single element, as in a normal Turing machine or digital computer). The\r\ndegree of blurriness is determined by an attentional \u201cfocus\u201d mechanism that constrains each\r\nread and write operation to interact with a small portion of the memory, while ignoring the\r\nrest. Because interaction with the memory is highly sparse, the NTM is biased towards\r\nstoring data without interference. The memory location brought into attentional focus is\r\ndetermined by specialised outputs emitted by the heads. These outputs define a normalised\r\nweighting over the rows in the memory matrix (referred to as memory \u201clocations\u201d). Each\r\nweighting, one per read or write head, defines the degree to which the head reads or writes\r\nat each location. A head can thereby attend sharply to the memory at a single location or\r\nweakly to the memory at many locations","848":"**Content-based attention** is an attention mechanism based on cosine similarity:\r\n\r\n$$f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right) = \\cos\\left[\\textbf{h}\\_{i};\\textbf{s}\\_{j}\\right] $$\r\n\r\nIt was utilised in [Neural Turing Machines](https:\/\/paperswithcode.com\/method\/neural-turing-machine) as part of the Addressing Mechanism.\r\n\r\nWe produce a normalized attention weighting by taking a [softmax](https:\/\/paperswithcode.com\/method\/softmax) over these attention alignment scores.","849":"**Metropolis-Hastings** is a Markov Chain Monte Carlo (MCMC) algorithm for approximate inference. It allows for sampling from a probability distribution where direct sampling is difficult - usually owing to the presence of an intractable integral.\r\n\r\nM-H consists of a proposal distribution $q\\left(\\theta^{'}\\mid\\theta\\right)$ to draw a parameter value. To decide whether $\\theta^{'}$ is accepted or rejected, we then calculate a ratio:\r\n\r\n$$ \\frac{p\\left(\\theta^{'}\\mid{D}\\right)}{p\\left(\\theta\\mid{D}\\right)} $$\r\n\r\nWe then draw a random number $r \\in \\left[0, 1\\right]$ and accept if it is under the ratio, reject otherwise. If we accept, we set $\\theta_{i} = \\theta^{'}$ and repeat.\r\n\r\nBy the end we have a sample of $\\theta$ values that we can use to form quantities over an approximate posterior, such as the expectation and uncertainty bounds. In practice, we typically have a period of tuning to achieve an acceptable acceptance ratio for the algorithm, as well as a warmup period to reduce bias towards initialization values.\r\n\r\nImage: [Samuel Hudec](https:\/\/static1.squarespace.com\/static\/52e69d46e4b05a145935f24d\/t\/5a7dbadcf9619a745c5b2513\/1518189289690\/Stan.pdf)","850":"**Recurrent Dropout** is a regularization method for [recurrent neural networks](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks). [Dropout](https:\/\/paperswithcode.com\/method\/dropout) is applied to the updates to [LSTM](https:\/\/paperswithcode.com\/method\/lstm) memory cells (or [GRU](https:\/\/paperswithcode.com\/method\/gru) states), i.e. it drops out the input\/update gate in LSTM\/GRU.","851":"**Temporal Pyramid Network**, or **TPN**, is a pyramid level module for action recognition at the feature-level, which can be flexibly integrated into 2D or 3D backbone networks in a plug-and-play manner. The source of features and the fusion of features form a feature hierarchy for the backbone so that it can capture action instances at various tempos. In the TPN, a Backbone Network is used to extract multiple level features, a Spatial Semantic Modulation spatially downsamples features to align semantics, a Temporal Rate Modulation temporally downsamples features to adjust relative tempo among levels, Information Flow aggregates features in various directions to enhance and enrich level-wise representations and Final Prediction rescales and concatenates all levels of pyramid along channel dimension.","852":"**IFBlock** is a video model block used in the [IFNet](https:\/\/paperswithcode.com\/method\/ifnet) architecture for video frame interpolation. IFBlocks do not contain expensive operators like cost volume or forward warping and use 3 \u00d7 3 convolution and deconvolution as building blocks. Each IFBlock has a feed-forward structure consisting of several convolutional layers and an upsampling operator. Except for the layer that outputs the optical flow residuals and the fusion map, [PReLU](https:\/\/paperswithcode.com\/method\/prelu) activations are used.","853":"**IFNet** is an architecture for video frame interpolation that adopts a coarse-to-fine strategy with progressively increased resolutions: it iteratively updates intermediate flows and soft fusion mask via successive [IFBlocks](https:\/\/paperswithcode.com\/method\/ifblock). Conceptually, according to the iteratively updated flow fields, we can move corresponding pixels from two input frames to the same location in a latent intermediate frame and use a fusion mask to combine pixels from two input frames. Unlike most previous optical flow models, IFBlocks do not contain expensive operators like cost volume or forward warping and use 3 \u00d7 3 [convolution](https:\/\/paperswithcode.com\/method\/convolution) and deconvolution as building blocks.","854":"**RIFE**, or **Real-time Intermediate Flow Estimation** is an intermediate flow estimation algorithm for Video Frame Interpolation (VFI). Many recent flow-based VFI methods first estimate the bi-directional optical flows, then scale and reverse them to approximate intermediate flows, leading to artifacts on motion boundaries. RIFE uses a neural network named [IFNet](https:\/\/paperswithcode.com\/method\/ifnet) that can directly estimate the intermediate flows from coarse-to-fine with much better speed. It introduces a privileged distillation scheme for training intermediate flow model, which leads to a large performance improvement.\r\n\r\nIn RIFE training, given two input frames $I_{0}, I_{1}$, we directly feed them into the IFNet to approximate intermediate flows $F_{t \\rightarrow 0}, F_{t \\rightarrow 1}$ and the fusion map $M$. During training phase, a privileged teacher refines student's results to get $F_{t \\rightarrow 0}^{T e a}, F_{t \\rightarrow 1}^{T e a}$ and $M^{\\text {Tea }}$ based on ground truth $I_{t}$. The student model and the teacher model are jointly trained from scratch using the reconstruction loss. The teacher's approximations are more accurate so that they can guide the student to learn.","855":"**Supervised Contrastive Loss** is an alternative loss function to cross entropy that the authors argue can leverage label information more effectively. Clusters of points belonging to the same class are pulled together in embedding space, while simultaneously pushing apart clusters of samples from different classes.\r\n\r\n$$\r\n  \\mathcal{L}^{sup}=\\sum_{i=1}^{2N}\\mathcal{L}_i^{sup}\r\n  \\label{eqn:total_supervised_loss}\r\n$$\r\n\r\n$$\r\n  \\mathcal{L}\\_i^{sup}=\\frac{-1}{2N\\_{\\boldsymbol{\\tilde{y}}\\_i}-1}\\sum\\_{j=1}^{2N}\\mathbf{1}\\_{i\\neq j}\\cdot\\mathbf{1}\\_{\\boldsymbol{\\tilde{y}}\\_i=\\boldsymbol{\\tilde{y}}_j}\\cdot\\log{\\frac{\\exp{\\left(\\boldsymbol{z}\\_i\\cdot\\boldsymbol{z}\\_j\/\\tau\\right)}}{\\sum\\_{k=1}^{2N}\\mathbf{1}\\_{i\\neq k}\\cdot\\exp{\\left(\\boldsymbol{z}\\_i\\cdot\\boldsymbol{z}\\_k\/\\tau\\right)}}}\r\n$$\r\n\r\nwhere $N_{\\boldsymbol{\\tilde{y}}_i}$ is the total number of images in the minibatch that have the same label, $\\boldsymbol{\\tilde{y}}_i$, as the anchor, $i$. This loss has important properties well suited for supervised learning: (a) generalization to an arbitrary number of positives, (b) contrastive power increases with more negatives.","856":"A **Time-aware Large Kernel (TaLK) convolution** is a type of temporal [convolution](https:\/\/paperswithcode.com\/method\/convolution) that learns the kernel size of a summation kernel for each time-step instead of learning the kernel weights as in a typical convolution operation. For each time-step, a function is responsible for predicting the appropriate size of neighbor representations to use in the form of left and right offsets relative to the time-step.","857":"To aggregate global spatial information,\r\nan SE block applies global pooling to the feature map.\r\nHowever, it ignores pixel-wise spatial information,\r\nwhich is important in dense prediction tasks.\r\nTherefore, Roy et al. proposed\r\nspatial and channel SE blocks (scSE). \r\nLike BAM, spatial SE blocks are used, complementing SE blocks, \r\nto provide spatial attention weights to focus on important regions.\r\n\r\nGiven the input feature map $X$, two parallel modules, spatial SE and channel SE, are applied to feature maps to encode spatial and channel information respectively. The channel SE module is an ordinary SE block, while the spatial SE module adopts $1\\times 1$ convolution for spatial squeezing. The outputs from the two modules are fused. The overall process can be written as\r\n\\begin{align}\r\n    s_c & = \\sigma (W_{2} \\delta (W_{1}\\text{GAP}(X)))\r\n\\end{align}\r\n\\begin{align}\r\n    X_\\text{chn} & = s_c  X \r\n\\end{align}\r\n\\begin{align}\r\n    s_s &= \\sigma(\\text{Conv}^{1\\times 1}(X))\r\n\\end{align}\r\n\\begin{align}\r\n    X_\\text{spa} & = s_s  X\r\n\\end{align}\r\n\\begin{align}\r\n    Y &= f(X_\\text{spa},X_\\text{chn})  \r\n\\end{align}\r\n\r\nwhere $f$ denotes the fusion function, which can be  maximum, addition, multiplication or concatenation. \r\n\r\nThe proposed scSE block combines channel and spatial attention to\r\nenhance features as well as \r\ncapturing pixel-wise spatial information.\r\nSegmentation tasks are greatly benefited as a result.\r\nThe integration of an scSE block in F-CNNs makes a consistent improvement in semantic segmentation at negligible extra cost.","858":"Inspired on the widely known [spatial squeeze and channel excitation (SE)](https:\/\/paperswithcode.com\/method\/squeeze-and-excitation-block) block, the sSE block performs channel squeeze and spatial excitation, to recalibrate the feature maps spatially and achieve more fine-grained image segmentation.","859":"Combines the channel attention of the widely known [spatial squeeze and channel excitation (SE)](https:\/\/paperswithcode.com\/method\/squeeze-and-excitation-block) block and the spatial attention of the [channel squeeze and spatial excitation (sSE)](https:\/\/paperswithcode.com\/method\/channel-squeeze-and-spatial-excitation#) block to build a spatial and channel attention mechanism for image segmentation tasks.","860":"F2DNet, a novel two-stage object detection architecture which eliminates redundancy of classical two-stage detectors by replacing the region proposal network with focal detection network and\r\nbounding box head with fast suppression head.","861":"**CANINE** is a pre-trained encoder for language understanding that operates directly on character sequences\u2014without explicit tokenization or vocabulary\u2014and a pre-training strategy with soft inductive biases in place of hard token boundaries. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep [transformer](https:\/\/paperswithcode.com\/method\/transformer) stack, which encodes context.","862":"**Prediction-aware One-To-One**, or **POTO**, is an assignment rule for object detection which dynamically assigns the foreground samples according to the quality of classification and regression simultaneously.","863":"**SPADE**, or **Spatially-Adaptive Normalization** is a conditional normalization method for semantic image synthesis. Similar to [Batch Normalization](https:\/\/www.paperswithcode.com\/method\/batch-normalization), the activation is normalized in the channel-wise manner and then modulated with learned scale and bias. In the SPADE, the mask is first projected onto an embedding space and then convolved to produce the modulation parameters $\\gamma$ and $\\beta .$ Unlike prior conditional normalization methods, $\\gamma$ and $\\mathbf{\\beta}$ are not vectors, but tensors with spatial dimensions. The produced $\\gamma$ and $\\mathbf{\\beta}$ are multiplied and added to the normalized activation element-wise.","864":"**Teacher-Tutor-Student Knowledge Distillation** is a method for image virtual try-on models. It treats fake images produced by the parser-based method as \"tutor knowledge\", where the artifacts can be corrected by real \"teacher knowledge\", which is extracted from the real person images in a self-supervised way. Other than using real images as supervisions, knowledge distillation is formulated in the try-on problem as distilling the appearance flows between the person image and the garment image, enabling the finding of dense correspondences between them to produce high-quality results.","865":"Gradual self-training is a method for semi-supervised domain adaptation. The goal is to adapt an initial classifier trained on a source domain given only unlabeled data that shifts gradually in distribution towards a target domain. \r\n\r\nThis comes up for example in applications ranging from sensor networks and self-driving car perception modules to brain-machine interfaces, where machine learning systems must adapt to data distributions that evolve over time.\r\n\r\nThe gradual self-training algorithm begins with a classifier $w_0$ trained on labeled examples from the source domain (Figure a). For each successive domain $P_t$, the algorithm generates pseudolabels for unlabeled examples from that domain, and then trains a regularized supervised classifier on the pseudolabeled examples. The intuition, visualized in the Figure, is that after a single gradual shift, most examples are pseudolabeled correctly so self-training learns a good classifier on the shifted data, but the shift from the source to the target can be too large for self-training to correct.","866":"**Bridge-net** is an audio model block used in the [ClariNet](https:\/\/paperswithcode.com\/method\/clarinet) text-to-speech architecture. Bridge-net maps frame-level hidden representation to sample-level through several [convolution](https:\/\/paperswithcode.com\/method\/convolution) blocks and [transposed convolution](https:\/\/paperswithcode.com\/method\/transposed-convolution) layers interleaved with softsign non-linearities.","867":"**Softsign** is an activation function for neural networks:\r\n\r\n$$ f\\left(x\\right) = \\left(\\frac{x}{|x|+1}\\right)$$\r\n\r\nImage Source: [Sefik Ilkin Serengil](https:\/\/sefiks.com\/2017\/11\/10\/softsign-as-a-neural-networks-activation-function\/)","868":"**DV3 Convolution Block** is a convolutional block used for the [Deep Voice 3](https:\/\/paperswithcode.com\/method\/deep-voice-3) text-to-speech architecture. It consists of a 1-D [convolution](https:\/\/paperswithcode.com\/method\/convolution) with a gated linear unit and a [residual connection](https:\/\/paperswithcode.com\/method\/residual-connection). In the Figure, $c$ denotes the dimensionality of the input. The convolution output of size $2 \\cdot c$ is split into equal-sized portions: the gate vector and the input vector. A scaling factor $\\sqrt{0.5}$ is used to ensure that we preserve the input variance early in training. The gated linear unit provides a linear path for the gradient flow, which alleviates the vanishing gradient issue for stacked convolution blocks while retaining non-linearity. To introduce speaker-dependent control, a speaker-dependent embedding is added as a bias to the convolution filter output, after a softsign function. The authors use the softsign nonlinearity because it limits the range of the output while also avoiding the saturation problem that exponential based nonlinearities sometimes exhibit. Convolution filter weights are initialized with zero-mean and unit-variance activations throughout the entire network.","869":"**DV3 Attention Block** is an attention-based module used in the [Deep Voice 3](https:\/\/paperswithcode.com\/method\/deep-voice-3) architecture. It uses a [dot-product attention](https:\/\/paperswithcode.com\/method\/dot-product-attention) mechanism. A query vector (the hidden states of the decoder) and the per-timestep key vectors from the encoder are used to compute attention weights. This then outputs a context vector computed as the weighted average of the value vectors.","870":"**ClariNet** is an end-to-end text-to-speech architecture. Unlike previous TTS systems which use text-to-spectogram models with a separate waveform [synthesizer](https:\/\/paperswithcode.com\/method\/synthesizer) (vocoder), ClariNet is a text-to-wave architecture that is fully convolutional and can be trained from scratch. In ClariNet, the [WaveNet](https:\/\/paperswithcode.com\/method\/wavenet) module is conditioned on the hidden states instead of the mel-spectogram. The architecture is otherwise based on [Deep Voice 3](https:\/\/paperswithcode.com\/method\/deep-voice-3).","871":"**Adversarially Learned Inference (ALI)** is a generative modelling approach that casts the learning of both an inference machine (or encoder) and a deep directed generative model (or decoder) in an GAN-like adversarial framework. A discriminator is trained to discriminate joint samples of the data and the corresponding latent variable from the encoder (or approximate posterior) from joint samples from the decoder while in opposition, the encoder and the decoder are trained together to fool the discriminator. Not is the discriminator asked to distinguish synthetic samples from real data, but it is required it to distinguish between two joint distributions over the data space and the latent variables.\r\n\r\nAn ALI differs from a [GAN](https:\/\/paperswithcode.com\/method\/gan) in two ways:\r\n\r\n- The generator has two components: the encoder, $G\\_{z}\\left(\\mathbf{x}\\right)$, which maps data samples $x$ to $z$-space, and the decoder $G\\_{x}\\left(\\mathbf{z}\\right)$, which maps samples from the prior $p\\left(\\mathbf{z}\\right)$ (a source of noise) to the input space.\r\n- The discriminator is trained to distinguish between joint pairs $\\left(\\mathbf{x}, \\tilde{\\mathbf{z}} = G\\_{\\mathbf{x}}\\left(\\mathbf{x}\\right)\\right)$ and $\\left(\\tilde{\\mathbf{x}} =\r\nG\\_{x}\\left(\\mathbf{z}\\right), \\mathbf{z}\\right)$, as opposed to marginal samples $\\mathbf{x} \\sim q\\left(\\mathbf{x}\\right)$ and $\\tilde{\\mathbf{x}} \u223c p\\left(\\mathbf{x}\\right)$.","872":"**LV-ViT** is a type of [vision transformer](https:\/\/paperswithcode.com\/method\/vision-transformer) that uses token labelling as a training objective. Different from the standard training\r\nobjective of ViTs that computes the classification loss on an additional trainable class token, token labelling takes advantage of all the image patch tokens to compute the training loss in a dense manner. Specifically, token labeling reformulates the image classification problem into multiple token-level recognition problems and assigns each patch token with an individual location-specific supervision generated by a machine annotator.","873":"**MnasNet** is a type of convolutional neural network optimized for mobile devices that is discovered through mobile [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search), which explicitly incorporates model latency into the main objective so that the search can identify a model that achieves a good trade-off between accuracy and latency. The main building block is an [inverted residual block](https:\/\/paperswithcode.com\/method\/inverted-residual-block) (from [MobileNetV2](https:\/\/paperswithcode.com\/method\/mobilenetv2)).","874":"spatial transformer networks uses an explicit procedure to learn invariance to translation, scaling, rotation and other more general warps, making the network pay attention to the most relevant regions. STN was the first attention mechanism to explicitly predict important regions and provide a deep neural network with transformation invariance.\r\n\r\nTaking a 2D image as an example, a 2D affine transformation can be formulated as followed, where A denotes a $ 2 \\times 3 $ learneable affine matrix:\r\n\r\n\\begin{align}\r\nA = f_\\text{loc}(U) \r\n\\end{align}\r\n\\begin{align}\r\nx_i^s = A x_i^t\r\n\\end{align}\r\n\r\nHere, $U$ is the input feature map, and $f_\\text{loc}$ can be any differentiable function, such as a lightweight fully-connected network or convolutional neural network. $x_{i}^{s}$  is coordinates in the output feature map, while $x_{i}^{t}$ is corresponding coordinates in the input feature map and the $ A $ matrix is the learnable affine matrix. After obtaining the correspondence, the network can sample relevant input regions using the correspondence. \r\nTo ensure that the whole process is differentiable and can be updated in an end-to-end manner,  bilinear sampling is used to sample the input features.\r\n\r\nSTNs focus on discriminative regions automatically\r\nand  learn invariance to some geometric transformations.","875":"**MatrixNet** is a scale and aspect ratio aware building block for object detection that seek to handle objects of different sizes and aspect ratios. They have several matrix layers, each layer handles an object of specific size and aspect ratio. They can be seen as an alternative to [FPNs](https:\/\/paperswithcode.com\/method\/fpn). While FPNs are capable of handling objects of different sizes, they do not have a solution for objects of different aspect ratios. Objects such as a high tower, a giraffe, or a knife introduce a design difficulty for FPNs: does one map these objects to layers according to their width or height? Assigning the object to a layer according to its larger dimension would result in loss of information along the smaller dimension due to aggressive downsampling, and vice versa. \r\n\r\nMatrixNets assign objects of different sizes and aspect ratios to layers such that object sizes within their assigned layers are close to uniform. This assignment allows a square output [convolution](https:\/\/paperswithcode.com\/method\/convolution) kernel to equally gather information about objects of all aspect ratios and scales. MatrixNets can be applied to any backbone, similar to FPNs. We denote this by appending a \"-X\" to the backbone, i.e. ResNet50-X.","876":"LXMERT is a model for learning vision-and-language cross-modality representations. It consists of a Transformer model that consists three encoders: object relationship encoder, a language encoder, and a cross-modality encoder. The model takes two inputs: image with its related sentence. The images are represented as a sequence of objects, whereas each sentence is represented as sequence of words. By combining the self-attention and cross-attention layers the model is able to generated language representation, image representations, and cross-modality representations from the input. The model is pre-trained with image-sentence pairs via five pre-training tasks: masked language modeling, masked object prediction, cross-modality matching, and image questions answering. These tasks help the model to learn both intra-modality and cross-modality relationships.","877":"**Vision-and-Language BERT** (**ViLBERT**) is a [BERT](https:\/\/paperswithcode.com\/method\/bert)-based model for learning task-agnostic joint representations of image content and natural language. ViLBERT extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional [transformer](https:\/\/paperswithcode.com\/method\/transformer) layers.","878":"**PowerSGD** is a distributed optimization technique that computes a low-rank approximation of the gradient using a generalized power iteration (known as subspace iteration). The approximation is computationally light-weight, avoiding any prohibitively expensive Singular Value Decomposition. To improve the quality of the efficient approximation, the authors warm-start the power iteration by reusing the approximation from the previous optimization step.","879":"Please enter a description about the method here","880":"**NAS-FPN** is a Feature Pyramid Network that is discovered via [Neural Architecture Search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) in a novel scalable search space covering all cross-scale connections. The discovered architecture consists of a combination of top-down and bottom-up connections to fuse features across scales","881":"**AmoebaNet** is a convolutional neural network found through regularized evolution architecture search. The search space is NASNet, which specifies a space of image classifiers with a fixed outer structure: a feed-forward stack of [Inception-like modules](https:\/\/paperswithcode.com\/method\/inception-module) called cells. The discovered architecture is shown to the right.","882":"**Precise RoI Pooling**, or **PrRoI Pooling**, is a region of interest feature extractor that avoids any quantization of coordinates and has a continuous gradient on bounding box coordinates. Given the feature map $\\mathcal{F}$ before RoI\/PrRoI Pooling (eg from Conv4 in [ResNet](https:\/\/paperswithcode.com\/method\/resnet)-50), let $w_{i,j}$ be the feature at one discrete location $(i,j)$ on the feature map. Using bilinear interpolation, the discrete feature map can be considered continuous at any continuous coordinates $(x,y)$:\r\n\r\n$$\r\nf(x,y) = \\sum_{i,j}IC(x,y,i,j) \\times w_{i,j},\r\n$$\r\n\r\nwhere $IC(x,y,i,j) = max(0,1-|x-i|)\\times max(0,1-|y-j|)$ is the interpolation coefficient. Then denote a bin of a RoI as $bin=\\{(x_1,y_1),(x_2,y_2)\\}$, where $(x_1,y_1)$ and $(x_2,y_2)$ are the continuous coordinates of the top-left and bottom-right points, respectively. We perform pooling (e.g. [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling)) given $bin$ and feature map $\\mathcal{F}$ by computing a two-order integral:","883":"**IoU-Net** is an object detection architecture that introduces localization confidence. IoU-Net learns to predict the IoU between each detected bounding box and the matched ground-truth. The network acquires this confidence of localization, which improves the NMS procedure by preserving accurately localized bounding boxes. Furthermore, an optimization-based bounding box refinement method is proposed, where the predicted IoU is formulated as the objective.","884":"**VisuoSpatial Foresight** is a method for robotic fabric manipulation that leverages a combination of RGB and depth information to learn goal conditioned fabric manipulation policies for a variety of long horizon tasks.","885":"A **ShuffleNet Block** is an image model block that utilises a [channel shuffle](https:\/\/paperswithcode.com\/method\/channel-shuffle) operation, along with depthwise convolutions, for an efficient architectural design. It was proposed as part of the [ShuffleNet](https:\/\/paperswithcode.com\/method\/shufflenet) architecture. The starting point is the [Residual Block](https:\/\/paperswithcode.com\/method\/residual-block) unit from [ResNets](https:\/\/paperswithcode.com\/method\/resnet), which is then modified with a pointwise group [convolution](https:\/\/paperswithcode.com\/method\/convolution) and a channel shuffle operation.","886":"**ShuffleNet** is a convolutional neural network designed specially for mobile devices with very limited computing power. The architecture utilizes two new operations, pointwise group [convolution](https:\/\/paperswithcode.com\/method\/convolution) and [channel shuffle](https:\/\/paperswithcode.com\/method\/channel-shuffle), to reduce computation cost while maintaining accuracy.","887":"**HITNet** is a framework for neural network based depth estimation which overcomes the computational disadvantages of operating on a 3D volume by integrating image warping, spatial propagation and a fast high resolution initialization step into the network architecture, while keeping the flexibility of a learned representation by allowing features to flow through the network. The main idea of the approach is to represent image tiles as planar patches which have a learned compact feature descriptor attached to them. The basic principle of the approach is to fuse information from the high resolution initialization and the current hypotheses using spatial propagation. The propagation is implemented via a [convolutional neural network](https:\/\/paperswithcode.com\/methods\/category\/convolutional-neural-networks) module that updates the estimate of the planar patches and their attached features. \r\n\r\nIn order for the network to iteratively increase the accuracy of the disparity predictions, the network is provided a local cost volume in a narrow band (\u00b11 disparity) around the planar patch using in-network image warping allowing the network to minimize image dissimilarity. To reconstruct fine details while also capturing large texture-less areas we start at low resolution and hierarchically upsample predictions to higher resolution. A critical feature of the architecture is that at each resolution, matches from the initialization module are provided to facilitate recovery of thin structures that cannot be represented at low resolution.","888":"The Stacked Denoising Autoencoder (SdA) is an extension of the stacked autoencoder [Bengio07] and it was introduced in [Vincent08].\r\n\r\nDenoising autoencoders can be stacked to form a deep network by feeding the latent representation (output code) of the [denoising autoencoder](https:\/\/paperswithcode.com\/method\/denoising-autoencoder) found on the layer below as input to the current layer. The unsupervised pre-training of such an architecture is done one layer at a time. Each layer is trained as a denoising autoencoder by minimizing the error in reconstructing its input (which is the output code of the previous layer). Once the first k layers are trained, we can train the k+1-th layer because we can now compute the code or latent representation from the layer below.\r\n\r\nOnce all layers are pre-trained, the network goes through a second stage of training called fine-tuning. Here we consider supervised fine-tuning where we want to minimize prediction error on a supervised task. For this, we first add a [logistic regression](https:\/\/paperswithcode.com\/method\/logistic-regression) layer on top of the network (more precisely on the output code of the output layer). We then train the entire network as we would train a multilayer perceptron. At this point, we only consider the encoding parts of each auto-encoder. This stage is supervised, since now we use the target class during training. (See the Multilayer Perceptron for details on the multilayer perceptron.)\r\n\r\nThis can be easily implemented in Theano, using the class defined previously for a denoising autoencoder. We can see the stacked denoising autoencoder as having two facades: a list of autoencoders, and an MLP. During pre-training we use the first facade, i.e., we treat our model as a list of autoencoders, and train each autoencoder seperately. In the second stage of training, we use the second facade. These two facades are linked because:\r\n* the autoencoders and the sigmoid layers of the MLP share parameters, and\r\n* the latent representations computed by intermediate layers of the MLP are fed as input to the autoencoders.\r\n\r\nExtracted from [webpage](http:\/\/deeplearning.net\/tutorial\/SdA.html)\r\n\r\nImage: [Jigar Bandaria](https:\/\/miro.medium.com\/max\/701\/1*wbaL5CvUkVkZxlSUsRS5IQ.png)\r\n\r\n**Source**:\r\n\r\nImage: [Jigar Bandaria](https:\/\/blog.insightdatascience.com\/brain-mri-image-segmentation-using-stacked-denoising-autoencoders-4e91417688f6)\r\n\r\nWebpage: [deeplearning.net](http:\/\/deeplearning.net\/tutorial\/SdA.html)\r\n\r\nWebpage: [www.iro.umontreal.ca](http:\/\/www.iro.umontreal.ca\/~pift6266\/H10\/notes\/SdA.html)\r\n\r\nPaper:\r\n\r\n[Vincent, H. Larochelle Y. Bengio and P.A. Manzagol, Extracting and Composing Robust Features with Denoising Autoencoders](https:\/\/doi.org\/10.1145\/1390156.1390294)\r\n\r\n[Vincent, H. Larochelle Y. Bengio and P.A. Manzagol, Extracting and Composing Robust Features with Denoising Autoencoders](http:\/\/www.iro.umontreal.ca\/~lisa\/publications2\/index.php\/publications\/show\/217)","889":"**EfficientNetV2** is a type convolutional neural network that has faster training speed and better parameter efficiency than [previous models](https:\/\/paperswithcode.com\/method\/efficientnet). To develop these models, the authors use a combination of training-aware [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) and scaling, to jointly optimize training speed. The models were searched from the search space enriched with new ops such as [Fused-MBConv](https:\/\/ai.googleblog.com\/2019\/08\/efficientnet-edgetpu-creating.html).\r\n\r\nArchitecturally the main differences are:\r\n\r\n- EfficientNetV2 extensively uses both [MBConv](https:\/\/paperswithcode.com\/method\/inverted-residual-block)  and the newly added fused-MBConv in the early layers.\r\n- EfficientNetV2 prefers smaller expansion ratio for [MBConv](https:\/\/paperswithcode.com\/method\/inverted-residual-block) since smaller expansion ratios tend to have less memory access overhead.\r\n- EfficientNetV2 prefers smaller 3x3 kernel sizes, but it adds more layers to compensate the reduced receptive field resulted from the smaller kernel size. \r\n- EfficientNetV2 completely removes the last stride-1 stage in the original EfficientNet, wperhaps due to its large parameter size and memory access overhead.","890":"A **Gated Convolution** is a type of temporal [convolution](https:\/\/paperswithcode.com\/method\/convolution) with a gating mechanism. Zero-padding is used to ensure that future context can not be seen.","891":"The **Simple Neural Attention Meta-Learner**, or **SNAIL**, combines the benefits of temporal convolutions and attention to solve meta-learning tasks. They introduce positional dependence through temporal convolutions to make the model applicable to reinforcement tasks - where the observations, actions, and rewards are intrinsically sequential. They also introduce attention in order to provide pinpoint access over an infinitely large context. SNAIL is constructing by combining the two: we use temporal convolutions to produce the context over which we use a causal attention operation.","892":"**RPDet**, or **RepPoints Detector**, is a anchor-free, two-stage object detection model based on deformable convolutions.  [RepPoints](https:\/\/paperswithcode.com\/method\/reppoints) serve as the basic object representation throughout the detection system. Starting from the center points, the first set of RepPoints is obtained via regressing offsets over the center points. The learning of these RepPoints is driven by two objectives: 1) the top-left and bottom-right points distance loss between the induced pseudo box and the ground-truth bounding box; 2) the object recognition loss of the subsequent stage.","893":"Learnable graph convolutional layer (LGCL) automatically selects a fixed number of neighboring nodes for each feature based on value ranking in order to transform graph data into grid-like structures in 1-D format, thereby enabling the use of regular convolutional operations on generic graphs.\r\n\r\nDescription and image from: [Large-Scale Learnable Graph Convolutional Networks](https:\/\/arxiv.org\/pdf\/1808.03965.pdf)","894":"**Self-Attention Network** (**SANet**) proposes two variations of self-attention used for image recognition: 1) pairwise self-attention which generalizes standard [dot-product attention](https:\/\/paperswithcode.com\/method\/dot-product-attention) and is fundamentally a set operator, and 2) patchwise self-attention which is strictly more powerful than [convolution](https:\/\/paperswithcode.com\/method\/convolution).","895":"**Height-driven Attention Network**, or **HANet**, is a general add-on module for improving semantic segmentation for urban-scene images. It emphasizes informative features or classes selectively according to the vertical position of a pixel. The pixel-wise class distributions are significantly different from each other among horizontally segmented sections in the urban-scene images. Likewise, urban-scene images have their own distinct characteristics, but most semantic segmentation networks do not reflect such unique attributes in the architecture. The proposed network architecture incorporates the capability exploiting the attributes to handle the urban scene dataset effectively.","896":"A **Split Attention** block enables attention across feature-map groups. As in [ResNeXt blocks](https:\/\/paperswithcode.com\/method\/resnext-block), the feature can be divided into several groups, and the number of feature-map groups is given by a cardinality hyperparameter $K$. The resulting feature-map groups are called cardinal groups. Split Attention blocks introduce a new radix hyperparameter $R$ that indicates the number of splits within a cardinal group, so the total number of feature groups is $G = KR$. We may apply a series of transformations {$\\mathcal{F}\\_1, \\mathcal{F}\\_2, \\cdots\\mathcal{F}\\_G$} to each individual group, then the intermediate representation of each group is $U\\_i = \\mathcal{F}\\_i\\left(X\\right)$, for $i \\in$ {$1, 2, \\cdots{G}$}.\r\n\r\nA combined representation for each cardinal group can be obtained by fusing via an element-wise summation across multiple splits. The representation for $k$-th cardinal group is \r\n$\\hat{U}^k = \\sum_{j=R(k-1)+1}^{R k} U_j $, where $\\hat{U}^k \\in \\mathbb{R}^{H\\times W\\times C\/K}$ for $k\\in{1,2,...K}$, and $H$, $W$ and $C$ are the block output feature-map sizes. \r\nGlobal contextual information with embedded channel-wise statistics can be gathered with [global average pooling](https:\/\/paperswithcode.com\/method\/global-average-pooling) across spatial dimensions  $s^k\\in\\mathbb{R}^{C\/K}$. Here the $c$-th component is calculated as:\r\n\r\n$$\r\n    s^k\\_c = \\frac{1}{H\\times W} \\sum\\_{i=1}^H\\sum\\_{j=1}^W \\hat{U}^k\\_c(i, j).\r\n$$\r\n\r\nA weighted fusion of the cardinal group representation $V^k\\in\\mathbb{R}^{H\\times W\\times C\/K}$ is aggregated using [channel-wise soft attention](https:\/\/paperswithcode.com\/method\/channel-wise-soft-attention), where each feature-map channel is produced using a weighted combination over splits. The $c$-th channel is calculated as:\r\n\r\n$$\r\n    V^k_c=\\sum_{i=1}^R a^k_i(c) U_{R(k-1)+i} ,\r\n$$\r\n\r\nwhere $a_i^k(c)$ denotes a (soft) assignment weight given by:\r\n\r\n$$\r\na_i^k(c) =\r\n\\begin{cases}\r\n  \\frac{exp(\\mathcal{G}^c_i(s^k))}{\\sum_{j=0}^R exp(\\mathcal{G}^c_j(s^k))} & \\quad\\textrm{if } R>1, \\\\\r\n   \\frac{1}{1+exp(-\\mathcal{G}^c_i(s^k))} & \\quad\\textrm{if } R=1,\\\\\r\n\\end{cases}\r\n$$\r\n\r\nand mapping $\\mathcal{G}_i^c$ determines the weight of each split for the $c$-th channel based on the global context representation $s^k$.","897":"**Adaptive Feature Pooling** pools features from all levels for each proposal in object detection and fuses them for the following prediction. For each proposal, we map them to different feature levels. Following the idea of [Mask R-CNN](https:\/\/paperswithcode.com\/method\/adaptive-feature-pooling), [RoIAlign](https:\/\/paperswithcode.com\/method\/roi-align) is used to pool feature grids from each level. Then a fusion operation (element-wise max or sum) is utilized to fuse feature grids from different levels.\r\n\r\nThe motivation for this technique is that in an [FPN](https:\/\/paperswithcode.com\/method\/fpn) we assign proposals to different feature levels based on the size of proposals, which could be suboptimal if images with small differences are assigned to different levels, or if the importance of features is not strongly correlated to their level which they belong.","898":"**Path Aggregation Network**, or **PANet**, aims to boost information flow in a proposal-based instance segmentation framework. Specifically, the feature hierarchy is enhanced with accurate localization signals in lower layers by [bottom-up path augmentation](https:\/\/paperswithcode.com\/method\/bottom-up-path-augmentation), which shortens the information path between lower layers and topmost feature. Additionally, [adaptive feature pooling](https:\/\/paperswithcode.com\/method\/adaptive-feature-pooling) is employed, which links feature grid and all feature levels to make useful information in each feature level propagate directly to following proposal subnetworks. A complementary branch capturing different views for each proposal is created to further improve mask prediction.","899":"**K-Net** is a framework for unified semantic and instance segmentation that segments both instances and semantic categories consistently by a group of learnable kernels, where each kernel is responsible for generating a mask for either a potential instance or a stuff class. It begins with a set of kernels that are randomly initialized, and learns the kernels in accordance to the segmentation targets at hand, namely, semantic kernels for semantic categories and instance kernels for instance identities. A simple combination of semantic kernels and instance kernels allows panoptic segmentation naturally. In the forward pass, the kernels perform [convolution](https:\/\/paperswithcode.com\/method\/convolution) on the image features to obtain the corresponding segmentation predictions.\r\n\r\nK-Net is formulated so that it dynamically updates the kernels to make them conditional to their activations on the image. Such a content-aware mechanism is crucial to ensure that each kernel, especially an instance kernel, responds accurately to varying objects in an image. Through applying this adaptive kernel update strategy iteratively, K-Net significantly improves the discriminative ability of the kernels and boosts the final segmentation performance. It is noteworthy that this strategy universally applies to kernels for all the segmentation tasks.\r\n\r\nIt also utilises a bipartite matching strategy to assign learning targets for each kernel. This training approach is advantageous to conventional training strategies as it builds a one-to-one mapping between kernels and instances in an image. It thus resolves the problem of dealing with a varying number of instances in an image. In addition, it is purely mask-driven without involving boxes. Hence, K-Net is naturally [NMS](https:\/\/paperswithcode.com\/method\/non-maximum-suppression)-free and box-free, which is appealing to real-time applications.","900":"**CMCL**, or **Crossmodal Contrastive Learning**, is a method for unifying visual and textual representations into the same semantic space based on a large-scale corpus of image collections, text corpus and image-text pairs. The CMCL aligns the visual representations and textual representations, and unifies them into the same semantic space based on image-text pairs. As shown in the Figure, to facilitate different levels of semantic alignment between vision and language, a series of text rewriting techniques are utilized to improve the diversity of cross-modal information. Specifically, for an image-text pair, various positive examples and hard negative examples can be obtained by rewriting the original caption at different levels. Moreover, to incorporate more background information from the single-modal data, text and image retrieval are also applied to augment each image-text pair with various related texts and images. The positive pairs, negative pairs, related images and texts are learned jointly by CMCL. In this way, the model can effectively unify different levels of visual and textual representations into the same semantic space, and incorporate more single-modal knowledge to enhance each other.","901":"**UNIMO** is a multi-modal pre-training architecture that can effectively adapt to both single modal and multimodal understanding and generation tasks. UNIMO learns visual representations and textual representations simultaneously, and unifies them into the same semantic space via [cross-modal contrastive learning](https:\/\/paperswithcode.com\/method\/cmcl) (CMCL) based on a large-scale corpus of image collections, text corpus and image-text pairs. The CMCL aligns the visual representation and textual representation, and unifies them into the same semantic\r\nspace based on image-text pairs.","902":"**Attention Mesh** is a neural network architecture for 3D face mesh prediction that uses attention to semantically meaningful regions. Specifically region-specific heads are employed that transform the feature maps with spatial transformers.","903":"The **Re-Attention Module** is an attention layer used in the [DeepViT](https:\/\/paperswithcode.com\/method\/deepvit) architecture which mixes the attention map with a learnable matrix before multiplying with the values. The motivation is to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The authors note that traditional self-attention fails to learn effective concepts for representation learning in deeper layers of ViT -- attention maps become more similar and less diverse in deeper layers (attention collapse) - and this hinders the model from getting expected performance gain. Re-attention is implemented by:\r\n\r\n$$\r\n\\operatorname{Re}-\\operatorname{Attention}(Q, K, V)=\\operatorname{Norm}\\left(\\Theta^{\\top}\\left(\\operatorname{Softmax}\\left(\\frac{Q K^{\\top}}{\\sqrt{d}}\\right)\\right)\\right) V\r\n$$\r\n\r\nwhere transformation matrix $\\Theta$ is multiplied to the self-attention map $\\textbf{A}$ along the head dimension.","904":"**DeepViT** is a type of [vision transformer](https:\/\/paperswithcode.com\/method\/vision-transformer) that replaces the self-attention layer within the [transformer](https:\/\/paperswithcode.com\/method\/transformer) block with a [Re-attention module](https:\/\/paperswithcode.com\/method\/re-attention-module) to address the issue of attention collapse and enables training deeper ViTs.","905":"A **Noisy Linear Layer** is a [linear layer](https:\/\/paperswithcode.com\/method\/linear-layer) with parametric noise added to the weights. This induced stochasticity can be used in reinforcement learning networks for the agent's policy to aid efficient exploration. The parameters of the noise are learned with gradient descent along with any other remaining network weights. Factorized Gaussian noise is the type of noise usually employed.\r\n\r\nThe noisy linear layer takes the form:\r\n\r\n$$y = \\left(b + Wx\\right) + \\left(b\\_{noisy}\\odot\\epsilon^{b}+\\left(W\\_{noisy}\\odot\\epsilon^{w}\\right)x\\right) $$\r\n\r\nwhere $\\epsilon^{b}$ and $\\epsilon^{w}$ are random variables.","906":"**Rainbow DQN** is an extended [DQN](https:\/\/paperswithcode.com\/method\/dqn) that combines several improvements into a single learner. Specifically:\r\n\r\n- It uses [Double Q-Learning](https:\/\/paperswithcode.com\/method\/double-q-learning) to tackle overestimation bias.\r\n- It uses [Prioritized Experience Replay](https:\/\/paperswithcode.com\/method\/prioritized-experience-replay) to prioritize important transitions.\r\n- It uses dueling networks.\r\n- It uses multi-step learning .\r\n- It uses distributional reinforcement learning instead of the expected return.\r\n- It uses noisy linear layers for exploration.","907":"**ControlVAE** is a [variational autoencoder](https:\/\/paperswithcode.com\/method\/vae) (VAE) framework that combines the automatic control theory with the basic VAE to stabilize the KL-divergence of VAE models to a specified value. It leverages a non-linear PI controller, a variant of the proportional-integral-derivative (PID) control, to dynamically tune the weight of the KL-divergence term in the evidence lower bound (ELBO) using the output KL-divergence as feedback. This allows for control of the KL-divergence to a desired value (set point), which is effective in avoiding posterior collapse and learning disentangled representations.","908":"**Self-Supervised Temporal Domain Adaptation (SSTDA)** is a method for action segmentation with self-supervised temporal domain adaptation. It contains two self-supervised auxiliary tasks (binary and sequential domain prediction) to jointly align cross-domain feature spaces embedded with local and global temporal dynamics.","909":"**DIoU-NMS** is a type of non-maximum suppression where we use Distance IoU rather than regular DIoU, in which the overlap area and the distance between two central points of bounding boxes are simultaneously considered when suppressing redundant boxes.\r\n\r\nIn original NMS, the IoU metric is used to suppress the redundant detection boxes, where the overlap area is the unique factor, often yielding false suppression for the cases with occlusion. With DIoU-NMS, we not only consider the overlap area but also central point distance between two boxes.","910":"**Receptive Field Block (RFB)** is a module for strengthening the deep features learned from lightweight CNN models so that they can contribute to fast and accurate detectors. Specifically, RFB makes use of multi-branch pooling with varying kernels corresponding to RFs of different sizes, applies [dilated convolution](https:\/\/paperswithcode.com\/method\/dilated-convolution) layers to control their eccentricities, and reshapes them to generate\r\nfinal representation.","911":"**Polynomial Rate Decay** is a learning rate schedule where we polynomially decay the learning rate.","912":"**CSPResNeXt Block** is an extended [ResNext Block](https:\/\/paperswithcode.com\/method\/resnext-block) where we partition the feature map of the base layer into two parts and then merge them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network.","913":"**CSPResNeXt** is a convolutional neural network where we apply the Cross Stage Partial Network (CSPNet) approach to [ResNeXt](https:\/\/paperswithcode.com\/method\/resnext). The CSPNet partitions the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network.","914":"XGPT is a method of cross-modal generative pre-training for image captioning designed to pre-train text-to-image caption generators through three novel generation tasks, including image-conditioned masked language modeling (IMLM), image-conditioned denoising autoencoding (IDA), and text-conditioned image feature generation (TIGF). The pre-trained XGPT can be fine-tuned without any task-specific architecture modifications and build strong image captioning models.","915":"**Synchronized Batch Normalization (SyncBN)** is a type of [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) used for multi-GPU training. Standard batch normalization only normalizes the data within each device (GPU). SyncBN normalizes the input within the whole mini-batch.","916":"**Squared ReLU** is an activation function used in the [Primer](https:\/\/paperswithcode.com\/method\/primer) architecture in the feedforward block of the [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) layer. It is simply squared [ReLU](https:\/\/paperswithcode.com\/method\/relu) activations.\r\n\r\nThe effectiveness of higher order polynomials can also be observed in other effective [Transformer](https:\/\/paperswithcode.com\/method\/transformer) nonlinearities, such as [GLU](https:\/\/paperswithcode.com\/method\/glu) variants like [ReGLU](https:\/\/paperswithcode.com\/method\/reglu) and point-wise activations like [approximate GELU](https:\/\/paperswithcode.com\/method\/gelu). However, squared ReLU has drastically different asymptotics as $x \\rightarrow \\inf$ compared to the most commonly used activation functions: [ReLU](https:\/\/paperswithcode.com\/method\/relu), [GELU](https:\/\/paperswithcode.com\/method\/gelu) and [Swish](https:\/\/paperswithcode.com\/method\/swish). Squared ReLU does have significant overlap with ReGLU and in fact is equivalent when ReGLU\u2019s $U$ and $V$ weight matrices are the same and squared ReLU is immediately preceded by a linear transformation with weight matrix $U$. This leads the authors to believe that squared ReLUs capture the benefits of these GLU variants, while being simpler, without additional parameters, and delivering better quality.","917":"**Multi-DConv-Head Attention**, or **MDHA**, is a type of [Multi-Head Attention](https:\/\/paperswithcode.com\/method\/multi-head-attention) that utilizes [depthwise convolutions](https:\/\/paperswithcode.com\/method\/depthwise-convolution) after the multi-head projections. It is used in the [Primer](https:\/\/paperswithcode.com\/method\/primer) [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture.\r\n\r\nSpecifically, 3x1 depthwise convolutions are added after each of the multi-head projections for query $Q$, key $K$ and value $V$ in self-attention. These depthwise convolutions are performed over the spatial dimension of each dense projection\u2019s output. Interestingly, this ordering of pointwise followed by depthwise convolution is the reverse of typical [separable convolution](https:\/\/paperswithcode.com\/method\/depthwise-separable-convolution), which the authors find to be less effective. They also find that wider depthwise convolution and [standard convolution](https:\/\/paperswithcode.com\/method\/convolution) not only do not improve performance, but in several cases hurt it. \r\n\r\nMDHA is similar to [Convolutional Attention](https:\/\/paperswithcode.com\/method\/cvt), which uses [separable convolution](https:\/\/paperswithcode.com\/method\/depthwise-separable-convolution) instead of depthwise convolution and does not apply convolution operations per attention head as in MDHA.","918":"**Primer** is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based architecture that improves upon the [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture with two improvements found through [neural architecture search](https:\/\/paperswithcode.com\/methods\/category\/neural-architecture-search): [squared RELU activations](https:\/\/paperswithcode.com\/method\/squared-relu) in the feedforward block, and [depthwise convolutions]() added to the attention multi-head projections: resulting in a new module called [Multi-DConv-Head-Attention](https:\/\/paperswithcode.com\/method\/multi-dconv-head-attention).","919":"**ReLIC**, or **Representation Learning via Invariant Causal Mechanisms**, is a self-supervised learning objective that enforces invariant prediction of proxy targets across augmentations through an invariance regularizer which yields improved generalization guarantees. \r\n\r\nWe can write the objective as:\r\n\r\n$$\r\n\\underset{X}{\\mathbb{E}} \\underset{\\sim\\_{l k}, a\\_{q \\mathcal{A}}}{\\mathbb{E}} \\sum_{b \\in\\left\\(a\\_{l k}, a\\_{q t}\\right\\)} \\mathcal{L}\\_{b}\\left(Y^{R}, f(X)\\right) \\text { s.t. } K L\\left(p^{d o\\left(a\\_{l k}\\right)}\\left(Y^{R} \\mid f(X)\\right), p^{d o\\left(a\\_{q t}\\right)}\\left(Y^{R} \\mid f(X)\\right)\\right) \\leq \\rho\r\n$$\r\n\r\nwhere $\\mathcal{L}$ is the proxy task loss and $K L$ is the Kullback-Leibler (KL) divergence. Note that any distance measure on distributions can be used in place of the KL divergence.\r\n\r\nConcretely, as proxy task we associate to every datapoint $x\\_{i}$ the label $y\\_{i}^{R}=i$. This corresponds to the instance discrimination task, commonly used in contrastive learning. We take pairs of points $\\left(x\\_{i}, x\\_{j}\\right)$ to compute similarity scores and use pairs of augmentations $a\\_{l k}=\\left(a\\_{l}, a\\_{k}\\right) \\in$ $\\mathcal{A} \\times \\mathcal{A}$ to perform a style intervention. Given a batch of samples $\\left\\(x\\_{i}\\right\\)\\_{i=1}^{N} \\sim \\mathcal{D}$, we use\r\n\r\n$$\r\np^{d o\\left(a\\_{l k}\\right)}\\left(Y^{R}=j \\mid f\\left(x\\_{i}\\right)\\right) \\propto \\exp \\left(\\phi\\left(f\\left(x\\_{i}^{a\\_{l}}\\right), h\\left(x\\_{j}^{a\\_{k}}\\right)\\right) \/ \\tau\\right)\r\n$$\r\n\r\nwith $x^{a}$ data augmented with $a$ and $\\tau$ a softmax temperature parameter. We encode $f$ using a neural network and choose $h$ to be related to $f$, e.g. $h=f$ or as a network with an exponential moving average of the weights of $f$ (e.g. target networks). To compare representations we use the function $\\phi\\left(f\\left(x\\_{i}\\right), h\\left(x\\_{j}\\right)\\right)=\\left\\langle g\\left(f\\left(x\\_{i}\\right)\\right), g\\left(h\\left(x\\_{j}\\right)\\right)\\right\\rangle$ where $g$ is a fully-connected neural network often called the critic.\r\n\r\nCombining these pieces, we learn representations by minimizing the following objective over the full set of data $x\\_{i} \\in \\mathcal{D}$ and augmentations $a_{l k} \\in \\mathcal{A} \\times \\mathcal{A}$\r\n\r\n$$\r\n-\\sum_{i=1}^{N} \\sum\\_{a\\_{l k}} \\log \\frac{\\exp \\left(\\phi\\left(f\\left(x\\_{i}^{a_{l}}\\right), h\\left(x\\_{i}^{a\\_{k}}\\right)\\right) \/ \\tau\\right)}{\\sum\\_{m=1}^{M} \\exp \\left(\\phi\\left(f\\left(x\\_{i}^{a\\_{l}}\\right), h\\left(x\\_{m}^{a\\_{k}}\\right)\\right) \/ \\tau\\right)}+\\alpha \\sum\\_{a\\_{l k}, a\\_{q t}} K L\\left(p^{d o\\left(a\\_{l k}\\right)}, p^{d o\\left(a\\_{q t}\\right)}\\right)\r\n$$\r\n\r\nwith $M$ the number of points we use to construct the contrast set and $\\alpha$ the weighting of the invariance penalty. The shorthand $p^{d o(a)}$ is used for $p^{d o(a)}\\left(Y^{R}=j \\mid f\\left(x\\_{i}\\right)\\right)$. The Figure shows a schematic of the RELIC objective.","920":"The **Levenshtein Transformer** (LevT) is a type of [transformer](https:\/\/paperswithcode.com\/method\/transformer) that aims to address the lack of flexibility of previous decoding models. Notably, in previous frameworks, the length of generated sequences is either fixed or monotonically increased as the decoding proceeds. The authors argue this is incompatible with human-level intelligence where humans can revise, replace, revoke or delete any part of their generated text. Hence, LevT is proposed to bridge this gap by breaking the in-so-far standardized decoding mechanism and replacing it with two basic operations \u2014 insertion and deletion.\r\n\r\nLevT is trained using imitation learning. The resulted model contains two policies and they are executed in an alternate manner. The authors argue that with this model decoding becomes more flexible. For example, when the decoder is given an empty token, it falls back to a normal sequence generation model. On the other hand, the decoder acts as a refinement model when the initial state is a low-quality generated sequence.\r\n\r\nOne crucial component in LevT framework is the learning algorithm. The authors leverage the characteristics of insertion and deletion \u2014 they are complementary but also adversarial. The algorithm they propose is called \u201cdual policy learning\u201d. The idea is that when training one policy (insertion or deletion), we use the output from its adversary at the previous iteration as input. An expert policy, on the other hand, is drawn to provide a correction signal.","921":"In the context of image enhancement, maximizing NIMA score as a prior can increase the likelihood of enhancing perceptual quality of an image.","922":"**FORK**, or **Forward Looking Actor** is a type of actor for actor-critic algorithms. In particular, FORK includes a neural network that forecasts the next state given the current state and current action, called system network; and a neural network that forecasts the\r\nreward given a (state, action) pair, called reward network. With the system network and reward network, FORK can forecast the next state and consider the value of the next state when improving the policy.","923":"Please enter a description about the method here","924":"**TAPAS** is a weakly supervised question answering model that reasons over tables without generating logical forms. TAPAS predicts a minimal program by selecting a subset of the table cells and a possible aggregation operation to be executed on top of them. Consequently, TAPAS can learn operations from natural language, without the need to specify them in some formalism. This is implemented by extending [BERT](https:\/\/paperswithcode.com\/method\/bert)\u2019s architecture with additional embeddings that capture tabular structure, and with two classification layers for selecting cells and predicting a corresponding aggregation operator.","925":"**DeepCluster** is a self-supervision approach for learning image representations.  DeepCluster iteratively groups the features with a standard clustering algorithm, k-means, and uses the subsequent assignments as supervision to update\r\nthe weights of the network","926":"**Boost-GNN** is an architecture that trains GBDT and GNN jointly to get the best of both worlds: the GBDT model deals with heterogeneous features, while GNN accounts for the graph structure. The model benefits from end-to-end optimization by allowing new trees to fit the gradient updates of GNN.","927":"**DeepSIM** is a generative model for conditional image manipulation based on a single image. The network learns to map between a primitive representation of the image to the image itself. At manipulation time, the generator allows for making complex image changes by modifying the primitive input representation and mapping it through the network. The choice of a primitive representations has an impact on the ease and expressiveness of the manipulations and can be automatic (e.g. edges), manual, or hybrid such as edges on top of segmentations.","928":"**Hamburger** is a global context module that employs matrix decomposition to factorize the learned representation into sub-matrices so as to recover the clean low-rank signal subspace. The key idea is, if we formulate the inductive bias like the global context into an objective function, the optimization algorithm to minimize the objective function can construct a computational graph, i.e., the architecture we need in the networks.","929":"A **Masked Convolution** is a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) which masks certain pixels so that the model can only predict based on pixels already seen. This type of convolution was introduced with [PixelRNN](https:\/\/paperswithcode.com\/method\/pixelrnn) generative models, where an image is generated pixel by pixel, to ensure that the model was conditional only on pixels already visited.","930":"A **Compact Global Descriptor** is an image model block for modelling interactions between positions across different dimensions (e.g., channels, frames). This descriptor enables subsequent convolutions to access the informative global features. It is a form of attention.","931":"A variant of [CutMix](https:\/\/paperswithcode.com\/method\/cutmix) which randomly samples masks from Fourier space.","932":"**Multiple Random Window Discriminator** is a discriminator used for the [GAN-TTS](https:\/\/paperswithcode.com\/method\/gan-tts) text-to-speech architecture. These discriminators operate on randomly sub-sampled fragments of the real or generated samples. The ensemble allows for the evaluation of audio in different complementary ways, and is obtained by taking\r\na Cartesian product of two parameter spaces: (i) the size of the random windows fed into the discriminator; (ii) whether a discriminator is conditioned on linguistic and pitch features. For example,\r\nin the authors' best-performing model, they consider five window sizes (240, 480, 960, 1920, 3600 samples), which yields 10 discriminators in total. \r\n\r\nUsing random windows of different size, rather than the full generated sample, has a data augmentation effect and also reduces the computational complexity of RWDs. In the first layer of each discriminator, the MRWD reshapes (downsamples) the input raw waveform to a constant\r\ntemporal dimension $\\omega = 240$ by moving consecutive blocks of samples into the channel dimension, i.e. from $\\left[\\omega\\_{k}, 1\\right]$ to $\\left[\\omega, k\\right]$, where $k$ is the downsampling factor (e.g. $k = 8$ for input window size $1920$). This way, all the RWDs have the same architecture and similar computational complexity despite different window sizes. \r\n\r\nThe conditional discriminators have access to linguistic and pitch features, and can measure whether\r\nthe generated audio matches the input conditioning. This means that random windows in conditional\r\ndiscriminators need to be aligned with the conditioning frequency to preserve the correspondence\r\nbetween the waveform and linguistic features within the sampled window. This limits the valid sampling to that of the frequency of the conditioning signal (200Hz, or every 5ms). The unconditional\r\ndiscriminators, on the contrary, only evaluate whether the generated audio sounds realistic regardless\r\nof the conditioning. The random windows for these discriminators are sampled without constraints\r\nat full 24kHz frequency, which further increases the amount of training data. \r\n\r\nFor the architecture, the discriminators consists of blocks (DBlocks) that are similar to the [GBlocks](https:\/\/paperswithcode.com\/method\/gblock) used in the generator, but without batch normalisation. Unconditional RWDs are composed entirely of DBlocks. In conditional RWDs, the input waveform is gradually downsampled by DBlocks, until the temporal dimension of the activation is equal to that of the conditioning, at which point a conditional [DBlock](https:\/\/paperswithcode.com\/method\/dblock) is used. This joint information is then passed to the remaining DBlocks, whose final output is average-pooled to obtain a scalar. The dilation factors in the DBlocks\u2019 convolutions follow the pattern 1, 2, 1, 2 \u2013 unlike the generator, the discriminator operates on a relatively small window, and the authors did not observe any benefit from using larger dilation factors.","933":"**Conditional DBlock** is a residual based block used in the discriminator of the [GAN-TTS](https:\/\/paperswithcode.com\/method\/gan-tts) architecture. They are similar to the [GBlocks](https:\/\/paperswithcode.com\/method\/gblock) used in the generator, but without [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization). Unlike the [DBlock](https:\/\/paperswithcode.com\/method\/dblock), the Conditional DBlock adds the embedding of the linguistic features after the first [convolution](https:\/\/paperswithcode.com\/method\/convolution).","934":"**DBlock** is a residual based block used in the discriminator of the [GAN-TTS](https:\/\/paperswithcode.com\/method\/gan-tts) architecture. They are similar to the [GBlocks](https:\/\/paperswithcode.com\/method\/gblock) used in the generator, but without batch normalisation.","935":"**GBlock** is a type of [residual block](https:\/\/paperswithcode.com\/method\/residual-block) used in the [GAN-TTS](https:\/\/paperswithcode.com\/method\/gan-tts) text-to-speech architecture - it is a stack of two residual blocks. As the generator is producing raw audio (e.g. a 2s training clip corresponds\r\nto a sequence of 48000 samples), dilated convolutions are used to ensure that the receptive field of $G$ is large enough to capture long-term dependencies. The four kernel size-3 convolutions in each GBlock have increasing dilation factors: 1, 2, 4, 8. Convolutions are preceded by Conditional Batch Normalisation, conditioned on the linear embeddings of the noise term $z \\sim N\\left(0, \\mathbf{I}\\_{128}\\right)$ in the single-speaker case, or the concatenation of $z$ and a one-hot representation of the speaker ID in the multi-speaker case. The embeddings are different for\r\neach BatchNorm instance. \r\n\r\nA GBlock contains two skip connections, the first of which in [GAN](https:\/\/paperswithcode.com\/method\/gan)-TTS performs upsampling if the output frequency is higher than the input, and it also contains a size-1 [convolution](https:\/\/paperswithcode.com\/method\/convolution)\r\nif the number of output channels is different from the input.","936":"**GAN-TTS** is a generative adversarial network for text-to-speech synthesis. The architecture is composed of a conditional feed-forward generator producing raw speech audio, and an ensemble of discriminators which operate on random windows of different sizes. The discriminators analyze the audio both in terms of general realism, as well as how well the audio corresponds to the utterance that should be pronounced.\r\n\r\nThe generator architecture consists of several GBlocks, which are residual based (dilated) [convolution](https:\/\/paperswithcode.com\/method\/convolution) blocks. GBlocks 3\u20137 gradually upsample the temporal dimension of hidden representations by factors of 2, 2, 2, 3, 5, while the number of channels is reduced by GBlocks 3, 6 and 7 (by a factor of 2 each). The final convolutional layer with [Tanh activation](https:\/\/paperswithcode.com\/method\/tanh-activation) produces a single-channel audio waveform.\r\n\r\nInstead of a single discriminator, GAN-TTS uses an ensemble of Random Window Discriminators (RWDs) which operate on randomly sub-sampled fragments of the real or generated samples. The ensemble allows for the evaluation of audio in different complementary ways.","937":"A **Reversible Residual Network**, or **RevNet**, is a variant of a [ResNet](https:\/\/paperswithcode.com\/method\/resnet) where each layer\u2019s activations can be reconstructed exactly from the next layer\u2019s. Therefore, the activations for most layers need not be stored in memory during backpropagation. The result is a network architecture whose activation storage requirements are independent of depth, and typically at least an order of magnitude smaller compared with equally sized ResNets.\r\n\r\nRevNets are composed of a series of reversible blocks. Units in each layer are partitioned into two groups, denoted $x\\_{1}$ and $x\\_{2}$; the authors find what works best is partitioning the channels. Each reversible block takes inputs $\\left(x\\_{1}, x\\_{2}\\right)$ and produces outputs $\\left(y\\_{1}, y\\_{2}\\right)$ according to the following additive coupling rules \u2013 inspired the transformation in [NICE](https:\/\/paperswithcode.com\/method\/nice) (nonlinear independent components estimation) \u2013 and residual functions $F$ and $G$ analogous to those in standard ResNets:\r\n\r\n$$y\\_{1} = x\\_{1} + F\\left(x\\_{2}\\right)$$\r\n$$y\\_{2} = x\\_{2} + G\\left(y\\_{1}\\right)$$\r\n\r\nEach layer\u2019s activations can be reconstructed from the next layer\u2019s activations as follows:\r\n\r\n$$ x\\_{2} = y\\_{2} \u2212 G\\left(y\\_{1}\\right)$$\r\n$$ x\\_{1} = y\\_{1} \u2212 F\\left(x\\_{2}\\right)$$\r\n\r\nNote that unlike residual blocks, reversible blocks must have a stride of 1 because otherwise the layer\r\ndiscards information, and therefore cannot be reversible. Standard ResNet architectures typically\r\nhave a handful of layers with a larger stride. If we define a RevNet architecture analogously, the\r\nactivations must be stored explicitly for all non-reversible layers.","938":"**Review-guided Answer Helpfulness Prediction** (RAHP) is a textual inference model for identifying helpful answers in e-commerce. It not only considers the interactions between QA pairs, but also investigates the opinion coherence between the answer and crowds' opinions reflected in the reviews, which is another important factor to identify helpful answers.","939":"**Graph Path Feature Learning** is a probabilistic rule learner optimized to mine instantiated first-order logic rules from knowledge graphs. Instantiated rules contain constants extracted from KGs. Compared to abstract rules that contain no constants, instantiated rules are capable of explaining and expressing concepts in more detail. GPFL utilizes a novel two-stage rule generation mechanism that first generalizes extracted paths into templates that are acyclic abstract rules until a certain degree of template saturation is achieved, then specializes the generated templates into instantiated rules.","940":"**Aging Evolution**, or **Regularized Evolution**, is an evolutionary algorithm for [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search). Whereas in tournament selection, the best architectures are kept, in aging evolution we associate each genotype with an age, and bias the tournament selection to choose\r\nthe younger genotypes. In the context of architecture search, aging evolution allows us to explore the search space more, instead of zooming in on good models too early, as non-aging evolution would.","941":"**Mechanism Transfer** is a meta-distributional scenario for few-shot domain adaptation in which a data generating mechanism is invariant across domains. This transfer assumption can accommodate nonparametric shifts resulting in apparently different distributions while providing a solid statistical basis for domain adaptation.","942":"**YOLOv1** is a single-stage object detection model. Object detection is framed as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. \r\n\r\nThe network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means the network reasons globally about the full image and all the objects in the image.","943":"**Collaborative Distillation** is a new knowledge distillation method (named Collaborative Distillation) for encoder-decoder based neural style transfer to reduce the number of convolutional filters. The main idea is underpinned by a finding that the encoder-decoder pairs construct an exclusive collaborative relationship, which is regarded as a new kind of knowledge for style transfer models.","944":"An **Eligibility Trace** is a memory vector $\\textbf{z}\\_{t} \\in \\mathbb{R}^{d}$ that parallels the long-term weight vector $\\textbf{w}\\_{t} \\in \\mathbb{R}^{d}$. The idea is that when a component of $\\textbf{w}\\_{t}$ participates in producing an estimated value, the corresponding component of $\\textbf{z}\\_{t}$ is bumped up and then begins to fade away. Learning will then occur in that component of $\\textbf{w}\\_{t}$ if a nonzero TD error occurs before the trade falls back to zero. The trace-decay parameter $\\lambda \\in \\left[0, 1\\right]$ determines the rate at which the trace falls.\r\n\r\nIntuitively, they tackle the credit assignment problem by capturing both a frequency heuristic - states that are visited more often deserve more credit - and a recency heuristic - states that are visited more recently deserve more credit.\r\n\r\n$$E\\_{0}\\left(s\\right) = 0 $$\r\n$$E\\_{t}\\left(s\\right) = \\gamma\\lambda{E}\\_{t-1}\\left(s\\right) + \\textbf{1}\\left(S\\_{t} = s\\right) $$\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition","945":"A **Global Context Network**, or **GCNet**, utilises global context blocks to model long-range dependencies in images. It is based on the [Non-Local Network](https:\/\/paperswithcode.com\/method\/non-local-block), but it modifies the architecture so less computation is required. Global context blocks are applied to multiple layers in a backbone network to construct the GCNet.","946":"A **Global Context Block** is an image model block for global context modeling. The aim is to have both the benefits of the simplified [non-local block](https:\/\/paperswithcode.com\/method\/non-local-block) with effective modeling of long-range dependencies, and the [squeeze-excitation block](https:\/\/paperswithcode.com\/method\/squeeze-and-excitation-block) with lightweight computation. \r\n\r\nIn the Global Context framework, we have (a) global attention pooling, which adopts a [1x1 convolution](https:\/\/paperswithcode.com\/method\/1x1-convolution) $W_{k}$ and [softmax](https:\/\/paperswithcode.com\/method\/softmax) function to obtain the attention weights, and then performs the attention pooling to obtain the global context features, (b) feature transform via a 1x1 [convolution](https:\/\/paperswithcode.com\/method\/convolution) $W\\_{v}$; (c) feature aggregation, which employs addition to aggregate the global context features to the features of each position. Taken as a whole, the GC block is proposed as a lightweight way to achieve global context modeling.","947":"**Circular Smooth Label** (CSL) is a classification-based rotation detection technique for arbitrary-oriented object detection. It is used for circularly distributed angle classification and addresses the periodicity of the angle and increases the error tolerance to adjacent angles.","948":"**RegNetY** is a convolutional network design space with simple, regular models with parameters: depth $d$, initial width $w\\_{0} > 0$, and slope $w\\_{a} > 0$, and generates a different block width $u\\_{j}$ for each block $j < d$. The key restriction for the RegNet types of model is that there is a linear parameterisation of block widths (the design space only contains models with this linear structure):\r\n\r\n$$ u\\_{j} = w\\_{0} + w\\_{a}\\cdot{j} $$\r\n\r\nFor **RegNetX** we have additional restrictions: we set $b = 1$ (the bottleneck ratio), $12 \\leq d \\leq 28$, and $w\\_{m} \\geq 2$ (the width multiplier).\r\n\r\nFor **RegNetY** we make one change, which is to include Squeeze-and-Excitation blocks.","949":"**SEER** is a self-supervised learning approach for training large models on random, uncurated images with no supervision. It trains [RegNet-Y](https:\/\/paperswithcode.com\/method\/regnet-y) architectures with the [SwAV](https:\/\/paperswithcode.com\/method\/swav). Several adjustments are made to self-supervised training to make it work at a larger scale, including using a [cosine learning schedule](https:\/\/paperswithcode.com\/method\/cosine-annealing)","950":"SCARF is a simple, widely-applicable technique for contrastive learning, where views are formed by corrupting a random subset of features. When applied to pre-train deep neural networks on the 69 real-world, tabular classification datasets from the OpenML-CC18 benchmark, SCARF not only improves classification accuracy in the fully-supervised setting but does so also in the presence of label noise and in the semi-supervised setting where only a fraction of the available training data is labeled.","951":"**MaxUp** is an adversarial data augmentation technique for improving the generalization performance of machine learning models. The idea is to generate a set of augmented data with some random perturbations or transforms, and minimize the maximum, or worst case loss over the augmented data.  By doing so, we implicitly introduce a smoothness or robustness regularization against the random perturbations, and hence improve the generation performance.  For example, in the case of Gaussian perturbation, MaxUp is asymptotically equivalent to using the gradient norm of the loss as a penalty to encourage smoothness.","952":"Please enter a description about the method here","953":"**PAR Transformer** is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) model that uses 63% fewer [self-attention blocks](https:\/\/paperswithcode.com\/method\/scaled), replacing them with [feed-forward blocks](https:\/\/paperswithcode.com\/method\/position-wise-feed-forward-layer), while retaining test accuracies. It is based on the [Transformer-XL](https:\/\/paperswithcode.com\/method\/transformer-xl) architecture and uses [neural architecture search](https:\/\/paperswithcode.com\/task\/architecture-search) to find an an efficient pattern of blocks in the transformer architecture.","954":"Given a training set drawn from an unknown $d$-variate probability distribution, QuantTree constructs a histogram by recursively splitting $\\mathbb{R}^d$. The splits are defined by a stochastic process so that each bin contains a certain proportion of the training set. These histograms can be used to define test statistics (e.g., the Pearson statistic) to tell whether a batch of data is drawn from $\\phi_0$ or not. The most crucial property of QuantTree is that the distribution of any statistic based on QuantTree histograms is independent of $\\phi_0$, thus enabling nonparametric statistical testing.","955":"**PGNet** is a point-gathering network for reading arbitrarily-shaped text in real-time. It is a single-shot text spotter, where the pixel-level character classification map is learned with proposed PG-CTC loss avoiding the usage of character-level annotations. With a PG-CTC decoder, we gather high-level character classification vectors from two-dimensional space and decode them into text symbols without NMS and RoI operations involved, which guarantees high efficiency. Additionally, reasoning the relations between each character and its neighbors, a graph refinement module (GRM) is proposed to optimize the coarse recognition and improve the end-to-end performance.","956":"**NesT** stacks canonical transformer layers to conduct local self-attention on every image block independently, and then \"nests\" them hierarchically. Coupling of processed information between spatially adjacent blocks is achieved through a proposed block aggregation between every two hierarchies. The overall hierarchical structure can be determined by two key hyper-parameters: patch size $S \u00d7 S$ and number of block hierarchies $T_d$. All blocks inside each hierarchy share one set of parameters. Given input of image, each image is linearly projected to an embedding. All embeddings are partitioned to blocks and flattened to generate final input. Each transformer layers is composed of a multi-head self attention (MSA) layer followed by a feed-forward fully-connected network (FFN) with skip-connection and Layer normalization. Positional embeddings are added to encode spatial information before feeding into the block. Lastly, a nested hierarchy with block aggregation is built -- every four spatially connected blocks are merged into one.","957":"**CTAB-GAN** is a model for conditional tabular data generation. The generator and discriminator utilize the [DCGAN](https:\/\/paperswithcode.com\/method\/dcgan) architecture. An [auxiliary classifier](https:\/\/paperswithcode.com\/method\/auxiliary-classifier) is also used with an MLP architecture.","958":"A **TridentNet Block** is a feature extractor used in object detection models. Instead of feeding in multi-scale inputs like the image pyramid, in a [TridentNet](https:\/\/paperswithcode.com\/method\/tridentnet) block we adapt the backbone network for different scales. These blocks create multiple scale-specific feature maps. With the help of dilated convolutions, different branches of trident blocks have the same network structure and share the\r\nsame parameters yet have different receptive fields. Furthermore, to avoid training objects with extreme scales, a scale-aware training scheme is employed to make each branch specific to a given scale range matching its receptive field. Weight sharing is used to prevent overfitting.","959":"**TridentNet** is an object detection architecture that aims to generate scale-specific feature\r\nmaps with a uniform representational power.  A parallel multi-branch architecture is constructed in which each branch shares the same transformation parameters but with different receptive fields. A scale-aware training scheme is used to specialize each branch by sampling object instances of proper scales for training.","960":"**Generalized Mean Pooling (GeM)** computes the generalized mean of each channel in a tensor. Formally:\r\n\r\n$$ \\textbf{e} = \\left[\\left(\\frac{1}{|\\Omega|}\\sum\\_{u\\in{\\Omega}}x^{p}\\_{cu}\\right)^{\\frac{1}{p}}\\right]\\_{c=1,\\cdots,C} $$\r\n\r\nwhere $p > 0$ is a parameter. Setting this exponent as $p > 1$ increases the contrast of the pooled feature map and focuses on the salient features of the image. GeM is a generalization of the [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling) commonly used in classification networks ($p = 1$) and of spatial max-pooling layer ($p = \\infty$).\r\n\r\nSource: [MultiGrain](https:\/\/paperswithcode.com\/method\/multigrain)\r\n\r\nImage Source: [Eva Mohedano](https:\/\/www.google.com\/url?sa=i&url=https%3A%2F%2Fwww.slideshare.net%2Fxavigiro%2Fd1l5-contentbased-image-retrieval-upc-2018-deep-learning-for-computer-vision&psig=AOvVaw2-9Hx23FNGFDe4GHU22Oo5&ust=1591798200590000&source=images&cd=vfe&ved=0CA0QjhxqFwoTCOiP-9P09OkCFQAAAAAdAAAAABAD)","961":"**DELG** is a convolutional neural network for image retrieval that combines generalized mean pooling for global features and attentive selection for local features. The entire network can be learned end-to-end by carefully balancing the gradient flow between two heads \u2013 requiring only image-level labels. This allows for efficient inference by extracting an image\u2019s global feature, detected keypoints and local descriptors within a single model.\r\n\r\nThe model is enabled by leveraging hierarchical image representations that arise in [CNNs](https:\/\/paperswithcode.com\/methods\/category\/convolutional-neural-networks), which are coupled to [generalized mean pooling](https:\/\/paperswithcode.com\/method\/generalized-mean-pooling) and attentive local feature detection. Secondly, a convolutional autoencoder module is adopted that can successfully learn low-dimensional local descriptors. This can be readily integrated into the unified model, and avoids the need of post-processing learning steps, such as [PCA](https:\/\/paperswithcode.com\/method\/pca), that are commonly used. Finally, a procedure is used that enables end-to-end training of the proposed model using only image-level supervision. This requires carefully controlling the gradient flow between the global and local network heads during backpropagation, to avoid disrupting the desired representations.","962":"Population-based intrinsically motivated goal exploration algorithms applied to real world robot learning of complex skills like tool use.","963":"**Local Prior Matching** is a semi-supervised objective for speech recognition that distills knowledge from a strong prior (e.g. a language model) to provide learning signal to a discriminative model trained on unlabeled speech. The LPM objective minimizes the cross entropy between the local prior and the model distribution, and is minimized when $q\\_{y\\mid{x}} = \\bar{p}\\_{y\\mid{x}}$. Intuitively, LPM encourages the ASR model to assign posterior probabilities proportional to the linguistic probabilities of the proposed hypotheses.","964":"**Talking-Heads Attention** is a variation on [multi-head attention](https:\/\/paperswithcode.com\/method\/multi-head-attention) which includes linear projections across the attention-heads dimension, immediately before and after the [softmax](https:\/\/paperswithcode.com\/method\/softmax) operation. In [multi-head attention](https:\/\/paperswithcode.com\/method\/multi-head-attention), the different attention heads perform separate computations, which are then summed at the end. Talking-Heads Attention breaks that separation. Two additional learned linear projections are inserted, $P\\_{l}$ and $P\\_{w}$, which transform the attention-logits and the attention weights respectively, moving information across attention heads. Instead of one \"heads\" dimension $h$ across the whole computation, we now have three separate heads dimensions: $h\\_{k}$, $h$, and $h\\_{v}$, which can optionally differ in size (number of \"heads\"). $h\\_{k}$ refers to the number of attention heads for the keys and the queries. $h$ refers to the number of attention heads for the logits and the weights, and $h\\_{v}$ refers to the number of attention heads for the values.","965":"**SimpleNet** is a convolutional neural network with 13 layers. The network employs a homogeneous design utilizing 3 \u00d7 3 kernels for convolutional layer and 2 \u00d7 2 kernels for pooling operations. The only layers which do not use 3 \u00d7 3 kernels are 11th and 12th layers, these layers, utilize 1 \u00d7 1 convolutional kernels. Feature-map down-sampling is carried out using nonoverlaping 2 \u00d7 2 max-pooling. In order to cope with the problem of vanishing gradient and also over-fitting, SimpleNet also uses batch-normalization with moving average fraction of 0.95 before any [ReLU](https:\/\/paperswithcode.com\/method\/relu) non-linearity.","966":"**Parallax** is a hybrid parallel method for training large neural networks. Parallax is a framework that optimizes data parallel training by utilizing the sparsity of model parameters. Parallax introduces a hybrid approach that combines Parameter Server and AllReduce architectures to optimize the amount of data transfer according to the sparsity.\r\n\r\nParallax pursues a hybrid approach that uses the Parameter Server architecture for handling sparse variables and the AllReduce architecture for handling dense variables. Moreover, Parallax partitions large sparse variables by a near-optimal number of partitions to maximize parallelism while maintaining low computation and communication overhead. Parallax further optimizes training with local aggregation and smart operation placement to mitigate communication overhead. Graph transformation in Parallax automatically applies all of these optimizations and the data parallel training itself at the framework level to minimize user efforts for writing and optimizing a distributed program by composing low-level primitives.","967":"**RotNet** is a self-supervision approach that relies on predicting image rotations as the pretext task\r\nin order to learn image representations.","968":"**Inception v2** is the second generation of Inception convolutional neural network architectures which notably uses [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization). Other changes include dropping [dropout](https:\/\/paperswithcode.com\/method\/dropout) and removing [local response normalization](https:\/\/paperswithcode.com\/method\/local-response-normalization), due to the benefits of batch normalization.","969":"**Kernel Inducing Points**, or **KIP**, is a meta-learning algorithm for learning datasets that can mitigate the challenges which occur for naturally occurring datasets without a significant sacrifice in performance. KIP uses kernel-ridge regression to learn \u000f$\\epsilon$-approximate datasets. It can be regarded as an adaption of the inducing point method for Gaussian processes to the case of Kernel Ridge Regression.","970":"The  overhead  cost  of  training  multiple  deep  neural networks  could  be  very  high  in  terms  of  the  training  time, hardware, and computational resource requirement and often acts  as  obstacle  for  creating  deep  ensembles.  To  overcome these barriers Huang et al. proposed a unique method to create  ensemble  which  at  the  cost  of  training  one  model, yields  multiple  constituent  model  snapshots  that  can  be ensembled together to create a strong learner. The core idea behind the concept is to make the model converge to several local minima along the optimization path and save the model parameters at these local minima points. During the training phase, a neural network would traverse through many such points. The lowest of all such local minima is known as the Global Minima. The larger the model, more are the number of parameters and larger the number of local minima points. This implies, there are discrete sets of weights and biases, at which  the  model  is  making  fewer  errors.  So,  every  such minimum  can  be  considered a  weak  but  a  potential learner model for the problem being solved. Multiple such snapshot of  weights  and  biases  are  recorded  which  can  later  be ensembled to get a better generalized model which makes the least amount of mistakes.","971":"**Anycost GAN** is a type of generative adversarial network for image synthesis and editing. Given an input image, we project it into the latent space with encoder $E$ and backward optimization. We can modify the latent code with user input to edit the image. During editing, a sub-generator of small cost is used for fast and interactive preview; during idle time, the full cost generator renders the final, high-quality output. The outputs from the full and sub-generators are visually consistent during projection and editing.","972":"**Mesh-TensorFlow** is a language for specifying a general class of distributed tensor computations. Where data-parallelism can be viewed as splitting tensors and operations along the \"batch\" dimension, in Mesh-TensorFlow, the user can specify any tensor dimensions to be split across any dimensions of a multi-dimensional mesh of processors. A MeshTensorFlow graph compiles into a SPMD program consisting of parallel operations coupled with collective communication primitives such as Allreduce.","973":"Our tracker, CenterTrack, applies a detection model to a pair of images and detections from the prior frame. Given this minimal input, CenterTrack localizes objects and predicts their associations with the previous frame. That's it. CenterTrack is simple, online (no peeking into the future), and real-time.","974":"**D4PG**, or **Distributed Distributional DDPG**, is a policy gradient algorithm that extends upon the [DDPG](https:\/\/paperswithcode.com\/method\/ddpg). The improvements include a distributional updates to the DDPG algorithm, combined with the use of multiple distributed workers all writing into the same replay table. The biggest performance gain of other simpler changes was the use of $N$-step returns. The authors found that the use of [prioritized experience replay](https:\/\/paperswithcode.com\/method\/prioritized-experience-replay) was less crucial to the overall D4PG algorithm especially on harder problems.","975":"[Transformer](https:\/\/paperswithcode.com\/method\/transformer)-Decoder is a modification to Transformer-Encoder-Decoder for long sequences that drops the encoder\r\nmodule, combines the input and output sequences into a single \u201dsentence\u201d and is trained as a standard language model. It is used in [GPT](https:\/\/paperswithcode.com\/method\/gpt) and later revisions.","976":"The  **Synthesizer** is a model that learns synthetic attention weights without token-token interactions. Unlike [Transformers](https:\/\/paperswithcode.com\/method\/transformer), the model eschews dot product self-attention but also content-based self-attention altogether. Synthesizer learns to synthesize the self-alignment matrix instead of manually computing pairwise dot products. It is transformation-based, only relies on simple feed-forward layers, and completely dispenses with dot products and explicit token-token interactions. \r\n\r\nThis new module employed by the Synthesizer is called \"Synthetic Attention\": a new way of learning to attend without explicitly attending (i.e., without dot product attention or [content-based attention](https:\/\/paperswithcode.com\/method\/content-based-attention)). Instead, Synthesizer generate the alignment matrix independent of token-token dependencies.","977":"Deep-learning models estimate values using backpropagation. The activation function within hidden layers is a critical component to minimizing loss in deep neural-networks. Rectified Linear (ReLU) has been the dominant activation function for the past decade. Swish and Mish are newer activation functions that have shown to yield better results than ReLU given specific circumstances. Phish is a novel activation function proposed here. It is a composite function defined as f(x) = xTanH(GELU(x)), where no discontinuities are apparent in the differentiated graph on the domain observed. Generalized networks were constructed using different activation functions. SoftMax was the output function. Using images from MNIST and CIFAR-10 databanks, these networks were trained to minimize sparse categorical crossentropy. A large scale cross-validation was simulated using stochastic Markov chains to account for the law of large numbers for the probability values. Statistical tests support the research hypothesis stating Phish could outperform other activation functions in classification. Future experiments would involve testing Phish in unsupervised learning algorithms and comparing it to more activation functions.","978":"An implementation of model & data parallel [GPT3-like](https:\/\/paperswithcode.com\/method\/gpt-3) models using the [mesh-tensorflow](https:\/\/github.com\/tensorflow\/mesh) library.\r\n\r\nSource: [EleutherAI\/GPT-Neo](https:\/\/github.com\/EleutherAI\/gpt-neo)","979":"**Path Planning and Motion Control**, or **PPMC RL**, is a training algorithm that teaches path planning and motion control to robots using reinforcement learning in a simulated environment. The focus is on promoting generalization where there are environmental uncertainties such as rough environments like lunar services. The algorithm is coupled with any generic reinforcement learning algorithm to teach robots how to respond to user commands and to travel to designated locations on a single neural network. The algorithm works independently of the robot structure, demonstrating that it works on a wheeled rover in addition to the past results on a quadruped walking robot.","980":"**DistanceNet** is a learning algorithm for multi-source domain adaptation that uses various distance measures, or a mixture of these distance measures, as an additional loss function to be minimized jointly with the task's loss function, so as to achieve better unsupervised domain adaptation.","981":"**Visformer**, or **Vision-friendly Transformer**, is an architecture that combines [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based architectural features with those from [convolutional neural network](https:\/\/paperswithcode.com\/methods\/category\/convolutional-neural-networks) architectures. Visformer adopts the stage-wise design for higher base performance. But [self-attentions](https:\/\/paperswithcode.com\/method\/multi-head-attention) are only utilized in the last two stages, considering that self-attention in the high-resolution stage is relatively inefficient even when the FLOPs are balanced. Visformer employs [bottleneck blocks](https:\/\/paperswithcode.com\/method\/bottleneck-residual-block) in the first stage and utilizes [group 3 \u00d7 3 convolutions](https:\/\/paperswithcode.com\/method\/grouped-convolution) in bottleneck blocks inspired by [ResNeXt](https:\/\/paperswithcode.com\/method\/resnext). It also introduces [BatchNorm](https:\/\/paperswithcode.com\/method\/batch-normalization) to patch embedding modules as in CNNs.","982":"**Fawkes** is an image cloaking system that helps individuals inoculate their images against unauthorized facial recognition models. Fawkes achieves this by helping users add imperceptible pixel-level changes (\"cloaks\") to their own photos before releasing them. When used to train facial recognition models, these \"cloaked\" images produce functional models that consistently cause normal images of the user to be misidentified.","983":"**Anti-Alias Downsampling (AA)** aims to improve the shift-equivariance of deep networks. Max-pooling is inherently composed of two operations. The first operation is to densely evaluate the max operator and second operation is naive subsampling. AA is proposed as a low-pass filter between them to achieve practical anti-aliasing in any existing strided layer such as strided [convolution](https:\/\/paperswithcode.com\/method\/convolution). The smoothing factor can be adjusted by changing the blur kernel filter size, where a larger filter size results in increased blur.","984":"**Big-Little Modules** are blocks for image models that have two branches: each of which represents a separate block from a deep model and a less deep counterpart. They were proposed as part of the [BigLittle-Net](https:\/\/paperswithcode.com\/method\/big-little-net) architecture. The two branches are fused with a linear combination and unit weights. These two branches are known as Big-Branch (more layers and channels at low resolutions) and Little-Branch (fewer layers and channels at high resolution).","985":"**Assemble-ResNet** is a modification to the [ResNet](https:\/\/paperswithcode.com\/method\/resnet) architecture with several tweaks including using [ResNet-D](https:\/\/paperswithcode.com\/method\/resnet-d), channel attention, [anti-alias downsampling](https:\/\/paperswithcode.com\/method\/anti-alias-downsampling), and Big Little Networks.","986":"**Ape-X** is a distributed architecture for deep reinforcement learning. The algorithm decouples acting from learning: the actors interact with their own instances of the environment by selecting actions according to a shared neural network, and accumulate the resulting experience in a shared [experience replay](https:\/\/paperswithcode.com\/method\/experience-replay) memory; the learner replays samples of experience and updates the neural network. The architecture relies on [prioritized experience replay](https:\/\/paperswithcode.com\/method\/prioritized-experience-replay) to focus only on the most significant data generated by the actors.\r\n\r\nIn contrast to Gorila, Ape-X uses a shared, centralized replay memory, and instead of sampling\r\nuniformly, it prioritizes, to sample the most useful data more often. All communications are batched with the centralized replay, increasing the efficiency and throughput at the cost of some latency. \r\nAnd by learning off-policy, Ape-X has the ability to combine data from many distributed actors, by giving the different actors different exploration policies, broadening the diversity of the experience they jointly encounter.","987":"SCAN automatically groups images into semantically meaningful clusters when ground-truth annotations are absent. SCAN is a two-step approach where feature learning and clustering are decoupled. First, a self-supervised task is employed to obtain semantically meaningful features. Second, the obtained features are used as a prior in a learnable clustering  approach.\r\n\r\nImage source: [Gansbeke et al.](https:\/\/arxiv.org\/pdf\/2005.12320v2.pdf)","988":"**Content-Aware ReAssembly of FEatures (CARAFE)** is an operator for feature upsampling in convolutional neural networks. CARAFE has several appealing properties: (1) Large field of view. Unlike previous works (e.g. bilinear interpolation) that only exploit subpixel neighborhood, CARAFE can aggregate contextual information within a large receptive field. (2) Content-aware handling. Instead of using a fixed kernel for all samples (e.g. deconvolution), CARAFE enables instance-specific content-aware handling, which generates adaptive kernels on-the-fly. (3) Lightweight and fast to compute.","989":"HBMP is a hierarchy-like structure of [BiLSTM](https:\/\/paperswithcode.com\/method\/bilstm) layers with [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling). All in all, this model improves the previous state of the art for SciTail and achieves strong results for the SNLI and MultiNLI.","990":"**Continuous Bag-of-Words Word2Vec** is an architecture for creating word embeddings that uses $n$ future words as well as $n$ past words to create a word embedding. The objective function for CBOW is:\r\n\r\n$$ J\\_\\theta = \\frac{1}{T}\\sum^{T}\\_{t=1}\\log{p}\\left(w\\_{t}\\mid{w}\\_{t-n},\\ldots,w\\_{t-1}, w\\_{t+1},\\ldots,w\\_{t+n}\\right) $$\r\n\r\nIn the CBOW model, the distributed representations of context are used to predict the word in the middle of the window. This contrasts with [Skip-gram Word2Vec](https:\/\/paperswithcode.com\/method\/skip-gram-word2vec) where the distributed representation of the input word is used to predict the context.","991":"The **focal self-attention** is built to make Transformer layers scalable to high-resolution inputs.  Instead of attending all tokens at fine-grain, the approach attends the fine-grain tokens only locally, but the summarized ones globally. As such, it can cover as many regions as standard self-attention but with much less cost. An image is first partitioned into patches, resulting in visual tokens. Then a patch embedding layer, consisting of a convolutional layer with filter and stride of same size, to project the patches into hidden features. This spatial feature map in then passed to four stages of focal Transformer blocks. Each focal Transformer block consists of $N_i$ focal Transformer layers. Patch embedding layers are used in between to reduce spatial size of feature map by factor 2, while feature dimension increased by 2.","992":"A **DeLighT Block** is a block used in the [DeLighT](https:\/\/paperswithcode.com\/method\/delight) [transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture. It uses a [DExTra](https:\/\/paperswithcode.com\/method\/dextra) transformation to reduce the dimensionality of the vectors entered into the attention layer, where a [single-headed attention](https:\/\/paperswithcode.com\/method\/single-headed-attention) module is used.  Since the DeLighT block learns wider representations of the input across different layers using DExTra, it enables the authors to replace [multi-head attention](https:\/\/paperswithcode.com\/method\/multi-head-attention) with single-head attention. This is then followed by a light-weight FFN which, rather than expanding the dimension (as in normal Transformers which widen to a dimension 4x the size), imposes a bottleneck and squeezes the dimensions. Again, the reason for this is that the DExTra transformation has already incorporated wider representations so we can squeeze instead at this layer.","993":"**DExTra**, or **Deep and Light-weight Expand-reduce Transformation**, is a light-weight expand-reduce transformation that enables learning wider representations efficiently.\r\n\r\nDExTra maps a $d\\_{m}$ dimensional input vector into a high dimensional space (expansion) and then\r\nreduces it down to a $d\\_{o}$ dimensional output vector (reduction) using $N$ layers of group transformations. During these expansion and reduction phases, DExTra uses group linear transformations because they learn local representations by deriving the output from a specific part of the input and are more efficient than linear transformations. To learn global representations, DExTra shares information between different groups in the group linear transformation using feature shuffling \r\n\r\nFormally, the DExTra transformation is controlled by five configuration parameters: (1) depth $N$, (2)\r\nwidth multiplier $m\\_{w}$, (3) input dimension $d\\_{m}$, (4) output dimension $d\\_{o}$, and (5) maximum groups $g\\_{max}$ in a group linear transformation. In the expansion phase, DExTra projects the $d\\_{m}$-dimensional input to a high-dimensional space, $d\\_{max} = m\\_{w}d\\_{m}$, linearly using $\\text{ceil}\\left(\\frac{N}{2}\\right)$ layers. In the reduction phase, DExTra projects the $d\\_{max}$-dimensional vector to a $d\\_{o}$-dimensional space using the remaining $N -\\text{ceil}\\left(\\frac{N}{2}\\right)$ layers. Mathematically, we define the output $Y$ at each layer $l$ as:\r\n\r\n$$ \\mathbf{Y}\\_{l} = \\mathcal{F}\\left(\\mathbf{X}, \\mathbf{W}^{l}, \\mathbf{b}^{l}, g^{l}\\right) \\text{ if } l=1 $$\r\n$$ \\mathbf{Y}\\_{l} = \\mathcal{F}\\left(\\mathcal{H}\\left(\\mathbf{X}, \\mathbf{Y}^{l-1}\\right), \\mathbf{W}^{l}, \\mathbf{b}^{l}, g^{l}\\right) \\text{ Otherwise } $$\r\n\r\nwhere the number of groups at each layer $l$ are computed as:\r\n\r\n$$ g^{l} = \\text{min}\\left(2^{l-1}, g\\_{max}\\right), 1 \\leq l \\leq \\text{ceil}\\left(N\/2\\right) $$\r\n$$ g^{N-l}, \\text{Otherwise}$$\r\n\r\nIn the above equations, $\\mathcal{F}$ is a group linear transformation function. The function $\\mathcal{F}$ takes the input $\\left(\\mathbf{X} \\text{ or } \\mathcal{H}\\left(\\mathbf{X}, \\mathbf{Y}^{l-1}\\right) \\right)$, splits it into $g^{l}$ groups, and then applies a linear transformation with learnable parameters $\\mathbf{W}^{l}$ and bias $\\mathbf{b}^{l}$ to each group independently. The outputs of each group are then concatenated to produce the final output $\\mathbf{Y}^{l}$. The function $\\mathcal{H}$ first shuffles the output of each group in $\\mathbf{Y}^{l\u22121}$ and then combines it with the input $\\mathbf{X}$ using an input mixer connection.\r\n\r\nIn the authors' experiments, they use $g\\_{max} = \\text{ceil}\\left(\\frac{d\\_{m}}{32}\\right)$ so that each group has at least 32 input elements. Note that (i) group linear transformations reduce to linear transformations when $g^{l} = 1$, and (ii) DExTra is equivalent to a multi-layer perceptron when $g\\_{max} = 1$.","994":"**DeLiGHT** is a [transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture that delivers parameter efficiency improvements by (1) within each Transformer block using [DExTra](https:\/\/paperswithcode.com\/method\/dextra), a deep and light-weight transformation, allowing for the use of [single-headed attention](https:\/\/paperswithcode.com\/method\/single-headed-attention) and bottleneck FFN layers and (2) across blocks using block-wise scaling, that allows for shallower and narrower [DeLighT blocks](https:\/\/paperswithcode.com\/method\/delight-block) near the input and wider and deeper DeLighT blocks near the output.","995":"Represent and associate with a composite of primitive fields.","996":"**PolarNet** is an improved grid representation for online, single-scan LiDAR point clouds. Instead of using common spherical or bird's-eye-view projection, the polar bird's-eye-view representation balances the points across grid cells in a polar coordinate system, indirectly aligning a segmentation network's attention with the long-tailed distribution of the points along the radial axis.","997":"**Sample Consistency Network (SCNet)** is a method for instance segmentation which ensures the IoU distribution of the samples at training time are as close to that at inference time. To this end, only the outputs of the last box stage are used for mask predictions at both training and inference. The Figure shows the IoU distribution of the samples going to the mask branch at training time with\/without sample consistency compared to that at inference time.","998":"A **Gated Linear Network**, or **GLN**, is a type of backpropagation-free neural architecture. What distinguishes GLNs from contemporary neural networks is the distributed and local nature of their credit assignment mechanism; each neuron directly predicts the target, forgoing the ability to learn feature representations in favor of rapid online learning. Individual neurons can model nonlinear functions via the use of data-dependent gating in conjunction with online convex optimization. \r\n\r\nGLNs are feedforward networks composed of many layers of gated geometric mixing neurons as shown in the Figure . Each neuron in a given layer outputs a gated geometric mixture of the predictions from the previous layer, with the final layer consisting of just a single neuron. In a supervised learning setting, a $\\mathrm{GLN}$ is trained on (side information, base predictions, label) triplets $\\left(z\\_{t}, p\\_{t}, x\\_{t}\\right)_{t=1,2,3, \\ldots}$ derived from input-label pairs $\\left(z\\_{t}, x\\_{t}\\right)$. There are two types of input to neurons in the network: the first is the side information $z\\_{t}$, which can be thought of as the input features; the second is the input to the neuron, which will be the predictions output by the previous layer, or in the case of layer 0 , some (optionally) provided base predictions $p\\_{t}$ that typically will be a function of $z\\_{t} .$ Each neuron will also take in a constant bias prediction, which helps empirically and is essential for universality guarantees.\r\n\r\nWeights are learnt in a Gated Linear Network using Online Gradient Descent (OGD) locally at each neuron. They key observation is that as each neuron $(i, k)$ in layers $i>0$ is itself a gated geometric mixture, all of these neurons can be thought of as individually predicting the target. Given side information $z$ , each neuron $(i, k)$ suffers a loss convex in its active weights $u:=w\\_{i k c\\_{i k}(z)}$ of\r\n$$\r\n\\ell\\_{t}(u):=-\\log \\left(\\operatorname{GEO}\\_{u}\\left(x_{t} ; p\\_{i-1}\\right)\\right)\r\n$$","999":"Self-supervised pre-training method to learn efficient representations without labels on histopathology medical images utilizing magnification factors.","1000":"**Two-Way Dense Layer** is an image model block used in the [PeleeNet](https:\/\/paperswithcode.com\/method\/peleenet) architectures. Motivated by [GoogLeNet](https:\/\/paperswithcode.com\/method\/googlenet), the 2-way dense layer is used to get different scales of receptive fields. One way of the layer uses a 3x3 kernel size. The other way of the layer uses two stacked 3x3 [convolution](https:\/\/paperswithcode.com\/method\/convolution) to learn visual patterns for large objects.","1001":"**PeleeNet** is a convolutional neural network  and object detection backbone that is a variation of [DenseNet](https:\/\/paperswithcode.com\/method\/densenet) with optimizations to meet a memory and computational budget. Unlike competing networks, it does not use depthwise convolutions and instead relies on regular convolutions.","1002":"FLAVA aims at building a single holistic universal model that targets all modalities at once. FLAVA is a language vision alignment model that learns strong representations from multimodal data (image-text pairs) and unimodal data (unpaired images and text). The model consists of an image encode transformer to capture unimodal image representations, a text encoder transformer to process unimodal text information, and a multimodal encode transformer that takes as input the encoded unimodal image and text and integrates their representations for multimodal reasoning. During pretraining, masked image modeling (MIM) and mask language modeling (MLM) losses are applied onto the image and text encoders over a single image or a text piece, respectively, while contrastive, masked multimodal modeling (MMM), and image-text matching (ITM) loss are used over paired image-text data. For downstream tasks, classification heads are applied on the outputs from the image, text, and multimodal encoders respectively for visual recognition, language understanding, and multimodal reasoning tasks It can be applied to broad scope of tasks from three domains (visual recognition, language understanding, and multimodal reasoning) under a common transformer model architecture.","1003":"**LayerDrop** is a form of structured [dropout](https:\/\/paperswithcode.com\/method\/dropout) for [Transformer](https:\/\/paperswithcode.com\/method\/transformer) models which has a regularization effect during training and allows for efficient pruning at inference time. It randomly drops layers from the Transformer according to an \"every other\" strategy where pruning with a rate $p$ means dropping the layers at depth $d$ such that $d = 0\\left\\(\\text{mod}\\left(\\text{floor}\\left(\\frac{1}{p}\\right)\\right)\\right)$.","1004":"In a given dataset for semantic image segmentation, the number of samples per class should be the same, so that no classifier would be biased towards the majority class (here included the background). It is very difficult, if not impossible, to achieve a perfect balance between the several classes of objects of a dataset. Considering that the segmentation of the objects  is accomplished at the pixel level, the number of pixels for each class must be taken into account. As a matter of fact, in image semantic segmentation, \r\ndifferent classes and the background may have quite different\r\nsizes. Therefore, the image segmentation problem is naturally unbalanced. The IPBI is based on the concept of entropy, a common measure used in many fields of science. In a general sense, it measures the amount of disorder of a system. For the sake of semantic image segmentation, the ideal dataset should have the same number of instances per class, as well as the same number of pixels in all classes. Similar reasoning can be done considering the number of pixels of all samples in a class, so that we can obtain the\r\npixels balance measure for the dataset. Overall, IPBI evaluates the balance of pixels and number of instances of an image semantic segmentation dataset and, so, it is usefull to compare different datasets.","1005":"A **CoordConv** layer is a simple extension to the standard convolutional layer. It has the same functional signature as a convolutional layer, but accomplishes the mapping by first concatenating extra channels to the incoming representation. These channels contain hard-coded coordinates, the most basic version of which is one channel for the $i$ coordinate and one for the $j$ coordinate.\r\n\r\nThe CoordConv layer keeps the properties of few parameters and efficient computation from convolutions, but allows the network to learn to keep or to discard translation invariance as is needed for the task being learned. This is useful for coordinate transform based tasks where regular convolutions can fail.","1006":"# Memory-Efficient Adaptive Optimization\r\n\r\nSource: https:\/\/arxiv.org\/abs\/1901.11150\r\n\r\nAdaptive gradient-based optimizers such as [AdaGrad](https:\/\/paperswithcode.com\/method\/adagrad) and [Adam](https:\/\/paperswithcode.com\/method\/adam) are among the\r\ndefacto methods of choice in modern machine learning.These methods tune the learning rate for each parameter during the optimization process using cumulative second-order statistics. These methods provide superior convergence properties and are very attractive in large scale applications due to their moderate time and space requirements which are linear in the number of parameters.\r\n\r\n\r\nHowever, the recent advances in natural language processing such as [BERT](https:\/\/paperswithcode.com\/method\/bert) and GPT2 show that models with 10<sup>8<\/sup> to 10<sup>10<\/sup> parameters, trained with adaptive optimization methods, achieve state-of-the-art results. In such cases, the memory overhead of the optimizer can restrict the size of the model that can be used as well as the batch size, both of which can have a dramatic effect on the quality of the final model.\r\n\r\n\r\nHere we construct a new adaptive optimization method that retains most of the benefits of standard per-parameter adaptivity while significantly reducing memory overhead.\r\n\r\n\r\nWe observe that in standard neural networks that certain entries of the stochastic gradients have (on average) similar values, and exhibit what we refer to as an activation pattern. For example, in gradients of embedding layers of deep networks, an entire row (or column) is either zero or non-zero. Similarly, in intermediate layers we often observe that gradients associated with the same unit are of similar order of magnitude. In these cases, a similar phenomenon is observed in the second-order statistics maintained by adaptive methods. With this key observation, to reduce the memory overhead of the optimizer our method takes in a cover set of the parameters. Cover sets are typically selected in practice such that parameters in each of the sets have second order statistics of similar magnitude. Our method is general enough that it can easily be extended to arbitrary cover sets. For parameters of deep networks that are organized as a collection of tensors, we form a cover consisting of slices of codimension one for each tensor. Thus, for an m x n parameter matrix, the cover consists of rows and columns of the matrix. The memory requirements therefore drop from mxn to merely m+n. For a parameter tensor of rank p, with dimensions n<sub>1<\/sub>  ...   n<sub>p<\/sub>, the reduction in memory consumption is even more pronounced, dropping from product of all the dimensions to the sum of all dimensions. This virtually eliminates the memory overhead associated with maintaining the adaptive learning rates!\r\n\r\nAnother practical aspect worthy of note is that our method does not require an external hand engineered learning rate decay schedule but instead relies on the per parameter adaptivity that is natural to its update rule which makes it easier to tune. We provide details in the supplementary section of the paper.\r\n\r\n## Advice on using SM3 on your model\r\n\r\n### Learning rate warm-up:\r\n\r\n```python\r\nlearning_rate = lr_constant * tf.minimum(1.0, (warm_up_step \/ global_step) ** p)\r\n```\r\n\r\n* p = 1, linear ramp up of learning rate.\r\n* p = 2, quadratic ramp up of learning rate [preferred].\r\n\r\nWe typically set `warm_up_step` as 5% of overall steps. Initially, the norm of the preconditioned gradient is much larger than norm of the weights. Learning rate warmup allows us to heuristically fix this scale mismatch.\r\n\r\n### Learning rate decay:\r\n\r\nWe make use accumulated gradient squares for the decay. This means that each coordinate gets its own natural decay based on the scales of the gradients over time. Hence, users need not put in an external learning rate decay schedule. Moreover, we found in our experiments with translation and language models that this approach is superior to a hand-tuned learning rate decay schedules which is typically combined with exponential moving averages of the gradient squares.\r\n\r\nHaving said that if users want to add exponential moving averages instead of the standard accumulated gradient squares - It's easy to modify the optimizer implementation to switch to exponential moving averages.\r\n\r\nFor rank > 1:\r\n\r\n|            from                     |                  to                 |\r\n|-------------------------------------|-------------------------------------|\r\n|  current_accumulator += grad * grad |  current_accumulator = beta * current_accumulator + (1-beta) * grad * grad |\r\n\r\n\r\nFor rank <= 1:\r\n\r\n\r\n|            from                     |                  to                 |\r\n|-------------------------------------|-------------------------------------|\r\n|  current_accumulator = tf.assign_add(accumulator, grad * grad) |   current_accumulator = tf.assign(accumulator, beta * accumulator + (1-beta) * (grad * grad)) |\r\n\r\n\r\n### [Polyak averaging](https:\/\/paperswithcode.com\/method\/polyak-averaging) of parameters: \r\nIt's useful to run [polyak averaging](https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/train\/ExponentialMovingAverage) of the parameters. These parameters are then used in inference \/ serving. Using the averaged parameters instead of the last iterate typically improves the overall performance of the model.\r\n\r\nAn **alternative** to polyak averaging which does not make use of extra memory is to decay the learning rate from the constant to zero for the last 10% of the steps of your entire training run, we term the phase a **cool-down** phase of the model. As training makes smaller and smaller steps the final iterate can be thought of as an average iterate.","1007":"**FastPitch** is a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The architecture of FastPitch is shown in the Figure. It is based on FastSpeech and composed mainly of two feed-forward [Transformer](https:\/\/paperswithcode.com\/method\/transformer) (FFTr) stacks. The first one operates in the resolution of input tokens, the second one in the resolution of the output frames. Let $x=\\left(x\\_{1}, \\ldots, x\\_{n}\\right)$ be the sequence of input lexical units, and $\\mathbf{y}=\\left(y\\_{1}, \\ldots, y\\_{t}\\right)$ be the sequence of target mel-scale spectrogram frames. The first FFTr stack produces the hidden representation $\\mathbf{h}=\\operatorname{FFTr}(\\mathbf{x})$. The hidden representation $h$ is used to make predictions about the duration and average pitch of every character with a 1-D CNN \r\n\r\n$$\r\n\\hat{\\mathbf{d}}=\\text { DurationPredictor }(\\mathbf{h}), \\quad \\hat{\\mathbf{p}}=\\operatorname{PitchPredictor}(\\mathbf{h})\r\n$$\r\n\r\nwhere $\\hat{\\mathbf{d}} \\in \\mathbb{N}^{n}$ and $\\hat{\\mathbf{p}} \\in \\mathbb{R}^{n}$. Next, the pitch is projected to match the dimensionality of the hidden representation $h \\in$ $\\mathbb{R}^{n \\times d}$ and added to $\\mathbf{h}$. The resulting sum $\\mathbf{g}$ is discretely upsampled and passed to the output FFTr, which produces the output mel-spectrogram sequence\r\n\r\n$$\r\n\\mathbf{g}=\\mathbf{h}+\\operatorname{PitchEmbedding}(\\mathbf{p})\r\n$$\r\n\r\n$$\r\n\\hat{\\mathbf{y}}=\\operatorname{FFTr}\\left([\\underbrace{g\\_{1}, \\ldots, g\\_{1}}\\_{d\\_{1}}, \\ldots \\underbrace{g\\_{n}, \\ldots, g\\_{n}}_{d\\_{n}}]\\right)\r\n$$\r\n\r\n\r\nGround truth $\\mathbf{p}$ and $\\mathbf{d}$ are used during training, and predicted $\\hat{\\mathbf{p}}$ and $\\hat{\\mathbf{d}}$ are used during inference. The model optimizes mean-squared error (MSE) between the predicted and ground-truth modalities\r\n\r\n$$\r\n\\mathcal{L}=\\|\\hat{\\mathbf{y}}-\\mathbf{y}\\|\\_{2}^{2}+\\alpha\\|\\hat{\\mathbf{p}}-\\mathbf{p}\\|\\_{2}^{2}+\\gamma\\|\\hat{\\mathbf{d}}-\\mathbf{d}\\|\\_{2}^{2}\r\n$$","1008":"AlphaFold is a deep learning based algorithm for accurate protein structure prediction. AlphaFold incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.\r\n\r\nDescription from: [Highly accurate protein structure prediction with AlphaFold](https:\/\/paperswithcode.com\/paper\/highly-accurate-protein-structure-prediction)\r\n\r\nImage credit: [DeepMind](https:\/\/deepmind.com\/blog\/article\/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology)","1009":"Training a denoiser on signals gives you a powerful prior over this signal that you can then use to sample examples of this signal.","1010":"Lightweight or mobile neural networks used for real-time computer vision tasks contain fewer parameters than normal\r\nnetworks, which lead to a constrained performance. In this work, we proposed a novel activation function named Tanh Exponential\r\nActivation Function (TanhExp) which can improve the performance for these networks on image classification task significantly.\r\nThe definition of TanhExp is $f(x) = x tanh(e^x)$. We demonstrate the simplicity, efficiency, and robustness of TanhExp on various\r\ndatasets and network models and TanhExp outperforms its counterparts in both convergence speed and accuracy. Its behaviour\r\nalso remains stable even with noise added and dataset altered. We show that without increasing the size of the network, the\r\ncapacity of lightweight neural networks can be enhanced by TanhExp with only a few training epochs and no extra parameters\r\nadded.","1011":"**Pattern-Exploiting Training** is a semi-supervised training procedure that reformulates input examples as cloze-style phrases to help language models understand a given task. These phrases are then used to assign soft labels to a large set of unlabeled examples. Finally, standard supervised training is performed on the resulting training set. \r\n\r\nIn the case of PET for sentiment classification, first a number of patterns encoding some form of task description are created to convert training examples to cloze questions; for each pattern, a pretrained language model is finetuned. Secondly, the ensemble of trained models annotates unlabeled data. Lastly, a classifier is trained on the resulting soft-labeled dataset.","1012":"Recurrent Event Network (RE-NET) is an autoregressive architecture for predicting future interactions. The occurrence of a fact (event) is modeled as a probability distribution conditioned on temporal sequences of past knowledge graphs. RE-NET employs a recurrent event encoder to encode past facts and uses a neighborhood aggregator to model the connection of facts at the same timestamp. Future facts can then be inferred in a sequential manner based on the two modules.","1013":"The full architecture of CGNN is presented at [CGNN's official site](https:\/\/tony-y.github.io\/cgnn\/architectures\/).","1014":"**Probabilistic Continuously Indexed Domain Adaptation** (**PCIDA**) enjoys better theoretical guarantees to match both the mean and variance of the distribution $p(u|z)$. PCIDA can be extended to match higher-order moments.","1015":"**Continuously Indexed Domain Adaptation** combines traditional adversarial adaptation with a novel discriminator that models the encoding-conditioned domain index distribution.\r\n\r\nImage Source: [Wang et al.](https:\/\/arxiv.org\/pdf\/2007.01807v2.pdf)","1016":"**modReLU** is an activation that is a modification of a [ReLU](https:\/\/paperswithcode.com\/method\/relu). It is a pointwise nonlinearity, $\\sigma\\_{modReLU}\\left(z\\right) : C \\rightarrow C$, which affects only the absolute value of a complex number, defined as:\r\n\r\n$$ \\sigma\\_{modReLU}\\left(z\\right) = \\left(|z| + b\\right)\\frac{z}{|z|} \\text{ if } |z| + b \\geq 0 $$\r\n$$ \\sigma\\_{modReLU}\\left(z\\right) = 0 \\text{ if } |z| + b \\leq 0 $$\r\n\r\nwhere $b \\in \\mathbb{R}$ is a bias parameter of the nonlinearity. For a $n\\_{h}$ dimensional hidden space we learn $n\\_{h}$ nonlinearity bias parameters, one per dimension.","1017":"A **Unitary RNN** is a recurrent neural network architecture that uses a unitary hidden to hidden matrix. Specifically they concern dynamics of the form:\r\n\r\n$$ h\\_{t} = f\\left(Wh\\_{t\u22121} + Vx\\_{t}\\right) $$\r\n\r\nwhere $W$ is a unitary matrix $\\left(W^{\u2020}W = I\\right)$. The product of unitary matrices is a unitary matrix, so $W$ can be parameterised as a product of simpler unitary matrices:\r\n\r\n$$ h\\_{t} = f\\left(D\\_{3}R\\_{2}F^{\u22121}D\\_{2}PR\\_{1}FD\\_{1}h\\_{t\u22121} + Vxt\\right) $$\r\n\r\nwhere $D\\_{3}$, $D\\_{2}$, $D\\_{1}$ are learned diagonal complex matrices, and $R\\_{2}$, $R\\_{1}$ are learned reflection matrices. Matrices $F$ and $F^{\u22121}$ are the discrete Fourier transformation and its inverse. P is any constant random permutation. The activation function $f\\left(h\\right)$ applies a rectified linear unit with a learned bias to the modulus of each complex number. Only\r\nthe diagonal and reflection matrices, $D$ and $R$, are learned, so Unitary RNNs have fewer parameters than [LSTMs](https:\/\/paperswithcode.com\/method\/lstm) with comparable numbers of hidden units.\r\n\r\nSource: [Associative LSTMs](https:\/\/arxiv.org\/pdf\/1602.03032.pdf)","1018":"Please enter a description about the method here","1019":"Gated Attention Networks (GaAN) is a new architecture for learning on graphs. Unlike the traditional multi-head attention mechanism, which equally consumes all attention heads, GaAN uses a convolutional sub-network to control each attention head\u2019s importance.\r\n\r\nImage credit: [GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs](https:\/\/paperswithcode.com\/paper\/gaan-gated-attention-networks-for-learning-on)","1020":"Relational tables on the Web store a vast amount of knowledge. Owing to the wealth of such tables, there has been tremendous progress on a variety of tasks in the area of table understanding. However, existing work generally relies on heavily-engineered task- specific features and model architectures. In this paper, we present TURL, a novel framework that introduces the pre-training\/fine- tuning paradigm to relational Web tables. During pre-training, our framework learns deep contextualized representations on relational tables in an unsupervised manner. Its universal model design with pre-trained representations can be applied to a wide range of tasks with minimal task-specific fine-tuning.\r\nSpecifically, we propose a structure-aware Transformer encoder to model the row-column structure of relational tables, and present a new Masked Entity Recovery (MER) objective for pre-training to capture the semantics and knowledge in large-scale unlabeled data. We systematically evaluate TURL with a benchmark consisting of 6 different tasks for table understanding (e.g., relation extraction, cell filling). We show that TURL generalizes well to all tasks and substantially outperforms existing methods in almost all instances.","1021":"A **Style-based Recalibration Module (SRM)** is a module for convolutional neural networks that adaptively recalibrates intermediate feature maps by exploiting their styles. SRM first extracts the style information from each channel of the feature maps by style pooling, then estimates per-channel recalibration weight via channel-independent style integration. By incorporating the relative importance of individual styles into feature maps, SRM is aimed at enhancing the representational ability of a CNN.\r\n\r\nThe overall structure of SRM is illustrated in the Figure to the right. It consists of two main components: style pooling and style integration. The style pooling operator extracts style features\r\nfrom each channel by summarizing feature responses across spatial dimensions. It is followed by the style integration operator, which produces example-specific style weights by utilizing the style features via channel-wise operation. The style weights finally recalibrate the feature maps to either\r\nemphasize or suppress their information.","1022":"GLIDE is a generative model based on text-guided diffusion models for more photorealistic image generation. Guided diffusion is applied to text-conditional image synthesis and the model is able to handle free-form prompts. The diffusion model uses a text encoder to condition on natural language descriptions. The model is provided with editing capabilities in addition to zero-shot generation, allowing for iterative improvement of model samples to match more complex prompts. The model is fine-tuned to perform image inpainting.","1023":"TAM is designed to capture complex temporal relationships both  efficiently and  flexibly,\r\nIt adopts an adaptive kernel instead of self-attention to capture  global contextual information, with lower time complexity \r\nthan GLTR.\r\n\r\nTAM has two branches, a local branch and a global branch. Given the input feature map $X\\in \\mathbb{R}^{C\\times T\\times H\\times W}$,  global spatial average pooling $\\text{GAP}$ is first applied to the feature map to ensure TAM has a low computational cost. Then the local branch in TAM employs several 1D convolutions with ReLU nonlinearity across the temporal domain to produce location-sensitive importance maps for enhancing frame-wise features.\r\nThe local branch can be written as\r\n\\begin{align}\r\n    s &= \\sigma(\\text{Conv1D}(\\delta(\\text{Conv1D}(\\text{GAP}(X)))))\r\n\\end{align}\r\n\\begin{align}\r\n    X^1 &= s X\r\n\\end{align}\r\nUnlike the local branch, the global branch is location invariant and focuses on generating a channel-wise adaptive kernel based on global temporal information in each channel. For the $c$-th channel, the  kernel can be written as\r\n\r\n\\begin{align}\r\n    \\Theta_c = \\text{Softmax}(\\text{FC}_2(\\delta(\\text{FC}_1(\\text{GAP}(X)_c)))) \r\n\\end{align}\r\n\r\nwhere $\\Theta_c \\in \\mathbb{R}^{K}$ and $K$ is the adaptive kernel size. Finally, TAM  convolves the adaptive kernel $\\Theta$ with $ X_\\text{out}^1$:\r\n\\begin{align}\r\n    Y = \\Theta \\otimes  X^1\r\n\\end{align}\r\n\r\nWith the help of the local branch and global branch,\r\nTAM can capture the complex temporal structures in video and \r\nenhance per-frame features at low computational cost.\r\nDue to its flexibility and lightweight design,\r\nTAM can be added to any existing 2D CNNs.","1024":"A **Deactivable Skip Connection** is a type of skip connection which, instead of concatenating the encoder features\r\n(red) and decoder features (blue), as with [standard skip connections](https:\/\/paperswithcode.com\/methods\/category\/skip-connections), it instead fuses the encoder features with part of the decoder features (light blue), to be able to deactivate this operation when needed.","1025":"A novel built-in attention mechanism, that is complementary to all other prior attention mechanisms (e.g. squeeze and excitation, transformers) that are external (i.e., not built-in - please read paper for more details)","1026":"A **Neural Probablistic Language Model** is an early language modelling architecture. It involves a feedforward architecture that takes in input vector representations (i.e. word embeddings) of the previous $n$ words, which are looked up in a table $C$.\r\n\r\nThe word embeddings are concatenated and fed into a hidden layer which then feeds into a [softmax](https:\/\/paperswithcode.com\/method\/softmax) layer to estimate the probability of the word given the context.","1027":"**ShakeDrop regularization** extends [Shake-Shake regularization](https:\/\/paperswithcode.com\/method\/shake-shake-regularization) and can be applied not only to [ResNeXt](https:\/\/paperswithcode.com\/method\/resnext) but also [ResNet](https:\/\/paperswithcode.com\/method\/resnet), [WideResNet](https:\/\/paperswithcode.com\/method\/wideresnet), and [PyramidNet](https:\/\/paperswithcode.com\/method\/pyramidnet). The proposed ShakeDrop is given as\r\n\r\n$$G\\left(x\\right) = x + \\left(b\\_{l} + \\alpha \u2212 b\\_{l}\\alpha\\right)F\\left(x\\right), \\text{ in train-fwd} $$\r\n$$G\\left(x\\right) = x + \\left(b\\_{l} + \\beta \u2212 b\\_{l}\\beta\\right)F\\left(x\\right), \\text{ in train-bwd} $$\r\n$$G\\left(x\\right) = x + E\\left[b\\_{l} + \\alpha \u2212 b\\_{l}\\alpha\\right]F\\left(x\\right), \\text{ in test} $$\r\n\r\nwhere $b\\_{l}$ is a Bernoulli random variable with probability $P\\left(b\\_{l} = 1\\right) = E\\left[b\\_{l}\r\n\\right] = p\\_{l}$ given by the linear decay rule in each layer, and $\\alpha$ and $\\beta$ are independent uniform random variables in each element. \r\n\r\nThe most effective ranges of $\\alpha$ and $\\beta$ were experimentally found to be different from those of Shake-Shake, and are $\\alpha$ = 0, $\\beta \\in \\left[0, 1\\right]$ and $\\alpha \\in \\left[\u22121, 1\\right]$, $\\beta \\in \\left[0, 1\\right]$.","1028":"Contrastive learning has achieved remarkable success in representation learning via self-supervision in unsupervised settings. However, effectively adapting contrastive learning to supervised learning tasks remains as a challenge in practice. In this work, we introduce a dual contrastive learning (DualCL) framework that simultaneously learns the features of input samples and the parameters of classifiers in the same space. Specifically, DualCL regards the parameters of the classifiers as augmented samples associating to different labels and then exploits the contrastive learning between the input samples and the augmented samples. Empirical studies on five benchmark text classification datasets and their low-resource version demonstrate the improvement in classification accuracy and confirm the capability of learning discriminative representations of DualCL.","1029":"**SKNet** is a type of convolutional neural network that employs [selective kernel](https:\/\/paperswithcode.com\/method\/selective-kernel) units, with selective kernel convolutions, in its architecture. This allows for a type of attention where the network can learn to attend to different receptive fields.","1030":"**Pretext-Invariant Representation Learning (PIRL, pronounced as \u201cpearl\u201d)** learns invariant representations based on pretext tasks. PIRL is used with a commonly used pretext task that involves solving [jigsaw](https:\/\/paperswithcode.com\/method\/jigsaw) puzzles. Specifically, PIRL constructs image representations that are similar to the representation of transformed versions of the same image and different from the representations of other images.","1031":"Bilateral grid is a new data structure that enables fast edge-aware image processing. It enables edge-aware image manipulations such as local tone mapping on high resolution images in real time.\r\n\r\nSource: [Chen et al.](https:\/\/people.csail.mit.edu\/sparis\/publi\/2007\/siggraph\/Chen_07_Bilateral_Grid.pdf)\r\n\r\nImage source: [Chen et al.](https:\/\/people.csail.mit.edu\/sparis\/publi\/2007\/siggraph\/Chen_07_Bilateral_Grid.pdf)","1032":"**Conditional Instance Normalization** is a normalization technique where all convolutional weights of a style transfer network are shared across many styles.  The goal of the procedure is transform\r\na layer\u2019s activations $x$ into a normalized activation $z$ specific to painting style $s$. Building off\r\n[instance normalization](https:\/\/paperswithcode.com\/method\/instance-normalization), we augment the $\\gamma$ and $\\beta$ parameters so that they\u2019re $N \\times C$ matrices, where $N$ is the number of styles being modeled and $C$ is the number of output feature maps. Conditioning on a style is achieved as follows:\r\n\r\n$$ z = \\gamma\\_{s}\\left(\\frac{x - \\mu}{\\sigma}\\right) + \\beta\\_{s}$$\r\n\r\nwhere $\\mu$ and $\\sigma$ are $x$\u2019s mean and standard deviation taken across spatial axes and $\\gamma\\_{s}$ and $\\beta\\_{s}$ are obtained by selecting the row corresponding to $s$ in the $\\gamma$ and $\\beta$ matrices. One added benefit of this approach is that one can stylize a single image into $N$ painting styles with a single feed forward pass of the network with a batch size of $N$.","1033":"**Slot Attention** is an architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which we call slots. These slots are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention. Using an iterative attention mechanism, slots produces a set of output vectors with permutation symmetry. Unlike capsules used in Capsule Networks, slots produced by Slot Attention do not specialize to one particular type or class of object, which could harm generalization. Instead, they act akin to object files, i.e., slots use a common representational format: each slot can store (and bind to) any object in the input. This allows Slot Attention to generalize in a systematic way to unseen compositions, more objects, and more slots.","1034":"**FixUp Initialization**, or **Fixed-Update Initialization**, is an initialization method that rescales the standard initialization of [residual branches](https:\/\/paperswithcode.com\/method\/residual-block) by adjusting for the network architecture. Fixup aims to enables training very deep [residual networks](https:\/\/paperswithcode.com\/method\/resnet) stably at a maximal learning rate without [normalization](https:\/\/paperswithcode.com\/methods\/category\/normalization).\r\n\r\nThe steps are as follows:\r\n\r\n1. Initialize the classification layer and the last layer of each residual branch to 0.\r\n\r\n2. Initialize every other layer using a standard method, e.g. [Kaiming Initialization](https:\/\/paperswithcode.com\/method\/he-initialization), and scale only the weight layers inside residual branches by $L^{\\frac{1}{2m-2}}$.\r\n\r\n3. Add a scalar multiplier (initialized at 1) in every branch and a scalar bias (initialized at 0) before each [convolution](https:\/\/paperswithcode.com\/method\/convolution), [linear](https:\/\/paperswithcode.com\/method\/linear-layer), and element-wise activation layer.","1035":"**RFB Net** is a one-stage object detector that utilises a receptive field block module. It utilises a VGG16 backbone, and is otherwise quite similar to the [SSD](https:\/\/paperswithcode.com\/method\/ssd) architecture.","1036":"Branch attention can be seen as a dynamic branch selection mechanism: which to pay attention to, used with a multi-branch structure.","1037":"**Probabilistically Masked Language Model**, or **PMLM**, is a type of language model that utilizes a probabilistic masking scheme, aiming to bridge the gap between masked and autoregressive language models. The basic idea behind the connection of two categories of models is similar to MADE by Germain et al (2015). PMLM is a masked language model with a probabilistic masking scheme, which defines the way sequences are masked by following a probabilistic distribution. The authors employ a simple uniform distribution of the masking ratio and name the model as u-PMLM.","1038":"**CRISS**, or **Cross-lingual Retrievial for Iterative Self-Supervised Training (CRISS)**, is a self-supervised learning method for multilingual sequence generation. CRISS is developed based on the finding that the encoder outputs of multilingual denoising autoencoder can be used as language agnostic representation to retrieve parallel sentence pairs, and training the model on these retrieved sentence pairs can further improve its sentence retrieval and translation capabilities in an iterative manner. Using only unlabeled data from many different languages, CRISS iteratively mines for parallel sentences across languages, trains a new better multilingual model using these mined sentence pairs, mines again for better parallel sentences, and repeats.","1039":"**Span-Based Dynamic Convolution** is a type of convolution used in the [ConvBERT](https:\/\/paperswithcode.com\/method\/convbert) architecture to capture local dependencies between tokens.  Kernels are generated by taking in a local span of current token, which better utilizes local dependency and discriminates different meanings of the same token (e.g., if \u201ca\u201d is in front of \u201ccan\u201d in the input sentence, \u201ccan\u201d is apparently a noun not a verb).\r\n\r\nSpecifically, with [classic convolution](https:\/\/paperswithcode.com\/method\/convolution), we would have fixed parameters shared for all input tokens. [Dynamic convolution](https:\/\/paperswithcode.com\/method\/dynamicconv) is therefore preferable because it has  higher flexibility in capturing local dependencies of different tokens. Dynamic convolution uses a kernel generator to produce different kernels for different input tokens. However, such dynamic convolution cannot differentiate the same tokens within different context and\r\ngenerate the same kernels (e.g., the three \u201ccan\u201d in Figure (b)).\r\n\r\nTherefore the span-based dynamic convolution is developed to produce more adaptive convolution kernels by receiving an input span instead of only a single token, which enables discrimination of generated kernels for the same tokens within different context. For example, as shown in Figure (c), span-based dynamic convolution produces different kernels for different \u201ccan\u201d tokens.","1040":"**Mixed Attention Block** is an attention module used in the [ConvBERT](https:\/\/paperswithcode.com\/method\/convbert) architecture. It is a mixture of [self-attention](https:\/\/paperswithcode.com\/method\/scaled) and [span-based dynamic convolution](https:\/\/paperswithcode.com\/method\/span-based-dynamic-convolution) (highlighted in pink). They share the same Query but use different Key to generate the attention map and [convolution](https:\/\/paperswithcode.com\/method\/convolution) kernel respectively. The number of attention heads is reducing by directly projecting the input to a smaller embedding space to form a bottleneck structure for self-attention and span-based dynamic convolution. Dimensions of the input and output of some blocks are labeled on the left top corner to illustrate the overall framework, where $d$ is the embedding size of the input and $\\gamma$ is the reduction ratio.","1041":"**ConvBERT** is a modification on the [BERT](https:\/\/paperswithcode.com\/method\/bert) architecture which uses a [span-based dynamic convolution](https:\/\/paperswithcode.com\/method\/span-based-dynamic-convolution) to replace self-attention heads to directly model local dependencies. Specifically a new [mixed attention module](https:\/\/paperswithcode.com\/method\/mixed-attention-block) replaces the [self-attention modules](https:\/\/paperswithcode.com\/method\/scaled) in BERT, which leverages the advantages of [convolution](https:\/\/paperswithcode.com\/method\/convolution) to better capture local dependency. Additionally, a new span-based dynamic convolution operation is used to utilize multiple input tokens to dynamically generate the convolution kernel. Lastly, ConvBERT also incorporates some new model designs including the bottleneck attention and grouped linear operator for the feed-forward module (reducing the number of parameters).","1042":"Mirror-BERT converts pretrained language models into effective universal text encoders without any supervision, in 20-30 seconds. It is an extremely simple, fast, and effective contrastive learning technique. It relies on fully identical *or* slightly modified string pairs as positive (i.e., synonymous) fine-tuning examples, and aims to maximise their similarity during identity fine-tuning.","1043":"Liu et al. presented self-calibrated convolution as a means to enlarge the receptive field at each spatial location. \r\n\r\nSelf-calibrated convolution is used together with a standard convolution. It first divides the input feature $X$ into $X_{1}$ and $X_{2}$ in the channel domain. The self-calibrated convolution first uses average pooling to reduce the input size and enlarge the receptive field:\r\n\\begin{align}\r\nT_{1} = AvgPool_{r}(X_{1}) \r\n\\end{align}\r\nwhere $r$ is the filter size and stride. Then a convolution is used to model the channel relationship and a bilinear interpolation operator $Up$ is used to upsample the feature map: \r\n\r\n\\begin{align}\r\nX'_{1} = \\text{Up}(Conv_2(T_1))\r\n\\end{align}\r\n\r\nNext, element-wise multiplication finishes the self-calibrated process:\r\n\r\n\\begin{align}\r\nY'_{1} = Conv_3(X_1) \\sigma(X_1 + X'_1)\r\n\\end{align}\r\n\r\nFinally, the output feature map of is formed:\r\n\\begin{align}\r\nY_{1} &= Conv_4(Y'_{1})\r\n\\end{align}\r\n\\begin{align}\r\nY_2 &= Conv_1(X_2)\r\n\\end{align}\r\n\\begin{align}\r\nY &= [Y_1; Y_2]\r\n\\end{align}\r\nSuch self-calibrated convolution can enlarge the receptive field of a network and improve its adaptability. It achieves excellent results in image classification and certain downstream tasks such as instance segmentation, object detection and keypoint detection.","1044":"**TL;DR: GAP-Layer is a GNN Layer which is able to rewire a graph in an inductive an parameter-free way optimizing the spectral gap (minimizing or maximizing the bottleneck size), learning a differentiable way to compute the Fiedler vector and the Fiedler value of the graph.**\r\n\r\n## Summary\r\n **GAP-Layer** is a rewiring layer based on minimizing or maximizing the spectral gap (or graph bottleneck size) in an inductive way. Depending on the mining task we want to perform in our graph, we would like to maximize or minimize the size of the bottleneck, aiming to more connected or more separated communities. \r\n\r\n## GAP-Layer: Spectral Gap Rewiring\r\n\r\n#### Loss and derivatives using $\\mathbf{L}$ or $\\mathbf{\\cal L}$\r\nFor this explanation, we are going to suppose we want to minimize the spectral gap, i.e. make the graph bottleneck size smaller. For minimizing the spectral GAP we minimize this loss:\r\n\r\n$$\r\nL\\_{Fiedler} = \\|\\tilde{\\mathbf{A}}-\\mathbf{A}\\| \\_F + \\alpha(\\lambda\\_2)^2\r\n$$\r\n\r\nThe gradients of this cost function w.r.t each element of $\\mathbf{A}$ are not trivial. Depending on if we use the Laplacian, $\\mathbf{L}$, or the normalized Laplacian, $\\cal L$, the derivatives are going to be different. For the former case ($\\mathbf{L}$), we will use the derivatives presented in Kang et al. 2019. In the latter scenario ($\\cal L$), we present the **Spectral Gradients**: derivatives from the spectral gap w.r.t. the Normalized Laplacian. However, whatever option we choose, $\\lambda_2$ can seen as a function of  $\\tilde{\\mathbf{A}}$ and , hence, $\\nabla\\_{\\tilde{\\mathbf{A}}}\\lambda\\_2$, the gradient of $\\lambda\\_2$ wrt each component of $\\tilde{\\mathbf{A}}$ (*how does the bottleneck change with each change in our graph?*),  comes from the chain rule of the matrix derivative $Tr\\left[\\left(\\nabla\\_{\\tilde{\\mathbf{L}}}\\lambda\\_2\\right)^T\\cdot\\nabla\\_{\\tilde{\\mathbf{A}}}\\tilde{\\mathbf{L}}\\right]$ if using the Laplacian or $Tr\\left[\\left(\\nabla\\_{\\tilde{\\mathbf{\\cal L}}}\\lambda\\_2\\right)^T\\cdot\\nabla\\_{\\tilde{\\mathbf{A}}}\\tilde{\\mathbf{\\cal L}}\\right]$ if using the normalized Laplacian. Both of this derivatives, relies on the Fiedler vector (2nd eigenvector: $\\mathbf{f}\\_2$ if we use $\\mathbf{L}$ and $\\mathbf{g}\\_2$ if using $\\mathbf{\\cal L}$ instead). For more details on those derivatives, and for the sake of simplicity in this blog explanation, I suggest go to the original paper.\r\n\r\n#### Differentiable approximation of $\\mathbf{f}_2$ and $\\lambda_2$\r\nOnce we have those derivatives, the problem is still not that trivial. Note that our cost function $L\\_{Fiedler}$, relies on an eigenvalue $\\lambda\\_2$. In addition, the derivatives also depends on the Fiedler vector $\\mathbf{f}\\_2$ or $\\mathbf{g}\\_2$, which is the eigenvector corresponding to the aforementioned eigenvalue. However, we **DO NOT COMPUTE IT SPECTRALLY**, as its computation has a complexity of $O(n^3)$ and would need to be computed in every learning iteration. Instead, **we learn an approximation of $\\mathbf{f}\\_2$ and use its Dirichlet energy ${\\cal E}(\\mathbf{f}\\_2)$ to approximate the $\\lambda_2$**. \r\n$$\r\n\\mathbf{f}\\_2(u) =  \\begin{array}{cl}\r\n       +1\/\\sqrt{n}  & \\text{if}\\;\\; u\\;\\; \\text{belongs to the first cluster} \\\\\r\n       -1\/\\sqrt{n}  & \\text{if}\\;\\; u\\;\\; \\text{belongs to the second cluster} \r\n\\end{array} \r\n$$\r\nIn addition, if using $\\mathbf{\\cal L}$, since $\\mathbf{g}\\_2=\\mathbf{D}^{1\/2}\\mathbf{f}_2$, we first approximate $\\mathbf{g}_2$ and then approximate $\\lambda_2$ from ${\\cal E}(\\mathbf{g}\\_2)$. With this approximation, we can easily compute the node belonging to each cluster with a simple MLP. In addition, such as the Fiedler value must satisfy orthogonality and normality, restrictions must be added to that MLP Clustering.\r\n\r\n### GAP-Layer\r\nTo sum up, **GAP-Layer** can be defined as the following. Given the matrix $\\mathbf{X}\\_{n\\times F}$ encoding the features of the nodes after any message passing (MP) layer, $\\mathbf{S}\\_{n\\times 2}=\\textrm{Softmax}(\\textrm{MLP}(\\mathbf{X}))$ learns the association $\\mathbf{X}\\rightarrow \\mathbf{S}$ while $\\mathbf{S}$ is optimized according to the loss:\r\n  \r\n$$\r\nL\\_{Cut} = -\\frac{Tr[\\mathbf{S}^T\\mathbf{A}\\mathbf{S}]}{Tr[\\mathbf{S}^T\\mathbf{D}\\mathbf{S}]} + \\left\\|\\frac{\\mathbf{S}^T\\mathbf{S}}{\\|\\mathbf{S}^T\\mathbf{S}\\|\\_F} - \\frac{\\mathbf{I}\\_n}{\\sqrt{2}}\\right\\|\\_F\r\n$$\r\nThen, the $\\mathbf{f}\\_2$ is approximated from $\\mathbf{S}$ using $\\mathbf{f}\\_2(u)$ equation. Once calculated  $\\mathbf{f}\\_2$ and  $\\lambda\\_2$ we consider the loss:\r\n\r\n$$\r\nL\\_{Fiedler} = \\|\\tilde{\\mathbf{A}}-\\mathbf{A}\\|\\_F + \\alpha(\\lambda\\_2)^2\r\n$$\r\n$$\\mathbf{\\tilde{A}} = \\mathbf{A} - \\mu \\nabla_\\mathbf{\\tilde{A}}\\lambda\\_2$$\r\nreturning $\\tilde{\\mathbf{A}}$. Then the GAP diffusion $\\mathbf{T}^{GAP} = \\tilde{\\mathbf{A}}(\\mathbf{S}) \\odot \\mathbf{A}$ results from minimizing \r\n\r\n$$L_{GAP}= L\\_{Cut} + L\\_{Fiedler}$$\r\n\r\n\r\n**References**\r\n(Kang et al. 2019) Kang, J., & Tong, H. (2019, November). N2n: Network derivative mining. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (pp. 861-870).","1045":"**TL;DR: CT-Layer is a GNN Layer which is able to rewire a graph in an inductive an parameter-free way according to the commute times distance (or effective resistance). We address it learning a differentiable way to compute the CT-embedding of the graph.**\r\n\r\n### Summary\r\n\r\n**CT-Layer** is able to Learn the *Commute Times distance*  between nodes (i.e. *effective resistance distance*) in a **differentiable** way, instead of the common spectral version, and in a **parameter free** manner, which is not the cased of the heat kernel. This approach allow to solve it as an optimization problem inside a GNN, leading to have a new layer which is able to learn how rewire a given graph in an optimal, and **inductive** way. \r\n\r\nIn addition, **CT-Layer** also is able to learn *Commute Times embeddings*, and then calculate it for any graph in an inductive way. The Commute Times embedding is also related with the *eigenvalues* and *eigenvectors* of the Laplacian of the graph, because CT embedding is just the eigenvectors scaled. Therefore, CT-Layer is also able to learn hot to calculate the spectrum of the Laplacian in a differentiable way. Therefore, this embedding must satisfy orthogonality and normality.\r\n\r\nFinally, recent connections has been found between commute times distance and **curvature** (which is non-differentiable), establishing equivalences between them. Therefore, **CT-Layer** can also be seen as the differentiable version of the curvature rewiring.\r\n\r\n**We are going through a quick overview of the layer, but I suggest go to the paper for a detailed explanation. **\r\n\r\n### Spectral CT- Embedding downsides\r\nCT-embedding $\\mathbf{Z}$ is computed spectrally  in the literature (until the proposal of this method) or it is approximated using the heat kernel (very dependent on hyperparameter $t$). This fact does not allow us to propose differentiable methods using that measure:\r\n$$\r\n\\mathbf{Z}=\\sqrt{vol(G)}\\mathbf{\\Lambda}^\\frac{1}{2}\\mathbf{F}^T \\textrm{ given } \\mathbf{L}=\\mathbf{F}\\mathbf{\\Lambda}\\mathbf{F}^T\r\n$$\r\n\r\nThen, CT-distance  is given by the Euclidean distances between the embeddings $CT_{uv} = ||\\mathbf{z_u}-\\mathbf{z_v}||^2$. The spectral form is: \r\n\r\n$$\r\n\\frac{CT_{uv}}{vol(G)} = \\sum_{i=2}^n \\frac{1}{\\lambda_i} (\\mathbf{f}(u)-\\mathbf{f}(v))^2 \r\n$$\r\nbeing $\\mathbf{f}$ the eigenvectors of the graph Laplacian. \r\n\r\nThis embedding and distances gives us desirable properties of the graph, such an understanding of the structure, or an embedding based on the spectrum which minimizes Dirichlet energies. However, **the spectral computation is not differentiable**.\r\n\r\n### CT-Layer as an optimization problem: Differentiable, learnable and inductive CT-Layer\r\nGiving that $\\mathbf{Z}$ minimizes Dirichlet energies s.t. being orthogonal and normalized, we can formulate this problem as constraining neighboring nodes to have a similar embeddings s.t. $\\mathbf{Z}\\mathbf{Z}^T=\\mathbf{I}$.\r\n\r\n$$\r\n\\mathbf{Z} = \\arg\\min_{\\mathbf{Z}^T\\mathbf{Z}=\\mathbf{I}} \\frac{\\sum\\_{u,v} ||\\mathbf{z_u}-\\mathbf{z_v}||^2\\mathbf{A}\\_{uv}}{\\sum\\_{u,v} \\mathbf{Z}^2\\_{uv} d_u}=\\frac{Tr[\\mathbf{Z}^T\\mathbf{L}\\mathbf{Z}]}{Tr[\\mathbf{Z}^T\\mathbf{D}\\mathbf{Z}]}\r\n$$\r\n\r\nWith the above elements we have a definition of **CT-Layer**, our rewiring layer: \r\nGiven the matrix $\\mathbf{X}\\_{n\\times F}$ encoding the features of the nodes after any message passing (MP) layer, $\\mathbf{Z}\\_{n\\times O(n)}=\\tanh(\\textrm{MLP}(\\mathbf{X}))$ learns the association $\\mathbf{X}\\rightarrow \\mathbf{Z}$ while $\\mathbf{Z}$ is optimized according to the loss \r\n$$\r\nL\\_{CT} = \\frac{Tr[\\mathbf{Z}^T\\mathbf{L}\\mathbf{Z}]}{Tr[\\mathbf{Z}^T\\mathbf{D}\\mathbf{Z}]} + \\left\\|\\frac{\\mathbf{Z}^T\\mathbf{Z}}{\\|\\mathbf{Z}^T\\mathbf{Z}\\|\\_F} - \\mathbf{I}\\_n\\right\\|\\_F\r\n$$\r\n This results in the following **resistance diffusion** $\\mathbf{T}^{CT} = \\mathbf{R}(\\mathbf{S})\\odot \\mathbf{A}$ (Hadamard product between the resistance distance and the adjacency) which provides as input to the subsequent MP layer a learnt convolution matrix.\r\n\r\nAs explained before, $\\mathbf{Z}$ is the **commute times embedding matrix** and the pairwise euclidian distance of that learned embeddings are the **commute times distances** or resistance distances. **Therefore, once trained this layer, it will be able to calculate the commute times embedding for a new graph, and rewire that new and unseen graph in a principled way based on the commute times distance.**\r\n\r\n## Preservation of Structure\r\nDoes this rewiring preserve the original structure? Let $G' = \\textrm{Sparsify}(G, q)$ be a sampling algorithm of graph $G = (V, E)$, where edges $e \\in E$ are sampled with probability $q\\propto R_e$ (**proportional to the effective resistance, i.e. commute times**).\r\nThen, for $n = |V|$ sufficiently large and $1\/\\sqrt{n}< \\epsilon\\le 1$, we need O(n\\log n\/\\epsilon^2)$ samples to satisfy:\r\n\r\n$$\r\n\\forall \\mathbf{x}\\in\\mathbb{R}^n:\\; (1-\\epsilon)\\mathbf{x}^T\\mathbf{L}\\_G\\mathbf{x}\\le\\mathbf{x}^T\\mathbf{L}\\_{G'}\\mathbf{x}\\le (1+\\epsilon)\\mathbf{x}^T\\mathbf{L}\\_G\\mathbf{x}\r\n$$\r\n\r\nThe intuitions behind is that Dirichlet energies in $G'$ are bounded in $(1\\pm \\epsilon)$ of the Dirichlet energies of the original graph $G$.","1046":"A **Gated Convolutional Network** is a type of language model that combines convolutional networks with a gating mechanism. Zero padding is used to ensure future context can not be seen. Gated convolutional layers can be stacked on top of other hierarchically. Model predictions are then obtained with an [adaptive softmax](https:\/\/paperswithcode.com\/method\/adaptive-softmax) layer.","1047":"Deep Graph Infomax (DGI), a general approach for learning node representations within graph-structured data in an unsupervised manner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs\u2014both derived using established graph convolutional network architectures. The learnt patch representations summarize subgraphs centered around nodes of interest, and can thus be reused for downstream node-wise learning tasks. In contrast to most prior approaches to unsupervised learning with GCNs, DGI does not rely on random walk objectives, and is readily applicable to both transductive and inductive learning setups.\r\n\r\nDescription and image from: [DEEP GRAPH INFOMAX](https:\/\/arxiv.org\/pdf\/1809.10341.pdf)","1048":"Z. Pr\u016f\u0161a, P. Balazs and P. L. S\u00f8ndergaard, \"A Noniterative Method for Reconstruction of Phase From STFT Magnitude,\" in IEEE\/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 5, pp. 1154-1164, May 2017, doi: 10.1109\/TASLP.2017.2678166.\r\nAbstract: A noniterative method for the reconstruction of the short-time fourier transform (STFT) phase from the magnitude is presented. The method is based on the direct relationship between the partial derivatives of the phase and the logarithm of the magnitude of the un-sampled STFT with respect to the Gaussian window. Although the theory holds in the continuous setting only, the experiments show that the algorithm performs well even in the discretized setting (discrete Gabor transform) with low redundancy using the sampled Gaussian window, the truncated Gaussian window and even other compactly supported windows such as the Hann window. Due to the noniterative nature, the algorithm is very fast and it is suitable for long audio signals. Moreover, solutions of iterative phase reconstruction algorithms can be improved considerably by initializing them with the phase estimate provided by the present algorithm. We present an extensive comparison with the state-of-the-art algorithms in a reproducible manner.\r\nURL: https:\/\/ieeexplore.ieee.org\/stamp\/stamp.jsp?tp=&arnumber=7890450&isnumber=7895265","1049":"**Point-wise Spatial Attention (PSA)** is a [semantic segmentation](https:\/\/paperswithcode.com\/task\/semantic-segmentation) module. The goal is capture contextual information, especially in the long range, by aggregating information. Through the PSA module, information aggregation is performed as a kind of information flow where we adaptively learn a pixel-wise global attention map for each position from two perspectives to aggregate contextual information over the entire feature map.\r\n\r\nThe PSA module takes a spatial feature map $\\mathbf{X}$ as input. We denote the spatial size of $\\mathbf{X}$ as $H \\times W$. Through the two branches as illustrated, we generate pixel-wise global attention maps for each position in feature map $\\mathbf{X}$ through several convolutional layers.\r\n\r\nWe aggregate input feature maps based on attention maps to generate new feature representations with the long-range contextual information incorporated, i.e., $\\mathbf{Z}\\_{c}$ from the \u2018collect\u2019 branch and $\\mathbf{Z}\\_{d}$ from the \u2018distribute\u2019 branch.\r\n\r\nWe concatenate the new representations $\\mathbf{Z}\\_{c}$ and $\\mathbf{Z}\\_{d}$ and apply a convolutional layer with [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) and activation layers for dimension reduction and feature fusion. Then we concatenate the new global contextual feature with the local representation feature $\\mathbf{X}$. It is followed by one or several convolutional layers with batch normalization and activation layers to generate the final feature map for following subnetworks.","1050":"**CheXNet** is a 121-layer [DenseNet](https:\/\/paperswithcode.com\/method\/densenet) trained on ChestX-ray14 for pneumonia detection.","1051":"Visual Parsing is a vision and language pretrained model that adopts self-attention for visual feature learning where each visual token is an approximate weighted mixture of all tokens. Thus, visual parsing provides the dependencies of each visual token pair.  It helps better learning of visual relation with the language and promote inter modal alignment. The model is composed of a vision Transformer that takes an image as input and outputs the visual tokens and a multimodal Transformer. \r\nIt applies a linear layer and a Layer Normalization to embed the vision tokens. It follows BERT to get word embeddings. Vision and language tokens are concatenated to form the input sequences. A multi-modal Transformer is used to fuse the vision and language modality. A metric named Inter-Modality Flow (IMF) is used to quantify the interactions between two modalities.\r\nThree pretraining tasks are adopted: Masked Language Modeling (MLM), Image-Text Matching (ITM), and Masked Feature Regression (MFR). MFR is a novel task that is included to mask visual tokens with similar or correlated semantics in this framework.","1052":"The **Feature-wise linear modulation** (**FiLM**) module combines information from both noisy waveform and input mel-spectrogram. It is used in the [WaveGrad](https:\/\/paperswithcode.com\/method\/wavegrad) model. The authors also added iteration index $n$ which indicates the noise level of the input waveform by using the [Transformer](https:\/\/paperswithcode.com\/method\/transformer) sinusoidal positional embedding. To condition on the noise level directly, $n$ is replaced by $\\sqrt{\\bar{\\alpha}}$ and a linear scale $C = 5000$ is applied. The FiLM module produces both scale and bias vectors given inputs, which are used in a UBlock for feature-wise affine transformation as:\r\n\r\n$$ \\gamma\\left(D, \\sqrt{\\bar{\\alpha}}\\right) \\odot U + \\zeta\\left(D, \\sqrt{\\bar{\\alpha}}\\right) $$\r\n\r\nwhere $\\gamma$ and $\\zeta$ correspond to the scaling and shift vectors from the FiLM module, $D$ is the output from corresponding [DBlock](https:\/\/paperswithcode.com\/method\/dblock), $U$ is an intermediate output in the UBlock.","1053":"**WaveGrad DBlocks** are used to downsample the temporal dimension of noisy waveform in [WaveGrad](https:\/\/paperswithcode.com\/method\/wavegrad). They are similar to UBlocks except that only one [residual block](https:\/\/paperswithcode.com\/method\/residual-block) is included. The dilation factors are 1, 2, 4 in the main branch. Orthogonal initialization is used.","1054":"The **WaveGrad UBlock** is used for upsampling in [WaveGrad](https:\/\/paperswithcode.com\/method\/wavegrad). Neural audio generation models often use large receptive field. Dilation factors of four convolutional layers are 1, 2, 1, 2 for the first two UBlocks and 1, 2, 4, 8 for the rest. Orthogonal initialization is used.","1055":"**WaveGrad** is a conditional model for waveform generation through estimating gradients of the data density. This model is built on the prior work on score matching and diffusion probabilistic models. It starts from Gaussian white noise and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram. WaveGrad is non-autoregressive, and requires only a constant number of generation steps during inference. It can use as few as 6 iterations to generate high fidelity audio samples.","1056":"**Deflation** is a video-to-image operation to transform a video network into a network that can ingest a single image. In the two types of video networks considered in the original paper, this deflation corresponds to the following operations: for [3D convolutional based networks](https:\/\/paperswithcode.com\/method\/3d-convolution), summing the 3D spatio-temporal filters over the temporal dimension to obtain 2D filters; for TSM networks,, turning off the channel shifting which results in a standard [residual architecture](https:\/\/paperswithcode.com\/method\/resnet) (ResNet50) for images.","1057":"**Gated Transformer-XL**, or **GTrXL**, is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based architecture for reinforcement learning. It introduces architectural modifications that improve the stability and learning speed of the original Transformer and XL variant. Changes include:\r\n\r\n- Placing the [layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization) on only the input stream of the submodules. A key benefit to this reordering is that it now enables an identity map from the input of the transformer at the first layer to the output of the transformer after the last layer. This is in contrast to the canonical transformer, where there are a series of layer normalization operations that non-linearly transform the state encoding.\r\n- Replacing [residual connections](https:\/\/paperswithcode.com\/method\/residual-connection) with gating layers. The authors' experiments found that [GRUs](https:\/\/www.paperswithcode.com\/method\/gru) were the most effective form of gating.","1058":"**Contrastive BERT** is a reinforcement learning agent that combines a new contrastive loss and a hybrid [LSTM](https:\/\/paperswithcode.com\/method\/lstm)-[transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture to tackle the challenge of improving data efficiency for RL. It uses bidirectional masked prediction in combination with a generalization of recent contrastive methods to learn better representations for transformers in RL, without the need of hand engineered data augmentations.\r\n\r\nFor the architecture, a residual network is used to encode observations into embeddings $Y\\_{t}$. $Y_{t}$  is fed through a causally masked [GTrXL transformer](https:\/\/www.paperswithcode.com\/method\/gtrxl), which computes the predicted masked inputs $X\\_{t}$ and passes those together with $Y\\_{t}$ to a learnt gate. The output of the gate is passed through a single [LSTM](https:\/\/www.paperswithcode.com\/method\/lstm) layer to produce the values that we use for computing the RL loss. A contrastive loss is computed using predicted masked inputs $X_{t}$ and $Y_{t}$ as targets. For this, we do not use the causal mask of the Transformer.","1059":"Ensemble clustering, also called consensus clustering, has\r\nbeen attracting much attention in recent years, aiming to combine multiple base clustering algorithms into a better and more consensus clustering. Due to its good performance, ensemble clustering plays a vital role in many research areas, such as community detection and bioinformatics.","1060":"**Network Dissection** is an interpretability method for [CNNs](https:\/\/paperswithcode.com\/methods\/category\/convolutional-neural-networks) that evaluates the alignment between individual hidden units and a set of visual semantic concepts. By identifying the best alignments, units are given human interpretable labels across a range of objects, parts, scenes, textures, materials, and colors. \r\n\r\nThe measurement of interpretability proceeds in three steps:\r\n\r\n- Identify a broad set of human-labeled visual concepts.\r\n- Gather the response of the hidden variables to known concepts.\r\n- Quantify alignment of hidden variable\u2212concept pairs.","1061":"**CenterMask** is an anchor-free instance segmentation method that adds a novel [spatial attention-guided mask](https:\/\/paperswithcode.com\/method\/spatial-attention-guided-mask) (SAG-Mask) branch to anchor-free one stage object detector ([FCOS](https:\/\/paperswithcode.com\/method\/fcos)) in the same vein with [Mask R-CNN](https:\/\/paperswithcode.com\/method\/mask-r-cnn). Plugged into the FCOS object detector, the SAG-Mask branch predicts a segmentation mask on each detected box with the spatial attention map that helps to focus on informative pixels and suppress noise.","1062":"An **Octave Convolution (OctConv)** stores and process feature maps that vary spatially \u201cslower\u201d at a lower spatial resolution reducing both memory and computation cost. It takes in feature maps containing tensors of two frequencies one octave apart, and extracts information directly from the\r\nlow-frequency maps without the need of decoding it back to the high-frequency. The motivation is that in natural images, information is conveyed at different frequencies where higher frequencies are usually encoded with fine details and lower frequencies are usually encoded with global structures.","1063":"AGCN is a novel spectral graph convolution network that feed on original data of diverse graph structures.\r\n\r\nImage credit: [Adaptive Graph Convolutional Neural Networks](https:\/\/arxiv.org\/pdf\/1801.03226.pdf)","1064":"**Semi-Supervised Knowledge Distillation** is a type of knowledge distillation for person re-identification that exploits weakly annotated data by assigning soft pseudo labels to YouTube-Human to improve models' generalization ability. SSKD first trains a student model (e.g. [ResNet](https:\/\/paperswithcode.com\/method\/resnet)-50) and a teacher model (e.g. ResNet-101) using labeled data from multi-source domain datasets. Then, SSKD develops an [auxiliary classifier](https:\/\/paperswithcode.com\/method\/auxiliary-classifier) to imitate the soft predictions of unlabeled data generated by the teacher model. Meanwhile, the student model is also supervised by hard labels and predicted soft labels by the teacher model for labeled data.","1065":"**Convolution-enhanced image Transformer** (**CeiT**) combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies. Three modifications are made to the original Transformer: 1) instead of the straightforward tokenization from raw input images, we design an **Image-to-Tokens** (**I2T**) module that extracts patches from generated low-level features; 2) the feed-froward network in each encoder block is replaced with a **Locally-enhanced Feed-Forward** (**LeFF**) layer that promotes the correlation among neighbouring tokens in the spatial dimension; 3) a **Layer-wise Class token Attention** (**LCA**) is attached at the top of the Transformer that utilizes the multi-level representations.","1066":"**DEXTR**, or **Deep Extreme Cut**, obtains an object segmentation from its four extreme points: the left-most, right-most, top, and bottom pixels. The annotated extreme points are given as a guiding signal to the input of the network. To this end, we create a [heatmap](https:\/\/paperswithcode.com\/method\/heatmap) with activations in the regions of extreme points. We center a 2D Gaussian around each of the points, in order to create a single heatmap. The heatmap is concatenated with the RGB channels of the input image, to form a 4-channel input for the CNN. In order to focus on the object of interest, the input is cropped by the bounding box, formed from the extreme point annotations. To include context on the resulting\r\ncrop, we relax the tight bounding box by several pixels. After the pre-processing step that comes exclusively from the extreme clicks, the input consists of an RGB crop including an object, plus its extreme points. \r\n\r\n[ResNet](https:\/\/paperswithcode.com\/method\/resnet)-101 is chosen as backbone of the architecture. We remove the fully connected layers as well as the [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) layers in the last two stages to preserve acceptable output resolution for dense prediction, and we introduce atrous convolutions in the last two stages to maintain the same receptive field. After the last ResNet-101 stage, we introduce a pyramid scene parsing module to aggregate global context to the final feature map. The output of the CNN is a probability map representing whether a pixel belongs to the object that we want to segment or not. The CNN is trained to minimize the standard cross entropy loss, which takes into account that different classes occur with different frequency in a dataset.","1067":"**ExtremeNet** is a a bottom-up object detection framework that detects four extreme points (top-most, left-most, bottom-most, right-most) of an object. It uses a keypoint estimation framework to find extreme points, by predicting four multi-peak heatmaps for each object category. In addition, it uses one [heatmap](https:\/\/paperswithcode.com\/method\/heatmap) per category predicting the object center, as the average of two bounding box edges in both the x and y dimension. We group extreme points into objects with a purely geometry-based approach. We group four extreme points, one from each map, if and only if their\r\ngeometric center is predicted in the center heatmap with a score higher than a pre-defined threshold, We enumerate all $O\\left(n^{4}\\right)$ combinations of extreme point prediction, and select the valid ones.","1068":"**PCA Whitening** is a processing step for image based data that makes input less redundant. Adjacent pixel or feature values can be highly correlated, and whitening through the use of [PCA](https:\/\/paperswithcode.com\/method\/pca) reduces this degree of correlation.\r\n\r\nImage Source: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Principal_component_analysis#\/media\/File:GaussianScatterPCA.svg)","1069":"**MultiGrain** is a type of image model that learns a single embedding for classes, instances and copies.  In other words, it is a convolutional neural network that is suitable for both image classification and instance retrieval. We learn MultiGrain by jointly training an image embedding for multiple tasks. The resulting representation is compact and can outperform narrowly-trained embeddings. The learned embedding output incorporates different levels of granularity.","1070":"**ZCA Whitening** is an image preprocessing method that leads to a transformation of data such that the covariance matrix $\\Sigma$ is the identity matrix, leading to decorrelated features.\r\n\r\nImage Source: [Alex Krizhevsky](http:\/\/www.cs.toronto.edu\/~kriz\/learning-features-2009-TR.pdf)","1071":"An **Active Convolution** is a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) which does not have a fixed shape of the receptive field, and can be used to take more diverse forms of receptive fields for convolutions. Its shape can be learned through backpropagation during training. It can be seen as a generalization of convolution; it can define not only all conventional convolutions, but also convolutions with fractional pixel coordinates. We can freely change the shape of the convolution, which provides greater freedom to form CNN structures. Second, the shape of the convolution is learned while training and there is no need to tune it by hand","1072":"**Field Embedded Factorization Machine**, or **FEFM**, is a factorization machine variant. For each field pair, FEFM introduces symmetric matrix embeddings along with the usual feature vector embeddings that are present in FM. Like FM, $v\\_{i}$ is the vector embedding of the $i^{t h}$ feature. However, unlike Field-Aware Factorization Machines (FFMs), FEFM doesn't explicitly learn field-specific feature embeddings. The learnable symmetric matrix $W\\_{F(i), F(j)}$ is the embedding for the field pair $F(i)$ and $F(j) .$ The interaction between the $i^{t h}$ feature and the $j^{t h}$ feature is mediated through $W_{F(i), F(j)} .$\r\n\r\n$$\r\n\\phi(\\theta, x)=\\phi\\_{F E F M}((w, v, W), x)=w\\_{0}+\\sum\\_{i=1}^{m} w_{i} x_{i}+\\sum\\_{i=1}^{m} \\sum\\_{j=i+1}^{m} v\\_{i}^{T} W\\_{F(i), F(j)} v\\_{j} x\\_{i} x\\_{j}\r\n$$\r\n\r\nwhere $W\\_{F(i), F(j)}$ is a $k \\times k$ symmetric matrix ( $k$ is the dimension of the feature vector embedding space containing feature vectors $v\\_{i}$ and $v\\_{j}$ ).\r\n\r\nThe symmetric property of the learnable matrix $W\\_{F(i), F(j)}$ is ensured by reparameterizing $W\\_{F(i), F(j)}$ as $U\\_{F(i), F(j)}+$ $U\\_{F(i), F(j)}^{T}$, where $U\\_{F(i), F(j)}^{T}$ is the transpose of the learnable matrix $U\\_{F(i), F(j)} .$ Note that $W_{F(i), F(j)}$ can also be interpreted as a vector transformation matrix which transforms a feature embedding when interacting with a specific field.","1073":"**Segmentation Transformer**, or **SETR**, is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based segmentation model. The transformer-alone encoder treats an input image as a sequence of image patches represented by learned patch embedding, and transforms the sequence with global self-attention modeling for discriminative feature representation learning. Concretely, we first decompose an image into a grid of fixed-sized patches, forming a sequence of patches. With a linear embedding layer applied to the flattened pixel vectors of every patch, we then obtain a sequence of feature embedding vectors as the input to a transformer. Given the learned features from the encoder\r\ntransformer, a decoder is then used to recover the original image resolution. Crucially there is no downsampling in spatial resolution but global context modeling at every layer of the encoder transformer.","1074":"**Blink** is a communication library for inter-GPU parameter exchange that achieves near-optimal link utilization. To handle topology heterogeneity from hardware generations or partial allocations from cluster schedulers, Blink dynamically generates optimal communication primitives for a given topology. Blink probes the set of links available for a given job at runtime and builds a topology with appropriate link capacities. Given the topology, Blink achieves the optimal communication rate by packing spanning trees, that can utilize more links (Lovasz, 1976; Edmonds, 1973) when compared to rings. The authors use a multiplicative-weight update based approximation algorithm to quickly compute the maximal packing and extend the algorithm to further minimize the number of trees generated. Blink\u2019s collectives extend across multiple machines effectively utilizing all available network interfaces.","1075":"**CRF-RNN** is a formulation of a [CRF](https:\/\/paperswithcode.com\/method\/crf) as a Recurrent Neural Network. Specifically it formulates mean-field approximate inference for the Conditional Random Fields with Gaussian pairwise potentials as Recurrent Neural Networks.","1076":"**Hunger Games Search (**HGS**)** is a general-purpose population-based optimization technique with a simple structure, special stability features and very competitive performance to realize the solutions of both constrained and unconstrained problems more effectively. HGS is designed according to the hunger-driven activities and behavioural choice of animals. This dynamic, fitness-wise search method follows a simple concept of \u201cHunger\u201d as the most crucial homeostatic motivation and reason for behaviours, decisions, and actions in the life of all animals to make the process of optimization more understandable and consistent for new users and decision-makers. The Hunger Games Search incorporates the concept of hunger into the feature process; in other words, an adaptive weight based on the concept of hunger is designed and employed to simulate the effect of hunger on each search step. It follows the computationally logical rules (games) utilized by almost all animals and these rival activities and games are often adaptive evolutionary by securing higher chances of survival and food acquisition. This method's main feature is its dynamic nature, simple structure, and high performance in terms of convergence and acceptable quality of solutions, proving to be more efficient than the current optimization methods. \r\n\r\nImplementation of the HGS algorithm is available at [https:\/\/aliasgharheidari.com\/HGS.html](https:\/\/aliasgharheidari.com\/HGS.html).","1077":"Optimize combinations of various neural network models for multimodal data with bayseian optimization.","1078":"Distributed training has become a pervasive and effective approach for training a large neural network\r\n(NN) model with processing massive data. However, it is very challenging to satisfy requirements\r\nfrom various NN models, diverse computing resources, and their dynamic changes during a training\r\njob. In this study, we design our distributed training framework in a systematic end-to-end view to\r\nprovide the built-in adaptive ability for different scenarios, especially for industrial applications and\r\nproduction environments, by fully considering resource allocation, model partition, task placement,\r\nand distributed execution. Based on the unified distributed graph and the unified cluster object,\r\nour adaptive framework is equipped with a global cost model and a global planner, which can\r\nenable arbitrary parallelism, resource-aware placement, multi-mode execution, fault-tolerant, and\r\nelastic distributed training. The experiments demonstrate that our framework can satisfy various\r\nrequirements from the diversity of applications and the heterogeneity of resources with highly\r\ncompetitive performance.","1079":"**Pyramid Vision Transformer v2** (PVTv2) is a type of [Vision Transformer](https:\/\/paperswithcode.com\/method\/vision-transformer) for detection and segmentation tasks. It improves on [PVTv1](https:\/\/paperswithcode.com\/method\/pvt) through several design improvements: (1) overlapping patch embedding, (2) convolutional feed-forward networks, and (3) linear complexity attention layers that are orthogonal to the PVTv1 framework.","1080":"**SimCLRv2** is a semi-supervised learning method for learning from few labeled examples while making best use of a large amount of unlabeled data. It is a modification of a recently proposed contrastive learning framework, [SimCLR](https:\/\/www.paperswithcode.com\/method\/simclr). It improves upon it in three major ways:\r\n\r\n1. To fully leverage the power of general pre-training, larger [ResNet](https:\/\/paperswithcode.com\/method\/resnet) models are explored. Unlike SimCLR and other previous work, whose largest model is ResNet-50 (4\u00d7), SimCLRv2 trains models that are deeper but less wide. The largest model trained is a 152 layer ResNet with 3\u00d7 wider channels and [selective kernels](https:\/\/paperswithcode.com\/method\/selective-kernel-convolution) (SK), a channel-wise attention mechanism that improves the parameter efficiency of the network. By scaling up the model from ResNet-50 to ResNet-152 (3\u00d7+SK), a 29% relative improvement is obtained in top-1 accuracy when fine-tuned on 1% of labeled examples.\r\n\r\n2. The capacity of the non-linear network $g(\u00b7)$ (a.k.a. projection head) is increased, by making it deeper. Furthermore, instead of throwing away $g(\u00b7)$ entirely after pre-training as in SimCLR, fine-tuning occurs from a middle layer. This small change yields a significant improvement for both linear evaluation and fine-tuning with only a few labeled examples. Compared to SimCLR with 2-layer projection head, by using a 3-layer projection head and fine-tuning from the 1st layer of projection head, it results in as much as 14% relative improvement in top-1 accuracy when fine-tuned on 1% of labeled examples.\r\n\r\n3. The memory mechanism of [MoCo v2](https:\/\/paperswithcode.com\/method\/moco-v2) is incorporated, which designates a memory network (with a moving average of weights for stabilization) whose output will be buffered as negative examples. Since training is based on large mini-batch which already supplies many contrasting negative examples, this change yields an improvement of \u223c1% for linear evaluation as well as when fine-tuning on 1% of labeled examples.","1081":"**Parrot** is an imitation learning approach to automatically learn cache access patterns by leveraging Belady\u2019s optimal policy. Belady\u2019s optimal policy is an oracle policy that computes the theoretically optimal cache eviction decision based on knowledge of future cache accesses, which Parrot approximates with a policy that only conditions on the past accesses.","1082":"Invertible multi-input multi-output coupling module. In RevBiFPN it is used as a bidirectional multi-scale feature pyramid fusion module that is invertible.","1083":"A **BiGAN**, or **Bidirectional GAN**, is a type of generative adversarial network where the generator  not only maps latent samples to generated data, but also has an inverse mapping from data to the latent representation. The motivation is to make a type of GAN that can learn rich representations for us in applications like unsupervised learning.\r\n\r\nIn addition to the generator $G$ from the standard [GAN](https:\/\/paperswithcode.com\/method\/gan) framework, BiGAN includes an encoder $E$ which maps data $\\mathbf{x}$ to latent representations $\\mathbf{z}$. The BiGAN discriminator $D$ discriminates not only in data space ($\\mathbf{x}$ versus $G\\left(\\mathbf{z}\\right)$), but jointly in data and latent space (tuples $\\left(\\mathbf{x}, E\\left(\\mathbf{x}\\right)\\right)$ versus $\\left(G\\left(z\\right), z\\right)$), where the latent component is either an encoder output $E\\left(\\mathbf{x}\\right)$ or a generator input $\\mathbf{z}$.","1084":"**NEAT**, or **Neural Attention Fields**, is a feature representation for end-to-end imitation learning models. NEAT is a continuous function which maps locations in Bird's Eye View (BEV) scene coordinates to waypoints and semantics, using intermediate attention maps to iteratively compress high-dimensional 2D image features into a compact representation. This allows the model to selectively attend to relevant regions in the input while ignoring information irrelevant to the driving task, effectively associating the images with the BEV representation. Furthermore, visualizing the attention maps for models with NEAT intermediate representations provides improved interpretability.","1085":"DBGAN is a method for graph representation learning. Instead of the widely used normal distribution assumption, the prior distribution of latent representation in DBGAN is estimated in a structure-aware way, which implicitly bridges the graph and feature spaces by prototype learning.\r\n\r\nSource: [Distribution-induced Bidirectional Generative Adversarial Network for Graph Representation Learning](https:\/\/arxiv.org\/abs\/1912.01899)","1086":"The **Universal Transformer** is a generalization of the [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture. Universal Transformers combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of [RNNs](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks). They also utilise a dynamic per-position halting mechanism.","1087":"**Multiplicative Attention** is an attention mechanism where the alignment score function is calculated as:\r\n\r\n$$f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right) = \\mathbf{h}\\_{i}^{T}\\textbf{W}\\_{a}\\mathbf{s}\\_{j}$$\r\n\r\nHere $\\mathbf{h}$ refers to the hidden states for the encoder\/source, and $\\mathbf{s}$ is the hidden states for the decoder\/target. The function above is thus a type of alignment score function. We can use a matrix of alignment scores to show the correlation between source and target words, as the Figure to the right shows. Within a neural network, once we have the alignment scores, we calculate the final scores using a [softmax](https:\/\/paperswithcode.com\/method\/softmax) function of these alignment scores (ensuring it sums to 1).\r\n\r\nAdditive and multiplicative attention are similar in complexity, although multiplicative attention is faster and more space-efficient in practice as it can be implemented more efficiently using matrix multiplication. Both variants perform similar for small dimensionality $d_{h}$ of the decoder states, but [additive attention](https:\/\/paperswithcode.com\/method\/additive-attention) performs better for larger dimensions. One way to mitigate this is to scale $f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right)$ by $1\/\\sqrt{d\\_{h}}$ as with [scaled dot-product attention](https:\/\/paperswithcode.com\/method\/scaled).","1088":"**Hierarchical Softmax** is a is an alternative to [softmax](https:\/\/paperswithcode.com\/method\/softmax) that is faster to evaluate: it is $O\\left(\\log{n}\\right)$ time to evaluate compared to $O\\left(n\\right)$ for softmax. It utilises a multi-layer binary tree, where the probability of a word is calculated through the product of probabilities on each edge on the path to that node. See the Figure to the right for an example of where the product calculation would occur for the word \"I'm\".\r\n\r\n(Introduced by Morin and Bengio)\r\n\r\nImage Credit: [Steven Schmatz](https:\/\/www.quora.com\/profile\/Steven-Schmatz)","1089":"**Augmented SBERT** is a data augmentation strategy for pairwise sentence scoring that uses a [BERT](https:\/\/paperswithcode.com\/method\/bert) cross-encoder to improve the performance for the [SBERT](https:\/\/paperswithcode.com\/method\/sbert) bi-encoders. Given a pre-trained, well-performing crossencoder, we sample sentence pairs according to a certain sampling strategy and label these using the cross-encoder. We call these weakly labeled examples the silver dataset and they will be merged with the gold training dataset. We then train the bi-encoder on this extended training dataset.","1090":"**Deep Stereo Geometry Network** is a 3D object detection pipeline that relies on space transformation from 2D features to an effective 3D structure, called 3D geometric volume (3DGV). The whole neural network consists of four components. (a) A 2D image\r\nfeature extractor for capture of both pixel- and high-level feature. (b) Constructing the plane-sweep volume and 3D geometric volume. (c) Depth Estimation on the plane-sweep volume. (d) 3D object detection on 3D geometric volume.","1091":"**ReZero** is a [normalization](https:\/\/paperswithcode.com\/methods\/category\/normalization) approach that dynamically facilitates well-behaved gradients and arbitrarily deep signal propagation. The idea is simple: ReZero initializes each layer to perform the identity operation. For each layer,  a [residual connection](https:\/\/paperswithcode.com\/method\/residual-connectio) is introduced for the input signal $x$ and one trainable parameter $\\alpha$ that modulates the non-trivial transformation of a layer $F(\\mathbf{x})$:\r\n\r\n$$\r\n\\mathbf{x}\\_{i+1}=\\mathbf{x}\\_{i}+\\alpha_{i} F\\left(\\mathbf{x}\\_{i}\\right)\r\n$$\r\n\r\nwhere $\\alpha=0$ at the beginning of training. Initially the gradients for all parameters defining $F$ vanish, but dynamically evolve to suitable values during initial stages of training. The architecture is illustrated in the Figure.","1092":"Please enter a description about the method here","1093":"**PREDATOR** is a model for pairwise point-cloud registration with deep attention to the overlap region. Its key novelty is an overlap-attention block for early information exchange between the latent encodings of the two point clouds. In this way the subsequent decoding of the latent representations into per-point features is conditioned on the respective other point cloud, and thus can predict which points are not only salient, but also lie in the overlap region between the two point clouds.","1094":"An **Efficient Recurrent Unit (ERU)** extends [LSTM](https:\/\/paperswithcode.com\/method\/mrnn)-based language models by replacing linear transforms for processing the input vector with the [EESP](https:\/\/paperswithcode.com\/method\/eesp) unit inside the [LSTM](https:\/\/paperswithcode.com\/method\/lstm) cell.","1095":"A **Cyclical Learning Rate Policy** combines a linear learning rate decay with warm restarts.\r\n\r\nImage: [ESPNetv2](https:\/\/paperswithcode.com\/method\/espnetv2)","1096":"**ESPNetv2** is a convolutional neural network that utilises group point-wise and depth-wise dilated separable convolutions to learn representations from a large effective receptive field with fewer FLOPs and parameters.","1097":"A **Strided EESP** unit is based on the [EESP Unit](https:\/\/paperswithcode.com\/method\/eesp) but is modified to learn representations more efficiently at multiple scales. Depth-wise dilated convolutions are given strides, an [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling) operation is added instead of an identity connection, and the element-wise addition operation is replaced with a concatenation operation, which helps in expanding the dimensions of feature maps efficiently.","1098":"A **Depthwise Dilated Separable Convolution** is a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) that combines [depthwise separability](https:\/\/paperswithcode.com\/method\/depthwise-separable-convolution) with the use of [dilated convolutions](https:\/\/paperswithcode.com\/method\/dilated-convolution).","1099":"An **EESP Unit**, or  Extremely Efficient Spatial Pyramid of Depth-wise Dilated Separable Convolutions, is an image model block designed for edge devices. It was proposed as part of the [ESPNetv2](https:\/\/paperswithcode.com\/method\/espnetv2) CNN architecture. \r\n\r\nThis building block is based on a reduce-split-transform-merge strategy. The EESP unit first projects the high-dimensional input feature maps into low-dimensional space using groupwise pointwise convolutions and then learns the representations in parallel using depthwise dilated separable convolutions with different dilation rates. Different dilation rates in each branch allow the EESP unit to learn the representations from a large effective receptive field. To remove the gridding artifacts caused by dilated convolutions, the EESP fuses the feature maps using [hierarchical feature fusion](https:\/\/paperswithcode.com\/method\/hierarchical-feature-fusion) (HFF).","1100":"**MUSIQ**, or **Multi-scale Image Quality Transformer**, is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based model for multi-scale image quality assessment. It processes native resolution images with varying sizes and aspect ratios. In MUSIQ, we construct a multi-scale image representation as input, including the native resolution image and its ARP resized variants.  Each image is split into fixed-size patches which are embedded by a patch encoding module (blue boxes). To capture 2D structure of the image and handle images of varying aspect ratios, the spatial embedding is encoded by hashing the patch position $(i,j)$ to $(t_{i},t_{j})$ within a grid of learnable embeddings (red boxes). Scale Embedding (green boxes) is introduced to capture scale information. The Transformer encoder takes the input tokens and performs multi-head self-attention. To predict the image quality, MUSIQ follows a common strategy in Transformers to add an [CLS] token to the sequence to represent the whole multi-scale input and the corresponding Transformer output is used as the final representation.","1101":"**NADAM**, or **Nesterov-accelerated Adaptive Moment Estimation**, combines [Adam](https:\/\/paperswithcode.com\/method\/adam) and [Nesterov Momentum](https:\/\/paperswithcode.com\/method\/nesterov-accelerated-gradient). The update rule is of the form:\r\n\r\n$$ \\theta\\_{t+1} = \\theta\\_{t} - \\frac{\\eta}{\\sqrt{\\hat{v}\\_{t}}+\\epsilon}\\left(\\beta\\_{1}\\hat{m}\\_{t} + \\frac{(1-\\beta\\_{t})g\\_{t}}{1-\\beta^{t}\\_{1}}\\right)$$\r\n\r\nImage Source: [Incorporating Nesterov Momentum into Adam](http:\/\/cs229.stanford.edu\/proj2015\/054_report.pdf)","1102":"VL-T5 is a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation. The model learns to generate labels in text based on the visual and textual inputs. In contrast to other existing methods, the framework unifies tasks as generating text labels conditioned on multimodal inputs. This allows the model to tackle vision-and-language tasks with unified text generation objective. The models use text prefixes to adapt to different tasks.","1103":"**WaveGlow** is a flow-based generative model that generates audio by sampling from a distribution. Specifically samples are taken from a zero mean spherical Gaussian with the same number of dimensions as our desired output, and those samples are put through a series of layers that transforms the simple distribution to one which has the desired distribution.","1104":"**CornerNet-Squeeze Hourglass Module** is an image model block used in [CornerNet](https:\/\/paperswithcode.com\/method\/cornernet)-Lite that is based on an [hourglass module](https:\/\/paperswithcode.com\/method\/hourglass-module), but uses modified fire modules instead of residual blocks. Other than replacing the residual blocks, further modifications include: reducing the maximum feature map resolution of the hourglass modules by adding one more downsampling layer before the hourglass modules, removing one downsampling layer in each hourglass module, replacing the 3 \u00d7 3 filters with 1 x 1 filters in the prediction modules of CornerNet, and finally replacing the nearest neighbor upsampling in the hourglass network with transpose [convolution](https:\/\/paperswithcode.com\/method\/convolution) with a 4 \u00d7 4 kernel.","1105":"A **Depthwise Fire Module** is a modification of a [Fire Module](https:\/\/paperswithcode.com\/method\/fire-module) with depthwise separable convolutions to improve the inference time performance. It is used in the [CornerNet](https:\/\/paperswithcode.com\/method\/cornernet)-Lite architecture for object detection.","1106":"**CornerNet-Saccade** is an extension of [CornerNet](https:\/\/paperswithcode.com\/method\/cornernet) with an attention mechanism similar to saccades in human vision. It starts with a downsized full image and generates an attention map, which is then zoomed in on and processed further by the model. This differs from the original CornerNet in that it is applied fully convolutionally across multiple scales.","1107":"**CornerNet-Squeeze** is an object detector that extends [CornerNet](https:\/\/paperswithcode.com\/method\/cornernet) with a new compact hourglass architecture that makes use of fire modules with depthwise separable convolutions.","1108":"**CornerNet-Squeeze Hourglass** is a convolutional neural network and object detection backbone used in the [CornerNet-Squeeze](https:\/\/paperswithcode.com\/method\/cornernet-squeeze) object detector. It uses a modified [hourglass module](https:\/\/paperswithcode.com\/method\/hourglass-module) that makes use of a [fire module](https:\/\/paperswithcode.com\/method\/fire-module): containing 1x1 convolutions and depthwise convolutions.","1109":"**Multi-source Sentiment Generative Adversarial Network** is a multi-source domain adaptation (MDA) method for visual sentiment classification. It is composed of three pipelines, i.e., image reconstruction, image translation, and cycle-reconstruction. To handle data from multiple source domains, it learns to find a unified sentiment latent space where data from both the source and target domains share a similar distribution. This is achieved via cycle consistent adversarial learning in an end-to-end manner. Notably, thanks to the unified sentiment latent space, MSGAN requires a single classification network to handle data from different source domains.","1110":"**Dense Prediction Transformers** (DPT) are a type of [vision transformer](https:\/\/paperswithcode.com\/method\/vision-transformer) for dense prediction tasks.\r\n\r\nThe input image is transformed into tokens (orange) either by extracting non-overlapping patches followed by a linear projection of their flattened representation (DPT-Base and DPT-Large) or by applying a [ResNet](https:\/\/paperswithcode.com\/method\/resnet)-50 feature extractor (DPT-Hybrid). The image embedding is augmented with a positional embedding and a patch-independent readout token (red) is added. The tokens are passed through multiple [transformer](https:\/\/paperswithcode.com\/method\/transformer) stages. The tokens are reassembled from different stages into an image-like representation at multiple resolutions (green). Fusion modules (purple) progressively fuse and upsample the representations to generate a fine-grained prediction.","1111":"**CReLU**, or **Concatenated Rectified Linear Units**, is a type of activation function which preserves both positive and negative phase information while enforcing non-saturated non-linearity. We compute by concatenating the layer output $h$ as:\r\n\r\n$$ \\left[\\text{ReLU}\\left(h\\right), \\text{ReLU}\\left(-h\\right)\\right] $$","1112":"**BigBiGAN** is a type of [BiGAN](https:\/\/paperswithcode.com\/method\/bigan) with a [BigGAN](https:\/\/paperswithcode.com\/method\/biggan) image generator. The authors initially used [ResNet](https:\/\/paperswithcode.com\/method\/resnet) as a baseline for the encoder $\\mathcal{E}$ followed by a 4-layer MLP with skip connections, but they experimented with RevNets and found they outperformed with increased network width, so opted for this type of encoder for the final architecture.","1113":"PatchMerger is a module for Vision Transformers that decreases the number of tokens\/patches passed onto each individual transformer encoder block whilst maintaining performance and reducing compute. PatchMerger takes linearly transforms an input of shape N patches \u00d7 D dimensions through a learnable weight matrix of shape M output patches \u00d7 D. This generates M scores, in which a Softmax function is applied for each score. The resulting output has a shape of M \u00d7 N, which is multiplied to the original input to get an output of shape M \u00d7 D.\r\n\r\nMathematically, $$Y = \\text{softmax}({W^T}{X^T})X$$\r\n\r\nImage and formula from: Renggli, C., Pinto, A. S., Houlsby, N., Mustafa, B., Puigcerver, J., & Riquelme, C. (2022). Learning to Merge Tokens in Vision Transformers. arXiv preprint arXiv:2202.12015.","1114":"**wav2vec-U** is an unsupervised method to train speech recognition models without any labeled data. It leverages self-supervised speech representations to segment unlabeled language and learn a mapping from these representations to phonemes via adversarial training. \r\n\r\nSpecifically, we learn self-supervised representations with wav2vec 2.0 on unlabeled speech audio, then identify clusters in the representations with k-means to segment the audio data. Next, we build segment representations by mean pooling the wav2vec 2.0 representations, performing [PCA](https:\/\/paperswithcode.com\/method\/pca) and a second mean pooling step between adjacent segments. This is input to the generator which outputs a phoneme sequence that is fed to the discriminator, similar to phonemized unlabeled text to perform adversarial training.","1115":"**Gradient Normalization** is a normalization method for [Generative Adversarial Networks](https:\/\/paperswithcode.com\/methods\/category\/generative-adversarial-networks) to tackle the training instability of generative adversarial networks caused by the sharp gradient space. Unlike existing work such as [gradient penalty](https:\/\/paperswithcode.com\/method\/wgan-gp-loss) and [spectral normalization](https:\/\/paperswithcode.com\/method\/spectral-normalization), the proposed GN only imposes a hard 1-Lipschitz constraint on the discriminator function, which increases the capacity of the network.","1116":"**Sparsemax** is a type of activation\/output function similar to the traditional [softmax](https:\/\/paperswithcode.com\/method\/softmax), but able to output sparse probabilities. \r\n\r\n$$ \\text{sparsemax}\\left(z\\right) = \\arg\\_{p\u2208\\Delta^{K\u22121}}\\min||\\mathbf{p} - \\mathbf{z}||^{2} $$","1117":"**ACGPN**, or **Adaptive Content Generating and Preserving Network**, is a [generative adversarial network](https:\/\/www.paperswithcode.com\/method\/category\/generative-adversarial-network) for virtual try-on clothing applications. \r\n\r\nIn Step I, the Semantic Generation Module (SGM) takes the target clothing image $\\mathcal{T}\\_{c}$, the pose map $\\mathcal{M}\\_{p}$, and the fused body part mask $\\mathcal{M}^{F}$ as the input to predict the semantic layout and to output the synthesized body part mask $\\mathcal{M}^{S}\\_{\\omega}$ and the target clothing mask $\\mathcal{M}^{S\\_{c}$.\r\n\r\nIn Step II, the Clothes Warping Module (CWM) warps the target clothing image to $\\mathcal{T}^{R}\\_{c}$ according to the predicted semantic layout, where a second-order difference constraint is introduced to stabilize the warping process. \r\n\r\nIn Steps III and IV, the Content Fusion Module (CFM) first produces the composited body part mask $\\mathcal{M}^{C}\\_{\\omega}$ using the original clothing mask $\\mathcal{M}\\_{c}$, the synthesized clothing mask $\\mathcal{M}^{S}\\_{c}$, the body part mask $\\mathcal{M}\\_{\\omega}$, and the synthesized body part mask $\\mathcal{M}\\_{\\omega}^{S}$, and then exploits a fusion network to generate the try-on images $\\mathcal{I}^{S}$ by utilizing the information $\\mathcal{T}^{R}\\_{c}$, $\\mathcal{M}^{S}\\_{c}$, and the body part image $I\\_{\\omega}$ from previous steps.","1118":"Modules used in [GAN](https:\/\/paperswithcode.com\/method\/gan)'s style transfer.","1119":"A **ResNest** is a variant on a [ResNet](https:\/\/paperswithcode.com\/method\/resnet), which instead stacks Split-Attention blocks. The cardinal group representations are then concatenated along the channel dimension: $V = \\text{Concat}${$V^{1},V^{2},\\cdots{V}^{K}$}. As in standard residual blocks, the final output $Y$ of otheur Split-Attention block is produced using a shortcut connection: $Y=V+X$, if the input and output feature-map share the same shape.  For blocks with a stride, an appropriate transformation $\\mathcal{T}$ is applied to the shortcut connection to align the output shapes:  $Y=V+\\mathcal{T}(X)$. For example, $\\mathcal{T}$ can be strided [convolution](https:\/\/paperswithcode.com\/method\/convolution) or combined convolution-with-pooling.","1120":"**AdvProp** is an adversarial training scheme which treats adversarial examples as additional examples, to prevent overfitting. Key to the method is the usage of a separate auxiliary batch norm for adversarial examples, as they have different underlying distributions to normal examples.","1121":"**Parameterized Exponential Linear Units**, or **PELU**, is an activation function for neural networks. It involves learning a parameterization of [ELU](https:\/\/paperswithcode.com\/method\/elu) in order to learn the proper activation shape at each layer in a CNN. \r\n\r\nThe PELU has two additional parameters over the ELU:\r\n\r\n$$ f\\left(x\\right) = cx \\text{ if } x > 0 $$\r\n$$ f\\left(x\\right) = \\alpha\\exp^{\\frac{x}{b}} - 1 \\text{ if } x \\leq 0 $$\r\n\r\nWhere $a$, $b$, and $c > 0$. Here $c$ causes a change in the slope in the positive quadrant, $b$ controls the scale of the [exponential decay](https:\/\/paperswithcode.com\/method\/exponential-decay), and $\\alpha$ controls the saturation in the negative quadrant.\r\n\r\nSource: [Activation Functions](https:\/\/arxiv.org\/pdf\/1811.03378.pdf)","1122":"**Glow-TTS** is a flow-based generative model for parallel TTS that does not require any external aligner. By combining the properties of flows and dynamic programming, the proposed model searches for the most probable monotonic alignment between text and the latent representation of speech.  The model is directly trained to maximize the log-likelihood of speech with the alignment. Enforcing hard monotonic alignments helps enable robust TTS, which generalizes to long utterances, and employing flows enables fast, diverse, and controllable speech synthesis.","1123":"**Fishr** is a learning scheme to enforce domain invariance in the space of the gradients of the loss function: specifically, it introduces a regularization term that matches the domain-level variances of gradients across training domains. Critically, the strategy exhibits close relations with the Fisher Information and the Hessian of the loss. Forcing domain-level gradient covariances to be similar during the learning procedure eventually aligns the domain-level loss landscapes locally around the final weights.","1124":"**Source Hypothesis Transfer**, or **SHOT**, is a representation learning framework for unsupervised domain adaptation. SHOT freezes the classifier module (hypothesis) of the source model and learns the target-specific feature extraction module by exploiting both information maximization and self-supervised pseudo-labeling to implicitly align representations from the target domains to the source hypothesis.","1125":"**Batch Nuclear-norm Maximization** is an approach for aiding classification in label insufficient situations. It involves maximizing the nuclear-norm of the batch output matrix. The nuclear-norm of a matrix is an upper bound of the Frobenius-norm of the matrix. Maximizing nuclear-norm ensures large Frobenius-norm of the batch matrix, which leads to increased discriminability. The nuclear-norm of the batch matrix is also a convex approximation of the matrix rank, which refers to the prediction diversity.","1126":"We present a novel classifier network called STEP, to classify perceived human emotion from gaits, based on a Spatial Temporal Graph Convolutional Network (ST-[GCN](https:\/\/paperswithcode.com\/method\/gcn)) architecture. Given an RGB video of an individual walking, our formulation implicitly exploits the gait features to classify the perceived emotion of the human into one of four emotions: happy, sad, angry, or neutral. We train STEP on annotated real-world gait videos, augmented with annotated synthetic gaits generated using a novel generative network called STEP-Gen, built on an ST-GCN based Conditional Variational Autoencoder (CVAE). We incorporate a novel push-pull regularization loss in the CVAE formulation of STEP-Gen to generate realistic gaits and improve the classification accuracy of STEP.\r\nWe also release a novel dataset (E-Gait), which consists of 4,227 human gaits annotated with perceived emotions along with thousands of synthetic gaits. In practice, STEP can learn the affective features and exhibits classification accuracy of 88\\% on E-Gait, which is 14--30\\% more accurate over prior methods.","1127":"**YOLOP** is a panoptic driving perception network for handling traffic object detection, drivable area segmentation and lane detection simultaneously. It is composed of one encoder for feature extraction and three decoders to handle the specific tasks. It can be thought of a lightweight version of Tesla's HydraNet model for self-driving cars.\r\n\r\nA lightweight CNN, from Scaled-yolov4, is used as the encoder to extract features from the image. Then these feature maps are fed to three decoders to complete their respective tasks. The detection decoder is based on the current best-performing single-stage detection network, [YOLOv4](https:\/\/paperswithcode.com\/method\/yolov4),  for two main reasons: (1) The single-stage detection network is faster than the two-stage detection network. (2) The grid-based prediction mechanism of the single-stage detector is more related to the other two semantic segmentation tasks, while instance segmentation is usually combined with the region based detector as in [Mask R-CNN](https:\/\/paperswithcode.com\/method\/mask-r-cnn). The feature map output by the encoder incorporates semantic features of different levels and scales, and our segmentation branch can use these feature maps to complete pixel-wise semantic prediction.","1128":"VisualBERT aims to reuse self-attention to implicitly align elements of the input text and regions in the input image. Visual embeddings are used to model images where the representations are represented by a bounding region in an image obtained from an object detector. These visual embeddings are constructed by summing three embeddings: 1) visual feature representation, 2) a segment embedding indicate whether it is an image embedding, and 3) position embedding. Essentially, image regions and language are combined with a Transformer to allow self-attention to discover implicit alignments between language and vision. VisualBERT is trained using COCO, which consists of images paired with captions. It is pre-trained using two objectives: masked language modeling objective and sentence-image prediction task. It can then be fine-tuned on different downstream tasks.","1129":"The **Content-Conditioned Style Encoder**, or **COCO**, is a style encoder used for image-to-image translation in the [COCO-FUNIT](https:\/\/paperswithcode.com\/method\/coco-funit#) architecture.  Unlike the style encoder in [FUNIT](https:\/\/arxiv.org\/abs\/1905.01723), COCO takes both content and style image as input. With this content conditioning scheme, we create a direct feedback path during learning to let the content image influence how the style code is computed. It also helps reduce the direct influence of the style image to the extract style code.\r\n\r\nThe bottom part of the Figure details architecture. First, the content image is fed into an encoder $E\\_{S, C}$ to compute a spatial feature map. This content feature map is then mean-pooled and mapped to a vector $\\zeta\\_{c} .$ Similarly, the style image is fed into encoder $E\\_{S, S}$ to compute a spatial feature map. The style feature map is then mean-pooled and concatenated with an input-independent bias vector: the constant style bias (CSB). Note that while the regular bias in deep networks is added to the activations, in CSB, the bias is concatenated with the activations. The CSB provides a fixed input to the style encoder, which helps compute a style code that is less sensitive to the variations in the style image.\r\n\r\nThe concatenation of the style vector and the CSB is mapped to a vector $\\zeta\\_{s}$ via a fully connected layer. We then perform an element-wise product operation to $\\zeta\\_{c}$ and $\\zeta\\_{s}$, which is the final style code. The style code is then mapped to produce the [AdaIN](https:\/\/paperswithcode.com\/method\/adaptive-instance-normalization) parameters for generating the translation. Through this element-wise product operation, the resulting style code is heavily influenced by the content image. One way to look at this mechanism is that it produces a customized style code for the input content image.\r\n\r\nThe COCO is used as a drop-in replacement for the style encoder in FUNIT. Let $\\phi$ denote the COCO mapping. The translation output is then computed via\r\n\r\n$$\r\nz\\_{c}=E\\_{c}\\left(x_{c}\\right), z_{s}=\\phi\\left(E\\_{s, s}\\left(x_{s}\\right), E\\_{s, c}\\left(x\\_{c}\\right)\\right), \\overline{\\mathbf{x}}=F\\left(z\\_{c}, z\\_{s}\\right)\r\n$$\r\n\r\nThe style code extracted by the COCO is more robust to variations in the style image. Note that we set $E\\_{S, C} \\equiv E\\_{C}$ to keep the number of parameters in our model similar to that in FUNIT.","1130":"**COCO-FUNIT** is few-shot image translation model which computes the style embedding of the example images conditioned on the input image and a new module called the constant style bias. It builds on top of [FUNIT](https:\/\/arxiv.org\/abs\/1905.01723) by identifying the content loss problem and then addressing it with a novel content-conditioned style encoder architecture.\r\n\r\nThe FUNIT method suffers from the content loss problem\u2014the translation result is not well-aligned with the input image. While a direct theoretical analysis is likely elusive, we conduct an empirical study, aiming at identify the cause of the content loss problem. In analyses, the authors show that the FUNIT style encoder produces very different style codes using different crops -- suggesting the style code contains other information about the style image such as the object pose.\r\n\r\nTo make the style embedding more robust to small variations in the style image, a new style encoder architecture, the Content-Conditioned style encoder (COCO), is introduced. The most distinctive feature of this new encoder is the conditioning in the content image as illustrated in the top-right of the Figure. Unlike the style encoder in FUNIT, COCO takes both content and style image as input. With this content-conditioning scheme, a direct feedback path is created during learning to let the content image influence how the style code is computed. It also helps reduce the direct influence of the style image to the extract style code.","1131":"**Accordion** is a gradient communication scheduling algorithm that is generic across models while imposing low computational overheads. Accordion inspects the change in the gradient norms to detect critical regimes and adjusts the communication schedule dynamically. Accordion works for both adjusting the gradient compression rate or the batch size without additional parameter tuning.","1132":"**SFAM**, or **Scale-wise Feature Aggregation Module**, is a feature extraction block from the [M2Det](https:\/\/paperswithcode.com\/method\/m2det) architecture. It aims to aggregate the multi-level multi-scale features generated by [Thinned U-Shaped Modules](https:\/\/paperswithcode.com\/method\/tum) into a multi-level feature pyramid. \r\n\r\nThe first stage of SFAM is to concatenate features of the equivalent scale together along the channel dimension. The aggregated feature pyramid can be presented as $\\mathbf{X} =[\\mathbf{X}\\_1,\\mathbf{X}\\_2,\\dots,\\mathbf{X}\\_i]$, where $\\mathbf{X}\\_i = \\text{Concat}(\\mathbf{x}\\_i^1,\\mathbf{x}\\_i^2,\\dots,\\mathbf{x}\\_i^L) \\in \\mathbb{R}^{W\\_{i}\\times H\\_{i}\\times C}$ refers to the features of the $i$-th largest scale. Here, each scale in the aggregated pyramid contains features from multi-level depths. \r\n\r\nHowever, simple concatenation operations are not adaptive enough. In the second stage, we introduce a channel-wise attention module to encourage features to focus on channels that they benefit most. Following Squeeze-and-Excitation, we use [global average pooling](https:\/\/paperswithcode.com\/method\/global-average-pooling) to generate channel-wise statistics $\\mathbf{z} \\in \\mathbb{R}^C$ at the squeeze step. And to fully capture channel-wise dependencies, the following excitation step learns the attention mechanism via two fully connected layers:\r\n\r\n$$\r\n\\mathbf{s} = \\mathbf{F}\\_{ex}(\\mathbf{z},\\mathbf{W}) = \\sigma(\\mathbf{W}\\_{2} \\delta(\\mathbf{W}\\_{1}\\mathbf{z})),\r\n$$\r\n\r\nwhere $\\sigma$ refers to the [ReLU](https:\/\/paperswithcode.com\/method\/relu) function, $\\delta$ refers to the sigmoid function, $\\mathbf{W}\\_{1} \\in \\mathbb{R}^{\\frac{C}{r}\\times C}$ , $\\mathbf{W}\\_{2} \\in \\mathbb{R}^{C\\times \\frac{C}{r}}$, r is the reduction ratio ($r=16$ in our experiments). The final output is obtained by reweighting the input $\\mathbf{X}$ with activation $\\mathbf{s}$:\r\n\r\n$$\r\n\\tilde{\\mathbf{X}}_i^c = \\mathbf{F}\\_{scale}(\\mathbf{X}\\_i^c,s_c) = s_c \\cdot \\mathbf{X}_i^c,\r\n$$\r\n\r\nwhere $\\tilde{\\mathbf{X}\\_i} = [\\tilde{\\mathbf{X}}\\_i^1,\\tilde{\\mathbf{X}}\\_i^2,...,\\tilde{\\mathbf{X}}\\_i^C]$, each of the features is enhanced or weakened by the rescaling operation.","1133":"**PonderNet** is an adaptive computation method that learns to adapt the amount of computation based on the complexity of the problem at hand. PonderNet learns end-to-end the number of computational steps to achieve an effective compromise between training prediction accuracy, computational cost and generalization.","1134":"VideoBERT adapts the powerful [BERT](https:\/\/paperswithcode.com\/method\/bert) model to learn a joint visual-linguistic representation for video. It is used in numerous tasks, including action classification and video captioning.","1135":"**Movement Pruning** is a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning. Magnitude pruning can be seen as utilizing zeroth-order information (absolute value) of the running model. In contrast, movement pruning methods are where importance is derived from first-order information. Intuitively, instead of selecting weights that are far from zero, we retain connections that are moving away from zero during the training process.","1136":"**Fraternal Dropout** is a regularization method for recurrent neural networks that trains two identical copies of an RNN (that share parameters) with different [dropout](https:\/\/paperswithcode.com\/method\/dropout) masks while minimizing the difference between their (pre-[softmax](https:\/\/paperswithcode.com\/method\/softmax)) predictions. This encourages the representations of RNNs to be invariant to dropout mask, thus being robust.","1137":"**TabNet** is a deep tabular data learning architecture that uses sequential attention to choose which features to reason from at each decision step.\r\n\r\nThe TabNet encoder is composed of a feature transformer, an attentive transformer and feature masking. A split block\r\ndivides the processed representation to be used by the attentive transformer of the subsequent step as well as for the overall output. For each step, the feature selection mask provides interpretable information about the model\u2019s functionality, and the masks can be aggregated to obtain global feature important attribution. The TabNet decoder is composed of a feature transformer block at each step. \r\n\r\nIn the feature transformer block, a 4-layer network is used, where 2 are shared across all decision steps and 2 are decision step-dependent. Each layer is composed of a fully-connected (FC) layer, BN and GLU nonlinearity. An attentive transformer block example \u2013 a single layer mapping is modulated with a prior scale information which aggregates how much each feature has been used before the current decision step. sparsemax is used for normalization of the coefficients, resulting in sparse selection of the salient features.","1138":"TAPEX is a conceptually simple and empirically powerful pre-training approach to empower existing models with table reasoning skills. TAPEX realizes table pre-training by learning a neural SQL executor over a synthetic corpus, which is obtained by automatically synthesising executable SQL queries.","1139":"Hey! Anyone can edit!\r\n\r\nCan someone help delete this method? I just created it for testing purpose.","1140":"**Dynamic Keypoint Head** is an output head for pose estimation that are conditioned on each instance (person), and can encode the instance concept in the dynamically-generated weights of their filters. They are used in the [FCPose](https:\/\/paperswithcode.com\/method\/fcpose) architecture.\r\n\r\nThe Figure shows the core idea. $F$ denotes a level of feature maps. \"Rel. Coord.\" means the relative coordinates, denoting the relative offsets from the locations of $F$ to the location where the filters are generated. Refer to the text for details. $f\\_{\\theta\\_{i}}$ is the dynamically-generated keypoint head for the $i$-th person instance. Note that each person instance has its own keypoint head.","1141":"**FCPose** is a fully convolutional multi-person [pose estimation framework](https:\/\/paperswithcode.com\/methods\/category\/pose-estimation-models) using dynamic instance-aware convolutions. Different from existing methods, which often require ROI (Region of Interest) operations and\/or grouping post-processing, FCPose eliminates the ROIs and grouping pre-processing with dynamic instance aware keypoint estimation heads. The dynamic keypoint heads are conditioned on each instance (person), and can encode the instance concept in the dynamically-generated weights of their filters. \r\n\r\nOverall, FCPose is built upon the one-stage object detector [FCOS](https:\/\/paperswithcode.com\/method\/fcos). The controller that generates the weights of the keypoint heads is attached to the FCOS heads. The weights $\\theta\\_{i}$ generated by the controller is used to fulfill the keypoint head $f$ for the instance $i$. Moreover, a keypoint refinement module is introduced to predict the offsets from each location of the heatmaps to the ground-truth keypoints. Finally, the coordinates derived from the predicted heatmaps are refined by the offsets predicted by the keypoint refinement module, resulting in the final keypoint results. \"Rel. coord.\" is a map of the relative coordinates from all the locations of the feature maps $F$ to the location where the weights are generated. The relative coordinate map is concatenated to $F$ as the input to the keypoint head.","1142":"**HRank** is a filter pruning method that explores the High Rank of the feature map in each layer (HRank). The proposed HRank  is inspired by the discovery that the average rank of multiple feature maps generated by a single filter is always the same, regardless of the number of image batches CNNs receive. Based on HRank, the authors develop a method that is mathematically formulated to prune filters with low-rank feature maps.","1143":"**PocketNet** is a face recognition model family discovered through [neural architecture search](https:\/\/paperswithcode.com\/methods\/category\/neural-architecture-search). The training is based on multi-step knowledge distillation.","1144":"**PSANet** is a semantic segmentation architecture that utilizes a [Point-wise Spatial Attention](https:\/\/paperswithcode.com\/method\/point-wise-spatial-attention) (PSA) module to aggregate long-range contextual information in a flexible and adaptive manner. Each position in the feature map is connected with all other ones through self-adaptively predicted attention maps, thus harvesting various information nearby and far away. Furthermore, the authors design the bi-directional information propagation path for a comprehensive understanding of complex scenes. Each position collects information from all others to help the prediction of itself and vice versa, the information at each position can be distributed globally, assisting the prediction of all other positions. Finally, the bi-directionally aggregated contextual information is fused with local features to form the final representation of complex scenes.\r\n\r\nThe authors use [ResNet](https:\/\/paperswithcode.com\/method\/resnet) as an [FCN](https:\/\/paperswithcode.com\/method\/fcn) backbone for PSANet, as the Figure to the right illustrates. The proposed PSA module is then used to aggregate long-range contextual information from the local representation. It follows stage-5 in ResNet, which is the final stage of the FCN backbone. Features in stage-5 are semantically stronger. Aggregating them together leads to a more comprehensive representation of long-range context. Moreover, the spatial size of the feature map at stage-5 is smaller and can reduce computation overhead and memory consumption. An auxiliary loss branch is applied apart from the main loss.","1145":"**MixText** is a semi-supervised learning method for text classification, which uses a new data augmentation method called TMix. TMix creates a large amount of augmented training samples by interpolating text in hidden space. The technique leverages advances in data augmentation to guess low-entropy labels for unlabeled data, making them as easy to use as labeled data.","1146":"**Minibatch Discrimination** is a discriminative technique for generative adversarial networks where we discriminate between whole minibatches of samples rather than between individual samples. This is intended to avoid collapse of the generator.","1147":"A **Multiscale Dilated Convolution Block** is an Inception-style convolutional block motivated by the ideas that image features naturally occur at multiple scales, that a network\u2019s expressivity is proportional to the range of functions it can represent divided by its total number of parameters, and by the desire to efficiently expand a network\u2019s receptive field. The Multiscale [Dilated Convolution](https:\/\/paperswithcode.com\/method\/dilated-convolution) (MDC) block applies a single $F\\times{F}$ filter at multiple dilation factors, then performs a weighted elementwise sum of each dilated filter\u2019s output, allowing the network to simultaneously learn a set of features and the relevant scales at which those features occur with a minimal increase in parameters. This also rapidly expands the network\u2019s receptive field without requiring an increase in depth or the number of parameters.","1148":"The **Introspective Adversarial Network (IAN)** is a hybridization of [GANs](https:\/\/paperswithcode.com\/method\/gan) and [VAEs](https:\/\/paperswithcode.com\/method\/vae) that leverages the power of the adversarial objective while maintaining the VAE\u2019s efficient inference mechanism. It uses the discriminator of the GAN, $D$, as a feature extractor for an inference subnetwork, $E$, which is implemented as a fully-connected layer on top of the final convolutional layer of the discriminator. We infer latent values $Z \\sim E\\left(X\\right) = q\\left(Z\\mid{X}\\right)$ for reconstruction and sample random values $Z \\sim p\\left(Z\\right)$ from a standard normal for random image generation using the generator network, $G$.\r\n\r\nThree distinct loss functions are used:\r\n\r\n- $\\mathcal{L}\\_{img}$, the L1 pixel-wise reconstruction loss, which is preferred to the L2 reconstruction loss for its higher average gradient.\r\n- $\\mathcal{L\\_{feature}}$, the feature-wise reconstruction loss, evaluated as the L2 difference between the original and reconstruction in the space of the hidden layers of the discriminator.\r\n- $\\mathcal{L}\\_{adv}$, the ternary adversarial loss, a modification of the adversarial loss that forces the discriminator to label a sample as real, generated, or reconstructed (as opposed to a binary\r\nreal vs. generated label).\r\n\r\nIncluding the VAE\u2019s KL divergence between the inferred latents $E\\left(X\\right)$ and the prior $p\\left(Z\\right)$, the loss function for the generator and encoder network is thus:\r\n\r\n$$\\mathcal{L}\\_{E, G} = \\lambda\\_{adv}\\mathcal{L}\\_{G\\_{adv}} + \\lambda\\_{img}\\mathcal{L}\\_{img}  + \\lambda\\_{feature}\\mathcal{L}\\_{feature}  + D\\_{KL}\\left(E\\left(X\\right) || p\\left(Z\\right)\\right) $$\r\n\r\nWhere the $\\lambda$ terms weight the relative importance of each loss. We set $\\lambda\\_{img}$ to 3 and leave the other terms at 1. The discriminator is updated solely using the ternary adversarial loss. During each training step, the generator produces reconstructions $G\\left(E\\left(X\\right)\\right)$ (using the standard VAE reparameterization trick) from data $X$ and random samples $G\\left(Z\\right)$, while the discriminator observes $X$ as well as the reconstructions and random samples, and both networks are simultaneously updated.","1149":"**Holographic Reduced Representations** are a simple mechanism to represent an associative array of key-value pairs in a fixed-size vector. Each individual key-value pair is the same size as the entire associative array; the array is represented by the sum of the pairs. Concretely, consider a complex vector key $r = (a\\_{r}[1]e^{i\u03c6\\_{r}[1]}, a\\_{r}[2]e^{i\u03c6\\_{r}[2]}, \\dots)$, which is the same size as the complex vector value x. The pair is \"bound\" together by element-wise complex multiplication, which multiplies the moduli and adds the phases of the elements:\r\n\r\n$$ y = r \\otimes x $$\r\n\r\n$$ y =  \\left(a\\_{r}[1]a\\_{x}[1]e^{i(\u03c6\\_{r}[1]+\u03c6\\_{x}[1])}, a\\_{r}[2]a\\_{x}[2]e^{i(\u03c6\\_{r}[2]+\u03c6\\_{x}[2])}, \\dots\\right) $$\r\n\r\nGiven keys $r\\_{1}$, $r\\_{2}$, $r\\_{3}$ and input vectors $x\\_{1}$, $x\\_{2}$, $x\\_{3}$, the associative array is:\r\n\r\n$$c = r\\_{1} \\otimes x\\_{1} + r\\_{2} \\otimes x\\_{2} + r\\_{3} \\otimes x\\_{3} $$\r\n\r\nwhere we call $c$ a memory trace. Define the key inverse:\r\n\r\n$$ r^{-1} = \\left(a\\_{r}[1]^{\u22121}e^{\u2212i\u03c6\\_{r}[1]}, a\\_{r}[2]^{\u22121}e^{\u2212i\u03c6\\_{r}[2]}, \\dots\\right) $$\r\n\r\nTo retrieve the item associated with key $r\\_{k}$, we multiply the memory trace element-wise by the vector $r^{-1}\\_{k}$. For example: \r\n\r\n$$ r\\_{2}^{\u22121} \\otimes c = r\\_{2}^{-1} \\otimes \\left(r\\_{1} \\otimes x\\_{1} + r\\_{2} \\otimes x\\_{2} + r\\_{3} \\otimes x\\_{3}\\right) $$\r\n\r\n$$ r\\_{2}^{\u22121} \\otimes c = x\\_{2} + r^{-1}\\_{2} \\otimes \\left(r\\_{1} \\otimes x\\_{1} + r\\_{3} \\otimes x3\\right) $$\r\n\r\n$$ r\\_{2}^{\u22121} \\otimes c = x\\_{2} + noise $$\r\n\r\nThe product is exactly $x\\_{2}$ together with a noise term. If the phases of the elements of the key vector are randomly distributed, the noise term has zero mean.\r\n\r\nSource: [Associative LSTMs](https:\/\/arxiv.org\/pdf\/1602.03032.pdf)","1150":"To obtain excellent deep neural architectures, a series of techniques are carefully designed in EfficientNets. The giant formula for simultaneously enlarging the resolution, depth and width provides us a Rubik's cube for neural networks. So that we can find networks with high efficiency and excellent performance by twisting the three dimensions. This paper aims to explore the twisting rules for obtaining deep neural networks with minimum model sizes and computational costs. Different from the network enlarging, we observe that resolution and depth are more important than width for tiny networks. Therefore, the original method, i.e., the compound scaling in [EfficientNet](https:\/\/paperswithcode.com\/method\/efficientnet) is no longer suitable. To this end, we summarize a tiny formula for downsizing neural architectures through a series of smaller models derived from the EfficientNet-B0 with the FLOPs constraint. Experimental results on the ImageNet benchmark illustrate that our TinyNet performs much better than the smaller version of EfficientNets using the inversed giant formula. For instance, our TinyNet-E achieves a 59.9% Top-1 accuracy with only 24M FLOPs, which is about 1.9% higher than that of the previous best [MobileNetV3](https:\/\/paperswithcode.com\/method\/mobilenetv3) with similar computational cost.","1151":"A general multimodal attention unit for any number of modalities. Graphical models inspire it, i.e., it infers several attention beliefs via aggregated interaction messages.","1152":"**CuBERT**, or **Code Understanding BERT**, is a [BERT](https:\/\/paperswithcode.com\/method\/bert) based model for code understanding. In order to achieve this, the authors curate a massive corpus of Python programs collected from GitHub. GitHub projects are known to contain a large amount of duplicate code. To avoid biasing the model to such duplicated code, authors perform deduplication using the method of [Allamanis (2018)](https:\/\/arxiv.org\/abs\/1812.06469). The resulting corpus has 7.4 million files with a total of 9.3 billion tokens (16 million unique).","1153":"Recently, [dense connections](https:\/\/paperswithcode.com\/method\/dense-connections) have attracted substantial attention in computer vision because they facilitate gradient flow and implicit deep supervision during training. Particularly, [DenseNet](https:\/\/paperswithcode.com\/method\/densenet) that connects each layer to every other layer in a feed-forward fashion and has shown impressive performances in natural image classification tasks. We propose HyperDenseNet, a 3-D fully convolutional neural network that extends the definition of dense connectivity to multi-modal segmentation problems. Each imaging modality has a path, and dense connections occur not only between the pairs of layers within the same path but also between those across different paths. This contrasts with the existing multi-modal CNN approaches, in which modeling several modalities relies entirely on a single joint layer (or level of abstraction) for fusion, typically either at the input or at the output of the network. Therefore, the proposed network has total freedom to learn more complex combinations between the modalities, within and in-between all the levels of abstraction, which increases significantly the learning representation. We report extensive evaluations over two different and highly competitive multi-modal brain tissue segmentation challenges, iSEG 2017 and MRBrainS 2013, with the former focusing on six month infant data and the latter on adult images. HyperDenseNet yielded significant improvements over many state-of-the-art segmentation networks, ranking at the top on both benchmarks. We further provide a comprehensive experimental analysis of features re-use, which confirms the importance of hyper-dense connections in multi-modal representation learning.","1154":"**Soft Actor Critic (Autotuned Temperature** is a modification of the [SAC](https:\/\/paperswithcode.com\/method\/soft-actor-critic) reinforcement learning algorithm. [SAC](https:\/\/paperswithcode.com\/method\/sac) can suffer from brittleness to the temperature hyperparameter. Unlike in conventional reinforcement learning, where the optimal policy is independent of scaling of the reward function, in maximum entropy reinforcement learning the scaling factor has to be compensated by the choice a of suitable temperature, and a sub-optimal temperature can drastically degrade performance. To resolve this issue, SAC with Autotuned Temperature has an automatic gradient-based temperature tuning method that adjusts the expected entropy over the visited states to match a target value.","1155":"ALBEF introduces a contrastive loss to align the image and text representations before fusing them through cross-modal attention. This enables more grounded vision and language representation learning. ALBEF also doesn't require bounding box annotations. The model consists of an image encode, a text encoder, and a multimodal encoder. The image-text contrastive loss helps to align the unimodal representations of an image-text pair before fusion. The image-text matching loss and a masked language modeling loss are applied to learn multimodal interactions between image and text. In addition, momentum distillation is used to generate pseudo-targets. This improves learning with noisy data.","1156":"ViLT is a minimal vision-and-language pre-training transformer model where processing of visual inputs is simplified to just the same convolution-free manner that text inputs are processed. The model-specific components of ViLT require less computation than the transformer component for multimodal interactions. ViLTThe model is pre-trained on the following objectives: image text matching, masked language modeling, and word patch alignment.","1157":"**MEI** introduces the *multi-partition embedding interaction* idea with block term tensor format to systematically and optimally address the efficiency--expressiveness trade-off in knowledge graph embedding.","1158":"**Progressive Neural Architecture Search**, or **PNAS**, is a method for learning the structure of convolutional neural networks (CNNs). It uses a sequential model-based optimization (SMBO) strategy, where we search the space of cell structures, starting with simple (shallow) models and progressing to complex ones, pruning out unpromising structures as we go. \r\n\r\nAt iteration $b$ of the algorithm, we have a set of $K$ candidate cells (each of size $b$ blocks), which we train and evaluate on a dataset of interest. Since this process is expensive, PNAS also learns a model or surrogate function which can predict the performance of a structure without needing to train it. We then expand the $K$ candidates of size $b$ into $K' \\gg K$ children, each of size $b+1$. The surrogate function is used to rank all of the $K'$ children, pick the top $K$, and then train and evaluate them. We continue in this way until $b=B$, which is the maximum number of blocks we want to use in a cell.","1159":"# [Spectral Clustering](https:\/\/paperswithcode.com\/method\/spectral-clustering)\r\n\r\nSpectral clustering aims to partition the data points into $k$ clusters using the spectrum of the graph Laplacians \r\nGiven a dataset $X$ with $N$ data points, spectral clustering algorithm first constructs similarity matrix ${W}$, where ${w_{ij}}$ indicates the similarity between data points $x_i$ and $x_j$ via a similarity measure metric.\r\n\r\nLet $L=D-W$, where $L$ is called graph Laplacian and ${D}$ is a diagonal matrix with $d_{ii} = \\sum_ {j=1}^n w_{ij}$.\r\nThe objective function of spectral clustering can be formulated based on the graph Laplacian as follow:\r\n\\begin{equation}\r\n  \\label{eq:SC_obj}\r\n  {\\max_{{U}}  \\operatorname{tr}\\left({U}^{T} {L} {U}\\right)}, \\\\ {\\text { s.t. } \\quad {U}^{T} {{U}={I}}},\r\n\\end{equation}\r\nwhere $\\operatorname{tr(\\cdot)}$ denotes the trace norm of a matrix.\r\nThe rows of matrix ${U}$ are the low dimensional embedding of the original data points.\r\nGenerally, spectral clustering computes ${U}$ as the bottom $k$ eigenvectors of ${L}$, and finally applies $k$-means on ${U}$ to obtain the clustering results.\r\n\r\n\r\n# Large-scale Spectral Clustering\r\n\r\nTo capture the relationship between all data points in $X$, an $N\\times N$ similarity matrix is needed to be constructed in conventional spectral clustering, which costs $O(N^2d)$ time and $O(N^2)$ memory and is not feasible for large-scale clustering tasks.\r\nInstead of a full similarity matrix, many accelerated spectral clustering methods are using a similarity sub-matrix to represent each data points by the cross-similarity between data points and a set of representative data points (i.e., landmarks) via some similarity measures, as\r\n\\begin{equation}\r\n    \\label{eq: cross-similarity}\r\n    B = \\Phi(X,R),\r\n\\end{equation}\r\nwhere $R = \\{r_1,r_2,\\dots, r_p \\}$ ($p \\ll N$) is a set of landmarks with the same dimension to $X$, $\\Phi(\\cdot)$ indicate a similarity measure metric, and $B\\in \\mathbb{R}^{N\\times p}$ is the similarity sub-matrix to represent the $X \\in \\mathbb{R}^{N\\times d}$ with respect to the $R\\in \\mathbb{R}^{p\\times d}$.\r\n\r\nFor large-scale spectral clustering using such similarity matrix,\r\na symmetric similarity matrix $W$ can be designed as \r\n\\begin{equation}\r\n  \\label{eq: WusedB }\r\n  W=\\left[\\begin{array}{ll}\r\n      \\mathbf{0} & B         ; \\\\\r\n      B^{T}      & \\mathbf{0}\r\n    \\end{array}\\right].\r\n\\end{equation}\r\nThe size of matrix $W$ is $(N+p)\\times (N+p)$. \r\nTaking the advantage of the bipartite structure, some fast eigen-decomposition methods can then  be used to obtain the spectral embedding.\r\nFinally, $k$-means is conducted on the embedding to obtain clustering results.\r\n\r\nThe clustering result is directly related to the quality of $B$ that consists of the similarities between data points and landmarks.\r\nThus, the performance of landmark selection is crucial to the clustering result.","1160":"CrossTransformers is a Transformer-based neural network architecture which can take a small number of labeled images and an unlabeled query, find coarse spatial correspondence between the query and the labeled images, and then infer class membership by computing distances between spatially-corresponding features.","1161":"**Stand-Alone Self Attention** (SASA) replaces all instances of spatial [convolution](https:\/\/paperswithcode.com\/method\/convolution) with a form of self-attention applied to [ResNet](https:\/\/paperswithcode.com\/method\/resnet) producing a fully, stand-alone self-attentional model.","1162":"**InstaBoost** is a data augmentation technique for instance segmentation that utilises existing instance mask annotations.\r\n\r\nIntuitively in a small neighbor area of $(x_0, y_0, 1, 0)$, the probability map $P(x, y, s, r)$ should be high-valued since images are usually continuous and redundant in pixel level. Based on this, InstaBoost is a form of augmentation where we apply object jittering that randomly samples transformation tuples from the neighboring space of identity transform $(x_0, y_0, 1, 0)$ and paste the cropped object following affine transform $\\mathbf{H}$.","1163":"**IB-BERT**, or **Inverted Bottleneck BERT**, is a [BERT](https:\/\/paperswithcode.com\/method\/bert) variant that uses an [inverted bottleneck](https:\/\/paperswithcode.com\/method\/inverted-residual-block) structure. It is used as a teacher network to train the [MobileBERT](https:\/\/paperswithcode.com\/method\/mobilebert) models.","1164":"**DeeBERT** is a method for accelerating [BERT](https:\/\/paperswithcode.com\/method\/bert) inference. It inserts extra classification layers (which are referred to as off-ramps) between each [transformer](https:\/\/paperswithcode.com\/method\/transformer) layer of BERT. All transformer layers and off-ramps are jointly fine-tuned on a given downstream dataset. At inference time, after a sample goes through a transformer layer, it is passed to the following off-ramp. If the off-ramp is confident of the prediction, the result is returned; otherwise, the sample is sent to the next transformer layer.","1165":"**Patch AutoAugment** is a patch-level automatic data augmentation algorithm that automatically searches for the optimal augmentation policies for the patches of an image. Specifically, PAA allows each patch DA operation to be controlled by an agent and models it as a Multi-Agent Reinforcement Learning (MARL) problem. At each step, PAA samples the most effective operation for each patch based on its content and the semantics of the whole image. The agents cooperate as a team and share a unified team reward for achieving the joint optimal DA policy of the whole image. PAA is co-trained with a target network through adversarial training. At each step, the policy network samples the most effective operation for each patch based on its content and the semantics of the image.","1166":"**NAS-FCOS** consists of two sub networks, an [FPN](https:\/\/paperswithcode.com\/method\/fpn) $f$ and a set of prediction heads $h$ which have shared structures. One notable difference with other FPN-based one-stage detectors is that our heads have partially shared weights. Only the last several layers of the predictions heads (marked as yellow) are tied by their weights. The number of layers to share is decided automatically by the search algorithm. Note that both FPN and head are in our actual search space; and have more layers than shown in this figure.","1167":"**Seesaw Loss** is a loss function for long-tailed instance segmentation. It dynamically re-balances the gradients of positive and negative samples on a tail class with two complementary factors: mitigation factor and compensation factor. The mitigation factor reduces punishments to tail categories w.r.t the ratio of cumulative training instances between different categories. Meanwhile, the compensation factor increases the penalty of misclassified instances to avoid false positives of tail categories. The synergy of the two factors enables Seesaw Loss to mitigate the overwhelming punishments on tail classes as well as compensate for the risk of misclassification caused by diminished penalties.\r\n\r\n$$ L\\_{seesaw}\\left(\\mathbf{x}\\right) = - \\sum^{C}\\_{i=1}y\\_{i}\\log\\left(\\hat{\\sigma}\\_{i}\\right) $$\r\n\r\n$$ \\text{with } \\hat{\\sigma\\_{i}} = \\frac{e^{z\\_{i}}}{- \\sum^{C}\\_{j\\neq{1}}\\mathcal{S}\\_{ij}e^{z\\_{j}}+e^{z\\_{i}} } $$\r\n\r\nHere $\\mathcal{S}\\_{ij}$ works as a tunable balancing factor between different classes. By a careful design of $\\mathcal{S}\\_{ij}$, Seesaw loss adjusts the punishments on class j from positive samples of class $i$. Seesaw loss determines $\\mathcal{S}\\_{ij}$ by a mitigation factor and a compensation factor, as:\r\n\r\n$$ \\mathcal{S}\\_{ij} =\\mathcal{M}\\_{ij} \u00b7 \\mathcal{C}\\_{ij}  $$\r\n\r\nThe mitigation factor $\\mathcal{M}\\_{ij}$ decreases the penalty on tail class $j$ according to a ratio of instance numbers between tail class $j$ and head class $i$. The compensation factor $\\mathcal{C}\\_{ij}$ increases the penalty on class $j$ whenever an instance of class $i$ is misclassified to class $j$.","1168":"**Ape-X DQN** is a variant of a [DQN](https:\/\/paperswithcode.com\/method\/dqn) with some components of [Rainbow-DQN](https:\/\/paperswithcode.com\/method\/rainbow-dqn) that utilizes distributed [prioritized experience replay](https:\/\/paperswithcode.com\/method\/prioritized-experience-replay) through the [Ape-X](https:\/\/paperswithcode.com\/method\/ape-x) architecture.","1169":"**Social-STGCNN** is a method for human trajectory prediction. Pedestrian trajectories are not only influenced by the pedestrian itself but also by interaction with surrounding objects.","1170":"**MODNet** is a light-weight matting objective decomposition network that can process portrait matting from a single input image in real time. The design of MODNet benefits from optimizing a series of correlated sub-objectives simultaneously via explicit constraints. To overcome the domain shift problem, MODNet introduces a self-supervised strategy based on subobjective consistency (SOC) and  a one-frame delay trick to smooth the results when applying MODNet to portrait video sequence.\r\n\r\nGiven an input image $I$, MODNet predicts human semantics $s\\_{p}$, boundary details $d\\_{p}$, and final alpha matte $\\alpha\\_{p}$ through three interdependent branches, $S, D$, and $F$, which are constrained by specific supervisions generated from the ground truth matte $\\alpha\\_{g}$. Since the decomposed sub-objectives are correlated and help strengthen each other, we can optimize MODNet end-to-end.","1171":"**YellowFin** is a learning rate and momentum tuner motivated by robustness properties and analysis of quadratic objectives. It stems from a known but obscure fact: the momentum operator's spectral radius is constant in a large subset of the hyperparameter space. For quadratic objectives, the optimizer tunes both the learning rate and the momentum to keep the hyperparameters within a region in which the convergence rate is a constant rate equal to the root momentum. This notion is extended empirically to non-convex objectives. On every iteration, YellowFin optimizes the hyperparameters to minimize a local quadratic optimization.","1172":"An **RGCN**, or **Relational Graph Convolution Network**, is a an application of the [GCN framework](https:\/\/paperswithcode.com\/method\/gcn) to modeling relational data, specifically\r\nto link prediction and entity classification tasks.\r\n\r\nSee [here](https:\/\/docs.dgl.ai\/en\/0.4.x\/tutorials\/models\/1_gnn\/4_rgcn.html) for an in-depth explanation of RGCNs by DGL.","1173":"**PointAugment** is a an auto-augmentation framework that automatically optimizes and augments point cloud samples to enrich the data diversity when we train a classification network. Different from existing auto-augmentation methods for 2D images, PointAugment is sample-aware and takes an adversarial learning strategy to jointly optimize an augmentor network and a classifier network, such that the augmentor can learn to produce augmented samples that best fit the classifier.","1174":"A **Neural Cache**, or a **Continuous Cache**, is a module for language modelling which stores previous hidden states in memory cells. They are then used as keys to retrieve their corresponding word, that is the next word. There is no transformation applied to the storage during writing and reading.\r\n\r\nMore formally it exploits the hidden representations $h\\_{t}$ to define a probability distribution over the words in the cache. As\r\nillustrated in the Figure, the cache stores pairs $\\left(h\\_{i}, x\\_{i+1}\\right)$ of a hidden representation, and the word which was generated based on this representation (the vector $h\\_{i}$ encodes the history $x\\_{i}, \\dots, x\\_{1}$). At time $t$, we then define a probability distribution over words stored in the cache based on the stored hidden representations and the current one $h\\_{t}$ as:\r\n\r\n$$ p\\_{cache}\\left(w | h\\_{1\\dots{t}}, x\\_{1\\dots{t}}\\right) \\propto \\sum^{t-1}\\_{i=1}\\mathcal{1}\\_{\\text{set}\\left(w=x\\_{i+1}\\right)} \\exp\\left(\u03b8\\_{h}>h\\_{t}^{T}h\\_{i}\\right) $$\r\n\r\nwhere the scalar $\\theta$ is a parameter which controls the flatness of the distribution. When $\\theta$ is equal to zero, the probability distribution over the history is uniform, and the model is equivalent to a unigram cache model.","1175":"A **ParaNet Convolution Block** is a convolutional block that appears in the encoder and decoder of the [ParaNet](https:\/\/paperswithcode.com\/method\/paranet) text-to-speech architecture. It consists of a 1-D [convolution](https:\/\/paperswithcode.com\/method\/convolution) with a gated linear unit ([GLU](https:\/\/paperswithcode.com\/method\/glu)) and a [residual connection](https:\/\/paperswithcode.com\/method\/residual-connection). It is similar to the [DV3 Convolution Block](https:\/\/paperswithcode.com\/method\/dv3-convolution-block).","1176":"**ParaNet** is a non-autoregressive attention-based architecture for text-to-speech, which is fully convolutional and converts text to mel spectrogram. ParaNet distills the attention from the autoregressive text-to-spectrogram model, and iteratively refines the alignment between text and spectrogram in a layer-by-layer manner. The architecture is otherwise similar to [Deep Voice 3](https:\/\/paperswithcode.com\/method\/deep-voice-3) except these changes to the decoder; whereas the decoder of DV3 has multiple attention-based layers, where each layer consists of a\r\n[causal convolution](https:\/\/paperswithcode.com\/method\/causal-convolution) block followed by an attention block, ParaNet has a single attention block in the encoder.","1177":"Cluster-GCN is a novel GCN algorithm that is suitable for SGD-based training by exploiting the graph clustering structure. Cluster-GCN works as the following: at each step, it samples a block of nodes that associate with a dense subgraph identified by a graph clustering algorithm, and restricts the neighborhood search within this subgraph. This simple but effective strategy leads to significantly improved memory and computational efficiency while being able to achieve comparable test accuracy with previous algorithms.\r\n\r\nDescription and image from: [Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks](https:\/\/arxiv.org\/pdf\/1905.07953.pdf)","1178":"**TABBIE** is a pretraining objective (*corrupt cell detection*) that learns exclusively from tabular data. Unlike other approaches, TABBIE provides embeddings of all table substructures (cells, rows, and columns). TABBIE can be seen as a table embedding model trained to detect corrupted cells, inspired by the [ELECTRA](https:\/\/www.paperswithcode.com\/method\/electra) objective function.","1179":"**Automatic Structured Variational Inference (ASVI)** is a fully automated method for constructing structured variational families, inspired by the closed-form update in conjugate Bayesian models. These convex-update families incorporate the forward pass of the input probabilistic program and can therefore capture complex statistical dependencies. Convex-update families have the same space and time complexity as the input probabilistic program and are therefore tractable for a very large family of models including both continuous and discrete variables.","1180":"**Tofu** is an intra-layer model parallel system that partitions very large DNN models across multiple GPU devices to reduce per-GPU memory footprint. Tofu is designed to partition a dataflow graph of fine-grained tensor operators used by platforms like MXNet and TensorFlow. To optimally partition different operators in a dataflow graph, Tofu uses a recursive search algorithm that minimizes the total communication cost.","1181":"**Feature Fusion Module v1** is a feature fusion module from the [M2Det](https:\/\/paperswithcode.com\/method\/m2det) object detection model, and feature fusion modules are crucial for constructing the final multi-level feature pyramid. They use [1x1 convolution](https:\/\/paperswithcode.com\/method\/1x1-convolution) layers to compress the channels of the input features and use concatenation operation to aggregate these feature map. FFMv1 takes two feature maps with different scales in backbone as input, it adopts one upsample operation to rescale the deep features to the same scale before the concatenation operation.","1182":"**Feature Fusion Module v2** is a feature fusion module from the [M2Det](https:\/\/paperswithcode.com\/method\/m2det) object detection model, and is crucial for constructing the final multi-level feature pyramid. They use [1x1 convolution](https:\/\/paperswithcode.com\/method\/1x1-convolution) layers to compress the channels of the input features and use a concatenation operation to aggregate these feature map. FFMv2 takes the base feature and the largest output feature map of the previous [Thinned U-Shape Module](https:\/\/paperswithcode.com\/method\/tum) (TUM) \u2013 these two are of the same scale \u2013 as input, and produces the fused feature for the next TUM.","1183":"**Multi-Level Feature Pyramid Network**, or **MLFPN**, is a feature pyramid block used in object detection models, notably [M2Det](https:\/\/paperswithcode.com\/method\/m2det). We first fuse multi-level features (i.e. multiple layers) extracted by a backbone as a base feature, and then feed it into a block of alternating joint Thinned U-shape Modules ([TUM](https:\/\/paperswithcode.com\/method\/tum)) and Feature Fusion Modules (FFM) to extract more representative, multi-level multi-scale features. Finally, we gather up the feature maps with equivalent scales to construct the final feature pyramid for object detection. Decoder layers that form the final feature pyramid are much deeper than the layers in the backbone, namely, they are more representative. Moreover, each feature map in the final feature pyramid consists of the decoder layers from multiple levels. Hence, the feature pyramid block is called Multi-Level Feature Pyramid Network (MLFPN).","1184":"**M2Det** is a one-stage object detection model that utilises a Multi-Level Feature Pyramid Network ([MLFPN](https:\/\/paperswithcode.com\/method\/mlfpn)) to extract features from the input image, and then similar to [SSD](https:\/\/paperswithcode.com\/method\/ssd), produces dense bounding boxes and category scores based on the learned features, followed by the non-maximum suppression (NMS) operation to produce the final results.","1185":"**SimAug**, or **Simulation as Augmentation**, is a data augmentation method for trajectory prediction. It augments the representation such that it is robust to the variances in semantic scenes and camera views.  First, to deal with the gap between real and synthetic semantic scene, it represents each training trajectory by high-level scene semantic segmentation features, and defends the model from adversarial examples generated by whitebox attack methods. Second, to overcome the changes in camera views, it generates multiple views for the same trajectory, and encourages the model to focus on the \u201chardest\u201d view to which the model has learned. The classification loss is adopted and the view with the highest loss is favored during training. Finally, the augmented trajectory is computed as a convex combination of the trajectories generated in previous steps. The trajectory prediction model is built on a multi-scale representation and the final model is trained to minimize the empirical vicinal risk over the distribution of augmented trajectories.","1186":"**STDC**, or **Short-Term Dense Concatenate**, is a module for semantic segmentation to extract deep features with scalable\r\nreceptive field and multi-scale information. It aims to remove structure redundancy in the BiSeNet architecture, specifically BiSeNet adds an extra path to encode spatial information which can be time-consuming,. Instead, STDC gradually reduces the dimension of feature maps and use the aggregation of them for image representation.\r\n\r\nWe concatenate response maps from multiple continuous layers, each of which encodes input image\/feature in different scales and respective fields, leading to multi-scale feature representation. To speed up, the filter size of layers is gradually reduced with negligible loss in segmentation performance.","1187":"**NoisyNet-DQN** is a modification of a [DQN](https:\/\/paperswithcode.com\/method\/dqn) that utilises noisy linear layers for exploration instead of $\\epsilon$-greedy exploration as in the original DQN formulation.","1188":"**AdaBound** is a variant of the [Adam](https:\/\/paperswithcode.com\/method\/adabound) stochastic optimizer which is designed to be more robust to extreme learning rates. Dynamic bounds are employed on learning rates, where the lower and upper bound are initialized as zero and infinity respectively, and they both smoothly converge to a constant final step size. AdaBound can be regarded as an adaptive method at the beginning of training, and thereafter it gradually and smoothly transforms to [SGD](https:\/\/paperswithcode.com\/method\/sgd) (or with momentum) as the time step increases. \r\n\r\n$$ g\\_{t} = \\nabla{f}\\_{t}\\left(x\\_{t}\\right) $$\r\n\r\n$$ m\\_{t} = \\beta\\_{1t}m\\_{t-1} + \\left(1-\\beta\\_{1t}\\right)g\\_{t} $$\r\n\r\n$$ v\\_{t} = \\beta\\_{2}v\\_{t-1} + \\left(1-\\beta\\_{2}\\right)g\\_{t}^{2} \\text{ and } V\\_{t} = \\text{diag}\\left(v\\_{t}\\right) $$\r\n\r\n$$ \\hat{\\eta}\\_{t} = \\text{Clip}\\left(\\alpha\/\\sqrt{V\\_{t}}, \\eta\\_{l}\\left(t\\right), \\eta\\_{u}\\left(t\\right)\\right) \\text{ and } \\eta\\_{t} = \\hat{\\eta}\\_{t}\/\\sqrt{t} $$\r\n\r\n$$ x\\_{t+1} = \\Pi\\_{\\mathcal{F}, \\text{diag}\\left(\\eta\\_{t}^{-1}\\right)}\\left(x\\_{t} - \\eta\\_{t} \\odot m\\_{t} \\right) $$\r\n\r\nWhere $\\alpha$ is the initial step size, and $\\eta_{l}$ and $\\eta_{u}$ are the lower and upper bound functions respectively.","1189":"**One-Shot Aggregation** is an image model block that is an alternative to [Dense Blocks](https:\/\/paperswithcode.com\/method\/dense-block), by aggregating intermediate features. It is proposed as part of the [VoVNet](https:\/\/paperswithcode.com\/method\/vovnet) architecture. Each [convolution](https:\/\/paperswithcode.com\/method\/convolution) layer is connected by two-way connection. One way is connected to the subsequent layer to produce the feature with a larger receptive field while the other way is aggregated only once into the final output feature map. The difference with [DenseNet](https:\/\/paperswithcode.com\/method\/densenet) is that the output of each layer is not routed to all subsequent intermediate layers which makes the input size of intermediate layers constant.","1190":"**IoU-Balanced Sampling** is hard mining method for object detection. Suppose we need to sample $N$ negative samples from $M$ corresponding candidates. The selected probability for each sample under random sampling is:\r\n\r\n$$ p = \\frac{N}{M} $$\r\n\r\nTo raise the selected probability of hard negatives, we evenly split the sampling interval into $K$ bins according to IoU. $N$ demanded negative samples are equally distributed to each bin. Then we select samples from them uniformly. Therefore, we get the selected probability under IoU-balanced sampling:\r\n\r\n$$ p\\_{k} = \\frac{N}{K}*\\frac{1}{M\\_{k}}\\text{ , } k\\in\\left[0, K\\right)$$\r\n\r\nwhere $M\\_{k}$ is the number of sampling candidates in the corresponding interval denoted by $k$. $K$ is set to 3 by default in our experiments.\r\n\r\nThe sampled histogram with IoU-balanced sampling is shown by green color in the Figure to the right. The IoU-balanced sampling can guide the distribution of training samples close to the one of hard negatives.","1191":"**AutoSync** is a pipeline for automatically optimizing synchronization strategies, given model structures and resource specifications, in data-parallel distributed machine learning. By factorizing the synchronization strategy with respect to each trainable building block of a DL model, we can construct a valid and large strategy space spanned by multiple factors. AutoSync efficiently navigates the space and locates the optimal strategy. AutoSync leverages domain knowledge about synchronization systems to reduce the search space, and is equipped with a domain adaptive simulator, which combines principled communication modeling and data-driven ML models, to estimate the runtime of strategy proposals without launching real distributed execution.","1192":"We propose a deep network architecture for the pansharpening problem called PanNet. We incorporate domain-specific knowledge to design our PanNet architecture by focusing on the two aims of the pan-sharpening problem: spectral and spatial preservation. For spectral preservation, we add up-sampled multispectral images to the network output, which directly propagates the spectral information to the reconstructed image. To preserve the spatial structure, we train our network parameters in the high-pass filtering domain rather than the image domain. We show that the trained network generalizes well to images from different satellites without needing retraining. Experiments show significant improvement over state-of-the-art methods visually and in terms of standard quality metrics.","1193":"**SCCL**, or **Supporting Clustering with Contrastive Learning**, is a framework to leverage contrastive learning to promote better separation in unsupervised clustering. It combines the top-down clustering with the bottom-up instance-wise contrastive learning to achieve better inter-cluster distance and intra-cluster distance. During training, we jointly optimize a clustering loss over the original data instances and an instance-wise contrastive loss over the associated augmented pairs.","1194":"**Ape-X DPG** combines [DDPG](https:\/\/paperswithcode.com\/method\/ddpg) with distributed [prioritized experience replay](https:\/\/paperswithcode.com\/method\/prioritized-experience-replay) through the [Ape-X](https:\/\/paperswithcode.com\/method\/ape-x) architecture.","1195":"**NT-ASGD**, or **Non-monotonically Triggered ASGD**, is an averaged stochastic gradient descent technique. \r\n\r\nIn regular ASGD, we take steps identical to [regular SGD](https:\/\/paperswithcode.com\/method\/sgd) but instead of returning the last iterate as the solution, we return $\\frac{1}{\\left(K-T+1\\right)}\\sum^{T}\\_{i=T}w\\_{i}$, where $K$ is the total number of iterations and $T < K$ is a user-specified averaging trigger.\r\n\r\nNT-ASGD has a non-monotonic criterion that conservatively triggers the averaging when the validation metric fails to improve for multiple cycles. Given that the choice of triggering is irreversible, this conservatism ensures that the randomness of training does not play a major role in the decision.","1196":"In relation-aware global attention (RGA) stresses the importance of global structural information provided by pairwise relations, and uses it to produce attention maps. \r\n\r\nRGA comes in two forms,  spatial RGA (RGA-S) and channel RGA (RGA-C). RGA-S first reshapes the input feature map $X$ to $C\\times (H\\times W)$ and the pairwise relation matrix $R \\in \\mathbb{R}^{(H\\times W)\\times (H\\times W)}$ is computed using \r\n\\begin{align}\r\n    Q &= \\delta(W^QX) \r\n\\end{align}\r\n\\begin{align}\r\n    K &= \\delta(W^KX) \r\n\\end{align}\r\n\\begin{align}\r\n    R &= Q^TK\r\n\\end{align}\r\nThe relation vector $r_i$ at position $i$ is defined by stacking  pairwise relations at all positions:\r\n\\begin{align}\r\n    r_i = [R(i, :); R(:,i)]    \r\n\\end{align}\r\nand the spatial relation-aware feature $y_i$ can be written as\r\n\\begin{align}\r\n    Y_i = [g^c_\\text{avg}(\\delta(W^\\varphi x_i)); \\delta(W^\\phi r_i)]\r\n\\end{align}\r\nwhere $g^c_\\text{avg}$ denotes global average pooling in the channel domain. Finally, the spatial attention score at position $i$ is given by \r\n\\begin{align}\r\n    a_i = \\sigma(W_2\\delta(W_1y_i))\r\n\\end{align}\r\nRGA-C has the same form as RGA-S, except for taking the input feature map as a set of $H\\times W$-dimensional features.\r\n\r\nRGA uses global relations to generate the attention score for each feature node,  so provides valuable structural information and significantly enhances the representational power. RGA-S and RGA-C are flexible enough to be used in any CNN network; Zhang et al. propose  using them jointly in sequence to better capture both spatial and cross-channel relationships.","1197":"Most attention mechanisms learn where to focus using only weak supervisory signals from class labels, which inspired Linsley et al. to investigate how explicit human supervision can affect the performance and interpretability of attention models. As a proof of concept, Linsley et al. proposed the global-and-local attention (GALA) module, which extends an SE block with a spatial attention mechanism.\r\n\r\nGiven the input feature map $X$, GALA uses an attention mask that combines global and local attention to tell the network where and on what to focus. As in SE blocks, global attention aggregates global information by global average pooling and then produces a channel-wise attention weight vector using a multilayer perceptron. In local attention, two consecutive $1\\times 1$ convolutions are conducted on the input to produce a positional weight map. The outputs of the local and global pathways are combined by addition and multiplication. Formally, GALA can be represented as:\r\n\\begin{align}\r\n    s_g &= W_{2} \\delta (W_{1}\\text{GAP}(x))\r\n\\end{align}\r\n\r\n\\begin{align}\r\n    s_l &= Conv_2^{1\\times 1} (\\delta(Conv_1^{1\\times1}(X)))\r\n\\end{align}\r\n\r\n\\begin{align}\r\n    s_g^* &= \\text{Expand}(s_g)\r\n\\end{align}\r\n\r\n\\begin{align}\r\n    s_l^* &= \\text{Expand}(s_l) \r\n\\end{align}\r\n\r\n\\begin{align}\r\n    s &= \\tanh(a(s_g^\\* + s_l^\\*) +m \\cdot (s_g^\\* s_l^\\*) )\r\n\\end{align}\r\n\r\n\\begin{align}\r\n    Y &= sX\r\n\\end{align}\r\n\r\nwhere $a,m \\in \\mathbb{R}^{C}$ are learnable parameters representing channel-wise weight vectors. \r\n\r\nSupervised by human-provided feature importance maps, GALA has significantly improved representational power and can be combined with any CNN backbone.","1198":"A **Bottleneck Transformer Block** is a block used in [Bottleneck Transformers](https:\/\/www.paperswithcode.com\/method\/bottleneck-transformer) that replaces the spatial 3 \u00d7 3 [convolution](https:\/\/paperswithcode.com\/method\/convolution) layer in a [Residual Block](https:\/\/paperswithcode.com\/method\/residual-block) with Multi-Head Self-Attention (MHSA).","1199":"The **Bottleneck Transformer (BoTNet) ** is an image classification model that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation. By just replacing the spatial convolutions with global self-attention in the final three bottleneck blocks of a [ResNet](https:\/\/paperswithcode.com\/method\/resnet) and no other changes, the approach improves upon baselines significantly on instance segmentation and object detection while also reducing the parameters, with minimal overhead in latency.","1200":"**NetAdapt** is a network shrinking algorithm to adapt a pretrained network to a mobile platform given a real resource budget. NetAdapt can incorporate direct metrics, such as latency and energy, into the optimization to maximize the adaptation performance based on the characteristics of the platform. By using empirical measurements, NetAdapt can be applied to any platform as long as we can measure the desired metrics, without any knowledge of the underlying implementation of the platform. \r\n\r\nWhile many existing algorithms simplify networks based on the number of MACs or weights, optimizing those indirect metrics may not necessarily reduce the direct metrics, such as latency and energy consumption. To solve this problem, NetAdapt incorporates direct metrics into its adaptation algorithm. These direct metrics are evaluated using *empirical measurements*, so that detailed knowledge of the platform and toolchain is not required. NetAdapt automatically and progressively simplifies a pre-trained network until the resource budget is met while maximizing the accuracy.","1201":"Please enter a description about the method here","1202":"**Denoised Smoothing** is a method for obtaining a provably robust classifier from a fixed pretrained one, without any additional training or fine-tuning of the latter. The basic idea is to prepend a custom-trained denoiser before the pretrained classifier, and then apply randomized smoothing. Randomized smoothing is a certified defense that converts any given classifier $f$ into a new smoothed classifier $g$ that is characterized by a non-linear Lipschitz property. When queried at a point $x$, the smoothed classifier $g$ outputs the class that is most likely to be returned by $f$ under isotropic Gaussian perturbations of its inputs. Unfortunately, randomized smoothing requires that the underlying classifier $f$ is robust to relatively large random Gaussian perturbations of the input, which is not the case for off-the-shelf pretrained models. By applying our custom-trained denoiser to the classifier $f$, we can effectively make $f$ robust to such Gaussian perturbations, thereby making it \u201csuitable\u201d for randomized smoothing.","1203":"**IoU-guided NMS** is a type of non-maximum suppression that help to eliminate the suppression failure caused by the misleading classification confidences. This is achieved through using the predicted IoU instead of the classification confidence as the ranking keyword for bounding boxes. ","1204":"Nystr\u00f6mformer replaces the self-attention in [BERT](https:\/\/paperswithcode.com\/method\/bert)-small and BERT-base using the proposed Nystr\u00f6m approximation. This reduces self-attention complexity to $O(n)$ and allows the [Transformer](https:\/\/paperswithcode.com\/method\/transformer) to support longer sequences.","1205":"A **Panoptic FPN** is an extension of an [FPN](https:\/\/paperswithcode.com\/method\/fpn) that can generate both instance and semantic segmentations via FPN. The approach starts with an FPN backbone and adds a branch for performing semantic segmentation in parallel with the existing region-based branch for instance segmentation. No changes are made to the FPN backbone when adding the dense-prediction branch, making it compatible with existing instance segmentation methods. \r\n\r\nThe new semantic segmentation branch achieves its goal as follows. Starting from the deepest FPN level (at 1\/32 scale), we perform three upsampling stages to yield a feature map at 1\/4 scale, where each upsampling stage consists of 3\u00d73 [convolution](https:\/\/paperswithcode.com\/method\/convolution), group norm, [ReLU](https:\/\/paperswithcode.com\/method\/relu), and 2\u00d7 bilinear upsampling. This strategy is repeated for FPN scales 1\/16, 1\/8, and 1\/4 (with progressively fewer upsampling stages). The result is a set of feature maps at the same 1\/4 scale, which are then element-wise summed. A final 1\u00d71 convolution, 4\u00d7 bilinear upsampling, and [softmax](https:\/\/paperswithcode.com\/method\/softmax) are used to generate the per-pixel class labels at the original image resolution. In addition to stuff classes, this branch also outputs a special \u2018other\u2019 class for all pixels belonging to objects (to avoid predicting stuff classes for such pixels).","1206":"**PointRend** is a module for image segmentation tasks, such as instance and semantic segmentation, that attempts to treat segmentation as image rending problem to efficiently \"render\" high-quality label maps. It uses a subdivision strategy to adaptively select a non-uniform set of points at which to compute labels. PointRend can be incorporated into popular meta-architectures for both instance segmentation (e.g. [Mask R-CNN](https:\/\/paperswithcode.com\/method\/mask-r-cnn)) and semantic segmentation (e.g. [FCN](https:\/\/paperswithcode.com\/method\/fcn)). Its subdivision strategy efficiently computes high-resolution segmentation maps using an order of magnitude fewer floating-point operations than direct, dense computation.\r\n\r\nPointRend is a general module that admits many possible implementations. Viewed abstractly, a PointRend module accepts one or more typical CNN feature maps $f\\left(x\\_{i}, y\\_{i}\\right)$ that are defined over regular grids, and outputs high-resolution predictions $p\\left(x^{'}\\_{i}, y^{'}\\_{i}\\right)$ over a finer grid. Instead of making excessive predictions over all points on the output grid, PointRend makes predictions only on carefully selected points. To make these predictions, it extracts a point-wise feature representation for the selected points by interpolating $f$, and uses a small point head subnetwork to predict output labels from the point-wise features.","1207":"VLMo is a unified vision-language pre-trained model that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. A Mixture-of-Modality-Experts (MOME) transformer is introduced to encode different modalities which helps it to capture modality-specific information by modality experts, and align content of different modalities by the self-attention module shared across modalities. The model parameters are shared across image-text contrastive learning, masked language modeling, and image-text matching tasks. During fine-tuning, the flexible modeling allows for VLMO to be used as either a dual encoder (i.e., separately encode images and text for retrieval tasks) or a fusion encoder (i.e., jointly encode image-text pairs for better interaction across modalities) Stage-wise pretraining on image-only and text-only data improved the vision-language pre-trained model. The model can be used for classification tasks and fine-tuned as a dual encoder for retrieval tasks.","1208":"**SGDW** is a stochastic optimization technique that decouples [weight decay](https:\/\/paperswithcode.com\/method\/weight-decay) from the gradient update:\r\n\r\n$$ g\\_{t} =  \\nabla{f\\_{t}}\\left(\\theta\\_{t-1}\\right) + \\lambda\\theta\\_{t-1}$$\r\n\r\n$$ m\\_{t} =  \\beta\\_{1}m\\_{t-1} + \\eta\\_{t}\\alpha{g}\\_{t}$$\r\n\r\n$$ \\theta\\_{t} = \\theta\\_{t-1} - m\\_{t} - \\eta\\_{t}\\lambda\\theta\\_{t-1}$$","1209":"The **BP-Transformer (BPT)** is a type of [Transformer](https:\/\/paperswithcode.com\/method\/transformer) that is motivated by the need to find a better balance between capability and computational complexity for self-attention. The architecture partitions the input sequence into different multi-scale spans via binary partitioning (BP). It incorporates an inductive bias of attending the context information from fine-grain to coarse-grain as the relative distance increases. The farther the context information is, the coarser its representation is.\r\nBPT can be regard as graph neural network, whose nodes are the multi-scale spans. A token node can attend the smaller-scale span for the closer context and the larger-scale span for the longer distance context. The representations of nodes are updated with [Graph Self-Attention](https:\/\/paperswithcode.com\/method\/graph-self-attention).","1210":"**Unigram Segmentation** is a subword segmentation algorithm based on a unigram language model. It provides multiple segmentations with probabilities. The language model allows for emulating the noise generated during the segmentation of actual data.\r\n\r\nThe unigram language model makes an assumption that each subword occurs independently, and consequently, the probability of a subword sequence $\\mathbf{x} = (x_1,\\ldots,x_M)$ is\r\nformulated as the product of the subword occurrence probabilities\r\n$p(x_i)$:\r\n\r\n$$\r\n  P(\\mathbf{x}) = \\prod_{i=1}^{M} p(x_i), \\\\\\\\\r\n  \\forall i\\,\\, x_i \\in \\mathcal{V},\\,\\,\\,\r\n  \\sum_{x \\in \\mathcal{V}} p(x) = 1, \\nonumber\r\n$$\r\n\r\nwhere $\\mathcal{V}$ is a pre-determined vocabulary.  The most probable\r\nsegmentation $\\mathbf{x}^*$ for the input sentence $X$ is then given by:\r\n\r\n$$\r\n  \\mathbf{x}^{*} = \\text{argmax}_{\\mathbf{x} \\in \\mathcal{S}(X)} P(\\mathbf{x}),\r\n$$\r\n\r\nwhere $\\mathcal{S}(X)$ is a set of segmentation candidates built from\r\nthe input sentence $X$.  $\\mathbf{x}^*$ is obtained with the Viterbi\r\nalgorithm.","1211":"**ComiRec** is a multi-interest framework for sequential recommendation. The multi-interest module captures multiple interests from user behavior sequences, which can be exploited for retrieving candidate items from the large-scale item pool. These items are then fed into an aggregation module to obtain the overall recommendation. The aggregation module leverages a controllable factor to balance the recommendation accuracy and diversity.","1212":"In the space of adversarial perturbation against classifier accuracy, the ARA is the area between a classifier's curve and the straight line defined by a naive classifier's maximum accuracy. Intuitively, the ARA measures a combination of the classifier\u2019s predictive power and its ability to overcome an adversary. Importantly, when contrasted against existing robustness metrics, the ARA takes into account the classifier\u2019s performance against all adversarial examples, without  bounding them by some arbitrary $\\epsilon$.","1213":"The **Sliced Iterative Generator (SIG)** is an iterative generative model that is a Normalizing Flow (NF), but shares the advantages of Generative Adversarial Networks (GANs). The model is based on iterative Optimal Transport of a series of 1D slices through the data space, matching on each slice the probability distribution function (PDF) of the samples to the data. To improve the efficiency, the directions of the orthogonal slices are chosen to maximize the PDF difference between the generated samples and the data using Wasserstein distance at each iteration. A patch based approach is adopted to model the images in a hierarchical way, enabling the model to scale well to high dimensions. \r\n\r\nUnlike GANs, SIG has a NF structure and allows efficient likelihood evaluations that can be used in downstream tasks. While SIG has a deep neural network architecture, the approach deviates significantly from the current deep learning paradigm, as it does not use concepts such as mini-batching, stochastic gradient descent, gradient back-propagation through deep layers, or non-convex loss function optimization. SIG is very insensitive to hyper-parameter tuning, making it a useful generator tool for ML experts and non-experts alike.","1214":"**Dilated Bottleneck with Projection Block** is an image model block used in the [DetNet](https:\/\/paperswithcode.com\/method\/detnet) convolutional neural network architecture. It employs a bottleneck structure with dilated convolutions to efficiently enlarge the receptive field. It uses a [1x1 convolution](https:\/\/paperswithcode.com\/method\/1x1-convolution) to ensure the spatial size stays fixed.","1215":"**Dilated Bottleneck Block** is an image model block used in the [DetNet](https:\/\/paperswithcode.com\/method\/detnet) convolutional neural network architecture. It employs a bottleneck structure with dilated convolutions to efficiently enlarge the receptive field.","1216":"**DetNet** is a backbone convolutional neural network for object detection. Different from traditional pre-trained models for ImageNet classification, DetNet maintains the spatial resolution of the features even though extra stages are included. DetNet attempts to stay efficient by employing a low complexity dilated bottleneck structure.","1217":"CondInst is a simple yet effective instance segmentation framework. It eliminates ROI cropping and feature alignment with the instance-aware mask heads. As a result, CondInst can solve instance segmentation with fully convolutional networks. CondInst is able to produce high-resolution instance masks without longer computational time. Extensive experiments show that CondInst can achieve even better performance and inference speed than [Mask R-CNN](https:\/\/paperswithcode.com\/method\/mask-r-cnn). It can be a strong alternative to previous ROI-based instance segmentation methods. Code is at https:\/\/github.com\/aim-uofa\/AdelaiDet.","1218":"**Involution** is an atomic operation for deep neural networks that inverts the design principles of convolution. Involution kernels are distinct in the spatial extent but shared across channels. If involution kernels are parameterized as fixed-sized matrices like convolution kernels and updated using the back-propagation algorithm, the learned involution kernels are impeded from transferring between input images with variable resolutions. \r\n\r\nThe authors argue for two benefits of involution over convolution: (i) involution can summarize the context in a wider spatial arrangement, thus overcome the difficulty of modeling long-range interactions well; (ii) involution can adaptively allocate the weights over different positions, so as to prioritize the most informative visual elements in the spatial domain.","1219":"The aim of 4D A* is to find the shortest path between two four-dimensional (4D) nodes of a 4D search space - a starting node and a target node - as long as there is a path. It achieves both optimality and completeness. The former is because the path is shortest possible, and the latter because if the solution exists the algorithm is guaranteed to find it.","1220":"**Visual-Spatial-Graph Network** (VSGNet) is a network for human-object interaction detection. It extracts visual features from the image representing the human-object pair, refines the features with spatial configurations of the pair, and utilizes the structural connections between the pair via graph convolutions.","1221":"**DynaBERT** is a [BERT](https:\/\/paperswithcode.com\/method\/bert)-variant which can flexibly adjust the size and latency by selecting adaptive width and depth. The training process of DynaBERT includes first training a width-adaptive BERT and then allowing both adaptive width and depth, by distilling knowledge from the full-sized model to small sub-networks. Network rewiring is also used to keep the more important attention heads and neurons shared by more sub-networks. \r\n\r\nA two-stage procedure is used to train DynaBERT. First, using knowledge distillation (dashed lines) to transfer the knowledge from a fixed teacher model to student sub-networks with adaptive width in DynaBERTW. Then, using knowledge distillation (dashed lines) to transfer the knowledge from a trained DynaBERTW to student sub-networks with adaptive width and depth in DynaBERT.","1222":"The **Hard Sigmoid** is an activation function used for neural networks of the form:\r\n\r\n$$f\\left(x\\right) = \\max\\left(0, \\min\\left(1,\\frac{\\left(x+1\\right)}{2}\\right)\\right)$$\r\n\r\nImage Source: [Rinat Maksutov](https:\/\/towardsdatascience.com\/deep-study-of-a-not-very-deep-neural-network-part-2-activation-functions-fd9bd8d406fc)","1223":"**Blue River Controls** is a tool that allows users to train and test reinforcement learning algorithms on real-world hardware. It features a simple interface based on OpenAI Gym, that works directly on both simulation and hardware.","1224":"**Batchboost** is a variation on [MixUp](https:\/\/paperswithcode.com\/method\/mixup) that instead of mixing just two images, mixes many images together.","1225":"**ASFF**, or **Adaptively Spatial Feature Fusion**, is a method for pyramidal feature fusion. It learns the way to spatially filter conflictive information to suppress inconsistency across different feature scales, thus improving the scale-invariance of features. \r\n\r\nASFF enables the network to directly learn how to spatially filter features at other levels so that only useful information is kept for combination. For the features at a certain level, features of other levels are first integrated and resized into the same resolution and then trained to find the optimal fusion. At each spatial location, features at different levels are fused adaptively, *i.e.*, some features may be filter out as they carry contradictory information at this location and some may dominate with more discriminative clues. ASFF offers several advantages: (1) as the operation of searching the optimal fusion is differential, it can be conveniently learned in back-propagation; (2) it is agnostic to the backbone model and it is applied to single-shot detectors that have a feature pyramid structure; and (3) its implementation is simple and the increased computational cost is marginal.\r\n\r\nLet $\\mathbf{x}_{ij}^{n\\rightarrow l}$ denote the feature vector at the position $(i,j)$ on the feature maps resized from level $n$ to level $l$. Following a feature resizing stage, we fuse the features at the corresponding level $l$ as follows:\r\n\r\n$$\r\n\\mathbf{y}\\_{ij}^l = \\alpha^l_{ij} \\cdot \\mathbf{x}\\_{ij}^{1\\rightarrow l} + \\beta^l_{ij} \\cdot \\mathbf{x}\\_{ij}^{2\\rightarrow l} +\\gamma^l\\_{ij} \\cdot \\mathbf{x}\\_{ij}^{3\\rightarrow l},\r\n$$\r\n\r\nwhere $\\mathbf{y}\\_{ij}^l$ implies the $(i,j)$-th vector of the output feature maps $\\mathbf{y}^l$ among channels. $\\alpha^l\\_{ij}$, $\\beta^l\\_{ij}$ and $\\gamma^l\\_{ij}$ refer to the spatial importance weights for the feature maps at three different levels to level $l$, which are adaptively learned by the network. Note that $\\alpha^l\\_{ij}$, $\\beta^l\\_{ij}$ and $\\gamma^l\\_{ij}$ can be simple scalar variables, which are shared across all the channels. Inspired by acnet, we force $\\alpha^l\\_{ij}+\\beta^l\\_{ij}+\\gamma^l\\_{ij}=1$ and $\\alpha^l\\_{ij},\\beta^l\\_{ij},\\gamma^l\\_{ij} \\in [0,1]$, and \r\n\r\n$$\r\n\t\\alpha^l_{ij} = \\frac{e^{\\lambda^l\\_{\\alpha\\_{ij}}}}{e^{\\lambda^l\\_{\\alpha_{ij}}} + e^{\\lambda^l\\_{\\beta_{ij}\r\n\t\t}} + e^{\\lambda^l\\_{\\gamma_{ij}}}}.\r\n$$\r\n\r\nHere $\\alpha^l\\_{ij}$, $\\beta^l\\_{ij}$ and $\\gamma^l\\_{ij}$ are defined by using the [softmax](https:\/\/paperswithcode.com\/method\/softmax) function with $\\lambda^l\\_{\\alpha_{ij}}$, $\\lambda^l\\_{\\beta_{ij}}$ and $\\lambda^l\\_{\\gamma_{ij}}$ as control parameters respectively. We use $1\\times1$ [convolution](https:\/\/paperswithcode.com\/method\/convolution) layers to compute the weight scalar maps $\\mathbf{\\lambda}^l_\\alpha$, $\\mathbf{\\lambda}^l\\_\\beta$ and $\\mathbf{\\lambda}^l\\_\\gamma$ from $\\mathbf{x}^{1\\rightarrow l}$, $\\mathbf{x}^{2\\rightarrow l}$ and $\\mathbf{x}^{3\\rightarrow l}$ respectively, and they can thus be learned through standard back-propagation.\r\n\r\nWith this method, the features at all the levels are adaptively aggregated at each scale. The outputs are used for object detection following the same pipeline of [YOLOv3](https:\/\/paperswithcode.com\/method\/yolov3).","1226":"Non-parametric approximation of Q-values by storing all visited states and doing inference through k-Nearest Neighbors.","1227":"**BezierAlign** is a feature sampling method for arbitrarily-shaped scene text recognition that exploits parameterization nature of a compact Bezier curve bounding box.  Unlike RoIAlign, the shape of sampling grid of BezierAlign is not rectangular. Instead, each column of the arbitrarily-shaped grid is orthogonal to the Bezier curve boundary of the text. The sampling points have equidistant interval in width and height, respectively, which are bilinear interpolated with respect to the coordinates.\r\n\r\nFormally given an input feature map and Bezier curve control points, we concurrently process all the output pixels of the rectangular output feature map with size $h\\_{\\text {out }} \\times w\\_{\\text {out }}$. Taking pixel $g\\_{i}$ with position $\\left(g\\_{i w}, g\\_{i h}\\right)$ (from output feature map) as an example, we calculate $t$ by:\r\n\r\n$$\r\nt=\\frac{g\\_{i w}}{w\\_{o u t}}\r\n$$\r\n\r\nWe then calculate the point of upper Bezier curve boundary $tp$ and lower Bezier curve boundary $bp$. Using $tp$ and $bp$, we can linearly index the sampling point $op$ by:\r\n\r\n$$\r\nop=bp \\cdot \\frac{g\\_{i h}}{h\\_{\\text {out }}}+tp \\cdot\\left(1-\\frac{g\\_{i h}}{h\\_{\\text {out }}}\\right)\r\n$$\r\n\r\nWith the position of $op$, we can easily apply bilinear interpolation to calculate the result. Comparisons among previous sampling methods and BezierAlign are shown in the Figure.","1228":"**Adaptive Bezier-Curve Network**, or **ABCNet**, is an end-to-end framework for arbitrarily-shaped scene text spotting. It adaptively fits arbitrary-shaped text by a parameterized bezier curve. It also utilizes a feature alignment layer, [BezierAlign](https:\/\/paperswithcode.com\/method\/bezieralign), to calculate convolutional features of text instances in curved shapes. These features are then passed to a light-weight recognition head.","1229":"**BIDeN**, or **Blind Image Decomposition Network**, is a model for blind image decomposition, which requires separating a superimposed image into constituent underlying images in a blind setting, that is, both the source components involved in mixing as well as the mixing mechanism are unknown.  For example, rain may consist of multiple components, such as rain streaks, raindrops, snow, and haze. \r\n\r\nThe Figure shows an example where $N = 4, L = 2, x = {a, b, c, d}$, and $I = {1, 3}$. $a, c$ are selected then passed to the mixing function $f$, and outputs the mixed input image $z$, which is $f\\left(a, c\\right)$ here. The generator consists of an encoder $E$ with three branches and multiple heads $H$. $\\bigotimes$ denotes the concatenation operation. Depth and receptive field of each branch is different to capture multiple scales of features. Each specified head points to the corresponding source component, and the number of heads varies with the maximum number of source components N. All reconstructed images $\\left(a', c'\\right)$ and their corresponding real images $\\left(a, c\\right)$ are sent to an unconditional discriminator. The discriminator also predicts the source components of the input image $z$. The outputs from other heads $\\left(b', d'\\right)$ do not contribute to the optimization.","1230":"**Balanced L1 Loss** is a loss function used for the object detection task. Classification and localization problems are solved simultaneously under the guidance of a multi-task loss since\r\n[Fast R-CNN](https:\/\/paperswithcode.com\/method\/fast-r-cnn), defined as:\r\n\r\n$$ L\\_{p,u,t\\_{u},v} = L\\_{cls}\\left(p, u\\right) + \\lambda\\left[u \\geq 1\\right]L\\_{loc}\\left(t^{u}, v\\right) $$\r\n\r\n$L\\_{cls}$ and $L\\_{loc}$ are objective functions corresponding to recognition and localization respectively. Predictions and targets in $L\\_{cls}$ are denoted as $p$ and $u$. $t\\_{u}$ is the corresponding regression results with class $u$. $v$ is the regression target. $\\lambda$ is used for tuning the loss weight under multi-task learning. We call samples with a loss greater than or equal to 1.0 outliers. The other samples are called inliers.\r\n\r\nA natural solution for balancing the involved tasks is to tune the loss weights of them. However, owing to the unbounded regression targets, directly raising the weight of localization loss will make the model more sensitive to outliers. These outliers, which can be regarded as hard samples, will produce excessively large gradients that are harmful to the training process. The inliers, which can be regarded as the easy samples, contribute little gradient to the overall gradients compared with the outliers. To be more specific, inliers only contribute 30% gradients average per sample compared with outliers. Considering these issues, the authors introduced the balanced L1 loss, which is denoted as $L\\_{b}$.\r\n\r\nBalanced L1 loss is derived from the conventional smooth L1 loss, in which an inflection point is set to separate inliers from outliners, and clip the large gradients produced by outliers with a maximum value of 1.0, as shown by the dashed lines in the Figure to the right. The key idea of balanced L1 loss is promoting the crucial regression gradients, i.e. gradients from inliers (accurate samples), to rebalance\r\nthe involved samples and tasks, thus achieving a more balanced training within classification, overall localization and accurate localization. Localization loss $L\\_{loc}$ uses balanced L1 loss is defined as:\r\n\r\n$$ L\\_{loc} = \\sum\\_{i\\in{x,y,w,h}}L\\_{b}\\left(t^{u}\\_{i}-v\\_{i}\\right) $$\r\n\r\nThe Figure to the right shows that the balanced L1 loss increases the gradients of inliers under the control of a factor denoted as $\\alpha$. A small $\\alpha$ increases more gradient for inliers, but the gradients of outliers are not influenced. Besides, an overall promotion magnification controlled by \u03b3 is also brought in for tuning the upper bound of regression errors, which can help the objective function better balancing involved tasks. The two factors that control different aspects are mutually enhanced to reach a more balanced training.$b$ is used to ensure $L\\_{b}\\left(x = 1\\right)$ has the same value for both formulations in the equation below.\r\n\r\nBy integrating the gradient formulation above, we can get the balanced L1 loss as:\r\n\r\n$$ L\\_{b}\\left(x\\right) = \\frac{\\alpha}{b}\\left(b|x| + 1\\right)ln\\left(b|x| + 1\\right) - \\alpha|x| \\text{ if } |x| < 1$$\r\n\r\n$$ L\\_{b}\\left(x\\right) = \\gamma|x| + C \\text{ otherwise } $$\r\n\r\nin which the parameters $\\gamma$, $\\alpha$, and $b$ are constrained by $\\alpha\\text{ln}\\left(b + 1\\right) = \\gamma$. The default parameters are set as $\\alpha = 0.5$ and $\\gamma = 1.5$","1231":"**Balanced Feature Pyramid** is a feature pyramid module. It differs from approaches like [FPNs](https:\/\/paperswithcode.com\/method\/fpn) that integrate multi-level features using lateral connections. Instead the BFP strengthens the multi-level features using the same deeply integrated balanced semantic features. The pipeline is shown in the Figure to the right. It consists of four steps, rescaling, integrating, refining and strengthening.\r\n\r\nFeatures at resolution level $l$ are denoted as $C\\_{l}$. The number of multi-level features is denoted as $L$. The indexes of involved lowest and highest levels are denoted as $l\\_{min}$ and $l\\_{max}$. In the Figure, $C\\_{2}$ has the highest resolution. To integrate multi-level features and preserve their semantic hierarchy at the same time, we first resize the multi-level features {$C\\_{2}, C\\_{3}, C\\_{4}, C\\_{5}$} to an intermediate size, i.e., the same size as $C\\_{4}$, with interpolation and max-pooling respectively. Once the features are rescaled, the balanced semantic features are obtained by simple averaging as:\r\n\r\n$$ C = \\frac{1}{L}\\sum^{l\\_{max}}\\_{l=l\\_{min}}C\\_{l} $$\r\n\r\nThe obtained features are then rescaled using the same but reverse procedure to strengthen the original features. Each resolution obtains equal information from others in this procedure. Note that this procedure does not contain any parameter. The authors observe improvement with this nonparametric method, proving the effectiveness of the information flow. \r\n\r\nThe balanced semantic features can be further refined to be more discriminative. The authors found both the refinements with convolutions directly and the non-local module work well. But the\r\nnon-local module works in a more stable way. Therefore, embedded Gaussian non-local attention is utilized as default. The refining step helps us enhance the integrated features and further improve the results.\r\n\r\nWith this method, features from low-level to high-level are aggregated at the same time. The outputs\r\n{$P\\_{2}, P\\_{3}, P\\_{4}, P\\_{5}$} are used for object detection following the same pipeline in FPN.","1232":"**Libra R-CNN** is an object detection model that seeks to achieve a balanced training procedure. The authors motivation is that training in past detectors has suffered from imbalance during the training process, which generally consists in three levels \u2013 sample level, feature level, and objective level. To mitigate the adverse effects, Libra R-CNN integrates three novel components: IoU-balanced\r\nsampling, [balanced feature pyramid](https:\/\/paperswithcode.com\/method\/balanced-feature-pyramid), and [balanced L1 loss](https:\/\/paperswithcode.com\/method\/balanced-l1-loss), respectively for reducing the imbalance at sample, feature, and objective level.","1233":"The **Adaptively Sparse Transformer** is a type of [Transformer](https:\/\/paperswithcode.com\/method\/transformer).","1234":"**Attention-augmented Convolution** is a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) with a two-dimensional relative self-attention mechanism that can replace convolutions as a stand-alone computational primitive for image classification. It employs [scaled-dot product attention](https:\/\/paperswithcode.com\/method\/scaled) and [multi-head attention](https:\/\/paperswithcode.com\/method\/multi-head-attention) as with [Transformers](https:\/\/paperswithcode.com\/method\/transformer).\r\n\r\nIt works by concatenating convolutional and attentional feature map. To see this, consider an original convolution operator with kernel size $k$, $F\\_{in}$ input filters and $F\\_{out}$ output filters. The corresponding attention augmented convolution can be written as\"\r\n\r\n$$\\text{AAConv}\\left(X\\right) = \\text{Concat}\\left[\\text{Conv}(X), \\text{MHA}(X)\\right] $$\r\n\r\n$X$ originates from an input tensor of shape $\\left(H, W, F\\_{in}\\right)$. This is flattened to become $X \\in \\mathbb{R}^{HW \\times F\\_{in}}$ which is passed into a multi-head attention module, as well as a convolution (see above).\r\n\r\nSimilarly to the convolution, the attention augmented convolution 1) is equivariant to translation and 2) can readily operate on inputs of different spatial dimensions.","1235":"**ALiBi**, or **Attention with Linear Biases**, is an alternative to [position embeddings](https:\/\/paperswithcode.com\/methods\/category\/position-embeddings) for inference extrapolation in [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) models. When computing the attention scores for each head, the ALiBi method adds a constant bias to each attention score $\\left(\\textbf{q}\\_{i}\u00b7\\textbf{k}\\_{j} , \\text{left}\\right)$ As in the unmodified attention sublayer, the [softmax](https:\/\/paperswithcode.com\/method\/softmax) function is then applied to these scores, and the rest of the computation is left unmodified. $m$ is a head-specific scalar that is set and not learned throughout training. When using ALiBi no positional embeddings are added at the bottom of the network.","1236":"**XLSR** is a multilingual speech recognition model built on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations and jointly learns a quantization of the latents shared across languages. The model is fine-tuned on labeled data and experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining. A shared quantization module over feature encoder representations produces multilingual quantized speech units whose embeddings are then used as targets for a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) trained by contrastive learning. The model learns to share discrete tokens across languages, creating bridges across languages.","1237":"**DropAttack** is an adversarial training method that adds intentionally worst-case adversarial perturbations to both the input and hidden layers in different dimensions and minimizes the adversarial risks generated by each layer.","1238":"**DCN-V2** is an architecture for learning-to-rank that improves upon the original [DCN](http:\/\/paperswithcode.com\/method\/dcn) model. It first learns explicit feature interactions of the inputs (typically the embedding layer) through cross layers, and then combines with a deep network to learn complementary implicit interactions. The core of DCN-V2 is the cross layers, which inherit the simple structure of the cross network from DCN, however it is significantly more expressive at learning explicit and bounded-degree cross features.","1239":"Siamese U-Net model with a pre-trained ResNet34 architecture as an encoder for data efficient Change Detection","1240":"**Bort** is a parametric architectural variant of the [BERT](https:\/\/paperswithcode.com\/method\/bert) architecture. It extracts an optimal subset of architectural parameters for the BERT architecture through a [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) approach; in particular, a fully polynomial-time approximation scheme (FPTAS). This optimal subset - \u201cBort\u201d - is demonstrably smaller, having an effective size of $5.5 \\%$ the original BERT-large architecture, and $16\\%$ of the net size. Bort is also able to be pretrained in $288$ GPU hours, which is $1.2\\%$ less than the time required to pretrain the highest-performing BERT parametric architecture variant, RoBERTa-large ([RoBERTa](https:\/\/paperswithcode.com\/method\/roberta)), and about $33\\%","1241":"**ProxylessNet-Mobile** is a convolutional neural architecture learnt with the [ProxylessNAS](https:\/\/paperswithcode.com\/method\/proxylessnas) [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) algorithm that is optimized for mobile devices. It uses inverted residual blocks (MBConvs) from [MobileNetV2](https:\/\/paperswithcode.com\/method\/mobilenetv2) as its basic building block.","1242":"**ProxylessNet-CPU** is an image model learnt with the [ProxylessNAS](https:\/\/paperswithcode.com\/method\/proxylessnas) [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) algorithm that is optimized for CPU devices. It uses inverted residual blocks (MBConvs) from [MobileNetV2](https:\/\/paperswithcode.com\/method\/mobilenetv2) as its basic building block.","1243":"**ProxylessNet-GPU** is a convolutional neural network architecture learnt with the [ProxylessNAS](https:\/\/paperswithcode.com\/method\/proxylessnas) [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) algorithm that is optimized for GPU devices. It uses inverted residual blocks (MBConvs) from [MobileNetV2](https:\/\/paperswithcode.com\/method\/mobilenetv2) as its basic building block.","1244":"**BlendMask** is an [instance segmentation framework](https:\/\/paperswithcode.com\/methods\/category\/instance-segmentation-models) built on top of the[ FCOS](https:\/\/paperswithcode.com\/method\/fcos) object detector. The bottom module uses either backbone or [FPN](https:\/\/paperswithcode.com\/method\/fpn) features to predict a set of bases. A single [convolution](https:\/\/paperswithcode.com\/methods\/category\/convolutions) layer is added on top of the detection towers to produce attention masks along with each bounding box prediction. For each predicted instance, the [blender](https:\/\/paperswithcode.com\/method\/blender) crops the bases with its bounding box and linearly combine them according the learned attention maps. Note that the Bottom Module can take features either from \u2018C\u2019, or \u2018P\u2019 as the input.","1245":"**Models Genesis**, or **Generic Autodidactic Models**, is a self-supervised approach for learning 3D image representations. The objective of Models Genesis is to learn a common image representation that is transferable and generalizable across diseases, organs, and modalities.  It consists of an encoder-decoder architecture with skip connections in between, and is trained to learn a common image representation by restoring the original sub-volume $x\\_{i}$ (as ground truth) from the transformed one $\\bar{x}\\_{i}$ (as input), in which the reconstruction loss (MSE) is computed between the model prediction $x'\\_{0}$ and ground truth $x\\_{i}$. Once trained, the encoder alone can be fine-tuned for target classification tasks; while the encoder and decoder together can be fine-tuned for target segmentation tasks.","1246":"A **QRNN**, or **Quasi-Recurrent Neural Network**, is a type of recurrent neural network that alternates convolutional layers, which apply in parallel across timesteps, and a minimalist recurrent pooling function that applies in parallel across channels. Due to their increased parallelism, they can be up to 16 times faster at train and test time than [LSTMs](https:\/\/paperswithcode.com\/method\/lstm).\r\n\r\nGiven an input sequence $\\mathbf{X} \\in \\mathbb{R}^{T\\times{n}}$ of $T$ n-dimensional vectors $\\mathbf{x}\\_{1}, \\dots, \\mathbf{x}\\_{T}$, the convolutional subcomponent of a QRNN performs convolutions in the timestep dimension with a bank of $m$ filters, producing a sequence $\\mathbf{Z} \\in \\mathbb{R}^{T\\times{m}}$ of m-dimensional candidate vectors $\\mathbf{z}\\_{t}$. Masked convolutions are used so filters can not access information from future timesteps (implementing with left padding).\r\n\r\nAdditional convolutions are applied with separate filter banks to obtain sequences of vectors for the\r\nelementwise gates that are needed for the pooling function. While the candidate vectors are passed\r\nthrough a $\\tanh$ nonlinearity, the gates use an elementwise sigmoid. If the pooling function requires a\r\nforget gate $f\\_{t}$ and an output gate $o\\_{t}$ at each timestep, the full set of computations in the convolutional component is then:\r\n\r\n$$ \\mathbf{Z} = \\tanh\\left(\\mathbf{W}\\_{z} \u2217 \\mathbf{X}\\right) $$\r\n$$ \\mathbf{F} = \\sigma\\left(\\mathbf{W}\\_{f} \u2217 \\mathbf{X}\\right) $$\r\n$$ \\mathbf{O} = \\sigma\\left(\\mathbf{W}\\_{o} \u2217 \\mathbf{X}\\right) $$\r\n\r\nwhere $\\mathbf{W}\\_{z}$, $\\mathbf{W}\\_{f}$, and $\\mathbf{W}\\_{o}$, each in $\\mathbb{R}^{k\u00d7n\u00d7m}$, are the convolutional filter banks and \u2217 denotes a [masked convolution](https:\/\/paperswithcode.com\/method\/masked-convolution) along the timestep dimension.  Dynamic [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling) by Balduzzi & Ghifary (2016) is used, which uses only a forget gate:\r\n\r\n$$ \\mathbf{h}\\_{t} = \\mathbf{f}\\_{t} \\odot{\\mathbf{h}\\_{t\u22121}} + \\left(1 \u2212 \\mathbf{f}\\_{t}\\right) \\odot{\f\\mathbf{z}\\_{t}} $$ \r\n\r\nWhich is denoted f-pooling. The function may also include an output gate:\r\n\r\n$$ \\mathbf{c}\\_{t} = \\mathbf{f}\\_{t} \\odot{\\mathbf{c}\\_{t\u22121}} + \\left(1 \u2212 \\mathbf{f}\\_{t}\\right) \\odot{\f\\mathbf{z}\\_{t}} $$ \r\n\r\n$$ \\mathbf{h}\\_{t} = \\mathbf{o}\\_{t} \\odot{\\mathbf{c}\\_{t}} $$\r\n\r\nWhich is denoted fo-pooling. Or the recurrence relation may include an independent input and forget gate:\r\n\r\n$$ \\mathbf{c}\\_{t} = \\mathbf{f}\\_{t} \\odot{\\mathbf{c}\\_{t\u22121}} + \\mathbf{i}\\_{t}\\odot{\f\\mathbf{z}\\_{t}} $$ \r\n\r\n$$ \\mathbf{h}\\_{t} = \\mathbf{o}\\_{t} \\odot{\\mathbf{c}\\_{t}} $$\r\n\r\nWhich is denoted ifo-pooling. In each case $h$ or $c$ is initialized to zero. The recurrent part sof these functions must be calculated for each timestep in the sequence, but parallelism along feature dimensions means evaluating them even over long sequences requires a negligible amount of computation time.\r\n\r\nA single QRNN layer thus performs an input-dependent pooling, followed by a gated linear combination of convolutional features. As with convolutional neural networks, two or more QRNN layers should be stacked to create a model with the capacity to approximate more complex functions.","1247":"Please enter a description here","1248":"**DROID-SLAM** is a deep learning based SLAM system. It consists of recurrent iterative updates of camera pose and pixelwise depth through a Dense Bundle Adjustment layer. This layer leverages geometric constraints, improves accuracy and robustness, and enables a monocular system to handle stereo or RGB-D input without retraining. It builds a dense 3D map of the environment while simultaneously localizing the camera within the map.","1249":"SimVLM is a minimalist pretraining framework to reduce training complexity by exploiting large-scale weak supervision. It is trained end-to-end with a single prefix language modeling (PrefixLM) objective. PrefixLM enables bidirectional attention within the prefix sequence, and thus it is applicable for both decoder-only\r\nand encoder-decoder sequence-to-sequence language models.","1250":"An **Recursive Feature Pyramid (RFP)** builds on top of the Feature Pyramid Networks ([FPN](https:\/\/paperswithcode.com\/method\/fpn)) by incorporating extra feedback connections from the FPN layers into the bottom-up backbone layers. Unrolling the recursive structure to a sequential implementation, we obtain a backbone for object detector that looks at the images twice or more. Similar to the cascaded detector heads in [Cascade R-CNN](https:\/\/paperswithcode.com\/method\/cascade-r-cnn) trained with more selective examples, an RFP recursively enhances FPN to generate increasingly powerful representations. Resembling Deeply-Supervised Nets, the feedback connections bring the features that directly receive gradients from the detector heads back to the low levels of the bottom-up backbone to speed up training and boost performance.","1251":"TuckER model trained with a relation prediction objective on top of the 1vsAll loss","1252":"RESCAL model trained with a relation prediction objective on top of the 1vsAll loss","1253":"CP with N3 Regularizer and Relation Prediction","1254":"Canonical Tensor Decomposition, trained with N3 regularizer","1255":"ComplEx model trained with a nuclear norm regularizer; A relation prediction objective is added on top of the commonly used 1vsAll objective.","1256":"ComplEx model trained with a nuclear norm regularizer","1257":"**Criss-Cross Network** (**CCNet**) aims to obtain full-image contextual information in an effective and efficient way. Concretely,\r\nfor each pixel, a novel criss-cross attention module harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. **CCNet** is with the following\r\nmerits: **1)** GPU memory friendly. Compared with the [non-local block](https:\/\/paperswithcode.com\/method\/non-local-block), the proposed recurrent criss-cross attention module requires 11\u00d7 less GPU memory usage. **2)** High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85% of the non-local block. **3)** The state-of-the-art performance.","1258":"**Child-Tuning** is a fine-tuning technique that updates a subset of parameters (called child network) of large pretrained models via strategically masking out the gradients of the non-child network during the backward process. It decreases the hypothesis space of the model via a task-specific mask applied to the full gradients, helping to effectively adapt the large-scale pretrained model to various tasks and meanwhile aiming to maintain its original generalization ability.","1259":"**DALL\u00b7E 2** is a generative text-to-image model made up of two main components: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding.","1260":"**SAINT** is a hybrid deep learning approach to solving tabular data problems. SAINT performs attention over both rows and columns, and it includes an enhanced embedding method. The architecture, pre-training and training pipeline are as follows: \r\n\r\n- $L$ layers with 2 attention blocks each, one self-attention block, and a novel intersample attention blocks that computes attention across samples are used.\r\n- For pre-training, this involves minimizing the contrastive and denoising losses between a given data point and its views generated by [CutMix](https:\/\/paperswithcode.com\/method\/cutmix) and [mixup](https:\/\/paperswithcode.com\/method\/mixup). During finetuning\/regular training, data passes through an embedding layer and then the SAINT model. Lastly, the contextual embeddings from SAINT are used to pass only the embedding corresponding to the CLS token through an [MLP](https:\/\/paperswithcode.com\/method\/feedforward-network) to obtain the final prediction.","1261":"AdaHessian achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods, including variants of [ADAM](https:\/\/paperswithcode.com\/method\/adam). In particular, we perform extensive tests on CV, NLP, and recommendation system tasks and find that AdaHessian: (i) achieves 1.80%\/1.45% higher accuracy on ResNets20\/32 on Cifar10, and 5.55% higher accuracy on ImageNet as compared to ADAM; (ii) outperforms ADAMW for transformers by 0.27\/0.33 BLEU score on IWSLT14\/WMT14 and 1.8\/1.0 PPL on PTB\/Wikitext-103; and (iii) achieves 0.032% better score than [AdaGrad](https:\/\/paperswithcode.com\/method\/adagrad) for DLRM on the Criteo Ad Kaggle dataset. Importantly, we show that the cost per iteration of AdaHessian is comparable to first-order methods, and that it exhibits robustness towards its hyperparameters.","1262":"**Shrink and Fine-Tune**, or **SFT**, is a type of distillation that avoids explicit distillation by copying parameters to a student student model and then fine-tuning. Specifically it extracts a student model from the maximally spaced layers of a fine-tuned teacher. Each layer $l \\in L'$ is copied fully from $L$. For example, when creating a [BART](https:\/\/paperswithcode.com\/method\/bart) student with 3 decoder layers from the 12 encoder layer 12 decoder layer teacher, we copy the teacher\u2019s full $Enc^{L}$ and decoder layers 0, 6, and 11 to the student. When deciding which layers to copy, we break ties arbitrarily; copying layers 0, 5, and 11 might work just as well. When copy only 1 decoder layer, we copy layer 0. This was found this to work better than copying layer 11. The impact of initialization on performance is measured experimentally in Section 6.1. After initialization, the student model continues to fine-tune on the summarization dataset, with the objective of minimizing $\\mathcal{L}\\_{Data}$.","1263":"In the field of fusing multi-spectral and panchromatic images (Pan-sharpening), the impressive effectiveness of deep neural networks has been recently employed to overcome the drawbacks of traditional linear models and boost the fusing accuracy. However, to the best of our knowledge, existing research works are mainly based on simple and flat networks with relatively shallow architecture, which severely limited their performances. In this paper, the concept of residual learning has been introduced to form a very deep convolutional neural network to make a full use of the high non-linearity of deep learning models. By both quantitative and visual assessments on a large number of high quality multi-spectral images from various sources, it has been supported that our proposed model is superior to all mainstream algorithms included in the comparison, and achieved the highest spatial-spectral unified accuracy.","1264":"In recent years, there has been a growing interest on deep learning-based pansharpening.\r\nResearch has mainly focused on architectures.\r\nHowever, lacking a ground truth, model training is also a major issue.\r\nA popular approach is to train networks in a reduced resolution domain, using the original data as ground truths.\r\nThe trained networks are then used on full resolution data, relying on an implicit scale invariance hypothesis.\r\nResults are generally good at reduced resolution, but more questionable at full resolution.\r\n\r\nHere, we propose a full-resolution training framework for deep learning-based pansharpening.\r\nTraining takes place in the high resolution domain, relying only on the original data, with no loss of information.\r\nTo ensure spectral and spatial fidelity, suitable losses are defined,\r\nwhich force the pansharpened output to be consistent with the available panchromatic and multispectral input.\r\nExperiments carried out on WorldView-3, WorldView-2, and GeoEye-1 images show that methods trained with the proposed framework\r\nguarantee an excellent performance in terms of both full-resolution numerical indexes and visual quality.\r\nThe framework is fully general, and can be used to train and fine-tune any deep learning-based pansharpening network.","1265":"**MoGA-C** is a convolutional neural network optimized for mobile latency and discovered via Mobile GPU-Aware (MoGA) [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search). The basic building block is MBConvs (inverted residual blocks) from [MobileNetV2](https:\/\/paperswithcode.com\/method\/mobilenetv2). Squeeze-and-excitation layers are also experimented with.","1266":"**MoGA-B** is a convolutional neural network optimized for mobile latency and discovered via Mobile GPU-Aware (MoGA) [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search). The basic building block is MBConvs (inverted residual blocks) from [MobileNetV2](https:\/\/paperswithcode.com\/method\/mobilenetv2). Squeeze-and-excitation layers are also experimented with.","1267":"**MoGA-A** is a convolutional neural network optimized for mobile latency and discovered via Mobile GPU-Aware (MoGA) [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search). The basic building block is MBConvs (inverted residual blocks) from [MobileNetV2](https:\/\/paperswithcode.com\/method\/mobilenetv2). Squeeze-and-excitation layers are also experimented with.","1268":"Channel & spatial attention combines the advantages of channel attention and spatial attention. It adaptively selects both important objects and regions","1269":"**Temporal Jittering** is a method used in deep learning for video, where we sample multiple training clips from each video with random start times during at every epoch.","1270":"Automated graph learning is a method that aims at discovering the best hyper-parameter and neural architecture configuration for different graph tasks\/data without manual design.","1271":"**Channel-wise Cross Attention** is a module for semantic segmentation used in the [UCTransNet](https:\/\/paperswithcode.com\/method\/uctransnet) architecture. It is used to fuse features of inconsistent semantics between the Channel [Transformer](https:\/\/paperswithcode.com\/method\/transformer) and [U-Net](https:\/\/paperswithcode.com\/method\/u-net) decoder. It guides the channel and information filtration of the Transformer features and eliminates the ambiguity with the decoder features.\r\n\r\nMathematically, we take the $i$-th level Transformer output $\\mathbf{O\\_{i}} \\in \\mathbb{R}^{C\u00d7H\u00d7W}$ and i-th level decoder feature map $\\mathbf{D\\_{i}} \\in \\mathbb{R}^{C\u00d7H\u00d7W}$ as the inputs of Channel-wise Cross Attention. Spatial squeeze is performed by a [global average pooling](https:\/\/paperswithcode.com\/method\/global-average-pooling) (GAP) layer, producing vector $\\mathcal{G}\\left(\\mathbf{X}\\right) \\in \\mathbb{R}^{C\u00d71\u00d71}$ with its $k$th channel $\\mathcal{G}\\left(\\mathbf{X}\\right) = \\frac{1}{H\u00d7W}\\sum^{H}\\_{i=1}\\sum^{W}\\_{j=1}\\mathbf{X}^{k}\\left(i, j\\right)$. We use this operation to embed the global spatial information and then generate the attention mask:\r\n\r\n$$ \\mathbf{M}\\_{i} = \\mathbf{L}\\_{1} \\cdot \\mathcal{G}\\left(\\mathbf{O\\_{i}}\\right) + \\mathbf{L}\\_{2} \\cdot \\mathcal{G}\\left(\\mathbf{D}\\_{i}\\right) $$\r\n\r\nwhere $\\mathbf{L}\\_{1} \\in \\mathbb{R}^{C\u00d7C}$ and $\\mathbf{L}\\_{2} \\in \\mathbb{R}^{C\u00d7C}$ and being weights of two Linear layers and the [ReLU](https:\/\/paperswithcode.com\/method\/relu) operator $\\delta\\left(\\cdot\\right)$. This operation in the equation above encodes the channel-wise dependencies. Following [ECA-Net](https:\/\/paperswithcode.com\/method\/eca-net) which empirically showed avoiding dimensionality reduction is important for learning channel attention, the authors use a single [Linear layer](https:\/\/paperswithcode.com\/method\/linear-layer) and sigmoid function to build the channel attention map. The resultant vector is used to recalibrate or excite $\\mathbf{O\\_{i}}$ to $\\mathbf{\\bar{O}\\_{i}} = \\sigma\\left(\\mathbf{M\\_{i}}\\right) \\cdot \\mathbf{O\\_{i}}$, where the activation $\\sigma\\left(\\mathbf{M\\_{i}}\\right)$ indicates the importance of each channel. Finally, the masked $\\mathbf{\\bar{O}}\\_{i}$ is concatenated with the up-sampled features of the $i$-th level decoder.","1272":"**Channel-wise Cross Fusion Transformer** is a module used in the [UCTransNet](https:\/\/paperswithcode.com\/method\/uctransnet) architecture for semantic segmentation. It fuses the multi-scale encoder features with the advantage of the long dependency modeling in the [Transformer](https:\/\/paperswithcode.com\/method\/transformer). The [CCT](https:\/\/paperswithcode.com\/method\/cct) module consists of three steps: multi-scale feature embedding, multi-head [channel-wise cross attention](https:\/\/paperswithcode.com\/method\/channel-wise-cross-attention) and Multi-Layer Perceptron (MLP).","1273":"**UCTransNet** is an end-to-end deep learning network for semantic segmentation that takes [U-Net](https:\/\/paperswithcode.com\/method\/u-net) as the main structure of the network. The original skip connections of U-Net are replaced by CTrans consisting of two components: [Channel-wise Cross fusion Transformer](https:\/\/paperswithcode.com\/method\/channel-wise-cross-fusion-transformer) ([CCT](https:\/\/paperswithcode.com\/method\/cct)) and [Channel-wise Cross Attention](https:\/\/paperswithcode.com\/method\/channel-wise-cross-attention) (CCA) to guide the fused multi-Scale channel-wise information to effectively connect to the decoder features for eliminating the ambiguity.","1274":"SOHO (\u201cSee Out of tHe bOx\u201d) that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches. Text embeddings are used to extract textual embedding features. A trainable CNN is used to extract visual representations. SOHO learns to extract comprehensive yet compact image features through a visual dictionary (VD) that facilitates cross-modal understanding. VD is designed to represent consistent visual abstractions of similar semantics. It is updated on-the-fly and utilized in the proposed pre-training task Masked Visual Modeling (MVM).","1275":"**Audiovisual SlowFast Network**, or **AVSlowFast**, is an architecture for integrated audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are integrated with a Faster Audio pathway to model vision and sound in a unified representation. Audio and visual features are fused at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcome training difficulties that arise from different learning dynamics for audio and visual modalities, [DropPathway](https:\/\/paperswithcode.com\/method\/droppathway) is used, which randomly drops the Audio pathway during training as an effective regularization technique. Inspired by prior studies in neuroscience, hierarchical audiovisual synchronization is performed to learn joint audiovisual features.","1276":"**DropPathway** randomly drops an audio pathway during training as a regularization technique for audiovisual recognition models.  Specifically, at each training iteration, we drop the Audio pathway altogether with probability $P\\_{d}$. This way, we slow down the learning of the Audio pathway and make its learning dynamics more compatible with its visual counterpart. When dropping the audio pathway, we sum zero tensors with the visual pathways.\r\n\r\nNote that DropPathway is different from simply setting different learning rates for the audio\/visual pathways in that it 1) ensures the audio pathway has fewer parameter updates, 2) hinders the visual pathway to 'shortcut' training by memorizing audio information, and 3) provides extra regularization as different audio clips are dropped in each epoch.","1277":"**Weighted Recurrent Quality Enhancement**, or **WRQE**, is a recurrent quality enhancement network for video compression that takes both compressed frames and the bit stream as inputs. In the recurrent cell of WRQE, the memory and update signal are weighted by quality features to reasonably leverage multi-frame information for enhancement.","1278":"**Flow Alignment Module**, or **FAM**, is a flow-based align module for scene parsing to learn Semantic Flow between feature maps of adjacent levels and broadcast high-level features to high resolution features effectively and efficiently. The concept of Semantic Flow is inspired from optical flow, which is widely used in video processing task to represent the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by relative motion. The authors postulate that the relatinship between two feature maps of arbitrary resolutions from the same image can also be represented with the \u201cmotion\u201d of every pixel from one feature map to the other one. Once precise Semantic Flow is obtained, the network is able to propagate semantic features with minimal information loss.\r\n\r\nIn the FAM module, the transformed high-resolution feature map are combined with the low-resolution feature map to generate the semantic flow field, which is utilized to warp the low-resolution feature map to high-resolution feature map.","1279":"**Singular Value Clipping (SVC)** is an adversarial training technique used by [TGAN](https:\/\/paperswithcode.com\/method\/tgan) to enforce the 1-Lipschitz constraint of the [WGAN](https:\/\/paperswithcode.com\/method\/wgan) objective. It is a constraint to all linear layers in the discriminator that satisfies the spectral norm of weight parameter $W$ is equal or less than one. This\r\nmeans that the singular values of weight matrix are all one or less. Therefore singular value decomposition (SVD) is performed after a parameter update, replacing all the singular values larger than one with one, and the parameters are reconstructed with them. The same operation is applied to convolutional layers by interpreting a higher order tensor in weight parameter as a matrix $\\hat{W}$.","1280":"**TGAN** is a type of generative adversarial network that is capable of learning representation from an unlabeled video dataset and producing a new video. The generator consists of two sub networks\r\ncalled a temporal generator and an image generator. Specifically, the temporal generator first yields a set of latent variables, each of which corresponds to a latent variable for the image generator. Then, the image generator transforms these latent variables into a video which has the same number of frames as the variables. The model comprised of the temporal and image generators can not only enable to efficiently capture the time series, but also be easily extended to frame interpolation. The authors opt for a [WGAN](https:\/\/paperswithcode.com\/method\/wgan) as the basic [GAN](https:\/\/paperswithcode.com\/method\/gan) structure and objective, but use [singular value clipping](https:\/\/paperswithcode.com\/method\/singular-value-clipping) to enforce the Lipschitz constraint.","1281":"**Deep Voice 3 (DV3)** is a fully-convolutional attention-based neural text-to-speech system. The Deep Voice 3 architecture consists of three components:\r\n\r\n- Encoder: A fully-convolutional encoder, which converts textual features to an internal\r\nlearned representation.\r\n\r\n- Decoder: A fully-convolutional causal decoder, which decodes the learned representation\r\nwith a multi-hop convolutional attention mechanism into a low-dimensional audio representation (mel-scale spectrograms) in an autoregressive manner.\r\n\r\n- Converter: A fully-convolutional post-processing network, which predicts final vocoder\r\nparameters (depending on the vocoder choice) from the decoder hidden states. Unlike the\r\ndecoder, the converter is non-causal and can thus depend on future context information.\r\n\r\nThe overall objective function to be optimized is a linear combination of the losses from the decoder and the converter. The authors separate decoder and converter and apply multi-task training, because it makes attention learning easier in practice. To be specific, the loss for mel-spectrogram prediction guides training of the attention mechanism, because the attention is trained with the gradients from mel-spectrogram prediction besides vocoder parameter prediction.","1282":"**XGrad-CAM**, or **Axiom-based Grad-CAM**, is a class-discriminative visualization method and able to highlight the regions belonging to the objects of interest. Two axiomatic properties are introduced in the derivation of XGrad-CAM: Sensitivity and Conservation. In particular, the proposed XGrad-CAM is still a linear combination of feature maps, but able to meet the constraints of those two axioms.","1283":"**Virtual Data Augmentation**, or **VDA**, is a framework for robustly fine-tuning pre-trained language model. Based on the original token embeddings, a multinomial mixture for augmenting virtual data is constructed, where a masked language model guarantees the semantic relevance and the Gaussian noise provides the augmentation diversity. Furthermore, a regularized training strategy is proposed to balance the two aspects.","1284":"reSGLD proposes to simulate a high-temperature particle for exploration and a low-temperature particle for exploitation and allows them to swap simultaneously. Moreover, a correction term is included to avoid biases.","1285":"Multi-task learning (MTL) introduces an inductive bias, based on a-priori relations between tasks: the trainable model is compelled to model more general dependencies by using the abovementioned relation as an important data feature. Hierarchical MTL, in which different tasks use different levels of the deep neural network, provides more effective inductive bias compared to \u201cflat\u201d MTL. Also, hierarchical MTL helps to solve the vanishing gradient problem in deep learning.","1286":"**Matrix NMS**, or **Matrix Non-Maximum Suppression**,  performs [non-maximum suppression](https:\/\/paperswithcode.com\/method\/non-maximum-suppression) with parallel matrix operations in one shot. It is motivated by [Soft-NMS](https:\/\/paperswithcode.com\/method\/soft-nms). Soft-NMS decays the other detection scores as a monotonic decreasing function $f(iou)$ of their overlaps. By decaying the scores according to IoUs recursively, higher IoU detections will be eliminated with a minimum score threshold. However, such process is sequential like traditional Greedy NMS and can not be implemented in parallel.\r\n\r\nMatrix NMS views this process from another perspective by considering how a predicted mask $m\\_{j}$ being suppressed. For $m\\_{j}$, its decay factor is affected by: (a) The penalty of each prediction $m\\_{i}$ on $m\\_{j}$ $\\left(s\\_{i}>s\\_{j}\\right)$, where $s\\_{i}$ and $s\\_{j}$ are the confidence scores; and (b) the probability of $m\\_{i}$ being suppressed. For (a), the penalty of each prediction $m\\_{i}$ on $m\\_{j}$ could be easily computed by $f\\left(\\right.$ iou $\\left.\\_{i, j}\\right)$. For (b), the probability of $m\\_{i}$ being suppressed is not so elegant to be computed. However, the probability usually has positive correlation with the IoUs. So here we directly approximate the probability by the most overlapped prediction on $m\\_{i}$ as\r\n\r\n$$\r\nf\\left(\\text { iou. }\\_{, i}\\right)=\\min\\_{\\forall s\\_{k}>s\\_{i}} f\\left(\\text { iou }\\_{k, i}\\right)\r\n$$\r\n\r\nTo this end, the final decay factor becomes\r\n\r\n$$\r\n\\operatorname{decay}\\_{j}=\\min\\_{\\forall s\\_{i}>s\\_{j}} \\frac{f\\left(\\text { iou }\\_{i, j}\\right)}{f\\left(\\text { iou }\\_{\\cdot, i}\\right)}\r\n$$\r\n\r\nand the updated score is computed by $s\\_{j}=s\\_{j} \\cdot$ decay $\\_{j} .$ The authors consider the two most simple decremented functions, denoted as linear $f\\left(\\right.$ iou $\\left.\\_{i, j}\\right)=1-$ iou $\\_{i, j}$, and Gaussian $f\\left(\\right.$ iou $\\left.\\_{i, j}\\right)=\\exp \\left(-\\frac{i o u\\_{i, j}^{2}}{\\sigma}\\right)$.","1287":"**Parametric UMAP** is a non-parametric graph-based dimensionality reduction algorithm that extends the second step of [UMAP](https:\/\/www.paperswithcode.com\/method\/umap) to a parametric optimization over neural network weights, learning a parametric relationship between data and embedding.","1288":"**Co-Scale Conv-Attentional Image Transformer** (CoaT) is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based image classifier equipped with co-scale and conv-attentional mechanisms. First, the co-scale mechanism maintains the integrity of Transformers' encoder branches at individual scales, while allowing representations learned at different scales to effectively communicate with each other. Second, the conv-attentional mechanism is designed by realizing a relative position embedding formulation in the factorized attention module with an efficient [convolution](https:\/\/paperswithcode.com\/method\/convolution)-like implementation. CoaT empowers image Transformers with enriched multi-scale and contextual modeling capabilities.","1289":"A **G-GLN Neuron** is a type of neuron used in the [G-GLN](https:\/\/paperswithcode.com\/method\/g-gln) architecture. G-GLN. The key idea is that further representational power can be added to a weighted product of Gaussians via a contextual gating procedure. This is achieved by extending a weighted product of Gaussians model with an additional type of input called side information. The side information will be used by a neuron to select a weight vector to apply for a given example from a table of weight vectors. In typical applications to regression, the side information is defined as the (normalized) input features for an input example: i.e. $z=(x-\\bar{x}) \/ \\sigma\\_{x}$.\r\n\r\nMore formally, associated with each neuron is a context function $c: \\mathcal{Z} \\rightarrow \\mathcal{C}$, where $\\mathcal{Z}$ is the set of possible side information and $\\mathcal{C}=\\{0, \\ldots, k-1\\}$ for some $k \\in \\mathbb{N}$ is the context space. Each neuron $i$ is now parameterized by a weight matrix $W\\_{i}=\\left[w\\_{i, 0} \\ldots w\\_{i, k-1}\\right]^{\\top}$ with each row vector $w\\_{i j} \\in \\mathcal{W}$ for $0 \\leq j<k$. The context function $c$ is responsible for mapping side information $z \\in \\mathcal{Z}$ to a particular row $w\\_{i, c(z)}$ of $W_{i}$, which we then use to weight the Product of Gaussians. In other words, a G-GLN neuron can be defined by:\r\n\r\n$$\r\n\\operatorname{PoG}\\_{W}^{c}\\left(y ; f_{1}(\\cdot), \\ldots, f\\_{m}(\\cdot), z\\right):=\\operatorname{PoG}\\_{w^{c(z)}}\\left(y ; f\\_{1}(\\cdot), \\ldots, f\\_{m}(\\cdot)\\right)\r\n$$\r\n\r\nwith the associated loss function $-\\log \\left(\\operatorname{PoG}\\_{W}^{c}\\left(y ; f\\_{1}(y), \\ldots, f\\_{m}(y), z\\right)\\right)$ inheriting all the properties needed to apply Online Convex Programming.","1290":"**Gaussian Gated Linear Network**, or **G-GLN**, is a multi-variate extension to the recently proposed [GLN](https:\/\/paperswithcode.com\/method\/gln) family of deep neural networks by reformulating the GLN neuron as a gated product of Gaussians. This Gaussian Gated Linear Network (G-GLN) formulation exploits the fact that exponential family densities are closed under multiplication, a property that has seen much use in [Gaussian Process](https:\/\/paperswithcode.com\/method\/gaussian-process) and related literature. Similar to the Bernoulli GLN, every neuron in the G-GLN directly predicts the target distribution.  \r\n\r\nPrecisely, a G-GLN is a feed-forward network of data-dependent distributions. Each neuron calculates the sufficient statistics $\\left(\\mu, \\sigma\\_{2}\\right)$ for its associated PDF using its active weights, given those emitted by neurons in the preceding layer. It consists of consists of $L+1$ layers indexed by $i \\in\\{0, \\ldots, L\\}$ with $K\\_{i}$ neurons in each layer. The weight space for a neuron in layer $i$ is denoted by $\\mathcal{W}\\_{i}$; the subscript is needed since the dimension of the weight space depends on $K_{i-1}$. Each neuron\/distribution is indexed by its position in the network when laid out on a grid; for example, $f\\_{i k}$ refers to the family of PDFs defined by the $k$ th neuron in the $i$ th layer. Similarly, $c\\_{i k}$ refers to the context function associated with each neuron in layers $i \\geq 1$, and $\\mu\\_{i k}$ and $\\sigma\\_{i k}^{2}$ (or $\\Sigma\\_{i k}$ in the multivariate case) referring to the sufficient statistics for each Gaussian PDF.\r\n\r\nThere are two types of input to neurons in the network. The first is the side information, which can be thought of as the input features, and is used to determine the weights used by each neuron via half-space gating. The second is the input to the neuron, which is the PDFs output by the previous layer, or in the case of layer 0, some provided base models. To apply a G-GLN in a supervised learning setting, we need to map the sequence of input-label pairs $\\left(x\\_{t}, y\\_{t}\\right)$ for $t=1,2, \\ldots$ onto a sequence of (side information, base Gaussian PDFs, label) triplets $\\left(z\\_{t},\\left\\(f\\_{0 i}\\right\\)\\_{i}, y\\_{t}\\right)$. The side information $z\\_{t}$ is set to the (potentially normalized) input features $x\\_{t}$. The Gaussian PDFs for layer 0 will generally include the necessary base Gaussian PDFs to span the target range, and optionally some base prediction PDFs that capture domain-specific knowledge.","1291":"GBO is a novel metaheuristic optimization algorithm. The GBO, inspired by the gradient-based Newton\u2019s method, uses two main operators: gradient search rule (GSR) and local escaping operator (LEO) and a set of vectors to explore the search space. The GSR employs the gradient-based method to enhance the exploration tendency and accelerate the convergence rate to achieve better positions in the search space. The LEO enables the proposed GBO to escape from local optima. The performance of the new algorithm was evaluated in two phases. 28 mathematical test functions were first used to evaluate various characteristics of the GBO, and then six engineering problems were optimized by the GBO. In the first phase, the GBO was compared with five existing optimization algorithms, indicating that the GBO yielded very promising results due to its enhanced capabilities of exploration, exploitation, convergence, and effective avoidance of local optima. The second phase also demonstrated the superior performance of the GBO in solving complex real-world engineering problems. \r\n\r\n* The source codes of GBO are publicly available at https:\/\/imanahmadianfar.com\/codes\/.","1292":"Fast-OCR is a new lightweight detection network that incorporates features from existing models focused on the speed\/accuracy trade-off, such as [YOLOv2](https:\/\/paperswithcode.com\/method\/yolov2), [CR-NET](https:\/\/paperswithcode.com\/method\/cr-net), and Fast-[YOLOv4](https:\/\/paperswithcode.com\/method\/yolov4).","1293":"CDCC-NET is a multi-task network that analyzes the detected counter region and predicts 9 outputs: eight float numbers referring to the corner positions (x0\/w, y0\/h, ... , x3\/w, y3\/h) and an array containing two float numbers regarding the probability of the counter being legible\/operational or illegible\/faulty.","1294":"The Fast-YOLOv4-SmallObj model is a modified version of Fast-[YOLOv4](https:\/\/paperswithcode.com\/method\/yolov4) to improve the detection of small objects. Seven layers were added so that it predicts bounding boxes at 3 different scales instead of 2.","1295":"**MPNet** is a pre-training method for language models that combines masked language modeling (MLM) and permuted language modeling (PLM) in one view. It takes the dependency among the predicted tokens into consideration through permuted language modeling and thus avoids the issue of [BERT](https:\/\/paperswithcode.com\/method\/bert). On the other hand, it takes position information of all tokens as input to make the model see the position information of all the tokens and thus alleviates the position discrepancy of [XLNet](https:\/\/paperswithcode.com\/method\/xlnet).\r\n\r\nThe training objective of MPNet is:\r\n\r\n$$ \\mathbb{E}\\_{z\\in{\\mathcal{Z}\\_{n}}} \\sum^{n}\\_{t=c+1}\\log{P}\\left(x\\_{z\\_{t}}\\mid{x\\_{z\\_{<t}}}, M\\_{z\\_{{>}{c}}}; \\theta\\right) $$\r\n\r\nAs can be seen, MPNet conditions on ${x\\_{z\\_{<t}}}$ (the tokens preceding the current predicted token $x\\_{z\\_{t}}$) rather than only the non-predicted tokens ${x\\_{z\\_{<=c}}}$ in MLM; comparing with PLM, MPNet takes more information (i.e., the mask symbol $[M]$ in position $z\\_{>c}$) as inputs. Although the objective seems simple, it is challenging to implement the model efficiently. For details, see the paper.","1296":"**Voxel RoI Pooling** is a RoI feature extractor extracts RoI features directly from voxel features for further refinement. It starts by dividing a region proposal into $G \\times G \\times G$ regular sub-voxels. The center point is taken as the grid point of the corresponding sub-voxel. Since $3 D$ feature volumes are extremely sparse (non-empty voxels account for $<3 \\%$ spaces), we cannot directly utilize max pooling over features of each sub-voxel. Instead, features are integrated from neighboring voxels into the grid points for feature extraction. Specifically, given a grid point $g\\_{i}$, we first exploit voxel query to group a set of neighboring voxels $\\Gamma\\_{i}=\\left\\(\\mathbf{v}\\_{i}^{1}, \\mathbf{v}\\_{i}^{2}, \\cdots, \\mathbf{v}\\_{i}^{K}\\right\\) .$ Then, we aggregate the neighboring voxel features with a [PointNet](https:\/\/paperswithcode.com\/method\/pointnet) module $\\mathrm{a}$ as:\r\n\r\n$$\r\n\\mathbf{\\eta}\\_{i}=\\max _{k=1,2, \\cdots, K}\\left\\(\\Psi\\left(\\left[\\mathbf{v}\\_{i}^{k}-\\mathbf{g}\\_{i} ; \\mathbf{\\phi}\\_{i}^{k}\\right]\\right)\\right\\)\r\n$$\r\n\r\nwhere $\\mathbf{v}\\_{i}-\\mathbf{g}\\_{i}$ represents the relative coordinates, $\\mathbf{\\phi}\\_{i}^{k}$ is the voxel feature of $\\mathbf{v}\\_{i}^{k}$, and $\\Psi(\\cdot)$ indicates an MLP. The [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) operation $\\max (\\cdot)$ is performed along the channels to obtain the aggregated feature vector $\\eta_{i} .$ Particularly, Voxel RoI pooling is exploited to extract voxel features from the 3D feature volumes out of the last two stages in the $3 \\mathrm{D}$ backbone network. And for each stage, two Manhattan distance thresholds are set to group voxels with multiple scales. Then, we concatenate the aggregated features pooled from different stages and scales to obtain the RoI features.","1297":"**Voxel R-CNN** is a voxel-based two stage framework for 3D object detection. It consists of a 3D backbone network, a 2D bird-eye-view (BEV) Region Proposal Network and a detect head. Voxel RoI Pooling is devised to extract RoI features directly from raw features for further refinement. \r\n\r\nEnd-to-end, the point clouds are first divided into regular voxels and fed into the 3D backbone network for feature extraction. Then, the 3D feature volumes are converted into BEV representation, on which the 2D backbone and [RPN](https:\/\/paperswithcode.com\/method\/rpn) are applied for region proposal generation. Subsequently, [Voxel RoI Pooling](https:\/\/paperswithcode.com\/method\/voxel-roi-pooling) directly extracts RoI features from the 3D feature volumes. Finally the RoI features are exploited in the detect head for further box refinement.","1298":"**NoisyNet-A3C** is a modification of [A3C](https:\/\/paperswithcode.com\/method\/a3c) that utilises noisy linear layers for exploration instead of \r\n$\\epsilon$-greedy exploration as in the original [DQN](https:\/\/paperswithcode.com\/method\/dqn) formulation.","1299":"**NoisyNet-Dueling** is a modification of a [Dueling Network](https:\/\/paperswithcode.com\/method\/dueling-network) that utilises noisy linear layers for exploration instead of $\\epsilon$-greedy exploration as in the original Dueling formulation.","1300":"A **HaloNet** is a self-attention based model for efficient image classification. It relies on a local self-attention architecture that efficiently maps to existing hardware with haloing. The formulation breaks translational equivariance, but the authors observe that it improves  throughput and accuracies over the centered local self-attention used in regular self-attention. The approach also utilises a strided self-attentive downsampling operation for multi-scale feature extraction.","1301":"**Meta Pseudo Labels** is a semi-supervised learning method that uses a teacher network to generate pseudo labels on unlabeled data to teach a student network. The teacher receives feedback from the student to inform the teacher to generate better pseudo labels. This feedback signal is used as a reward to train the teacher throughout the course of the student\u2019s learning.","1302":"**RegNetX** is a convolutional network design space with simple, regular models with parameters: depth $d$, initial width $w\\_{0} > 0$, and slope $w\\_{a} > 0$, and generates a different block width $u\\_{j}$ for each block $j < d$. The key restriction for the RegNet types of model is that there is a linear parameterisation of block widths (the design space only contains models with this linear structure):\r\n\r\n$$ u\\_{j} = w\\_{0} + w\\_{a}\\cdot{j} $$\r\n\r\nFor **RegNetX** we have additional restrictions: we set $b = 1$ (the bottleneck ratio), $12 \\leq d \\leq 28$, and $w\\_{m} \\geq 2$ (the width multiplier).","1303":"**Dynamic SmoothL1 Loss (DSL)** is a loss function in object detection where we change the shape of loss function to gradually focus on high quality samples:\r\n\r\n$$\\text{DSL}\\left(x, \\beta\\_{now}\\right) = 0.5|{x}|^{2}\/\\beta\\_{now}, \\text{ if } |x| < \\beta\\_{now}\\text{,} $$ \r\n$$\\text{DSL}\\left(x, \\beta\\_{now}\\right) = |{x}| - 0.5\\beta\\_{now}\\text{, otherwise} $$ \r\n\r\nDSL will change the value of $\\beta\\_{now}$ according to the statistics of regression errors which can reflect the localization accuracy. It was introduced as part of the [Dynamic R-CNN](https:\/\/paperswithcode.com\/method\/dynamic-r-cnn) model.","1304":"**Dynamic R-CNN** is an object detection method that adjusts the label assignment criteria (IoU threshold) and the shape of regression loss function (parameters of Smooth L1 Loss) automatically based on the statistics of proposals during training. The motivation is that in previous two-stage object detectors, there is an inconsistency problem between the fixed network settings and the dynamic training procedure. For example, the fixed label assignment strategy and regression loss function cannot fit the distribution change of proposals and thus are harmful to training high quality detectors.\r\n\r\nIt consists of two components: Dynamic Label Assignment and Dynamic Smooth L1 Loss, which are designed for the classification and regression branches, respectively. \r\n\r\nFor Dynamic Label Assignment, we want our model to be discriminative for high IoU proposals, so we gradually adjust the IoU threshold for positive\/negative samples based on the proposals distribution in the training procedure. Specifically, we set the threshold as the IoU of the proposal at a certain percentage since it can reflect the quality of the overall distribution. \r\n\r\nFor Dynamic Smooth L1 Loss, we want to change the shape of the regression loss function to adaptively fit the distribution change of error and ensure the contribution of high quality samples to training. This is achieved by adjusting the $\\beta$ in Smooth L1 Loss based on the error distribution of the regression loss function, in which $\\beta$ actually controls the magnitude of the gradient of small errors.","1305":"**Side-Aware Boundary Localization (SABL)** is a methodology for precise localization in object detection where each side of the bounding box is respectively localized with a dedicated network branch. Empirically, the authors observe that when they manually annotate a bounding box for an object, it is often much easier to align each side of the box to the object boundary than to move the\r\nbox as a whole while tuning the size. Inspired by this observation, in SABL each side of the bounding box is respectively positioned based on its surrounding context. \r\n\r\nAs shown in the Figure, the authors devise a bucketing scheme to improve the localization precision. For each side of a bounding box, this scheme divides the target space into multiple buckets, then determines the bounding box via two steps. Specifically, it first searches for the correct bucket, i.e., the one in which the boundary resides. Leveraging the centerline of the selected buckets as a\r\ncoarse estimate, fine regression is then performed by predicting the offsets. This scheme allows very precise localization even in the presence of displacements with large variance. Moreover, to preserve precisely localized bounding boxes in the non-maximal suppression procedure, the authors also propose to adjust the classification score based on the bucketing confidences, which leads to further performance gains.","1306":"Class activation guide is a module which uses weak localization information from the instrument activation maps to guide the verb and target recognition. \r\n\r\nImage source: [Nwoye et al.](https:\/\/arxiv.org\/pdf\/2007.05405v1.pdf)","1307":"**Soft Split and Soft Composition** are video frame based operations used in the [FuseFormer](https:\/\/paperswithcode.com\/method\/fuseformer) architecture, specifically the [FuseFormer blocks](https:\/\/paperswithcode.com\/method\/fuseformer-block). We softly split each frame into overlapped patches and then softly composite them back, by using an unfold and fold operator with patch size $k$ being greater than patch stride $s$. When compositing patches back to its original spatial shape, we add up feature values at each overlapping spatial location of neighboring patches.","1308":"A **FuseFormer block** is used in the [FuseFormer](https:\/\/paperswithcode.com\/method\/fuseformer) model for video inpainting. It is the same to standard [Transformer](https:\/\/paperswithcode.com\/method\/transformer) block except that feed forward network is replaced with a Fusion Feed Forward Network (F3N). F3N brings no extra parameter into the standard feed forward net and the difference is that F3N inserts a soft-split and a soft composite operation between the two layer of MLPs.","1309":"**FuseFormer** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based model designed for video inpainting via fine-grained feature fusion based on novel [Soft Split and Soft Composition](https:\/\/paperswithcode.com\/method\/soft-split-and-soft-composition) operations. The soft split divides feature map into many patches with given overlapping interval while the soft composition stitches them back into a whole feature map where pixels in overlapping regions are summed up. FuseFormer builds soft composition and soft split into its [feedforward network](https:\/\/paperswithcode.com\/method\/feedforward-network) for further enhancing subpatch level feature fusion.","1310":"**Iterative Latent Variable Refinement**, or **ILVR**, is a method to guide the generative process in denoising diffusion probabilistic models (DDPMs) to generate high-quality images based on a given reference image. ILVR conditions the generation process in well-performing unconditional DDPM. Each transition in the generation process is refined utilizing a given reference image. By matching each latent variable, ILVR ensures the given condition in each transition thus enables sampling from a conditional distribution. Thus, ILVR generates high-quality images sharing desired semantics.","1311":"**TorchBeast** is a platform for reinforcement learning (RL) research in PyTorch. It implements a version of the popular [IMPALA](https:\/\/paperswithcode.com\/method\/impala) algorithm for fast, asynchronous, parallel training of RL agents.","1312":"**Meta Face Recognition** (MFR) is a meta-learning face recognition method. MFR synthesizes the source\/target domain shift with a meta-optimization objective, which requires the model to learn effective representations not only on synthesized source domains but also on synthesized target domains. Specifically, domain-shift batches are built through a domain-level sampling strategy and back-propagated gradients\/meta-gradients are obtained on synthesized source\/target domains by optimizing multi-domain distributions. The gradients and meta-gradients are further combined to update the model to improve generalization.","1313":"**DFDNet**, or **DFDNet**, is a deep face dictionary network for face restoration to guide the restoration process of degraded observations. Given a LQ image $I\\_{d}$, the DFDNet selects the dictionary features that have the most similar structure with the input. Specially, we re-norm the whole dictionaries via component AdaIN (termed as CAdaIN) based on the input component to eliminate the distribution or style diversity. The selected dictionary features are then utilized to guide the restoration process via dictionary feature transformation.","1314":"**Single-Headed Attention** is a single-headed attention module used in the [SHA-RNN](https:\/\/paperswithcode.com\/method\/sha-rnn) language model. The principle design reasons for single-headedness were simplicity (avoiding running out of memory) and scepticism about the benefits of using multiple heads.","1315":"Chinchilla is a 70B parameters model trained as a compute-optimal model with 1.4 trillion tokens. Findings suggest that these types of models are trained optimally by equally scaling both model size and training tokens. It uses the same compute budget as Gopher but with 4x more training data. Chinchilla and Gopher are trained for the same number of FLOPs. It is trained using [MassiveText](\/dataset\/massivetext) using a slightly modified SentencePiece tokenizer. More architectural details in the paper.","1316":"**ESACL**, or **Enhanced Seq2Seq Autoencoder via Contrastive Learning**, is a denoising sequence-to-sequence (seq2seq) autoencoder via contrastive learning for abstractive text summarization. The model adopts a standard [Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based architecture with a multilayer bi-directional encoder and an autoregressive decoder. To enhance its denoising ability, self-supervised contrastive learning is incorporated along with various sentence-level document augmentation.","1317":"**Energy Based Processes** extend energy based models to exchangeable data while allowing neural network parameterizations of the energy function. They extend the previously separate stochastic process and latent variable model perspectives in a common framework. The result is a generalization of [Gaussian processes](https:\/\/paperswithcode.com\/method\/gaussian-process) and Student-t processes that exploits EBMs for greater flexibility.","1318":"**Filter Response Normalization (FRN)** is a type of normalization that combines normalization and an activation function, which can be used as a replacement for other normalizations and activations. It operates on each activation channel of each batch element independently, eliminating the dependency on other batch elements. \r\n\r\nTo demonstrate, assume we are dealing with the feed-forward convolutional neural network. We follow the usual convention that the filter responses (activation maps) produced after a [convolution](https:\/\/paperswithcode.com\/method\/convolution) operation are a [4D ](https:\/\/paperswithcode.com\/method\/4d-a)tensor $X$ with shape $[B, W, H, C]$, where $B$ is the mini-batch size, $W, H$ are the spatial extents of the map, and $C$ is the number of filters used in convolution. $C$ is also referred to as output channels. Let $x = X_{b,:,:,c} \\in \\mathcal{R}^{N}$, where $N = W \\times H$, be the vector of filter responses for the $c^{th}$ filter for the $b^{th}$ batch point. \r\nLet $\\nu^2 = \\sum\\_i x_i^2\/N$, be the mean squared norm of $x$. \r\n\r\nThen Filter Response Normalization is defined as the following:\r\n\r\n$$\r\n\\hat{x} = \\frac{x}{\\sqrt{\\nu^2 + \\epsilon}},\r\n$$\r\n\r\nwhere $\\epsilon$ is a small positive constant to prevent division by zero.  \r\n\r\nA lack of mean centering in FRN can lead to activations having an arbitrary bias away from zero. Such a bias in conjunction with [ReLU](https:\/\/paperswithcode.com\/method\/relu) can have a detrimental effect on learning and lead to poor performance and dead units. To address this the authors augment ReLU with a learned threshold $\\tau$ to yield:\r\n\r\n$$\r\nz = \\max(y, \\tau)\r\n$$\r\n\r\nSince $\\max(y, \\tau){=}\\max(y-\\tau,0){+}\\tau{=}\\text{ReLU}{(y{-}\\tau)}{+}\\tau$, the effect of this activation is the same as having a shared bias before and after ReLU.","1319":"A **Deformable Kernels** is a type of convolutional operator for deformation modeling. DKs learn free-form offsets on kernel coordinates to deform the original kernel space towards specific data modality, rather than recomposing data. This can directly adapt the effective receptive field (ERF) while leaving the receptive field untouched. They can be used as a drop-in replacement of rigid kernels. \r\n\r\nAs shown in the Figure, for each input patch, a local DK first generates a group of kernel offsets $\\{\\Delta \\mathcal{k}\\}$ from input feature patch using the light-weight generator $\\mathcal{G}$ (a 3$\\times$3 [convolution](https:\/\/paperswithcode.com\/method\/convolution) of rigid kernel). Given the original kernel weights $\\mathcal{W}$ and the offset group $\\{\\Delta \\mathcal{k}\\}$, DK samples a new set of kernel $\\mathcal{W}'$ using a bilinear sampler $\\mathcal{B}$. Finally, DK convolves the input feature map and the sampled kernels to complete the whole computation.","1320":"A **Scatter Connection** is a type of connection that allows a vector to be \"scattered\" onto a layer representing a map, so that a vector at a specific location corresponds to objects of interest at that location (e.g. units in Starcraft II). This allows for the integration of spatial and non-spatial features.","1321":"**AlphaStar** is a reinforcement learning agent for tackling the game of Starcraft II. It learns a policy $\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}, z\\right) = P\\left[a\\_{t}\\mid{s\\_{t}}, z\\right]$ using a neural network for parameters $\\theta$ that receives observations $s\\_{t} = \\left(o\\_{1:t}, a\\_{1:t-1}\\right)$ as inputs and chooses actions as outputs. Additionally, the policy conditions on a statistic $z$ that summarizes a strategy sampled from human data such as a build order [1].\r\n\r\nAlphaStar uses numerous types of architecture to incorporate different types of features. Observations of player and enemy units are processed with a [Transformer](https:\/\/paperswithcode.com\/method\/transformer). Scatter connections are used to integrate spatial and non-spatial information. The temporal sequence of observations is processed by a core [LSTM](https:\/\/paperswithcode.com\/method\/lstm). Minimap features are extracted with a Residual Network. To manage the combinatorial action space, the agent uses an autoregressive policy and a recurrent [pointer network](https:\/\/paperswithcode.com\/method\/pointer-net).\r\n\r\nThe agent is trained first with supervised learning from human replays. Parameters are subsequently trained using reinforcement learning that maximizes the win rate against opponents.  The RL algorithm is based on a policy-gradient algorithm similar to actor-critic. Updates are performed asynchronously and off-policy. To deal with this, a combination of $TD\\left(\\lambda\\right)$ and [V-trace](https:\/\/paperswithcode.com\/method\/v-trace) are used, as well as a new self-imitation algorithm (UPGO).\r\n\r\nLastly, to address game-theoretic challenges, AlphaStar is trained with league training to try to approximate a fictitious self-play (FSP) setting which avoids cycles by computing a best response against a uniform mixture of all previous policies. The league of potential opponents includes a diverse range of agents, including policies from current and previous agents.\r\n\r\nImage Credit: [Yekun Chai](https:\/\/ychai.uk\/notes\/2019\/07\/21\/RL\/DRL\/Decipher-AlphaStar-on-StarCraft-II\/)\r\n\r\n####  References\r\n1. Chai, Yekun. \"AlphaStar: Grandmaster level in StarCraft II Explained.\" (2019).  [https:\/\/ychai.uk\/notes\/2019\/07\/21\/RL\/DRL\/Decipher-AlphaStar-on-StarCraft-II\/](https:\/\/ychai.uk\/notes\/2019\/07\/21\/RL\/DRL\/Decipher-AlphaStar-on-StarCraft-II\/)\r\n\r\n#### Code Implementation\r\n1. https:\/\/github.com\/opendilab\/DI-star","1322":"**TSDAE** is an unsupervised sentence embedding method. During training, TSDAE encodes corrupted sentences into fixed-sized vectors and requires the decoder to reconstruct the original sentences from this sentence embedding. For good reconstruction quality, the semantics must be captured well in the sentence embedding from the encoder. Later, at inference, we only use the encoder for creating sentence embeddings.\r\n\r\nThe model architecture of TSDAE is a modified [encoder-decoder Transformer](https:\/\/paperswithcode.com\/methods\/category\/autoencoding-transformers) where the key and value of the cross-attention are both confined to the sentence embedding only. Formally, the formulation of the modified cross-attention is:\r\n\r\n$$\r\nH^{(k)}=\\text { Attention }\\left(H^{(k-1)},\\left[s^{T}\\right],\\left[s^{T}\\right]\\right)\r\n$$\r\n\r\n$$\r\n\\operatorname{Attention}(Q, K, V)=\\operatorname{softmax}\\left(\\frac{Q K^{T}}{\\sqrt{d}}\\right) V\r\n$$\r\n\r\nwhere $H^{(k)} \\in \\mathbb{R}^{t \\times d}$ is the decoder hidden states within $t$ decoding steps at the $k$-th layer, $d$ is the size of the sentence embedding, $\\left[s^{T}\\right] \\in \\mathbb{R}^{1 \\times d}$ is a one-row matrix including the sentence embedding vector and $Q, K$ and $V$ are the query, key and value, respectively. By exploring different configurations on the STS benchmark dataset, the authors discover that the best combination is: (1) adopting deletion as the input noise and setting the deletion ratio to $0.6,(2)$ using the output of the [CLS] token as fixed-sized sentence representation (3) tying the encoder and decoder parameters during training.","1323":"A method to convert a Tree Ensemble model into a Rule list. This makes the AI model more transparent.","1324":"**Peer-attention** is a network component which dynamically learns the attention weights using another block or input modality. This is unlike AssembleNet which partially relies on exponential mutations to explore connections. Once the attention weights are found, we can either prune the connections by only leaving the argmax over $h$ or leave them with [softmax](https:\/\/paperswithcode.com\/method\/softmax).","1325":"**PermuteFormer** is a [Performer](https:\/\/paperswithcode.com\/method\/performer)-based model with relative position encoding that scales linearly on long sequences. PermuteFormer applies position-dependent transformation on queries and keys to encode positional information into the attention module. This transformation is carefully crafted so that the final output of self-attention is not affected by absolute positions of tokens.\r\n\r\nEach token\u2019s query \/ key feature is illustrated as a row of blocks in the figure, and its elements are marked with different colors. The position-aware permutation permutes elements of each token\u2019s query \/ key feature along the head size dimension in each attention head. Depending on the token\u2019s position, the permutation applied to query \/ key feature is different.","1326":"An agreement of a group to follow a common purpose is manifested by its coalescence into a coordinated behavior. The process of initiating this behavior and the period of decision-making by the group members necessarily precedes the coordinated behavior. Given time series of group members\u2019 behavior, the goal is to find these periods of decision-making and identify the initiating individual, if one exists.\r\n\r\nImage Source:  [Amornbunchornvej et al.](https:\/\/arxiv.org\/pdf\/1603.01570v2.pdf)","1327":"**DSelect-k** is a continuously differentiable and sparse gate for Mixture-of-experts (MoE), based on a novel binary encoding formulation. Given a user-specified parameter $k$, the gate selects at most $k$ out of the $n$ experts. The gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. This explicit control over sparsity leads to a cardinality-constrained optimization problem, which is computationally challenging. To circumvent this challenge, the authors use a unconstrained reformulation that is equivalent to the original problem. The reformulated problem uses a binary encoding scheme to implicitly enforce the cardinality constraint. By carefully smoothing the binary encoding variables, the reformulated problem can be effectively optimized using first-order methods such as [SGD](https:\/\/paperswithcode.com\/method\/sgd).\r\n\r\nThe motivation for this method is that  existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods.","1328":"An infinite image generator which is based on a patch-wise, periodically equivariant generator.","1329":"**Region of Interest Warping**, or **RoIWarp**, is a form of [RoIPool](https:\/\/paperswithcode.com\/method\/roi-pooling) that is differentiable with respect to the box position. In practice, this takes the form of a RoIWarp layer followed by a standard [Max Pooling](https:\/\/paperswithcode.com\/method\/max-pooling) layer. The RoIWarp layer crops a feature map region and warps it into a target size by interpolation.","1330":"**OverFeat** is a classic type of convolutional neural network architecture, employing [convolution](https:\/\/paperswithcode.com\/method\/convolution), pooling and fully connected layers. The Figure to the right shows the architectural details.","1331":"**SPP-Net** is a convolutional neural architecture that employs [spatial pyramid pooling](https:\/\/paperswithcode.com\/method\/spatial-pyramid-pooling) to remove the fixed-size constraint of the network. Specifically, we add an SPP layer on top of the last convolutional layer. The SPP layer pools the features and generates fixed-length outputs, which are then fed into the fully-connected layers (or other classifiers). In other words, we perform some information aggregation at a deeper stage of the network hierarchy (between convolutional layers and fully-connected layers) to avoid the need for cropping or warping at the beginning.","1332":"**BasicVSR** is a video super-resolution pipeline including optical flow and [residual blocks](https:\/\/paperswithcode.com\/method\/residual-connection). It adopts a typical bidirectional recurrent network. The upsampling module $U$ contains multiple [pixel-shuffle](https:\/\/paperswithcode.com\/method\/pixelshuffle) and convolutions. In the Figure, red and blue colors represent the backward and forward propagations, respectively.  The propagation branches contain only generic components. $S, W$, and $R$ refer to the flow estimation module, spatial warping module, and residual blocks, respectively.","1333":"SAFRAN is a rule application framework which aggregates rules through a scalable clustering algorithm.","1334":"Image Retrieval is a fundamental task of obtaining images similar to the query one from a database. A common image retrieval practice is to firstly retrieve candidate images via similarity search using global image features and then re-rank the candidates by leveraging their\r\nlocal features. Previous learning-based studies mainly focus on either global or local image representation learning\r\nto tackle the retrieval task. In this paper, we abandon the\r\ntwo-stage paradigm and seek to design an effective singlestage solution by integrating local and global information\r\ninside images into compact image representations. Specifically, we propose a Deep Orthogonal Local and Global\r\n(DOLG) information fusion framework for end-to-end image retrieval. It attentively extracts representative local information with multi-atrous convolutions and self-attention\r\nat first. Components orthogonal to the global image representation are then extracted from the local information.\r\nAt last, the orthogonal components are concatenated with\r\nthe global representation as a complementary, and then aggregation is performed to generate the final representation.\r\nThe whole framework is end-to-end differentiable and can\r\nbe trained with image-level labels. Extensive experimental\r\nresults validate the effectiveness of our solution and show\r\nthat our model achieves state-of-the-art image retrieval performances on Revisited Oxford and Paris datasets.","1335":"The core ingredient of **CayleyNet** is a new class of parametric rational complex functions (Cayley polynomials) allowing to efficiently compute spectral filters on graphs that specialize on frequency bands of interest. The model generates rich spectral filters that are localized in space, scales linearly with the size of the input data for sparsely-connected graphs, and can handle different constructions of Laplacian operators.\r\n\r\nDescription adapted from: [CayleyNets: Graph Convolutional Neural Networks with Complex Rational Spectral Filters](https:\/\/arxiv.org\/pdf\/1705.07664.pdf)","1336":"**Siren**, or **Sinusoidal Representation Network**, is a periodic activation function for implicit neural representations. Specifically it uses the sine as a periodic activation function:\r\n\r\n$$ \\Phi\\left(x\\right) = \\textbf{W}\\_{n}\\left(\\phi\\_{n-1} \\circ \\phi\\_{n-2} \\circ \\dots \\circ \\phi\\_{0} \\right) $$","1337":"**Mask Scoring R-CNN** is a Mask RCNN with MaskIoU Head, which takes the instance feature and the predicted mask together as input, and predicts the IoU between input mask and ground truth mask.","1338":"**ARShoe** is a multi-branch network for pose estimation and segmentation tackling the \"try-on\" problem for augmented reality shoes. Consisting of an encoder and a decoder, the multi-branch network is trained to predict keypoints [heatmap](https:\/\/paperswithcode.com\/method\/heatmap) (heatmap), [PAFs](https:\/\/paperswithcode.com\/method\/pafs) heatmap (pafmap), and segmentation results (segmap) simultaneously. Post processes are then performed for a smooth and realistic virtual try-on.","1339":"An **Invertible Rescaling Network (IRN)** is a network for image rescaling.  According to the Nyquist-Shannon sampling theorem, high-frequency contents are lost during downscaling. Ideally, we hope to keep all lost information to perfectly recover the original HR image, but storing or transferring the high-frequency information is unacceptable. In order to well address this challenge, the Invertible Rescaling Net (IRN) captures some knowledge on the lost information in the form of its distribution and embeds it into model\u2019s parameters to mitigate the ill-posedness. Given an HR image $x$, IRN not only downscales it into a LR image y, but also embeds the case-specific high-frequency content into an auxiliary case-agnostic latent variable $z$, whose marginal distribution\r\nobeys a fixed pre-specified distribution (e.g., isotropic Gaussian). Based on this model,\r\nwe use a randomly drawn sample of $z$ from the pre-specified distribution for the inverse upscaling procedure, which holds the most information that one could have in upscaling.","1340":"**CPT**, or **Chinese Pre-trained Unbalanced Transformer**, is a pre-trained unbalanced [Transformer](https:\/\/paperswithcode.com\/method\/transformer) for Chinese natural language understanding (NLU) and natural language generation (NLG) tasks. CPT consists of three parts: a shared encoder, an understanding decoder, and a generation decoder. Two specific decoders with a shared encoder are pre-trained with masked language modeling (MLM) and denoising auto-encoding (DAE) tasks, respectively. With the partially shared architecture and multi-task pre-training, CPT can (1) learn specific knowledge of both NLU or NLG tasks with two decoders and (2) be fine-tuned flexibly that fully exploits the potential of the model. Two specific decoders with a shared encoder are pre-trained with masked language modeling (MLM) and denoising auto-encoding (DAE) tasks, respectively. With the partially shared architecture and multi-task pre-training, CPT can (1) learn specific knowledge of both NLU or NLG tasks with two decoders and (2) be fine-tuned flexibly that fully exploits the potential of the model.","1341":"**Adaptive Non-Maximum Suppression** is a non-maximum suppression algorithm that applies a dynamic suppression threshold to an instance according to the target density. The motivation is to find an NMS algorithm that works well for pedestrian detection in a crowd. Intuitively, a high NMS threshold keeps more crowded instances while a low NMS threshold wipes out more false positives. The adaptive-NMS thus applies a dynamic suppression strategy, where the threshold rises as instances gather and occlude each other and decays when instances appear separately. To this end, an auxiliary and learnable sub-network is designed to predict the adaptive NMS threshold for each instance.","1342":"**NPID** (Non-Parametric Instance Discrimination) is a self-supervision approach that takes a non-parametric classification approach. Noise contrastive estimation is used to learn representations. Specifically, distances (similarity) between instances are calculated directly from the features in a non-parametric way.","1343":"Perceiver IO is a general neural network architecture that performs well for structured input modalities and output tasks. Perceiver IO is built to easily integrate and transform arbitrary information for arbitrary tasks.","1344":"**LightConv** is a type of [depthwise convolution](https:\/\/paperswithcode.com\/method\/depthwise-convolution) for sequential modelling which shares certain output channels and whose weights are normalized across the temporal dimension using a [softmax](https:\/\/paperswithcode.com\/method\/softmax). Compared to self-attention, LightConv has a fixed context window and it determines the importance of context elements with a set of weights that do not change over time steps. LightConv computes the following for the $i$-th element in the sequence and output channel $c$:\r\n\r\n$$ \\text{LightConv}\\left(X, W\\_{\\text{ceil}\\left(\\frac{cH}{d}\\right),:}, i, c\\right) = \\text{DepthwiseConv}\\left(X,\\text{softmax}\\left(W\\_{\\text{ceil}\\left(\\frac{cH}{d}\\right),:}\\right), i, c\\right) $$","1345":"**VC R-CNN** is an unsupervised feature representation learning method, which uses Region-based Convolutional Neural Network ([R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn)) as the visual backbone, and the causal intervention as the training objective. Given a set of detected object regions in an image (e.g., using [Faster R-CNN](https:\/\/paperswithcode.com\/method\/faster-r-cnn)), like any other unsupervised feature learning methods (e.g., word2vec), the proxy training objective of VC R-CNN is to predict the contextual objects of a region. However, they are fundamentally different: the prediction of VC R-CNN is by using causal intervention: P(Y|do(X)), while others are by using the conventional likelihood: P(Y|X). This is also the core reason why VC R-CNN can learn \"sense-making\" knowledge like chair can be sat -- while not just \"common\" co-occurrences such as the chair is likely to exist if table is observed.","1346":"Graph structure is learnable","1347":"**Contrastive Multiview Coding (CMC)** is a self-supervised learning approach, based on [CPC](https:\/\/paperswithcode.com\/method\/contrastive-predictive-coding), that  learns representations that capture information shared between multiple sensory views. The core idea is to set an anchor view and the sample positive and negative data points from the other view and maximise agreement between positive pairs in learning from two views. Contrastive learning is used to build the embedding.","1348":"**UFLoss**, or **Unsupervised Feature Loss**, is a patch-based unsupervised learned feature loss for deep learning (DL) based reconstructions. The UFLoss provides instance-level discrimination by mapping similar instances to similar low-dimensional feature vectors using a pre-trained mapping network (UFLoss Network). The rationale of using features from large-patches (typically 40\u00d740 pixels for a 300\u00d7300 pixels image) is that we want the UFLoss to capture mid-level structural and semantic features instead of using small patches (typically around 10\u00d710 pixels), which only contain local edge information. On the other hand, the authors avoid using global features due to the fact that the training set (typically around 5000 slices) is usually not large enough to capture common and general features at a large-image scale.","1349":"**3D Dynamic Scene Graph**, or **DSG**, is a representation that captures metric and semantic aspects of a dynamic environment. A DSG is a layered graph where nodes represent spatial concepts at different levels of abstraction, and edges represent spatio-temporal relations among nodes.","1350":"NICE-SLAM, a dense RGB-D SLAM system that combines neural implicit decoders with hierarchical grid-based representations, which can be applied to large-scale scenes.\r\n\r\nNeural implicit representations have recently shown encouraging results in various domains, including promising progress in simultaneous localization and mapping (SLAM). Nevertheless, existing methods produce over-smoothed scene reconstructions and have difficulty scaling up to large scenes. These limitations are mainly due to their simple fully-connected network architecture that does not incorporate local information in the observations. In this paper, we present NICE-SLAM, a dense SLAM system that incorporates multi-level local information by introducing a hierarchical scene representation. Optimizing this representation with pre-trained geometric priors enables detailed reconstruction on large indoor scenes. Compared to recent neural implicit SLAM systems, our approach is more scalable, efficient, and robust. Experiments on five challenging datasets demonstrate competitive results of NICE-SLAM in both mapping and tracking quality.","1351":"**RealFormer** is a type of [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) based on the idea of [residual](https:\/\/paperswithcode.com\/method\/residual-connection) attention. It adds skip edges to the backbone [Transformer](https:\/\/paperswithcode.com\/method\/transformer) to create multiple direct paths, one for each type of attention module. It adds no parameters or hyper-parameters. Specifically, RealFormer uses a Post-[LN](https:\/\/paperswithcode.com\/method\/layer-normalization) style Transformer as backbone and adds skip edges to connect [Multi-Head Attention](https:\/\/paperswithcode.com\/method\/multi-head-attention) modules in adjacent layers.","1352":"Contextual Graph Markov Model (CGMM) is an approach combining ideas from generative models and neural networks for the processing of graph data. It founds on a constructive methodology to build a deep architecture comprising layers of probabilistic models that learn to encode the structured information in an incremental fashion. Context is diffused in an efficient and scalable way across the graph vertexes and edges. The resulting graph encoding is used in combination with discriminative models to address structure classification benchmarks.\r\n\r\nDescription and image from: [Contextual Graph Markov Model: A Deep and Generative Approach to Graph Processing](https:\/\/arxiv.org\/pdf\/1805.10636.pdf)","1353":"**Gated Positional Self-Attention (GPSA)** is a self-attention module for vision transformers, used in the [ConViT](https:\/\/paperswithcode.com\/method\/convit) architecture, that can be initialized as a convolutional layer -- helping a ViT learn inductive biases about locality.","1354":"**ConViT** is a type of [vision transformer](https:\/\/paperswithcode.com\/method\/vision-transformer) that uses a gated positional self-attention module ([GPSA](https:\/\/paperswithcode.com\/method\/gpsa)), a form of positional self-attention which can be equipped with a \u201csoft\u201d convolutional inductive bias. The GPSA layers are initialized to mimic the locality of convolutional layers, then each attention head is given the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information.","1355":"**ZoomNet** is a 2D human whole-body pose estimation technique. It aims to localize dense landmarks on the entire human body including face, hands, body, and feet. ZoomNet follows the top-down paradigm. Given a human bounding box of each person, ZoomNet first localizes the easy-to-detect body keypoints and estimates the rough position of hands and face. Then it zooms in to focus on the hand\/face areas and predicts keypoints using features with higher resolution for accurate localization. Unlike previous approaches which usually assemble multiple networks, ZoomNet has a single network that is end-to-end trainable. It unifies five network heads including the human body pose estimator, hand and face detectors, and hand and face pose estimators into a single network with shared low-level features.","1356":"**VisTR** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) based video instance segmentation model. It views video instance segmentation as a direct end-to-end parallel sequence decoding\/prediction problem. Given a video clip consisting of multiple image frames as input, VisTR outputs the sequence of masks for each instance in the video in order directly. At the core is a new, effective instance sequence matching and segmentation strategy, which supervises and segments instances at the sequence level as a whole. VisTR frames the instance segmentation and tracking in the same perspective of similarity learning, thus considerably simplifying the overall pipeline and is significantly different from existing approaches.","1357":"**Implicit Subspace Prior Learning**, or **ISPL**, is a framework to approach dual-blind face restoration, with two major distinctions from previous restoration methods: 1) Instead of assuming an explicit degradation function between LQ and HQ domain, it establishes an implicit correspondence between both domains via a mutual embedding space, thus avoid solving the pathological inverse problem directly. 2) A subspace prior decomposition and fusion mechanism to dynamically handle inputs at varying degradation levels with consistent high-quality restoration results.","1358":"**HDCGAN**, or **High-resolution Deep Convolutional Generative Adversarial Networks**, is a [DCGAN](https:\/\/paperswithcode.com\/method\/dcgan) based architecture that achieves high-resolution image generation through the proper use of [SELU](https:\/\/paperswithcode.com\/method\/selu) activations. Glasses, a mechanism to arbitrarily improve the final [GAN](https:\/\/paperswithcode.com\/method\/gan) generated results by enlarging the input size by a telescope \u03b6 is also set forth. \r\n\r\nA video showing the training procedure on CelebA-hq can be found [here](https:\/\/youtu.be\/1XZB87W0SaY).","1359":"**ALAE**, or **Adversarial Latent Autoencoder**, is a type of autoencoder that attempts to overcome some of the limitations of[ generative adversarial networks](https:\/\/paperswithcode.com\/paper\/generative-adversarial-networks). The architecture allows the latent distribution to be learned from data to address entanglement (A). The output data distribution is learned with an adversarial strategy (B). Thus, we retain the generative properties of GANs, as well as the ability to build on the recent advances in this area. For instance, we can include independent sources of stochasticity, which have proven essential for generating image details, or can leverage recent improvements on GAN loss functions, regularization, and hyperparameters tuning. Finally, to implement (A) and (B), AE reciprocity is imposed in the latent space (C). Therefore, we can avoid using reconstruction losses based on simple $\\mathcal{l}\\){2}$ norm that operates in data space, where they are often suboptimal, like for the image space. Since it works on the latent space, rather than autoencoding the data space, the approach is named Adversarial Latent Autoencoder (ALAE).","1360":"**StyleALAE** is a type of [adversarial latent autoencoder](https:\/\/paperswithcode.com\/method\/alae) that uses a [StyleGAN](https:\/\/paperswithcode.com\/method\/stylegan) based generator. For this the latent space $\\mathcal{W}$ plays the same role as the intermediate latent space in [StyleGAN](https:\/\/paperswithcode.com\/method\/stylegan). Therefore, the $G$ network becomes the part of StyleGAN depicted on the right side of the Figure. The left side is a\r\nnovel architecture that we designed to be the encoder $E$. The StyleALAE encoder has [Instance Normalization](https:\/\/paperswithcode.com\/method\/instance-normalization) (IN) layers to extract multiscale style information that is combined into a latent code $w$ via a learnable multilinear map.","1361":"**DeepLabv2** is an architecture for semantic segmentation that build on [DeepLab](https:\/\/paperswithcode.com\/method\/deeplab) with an atrous [spatial pyramid pooling](https:\/\/paperswithcode.com\/method\/spatial-pyramid-pooling) scheme. Here we have parallel dilated convolutions with different rates applied in the input feature map, which are then fused together. As objects of the same class can have different sizes in the image, [ASPP](https:\/\/paperswithcode.com\/method\/aspp) helps to account for different object sizes.","1362":"An **All-Attention Layer** is an attention module and layer for transformers that merges the self-attention and feedforward sublayers into a single unified attention layer. As opposed to the two-step mechanism of the [Transformer](https:\/\/paperswithcode.com\/method\/transformer) layer, it directly builds its representation from the context and a persistent memory block without going through a feedforward transformation. The additional persistent memory block stores, in the form of key-value vectors, information that does not depend on the context. In terms of parameters, these persistent key-value vectors replace the feedforward sublayer.","1363":"**Fastformer** is an type of [Transformer](https:\/\/paperswithcode.com\/method\/transformer) which uses [additive attention](https:\/\/www.paperswithcode.com\/method\/additive-attention) as a building block. Instead of modeling the pair-wise interactions between tokens, [additive attention](https:\/\/paperswithcode.com\/method\/additive-attention) is used to model global contexts, and then each token representation is further transformed based on its interaction with global context representations.","1364":"**Skim and Intensive Reading Model**, or **SIRM**, is a deep neural network for figuring out implied textual meaning. It consists of two main components, namely the skim reading component and intensive reading component. N-gram features are quickly extracted from the skim reading component, which is a combination of several convolutional neural networks, as skim (entire) information. An intensive reading component enables a hierarchical investigation for both local (sentence) and global (paragraph) representation, which encapsulates the current embedding and the contextual information with a dense connection.","1365":"**T-Fixup** is an [initialization](https:\/\/paperswithcode.com\/methods\/category\/initialization) method for [Transformers](https:\/\/paperswithcode.com\/methods\/category\/transformers) that aims to remove the need for [layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization) and [warmup](https:\/\/paperswithcode.com\/method\/linear-warmup). The initialization procedure is as follows:\r\n\r\n- Apply [Xavier initialization](https:\/\/paperswithcode.com\/method\/xavier-initialization) for all parameters excluding input embeddings. Use Gaussian initialization $\\mathcal{N}\\left(0, d^{-\\frac{1}{2}}\\right)$ for input embeddings where $d$ is the embedding dimension.\r\n- Scale $\\mathbf{v}\\_{d}$ and $\\mathbf{w}\\_{d}$ matrices in each decoder [attention block](https:\/\/paperswithcode.com\/method\/multi-head-attention), weight matrices in each decoder [MLP block](https:\/\/paperswithcode.com\/method\/position-wise-feed-forward-layer) and input embeddings $\\mathbf{x}$ and $\\mathbf{y}$ in encoder and decoder by $(9 N)^{-\\frac{1}{4}}$\r\n- Scale $\\mathbf{v}\\_{e}$ and $\\mathbf{w}\\_{e}$ matrices in each encoder [attention block](https:\/\/paperswithcode.com\/method\/multi-head-attention) and weight matrices in each encoder [MLP block](https:\/\/paperswithcode.com\/method\/position-wise-feed-forward-layer) by $0.67 N^{-\\frac{1}{4}}$","1366":"**RandomRotate** is a type of image data augmentation where we randomly rotate the image by a degree.","1367":"Spatial CNN with UNet based Encoder-decoder and ConvLSTM","1368":"**MacBERT** is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based model for Chinese NLP that alters [RoBERTa](https:\/\/paperswithcode.com\/method\/roberta) in several ways, including a modified masking strategy. Instead of masking with [MASK] token, which never appears in the fine-tuning stage, MacBERT masks the word with its similar word. Specifically MacBERT shares the same pre-training tasks as [BERT](https:\/\/paperswithcode.com\/method\/bert) with several modifications. For the MLM task, the following modifications are performed:\r\n\r\n- Whole word masking is used as well as Ngram masking strategies for selecting candidate tokens for masking, with a percentage of\r\n40%, 30%, 20%, 10% for word-level unigram to 4-gram.\r\n- Instead of masking with [MASK] token, which never appears in the fine-tuning stage, similar words are used for the masking purpose. A similar word is obtained by using Synonyms toolkit which is based on word2vec similarity calculations. If an N-gram is selected to mask, we will find similar words individually. In rare cases, when there is no similar word, we will degrade to use random word replacement.\r\n- A percentage of 15% input words is used for masking, where 80% will replace with similar words, 10% replace with a random word, and keep with original words for the rest of 10%.","1369":"**Meena** is a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. A seq2seq model is used with the Evolved [Transformer](https:\/\/paperswithcode.com\/method\/transformer) as the main architecture. The model is trained on multi-turn conversations where the input sequence is all turns of the context and the output sequence is the response.","1370":"**Domain-Symmetric Network**, or **SymmNet**, is an algorithm for unsupervised multi-class domain adaptation. It features an adversarial strategy of domain confusion and discrimination.","1371":"A **Hopfield Layer** is a module that enables a network to associate two sets of vectors. This general functionality allows for [transformer](https:\/\/paperswithcode.com\/method\/transformer)-like self-attention, for decoder-encoder attention, for time series prediction (maybe with positional encoding), for sequence analysis, for multiple instance learning, for learning with point sets, for combining data sources by associations, for constructing a memory, for averaging and pooling operations, and for many more. \r\n\r\nIn particular, the Hopfield layer can readily be used as plug-in replacement for existing layers like pooling layers ([max-pooling](https:\/\/paperswithcode.com\/method\/max-pooling) or [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling), permutation equivariant layers, [GRU](https:\/\/paperswithcode.com\/method\/gru) & [LSTM](https:\/\/paperswithcode.com\/method\/lstm) layers, and attention layers. The Hopfield layer is based on modern Hopfield networks with continuous states that have very high storage capacity and converge after one update.","1372":"**Discriminative Adversarial Search**, or **DAS**, is a sequence decoding approach which aims to alleviate the effects of exposure bias and to optimize on the data distribution itself rather than for external metrics. Inspired by generative adversarial networks (GANs), wherein a discriminator is used to improve the generator, DAS differs from GANs in that the generator parameters are not updated at training time and the discriminator is only used to drive sequence generation at inference time.","1373":"The __GATv2__ operator from the [\u201cHow Attentive are Graph Attention Networks?\u201d](https:\/\/arxiv.org\/abs\/2105.14491) paper, which fixes the static attention problem of the standard [GAT](https:\/\/paperswithcode.com\/method\/gat) layer: since the linear layers in the standard GAT are applied right after each other, the ranking of attended nodes is unconditioned on the query node. In contrast, in GATv2, every node can attend to any other node.\r\n\r\nGATv2 scoring function:\r\n\r\n$e_{i,j} =\\mathbf{a}^{\\top}\\mathrm{LeakyReLU}\\left(\\mathbf{W}[\\mathbf{h}_i \\, \\Vert \\,\\mathbf{h}_j]\\right)$","1374":"**Bilateral Guided Aggregation Layer** is a feature fusion layer for semantic segmentation that aims to enhance mutual connections and fuse different types of feature representation. It was used in the [BiSeNet V2](https:\/\/paperswithcode.com\/method\/bisenet-v2) architecture. Specifically, within the BiSeNet implementation, the layer was used to employ the contextual information of the Semantic Branch to guide the feature response of Detail Branch. With different scale guidance, different scale feature representations can be captured, which inherently encodes the multi-scale information.","1375":"**BiSeNet V2** is a two-pathway architecture for real-time semantic segmentation. One pathway is designed to capture the spatial details with wide channels and shallow layers, called Detail Branch. In contrast, the other pathway is introduced to extract the categorical semantics with narrow channels and deep layers, called Semantic Branch. The Semantic Branch simply requires a large receptive field to capture semantic context, while the detail information can be supplied by the Detail Branch. Therefore, the Semantic Branch can be made very lightweight with fewer channels and a fast-downsampling strategy. Both types of feature representation are merged to construct a stronger and more comprehensive feature representation.","1376":"**Protagonist Antagonist Induced Regret Environment Design**, or **PAIRED**, is an adversarial method for approximate minimax regret to generate environments for reinforcement learning. It introduces an antagonist which is allied with the environment generating adversary. The primary agent we are trying to train is the protagonist. The environment adversary\u2019s goal is to design environments in which the antagonist achieves high reward and the protagonist receives low reward. If the adversary generates unsolvable environments, the antagonist and protagonist would perform the same and the adversary would get a score of zero, but if the adversary finds environments the antagonist solves and the protagonist does not solve, the adversary achieves a positive score. Thus, the environment adversary is incentivized to create challenging but feasible environments, in which the antagonist can outperform the protagonist. Moreover, as the protagonist learns to solves the simple environments, the antagonist must generate more complex environments to make the protagonist fail, increasing the complexity of the generated tasks and leading to automatic curriculum generation.","1377":"**SqueezeBERT** is an efficient architectural variant of [BERT](https:\/\/paperswithcode.com\/method\/bert) for natural language processing that uses [grouped convolutions](https:\/\/paperswithcode.com\/method\/grouped-convolution). It is much like BERT-base, but with positional feedforward connection layers implemented as convolutions, and grouped [convolution](https:\/\/paperswithcode.com\/method\/convolution) for many of the layers.","1378":"**Synergistic Image and Feature Alignment** is an unsupervised domain adaptation framework that conducts synergistic alignment of domains from both image and feature perspectives. In SIFA, we simultaneously transform the appearance of images across domains and enhance domain-invariance of the extracted features by leveraging adversarial learning in multiple aspects and with a deeply supervised mechanism. The feature encoder is shared between both adaptive perspectives to leverage their mutual benefits via end-to-end learning.","1379":"**Subformer** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) that combines sandwich-style parameter sharing, which overcomes naive cross-layer parameter sharing in generative models, and self-attentive embedding factorization (SAFE). In SAFE, a small self-attention layer is used to reduce embedding parameter count.","1380":"A novel low-resource intrinsic metric to evaluate word\r\nembedding quality based on graph modularity.","1381":"**Deformable RoI Pooling** adds an offset to each bin position in the regular bin partition of the RoI Pooling. Similarly, the offsets are learned from the preceding feature maps and the RoIs, enabling adaptive part localization for objects with different shapes.","1382":"**K3M** is a multi-modal pretraining method for e-commerce product data that introduces knowledge modality to correct the noise and supplement the missing of image and text modalities. The modal-encoding layer extracts the features of each modality. The modal-interaction layer is capable of effectively modeling the interaction of multiple modalities, where an initial-interactive feature fusion model is designed to maintain the independence of image modality and text modality, and a structure aggregation module is designed to fuse the information of image, text, and knowledge modalities. K3M is pre-trained with three pretraining tasks, including masked object modeling (MOM), masked language modeling (MLM), and link prediction modeling ([LPM](https:\/\/paperswithcode.com\/method\/local-prior-matching)).","1383":"**AutoTinyBERT** is a an efficient [BERT](https:\/\/paperswithcode.com\/method\/bert) variant found through neural architecture search. Specifically, one-shot learning is used to obtain a big Super Pretrained Language Model (SuperPLM), where the objectives of pre-training or task-agnostic BERT distillation are used.  Then, given a specific latency constraint, an evolutionary algorithm is run on the SuperPLM to search optimal architectures. Finally, we extract the corresponding sub-models based on the optimal architectures and further train these models.","1384":"**MiVOS** is a video object segmentation model which decouples interaction-to-mask and mask propagation. By decoupling interaction from propagation, MiVOS is versatile and not limited by the type of interactions. It uses three modules: Interaction-to-Mask, Propagation and Difference-Aware Fusion. Trained separately, the interaction module converts user interactions to an object mask, which is then temporally propagated by our propagation module using a novel top-filtering strategy in reading the space-time memory. To effectively take the user's intent into account, a novel difference-aware module is proposed to learn how to properly fuse the masks before and after each interaction, which are aligned with the target frames by employing the space-time memory.","1385":"**Attribute2Font** is a model that automatically creates fonts by synthesizing visually pleasing glyph images according to user-specified attributes and their corresponding values. Specifically, Attribute2Font is trained to perform font style transfer between any two fonts conditioned on their attribute values. After training, the model can generate glyph images in accordance with an arbitrary set of font attribute values. A unit named Attribute Attention Module is designed to make those generated glyph images better embody the prominent font attributes. A semi-supervised learning scheme is also introduced to exploit a large number of unlabeled fonts","1386":"CP with N3 Regularizer","1387":"**Differentiable Augmentation (DiffAugment)** is a set of differentiable image transformations used to augment data during [GAN](https:\/\/paperswithcode.com\/method\/gan) training. The transformations are applied to the real and generated images. It enables the gradients to be propagated through the augmentation back to the generator, regularizes\r\nthe discriminator without manipulating the target distribution, and maintains the balance of training\r\ndynamics. Three choices of transformation are preferred by the authors in their experiments: Translation, [CutOut](https:\/\/paperswithcode.com\/method\/cutout), and Color.","1388":"**ReInfoSelect** is a reinforcement weak supervision selection method for information retrieval. It learns to select anchor-document pairs that best weakly supervise the neural ranker (action), using the ranking performance on a handful of relevance labels as the reward. Iteratively, for a batch of anchor-document pairs, ReInfoSelect back propagates the gradients through the neural ranker, gathers its NDCG reward, and optimizes the data selection network using policy gradients, until the neural ranker's performance peaks on target relevance metrics (convergence).","1389":"**TaBERT** is a pretrained language model (LM) that jointly learns representations for natural language sentences and (semi-)structured tables. TaBERT is trained on a large corpus of 26 million tables and their English contexts. \r\n\r\nIn summary, TaBERT's process for learning representations for NL sentences is as follows: Given an utterance $u$ and a table $T$, TaBERT first creates a content snapshot of $T$. This snapshot consists of sampled rows that summarize the information in $T$ most relevant to the input utterance. The model then linearizes each row in the snapshot, concatenates each linearized row with the utterance, and uses the concatenated string as input to a Transformer model, which outputs row-wise encoding vectors of utterance tokens and cells. The encodings for all the rows in the snapshot are fed into a series of vertical self-attention layers, where a cell representation (or an utterance token representation) is computed by attending to vertically-aligned vectors of the same column (or the same NL token). Finally, representations for each utterance token and column are generated from a pooling layer.","1390":"**DetNAS** is a [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) algorithm for the design of better backbones for object detection. It is based on the technique of one-shot supernet, which contains all possible networks in the search space. The supernet is trained under the typical detector training schedule: ImageNet pre-training and detection fine-tuning. Then, the architecture search is performed on the trained supernet, using the detection task as the guidance. DetNAS uses evolutionary search as opposed to RL-based methods or gradient-based methods.","1391":"Interpolation between [exponential decay](https:\/\/paperswithcode.com\/method\/exponential-decay) and [cosine annealing](https:\/\/paperswithcode.com\/method\/cosine-annealing).","1392":"**Policy Similarity Metric**, or **PSM**, is a similarity metric for measuring behavioral similarity between states in reinforcement learning. It assigns high similarity to states for which the optimal policies in those states as well as in future states are similar. PSM is reward-agnostic, making it more robust for generalization compared to approaches that rely on reward information.","1393":"DASPP is a deeper version of the [ASPP](https:\/\/paperswithcode.com\/method\/aspp) module (the latter from [DeepLabv3](https:\/\/paperswithcode.com\/method\/deeplabv3)) that adds standard 3 \u00d7 3 [convolution](https:\/\/paperswithcode.com\/method\/convolution) after 3 \u00d7 3 dilated convolutions to refine the features and also fusing the input and the output of the DASPP module via short [residual connection](https:\/\/paperswithcode.com\/method\/residual-connection). Also, the number of convolution filters of ASPP is reduced from 255 to 96 to gain computational performance.","1394":"**LiteSeg** is a lightweight architecture for semantic segmentation that uses a deeper version of Atrous [Spatial Pyramid Pooling](https:\/\/paperswithcode.com\/method\/spatial-pyramid-pooling) module ([ASPP](https:\/\/paperswithcode.com\/method\/aspp)) and applies short and long residual connections, and [depthwise separable convolution](https:\/\/paperswithcode.com\/method\/depthwise-separable-convolution), resulting in a faster, more efficient model.","1395":"**Negative Face Recognition**, or **NFR**, is a face recognition approach that enhances the soft-biometric privacy on the template-level by representing face templates in a complementary (negative) domain. While ordinary templates characterize facial properties of an individual, negative templates describe facial properties that does not exist for this individual. This suppresses privacy-sensitive information from stored templates. Experiments are conducted on two publicly available datasets captured under controlled and uncontrolled scenarios on three privacy-sensitive attributes.","1396":"**VQ-VAE-2** is a type of variational autoencoder that combines a a two-level hierarchical VQ-[VAE](https:\/\/paperswithcode.com\/method\/vae) with a self-attention autoregressive model ([PixelCNN](https:\/\/paperswithcode.com\/method\/pixelcnn)) as a prior. The encoder and decoder architectures are kept simple and light-weight as in the original [VQ-VAE](https:\/\/paperswithcode.com\/method\/vq-vae), with the only difference that hierarchical multi-scale latent maps are used for increased resolution.","1397":"**Dual Softmax Loss** is a loss function based on symmetric cross-entropy loss used in the [CAMoE](https:\/\/paperswithcode.com\/method\/camoe) video-text retrieval model. Every text and video are calculated the\r\nsimilarity with other videos or texts, which should be maximum in terms of the ground truth pair. For DSL, a prior is introduced to revise the similarity score. Multiplying the prior with the original similarity matrix imposes an efficient constraint and can help to filter those single side match pairs. As a result, DSL highlights the one with both great Text-to-Video and Video-to-Text probability, conducting a more convincing result.","1398":"**CAMoE** is a multi-stream Corpus Alignment network with single gate Mixture-of-Experts (MoE) for video-text retrieval. The CAMoE employs Mixture-of-Experts (MoE) to extract multi-perspective video representations, including action, entity, scene, etc., then align them with the corresponding part of the text. A [Dual Softmax Loss](https:\/\/paperswithcode.com\/method\/dual-softmax-loss) (DSL) is used to avoid the one-way optimum-match which occurs in previous contrastive methods. Introducing the intrinsic prior of each pair in a batch, DSL serves as a reviser to correct the similarity matrix and achieves the dual optimal match.","1399":"**Vokenization** is an approach for extrapolating multimodal alignments to language-only data by contextually mapping language tokens to their related images (\"vokens\") by retrieval. Instead of directly supervising the language model with visually grounded language datasets (e.g., MS COCO) these relative small datasets are used to train the vokenization processor (i.e. the vokenizer). Vokens are generated for large language corpora (e.g., English Wikipedia), and the visually-supervised language model takes the\r\ninput supervision from these large datasets, thus bridging the gap between different data sources.","1400":"**EMQAP**, or **E-Manual Question Answering Pipeline**, is an approach for answering questions pertaining to electronics devices. Built upon the pretrained [RoBERTa](https:\/\/paperswithcode.com\/method\/roberta), it harbors a supervised multi-task learning framework which efficiently performs the dual tasks of identifying the section in the E-manual where the answer can be found and the exact answer span within that section.","1401":"**Orientation Regularized Network** (ORN) is a multi-view image fusion technique for pose estimation. It uses IMU orientations as a structural prior to mutually fuse the image features of each pair of joints linked by IMUs. For example, it uses the features of the elbow to reinforce those of the wrist based on the IMU at the lower-arm.","1402":"**PP-YOLO** is an object detector based on [YOLOv3](https:\/\/paperswithcode.com\/method\/yolov3). It mainly tries to combine various existing tricks that almost not increase the number of model parameters and FLOPs, to achieve the goal of improving the accuracy of detector as much as possible while ensuring that the speed is almost unchanged. Some of these changes include:\r\n\r\n- Changing the [DarkNet-53](https:\/\/paperswithcode.com\/method\/darknet-53) backbone with ResNet50-vd. Some of the convolutional layers in ResNet50-vd are also replaced with [deformable convolutional layers](https:\/\/paperswithcode.com\/method\/deformable-convolution).\r\n- A larger batch size is used - changing from 64 to 192.\r\n- An exponentially moving average is used for the parameters.\r\n- [DropBlock](https:\/\/paperswithcode.com\/method\/dropblock) is applied to the [FPN](https:\/\/paperswithcode.com\/method\/fpn).\r\n- An IoU loss is used.\r\n- An IoU prediction branch is added to measure the accuracy of localization.\r\n- [Grid Sensitive](https:\/\/paperswithcode.com\/method\/grid-sensitive) is used, similar to [YOLOv4](https:\/\/paperswithcode.com\/method\/yolov4).\r\n- [Matrix NMS](https:\/\/paperswithcode.com\/method\/matrix-nms) is used.\r\n- [CoordConv](https:\/\/paperswithcode.com\/method\/coordconv) is used for the [FPN](https:\/\/paperswithcode.com\/method\/fpn), replacing the 1x1 convolution layer, and also the first convolution layer in the detection head.\r\n- [Spatial Pyramid Pooling](https:\/\/paperswithcode.com\/method\/spatial-pyramid-pooling) is used for the top feature map.","1403":"**Graph Network-Based Simulators** is a type of graph neural network that represents the state of a physical system with particles, expressed as nodes in a graph, and computes dynamics via learned message-passing.","1404":"**Hierarchical Transferability Calibration Network** (HTCN) is an adaptive object detector that hierarchically (local-region\/image\/instance) calibrates the transferability of feature representations for harmonizing transferability and discriminability. The proposed model consists of three components: (1) Importance Weighted Adversarial Training with input Interpolation (IWAT-I), which strengthens the global discriminability by re-weighting the interpolated image-level features; (2) Context-aware Instance-Level Alignment (CILA) module, which enhances the local discriminability by capturing the complementary effect between the instance-level feature and the global context information for the instance-level feature alignment; (3) local feature masks that calibrate the local transferability to provide semantic guidance for the following discriminative pattern alignment.","1405":"**Attentional Liquid Warping GAN** is a type of generative adversarial network for human image synthesis that utilizes a [AttLWB](https:\/\/paperswithcode.com\/method\/attlwb) block, which is a 3D body mesh recovery module that disentangles pose and shape. To preserve the source information, such as texture, style, color, and face identity, the Attentional Liquid Warping GAN with AttLWB propagates the source information in both image and feature spaces to the synthesized reference.","1406":"**Attentional Liquid Warping Block**, or **AttLWB**, is a module for human image synthesis GANs that propagates the source information - such as texture, style, color and face identity - in both image and feature spaces to the synthesized reference. It firstly learns similarities of the global features among all multiple sources features, and then it fuses the multiple sources features by a linear combination of the learned similarities and the multiple sources in the feature spaces. Finally, to better propagate the source identity (style, color, and texture) into the global stream, the fused source features are warped to the global stream by [Spatially-Adaptive Normalization](https:\/\/paperswithcode.com\/method\/spade) (SPADE).","1407":"**BARThez** is a self-supervised transfer learning model for the French language based on [BART](https:\/\/paperswithcode.com\/method\/bart). Compared to existing [BERT](https:\/\/paperswithcode.com\/method\/bert)-based French language models such as [CamemBERT](https:\/\/paperswithcode.com\/paper\/camembert-a-tasty-french-language-model) and [FlauBERT](https:\/\/paperswithcode.com\/paper\/flaubert-unsupervised-language-model-pre), BARThez is well-suited for generative tasks, since not only its encoder but also its decoder is pretrained.","1408":"**Local Patch Interaction**, or **LPI**, is a module used for the [XCiT layer](https:\/\/paperswithcode.com\/method\/xcit-layer) to enable explicit communication across patches. LPI consists of two [depth-wise 3\u00d73 convolutional layers](https:\/\/paperswithcode.com\/method\/depthwise-convolution) with [Batch Normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) and [GELU](https:\/\/paperswithcode.com\/method\/gelu) non-linearity in between. Due to its depth-wise structure, the LPI block has a negligible overhead in terms of parameters, as well as a limited overhead in terms of throughput and memory usage during inference.","1409":"**Cross-Covariance Attention**, or **XCA**, is an [attention mechanism](https:\/\/paperswithcode.com\/methods\/category\/attention-mechanisms-1) which operates along the feature dimension instead of the token dimension as in [conventional transformers](https:\/\/paperswithcode.com\/methods\/category\/transformers).\r\n\r\nUsing the definitions of queries, keys and values from conventional attention, the cross-covariance attention function is defined as:\r\n\r\n$$\r\n\\text { XC-Attention }(Q, K, V)=V \\mathcal{A}_{\\mathrm{XC}}(K, Q), \\quad \\mathcal{A}\\_{\\mathrm{XC}}(K, Q)=\\operatorname{Softmax}\\left(\\hat{K}^{\\top} \\hat{Q} \/ \\tau\\right)\r\n$$\r\n\r\nwhere each output token embedding is a convex combination of the $d\\_{v}$ features of its corresponding token embedding in $V$. The attention weights $\\mathcal{A}$ are computed based on the cross-covariance matrix.","1410":"An **XCiT Layer** is the main building block of the [XCiT](https:\/\/paperswithcode.com\/method\/xcit) architecture which uses a [cross-covariance attention]() operator as its principal operation. The XCiT layer consists of three main blocks, each preceded by [LayerNorm](https:\/\/paperswithcode.com\/method\/layer-normalization) and followed by a [residual connection](https:\/\/paperswithcode.com\/method\/residual-connection): (i) the core [cross-covariance attention](https:\/\/paperswithcode.com\/method\/cross-covariance-attention) (XCA) operation, (ii) the [local patch interaction](https:\/\/paperswithcode.com\/method\/local-patch-interaction) (LPI) module, and (iii) a [feed-forward network](https:\/\/paperswithcode.com\/method\/feedforward-network) (FFN). By transposing the query-key interaction, the computational complexity of XCA is linear in the number of data elements N, rather than quadratic as in conventional self-attention.","1411":"**Cross-Covariance Image Transformers**, or **XCiT**, is a type of [vision transformer](https:\/\/paperswithcode.com\/methods\/category\/vision-transformer) that aims to combine the accuracy of [conventional transformers](https:\/\/paperswithcode.com\/methods\/category\/transformers) with the scalability of [convolutional architectures](https:\/\/paperswithcode.com\/methods\/category\/convolutional-neural-networks). \r\n\r\nThe [self-attention operation](https:\/\/paperswithcode.com\/method\/scaled) underlying transformers yields global interactions between all tokens, i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. The authors propose a \u201ctransposed\u201d version of self-attention called [cross-covariance attention](https:\/\/paperswithcode.com\/method\/cross-covariance-attention) that operates across feature channels rather than tokens, where the interactions are based on the cross-covariances matrix between keys and queries.","1412":"**BiGG** is an autoregressive model for generative modeling for sparse graphs. It utilizes sparsity to avoid generating the full adjacency matrix, and reduces the graph generation time complexity to $O(((n + m)\\log n)$. Furthermore, during training this autoregressive model can be parallelized with $O(\\log n)$ synchronization stages, which makes it much more efficient than other autoregressive models that require $\\Omega(n)$. The approach is based on three key elements: (1) an $O(\\log n)$ process for generating each edge using a binary tree data structure, inspired by R-MAT; (2) a tree-structured autoregressive model for generating the set of edges associated with each node; and (3) an autoregressive model defined over the sequence of nodes.","1413":"**PP-YOLOv2** is an object detector that extends upon [PP-YOLO](https:\/\/www.paperswithcode.com\/method\/pp-yolo) with several refinements:\r\n\r\n- A [Path Aggregation Network](https:\/\/paperswithcode.com\/method\/pafpn) is included for the FPN to compose bottom-up paths.\r\n- [Mish Activation functions](https:\/\/paperswithcode.com\/method\/mish) are used.\r\n- The input size is expanded.\r\n- An IoU aware branch is calculated with a soft label format.","1414":"A dual graph convolutional neural network jointly considers the two essential assumptions of semi-supervised learning: (1) local consistency and (2) global consistency. Accordingly, two convolutional neural networks are devised to embed the local-consistency-based and global-consistency-based knowledge, respectively.\r\n\r\nDescription  and image from: [Dual Graph Convolutional Networks for Graph-Based Semi-Supervised Classification](https:\/\/persagen.com\/files\/misc\/zhuang2018dual.pdf)","1415":"**Chimera** is a pipeline model parallelism scheme which combines bidirectional pipelines for efficiently training large-scale models. The key idea of Chimera is to combine two pipelines in different directions (down and up pipelines). \r\n\r\nDenote $N$ as the number of micro-batches executed by each worker within a training iteration, and $D$ the number of pipeline stages (depth), and $P$ the number of workers.\r\n\r\nThe Figure shows an example with four pipeline stages (i.e. $D=4$). Here we assume there are $D$ micro-batches executed by each worker within a training iteration, namely $N=D$, which is the minimum to keep all the stages active. \r\n\r\nIn the down pipeline, stage$\\_{0}$\u223cstage$\\_{3}$ are mapped to $P\\_{0}\u223cP\\_{3}$ linearly, while in the up pipeline the stages are mapped in a completely opposite order. The $N$ (assuming an even number) micro-batches are equally partitioned among the two pipelines. Each pipeline schedules $N\/2$ micro-batches using 1F1B strategy, as shown in the left part of the Figure. Then, by merging these two pipelines together, we obtain the pipeline schedule of Chimera. Given an even number of stages $D$ (which can be easily satisfied in practice), it is guaranteed that there is no conflict (i.e., there is at most one micro-batch occupies the same time slot on each worker) during merging.","1416":"**GRLIA** is an incident aggregation framework for online service systems based on graph representation learning over the cascading graph of cloud failures. A representation vector is learned for each unique type of incident in an unsupervised and unified manner, which is able to simultaneously encode the topological and temporal correlations among incidents.","1417":"**GreedyNAS-C** is a convolutional neural network discovered using the [GreedyNAS](https:\/\/paperswithcode.com\/method\/greedynas) [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) method. The basic building blocks used are inverted residual blocks (from [MobileNetV2](https:\/\/paperswithcode.com\/method\/mobilenetv2)) and squeeze-and-excitation blocks.","1418":"**GreedyNAS-B** is a convolutional neural network discovered using the [GreedyNAS](https:\/\/paperswithcode.com\/method\/greedynas) [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) method. The basic building blocks used are inverted residual blocks (from [MobileNetV2](https:\/\/paperswithcode.com\/method\/mobilenetv2)) and squeeze-and-excitation blocks.","1419":"**GreedyNAS-A** is a convolutional neural network discovered using the [GreedyNAS](https:\/\/paperswithcode.com\/method\/greedynas) [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) method. The basic building blocks used are inverted residual blocks (from [MobileNetV2](https:\/\/paperswithcode.com\/method\/mobilenetv2)) and squeeze-and-excitation blocks.","1420":"**GreedyNAS** is a one-shot [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) method. Previous methods held the assumption that a supernet should give a reasonable ranking over all paths. They thus treat all paths equally, and spare much effort to train paths. However, it is harsh for a single supernet to evaluate accurately on such a huge-scale search space (eg, $7^{21}$). GreedyNAS eases the burden of supernet by encouraging focus more on evaluation of potentially-good candidates, which are identified using a surrogate portion of validation data. \r\n\r\nConcretely, during training, GreedyNAS utilizes a multi-path sampling strategy with rejection, and greedily filters the weak paths. The training efficiency is thus boosted since the training space has been greedily shrunk from all paths to those potentially-good ones. An exploration and exploitation policy is adopted by introducing an empirical candidate path pool.","1421":"**Class-MLP** is an alternative to [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling), which is an adaptation of the class-attention token introduced in [CaiT](https:\/\/paperswithcode.com\/method\/cait). In CaiT, this consists of two layers that have the same structure as the [transformer](https:\/\/paperswithcode.com\/method\/transformer), but in which only the class token is updated based on the frozen patch embeddings. In Class-MLP, the same approach is used, but after aggregating the patches with a [linear layer](https:\/\/paperswithcode.com\/method\/linear-layer), we replace the [attention-based interaction](https:\/\/paperswithcode.com\/method\/scaled) between the class and patch embeddings by simple linear layers, still keeping the patch embeddings frozen. This increases the performance, at the expense of adding some parameters and computational cost. This pooling variant is referred to as \u201cclass-MLP\u201d, since the purpose of these few layers is to replace average pooling.","1422":"**NeuralRecon** is a framework for real-time 3D scene reconstruction from a monocular video. Unlike previous methods that estimate single-view depth maps separately on each key-frame and fuse them later, NeuralRecon proposes to directly reconstruct local surfaces represented as sparse TSDF volumes for each video fragment sequentially by a neural network. A learning-based TSDF fusion module based on gated recurrent units is used to guide the network to fuse features from previous fragments. This design allows the network to capture local smoothness prior and global shape prior of 3D surfaces.","1423":"A **Residual SRM** is a module for convolutional neural networks that uses a [Style-based Recalibration Module](https:\/\/paperswithcode.com\/method\/style-based-recalibration-module) within a [residual block](https:\/\/paperswithcode.com\/method\/residual-block) like structure. The Style-based Recalibration Module (SRM) adaptively recalibrates intermediate feature maps by exploiting their styles.","1424":"An ECA block has similar formulation to an SE block including a squeeze module for aggregating global spatial information and an efficient excitation module for modeling cross-channel interaction. Instead of indirect correspondence, an ECA block only considers direct interaction between each channel and its k-nearest neighbors to control model complexity. Overall, the formulation of an ECA block is:\r\n\\begin{align}\r\n    s = F_\\text{eca}(X, \\theta) & = \\sigma (\\text{Conv1D}(\\text{GAP}(X))) \r\n\\end{align}\r\n\\begin{align}\r\n    Y & = s  X\r\n\\end{align}\r\nwhere $\\text{Conv1D}(\\cdot)$ denotes 1D convolution with a kernel of shape $k$ across the channel domain, to model local cross-channel interaction. The parameter $k$ decides the coverage of interaction, and in ECA the kernel size $k$ is adaptively determined from the channel dimensionality $C$ instead of by manual tuning, using cross-validation:\r\n\\begin{equation}\r\n    k = \\psi(C) = \\left | \\frac{\\log_2(C)}{\\gamma}+\\frac{b}{\\gamma}\\right |_\\text{odd}\r\n\\end{equation}\r\n\r\nwhere $\\gamma$ and $b$ are hyperparameters. $|x|_\\text{odd}$ indicates the nearest odd function of $x$. \r\n\r\nCompared to SENet, ECANet has an \r\nimproved excitation module, and provides an efficient and effective block which can readily be \r\n incorporated into various\r\nCNNs.","1425":"An **ECA-Net** is a type of convolutional neural network that utilises an [Efficient Channel Attention](https:\/\/paperswithcode.com\/method\/efficient-channel-attention) module.","1426":"**Instruction Pointer Attention Graph Neural Network**, or **IPA-GNN**, is a learning-interpreter neural network (LNN) based on GNNs for learning to execute programmes. It achieves improved systematic generalization on the task of learning to execute programs using control flow graphs. The model arises by considering RNNs operating on program traces with branch decisions as latent variables. The IPA-GNN can be seen either as a continuous relaxation of the RNN model or as a GNN variant more tailored to execution.","1427":"**Spatial Group-wise Enhance** is a module for convolutional neural networks that can adjust the\r\nimportance of each sub-feature by generating an attention factor for each spatial location in each semantic group, so that every individual group can autonomously enhance its learnt expression and suppress possible noise\r\n\r\nInside each feature group, we model a spatial enhance mechanism inside each feature group, by scaling the feature vectors over all the locations with an attention mask. This attention mask is designed to suppress the possible noise and highlight the correct semantic feature regions. Different from other popular attention methods, it utilises the similarity between the global statistical feature and the local ones of each location as the source of generation for the attention masks.","1428":"**Factorized Dense Synthesized Attention** is a synthesized attention mechanism, similar to [dense synthesized attention](https:\/\/paperswithcode.com\/method\/dense-synthesized-attention), but we factorize the outputs to reduce parameters and prevent overfitting. It was proposed as part of the [Synthesizer](https:\/\/paperswithcode.com\/method\/synthesizer) architecture. The factorized variant of the dense synthesizer can be expressed as follows:\r\n\r\n$$A, B = F\\_{A}\\left(X\\_{i}\\right), F\\_{B}\\left(X\\_{i}\\right)$$\r\n\r\nwhere $F\\_{A}\\left(.\\right)$ projects input $X\\_{i}$ into $a$ dimensions, $F\\_B\\left(.\\right)$ projects $X\\_{i}$ to $b$ dimensions, and $a \\text{ x } b = l$. The output of the factorized module is now written as:\r\n\r\n$$ Y = \\text{Softmax}\\left(C\\right)G\\left(X\\right) $$\r\n\r\nwhere $C = H\\_{A}\\left(A\\right) * H\\_{B}\\left(B\\right)$, where $H\\_{A}$, $H\\_{B}$ are tiling functions and $C \\in \\mathbb{R}^{l \\text{ x } l}$. The tiling function simply duplicates the vector $k$ times, i.e., $\\mathbb{R}^{l} \\rightarrow \\mathbb{R}^{lk}$. In this case, $H\\_{A}\\left(\\right)$ is a projection of $\\mathbb{R}^{a} \\rightarrow \\mathbb{R}^{ab}$ and $H\\_{B}\\left(\\right)$ is a projection of $\\mathbb{R}^{b} \\rightarrow \\mathbb{R}^{ba}$. To avoid having similar values within the same block, we compose the outputs of $H\\_{A}$ and $H\\_{B}$.","1429":"**Factorized Random Synthesized Attention**, introduced with the [Synthesizer](https:\/\/paperswithcode.com\/method\/synthesizer) architecture, is similar to [factorized dense synthesized attention](https:\/\/paperswithcode.com\/method\/factorized-dense-synthesized-attention) but for random synthesizers. Letting $R$ being a randomly initialized matrix, we factorize $R$ into low rank matrices $R\\_{1}, R\\_{2} \\in \\mathbb{R}^{l\\text{ x}k}$ in the attention function:\r\n\r\n$$ Y = \\text{Softmax}\\left(R\\_{1}R\\_{2}^{T}\\right)G\\left(X\\right) . $$\r\n\r\nHere $G\\left(.\\right)$ is a parameterized function that is equivalent to $V$ in [Scaled Dot-Product Attention](https:\/\/paperswithcode.com\/method\/scaled).\r\n\r\nFor each head, the factorization reduces the parameter costs from $l^{2}$ to $2\\left(lk\\right)$ where\r\n$k << l$ and hence helps prevent overfitting. In practice, we use a small value of $k = 8$.\r\n\r\nThe basic idea of a  Random Synthesizer is to not rely on pairwise token interactions or any information from individual token but rather to learn a task-specific alignment that works well globally across many samples.","1430":"**Random Synthesized Attention** is a form of synthesized attention where the attention weights are not conditioned on any input tokens. Instead, the attention weights are initialized to random values. It was introduced with the [Synthesizer](https:\/\/paperswithcode.com\/method\/synthesizer) architecture. Random Synthesized Attention contrasts with [Dense Synthesized Attention](https:\/\/paperswithcode.com\/method\/dense-synthesized-attention) which conditions on each token independently, as opposed to pairwise token interactions in the vanilla [Transformer](https:\/\/paperswithcode.com\/method\/transformer) model.\r\n\r\nLet $R$ be a randomly initialized matrix. Random Synthesized Attention is defined as:\r\n\r\n$$Y = \\text{Softmax}\\left(R\\right)G\\left(X\\right) $$\r\n\r\nwhere $R \\in \\mathbb{R}^{l \\text{ x } l}$. Notably, each head adds 2 parameters to the overall network. The basic idea of the Random Synthesizer is to not rely on pairwise token interactions or any information from individual token but rather to learn a task-specific alignment that works well globally across many samples. This is a direct generalization of the recently proposed fixed self-attention patterns of [Raganato et al (2020)](https:\/\/arxiv.org\/abs\/2002.10260).","1431":"**Dense Synthesized Attention**, introduced with the [Synthesizer](https:\/\/paperswithcode.com\/method\/synthesizer) architecture, is a type of synthetic attention mechanism that replaces the notion of [query-key-values](https:\/\/paperswithcode.com\/method\/scaled) in the self-attention module and directly synthesizes the alignment matrix instead. Dense attention is conditioned on each input token. The method accepts an input $X \\in \\mathbb{R}^{l\\text{ x }d}$ and produces an output of $Y \\in \\mathbb{R}^{l\\text{ x }d}$. Here $l$ refers to the sequence length and $d$ refers to the dimensionality of the model. We first adopt $F\\left(.\\right)$, a parameterized function, for projecting input $X\\_{i}$ from $d$ dimensions to $l$ dimensions.\r\n\r\n$$B\\_{i} = F\\left(X\\_{i}\\right)$$\r\n\r\nwhere $F\\left(.\\right)$ is a parameterized function that maps $\\mathbb{R}^{d}$ to $\\mathbb{R}^{l}$ and $i$ is the $i$-th token of $X$. Intuitively, this can be interpreted as learning a token-wise projection to the sequence length $l$. Essentially, with this model, each token predicts weights for each token in the input sequence. In practice, a simple two layered feed-forward layer with [ReLU](https:\/\/paperswithcode.com\/method\/relu) activations for $F\\left(.\\right)$ is adopted:\r\n\r\n$$ F\\left(X\\right) = W\\left(\\sigma\\_{R}\\left(W(X) + b\\right)\\right) + b$$\r\n\r\nwhere $\\sigma\\_{R}$ is the ReLU activation function. Hence, $B$ is now of $\\mathbb{R}^{l\\text{ x }d}$. Given $B$, we now compute:\r\n\r\n$$ Y = \\text{Softmax}\\left(B\\right)G\\left(X\\right) $$\r\n\r\nwhere $G\\left(.\\right)$ is another parameterized function of $X$ that is analogous to $V$ (value) in the standard [Transformer](https:\/\/paperswithcode.com\/method\/transformer) model. This approach eliminates the [dot product](https:\/\/paperswithcode.com\/method\/scaled) altogether by replacing $QK^{T}$ in standard Transformers with the synthesizing function $F\\left(.\\right)$.","1432":"**Online Normalization** is a normalization technique for training deep neural networks. To define Online Normalization. we replace arithmetic averages over the full dataset in with exponentially decaying averages of online samples. The decay factors $\\alpha\\_{f}$ and $\\alpha\\_{b}$ for forward and backward passes respectively are hyperparameters for the technique.\r\n\r\nWe allow incoming samples $x\\_{t}$, such as images, to have multiple scalar components and denote\r\nfeature-wide mean and variance by $\\mu\\left(x\\_{t}\\right)$ and $\\sigma^{2}\\left(x\\_{t}\\right)$. The algorithm also applies to outputs of fully connected layers with only one scalar output per feature. In fact, this case simplifies to $\\mu\\left(x\\_{t}\\right) = x\\_{t}$ and $\\sigma\\left(x\\_{t}\\right) = 0$. Denote scalars $\\mu\\_{t}$ and $\\sigma\\_{t}$ to denote running estimates of mean and variance across\r\nall samples. The subscript $t$ denotes time steps corresponding to processing new incoming samples.\r\n\r\nOnline Normalization uses an ongoing process during the forward pass to estimate activation means\r\nand variances. It implements the standard online computation of mean and variance generalized to processing multi-value samples and exponential averaging of sample statistics. The\r\nresulting estimates directly lead to an affine normalization transform.\r\n\r\n$$ y\\_{t} = \\frac{x\\_{t} - \\mu\\_{t-1}}{\\sigma\\_{t-1}} $$ \r\n\r\n$$ \\mu\\_{t} = \\alpha\\_{f}\\mu\\_{t-1} + \\left(1-\\alpha\\_{f}\\right)\\mu\\left(x\\_{t}\\right) $$\r\n\r\n$$ \\sigma^{2}\\_{t} = \\alpha\\_{f}\\sigma^{2}\\_{t-1} + \\left(1-\\alpha\\_{f}\\right)\\sigma^{2}\\left(x\\_{t}\\right) + \\alpha\\_{f}\\left(1-\\alpha\\_{f}\\right)\\left(\\mu\\left(x\\_{t}\\right) - \\mu\\_{t-1}\\right)^{2} $$","1433":"Sandwich Batch Normalization (**SaBN**) is a frustratingly easy improvement of [Batch Normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) (BN) with only a few lines of code changes. SaBN is motivated by addressing the inherent *feature distribution heterogeneity* that one can be identified in many tasks, which can arise from data heterogeneity (multiple input domains) or model heterogeneity (dynamic architectures, model conditioning, etc.). Our SaBN factorizes the BN affine layer into one shared *sandwich affine* layer, cascaded by several parallel *independent affine* layers. We demonstrate the prevailing effectiveness of SaBN as a **drop-in replacement in four tasks**: *conditional image generation*, *[neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search)* (NAS), *adversarial training*, and *arbitrary style transfer*. Leveraging SaBN immediately achieves better Inception Score and FID on CIFAR-10 and ImageNet conditional image generation with three state-of-the-art GANs; boosts the performance of a state-of-the-art weight-sharing NAS algorithm significantly on NAS-Bench-201; substantially improves the robust and standard accuracies for adversarial defense; and produces superior arbitrary stylized results.","1434":"**PASE+** is a problem-agnostic speech encoder that combines a convolutional encoder followed by multiple neural networks, called workers, tasked to solve self-supervised problems (i.e., ones that do not require manual annotations as ground truth). An online speech distortion module is employed, that contaminates the input signals with a variety of random disturbances. A revised encoder is also proposed that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks. Finally, the authors refine the set of workers used in self-supervision to encourage better cooperation.","1435":"**CTAL** is a pre-training framework for strong audio-and-language representations with a [Transformer](https:\/\/paperswithcode.com\/method\/transformer), which aims to learn the intra-modality and inter-modalities connections between audio and language through two proxy tasks on a large amount of audio- and-language pairs: masked language modeling and masked cross-modal acoustic modeling. The pre-trained model is a Transformer for Audio and Language, i.e., CTAL, which consists of two modules, a language stream encoding module which adapts word as input element, and a text-referred audio stream encoder module which accepts both frame-level Mel-spectrograms and token-level output embeddings from the language stream","1436":"**Switchable Normalization** combines three types of statistics estimated channel-wise, layer-wise, and minibatch-wise by using [instance normalization](https:\/\/paperswithcode.com\/method\/instance-normalization), [layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization), and [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) respectively. [Switchable Normalization](https:\/\/paperswithcode.com\/method\/switchable-normalization) switches among them by learning their importance weights.","1437":"**Video-Audio-Text Transformer**, or **VATT**, is a framework for learning multimodal representations from unlabeled data using [convolution](https:\/\/paperswithcode.com\/method\/convolution)-free [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architectures. Specifically, it takes raw signals as inputs and extracts multidimensional representations that are rich enough to benefit a variety of downstream tasks. VATT borrows the exact architecture from [BERT](https:\/\/paperswithcode.com\/method\/bert) and [ViT](https:\/\/paperswithcode.com\/method\/vision-transformer) except the layer of tokenization and linear projection reserved for each modality separately. The design follows the same spirit as ViT that makes the minimal changes to the architecture so that the learned model can transfer its weights to various frameworks and tasks.\r\n\r\nVATT linearly projects each modality into a feature vector and feeds it into a Transformer encoder. A semantically hierarchical common space is defined to account for the granularity of different modalities and noise contrastive estimation is employed to train the model.","1438":"Although significant effort has been applied to fact-checking, the prevalence of fake news over social media, which has profound impact on justice, public trust and our society, remains a serious problem. In this work, we focus on propagation-based fake news detection, as recent studies have demonstrated that fake news and real news spread differently online. Specifically, considering the capability of graph neural networks (GNNs) in dealing with non-Euclidean data, we use GNNs to differentiate between the propagation patterns of fake and real news on social media. In particular, we concentrate on two questions: (1) Without relying on any text information, e.g., tweet content, replies and user descriptions, how accurately can GNNs identify fake news? Machine learning models are known to be vulnerable to adversarial attacks, and avoiding the dependence on text-based features can make the model less susceptible to the manipulation of advanced fake news fabricators. (2) How to deal with new, unseen data? In other words, how does a GNN trained on a given dataset perform on a new and potentially vastly different dataset? If it achieves unsatisfactory performance, how do we solve the problem without re-training the model on the entire data from scratch? We study the above questions on two datasets with thousands of labelled news items, and our results show that: (1) GNNs can achieve comparable or superior performance without any text information to state-of-the-art methods. (2) GNNs trained on a given dataset may perform poorly on new, unseen data, and direct incremental training cannot solve the problem---this issue has not been addressed in the previous work that applies GNNs for fake news detection. In order to solve the problem, we propose a method that achieves balanced performance on both existing and new datasets, by using techniques from continual learning to train GNNs incrementally.","1439":"**AEDA**, or **An Easier Data Augmentation**, is a type of data augmentation technique for text classification which includes only the insertion of various punctuation marks into the input sequence. AEDA preserves all the input information and does not mislead the network since it keeps the word order intact while changing their positions in that the words are shifted to the right.","1440":"CodeSLAM represents the 3D geometry of a scene using the latent space of a variational autoencoder. The depth thus becomes a function of the RGB image and the unknown code, $D = G_\\theta(I,c)$. During training time, the weights of the network $G_\\theta$ are learnt by training the generator and encoder using a standard autoencoding task. At test time the code $c$ and the pose of the images is found by optimizing the reprojection error over multiple images.","1441":"**MODERN**, or **Modulated Residual Network**, is an architecture for [visual question answering](https:\/\/paperswithcode.com\/task\/visual-question-answering) (VQA). It employs [conditional batch normalization](https:\/\/paperswithcode.com\/method\/conditional-batch-normalization) to allow a linguistic embedding from an [LSTM](https:\/\/paperswithcode.com\/method\/lstm) to modulate the [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) parameters of a [ResNet](https:\/\/paperswithcode.com\/method\/resnet). This enables the linguistic embedding to manipulate entire feature maps by scaling them up or down, negating them, or shutting them off, etc.","1442":"Like [DARTS](https:\/\/paperswithcode.com\/method\/darts), except subtract the max weight gradients.\r\n\r\nMax-W Weighting:\r\n\\begin{equation}\r\noutput_i = (1 - max(w) + w_i) * op_i(input_i)\r\n\\label{eqn:max_w}\r\n\\end{equation}","1443":"Differentiable simultaneous optimization of hyperparameters and neural network architecture. Also a [Neural Architecture Search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) (NAS) method.","1444":"**Contrastive Predictive Coding v2 (CPC v2)** is a self-supervised learning approach that builds upon the original [CPC](https:\/\/paperswithcode.com\/method\/contrastive-predictive-coding) with several improvements. These improvements include:\r\n\r\n- **Model capacity** - The third residual stack of [ResNet](https:\/\/paperswithcode.com\/method\/resnet)-101 (originally containing 23 blocks, 1024-dimensional feature maps, and 256-dimensional bottleneck layers), is converted to use 46 blocks, with 4096-dimensional feature maps and 512-dimensional bottleneck layers: ResNet-161.\r\n\r\n- **Layer Normalization** - The authors find CPC with [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) harms downstream performance. They hypothesize this is due to batch normalization allowing large models to find a trivial solution to CPC: it introduces a dependency between patches (through the batch statistics) that can be exploited to bypass the constraints on the receptive field. They replace batch normalization with [layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization).\r\n\r\n- **Predicting lengths and directions** - patches are predicted with contexts from both directions rather than just spatially underneath.\r\n\r\n- **Patch-based Augmentation** - Utilising \"color dropping\" which randomly drops two of the three color channels in each patch, as well as random horizontal flips.\r\n\r\n\r\nConsistent with prior results, this new architecture delivers better performance regardless of","1445":"Unsupervised attributed graph representation learning is challenging since both structural and feature information are required to be represented in the latent space. Existing methods concentrate on learning latent representation via reconstruction tasks, but cannot directly optimize representation and are prone to oversmoothing, thus limiting the applications on downstream tasks. To alleviate these issues, we propose a novel graph embedding framework named Deep Manifold Attributed Graph Embedding (DMAGE). A node-to-node geodesic similarity is proposed to compute the inter-node similarity between the data space and the latent space and then use Bergman divergence as loss function to minimize the difference between them. We then design a new network structure with fewer aggregation to alleviate the oversmoothing problem and incorporate graph structure augmentation to improve the representation's stability. Our proposed DMAGE surpasses state-of-the-art methods by a significant margin on three downstream tasks: unsupervised visualization, node clustering, and link prediction across four popular datasets.","1446":"**PixLoc** is a scene-agnostic neural network that estimates an accurate 6-DoF pose from an image and a 3D model. It is based on the direct alignment of multiscale deep features, casting camera localization as metric learning. PixLoc learns strong data priors by end-to-end training from pixels to pose and exhibits exceptional generalization to new scenes by separating model parameters and scene geometry. As the CNN never sees 3D points, PixLoc can generalize to any 3D structure available. This includes sparse SfM point clouds, dense depth maps from stereo or RGBD sensors, meshes, Lidar scans, but also lines and other primitives.","1447":"The **Adaptive Attention Span Transformer** is a Transformer that utilises an improvement to the self-attention layer called [adaptive masking](https:\/\/paperswithcode.com\/method\/adaptive-masking) that allows the model to choose its own context size. This results in a network where each attention layer gathers information on their own context. This allows for scaling to input sequences of more than 8k tokens.\r\n\r\nTheir proposals are based on the observation that, with the dense attention of a traditional [Transformer](https:\/\/paperswithcode.com\/method\/transformer), each attention head shares the same attention span $S$ (attending over the full context). But many attention heads can specialize to more local context (others look at the longer sequence). This motivates the need for a variant of self-attention that allows the model to choose its own context size (adaptive masking - see components).","1448":"A **Harmonic Network**, or **Harm-Net**, is a type of convolutional neural network that replaces convolutional layers with \"harmonic blocks\" that use [Discrete Cosine Transform](https:\/\/paperswithcode.com\/method\/discrete-cosine-transform) (DCT) filters. These blocks can be useful in  truncating high-frequency information (possible due to the redundancies in the spectral domain).","1449":"A **Harmonic Block** is an image model component that utilizes [Discrete Cosine Transform](https:\/\/paperswithcode.com\/method\/discrete-cosine-transform) (DCT) filters. Convolutional neural networks (CNNs) learn filters in order to capture local correlation patterns in feature space. In contrast, DCT has preset spectral filters, which can be better for compressing information (due to the presence of redundancy in the spectral domain).\r\n\r\nDCT has been successfully used for JPEG encoding to transform image blocks into spectral representations to capture the most information with a small number of coefficients. Harmonic blocks learn how to optimally combine spectral coefficients at every layer to produce a fixed size representation defined as a weighted sum of responses to DCT filters. The use of DCT filters allows to address the task of model compression.","1450":"**TrOCR** is an end-to-end [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based OCR model for text recognition with pre-trained CV and NLP models. It leverages the [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture for both image understanding and wordpiece-level text generation. It first resizes the input text image into $384 \u00d7 384$ and then the image is split into a sequence of 16 patches which are used as the input to image Transformers.  Standard Transformer architecture with the [self-attention mechanism](https:\/\/paperswithcode.com\/method\/scaled) is leveraged on both encoder and decoder parts, where wordpiece units are generated as the recognized text from the input image.","1451":"uNet neural network architecture which takes multiple (X) tensors as input and contains [Spatial Transformer](https:\/\/paperswithcode.com\/method\/spatial-transformer) units (ST)","1452":"An **End-to-End Memory Network** is a neural network with a recurrent attention model over a possibly large external memory. The architecture is a form of [Memory Network](https:\/\/paperswithcode.com\/method\/memory-network), but unlike the model in that work, it is trained end-to-end, and hence requires significantly less supervision during training. It can also be seen as an extension of RNNsearch to the case where multiple computational steps (hops) are performed per output symbol.\r\n\r\nThe model takes a discrete set of inputs $x\\_{1}, \\dots, x\\_{n}$ that are to be stored in the memory, a query $q$, and outputs an answer $a$. Each of the $x\\_{i}$, $q$, and $a$ contains symbols coming from a dictionary with $V$ words. The model writes all $x$ to the memory up to a fixed buffer size, and then finds a continuous representation for the $x$ and $q$. The continuous representation is then processed via multiple hops to output $a$.","1453":"**PointASNL** is a non-local neural network for point clouds processing It consists of two general modules: adaptive sampling (AS) module and local-Nonlocal (L-NL) module. The AS module first re-weights the neighbors around the initial sampled points from farthest point sampling (FPS), and then adaptively adjusts the sampled points beyond the entire point cloud. The AS module can not only benefit the feature learning of point clouds, but also ease the biased effect of outliers. The L-NL module capture the neighbor and long-range dependencies of the sampled point, and enables the learning process to be insensitive to noise.","1454":"**Kalman Optimization for Value Approximation**, or **KOVA** is a general framework for addressing uncertainties while approximating value-based functions in deep RL domains. KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties. It is feasible when using non-linear approximation functions as DNNs and can estimate the value in both on-policy and off-policy settings. It can be incorporated as a policy evaluation component in policy optimization algorithms.","1455":"**CurricularFace**, or **Adaptive Curriculum Learning**, is a method for face recognition that embeds the idea of curriculum learning into the loss function to achieve a new training scheme. This training scheme mainly addresses easy samples in the early training stage and hard ones in the later stage. Specifically, CurricularFace adaptively adjusts the relative importance of easy and hard samples during different training stages.","1456":"**Generalized State-Dependent Exploration**, or **gSDE**, is an exploration method for reinforcement learning that uses more general features and re-sampling the noise periodically. \r\n\r\nState-Dependent Exploration (SDE) is an intermediate solution for exploration that consists in adding noise as a function of the state $s\\_{t}$, to the deterministic action $\\mu\\left(\\mathbf{s}\\_{t}\\right)$. At the beginning of an episode, the parameters $\\theta\\_{\\epsilon}$ of that exploration function are drawn from a Gaussian distribution. The resulting action $\\mathbf{a}\\_{t}$ is as follows:\r\n\r\n$$\r\n\\mathbf{a}\\_{t}=\\mu\\left(\\mathbf{s}\\_{t} ; \\theta\\_{\\mu}\\right)+\\epsilon\\left(\\mathbf{s}\\_{t} ; \\theta\\_{\\epsilon}\\right), \\quad \\theta\\_{\\epsilon} \\sim \\mathcal{N}\\left(0, \\sigma^{2}\\right)\r\n$$\r\n\r\nThis episode-based exploration is smoother and more consistent than the unstructured step-based exploration. Thus, during one episode, instead of oscillating around a mean value, the action a for a given state $s$ will be the same.\r\n\r\nIn the case of a linear exploration function $\\epsilon\\left(\\mathbf{s} ; \\theta\\_{\\epsilon}\\right)=\\theta\\_{\\epsilon} \\mathbf{s}$, by operation on Gaussian distributions, R\u00fcckstie\u00df et al. show that the action element $\\mathbf{a}\\_{j}$ is normally distributed:\r\n\r\n$$\r\n\\pi]_{j}\\left(\\mathbf{a}\\_{j} \\mid \\mathbf{s}\\right) \\sim \\mathcal{N}\\left(\\mu\\_{j}(\\mathbf{s}), \\hat{\\sigma\\_{j}}^{2}\\right)\r\n$$\r\n\r\nwhere $\\hat{\\sigma}$ is a diagonal matrix with elements $\\hat{\\sigma}\\_{j}=\\sqrt{\\sum\\_{i}\\left(\\sigma\\_{i j} \\mathbf{s}\\_{i}\\right)^{2}}$.\r\n\r\nBecause we know the policy distribution, we can obtain the derivative of the log-likelihood $\\log \\pi(\\mathbf{a} \\mid \\mathbf{s})$ with respect to the variance $\\sigma$ :\r\n\r\n$$\r\n\\frac{\\partial \\log \\pi(\\mathbf{a} \\mid \\mathbf{s})}{\\partial \\sigma_{i j}}=\\frac{\\left(\\mathbf{a}\\_{j}-\\mu\\_{j}\\right)^{2}-\\hat{\\sigma\\_{j}}^{2}}{\\hat{\\sigma}\\_{j}^{3}} \\frac{\\mathbf{s}\\_{i}^{2} \\sigma\\_{i j}}{\\hat{\\sigma_{j}}}\r\n$$\r\n\r\nThis can be easily plugged into the likelihood ratio gradient estimator, which allows to adapt $\\sigma$ during training. SDE is therefore compatible with standard policy gradient methods, while addressing most shortcomings of the unstructured exploration.\r\n\r\nFor gSDE, two improvements are suggested:\r\n\r\n1. We sample the parameters $\\theta\\_{\\epsilon}$ of the exploration function every $n$ steps instead of every episode.\r\n2. Instead of the state s, we can in fact use any features. We chose policy features $\\mathbf{z}\\_{\\mu}\\left(\\mathbf{s} ; \\theta\\_{\\mathbf{z}\\_{\\mu}}\\right)$ (last layer before the deterministic output $\\left.\\mu(\\mathbf{s})=\\theta\\_{\\mu} \\mathbf{z}\\_{\\mu}\\left(\\mathbf{s} ; \\theta_{\\mathbf{z}\\_{\\mu}}\\right)\\right)$ as input to the noise function $\\epsilon\\left(\\mathbf{s} ; \\theta\\_{\\epsilon}\\right)=\\theta\\_{\\epsilon} \\mathbf{z}\\_{\\mu}(\\mathbf{s})$","1457":"**BTmPG**, or **Back-Translation guided multi-round Paraphrase Generation**, is a multi-round paraphrase generation method that leverages back-translation to guide paraphrase model during training and generates paraphrases in a multiround process. The model regards paraphrase generation as a monolingual translation task. Given a paraphrase pair $\\left(S\\_{0}, P\\right)$, which $S\\_{0}$ is the original\/source sentence and $P$ is the target paraphrase given in the dataset. In the first round generation, we send $S\\_{0}$ into a paraphrase model to generate a paraphrase $S\\_{1}$. In the second round generation, we use the $S\\_{1}$ as the input of the model to generate a new paraphrase $S\\_{2}$. And so forth, in the $i$-th round generation, we send $S\\_{i\u22121}$ into the paraphrase model to generate $S\\_{i}$.\r\n.","1458":"**DAFNe** is a dense one-stage anchor-free deep model for oriented object detection. It is a deep neural network that performs predictions on a dense grid over the input image, being architecturally simpler in design, as well as easier to optimize than its two-stage counterparts. Furthermore, it reduces the prediction complexity by refraining from employing bounding box anchors. This enables a tighter fit to oriented objects, leading to a better separation of bounding boxes especially in case of dense object distributions. Moreover, it introduces an orientation-aware generalization of the center-ness function to arbitrary quadrilaterals that takes into account the object's orientation and that, accordingly, accurately down-weights low-quality predictions","1459":"TLC convert the global operation to a local one so that it extract representations based on local spatial region of features as in training phase.","1460":"**Feedback Memory** is a type of attention module used in the [Feedback Transformer](https:\/\/paperswithcode.com\/method\/feedback-transformer) architecture. It allows a [transformer](https:\/\/paperswithcode.com\/method\/transformer) to to use the most abstract representations from the past directly as inputs for the current timestep. This means that the model does not form its representation in parallel, but sequentially token by token. More precisely, we replace the context inputs to attention modules with memory vectors that are computed over the past, i.e.:\r\n\r\n$$ \\mathbf{z}^{l}\\_{t} = \\text{Attn}\\left(\\mathbf{x}^{l}\\_{t},  \\left[\\mathbf{m}\\_{t\u2212\\tau}, \\dots, \\mathbf{m}\\_{t\u22121}\\right]\\right) $$\r\n\r\nwhere a memory vector $\\mathbf{m}\\_{t}$ is computed by summing the representations of each layer at the $t$-th time step:\r\n\r\n$$ \\mathbf{m}\\_{t} = \\sum^{L}\\_{l=0}\\text{Softmax}\\left(w^{l}\\right)\\mathbf{x}\\_{t}^{l} $$\r\n\r\nwhere $w^{l}$ are learnable scalar parameters. Here $l = 0$ corresponds to token embeddings. The weighting of different layers by a [softmax](https:\/\/paperswithcode.com\/method\/softmax) output gives the model more flexibility as it can average them or select one of them. This modification of the self-attention input adapts the computation of the Transformer from parallel to sequential, summarized in the Figure. Indeed, it gives the ability to formulate the representation $\\mathbf{x}^{l}\\_{t+1}$ based on past representations from any layer $l'$, while in a standard Transformer this is only true for $l > l'$. This change can be viewed as exposing all previous computations to all future computations, providing better representations of the input. Such capacity would allow much shallower models to capture the same level of abstraction as a deeper architecture.","1461":"A **Feedback Transformer** is a type of sequential transformer that exposes all previous representations to all future representations, meaning the lowest representation of the current timestep is formed from the highest-level abstract representation of the past. This feedback nature allows this architecture to perform recursive computation, building stronger representations iteratively upon previous states. To achieve this, the self-attention mechanism of the standard [Transformer](https:\/\/paperswithcode.com\/method\/transformer) is modified so it attends to higher level representations rather than lower ones.","1462":"**LightAutoML** is an AutoML solution targeted for financial services companies. A typical LightAutoML pipeline scheme is presented in the Figure, each pipeline containing:\r\n\r\n- Reader: object that receives raw data and task as input, calculates some useful metadata, performs initial data cleaning and decides about data manipulations that should be done before fitting different model types.\r\n\r\n- LightAutoML inner datasets that contains metadata and CV iterators that implements validation scheme for the datasets.\r\n\r\n- Multiple ML Pipelines that are stacked and\/or blended to get a single prediction.\r\n\r\nAn ML pipeline in LightAutoML is one or multiple ML models that share a single data preprocessing and validation scheme. The preprocessing step may have up to two feature selection steps, a feature engineering step or even just be empty if no preprocessing is needed. The ML pipelines can be computed independently on the same datasets and then blended together using averaging (or weighted averaging). Alternatively, a stacking ensemble scheme can be used to build multi level ensemble architectures.","1463":"**I-BERT** is a quantized version of [BERT](https:\/\/paperswithcode.com\/method\/bert) that quantizes the entire inference with integer-only arithmetic. Based on lightweight integer only approximation methods for nonlinear operations, e.g., [GELU](https:\/\/paperswithcode.com\/method\/gelu), [Softmax](https:\/\/paperswithcode.com\/method\/softmax), and [Layer Normalization](https:\/\/paperswithcode.com\/method\/layer-normalization), it performs an end-to-end integer-only [BERT](https:\/\/paperswithcode.com\/method\/bert) inference without any floating point calculation.\r\n\r\nIn particular, GELU and Softmax are approximated with lightweight second-order polynomials, which can be evaluated with integer-only arithmetic. For LayerNorm, integer-only computation is performed by leveraging a known algorithm for integer calculation of\r\nsquare root.","1464":"**State-Aware Tracker** is a pipeline for semi-supervised video object segmentation. It takes each target object as a tracklet, which not only makes the pipeline more efficient but also filters distractors to facilitate target modeling. For more stable and robust performance over video sequences, SAT gets awareness for each state and makes self-adaptation via two feedback loops. One loop assists SAT in generating more stable tracklets. The other loop helps to construct a more robust and holistic target representation.","1465":"**UCNet** is a probabilistic framework for RGB-D Saliency Detection that employs uncertainty by learning from the data labelling process. It utilizes conditional variational autoencoders to model human annotation uncertainty and generate multiple saliency maps for each input image by sampling in the latent space.","1466":"**Self-Adjusting Smooth L1 Loss** is a loss function used in object detection that was introduced with [RetinaMask](https:\/\/paperswithcode.com\/method\/retinamask). This is an improved version of Smooth L1.  For Smooth L1 loss we have:\r\n\r\n$$ f(x) = 0.5  \\frac{x^{2}}{\\beta} \\text{ if } |x| < \\beta $$\r\n$$ f(x) = |x| -0.5\\beta \\text{ otherwise } $$\r\n\r\nHere a point $\\beta$ splits the positive axis range into two parts: $L2$ loss is used for targets in range $[0, \\beta]$, and $L1$ loss is used beyond $\\beta$ to avoid over-penalizing  utliers. The overall function is smooth (continuous, together with its derivative). However, the choice of control point ($\\beta$) is heuristic and is usually done by hyper parameter search.\r\n\r\nInstead, with self-adjusting smooth L1 loss, inside the loss function the running mean and variance of the absolute loss are recorded. We use the running minibatch mean and variance with a momentum of $0.9$ to update these two parameters.","1467":"**RetinaMask** is a one-stage object detection method that improves upon [RetinaNet](https:\/\/paperswithcode.com\/method\/retinanet) by adding the task of instance mask prediction during training, as well as an [adaptive loss](https:\/\/paperswithcode.com\/method\/adaptive-loss) that improves robustness to parameter choice during training, and including more difficult examples in training.","1468":"**NeuroTactic** is a model for theorem proving which leverages [graph neural networks](https:\/\/paperswithcode.com\/methods\/category\/graph-models) to represent the theorem and premises, and applies graph contrastive learning for pre-training. Specifically, premise selection is designed as a pretext task for the graph contrastive learning approach. The learned representations are then used for the downstream task, tactic prediction","1469":"**ShapeConv**, or **Shape-aware Convolutional layer**, is a convolutional layer for processing the depth feature in indoor RGB-D semantic segmentation. The depth feature is firstly decomposed into a shape-component and a base-component, next two learnable weights are introduced to cooperate with them independently, and finally a [convolution](https:\/\/paperswithcode.com\/method\/convolution) is applied on the re-weighted combination of these two components.","1470":"Symbolic rule learning methods find regularities in data that can be expressed in the form of 'if-then' rules based on symbolic representations of the data.","1471":"**SC-GPT** is a multi-layer [Transformer](http:\/\/paperswithcode.com\/method\/transformer) neural language model, trained in three steps: (i) Pre-trained on plain text, similar to [GPT-2](http:\/\/paperswithcode.com\/method\/gpt-2); (ii) Continuously pretrained on large amounts of dialog-act labeled utterances corpora to acquire the ability of controllable generation; (iii) Fine-tuned for a target domain using very limited amounts of domain labels. Unlike [GPT-2](http:\/\/paperswithcode.com\/method\/gpt-2), SC-GPT generates semantically controlled responses that are conditioned on the given semantic form, similar to SC-[LSTM](https:\/\/paperswithcode.com\/method\/lstm) but requiring much less domain labels to generalize to new domains. It is pre-trained on a large set of annotated NLG corpus to acquire the controllable generation ability, and fine-tuned with only a few domain-specific labels to adapt to new domains.","1472":"**VoVNet** is a convolutional neural network that seeks to make [DenseNet](https:\/\/paperswithcode.com\/method\/densenet) more efficient by concatenating all features only once in the last feature map, which makes input size constant and enables enlarging new output channel. In the Figure to the right, $F$ represents a [convolution](https:\/\/paperswithcode.com\/method\/convolution) layer and $\\otimes$ indicates concatenation.","1473":"MinCutPool is a trainable pooling operator for graphs that learns to map nodes into clusters.\r\nThe method is trained to approximate the minimum K-cut of the graph to ensure that the clusters are balanced, while also jointly optimizing the objective of the task at hand.","1474":"A **Convolutional Gated Recurrent Unit** is a type of [GRU](https:\/\/paperswithcode.com\/method\/gru) that combines GRUs with the [convolution](https:\/\/paperswithcode.com\/method\/convolution) operation. The update rule for input $x\\_{t}$ and the previous output $h\\_{t-1}$ is given by the following:\r\n\r\n$$ r = \\sigma\\left(W\\_{r} \\star\\_{n}\\left[h\\_{t-1};x\\_{t}\\right] + b\\_{r}\\right) $$\r\n\r\n$$ u = \\sigma\\left(W\\_{u} \\star\\_{n}\\left[h\\_{t-1};x\\_{t}\\right] + b\\_{u} \\right) $$\r\n\r\n$$ c = \\rho\\left(W\\_{c} \\star\\_{n}\\left[x\\_{t}; r \\odot h\\_{t-1}\\right] + b\\_{c} \\right) $$\r\n\r\n$$ h\\_{t} = u \\odot h\\_{t-1} + \\left(1-u\\right) \\odot c $$\r\n\r\nIn these equations $\\sigma$ and $\\rho$ are the elementwise sigmoid and [ReLU](https:\/\/paperswithcode.com\/method\/relu) functions respectively and the $\\star\\_{n}$ represents a convolution with a kernel of size $n \\times n$. Brackets are used to represent a feature concatenation.","1475":"A **Hermite Activations** is a type of activation function which uses a smooth finite Hermite polynomial base as a substitute for non-smooth [ReLUs](https:\/\/paperswithcode.com\/method\/relu). \r\n\r\nRelevant Paper: [Lokhande et al](https:\/\/arxiv.org\/pdf\/1909.05479.pdf)","1476":"**FastSpeech 2s** is a text-to-speech model that abandons mel-spectrograms as intermediate output completely and directly generates speech waveform from text during inference. In other words there is no cascaded mel-spectrogram generation (acoustic model) and waveform generation (vocoder). FastSpeech 2s generates waveform conditioning on intermediate hidden, which makes it more compact in inference by discarding the mel-spectrogram decoder.\r\n\r\nTwo main design changes are made to the waveform decoder. \r\n\r\nFirst, considering that the phase information is difficult to predict using a variance predictor, [adversarial training](https:\/\/paperswithcode.com\/methods\/category\/adversarial-training)  is used in the waveform decoder to force it to implicitly recover the phase information by itself. \r\n\r\nSecondly, the mel-spectrogram decoder of [FastSpeech 2](https:\/\/paperswithcode.com\/method\/fastspeech-2) is leveraged, which is trained on the full text sequence to help on the text feature extraction. As shown in the Figure, the waveform decoder is based on the structure of [WaveNet](https:\/\/paperswithcode.com\/method\/wavenet) including non-causal convolutions and gated activation. The waveform decoder takes a sliced hidden sequence corresponding to a short audio clip as input and upsamples it with transposed 1D-convolution to match the length of audio clip. The discriminator in the adversarial training adopts the same structure in Parallel WaveGAN, which consists of ten layers of non-causal [dilated 1-D convolutions](https:\/\/paperswithcode.com\/method\/dilated-convolution) with [leaky ReLU](https:\/\/paperswithcode.com\/method\/leaky-relu) activation function. The waveform decoder is optimized by the multi-resolution STFT loss and the [LSGAN discriminator](https:\/\/paperswithcode.com\/method\/lsgan) loss following Parallel WaveGAN. \r\n\r\nIn inference, the mel-spectrogram decoder is discarded and only the waveform decoder is used to synthesize speech audio.","1477":"**LocalViT** aims to introduce depthwise convolutions to enhance local features modeling capability of ViTs. The network, as shown in Figure (c), brings localist mechanism into transformers through the depth-wise convolution (denoted by \"DW\"). To cope with the convolution operation, the conversation between sequence and image feature map is added by \"Seq2Img\" and \"Img2Seq\". The computation is as follows:\r\n\r\n$$\r\n\\mathbf{Y}^{r}=f\\left(f\\left(\\mathbf{Z}^{r} \\circledast \\mathbf{W}_{1}^{r} \\right) \\circledast \\mathbf{W}_d  \\right) \\circledast \\mathbf{W}_2^{r}\r\n$$\r\n\r\nwhere $\\mathbf{W}_{d} \\in \\mathbb{R}^{\\gamma d \\times 1 \\times k \\times k}$ is the kernel of the depth-wise convolution.\r\n\r\nThe input (sequence of tokens) is first reshaped to a feature map rearranged on a 2D lattice. Two convolutions along with a depth-wise convolution are applied to the feature map. The feature map is reshaped to a sequence of tokens which are used as by the self-attention of the network transformer layer.","1478":"**k-Sparse Autoencoders** are autoencoders with linear activation function, where in hidden layers only the $k$ highest activities are kept. This achieves exact sparsity in the hidden representation. Backpropagation only goes through the the top $k$ activated units. This can be achieved with a [ReLU](https:\/\/paperswithcode.com\/method\/relu) layer with an adjustable threshold.","1479":"Make-A-Scene is a text-to-image method that (i) enables a simple control mechanism complementary to text in the form of a scene, (ii) introduces elements that improve the tokenization process by employing domain-specific knowledge over key image regions (faces and salient objects), and (iii) adapts classifier-free guidance for the transformer use case.","1480":"**Pixel2Style2Pixel**, or **pSp**, is an image-to-image translation framework that is based on a novel encoder that directly generates a series of style vectors which are fed into a pretrained [StyleGAN](https:\/\/paperswithcode.com\/method\/stylegan) generator, forming the extended $\\mathcal{W+}$ latent space. Feature maps are first extracted using a standard feature pyramid over a [ResNet](https:\/\/paperswithcode.com\/method\/resnet) backbone. Then, for each of $18$ target styles, a small mapping network is trained to extract the learned styles from the corresponding feature map, where styles $(0-2)$ are generated from the small feature map, $(3-6)$ from the medium feature map, and $(7-18)$ from the largest feature map. The mapping network, map2style, is a small fully convolutional network, which gradually reduces spatial size using a set of 2-strided convolutions followed by [LeakyReLU](https:\/\/paperswithcode.com\/method\/leaky-relu) activations. Each generated 512 vector, is fed into [StyleGAN](https:\/\/paperswithcode.com\/method\/stylegan), starting from its matching affine transformation, $A$.","1481":"**Temporally Consistent Spatial Augmentation** is a video data augmentation technique used for contrastive learning in the [Contrastive Video Representation Learning](https:\/\/paperswithcode.com\/method\/cvrl) framework. It fixes the randomness of spatial augmentation across frames; this prevents spatial augmentation hurting learning if applied independently across frames, because in that case it breaks the natural motion. In contrast, having temporally consistent spatial augmentation does not break the natural motion in the frames.","1482":"**Contrastive Video Representation Learning**, or **CVRL**, is a self-supervised contrastive learning framework for learning spatiotemporal visual representations from unlabeled videos. Representations are learned using a contrastive loss, where two clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away. Data augmentations are designed involving spatial and temporal cues. Concretely, a [temporally consistent spatial augmentation](https:\/\/paperswithcode.com\/method\/temporally-consistent-spatial-augmentation#) method is used to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames. A sampling-based temporal augmentation method is also used to avoid overly enforcing invariance on clips that are distant in time. \r\n\r\nEnd-to-end, from a raw video, we first sample a temporal interval from a monotonically decreasing distribution. The temporal interval represents the number of frames between the start points of two clips, and we sample two clips from a video according to this interval. Afterwards we apply a [temporally consistent spatial augmentation](https:\/\/paperswithcode.com\/method\/temporally-consistent-spatial-augmentation) to each of the clips and feed them into a 3D backbone with an MLP head. The contrastive loss is used to train the network to attract the clips from the same video and repel the clips from different videos in the embedding space.","1483":"This is a general approach to convert a neural network into an analytic equation. The technique works as follows:\r\n\r\n1. Encourage sparse latent representations\r\n2. Apply symbolic regression to approximate the transformations between in\/latent\/out layers\r\n3. Compose the symbolic expressions.\r\n\r\nIn the [paper](https:\/\/arxiv.org\/abs\/2006.11287), we show that we find the correct known equations, including force laws and Hamiltonians, can be extracted from the neural network. We then apply our method to a non-trivial cosmology example-a detailed dark matter simulation-and discover a new analytic formula which can predict the concentration of dark matter from the mass distribution of nearby cosmic structures. The symbolic expressions extracted from the GNN using our technique also generalized to out-of-distribution data better than the GNN itself. Our approach offers alternative directions for interpreting neural networks and discovering novel physical principles from the representations they learn.","1484":"**GradDrop**, or **Gradient Sign Dropout**, is a probabilistic masking procedure which samples gradients at an activation layer based on their level of consistency. It is applied as a layer in any standard network forward pass, usually on the final layer before the prediction head to save on compute overhead and maximize benefits during backpropagation. Below, we develop the GradDrop formalism. Throughout, o denotes elementwise multiplication after any necessary tiling operations (if any) are completed.\r\nTo implement GradDrop, we first define the Gradient Positive Sign Purity, $\\mathcal{P}$, as\r\n\r\n$$\r\n\\mathcal{P}=\\frac{1}{2}\\left(1+\\frac{\\sum\\_{i} \\nabla L_\\{i}}{\\sum\\_{i}\\left|\\nabla L\\_{i}\\right|}\\right)\r\n$$\r\n\r\n$\\mathcal{P}$ is bounded by $[0,1] .$ For multiple gradient values $\\nabla\\_{a} L\\_{i}$ at some scalar $a$, we see that $\\mathcal{P}=0$ if $\\nabla_{a} L\\_{i}<0 $ $\\forall i$, while $\\mathcal{P}=1$ if $\\nabla\\_{a} L\\_{i}>0$ $\\forall i $. Thus, $\\mathcal{P}$ is a measure of how many positive gradients are present at any given value. We then form a mask for each gradient $\\mathcal{M}\\_{i}$ as follows:\r\n\r\n$$\r\n\\mathcal{M}\\_{i}=\\mathcal{I}[f(\\mathcal{P})>U] \\circ \\mathcal{I}\\left[\\nabla L\\_{i}>0\\right]+\\mathcal{I}[f(\\mathcal{P})<U] \\circ \\mathcal{I}\\left[\\nabla L\\_{i}<0\\right]\r\n$$\r\n\r\nfor $\\mathcal{I}$ the standard indicator function and $f$ some monotonically increasing function (often just the identity) that maps $[0,1] \\mapsto[0,1]$ and is odd around $(0.5,0.5)$. $U$ is a tensor composed of i.i.d $U(0,1)$ random variables. The $\\mathcal{M}\\_{i}$ is then used to produce a final gradient $\\sum \\mathcal{M}\\_{i} \\nabla L\\_{i}$","1485":"**FoveaBox** is anchor-free framework for object detection. Instead of using predefined anchors to enumerate possible locations, scales and aspect ratios for the search of the objects, FoveaBox directly learns the object existing possibility and the bounding box coordinates without anchor reference. This is achieved by: (a) predicting category-sensitive semantic maps for the object existing possibility, and (b) producing category-agnostic bounding box for each position that potentially contains an object. The scales of target boxes are naturally associated with feature pyramid representations for each input image\r\n\r\nIt is a single, unified network composed of a backbone network and two task-specific subnetworks. The backbone is responsible for computing a convolutional feature map over an entire input image and is an off-the-shelf convolutional network. The first subnet performs per pixel classification on the backbone\u2019s output; the second subnet performs bounding box prediction for the corresponding\r\nposition.","1486":"**Co-Correcting** is a noise-tolerant deep learning framework for medical image classification based on mutual learning and annotation correction. It consists of three modules: the dual-network architecture, the curriculum learning module, and the label correction module.","1487":"A Gsop block has a squeeze module and an excitation module, and uses a second-order pooling to model high-order statistics while gathering global information.\r\nIn the squeeze module, a GSoP block firstly reduces the number of channels from $c$ to $c'$ ($c' < c$) using a $1 \\times 1$ convolution,  then  computes a $c' \\times c'$ covariance matrix for the different channels to obtain their correlation.  Next, row-wise normalization is performed on the covariance matrix.  Each $(i, j)$ in the normalized covariance matrix explicitly relates channel $i$ to channel $j$. \r\n\r\nIn the excitation module, a GSoP block performs row-wise convolution to  maintain structural information and output a vector. Then a fully-connected layer and a sigmoid function are applied  to get a $c$-dimensional attention vector. Finally, it multiplies the input features by the attention vector, as in an SE block. A GSoP block can be formulated as:\r\n\\begin{align}\r\n    s = F_\\text{gsop}(X, \\theta) & = \\sigma (W \\text{RC}(\\text{Cov}(\\text{Conv}(X))))\r\n\\end{align}\r\n\\begin{align}\r\n    Y & = s  X\r\n\\end{align}\r\nHere, $\\text{Conv}(\\cdot)$ reduces the number of channels,\r\n$\\text{Cov}(\\cdot)$ computes the covariance matrix and\r\n$\\text{RC}(\\cdot)$ means row-wise convolution.","1488":"**SpecGAN** is a generative adversarial network method for spectrogram-based, frequency-domain audio generation. The problem is suited for GANs designed for image generation. The model can be approximately inverted. \r\n\r\nTo process audio into suitable spectrograms, the authors perform the short-time Fourier transform with 16 ms windows and 8ms stride, resulting in 128 frequency bins, linearly spaced from 0 to 8 kHz. They take the magnitude of the resultant spectra and scale amplitude values logarithmically to better-align with human perception. They then normalize each frequency bin to have zero mean and unit variance. They clip the spectra to $3$ standard deviations and rescale to $\\left[\u22121, 1\\right]$.\r\n\r\nThey then use the [DCGAN](https:\/\/paperswithcode.com\/method\/dcgan) approach on the result spectra.","1489":"The **Local Relation Network** (**LR-Net**) is a network built with local relation layers which represent a feature image extractor. This feature extractor adaptively determines aggregation weights based on the compositional relationship of local pixel pairs.","1490":"**HardELiSH** is an activation function for neural networks.  The HardELiSH is a multiplication of the [HardSigmoid](https:\/\/paperswithcode.com\/method\/hard-sigmoid) and [ELU](https:\/\/paperswithcode.com\/method\/elu) in the negative part and a multiplication of the Linear and the HardSigmoid in the positive\r\npart:\r\n\r\n$$f\\left(x\\right) = x\\max\\left(0, \\min\\left(1, \\left(\\frac{x+1}{2}\\right)\\right) \\right) \\text{ if } x \\geq 1$$\r\n$$f\\left(x\\right) = \\left(e^{x}-1\\right)\\max\\left(0, \\min\\left(1, \\left(\\frac{x+1}{2}\\right)\\right)\\right) \\text{ if } x < 0 $$\r\n\r\nSource: [Activation Functions](https:\/\/arxiv.org\/pdf\/1811.03378.pdf)","1491":"The **Exponential Linear Squashing Activation Function**, or **ELiSH**, is an activation function used for neural networks. It shares common properties with [Swish](https:\/\/paperswithcode.com\/method\/swish), being made up of an [ELU](https:\/\/paperswithcode.com\/method\/elu) and a [Sigmoid](https:\/\/paperswithcode.com\/method\/sigmoid-activation):\r\n\r\n$$f\\left(x\\right) = \\frac{x}{1+e^{-x}} \\text{ if } x \\geq 0 $$\r\n$$f\\left(x\\right) = \\frac{e^{x} - 1}{1+e^{-x}} \\text{ if } x < 0 $$\r\n\r\nThe Sigmoid part of **ELiSH** improves information flow, while the linear parts solve issues of vanishing gradients.","1492":"**Group Decreasing Network**, or **GroupDNet**, is a type of convolutional neural network for multi-modal image synthesis. GroupDNet contains one encoder and one decoder. Inspired by the idea of [VAE](https:\/\/paperswithcode.com\/method\/vae) and SPADE, the encoder $E$ produces a\r\nlatent code $Z$ that is supposed to follow a Gaussian distribution $\\mathcal{N}(0,1)$ during training. While testing, the encoder $E$ is discarded. A randomly sampled code from the Gaussian distribution substitutes for $Z$. To fulfill this, the re-parameterization trick is used to enable a differentiable loss function during training. Specifically, the encoder predicts a mean vector and a variance vector through two fully connected layers to represent the encoded distribution. The gap between the encoded distribution and Gaussian distribution can be minimized by imposing a KL-divergence loss.","1493":"**Lookahead** is a type of stochastic optimizer that iteratively updates two sets of weights: \"fast\" and \"slow\". Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of *fast weights* generated by another optimizer.\r\n\r\n\r\n\r\n**Algorithm 1** Lookahead Optimizer\r\n\r\n**Require** Initial parameters $\\phi_0$, objective function $L$ \r\n\r\n**Require** Synchronization period $k$, slow weights step size $\\alpha$, optimizer $A$\r\n\r\n&nbsp;&nbsp;  **for** $t=1, 2, \\dots$\r\n\r\n&nbsp;&nbsp;&nbsp;&nbsp; Synchronize parameters $\\theta_{t,0} \\gets \\phi_{t-1}$\r\n\r\n&nbsp;&nbsp;&nbsp;&nbsp; **for** $i=1, 2, \\dots, k$\r\n\r\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sample minibatch of data $d \\sim \\mathcal{D}$\r\n\r\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $\\theta_{t,i} \\gets \\theta_{t,i-1} + A(L, \\theta_{t,i-1}, d)$\r\n\r\n&nbsp;&nbsp;&nbsp;&nbsp; **endfor**\r\n\r\n&nbsp;&nbsp;&nbsp;&nbsp; Perform outer update $\\phi_t \\gets \\phi_{t-1} + \\alpha (\\theta_{t,k} - \\phi_{t-1})$\r\n\r\n&nbsp;&nbsp; **endfor**\r\n\r\n&nbsp;&nbsp; **return** parameters $\\phi$","1494":"**MeshGraphNet** is a framework for learning mesh-based simulations using [graph neural networks](https:\/\/paperswithcode.com\/methods\/category\/graph-models). The model can be trained to pass messages on a mesh graph and to adapt the mesh discretization during forward simulation. The model uses an Encode-Process-Decode architecture trained with one-step supervision, and can be applied iteratively to generate long trajectories at inference time. The encoder transforms the input mesh $M^{t}$ into a graph, adding extra world-space edges. The processor performs several rounds of message passing along mesh edges and world edges, updating all node and edge embeddings. The decoder extracts the acceleration for each node, which is used to update the mesh to produce $M^{t+1}$.","1495":"As they carry great potential for modeling complex interactions, graph neural network (GNN)-based methods have been widely used to predict quantum mechanical properties of molecules. Most of the existing methods treat molecules as molecular graphs in which atoms are modeled as nodes. They characterize each atom's chemical environment by modeling its pairwise interactions with other atoms in the molecule. Although these methods achieve a great success, limited amount of works explicitly take many-body interactions, i.e., interactions between three and more atoms, into consideration. In this paper, we introduce a novel graph representation of molecules, heterogeneous molecular graph (HMG) in which nodes and edges are of various types, to model many-body interactions. HMGs have the potential to carry complex geometric information. To leverage the rich information stored in HMGs for chemical prediction problems, we build heterogeneous molecular graph neural networks (HMGNN) on the basis of a neural message passing scheme. HMGNN incorporates global molecule representations and an attention mechanism into the prediction process. The predictions of HMGNN are invariant to translation and rotation of atom coordinates, and permutation of atom indices. Our model achieves state-of-the-art performance in 9 out of 12 tasks on the QM9 dataset.","1496":"**Feature Pyramid Grids**, or **FPG**, is a deep multi-pathway feature pyramid, that represents the feature scale-space as a regular grid of parallel bottom-up pathways which are fused by multi-directional lateral connections. It connects the backbone features, $C$, of a ConvNet with a regular structure of $p$ parallel top-down pyramid pathways which are fused by multi-directional lateral connections, AcrossSame, AcrossUp, AcrossDown, and AcrossSkip. AcrossSkip are direct connections while all other types use [convolutional](https:\/\/paperswithcode.com\/method\/convolution) and [ReLU](https:\/\/paperswithcode.com\/method\/relu) layers.\r\n\r\nOn a high-level, FPG is a deep generalization of [FPN](https:\/\/paperswithcode.com\/method\/fpn) from one to $p$ pathways under a dense lateral connectivity structure.","1497":"**CutBlur** is a data augmentation method that is specifically designed for the low-level vision tasks. It cuts a low-resolution patch and pastes it to the corresponding high-resolution image region and vice versa. The key intuition of Cutblur is to enable a model to learn not only \"how\" but also \"where\" to super-resolve an image. By doing so, the model can understand \"how much\" instead of blindly learning to apply super-resolution to every given pixel.","1498":"**STAC** is a semi-supervised framework for visual object detection along with a data augmentation strategy. STAC deploys highly confident pseudo labels of localized objects from an unlabeled image and updates the model by enforcing consistency via strong augmentations. We generate pseudo labels (i.e., bounding boxes and their class labels) for unlabeled data using test-time inference, including NMS , of the teacher model trained with labeled data. We then compute unsupervised loss with respect to pseudo labels whose confidence scores are above a threshold $\\tau$ . The strong augmentations are applied for augmentation consistency during the model training. Target boxes are augmented when global geometric transformations are used.","1499":"**CentripetalNet** is a keypoint-based detector which uses centripetal shift to pair corner keypoints from the same instance. CentripetalNet predicts the position and the centripetal shift of the corner points and matches corners whose shifted results are aligned.","1500":"**Entropy Minimized Ensemble of Adapters**, or **EMEA**, is a method that optimizes the ensemble weights of the pretrained language adapters for each test sentence by minimizing the entropy of its predictions. The intuition behind the method is that a good [adapter](https:\/\/paperswithcode.com\/method\/adapter) weight $\\alpha$ for a test input $x$ should make the model more confident in its prediction for $x$, that is, it should lead to lower model entropy over the input","1501":"**AutoInt** is a deep tabular learning method that models high-order feature interactions of input features. AutoInt can be applied to both numerical and categorical input features. Specifically, both the numerical and categorical features are mapped into the same low-dimensional space. Afterwards, a multi-head self-attentive neural network with residual connections is proposed to explicitly model the feature interactions in the low-dimensional space. With different layers of the multi-head self-attentive neural networks, different orders of feature combinations of input features can be modeled.","1502":"**AdaShift** is a type of adaptive stochastic optimizer that decorrelates $v\\_{t}$ and $g\\_{t}$ in [Adam](https:\/\/paperswithcode.com\/method\/adam) by temporal shifting, i.e., using temporally shifted gradient $g\\_{t\u2212n}$ to calculate $v\\_{t}$. The authors argue that an inappropriate correlation between gradient $g\\_{t}$ and the second-moment term $v\\_{t}$ exists in Adam, which results in a large gradient being likely to have a small step size while a small gradient may have a large step size. The authors argue that such biased step sizes are the fundamental cause of non-convergence of Adam.\r\n\r\nThe AdaShift updates, based on the idea of temporal independence between gradients, are as follows:\r\n\r\n$$ g\\_{t} = \\nabla{f\\_{t}}\\left(\\theta\\_{t}\\right) $$\r\n\r\n$$ m\\_{t} = \\sum^{n-1}\\_{i=0}\\beta^{i}\\_{1}g\\_{t-i}\/\\sum^{n-1}\\_{i=0}\\beta^{i}\\_{1} $$\r\n\r\nThen for $i=1$ to $M$:\r\n\r\n$$ v\\_{t}\\left[i\\right] = \\beta\\_{2}v\\_{t-1}\\left[i\\right] + \\left(1-\\beta\\_{2}\\right)\\phi\\left(g^{2}\\_{t-n}\\left[i\\right]\\right) $$\r\n\r\n$$ \\theta\\_{t}\\left[i\\right] = \\theta\\_{t-1}\\left[i\\right] - \\alpha\\_{t}\/\\sqrt{v\\_{t}\\left[i\\right]}\\cdot{m\\_{t}\\left[i\\right]} $$","1503":"**AdaMod** is a stochastic optimizer that restricts adaptive learning rates with adaptive and momental upper bounds. The dynamic learning rate bounds are based on the exponential moving averages of the adaptive learning rates themselves, which smooth out unexpected large learning rates and stabilize the training of deep neural networks.\r\n\r\n\r\nThe weight updates are performed as:\r\n\r\n\r\n$$ g\\_{t} = \\nabla{f}\\_{t}\\left(\\theta\\_{t-1}\\right) $$\r\n\r\n$$ m\\_{t} = \\beta\\_{1}m\\_{t-1} + \\left(1-\\beta\\_{1}\\right)g\\_{t} $$\r\n\r\n$$ v\\_{t} = \\beta\\_{2}v\\_{t-1} + \\left(1-\\beta\\_{2}\\right)g\\_{t}^{2} $$\r\n\r\n$$ \\hat{m}\\_{t} = m\\_{t} \/ \\left(1 - \\beta^{t}\\_{1}\\right)$$\r\n\r\n$$ \\hat{v}\\_{t} = v\\_{t} \/ \\left(1 - \\beta^{t}\\_{2}\\right)$$\r\n\r\n$$ \\eta\\_{t} = \\alpha\\_{t} \/ \\left(\\sqrt{\\hat{v}\\_{t}} + \\epsilon\\right) $$\r\n\r\n$$ s\\_{t} = \\beta\\_{3}s\\_{t-1} + (1-\\beta\\_{3})\\eta\\_{t} $$\r\n\r\n$$ \\hat{\\eta}\\_{t} = \\text{min}\\left(\\eta\\_{t}, s\\_{t}\\right) $$\r\n\r\n$$ \\theta\\_{t} = \\theta\\_{t-1} - \\hat{\\eta}\\_{t}\\hat{m}\\_{t} $$","1504":"**In-Place Activated Batch Normalization**, or **InPlace-ABN**, substitutes the conventionally used succession of [BatchNorm](https:\/\/paperswithcode.com\/method\/batch-normalization) + Activation layers with a single plugin layer, hence avoiding invasive framework surgery while providing straightforward applicability for existing deep learning frameworks. It approximately halves the memory requirements during training of modern deep learning models.","1505":"A **TResNet** is a variant on a [ResNet](https:\/\/paperswithcode.com\/method\/resnet) that aim to boost accuracy while maintaining GPU training and inference efficiency.  They contain several design tricks including a SpaceToDepth stem, [Anti-Alias downsampling](https:\/\/paperswithcode.com\/method\/anti-alias-downsampling), In-Place Activated BatchNorm, Blocks selection and squeeze-and-excitation layers.","1506":"**Graph Finite-State Automaton**, or **GFSA**, is a differentiable layer for learning graph structure that adds a new edge type (expressed as a weighted adjacency matrix) to a base graph. This layer can be trained end-to-end to add derived relationships (edges) to arbitrary graph-structured data based on performance on a downstream task.","1507":"** LayoutReader** is a sequence-to-sequence model for reading order detection that uses both textual and layout information, where the layout-aware language model [LayoutLM](https:\/\/paperswithcode.com\/method\/layoutlmv2) is leveraged as an encoder. The generation step in the encoder-decoder structure tis modified to generate the reading order sequence.\r\n\r\nIn the encoding stage, LayoutReader packs the pair of source and target segments into a contiguous input sequence of LayoutLM and carefully designs the [self-attention mask](https:\/\/paperswithcode.com\/methods\/category\/factorized-attention) to control the visibility between tokens. As shown in the Figure, LayoutReader allows the tokens in the source segment to attend to each other while preventing the tokens in the target segment from attending to the rightward context. If 1 means allowing and 0 means preventing, the detail of the mask $M$ is as follows:\r\n\r\n$$ M\\_{i, j}= \\begin{cases}1, & \\text { if } i<j \\text { or } i, j \\in \\operatorname{src} \\\\ 0, & \\text { otherwise }\\end{cases} $$\r\n\r\nwhere $i, j$ are the indices in the packed input sequence, so they may be from source or target segments; $i, j \\in$ src means both tokens are from source segment.\r\n\r\nIn the decoding stage, since the source and target are reordered sequences, the prediction candidates can be constrained to the source segment. Therefore, we ask the model to predict the indices in the source sequence. The probability is calculated as follows:\r\n\r\n$$\r\n\\mathcal{P}\\left(x_{k}=i \\mid x_{<k}\\right)=\\frac{\\exp \\left(e_{i}^{T} h\\_{k}+b\\_{k}\\right)}{\\sum_{j} \\exp \\left(e\\_{j}^{T} h_{k}+b\\_{k}\\right)}\r\n$$\r\n\r\nwhere $i$ is an index in the source segment; $e\\_{i}$ and $e\\_{j}$ are the $\\mathrm{i}$-th and $\\mathrm{j}$-th input embeddings of the source segment; $h\\_{k}$ is the hidden states at the $\\mathrm{k}$-th time step; $b\\_{k}$ is the bias at the $\\mathrm{k}$-th time step.","1508":"**Memory-Associated Differential** (**MAD**) Learning was developed to inference from the memorized facts that we already know to predict what we want to know.\r\n\r\nImage source: [Luo et al.](https:\/\/arxiv.org\/pdf\/2102.05246v1.pdf)","1509":"**TD-VAE**, or **Temporal Difference VAE**, is a generative sequence model that learns representations containing explicit beliefs about states several steps into the future, and that can be rolled out directly without single-step transitions. TD-VAE is trained on pairs of temporally separated time points, using an analogue of [temporal difference learning](https:\/\/paperswithcode.com\/method\/td-lambda) used in reinforcement learning.","1510":"**Stein Variational Policy Gradient**, or **SVPG**, is a policy gradient based method in reinforcement learning that uses Stein Variational Gradient Descent to allow simultaneous exploitation and exploration of multiple policies. Unlike traditional policy optimization which attempts to learn a single policy, SVPG models a distribution of policy parameters, where samples from this distribution will represent strong policies.  SVPG optimizes this distribution of policy parameters with (relative) [entropy regularization](https:\/\/paperswithcode.com\/method\/entropy-regularization). The (relative) entropy term explicitly encourages exploration in the parameter space while also optimizing the expected utility of polices drawn from this distribution. Stein variational gradient descent (SVGD) is then used to optimize this distribution. SVGD leverages efficient deterministic dynamics to transport a set of particles to approximate given target posterior distributions. \r\n\r\nThe update takes the form:\r\n\r\n$$ $$\r\n\r\n$$ \\nabla\\theta\\_i = \\frac{1} {n}\\sum\\_{j=1}^n \\nabla\\_{\\theta\\_{j}} \\left(\\frac{1}{\\alpha} J(\\theta\\_{j}) + \\log q\\_0(\\theta\\_j)\\right)k(\\theta\\_j, \\theta\\_i) + \\nabla\\_{\\theta\\_j} k(\\theta\\_j, \\theta\\_i)$$\r\n\r\nNote that here the magnitude of $\\alpha$ adjusts the relative importance between the policy gradient and the prior term $\\nabla_{\\theta_j} \\left(\\frac{1}{\\alpha} J(\\theta_j) + \\log q_0(\\theta_j)\\right)k(\\theta_j, \\theta_i)$ and the repulsive term $\\nabla_{\\theta_j} k(\\theta_j, \\theta_i)$. The repulsive functional is used to diversify particles to enable parameter exploration. A suitable $\\alpha$ provides a good trade-off between exploitation and exploration. If $\\alpha$ is too large, the Stein gradient would only drive the particles to be consistent with the prior $q_0$. As $\\alpha \\to 0$, this algorithm is reduced to running $n$ copies of independent policy gradient algorithms, if $\\{\\theta_i\\}$ are initialized very differently. A careful annealing scheme of $\\alpha$ allows efficient exploration in the beginning of training and later focuses on exploitation towards the end of training.","1511":"**Viewmaker Network** is a type of generative model that learns to produce input-dependent views for contrastive learning. This network is trained jointly with an encoder network. The viewmaker network is trained adversarially to create views which increase the contrastive loss of the encoder network. Rather than directly outputting views for an image, the viewmaker instead outputs a stochastic perturbation that is added to the input. This perturbation is projected onto an $\\mathcal{l}\\_{p}$ sphere, controlling the effective strength of the view, similar to methods in adversarial robustness. This constrained adversarial training method enables the model to reduce the mutual information between different views while preserving useful input features for the encoder to learn from.\r\n\r\nSpecifically, the encoder and viewmaker are optimized in alternating steps to minimize and maximize $\\mathcal{L}$, respectively. An image-to-image neural network is used as the viewmaker network, with an architecture adapted from work on style transfer. This network ingests the input image and outputs a perturbation that is constrained to an $\\ell_{1}$ sphere. The sphere's radius is determined by the volume of the input tensor times a hyperparameter $\\epsilon$, the distortion budget, which determines the strength of the applied perturbation. This perturbation is added to the input image and optionally clamped in the case of images to ensure all pixels are in $[0,1]$.","1512":"**SRU++** is a self-attentive recurrent unit that combines fast recurrence and attention for sequence modeling, extending the [SRU](https:\/\/www.paperswithcode.com\/method\/sru) unit. The key modification of SRU++ is to incorporate more expressive non-linear operations into the recurrent network. Specifically, given the input sequence represented as a matrix $\\mathbf{X} \\in \\mathbb{R}^{L \\times d}$, the attention component computes the query, key and value representations using the following multiplications,\r\n\r\n$$\r\n\\mathbf{Q} =\\mathbf{W}^{q} \\mathbf{X}^{\\top} \r\n$$\r\n\r\n$$\r\n\\mathbf{K} =\\mathbf{W}^{k} \\mathbf{Q} \\\\\r\n$$\r\n\r\n$$\r\n\\mathbf{V} =\\mathbf{W}^{v} \\mathbf{Q}\r\n$$\r\n\r\nwhere $\\mathbf{W}^{q} \\in \\mathbb{R}^{d^{\\prime} \\times d}, \\mathbf{W}^{k}, \\mathbf{W}^{v} \\in \\mathbb{R}^{d^{\\prime} \\times d^{\\prime}}$ are model parameters. $d^{\\prime}$ is the attention dimension that is typically much smaller than $d$. Note that the keys $\\mathbf{K}$ and values $\\mathbf{V}$ are computed using $\\mathbf{Q}$ instead of $\\mathbf{X}$ such that the weight matrices $\\mathbf{W}^{k}$ and $\\mathbf{W}^{v}$ are significantly smaller. \r\n\r\nNext, we compute a weighted average output $\\mathbf{A} \\in \\mathbb{R}^{d^{\\prime} \\times L}$ using [scaled dot-product attention](https:\/\/paperswithcode.com\/method\/scaled):\r\n\r\n$$\r\n\\mathbf{A}^{\\top}=\\operatorname{softmax}\\left(\\frac{\\mathbf{Q}^{\\top} \\mathbf{K}}{\\sqrt{d^{\\prime}}}\\right) \\mathbf{V}^{\\top}\r\n$$\r\n\r\nThe final output $U$ required by the elementwise recurrence is obtained by another linear projection,\r\n\r\n$$\r\n\\mathbf{U}^{\\top}=\\mathbf{W}^{o}(\\mathbf{Q}+\\alpha \\cdot \\mathbf{A})\r\n$$\r\n\r\nwhere $\\alpha \\in \\mathbb{R}$ is a learned scalar and $\\mathbf{W}\\_{o} \\in \\mathbb{R}^{3 d \\times d^{\\prime}}$ is a parameter matrix. $\\mathbf{Q}+\\alpha \\cdot \\mathbf{A}$ is a [residual connection](https:\/\/paperswithcode.com\/method\/residual-connection) which improves gradient propagation and stabilizes training. We initialize $\\alpha$ to zero and as a result,\r\n\r\n$$\r\n\\mathbf{U}^{\\top}=\\mathbf{W}^{o} \\mathbf{Q}=\\left(\\mathbf{W}^{o} \\mathbf{W}^{q}\\right) \\mathbf{X}^{\\top}\r\n$$\r\n\r\ninitially falls back to a linear transformation of the input $X$ skipping the attention transformation. Intuitively, skipping attention encourages leveraging recurrence to capture sequential patterns during early stage of training. As $|\\alpha|$ grows, the attention mechanism can learn long-range dependencies for the model. In addition, $\\mathbf{W}^{o} \\mathbf{W}^{q}$ can be interpreted as applying a matrix factorization trick with a small inner dimension $d^{\\prime}<d$, reducing the total number of parameters. The Figure compares the differences of SRU, SRU with this factorization trick (but without attention), and SRU++.\r\n\r\nThe last modification is adding [layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization) to each SRU++ layer. We apply normalization after the attention operation and before the matrix multiplication with $\\mathbf{W}^{o}$\r\n\r\n$$\r\n\\mathbf{U}^{\\top}=\\mathbf{W}^{o} \\operatorname{layernorm}(\\mathbf{Q}+\\alpha \\cdot \\mathbf{A})\r\n$$\r\n\r\nThis implementation is post-layer normalization in which the normalization is added after the residual connection.","1513":"**Fast Minimum-Norm Attack**, or **FNM**, is a type of adversarial attack that works with different $\\ell_{p}$-norm perturbation models ($p=0,1,2,\\infty$), is robust to hyperparameter choices, does not require adversarial starting points, and converges within few lightweight steps. It works by iteratively finding the sample misclassified with maximum confidence within an $\\ell_{p}$-norm constraint of size $\\epsilon$, while adapting $\\epsilon$ to minimize the distance of the current sample to the decision boundary.","1514":"ChebNet involves a formulation of CNNs in the context of spectral graph theory, which provides the necessary mathematical background and efficient numerical schemes to design fast localized convolutional filters on graphs.\r\n\r\nDescription from: [Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering](https:\/\/arxiv.org\/pdf\/1606.09375.pdf)","1515":"**EvoNorms** are a set of normalization-activation layers that go beyond existing design patterns. Normalization and activation are unified into a single computation graph, its structure is evolved starting from low-level primitives. EvoNorms consist of two series: B series and S series. The B series are batch-dependent and were discovered by our method without any constraint. The S series work on individual samples, and were discovered by rejecting any batch-dependent operations.","1516":"**FreeAnchor** is an anchor supervision method for object detection. Many CNN-based object detectors assign anchors for ground-truth objects under the restriction of object-anchor Intersection-over-Unit (IoU). In contrast, FreeAnchor is a learning-to-match approach that breaks the IoU restriction, allowing objects to match anchors in a flexible manner. It updates hand-crafted anchor assignment to free anchor matching by formulating detector training as a maximum likelihood estimation (MLE) procedure. FreeAnchor targets at learning features which best explain a class of objects in terms of both classification and localization.","1517":"SoftPool: a fast and efficient method that sums exponentially weighted activations. Compared to a range of other pooling methods, SoftPool retains more information in the downsampled activation maps. More refined downsampling leads to better classification accuracy.","1518":"**MotionNet** is a system for joint perception and motion prediction based on a bird's eye view (BEV) map, which encodes the object category and motion information from 3D point clouds in each grid cell. MotionNet takes a sequence of LiDAR sweeps as input and outputs the bird's eye view (BEV) map. The backbone of MotionNet is a spatio-temporal pyramid network, which extracts deep spatial and temporal features in a hierarchical fashion. To enforce the smoothness of predictions over both space and time, the training of MotionNet is further regularized with novel spatial and temporal consistency losses.","1519":"**LTLS** is a technique for multiclass and multilabel prediction that can perform training and inference in logarithmic time and space. LTLS embeds large classification problems into simple structured prediction problems and relies on efficient dynamic programming algorithms for inference. It tackles extreme multi-class and multi-label classification problems where the size $C$ of the output space is extremely large.","1520":"**CV-MIM**, or **Contrastive Cross-View Mutual Information Maximization**, is a representation learning method to disentangle pose-dependent as well as view-dependent factors from 2D human poses. The method trains a network using cross-view mutual information maximization, which maximizes mutual information of the same pose performed from different viewpoints in a contrastive learning manner. It further utilizes two regularization terms to ensure disentanglement and smoothness of the learned representations.","1521":"**DetNASNet** is a convolutional neural network designed to be an object detection backbone and discovered through [DetNAS](https:\/\/paperswithcode.com\/method\/detnas) architecture search. It uses [ShuffleNet V2](https:\/\/paperswithcode.com\/method\/shufflenet-v2) blocks as its basic building block.","1522":"**ResNeXt-Elastic** is a convolutional neural network that is a modification of a [ResNeXt](https:\/\/paperswithcode.com\/method\/resnext) with elastic blocks (extra upsampling and downsampling).","1523":"**DenseNet-Elastic** is a convolutional neural network that is a modification of a [DenseNet](https:\/\/paperswithcode.com\/method\/densenet) with elastic blocks (extra upsampling and downsampling).","1524":"**Elastic Dense Block** is a skip connection block that modifies the [Dense Block](https:\/\/paperswithcode.com\/method\/dense-block) with downsamplings and upsamplings in parallel branches at each layer to let the network learn from a data scaling policy in which inputs are processed at different resolutions in each layer. It is called \"elastic\" because each layer in the network is flexible in terms of choosing the best scale by a soft policy.","1525":"An **Elastic ResNeXt Block** is a modification of the [ResNeXt Block](https:\/\/paperswithcode.com\/method\/resnext-block) that adds downsamplings and upsamplings in parallel branches at each layer. It is called \"elastic\" because each layer in the network is flexible in terms of choosing the best scale by a soft policy.","1526":"**Adversarial-Learned Loss for Domain Adaptation** is a method for domain adaptation that combines adversarial learning with self-training. Specifically, the domain discriminator has to produce different corrected labels for different domains, while the feature generator aims to confuse the domain discriminator. The adversarial process finally leads to a proper confusion matrix on the target domain. In this way, ALDA takes the strengths of domain-adversarial learning and self-training based methods.","1527":"Please enter a description about the method here","1528":"**BiDet** is a binarized neural network learning method for efficient object detection. Conventional network binarization methods directly quantize the weights and activations in one-stage or two-stage detectors with constrained representational capacity, so that the information redundancy in the networks causes numerous false positives and degrades the performance significantly. On the contrary, BiDet fully utilizes the representational capacity of the binary neural networks for object detection by redundancy removal, through which the detection precision is enhanced with alleviated false positives. Specifically, the information bottleneck (IB) principle is generalized to object detection, where the amount of information in the high-level feature maps is constrained and the mutual information between the feature maps and object detection is maximized.","1529":"**Attentive Normalization** generalizes the common affine transformation component in the vanilla feature normalization. Instead of learning a single affine transformation, AN learns a mixture of affine transformations and utilizes their weighted-sum as the final affine transformation applied to re-calibrate features in an instance-specific way. The weights are learned by leveraging feature attention.","1530":"**Multi-View Entity Representations**, or **MuVER**, is an approach for entity retrieval that constructs multi-view representations for entity descriptions and approximates the optimal view for mentions via a heuristic searching method. It matches a mention to the appropriate entity by comparing it with entity descriptions. Motivated by the fact that mentions with different contexts correspond to different parts in descriptions, multi-view representations are constructed for each description. Specifically, we segment a description into several sentences. We refer to each sentence as a view $v$, which contains partial information, to form a view set $\\mathcal{V}$ of the entity $e$. The Figure illustrates an example that constructs a view set $\\mathcal{V}$ for \u201cKobe Bryant\u201d.","1531":"**Bi3D** is a stereo depth estimation framework that estimates depth via a series of binary classifications. Rather than testing if objects are at a particular depth *D*, as existing stereo methods do, it classifies them as being closer or farther than *D*. It takes the stereo pair and a disparity $d\\_{i}$ and produces a confidence map, which can be thresholded to yield the binary segmentation. To estimate depth on $N + 1$ quantization levels we run this network $N$ times and maximize the probability in Equation 8 (see paper). To estimate continuous depth, whether full or selective, we run the [SegNet](https:\/\/paperswithcode.com\/method\/segnet) block of Bi3DNet for each disparity level and work directly on the confidence volume.","1532":"**Prioritized Sweeping** is a reinforcement learning technique for model-based algorithms that prioritizes updates according to a measure of urgency, and performs these updates first. A queue is maintained of every state-action pair whose estimated value would change nontrivially if updated, prioritized by the size of the change. When the top pair in the queue is updated, the effect on each of its predecessor pairs is computed. If the effect is greater than some threshold, then the pair is inserted in the queue with the new priority.\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition","1533":"**Primal Wasserstein Imitation Learning**, or **PWIL**, is a method for imitation learning which ties to the primal form of the Wasserstein distance between the expert and the agent state-action distributions. The reward function is derived offline, as opposed to recent adversarial IL algorithms that learn a reward function through interactions with the environment, and requires little fine-tuning.","1534":"**AutoSmart** is AutoML framework for temporal relational data. The framework includes automatic data processing, table merging, feature engineering, and model tuning, integrated with a time&memory control unit.","1535":"**Single-Shot Multi-Object Tracker** or **SMOT**, is a tracking framework that converts any single-shot detector (SSD) model into an online multiple object tracker, which emphasizes simultaneously detecting and tracking of the object paths. Contrary to the existing tracking by detection approaches which suffer from errors made by the object detectors, SMOT adopts the recently proposed scheme of tracking by re-detection.\r\n\r\nThe proposed SMOT consists of two stages. The first stage generates temporally consecutive tracklets by exploring the temporal and spatial correlations from previous frame. The second stage performs online linking of the tracklets to generate a face track for each person (better view in color).","1536":"Social media are nowadays one of the main news sources for millions of people around the globe due to their low cost, easy access and rapid dissemination. This however comes at the cost of dubious trustworthiness and significant risk of exposure to 'fake news', intentionally written to mislead the readers. Automatically detecting fake news poses challenges that defy existing content-based analysis approaches. One of the main reasons is that often the interpretation of the news requires the knowledge of political or social context or 'common sense', which current NLP algorithms are still missing. Recent studies have shown that fake and real news spread differently on social media, forming propagation patterns that could be harnessed for the automatic fake news detection. Propagation-based approaches have multiple advantages compared to their content-based counterparts, among which is language independence and better resilience to adversarial attacks. In this paper we show a novel automatic fake news detection model based on geometric deep learning. The underlying core algorithms are a generalization of classical CNNs to graphs, allowing the fusion of heterogeneous data such as content, user profile and activity, social graph, and news propagation. Our model was trained and tested on news stories, verified by professional fact-checking organizations, that were spread on Twitter. Our experiments indicate that social network structure and propagation are important features allowing highly accurate (92.7% ROC AUC) fake news detection. Second, we observe that fake news can be reliably detected at an early stage, after just a few hours of propagation. Third, we test the aging of our model on training and testing data separated in time. Our results point to the promise of propagation-based approaches for fake news detection as an alternative or complementary strategy to content-based approaches.","1537":"**Disp R-CNN** is a 3D object detection system for stereo images. It utilizes an instance disparity estimation network (iDispNet) that predicts disparity only for pixels on objects of interest and learns a category-specific shape prior for more accurate disparity estimation. To address the challenge from scarcity of disparity annotation in training, a statistical shape model is used to generate dense disparity pseudo-ground-truth without the need of LiDAR point clouds.","1538":"Nlogistic-sigmoid function (NLSIG) is a modern logistic-sigmoid function definition for modelling growth (or decay) processes. It features two logistic metrics (YIR and XIR) for monitoring growth from a two-dimensional (x-y axis) perspective.","1539":"**PixelRNNs** are generative neural networks that sequentially predicts the pixels in an image along the two spatial dimensions. They model the discrete probability of the raw pixel values and encode the complete set of dependencies in the image. Variants include the Row [LSTM](https:\/\/paperswithcode.com\/method\/lstm) and the Diagonal [BiLSTM](https:\/\/paperswithcode.com\/method\/bilstm), that scale more easily to larger datasets. Pixel values are treated as discrete random variables by using a [softmax](https:\/\/paperswithcode.com\/method\/softmax) layer in the conditional distributions. Masked convolutions are employed to allow PixelRNNs to model full dependencies between the color channels.","1540":"**TaxoExpan** is a self-supervised taxonomy expansion framework. It automatically generates a set of <query concept, anchor concept> pairs from the existing taxonomy as training data. Using such self-supervision data, TaxoExpan learns a model to predict whether a query concept is the direct hyponym of an anchor concept. TaxoExpan features: (1) a position-enhanced graph neural network that encodes the local structure of an anchor concept in the existing taxonomy, and (2) a noise-robust training objective that enables the learned model to be insensitive to the label noise in the self-supervision data.","1541":"**MobileDet** is an object detection model developed for mobile accelerators. MobileDets uses regular convolutions extensively on EdgeTPUs and DSPs, especially in the early stage of the network where depthwise convolutions tend to be less efficient.  This helps boost the latency-accuracy trade-off for object detection on accelerators, provided that they are placed strategically in the network via [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search). By incorporating regular convolutions in the search space and directly optimizing the network architectures for object detection, an efficient family of object detection models is obtained.","1542":"**TILDEv2** is a [BERT](https:\/\/paperswithcode.com\/method\/bert)-based re-ranking method that stems from [TILDE](https:\/\/dl.acm.org\/doi\/abs\/10.1145\/3404835.3462922) but that addresses its limitations. It relies on contextualized exact term matching with expanded passages. This requires to only store in the index the score of tokens that appear in the expanded passages (rather than all the vocabulary), thus producing indexes that are 99% smaller than those of the original.\r\n\r\nSpecifically, TILDE is modified in the following aspects:\r\n\r\n- **Exact Term Matching**. The query likelihood matching originally employed in TILDE, expands passages into the BERT vocabulary size, resulting in large indexes. To overcome this issue, estimating relevance scores is achieved with contextualized exact term matching. This allows the model to index tokens only present in the passage, thus reducing the index size. In addition to this, we replace the query likelihood loss function, with the Noise contrastive estimation (NCE) loss that allows to better leverage negative training samples. \r\n \r\n- **Passage Expansion**. To overcome the vocabulary mismatch problem that affects exact term matching methods, passage expansion is used to expand the original passage collection. Passages in the collection are expanded using deep LMs with a limited number of tokens. This requires TILDEv2 to only index a few extra tokens in addition to those in the original passages.","1543":"**ClusterFit** is a self-supervision approach for learning image representations.  Given a dataset, we (a) cluster its features extracted from a pre-trained network using k-means and (b) re-train a new network from scratch on this dataset using cluster assignments as pseudo-labels.","1544":"**L2M** is a learning algorithm that can work for most cross-domain distribution matching tasks. It automatically learns the cross-domain distribution matching without relying on hand-crafted priors on the matching loss. Instead, L2M reduces the inductive bias by using a meta-network to learn the distribution matching loss in a data-driven way.","1545":"**Partition Filter Network** is a framework designed specifically for joint entity and relation extraction. The framework consists of three components: partition filter encoder, NER unit and RE unit. In task units, we use table-filling for word pair prediction. Orange, yellow and green represents NER-related, shared and RE-related component or features. (b) Detailed depiction of partition filter encoder in one single time step. We decompose feature encoding into two steps: partition and filter (shown in the gray area). In partition, we first segment neurons into two task partitions and one shared partition. Then in filter, partitions are selected and combined to form task-specific features and shared features, filtering out information irrelevant to each task.","1546":"**Local Importance-based Pooling (LIP)** is a pooling layer that can enhance discriminative features during the downsampling procedure by learning adaptive importance weights based on inputs. By using a learnable network $G$ in $F$, the importance function now is not limited in hand-crafted forms and able to learn the criterion for the discriminativeness of features. Also, the window size of LIP is restricted to be not less than stride to fully utilize the feature map and avoid the issue of fixed interval sampling scheme. More specifically, the importance function in LIP is implemented by a tiny fully convolutional network, which learns to produce the importance map based on inputs in an end-to-end manner.","1547":"**DeepIR**, or **Deep InfraRed image processing**, is a thermal image processing framework for recovering high quality images from a very small set of images captured with camera motion. Enhancement is achieved by noting that camera motion, which is usually a hinderance, can be exploited to our advantage to separate a sequence of images into the scene-dependent radiant flux, and a slowly changing scene-independent non-uniformity. DeepIR combines the physics of microbolometer sensors, with powerful regularization capabilities by neural network-based representations. DeepIR relies on the key observation that jittering a camera, while unwanted in visible domain, is highly desirable in the thermal domain as it allows an accurate separation of the sensor-specific non-uniformities from the scene\u2019s radiant flux.","1548":"We introduce myGym, a toolkit suitable for fast prototyping of neural networks in the area of robotic manipulation and navigation. Our toolbox is fully modular, enabling users to train their algorithms on different robots, environments, and tasks. We also include pretrained neural network modules for the real-time vision that allows training visuomotor tasks with sim2real transfer. The visual modules can be easily retrained using the dataset generation pipeline with domain augmentation and randomization. Moreover, myGym provides automatic evaluation methods and baselines that help the user to directly compare their trained model with the state-of-the-art algorithms. We additionally present a novel metric, called learnability, to compare the general learning capability of algorithms in different settings, where the complexity of the environment, robot, and the task is systematically manipulated. The learnability score tracks differences between the performance of algorithms in increasingly challenging setup conditions, and thus allows the user to compare different models in a more systematic fashion. The code is accessible at https:\/\/github.com\/incognite-lab\/myGym","1549":"**Noise2Fast** is a model for single image blind denoising. It is similar to masking based methods -- filling in the pixel gaps -- in that the network is blind to many of the input pixels during training. The method is inspired by Neighbor2Neighbor, where the neural network learns a mapping between adjacent pixels. Noise2Fast is tuned to speed by using a discrete four image training set obtained by a form of downsampling called \u201ccheckerboard downsampling.","1550":"Please enter a description about the method here","1551":"**ERNIE-GEN** is a multi-flow sequence to sequence pre-training and fine-tuning framework which bridges the discrepancy between training and inference with an infilling generation mechanism and a noise-aware generation method. To make generation closer to human writing patterns, this framework introduces a span-by-span generation flow that trains the model to predict semantically-complete spans consecutively rather than predicting word by word. Unlike existing pre-training methods, ERNIE-GEN incorporates multi-granularity target sampling to construct pre-training data, which enhances the correlation between encoder and decoder.","1552":"**Fast Sample Re-Weighting**, or **FSR**, is a sample re-weighting strategy to tackle problems such as dataset biases, noisy labels and imbalanced classes. It leverages a dictionary (essentially an extra buffer) to monitor the training history reflected by the model updates during meta optimization periodically, and utilises a valuation function to discover meaningful samples from training data as the proxy of reward data. The unbiased dictionary keeps being updated and provides reward signals to optimize sample weights. Additionally, instead of maintaining model states for both model and sample weight updates separately, feature sharing is enabled for saving the computation cost used for maintaining respective states.","1553":"**G3D** is a unified spatial-temporal graph convolutional operator that directly models cross-spacetime joint dependencies. It leverages dense cross-spacetime edges as skip connections for direct information propagation across the 3D spatial-temporal graph.","1554":"**Virtual Batch Normalization** is a normalization method used for training generative adversarial networks that extends batch normalization. Regular [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) causes the output of a neural network for an input example $\\mathbf{x}$ to be highly dependent on several other inputs $\\mathbf{x}'$ in the same minibatch. To avoid this problem in virtual batch normalization (VBN), each example $\\mathbf{x}$ is normalized based on the statistics collected on a reference batch of examples that are chosen once and fixed at the start of training, and on $\\mathbf{x}$ itself. The reference batch is normalized using only its own statistics. VBN is computationally expensive because it requires running forward propagation on two minibatches of data, so the authors use it only in the generator network.","1555":"**PAFNet** is an anchor-free detector for object detection that removes pre-defined anchors and regresses the locations directly, which can achieve higher efficiency. The overall network is composed of a backbone, an up-sampling module, an AGS module, a localization branch and a regression branch. Specifically,  ResNet50-vd is chosen as the backbone for server side, and [MobileNetV3](https:\/\/paperswithcode.com\/method\/mobilenetv3) for mobile side. Besides, for mobile side, we replace traditional [convolution](https:\/\/paperswithcode.com\/method\/convolution) layers with lite convolution operators.","1556":"**EdgeFlow** is an interactive segmentation architecture that fully utilizes interactive information of user clicks with edge-guided flow. Edge guidance is the idea that interactive segmentation improves segmentation masks progressively with user clicks. Based on user clicks, an edge mask scheme is used, which takes the object edges estimated from the previous iteration as prior information, instead of direct mask estimation (if the previous mask is used as input, poor segmentation results could result).\r\n\r\nThe architecture consists of a coarse-to-fine network including CoarseNet and FineNet. For CoarseNet, [HRNet](https:\/\/paperswithcode.com\/method\/hrnet)-18+OCR is utilized as the base segmentation model and the edge-guided flow is appended to deal with interactive information. For FineNet, three [atrous convolution](https:\/\/paperswithcode.com\/method\/dilated-convolution) blocks are utilized to refine the coarse masks.","1557":"The **Recurrent Entity Network** is equipped with a dynamic long-term memory which allows it to maintain and update a representation of the state of the world as it receives new data. For language understanding tasks, it can reason on-the-fly as it reads text, not just when it is required to answer a question or respond as is the case for a [Memory Network](https:\/\/paperswithcode.com\/method\/memory-network). Like a [Neural Turing Machine](https:\/\/paperswithcode.com\/method\/neural-turing-machine) or Differentiable Neural Computer, it maintains a fixed size memory and can learn to perform location and content-based read and write operations.  However, unlike those models it has a simple parallel  architecture in which several memory locations can be updated simultaneously. \r\n\r\nThe model consists of a fixed number of dynamic memory cells, each containing a vector key $w_j$ and a vector value (or content) $h_j$. Each cell is associated with its own processor, a simple gated recurrent network that may update the cell value given an input. If each cell learns to represent a concept or entity in the world, one can imagine a gating mechanism that, based on the key and content of the memory cells, will only modify the cells that concern the entities mentioned in the input. There is no direct interaction between the memory cells, hence the system can be seen as multiple identical processors functioning in parallel, with distributed local memory. \r\n\r\nThe sharing of these parameters reflects an invariance of these laws across object instances, similarly to how the [weight tying](https:\/\/paperswithcode.com\/method\/weight-tying) scheme in a CNN reflects an invariance of image statistics across locations. Their hidden state is updated only when new information relevant to their concept is received, and remains otherwise unchanged. The keys used in the addressing\/gating mechanism also correspond to concepts or entities, but are modified only during learning, not during inference.","1558":"**Fisher-BRC** is an actor critic algorithm for offline reinforcement learning that encourages the learned policy to stay close to the data, namely parameterizing the critic as the $\\log$-behavior-policy, which generated the offline dataset, plus a state-action value offset term, which can be learned using a neural network. Behavior regularization then corresponds to an appropriate regularizer on the offset term. A gradient penalty regularizer is used for the offset term, which is equivalent to Fisher divergence regularization, suggesting connections to the score matching and generative energy-based model literature.","1559":"Decoder architecture inspired on the [UNet++](https:\/\/paperswithcode.com\/method\/unet) structure and the [EfficientNet](https:\/\/paperswithcode.com\/method\/efficientnet) building blocks. Keeping the UNet++ structure, the EfficientUNet++ achieves higher performance and significantly lower computational complexity through two simple modifications:\r\n\r\n* Replaces the 3x3 convolutions of the UNet++ with residual bottleneck blocks with depthwise convolutions\r\n* Applies channel and spatial attention to the bottleneck feature maps using [concurrent spatial and channel squeeze & excitation (scSE)](https:\/\/paperswithcode.com\/method\/scse) blocks","1560":"A **LAPGAN**, or **Laplacian Generative Adversarial Network**, is a type of generative adversarial network that has a [Laplacian pyramid](https:\/\/paperswithcode.com\/method\/laplacian-pyramid) representation. In the sampling procedure following training, we have a set of generative convnet models {$G\\_{0}, \\dots , G\\_{K}$}, each of which captures the distribution of coefficients $h\\_{k}$ for natural images at a different level of the Laplacian pyramid. Sampling an image is akin to a reconstruction procedure, except that the generative\r\nmodels are used to produce the $h\\_{k}$\u2019s:\r\n\r\n$$ \\tilde{I}\\_{k} = u\\left(\\tilde{I}\\_{k+1}\\right) + \\tilde{h}\\_{k} = u\\left(\\tilde{I}\\_{k+1}\\right) + G\\_{k}\\left(z\\_{k}, u\\left(\\tilde{I}\\_{k+1}\\right)\\right)$$\r\n\r\nThe recurrence starts by setting $\\tilde{I}\\_{K+1} = 0$ and using the model at the final level $G\\_{K}$ to generate a residual image $\\tilde{I}\\_{K}$ using noise vector $z\\_{K}$: $\\tilde{I}\\_{K} = G\\_{K}\\left(z\\_{K}\\right)$. Models at all levels except the final are conditional generative models that take an upsampled version of the current image $\\tilde{I}\\_{k+1}$ as a conditioning variable, in addition to the noise vector $z\\_{k}$.\r\n\r\nThe generative models {$G\\_{0}, \\dots, G\\_{K}$} are trained using the CGAN approach at each level of the pyramid. Specifically, we construct a Laplacian pyramid from each training image $I$. At each level we make a stochastic choice (with equal probability) to either (i) construct the coefficients $h\\_{k}$ either using the standard Laplacian pyramid coefficient generation procedure or (ii) generate them using $G\\_{k}:\r\n\r\n$$ \\tilde{h}\\_{k} = G\\_{k}\\left(z\\_{k}, u\\left(I\\_{k+1}\\right)\\right) $$\r\n\r\nHere $G\\_{k}$ is a convnet which uses a coarse scale version of the image $l\\_{k} = u\\left(I\\_{k+1}\\right)$ as an input, as well as noise vector $z\\_{k}$. $D\\_{k}$ takes as input $h\\_{k}$ or $\\tilde{h}\\_{k}$, along with the low-pass image $l\\_{k}$ (which is explicitly added to $h\\_{k}$ or $\\tilde{h}\\_{k}$ before the first [convolution](https:\/\/paperswithcode.com\/method\/convolution) layer), and predicts if the image was real or\r\ngenerated. At the final scale of the pyramid, the low frequency residual is sufficiently small that it\r\ncan be directly modeled with a standard [GAN](https:\/\/paperswithcode.com\/method\/gan): $\\tilde{h}\\_{K} = G\\_{K}\\left(z\\_{K}\\right)$ and $D\\_{K}$ only has $h\\_{K}$ or $\\tilde{h}\\_{K}$ as input.\r\n\r\nBreaking the generation into successive refinements is the key idea. We give up any \u201cglobal\u201d notion of fidelity; an attempt is never made to train a network to discriminate between the output of a cascade and a real image and instead the focus is on making each step plausible.","1561":"**PSFR-GAN** is a semantic-aware style transformation framework for face restoration. Given a pair of LQ face image and its corresponding parsing map, we first generate a multi-scale pyramid of the inputs, and then progressively modulate different scale features from coarse-to-fine in a semantic-aware style transfer way. Compared with previous networks, the proposed PSFR-GAN makes full use of the semantic (parsing maps) and pixel (LQ images) space information from different scales of inputs.","1562":"**DiCENet** is a convolutional neural network architecture that utilizes dimensional convolutions (and dimension-wise fusion). The dimension-wise convolutions apply light-weight convolutional filtering across each dimension of the input tensor while dimension-wise fusion efficiently combines these dimension-wise representations; allowing the [DiCE Unit](https:\/\/paperswithcode.com\/method\/dice-unit) in the network to efficiently encode spatial and channel-wise information contained in the input tensor.","1563":"**Dimension-wise Fusion** is an image model block that attempts to capture global information by combining features globally. It is an alternative to point-wise [convolution](https:\/\/paperswithcode.com\/method\/convolution). A point-wise convolutional layer applies $D$ point-wise kernels $\\mathbf{k}\\_p \\in \\mathbb{R}^{3D \\times 1 \\times 1}$ and performs $3D^2HW$ operations to combine dimension-wise representations of $\\mathbf{Y_{Dim}} \\in \\mathbb{R}^{3D \\times H \\times W}$ and produce an output $\\mathbf{Y} \\in \\mathbb{R}^{D \\times H \\times W}$. This is computationally expensive. Dimension-wise fusion is an alternative that can allow us to combine representations of $\\mathbf{Y\\_{Dim}}$ efficiently. As illustrated in the Figure to the right, it factorizes the point-wise convolution in two steps: (1) local fusion and (2) global fusion.","1564":"A **Dimension-wise Convolution**, or **DimConv**, is a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) that can encode depth-wise, width-wise, and height-wise information independently. To achieve this, DimConv extends depthwise convolutions to all dimensions of the input tensor $X \\in \\mathbb{R}^{D\\times{H}\\times{W}}$, where $W$, $H$, and $D$ corresponds to width, height, and depth of $X$. DimConv has three branches, one branch per dimension. These branches apply $D$ depth-wise convolutional kernels $k\\_{D} \\in \\mathbb{R}^{1\\times{n}\\times{n}}$ along depth, $W$ width-wise convolutional kernels $k\\_{W} \\in \\mathbb{R}^{n\\times{1}\\times{1}}$ along width, and $H$ height-wise convolutional kernels $k\\_{H} \\in \\mathbb{R}^{n\\times{1}\\times{n}}$ kernels along height\r\nto produce outputs $Y\\_{D}$, $Y\\_{W}$, and $Y\\_{H} \\in \\mathbb{R}^{D\\times{H}\\times{W}}$ that\r\nencode information from all dimensions of the input tensor. The outputs of these independent branches are concatenated along the depth dimension, such that the first spatial plane of $Y\\_{D}$, $Y\\_{W}$, and $Y\\_{H}$ are put together and so on, to produce the output $Y\\_{Dim} = ${$Y\\_{D}$, $Y\\_{W}$, $Y\\_{H}$} $\\in \\mathbb{R}^{3D\\times{H}\\times{W}}$.","1565":"A **DiCE Unit** is an image model block that is built using dimension-wise convolutions and dimension-wise fusion.  The dimension-wise convolutions apply light-weight convolutional filtering across each dimension of the input tensor while dimension-wise fusion efficiently combines these dimension-wise representations; allowing the DiCE unit to efficiently encode spatial and channel-wise information contained in the input tensor. \r\n\r\nStandard convolutions encode spatial and channel-wise information simultaneously, but they are computationally expensive. To improve the efficiency of standard convolutions, separable [convolution](https:\/\/paperswithcode.com\/method\/convolution) are introduced, where spatial and channelwise information are encoded separately using depth-wise and point-wise convolutions, respectively. Though this factorization is effective, it puts a significant computational load on point-wise convolutions and makes them a computational bottleneck.\r\n\r\nDiCE Units utilize a dimension-wise convolution to encode depth-wise, width-wise, and height-wise information independently. The dimension-wise convolutions encode local information from different dimensions of the input tensor, but do not capture global information. One approach is a [pointwise convolution](https:\/\/paperswithcode.com\/method\/pointwise-convolution), but it is computationally expensive, so instead dimension-wise fusion factorizes the point-wise convolution in two steps: (1) local fusion and (2) global fusion.","1566":"Pixel-BERT is pre-trained model that is trained to align image pixels with text. It is an end to end framework that includes CNN based visual encoder and cross modal transformers for visual and language embedding learning.\r\nThis model has three parts: one fully convolutional neural network that takes pixels of image as input, one word level token embedding based on BERT, and a multimodal transformers for jointly learning of visual and language embedding.\r\n\r\nFor language, it uses other pretraining works to use Masked Language Modeling (MLM) for the prediction of masked tokens with surrounding text and image. For vision it uses the random pixel sampling mechanism that makes up for the challenge of predicting pixel level features. This mechanism is also good for solving overfitting issues and improving the robustness of visual features. \r\n\r\nFor vision and language interaction, it applies Image-Text_Matching (ITM) to classify whether an image and sentence pair is matched or not. \r\n\r\nSome cross modality tasks like VQA, image captioning is required to understand both language and visual semantics. Region based visual features extracted from object detection models like Faster RCNN is used for better performance in the newer version of the model.","1567":"**Implicit PointRend** is a modification to the [PointRend](https:\/\/paperswithcode.com\/method\/pointrend) module for instance segmentation. Instead of a coarse mask prediction used in [PointRend](https:\/\/paperswithcode.com\/method\/pointrend) to provide region-level context to distinguish objects, for each object Implicit PointRend generates different parameters for a function that makes the final pointwise mask prediction. The new model is more straightforward than PointRend: (1) it does not require an importance point sampling during training and (2) it uses a single point-level mask loss instead of two mask losses. Implicit PointRend can be trained directly with point supervision without any intermediate prediction interpolation steps.","1568":"**ClipBERT** is a framework for end-to-end-learning for video-and-language tasks, by employing sparse sampling, where only a single or a few sparsely sampled short clips from a video are used at each training step. Two aspects distinguish ClipBERT from previous work. \r\n\r\nFirst, in contrast to densely extracting video features (adopted by most existing methods), CLIPBERT sparsely samples only one single or a few short clips from the full-length videos at each training step. The hypothesis is that visual features from sparse clips already capture key visual and semantic information in the video, as consecutive clips usually contain similar semantics from a continuous scene. Thus, a handful of clips are sufficient for training, instead of using the full video. Then, predictions from multiple densely-sampled clips are aggregated to obtain the final video-level prediction during inference, which is less computational demanding. \r\n\r\nThe second differentiating aspect concerns the initialization of model weights (i.e., transfer through pre-training). The authors use 2D architectures (e.g., [ResNet](https:\/\/paperswithcode.com\/method\/resnet)-50) instead of 3D features as the visual backbone for video encoding, allowing them to harness the power of image-text pretraining for video-text understanding along with the advantages of low memory cost and runtime efficiency.","1569":"Please enter a description about the method here","1570":"**Layer-Sequential Unit-Variance Initialization** (**LSUV**) is a simple method for weight initialization for deep net learning. The initialization strategy involves the following two step:\r\n\r\n1) First, pre-initialize weights of each [convolution](https:\/\/paperswithcode.com\/method\/convolution) or inner-product layer with\r\northonormal matrices. \r\n\r\n2) Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one.","1571":"**Agglomerative Contextual Decomposition (ACD)** is an interpretability method that produces hierarchical interpretations for a single prediction made by a neural network, by scoring interactions and building them into a tree. Given a prediction from a trained neural network, ACD produces a hierarchical clustering of the input features, along with the contribution of each cluster to the final prediction. This hierarchy is optimized to identify clusters of features that the DNN learned are predictive.","1572":"**Chained-Tracker**, or **CTracker**,  is an online model for multiple-object tracking. It chains paired bounding boxes regression results estimated from overlapping nodes, of which each node covers two adjacent frames. The paired regression is made attentive by object-attention (brought by a detection module) and identity-attention (ensured by an ID verification module).\r\n\r\nThe joint attention module guides the paired boxes regression branch to focus on informative spatial regions with two other branches. One is the object classification branch, which predicts the confidence scores for the first box in the detected box pairs, and such scores are used to guide the regression branch to focus on the foreground regions. The other one is the ID verification branch whose prediction facilitates the regression branch to focus on regions corresponding to the same target. Finally, the bounding box pairs are filtered according to the classification confidence. Then, the generated box pairs belonging to the adjacent frame pairs could be associated using simple methods like IoU (Intersection over Union) matching according to their boxes in the common frame. In this way, the tracking process could be achieved by chaining all the adjacent frame pairs (i.e. chain nodes) sequentially.","1573":"**Computation Redistribution** is an [neural architecture search](https:\/\/paperswithcode.com\/task\/architecture-search) method for [face detection](https:\/\/paperswithcode.com\/task\/face-detection), which reallocates the computation between the backbone, neck and head of the model based on a predefined search methodology. Directly utilising the backbone of a classification network for scale-specific face detection can be sub-optimal. Therefore, [network structure search](https:\/\/paperswithcode.com\/method\/regnety) is used to reallocate the computation on the backbone, neck and head, under a wide range of flop regimes. The search method is applied to [RetinaNet](https:\/\/paperswithcode.com\/method\/retinanet), with [ResNet](https:\/\/paperswithcode.com\/method\/resnet) as backbone, [Path Aggregation Feature Pyramid Network](https:\/\/paperswithcode.com\/method\/pafpn) (PAFPN)  as the neck and stacked 3 \u00d7 3 [convolutional layers](https:\/\/paperswithcode.com\/method\/convolution) for the head. While the general structure is simple, the total number of possible networks in the search space is unwieldy. In the first step, the authors explore the reallocation of the computation within the backbone parts (i.e. stem, C2, C3, C4, and C5), while fixing the neck and head components. Based on the optimised computation distribution on the backbone they find, they further explore the reallocation of the computation across the backbone, neck and head.","1574":"**Sample Redistribution** is a [data augmentation](https:\/\/paperswithcode.com\/methods\/category\/image-data-augmentation) technique for face detection which augments training samples based on the statistics of benchmark datasets via large-scale cropping. During training data augmentation, square patches are cropped from the original images with a random size from the set $[0.3,1.0]$ of the short edge of the original images. To generate more positive samples for stride 8, the random size range is enlarged from $[0.3,1.0]$ to $[0.3,2.0]$. When the crop box is beyond the original image, average RGB values fill the missing pixels.\r\n\r\nThe motivation is that for efficient [face detection](https:\/\/paperswithcode.com\/task\/face-detection) under a fixed VGA resolution (i.e. 640\u00d7480), most of the faces (78.93%) in [WIDER FACE](https:\/\/paperswithcode.com\/dataset\/wider-face-1) are smaller than 32\u00d732 pixels, and thus they are predicted by shallow stages. To obtain more training samples for these shallow stages, Sample Redistribution (SR) is used.","1575":"**TinaFace** is a type of face detection method that is based on generic object detection. It consists of (a) Feature Extractor: [ResNet](https:\/\/paperswithcode.com\/method\/resnet)-50 and 6 level [Feature Pyramid Network](https:\/\/www.paperswithcode.com\/method\/fpn) to extract the multi-scale features of input image; (b) an Inception block to enhance receptive field; (c) Classification Head: 5 layers [FCN](https:\/\/paperswithcode.com\/method\/fcn) for classification of anchors; (d) Regression Head: 5 layers [FCN](https:\/\/paperswithcode.com\/method\/fcn) for regression of anchors to ground-truth objects boxes; (e) IoU Aware Head: a single convolutional layer for IoU prediction.","1576":"**Compressed Memory** is a secondary FIFO memory component proposed as part of the [Compressive Transformer](https:\/\/paperswithcode.com\/method\/compressive-transformer) model. The Compressive [Transformer](https:\/\/paperswithcode.com\/method\/transformer) keeps a fine-grained memory of past activations, which are then compressed into coarser compressed memories. \r\n\r\nFor choices of compression functions $f\\_{c}$ the authors consider (1) max\/mean pooling, where the kernel and stride is set to the compression rate $c$; (2) 1D [convolution](https:\/\/paperswithcode.com\/method\/convolution) also with kernel & stride set to $c$; (3) dilated convolutions; (4) *most-used* where the memories are sorted by their average attention (usage) and the most-used are preserved.","1577":"The **Compressive Transformer** is an extension to the [Transformer](https:\/\/paperswithcode.com\/method\/transformer) which maps past hidden activations (memories) to a smaller set of compressed representations (compressed memories). The Compressive Transformer uses the same attention mechanism over its set of memories and compressed memories, learning to query both its short-term granular memory and longer-term coarse memory. It builds on the ideas of [Transformer-XL](https:\/\/paperswithcode.com\/method\/transformer-xl) which maintains a memory of past activations at each layer to preserve a longer history of context. The Transformer-XL discards past activations when they become sufficiently old (controlled by the size of the memory). The key principle of the Compressive Transformer is to compress these old memories, instead of discarding them, and store them in an additional [compressed memory](https:\/\/paperswithcode.com\/method\/compressed-memory).\r\n\r\nAt each time step $t$, we discard the oldest compressed memories (FIFO) and then the oldest $n$ states from ordinary memory are compressed and shifted to the new slot in compressed memory. During training, the compressive memory component is optimized separately from the main language model (separate training loop).","1578":"**KNN and IoU-based Verification** is used to verify detections and choose between multiple detections of the same underlying object. It was originally used within the context of blood cell counting in medical images. To avoid this double counting problem, the KNN algorithm is applied in each platelet to determine its closest platelet and then using the intersection of union (IOU) between two platelets we calculate their extent of overlap. The authors allow 10% of the overlap between platelet and its closest platelet based on empirical observations. If the overlap is larger than that, they ignore that cell as a double count to get rid of spurious counting.","1579":"**Slime Mould Algorithm** (**SMA**) is a new stochastic optimizer proposed based on the oscillation mode of slime mould in nature. SMA has several new features with a unique mathematical model that uses adaptive weights to simulate the process of producing positive and negative feedback of the propagation wave of slime mould based on bio-oscillator to form the optimal path for connecting food with excellent exploratory ability and exploitation propensity.\r\n\r\n\ud83d\udd17 The source codes of SMA are publicly available at [https:\/\/aliasgharheidari.com\/SMA.html](https:\/\/aliasgharheidari.com\/SMA.html)","1580":"**Prescribed GANs** add noise to the output of a density network and optimize an entropy-regularized adversarial loss. The added noise renders tractable approximations of the predictive log-likelihood and stabilizes the training procedure. The entropy regularizer encourages PresGANs to capture all the modes of the data distribution. Fitting PresGANs involves computing the intractable gradients of the [entropy regularization](https:\/\/paperswithcode.com\/method\/entropy-regularization) term; PresGANs sidestep this intractability using\r\nunbiased stochastic estimates.","1581":"**DDParser**, or **Baidu Dependency Parser**, is a Chinese dependency parser trained on a large-scale manually labeled dataset called Baidu Chinese Treebank (DuCTB).\r\n\r\nFor inputs, for the $i$ th word, its input vector $e_{i}$ is the concatenation of the word embedding and character-level representation:\r\n\r\n$$\r\ne\\_{i}=e\\_{i}^{w o r d} \\oplus C h a r L S T M\\left(w\\_{i}\\right)\r\n$$\r\n\r\nWhere $\\operatorname{CharLSTM}\\left(w_{i}\\right)$ is the output vectors after feeding the character sequence into a [BiLSTM](https:\/\/paperswithcode.com\/method\/bilstm) layer. The experimental results on DuCTB dataset show that replacing POS tag embeddings with $\\operatorname{CharLSTM}\\left(w_{i}\\right)$ leads to the improvement.\r\n\r\nFor the BiLSTM encoder, three BiLSTM layers are employed over the input vectors for context encoding. Denote $r\\_{i}$ the output vector of the top-layer BiLSTM for $w\\_{i}$\r\n\r\nThe dependency parser of [Dozat and Manning](https:\/\/arxiv.org\/abs\/1611.01734) is used. Dimension-reducing MLPs are applied to each recurrent output vector $r\\_{i}$ before applying the biaffine transformation. Applying smaller MLPs to the recurrent output states before the biaffine classifier has the advantage of stripping away information not relevant to the current decision. Then biaffine attention is used both in the dependency arc classifier and relation classifier. The computations of all symbols in the Figure are shown below:\r\n\r\n$$\r\nh_{i}^{d-a r c}=M L P^{d-a r c}\\left(r_{i}\\right)\r\n$$\r\n$$\r\nh_{i}^{h-a r c}=M L P^{h-a r c}\\left(r_{i}\\right) \\\\\r\n$$\r\n$$\r\nh_{i}^{d-r e l}=M L P^{d-r e l}\\left(r_{i}\\right) \\\\\r\n$$\r\n$$\r\nh_{i}^{h-r e l}=M L P^{h-r e l}\\left(r_{i}\\right) \\\\\r\n$$\r\n$$\r\nS^{a r c}=\\left(H^{d-a r c} \\oplus I\\right) U^{a r c} H^{h-a r c} \\\\\r\n$$\r\n$$\r\nS^{r e l}=\\left(H^{d-r e l} \\oplus I\\right) U^{r e l}\\left(\\left(H^{h-r e l}\\right)^{T} \\oplus I\\right)^{T}\r\n$$\r\n\r\nFor the decoder, the first-order Eisner algorithm is used to ensure that the output is a projection tree. Based on the dependency tree built by biaffine parser, we get a word sequence through the in-order traversal of the tree. The output is a projection tree only if the word sequence is in order.","1582":"Unsupervised knowledge distillation from a pretrained language model to *itself*, by alternating between its bi- and cross-encoder forms.","1583":"**Seq2Edits** is an open-vocabulary approach to sequence editing for natural language processing (NLP) tasks with a high degree of overlap between input and output texts. In this approach, each sequence-to-sequence transduction is represented as a sequence of edit operations, where each operation either replaces an entire source span with target tokens or keeps it unchanged. For text normalization, sentence fusion, sentence splitting & rephrasing, text simplification, and grammatical error correction, the approach improves explainability by associating each edit operation with a human-readable tag.\r\n\r\nRather than generating the target sentence as a series of tokens, the model predicts a sequence of edit operations that, when applied to the source sentence, yields the target sentence. Each edit operates on a span in the source sentence and either copies, deletes, or replaces it with one or more target tokens. Edits are generated auto-regressively from left to right using a modified [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture to facilitate learning of long-range dependencies.","1584":"**AdaSqrt** is a stochastic optimization technique that is motivated by the observation that methods like [Adagrad](https:\/\/paperswithcode.com\/method\/adagrad) and [Adam](https:\/\/paperswithcode.com\/method\/adam) can be viewed as relaxations of [Natural Gradient Descent](https:\/\/paperswithcode.com\/method\/natural-gradient-descent).\r\n\r\nThe updates are performed as follows:\r\n\r\n$$ t \\leftarrow t + 1 $$\r\n\r\n$$ \\alpha\\_{t} \\leftarrow \\sqrt{t} $$\r\n\r\n$$ g\\_{t} \\leftarrow \\nabla\\_{\\theta}f\\left(\\theta\\_{t-1}\\right) $$\r\n\r\n$$ S\\_{t} \\leftarrow S\\_{t-1} + g\\_{t}^{2} $$\r\n\r\n$$ \\theta\\_{t+1} \\leftarrow \\theta\\_{t} + \\eta\\frac{\\alpha\\_{t}g\\_{t}}{S\\_{t} + \\epsilon} $$","1585":"Scalable method to train large scale GNN models via sampling small subgraphs.","1586":"**DeCLUTR** is an approach for learning universal sentence embeddings that utilizes a self-supervised objective that does not require labelled training data. The objective learns universal sentence embeddings by training an encoder to minimize the distance between the embeddings of textual segments randomly sampled from nearby in the same document.","1587":"SETSe is a deterministic physics based graph embedding algorithm. It embeds weighted feature rich networks. It treats each edge as a spring and each node as a bead whose movement is constrained by the graph adjacency matrix so that the nodes move in parallel planes enforcing a minimum distance between neighboring nodes. The node features act as forces moving the nodes up and down. The network converges to the embedded state when the force produced by each node is equal and opposite to the sum of the forces exerted by its edges, creating a net force of 0.\r\n\r\nSETSe has no conventional loss function and does not attempt to place similar nodes close to each other.","1588":"**Varifocal Loss** is a loss function for training a dense object detector to predict the IACS, inspired by [focal loss](https:\/\/paperswithcode.com\/method\/focal-loss). Unlike the focal loss that deals with positives and negatives equally, Varifocal Loss treats them asymmetrically.\r\n\r\n$$ VFL\\left(p, q\\right) = \u2212q\\left(q\\log\\left(p\\right) + \\left(1 \u2212 q\\right)\\log\\left(1 \u2212 p\\right)\\right) \\text{ if } q > 0 $$\r\n\r\n$$ VFL\\left(p, q\\right) = \u2212\\alpha{p^{\\gamma}}\\log\\left(1-p\\right) $$\r\n\r\nwhere $p$ is the predicted IACS and $q$ is the target IoU score.\r\n\r\nFor a positive training example, $q$ is set as the IoU between the generated bounding box and the ground-truth one\r\n(gt IoU), whereas for a negative training example, the training target $q$ for all classes is $0$.","1589":"**VarifocalNet** is a method aimed at accurately ranking a huge number of candidate detections in object detection. It consists of a new loss function, named [Varifocal Loss](https:\/\/paperswithcode.com\/method\/varifocal-loss), for training a dense object detector to predict the IACS, and a new efficient star-shaped bounding box feature representation for estimating the IACS and refining coarse bounding boxes. Combining these two new components and a bounding box refinement branch, results in a dense object detector on the [FCOS](https:\/\/paperswithcode.com\/method\/fcos) architecture, what the authors call VarifocalNet or VFNet for short.","1590":"In contrast to typical GANs, a U-Net GAN uses a segmentation network as the discriminator. This segmentation network predicts two classes: real and fake. In doing so, the discriminator gives the generator region-specific feedback. This discriminator design also enables a  [CutMix](https:\/\/paperswithcode.com\/method\/cutmix)-based consistency regularization on the two-dimensional output of the U-Net GAN discriminator, which further improves image synthesis quality.","1591":"**MaskFlownet** is an asymmetric occlusion-aware feature matching module, which can learn a rough occlusion mask that filters useless (occluded) areas immediately after feature warping without any explicit supervision. The learned occlusion mask can be further fed into a subsequent network cascade with dual feature pyramids.","1592":"**BAGUA** is a communication framework whose design goal is to provide a system abstraction that is both flexible and modular to support state-of-the-art system relaxation techniques of distributed training. The abstraction goes beyond parameter server and Allreduce paradigms, and provides a collection of MPI-style collective operations to facilitate communications with different precision and centralization strategies.","1593":"**BytePS** is a distributed training method for deep neural networks. BytePS handles cases with varying number of CPU machines and makes traditional all-reduce and PS as two special cases of its framework. To further accelerate DNN training, BytePS proposes Summation Service and splits a DNN optimizer into two parts: gradient summation and parameter update. It keeps the CPU-friendly part, gradient summation, in CPUs, and moves parameter update, which is more computation heavy, to GPUs.","1594":"**HEGCN**, or **Hierarchical Entity Graph Convolutional Network** is a model for multi-hop relation extraction across documents. Documents in a document chain are encoded using a bi-directional long short-term memory ([BiLSTM](https:\/\/paperswithcode.com\/method\/bilstm)) layer. On top of the BiLSTM layer, two graph convolutional networks ([GCN](https:\/\/paperswithcode.com\/method\/gcn)) are used, one after another in a hierarchy. \r\n\r\nIn the first level of the GCN hierarchy, a separate entity mention graph is constructed on each document of the chain using all the entities mentioned in that document. Each mention of an entity in a document is considered as a separate node in the graph. A graph convolutional network (GCN) is used to represent the entity mention graph of each document to capture the relations among the entity mentions in the document. A unified entity-level graph is then constructed across all the documents in the chain. Each node of this entity-level graph represents a unique entity in the document chain. Each common entity between two documents in the chain is represented by a single node in the graph. A GCN is used to represent this entity-level graph to capture the relations among the entities across the documents. \r\n\r\nThe representations of the nodes of the subject entity and object entity are concatenated and passed to a feed-forward layer with [softmax](https:\/\/paperswithcode.com\/method\/softmax) for relation classification.","1595":"**Big-Little Net** is a convolutional neural network architecture for learning multi-scale feature representations. This is achieved by using a multi-branch network, which has different computational complexity at different branches with different resolutions. Through frequent merging of features from branches at distinct scales, the model obtains multi-scale features while using less computation.\r\n\r\nIt consists of Big-Little Modules, which have two branches: each of which represents a separate block from a deep model and a less deep counterpart. The two branches are fused with linear combination + unit weights. These two branches are known as Big-Branch (more layers and channels at low resolutions) and Little-Branch (fewer layers and channels at high resolution).","1596":"**Mobile Video Network**, or **MoViNet**, is a type of computation and memory efficient video network that can operate on streaming video for online inference. Three techniques are used to improve efficiency while reducing the peak memory usage of 3D CNNs. First, a video network search space is designed and [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) employed to generate efficient and diverse 3D CNN architectures. Second, a Stream Buffer technique is introduced that decouples memory from video clip duration, allowing 3D CNNs to embed arbitrary-length streaming video sequences for both training and inference with a small constant memory footprint. Third, a simple ensembling technique is used to improve accuracy further without sacrificing efficiency.","1597":"GANformer is a novel and efficient type of [transformer](https:\/\/paperswithcode.com\/method\/transformer) which can be used for visual generative modeling. The network employs a bipartite structure that enables long-range interactions across an image, while maintaining computation of linearly efficiency, that can readily scale to high-resolution synthesis. It iteratively propagates information from a set of latent variables to the evolving visual features and vice versa, to support the refinement of each in light of the other and encourage the emergence of compositional representations of objects and scenes.\r\n\r\nSource: [Generative Adversarial Transformers](https:\/\/arxiv.org\/pdf\/2103.01209v2.pdf)\r\n\r\nImage source: [Generative Adversarial Transformers](https:\/\/arxiv.org\/pdf\/2103.01209v2.pdf)","1598":"**Variational Quantum Singular Value Decomposition** is a variational quantum algorithm for singular value decomposition (VQSVD). By exploiting the variational principles for singular values and the Ky Fan Theorem, a novel loss function is designed such that two quantum neural networks (or parameterized quantum circuits) could be trained to learn the singular vectors and output the corresponding singular values.","1599":"**Temporal ROI Align** is an operator for extracting features from other frames' feature maps for current frame proposals by utilizing feature similarity. Considering the features of the same object instance are highly similar among frames in a video, the proposed operator implicitly extracts the most similar ROI features from support frames feature map for target frame proposals based on feature similarity.","1600":"The NA method can be divided into two steps: (i) Training a neural network approximation of f , and (ii) inference of x\u02c6. Step (i) is conventional and involves training a generic neural network on a dataset\r\n\u02c6\r\nof input\/output pairs from the simulator, denoted D, resulting in f, an approximation of the forward \u02c6\r\nmodel. This is illustrated in the left inset of Fig 1. In step (ii), our goal is to use \u2202f\/\u2202x to help us gradually adjust x so that we achieve a desired output of the forward model, y. This is similar to many classical inverse modeling approaches, such as the popular Adjoint method [8, 9]. For many practical\r\n\u02c6\r\nexpression for the simulator, from which it is trivial to compute \u2202f\/\u2202x, and furthermore, we can use modern deep learning software packages to efficiently estimate gradients, given a loss function L.\r\nMore formally, let y be our target output, and let x\u02c6i be our current estimate of the solution, where i indexes each solution we obtain in an iterative gradient-based estimation procedure. Then we compute x\u02c6i+1 with\r\ninverse problems, however, obtaining \u2202f\/\u2202x requires significant expertise and\/or effort, making these approaches challenging. Crucially, f\u02c6 from step (i) provides us with a closed-form differentiable","1601":"A Synaptic Neural Network (SynaNN) consists of synapses and neurons. Inspired by the synapse research of neuroscience, we built a synapse model with a nonlinear and log-concave synapse function of excitatory and inhibitory probabilities of channels.","1602":"**SCARLET** is a type of convolutional neural architecture learnt by the [SCARLET-NAS](https:\/\/paperswithcode.com\/method\/scarlet-nas) [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) method. The three variants are SCARLET-A, SCARLET-B and SCARLET-C. The basic building block is MBConvs from [MobileNetV2](https:\/\/paperswithcode.com\/method\/mobilenetv2). Squeeze-and-excitation layers are also experimented with.","1603":"**SCARLET-NAS** is a type of [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) that utilises a learnable stabilizer to calibrate feature deviation, named the Equivariant Learnable Stabilizer (ELS). Previous one-shot approaches can be limited by fixed-depth search spaces. With SCARLET-NAS, we use the equivariant learnable stabilizer on each skip connection. This can lead to improved convergence, more reliable evaluation, and retained equivalence. The third benefit is deemed most important by the authors for scalability.","1604":"**SimAdapter** is a module for explicitly learning knowledge from adapters. SimAdapter aims to learn the similarities between the source and target languages during fine-tuning using the adapters, and the similarity is based on an [attention mechanism](https:\/\/paperswithcode.com\/methods\/category\/attention-mechanisms-1). \r\n\r\nThe detailed composition of the SimAdapter is shown in the Figure. By taking the language-agnostic representations from the backbone model as the query, and the language-specific outputs from multiple adapter as the keys and values, the final output for SimAdapter over attention are computed as (For notation simplicity, we omit the layer index $l$ below):\r\n\r\n$$\r\n\\operatorname{SimAdapter}\\left(\\mathbf{z}, \\mathbf{a}\\_{\\left\\(S\\_{1}, S\\_{2}, \\ldots, S\\_{N}\\right\\)}\\right)=\\sum_{i=1}^{N} \\operatorname{Attn}\\left(\\mathbf{z}, \\mathbf{a}\\_{S\\_{i}}\\right) \\cdot\\left(\\mathbf{a}\\_{S\\_{i}} \\mathbf{W}\\_{V}\\right)\r\n$$\r\n\r\nwhere SimAdapter $(\\cdot)$ and $\\operatorname{Attn}(\\cdot)$ denotes the SimAdapter and attention operations, respectively. Specifically, the attention operation is computed as:\r\n\r\n$$\r\n\\operatorname{Attn}(\\mathbf{z}, \\mathbf{a})=\\operatorname{Softmax}\\left(\\frac{\\left(\\mathbf{z} \\mathbf{W}\\_{Q}\\right)\\left(\\mathbf{a} \\mathbf{W}\\_{K}\\right)^{\\top}}{\\tau}\\right)\r\n$$\r\n\r\nwhere $\\tau$ is the temperature coefficient, $\\mathbf{W}\\_{Q}, \\mathbf{W}\\_{K}, \\mathbf{W}\\_{V}$ are attention matrices. Note that while $\\mathbf{W}\\_{Q}, \\mathbf{W}\\_{K}$ are initialized randomly, $\\mathbf{W}\\_{V}$ is initialized with a diagonal of ones and the rest of the matrix with small weights $(1 e-6)$ to retain the adapter representations. Furthermore, a regularization term is introduced to avoid drastic feature changes:\r\n\r\n$$\r\n\\mathcal{L}\\_{\\mathrm{reg}}=\\sum\\_{i, j}\\left(\\left(\\mathbf{I}\\_{V}\\right)\\_{i, j}-\\left(\\mathbf{W}\\_{V}\\right)_{i, j}\\right)^{2}\r\n$$\r\n\r\nwhere $\\mathbf{I}\\_{V}$ is the identity matrix with the same size as $\\mathbf{W}\\_{V}$","1605":"**SMITH**, or **Siamese Multi-depth Transformer-based Hierarchical Encoder**, is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based model for document representation learning and matching. It contains several design choices to adapt [self-attention models](https:\/\/paperswithcode.com\/methods\/category\/attention-modules) for long text inputs. For the model pre-training, a masked sentence block language modeling task is used in addition to the original masked word language model task used in [BERT](https:\/\/paperswithcode.com\/method\/bert), to capture sentence block relations within a document. Given a sequence of sentence block representation, the document level Transformers learn the contextual representation for each sentence block and the final document representation.","1606":"**LeViT Attention Block** is a module used for [attention](https:\/\/paperswithcode.com\/methods\/category\/attention-mechanisms) in the [LeViT](https:\/\/paperswithcode.com\/method\/levit) architecture. Its main feature is providing positional information within each attention block, i.e. where we explicitly inject relative position information in the attention mechanism. This is achieved by adding an attention bias to the attention maps.","1607":"**LeVIT** is a hybrid neural network for fast inference image classification. LeViT is a stack of [transformer blocks](https:\/\/paperswithcode.com\/method\/transformer), with [pooling steps](https:\/\/paperswithcode.com\/methods\/category\/pooling-operation) to reduce the resolution of the activation maps as in classical [convolutional architectures](https:\/\/paperswithcode.com\/methods\/category\/convolutional-neural-networks). This replaces the uniform structure of a Transformer by a pyramid with pooling, similar to the [LeNet](https:\/\/paperswithcode.com\/method\/lenet) architecture","1608":"**ClassSR** is a framework to accelerate super-resolution (SR) networks on large images (2K-8K). ClassSR combines classification and SR in a unified framework. In particular, it first uses a Class-Module to classify the sub-images into different classes according to restoration difficulties, then applies an SR-Module to perform SR for different classes. The Class-Module is a conventional classification network, while the SR-Module is a network container that consists of the to-be-accelerated SR network and its simplified versions.","1609":"**Slow Momentum** (SlowMo) is a distributed optimization method where workers periodically synchronize and perform a momentum update, after multiple iterations of a base optimization algorithm.  Periodically, after taking some number $\\tau$ of base algorithm steps, workers average their parameters using ALLREDUCE and perform a momentum update.","1610":"**Mixture Normalization** is normalization technique that relies on an approximation of the probability density function of the internal representations. Any continuous distribution can be approximated with arbitrary precision using a Gaussian Mixture Model (GMM). Hence, instead of computing one set of statistical measures from the entire population (of instances in the mini-batch) as [Batch Normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) does, Mixture Normalization works on sub-populations which can be identified by disentangling modes of the distribution, estimated via GMM. \r\n\r\nWhile BN can only scale and\/or shift the whole underlying probability density function, mixture normalization operates like a soft piecewise normalizing transform, capable of completely re-structuring the data distribution by independently scaling and\/or shifting individual modes of distribution.","1611":"**Hierarchical Style Disentanglement**, or **HiSD**,  aims to disentangle different styles in image-to-image translation models. It organizes the labels into a hierarchical structure, where independent tags, exclusive attributes, and disentangled styles are allocated from top to bottom. To make the styles identified to the tags and attributes, the authors carefully redesign the modules, phases, and objectives.","1612":"Adaptive Early-Learning Correction for Segmentation from Noisy Annotations","1613":"**Mirror Descent Policy Optimization (MDPO)** is a policy gradient algorithm based on the idea of iteratively solving a trust-region problem that minimizes a sum of two terms: a linearization of the standard RL objective function and a proximity term that restricts two consecutive updates to be close to each other. It is based on Mirror Descent, which is a general trust region method that\r\nattempts to keep consecutive iterates close to each other.","1614":"GCT first collects global information by computing the l2-norm of each channel. Next, a learnable vector $ \\alpha $ is applied to scale the feature. Then a competition mechanism is adopted by channel normalization to interact between channels.\r\n\r\nUnlike previous methods, GCT first collects global information by computing the $l_{2}$-norm of each channel. \r\nNext, a learnable vector $\\alpha$ is applied to scale the feature.\r\nThen a competition mechanism is adopted by \r\nchannel normalization to interact between channels. \r\nLike other common normalization methods, \r\na learnable scale parameter $\\gamma$ and bias $\\beta$ are applied to \r\nrescale the normalization.\r\nHowever, unlike previous methods,\r\nGCT adopts tanh activation to control the attention vector.\r\nFinally, it not only multiplies the input by the attention vector but also adds an identity connection. GCT can be written as: \r\n\\begin{align}\r\n    s = F_\\text{gct}(X, \\theta) & = \\tanh (\\gamma CN(\\alpha \\text{Norm}(X)) + \\beta)\r\n\\end{align}\r\n\\begin{align}\r\n    Y & = s  X + X\r\n\\end{align}\r\n\r\nwhere $\\alpha$, $\\beta$ and $\\gamma$ are trainable parameters. $\\text{Norm}(\\cdot)$ indicates the $L2$-norm of each channel. $CN$ is  channel normalization.\r\n\r\nA GCT block has fewer parameters than an SE block, and as it is  lightweight, \r\n can be added after each convolutional layer of a CNN.","1615":"**Crossbow** is a single-server multi-GPU system for training deep learning models that enables users to freely choose their preferred batch size\u2014however small\u2014while scaling to multiple GPUs. Crossbow uses many parallel model replicas and avoids reduced statistical efficiency through a new synchronous training method. [SMA](https:\/\/paperswithcode.com\/method\/slime-mould-algorithm-sma), a synchronous variant of model averaging, is used in which replicas independently explore the solution space with gradient descent, but adjust their search synchronously based on the trajectory of a globally-consistent average model.","1616":"**Principal Neighbourhood Aggregation** (PNA) is a general and flexible architecture for graphs combining multiple aggregators with degree-scalers (which generalize the sum aggregator).","1617":"**GHM-R** is a loss function designed to balance the gradient flow for bounding box refinement. The GHM first performs statistics on the number of examples with similar attributes w.r.t their gradient density and then attaches a harmonizing parameter to the gradient of each example according to the density. The modification of gradient can be equivalently implemented by reformulating the loss function. Embedding the GHM into the bounding box regression branch is denoted as GHM-R loss.","1618":"**GHM-C** is a loss function designed to balance the gradient flow for anchor classification. The GHM first performs statistics on the number of examples with similar attributes w.r.t their gradient density and then attaches a harmonizing parameter to the gradient of each example according to the density. The modification of gradient can be equivalently implemented by reformulating the loss function. Embedding the GHM into the classification loss is denoted as GHM-C loss. Since the gradient density is a statistical variable depending on the examples distribution in a mini-batch, GHM-C is a dynamic loss that can adapt to the change of data distribution in each batch as well as to the updating of the model.","1619":"**Charformer** is a type of [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) model that learns a subword tokenization end-to-end as part of the model. Specifically it uses [GBST](https:\/\/paperswithcode.com\/method\/gradient-based-subword-tokenization) that automatically learns latent subword representations from characters in a data-driven fashion. Following GBST, the soft subword sequence is passed through [Transformer](https:\/\/paperswithcode.com\/method\/transformer) layers.","1620":"**GBST**, or **Gradient-based Subword Tokenization Module**, is a soft gradient-based subword tokenization module that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network.  \r\n\r\nGBST learns a position-wise soft selection over candidate subword blocks by scoring them with a scoring network. In contrast to prior tokenization-free methods, GBST learns interpretable latent subwords, which enables easy inspection of lexical representations and is more efficient than other byte-based models.","1621":"**GBST**, or **Gradient-based Subword Tokenization Module**, is a soft gradient-based subword tokenization module that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network.  \r\n\r\nGBST learns a position-wise soft selection over candidate subword blocks by scoring them with a scoring network. In contrast to prior tokenization-free methods, GBST learns interpretable latent subwords, which enables easy inspection of lexical representations and is more efficient than other byte-based models.","1622":"**Structurally Regularized Deep Clustering**, or **SRDC**, is a deep network based discriminative clustering method for domain adaptation that minimizes the KL divergence between predictive label distribution of the network and an introduced auxiliary one. Replacing the auxiliary distribution with that formed by ground-truth labels of source data implements the structural source regularization via a simple strategy of joint network training.","1623":"**TradeS** is an online joint detection and tracking model, coined as TRACK to DEtect and Segment, exploiting tracking clues to assist detection end-to-end. TraDeS infers object tracking offset by a cost volume, which is used to propagate previous object features for improving current object detection and segmentation.","1624":"**Batch-Channel Normalization**, or **BCN**, uses batch knowledge to prevent channel-normalized models from getting too close to \"elimination singularities\". Elimination singularities correspond to the points on the training trajectory where neurons become consistently deactivated. They cause degenerate manifolds in the loss landscape which will slow down training and harm model performances.","1625":"**Sscs**, or **Support-set Based Cross-Supervision**, is a module for video grounding which consists of two main components: a discriminative contrastive objective and a generative caption objective. The contrastive objective aims to learn effective representations by contrastive learning, while the caption objective can train a powerful video encoder supervised by texts. Due to the co-existence of some visual entities in both ground-truth and background intervals, i.e., mutual exclusion, naively contrastive learning is unsuitable to video grounding. This problem is addressed by boosting the cross-supervision with the support-set concept, which collects visual information from the whole video and eliminates the mutual exclusion of entities.\r\n\r\nSpecifically, in the Figure to the right, two video-text pairs { $V\\_{i}, L\\_{i}$}, {$V\\_{j} , L\\_{j}$ } in the batch are presented for clarity. After feeding them into a video and text encoder, the clip-level and sentence-level embedding ( {$X\\_{i}, Y\\_{i}$} and {$X\\_{j} , Y\\_{j}$} ) in a shared space are acquired. Base on the support-set module, the weighted average of $X\\_{i}$ and $X\\_{j}$ is computed to obtain $\\bar{X}\\_{i}$, $\\bar{X}\\_{j}$ respectively. Finally, the contrastive and caption objectives are combined to pull close the representations of the clips and text from the same samples and push away those from other pairs","1626":"**Funnel Transformer** is a type of [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) that gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. By re-investing the saved FLOPs from length reduction in constructing a deeper or wider model, the model capacity is further improved. In addition, to perform token-level predictions as required by common pretraining objectives, Funnel-[transformer](https:\/\/paperswithcode.com\/method\/transformer) is able to recover a deep representation for each token from the reduced hidden sequence via a decoder.\r\n\r\nThe proposed model keeps the same overall skeleton of interleaved S-[Attn](https:\/\/paperswithcode.com\/method\/scaled) and P-[FFN](https:\/\/paperswithcode.com\/method\/dense-connections) sub-modules wrapped by [residual connection](https:\/\/paperswithcode.com\/method\/residual-connection) and [layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization). But differently, to achieve representation compression and computation reduction, THE model employs an encoder that gradually reduces the sequence length of the hidden states as the layer gets deeper. In addition, for tasks involving per-token predictions like pretraining, a simple decoder is used to reconstruct a full sequence of token-level representations from the compressed encoder output. Compression is achieved via a pooling operation,","1627":"A Mixer layer is a layer used in the MLP-Mixer architecture proposed by Tolstikhin et. al (2021) for computer vision. Mixer layers consist purely of MLPs, without convolutions or attention. It takes an input of embedded image patches (tokens), with its output having the same shape as its input, similar to that of a Vision Transformer encoder. As suggested by its name, Mixer layers \"mix\" tokens and channels through its \"token mixing\" and \"channel mixing\" MLPs contained the layer. It utilizes previous techniques by other architectures, such as layer normalization, skip-connections, and regularization methods.\r\n\r\nImage credit: Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., ... & Dosovitskiy, A. (2021). Mlp-mixer: An all-mlp architecture for vision. Advances in Neural Information Processing Systems, 34, 24261-24272.","1628":"**Point-GNN** is a graph neural network for detecting objects from a LiDAR point cloud. It predicts the category and shape of the object that each vertex in the graph belongs to. In Point-GNN, there is an auto-registration mechanism to reduce translation variance, as well as a box merging and scoring operation to combine detections from multiple vertices accurately.","1629":"**Hi-LANDER** is a hierarchical [graph neural network](https:\/\/paperswithcode.com\/methods\/category\/graph-models) (GNN) model that learns how to cluster a set of images into an unknown number of identities using an image annotated with labels belonging to a disjoint set of identities. The hierarchical GNN uses an approach to merge connected components predicted at each level of the hierarchy to form a new graph at the next level. Unlike fully unsupervised hierarchical clustering, the choice of grouping and complexity criteria stems naturally from supervision in the training set.","1630":"**CSPDenseNet-Elastic** is a convolutional neural network and object detection backbone where we apply the Cross Stage Partial Network (CSPNet) approach to [DenseNet-Elastic](https:\/\/paperswithcode.com\/method\/densenet-elastic). The CSPNet partitions the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network.","1631":"**CSPDenseNet** is a convolutional neural network and object detection backbone where we apply the Cross Stage Partial Network (CSPNet) approach to [DenseNet](https:\/\/paperswithcode.com\/method\/densenet). The CSPNet partitions the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network.","1632":"**CSPPeleeNet** is a convolutional neural network and object detection backbone  where we apply the Cross Stage Partial Network (CSPNet) approach to [PeleeNet](https:\/\/paperswithcode.com\/method\/peleenet). The CSPNet partitions the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network.","1633":"**Exact Fusion Model (EFM)** is a method for aggregating a feature pyramid. The EFM is based on [YOLOv3](https:\/\/paperswithcode.com\/method\/yolov3), which assigns exactly one bounding-box prior to each ground truth object. Each ground truth bounding box corresponds to one anchor box that surpasses the threshold IoU. If the size of an anchor box is equivalent to the field-of-view of the grid cell, then for the grid cells of the $s$-th scale, the corresponding bounding box will be lower bounded by the $(s \u2212 1)$th scale and upper bounded by the (s + 1)th scale. Therefore, the EFM assembles features from the three scales.","1634":"Deformable ConvNets do not learn an affine transformation. They divide convolution into two steps, firstly sampling features on a regular grid $ \\mathcal{R} $ from the input feature map, then aggregating sampled features by weighted summation using a convolution kernel. The process can be written as:\r\n\\begin{align}\r\n    Y(p_{0}) &= \\sum_{p_i \\in \\mathcal{R}} w(p_{i}) X(p_{0} + p_{i})\r\n\\end{align}\r\n\\begin{align}\r\n    \\mathcal{R}  &= \\{(-1,-1), (-1, 0), \\dots, (1, 1)\\}\r\n\\end{align}\r\nThe deformable convolution augments the sampling process by introducing a group of learnable offsets $\\Delta p_{i}$ which can be generated by a lightweight CNN. Using the offsets $\\Delta p_{i}$, the deformable convolution can be formulated as:\r\n\\begin{align}\r\n    Y(p_{0}) &= \\sum_{p_i \\in \\mathcal{R}} w(p_{i}) X(p_{0} + p_{i} + \\Delta p_{i}). \r\n\\end{align}\r\nThrough the above method, adaptive sampling is achieved.\r\nHowever, $\\Delta p_{i}$ is a floating point value\r\nunsuited to grid sampling. \r\nTo address this problem, bilinear interpolation is used. Deformable RoI pooling is also used, which greatly improves object detection. \r\n\r\nDeformable ConvNets adaptively select the important regions and enlarge the valid receptive field of convolutional neural networks; this is important in object detection and semantic segmentation tasks.","1635":"In image inpainting task, the mechanism extracts complementary features from the word embedding in two paths by reciprocal attention, which is done by comparing the descriptive text and complementary image areas through reciprocal attention.","1636":"**EsViT** proposes two techniques for developing efficient self-supervised vision transformers for visual representation leaning: a multi-stage architecture with sparse self-attention and a new pre-training task of region matching. The multi-stage architecture reduces modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. The new pretraining task allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations.","1637":"The Auditory Cortex ResNet, briefly AUCO ResNet, is proposed and tested. It is a deep neural network architecture especially designed for audio classification trained end-to-end. It is inspired by the architectural organization of rat's auditory cortex, containing also innovations 2 and 3. The network outperforms the state-of-the-art accuracies on a reference audio benchmark dataset without any kind of preprocessing, imbalanced data handling and, most importantly, any kind of data augmentation.","1638":"Multi-layer neural networks traditionally use  dot products between the output vector of previous layer and the incoming weight vector as the input to activation function. The result of dot product is unbounded. To bound dot product and decrease the variance, **Cosine Normalization** uses cosine similarity or centered cosine similarity (Pearson Correlation Coefficient) instead of dot products in neural networks. \r\n\r\nUsing cosine normalization, the output of a hidden unit is computed by:\r\n\r\n$$o = f(net_{norm})= f(\\cos \\theta) = f(\\frac{\\vec{w} \\cdot \\vec{x}} {\\left|\\vec{w}\\right|  \\left|\\vec{x}\\right|})$$\r\n\r\nwhere $net_{norm}$ is the normalized pre-activation,  $\\vec{w}$ is the incoming weight vector and $\\vec{x}$ is the input vector, ($\\cdot$) indicates dot product, $f$ is nonlinear activation function. Cosine normalization bounds the pre-activation between -1 and 1.","1639":"The **Deep LSTM Reader** is a neural network for reading comprehension. We feed documents one word at a time into a Deep [LSTM](https:\/\/paperswithcode.com\/method\/lstm) encoder, after a delimiter we then also feed the query into the encoder. The model therefore processes each document query pair as a single long sequence. Given the embedded document and query the network predicts which token in the document answers the query.\r\n\r\nThe model consists of a Deep LSTM cell with skip connections from each input $x\\left(t\\right)$ to every hidden layer, and from every hidden layer to the output $y\\left(t\\right)$:\r\n\r\n$$x'\\left(t, k\\right) = x\\left(t\\right)||y'\\left(t, k - 1\\right) \\text{,  } y\\left(t\\right) = y'\\left(t, 1\\right)|| \\dots ||y'\\left(t, K\\right) $$\r\n\r\n$$ i\\left(t, k\\right) =  \\left(W\\_{kxi}x'\\left(t, k\\right) + W\\_{khi}h(t - 1, k) + W\\_{kci}c\\left(t - 1, k\\right) + b\\_{ki}\\right) $$\r\n\r\n$$ f\\left(t, k\\right) =  \\left(W\\_{kxf}x\\left(t\\right) + W\\_{khf}h\\left(t - 1, k\\right) + W\\_{kcf}c\\left(t - 1, k\\right) + b\\_{kf}\\right) $$\r\n\r\n$$ c\\left(t, k\\right) = f\\left(t, k\\right)c\\left(t - 1, k\\right) + i\\left(t, k\\right)\\text{tanh}\\left(W\\_{kxc}x'\\left(t, k\\right) + W\\_{khc}h\\left(t -  1, k\\right) + b\\_{kc}\\right) $$\r\n\r\n$$ o\\left(t, k\\right) =  \\left(W\\_{kxo}x'\\left(t, k\\right) + W\\_{kho}h\\left(t - 1, k\\right) + W\\_{kco}c\\left(t, k\\right) + b\\_{ko}\\right) $$\r\n\r\n$$ h\\left(t, k\\right) = o\\left(t, k\\right)\\text{tanh}\\left(c\\left(t, k\\right)\\right) $$\r\n\r\n$$ y'\\left(t, k\\right) = W\\_{kyh}\\left(t, k\\right) + b\\_{ky} $$\r\n\r\nwhere || indicates vector concatenation, $h\\left(t, k\\right)$ is the hidden state for layer $k$ at time $t$, and $i$, $f$, $o$ are the input, forget, and output gates respectively. Thus our Deep LSTM Reader is defined by $g^{\\text{LSTM}}\\left(d, q\\right) = y\\left(|d|+|q|\\right)$ with input $x\\left(t\\right)$ the concatenation of $d$ and $q$ separated by the delimiter |||.","1640":"**One-Shot Aggregation with an Identity Mapping and eSE** is an image model block that extends [one-shot aggregation](https:\/\/paperswithcode.com\/method\/one-shot-aggregation) with a [residual connection](https:\/\/paperswithcode.com\/method\/residual-connection) and [effective squeeze-and-excitation block](https:\/\/paperswithcode.com\/method\/effective-squeeze-and-excitation-block). It is proposed as part of the [VoVNetV2](https:\/\/paperswithcode.com\/method\/vovnetv2) CNN architecture.\r\n\r\nThe module adds an identity mapping to the OSA module - the input path is connected to the end of an OSA module that is able to backpropagate the gradients of every OSA module in an end-to-end manner on each stage like a [ResNet](https:\/\/paperswithcode.com\/method\/resnet). Additionally, a [channel attention module](https:\/\/paperswithcode.com\/method\/channel-attention-module) - effective Squeeze-Excitation - is used which is like regular [squeeze-and-excitation](https:\/\/paperswithcode.com\/method\/squeeze-and-excitation-block) but uses only one FC layer with $C$ channels instead of two FCs without a channel dimension reduction, which maintains channel information.","1641":"**Effective Squeeze-and-Excitation Block** is an image model block based on squeeze-and-excitation, the difference being that one less FC layer is used. The authors note the SE module has a limitation: channel information loss due to dimension reduction. For avoiding high model complexity burden, two FC layers of the SE module need to reduce channel dimension. Specifically, while the first FC layer reduces input feature channels $C$ to $C\/r$ using reduction ratio $r$, the second FC layer expands the reduced channels to original channel size $C$. As a result, this channel dimension reduction causes channel information loss. Therefore, effective SE (eSE) uses only one FC layer with $C$ channels instead of two FCs without channel dimension reduction, which maintains channel information.","1642":"**VoVNetV2** is a convolutional neural network that improves upon [VoVNet](https:\/\/paperswithcode.com\/method\/vovnet) with two effective strategies: (1) [residual connection](https:\/\/paperswithcode.com\/method\/residual-connection) for alleviating the optimization problem of larger VoVNets and (2) effective Squeeze-Excitation (eSE) dealing with the channel information loss problem of the original squeeze-and-excitation module.","1643":"**NPID++** (Non-Parametric Instance Discrimination) is a self-supervision approach that takes a non-parametric classification approach. It approves upon [NPID](https:\/\/paperswithcode.com\/method\/npid) by using more negative samples and training for more epochs.","1644":"**StyleMapGAN** is a generative adversarial network for real-time image editing. The intermediate latent space has spatial dimensions, and a spatially variant modulation replaces [AdaIN](https:\/\/paperswithcode.com\/method\/adaptive-instance-normalization). It aims to make the embedding through an encoder more accurate than existing optimization-based methods while maintaining the properties of GANs.","1645":"**Zero Redundancy Optimizer (ZeRO)** is a sharded data parallel method for distributed training. ZeRODP removes the memory state redundancies across data-parallel processes by partitioning the model states instead of replicating them, and it retains the compute\/communication efficiency by retaining the computational granularity and communication volume of DP using a dynamic communication schedule during training.","1646":"**MT-PET** is a multi-task version of [Pattern Exploiting Training](https:\/\/arxiv.org\/abs\/2001.07676) (PET) for exaggeration detection, which leverages knowledge from complementary cloze-style QA tasks to improve few-shot learning. It defines pairs of complementary pattern-verbalizer pairs for a main task and auxiliary task. These PVPs are then used to train PET on data from both tasks.\r\n\r\nPET uses the masked language modeling objective of pretrained language models to transform a task into one or more cloze-style question answering tasks.  In the original PET implementation, PVPs are defined for a single target task. MT-PET extends this by allowing for auxiliary PVPs from related tasks, adding complementary cloze-style QA tasks during training. The motivation for the multi-task approach is two-fold: 1) complementary cloze-style tasks can potentially help the model to learn different aspects of the main task, i.e. the similar tasks of exaggeration detection and claim strength prediction; 2) data on related tasks can be utilized during training, which is important in situations where data for the main task is limited.","1647":"**Polyak Averaging** is an optimization technique that sets final parameters to an average of (recent) parameters visited in the optimization trajectory. Specifically if in $t$ iterations we have parameters $\\theta\\_{1}, \\theta\\_{2}, \\dots, \\theta\\_{t}$, then Polyak Averaging suggests setting \r\n\r\n$$ \\theta\\_t =\\frac{1}{t}\\sum\\_{i}\\theta\\_{i} $$\r\n\r\nImage Credit: [Shubhendu Trivedi & Risi Kondor](https:\/\/ttic.uchicago.edu\/~shubhendu\/Pages\/Files\/Lecture6_flat.pdf)","1648":"**SpreadsheetCoder** is a neural network architecture for spreadsheet formula prediction. It is a [BERT](https:\/\/paperswithcode.com\/method\/bert)-based model architecture to represent the tabular context in both row-based and column-based formats. A [BERT](https:\/\/paperswithcode.com\/method\/bert) encoder computes an embedding vector for each input token, incorporating the contextual information from nearby rows and columns. The BERT encoder is initialized from the weights pre-trained on English text corpora, which is beneficial for encoding table headers. To handle cell references, a two-stage decoding process is used inspired by sketch learning for program synthesis. The decoder first generates a formula sketch, which does not include concrete cell references, and then predicts the corresponding cell ranges to generate the complete formula","1649":"**Fast Schema Guided Tracker**, or **FastSGT**, is a fast and robust [BERT](https:\/\/paperswithcode.com\/method\/bert)-based model for state tracking in goal-oriented dialogue systems. The model employs carry-over mechanisms for transferring the values between slots, enabling switching between services and accepting the values offered by the system during dialogue. It also uses [multi-head attention](https:\/\/paperswithcode.com\/method\/multi-head-attention) projections in some of the decoders to have a better modelling of the encoder outputs.\r\n\r\nThe model architecture is illustrated in the Figure. It consists of four main modules: 1-Utterance Encoder, 2-Schema Encoder, 3-State Decoder, and 4-State Tracker. The first three modules constitute the NLU component and are based on neural networks, whereas the state tracker is a rule-based module. [BERT](https:\/\/paperswithcode.com\/method\/bert) was used for both encoders in the model.\r\n\r\nThe Utterance Encoder is a BERT model which encodes the user and system utterances at each turn. The Schema Encoder is also a BERT model which encodes the schema descriptions of intents, slots, and values into schema embeddings. These schema embeddings help the decoders to transfer or share knowledge between different services by having some language understanding of each slot, intent, or value. The schema and utterance embeddings are passed to the State Decoder - a multi-task module. This module consists of five sub-modules producing the information necessary to track the state of the dialogue. Finally, the State Tracker module takes the previous state along with the current outputs of the State Decoder and predicts the current state of the dialogue by aggregating and summarizing the information across turns.","1650":"**ALDEN**, or **Active Learning with DivErse iNterpretations**, is an active learning approach for text classification. With local interpretations in DNNs, ALDEN identifies linearly separable regions of samples. Then, it selects samples according to their diversity of local interpretations and queries their labels.\r\n\r\nSpecifically, we first calculate the local interpretations in DNN for each sample as the gradient backpropagated from the final\r\npredictions to the input features. Then, we use the most diverse interpretation of words in a sample to measure its diverseness. Accordingly, we select unlabeled samples with the maximally diverse interpretations for labeling and retrain the model with these\r\nlabeled samples.","1651":"**TabTransformer** is a deep tabular data modeling architecture for supervised and semi-supervised learning. The TabTransformer is built upon self-attention based Transformers. The Transformer layers transform the embeddings of categorical features into robust contextual embeddings to achieve higher prediction accuracy. \r\n\r\nAs an overview, the architecture comprises a column embedding layer, a stack of $N$ [Transformer](\/method\/transformer) layers, and a multi-layer perceptron (MLP). The contextual embeddings (outputted by the Transformer layer) are concatenated along with continuous features which is inputted to an MLP. The loss function is then minimized  to learn all the parameters in an end-to-end learning.","1652":"A large amount of information is stored in data tables. Users can search for data tables using a keyword-based query. A table is composed primarily of data values that are organized in rows and columns providing implicit structural information. A table is usually accompanied by secondary information such as the caption, page title, etc., that form the textual information. Understanding the connection between the textual and structural information is an important yet neglected aspect in table retrieval as previous methods treat each source of information independently. In addition, users can search for data tables that are similar to an existing table, and this setting can be seen as a content-based table retrieval. In this paper, we propose StruBERT, a structure-aware BERT model that fuses the textual and structural information of a data table to produce context-aware representations for both textual and tabular content of a data table. StruBERT features are integrated in a new end-to-end neural ranking model to solve three table-related downstream tasks: keyword- and content-based table retrieval, and table similarity. We evaluate our approach using three datasets, and we demonstrate substantial improvements in terms of retrieval and classification metrics over state-of-the-art methods.","1653":"**GFP-GAN** is a generative adversarial network for blind face restoration that leverages a generative facial prior (GFP). This Generative Facial Prior (GFP) is incorporated into the face restoration process via channel-split spatial feature transform layers, which allow for a good balance between realness and fidelity. As a whole, the GFP-GAN consists of a degradation removal module ([U-Net](https:\/\/paperswithcode.com\/method\/u-net)) and a pretrained face  [StyleGAN](https:\/\/paperswithcode.com\/method\/stylegan) as a facial prior. They are bridged by a latent code mapping and several Channel-Split [Spatial Feature Transform](https:\/\/paperswithcode.com\/method\/spatial-feature-transform) (CS-SFT) layers. During training, 1) intermediate restoration losses are employed to remove complex degradation, 2) Facial component loss with discriminators is used to enhance facial details, and 3) identity preserving loss is used to retain face identity.","1654":"**ConvMLP** is a hierarchical convolutional MLP for visual recognition, which consists of a stage-wise, co-design of [convolution](https:\/\/paperswithcode.com\/method\/convolution) layers, and MLPs. The Conv Stage consists of $C$ convolutional blocks with $1\\times 1$ and $3\\times 3$ kernel sizes. It is repeated $M$ times before a down convolution is utilized to express a level $L$. The MLP-Conv Stage consists of Channelwise MLPs, with skip layers, and a [depthwise convolution](https:\/\/paperswithcode.com\/method\/depthwise-convolution). This is repeated $M$ times before a down convolution is utilized to express a level $\\mathcal{L}$.","1655":"**Bayesian Reward Extrapolation** is a Bayesian reward learning algorithm that scales to high-dimensional imitation learning problems by pre-training a low-dimensional feature encoding via self-supervised tasks and then leveraging preferences over demonstrations to perform fast Bayesian inference.","1656":"**Pipelined Backpropagation** is an asynchronous pipeline parallel training algorithm. It was first introduced by Petrowski et al (1993). It avoids fill and drain overhead by updating the weights without draining the pipeline first. This results in weight inconsistency, the use of different weights on the forward and backward passes for a given micro-batch. The weights used to produce a particular gradient may also have been updated when the gradient is applied, resulting in stale (or delayed) gradients. For these reasons PB resembles Asynchronous [SGD](https:\/\/paperswithcode.com\/method\/sgd) and is not equivalent to standard SGD. Finegrained pipelining increases the number of pipeline stages and hence increases the weight inconsistency and delay.","1657":"The MADGRAD method contains a series of modifications to the [AdaGrad](https:\/\/paperswithcode.com\/method\/adagrad)-DA method to improve its performance on deep learning optimization problems. It gives state-of-the-art generalization performance across a diverse set of problems, including those that [Adam](https:\/\/paperswithcode.com\/method\/adam) normally under-performs on.","1658":"**STraTA**, or **Self-Training with Task Augmentation**, is a self-training approach that builds on two key ideas for effective leverage of unlabeled data. First, STraTA uses task augmentation, a technique that synthesizes a large amount of data for auxiliary-task fine-tuning from target-task unlabeling texts. Second, STRATA performs self-training by further fine-tuning the strong base model created by task augmentation on a broad distribution of pseudo-labeled data.\r\n\r\nIn task augmentation, we train an NLI data generation model and use it to synthesize a large amount of in-domain NLI training data for each given target task, which is then used for auxiliary (intermediate) fine-tuning. The self-training algorithm iteratively learns a better model using a concatenation of labeled and pseudo-labeled examples. At each iteration, we always start with the auxiliary-task model produced by task augmentation and train on a broad distribution of pseudo-labeled data.","1659":"**Informative Sample Mining Network** is a multi-stage sample training scheme for GANs to reduce sample hardness while preserving sample informativeness. Adversarial Importance Weighting is proposed to select informative samples and assign them greater weight. The authors also propose Multi-hop Sample Training to avoid the potential problems in model training caused by sample mining. Based on the principle of divide-and-conquer, the authors produce target images by multiple hops, which means the image translation is decomposed into several separated steps.","1660":"Florence is a computer vision foundation model aiming to learn universal visual-language representations that be adapted to various computer vision tasks, visual question answering, image captioning, video retrieval, among other tasks. Florence's workflow consists of data curation, unified learning, Transformer architectures and adaption. Florence is pre-trained in an image-label-description space, utilizing a unified image-text contrastive learning. It involves a two-tower architecture: 12-layer Transformer for the language encoder, and a Vision Transformer for the image encoder. Two linear projection layers are added on top of the image encoder and language encoder to match the dimensions of image and language features. Compared to previous methods for cross-modal shared representations, Florence expands beyond simple classification and retrieval capabilities to advanced representations that support object level, multiple modality, and videos respectively.","1661":"**DNN2LR** is an automatic feature crossing method to find feature interactions in a deep neural network, and use them as cross features in logistic regression. In general, DNN2LR consists of two steps: (1) generating a compact and accurate candidate set of cross feature fields; (2) searching in the candidate set for the final cross feature fields.","1662":"**SKEP** is a self-supervised pre-training method for sentiment analysis. With the help of automatically-mined knowledge, SKEP conducts sentiment masking and constructs three sentiment knowledge prediction objectives, so as to embed sentiment information at the word, polarity and aspect level into pre-trained sentiment representation. In particular, the prediction of aspect-sentiment pairs is converted into multi-label classification, aiming to capture the dependency between words in a pair.\r\n\r\nSKEP contains two parts: (1) Sentiment masking recognizes the sentiment information of an input sequence based on automatically-mined sentiment knowledge, and produces a corrupted version by removing these informations. (2) Sentiment pre-training objectives require the transformer to recover the removed information from the corrupted version. The three prediction objectives on top are jointly optimized: Sentiment Word (SW) prediction (on $\\left.\\mathrm{x}\\_{9}\\right)$, Word Polarity (SP) prediction (on $\\mathrm{x}\\_{6}$ and $\\mathbf{x}\\_{9}$ ), Aspect-Sentiment pairs (AP) prediction (on $\\mathbf{x}\\_{1}$ ). Here, the smiley denotes positive polarity. Notably, on $\\mathrm{x}\\_{6}$, only SP is calculated without SW, as its original word has been predicted in the pair prediction on $\\mathbf{x}\\_{1}$.","1663":"The multi-head of mixed attention combines both self- and cross-attentions, encouraging high-level learning of interactions between entities captured in the various attention features. It is build with several attention heads, each of the head can implement either self or cross attention. A self attention is when the key and query features are the same or come from the same domain features. A cross attention is when the key and query features are generated from different features. Modeling MHMA allows a model to identity the relationship between features of different domains. This is very useful in tasks involving relationship modeling such as human-object interaction, tool-tissue interaction, man-machine interaction, human-computer interface, etc.","1664":"CAGAM is a form of spatial attention mechanism that propagates attention from a known to an unknown context features thereby enhancing the unknown context for relevant pattern discovery. Usually the known context feature is a class activation map ([CAM](https:\/\/paperswithcode.com\/method\/cam)).","1665":"**Domain Adaptive Ensemble Learning**, or **DAEL**, is an architecture for domain adaptation. The model is composed of a CNN feature extractor shared across domains and multiple classifier heads each trained to specialize in a particular source domain. Each such classifier is an expert to its own domain and a non-expert to others. DAEL aims to learn these experts collaboratively so that when forming an ensemble, they can leverage complementary information from each other to be more effective for an unseen target domain. To this end, each source domain is used in turn as a pseudo-target-domain with its own expert providing supervisory signal to the ensemble of non-experts learned from the other sources. For unlabeled target data under the UDA setting where real expert does not exist, DAEL uses pseudo-label to supervise the ensemble learning.","1666":"**Temporal Distribution Matching**, or **TDM**,  is a module used in the [AdaRNN](https:\/\/paperswithcode.com\/method\/adarnn) architecture to match the distributions of the discovered periods to build a time series prediction model $\\mathcal{M}$ Given the learned time periods, the TDM module is designed to learn the common knowledge shared by different periods via matching their distributions. Thus, the learned model $\\mathcal{M}$ is expected to generalize well on unseen test data compared with the methods which only rely on local or statistical information.\r\n\r\nWithin the context of AdaRNN, Temporal Distribution Matching aims to adaptively match the distributions between the [RNN](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks) cells of two periods while capturing the temporal dependencies. TDM introduces the importance vector $\\mathbf{\\alpha} \\in \\mathbb{R}^{\\hat{V}}$ to learn the relative importance of $V$ hidden states inside the RNN, where all the hidden states are weighted with a normalized $\\alpha$. Note that for each pair of periods, there is an $\\mathbf{\\alpha}$, and we omit the subscript if there is no confusion. In this way, we can dynamically reduce the distribution divergence of cross-periods.\r\n\r\nGiven a period-pair $\\left(\\mathcal{D}\\_{i}, \\mathcal{D}\\_{j}\\right)$, the loss of temporal distribution matching is formulated as:\r\n\r\n$$\r\n\\mathcal{L}\\_{t d m}\\left(\\mathcal{D}\\_{i}, \\mathcal{D}\\_{j} ; \\theta\\right)=\\sum_{t=1}^{V} \\alpha\\_{i, j}^{t} d\\left(\\mathbf{h}\\_{i}^{t}, \\mathbf{h}\\_{j}^{t} ; \\theta\\right)\r\n$$\r\n\r\nwhere $\\alpha\\_{i, j}^{t}$ denotes the distribution importance between the periods $\\mathcal{D}\\_{i}$ and $\\mathcal{D}\\_{j}$ at state $t$.\r\n\r\nAll the hidden states of the RNN can be easily computed by following the standard RNN computation. Denote by $\\delta(\\cdot)$ the computation of a next hidden state based on a previous state. The state computation can be formulated as\r\n\r\n$$\r\n\\mathbf{h}\\_{i}^{t}=\\delta\\left(\\mathbf{x}\\_{i}^{t}, \\mathbf{h}\\_{i}^{t-1}\\right)\r\n$$\r\n\r\nThe final objective of temporal distribution matching (one RNN layer) is:\r\n\r\n$$\r\n\\mathcal{L}(\\theta, \\mathbf{\\alpha})=\\mathcal{L}\\_{\\text {pred }}(\\theta)+\\lambda \\frac{2}{K(K-1)} \\sum\\_{i, j}^{i \\neq j} \\mathcal{L}\\_{t d m}\\left(\\mathcal{D}\\_{i}, \\mathcal{D}\\_{j} ; \\theta, \\mathbf{\\alpha}\\right)\r\n$$\r\n\r\nwhere $\\lambda$ is a trade-off hyper-parameter. Note that in the second term, we compute the average of the distribution distances of all pairwise periods. For computation, we take a mini-batch of $\\mathcal{D}_{i}$ and $\\mathcal{D}\\_{j}$ to perform forward operation in RNN layers and concatenate all hidden features. Then, we can perform TDM using the above equation.","1667":"**Temporal Distribution Characterization**, or **TDC**, is a module used in the [AdaRNN](https:\/\/paperswithcode.com\/method\/adarnn) architecture to characterize the distributional information in a time series.\r\n\r\nBased on the principle of maximum entropy, maximizing the utilization of shared knowledge underlying a times series under temporal covariate shift can be done by finding periods which are most dissimilar to each other, which is also considered as the worst case of temporal covariate shift since the cross-period distributions are the most diverse. TDC achieves this goal for splitting the time-series by solving an optimization problem whose objective can be formulated as:\r\n\r\n$$\r\n\\max \\_{0<K \\leq K\\_{0}} \\max \\_{n\\_{1}, \\cdots, n\\_{K}} \\frac{1}{K} \\sum_{1 \\leq i \\neq j \\leq K} d\\left(\\mathcal{D}\\_{i}, \\mathcal{D}\\_{j}\\right) \r\n$$\r\n\r\n$$\r\n\\text { s.t. } \\forall i, \\Delta_{1}<\\left|\\mathcal{D}\\_{i}\\right|<\\Delta_{2} ; \\sum_{i}\\left|\\mathcal{D}\\_{i}\\right|=n\r\n$$\r\n\r\nwhere $d$ is a distance metric, $\\Delta\\_{1}$ and $\\Delta\\_{2}$ are predefined parameters to avoid trivial solutions (e.g., very small values or very large values may fail to capture the distribution information), and $K\\_{0}$ is the hyperparameter to avoid over-splitting. The metric $d(\\cdot, \\cdot)$ above can be any distance function, e.g., Euclidean or Editing distance, or some distribution-based distance \/ divergence, like MMD [14] and KL-divergence.\r\n\r\nThe learning goal of the optimization problem (1) is to maximize the averaged period-wise distribution distances by searching $K$ and the corresponding periods so that the distributions of each period are as diverse as possible and the learned prediction model has better a more generalization ability.","1668":"**AdaRNN** is an adaptive [RNN](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks) that learns an adaptive model through two modules: [Temporal Distribution Characterization](https:\/\/paperswithcode.com\/method\/temporal-distribution-characterization) (TDC) and [Temporal Distribution Matching](https:\/\/paperswithcode.com\/method\/temporal-distribution-matching) (TDM) algorithms. Firstly, to better characterize the distribution information in time-series, TDC splits the training data into $K$ most diverse periods that have a large distribution gap inspired by the principle of maximum entropy. After that, a temporal distribution matching (TDM) algorithm is used to dynamically reduce distribution divergence using a [RNN](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks)-based model.","1669":"The **Pointer Sentinel-LSTM mixture model** is a type of recurrent neural network that combines the advantages of standard [softmax](https:\/\/paperswithcode.com\/method\/softmax) classifiers with those of a pointer component for effective and efficient language modeling. Rather than relying on the RNN hidden state to decide when to use the pointer, the model allows the pointer component itself to decide when to use the softmax vocabulary through a sentinel.","1670":"**Shape Adaptor** is a novel resizing module for neural networks. It is a drop-in enhancement built on top of traditional resizing layers, such as pooling, bilinear sampling, and strided [convolution](https:\/\/paperswithcode.com\/method\/convolution). This module allows for a learnable shaping factor which differs from the traditional resizing layers that are fixed and deterministic.\r\n\r\nImage Source: [Liu et al.](https:\/\/arxiv.org\/pdf\/2008.00892v2.pdf)","1671":"An aspect of Bi-RNNs that could be undesirable is the architecture's symmetry in both time directions.\r\n\r\n Bi-RNNs are often used in natural language processing, where the order of the words is almost exclusively determined by grammatical rules and not by temporal sequentiality.  However, in some cases, the data has a preferred direction in time: the forward direction. \r\n\r\nAnother potential drawback of Bi-RNNs is that their output is simply the concatenation of two naive readings of the input in both directions. In consequence, Bi-RNNs never actually read an input by knowing what happens in the future. Conversely, the idea behind U-RNN, is to first do a backward pass, and then use during the forward pass information about the future.\r\n\r\nWe accumulate information while knowing which part of the information will be useful in the future as it should be relevant to do so if the forward direction is the preferred direction of the data.\r\n\r\nThe backward and forward hidden states $(h^b_t)$ and  $(h^f_t)$ are obtained according to these equations:\r\n\r\n\\begin{equation}\r\n\\begin{aligned}\r\n&h_{t-1}^{b}=R N N\\left(h_{t}^{b}, e_{t}, W_{b}\\right) \\\\\r\n&h_{t+1}^{f}=R N N\\left(h_{t}^{f},\\left[e_{t}, h_{t}^{b}\\right], W_{f}\\right)\r\n\\end{aligned}\r\n\\end{equation}\r\n\r\nwhere $W_b$ and $W_f$ are learnable weights that are shared among pedestrians, and $[\\cdot, \\cdot]$ denotes concatenation. The last hidden state $h^f_{T_{obs}}$ is then used as the encoding of the sequence.","1672":"**NormFormer** is a type of [Pre-LN](https:\/\/paperswithcode.com\/method\/layer-normalization) transformer that adds three normalization operations to each layer: a Layer Norm after self attention, head-wise scaling of self-attention outputs, and a Layer Norm after the first [fully connected layer](https:\/\/paperswithcode.com\/method\/position-wise-feed-forward-layer). The modifications introduce a small number of additional learnable parameters, which provide a cost-effective way for each layer to change the magnitude of its features, and therefore the magnitude of the gradients to subsequent components.","1673":"**Canvas Method** is a method for inference attacks on object detection models. It draws a predicted bounding box distribution on an empty canvas for an attack model input. The canvas is initially set to an image of 300$\\times$300 pixels in size, where every pixel has a value of zero and the boxes drawn on the canvas have the same center as the predicted boxes and the same intensity as the prediction scores.","1674":"**ParamCrop** is a parametric cubic cropping for video contrastive learning, where cubic cropping refers to cropping a 3D cube\r\nfrom the input video. The central component of ParamCrop is a differentiable spatio-temporal cropping operation, which enables ParamCrop to be trained simultaneously with the video backbone and adjust the cropping strategy on the fly. The objective of ParamCrop is set to be adversarial to the video backbone, which is to increase the contrastive loss. Hence, initialized with the simplest setting where two cropped views largely overlaps, ParamCrop gradually increases the disparity between two views.","1675":"The **Multiplex Molecular Graph Neural Network (MXMNet)** is an approach for the representation learning of molecules. The molecular interactions are divided into two categories: local and global. Then a two-layer multiplex graph $G = \\\\{ G_{l}, G_{g} \\\\}$ is constructed for a molecule. In $G$, the local layer $G_{l}$ only contains the local connections that mainly capture covalent interactions, and the global layer $G_{g}$ contains the global connections that cover non-covalent interactions. MXMNet uses the Multiplex Molecular (MXM) module that contains a novel angle-aware message passing operated on $G_{l}$ and an efficient message passing operated on $G_{g}$.","1676":"In human action recognition, \r\neach type of action  generally only depends \r\non a few specific kinematic joints. Furthermore, over time, multiple actions may be performed.\r\nMotivated by these observations, Song et al. proposed \r\na joint spatial and temporal attention network based on LSTM, to adaptively find discriminative features and keyframes. \r\nIts main attention-related components are a spatial attention sub-network, to select important regions, and a temporal attention sub-network, to select key frames. The spatial attention sub-network can be written as:\r\n\\begin{align}\r\n    s_{t} &= U_{s}\\tanh(W_{xs}X_{t} + W_{hs}h_{t-1}^{s} + b_{si}) + b_{so}\r\n\\end{align}\r\n\\begin{align}\r\n    \\alpha_{t} &= \\text{Softmax}(s_{t})\r\n\\end{align}\r\n\\begin{align}\r\n    Y_{t} &= \\alpha_{t}  X_{t} \r\n\\end{align}\r\nwhere $X_{t}$ is the input feature at time $t$, $U_{s}$, $W_{hs}$, $b_{si}$, and $b_{so}$ are learnable parameters, and $h_{t-1}^{s}$ is the hidden state at step $t-1$. Note that use of the hidden state $h$ means  the attention process takes  temporal relationships into consideration.\r\n\r\nThe temporal attention sub-network is similar to the spatial branch and produces its attention map using:\r\n\\begin{align}\r\n    \\beta_{t} = \\delta(W_{xp}X_{t} + W_{hp}h_{t-1}^{p} + b_{p}). \r\n\\end{align}\r\nIt adopts a ReLU function instead of a normalization function for ease of optimization. It also uses a regularized objective function to improve  convergence.\r\n\r\nOverall, this paper presents a joint spatiotemporal attention method\r\nto focus on important joints and keyframes, \r\nwith excellent results on the action recognition task.","1677":"Spatial & temporal attention combines the advantages of spatial attention and temporal attention as it adaptively selects both important regions and key frames. Some works compute temporal attention and spatial attention separately, while others produce joint spatio & temporal attention maps. Further works focusing on capturing pairwise relations.","1678":"Despite the tantalizing success in a broad of vision tasks, transformers have not yet demonstrated on-par ability as ConvNets in high-resolution image generative modeling. In this paper, we seek to explore using pure transformers to build a generative adversarial network for high-resolution image synthesis. To this end, we believe that local attention is crucial to strike the balance between computational efficiency and modeling capacity. Hence, the proposed generator adopts Swin transformer in a style-based architecture. To achieve a larger receptive field, we propose double attention which simultaneously leverages the context of the local and the shifted windows, leading to improved generation quality. Moreover, we show that offering the knowledge of the absolute position that has been lost in window-based transformers greatly benefits the generation quality. The proposed StyleSwin is scalable to high resolutions, with both the coarse geometry and fine structures benefit from the strong expressivity of transformers. However, blocking artifacts occur during high-resolution synthesis because performing the local attention in a block-wise manner may break the spatial coherency. To solve this, we empirically investigate various solutions, among which we find that employing a wavelet discriminator to examine the spectral discrepancy effectively suppresses the artifacts. Extensive experiments show the superiority over prior transformer-based GANs, especially on high resolutions, e.g., 1024x1024. The StyleSwin, without complex training strategies, excels over StyleGAN on CelebA-HQ 1024x1024, and achieves on-par performance on FFHQ 1024x1024, proving the promise of using transformers for high-resolution image generation.","1679":"**Kaleido-BERT**(CVPR2021) is the pioneering work that focus on solving PTM in e-commerce field. It achieves SOTA performances compared with many models published in general domain.","1680":"Code for paper: H3DNet: 3D Object Detection Using Hybrid Geometric Primitives (ECCV 2020)","1681":"**FLAVR** is an architecture for video frame interpolation. It uses 3D space-time convolutions to enable end-to-end learning and inference for video frame interpolation. Overall, it consists of a [U-Net](https:\/\/paperswithcode.com\/method\/u-net) style architecture with 3D space-time convolutions and\r\ndeconvolutions (yellow blocks). Channel gating is used after all (de-)[convolution](https:\/\/paperswithcode.com\/method\/convolution) layers (blue blocks). The final prediction layer (the purple block) is implemented as a convolution layer to project the 3D feature maps into $(k\u22121)$ frame predictions. This design allows FLAVR to predict multiple frames in one inference forward pass.","1682":"Unified VLP is unified encoder-decoder model for general vision-language pre-training. The models uses a shared multi-layer transformers network for both encoding and decoding. The model is pre-trained on large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. Model architecture for pre-training. For pre-training , the input comprises of image input, sentence input, and three special tokens ([CLS], [SEP], [STOP]). The image is processed as $N$ Region of Interests (RoIs) and region features are extracted. The sentence is tokenized and masked with [MASK] tokens for the later masked language modeling task. The model consists of 12 layers of Transformer blocks, each having a masked self-attention layer and feed-forward module, where the self-attention mask controls what input context the prediction conditions on. Two self-attention masks are implemented depending on whether the objective is bidirectional or seq2seq. The model is fine-tuned for image captioning and visual question answering.","1683":"**DVD-GAN DBlock** is a residual block for the discriminator used in the [DVD-GAN](https:\/\/paperswithcode.com\/method\/dvd-gan) architecture for video generation. Unlike regular [residual blocks](https:\/\/paperswithcode.com\/method\/residual-block), [3D convolutions](https:\/\/paperswithcode.com\/method\/3d-convolution) are employed due to the application to multiple frames in a video.","1684":"**DVD-GAN GBlock** is a [residual block](https:\/\/paperswithcode.com\/method\/residual-block) for the generator used in the [DVD-GAN](https:\/\/paperswithcode.com\/method\/dvd-gan) architecture for video generation.","1685":"**TSRUc**, or **Transformation-based Spatial Recurrent Unit c**, is a modification of a [ConvGRU](https:\/\/paperswithcode.com\/method\/cgru) used in the [TriVD-GAN](https:\/\/paperswithcode.com\/method\/trivd-gan) architecture for video generation.\r\n\r\nInstead of computing the reset gate $r$ and resetting $h\\_{t\u22121}$, the TSRUc computes the parameters of a transformation $\\theta$, which we use to warp $h\\_{t\u22121}$. The rest of our model is unchanged (with $\\hat{h}\\_{t-1}$ playing the role of $h'\\_{t}$ in $c$\u2019s update equation from ConvGRU. The TSRUc module is described by the following equations:\r\n\r\n$$ \\theta\\_{h,x} = f\\left(h\\_{t\u22121}, x\\_{t}\\right) $$\r\n\r\n$$ \\hat{h}\\_{t-1} = w\\left(h\\_{t-1}; \\theta\\_{h, x}\\right) $$\r\n\r\n$$ c = \\rho\\left(W\\_{c} \\star\\_{n}\\left[\\hat{h}\\_{t-1};x\\_{t}\\right] + b\\_{c} \\right) $$\r\n\r\n$$ u = \\sigma\\left(W\\_{u} \\star\\_{n}\\left[h\\_{t-1};x\\_{t}\\right] + b\\_{u} \\right) $$\r\n\r\n$$ h\\_{t} = u \\odot h\\_{t-1} + \\left(1-u\\right) \\odot c $$\r\n\r\nIn these equations $\\sigma$ and $\\rho$ are the elementwise sigmoid and [ReLU](https:\/\/paperswithcode.com\/method\/relu) functions respectively and the $\\star\\_{n}$ represents a [convolution](https:\/\/paperswithcode.com\/method\/convolution) with a kernel of size $n \\times n$. Brackets are used to represent a feature concatenation.","1686":"**TSRUp**, or **Transformation-based Spatial Recurrent Unit p**, is a modification of a [ConvGRU](https:\/\/paperswithcode.com\/method\/cgru) used in the [TriVD-GAN](https:\/\/paperswithcode.com\/method\/trivd-gan) architecture for video generation.\r\n\r\nIt largely follows [TSRUc](https:\/\/paperswithcode.com\/method\/tsruc), but computes $\\theta$, $u$ and $c$ in parallel given $x\\_{t}$ and $h\\_{t\u22121}$, yielding the following replacement for the $c$ update equation:\r\n\r\n$$ c = \\rho\\left(W\\_{c} \\star\\_{n}\\left[h\\_{t-1}; x\\_{t}\\right] + b\\_{c} \\right) $$\r\n\r\nIn these equations $\\sigma$ and $\\rho$ are the elementwise sigmoid and [ReLU](https:\/\/paperswithcode.com\/method\/relu) functions respectively and the $\\star\\_{n}$ represents a [convolution](https:\/\/paperswithcode.com\/method\/convolution) with a kernel of size $n \\times n$. Brackets are used to represent a feature concatenation.","1687":"**TSRUs**, or **Transformation-based Spatial Recurrent Unit p**, is a modification of a [ConvGRU](https:\/\/paperswithcode.com\/method\/cgru) used in the [TriVD-GAN](https:\/\/paperswithcode.com\/method\/trivd-gan) architecture for video generation.\r\n\r\nIt largely follows [TSRUc](https:\/\/paperswithcode.com\/method\/tsruc), but computes each intermediate output in a fully sequential manner: like in TSRUc, $c$ is given access to $\\hat{h}\\_{t-1}$, but additionally, $u$ is given access to both outputs $\\hat{h}\\_{t-1}$ and $c$, so as to make an informed decision prior to mixing. This yields the following replacement for $u$:\r\n\r\n$$ u = \\sigma\\left(W\\_{u} \\star\\_{n}\\left[\\hat{h}\\_{t-1};c\\right] + b\\_{u} \\right) $$\r\n\r\nIn these equations $\\sigma$ and $\\rho$ are the elementwise sigmoid and [ReLU](https:\/\/paperswithcode.com\/method\/relu) functions respectively and the $\\star\\_{n}$ represents a [convolution](https:\/\/paperswithcode.com\/method\/convolution) with a kernel of size $n \\times n$. Brackets are used to represent a feature concatenation.","1688":"**TrIVD-GAN**, or **Transformation-based & TrIple Video Discriminator GAN**, is a type of generative adversarial network for video generation that builds upon [DVD-GAN](https:\/\/paperswithcode.com\/method\/dvd-gan). Improvements include a novel transformation-based recurrent unit (the TSRU) that makes the generator more expressive, and an improved discriminator architecture. \r\n\r\nIn contrast with DVD-[GAN](https:\/\/paperswithcode.com\/method\/gan), TrIVD-GAN has an alternative split for the roles of the discriminators, with $\\mathcal{D}\\_{S}$ judging per-frame global structure, while $\\mathcal{D}\\_{T}$ critiques local spatiotemporal structure. This is achieved by downsampling the $k$ randomly sampled frames fed to $\\mathcal{D}\\_{S}$ by a factor $s$, and cropping $T \\times H\/s \\times W\/s$ clips inside the high resolution video fed to $\\mathcal{D}\\_{T}$, where $T, H, W, C$ correspond to time, height, width and channel dimension of the input. This further reduces the number of pixels to process per video,\r\nfrom $k \\times H \\times W + T \\times H\/s \\times W\/s$ to $\\left(k + T\\right) \\times H\/s \\times W\/s$.","1689":"**AdaGPR** is an adaptive, layer-wise graph [convolution](https:\/\/paperswithcode.com\/method\/convolution) model. AdaGPR applies adaptive generalized Pageranks at each layer of a [GCNII](https:\/\/paperswithcode.com\/method\/gcnii) model by learning to predict the coefficients of generalized Pageranks using sparse solvers.","1690":"**WaveVAE** is a generative audio model that can be used as a vocoder in text-to-speech systems. It is a [VAE](https:\/\/paperswithcode.com\/method\/vae) based model that can be trained from scratch by jointly optimizing the encoder $q\\_{\\phi}\\left(\\mathbf{z}|\\mathbf{x}, \\mathbf{c}\\right)$ and decoder $p\\_{\\theta}\\left(\\mathbf{x}|\\mathbf{z}, \\mathbf{c}\\right)$, where $\\mathbf{z}$ is latent variables and $\\mathbf{c}$ is the mel spectrogram conditioner. \r\n\r\nThe encoder of WaveVAE $q\\_{\\phi}\\left(\\mathbf{z}|\\mathbf{x}\\right)$ is parameterized by a Gaussian autoregressive [WaveNet](https:\/\/paperswithcode.com\/method\/wavenet) that maps the ground truth audio x into the same length latent representation $\\mathbf{z}$. The decoder $p\\_{\\theta}\\left(\\mathbf{x}|\\mathbf{z}\\right)$ is parameterized by the one-step ahead predictions from an inverse autoregressive flow.\r\n\r\nThe training objective is the ELBO for the observed $\\mathbf{x}$ in the VAE.","1691":"The Triplet Entropy Loss (TEL) training method aims to leverage both the strengths of Cross Entropy Loss (CEL) and [Triplet loss](https:\/\/paperswithcode.com\/method\/triplet-loss) during the training process, assuming that it would lead to better generalization. The TEL method though does not contain a pre-training step, but trains simultaneously with both CEL and Triplet losses.","1692":"A **Sandwich Transformer** is a variant of a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) that reorders sublayers in the architecture to achieve better performance. The reordering is based on the authors' analysis that models with more self-attention toward the bottom and more\r\nfeedforward sublayers toward the top tend to perform better in general.","1693":"The **Shuffle Transformer Block** consists of the Shuffle Multi-Head Self-Attention module (ShuffleMHSA), the Neighbor-Window Connection module (NWC), and the MLP module. To introduce cross-window connections while maintaining the efficient computation of non-overlapping windows, a strategy which alternates between WMSA and Shuffle-WMSA in consecutive Shuffle Transformer blocks is proposed. The first window-based transformer block uses regular window partition strategy and the second window-based transformer block uses window-based selfattention with spatial shuffle. Besides, the Neighbor-Window Connection moduel (NWC) is added into each block for enhancing connections among neighborhood windows. Thus the proposed shuffle transformer block could build rich cross-window connections and augments representation. Finally, the consecutive Shuffle Transformer blocks are computed as:\r\n\r\n$$ x^{l}=\\mathbf{W M S A}\\left(\\mathbf{B N}\\left(z^{l-1}\\right)\\right)+z^{l-1} $$\r\n\r\n$$ y^{l}=\\mathbf{N W C}\\left(x^{l}\\right)+x^{l} $$\r\n\r\n$$ z^{l}=\\mathbf{M L P}\\left(\\mathbf{B N}\\left(y^{l}\\right)\\right)+y^{l} $$\r\n\r\n$$ x^{l+1}=\\mathbf{S h u f f l e - W M S A}\\left(\\mathbf{B N}\\left(z^{l}\\right)\\right)+z^{l} $$\r\n\r\n$$ y^{l+1}=\\mathbf{N W C}\\left(x^{l+1}\\right)+x^{l+1} $$\r\n\r\n$$ z^{l+1}=\\mathbf{M L P}\\left(\\mathbf{B N}\\left(y^{l+1}\\right)\\right)+y^{l+1} $$\r\n\r\nwhere $x^l$, $y^l$ and $z^l$ denote the output features of the (Shuffle-)WMSA module, the Neighbor-Window Connection module and the MLP module for block $l$, respectively; WMSA and Shuffle-WMSA denote\r\nwindow-based multi-head self-attention without\/with spatial shuffle, respectively.","1694":"**AutoDropout** automates the process of designing [dropout](https:\/\/paperswithcode.com\/method\/dropout) patterns using a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) based controller. In this method, a controller learns to generate a dropout pattern at every channel and layer of a target network, such as a [ConvNet](https:\/\/paperswithcode.com\/methods\/category\/convolutional-neural-networks) or a Transformer. The target network is then trained with the dropped-out pattern, and its resulting validation performance is used as a signal for the controller to learn from. The resulting pattern is applied to a convolutional output channel, which is a common building block of image recognition models.\r\n\r\nThe controller network generates the tokens to describe the configurations of the dropout pattern. The tokens are generated like words in a language model. For every layer in a ConvNet, a group of 8 tokens need to be made to create a dropout pattern. These 8 tokens are generated sequentially. In the figure above, size, stride, and repeat indicate the size and the tiling of the pattern; rotate, shear_x, and shear_y specify the geometric transformations of the pattern; share_c is a binary deciding whether a pattern is applied to all $C$ channels; and residual is a binary deciding whether the pattern is applied to the residual branch as well. If we need $L$ dropout patterns, the controller will generate $8L$ decisions.","1695":"As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training. One popular communication-compression method for data-parallel [SGD](https:\/\/paperswithcode.com\/method\/sgd) is QSGD (Alistarh et al., 2017), which quantizes and encodes gradients to reduce communication costs. The baseline variant of QSGD provides strong theoretical guarantees, however, for practical purposes, the authors proposed a heuristic variant which we call QSGDinf, which demonstrated impressive empirical gains for distributed training of large neural networks. In this paper, we build on this work to propose a new gradient quantization scheme, and show that it has both stronger theoretical guarantees than QSGD, and matches and exceeds the empirical performance of the QSGDinf heuristic and of other compression methods.","1696":"This method works as a two-levels optimization algorithm.\r\nThe outmost layer uses Grammatical evolution to evolve a grammar to build the agent.\r\nThen, [Q-learning](https:\/\/paperswithcode.com\/method\/q-learning) is used the fitness evaluation phase to allow the agent to learn to perform online learning.","1697":"**PAUSE**, or **Positive and Annealed Unlabeled Sentence Embedding**, is an approach for learning sentence embeddings from a partially labeled dataset. It is based on a dual encoder schema that is widely adopted in supervised sentence embedding training. Each individual sample $\\mathbf{x}$ contains a pair of hypothesis and premise sentences $(x\\_{i},x^{\\prime}_{i})$, each of which is fed into a pretrained encoder (e.g. [BERT](https:\/\/paperswithcode.com\/method\/bert)). As shown in Figure, the two encoders are identical during the training by sharing their weights.","1698":"**DVD-GAN** is a generative adversarial network for video generation built upon the [BigGAN](https:\/\/paperswithcode.com\/method\/biggan) architecture.\r\n\r\nDVD-GAN uses two discriminators: a Spatial Discriminator $\\mathcal{D}\\_{S}$ and a\r\nTemporal Discriminator $\\mathcal{D}\\_{T}$. $\\mathcal{D}\\_{S}$ critiques single frame content and structure by randomly sampling $k$ full-resolution frames and judging them individually.  The temporal discriminator $\\mathcal{D}\\_{T}$ must provide $G$ with the learning signal to generate movement (not evaluated by $\\mathcal{D}\\_{S}$).\r\n\r\nThe input to $G$ consists of a Gaussian latent noise $z \\sim N\\left(0, I\\right)$ and a learned linear embedding $e\\left(y\\right)$ of the desired class $y$. Both inputs are 120-dimensional vectors. $G$ starts by computing an affine transformation of $\\left[z; e\\left(y\\right)\\right]$ to a $\\left[4, 4, ch\\_{0}\\right]$-shaped tensor. $\\left[z; e\\left(y\\right)\\right]$ is used as the input to all class-[conditional Batch Normalization](https:\/\/paperswithcode.com\/method\/conditional-batch-normalization) layers\r\nthroughout $G$. This is then treated as the input (at each frame we would like to generate) to a Convolutional [GRU](https:\/\/paperswithcode.com\/method\/gru).\r\n\r\nThis RNN is unrolled once per frame. The output of this RNN is processed by two residual blocks. The time dimension is combined with the batch dimension here, so each frame proceeds through the blocks independently. The output of these blocks has width and height dimensions which\r\nare doubled (we skip upsampling in the first block). This is repeated a number of times, with the\r\noutput of one RNN + residual group fed as the input to the next group, until the output tensors have\r\nthe desired spatial dimensions. \r\n\r\nThe spatial discriminator $\\mathcal{D}\\_{S}$ functions almost identically to BigGAN\u2019s discriminator. A score is calculated for each of the uniformly sampled $k$ frames (default $k = 8$) and the $\\mathcal{D}\\_{S}$ output is the sum over per-frame scores. The temporal discriminator $\\mathcal{D}\\_{T}$ has a similar architecture, but pre-processes the real or generated video with a $2 \\times 2$ average-pooling downsampling function $\\phi$. Furthermore, the first two residual blocks of $\\mathcal{D}\\_{T}$ are 3-D, where every [convolution](https:\/\/paperswithcode.com\/method\/convolution) is replaced with a 3-D convolution with a kernel size of $3 \\times 3 \\times 3$. The rest of the architecture follows BigGAN.","1699":"InterBERT aims to model interaction between information flows pertaining to different modalities. This new architecture builds multi-modal interaction and preserves the independence of single modal representation. InterBERT is built with an image embedding layer, a text embedding layer, a single-stream interaction module, and a two stream extraction module. The model is pre-trained with three tasks: 1) masked segment modeling, 2) masked region modeling, and 3) image-text matching.","1700":"**Colorization Transformer** is a probabilistic [colorization](https:\/\/paperswithcode.com\/method\/colorization) model composed only of [axial self-attention blocks](https:\/\/paperswithcode.com\/method\/axial). The main advantages of these blocks are the ability to capture a global receptive field with only two layers and $\\mathcal{O}(D\\sqrt{D})$ instead of $\\text{O}(D^{2})$ complexity. In order to enable colorization of high-resolution grayscale images, the task is decomposed into three simpler sequential subtasks: coarse low resolution autoregressive colorization, parallel color and spatial super-resolution.\r\n\r\nFor coarse low resolution colorization, a conditional variant of [Axial Transformer](https:\/\/paperswithcode.com\/method\/axial) is applied. The authors leverage the semi-parallel sampling mechanism of Axial Transformers. Finally, fast parallel deterministic upsampling models are employed to super-resolve the coarsely colorized image into the final high resolution output.","1701":"**ReasonBERT** is a pre-training method that augments language models with the ability to reason over long-range relations and multiple, possibly hybrid, contexts. It utilizes distant supervision to automatically connect multiple pieces of text and tables to create pre-training examples that require long-range reasoning. Different types of reasoning are simulated, including intersecting multiple pieces of evidence, bridging from one piece of evidence to another, and detecting unanswerable cases. \r\n\r\nSpecifically, given a query sentence containing an entity pair, if we mask one of the entities, another sentence or table that contains the same pair of entities can likely be used as evidence to recover the masked entity. Moreover, to encourage deeper reasoning, multiple pieces of evidence are collected that are jointly used to recover the masked entities in the query sentence, allowing for the scattering of the masked entities among different pieces of evidence to mimic different types of reasoning. \r\n\r\nThe Figure illustrates several examples using such distant supervision. In Ex. 1, a model needs to check multiple constraints (i.e., intersection reasoning type) and find \u201cthe beach soccer competition that is established in 1998.\u201d In Ex. 2, a model needs to find \u201cthe type of the band that released Awaken the Guardian,\u201d by first inferring the name of the band \u201cFates Warning\u201d (i.e., bridging reasoning type). \r\n\r\nThe masked entities in a query sentence are replaced with the [QUESTION] tokens. The new pre-training objective, span reasoning, then extracts the masked entities from the provided evidence. Existing LMs like [BERT](https:\/\/paperswithcode.com\/method\/bert) and [RoBERTa](https:\/\/paperswithcode.com\/method\/roberta) are augmented by continuing to train them with the new objective, which leads to ReasonBERT. Then query sentence and textual evidence are encoded via the LM. When tabular evidence is present, the structure-aware [transformer](https:\/\/paperswithcode.com\/method\/transformer) [TAPAS](https:\/\/paperswithcode.com\/method\/tapas) is used as the encoder to capture the table structure.","1702":"**nnFormer**, or **not-another transFormer**, is a semantic segmentation model with an interleaved architecture based on empirical combination of self-attention and [convolution](https:\/\/paperswithcode.com\/method\/convolution). Firstly, a light-weight convolutional embedding layer ahead is used ahead of [transformer](https:\/\/paperswithcode.com\/method\/transformer) blocks. In comparison to directly flattening raw pixels and applying 1D pre-processing, the convolutional embedding layer encodes precise (i.e., pixel-level) spatial information and provide low-level yet high-resolution 3D features. After the embedding block, transformer and convolutional down-sampling blocks are interleaved to fully entangle long-term dependencies with high-level and hierarchical object concepts at various scales, which helps improve the generalization ability and robustness of learned representations.","1703":"PipeDream is an asynchronous pipeline parallel strategy for training large neural networks. It adds inter-batch pipelining to intra-batch parallelism to further improve parallel training throughput, helping to better overlap computation with communication and reduce the amount of communication when possible.","1704":"A **Boom Layer** is a type of feedforward layer that is closely related to the feedforward layers used in Transformers. The layer takes a vector of the form $v \\in \\mathbb{R}^{H}$ and uses a matrix\r\nmultiplication with a GeLU activation to produce a vector $u \\in \\mathbb{R}^{N\\times{H}}$. We then break $u$ into $N$ vectors and sum those together, producing $w \\in \\mathbb{R}^{H}$. This minimizes computation and removes an entire matrix of parameters compared to traditional down-projection layers.\r\n\r\nThe Figure to the right shows the Boom Layer used in the context of [SHA-RNN](https:\/\/paperswithcode.com\/method\/sha-rnn) from the original paper.","1705":"**SHA-RNN**, or **Single Headed Attention RNN**, is a recurrent neural network, and language model when combined with an embedding input and [softmax](https:\/\/paperswithcode.com\/method\/softmax) classifier, based on a core [LSTM](https:\/\/paperswithcode.com\/method\/lstm) component and a [single-headed attention](https:\/\/paperswithcode.com\/method\/single-headed-attention) module. Other design choices include a Boom feedforward layer and the use of [layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization). The guiding principles of the author were to ensure simplicity in the architecture and to keep computational costs bounded (the model was originally trained with a single GPU).","1706":"**CPVT**, or **Conditional Position Encoding Vision Transformer**, is a type of [vision transformer](https:\/\/paperswithcode.com\/methods\/category\/vision-transformer) which utilizes [conditional positional encoding](https:\/\/paperswithcode.com\/method\/conditional-positional-encoding). Other than the new encodings, it follows the same architecture of [ViT](https:\/\/paperswithcode.com\/method\/vision-transformer) and [DeiT](https:\/\/paperswithcode.com\/method\/deit).","1707":"**Deformable Position-Sensitive RoI Pooling** is similar to PS RoI Pooling but it adds an offset to each bin position in the regular bin partition. Offset learning follows the \u201cfully convolutional\u201d spirit. In the top branch, a convolutional layer generates the full spatial resolution offset fields. For each RoI (also for each class), PS RoI pooling is applied on such fields to obtain normalized offsets, which are then transformed to the real offsets, in the same way as in deformable RoI pooling.","1708":"**DistDGL** is a system for training GNNs in a mini-batch fashion on a cluster of machines. It is is based on the Deep Graph Library (DGL), a popular GNN development framework. DistDGL distributes the graph and its associated data (initial features and embeddings) across the machines and uses this distribution to derive a computational decomposition by following an owner-compute rule. DistDGL follows a synchronous training approach and allows ego-networks forming the mini-batches to include non-local nodes. To minimize the overheads associated with distributed computations, DistDGL uses a high-quality and light-weight mincut graph partitioning algorithm along with multiple balancing constraints. This allows it to reduce communication overheads and statically balance the computations. It further reduces the communication by replicating halo nodes and by using sparse embedding updates. The combination of these design choices allows DistDGL to train high-quality models while achieving high parallel efficiency and memory scalability","1709":"**CT3D** is a two-stage 3D object detection framework that leverages a high-quality region proposal network and a Channel-wise [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture. The proposed CT3D simultaneously performs proposal-aware embedding and channel-wise context aggregation for the point features within each proposal. Specifically, CT3D uses a proposal's keypoints for spatial contextual modelling and learns attention propagation in the encoding module, mapping the proposal to point embeddings. Next, a new channel-wise decoding module enriches the query-key interaction via channel-wise re-weighting to effectively merge multi-level contexts, which contributes to more accurate object predictions. \r\n\r\nIn CT3D, the raw points are first fed into the [RPN](https:\/\/paperswithcode.com\/method\/rpn) for generating 3D proposals. Then the raw points along with the corresponding proposals are processed by the channel-wise Transformer composed of the proposal-to-point encoding module and the channel-wise decoding module. Specifically, the proposal-to-point encoding module is to modulate each point feature with global proposal-aware context information. After that, the encoded point features are transformed into an effective proposal feature representation by the\r\nchannel-wise decoding module for confidence prediction and box regression.","1710":"**FastMoE ** is a distributed MoE training system based on PyTorch with common accelerators. The system provides a hierarchical interface for both flexible model design and adaption to different applications, such as [Transformer-XL](https:\/\/paperswithcode.com\/method\/transformer-xl) and Megatron-LM.","1711":"**RPM-Net** is an end-to-end differentiable deep network for robust point matching uses learned features. It preserves robustness of RPM against noisy\/outlier points while desensitizing initialization with point correspondences from learned feature distances instead of spatial distances. The network uses the differentiable Sinkhorn layer and annealing to get soft assignments of point correspondences from hybrid features learned from both spatial coordinates and local geometry. To further improve registration performance, the authors introduce a secondary network to predict optimal annealing parameters.","1712":"**Random Grayscale**  is an image data augmentation that converts an image to grayscale with probability $p$.","1713":"Rana Mostafa, Hoda Baraka and AbdelMoniem Bayoumi\r\n\r\n**LMOT**, i.e., Light-weight Multi-Object Tracker,  performs joint pedestrian detection and tracking. LMOT introduces a simplified DLA-34 encoder network to extract detection features for the current image that are computationally efficient. Furthermore, we generate efficient tracking features using a linear transformer for the prior image frame and its corresponding detection heatmap. After that, LMOT fuses both detection and tracking feature maps in a multi-layer scheme and performs a two-stage online data association relying on the Kalman filter to generate tracklets. We evaluated our model on the challenging real-world MOT16\/17\/20 datasets, showing LMOT significantly outperforms the state-of-the-art trackers concerning runtime while maintaining high robustness. LMOT is approximately ten times faster than state-of-the-art trackers while being only 3.8% behind in performance accuracy on average leading to a much computationally lighter model.\r\n\r\nCode: https:\/\/github.com\/RanaMostafaAbdElMohsen\/LMOT\r\nPaper: https:\/\/doi.org\/10.1109\/ACCESS.2022.3197157","1714":"**DouZero** is an AI system for the card game DouDizhu that enhances traditional Monte-Carlo methods with deep neural networks, action encoding, and parallel actors. The [Q-network](https:\/\/paperswithcode.com\/method\/dqn) of DouZero consists of an [LSTM](https:\/\/paperswithcode.com\/method\/lstm) to encode historical actions and six layers of [MLP](https:\/\/paperswithcode.com\/method\/feedforward-network) with hidden dimension of 512. The network predicts a value for a given state-action pair based on the concatenated representation of action and state.","1715":"Context-Aware Visual Attention-based end-to-end pipeline for Webpage Object Detection (_CoVA_) aims to learn function _f_ to predict labels _y = [$y_1, y_2, ..., y_N$]_ for a webpage containing _N_ elements. The input to CoVA consists of:\r\n1. a screenshot of a webpage,\r\n2. list of bounding boxes _[x, y, w, h]_ of the web elements, and\r\n3. neighborhood information for each element obtained from the DOM tree.\r\n\r\nThis information is processed in four stages:\r\n1. the graph representation extraction for the webpage,\r\n2. the Representation Network (_RN_),\r\n3. the Graph Attention Network (_GAT_), and\r\n4. a fully connected (_FC_) layer.\r\n\r\nThe graph representation extraction computes for every web element _i_ its set of _K_ neighboring web elements _$N_i$_. The _RN_ consists of a Convolutional Neural Net (_CNN_) and a positional encoder aimed to learn a visual representation _$v_i$_ for each web element _i &isin; {1, ..., N}_. The _GAT_ combines the visual representation _$v_i$_ of the web element _i_ to be classified and those of its neighbors, i.e., _$v_k$ &forall;k &isin; $N_i$_ to compute the contextual representation _$c_i$_ for web element _i_. Finally, the visual and contextual representations of the web element are concatenated and passed through the _FC_ layer to obtain the classification output.","1716":"**Generalized Focal Loss (GFL)** is a loss function for object detection that combines Quality [Focal Loss](https:\/\/paperswithcode.com\/method\/focal-loss) and Distribution Focal Loss into a general form.","1717":"OODformer is a [transformer](https:\/\/paperswithcode.com\/method\/transformer)-based OOD detection architecture that leverages the contextualization capabilities of the transformer. Incorporating the transformer as the principal feature extractor allows to exploit the object concepts and their discriminate attributes along with their co-occurrence via [visual attention](https:\/\/paperswithcode.com\/method\/visual-attention). \r\n\r\nOODformer employs [ViT](method\/vision-transformer) and its data efficient variant [DeiT](\/method\/deit). Each encoder layer consist of multi-head self attention and a multi-layer perception block. The combination of MSA and MLP layers in the encoder jointly encode the attributes' importance, associated correlation, and co-occurrence. The [class] token (a representative of an image $x$) consolidated multiple attributes and their related features via the global context. The [class] token from the final layer is used for OOD detection in two ways; first, it is passed to $\r\nF_{\\text {classifier }}\\left(x_{\\text {feat }}\\right)$  for softmax confidence score, and second it is used for latent space distance calculation.","1718":"FCANet contains a novel multi-spectral channel attention module. Given an input feature map $X \\in \\mathbb{R}^{C \\times H \\times W}$, multi-spectral channel attention first splits $X$ into many parts $x^{i} \\in \\mathbb{R}^{C' \\times H \\times W}$. Then it applies a 2D DCT to each part $x^{i}$. Note that a 2D DCT can use pre-processing results to reduce computation. After processing each part,  all results are concatenated into a vector. Finally, fully connected layers, ReLU activation and a sigmoid are used to get the attention vector as in an SE block. This can be formulated as:\r\n\\begin{align}\r\n    s = F_\\text{fca}(X, \\theta) & = \\sigma (W_{2} \\delta (W_{1}[(\\text{DCT}(\\text{Group}(X)))]))\r\n\\end{align}\r\n\\begin{align}\r\n    Y & = s  X\r\n\\end{align}\r\nwhere $\\text{Group}(\\cdot)$ indicates dividing the input into many groups and $\\text{DCT}(\\cdot)$ is the 2D discrete cosine transform. \r\n\r\nThis work based on information compression and discrete cosine transforms achieves excellent performance on the classification task.","1719":"**Probabilistic anchor assignment (PAA)** adaptively separates a set of anchors into positive and negative samples for a GT box according to the learning status of the model associated with it. To do so we first define a score of a detected bounding box that reflects both the classification and localization qualities. We then identify the connection between this score and the training objectives and represent the score as the combination of two loss objectives. Based on this scoring scheme, we calculate the scores of individual anchors that reflect how the model finds useful cues to detect a target object in each anchor. With these anchor scores, we aim to find a probability distribution of two modalities that best represents the scores as positive or negative samples as in the Figure. \r\n\r\nUnder the found probability distribution, anchors with probabilities from the positive component are high are selected as positive samples. This transforms the anchor assignment problem to a maximum likelihood estimation for a probability distribution where the parameters of the distribution is determined by anchor scores. Based on the assumption that anchor scores calculated by the model are samples drawn from a probability distribution, it is expected that the model can infer the sample separation in a probabilistic way, leading to easier training of the model compared to other non-probabilistic assignments. Moreover, since positive samples are adaptively selected based on the anchor score distribution, it does not require a pre-defined number of positive samples nor an IoU threshold.","1720":"**Video Panoptic Segmentation Network**, or **VPSNet**, is a model for video panoptic segmentation. On top of UPSNet, which is a method for image panoptic segmentation, VPSNet is designed to take an additional frame as the reference to correlate time information at two levels: pixel-level fusion and object-level tracking. To pick up the complementary feature points in the reference frame, a flow-based feature map alignment module is introduced along with an asymmetric attention block that computes similarities between the target and reference features to fuse them into one-frame shape. Additionally, to associate object instances across time, \r\n an object track head is added which learns the correspondence between the instances in the target and reference frames based\r\non their RoI feature similarity.","1721":"**SongNet** is an auto-regressive [Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based language model for rigid format text detection. Sets of symbols are tailor-designed to improve the modeling performance especially on format, rhyme, and sentence integrity. The attention mechanism is improved to impel the model to capture some future information on the format. A pre-training and fine-tuning framework is designed to further improve the generation quality.","1722":"**Panoptic-PolarNet** is a point cloud segmentation framework for LiDAR point clouds. It learns both semantic segmentation and class-agnostic instance clustering in a single inference network using a polar Bird's Eye View (BEV) representation, enabling the authors to circumvent the issue of occlusion among instances in urban street scenes. We first encode the raw point cloud data with $K$ features into a fixed-size representation on the polar BEV map. Next, we use a single backbone encoder-decoder network to generate semantic prediction, center [heatmap](https:\/\/paperswithcode.com\/method\/heatmap) and offset regression. Finally, we merge these outputs via a voting-based fusion to yield the panoptic segmentation result.","1723":"**Learning From Multiple Experts** is a self-paced knowledge distillation framework that aggregates the knowledge from multiple 'Experts' to learn a unified student model. Specifically, the proposed framework involves two levels of adaptive learning schedules: Self-paced Expert Selection and Curriculum Instance Selection, so that the knowledge is adaptively transferred to the 'Student'. The self-paced expert selection automatically controls the impact of knowledge distillation from each expert, so that the learned student model will gradually acquire the knowledge from the experts, and finally exceed the expert. The curriculum instance selection, on the other hand, designs a curriculum for the unified model where the training samples are organized from easy to hard, so that the unified student model will receive a less challenging learning schedule, and gradually learns from easy to hard samples.","1724":"**MoBY** is a self-supervised learning approach for [Vision Transformers](methods\/category\/vision-transformer). The approach is basically a combination of [MoCo v2](https:\/\/paperswithcode.com\/method\/moco-v2) and [BYOL](https:\/\/paperswithcode.com\/method\/byol). It inherits the momentum design, the key queue, and the contrastive loss used in MoCo v2, and inherits the asymmetric encoders, asymmetric data augmentations and the momentum scheduler in BYOL. It is named MoBY by picking the first two letters of each method.\r\n\r\nThe MoBY approach is illustrated in the Figure. There are two encoders: an online encoder and a target encoder. Both two encoders consist of a backbone and a projector head ([2-layer MLP](https:\/\/paperswithcode.com\/method\/feedforward-network)), and the online encoder introduces an additional prediction head (2-layer MLP), which makes the two encoders asymmetric. The online encoder is updated by gradients, and the target encoder is a moving average of the online encoder by momentum updating in each training iteration. A gradually increasing momentum updating strategy is applied for on the target encoder: the value of momentum term is gradually increased to 1 during the course of training. The default starting value is $0.99$.\r\n\r\nA contrastive loss is applied to learn the representations. Specifically, for an online view $q$, its contrastive loss is computed as\r\n\r\n$$\r\n\\mathcal{L}\\_{q}=-\\log \\frac{\\exp \\left(q \\cdot k\\_{+} \/ \\tau\\right)}{\\sum\\_{i=0}^{K} \\exp \\left(q \\cdot k\\_{i} \/ \\tau\\right)}\r\n$$\r\n\r\nwhere $k\\_{+}$is the target feature for the other view of the same image; $k\\_{i}$ is a target feature in the key queue; $\\tau$ is a temperature term; $K$ is the size of the key queue (4096 by default).\r\n\r\nIn training, like most [Transformer-based methods](https:\/\/paperswithcode.com\/methods\/category\/transformers), the [AdamW](https:\/\/paperswithcode.com\/method\/adamw) optimizer is used, in contrast to previous [self-supervised learning approaches](https:\/\/paperswithcode.com\/methods\/category\/self-supervised-learning) built on [ResNet](https:\/\/paperswithcode.com\/method\/resnet) backbone where usually [SGD](https:\/\/paperswithcode.com\/method\/sgd-with-momentum) or [LARS](https:\/\/paperswithcode.com\/method\/lars) $[4,8,19]$ is used. The authors also use a regularization method of asymmetric [drop path](https:\/\/paperswithcode.com\/method\/droppath) which proves important for the final performance.\r\n\r\nIn the experiments, the authors adopt a fixed learning rate of $0.001$ and a fixed weight decay of $0.05$, which performs stably well. Hyper-parameters are tuned of the key queue size $K$, the starting momentum value of the target branch, the temperature $\\tau$, and the drop path rates."}}
\ No newline at end of file