{"title":{"0":"Average Pooling","1":"1x1 Convolution","2":"Global Average Pooling","3":"Batch Normalization","4":"ReLU","5":"Kaiming Initialization","6":"Residual Connection","7":"Max Pooling","8":"Residual Block","9":"Bottleneck Residual Block","10":"ResNet","11":"Convolution","12":"AutoEncoder","13":"PCA","14":"Q-Learning","15":"Causal Inference","16":"Sigmoid Activation","17":"Tanh Activation","18":"LSTM","19":"Weight Decay","20":"WordPiece","21":"Softmax","22":"Dense Connections","23":"Scaled Dot-Product Attention","24":"Linear Warmup With Linear Decay","25":"Dropout","26":"Layer Normalization","27":"GELU","28":"Adam","29":"Multi-Head Attention","30":"Attention Dropout","31":"BERT","32":"GAN","33":"GPS","34":"Softplus","35":"Mish","36":"BiGRU","37":"Memory Network","38":"GRU","39":"Concatenated Skip Connection","40":"U-Net","41":"PatchGAN","42":"Instance Normalization","43":"Leaky ReLU","44":"GAN Least Squares Loss","45":"Cycle Consistency Loss","46":"CycleGAN","47":"Mixup","48":"Entropy Regularization","49":"PPO","50":"RPN","51":"RoIAlign","52":"Mask R-CNN","53":"Absolute Position Encodings","54":"Position-Wise Feed-Forward Layer","55":"Label Smoothing","56":"BPE","57":"Transformer","58":"Spectral Clustering","59":"IndexNet","60":"Depthwise Convolution","61":"Pointwise Convolution","62":"Depthwise Separable Convolution","63":"Inverted Residual Block","64":"MobileNetV2","65":"LIME","66":"BiLSTM","67":"Logistic Regression","68":"Diffusion","69":"AMP","70":"Triplet Loss","71":"A3C","72":"SHAP","73":"Knowledge Distillation","74":"Cosine Annealing","75":"Adaptive Input Representations","76":"Linear Warmup With Cosine Annealing","77":"Squeeze-and-Excitation Block","78":"Variational Dropout","79":"Adaptive Softmax","80":"Discriminative Fine-Tuning","81":"Transformer-XL","82":"SENet","83":"GPT-2","84":"SGD","85":"Dilated Causal Convolution","86":"Causal Convolution","87":"Affine Coupling","88":"Normalizing Flows","89":"NICE","90":"MAML","91":"LAMB","92":"RoBERTa","93":"ALBERT","94":"DARTS","95":"CAM","96":"SVM","97":"DQN","98":"Temporal attention","99":"fastText","100":"Seq2Seq","101":"Local Response Normalization","102":"Grouped Convolution","103":"AlexNet","104":"DLA","105":"Center Pooling","106":"Cascade Corner Pooling","107":"CenterNet","108":"Interpretability","109":"RoIPool","110":"Faster R-CNN","111":"BYOL","112":"ReLU6","113":"node2vec","114":"Graph Convolutional Networks","115":"Gravity","116":"Relativistic GAN","117":"WGAN","118":"ADMM","119":"Natural Gradient Descent","120":"AE","121":"ELMo","122":"VAE","123":"ATSS","124":"Focal Loss","125":"Xception","126":"Spectral Normalization","127":"FPN","128":"HTC","129":"k-NN","130":"GloVe","131":"BLANC","132":"NeRF","133":"Gaussian Process","134":"DE-GAN","135":"k-Means Clustering","136":"Darknet-19","137":"Darknet-53","138":"YOLOv3","139":"YOLOv2","140":"3D Convolution","141":"Early Stopping","142":"ResNeXt Block","143":"ResNeXt","144":"Linear Regression","145":"Swish","146":"RMSProp","147":"EfficientNet","148":"AWARE","149":"TRPO","150":"Linear Layer","151":"Capsule Network","152":"FAVOR+","153":"Performer","154":"GCN","155":"DistilBERT","156":"XLM","157":"Weight Demodulation","158":"R1 Regularization","159":"Path Length Regularization","160":"StyleGAN2","161":"Denoising Autoencoder","162":"Deformable Convolution","163":"SELU","164":"SNN","165":"InfoNCE","166":"Contrastive Predictive Coding","167":"Inception-A","168":"Reduction-A","169":"Inception-B","170":"Reduction-B","171":"Inception-C","172":"Inception-v4","173":"DTW","174":"HRNet","175":"rTPNN","176":"Base Boosting","177":"mBERT","178":"AlphaZero","179":"Sparse Transformer","180":"Jigsaw","181":"Population Based Training","182":"Population Based Augmentation","183":"AutoAugment","184":"TGN","185":"HyperNetwork","186":"GLU","187":"Adafactor","188":"Inverse Square Root Schedule","189":"SentencePiece","190":"T5","191":"Morphence","192":"MDETR","193":"Position-Sensitive RoI Pooling","194":"R-FCN","195":"ALIGN","196":"Dilated Convolution","197":"Fast Voxel Query","198":"VoTr","199":"RAN","200":"MPNN","201":"GA","202":"DeepWalk","203":"GraphSAGE","204":"Vision Transformer","205":"VGG","206":"ELECTRA","207":"Axial Attention","208":"Deformable Attention Module","209":"Feedforward Network","210":"Deformable DETR","211":"Detr","212":"MoCo","213":"Target Policy Smoothing","214":"Clipped Double Q-learning","215":"Experience Replay","216":"DDPG","217":"TD3","218":"SAC","219":"Fixed Factorized Attention","220":"Strided Attention","221":"GPT-3","222":"ENet Initial Block","223":"ENet Dilated Bottleneck","224":"SpatialDropout","225":"PReLU","226":"ENet Bottleneck","227":"ENet","228":"R-CNN","229":"PyTorch DDP","230":"Random Search","231":"SimCSE","232":"CARLA","233":"GPT","234":"Sliding Window Attention","235":"Dilated Sliding Window Attention","236":"Global and Sliding Window Attention","237":"AdamW","238":"Longformer","239":"ABC","240":"Discrete Cosine Transform","241":"Inpainting","242":"CutMix","243":"Stochastic Depth","244":"Swin Transformer","245":"Submanifold Convolution","246":"PULSE","247":"CodeBERT","248":"WaveRNN","249":"Mixture of Logistic Distributions","250":"WaveNet","251":"NT-Xent","252":"Random Gaussian Blur","253":"Random Resized Crop","254":"ColorJitter","255":"SimCLR","256":"DCGAN","257":"DANCE","258":"LDA","259":"CRF","260":"SGD with Momentum","261":"Demon ADAM","262":"Demon CM","263":"Demon","264":"Feature Intertwiner","265":"Gradient Clipping","266":"Non Maximum Suppression","267":"Random Horizontal Flip","268":"PointNet","269":"SSD","270":"FCN","271":"WideResNet","272":"Auxiliary Classifier","273":"OFA","274":"ConvLSTM","275":"mBART","276":"Hourglass Module","277":"Stacked Hourglass Network","278":"Image Scale Augmentation","279":"EfficientDet","280":"BiFPN","281":"Siamese Network","282":"Routing Attention","283":"Routing Transformer","284":"Weight Normalization","285":"NON","286":"mT5","287":"BART","288":"Prioritized Experience Replay","289":"Monte-Carlo Tree Search","290":"MuZero","291":"DCNN","292":"Double Q-learning","293":"Double DQN","294":"GMI","295":"Self-Adversarial Negative Sampling","296":"RotatE","297":"VEGA","298":"Procrustes","299":"Inception Module","300":"GoogLeNet","301":"Retrace","302":"Stochastic Dueling Network","303":"ACER","304":"Additive Attention","305":"Pointer Network","306":"FBNet Block","307":"FBNet","308":"Cascade R-CNN","309":"Dense Block","310":"DenseNet","311":"Grid Sensitive","312":"Bottom-up Path Augmentation","313":"Spatial Pyramid Pooling","314":"PAFPN","315":"Spatial Attention Module","316":"DropBlock","317":"CSPDarknet53","318":"YOLOv4","319":"Lovasz-Softmax","320":"Inception-v3 Module","321":"Inception-v3","322":"Step Decay","323":"MobileNetV1","324":"Wide Residual Block","325":"Channel Attention Module","326":"CBAM","327":"FFB6D","328":"IPL","329":"Random Erasing","330":"Spatial Transformer","331":"Adaptive Instance Normalization","332":"StyleGAN","333":"Exponential Decay","334":"Restricted Boltzmann Machine","335":"Maxout","336":"CLIP","337":"Laplacian Pyramid","338":"AccoMontage","339":"Asynchronous Interaction Aggregation","340":"Res2Net Block","341":"Res2Net","342":"Attention Gate","343":"SSE","344":"TopK Copy","345":"TS","346":"GAGNN","347":"SESAME Discriminator","348":"TransE","349":"GENet","350":"Hierarchical Feature Fusion","351":"ESP","352":"ESPNet","353":"GAM","354":"MLP-Mixer","355":"Blender","356":"LMU","357":"RegionViT","358":"Hydra","359":"RetinaNet","360":"AdaGrad","361":"ESIM","362":"Channel Shuffle","363":"Dynamic Convolution","364":"CR-NET","365":"Colorization","366":"LeNet","367":"SMOTE","368":"OASIS","369":"SiLU","370":"Temporal Activation Regularization","371":"Activation Regularization","372":"Weight Tying","373":"Embedding Dropout","374":"DropConnect","375":"AWD-LSTM","376":"Mixture of Softmaxes","377":"Highway Layer","378":"Highway Network","379":"Soft Actor Critic","380":"ORB-SLAM2","381":"SortCut Sinkhorn Attention","382":"Sparse Sinkhorn Attention","383":"Sinkhorn Transformer","384":"Activation Normalization","385":"Invertible 1x1 Convolution","386":"GLOW","387":"MATE","388":"PP-OCR","389":"Dice Loss","390":"Corner Pooling","391":"CornerNet","392":"Soft-NMS","393":"VQ-VAE","394":"GAT","395":"ICA","396":"RealNVP","397":"Xavier Initialization","398":"Spatially Separable Convolution","399":"SqueezeNeXt Block","400":"SqueezeNeXt","401":"Cross-Attention Module","402":"REINFORCE","403":"Cutout","404":"Shake-Shake Regularization","405":"CPN","406":"CoOp","407":"Griffin-Lim Algorithm","408":"Residual GRU","409":"CBHG","410":"Tacotron","411":"SOM","412":"XLNet","413":"CTC Loss","414":"DPN Block","415":"DPN","416":"Early exiting","417":"SPNet","418":"Strip Pooling","419":"SLR","420":"MixConv","421":"MixNet","422":"CaiT","423":"Class Attention","424":"LayerScale","425":"DeiT","426":"Context Enhancement Module","427":"ShuffleNet V2 Block","428":"Spatial Attention Module (ThunderNet)","429":"Position-Sensitive RoIAlign","430":"SNet","431":"ThunderNet","432":"Spatial Broadcast Decoder","433":"Affine Operator","434":"ResMLP","435":"Disentangled Attention Mechanism","436":"DeBERTa","437":"TuckER","438":"V-trace","439":"IMPALA","440":"TayPO","441":"Neural Tangent Transfer","442":"DECA","443":"VoiceFilter-Lite","444":"Channel attention","445":"Efficient Channel Attention","446":"FixRes","447":"Weight Standardization","448":"Group Normalization","449":"Distributed Shampoo","450":"Enhanced Fusion Framework","451":"EMF","452":"MFF","453":"MushroomRL","454":"EWC","455":"L1 Regularization","456":"Sparse Autoencoder","457":"FCOS","458":"AlignPS","459":"VOS","460":"Local SGD","461":"ASLFeat","462":"FastSpeech 2","463":"MAVL","464":"MViT","465":"context2vec","466":"Dot-Product Attention","467":"Spatial Gating Unit","468":"gMLP","469":"VLG-Net","470":"SAGA","471":"UNETR","472":"Global Convolutional Network","473":"FixMatch","474":"RAG","475":"Location-based Attention","476":"CoVe","477":"Graph Self-Attention","478":"RAdam","479":"HypE","480":"Metrix","481":"MoCo v2","482":"DeepMask","483":"Wide&Deep","484":"Laplacian PE","485":"Graph Transformer","486":"Local Contrast Normalization","487":"ZFNet","488":"Electric","489":"COLA","490":"(2+1)D Convolution","491":"R(2+1)D","492":"CodeT5","493":"Lambda Layer","494":"LightGCN","495":"lda2vec","496":"Deep-MAC","497":"Positional Encoding Generator","498":"Conditional Positional Encoding","499":"Global Sub-Sampled Attention","500":"Locally-Grouped Self-Attention","501":"Spatially Separable Self-Attention","502":"Twins-SVT","503":"Twins-PCPVT","504":"Meta-augmentation","505":"DiffPool","506":"WGAN-GP Loss","507":"Phase Shuffle","508":"WaveGAN","509":"PnP","510":"Fire Module","511":"SqueezeNet","512":"Go-Explore","513":"DINO","514":"MoCo v3","515":"AMSGrad","516":"SuperpixelGridMasks","517":"SCA-CNN","518":"Hard Swish","519":"MobileNetV3","520":"T2T-ViT","521":"MagFace","522":"Pix2Pix","523":"LARS","524":"SwAV","525":"PGM","526":"YOHO","527":"Hit-Detector","528":"Coordinate attention","529":"Neural Architecture Search","530":"ScheduledDropPath","531":"DD-PPO","532":"AdaMax","533":"Slanted Triangular Learning Rates","534":"ULMFiT","535":"PQ-Transformer","536":"Fractal Block","537":"FractalNet","538":"Contractive Autoencoder","539":"MADDPG","540":"Disentangled Attribution Curves","541":"LayoutLMv2","542":"Manifold Mixup","543":"RepVGG","544":"TUM","545":"SAGAN Self-Attention Module","546":"Non-Local Operation","547":"Truncation Trick","548":"SAGAN","549":"GAN Hinge Loss","550":"TTUR","551":"Conditional Batch Normalization","552":"Non-Local Block","553":"Projection Discriminator","554":"Off-Diagonal Orthogonal Regularization","555":"BigGAN","556":"Fast R-CNN","557":"DynamicConv","558":"Random Scaling","559":"PixelShuffle","560":"SRGAN Residual Block","561":"VGG Loss","562":"SRGAN","563":"Groupwise Point Convolution","564":"ShuffleNet V2 Downsampling Block","565":"ShuffleNet v2","566":"RAM","567":"Spatial Feature Transform","568":"Deep Boltzmann Machine","569":"CenterPoint","570":"Sharpness-Aware Minimization","571":"Grid R-CNN","572":"1D CNN","573":"Switch FFN","574":"Switch Transformer","575":"BAM","576":"GIN","577":"ProGAN","578":"FairMOT","579":"JLA","580":"LINE","581":"ArcFace","582":"AdaSmooth","583":"AdaDelta","584":"NODE","585":"CrossViT","586":"FIERCE","587":"CRN","588":"Dynamic Memory Network","589":"Nesterov Accelerated Gradient","590":"Relative Position Encodings","591":"Global-Local Attention","592":"ETC","593":"Inception-ResNet-v2-A","594":"Inception-ResNet-v2 Reduction-B","595":"Inception-ResNet-v2-B","596":"Inception-ResNet-v2-C","597":"Inception-ResNet-v2","598":"Deep Belief Network","599":"NNCF","600":"UNet++","601":"SegNet","602":"RAE","603":"VirTex","604":"Gumbel Softmax","605":"Beta-VAE","606":"Channel-wise Soft Attention","607":"Selective Kernel Convolution","608":"Selective Kernel","609":"SCN","610":"MDL","611":"CNN BiLSTM","612":"ASPP","613":"DeepLabv3","614":"Adaptive Loss","615":"CharacterBERT","616":"SRN","617":"VERSE","618":"SNIPER","619":"Apollo","620":"REM","621":"ResNet-RS","622":"ResNet-D","623":"3D ResNet-RS","624":"Shifted Softplus","625":"SchNet","626":"FastGCN","627":"MAS","628":"Rotary Embeddings","629":"SRM","630":"Multi-Head Linear Attention","631":"Skip-gram Word2Vec","632":"PReLU-Net","633":"Epsilon Greedy Exploration","634":"Barlow Twins","635":"Adversarial Color Enhancement","636":"Selective Search","637":"mLSTM","638":"LSGAN","639":"RandWire","640":"DeepLab","641":"CascadePSP","642":"DGCNN","643":"Residual Normal Distribution","644":"NVAE Encoder Residual Cell","645":"NVAE Generative Residual Cell","646":"NVAE","647":"DEQ","648":"CCT","649":"CvT","650":"DAC","651":"ARCH","652":"Pyramid Pooling Module","653":"PSPNet","654":"KnowPrompt","655":"MoNet","656":"ELU","657":"Attention Free Transformer","658":"Stochastic Weight Averaging","659":"DenseNAS-C","660":"DenseNAS-B","661":"DenseNAS-A","662":"DenseNAS","663":"ARMA","664":"MSPFN","665":"Fast_BAT","666":"DNAS","667":"DropPath","668":"ProxylessNAS","669":"Jukebox","670":"Auxiliary Batch Normalization","671":"A2C","672":"LAMA","673":"Adaptive Masking","674":"Scale Aggregation Block","675":"ScaleNet","676":"KAF","677":"ENIGMA","678":"TNT","679":"YOLOX","680":"GShard","681":"GCNII","682":"Highway networks","683":"SpineNet","684":"SNGAN","685":"AutoGAN","686":"Gradient Checkpointing","687":"Fast AutoAugment","688":"VL-BERT","689":"Gradient Sparsification","690":"RReLU","691":"FSAF","692":"1-bit LAMB","693":"1-bit Adam","694":"PIoU Loss","695":"N-step Returns","696":"Revision Network","697":"Drafting Network","698":"LapStyle","699":"SNIP","700":"Mix-FFN","701":"SegFormer","702":"MobileBERT","703":"Distributional Generalization","704":"Macaw","705":"RandAugment","706":"GridMask","707":"ConvTasNet","708":"SepFormer","709":"Mogrifier LSTM","710":"LipGAN","711":"Euclidean Norm Regularization","712":"Latent Optimisation","713":"CS-GAN","714":"LOGAN","715":"BigGAN-deep","716":"CondConv","717":"Cascade Mask R-CNN","718":"PolarMask","719":"InfoGAN","720":"AutoML-Zero","721":"Self-adaptive Training","722":"OSCAR","723":"PixelCNN","724":"Sarsa","725":"ScatNet","726":"Sparse R-CNN","727":"DPG","728":"Object Dropout","729":"Noisy Student","730":"Poincar\u00e9 Embeddings","731":"EEND","732":"GrowNet","733":"Contextualized Topic Models","734":"TimeSformer","735":"HS-ResNet","736":"Hierarchical-Split Block","737":"Ghost Module","738":"Ghost Bottleneck","739":"GhostNet","740":"Bi-attention","741":"Guided Anchoring","742":"RetinaNet-RS","743":"Linear Warmup","744":"CTRL","745":"Cross-Scale Non-Local Attention","746":"Contextual Residual Aggregation","747":"BLIP","748":"Accumulating Eligibility Trace","749":"TD Lambda","750":"TD-Gammon","751":"SEAM","752":"VIME","753":"TDN","754":"BASNet","755":"GAN Feature Matching","756":"Syntax Heat Parse Tree","757":"U2-Net","758":"WGAN GP","759":"Orthogonal Regularization","760":"Spektral","761":"Zero-padded Shortcut Connection","762":"Pyramidal Residual Unit","763":"Pyramidal Bottleneck Residual Unit","764":"PyramidNet","765":"HiFi-GAN","766":"SRU","767":"GPipe","768":"Packed Levitated Markers","769":"Dueling Network","770":"UNITER","771":"ViP-DeepLab","772":"GMVAE","773":"Adaptive Dropout","774":"DANet","775":"Vision-aided GAN","776":"Spatial-Reduction Attention","777":"BigBird","778":"PVT","779":"Reversible Residual Block","780":"LSH Attention","781":"Reformer","782":"DAU-ConvNet","783":"DU-GAN","784":"QHM","785":"QHAdam","786":"MelGAN Residual Block","787":"Window-based Discriminator","788":"Location Sensitive Attention","789":"Zoneout","790":"MelGAN","791":"Tacotron 2","792":"Expected Sarsa","793":"GGS-NNs","794":"Polya-Gamma Augmentation","795":"Cross-View Training","796":"GRoIE","797":"R2D2","798":"Triplet Attention","799":"3DIS","800":"GAIL","801":"Concatenation Affinity","802":"Embedded Dot Product Affinity","803":"Embedded Gaussian Affinity","804":"AugMix","805":"TransferQA","806":"Aggregated Learning","807":"SRS","808":"SGPCS","809":"CBNet","810":"Neighborhood Attention","811":"Spatial Attention-Guided Mask","812":"MCKERNEL","813":"DIME","814":"WenLan","815":"Graph Contrastive Coding","816":"EGT","817":"Single-path NAS","818":"ACTKR","819":"Multi Loss ( BCE Loss + Focal Loss ) + Dice Loss","820":"Quick Attention","821":"Serf","822":"SEED RL","823":"MPRNet","824":"DeltaConv","825":"RepPoints","826":"MeRL","827":"Label Quality Model","828":"TernaryBERT","829":"Ternary Weight Splitting","830":"BinaryBERT","831":"RESCAL","832":"Linformer","833":"Blended Diffusion","834":"ALQ and AMQ","835":"Rendezvous","836":"CHM","837":"GEOMANCER","838":"TWEC","839":"PISA","840":"OHEM","841":"EdgeBoxes","842":"PRNet+","843":"Semantic Reasoning Network","844":"NAM","845":"Dense Contrastive Learning","846":"ProphetNet","847":"Neural Turing Machine","848":"Content-based Attention","849":"Metropolis Hastings","850":"Recurrent Dropout","851":"TPN","852":"IFBlock","853":"IFNet","854":"RIFE","855":"Supervised Contrastive Loss","856":"TaLK Convolution","857":"scSE","858":"Channel Squeeze and Spatial Excitation","859":"Concurrent Spatial and Channel Squeeze & Excitation","860":"F2DNet","861":"CANINE","862":"POTO","863":"SPADE","864":"Teacher-Tutor-Student Knowledge Distillation","865":"Gradual Self-Training","866":"Bridge-net","867":"Softsign Activation","868":"DV3 Convolution Block","869":"DV3 Attention Block","870":"ClariNet","871":"ALI","872":"LV-ViT","873":"MnasNet","874":"STN","875":"MatrixNet","876":"LXMERT","877":"ViLBERT","878":"PowerSGD","879":"LCC","880":"NAS-FPN","881":"AmoebaNet","882":"Precise RoI Pooling","883":"IoU-Net","884":"VSF","885":"ShuffleNet Block","886":"ShuffleNet","887":"HITNet","888":"SDAE","889":"EfficientNetV2","890":"Gated Convolution","891":"SNAIL","892":"RPDet","893":"LGCL","894":"SANet","895":"HANet","896":"Split Attention","897":"Adaptive Feature Pooling","898":"PANet","899":"K-Net","900":"CMCL","901":"UNIMO","902":"Attention Mesh","903":"Re-Attention Module","904":"DeepViT","905":"Noisy Linear Layer","906":"Rainbow DQN","907":"ControlVAE","908":"SSTDA","909":"DIoU-NMS","910":"RFB","911":"Polynomial Rate Decay","912":"CSPResNeXt Block","913":"CSPResNeXt","914":"XGPT","915":"SyncBN","916":"Squared ReLU","917":"Multi-DConv-Head Attention","918":"Primer","919":"ReLIC","920":"Levenshtein Transformer","921":"NIMA","922":"FORK","923":"A2U","924":"TAPAS","925":"DeepCluster","926":"Boost-GNN","927":"DeepSIM","928":"Hamburger","929":"Masked Convolution","930":"Compact Global Descriptor","931":"FMix","932":"Multiple Random Window Discriminator","933":"Conditional DBlock","934":"DBlock","935":"GBlock","936":"GAN-TTS","937":"RevNet","938":"RAHP","939":"GPFL","940":"Aging Evolution","941":"Mechanism Transfer","942":"YOLOv1","943":"Collaborative Distillation","944":"Eligibility Trace","945":"GCNet","946":"Global Context Block","947":"CSL","948":"RegNetY","949":"SEER","950":"SCARF","951":"MaxUp","952":"DCLS","953":"PAR Transformer","954":"QuantTree","955":"PGNet","956":"NesT","957":"CTAB-GAN","958":"TridentNet Block","959":"TridentNet","960":"Generalized Mean Pooling","961":"DELG","962":"IMGEP","963":"LPM","964":"Talking-Heads Attention","965":"SimpleNet","966":"Parallax","967":"RotNet","968":"Inception v2","969":"KIP","970":"Snapshot Ensembles","971":"Anycost GAN","972":"Mesh-TensorFlow","973":"CenterTrack","974":"D4PG","975":"T-D","976":"Synthesizer","977":"Phish","978":"GPT-Neo","979":"PPMC","980":"DistanceNet","981":"Visformer","982":"Fawkes","983":"Anti-Alias Downsampling","984":"Big-Little Module","985":"Assemble-ResNet","986":"Ape-X","987":"SCAN-clustering","988":"CARAFE","989":"HBMP","990":"CBoW Word2Vec","991":"Focal Transformers","992":"DeLighT Block","993":"DExTra","994":"DeLighT","995":"Composite Fields","996":"PolarNet","997":"SCNet","998":"GLN","999":"Magnification Prior Contrastive Similarity","1000":"Two-Way Dense Layer","1001":"PeleeNet","1002":"FLAVA","1003":"LayerDrop","1004":"IPBI","1005":"CoordConv","1006":"SM3","1007":"FastPitch","1008":"AlphaFold","1009":"Denoising Score Matching","1010":"TanhExp","1011":"Pattern-Exploiting Training","1012":"RE-NET","1013":"CGNN","1014":"PCIDA","1015":"CIDA","1016":"modReLU","1017":"Unitary RNN","1018":"Targeted Dropout","1019":"GaAN","1020":"TURL","1021":"Style-based Recalibration Module","1022":"GLIDE","1023":"TAM","1024":"Deactivable Skip Connection","1025":"Weight excitation","1026":"Neural Probabilistic Language Model","1027":"ShakeDrop","1028":"DualCL","1029":"SKNet","1030":"PIRL","1031":"Bilateral Grid","1032":"Conditional Instance Normalization","1033":"Slot Attention","1034":"Fixup Initialization","1035":"RFB Net","1036":"Branch attention","1037":"PMLM","1038":"CRISS","1039":"Span-Based Dynamic Convolution","1040":"Mixed Attention Block","1041":"ConvBERT","1042":"Mirror-BERT","1043":"Self-Calibrated Convolutions","1044":"GAP-Layer","1045":"CT-Layer","1046":"Gated Convolution Network","1047":"DGI","1048":"PGHI","1049":"Point-wise Spatial Attention","1050":"CheXNet","1051":"Visual Parsing","1052":"FiLM Module","1053":"WaveGrad DBlock","1054":"WaveGrad UBlock","1055":"WaveGrad","1056":"Deflation","1057":"GTrXL","1058":"CoBERL","1059":"Ensemble Clustering","1060":"Network Dissection","1061":"CenterMask","1062":"Octave Convolution","1063":"AGCN","1064":"SSKD","1065":"CeiT","1066":"DEXTR","1067":"ExtremeNet","1068":"PCA Whitening","1069":"MultiGrain","1070":"ZCA Whitening","1071":"Active Convolution","1072":"FEFM","1073":"SETR","1074":"Blink Communication","1075":"CRF-RNN","1076":"HGS","1077":"HyperTree MetaModel","1078":"E2EAdaptiveDistTraining","1079":"PVTv2","1080":"SimCLRv2","1081":"Parrot","1082":"RevSilo","1083":"BiGAN","1084":"NEAT","1085":"DBGAN","1086":"Universal Transformer","1087":"Multiplicative Attention","1088":"Hierarchical Softmax","1089":"Augmented SBERT","1090":"DSGN","1091":"ReZero","1092":"Concrete Dropout","1093":"PREDATOR","1094":"ERU","1095":"Cyclical Learning Rate Policy","1096":"ESPNetv2","1097":"Strided EESP","1098":"Depthwise Dilated Separable Convolution","1099":"EESP","1100":"MUSIQ","1101":"NADAM","1102":"VL-T5","1103":"WaveGlow","1104":"CornerNet-Squeeze Hourglass Module","1105":"Depthwise Fire Module","1106":"CornerNet-Saccade","1107":"CornerNet-Squeeze","1108":"CornerNet-Squeeze Hourglass","1109":"MSGAN","1110":"DPT","1111":"CReLU","1112":"BigBiGAN","1113":"Patch Merger","1114":"wav2vec-U","1115":"Gradient Normalization","1116":"Sparsemax","1117":"ACGPN","1118":"Style Transfer Module","1119":"ResNeSt","1120":"AdvProp","1121":"PELU","1122":"Glow-TTS","1123":"Fishr","1124":"Source Hypothesis Transfer","1125":"Batch Nuclear-norm Maximization","1126":"GER","1127":"YOLOP","1128":"VisualBERT","1129":"Content-Conditioned Style Encoder","1130":"COCO-FUNIT","1131":"Accordion","1132":"SFAM","1133":"PonderNet","1134":"VideoBERT","1135":"Movement Pruning","1136":"Fraternal Dropout","1137":"TabNet","1138":"TAPEX","1139":"test_method","1140":"Dynamic Keypoint Head","1141":"FCPose","1142":"HRank","1143":"PocketNet","1144":"PSANet","1145":"MixText","1146":"Minibatch Discrimination","1147":"Multiscale Dilated Convolution Block","1148":"IAN","1149":"Holographic Reduced Representation","1150":"TinyNet","1151":"FGA","1152":"CuBERT","1153":"HyperDenseNet","1154":"Soft Actor-Critic (Autotuned Temperature)","1155":"ALBEF","1156":"ViLT","1157":"MEI","1158":"PNAS","1159":"Large-scale spectral clustering","1160":"CrossTransformers","1161":"SASA","1162":"InstaBoost","1163":"IB-BERT","1164":"DeeBERT","1165":"PAA","1166":"NAS-FCOS","1167":"Seesaw Loss","1168":"Ape-X DQN","1169":"Social-STGCNN","1170":"MODNet","1171":"YellowFin","1172":"RGCN","1173":"PointAugment","1174":"Neural Cache","1175":"ParaNet Convolution Block","1176":"ParaNet","1177":"Cluster-GCN","1178":"TABBIE","1179":"ASVI","1180":"Tofu","1181":"FFMv1","1182":"FFMv2","1183":"MLFPN","1184":"M2Det","1185":"SimAug","1186":"STDC","1187":"NoisyNet-DQN","1188":"AdaBound","1189":"One-Shot Aggregation","1190":"IoU-Balanced Sampling","1191":"AutoSync","1192":"PanNet","1193":"SCCL","1194":"Ape-X DPG","1195":"NT-ASGD","1196":"RGA","1197":"GALA","1198":"Bottleneck Transformer Block","1199":"Bottleneck Transformer","1200":"NetAdapt","1201":"SmeLU","1202":"Denoised Smoothing","1203":"IoU-guided NMS","1204":"Nystr\u00f6mformer","1205":"Panoptic FPN","1206":"PointRend","1207":"VLMo","1208":"SGDW","1209":"BP-Transformer","1210":"Unigram Segmentation","1211":"ComiRec","1212":"Accuracy-Robustness Area (ARA)","1213":"SIG","1214":"Dilated Bottleneck with Projection Block","1215":"Dilated Bottleneck Block","1216":"DetNet","1217":"CondInst","1218":"Involution","1219":"4D A*","1220":"VSGNet","1221":"DynaBERT","1222":"Hard Sigmoid","1223":"Blue River Controls","1224":"Batchboost","1225":"ASFF","1226":"MFEC","1227":"BezierAlign","1228":"ABCNet","1229":"BIDeN","1230":"Balanced L1 Loss","1231":"Balanced Feature Pyramid","1232":"Libra R-CNN","1233":"Adaptively Sparse Transformer","1234":"Attention-augmented Convolution","1235":"ALiBi","1236":"XLSR","1237":"DropAttack","1238":"DCN-V2","1239":"Siamese U-Net","1240":"Bort","1241":"ProxylessNet-Mobile","1242":"ProxylessNet-CPU","1243":"ProxylessNet-GPU","1244":"BlendMask","1245":"Models Genesis","1246":"QRNN","1247":"WEGL","1248":"DROID-SLAM","1249":"SimVLM","1250":"RFP","1251":"TuckER-RP","1252":"RESCAL-RP","1253":"CP-N3-RP","1254":"CP-N3","1255":"ComplEx-N3-RP","1256":"ComplEx-N3","1257":"CCNet","1258":"Child-Tuning","1259":"DALL\u00b7E 2","1260":"SAINT","1261":"AdaHessian","1262":"SFT","1263":"DRPNN","1264":"Z-PNN","1265":"MoGA-C","1266":"MoGA-B","1267":"MoGA-A","1268":"Channel & Spatial attention","1269":"Temporal Jittering","1270":"AutoGL","1271":"Channel-wise Cross Attention","1272":"Channel-wise Cross Fusion Transformer","1273":"UCTransNet","1274":"SOHO","1275":"AVSlowFast","1276":"DropPathway","1277":"WRQE","1278":"Flow Alignment Module","1279":"Singular Value Clipping","1280":"TGAN","1281":"Deep Voice 3","1282":"XGrad-CAM","1283":"Virtual Data Augmentation","1284":"reSGLD","1285":"Hierarchical MTL","1286":"Matrix NMS","1287":"Parametric UMAP","1288":"CoaT","1289":"G-GLN Neuron","1290":"G-GLN","1291":"GBO","1292":"Fast-OCR","1293":"CDCC-NET","1294":"Fast-YOLOv4-SmallObj","1295":"MPNet","1296":"Voxel RoI Pooling","1297":"Voxel R-CNN","1298":"NoisyNet-A3C","1299":"NoisyNet-Dueling","1300":"HaloNet","1301":"Meta Pseudo Labels","1302":"RegNetX","1303":"Dynamic SmoothL1 Loss","1304":"Dynamic R-CNN","1305":"SABL","1306":"CAG","1307":"Soft Split and Soft Composition","1308":"FuseFormer Block","1309":"FuseFormer","1310":"ILVR","1311":"TorchBeast","1312":"MFR","1313":"DFDNet","1314":"Single-Headed Attention","1315":"Chinchilla","1316":"ESACL","1317":"Energy Based Process","1318":"Filter Response Normalization","1319":"Deformable Kernel","1320":"Scatter Connection","1321":"AlphaStar","1322":"TSDAE","1323":"TE2Rules","1324":"Peer-attention","1325":"PermuteFormer","1326":"FLICA","1327":"DSelect-k","1328":"ALIS","1329":"RoIWarp","1330":"OverFeat","1331":"SPP-Net","1332":"BasicVSR","1333":"SAFRAN","1334":"DOLG","1335":"CayleyNet","1336":"Siren","1337":"Mask Scoring R-CNN","1338":"ARShoe","1339":"IRN","1340":"Chinese Pre-trained Unbalanced Transformer","1341":"Adaptive NMS","1342":"NPID","1343":"Perceiver IO","1344":"LightConv","1345":"VC R-CNN","1346":"L-GCN","1347":"Contrastive Multiview Coding","1348":"UFLoss","1349":"3D Dynamic Scene Graph","1350":"NICE-SLAM","1351":"RealFormer","1352":"CGMM","1353":"GPSA","1354":"ConViT","1355":"ZoomNet","1356":"VisTR","1357":"ISPL","1358":"HDCGAN","1359":"ALAE","1360":"StyleALAE","1361":"DeepLabv2","1362":"All-Attention Layer","1363":"Fastformer","1364":"SIRM","1365":"T-Fixup","1366":"RandomRotate","1367":"SCNN_UNet_ConvLSTM","1368":"MacBERT","1369":"Meena","1370":"SymmNet","1371":"Hopfield Layer","1372":"Discriminative Adversarial Search","1373":"GATv2","1374":"Bilateral Guided Aggregation Layer","1375":"BiSeNet V2","1376":"Protagonist Antagonist Induced Regret Environment Design","1377":"SqueezeBERT","1378":"SIFA","1379":"Subformer","1380":"Categorical Modularity","1381":"Deformable RoI Pooling","1382":"K3M","1383":"AutoTinyBERT","1384":"MiVOS","1385":"Attribute2Font","1386":"CP N3","1387":"DiffAugment","1388":"ReInfoSelect","1389":"TaBERT","1390":"DetNAS","1391":"Cosine Power Annealing","1392":"Policy Similarity Metric","1393":"DASPP","1394":"LiteSeg","1395":"NFR","1396":"VQ-VAE-2","1397":"Dual Softmax Loss","1398":"CAMoE","1399":"Vokenization","1400":"EMQAP","1401":"ORN","1402":"PP-YOLO","1403":"GNS","1404":"HTCN","1405":"Attentional Liquid Warping GAN","1406":"AttLWB","1407":"mBARTHez","1408":"Local Patch Interaction","1409":"Cross-Covariance Attention","1410":"XCiT Layer","1411":"XCiT","1412":"BiGG","1413":"PP-YOLOv2","1414":"DualGCN","1415":"Chimera","1416":"GRLIA","1417":"GreedyNAS-C","1418":"GreedyNAS-B","1419":"GreedyNAS-A","1420":"GreedyNAS","1421":"Class-MLP","1422":"NeuralRecon","1423":"Residual SRM","1424":"ECANet","1425":"ECA-Net","1426":"IPA-GNN","1427":"Spatial Group-wise Enhance","1428":"Factorized Dense Synthesized Attention","1429":"Factorized Random Synthesized Attention","1430":"Random Synthesized Attention","1431":"Dense Synthesized Attention","1432":"Online Normalization","1433":"SaBN","1434":"PASE+","1435":"CTAL","1436":"Switchable Normalization","1437":"VATT","1438":"GNNCL","1439":"AEDA","1440":"CodeSLAM","1441":"MODERN","1442":"DARTS Max-W","1443":"Differentiable Hyperparameter Search","1444":"CPC v2","1445":"DMAGE","1446":"PixLoc","1447":"Adaptive Span Transformer","1448":"Harm-Net","1449":"Harmonic Block","1450":"TrOCR","1451":"uNetXST","1452":"End-To-End Memory Network","1453":"PointASNL","1454":"KOVA","1455":"CurricularFace","1456":"gSDE","1457":"BTmPG","1458":"DAFNe","1459":"TLC","1460":"Feedback Memory","1461":"Feedback Transformer","1462":"LightAutoML","1463":"I-BERT","1464":"State-Aware Tracker","1465":"UCNet","1466":"Self-Adjusting Smooth L1 Loss","1467":"RetinaMask","1468":"NeuroTactic","1469":"ShapeConv","1470":"Symbolic rule learning","1471":"SC-GPT","1472":"VoVNet","1473":"MinCutPool","1474":"CGRU","1475":"Hermite Activation","1476":"FastSpeech 2s","1477":"LocalViT","1478":"k-Sparse Autoencoder","1479":"Make-A-Scene","1480":"pixel2style2pixel","1481":"Temporally Consistent Spatial Augmentation","1482":"CVRL","1483":"Symbolic Deep Learning","1484":"GradDrop","1485":"FoveaBox","1486":"Co-Correcting","1487":"GSoP-Net","1488":"SpecGAN","1489":"LRNet","1490":"HardELiSH","1491":"ELiSH","1492":"GroupDNet","1493":"Lookahead","1494":"MeshGraphNet","1495":"HMGNN","1496":"FPG","1497":"CutBlur","1498":"STAC","1499":"CentripetalNet","1500":"EMEA","1501":"AutoInt","1502":"AdaShift","1503":"AdaMod","1504":"InPlace-ABN","1505":"TResNet","1506":"GFSA","1507":"LayoutReader","1508":"MAD Learning","1509":"TD-VAE","1510":"SVPG","1511":"Viewmaker Network","1512":"SRU++","1513":"Fast Minimum-Norm Attack","1514":"ChebNet","1515":"EvoNorms","1516":"FreeAnchor","1517":"SoftPool","1518":"MotionNet","1519":"LTLS","1520":"CV-MIM","1521":"DetNASNet","1522":"ResNeXt-Elastic","1523":"DenseNet-Elastic","1524":"Elastic Dense Block","1525":"Elastic ResNeXt Block","1526":"ALDA","1527":"SSFG regularization","1528":"BiDet","1529":"Attentive Normalization","1530":"MuVER","1531":"Bi3D","1532":"Prioritized Sweeping","1533":"PWIL","1534":"AutoSmart","1535":"SMOT","1536":"GCNFN","1537":"Disp R-CNN","1538":"NLSIG","1539":"PixelRNN","1540":"TaxoExpan","1541":"MobileDet","1542":"TILDEv2","1543":"ClusterFit","1544":"L2M","1545":"Partition Filter Network","1546":"Local Importance-based Pooling","1547":"DeepIR","1548":"myGym","1549":"Noise2Fast","1550":"ReCo","1551":"ERNIE-GEN","1552":"Fast Sample Re-Weighting","1553":"G3D","1554":"Virtual Batch Normalization","1555":"PAFNet","1556":"EdgeFlow","1557":"Recurrent Entity Network","1558":"Fisher-BRC","1559":"EfficientUNet++","1560":"LAPGAN","1561":"PSFR-GAN","1562":"DiCENet","1563":"DimFuse","1564":"DimConv","1565":"DiCE Unit","1566":"Pixel-BERT","1567":"Implicit PointRend","1568":"ClipBERT","1569":"VocGAN","1570":"LSUV Initialization","1571":"Agglomerative Contextual Decomposition","1572":"CTracker","1573":"Computation Redistribution","1574":"Sample Redistribution","1575":"TinaFace","1576":"Compressed Memory","1577":"Compressive Transformer","1578":"KNN and IOU based verification","1579":"SMA","1580":"PresGAN","1581":"DDParser","1582":"Trans-Encoder","1583":"Seq2Edits","1584":"AdaSqrt","1585":"GraphSAINT","1586":"DeCLUTR","1587":"SETSe","1588":"Varifocal Loss","1589":"VFNet","1590":"U-Net GAN","1591":"MaskFlownet","1592":"BAGUA","1593":"BytePS","1594":"HEGCN","1595":"Big-Little Net","1596":"MoViNet","1597":"GANformer","1598":"VQSVD","1599":"Temporal ROIAlign","1600":"Neural adjoint","1601":"SynaNN","1602":"SCARLET","1603":"SCARLET-NAS","1604":"SimAdapter","1605":"SMITH","1606":"LeViT Attention Block","1607":"LeVIT","1608":"ClassSR","1609":"SlowMo","1610":"Mixture Normalization","1611":"HiSD","1612":"ADELE","1613":"MDPO","1614":"GCT","1615":"Crossbow","1616":"PNA","1617":"GHM-R","1618":"GHM-C","1619":"Charformer","1620":"GBST","1621":"Gradient-Based Subword Tokenization","1622":"SRDC","1623":"TraDeS","1624":"BatchChannel Normalization","1625":"Sscs","1626":"Funnel Transformer","1627":"Mixer Layer","1628":"Point-GNN","1629":"Hi-LANDER","1630":"CSPDenseNet-Elastic","1631":"CSPDenseNet","1632":"CSPPeleeNet","1633":"Exact Fusion Model","1634":"Deformable ConvNets","1635":"DMA","1636":"EsViT","1637":"AUCO ResNet","1638":"Cosine Normalization","1639":"Deep LSTM Reader","1640":"OSA (identity mapping + eSE)","1641":"Effective Squeeze-and-Excitation Block","1642":"VoVNetV2","1643":"NPID++","1644":"StyleMapGAN","1645":"ZeRO","1646":"MT-PET","1647":"Polyak Averaging","1648":"SpreadsheetCoder","1649":"FastSGT","1650":"ALDEN","1651":"TabTransformer","1652":"StruBERT","1653":"GFP-GAN","1654":"ConvMLP","1655":"Bayesian REX","1656":"Pipelined Backpropagation","1657":"MADGRAD","1658":"STraTA","1659":"Informative Sample Mining Network","1660":"Florence","1661":"DNN2LR","1662":"SKEP","1663":"MHMA","1664":"Class Activation Guided Attention Mechanism","1665":"DAEL","1666":"Temporal Distribution Matching","1667":"Temporal Distribution Characterization","1668":"AdaRNN","1669":"Pointer Sentinel-LSTM","1670":"Shape Adaptor","1671":"U-RNNs","1672":"NormFormer","1673":"Canvas Method","1674":"ParamCrop","1675":"MXMNet","1676":"STA-LSTM","1677":"Spatial & Temporal Attention","1678":"StyleSwin","1679":"Kaleido-BERT","1680":"H3DNet","1681":"FLAVR","1682":"Unified VLP","1683":"DVD-GAN DBlock","1684":"DVD-GAN GBlock","1685":"TSRUc","1686":"TSRUp","1687":"TSRUs","1688":"TrIVD-GAN","1689":"AdaGPR","1690":"WaveVAE","1691":"Triplet Entropy Loss","1692":"Sandwich Transformer","1693":"Shuffle-T","1694":"AutoDropout","1695":"NUQSGD","1696":"Grammatical evolution + Q-learning","1697":"PAUSE","1698":"DVD-GAN","1699":"InterBERT","1700":"Colorization Transformer","1701":"ReasonBERT","1702":"nnFormer","1703":"PipeDream","1704":"Boom Layer","1705":"SHA-RNN","1706":"CPVT","1707":"Deformable Position-Sensitive RoI Pooling","1708":"DistDGL","1709":"CT3D","1710":"FastMoE","1711":"RPM-Net","1712":"Random Grayscale","1713":"LMOT","1714":"DouZero","1715":"CoVA","1716":"Generalized Focal Loss","1717":"OODformer","1718":"FcaNet","1719":"Probabilistic Anchor Assignment","1720":"VPSNet","1721":"SongNet","1722":"Panoptic-PolarNet","1723":"LFME","1724":"MoBY"},"description":{"0":"**Average Pooling** is a pooling operation that calculates the average value for patches of a feature map, and uses it to create a downsampled (pooled) feature map. It is usually used after a convolutional layer. It adds a small amount of translation invariance - meaning translating the image by a small amount does not significantly affect the values of most pooled outputs. It extracts features more smoothly than [Max Pooling](https:\/\/paperswithcode.com\/method\/max-pooling), whereas max pooling extracts more pronounced features like edges.\r\n\r\nImage Source: [here](https:\/\/www.researchgate.net\/figure\/Illustration-of-Max-Pooling-and-Average-Pooling-Figure-2-above-shows-an-example-of-max_fig2_333593451)","1":"A **1 x 1 Convolution** is a [convolution](https:\/\/paperswithcode.com\/method\/convolution) with some special properties in that it can be used for dimensionality reduction, efficient low dimensional embeddings, and applying non-linearity after convolutions. It maps an input pixel with all its channels to an output pixel which can be squeezed to a desired output depth. It can be viewed as an [MLP](https:\/\/paperswithcode.com\/method\/feedforward-network) looking at a particular pixel location.\r\n\r\nImage Credit: [http:\/\/deeplearning.ai](http:\/\/deeplearning.ai)","2":"**Global Average Pooling** is a pooling operation designed to replace fully connected layers in classical CNNs. The idea is to generate one feature map for each corresponding category of the classification task in the last mlpconv layer. Instead of adding fully connected layers on top of the feature maps, we take the average of each feature map, and the resulting vector is fed directly into the [softmax](https:\/\/paperswithcode.com\/method\/softmax) layer. \r\n\r\nOne advantage of global [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling) over the fully connected layers is that it is more native to the [convolution](https:\/\/paperswithcode.com\/method\/convolution) structure by enforcing correspondences between feature maps and categories. Thus the feature maps can be easily interpreted as categories confidence maps. Another advantage is that there is no parameter to optimize in the global average pooling thus overfitting is avoided at this layer. Furthermore, global average pooling sums out the spatial information, thus it is more robust to spatial translations of the input.","3":"**Batch Normalization** aims to reduce internal covariate shift, and in doing so aims to accelerate the training of deep neural nets. It accomplishes this via a normalization step that fixes the means and variances of layer inputs. Batch Normalization also has a beneficial effect on the gradient flow through the network, by reducing the dependence of gradients on the scale of the parameters or of their initial values. This allows for use of much higher learning rates without the risk of divergence. Furthermore, batch normalization regularizes the model and reduces the need for [Dropout](https:\/\/paperswithcode.com\/method\/dropout).\r\n\r\nWe apply a batch normalization layer as follows for a minibatch $\\mathcal{B}$:\r\n\r\n$$ \\mu\\_{\\mathcal{B}} = \\frac{1}{m}\\sum^{m}\\_{i=1}x\\_{i} $$\r\n\r\n$$ \\sigma^{2}\\_{\\mathcal{B}} = \\frac{1}{m}\\sum^{m}\\_{i=1}\\left(x\\_{i}-\\mu\\_{\\mathcal{B}}\\right)^{2} $$\r\n\r\n$$ \\hat{x}\\_{i} = \\frac{x\\_{i} - \\mu\\_{\\mathcal{B}}}{\\sqrt{\\sigma^{2}\\_{\\mathcal{B}}+\\epsilon}} $$\r\n\r\n$$ y\\_{i} = \\gamma\\hat{x}\\_{i} + \\beta = \\text{BN}\\_{\\gamma, \\beta}\\left(x\\_{i}\\right) $$\r\n\r\nWhere $\\gamma$ and $\\beta$ are learnable parameters.","4":"**Rectified Linear Units**, or **ReLUs**, are a type of activation function that are linear in the positive dimension, but zero in the negative dimension. The kink in the function is the source of the non-linearity. Linearity in the positive dimension has the attractive property that it prevents non-saturation of gradients (contrast with [sigmoid activations](https:\/\/paperswithcode.com\/method\/sigmoid-activation)), although for half of the real line its gradient is zero.\r\n\r\n$$ f\\left(x\\right) = \\max\\left(0, x\\right) $$","5":"**Kaiming Initialization**, or **He Initialization**, is an initialization method for neural networks that takes into account the non-linearity of activation functions, such as [ReLU](https:\/\/paperswithcode.com\/method\/relu) activations.\r\n\r\nA proper initialization method should avoid reducing or magnifying the magnitudes of input signals exponentially. Using a derivation they work out that the condition to stop this happening is:\r\n\r\n$$\\frac{1}{2}n\\_{l}\\text{Var}\\left[w\\_{l}\\right] = 1 $$\r\n\r\nThis implies an initialization scheme of:\r\n\r\n$$ w\\_{l} \\sim \\mathcal{N}\\left(0, 2\/n\\_{l}\\right)$$\r\n\r\nThat is, a zero-centered Gaussian with standard deviation of $\\sqrt{2\/{n}\\_{l}}$ (variance shown in equation above). Biases are initialized at $0$.","6":"**Residual Connections** are a type of skip-connection that learn residual functions with reference to the layer inputs, instead of learning unreferenced functions. \r\n\r\nFormally, denoting the desired underlying mapping as $\\mathcal{H}({x})$, we let the stacked nonlinear layers fit another mapping of $\\mathcal{F}({x}):=\\mathcal{H}({x})-{x}$. The original mapping is recast into $\\mathcal{F}({x})+{x}$.\r\n\r\nThe intuition is that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.","7":"**Max Pooling** is a pooling operation that calculates the maximum value for patches of a feature map, and uses it to create a downsampled (pooled) feature map. It is usually used after a convolutional layer. It adds a small amount of translation invariance - meaning translating the image by a small amount does not significantly affect the values of most pooled outputs.\r\n\r\nImage Source: [here](https:\/\/computersciencewiki.org\/index.php\/File:MaxpoolSample2.png)","8":"**Residual Blocks** are skip-connection blocks that learn residual functions with reference to the layer inputs, instead of learning unreferenced functions. They were introduced as part of the [ResNet](https:\/\/paperswithcode.com\/method\/resnet) architecture.\r\n \r\nFormally, denoting the desired underlying mapping as $\\mathcal{H}({x})$, we let the stacked nonlinear layers fit another mapping of $\\mathcal{F}({x}):=\\mathcal{H}({x})-{x}$. The original mapping is recast into $\\mathcal{F}({x})+{x}$. The $\\mathcal{F}({x})$ acts like a residual, hence the name 'residual block'.\r\n\r\nThe intuition is that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers. Having skip connections allows the network to more easily learn identity-like mappings.\r\n\r\nNote that in practice, [Bottleneck Residual Blocks](https:\/\/paperswithcode.com\/method\/bottleneck-residual-block) are used for deeper ResNets, such as ResNet-50 and ResNet-101, as these bottleneck blocks are less computationally intensive.","9":"A **Bottleneck Residual Block** is a variant of the [residual block](https:\/\/paperswithcode.com\/method\/residual-block) that utilises 1x1 convolutions to create a bottleneck. The use of a bottleneck reduces the number of parameters and matrix multiplications. The idea is to make residual blocks as thin as possible to increase depth and have less parameters. They were introduced as part of the [ResNet](https:\/\/paperswithcode.com\/method\/resnet) architecture, and are used as part of deeper ResNets such as ResNet-50 and ResNet-101.","10":"**Residual Networks**, or **ResNets**, learn residual functions with reference to the layer inputs, instead of learning unreferenced functions. Instead of hoping each few stacked layers directly fit a desired underlying mapping, residual nets let these layers fit a residual mapping. They stack [residual blocks](https:\/\/paperswithcode.com\/method\/residual-block) ontop of each other to form network: e.g. a ResNet-50 has fifty layers using these blocks. \r\n\r\nFormally, denoting the desired underlying mapping as $\\mathcal{H}(x)$, we let the stacked nonlinear layers fit another mapping of $\\mathcal{F}(x):=\\mathcal{H}(x)-x$. The original mapping is recast into $\\mathcal{F}(x)+x$.\r\n\r\nThere is empirical evidence that these types of network are easier to optimize, and can gain accuracy from considerably increased depth.","11":"A **convolution** is a type of matrix operation, consisting of a kernel, a small matrix of weights, that slides over input data performing element-wise multiplication with the part of the input it is on, then summing the results into an output.\r\n\r\nIntuitively, a convolution allows for weight sharing - reducing the number of effective parameters - and image translation (allowing for the same feature to be detected in different parts of the input space).\r\n\r\nImage Source: [https:\/\/arxiv.org\/pdf\/1603.07285.pdf](https:\/\/arxiv.org\/pdf\/1603.07285.pdf)","12":"An **Autoencoder** is a bottleneck architecture that turns a high-dimensional input into a latent low-dimensional code (encoder), and then performs a reconstruction of the input with this latent code (the decoder).\r\n\r\nImage: [Michael Massi](https:\/\/en.wikipedia.org\/wiki\/Autoencoder#\/media\/File:Autoencoder_schema.png)","13":"**Principle Components Analysis (PCA)** is an unsupervised method primary used for dimensionality reduction within machine learning. PCA is calculated via a singular value decomposition (SVD) of the design matrix, or alternatively, by calculating the covariance matrix of the data and performing eigenvalue decomposition on the covariance matrix. The results of PCA provide a low-dimensional picture of the structure of the data and the leading (uncorrelated) latent factors determining variation in the data.\r\n\r\nImage Source: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Principal_component_analysis#\/media\/File:GaussianScatterPCA.svg)","14":"**Q-Learning** is an off-policy temporal difference control algorithm:\r\n\r\n$$Q\\left(S\\_{t}, A\\_{t}\\right) \\leftarrow Q\\left(S\\_{t}, A\\_{t}\\right) + \\alpha\\left[R_{t+1} + \\gamma\\max\\_{a}Q\\left(S\\_{t+1}, a\\right) - Q\\left(S\\_{t}, A\\_{t}\\right)\\right] $$\r\n\r\nThe learned action-value function $Q$ directly approximates $q\\_{*}$, the optimal action-value function, independent of the policy being followed.\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition","15":"Causal inference is the process of drawing a conclusion about a causal connection based on the conditions of the occurrence of an effect. The main difference between causal inference and inference of association is that the former analyzes the response of the effect variable when the cause is changed.","16":"**Sigmoid Activations** are a type of activation function for neural networks:\r\n\r\n$$f\\left(x\\right) = \\frac{1}{\\left(1+\\exp\\left(-x\\right)\\right)}$$\r\n\r\nSome drawbacks of this activation that have been noted in the literature are: sharp damp gradients during backpropagation from deeper hidden layers to inputs, gradient saturation, and slow convergence.","17":"**Tanh Activation** is an activation function used for neural networks:\r\n\r\n$$f\\left(x\\right) = \\frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$$\r\n\r\nHistorically, the tanh function became preferred over the [sigmoid function](https:\/\/paperswithcode.com\/method\/sigmoid-activation) as it gave better performance for multi-layer neural networks. But it did not solve the vanishing gradient problem that sigmoids suffered, which was tackled more effectively with the introduction of [ReLU](https:\/\/paperswithcode.com\/method\/relu) activations.\r\n\r\nImage Source: [Junxi Feng](https:\/\/www.researchgate.net\/profile\/Junxi_Feng)","18":"An **LSTM** is a type of [recurrent neural network](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks) that addresses the vanishing gradient problem in vanilla RNNs through additional cells, input and output gates. Intuitively, vanishing gradients are solved through additional *additive* components, and forget gate activations, that allow the gradients to flow through the network without vanishing as quickly.\r\n\r\n(Image Source [here](https:\/\/medium.com\/datadriveninvestor\/how-do-lstm-networks-solve-the-problem-of-vanishing-gradients-a6784971a577))\r\n\r\n(Introduced by Hochreiter and Schmidhuber)","19":"**Weight Decay**, or **$L_{2}$ Regularization**, is a regularization technique applied to the weights of a neural network. We minimize a loss function compromising both the primary loss function and a penalty on the $L\\_{2}$ Norm of the weights:\r\n\r\n$$L\\_{new}\\left(w\\right) = L\\_{original}\\left(w\\right) + \\lambda{w^{T}w}$$\r\n\r\nwhere $\\lambda$ is a value determining the strength of the penalty (encouraging smaller weights). \r\n\r\nWeight decay can be incorporated directly into the weight update rule, rather than just implicitly by defining it through to objective function. Often weight decay refers to the implementation where we specify it directly in the weight update rule (whereas L2 regularization is usually the implementation which is specified in the objective function).\r\n\r\nImage Source: Deep Learning, Goodfellow et al","20":"**WordPiece** is a subword segmentation algorithm used in natural language processing. The vocabulary is initialized with individual characters in the language, then the most frequent combinations of symbols in the vocabulary are iteratively added to the vocabulary. The process is:\r\n\r\n1. Initialize the word unit inventory with all the characters in the text.\r\n2. Build a language model on the training data using the inventory from 1.\r\n3. Generate a new word unit by combining two units out of the current word inventory to increment the word unit inventory by one. Choose the new word unit out of all the possible ones that increases the likelihood on the training data the most when added to the model.\r\n4. Goto 2 until a predefined limit of word units is reached or the likelihood increase falls below a certain threshold.\r\n\r\nText: [Source](https:\/\/stackoverflow.com\/questions\/55382596\/how-is-wordpiece-tokenization-helpful-to-effectively-deal-with-rare-words-proble\/55416944#55416944)\r\n\r\nImage: WordPiece as used in [BERT](https:\/\/paperswithcode.com\/method\/bert)","21":"The **Softmax** output function transforms a previous layer's output into a vector of probabilities. It is commonly used for multiclass classification. Given an input vector $x$ and a weighting vector $w$ we have:\r\n\r\n$$ P(y=j \\mid{x}) = \\frac{e^{x^{T}w_{j}}}{\\sum^{K}_{k=1}e^{x^{T}wk}} $$","22":"**Dense Connections**, or **Fully Connected Connections**, are a type of layer in a deep neural network that use a linear operation where every input is connected to every output by a weight. This means there are $n\\_{\\text{inputs}}*n\\_{\\text{outputs}}$ parameters, which can lead to a lot of parameters for a sizeable network.\r\n\r\n$$h\\_{l} = g\\left(\\textbf{W}^{T}h\\_{l-1}\\right)$$\r\n\r\nwhere $g$ is an activation function.\r\n\r\nImage Source: Deep Learning by Goodfellow, Bengio and Courville","23":"**Scaled dot-product attention** is an attention mechanism where the dot products are scaled down by $\\sqrt{d_k}$. Formally we have a query $Q$, a key $K$ and a value $V$ and calculate the attention as:\r\n\r\n$$ {\\text{Attention}}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^{T}}{\\sqrt{d_k}}\\right)V $$\r\n\r\nIf we assume that $q$ and $k$ are $d_k$-dimensional vectors whose components are independent random variables with mean $0$ and variance $1$, then their dot product, $q \\cdot k = \\sum_{i=1}^{d_k} u_iv_i$, has mean $0$ and variance $d_k$. Since we would prefer these values to have variance $1$, we divide by $\\sqrt{d_k}$.","24":"**Linear Warmup With Linear Decay** is a learning rate schedule in which we increase the learning rate linearly for $n$ updates and then linearly decay afterwards.","25":"**Dropout** is a regularization technique for neural networks that drops a unit (along with connections) at training time with a specified probability $p$ (a common value is $p=0.5$). At test time, all units are present, but with weights scaled by $p$ (i.e. $w$ becomes $pw$).\r\n\r\nThe idea is to prevent co-adaptation, where the neural network becomes too reliant on particular connections, as this could be symptomatic of overfitting. Intuitively, dropout can be thought of as creating an implicit ensemble of neural networks.","26":"Unlike [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization), **Layer Normalization** directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer so the normalization does not introduce any new dependencies between training cases. It works well for [RNNs](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks) and improves both the training time and the generalization performance of several existing RNN models. More recently, it has been used with [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) models.\r\n\r\nWe compute the layer normalization statistics over all the hidden units in the same layer as follows:\r\n\r\n$$ \\mu^{l} = \\frac{1}{H}\\sum^{H}\\_{i=1}a\\_{i}^{l} $$\r\n\r\n$$ \\sigma^{l} = \\sqrt{\\frac{1}{H}\\sum^{H}\\_{i=1}\\left(a\\_{i}^{l}-\\mu^{l}\\right)^{2}} $$\r\n\r\nwhere $H$ denotes the number of hidden units in a layer. Under layer normalization, all the hidden units in a layer share the same normalization terms $\\mu$ and $\\sigma$, but different training cases have different normalization terms. Unlike batch normalization, layer normalization does not impose any constraint on the size of the mini-batch and it can be used in the pure online regime with batch size 1.","27":"The **Gaussian Error Linear Unit**, or **GELU**, is an activation function. The GELU activation function is $x\\Phi(x)$, where $\\Phi(x)$ the standard Gaussian cumulative distribution function. The GELU nonlinearity weights inputs by their percentile, rather than gates inputs by their sign as in [ReLUs](https:\/\/paperswithcode.com\/method\/relu) ($x\\mathbf{1}_{x>0}$). Consequently the GELU can be thought of as a smoother ReLU.\r\n\r\n$$\\text{GELU}\\left(x\\right) = x{P}\\left(X\\leq{x}\\right) = x\\Phi\\left(x\\right) = x \\cdot \\frac{1}{2}\\left[1 + \\text{erf}(x\/\\sqrt{2})\\right],$$\r\nif $X\\sim \\mathcal{N}(0,1)$.\r\n\r\nOne can approximate the GELU with\r\n$0.5x\\left(1+\\tanh\\left[\\sqrt{2\/\\pi}\\left(x + 0.044715x^{3}\\right)\\right]\\right)$ or $x\\sigma\\left(1.702x\\right),$\r\nbut PyTorch's exact implementation is sufficiently fast such that these approximations may be unnecessary. (See also the [SiLU](https:\/\/paperswithcode.com\/method\/silu) $x\\sigma(x)$ which was also coined in the paper that introduced the GELU.)\r\n\r\nGELUs are used in [GPT-3](https:\/\/paperswithcode.com\/method\/gpt-3), [BERT](https:\/\/paperswithcode.com\/method\/bert), and most other Transformers.","28":"**Adam** is an adaptive learning rate optimization algorithm that utilises both momentum and scaling, combining the benefits of [RMSProp](https:\/\/paperswithcode.com\/method\/rmsprop) and [SGD w\/th Momentum](https:\/\/paperswithcode.com\/method\/sgd-with-momentum). The optimizer is designed to be appropriate for non-stationary objectives and problems with very noisy and\/or sparse gradients. \r\n\r\nThe weight updates are performed as:\r\n\r\n$$ w_{t} = w_{t-1} - \\eta\\frac{\\hat{m}\\_{t}}{\\sqrt{\\hat{v}\\_{t}} + \\epsilon} $$\r\n\r\nwith\r\n\r\n$$ \\hat{m}\\_{t} = \\frac{m_{t}}{1-\\beta^{t}_{1}} $$\r\n\r\n$$ \\hat{v}\\_{t} = \\frac{v_{t}}{1-\\beta^{t}_{2}} $$\r\n\r\n$$ m_{t} = \\beta_{1}m_{t-1} + (1-\\beta_{1})g_{t} $$\r\n\r\n$$ v_{t} = \\beta_{2}v_{t-1} + (1-\\beta_{2})g_{t}^{2} $$\r\n\r\n\r\n$ \\eta $ is the step size\/learning rate, around 1e-3 in the original paper. $ \\epsilon $ is a small number, typically 1e-8 or 1e-10, to prevent dividing by zero. $ \\beta_{1} $ and $ \\beta_{2} $ are forgetting parameters, with typical values 0.9 and 0.999, respectively.","29":"**Multi-head Attention** is a module for attention mechanisms which runs through an attention mechanism several times in parallel. The independent attention outputs are then concatenated and linearly transformed into the expected dimension. Intuitively, multiple attention heads allows for attending to parts of the sequence differently (e.g. longer-term dependencies versus shorter-term dependencies). \r\n\r\n$$ \\text{MultiHead}\\left(\\textbf{Q}, \\textbf{K}, \\textbf{V}\\right) = \\left[\\text{head}\\_{1},\\dots,\\text{head}\\_{h}\\right]\\textbf{W}_{0}$$\r\n\r\n$$\\text{where} \\text{ head}\\_{i} = \\text{Attention} \\left(\\textbf{Q}\\textbf{W}\\_{i}^{Q}, \\textbf{K}\\textbf{W}\\_{i}^{K}, \\textbf{V}\\textbf{W}\\_{i}^{V} \\right) $$\r\n\r\nAbove $\\textbf{W}$ are all learnable parameter matrices.\r\n\r\nNote that [scaled dot-product attention](https:\/\/paperswithcode.com\/method\/scaled) is most commonly used in this module, although in principle it can be swapped out for other types of attention mechanism.\r\n\r\nSource: [Lilian Weng](https:\/\/lilianweng.github.io\/lil-log\/2018\/06\/24\/attention-attention.html#a-family-of-attention-mechanisms)","30":"**Attention Dropout** is a type of [dropout](https:\/\/paperswithcode.com\/method\/dropout) used in attention-based architectures, where elements are randomly dropped out of the [softmax](https:\/\/paperswithcode.com\/method\/softmax) in the attention equation. For example, for scaled-dot product attention, we would drop elements from the first term:\r\n\r\n$$ {\\text{Attention}}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^{T}}{\\sqrt{d_k}}\\right)V $$","31":"**BERT**, or Bidirectional Encoder Representations from Transformers, improves upon standard [Transformers](http:\/\/paperswithcode.com\/method\/transformer) by removing the unidirectionality constraint by using a *masked language model* (MLM) pre-training objective. The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-to-right language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pre-train a deep bidirectional Transformer. In addition to the masked language model, BERT uses a *next sentence prediction* task that jointly pre-trains text-pair representations. \r\n\r\nThere are two steps in BERT: *pre-training* and *fine-tuning*. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For fine-tuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Each downstream task has separate fine-tuned models, even though they\r\nare initialized with the same pre-trained parameters.","32":"A **GAN**, or **Generative Adversarial Network**, is a generative model that simultaneously trains\r\ntwo models: a generative model $G$ that captures the data distribution, and a discriminative model $D$ that estimates the\r\nprobability that a sample came from the training data rather than $G$.\r\n\r\nThe training procedure for $G$ is to maximize the probability of $D$ making\r\na mistake. This framework corresponds to a minimax two-player game. In the\r\nspace of arbitrary functions $G$ and $D$, a unique solution exists, with $G$\r\nrecovering the training data distribution and $D$ equal to $\\frac{1}{2}$\r\neverywhere. In the case where $G$ and $D$ are defined by multilayer perceptrons,\r\nthe entire system can be trained with backpropagation. \r\n\r\n(Image Source: [here](http:\/\/www.kdnuggets.com\/2017\/01\/generative-adversarial-networks-hot-topic-machine-learning.html))","33":"**Greedy Policy Search** (GPS) is a simple algorithm that learns a policy for test-time data augmentation based on the predictive performance on a validation set. GPS starts with an empty policy and builds it in an iterative fashion. Each step selects a sub-policy that provides the largest improvement in calibrated log-likelihood of ensemble predictions and adds it to the current policy.","34":"**Softplus** is an activation function $f\\left(x\\right) = \\log\\left(1+\\exp\\left(x\\right)\\right)$. It can be viewed as a smooth version of [ReLU](https:\/\/paperswithcode.com\/method\/relu).","35":"**Mish** is an activation function for neural networks which can be defined as:\r\n\r\n$$ f\\left(x\\right) = x\\cdot\\tanh{\\text{softplus}\\left(x\\right)}$$\r\n\r\nwhere\r\n\r\n$$\\text{softplus}\\left(x\\right) = \\ln\\left(1+e^{x}\\right)$$\r\n\r\n(Compare with functionally similar previously proposed activation functions such as the [GELU](https:\/\/paperswithcode.com\/method\/silu) $x\\Phi(x)$ and the [SiLU](https:\/\/paperswithcode.com\/method\/silu) $x\\sigma(x)$.)","36":"A **Bidirectional GRU**, or **BiGRU**, is a sequence processing model that consists of two [GRUs](https:\/\/paperswithcode.com\/method\/gru). one taking the input in a forward direction, and the other in a backwards direction. It is a bidirectional recurrent neural network with only the input and forget gates.\r\n\r\nImage Source: *Rana R (2016). Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech.*","37":"A **Memory Network** provides a memory component that can be read from and written to with the inference capabilities of a neural network model. The motivation is that many neural networks lack a long-term memory component, and their existing memory component encoded by states and weights is too small and not compartmentalized enough to accurately remember facts from the past (RNNs for example, have difficult memorizing and doing tasks like copying). \r\n\r\nA memory network consists of a memory $\\textbf{m}$ (an array of objects indexed by $\\textbf{m}\\_{i}$ and four potentially learned components:\r\n\r\n- Input feature map $I$ - feature representation of the data input.\r\n- Generalization $G$ - updates old memories given the new input.\r\n- Output feature map $O$ - produces new feature map given $I$ and $G$.\r\n- Response $R$ - converts output into the desired response. \r\n\r\nGiven an input $x$ (e.g., an input character, word or sentence depending on the granularity chosen, an image or an audio signal) the flow of the model is as follows:\r\n\r\n1. Convert $x$ to an internal feature representation $I\\left(x\\right)$.\r\n2. Update memories $m\\_{i}$ given the new input: $m\\_{i} = G\\left(m\\_{i}, I\\left(x\\right), m\\right)$, $\\forall{i}$.\r\n3. Compute output features $o$ given the new input and the memory: $o = O\\left(I\\left(x\\right), m\\right)$.\r\n4. Finally, decode output features $o$ to give the final response: $r = R\\left(o\\right)$.\r\n\r\nThis process is applied at both train and test time, if there is a distinction between such phases, that\r\nis, memories are also stored at test time, but the model parameters of $I$, $G$, $O$ and $R$ are not updated. Memory networks cover a wide class of possible implementations. The components $I$, $G$, $O$ and $R$ can potentially use any existing ideas from the machine learning literature.\r\n\r\nImage Source: [Adrian Colyer](https:\/\/blog.acolyer.org\/2016\/03\/10\/memory-networks\/)","38":"A **Gated Recurrent Unit**, or **GRU**, is a type of recurrent neural network. It is similar to an [LSTM](https:\/\/paperswithcode.com\/method\/lstm), but only has two gates - a reset gate and an update gate - and notably lacks an output gate. Fewer parameters means GRUs are generally easier\/faster to train than their LSTM counterparts.\r\n\r\nImage Source: [here](https:\/\/www.google.com\/url?sa=i&url=https%3A%2F%2Fcommons.wikimedia.org%2Fwiki%2FFile%3AGated_Recurrent_Unit%2C_type_1.svg&psig=AOvVaw3EmNX8QXC5hvyxeenmJIUn&ust=1590332062671000&source=images&cd=vfe&ved=0CA0QjhxqFwoTCMiev9-eyukCFQAAAAAdAAAAABAR)","39":"A **Concatenated Skip Connection** is a type of skip connection that seeks to reuse features by concatenating them to new layers, allowing more information to be retained from previous layers of the network. This contrasts with say, residual connections, where element-wise summation is used instead to incorporate information from previous layers. This type of skip connection is prominently used in DenseNets (and also Inception networks), which the Figure to the right illustrates.","40":"**U-Net** is an architecture for semantic segmentation. It consists of a contracting path and an expansive path. The contracting path follows the typical architecture of a convolutional network. It consists of the repeated application of two 3x3 convolutions (unpadded convolutions), each followed by a rectified linear unit ([ReLU](https:\/\/paperswithcode.com\/method\/relu)) and a 2x2 [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) operation with stride 2 for downsampling. At each downsampling step we double the number of feature channels. Every step in the expansive path consists of an upsampling of the feature map followed by a 2x2 [convolution](https:\/\/paperswithcode.com\/method\/convolution) (\u201cup-convolution\u201d) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3x3 convolutions, each followed by a ReLU. The cropping is necessary due to the loss of border pixels in every convolution. At the final layer a [1x1 convolution](https:\/\/paperswithcode.com\/method\/1x1-convolution) is used to map each 64-component feature vector to the desired number of classes. In total the network has 23 convolutional layers.\r\n\r\n[Original MATLAB Code](https:\/\/lmb.informatik.uni-freiburg.de\/people\/ronneber\/u-net\/u-net-release-2015-10-02.tar.gz)","41":"**PatchGAN** is a type of discriminator for generative adversarial networks which only penalizes structure at the scale of local image patches. The PatchGAN discriminator tries to classify if each $N \\times N$ patch in an image is real or fake. This discriminator is run convolutionally across the image, averaging all responses to provide the ultimate output of $D$. Such a discriminator effectively models the image as a Markov random field, assuming independence between pixels separated by more than a patch diameter. It can be understood as a type of texture\/style loss.","42":"**Instance Normalization** (also known as contrast normalization) is a normalization layer where:\r\n\r\n$$\r\n y_{tijk} = \\frac{x_{tijk} - \\mu_{ti}}{\\sqrt{\\sigma_{ti}^2 + \\epsilon}},\r\n \\quad\r\n \\mu_{ti} = \\frac{1}{HW}\\sum_{l=1}^W \\sum_{m=1}^H x_{tilm},\r\n \\quad\r\n \\sigma_{ti}^2 = \\frac{1}{HW}\\sum_{l=1}^W \\sum_{m=1}^H (x_{tilm} - mu_{ti})^2.\r\n$$\r\n\r\nThis prevents instance-specific mean and covariance shift simplifying the learning process. Intuitively, the normalization process allows to remove instance-specific contrast information from the content image in a task like image stylization, which simplifies generation.","43":"**Leaky Rectified Linear Unit**, or **Leaky ReLU**, is a type of activation function based on a [ReLU](https:\/\/paperswithcode.com\/method\/relu), but it has a small slope for negative values instead of a flat slope. The slope coefficient is determined before training, i.e. it is not learnt during training. This type of activation function is popular in tasks where we we may suffer from sparse gradients, for example training generative adversarial networks.","44":"**GAN Least Squares Loss** is a least squares loss function for generative adversarial networks. Minimizing this objective function is equivalent to minimizing the Pearson $\\chi^{2}$ divergence. The objective function (here for [LSGAN](https:\/\/paperswithcode.com\/method\/lsgan)) can be defined as:\r\n\r\n$$ \\min\\_{D}V\\_{LS}\\left(D\\right) = \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{x} \\sim p\\_{data}\\left(\\mathbf{x}\\right)}\\left[\\left(D\\left(\\mathbf{x}\\right) - b\\right)^{2}\\right] + \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{z}\\sim p\\_{data}\\left(\\mathbf{z}\\right)}\\left[\\left(D\\left(G\\left(\\mathbf{z}\\right)\\right) - a\\right)^{2}\\right] $$\r\n\r\n$$ \\min\\_{G}V\\_{LS}\\left(G\\right) = \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{z} \\sim p\\_{\\mathbf{z}}\\left(\\mathbf{z}\\right)}\\left[\\left(D\\left(G\\left(\\mathbf{z}\\right)\\right) - c\\right)^{2}\\right] $$\r\n\r\nwhere $a$ and $b$ are the labels for fake data and real data and $c$ denotes the value that $G$ wants $D$ to believe for fake data.","45":"**Cycle Consistency Loss** is a type of loss used for generative adversarial networks that performs unpaired image-to-image translation. It was introduced with the [CycleGAN](https:\/\/paperswithcode.com\/method\/cyclegan) architecture. For two domains $X$ and $Y$, we want to learn a mapping $G : X \\rightarrow Y$ and $F: Y \\rightarrow X$. We want to enforce the intuition that these mappings should be reverses of each other and that both mappings should be bijections. Cycle Consistency Loss encourages $F\\left(G\\left(x\\right)\\right) \\approx x$ and $G\\left(F\\left(y\\right)\\right) \\approx y$. It reduces the space of possible mapping functions by enforcing forward and backwards consistency:\r\n\r\n$$ \\mathcal{L}\\_{cyc}\\left(G, F\\right) = \\mathbb{E}\\_{x \\sim p\\_{data}\\left(x\\right)}\\left[||F\\left(G\\left(x\\right)\\right) - x||\\_{1}\\right] + \\mathbb{E}\\_{y \\sim p\\_{data}\\left(y\\right)}\\left[||G\\left(F\\left(y\\right)\\right) - y||\\_{1}\\right] $$","46":"**CycleGAN**, or **Cycle-Consistent GAN**, is a type of generative adversarial network for unpaired image-to-image translation. For two domains $X$ and $Y$, CycleGAN learns a mapping $G : X \\rightarrow Y$ and $F: Y \\rightarrow X$. The novelty lies in trying to enforce the intuition that these mappings should be reverses of each other and that both mappings should be bijections. This is achieved through a [cycle consistency loss](https:\/\/paperswithcode.com\/method\/cycle-consistency-loss) that encourages $F\\left(G\\left(x\\right)\\right) \\approx x$ and $G\\left(Y\\left(y\\right)\\right) \\approx y$. Combining this loss with the adversarial losses on $X$ and $Y$ yields the full objective for unpaired image-to-image translation.\r\n\r\nFor the mapping $G : X \\rightarrow Y$ and its discriminator $D\\_{Y}$ we have the objective:\r\n\r\n$$ \\mathcal{L}\\_{GAN}\\left(G, D\\_{Y}, X, Y\\right) =\\mathbb{E}\\_{y \\sim p\\_{data}\\left(y\\right)}\\left[\\log D\\_{Y}\\left(y\\right)\\right] + \\mathbb{E}\\_{x \\sim p\\_{data}\\left(x\\right)}\\left[log(1 \u2212 D\\_{Y}\\left(G\\left(x\\right)\\right)\\right] $$\r\n\r\nwhere $G$ tries to generate images $G\\left(x\\right)$ that look similar to images from domain $Y$, while $D\\_{Y}$ tries to discriminate between translated samples $G\\left(x\\right)$ and real samples $y$. A similar loss is postulated for the mapping $F: Y \\rightarrow X$ and its discriminator $D\\_{X}$.\r\n\r\nThe Cycle Consistency Loss reduces the space of possible mapping functions by enforcing forward and backwards consistency:\r\n\r\n$$ \\mathcal{L}\\_{cyc}\\left(G, F\\right) = \\mathbb{E}\\_{x \\sim p\\_{data}\\left(x\\right)}\\left[||F\\left(G\\left(x\\right)\\right) - x||\\_{1}\\right] + \\mathbb{E}\\_{y \\sim p\\_{data}\\left(y\\right)}\\left[||G\\left(F\\left(y\\right)\\right) - y||\\_{1}\\right] $$\r\n\r\nThe full objective is:\r\n\r\n$$ \\mathcal{L}\\_{GAN}\\left(G, F, D\\_{X}, D\\_{Y}\\right) = \\mathcal{L}\\_{GAN}\\left(G, D\\_{Y}, X, Y\\right) + \\mathcal{L}\\_{GAN}\\left(F, D\\_{X}, X, Y\\right) + \\lambda\\mathcal{L}\\_{cyc}\\left(G, F\\right) $$\r\n\r\nWhere we aim to solve:\r\n\r\n$$ G^{\\*}, F^{\\*} = \\arg \\min\\_{G, F} \\max\\_{D\\_{X}, D\\_{Y}} \\mathcal{L}\\_{GAN}\\left(G, F, D\\_{X}, D\\_{Y}\\right) $$\r\n\r\nFor the original architecture the authors use:\r\n\r\n- two stride-2 convolutions, several residual blocks, and two fractionally strided convolutions with stride $\\frac{1}{2}$.\r\n- [instance normalization](https:\/\/paperswithcode.com\/method\/instance-normalization)\r\n- PatchGANs for the discriminator\r\n- Least Square Loss for the [GAN](https:\/\/paperswithcode.com\/method\/gan) objectives.","47":"**Mixup** is a data augmentation technique that that generates a weighted combinations of random image pairs from the training data. Given two images and their ground truth labels: $\\left(x\\_{i}, y\\_{i}\\right), \\left(x\\_{j}, y\\_{j}\\right)$, a synthetic training example $\\left(\\hat{x}, \\hat{y}\\right)$ is generated as:\r\n\r\n$$ \\hat{x} = \\lambda{x\\_{i}} + \\left(1 \u2212 \\lambda\\right){x\\_{j}} $$\r\n$$ \\hat{y} = \\lambda{y\\_{i}} + \\left(1 \u2212 \\lambda\\right){y\\_{j}} $$\r\n\r\nwhere $\\lambda \\sim \\text{Beta}\\left(\\alpha = 0.2\\right)$ is independently sampled for each augmented example.","48":"**Entropy Regularization** is a type of regularization used in [reinforcement learning](https:\/\/paperswithcode.com\/methods\/area\/reinforcement-learning). For on-policy policy gradient based methods like [A3C](https:\/\/paperswithcode.com\/method\/a3c), the same mutual reinforcement behaviour leads to a highly-peaked $\\pi\\left(a\\mid{s}\\right)$ towards a few actions or action sequences, since it is easier for the actor and critic to overoptimise to a small portion of the environment. To reduce this problem, entropy regularization adds an entropy term to the loss to promote action diversity:\r\n\r\n$$H(X) = -\\sum\\pi\\left(x\\right)\\log\\left(\\pi\\left(x\\right)\\right) $$\r\n\r\nImage Credit: Wikipedia","49":"**Proximal Policy Optimization**, or **PPO**, is a policy gradient method for reinforcement learning. The motivation was to have an algorithm with the data efficiency and reliable performance of [TRPO](https:\/\/paperswithcode.com\/method\/trpo), while using only first-order optimization. \r\n\r\nLet $r\\_{t}\\left(\\theta\\right)$ denote the probability ratio $r\\_{t}\\left(\\theta\\right) = \\frac{\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}$, so $r\\left(\\theta\\_{old}\\right) = 1$. TRPO maximizes a \u201csurrogate\u201d objective:\r\n\r\n$$ L^{\\text{CPI}}\\left({\\theta}\\right) = \\hat{\\mathbb{E}}\\_{t}\\left[\\frac{\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\_{t}\\mid{s\\_{t}}\\right)})\\hat{A}\\_{t}\\right] = \\hat{\\mathbb{E}}\\_{t}\\left[r\\_{t}\\left(\\theta\\right)\\hat{A}\\_{t}\\right] $$\r\n\r\nWhere $CPI$ refers to a conservative policy iteration. Without a constraint, maximization of $L^{CPI}$ would lead to an excessively large policy update; hence, we PPO modifies the objective, to penalize changes to the policy that move $r\\_{t}\\left(\\theta\\right)$ away from 1:\r\n\r\n$$ J^{\\text{CLIP}}\\left({\\theta}\\right) = \\hat{\\mathbb{E}}\\_{t}\\left[\\min\\left(r\\_{t}\\left(\\theta\\right)\\hat{A}\\_{t}, \\text{clip}\\left(r\\_{t}\\left(\\theta\\right), 1-\\epsilon, 1+\\epsilon\\right)\\hat{A}\\_{t}\\right)\\right] $$\r\n\r\nwhere $\\epsilon$ is a hyperparameter, say, $\\epsilon = 0.2$. The motivation for this objective is as follows. The first term inside the min is $L^{CPI}$. The second term, $\\text{clip}\\left(r\\_{t}\\left(\\theta\\right), 1-\\epsilon, 1+\\epsilon\\right)\\hat{A}\\_{t}$ modifies the surrogate\r\nobjective by clipping the probability ratio, which removes the incentive for moving $r\\_{t}$ outside of the interval $\\left[1 \u2212 \\epsilon, 1 + \\epsilon\\right]$. Finally, we take the minimum of the clipped and unclipped objective, so the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective. With this scheme, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse. \r\n\r\nOne detail to note is that when we apply PPO for a network where we have shared parameters for actor and critic functions, we typically add to the objective function an error term on value estimation and an entropy term to encourage exploration.","50":"A **Region Proposal Network**, or **RPN**, is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals. RPN and algorithms like [Fast R-CNN](https:\/\/paperswithcode.com\/method\/fast-r-cnn) can be merged into a single network by sharing their convolutional features - using the recently popular terminology of neural networks with attention mechanisms, the RPN component tells the unified network where to look.\r\n\r\nRPNs are designed to efficiently predict region proposals with a wide range of scales and aspect ratios. RPNs use anchor boxes that serve as references at multiple scales and aspect ratios. The scheme can be thought of as a pyramid of regression references, which avoids enumerating images or filters of multiple scales or aspect ratios.","51":"**Region of Interest Align**, or **RoIAlign**, is an operation for extracting a small feature map from each RoI in detection and segmentation based tasks. It removes the harsh quantization of [RoI Pool](https:\/\/paperswithcode.com\/method\/roi-pooling), properly *aligning* the extracted features with the input. To avoid any quantization of the RoI boundaries or bins (using $x\/16$ instead of $[x\/16]$), RoIAlign uses bilinear interpolation to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and the result is then aggregated (using max or average).","52":"**Mask R-CNN** extends [Faster R-CNN](http:\/\/paperswithcode.com\/method\/faster-r-cnn) to solve instance segmentation tasks. It achieves this by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. In principle, Mask R-CNN is an intuitive extension of Faster [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn), but constructing the mask branch properly is critical for good results. \r\n\r\nMost importantly, Faster R-CNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is evident in how [RoIPool](http:\/\/paperswithcode.com\/method\/roi-pooling), the *de facto* core operation for attending to instances, performs coarse spatial quantization for feature extraction. To fix the misalignment, Mask R-CNN utilises a simple, quantization-free layer, called [RoIAlign](http:\/\/paperswithcode.com\/method\/roi-align), that faithfully preserves exact spatial locations. \r\n\r\nSecondly, Mask R-CNN *decouples* mask and class prediction: it predicts a binary mask for each class independently, without competition among classes, and relies on the network's RoI classification branch to predict the category. In contrast, an [FCN](http:\/\/paperswithcode.com\/method\/fcn) usually perform per-pixel multi-class categorization, which couples segmentation and classification.","53":"**Absolute Position Encodings** are a type of position embeddings for [[Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based models] where positional encodings are added to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension $d\\_{model}$ as the embeddings, so that the two can be summed. In the original implementation, sine and cosine functions of different frequencies are used:\r\n\r\n$$ \\text{PE}\\left(pos, 2i\\right) = \\sin\\left(pos\/10000^{2i\/d\\_{model}}\\right) $$\r\n\r\n$$ \\text{PE}\\left(pos, 2i+1\\right) = \\cos\\left(pos\/10000^{2i\/d\\_{model}}\\right) $$\r\n\r\nwhere $pos$ is the position and $i$ is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $2\\pi$ to $10000 \\dot 2\\pi$. This function was chosen because the authors hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$, $\\text{PE}\\_{pos+k}$ can be represented as a linear function of $\\text{PE}\\_{pos}$.\r\n\r\nImage Source: [D2L.ai](https:\/\/d2l.ai\/chapter_attention-mechanisms\/self-attention-and-positional-encoding.html)","54":"**Position-Wise Feed-Forward Layer** is a type of [feedforward layer](https:\/\/www.paperswithcode.com\/method\/category\/feedforwad-networks) consisting of two [dense layers](https:\/\/www.paperswithcode.com\/method\/dense-connections) that applies to the last dimension, which means the same dense layers are used for each position item in the sequence, so called position-wise.","55":"**Label Smoothing** is a regularization technique that introduces noise for the labels. This accounts for the fact that datasets may have mistakes in them, so maximizing the likelihood of $\\log{p}\\left(y\\mid{x}\\right)$ directly can be harmful. Assume for a small constant $\\epsilon$, the training set label $y$ is correct with probability $1-\\epsilon$ and incorrect otherwise. Label Smoothing regularizes a model based on a [softmax](https:\/\/paperswithcode.com\/method\/softmax) with $k$ output values by replacing the hard $0$ and $1$ classification targets with targets of $\\frac{\\epsilon}{k-1}$ and $1-\\epsilon$ respectively.\r\n\r\nSource: Deep Learning, Goodfellow et al\r\n\r\nImage Source: [When Does Label Smoothing Help?](https:\/\/arxiv.org\/abs\/1906.02629)","56":"**Byte Pair Encoding**, or **BPE**, is a subword segmentation algorithm that encodes rare and unknown words as sequences of subword units. The intuition is that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations).\r\n\r\n[Lei Mao](https:\/\/leimao.github.io\/blog\/Byte-Pair-Encoding\/) has a detailed blog post that explains how this works.","57":"A **Transformer** is a model architecture that eschews recurrence and instead relies entirely on an [attention mechanism](https:\/\/paperswithcode.com\/methods\/category\/attention-mechanisms-1) to draw global dependencies between input and output. Before Transformers, the dominant sequence transduction models were based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The Transformer also employs an encoder and decoder, but removing recurrence in favor of [attention mechanisms](https:\/\/paperswithcode.com\/methods\/category\/attention-mechanisms-1) allows for significantly more parallelization than methods like [RNNs](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks) and [CNNs](https:\/\/paperswithcode.com\/methods\/category\/convolutional-neural-networks).","58":"Spectral clustering has attracted increasing attention due to\r\nthe promising ability in dealing with nonlinearly separable datasets [15], [16]. In spectral clustering, the spectrum of the graph Laplacian is used to reveal the cluster structure. The spectral clustering algorithm mainly consists of two steps: 1) constructs the low dimensional embedded representation of the data based on the eigenvectors of the graph Laplacian, 2) applies k-means on the constructed low dimensional data to obtain the clustering result. Thus,","59":"Please enter a description about the method here","60":"**Depthwise Convolution** is a type of convolution where we apply a single convolutional filter for each input channel. In the regular 2D [convolution](https:\/\/paperswithcode.com\/method\/convolution) performed over multiple input channels, the filter is as deep as the input and lets us freely mix channels to generate each element in the output. In contrast, depthwise convolutions keep each channel separate. To summarize the steps, we:\r\n\r\n1. Split the input and filter into channels.\r\n2. We convolve each input with the respective filter.\r\n3. We stack the convolved outputs together.\r\n\r\nImage Credit: [Chi-Feng Wang](https:\/\/towardsdatascience.com\/a-basic-introduction-to-separable-convolutions-b99ec3102728)","61":"**Pointwise Convolution** is a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) that uses a 1x1 kernel: a kernel that iterates through every single point. This kernel has a depth of however many channels the input image has. It can be used in conjunction with [depthwise convolutions](https:\/\/paperswithcode.com\/method\/depthwise-convolution) to produce an efficient class of convolutions known as [depthwise-separable convolutions](https:\/\/paperswithcode.com\/method\/depthwise-separable-convolution).\r\n\r\nImage Credit: [Chi-Feng Wang](https:\/\/towardsdatascience.com\/a-basic-introduction-to-separable-convolutions-b99ec3102728)","62":"While [standard convolution](https:\/\/paperswithcode.com\/method\/convolution) performs the channelwise and spatial-wise computation in one step, **Depthwise Separable Convolution** splits the computation into two steps: [depthwise convolution](https:\/\/paperswithcode.com\/method\/depthwise-convolution) applies a single convolutional filter per each input channel and [pointwise convolution](https:\/\/paperswithcode.com\/method\/pointwise-convolution) is used to create a linear combination of the output of the depthwise convolution. The comparison of standard convolution and depthwise separable convolution is shown to the right.\r\n\r\nCredit: [Depthwise Convolution Is All You Need for Learning Multiple Visual Domains](https:\/\/paperswithcode.com\/paper\/depthwise-convolution-is-all-you-need-for)","63":"An **Inverted Residual Block**, sometimes called an **MBConv Block**, is a type of residual block used for image models that uses an inverted structure for efficiency reasons. It was originally proposed for the [MobileNetV2](https:\/\/paperswithcode.com\/method\/mobilenetv2) CNN architecture. It has since been reused for several mobile-optimized CNNs.\r\n\r\nA traditional [Residual Block](https:\/\/paperswithcode.com\/method\/residual-block) has a wide -> narrow -> wide structure with the number of channels. The input has a high number of channels, which are compressed with a [1x1 convolution](https:\/\/paperswithcode.com\/method\/1x1-convolution). The number of channels is then increased again with a 1x1 [convolution](https:\/\/paperswithcode.com\/method\/convolution) so input and output can be added. \r\n\r\nIn contrast, an Inverted Residual Block follows a narrow -> wide -> narrow approach, hence the inversion. We first widen with a 1x1 convolution, then use a 3x3 [depthwise convolution](https:\/\/paperswithcode.com\/method\/depthwise-convolution) (which greatly reduces the number of parameters), then we use a 1x1 convolution to reduce the number of channels so input and output can be added.","64":"**MobileNetV2** is a convolutional neural network architecture that seeks to perform well on mobile devices. It is based on an inverted residual structure where the residual connections are between the bottleneck layers. The intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity. As a whole, the architecture of MobileNetV2 contains the initial fully [convolution](https:\/\/paperswithcode.com\/method\/convolution) layer with 32 filters, followed by 19 residual bottleneck layers.","65":"**LIME**, or **Local Interpretable Model-Agnostic Explanations**, is an algorithm that can explain the predictions of any classifier or regressor in a faithful way, by approximating it locally with an interpretable model. It modifies a single data sample by tweaking the feature values and observes the resulting impact on the output. It performs the role of an \"explainer\" to explain predictions from each data sample. The output of LIME is a set of explanations representing the contribution of each feature to a prediction for a single sample, which is a form of local interpretability.\r\n\r\nInterpretable models in LIME can be, for instance, [linear regression](https:\/\/paperswithcode.com\/method\/linear-regression) or decision trees, which are trained on small perturbations (e.g. adding noise, removing words, hiding parts of the image) of the original model to provide a good local approximation.","66":"A **Bidirectional LSTM**, or **biLSTM**, is a sequence processing model that consists of two LSTMs: one taking the input in a forward direction, and the other in a backwards direction. BiLSTMs effectively increase the amount of information available to the network, improving the context available to the algorithm (e.g. knowing what words immediately follow *and* precede a word in a sentence).\r\n\r\nImage Source: Modelling Radiological Language with Bidirectional Long Short-Term Memory Networks, Cornegruta et al","67":"**Logistic Regression**, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.\r\n\r\nSource: [scikit-learn](https:\/\/scikit-learn.org\/stable\/modules\/linear_model.html#logistic-regression)\r\n\r\nImage: [Michaelg2015](https:\/\/commons.wikimedia.org\/wiki\/User:Michaelg2015)","68":"Diffusion models generate samples by gradually\r\nremoving noise from a signal, and their training objective can be expressed as a reweighted variational lower-bound (https:\/\/arxiv.org\/abs\/2006.11239).","69":"Based on the understanding that the flat local minima of the empirical risk cause the model to generalize better. Adversarial Model Perturbation (AMP) improves generalization via minimizing the **AMP loss**, which is obtained from the empirical risk by applying the **worst** norm-bounded perturbation on each point in the parameter space.","70":"The goal of **Triplet loss**, in the context of Siamese Networks, is to maximize the joint probability among all score-pairs i.e. the product of all probabilities. By using its negative logarithm, we can get the loss formulation as follows:\r\n\r\n$$\r\nL\\_{t}\\left(\\mathcal{V}\\_{p}, \\mathcal{V}\\_{n}\\right)=-\\frac{1}{M N} \\sum\\_{i}^{M} \\sum\\_{j}^{N} \\log \\operatorname{prob}\\left(v p\\_{i}, v n\\_{j}\\right)\r\n$$\r\n\r\nwhere the balance weight $1\/MN$ is used to keep the loss with the same scale for different number of instance sets.","71":"**A3C**, **Asynchronous Advantage Actor Critic**, is a policy gradient algorithm in reinforcement learning that maintains a policy $\\pi\\left(a\\_{t}\\mid{s}\\_{t}; \\theta\\right)$ and an estimate of the value\r\nfunction $V\\left(s\\_{t}; \\theta\\_{v}\\right)$. It operates in the forward view and uses a mix of $n$-step returns to update both the policy and the value-function. The policy and the value function are updated after every $t\\_{\\text{max}}$ actions or when a terminal state is reached. The update performed by the algorithm can be seen as $\\nabla\\_{\\theta{'}}\\log\\pi\\left(a\\_{t}\\mid{s\\_{t}}; \\theta{'}\\right)A\\left(s\\_{t}, a\\_{t}; \\theta, \\theta\\_{v}\\right)$ where $A\\left(s\\_{t}, a\\_{t}; \\theta, \\theta\\_{v}\\right)$ is an estimate of the advantage function given by:\r\n\r\n$$\\sum^{k-1}\\_{i=0}\\gamma^{i}r\\_{t+i} + \\gamma^{k}V\\left(s\\_{t+k}; \\theta\\_{v}\\right) - V\\left(s\\_{t}; \\theta\\_{v}\\right)$$\r\n\r\nwhere $k$ can vary from state to state and is upper-bounded by $t\\_{max}$.\r\n\r\nThe critics in A3C learn the value function while multiple actors are trained in parallel and get synced with global parameters every so often. The gradients are accumulated as part of training for stability - this is like parallelized stochastic gradient descent.\r\n\r\nNote that while the parameters $\\theta$ of the policy and $\\theta\\_{v}$ of the value function are shown as being separate for generality, we always share some of the parameters in practice. We typically use a convolutional neural network that has one [softmax](https:\/\/paperswithcode.com\/method\/softmax) output for the policy $\\pi\\left(a\\_{t}\\mid{s}\\_{t}; \\theta\\right)$ and one linear output for the value function $V\\left(s\\_{t}; \\theta\\_{v}\\right)$, with all non-output layers shared.","72":"**SHAP**, or **SHapley Additive exPlanations**, is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions. Shapley values are approximating using Kernel SHAP, which uses a weighting kernel for the approximation, and DeepSHAP, which uses DeepLift to approximate them.","73":"A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\r\nSource: [Distilling the Knowledge in a Neural Network](https:\/\/arxiv.org\/abs\/1503.02531)","74":"**Cosine Annealing** is a type of learning rate schedule that has the effect of starting with a large learning rate that is relatively rapidly decreased to a minimum value before being increased rapidly again. The resetting of the learning rate acts like a simulated restart of the learning process and the re-use of good weights as the starting point of the restart is referred to as a \"warm restart\" in contrast to a \"cold restart\" where a new set of small random numbers may be used as a starting point.\r\n\r\n$$\\eta\\_{t} = \\eta\\_{min}^{i} + \\frac{1}{2}\\left(\\eta\\_{max}^{i}-\\eta\\_{min}^{i}\\right)\\left(1+\\cos\\left(\\frac{T\\_{cur}}{T\\_{i}}\\pi\\right)\\right)\r\n$$\r\n\r\nWhere where $\\eta\\_{min}^{i}$ and $ \\eta\\_{max}^{i}$ are ranges for the learning rate, and $T\\_{cur}$ account for how many epochs have been performed since the last restart.\r\n\r\nText Source: [Jason Brownlee](https:\/\/machinelearningmastery.com\/snapshot-ensemble-deep-learning-neural-network\/)\r\n\r\nImage Source: [Gao Huang](https:\/\/www.researchgate.net\/figure\/Training-loss-of-100-layer-DenseNet-on-CIFAR10-using-standard-learning-rate-blue-and-M_fig2_315765130)","75":"**Adaptive Input Embeddings** extend the [adaptive softmax](https:\/\/paperswithcode.com\/method\/adaptive-softmax) to input word representations. The factorization assigns more capacity to frequent words and reduces the capacity for less frequent words with the benefit of reducing overfitting to rare words.","76":"**Linear Warmup With Cosine Annealing** is a learning rate schedule where we increase the learning rate linearly for $n$ updates and then anneal according to a cosine schedule afterwards.","77":"The **Squeeze-and-Excitation Block** is an architectural unit designed to improve the representational power of a network by enabling it to perform dynamic channel-wise feature recalibration. The process is:\r\n\r\n- The block has a convolutional block as an input.\r\n- Each channel is \"squeezed\" into a single numeric value using [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling).\r\n- A dense layer followed by a [ReLU](https:\/\/paperswithcode.com\/method\/relu) adds non-linearity and output channel complexity is reduced by a ratio.\r\n- Another dense layer followed by a sigmoid gives each channel a smooth gating function.\r\n- Finally, we weight each feature map of the convolutional block based on the side network; the \"excitation\".","78":"**Variational Dropout** is a regularization technique based on [dropout](https:\/\/paperswithcode.com\/method\/dropout), but uses a variational inference grounded approach. In Variational Dropout, we repeat the same dropout mask at each time step for both inputs, outputs, and recurrent layers (drop the same network units at each time step). This is in contrast to ordinary Dropout where different dropout masks are sampled at each time step for the inputs and outputs alone.","79":"**Adaptive Softmax** is a speedup technique for the computation of probability distributions over words. The adaptive [softmax](https:\/\/paperswithcode.com\/method\/softmax) is inspired by the class-based [hierarchical softmax](https:\/\/paperswithcode.com\/method\/hierarchical-softmax), where the word classes are built to minimize the computation time. Adaptive softmax achieves efficiency by explicitly taking into account the computation time of matrix-multiplication on parallel systems and combining it with a few important observations, namely keeping a shortlist of frequent words in the root node\r\nand reducing the capacity of rare words.","80":"**Discriminative Fine-Tuning** is a fine-tuning strategy that is used for [ULMFiT](https:\/\/paperswithcode.com\/method\/ulmfit) type models. Instead of using the same learning rate for all layers of the model, discriminative fine-tuning allows us to tune each layer with different learning rates. For context, the regular stochastic gradient descent ([SGD](https:\/\/paperswithcode.com\/method\/sgd)) update of a model\u2019s parameters $\\theta$ at time step $t$ looks like the following (Ruder, 2016):\r\n\r\n$$ \\theta\\_{t} = \\theta\\_{t-1} \u2212 \\eta\\cdot\\nabla\\_{\\theta}J\\left(\\theta\\right)$$\r\n\r\nwhere $\\eta$ is the learning rate and $\\nabla\\_{\\theta}J\\left(\\theta\\right)$ is the gradient with regard to the model\u2019s objective function. For discriminative fine-tuning, we split the parameters $\\theta$ into {$\\theta\\_{1}, \\ldots, \\theta\\_{L}$} where $\\theta\\_{l}$ contains the parameters of the model at the $l$-th layer and $L$ is the number of layers of the model. Similarly, we obtain {$\\eta\\_{1}, \\ldots, \\eta\\_{L}$} where $\\theta\\_{l}$ where $\\eta\\_{l}$ is the learning rate of the $l$-th layer. The SGD update with discriminative finetuning is then:\r\n\r\n$$ \\theta\\_{t}^{l} = \\theta\\_{t-1}^{l} - \\eta^{l}\\cdot\\nabla\\_{\\theta^{l}}J\\left(\\theta\\right) $$\r\n\r\nThe authors find that empirically it worked well to first choose the learning rate $\\eta^{L}$ of the last layer by fine-tuning only the last layer and using $\\eta^{l-1}=\\eta^{l}\/2.6$ as the learning rate for lower layers.","81":"**Transformer-XL** (meaning extra long) is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture that introduces the notion of recurrence to the deep self-attention network. Instead of computing the hidden states from scratch for each new segment, Transformer-XL reuses the hidden states obtained in previous segments. The reused hidden states serve as memory for the current segment, which builds up a recurrent connection between the segments. As a result, modeling very long-term dependency becomes possible because information can be propagated through the recurrent connections. As an additional contribution, the Transformer-XL uses a new relative positional encoding formulation that generalizes to attention lengths longer than the one observed during training.","82":"A **SENet** is a convolutional neural network architecture that employs squeeze-and-excitation blocks to enable the network to perform dynamic channel-wise feature recalibration.","83":"**GPT-2** is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) architecture that was notable for its size (1.5 billion parameters) on its release. The model is pretrained on a WebText dataset - text from 45 million website links. It largely follows the previous [GPT](https:\/\/paperswithcode.com\/method\/gpt) architecture with some modifications:\r\n\r\n- [Layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization) is moved to the input of each sub-block, similar to a\r\npre-activation residual network and an additional layer normalization was added after the final self-attention block. \r\n\r\n- A modified initialization which accounts for the accumulation on the residual path with model depth\r\nis used. Weights of residual layers are scaled at initialization by a factor of $1\/\\sqrt{N}$ where $N$ is the number of residual layers. \r\n\r\n- The vocabulary is expanded to 50,257. The context size is expanded from 512 to 1024 tokens and\r\na larger batch size of 512 is used.","84":"**Stochastic Gradient Descent** is an iterative optimization technique that uses minibatches of data to form an expectation of the gradient, rather than the full gradient using all available data. That is for weights $w$ and a loss function $L$ we have:\r\n\r\n$$ w\\_{t+1} = w\\_{t} - \\eta\\hat{\\nabla}\\_{w}{L(w\\_{t})} $$\r\n\r\nWhere $\\eta$ is a learning rate. SGD reduces redundancy compared to batch gradient descent - which recomputes gradients for similar examples before each parameter update - so it is usually much faster.\r\n\r\n(Image Source: [here](http:\/\/rasbt.github.io\/mlxtend\/user_guide\/general_concepts\/gradient-optimization\/))","85":"A **Dilated Causal Convolution** is a [causal convolution](https:\/\/paperswithcode.com\/method\/causal-convolution) where the filter is applied over an area larger than its length by skipping input values with a certain step. A dilated causal [convolution](https:\/\/paperswithcode.com\/method\/convolution) effectively allows the network to have very large receptive fields with just a few layers.","86":"**Causal convolutions** are a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) used for temporal data which ensures the model cannot violate the ordering in which we model the data: the prediction $p(x_{t+1} | x_{1}, \\ldots, x_{t})$ emitted by the model at timestep $t$ cannot depend on any of the future timesteps $x_{t+1}, x_{t+2}, \\ldots, x_{T}$. For images, the equivalent of a causal convolution is a [masked convolution](https:\/\/paperswithcode.com\/method\/masked-convolution) which can be implemented by constructing a mask tensor and doing an element-wise multiplication of this mask with the convolution kernel before applying it. For 1-D data such as audio one can more easily implement this by shifting the output of a normal convolution by a few timesteps.","87":"**Affine Coupling** is a method for implementing a normalizing flow (where we stack a sequence of invertible bijective transformation functions). Affine coupling is one of these bijective transformation functions. Specifically, it is an example of a reversible transformation where the forward function, the reverse function and the log-determinant are computationally efficient. For the forward function, we split the input dimension into two parts:\r\n\r\n$$ \\mathbf{x}\\_{a}, \\mathbf{x}\\_{b} = \\text{split}\\left(\\mathbf{x}\\right) $$\r\n\r\nThe second part stays the same $\\mathbf{x}\\_{b} = \\mathbf{y}\\_{b}$, while the first part $\\mathbf{x}\\_{a}$ undergoes an affine transformation, where the parameters for this transformation are learnt using the second part $\\mathbf{x}\\_{b}$ being put through a neural network. Together we have:\r\n\r\n$$ \\left(\\log{\\mathbf{s}, \\mathbf{t}}\\right) = \\text{NN}\\left(\\mathbf{x}\\_{b}\\right) $$\r\n\r\n$$ \\mathbf{s} = \\exp\\left(\\log{\\mathbf{s}}\\right) $$\r\n\r\n$$ \\mathbf{y}\\_{a} = \\mathbf{s} \\odot \\mathbf{x}\\_{a} + \\mathbf{t} $$\r\n\r\n$$ \\mathbf{y}\\_{b} = \\mathbf{x}\\_{b} $$\r\n\r\n$$ \\mathbf{y} = \\text{concat}\\left(\\mathbf{y}\\_{a}, \\mathbf{y}\\_{b}\\right) $$\r\n\r\nImage: [GLOW](https:\/\/paperswithcode.com\/method\/glow)","88":"**Normalizing Flows** are a method for constructing complex distributions by transforming a\r\nprobability density through a series of invertible mappings. By repeatedly applying the rule for change of variables, the initial density \u2018flows\u2019 through the sequence of invertible mappings. At the end of this sequence we obtain a valid probability distribution and hence this type of flow is referred to as a normalizing flow.\r\n\r\nIn the case of finite flows, the basic rule for the transformation of densities considers an invertible, smooth mapping $f : \\mathbb{R}^{d} \\rightarrow \\mathbb{R}^{d}$ with inverse $f^{-1} = g$, i.e. the composition $g \\cdot f\\left(z\\right) = z$. If we use this mapping to transform a random variable $z$ with distribution $q\\left(z\\right)$, the resulting random variable $z' = f\\left(z\\right)$ has a distribution:\r\n\r\n$$ q\\left(\\mathbf{z}'\\right) = q\\left(\\mathbf{z}\\right)\\bigl\\vert{\\text{det}}\\frac{\\delta{f}^{-1}}{\\delta{\\mathbf{z'}}}\\bigr\\vert = q\\left(\\mathbf{z}\\right)\\bigl\\vert{\\text{det}}\\frac{\\delta{f}}{\\delta{\\mathbf{z}}}\\bigr\\vert ^{-1} $$\r\n\f\r\nwhere the last equality can be seen by applying the chain rule (inverse function theorem) and is a property of Jacobians of invertible functions. We can construct arbitrarily complex densities by composing several simple maps and successively applying the above equation. The density $q\\_{K}\\left(\\mathbf{z}\\right)$ obtained by successively transforming a random variable $z\\_{0}$ with distribution $q\\_{0}$ through a chain of $K$ transformations $f\\_{k}$ is:\r\n\r\n$$ z\\_{K} = f\\_{K} \\cdot \\dots \\cdot f\\_{2} \\cdot f\\_{1}\\left(z\\_{0}\\right) $$\r\n\r\n$$ \\ln{q}\\_{K}\\left(z\\_{K}\\right) = \\ln{q}\\_{0}\\left(z\\_{0}\\right) \u2212 \\sum^{K}\\_{k=1}\\ln\\vert\\det\\frac{\\delta{f\\_{k}}}{\\delta{\\mathbf{z\\_{k-1}}}}\\vert $$\r\n\f\r\nThe path traversed by the random variables $z\\_{k} = f\\_{k}\\left(z\\_{k-1}\\right)$ with initial distribution $q\\_{0}\\left(z\\_{0}\\right)$ is called the flow and the path formed by the successive distributions $q\\_{k}$ is a normalizing flow.","89":"**NICE**, or **Non-Linear Independent Components Estimation** is a framework for modeling complex high-dimensional densities. It is based on the idea that a good representation is one in which the data has a distribution that is easy to model. For this purpose, a non-linear deterministic transformation of the data is learned that maps it to a latent space so as to make the transformed data conform to a factorized distribution, i.e., resulting in independent latent variables. The transformation is parameterised so that computing the determinant of the Jacobian and inverse Jacobian is trivial, yet it maintains the ability to learn complex non-linear transformations, via a composition of simple building blocks, each based on a deep neural network. The training criterion is simply the exact log-likelihood. The transformation used in NICE is the [affine coupling](https:\/\/paperswithcode.com\/method\/affine-coupling) layer without the scale term, known as additive coupling layer:\r\n\r\n$$ y\\_{I\\_{2}} = x\\_{I\\_{2}} + m\\left(x\\_{I\\_{1}}\\right) $$\r\n\r\n$$ x\\_{I\\_{2}} = y\\_{I\\_{2}} + m\\left(y\\_{I\\_{1}}\\right) $$","90":"**MAML**, or **Model-Agnostic Meta-Learning**, is a model and task-agnostic algorithm for meta-learning that trains a model\u2019s parameters such that a small number of gradient updates will lead to fast learning on a new task.\r\n\r\nConsider a model represented by a parametrized function $f\\_{\\theta}$ with parameters $\\theta$. When adapting to a new task $\\mathcal{T}\\_{i}$, the model\u2019s parameters $\\theta$ become $\\theta'\\_{i}$. With MAML, the updated parameter vector $\\theta'\\_{i}$ is computed using one or more gradient descent updates on task $\\mathcal{T}\\_{i}$. For example, when using one gradient update,\r\n\r\n$$ \\theta'\\_{i} = \\theta - \\alpha\\nabla\\_{\\theta}\\mathcal{L}\\_{\\mathcal{T}\\_{i}}\\left(f\\_{\\theta}\\right) $$\r\n\r\nThe step size $\\alpha$ may be fixed as a hyperparameter or metalearned. The model parameters are trained by optimizing for the performance of $f\\_{\\theta'\\_{i}}$ with respect to $\\theta$ across tasks sampled from $p\\left(\\mathcal{T}\\_{i}\\right)$. More concretely the meta-objective is as follows:\r\n\r\n$$ \\min\\_{\\theta} \\sum\\_{\\mathcal{T}\\_{i} \\sim p\\left(\\mathcal{T}\\right)} \\mathcal{L}\\_{\\mathcal{T\\_{i}}}\\left(f\\_{\\theta'\\_{i}}\\right) = \\sum\\_{\\mathcal{T}\\_{i} \\sim p\\left(\\mathcal{T}\\right)} \\mathcal{L}\\_{\\mathcal{T\\_{i}}}\\left(f\\_{\\theta - \\alpha\\nabla\\_{\\theta}\\mathcal{L}\\_{\\mathcal{T}\\_{i}}\\left(f\\_{\\theta}\\right)}\\right) $$\r\n\r\nNote that the meta-optimization is performed over the model parameters $\\theta$, whereas the objective is computed using the updated model parameters $\\theta'$. In effect MAML aims to optimize the model parameters such that one or a small number of gradient steps on a new task will produce maximally effective behavior on that task. The meta-optimization across tasks is performed via stochastic gradient descent ([SGD](https:\/\/paperswithcode.com\/method\/sgd)), such that the model parameters $\\theta$ are updated as follows:\r\n\r\n$$ \\theta \\leftarrow \\theta - \\beta\\nabla\\_{\\theta} \\sum\\_{\\mathcal{T}\\_{i} \\sim p\\left(\\mathcal{T}\\right)} \\mathcal{L}\\_{\\mathcal{T\\_{i}}}\\left(f\\_{\\theta'\\_{i}}\\right)$$\r\n\r\nwhere $\\beta$ is the meta step size.","91":"**LAMB** is a a layerwise adaptive large batch optimization technique. It provides a strategy for adapting the learning rate in large batch settings. LAMB uses [Adam](https:\/\/paperswithcode.com\/method\/adam) as the base algorithm and then forms an update as:\r\n\r\n$$r\\_{t} = \\frac{m\\_{t}}{\\sqrt{v\\_{t}} + \\epsilon}$$\r\n$$x\\_{t+1}^{\\left(i\\right)} = x\\_{t}^{\\left(i\\right)} - \\eta\\_{t}\\frac{\\phi\\left(|| x\\_{t}^{\\left(i\\right)} ||\\right)}{|| m\\_{t}^{\\left(i\\right)} || }\\left(r\\_{t}^{\\left(i\\right)}+\\lambda{x\\_{t}^{\\left(i\\right)}}\\right) $$\r\n\r\nUnlike [LARS](https:\/\/paperswithcode.com\/method\/lars), the adaptivity of LAMB is two-fold: (i) per dimension normalization with respect to the square root of the second moment used in Adam and (ii) layerwise normalization obtained due to layerwise adaptivity.","92":"**RoBERTa** is an extension of [BERT](https:\/\/paperswithcode.com\/method\/bert) with changes to the pretraining procedure. The modifications include: \r\n\r\n- training the model longer, with bigger batches, over more data\r\n- removing the next sentence prediction objective\r\n- training on longer sequences\r\n- dynamically changing the masking pattern applied to the training data. The authors also collect a large new dataset ($\\text{CC-News}$) of comparable size to other privately used datasets, to better control for training set size effects","93":"**ALBERT** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture based on [BERT](https:\/\/paperswithcode.com\/method\/bert) but with much fewer parameters. It achieves this through two parameter reduction techniques. The first is a factorized embeddings parameterization. By decomposing the large vocabulary embedding matrix into two small matrices, the size of the hidden layers is separated from the size of vocabulary embedding. This makes it easier to grow the hidden size without significantly increasing the parameter size of the vocabulary embeddings. The second technique is cross-layer parameter sharing. This technique prevents the parameter from growing with the depth of the network. \r\n\r\nAdditionally, ALBERT utilises a self-supervised loss for sentence-order prediction (SOP). SOP primary focuses on inter-sentence coherence and is designed to address the ineffectiveness of the next sentence prediction (NSP) loss proposed in the original BERT.","94":"**Differentiable Architecture Search** (**DART**) is a method for efficient architecture search. The search space is made continuous so that the architecture can be optimized with respect to its validation set performance through gradient descent.","95":"Class activation maps could be used to interpret the prediction decision made by the convolutional neural network (CNN).\r\n\r\nImage source: [Learning Deep Features for Discriminative Localization](https:\/\/paperswithcode.com\/paper\/learning-deep-features-for-discriminative)","96":"A **Support Vector Machine**, or **SVM**, is a non-parametric supervised learning model. For non-linear classification and regression, they utilise the kernel trick to map inputs to high-dimensional feature spaces. SVMs construct a hyper-plane or set of hyper-planes in a high or infinite dimensional space, which can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier. The figure to the right shows the decision function for a linearly separable problem, with three samples on the margin boundaries, called \u201csupport vectors\u201d. \r\n\r\nSource: [scikit-learn](https:\/\/scikit-learn.org\/stable\/modules\/svm.html)","97":"A **DQN**, or Deep Q-Network, approximates a state-value function in a [Q-Learning](https:\/\/paperswithcode.com\/method\/q-learning) framework with a neural network. In the Atari Games case, they take in several frames of the game as an input and output state values for each action as an output. \r\n\r\nIt is usually used in conjunction with [Experience Replay](https:\/\/paperswithcode.com\/method\/experience-replay), for storing the episode steps in memory for off-policy learning, where samples are drawn from the replay memory at random. Additionally, the Q-Network is usually optimized towards a frozen target network that is periodically updated with the latest weights every $k$ steps (where $k$ is a hyperparameter). The latter makes training more stable by preventing short-term oscillations from a moving target. The former tackles autocorrelation that would occur from on-line learning, and having a replay memory makes the problem more like a supervised learning problem.\r\n\r\nImage Source: [here](https:\/\/www.researchgate.net\/publication\/319643003_Autonomous_Quadrotor_Landing_using_Deep_Reinforcement_Learning)","98":"Temporal attention can be seen as a dynamic time selection mechanism determining when to pay attention, and is thus usually used for video processing.","99":"**fastText** embeddings exploit subword information to construct word embeddings. Representations are learnt of character $n$-grams, and words represented as the sum of the $n$-gram vectors. This extends the word2vec type models with subword information. This helps the embeddings understand suffixes and prefixes. Once a word is represented using character $n$-grams, a skipgram model is trained to learn the embeddings.","100":"**Seq2Seq**, or **Sequence To Sequence**, is a model used in sequence prediction tasks, such as language modelling and machine translation. The idea is to use one [LSTM](https:\/\/paperswithcode.com\/method\/lstm), the *encoder*, to read the input sequence one timestep at a time, to obtain a large fixed dimensional vector representation (a context vector), and then to use another LSTM, the *decoder*, to extract the output sequence\r\nfrom that vector. The second LSTM is essentially a recurrent neural network language model except that it is conditioned on the input sequence.\r\n\r\n(Note that this page refers to the original seq2seq not general sequence-to-sequence models)","101":"**Local Response Normalization** is a normalization layer that implements the idea of lateral inhibition. Lateral inhibition is a concept in neurobiology that refers to the phenomenon of an excited neuron inhibiting its neighbours: this leads to a peak in the form of a local maximum, creating contrast in that area and increasing sensory perception. In practice, we can either normalize within the same channel or normalize across channels when we apply LRN to convolutional neural networks.\r\n\r\n$$ b_{c} = a_{c}\\left(k + \\frac{\\alpha}{n}\\sum_{c'=\\max(0, c-n\/2)}^{\\min(N-1,c+n\/2)}a_{c'}^2\\right)^{-\\beta} $$\r\n\r\nWhere the size is the number of neighbouring channels used for normalization, $\\alpha$ is multiplicative factor, $\\beta$ an exponent and $k$ an additive factor","102":"A **Grouped Convolution** uses a group of convolutions - multiple kernels per layer - resulting in multiple channel outputs per layer. This leads to wider networks helping a network learn a varied set of low level and high level features. The original motivation of using Grouped Convolutions in [AlexNet](https:\/\/paperswithcode.com\/method\/alexnet) was to distribute the model over multiple GPUs as an engineering compromise. But later, with models such as [ResNeXt](https:\/\/paperswithcode.com\/method\/resnext), it was shown this module could be used to improve classification accuracy. Specifically by exposing a new dimension through grouped convolutions, *cardinality* (the size of set of transformations), we can increase accuracy by increasing it.","103":"**AlexNet** is a classic convolutional neural network architecture. It consists of convolutions, [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) and dense layers as the basic building blocks. Grouped convolutions are used in order to fit the model across two GPUs.","104":"**DLA**, or **Deep Layer Aggregation**, iteratively and hierarchically merges the feature hierarchy across layers in neural networks to make networks with better accuracy and fewer parameters. \r\n\r\nIn iterative deep aggregation (IDA), aggregation begins at the shallowest, smallest scale and then iteratively merges deeper,\r\nlarger scales. In this way shallow features are refined as\r\nthey are propagated through different stages of aggregation.\r\n\r\nIn hierarchical deep aggregation (HDA), blocks and stages\r\nin a tree are merged to preserve and combine feature channels. With\r\nHDA shallower and deeper layers are combined to learn\r\nricher combinations that span more of the feature hierarchy.\r\nWhile IDA effectively combines stages, it is insufficient\r\nfor fusing the many blocks of a network, as it is still only\r\nsequential.","105":"**Center Pooling** is a pooling technique for object detection that aims to capture richer and more recognizable visual patterns. The geometric centers of objects do not necessarily convey very recognizable visual patterns (e.g., the human head contains strong visual patterns, but the center keypoint is often in the middle of the human body). \r\n\r\nThe detailed process of center pooling is as follows: the backbone outputs a feature map, and to determine if a pixel in the feature map is a center keypoint, we need to find the maximum value in its both horizontal and vertical directions and add them together. By doing this, center pooling helps the better detection of center keypoints.","106":"**Cascade Corner Pooling** is a pooling layer for object detection that builds upon the [corner pooling](https:\/\/paperswithcode.com\/method\/corner-pooling) operation. Corners are often outside the objects, which lacks local appearance features. [CornerNet](https:\/\/paperswithcode.com\/method\/cornernet) uses corner pooling to address this issue, where we find the maximum values on the boundary directions so as to determine corners. However, it makes corners sensitive to the edges. To address this problem, we need to let corners see the visual patterns of objects. Cascade corner pooling first looks along a boundary to find a boundary maximum value, then looks inside along the location of the boundary maximum value to find an internal maximum value, and finally, add the two maximum values together. By doing this, the corners obtain both the the boundary information and the visual patterns of objects.","107":"**CenterNet** is a one-stage object detector that detects each object as a triplet, rather than a pair, of keypoints. It utilizes two customized modules named [cascade corner pooling](https:\/\/paperswithcode.com\/method\/cascade-corner-pooling) and [center pooling](https:\/\/paperswithcode.com\/method\/center-pooling), which play the roles of enriching information collected by both top-left and bottom-right corners and providing more recognizable information at the central regions, respectively. The intuition is that, if a predicted bounding box has a high IoU with the ground-truth box, then the probability that the center keypoint in its central region is predicted as the same class is high, and vice versa. Thus, during inference, after a proposal is generated as a pair of corner keypoints, we determine if the proposal is indeed an object by checking if there is a center keypoint of the same class falling within its central region.","108":"Please enter a description about the method here","109":"**Region of Interest Pooling**, or **RoIPool**, is an operation for extracting a small feature map (e.g., $7\u00d77$) from each RoI in detection and segmentation based tasks. Features are extracted from each candidate box, and thereafter in models like [Fast R-CNN](https:\/\/paperswithcode.com\/method\/fast-r-cnn), are then classified and bounding box regression performed.\r\n\r\nThe actual scaling to, e.g., $7\u00d77$, occurs by dividing the region proposal into equally sized sections, finding the largest value in each section, and then copying these max values to the output buffer. In essence, **RoIPool** is [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) on a discrete grid based on a box.\r\n\r\nImage Source: [Joyce Xu](https:\/\/towardsdatascience.com\/deep-learning-for-object-detection-a-comprehensive-review-73930816d8d9)","110":"**Faster R-CNN** is an object detection model that improves on [Fast R-CNN](https:\/\/paperswithcode.com\/method\/fast-r-cnn) by utilising a region proposal network ([RPN](https:\/\/paperswithcode.com\/method\/rpn)) with the CNN model. The RPN shares full-image convolutional features with the detection network, enabling nearly cost-free region proposals. It is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by [Fast R-CNN](https:\/\/paperswithcode.com\/method\/fast-r-cnn) for detection. RPN and Fast [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn) are merged into a single network by sharing their convolutional features: the RPN component tells the unified network where to look.\r\n\r\nAs a whole, Faster R-CNN consists of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector that uses the proposed regions.","111":"BYOL (Bootstrap Your Own Latent) is a new approach to self-supervised learning. BYOL\u2019s goal is to learn a representation $y_\u03b8$ which can then be used for downstream tasks. BYOL uses two neural networks to learn: the online and target networks. The online network is defined by a set of weights $\u03b8$ and is comprised of three stages: an encoder $f_\u03b8$, a projector $g_\u03b8$ and a predictor $q_\u03b8$. The target network has the same architecture\r\nas the online network, but uses a different set of weights $\u03be$. The target network provides the regression\r\ntargets to train the online network, and its parameters $\u03be$ are an exponential moving average of the\r\nonline parameters $\u03b8$.\r\n\r\nGiven the architecture diagram on the right, BYOL minimizes a similarity loss between $q_\u03b8(z_\u03b8)$ and $sg(z'{_\u03be})$, where $\u03b8$ are the trained weights, $\u03be$ are an exponential moving average of $\u03b8$ and $sg$ means stop-gradient. At the end of training, everything but $f_\u03b8$ is discarded, and $y_\u03b8$ is used as the image representation.\r\n\r\nSource: [Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning](https:\/\/paperswithcode.com\/paper\/bootstrap-your-own-latent-a-new-approach-to-1)\r\n\r\nImage credit: [Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning](https:\/\/paperswithcode.com\/paper\/bootstrap-your-own-latent-a-new-approach-to-1)","112":"**ReLU6** is a modification of the [rectified linear unit](https:\/\/paperswithcode.com\/method\/relu) where we limit the activation to a maximum size of $6$. This is due to increased robustness when used with low-precision computation.\r\n\r\nImage Credit: [PyTorch](https:\/\/pytorch.org\/docs\/master\/generated\/torch.nn.ReLU6.html)","113":"**node2vec** is a framework for learning graph embeddings for nodes in graphs. Node2vec maximizes a likelihood objective over mappings which preserve neighbourhood distances in higher dimensional spaces. From an algorithm design perspective, node2vec exploits the freedom to define neighbourhoods for nodes and provide an explanation for the effect of the choice of neighborhood on the learned representations. \r\n\r\nFor each node, node2vec simulates biased random walks based on an efficient network-aware search strategy and the nodes appearing in the random walk define neighbourhoods. The search strategy accounts for the relative influence nodes exert in a network. It also generalizes prior work alluding to naive search strategies by providing flexibility in exploring neighborhoods.","114":"A Graph Convolutional Network, or GCN, is an approach for semi-supervised learning on graph-structured data. It is based on an efficient variant of convolutional neural networks which operate directly on graphs.\r\n\r\nImage source: [Semi-Supervised Classification with Graph Convolutional Networks](https:\/\/arxiv.org\/pdf\/1609.02907v4.pdf)","115":"Gravity is a kinematic approach to optimization based on gradients.","116":"A **Relativistic GAN** is a type of generative adversarial network. It has a relativistic discriminator which estimates the probability that the given real data is more realistic than a randomly sampled fake data. The idea is to endow GANs with the property that the probability of real data being real ($D\\left(x\\_{r}\\right)$) should decrease as the probability of fake data being real ($D\\left(x\\_{f}\\right)$) increases.\r\n\r\nWith a standard [GAN](https:\/\/paperswithcode.com\/method\/gan), we can achieve this as follows. The standard GAN discriminator can be defined, in term of the non-transformed layer $C\\left(x\\right)$, as $D\\left(x\\right) = \\text{sigmoid}\\left(C\\left(x\\right)\\right)$. A simple way to make discriminator relativistic - having the output of $D$ depend on both real and fake data - is to sample from real\/fake data pairs $\\tilde{x} = \\left(x\\_{r}, x\\_{f}\\right)$ and define it as $D\\left(\\tilde{x}\\right) = \\text{sigmoid}\\left(C\\left(x\\_{r}\\right) \u2212 C\\left(x\\_{f}\\right)\\right)$. The modification can be interpreted as: the discriminator estimates the probability\r\nthat the given real data is more realistic than a randomly sampled fake data.\r\n\r\nMore generally a Relativistic GAN can be interpreted as having a discriminator of the form $a\\left(C\\left(x\\_{r}\\right)\u2212C\\left(x\\_{f}\\right)\\right)$, where $a$ is the activation function, to be relativistic.","117":"**Wasserstein GAN**, or **WGAN**, is a type of generative adversarial network that minimizes an approximation of the Earth-Mover's distance (EM) rather than the Jensen-Shannon divergence as in the original [GAN](https:\/\/paperswithcode.com\/method\/gan) formulation. It leads to more stable training than original GANs with less evidence of mode collapse, as well as meaningful curves that can be used for debugging and searching hyperparameters.","118":"The **alternating direction method of multipliers** (**ADMM**) is an algorithm that solves convex optimization problems by breaking them into smaller pieces, each of which are then easier to handle. It takes the form of a decomposition-coordination procedure, in which the solutions to small\r\nlocal subproblems are coordinated to find a solution to a large global problem. ADMM can be viewed as an attempt to blend the benefits of dual decomposition and augmented Lagrangian methods for constrained optimization. It turns out to be equivalent or closely related to many other algorithms\r\nas well, such as Douglas-Rachford splitting from numerical analysis, Spingarn\u2019s method of partial inverses, Dykstra\u2019s alternating projections method, Bregman iterative algorithms for l1 problems in signal processing, proximal methods, and many others.\r\n\r\nText Source: [https:\/\/stanford.edu\/~boyd\/papers\/pdf\/admm_distr_stats.pdf](https:\/\/stanford.edu\/~boyd\/papers\/pdf\/admm_distr_stats.pdf)\r\n\r\nImage Source: [here](https:\/\/www.slideshare.net\/derekcypang\/alternating-direction)","119":"**Natural Gradient Descent** is an approximate second-order optimisation method. It has an interpretation as optimizing over a Riemannian manifold using an intrinsic distance metric, which implies the updates are invariant to transformations such as whitening. By using the positive semi-definite (PSD) Gauss-Newton matrix to approximate the (possibly negative definite) Hessian, NGD can often work better than exact second-order methods.\r\n\r\nGiven the gradient of $z$, $g = \\frac{\\delta{f}\\left(z\\right)}{\\delta{z}}$, NGD computes the update as:\r\n\r\n$$\\Delta{z} = \\alpha{F}^{\u22121}g$$\r\n\r\nwhere the Fisher information matrix $F$ is defined as:\r\n\r\n$$ F = \\mathbb{E}\\_{p\\left(t\\mid{z}\\right)}\\left[\\nabla\\ln{p}\\left(t\\mid{z}\\right)\\nabla\\ln{p}\\left(t\\mid{z}\\right)^{T}\\right] $$\r\n\r\nThe log-likelihood function $\\ln{p}\\left(t\\mid{z}\\right)$ typically corresponds to commonly used error functions such as the cross entropy loss.\r\n\r\nSource: [LOGAN](https:\/\/paperswithcode.com\/method\/logan)\r\n\r\nImage: [Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks\r\n](https:\/\/arxiv.org\/abs\/1905.10961)","120":"An **autoencoder** is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal \u201cnoise\u201d. Along with the reduction side, a reconstructing side is learnt, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name. \r\n\r\nExtracted from: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Autoencoder)\r\n\r\nImage source: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Autoencoder#\/media\/File:Autoencoder_schema.png)","121":"**Embeddings from Language Models**, or **ELMo**, is a type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus.\r\n\r\nA biLM combines both a forward and backward LM. ELMo jointly maximizes the log likelihood of the forward and backward directions. To add ELMo to a supervised model, we freeze the weights of the biLM and then concatenate the ELMo vector $\\textbf{ELMO}^{task}_k$ with $\\textbf{x}_k$ and pass the ELMO enhanced representation $[\\textbf{x}_k; \\textbf{ELMO}^{task}_k]$ into the task RNN. Here $\\textbf{x}_k$ is a context-independent token representation for each token position. \r\n\r\nImage Source: [here](https:\/\/medium.com\/@duyanhnguyen_38925\/create-a-strong-text-classification-with-the-help-from-elmo-e90809ba29da)","122":"A **Variational Autoencoder** is a type of likelihood-based generative model. It consists of an encoder, that takes in data $x$ as input and transforms this into a latent representation $z$, and a decoder, that takes a latent representation $z$ and returns a reconstruction $\\hat{x}$. Inference is performed via variational inference to approximate the posterior of the model.","123":"**Adaptive Training Sample Selection**, or **ATSS**, is a method to automatically select positive and negative samples according to statistical characteristics of object. It bridges the gap between anchor-based and anchor-free detectors. \r\n\r\nFor each ground-truth box $g$ on the image, we first find out its candidate positive samples. As described in Line $3$ to $6$, on each pyramid level, we select $k$ anchor boxes whose center are closest to the center of $g$ based on L2 distance. Supposing there are $\\mathcal{L}$ feature pyramid levels, the ground-truth box $g$ will have $k\\times\\mathcal{L}$ candidate positive samples. After that, we compute the IoU between these candidates and the ground-truth $g$ as $\\mathcal{D}_g$ in Line $7$, whose mean and standard deviation are computed as $m_g$ and $v_g$ in Line $8$ and Line $9$. With these statistics, the IoU threshold for this ground-truth $g$ is obtained as $t_g=m_g+v_g$ in Line $10$. Finally, we select these candidates whose IoU are greater than or equal to the threshold $t_g$ as final positive samples in Line $11$ to $15$. \r\n\r\nNotably ATSS also limits the positive samples' center to the ground-truth box as shown in Line $12$. Besides, if an anchor box is assigned to multiple ground-truth boxes, the one with the highest IoU will be selected. The rest are negative samples.","124":"A **Focal Loss** function addresses class imbalance during training in tasks like object detection. Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard misclassified examples. It is a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples. \r\n\r\nFormally, the Focal Loss adds a factor $(1 - p\\_{t})^\\gamma$ to the standard cross entropy criterion. Setting $\\gamma>0$ reduces the relative loss for well-classified examples ($p\\_{t}>.5$), putting more focus on hard, misclassified examples. Here there is tunable *focusing* parameter $\\gamma \\ge 0$. \r\n\r\n$$ {\\text{FL}(p\\_{t}) = - (1 - p\\_{t})^\\gamma \\log\\left(p\\_{t}\\right)} $$","125":"**Xception** is a convolutional neural network architecture that relies solely on [depthwise separable convolution](https:\/\/paperswithcode.com\/method\/depthwise-separable-convolution) layers.","126":"**Spectral Normalization** is a normalization technique used for generative adversarial networks, used to stabilize training of the discriminator. Spectral normalization has the convenient property that the Lipschitz constant is the only hyper-parameter to be tuned.\r\n\r\nIt controls the Lipschitz constant of the discriminator $f$ by constraining the spectral norm of each layer $g : \\textbf{h}\\_{in} \\rightarrow \\textbf{h}_{out}$. The Lipschitz norm $\\Vert{g}\\Vert\\_{\\text{Lip}}$ is equal to $\\sup\\_{\\textbf{h}}\\sigma\\left(\\nabla{g}\\left(\\textbf{h}\\right)\\right)$, where $\\sigma\\left(a\\right)$ is the spectral norm of the matrix $A$ ($L\\_{2}$ matrix norm of $A$):\r\n\r\n$$ \\sigma\\left(a\\right) = \\max\\_{\\textbf{h}:\\textbf{h}\\neq{0}}\\frac{\\Vert{A\\textbf{h}}\\Vert\\_{2}}{\\Vert\\textbf{h}\\Vert\\_{2}} = \\max\\_{\\Vert\\textbf{h}\\Vert\\_{2}\\leq{1}}{\\Vert{A\\textbf{h}}\\Vert\\_{2}} $$\r\n\r\nwhich is equivalent to the largest singular value of $A$. Therefore for a [linear layer](https:\/\/paperswithcode.com\/method\/linear-layer) $g\\left(\\textbf{h}\\right) = W\\textbf{h}$ the norm is given by $\\Vert{g}\\Vert\\_{\\text{Lip}} = \\sup\\_{\\textbf{h}}\\sigma\\left(\\nabla{g}\\left(\\textbf{h}\\right)\\right) = \\sup\\_{\\textbf{h}}\\sigma\\left(W\\right) = \\sigma\\left(W\\right) $. Spectral normalization normalizes the spectral norm of the weight matrix $W$ so it satisfies the Lipschitz constraint $\\sigma\\left(W\\right) = 1$:\r\n\r\n$$ \\bar{W}\\_{\\text{SN}}\\left(W\\right) = W \/ \\sigma\\left(W\\right) $$","127":"A **Feature Pyramid Network**, or **FPN**, is a feature extractor that takes a single-scale image of an arbitrary size as input, and outputs proportionally sized feature maps at multiple levels, in a fully convolutional fashion. This process is independent of the backbone convolutional architectures. It therefore acts as a generic solution for building feature pyramids inside deep convolutional networks to be used in tasks like object detection.\r\n\r\nThe construction of the pyramid involves a bottom-up pathway and a top-down pathway.\r\n\r\nThe bottom-up pathway is the feedforward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2. For the feature\r\npyramid, one pyramid level is defined for each stage. The output of the last layer of each stage is used as a reference set of feature maps. For [ResNets](https:\/\/paperswithcode.com\/method\/resnet) we use the feature activations output by each stage\u2019s last [residual block](https:\/\/paperswithcode.com\/method\/residual-block). \r\n\r\nThe top-down pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels. These features are then enhanced with features from the bottom-up pathway via lateral connections. Each lateral connection merges feature maps of the same spatial size from the bottom-up pathway and the top-down pathway. The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times.","128":"**Hybrid Task Cascade**, or **HTC**, is a framework for cascading in instance segmentation. It differs from [Cascade Mask R-CNN](https:\/\/paperswithcode.com\/method\/cascade-mask-r-cnn) in two important aspects: (1) instead of performing cascaded refinement on the two tasks of detection and segmentation separately, it interweaves them for a joint multi-stage processing; (2) it adopts a fully convolutional branch to provide spatial context, which can help distinguishing hard foreground from cluttered background.","129":"**$k$-Nearest Neighbors** is a clustering-based algorithm for classification and regression. It is a a type of instance-based learning as it does not attempt to construct a general internal model, but simply stores instances of the training data. Prediction is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.\r\n\r\nSource of Description and Image: [scikit-learn](https:\/\/scikit-learn.org\/stable\/modules\/neighbors.html#classification)","130":"**GloVe Embeddings** are a type of word embedding that encode the co-occurrence probability ratio between two words as vector differences. GloVe uses a weighted least squares objective $J$ that minimizes the difference between the dot product of the vectors of two words and the logarithm of their number of co-occurrences:\r\n\r\n$$ J=\\sum\\_{i, j=1}^{V}f\\left(\ud835\udc4b\\_{i j}\\right)(w^{T}\\_{i}\\tilde{w}_{j} + b\\_{i} + \\tilde{b}\\_{j} - \\log{\ud835\udc4b}\\_{ij})^{2} $$\r\n\r\nwhere $w\\_{i}$ and $b\\_{i}$ are the word vector and bias respectively of word $i$, $\\tilde{w}_{j}$ and $b\\_{j}$ are the context word vector and bias respectively of word $j$, $X\\_{ij}$ is the number of times word $i$ occurs in the context of word $j$, and $f$ is a weighting function that assigns lower weights to rare and frequent co-occurrences.","131":"**BLANC** is an automatic estimation approach for document summary quality. The goal is to measure the functional performance of a summary with an objective, reproducible, and fully automated method. BLANC achieves this by measuring the performance boost gained by a pre-trained language model with access to a document summary while carrying out its language understanding task on the document's text.","132":"NeRF represents a scene with learned, continuous volumetric radiance field $F_\\theta$ defined over a bounded 3D volume. In a NeRF, $F_\\theta$ is a multilayer perceptron (MLP) that takes as input a 3D position $x = (x, y, z)$ and unit-norm viewing direction $d = (dx, dy, dz)$, and produces as output a density $\\sigma$ and color $c = (r, g, b)$. The weights of the multilayer perceptron that parameterize $F_\\theta$ are optimized so as to encode the radiance field of the scene. Volume rendering is used to compute the color of a single pixel.","133":"**Gaussian Processes** are non-parametric models for approximating functions. They rely upon a measure of similarity between points (the kernel function) to predict the value for an unseen point from training data. The models are fully probabilistic so uncertainty bounds are baked in with the model.\r\n\r\nImage Source: Gaussian Processes for Machine Learning, C. E. Rasmussen & C. K. I. Williams","134":"Documents often exhibit various forms of degradation, which make it hard to be read and substantially deteriorate the\r\nperformance of an OCR system. In this paper, we propose an effective end-to-end framework named Document Enhancement\r\nGenerative Adversarial Networks (DE-GAN) that uses the conditional GANs (cGANs) to restore severely degraded document images.\r\nTo the best of our knowledge, this practice has not been studied within the context of generative adversarial deep networks. We\r\ndemonstrate that, in different tasks (document clean up, binarization, deblurring and watermark removal), DE-GAN can produce an\r\nenhanced version of the degraded document with a high quality. In addition, our approach provides consistent improvements compared to state-of-the-art methods over the widely used DIBCO 2013, DIBCO 2017 and H-DIBCO 2018 datasets, proving its ability to restore a degraded document image to its ideal condition. The obtained results on a wide variety of degradation reveal the flexibility of the proposed model to be exploited in other document enhancement problems.","135":"**k-Means Clustering** is a clustering algorithm that divides a training set into $k$ different clusters of examples that are near each other. It works by initializing $k$ different centroids {$\\mu\\left(1\\right),\\ldots,\\mu\\left(k\\right)$} to different values, then alternating between two steps until convergence:\r\n\r\n(i) each training example is assigned to cluster $i$ where $i$ is the index of the nearest centroid $\\mu^{(i)}$\r\n\r\n(ii) each centroid $\\mu^{(i)}$ is updated to the mean of all training examples $x^{(j)}$ assigned to cluster $i$.\r\n\r\nText Source: Deep Learning, Goodfellow et al\r\n\r\nImage Source: [scikit-learn](https:\/\/scikit-learn.org\/stable\/auto_examples\/cluster\/plot_kmeans_digits.html)","136":"**Darknet-19** is a convolutional neural network that is used as the backbone of [YOLOv2](https:\/\/paperswithcode.com\/method\/yolov2). Similar to the [VGG](https:\/\/paperswithcode.com\/method\/vgg) models it mostly uses $3 \\times 3$ filters and doubles the number of channels after every pooling step. Following the work on Network in Network (NIN) it uses [global average pooling](https:\/\/paperswithcode.com\/method\/global-average-pooling) to make predictions as well as $1 \\times 1$ filters to compress the feature representation between $3 \\times 3$ convolutions. [Batch Normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) is used to stabilize training, speed up convergence, and regularize the model batch.","137":"**Darknet-53** is a convolutional neural network that acts as a backbone for the [YOLOv3](https:\/\/paperswithcode.com\/method\/yolov3) object detection approach. The improvements upon its predecessor [Darknet-19](https:\/\/paperswithcode.com\/method\/darknet-19) include the use of residual connections, as well as more layers.","138":"**YOLOv3** is a real-time, single-stage object detection model that builds on [YOLOv2](https:\/\/paperswithcode.com\/method\/yolov2) with several improvements. Improvements include the use of a new backbone network, [Darknet-53](https:\/\/paperswithcode.com\/method\/darknet-53) that utilises residual connections, or in the words of the author, \"those newfangled residual network stuff\", as well as some improvements to the bounding box prediction step, and use of three different scales from which to extract features (similar to an [FPN](https:\/\/paperswithcode.com\/method\/fpn)).","139":"**YOLOv2**, or [**YOLO9000**](https:\/\/www.youtube.com\/watch?v=QsDDXSmGJZA), is a single-stage real-time object detection model. It improves upon [YOLOv1](https:\/\/paperswithcode.com\/method\/yolov1) in several ways, including the use of [Darknet-19](https:\/\/paperswithcode.com\/method\/darknet-19) as a backbone, [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization), use of a high-resolution classifier, and the use of anchor boxes to predict bounding boxes, and more.","140":"A **3D Convolution** is a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) where the kernel slides in 3 dimensions as opposed to 2 dimensions with 2D convolutions. One example use case is medical imaging where a model is constructed using 3D image slices. Additionally video based data has an additional temporal dimension over images making it suitable for this module. \r\n\r\nImage: Lung nodule detection based on 3D convolutional neural networks, Fan et al","141":"**Early Stopping** is a regularization technique for deep neural networks that stops training when parameter updates no longer begin to yield improves on a validation set. In essence, we store and update the current best parameters during training, and when parameter updates no longer yield an improvement (after a set number of iterations) we stop training and use the last best parameters. It works as a regularizer by restricting the optimization procedure to a smaller volume of parameter space.\r\n\r\nImage Source: [Ramazan Gen\u00e7ay](https:\/\/www.researchgate.net\/figure\/Early-stopping-based-on-cross-validation_fig1_3302948)","142":"A **ResNeXt Block** is a type of [residual block](https:\/\/paperswithcode.com\/method\/residual-block) used as part of the [ResNeXt](https:\/\/paperswithcode.com\/method\/resnext) CNN architecture. It uses a \"split-transform-merge\" strategy (branched paths within a single module) similar to an [Inception module](https:\/\/paperswithcode.com\/method\/inception-module), i.e. it aggregates a set of transformations. Compared to a Residual Block, it exposes a new dimension, *cardinality* (size of set of transformations) $C$, as an essential factor in addition to depth and width. \r\n\r\nFormally, a set of aggregated transformations can be represented as: $\\mathcal{F}(x)=\\sum_{i=1}^{C}\\mathcal{T}_i(x)$, where $\\mathcal{T}_i(x)$ can be an arbitrary function. Analogous to a simple neuron, $\\mathcal{T}_i$ should project $x$ into an (optionally low-dimensional) embedding and then transform it.","143":"A **ResNeXt** repeats a building block that aggregates a set of transformations with the same topology. Compared to a [ResNet](https:\/\/paperswithcode.com\/method\/resnet), it exposes a new dimension, *cardinality* (the size of the set of transformations) $C$, as an essential factor in addition to the dimensions of depth and width. \r\n\r\nFormally, a set of aggregated transformations can be represented as: $\\mathcal{F}(x)=\\sum_{i=1}^{C}\\mathcal{T}_i(x)$, where $\\mathcal{T}_i(x)$ can be an arbitrary function. Analogous to a simple neuron, $\\mathcal{T}_i$ should project $x$ into an (optionally low-dimensional) embedding and then transform it.","144":"**Linear Regression** is a method for modelling a relationship between a dependent variable and independent variables. These models can be fit with numerous approaches. The most common is *least squares*, where we minimize the mean square error between the predicted values $\\hat{y} = \\textbf{X}\\hat{\\beta}$ and actual values $y$: $\\left(y-\\textbf{X}\\beta\\right)^{2}$.\r\n\r\nWe can also define the problem in probabilistic terms as a generalized linear model (GLM) where the pdf is a Gaussian distribution, and then perform maximum likelihood estimation to estimate $\\hat{\\beta}$.\r\n\r\nImage Source: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Linear_regression)","145":"**Swish** is an activation function, $f(x) = x \\cdot \\text{sigmoid}(\\beta x)$, where $\\beta$ a learnable parameter. Nearly all implementations do not use the learnable parameter $\\beta$, in which case the activation function is $x\\sigma(x)$ (\"Swish-1\").\r\n\r\nThe function $x\\sigma(x)$ is exactly the [SiLU](https:\/\/paperswithcode.com\/method\/silu), which was introduced by other authors before the swish.\r\nSee [Gaussian Error Linear Units](https:\/\/arxiv.org\/abs\/1606.08415) ([GELUs](https:\/\/paperswithcode.com\/method\/gelu)) where the SiLU (Sigmoid Linear Unit) was originally coined, and see [Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning](https:\/\/arxiv.org\/abs\/1702.03118) and [Swish: a Self-Gated Activation Function](https:\/\/arxiv.org\/abs\/1710.05941v1) where the same activation function was experimented with later.","146":"**RMSProp** is an unpublished adaptive learning rate optimizer [proposed by Geoff Hinton](http:\/\/www.cs.toronto.edu\/~tijmen\/csc321\/slides\/lecture_slides_lec6.pdf). The motivation is that the magnitude of gradients can differ for different weights, and can change during learning, making it hard to choose a single global learning rate. RMSProp tackles this by keeping a moving average of the squared gradient and adjusting the weight updates by this magnitude. The gradient updates are performed as:\r\n\r\n$$E\\left[g^{2}\\right]\\_{t} = \\gamma E\\left[g^{2}\\right]\\_{t-1} + \\left(1 - \\gamma\\right) g^{2}\\_{t}$$\r\n\r\n$$\\theta\\_{t+1} = \\theta\\_{t} - \\frac{\\eta}{\\sqrt{E\\left[g^{2}\\right]\\_{t} + \\epsilon}}g\\_{t}$$\r\n\r\nHinton suggests $\\gamma=0.9$, with a good default for $\\eta$ as $0.001$.\r\n\r\nImage: [Alec Radford](https:\/\/twitter.com\/alecrad)","147":"**EfficientNet** is a convolutional neural network architecture and scaling method that uniformly scales all dimensions of depth\/width\/resolution using a *compound coefficient*. Unlike conventional practice that arbitrary scales these factors, the EfficientNet scaling method uniformly scales network width, depth, and resolution with a set of fixed scaling coefficients. For example, if we want to use $2^N$ times more computational resources, then we can simply increase the network depth by $\\alpha ^ N$, width by $\\beta ^ N$, and image size by $\\gamma ^ N$, where $\\alpha, \\beta, \\gamma$ are constant coefficients determined by a small grid search on the original small model. EfficientNet uses a compound coefficient $\\phi$ to uniformly scales network width, depth, and resolution in a principled way.\r\n\r\nThe compound scaling method is justified by the intuition that if the input image is bigger, then the network needs more layers to increase the receptive field and more channels to capture more fine-grained patterns on the bigger image.\r\n\r\nThe base EfficientNet-B0 network is based on the inverted bottleneck residual blocks of [MobileNetV2](https:\/\/paperswithcode.com\/method\/mobilenetv2), in addition to squeeze-and-excitation blocks.\r\n\r\n EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters.","148":"We propose to theoretically and empirically examine the effect of incorporating weighting schemes into walk-aggregating GNNs. To this end, we propose a simple, interpretable, and end-to-end supervised GNN model, called AWARE (Attentive Walk-Aggregating GRaph Neural NEtwork), for graph-level prediction. AWARE aggregates the walk information by means of weighting schemes at distinct levels (vertex-, walk-, and graph-level) in a principled manner. By virtue of the incorporated weighting schemes at these different levels, AWARE can emphasize the information important for prediction while diminishing the irrelevant ones\u2014leading to representations that can improve learning performance.","149":"**Trust Region Policy Optimization**, or **TRPO**, is a policy gradient method in reinforcement learning that avoids parameter updates that change the policy too much with a KL divergence constraint on the size of the policy update at each iteration.\r\n\r\nTake the case of off-policy reinforcement learning, where the policy $\\beta$ for collecting trajectories on rollout workers is different from the policy $\\pi$ to optimize for. The objective function in an off-policy model measures the total advantage over the state visitation distribution and actions, while the mismatch between the training data distribution and the true policy state distribution is compensated with an importance sampling estimator:\r\n\r\n$$ J\\left(\\theta\\right) = \\sum\\_{s\\in{S}}p^{\\pi\\_{\\theta\\_{old}}}\\sum\\_{a\\in\\mathcal{A}}\\left(\\pi\\_{\\theta}\\left(a\\mid{s}\\right)\\hat{A}\\_{\\theta\\_{old}}\\left(s, a\\right)\\right) $$\r\n\r\n$$ J\\left(\\theta\\right) = \\sum\\_{s\\in{S}}p^{\\pi\\_{\\theta\\_{old}}}\\sum\\_{a\\in\\mathcal{A}}\\left(\\beta\\left(a\\mid{s}\\right)\\frac{\\pi\\_{\\theta}\\left(a\\mid{s}\\right)}{\\beta\\left(a\\mid{s}\\right)}\\hat{A}\\_{\\theta\\_{old}}\\left(s, a\\right)\\right) $$\r\n\r\n$$ J\\left(\\theta\\right) = \\mathbb{E}\\_{s\\sim{p}^{\\pi\\_{\\theta\\_{old}}}, a\\sim{\\beta}} \\left(\\frac{\\pi\\_{\\theta}\\left(a\\mid{s}\\right)}{\\beta\\left(a\\mid{s}\\right)}\\hat{A}\\_{\\theta\\_{old}}\\left(s, a\\right)\\right)$$\r\n\r\nWhen training on policy, theoretically the policy for collecting data is same as the policy that we want to optimize. However, when rollout workers and optimizers are running in parallel asynchronously, the behavior policy can get stale. TRPO considers this subtle difference: It labels the behavior policy as $\\pi\\_{\\theta\\_{old}}\\left(a\\mid{s}\\right)$ and thus the objective function becomes:\r\n\r\n$$ J\\left(\\theta\\right) = \\mathbb{E}\\_{s\\sim{p}^{\\pi\\_{\\theta\\_{old}}}, a\\sim{\\pi\\_{\\theta\\_{old}}}} \\left(\\frac{\\pi\\_{\\theta}\\left(a\\mid{s}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\mid{s}\\right)}\\hat{A}\\_{\\theta\\_{old}}\\left(s, a\\right)\\right)$$\r\n\r\nTRPO aims to maximize the objective function $J\\left(\\theta\\right)$ subject to a trust region constraint which enforces the distance between old and new policies measured by KL-divergence to be small enough, within a parameter $\\delta$:\r\n\r\n$$ \\mathbb{E}\\_{s\\sim{p}^{\\pi\\_{\\theta\\_{old}}}} \\left[D\\_{KL}\\left(\\pi\\_{\\theta\\_{old}}\\left(.\\mid{s}\\right)\\mid\\mid\\pi\\_{\\theta}\\left(.\\mid{s}\\right)\\right)\\right] \\leq \\delta$$","150":"A **Linear Layer** is a projection $\\mathbf{XW + b}$.","151":"A capsule is an activation vector that basically executes on its inputs some complex internal\r\ncomputations. Length of these activation vectors signifies the\r\nprobability of availability of a feature. Furthermore, the condition\r\nof the recognized element is encoded as the direction in which\r\nthe vector is pointing. In traditional, CNN uses Max pooling for\r\ninvariance activities of neurons, which is nothing except a minor\r\nchange in input and the neurons of output signal will remains\r\nsame.","152":"**FAVOR+**, or **Fast Attention Via Positive Orthogonal Random Features**, is an efficient attention mechanism used in the [Performer](https:\/\/paperswithcode.com\/method\/performer) architecture which leverages approaches such as kernel methods and random features approximation for approximating [softmax](https:\/\/paperswithcode.com\/method\/softmax) and Gaussian kernels. \r\n\r\nFAVOR+ works for attention blocks using matrices $\\mathbf{A} \\in \\mathbb{R}^{L\u00d7L}$ of the form $\\mathbf{A}(i, j) = K(\\mathbf{q}\\_{i}^{T}, \\mathbf{k}\\_{j}^{T})$, with $\\mathbf{q}\\_{i}\/\\mathbf{k}\\_{j}$ standing for the $i^{th}\/j^{th}$ query\/key row-vector in $\\mathbf{Q}\/\\mathbf{K}$ and kernel $K : \\mathbb{R}^{d } \u00d7 \\mathbb{R}^{d} \\rightarrow \\mathbb{R}\\_{+}$ defined for the (usually randomized) mapping: $\\phi : \\mathbb{R}^{d } \u2192 \\mathbb{R}^{r}\\_{+}$ (for some $r > 0$) as:\r\n\r\n$$K(\\mathbf{x}, \\mathbf{y}) = E[\\phi(\\mathbf{x})^{T}\\phi(\\mathbf{y})] $$\r\n\r\nWe call $\\phi(\\mathbf{u})$ a random feature map for $\\mathbf{u} \\in \\mathbb{R}^{d}$ . For $\\mathbf{Q}^{'}, \\mathbf{K}^{'} \\in \\mathbb{R}^{L \\times r}$ with rows given as $\\phi(\\mathbf{q}\\_{i}^{T})^{T}$ and $\\phi(\\mathbf{k}\\_{i}^{T})^{T}$ respectively, this leads directly to the efficient attention mechanism of the form:\r\n\r\n$$ \\hat{Att\\_{\\leftrightarrow}}\\left(\\mathbf{Q}, \\mathbf{K}, \\mathbf{V}\\right) = \\hat{\\mathbf{D}}^{-1}(\\mathbf{Q^{'}}((\\mathbf{K^{'}})^{T}\\mathbf{V}))$$\r\n\r\nwhere\r\n\r\n$$\\mathbf{\\hat{D}} = \\text{diag}(\\mathbf{Q^{'}}((\\mathbf{K^{'}})\\mathbf{1}\\_{L})) $$\r\n\r\nThe above scheme constitutes the [FA](https:\/\/paperswithcode.com\/method\/dfa)-part of the FAVOR+ mechanism. The other parts are achieved by:\r\n\r\n- The R part : The softmax kernel is approximated though trigonometric functions, in the form of a regularized softmax-kernel SMREG, that employs positive random features (PRFs).\r\n- The OR+ part : To reduce the variance of the estimator, so we can use a smaller number of random features, different samples are entangled to be exactly orthogonal using the Gram-Schmidt orthogonalization procedure.\r\n\r\nThe details are quite technical, so it is recommended you read the paper for further information on these steps.","153":"**Performer** is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) architectures which can estimate regular ([softmax](https:\/\/paperswithcode.com\/method\/softmax)) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. To approximate softmax attention-kernels, Performers use a Fast Attention Via positive Orthogonal Random features approach (FAVOR+), leveraging new methods for approximating softmax and Gaussian kernels.","154":"A **Graph Convolutional Network**, or **GCN**, is an approach for semi-supervised learning on graph-structured data. It is based on an efficient variant of [convolutional neural networks](https:\/\/paperswithcode.com\/methods\/category\/convolutional-neural-networks) which operate directly on graphs. The choice of convolutional architecture is motivated via a localized first-order approximation of spectral graph convolutions. The model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes.","155":"**DistilBERT** is a small, fast, cheap and light [Transformer](https:\/\/paperswithcode.com\/method\/transformer) model based on the [BERT](https:\/\/paperswithcode.com\/method\/bert) architecture. Knowledge distillation is performed during the pre-training phase to reduce the size of a BERT model by 40%. To leverage the inductive biases learned by larger models during pre-training, the authors introduce a triple loss combining language modeling, distillation and cosine-distance losses.","156":"**XLM** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) based architecture that is pre-trained using one of three language modelling objectives:\r\n\r\n1. Causal Language Modeling - models the probability of a word given the previous words in a sentence.\r\n2. Masked Language Modeling - the masked language modeling objective of [BERT](https:\/\/paperswithcode.com\/method\/bert).\r\n3. Translation Language Modeling - a (new) translation language modeling objective for improving cross-lingual pre-training.\r\n\r\nThe authors find that both the CLM and MLM approaches provide strong cross-lingual features that can be used for pretraining models.","157":"**Weight Modulation** is an alternative to [adaptive instance normalization](https:\/\/paperswithcode.com\/method\/adaptive-instance-normalization) for use in generative adversarial networks, specifically it is introduced in [StyleGAN2](https:\/\/paperswithcode.com\/method\/stylegan2). The purpose of [instance normalization](https:\/\/paperswithcode.com\/method\/instance-normalization) is to remove the effect of $s$ - the scales of the features maps - from the statistics of the [convolution](https:\/\/paperswithcode.com\/method\/convolution)\u2019s output feature maps. Weight modulation tries to achieve this goal more directly. Assuming that input activations are i.i.d. random variables with unit standard deviation. After modulation and convolution, the output activations have standard deviation of:\r\n\r\n$$ \\sigma\\_{j} = \\sqrt{{\\sum\\_{i,k}w\\_{ijk}'}^{2}} $$\r\n\r\ni.e., the outputs are scaled by the $L\\_{2}$ norm of the corresponding weights. The subsequent normalization aims to restore the outputs back to unit standard deviation. This can be achieved if we scale (\u201cdemodulate\u201d) each output feature map $j$ by $1\/\\sigma\\_{j}$ . Alternatively, we can again bake this into the convolution weights:\r\n\r\n$$ w''\\_{ijk} = w'\\_{ijk} \/ \\sqrt{{\\sum\\_{i, k}w'\\_{ijk}}^{2} + \\epsilon} $$\r\n\r\nwhere $\\epsilon$ is a small constant to avoid numerical issues.","158":"**R_INLINE_MATH_1 Regularization** is a regularization technique and gradient penalty for training [generative adversarial networks](https:\/\/paperswithcode.com\/methods\/category\/generative-adversarial-networks). It penalizes the discriminator from deviating from the Nash Equilibrium via penalizing the gradient on real data alone: when the generator distribution produces the true data distribution and the discriminator is equal to 0 on the data manifold, the gradient penalty ensures that the discriminator cannot create a non-zero gradient orthogonal to the data manifold without suffering a loss in the [GAN](https:\/\/paperswithcode.com\/method\/gan) game.\r\n\r\nThis leads to the following regularization term:\r\n\r\n$$ R\\_{1}\\left(\\psi\\right) = \\frac{\\gamma}{2}E\\_{p\\_{D}\\left(x\\right)}\\left[||\\nabla{D\\_{\\psi}\\left(x\\right)}||^{2}\\right] $$","159":"**Path Length Regularization** is a type of regularization for [generative adversarial networks](https:\/\/paperswithcode.com\/methods\/category\/generative-adversarial-networks) that encourages good conditioning in the mapping from latent codes to images. The idea is to encourage that a fixed-size step in the latent space $\\mathcal{W}$ results in a non-zero, fixed-magnitude change in the image.\r\n\r\nWe can measure the deviation from this ideal empirically by stepping into random directions in the image space and observing the corresponding $\\mathbf{w}$ gradients. These gradients should have close to an equal length regardless of $\\mathbf{w}$ or the image-space direction, indicating that the mapping from the latent space to image space is well-conditioned.\r\n\r\nAt a single $\\mathbf{w} \\in \\mathcal{W}$ the local metric scaling properties of the generator mapping $g\\left(\\mathbf{w}\\right) : \\mathcal{W} \\rightarrow \\mathcal{Y}$ are captured by the Jacobian matrix $\\mathbf{J\\_{w}} = \\delta{g}\\left(\\mathbf{w}\\right)\/\\delta{\\mathbf{w}}$. Motivated by the desire to preserve the expected lengths of vectors regardless of the direction, we formulate the regularizer as:\r\n\r\n$$ \\mathbb{E}\\_{\\mathbf{w},\\mathbf{y} \\sim \\mathcal{N}\\left(0, \\mathbf{I}\\right)} \\left(||\\mathbf{J}^{\\mathbf{T}}\\_{\\mathbf{w}}\\mathbf{y}||\\_{2} - a\\right)^{2} $$\r\n\r\nwhere $y$ are random images with normally distributed pixel intensities, and $w \\sim f\\left(z\\right)$, where $z$ are normally distributed. \r\n\r\nTo avoid explicit computation of the Jacobian matrix, we use the identity $\\mathbf{J}^{\\mathbf{T}}\\_{\\mathbf{w}}\\mathbf{y} = \\nabla\\_{\\mathbf{w}}\\left(g\\left(\\mathbf{w}\\right)\u00b7y\\right)$, which is efficiently computable using standard backpropagation. The constant $a$ is set dynamically during optimization as the long-running exponential moving average of the lengths $||\\mathbf{J}^{\\mathbf{T}}\\_{\\mathbf{w}}\\mathbf{y}||\\_{2}$, allowing the optimization to find a suitable global scale by itself.\r\n\r\nThe authors note that they find that path length regularization leads to more reliable and consistently behaving models, making architecture exploration easier. They also observe that the smoother generator is significantly easier to invert.","160":"**StyleGAN2** is a generative adversarial network that builds on [StyleGAN](https:\/\/paperswithcode.com\/method\/stylegan) with several improvements. First, [adaptive instance normalization](https:\/\/paperswithcode.com\/method\/adaptive-instance-normalization) is redesigned and replaced with a normalization technique called [weight demodulation](https:\/\/paperswithcode.com\/method\/weight-demodulation). Secondly, an improved training scheme upon progressively growing is introduced, which achieves the same goal - training starts by focusing on low-resolution images and then progressively shifts focus to higher and higher resolutions - without changing the network topology during training. Additionally, new types of regularization like lazy regularization and [path length regularization](https:\/\/paperswithcode.com\/method\/path-length-regularization) are proposed.","161":"A **Denoising Autoencoder** is a modification on the [autoencoder](https:\/\/paperswithcode.com\/method\/autoencoder) to prevent the network learning the identity function. Specifically, if the autoencoder is too big, then it can just learn the data, so the output equals the input, and does not perform any useful representation learning or dimensionality reduction. Denoising autoencoders solve this problem by corrupting the input data on purpose, adding noise or masking some of the input values.\r\n\r\nImage Credit: [Kumar et al](https:\/\/www.semanticscholar.org\/paper\/Static-hand-gesture-recognition-using-stacked-Kumar-Nandi\/5191ddf3f0841c89ba9ee592a2f6c33e4a40d4bf)","162":"**Deformable convolutions** add 2D offsets to the regular grid sampling locations in the standard [convolution](https:\/\/paperswithcode.com\/method\/convolution). It enables free form deformation of the sampling grid. The offsets are learned from the preceding feature maps, via additional convolutional layers. Thus, the deformation is conditioned on the input features in a local, dense, and adaptive manner.","163":"**Scaled Exponential Linear Units**, or **SELUs**, are activation functions that induce self-normalizing properties.\r\n\r\nThe SELU activation function is given by \r\n\r\n$$f\\left(x\\right) = \\lambda{x} \\text{ if } x \\geq{0}$$\r\n$$f\\left(x\\right) = \\lambda{\\alpha\\left(\\exp\\left(x\\right) -1 \\right)} \\text{ if } x < 0 $$\r\n\r\nwith $\\alpha \\approx 1.6733$ and $\\lambda \\approx 1.0507$.","164":"**Self-normalizing neural networks** (**SNNs**) are a type of neural architecture that aim to enable high-level abstract representations. While batch normalization requires explicit normalization, neuron activations of SNNs automatically converge towards zero mean and unit variance. The activation function of SNNs are \u201cscaled exponential linear units\u201d (SELUs), which induce self-normalizing properties. Using the Banach fixed point theorem, it's possible to prove that activations close to zero mean and unit variance that are propagated through many network layers will converge towards zero mean and unit variance \u2014 even under the presence of noise and perturbations. This convergence property of SNNs allows to (1) train deep networks with many layers, (2) employ strong regularization schemes, and (3) to make learning highly robust.","165":"**InfoNCE**, where NCE stands for Noise-Contrastive Estimation, is a type of contrastive loss function used for [self-supervised learning](https:\/\/paperswithcode.com\/methods\/category\/self-supervised-learning).\r\n\r\nGiven a set $X = ${$x\\_{1}, \\dots, x\\_{N}$} of $N$ random samples containing one positive sample from $p\\left(x\\_{t+k}|c\\_{t}\\right)$ and $N \u2212 1$ negative samples from the 'proposal' distribution $p\\left(x\\_{t+k}\\right)$, we optimize:\r\n\r\n$$ \\mathcal{L}\\_{N} = - \\mathbb{E}\\_{X}\\left[\\log\\frac{f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right)}{\\sum\\_{x\\_{j}\\in{X}}f\\_{k}\\left(x\\_{j}, c\\_{t}\\right)}\\right] $$\r\n\r\nOptimizing this loss will result in $f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right)$ estimating the density ratio, which is:\r\n\r\n$$ f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right) \\propto \\frac{p\\left(x\\_{t+k}|c\\_{t}\\right)}{p\\left(x\\_{t+k}\\right)} $$","166":"**Contrastive Predictive Coding (CPC)** learns self-supervised representations by predicting the future in latent space by using powerful autoregressive models. The model uses a probabilistic contrastive loss which induces the latent space to capture information that is maximally useful\r\nto predict future samples.\r\n\r\nFirst, a non-linear encoder $g\\_{enc}$ maps the input sequence of observations $x\\_{t}$ to a sequence of latent representations $z\\_{t} = g\\_{enc}\\left(x\\_{t}\\right)$, potentially with a lower temporal resolution. Next, an autoregressive model $g\\_{ar}$ summarizes all $z\\leq{t}$ in the latent space and produces a context latent representation $c\\_{t} = g\\_{ar}\\left(z\\leq{t}\\right)$.\r\n\r\nA density ratio is modelled which preserves the mutual information between $x\\_{t+k}$ and $c\\_{t}$ as follows:\r\n\r\n$$ f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right) \\propto \\frac{p\\left(x\\_{t+k}|c\\_{t}\\right)}{p\\left(x\\_{t+k}\\right)} $$\r\n\r\nwhere $\\propto$ stands for \u2019proportional to\u2019 (i.e. up to a multiplicative constant). Note that the density ratio $f$ can be unnormalized (does not have to integrate to 1). The authors use a simple log-bilinear model:\r\n\r\n$$ f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right) = \\exp\\left(z^{T}\\_{t+k}W\\_{k}c\\_{t}\\right) $$\r\n\r\nAny type of autoencoder and autoregressive can be used. An example the authors opt for is strided convolutional layers with residual blocks and GRUs.\r\n\r\nThe autoencoder and autoregressive models are trained to minimize an [InfoNCE](https:\/\/paperswithcode.com\/method\/infonce) loss (see components).","167":"**Inception-A** is an image model block used in the [Inception-v4](https:\/\/paperswithcode.com\/method\/inception-v4) architecture.","168":"**Reduction-A** is an image model block used in the [Inception-v4](https:\/\/paperswithcode.com\/method\/inception-v4) architecture.","169":"**Inception-B** is an image model block used in the [Inception-v4](https:\/\/paperswithcode.com\/method\/inception-v4) architecture.","170":"**Reduction-B** is an image model block used in the [Inception-v4](https:\/\/paperswithcode.com\/method\/inception-v4) architecture.","171":"**Inception-C** is an image model block used in the [Inception-v4](https:\/\/paperswithcode.com\/method\/inception-v4) architecture.","172":"**Inception-v4** is a convolutional neural network architecture that builds on previous iterations of the Inception family by simplifying the architecture and using more inception modules than [Inception-v3](https:\/\/paperswithcode.com\/method\/inception-v3).","173":"Dynamic Time Warping (DTW) [1] is one of well-known distance measures between a pairwise of time series. The main idea of DTW is to compute the distance from the matching of similar elements between time series. It uses the dynamic programming technique to find the optimal temporal matching between elements of two time series.\r\n\r\nFor instance, similarities in walking could be detected using DTW, even if one person was walking faster than the other, or if there were accelerations and decelerations during the course of an observation. DTW has been applied to temporal sequences of video, audio, and graphics data \u2014 indeed, any data that can be turned into a linear sequence can be analyzed with DTW. A well known application has been automatic speech recognition, to cope with different speaking speeds. Other applications include speaker recognition and online signature recognition. It can also be used in partial shape matching application.\r\n\r\nIn general, DTW is a method that calculates an optimal match between two given sequences (e.g. time series) with certain restriction and rules:\r\n\r\n1. Every index from the first sequence must be matched with one or more indices from the other sequence, and vice versa\r\n2. The first index from the first sequence must be matched with the first index from the other sequence (but it does not have to be its only match)\r\n3. The last index from the first sequence must be matched with the last index from the other sequence (but it does not have to be its only match)\r\n4. The mapping of the indices from the first sequence to indices from the other sequence must be monotonically increasing, and vice versa, i.e. if j>i are indices from the first sequence, then there must not be two indices l>k in the other sequence, such that index i is matched with index l and index j is matched with index k, and vice versa.\r\n\r\n[1] Sakoe, Hiroaki, and Seibi Chiba. \"Dynamic programming algorithm optimization for spoken word recognition.\" IEEE transactions on acoustics, speech, and signal processing 26, no. 1 (1978): 43-49.","174":"**HRNet**, or **High-Resolution Net**, is a general purpose convolutional neural network for tasks like semantic segmentation, object detection and image classification. It is able to maintain high resolution representations through the whole process. We start from a high-resolution [convolution](https:\/\/paperswithcode.com\/method\/convolution) stream, gradually add high-to-low resolution convolution streams one by one, and connect the multi-resolution streams in parallel. The resulting network consists of several ($4$ in the paper) stages and\r\nthe $n$th stage contains $n$ streams corresponding to $n$ resolutions. The authors conduct repeated multi-resolution fusions by exchanging the information across the parallel streams over and over.","175":"A neural network model to automatically capture trends in time-series data for improved prediction\/forecasting performance","176":"In the setting of multi-target regression, base boosting permits us to incorporate prior knowledge into the learning mechanism of gradient boosting (or Newton boosting, etc.). Namely, from the vantage of statistics, base boosting is a way of building the following additive expansion in a set of elementary basis functions:\r\n\\begin{equation}\r\nh_{j}(X ; \\{ \\alpha_{j}, \\theta_{j} \\}) = X_{j} + \\sum_{k=1}^{K_{j}} \\alpha_{j,k} b(X ; \\theta_{j,k}),\r\n\\end{equation}\r\nwhere \r\n$X$ is an example from the domain $\\mathcal{X},$\r\n$\\{\\alpha_{j}, \\theta_{j}\\} = \\{\\alpha_{j,1},\\dots, \\alpha_{j,K_{j}},\\theta_{j,1},\\dots,\\theta_{j,K_{j}}\\}$ collects the expansion coefficients and parameter sets,\r\n$X_{j}$ is the image of $X$ under the $j$th coordinate function (a prediction from a user-specified model),\r\n$K_{j}$ is the number of basis functions in the linear sum,\r\n$b(X; \\theta_{j,k})$ is a real-valued function of the example $X,$ characterized by a parameter set $\\theta_{j,k}.$\r\n\r\nThe aforementioned additive expansion differs from the \r\n[standard additive expansion](https:\/\/projecteuclid.org\/download\/pdf_1\/euclid.aos\/1013203451):\r\n\\begin{equation}\r\nh_{j}(X ; \\{ \\alpha_{j}, \\theta_{j}\\}) = \\alpha_{j, 0} + \\sum_{k=1}^{K_{j}} \\alpha_{j,k} b(X ; \\theta_{j,k}),\r\n\\end{equation}\r\nas it replaces the constant offset value $\\alpha_{j, 0}$ with a prediction from a user-specified model. In essence, this modification permits us to incorporate prior knowledge into the for loop of gradient boosting, as the for loop proceeds to build the linear sum by computing residuals that depend upon predictions from the user-specified model instead of the optimal constant model: $\\mbox{argmin} \\sum_{i=1}^{m_{train}} \\ell_{j}(Y_{j}^{(i)}, c),$ where $m_{train}$ denotes the number of training examples, $\\ell_{j}$ denotes a single-target loss function, and $c \\in \\mathbb{R}$ denotes a real number, e.g, $\\mbox{argmin} \\sum_{i=1}^{m_{train}} (Y_{j}^{(i)} - c)^{2} = \\frac{\\sum_{i=1}^{m_{train}} Y_{j}^{(i)}}{m_{train}}.$","177":"mBERT","178":"**AlphaZero** is a reinforcement learning agent for playing board games such as Go, chess, and shogi. ","179":"A **Sparse Transformer** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) based architecture which utilises sparse factorizations of the attention matrix to reduce time\/memory to $O(n \\sqrt{n})$. Other changes to the Transformer architecture include: (a) a restructured [residual block](https:\/\/paperswithcode.com\/method\/residual-block) and weight initialization, (b) A set of sparse attention kernels which efficiently compute subsets of the attention matrix, (c) recomputation of attention weights during the backwards pass to reduce memory usage","180":"**Jigsaw** is a self-supervision approach that relies on jigsaw-like puzzles as the pretext task in order to learn image representations.","181":"**Population Based Training**, or **PBT**, is an optimization method for finding parameters and hyperparameters, and extends upon parallel search methods and sequential optimisation methods.\r\nIt leverages information sharing across a population of concurrently running optimisation processes, and allows for online propagation\/transfer of parameters and hyperparameters between members of the population based on their performance. Furthermore, unlike most other adaptation schemes, the method is capable of performing online adaptation of hyperparameters -- which can be particularly important in problems with highly non-stationary learning dynamics, such as reinforcement learning settings. PBT is decentralised and asynchronous, although it could also be executed semi-serially or with partial synchrony if there is a binding budget constraint.","182":"**Population Based Augmentation**, or **PBA**, is a data augmentation strategy (PBA), which generates nonstationary augmentation policy schedules instead of a fixed augmentation policy. In PBA we consider the augmentation policy search problem as a special case of hyperparameter schedule learning. It leverages [Population Based Training](https:\/\/paperswithcode.com\/method\/population-based-training) (PBT), a hyperparameter search algorithm which\r\noptimizes the parameters of a network jointly with their hyperparameters to maximize performance. The output of PBT is not an optimal hyperparameter configuration but rather a trained model and schedule of hyperparameters. \r\n\r\nIn PBA, we are only interested in the learned schedule and discard the child model result (similar to [AutoAugment](https:\/\/paperswithcode.com\/method\/autoaugment)). This learned augmentation schedule can then be used to improve the training of different (i.e., larger and costlier to train) models on the same dataset.\r\n\r\nPBT executes as follows. To start, a fixed population of models are randomly initialized and trained in parallel. At certain intervals, an \u201cexploit-and-explore\u201d procedure is applied to the worse performing population members, where the model clones the weights of a better performing model (i.e., exploitation) and then perturbs the hyperparameters of the cloned model to search in the hyperparameter space (i.e., exploration). Because the weights of the models are cloned and never reinitialized, the total computation required is the computation to train a single model times the population size.","183":"**AutoAugment** is an automated approach to find data augmentation policies from data. It formulates the problem of finding the best augmentation policy as a discrete search problem. It consists of two components: a search algorithm and a search space. \r\n\r\nAt a high level, the search algorithm (implemented as a controller RNN) samples a data augmentation policy $S$, which has information about what image processing operation to use, the probability of using the operation in each batch, and the magnitude of the operation. The policy $S$ is used to train a neural network with a fixed architecture, whose validation accuracy $R$ is sent back to update the controller. Since $R$ is not differentiable, the controller will be updated by policy gradient methods. \r\n\r\nThe operations used are from PIL, a popular Python image library: all functions in PIL that accept an image as input and output an image. It additionally uses two other augmentation techniques: [Cutout](https:\/\/paperswithcode.com\/method\/cutout) and SamplePairing. The operations searched over are ShearX\/Y, TranslateX\/Y, Rotate, AutoContrast, Invert, Equalize, Solarize, Posterize, Contrast, Color, Brightness, Sharpness, Cutout and Sample Pairing.","184":"**Temporal Graph Network**, or **TGN**, is a framework for deep learning on dynamic graphs represented as sequences of timed events. The memory (state) of the model at time $t$ consists of a vector $\\mathbf{s}_i(t)$ for each node $i$ the model has seen so far. The memory of a node is updated after an event (e.g. interaction with another node or node-wise change), and its purpose is to represent the node's history in a compressed format. Thanks to this specific module, TGNs have the capability to memorize long term dependencies for each node in the graph. When a new node is encountered, its memory is initialized as the zero vector, and it is then updated for each event involving the node, even after the model has finished training.","185":"A **HyperNetwork** is a network that generates weights for a main network. The behavior of the main network is the same with any usual neural network: it learns to map some raw inputs to their desired targets; whereas the hypernetwork takes a set of inputs that contain information about the structure of the weights and generates the weight for that layer.","186":"A **Gated Linear Unit**, or **GLU** computes:\r\n\r\n$$ \\text{GLU}\\left(a, b\\right) = a\\otimes \\sigma\\left(b\\right) $$\r\n\r\nIt is used in natural language processing architectures, for example the [Gated CNN](https:\/\/paperswithcode.com\/method\/gated-convolution-network), because here $b$ is the gate that control what information from $a$ is passed up to the following layer. Intuitively, for a language modeling task, the gating mechanism allows selection of words or features that are important for predicting the next word. The GLU also has non-linear capabilities, but has a linear path for the gradient so diminishes the vanishing gradient problem.","187":"**Adafactor** is a stochastic optimization method based on [Adam](https:\/\/paperswithcode.com\/method\/adam) that reduces memory usage while retaining the empirical benefits of adaptivity. This is achieved through maintaining a factored representation of the squared gradient accumulator across training steps. Specifically, by tracking moving averages of the row and column sums of the squared gradients for matrix-valued variables, we are able to reconstruct a low-rank approximation of the exponentially smoothed accumulator at each training step that is optimal with respect to the generalized Kullback-Leibler divergence. For an $n \\times m$ matrix, this reduces the memory requirements from $O(n m)$ to $O(n + m)$. \r\n\r\nInstead of defining the optimization algorithm in terms of absolute step sizes {$\\alpha_t$}$\\_{t=1}^T$, the authors define the optimization algorithm in terms of relative step sizes {$\\rho_t$}$\\_{t=1}^T$, which get multiplied by the scale of the parameters. The scale of a parameter vector or matrix is defined as the root-mean-square of its components, lower-bounded by a small constant $\\epsilon_2$. The reason for this lower bound is to allow zero-initialized parameters to escape 0. \r\n\r\nProposed hyperparameters are: $\\epsilon\\_{1} = 10^{-30}$, $\\epsilon\\_{2} = 10^{-3}$, $d=1$, $p\\_{t} = \\min\\left(10^{-2}, \\frac{1}{\\sqrt{t}}\\right)$, $\\hat{\\beta}\\_{2\\_{t}} = 1 - t^{-0.8}$.","188":"**Inverse Square Root** is a learning rate schedule 1 \/ $\\sqrt{\\max\\left(n, k\\right)}$ where\r\n$n$ is the current training iteration and $k$ is the number of warm-up steps. This sets a constant learning rate for the first $k$ steps, then exponentially decays the learning rate until pre-training is over.","189":"**SentencePiece** is a subword tokenizer and detokenizer for natural language processing. It performs subword segmentation, supporting the byte-pair-encoding ([BPE](https:\/\/paperswithcode.com\/method\/bpe)) algorithm and unigram language model, and then converts this text into an id sequence guarantee perfect reproducibility of the normalization and subword segmentation.","190":"**T5**, or **Text-to-Text Transfer Transformer**, is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) based architecture that uses a text-to-text approach. Every task \u2013 including translation, question answering, and classification \u2013 is cast as feeding the model text as input and training it to generate some target text. This allows for the use of the same model, loss function, hyperparameters, etc. across our diverse set of tasks. The changes compared to [BERT](https:\/\/paperswithcode.com\/method\/bert) include:\r\n\r\n- adding a *causal* decoder to the bidirectional architecture.\r\n- replacing the fill-in-the-blank cloze task with a mix of alternative pre-training tasks.","191":"**Morphence** is an approach for adversarial defense that shifts the defense landscape by making a model a moving target against adversarial examples. By regularly moving the decision function of a model, Morphence makes it significantly challenging for repeated or correlated attacks to succeed. Morphence deploys a pool of models generated from a base model in a manner that introduces sufficient randomness when it responds to prediction queries. To ensure repeated or correlated attacks fail, the deployed pool of models automatically expires after a query budget is reached and the model pool is replaced by a new model pool generated in advance.","192":"**MDETR** is an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. It utilizes a [transformer](https:\/\/paperswithcode.com\/method\/transformer)-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. The network is pre-trained on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. The network is then fine-tuned on several downstream tasks such as phrase grounding, referring expression comprehension and segmentation.","193":"**Position-Sensitive RoI Pooling layer** aggregates the outputs of the last convolutional layer and generates scores for each RoI. Unlike [RoI Pooling](https:\/\/paperswithcode.com\/method\/roi-pooling), PS RoI Pooling conducts selective pooling, and each of the $k$ \u00d7 $k$ bin aggregates responses from only one score map out of the bank of $k$ \u00d7 $k$ score maps. With end-to-end training, this RoI layer shepherds the last convolutional layer to learn specialized position-sensitive score maps.","194":"**Region-based Fully Convolutional Networks**, or **R-FCNs**, are a type of region-based object detector. In contrast to previous region-based object detectors such as Fast\/[Faster R-CNN](https:\/\/paperswithcode.com\/method\/faster-r-cnn) that apply a costly per-region subnetwork hundreds of times, R-FCN is fully convolutional with almost all computation shared on the entire image.\r\n\r\nTo achieve this, R-FCN utilises position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection.","195":"In the ALIGN method, visual and language representations are jointly trained from noisy image alt-text data. The image and text encoders are learned via contrastive loss (formulated as normalized softmax) that pushes the embeddings of the matched image-text pair together and pushing those of non-matched image-text pair apart. The model learns to align visual and language representations of the image and text pairs using the contrastive loss. The representations can be used for vision-only or vision-language task transfer. Without any fine-tuning, ALIGN powers zero-shot visual classification and cross-modal search including image-to-text search, text-to image search and even search with joint image+text queries.","196":"**Dilated Convolutions** are a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) that \u201cinflate\u201d the kernel by inserting holes between the kernel elements. An additional parameter $l$ (dilation rate) indicates how much the kernel is widened. There are usually $l-1$ spaces inserted between kernel elements. \r\n\r\nNote that concept has existed in past literature under different names, for instance the *algorithme a trous*, an algorithm for wavelet decomposition (Holschneider et al., 1987; Shensa, 1992).","197":"**Fast Voxel Query** is a module used in the [Voxel Transformer](https:\/\/paperswithcode.com\/method\/votr) 3D object detection model implementation of self-attention, specifically Local and Dilated Attention. For each querying index $v\\_{i}$, an attending voxel index $v\\_{j}$ is determined by Local and Dilated Attention. Then we can lookup the non-empty index $j$ in the hash table with hashed $v\\_{j}$ as the key. Finally, the non-empty index $j$ is used to gather the attending feature $f\\_{j}$ from $\\mathcal{F}$ for [multi-head attention](https:\/\/paperswithcode.com\/method\/multi-head-attention).","198":"**VoTr** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based 3D backbone for 3D object detection from point clouds. It contains a series of sparse and submanifold voxel modules. Submanifold voxel modules perform multi-head self-attention strictly on the non-empty voxels, while sparse voxel modules can extract voxel features at empty locations. Long-range relationships between voxels are captured via self-attention.\r\n\r\nGiven the fact that non-empty voxels are naturally sparse but numerous, directly applying standard Transformer on voxels is non-trivial. To this end, VoTr uses a sparse voxel module and a submanifold voxel module, which can operate on the empty and non-empty voxel positions effectively. To further enlarge the attention range while maintaining comparable computational overhead to the convolutional counterparts, two attention mechanisms are used for [multi-head attention](https:\/\/paperswithcode.com\/method\/multi-head-attention) in those two modules: Local Attention and Dilated Attention. Furthermore [Fast Voxel Query](https:\/\/paperswithcode.com\/method\/fast-voxel-query) is used to accelerate the querying process in multi-head attention.","199":"Inspired by the success of ResNet,\r\nWang et al. proposed\r\nthe very deep convolutional residual attention network (RAN) by \r\ncombining an attention mechanism with residual connections. \r\n\r\nEach attention module stacked in a residual attention network \r\ncan be divided into a mask branch and a trunk branch. \r\nThe trunk branch processes features,\r\nand can be implemented by any state-of-the-art structure\r\nincluding a pre-activation residual unit and an inception block.\r\nThe mask branch uses a bottom-up top-down structure\r\nto learn a mask of the same size that \r\nsoftly weights output features from the trunk branch. \r\nA sigmoid layer normalizes the output to $[0,1]$ after two $1\\times 1$ convolution layers. Overall the residual attention mechanism can be written as\r\n\r\n\\begin{align}\r\ns &= \\sigma(Conv_{2}^{1\\times 1}(Conv_{1}^{1\\times 1}( h_\\text{up}(h_\\text{down}(X))))) \r\n\\end{align}\r\n\r\n\\begin{align}\r\nX_{out} &= s f(X) + f(X)\r\n\\end{align}\r\nwhere $h_\\text{up}$ is a bottom-up structure, \r\nusing max-pooling several times after residual units\r\nto increase the receptive field, while\r\n$h_\\text{down}$ is the top-down part using \r\nlinear interpolation to keep the output size the \r\nsame as the input feature map. \r\nThere are also skip-connections between the two parts,\r\nwhich are omitted from the formulation.\r\n$f$ represents the trunk branch\r\nwhich can be any state-of-the-art structure.\r\n\r\nInside each attention module, a\r\nbottom-up top-down feedforward structure models\r\nboth spatial and cross-channel dependencies, \r\n leading to a consistent performance improvement. \r\nResidual attention can be incorporated into\r\nany deep network structure in an end-to-end training fashion.\r\nHowever, the proposed bottom-up top-down structure fails to leverage global spatial information. \r\nFurthermore, directly predicting a 3D attention map has high computational cost.","200":"There are at least eight notable examples of models from the literature that can be described using the **Message Passing Neural Networks** (**MPNN**) framework. For simplicity we describe MPNNs which operate on undirected graphs $G$ with node features $x_{v}$ and edge features $e_{vw}$. It is trivial to extend the formalism to directed multigraphs. The forward pass has two phases, a message passing phase and a readout phase. The message passing phase runs for $T$ time steps and is defined in terms of message functions $M_{t}$ and vertex update functions $U_{t}$. During the message passing phase, hidden states $h_{v}^{t}$ at each node in the graph are updated based on messages $m_{v}^{t+1}$ according to\r\n$$\r\nm_{v}^{t+1} = \\sum_{w \\in N(v)} M_{t}(h_{v}^{t}, h_{w}^{t}, e_{vw})\r\n$$\r\n$$\r\nh_{v}^{t+1} = U_{t}(h_{v}^{t}, m_{v}^{t+1})\r\n$$\r\nwhere in the sum, $N(v)$ denotes the neighbors of $v$ in graph $G$. The readout phase computes a feature vector for the whole graph using some readout function $R$ according to\r\n$$\r\n\\hat{y} = R(\\\\{ h_{v}^{T} | v \\in G \\\\})\r\n$$\r\nThe message functions $M_{t}$, vertex update functions $U_{t}$, and readout function $R$ are all learned differentiable functions. $R$ operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism.","201":"Genetic Algorithms are search algorithms that mimic Darwinian biological evolution in order to select and propagate better solutions.","202":"**DeepWalk** learns embeddings (social representations) of a graph's vertices, by modeling a stream of short random walks. Social representations are latent features of the vertices that capture neighborhood similarity and community membership. These latent representations encode social relations in a continuous vector space with a relatively small number of dimensions. It generalizes neural language models to process a special language composed of a set of randomly-generated walks. \r\n\r\nThe goal is to learn a latent representation, not only a probability distribution of node co-occurrences, and so as to introduce a mapping function $\\Phi \\colon v \\in V \\mapsto \\mathbb{R}^{|V|\\times d}$.\r\nThis mapping $\\Phi$ represents the latent social representation associated with each vertex $v$ in the graph. In practice, $\\Phi$ is represented by a $|V| \\times d$ matrix of free parameters.","203":"GraphSAGE is a general inductive framework that leverages node feature information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data.\r\n\r\nImage from: [Inductive Representation Learning on Large Graphs](https:\/\/arxiv.org\/pdf\/1706.02216v4.pdf)","204":"The **Vision Transformer**, or **ViT**, is a model for image classification that employs a [Transformer](https:\/\/paperswithcode.com\/method\/transformer)-like architecture over patches of the image. An image is split into fixed-size patches, each of them are then linearly embedded, position embeddings are added, and the resulting sequence of vectors is fed to a standard [Transformer](https:\/\/paperswithcode.com\/method\/transformer) encoder. In order to perform classification, the standard approach of adding an extra learnable \u201cclassification token\u201d to the sequence is used.","205":"**VGG** is a classical convolutional neural network architecture. It was based on an analysis of how to increase the depth of such networks. The network utilises small 3 x 3 filters. Otherwise the network is characterized by its simplicity: the only other components being pooling layers and a fully connected layer.\r\n\r\nImage: [Davi Frossard](https:\/\/www.cs.toronto.edu\/frossard\/post\/vgg16\/)","206":"**ELECTRA** is a [transformer](https:\/\/paperswithcode.com\/method\/transformer) with a new pre-training approach which trains two transformer models: the generator and the discriminator. The generator replaces tokens in the sequence - trained as a masked language model - and the discriminator (the ELECTRA contribution) attempts to identify which tokens are replaced by the generator in the sequence. This pre-training task is called replaced token detection, and is a replacement for masking the input.","207":"**Axial Attention** is a simple generalization of self-attention that naturally aligns with the multiple dimensions of the tensors in both the encoding and the decoding settings. It was first proposed in [CCNet](https:\/\/paperswithcode.com\/method\/ccnet) [1] named as criss-cross attention, which harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. Ho et al [2] extents CCNet to process multi-dimensional data. The proposed structure of the layers allows for the vast majority of the context to be computed in parallel during decoding without introducing any independence assumptions. It serves as the basic building block for developing self-attention-based autoregressive models for high-dimensional data tensors, e.g., Axial Transformers. It has been applied in [AlphaFold](https:\/\/paperswithcode.com\/method\/alphafold) [3] for interpreting protein sequences.\r\n\r\n[1] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, Wenyu Liu. CCNet: Criss-Cross Attention for Semantic Segmentation. ICCV, 2019.\r\n\r\n[2] Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, Tim Salimans. arXiv:1912.12180\r\n\r\n[3] Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, \u017d\u00eddek A, Potapenko A, Bridgland A. Highly accurate protein structure prediction with AlphaFold. Nature. 2021 Jul 15:1-1.","208":"**Deformable Attention Module** is an attention module used in the [Deformable DETR](https:\/\/paperswithcode.com\/method\/deformable-detr) architecture, which seeks to overcome one issue base [Transformer attention](https:\/\/paperswithcode.com\/method\/scaled) in that it looks over all possible spatial locations. Inspired by [deformable convolution](https:\/\/paperswithcode.com\/method\/deformable-convolution), the deformable attention module only attends to a small set of key sampling points around a reference point, regardless of the spatial size of the feature maps. By assigning only a small fixed number of keys for each query, the issues of convergence and feature spatial resolution can be mitigated.\r\n\r\nGiven an input feature map $x \\in \\mathbb{R}^{C \\times H \\times W}$, let $q$ index a query element with content feature $\\mathbf{z}\\_{q}$ and a 2-d reference point $\\mathbf{p}\\_{q}$, the deformable attention feature is calculated by:\r\n\r\n$$ \\text{DeformAttn}\\left(\\mathbf{z}\\_{q}, \\mathbf{p}\\_{q}, \\mathbf{x}\\right)=\\sum\\_{m=1}^{M} \\mathbf{W}\\_{m}\\left[\\sum\\_{k=1}^{K} A\\_{m q k} \\cdot \\mathbf{W}\\_{m}^{\\prime} \\mathbf{x}\\left(\\mathbf{p}\\_{q}+\\Delta \\mathbf{p}\\_{m q k}\\right)\\right]\r\n$$\r\n\r\nwhere $m$ indexes the attention head, $k$ indexes the sampled keys, and $K$ is the total sampled key number $(K \\ll H W) . \\Delta p_{m q k}$ and $A_{m q k}$ denote the sampling offset and attention weight of the $k^{\\text {th }}$ sampling point in the $m^{\\text {th }}$ attention head, respectively. The scalar attention weight $A_{m q k}$ lies in the range $[0,1]$, normalized by $\\sum_{k=1}^{K} A_{m q k}=1 . \\Delta \\mathbf{p}_{m q k} \\in \\mathbb{R}^{2}$ are of 2-d real numbers with unconstrained range. As $p\\_{q}+\\Delta p\\_{m q k}$ is fractional, bilinear interpolation is applied as in Dai et al. (2017) in computing $\\mathbf{x}\\left(\\mathbf{p}\\_{q}+\\Delta \\mathbf{p}\\_{m q k}\\right)$. Both $\\Delta \\mathbf{p}\\_{m q k}$ and $A\\_{m q k}$ are obtained via linear projection over the query feature $z\\_{q} .$ In implementation, the query feature $z\\_{q}$ is fed to a linear projection operator of $3 M K$ channels, where the first $2 M K$ channels encode the sampling offsets $\\Delta p\\_{m q k}$, and the remaining $M K$ channels are fed to a softmax operator to obtain the attention weights $A\\_{m q k}$.","209":"A **Feedforward Network**, or a **Multilayer Perceptron (MLP)**, is a neural network with solely densely connected layers. This is the classic neural network architecture of the literature. It consists of inputs $x$ passed through units $h$ (of which there can be many layers) to predict a target $y$. Activation functions are generally chosen to be non-linear to allow for flexible functional approximation.\r\n\r\nImage Source: Deep Learning, Goodfellow et al","210":"**Deformable DETR** is an object detection method that aims mitigates the slow convergence and high complexity issues of [DETR](https:\/\/www.paperswithcode.com\/method\/detr). It combines the best of the sparse spatial sampling of [deformable convolution](https:\/\/paperswithcode.com\/method\/deformable-convolution), and the relation modeling capability of [Transformers](https:\/\/paperswithcode.com\/methods\/category\/transformers). Specifically, it introduces a \r\n deformable attention module, which attends to a small set of sampling locations as a pre-filter for prominent key elements out of all the feature map pixels. The module can be naturally extended to aggregating multi-scale features, without the help of [FPN](https:\/\/paperswithcode.com\/method\/fpn).","211":"**Detr**, or **Detection Transformer**, is a set-based object detector using a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) on top of a convolutional backbone. It uses a conventional CNN backbone to learn a 2D representation of an input image. The model flattens it and supplements it with a positional encoding before passing it into a transformer encoder. A transformer decoder then takes as input a small fixed number of learned positional embeddings, which we call object queries, and additionally attends to the encoder output. We pass each output embedding of the decoder to a shared feed forward network (FFN) that predicts either a detection (class\r\nand bounding box) or a \u201cno object\u201d class.","212":"**MoCo**, or **Momentum Contrast**, is a self-supervised learning algorithm with a contrastive loss. \r\n\r\nContrastive loss methods can be thought of as building dynamic dictionaries. The \"keys\" (tokens) in the dictionary are sampled from data (e.g., images or patches) and are represented by an encoder network. Unsupervised learning trains encoders to perform dictionary look-up: an encoded \u201cquery\u201d should be similar to its matching key and dissimilar to others. Learning is formulated as minimizing a contrastive loss. \r\n\r\nMoCo can be viewed as a way to build large and consistent dictionaries for unsupervised learning with a contrastive loss. In MoCo, we maintain the dictionary as a queue of data samples: the encoded representations of the current mini-batch are enqueued, and the oldest are dequeued. The queue decouples the dictionary size from the mini-batch size, allowing it to be large. Moreover, as the dictionary keys come from the preceding several mini-batches, a slowly progressing key encoder, implemented as a momentum-based moving average of the query encoder, is proposed to maintain consistency.","213":"**Target Policy Smoothing** is a regularization strategy for the value function in reinforcement learning. Deterministic policies can overfit to narrow peaks in the value estimate, making them highly susceptible to functional approximation error, increasing the variance of the target. To reduce this variance, target policy smoothing adds a small amount of random noise to the target policy and averages over mini-batches - approximating a [SARSA](https:\/\/paperswithcode.com\/method\/sarsa)-like expectation\/integral.\r\n\r\nThe modified target update is:\r\n\r\n$$ y = r + \\gamma{Q}\\_{\\theta'}\\left(s', \\pi\\_{\\theta'}\\left(s'\\right) + \\epsilon \\right) $$\r\n\r\n$$ \\epsilon \\sim \\text{clip}\\left(\\mathcal{N}\\left(0, \\sigma\\right), -c, c \\right) $$\r\n\r\nwhere the added noise is clipped to keep the target close to the original action. The outcome is an algorithm reminiscent of [Expected SARSA](https:\/\/paperswithcode.com\/method\/expected-sarsa), where the value estimate is instead learned off-policy and the noise added to the target policy is chosen independently of the exploration policy. The value estimate learned is with respect to a noisy policy defined by the parameter $\\sigma$.","214":"**Clipped Double Q-learning** is a variant on [Double Q-learning](https:\/\/paperswithcode.com\/method\/double-q-learning) that upper-bounds the less biased Q estimate $Q\\_{\\theta\\_{2}}$ by the biased estimate $Q\\_{\\theta\\_{1}}$. This is equivalent to taking the minimum of the two estimates, resulting in the following target update:\r\n\r\n$$ y\\_{1} = r + \\gamma\\min\\_{i=1,2}Q\\_{\\theta'\\_{i}}\\left(s', \\pi\\_{\\phi\\_{1}}\\left(s'\\right)\\right) $$\r\n\r\nThe motivation for this extension is that vanilla double [Q-learning](https:\/\/paperswithcode.com\/method\/q-learning) is sometimes ineffective if the target and current networks are too similar, e.g. with a slow-changing policy in an actor-critic framework.","215":"**Experience Replay** is a replay memory technique used in reinforcement learning where we store the agent\u2019s experiences at each time-step, $e\\_{t} = \\left(s\\_{t}, a\\_{t}, r\\_{t}, s\\_{t+1}\\right)$ in a data-set $D = e\\_{1}, \\cdots, e\\_{N}$ , pooled over many episodes into a replay memory. We then usually sample the memory randomly for a minibatch of experience, and use this to learn off-policy, as with Deep Q-Networks. This tackles the problem of autocorrelation leading to unstable training, by making the problem more like a supervised learning problem.\r\n\r\nImage Credit: [Hands-On Reinforcement Learning with Python, Sudharsan Ravichandiran](https:\/\/subscription.packtpub.com\/book\/big_data_and_business_intelligence\/9781788836524)","216":"**DDPG**, or **Deep Deterministic Policy Gradient**, is an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. It combines the actor-critic approach with insights from [DQNs](https:\/\/paperswithcode.com\/method\/dqn): in particular, the insights that 1) the network is trained off-policy with samples from a replay buffer to minimize correlations between samples, and 2) the network is trained with a target Q network to give consistent targets during temporal difference backups. DDPG makes use of the same ideas along with [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization).","217":"**TD3** builds on the [DDPG](https:\/\/paperswithcode.com\/method\/ddpg) algorithm for reinforcement learning, with a couple of modifications aimed at tackling overestimation bias with the value function. In particular, it utilises [clipped double Q-learning](https:\/\/paperswithcode.com\/method\/clipped-double-q-learning), delayed update of target and policy networks, and [target policy smoothing](https:\/\/paperswithcode.com\/method\/target-policy-smoothing) (which is similar to a [SARSA](https:\/\/paperswithcode.com\/method\/sarsa) based update; a safer update, as they provide higher value to actions resistant to perturbations).","218":"**Switchable Atrous Convolution (SAC)** softly switches the convolutional computation between different atrous rates and gathers the results using switch functions. The switch functions are spatially dependent, i.e., each location of the feature map might have different switches to control the outputs of SAC. To use SAC in a detector, we convert all the standard 3x3 convolutional layers in the bottom-up backbone to SAC.","219":"**Fixed Factorized Attention** is a factorized attention pattern where specific cells summarize previous locations and propagate that information to all future cells. It was proposed as part of the [Sparse Transformer](https:\/\/paperswithcode.com\/method\/sparse-transformer) architecture.\r\n\r\n\r\nA self-attention layer maps a matrix of input embeddings $X$ to an output matrix and is parameterized by a connectivity pattern $S = \\text{set}\\left(S\\_{1}, \\dots, S\\_{n}\\right)$, where $S\\_{i}$ denotes the set of indices of the input vectors to which the $i$th output vector attends. The output vector is a weighted sum of transformations of the input vectors:\r\n\r\n$$ \\text{Attend}\\left(X, S\\right) = \\left(a\\left(\\mathbf{x}\\_{i}, S\\_{i}\\right)\\right)\\_{i\\in\\text{set}\\left(1,\\dots,n\\right)}$$\r\n\r\n$$ a\\left(\\mathbf{x}\\_{i}, S\\_{i}\\right) = \\text{softmax}\\left(\\frac{\\left(W\\_{q}\\mathbf{x}\\_{i}\\right)K^{T}\\_{S\\_{i}}}{\\sqrt{d}}\\right)V\\_{S\\_{i}} $$\r\n\r\n$$ K\\_{Si} = \\left(W\\_{k}\\mathbf{x}\\_{j}\\right)\\_{j\\in{S\\_{i}}} $$\r\n\r\n$$ V\\_{Si} = \\left(W\\_{v}\\mathbf{x}\\_{j}\\right)\\_{j\\in{S\\_{i}}} $$\r\n\r\nHere $W\\_{q}$, $W\\_{k}$, and $W\\_{v}$ represent the weight matrices which transform a given $x\\_{i}$ into a query, key, or value, and $d$ is the inner dimension of the queries and keys. The output at each position is a sum of the values weighted by the scaled dot-product similarity of the keys and queries.\r\n\r\nFull self-attention for autoregressive models defines $S\\_{i} = \\text{set}\\left(j : j \\leq i\\right)$, allowing every element to attend to all previous positions and its own position.\r\n\r\nFactorized self-attention instead has $p$ separate attention heads, where the $m$th head defines a subset of the indices $A\\_{i}^{(m)} \u2282 \\text{set}\\left(j : j \\leq i\\right)$ and lets $S\\_{i} = A\\_{i}^{(m)}$. The goal with the Sparse [Transformer](https:\/\/paperswithcode.com\/method\/transformer) was to find efficient choices for the subset $A$.\r\n\r\nFormally for Fixed Factorized Attention, $A^{(1)}\\_{i} = ${$j : \\left(\\lfloor{j\/l\\rfloor}=\\lfloor{i\/l\\rfloor}\\right)$}, where the brackets denote the floor operation, and $A^{(2)}\\_{i} = ${$j : j \\mod l \\in ${$t, t+1, \\ldots, l$}}, where $t=l-c$ and $c$ is a hyperparameter. The $i$-th output vector of the attention head attends to all input vectors either from $A^{(1)}\\_{i}$ or $A^{(2)}\\_{i}$. This pattern can be visualized in the figure to the right.\r\n\r\nIf the stride is 128 and $c = 8$, then all future positions greater than 128 can attend to positions 120-128, all positions greater than 256 can attend to 248-256, and so forth. \r\n\r\nA fixed-attention pattern with $c = 1$ limits the expressivity of the network significantly, as many representations in the network are only used for one block whereas a small number of locations are used by all blocks. The authors found choosing $c \\in ${$8, 16, 32$} for typical values of $l \\in\r\n{128, 256}$ performs well, although this increases the computational cost of this method by $c$ in comparison to the [strided attention](https:\/\/paperswithcode.com\/method\/strided-attention).\r\n\r\nAdditionally, the authors found that when using multiple heads, having them attend to distinct subblocks of length $c$ within the block of size $l$ was preferable to having them attend to the same subblock.","220":"**Strided Attention** is a factorized attention pattern that has one head attend to the previous\r\n$l$ locations, and the other head attend to every $l$th location, where $l$ is the stride and chosen to be close to $\\sqrt{n}$. It was proposed as part of the [Sparse Transformer](https:\/\/paperswithcode.com\/method\/sparse-transformer) architecture.\r\n\r\nA self-attention layer maps a matrix of input embeddings $X$ to an output matrix and is parameterized by a connectivity pattern $S = \\text{set}\\left(S\\_{1}, \\dots, S\\_{n}\\right)$, where $S\\_{i}$ denotes the set of indices of the input vectors to which the $i$th output vector attends. The output vector is a weighted sum of transformations of the input vectors:\r\n\r\n$$ \\text{Attend}\\left(X, S\\right) = \\left(a\\left(\\mathbf{x}\\_{i}, S\\_{i}\\right)\\right)\\_{i\\in\\text{set}\\left(1,\\dots,n\\right)}$$\r\n\r\n$$ a\\left(\\mathbf{x}\\_{i}, S\\_{i}\\right) = \\text{softmax}\\left(\\frac{\\left(W\\_{q}\\mathbf{x}\\_{i}\\right)K^{T}\\_{S\\_{i}}}{\\sqrt{d}}\\right)V\\_{S\\_{i}} $$\r\n\r\n$$ K\\_{Si} = \\left(W\\_{k}\\mathbf{x}\\_{j}\\right)\\_{j\\in{S\\_{i}}} $$\r\n\r\n$$ V\\_{Si} = \\left(W\\_{v}\\mathbf{x}\\_{j}\\right)\\_{j\\in{S\\_{i}}} $$\r\n\r\nHere $W\\_{q}$, $W\\_{k}$, and $W\\_{v}$ represent the weight matrices which transform a given $x\\_{i}$ into a query, key, or value, and $d$ is the inner dimension of the queries and keys. The output at each position is a sum of the values weighted by the scaled dot-product similarity of the keys and queries.\r\n\r\nFull self-attention for autoregressive models defines $S\\_{i} = \\text{set}\\left(j : j \\leq i\\right)$, allowing every element to attend to all previous positions and its own position.\r\n\r\nFactorized self-attention instead has $p$ separate attention heads, where the $m$th head defines a subset of the indices $A\\_{i}^{(m)} \u2282 \\text{set}\\left(j : j \\leq i\\right)$ and lets $S\\_{i} = A\\_{i}^{(m)}$. The goal with the Sparse [Transformer](https:\/\/paperswithcode.com\/method\/transformer) was to find efficient choices for the subset $A$.\r\n\r\nFormally for Strided Attention, $A^{(1)}\\_{i} = ${$t, t + 1, ..., i$} for $t = \\max\\left(0, i \u2212 l\\right)$, and $A^{(2)}\\_{i} = ${$j : (i \u2212 j) \\mod l = 0$}. The $i$-th output vector of the attention head attends to all input vectors either from $A^{(1)}\\_{i}$ or $A^{(2)}\\_{i}$. This pattern can be visualized in the figure to the right.\r\n\r\nThis formulation is convenient if the data naturally has a structure that aligns with the stride, like images or some types of music. For data without a periodic structure, like text, however, the authors find that the network can fail to properly route information with the strided pattern, as spatial coordinates for an element do not necessarily correlate with the positions where the element may be most relevant in the future.","221":"**GPT-3** is an autoregressive [transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) model with 175 billion\r\nparameters. It uses the same architecture\/model as [GPT-2](https:\/\/paperswithcode.com\/method\/gpt-2), including the modified initialization, pre-normalization, and reversible tokenization, with the exception that GPT-3 uses alternating dense and locally banded sparse attention patterns in the layers of the [transformer](https:\/\/paperswithcode.com\/method\/transformer), similar to the [Sparse Transformer](https:\/\/paperswithcode.com\/method\/sparse-transformer).","222":"The **ENet Initial Block** is an image model block used in the [ENet](https:\/\/paperswithcode.com\/method\/enet) semantic segmentation architecture. [Max Pooling](https:\/\/paperswithcode.com\/method\/max-pooling) is performed with non-overlapping 2 \u00d7 2 windows, and the [convolution](https:\/\/paperswithcode.com\/method\/convolution) has 13 filters, which sums up to 16 feature maps after concatenation. This is heavily inspired by Inception Modules.","223":"**ENet Dilated Bottleneck** is an image model block used in the [ENet](https:\/\/paperswithcode.com\/method\/enet) semantic segmentation architecture. It is the same as a regular [ENet Bottleneck](https:\/\/paperswithcode.com\/method\/enet-bottleneck) but employs dilated convolutions instead.","224":"**SpatialDropout** is a type of [dropout](https:\/\/paperswithcode.com\/method\/dropout) for convolutional networks. For a given [convolution](https:\/\/paperswithcode.com\/method\/convolution) feature tensor of size $n\\_{\\text{feats}}$\u00d7height\u00d7width, we perform only $n\\_{\\text{feats}}$ dropout\r\ntrials and extend the dropout value across the entire feature map. Therefore, adjacent pixels in the dropped-out feature\r\nmap are either all 0 (dropped-out) or all active as illustrated in the figure to the right.","225":"A **Parametric Rectified Linear Unit**, or **PReLU**, is an activation function that generalizes the traditional rectified unit with a slope for negative values. Formally:\r\n\r\n$$f\\left(y\\_{i}\\right) = y\\_{i} \\text{ if } y\\_{i} \\ge 0$$\r\n$$f\\left(y\\_{i}\\right) = a\\_{i}y\\_{i} \\text{ if } y\\_{i} \\leq 0$$\r\n\r\nThe intuition is that different layers may require different types of nonlinearity. Indeed the authors find in experiments with convolutional neural networks that PReLus for the initial layer have more positive slopes, i.e. closer to linear. Since the filters of the first layers are Gabor-like filters such as edge or texture detectors, this shows a circumstance where positive and negative responses of filters are respected. In contrast the authors find deeper layers have smaller coefficients, suggesting the model becomes more discriminative at later layers (while it wants to retain more information at earlier layers).","226":"**ENet Bottleneck** is an image model block used in the [ENet](https:\/\/paperswithcode.com\/method\/enet) semantic segmentation architecture. Each block consists of three convolutional layers: a 1 \u00d7 1 projection that reduces the dimensionality, a main convolutional layer, and a 1 \u00d7 1 expansion. We place [Batch Normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) and [PReLU](https:\/\/paperswithcode.com\/method\/prelu) between all convolutions. If the bottleneck is downsampling, a [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) layer is added to the main branch.\r\nAlso, the first 1 \u00d7 1 projection is replaced with a 2 \u00d7 2 [convolution](https:\/\/paperswithcode.com\/method\/convolution) with stride 2 in both dimensions. We zero pad the activations, to match the number of feature maps.","227":"**ENet** is a semantic segmentation architecture which utilises a compact encoder-decoder architecture. Some design choices include:\r\n\r\n1. Using the [SegNet](https:\/\/paperswithcode.com\/method\/segnet) approach to downsampling y saving indices of elements chosen in max\r\npooling layers, and using them to produce sparse upsampled maps in the decoder.\r\n2. Early downsampling to optimize the early stages of the network and reduce the cost of processing large input frames. The first two blocks of ENet heavily reduce the input size, and use only a small set of feature maps. \r\n3. Using PReLUs as an activation function\r\n4. Using dilated convolutions \r\n5. Using Spatial [Dropout](https:\/\/paperswithcode.com\/method\/dropout)","228":"**R-CNN**, or **Regions with CNN Features**, is an object detection model that uses high-capacity CNNs to bottom-up region proposals in order to localize and segment objects. It uses [selective search](https:\/\/paperswithcode.com\/method\/selective-search) to identify a number of bounding-box object region candidates (\u201cregions of interest\u201d), and then extracts features from each region independently for classification.","229":"**PyTorch DDP** (Distributed Data Parallel) is a distributed data parallel implementation for PyTorch. To guarantee mathematical equivalence, all replicas start from the same initial values for model parameters and synchronize gradients to keep parameters consistent across training iterations. To minimize the intrusiveness, the implementation exposes the same forward API as the user model, allowing applications to seamlessly replace subsequent occurrences of a user model with the distributed data parallel model object with no additional code changes. Several techniques are integrated into the design to deliver high-performance training, including bucketing gradients, overlapping communication with computation, and skipping synchronization.","230":"**Random Search** replaces the exhaustive enumeration of all combinations by selecting them randomly. This can be simply applied to the discrete setting described above, but also generalizes to continuous and mixed spaces. It can outperform Grid search, especially when only a small number of hyperparameters affects the final performance of the machine learning algorithm. In this case, the optimization problem is said to have a low intrinsic dimensionality. Random Search is also embarrassingly parallel, and additionally allows the inclusion of prior knowledge by specifying the distribution from which to sample.\r\n\r\n\r\nExtracted from [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Hyperparameter_optimization#Random_search)\r\n\r\nSource [Paper](https:\/\/dl.acm.org\/doi\/10.5555\/2188385.2188395)\r\n\r\nImage Source: [BERGSTRA AND BENGIO](https:\/\/dl.acm.org\/doi\/pdf\/10.5555\/2188385.2188395)","231":"**SimCSE** is a contrastive learning framework for generating sentence embeddings. It utilizes an unsupervised approach, which takes an input sentence and predicts itself in contrastive objective, with only standard [dropout](https:\/\/paperswithcode.com\/method\/dropout) used as noise. The authors find that dropout acts as minimal \u201cdata augmentation\u201d of hidden representations, while removing it leads to a representation collapse. Afterwards a supervised approach is used, which incorporates annotated pairs from natural language inference datasets into the contrastive framework, by using \u201centailment\u201d pairs as positives and \u201ccontradiction","232":"CARLA is an open-source simulator for autonomous driving research. CARLA has been developed from the ground up to support development, training, and validation of autonomous urban driving systems. In addition to open-source code and protocols, CARLA provides open digital assets (urban layouts, buildings, vehicles) that were created for this purpose and can be used freely. \r\n\r\nSource: [Dosovitskiy et al.](https:\/\/arxiv.org\/pdf\/1711.03938v1.pdf)\r\n\r\nImage source: [Dosovitskiy et al.](https:\/\/arxiv.org\/pdf\/1711.03938v1.pdf)","233":"**GPT** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based architecture and training procedure for natural language processing tasks. Training follows a two-stage procedure. First, a language modeling objective is used on\r\nthe unlabeled data to learn the initial parameters of a neural network model. Subsequently, these parameters are adapted to a target task using the corresponding supervised objective.","234":"**Sliding Window Attention** is an attention pattern for attention-based models. It was proposed as part of the [Longformer](https:\/\/paperswithcode.com\/method\/longformer) architecture. It is motivated by the fact that non-sparse attention in the original [Transformer](https:\/\/paperswithcode.com\/method\/transformer) formulation has a [self-attention component](https:\/\/paperswithcode.com\/method\/scaled) with $O\\left(n^{2}\\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs. Given the importance of local context, the sliding window attention pattern employs a fixed-size window attention surrounding each token. Using multiple stacked layers of such windowed attention results in a large receptive field, where top layers have access to all input locations and have the capacity to build representations that incorporate information across the entire input. \r\n\r\nMore formally, in this attention pattern, given a fixed window size $w$, each token attends to $\\frac{1}{2}w$ tokens on each side. The computation complexity of this pattern is $O\\left(n\u00d7w\\right)$,\r\nwhich scales linearly with input sequence length $n$. To make this attention pattern efficient, $w$ should be small compared with $n$. But a model with typical multiple stacked transformers will have a large receptive field. This is analogous to CNNs where stacking layers of small kernels leads to high level features that are built from a large portion of the input (receptive field)\r\n\r\nIn this case, with a transformer of $l$ layers, the receptive field size is $l \u00d7 w$ (assuming\r\n$w$ is fixed for all layers). Depending on the application, it might be helpful to use different values of $w$ for each layer to balance between efficiency and model representation capacity.","235":"**Dilated Sliding Window Attention** is an attention pattern for attention-based models. It was proposed as part of the [Longformer](https:\/\/paperswithcode.com\/method\/longformer) architecture. It is motivated by the fact that non-sparse attention in the original [Transformer](https:\/\/paperswithcode.com\/method\/transformer) formulation has a [self-attention component](https:\/\/paperswithcode.com\/method\/scaled) with $O\\left(n^{2}\\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs. \r\n\r\nCompared to a [Sliding Window Attention](https:\/\/paperswithcode.com\/method\/sliding-window-attention) pattern, we can further increase the receptive field without increasing computation by making the sliding window \"dilated\". This is analogous to [dilated CNNs](https:\/\/paperswithcode.com\/method\/dilated-convolution) where the window has gaps of size dilation $d$. Assuming a fixed $d$ and $w$ for all layers, the receptive field is $l \u00d7 d \u00d7 w$, which can reach tens of thousands of tokens even for small values of $d$.","236":"**Global and Sliding Window Attention** is an attention pattern for attention-based models. It is motivated by the fact that non-sparse attention in the original [Transformer](https:\/\/paperswithcode.com\/method\/transformer) formulation has a [self-attention component](https:\/\/paperswithcode.com\/method\/scaled) with $O\\left(n^{2}\\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs. \r\n\r\nSince [windowed](https:\/\/paperswithcode.com\/method\/sliding-window-attention) and [dilated](https:\/\/paperswithcode.com\/method\/dilated-sliding-window-attention) attention patterns are not flexible enough to learn task-specific representations, the authors of the [Longformer](https:\/\/paperswithcode.com\/method\/longformer) add \u201cglobal attention\u201d on few pre-selected input locations. This attention is operation symmetric: that is, a token with a global attention attends to all tokens across the sequence, and all tokens in the sequence attend to it. The Figure to the right shows an example of a sliding window attention with global attention at a few tokens at custom locations. For the example of classification, global attention is used for the [CLS] token, while in the example of Question Answering, global attention is provided on all question tokens.","237":"**AdamW** is a stochastic optimization method that modifies the typical implementation of weight decay in [Adam](https:\/\/paperswithcode.com\/method\/adam), by decoupling [weight decay](https:\/\/paperswithcode.com\/method\/weight-decay) from the gradient update. To see this, $L\\_{2}$ regularization in Adam is usually implemented with the below modification where $w\\_{t}$ is the rate of the weight decay at time $t$:\r\n\r\n$$ g\\_{t} = \\nabla{f\\left(\\theta\\_{t}\\right)} + w\\_{t}\\theta\\_{t}$$\r\n\r\nwhile AdamW adjusts the weight decay term to appear in the gradient update:\r\n\r\n$$ \\theta\\_{t+1, i} = \\theta\\_{t, i} - \\eta\\left(\\frac{1}{\\sqrt{\\hat{v}\\_{t} + \\epsilon}}\\cdot{\\hat{m}\\_{t}} + w\\_{t, i}\\theta\\_{t, i}\\right), \\forall{t}$$","238":"**Longformer** is a modified [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture. Traditional [Transformer-based models](https:\/\/paperswithcode.com\/methods\/category\/transformers) are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this, **Longformer** uses an attention pattern that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. The attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention.\r\n\r\nThe attention patterns utilised include: [sliding window attention](https:\/\/paperswithcode.com\/method\/sliding-window-attention), [dilated sliding window attention](https:\/\/paperswithcode.com\/method\/dilated-sliding-window-attention) and global + sliding window. These can be viewed in the components section of this page.","239":"Class of methods in Bayesian Statistics where the posterior distribution is approximated over a rejection scheme on simulations because the likelihood function is intractable.\r\n\r\nDifferent parameters get sampled and simulated. Then a distance function is calculated to measure the quality of the simulation compared to data from real observations. Only simulations that fall below a certain threshold get accepted.\r\n\r\nImage source: [Kulkarni et al.](https:\/\/www.umass.edu\/nanofabrics\/sites\/default\/files\/PDF_0.pdf)","240":"**Discrete Cosine Transform (DCT)** is an orthogonal transformation method that decomposes an\r\nimage to its spatial frequency spectrum. It expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. It is used a lot in compression tasks, e..g image compression where for example high-frequency components can be discarded. It is a type of Fourier-related Transform, similar to discrete fourier transforms (DFTs), but only using real numbers.\r\n\r\nImage Credit: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Discrete_cosine_transform#\/media\/File:Example_dft_dct.svg)","241":"Train a convolutional neural network to generate the contents of an arbitrary image region conditioned on its surroundings.","242":"**CutMix** is an image data augmentation strategy. Instead of simply removing pixels as in [Cutout](https:\/\/paperswithcode.com\/method\/cutout), we replace the removed regions with a patch from another image. The ground truth labels are also mixed proportionally to the number of pixels of combined images. The added patches further enhance localization ability by requiring the model to identify the object from a partial view.","243":"**Stochastic Depth** aims to shrink the depth of a network during training, while\r\nkeeping it unchanged during testing. This is achieved by randomly dropping entire [ResBlocks](https:\/\/paperswithcode.com\/method\/residual-block) during training and bypassing their transformations through skip connections. \r\n\r\nLet $b\\_{l} \\in$ {$0, 1$} denote a Bernoulli random variable, which indicates whether the $l$th ResBlock is active ($b\\_{l} = 1$) or inactive ($b\\_{l} = 0$). Further, let us denote the \u201csurvival\u201d probability of ResBlock $l$ as $p\\_{l} = \\text{Pr}\\left(b\\_{l} = 1\\right)$. With this definition we can bypass the $l$th ResBlock by multiplying its function $f\\_{l}$ with $b\\_{l}$ and we extend the update rule to:\r\n\r\n$$ H\\_{l} = \\text{ReLU}\\left(b\\_{l}f\\_{l}\\left(H\\_{l-1}\\right) + \\text{id}\\left(H\\_{l-1}\\right)\\right) $$\r\n\r\nIf $b\\_{l} = 1$, this reduces to the original [ResNet](https:\/\/paperswithcode.com\/method\/resnet) update and this ResBlock remains unchanged. If $b\\_{l} = 0$, the ResBlock reduces to the identity function, $H\\_{l} = \\text{id}\\left((H\\_{l}\u22121\\right)$.","244":"The **Swin Transformer** is a type of [Vision Transformer](https:\/\/paperswithcode.com\/method\/vision-transformer). It builds hierarchical feature maps by merging image patches (shown in gray) in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local window (shown in red). It can thus serve as a general-purpose backbone for both image classification and dense recognition tasks. In contrast, previous vision Transformers produce feature maps of a single low resolution and have quadratic computation complexity to input image size due to computation of self-attention globally.","245":"**Submanifold Convolution (SC)** is a spatially sparse [convolution](https:\/\/paperswithcode.com\/method\/convolution) operation used for tasks with sparse data like semantic segmentation of 3D point clouds. An SC convolution computes the set of active sites in the same way as a regular convolution: it looks for the presence of any active sites in its receptive field of size $f^{d}$. If the input has size $l$ then the output will have size $\\left(l \u2212 f + s\\right)\/s$. Unlike a regular convolution, an SC convolution discards the ground state for non-active sites by assuming that the input from those sites is zero. For more details see the [paper](https:\/\/paperswithcode.com\/paper\/3d-semantic-segmentation-with-submanifold), or the official code [here](https:\/\/github.com\/facebookresearch\/SparseConvNet).","246":"**PULSE** is a self-supervised photo upsampling algorithm. Instead of starting with the LR image and slowly adding detail, PULSE traverses the high-resolution natural image manifold, searching for images that downscale to the original LR image. This is formalized through the downscaling loss, which guides exploration through the latent space of a generative model. By leveraging properties of high-dimensional Gaussians, the authors aim to restrict the search space to guarantee realistic outputs.","247":"**CodeBERT** is a bimodal pre-trained model for programming language (PL) and natural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language code search, code documentation generation, etc. CodeBERT is developed with a [Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based neural architecture, and is trained with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables the utilization of both bimodal data of NL-PL pairs and unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators.","248":"**WaveRNN** is a single-layer recurrent neural network for audio generation that is designed efficiently predict 16-bit raw audio samples.\r\n\r\nThe overall computation in the WaveRNN is as follows (biases omitted for brevity):\r\n\r\n$$ \\mathbf{x}\\_{t} = \\left[\\mathbf{c}\\_{t\u22121},\\mathbf{f}\\_{t\u22121}, \\mathbf{c}\\_{t}\\right] $$\r\n\r\n$$ \\mathbf{u}\\_{t} = \\sigma\\left(\\mathbf{R}\\_{u}\\mathbf{h}\\_{t-1} + \\mathbf{I}^{*}\\_{u}\\mathbf{x}\\_{t}\\right) $$\r\n\r\n$$ \\mathbf{r}\\_{t} = \\sigma\\left(\\mathbf{R}\\_{r}\\mathbf{h}\\_{t-1} + \\mathbf{I}^{*}\\_{r}\\mathbf{x}\\_{t}\\right) $$\r\n\r\n$$ \\mathbf{e}\\_{t} = \\tau\\left(\\mathbf{r}\\_{t} \\odot \\left(\\mathbf{R}\\_{e}\\mathbf{h}\\_{t-1}\\right) + \\mathbf{I}^{*}\\_{e}\\mathbf{x}\\_{t} \\right) $$\r\n\r\n$$ \\mathbf{h}\\_{t} = \\mathbf{u}\\_{t} \\cdot \\mathbf{h}\\_{t-1} + \\left(1-\\mathbf{u}\\_{t}\\right) \\cdot \\mathbf{e}\\_{t} $$\r\n\r\n$$ \\mathbf{y}\\_{c}, \\mathbf{y}\\_{f} = \\text{split}\\left(\\mathbf{h}\\_{t}\\right) $$\r\n\r\n$$ P\\left(\\mathbf{c}\\_{t}\\right) = \\text{softmax}\\left(\\mathbf{O}\\_{2}\\text{relu}\\left(\\mathbf{O}\\_{1}\\mathbf{y}\\_{c}\\right)\\right) $$\r\n\r\n$$ P\\left(\\mathbf{f}\\_{t}\\right) = \\text{softmax}\\left(\\mathbf{O}\\_{4}\\text{relu}\\left(\\mathbf{O}\\_{3}\\mathbf{y}\\_{f}\\right)\\right) $$\r\n\r\nwhere the $*$ indicates a masked matrix whereby the last coarse input $\\mathbf{c}\\_{t}$ is only connected to the fine part of the states $\\mathbf{u}\\_{t}$, $\\mathbf{r}\\_{t}$, $\\mathbf{e}\\_{t}$ and $\\mathbf{h}\\_{t}$ and thus only affects the fine output $\\mathbf{y}\\_{f}$. The coarse and fine parts $\\mathbf{c}\\_{t}$ and $\\mathbf{f}\\_{t}$ are encoded as scalars in $\\left[0, 255\\right]$ and scaled to the interval $\\left[\u22121, 1\\right]$. The matrix $\\mathbf{R}$ formed from the matrices $\\mathbf{R}\\_{u}$, $\\mathbf{R}\\_{r}$, $\\mathbf{R}\\_{e}$ is computed as a single matrix-vector product to produce the contributions to all three gates $\\mathbf{u}\\_{t}$, $mathbf{r}\\_{t}$ and $\\mathbf{e}\\_{t}$ (a variant of the [GRU cell](https:\/\/paperswithcode.com\/method\/gru). $\\sigma$ and $\\tau$ are the standard sigmoid and tanh non-linearities.\r\n\r\nEach part feeds into a [softmax](https:\/\/paperswithcode.com\/method\/softmax) layer over the corresponding 8 bits and the prediction of the 8 fine bits is conditioned on the 8 coarse bits. The resulting Dual Softmax layer allows for efficient prediction of 16-bit samples using two small output spaces (2 8 values each) instead of a single large output space (with 2 16 values).","249":"**Mixture of Logistic Distributions (MoL)** is a type of output function, and an alternative to a [softmax](https:\/\/paperswithcode.com\/method\/softmax) layer. Discretized logistic mixture likelihood is used in [PixelCNN](https:\/\/paperswithcode.com\/method\/pixelcnn)++ and [WaveNet](https:\/\/paperswithcode.com\/method\/wavenet) to predict discrete values.\r\n\r\nImage Credit: [Hao Gao](https:\/\/medium.com\/@smallfishbigsea\/an-explanation-of-discretized-logistic-mixture-likelihood-bdfe531751f0)","250":"**WaveNet** is an audio generative model based on the [PixelCNN](https:\/\/paperswithcode.com\/method\/pixelcnn) architecture. In order to deal with long-range temporal dependencies needed for raw audio generation, architectures are developed based on dilated causal convolutions, which exhibit very large receptive fields.\r\n\r\nThe joint probability of a waveform $\\vec{x} = \\{ x_1, \\dots, x_T \\}$ is factorised as a product of conditional probabilities as follows:\r\n\r\n$$p\\left(\\vec{x}\\right) = \\prod_{t=1}^{T} p\\left(x_t \\mid x_1, \\dots ,x_{t-1}\\right)$$\r\n\r\nEach audio sample $x_t$ is therefore conditioned on the samples at all previous timesteps.","251":"**NT-Xent**, or **Normalized Temperature-scaled Cross Entropy Loss**, is a loss function. Let $\\text{sim}\\left(\\mathbf{u}, \\mathbf{v}\\right) = \\mathbf{u}^{T}\\mathbf{v}\/||\\mathbf{u}|| ||\\mathbf{v}||$ denote the cosine similarity between two vectors $\\mathbf{u}$ and $\\mathbf{v}$. Then the loss function for a positive pair of examples $\\left(i, j\\right)$ is :\r\n\r\n$$ \\mathbb{l}\\_{i,j} = -\\log\\frac{\\exp\\left(\\text{sim}\\left(\\mathbf{z}\\_{i}, \\mathbf{z}\\_{j}\\right)\/\\tau\\right)}{\\sum^{2N}\\_{k=1}\\mathcal{1}\\_{[k\\neq{i}]}\\exp\\left(\\text{sim}\\left(\\mathbf{z}\\_{i}, \\mathbf{z}\\_{k}\\right)\/\\tau\\right)}$$\r\n\r\nwhere $\\mathcal{1}\\_{[k\\neq{i}]} \\in ${$0, 1$} is an indicator function evaluating to $1$ iff $k\\neq{i}$ and $\\tau$ denotes a temperature parameter. The final loss is computed across all positive pairs, both $\\left(i, j\\right)$ and $\\left(j, i\\right)$, in a mini-batch.\r\n\r\nSource: [SimCLR](https:\/\/paperswithcode.com\/method\/simclr)","252":"**Random Gaussian Blur** is an image data augmentation technique where we randomly blur the image using a Gaussian distribution.\r\n\r\nImage Source: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Gaussian_blur)","253":"**RandomResizedCrop** is a type of image data augmentation where a crop of random size of the original size and a random aspect ratio of the original aspect ratio is made. This crop is finally resized to given size.\r\n\r\nImage Credit: [Apache MXNet](https:\/\/mxnet.apache.org\/versions\/1.5.0\/tutorials\/gluon\/data_augmentation.html)","254":"**ColorJitter** is a type of image data augmentation where we randomly change the brightness, contrast and saturation of an image.\r\n\r\nImage Credit: [Apache MXNet](https:\/\/mxnet.apache.org\/versions\/1.5.0\/tutorials\/gluon\/data_augmentation.html)","255":"**SimCLR** is a framework for contrastive learning of visual representations. It learns representations by maximizing agreement between differently augmented views of the same data example via a contrastive loss in the latent space. It consists of:\r\n\r\n- A stochastic data augmentation module that transforms any given data example randomly resulting in two correlated views of the same example, denoted $\\mathbf{\\tilde{x}\\_{i}}$ and $\\mathbf{\\tilde{x}\\_{j}}$, which is considered a positive pair. SimCLR sequentially applies three simple augmentations: random cropping followed by resize back to the original size, random color distortions, and [random Gaussian blur](https:\/\/paperswithcode.com\/method\/random-gaussian-blur). The authors find random crop and color distortion is crucial to achieve good performance.\r\n\r\n- A neural network base encoder $f\\left(\u00b7\\right)$ that extracts representation vectors from augmented data examples. The framework allows various choices of the network architecture without any constraints. The authors opt for simplicity and adopt [ResNet](https:\/\/paperswithcode.com\/method\/resnet) to obtain $h\\_{i} = f\\left(\\mathbf{\\tilde{x}}\\_{i}\\right) = \\text{ResNet}\\left(\\mathbf{\\tilde{x}}\\_{i}\\right)$ where $h\\_{i} \\in \\mathbb{R}^{d}$ is the output after the [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling) layer.\r\n\r\n- A small neural network projection head $g\\left(\u00b7\\right)$ that maps representations to the space where contrastive loss is applied. Authors use a MLP with one hidden layer to obtain $z\\_{i} = g\\left(h\\_{i}\\right) = W^{(2)}\\sigma\\left(W^{(1)}h\\_{i}\\right)$ where $\\sigma$ is a [ReLU](https:\/\/paperswithcode.com\/method\/relu) nonlinearity. The authors find it beneficial to define the contrastive loss on $z\\_{i}$\u2019s rather than $h\\_{i}$\u2019s.\r\n\r\n- A contrastive loss function defined for a contrastive prediction task. Given a set {$\\mathbf{\\tilde{x}}\\_{k}$} including a positive pair of examples $\\mathbf{\\tilde{x}}\\_{i}$ and $\\mathbf{\\tilde{x}\\_{j}}$ , the contrastive prediction task aims to identify $\\mathbf{\\tilde{x}}\\_{j}$ in {$\\mathbf{\\tilde{x}}\\_{k}$}$\\_{k\\neq{i}}$ for a given $\\mathbf{\\tilde{x}}\\_{i}$.\r\n\r\nA minibatch of $N$ examples is randomly sampled and the contrastive prediction task is defined on pairs of augmented examples derived from the minibatch, resulting in $2N$ data points. Negative examples are not sampled explicitly. Instead, given a positive pair, the other $2(N \u2212 1)$ augmented examples within a minibatch are treated as negative examples. A [NT-Xent](https:\/\/paperswithcode.com\/method\/nt-xent) (the normalized\r\ntemperature-scaled cross entropy loss) loss function is used (see components).","256":"**DCGAN**, or **Deep Convolutional GAN**, is a generative adversarial network architecture. It uses a couple of guidelines, in particular:\r\n\r\n- Replacing any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator).\r\n- Using batchnorm in both the generator and the discriminator.\r\n- Removing fully connected hidden layers for deeper architectures.\r\n- Using [ReLU](https:\/\/paperswithcode.com\/method\/relu) activation in generator for all layers except for the output, which uses tanh.\r\n- Using LeakyReLU activation in the discriminator for all layer.","257":"**Domain Adaptive Neighborhood Clustering via Entropy Optimization (DANCE)** is a self-supervised clustering method that harnesses the cluster structure of the target domain using self-supervision. This is done with a neighborhood clustering technique that self-supervises feature learning in the target. At the same time, useful source features and class boundaries are preserved and adapted with a partial domain alignment loss that the authors refer to as entropy separation loss. This loss allows the model to either match each target example with the source, or reject it as unknown.","258":"**Linear discriminant analysis** (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics, pattern recognition, and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.\r\n\r\nExtracted from [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Linear_discriminant_analysis)\r\n\r\n**Source**:\r\n\r\nPaper: [Linear Discriminant Analysis: A Detailed Tutorial](https:\/\/dx.doi.org\/10.3233\/AIC-170729)\r\n\r\nPublic version: [Linear Discriminant Analysis: A Detailed Tutorial](https:\/\/usir.salford.ac.uk\/id\/eprint\/52074\/)","259":"**Conditional Random Fields** or **CRFs** are a type of probabilistic graph model that take neighboring sample context into account for tasks like classification. Prediction is modeled as a graphical model, which implements dependencies between the predictions. Graph choice depends on the application, for example linear chain CRFs are popular in natural language processing, whereas in image-based tasks, the graph would connect to neighboring locations in an image to enforce that they have similar predictions.\r\n\r\nImage Credit: [Charles Sutton and Andrew McCallum, An Introduction to Conditional Random Fields](https:\/\/homepages.inf.ed.ac.uk\/csutton\/publications\/crftut-fnt.pdf)","260":"**SGD with Momentum** is a stochastic optimization method that adds a momentum term to regular stochastic gradient descent:\r\n\r\n$$v\\_{t} = \\gamma{v}\\_{t-1} + \\eta\\nabla\\_{\\theta}J\\left(\\theta\\right)$$\r\n$$\\theta\\_{t} = \\theta\\_{t-1} - v\\_{t} $$\r\n\r\nA typical value for $\\gamma$ is $0.9$. The momentum name comes from an analogy to physics, such as ball accelerating down a slope. In the case of weight updates, we can think of the weights as a particle traveling through parameter space which incurs acceleration from the gradient of the loss.\r\n\r\nImage Source: [Juan Du](https:\/\/www.researchgate.net\/figure\/The-compare-of-the-SGD-algorithms-with-and-without-momentum-Take-Task-1-as-example-The_fig1_333469047)","261":"**Demon Adam** is a stochastic optimizer where the [Demon](https:\/\/paperswithcode.com\/method\/demon) momentum rule is applied to the [Adam](https:\/\/paperswithcode.com\/method\/adam) optimizer.\r\n\r\n$$ \\beta\\_{t} = \\beta\\_{init}\\cdot\\frac{\\left(1-\\frac{t}{T}\\right)}{\\left(1-\\beta\\_{init}\\right) + \\beta\\_{init}\\left(1-\\frac{t}{T}\\right)} $$\r\n\r\n$$ m\\_{t, i} = g\\_{t, i} + \\beta\\_{t}m\\_{t-1, i} $$\r\n\r\n$$ v\\_{t+1} = \\beta\\_{2}v\\_{t} + \\left(1-\\beta\\_{2}\\right)g^{2}\\_{t} $$\r\n\r\n$$ \\theta_{t} = \\theta_{t-1} - \\eta\\frac{\\hat{m}\\_{t}}{\\sqrt{\\hat{v}\\_{t}} + \\epsilon} $$","262":"**Demon CM**, or **SGD with Momentum and Demon**, is the [Demon](https:\/\/paperswithcode.com\/method\/demon) momentum rule applied to [SGD with momentum](https:\/\/paperswithcode.com\/method\/sgd-with-momentum).\r\n\r\n$$ \\beta\\_{t} = \\beta\\_{init}\\cdot\\frac{\\left(1-\\frac{t}{T}\\right)}{\\left(1-\\beta\\_{init}\\right) + \\beta\\_{init}\\left(1-\\frac{t}{T}\\right)} $$\r\n\r\n$$ \\theta\\_{t+1} = \\theta\\_{t} - \\eta{g}\\_{t} + \\beta\\_{t}v\\_{t} $$\r\n\r\n$$ v\\_{t+1} = \\beta\\_{t}{v\\_{t}} - \\eta{g\\_{t}} $$","263":"**Decaying Momentum**, or **Demon**, is a stochastic optimizer motivated by decaying the total contribution of a gradient to all future updates. By decaying the momentum parameter, the total contribution of a gradient to all future updates is decayed. A particular gradient term $g\\_{t}$ contributes a total of $\\eta\\sum\\_{i}\\beta^{i}$ of its \"energy\" to all future gradient updates, and this results in the geometric sum, $\\sum^{\\infty}\\_{i=1}\\beta^{i} = \\beta\\sum^{\\infty}\\_{i=0}\\beta^{i} = \\frac{\\beta}{\\left(1-\\beta\\right)}$. Decaying this sum results in the Demon algorithm. Letting $\\beta\\_{init}$ be the initial $\\beta$; then at the current step $t$ with total $T$ steps, the decay routine is given by solving the below for $\\beta\\_{t}$:\r\n\r\n$$ \\frac{\\beta\\_{t}}{\\left(1-\\beta\\_{t}\\right)} = \\left(1-t\/T\\right)\\beta\\_{init}\/\\left(1-\\beta\\_{init}\\right)$$\r\n\r\nWhere $\\left(1-t\/T\\right)$ refers to the proportion of iterations remaining. Note that Demon typically requires no hyperparameter tuning as it is usually decayed to $0$ or a small negative value at time \r\n$T$. Improved performance is observed by delaying the decaying. Demon can be applied to any gradient descent algorithm with a momentum parameter.","264":"**Feature Intertwiner** is an object detection module that leverages the features from a more reliable set to help guide the feature learning of another less reliable set. The mutual learning process helps two sets to have closer distance within the cluster in each class. The intertwiner is applied on the object detection task, where a historical buffer is proposed to address the sample missing problem during one mini-batch and the optimal transport (OT) theory is introduced to enforce the similarity among the two sets.","265":"One difficulty that arises with optimization of deep neural networks is that large parameter gradients can lead an [SGD](https:\/\/paperswithcode.com\/method\/sgd) optimizer to update the parameters strongly into a region where the loss function is much greater, effectively undoing much of the work that was needed to get to the current solution.\r\n\r\n**Gradient Clipping** clips the size of the gradients to ensure optimization performs more reasonably near sharp areas of the loss surface. It can be performed in a number of ways. One option is to simply clip the parameter gradient element-wise before a parameter update. Another option is to clip the norm ||$\\textbf{g}$|| of the gradient $\\textbf{g}$ before a parameter update:\r\n\r\n$$\\text{ if } ||\\textbf{g}|| > v \\text{ then } \\textbf{g} \\leftarrow \\frac{\\textbf{g}{v}}{||\\textbf{g}||}$$\r\n\r\nwhere $v$ is a norm threshold.\r\n\r\nSource: Deep Learning, Goodfellow et al\r\n\r\nImage Source: [Pascanu et al](https:\/\/arxiv.org\/pdf\/1211.5063.pdf)","266":"**Non Maximum Suppression** is a computer vision method that selects a single entity out of many overlapping entities (for example bounding boxes in object detection). The criteria is usually discarding entities that are below a given probability bound. With remaining entities we repeatedly pick the entity with the highest probability, output that as the prediction, and discard any remaining box where a $\\text{IoU} \\geq 0.5$ with the box output in the previous step.\r\n\r\nImage Credit: [Martin Kersner](https:\/\/github.com\/martinkersner\/non-maximum-suppression-cpp)","267":"**RandomHorizontalFlip** is a type of image data augmentation which horizontally flips a given image with a given probability.\r\n\r\nImage Credit: [Apache MXNet](https:\/\/mxnet.apache.org\/versions\/1.5.0\/tutorials\/gluon\/data_augmentation.html)","268":"**PointNet** provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. It directly takes point clouds as input and outputs either class labels for the entire input or per point segment\/part labels for each point of the input.\r\n\r\nSource: [Qi et al.](https:\/\/arxiv.org\/pdf\/1612.00593v2.pdf)\r\n\r\nImage source: [Qi et al.](https:\/\/arxiv.org\/pdf\/1612.00593v2.pdf)","269":"**SSD** is a single-stage object detection method that discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. \r\n\r\nThe fundamental improvement in speed comes from eliminating bounding box proposals and the subsequent pixel or feature resampling stage. Improvements over competing single-stage methods include using a small convolutional filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales.","270":"**Fully Convolutional Networks**, or **FCNs**, are an architecture used mainly for semantic segmentation. They employ solely locally connected layers, such as [convolution](https:\/\/paperswithcode.com\/method\/convolution), pooling and upsampling. Avoiding the use of dense layers means less parameters (making the networks faster to train). It also means an FCN can work for variable image sizes given all connections are local.\r\n\r\nThe network consists of a downsampling path, used to extract and interpret the context, and an upsampling path, which allows for localization. \r\n\r\nFCNs also employ skip connections to recover the fine-grained spatial information lost in the downsampling path.","271":"**Wide Residual Networks** are a variant on [ResNets](https:\/\/paperswithcode.com\/method\/resnet) where we decrease depth and increase the width of residual networks. This is achieved through the use of wide residual blocks.","272":"**Auxiliary Classifiers** are type of architectural component that seek to improve the convergence of very deep networks. They are classifier heads we attach to layers before the end of the network. The motivation is to push useful gradients to the lower layers to make them immediately useful and improve the convergence during training by combatting the vanishing gradient problem. They are notably used in the Inception family of convolutional neural networks.","273":"In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task\/modality-specific customization. We propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks, including image generation, visual grounding, image captioning, image classification, language modeling, etc., in a simple sequence-to-sequence learning framework. OFA follows the instruction-based learning in both pretraining and finetuning stages, requiring no extra task-specific layers for downstream tasks. In comparison with the recent state-of-the-art vision & language models that rely on extremely large cross-modal datasets, OFA is pretrained on only 20M publicly available image-text pairs. Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni-modal tasks. Our further analysis indicates that OFA can also effectively transfer to unseen tasks and unseen domains. Our code and models are publicly available at https:\/\/github.com\/OFA-Sys\/OFA.","274":"**ConvLSTM** is a type of recurrent neural network for spatio-temporal prediction that has convolutional structures in both the input-to-state and state-to-state transitions. The ConvLSTM determines the future state of a certain cell in the grid by the inputs and past states of its local neighbors. This can easily be achieved by using a [convolution](https:\/\/paperswithcode.com\/method\/convolution) operator in the state-to-state and input-to-state transitions (see Figure). The key equations of ConvLSTM are shown below, where $\u2217$ denotes the convolution operator and $\\odot$ the Hadamard product:\r\n\r\n$$ i\\_{t} = \\sigma\\left(W\\_{xi} \u2217 X\\_{t} + W\\_{hi} \u2217 H\\_{t\u22121} + W\\_{ci} \\odot \\mathcal{C}\\_{t\u22121} + b\\_{i}\\right) $$\r\n\r\n$$ f\\_{t} = \\sigma\\left(W\\_{xf} \u2217 X\\_{t} + W\\_{hf} \u2217 H\\_{t\u22121} + W\\_{cf} \\odot \\mathcal{C}\\_{t\u22121} + b\\_{f}\\right) $$\r\n\r\n$$ \\mathcal{C}\\_{t} = f\\_{t} \\odot \\mathcal{C}\\_{t\u22121} + i\\_{t} \\odot \\text{tanh}\\left(W\\_{xc} \u2217 X\\_{t} + W\\_{hc} \u2217 \\mathcal{H}\\_{t\u22121} + b\\_{c}\\right) $$\r\n\r\n$$ o\\_{t} = \\sigma\\left(W\\_{xo} \u2217 X\\_{t} + W\\_{ho} \u2217 \\mathcal{H}\\_{t\u22121} + W\\_{co} \\odot \\mathcal{C}\\_{t} + b\\_{o}\\right) $$\r\n\r\n$$ \\mathcal{H}\\_{t} = o\\_{t} \\odot \\text{tanh}\\left(C\\_{t}\\right) $$\r\n\r\nIf we view the states as the hidden representations of moving objects, a ConvLSTM with a larger transitional kernel should be able to capture faster motions while one with a smaller kernel can capture slower motions. \r\n\r\nTo ensure that the states have the same number of rows and same number of columns as the inputs, padding is needed before applying the convolution operation. Here, padding of the hidden states on the boundary points can be viewed as using the state of the outside world for calculation. Usually, before the first input comes, we initialize all the states of the [LSTM](https:\/\/paperswithcode.com\/method\/lstm) to zero which corresponds to \"total ignorance\" of the future.","275":"**mBART** is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the [BART objective](https:\/\/paperswithcode.com\/method\/bart). The input texts are noised by masking phrases and permuting sentences, and a single [Transformer model](https:\/\/paperswithcode.com\/method\/transformer) is learned to recover the texts. Different from other pre-training approaches for machine translation, mBART pre-trains a complete autoregressive [Seq2Seq](https:\/\/paperswithcode.com\/method\/seq2seq) model. mBART is trained once for all languages, providing a set of parameters that can be fine-tuned for any of the language pairs in both supervised and unsupervised settings, without any task-specific or language-specific modifications or initialization schemes.","276":"An **Hourglass Module** is an image block module used mainly for pose estimation tasks. The design of the hourglass is motivated by the need to capture information at every scale. While local evidence is essential for identifying features like faces and hands, a final pose estimate requires a coherent understanding of the full body. The person\u2019s orientation, the arrangement of their limbs, and the relationships of adjacent joints are among the many cues that are best recognized at different scales in the image. The hourglass is a simple, minimal design that has the capacity to capture all of these features and bring them together to output pixel-wise predictions.\r\n\r\nThe network must have some mechanism to effectively process and consolidate features across scales. The Hourglass uses a single pipeline with skip layers to preserve spatial information at each resolution. The network reaches its lowest resolution at 4x4 pixels allowing smaller spatial filters to be applied that compare features across the entire space of the image.\r\n\r\nThe hourglass is set up as follows: Convolutional and [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) layers are used to process features down to a very low resolution. At each max pooling step, the network branches off and applies more convolutions at the original pre-pooled resolution. After reaching the lowest resolution, the network begins the top-down sequence of upsampling and combination of features across scales. To bring together information across two adjacent resolutions, we do nearest neighbor upsampling of the lower resolution followed by an elementwise addition of the two sets of features. The topology of the hourglass is symmetric, so for every layer present on the way down there is a corresponding layer going up.\r\n\r\nAfter reaching the output resolution of the network, two consecutive rounds of 1x1 convolutions are applied to produce the final network predictions. The output of the network is a set of heatmaps where for a given [heatmap](https:\/\/paperswithcode.com\/method\/heatmap) the network predicts the probability of a joint\u2019s presence at each and every pixel.","277":"**Stacked Hourglass Networks** are a type of convolutional neural network for pose estimation. They are based on the successive steps of pooling and upsampling that are done to produce a final set of predictions.","278":"Image Scale Augmentation is an augmentation technique where we randomly pick the short size of a image within a dimension range. One use case of this augmentation technique is in object detectiont asks.","279":"**EfficientDet** is a type of object detection model, which utilizes several optimization and backbone tweaks, such as the use of a [BiFPN](https:\/\/paperswithcode.com\/method\/bifpn), and a compound scaling method that uniformly scales the resolution,depth and width for all backbones, feature networks and box\/class prediction networks at the same time.","280":"A **BiFPN**, or **Weighted Bi-directional Feature Pyramid Network**, is a type of feature pyramid network which allows easy and fast multi-scale feature fusion. It incorporates the multi-level feature fusion idea from [FPN](https:\/\/paperswithcode.com\/method\/fpn), [PANet](https:\/\/paperswithcode.com\/method\/panet) and [NAS-FPN](https:\/\/paperswithcode.com\/method\/nas-fpn) that enables information to flow in both the top-down and bottom-up directions, while using regular and efficient connections. It also utilizes a fast normalized fusion technique. Traditional approaches usually treat all features input to the FPN equally, even those with different resolutions. However, input features at different resolutions often have unequal contributions to the output features. Thus, the BiFPN adds an additional weight for each input feature allowing the network to learn the importance of each. All regular convolutions are also replaced with less expensive depthwise separable convolutions.\r\n\r\nComparing with PANet, PANet added an extra bottom-up path for information flow at the expense of more computational cost. Whereas BiFPN optimizes these cross-scale connections by removing nodes with a single input edge, adding an extra edge from the original input to output node if they are on the same level, and treating each bidirectional path as one feature network layer (repeating it several times for more high-level future fusion).","281":"A **Siamese Network** consists of twin networks which accept distinct inputs but are joined by an energy function at the top. This function computes a metric between the highest level feature representation on each side. The parameters between the twin networks are tied. [Weight tying](https:\/\/paperswithcode.com\/method\/weight-tying) guarantees that two extremely similar images are not mapped by each network to very different locations in feature space because each network computes the same function. The network is symmetric, so that whenever we present two distinct images to the twin networks, the top conjoining layer will compute the same metric as if we were to we present the same two images but to the opposite twins.\r\n\r\nIntuitively instead of trying to classify inputs, a siamese network learns to differentiate between inputs, learning their similarity. The loss function used is usually a form of contrastive loss.\r\n\r\nSource: [Koch et al](https:\/\/www.cs.cmu.edu\/~rsalakhu\/papers\/oneshot1.pdf)","282":"**Routed Attention** is an attention pattern proposed as part of the [Routing Transformer](https:\/\/paperswithcode.com\/method\/routing-transformer) architecture. Each attention module\r\nconsiders a clustering of the space: the current timestep only attends to context belonging to the same cluster. In other word, the current time-step query is routed to a limited number of context through its cluster assignment. This can be contrasted with [strided](https:\/\/paperswithcode.com\/method\/strided-attention) attention patterns and those proposed with the [Sparse Transformer](https:\/\/paperswithcode.com\/method\/sparse-transformer).\r\n\r\nIn the image to the right, the rows represent the outputs while the columns represent the inputs. The different colors represent cluster memberships for the output token.","283":"The **Routing Transformer** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) that endows self-attention with a sparse routing module based on online k-means. Each attention module considers a clustering of the space: the current timestep only attends to context belonging to the same cluster. In other word, the current time-step query is routed to a limited number of context through its cluster assignment.","284":"**Weight Normalization** is a normalization method for training neural networks. It is inspired by [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization), but it is a deterministic method that does not share batch normalization's property of adding noise to the gradients. It reparameterizes each weight vector $\\textbf{w}$ in terms of a parameter vector $\\textbf{v}$ and a scalar parameter $g$ and to perform stochastic gradient descent with respect to those parameters instead. Weight vectors are expressed in terms of the new parameters using:\r\n\r\n$$ \\textbf{w} = \\frac{g}{\\Vert\\\\textbf{v}\\Vert}\\textbf{v}$$\r\n\r\nwhere $\\textbf{v}$ is a $k$-dimensional vector, $g$ is a scalar, and $\\Vert\\textbf{v}\\Vert$ denotes the Euclidean norm of $\\textbf{v}$. This reparameterization has the effect of fixing the Euclidean norm of the weight vector $\\textbf{w}$: we now have $\\Vert\\textbf{w}\\Vert = g$, independent of the parameters $\\textbf{v}$.","285":"Network On Network (NON) is practical tabular data classification model based on deep neural network to provide accurate predictions. Various deep methods have been proposed and promising progress has been made. However, most of them use operations like neural network and factorization machines to fuse the embeddings of different features directly, and linearly combine the outputs of those operations to get the final prediction. As a result, the intra-field information and the non-linear interactions between those operations (e.g. neural network and factorization machines) are ignored. Intra-field information is the information that features inside each field belong to the same field. NON is proposed to take full advantage of intra-field information and non-linear interactions. It consists of three components: field-wise network at the bottom to capture the intra-field information, across field network in the middle to choose suitable operations data-drivenly, and operation fusion network on the top to fuse outputs of the chosen operations deeply","286":"**mt5** is a multilingual variant of [T5](https:\/\/paperswithcode.com\/method\/t5) that was pre-trained on a new Common Crawl-based dataset covering $101$ languages.","287":"**BART** is a [denoising autoencoder](https:\/\/paperswithcode.com\/method\/denoising-autoencoder) for pretraining sequence-to-sequence models. It is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard [Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based neural machine translation architecture. It uses a standard seq2seq\/NMT architecture with a bidirectional encoder (like [BERT](https:\/\/paperswithcode.com\/method\/bert)) and a left-to-right decoder (like [GPT](https:\/\/paperswithcode.com\/method\/gpt)). This means the encoder's attention mask is fully visible, like BERT, and the decoder's attention mask is causal, like [GPT2](https:\/\/paperswithcode.com\/method\/gpt-2).","288":"**Prioritized Experience Replay** is a type of [experience replay](https:\/\/paperswithcode.com\/method\/experience-replay) in reinforcement learning where we In more frequently replay transitions with high expected learning progress, as measured by the magnitude of their temporal-difference (TD) error. This prioritization can lead to a loss of diversity, which is alleviated with stochastic prioritization, and introduce bias, which can be corrected with importance sampling.\r\n\r\nThe stochastic sampling method interpolates between pure greedy prioritization and uniform random sampling. The probability of being sampled is ensured to be monotonic in a transition's priority, while guaranteeing a non-zero probability even for the lowest-priority transition. Concretely, define the probability of sampling transition $i$ as\r\n\r\n$$P(i) = \\frac{p_i^{\\alpha}}{\\sum_k p_k^{\\alpha}}$$\r\n\r\nwhere $p_i > 0$ is the priority of transition $i$. The exponent $\\alpha$ determines how much prioritization is used, with $\\alpha=0$ corresponding to the uniform case.\r\n\r\nPrioritized replay introduces bias because it changes this distribution in an uncontrolled fashion, and therefore changes the solution that the estimates will converge to. We can correct this bias by using\r\nimportance-sampling (IS) weights:\r\n\r\n$$ w\\_{i} = \\left(\\frac{1}{N}\\cdot\\frac{1}{P\\left(i\\right)}\\right)^{\\beta} $$\r\n\r\nthat fully compensates for the non-uniform probabilities $P\\left(i\\right)$ if $\\beta = 1$. These weights can be folded into the [Q-learning](https:\/\/paperswithcode.com\/method\/q-learning) update by using $w\\_{i}\\delta\\_{i}$ instead of $\\delta\\_{i}$ - weighted IS rather than ordinary IS. For stability reasons, we always normalize weights by $1\/\\max\\_{i}w\\_{i}$ so\r\nthat they only scale the update downwards.\r\n\r\nThe two types of prioritization are proportional based, where $p\\_{i} = |\\delta\\_{i}| + \\epsilon$ and rank-based, where $p\\_{i} = \\frac{1}{\\text{rank}\\left(i\\right)}$, the latter where $\\text{rank}\\left(i\\right)$ is the rank of transition $i$ when the replay memory is sorted according to |$\\delta\\_{i}$|, For proportional based, hyperparameters used were $\\alpha = 0.7$, $\\beta\\_{0} = 0.5$. For the rank-based variant, hyperparameters used were $\\alpha = 0.6$, $\\beta\\_{0} = 0.4$.","289":"**Monte-Carlo Tree Search** is a planning algorithm that accumulates value estimates obtained from Monte Carlo simulations in order to successively direct simulations towards more highly-rewarded trajectories. We execute MCTS after encountering each new state to select an agent's action for that state: it is executed again to select the action for the next state. Each execution is an iterative process that simulates many trajectories starting from the current state to the terminal state. The core idea is to successively focus multiple simulations starting at the current state by extending the initial portions of trajectories that have received high evaluations from earlier simulations.\r\n\r\nSource: Sutton and Barto, Reinforcement Learning (2nd Edition)\r\n\r\nImage Credit: [Chaslot et al](https:\/\/www.aaai.org\/Papers\/AIIDE\/2008\/AIIDE08-036.pdf)","290":"**MuZero** is a model-based reinforcement learning algorithm. It builds upon [AlphaZero](https:\/\/paperswithcode.com\/method\/alphazero)'s search and search-based policy iteration algorithms, but incorporates a learned model into the training procedure. \r\n\r\nThe main idea of the algorithm is to predict those aspects of the future that are directly relevant for planning. The model receives the observation (e.g. an image of the Go board or the Atari screen) as an\r\ninput and transforms it into a hidden state. The hidden state is then updated iteratively by a recurrent process that receives the previous hidden state and a hypothetical next action. At every one of these steps the model predicts the policy (e.g. the move to play), value function (e.g. the predicted winner), and immediate reward (e.g. the points scored by playing a move). The model is trained end-to-end, with the sole objective of accurately estimating these three important quantities, so as to match the improved estimates of policy and value generated by search as well as the observed reward. \r\n\r\nThere is no direct constraint or requirement for the hidden state to capture all information necessary to reconstruct the original observation, drastically reducing the amount of information the model has to maintain and predict; nor is there any requirement for the hidden state to match the unknown, true state of the environment; nor any other constraints on the semantics of state. Instead, the hidden states are free to represent state in whatever way is relevant to predicting current and future values and policies. Intuitively, the agent can invent, internally, the rules or dynamics that lead to most accurate planning.","291":"Diffusion-convolutional neural networks (DCNN) is a model for graph-structured data. Through the introduction of a diffusion-convolution operation, diffusion-based representations can be learned from graph structured data and used as an effective basis for node classification.\r\n\r\nDescription and image from: [Diffusion-Convolutional Neural Networks](https:\/\/arxiv.org\/pdf\/1511.02136.pdf)","292":"**Double Q-learning** is an off-policy reinforcement learning algorithm that utilises double estimation to counteract overestimation problems with traditional Q-learning. \r\n\r\nThe max operator in standard [Q-learning](https:\/\/paperswithcode.com\/method\/q-learning) and [DQN](https:\/\/paperswithcode.com\/method\/dqn) uses the same values both to select and to evaluate an action. This makes it more likely to select overestimated values, resulting in overoptimistic value estimates. To prevent this, we can decouple the selection from the evaluation, which is the idea behind Double Q-learning:\r\n\r\n$$ Y^{Q}\\_{t} = R\\_{t+1} + \\gamma{Q}\\left(S\\_{t+1}, \\arg\\max\\_{a}Q\\left(S\\_{t+1}, a; \\mathbb{\\theta}\\_{t}\\right);\\mathbb{\\theta}\\_{t}\\right) $$\r\n\r\nThe Double Q-learning error can then be written as:\r\n\r\n$$ Y^{DoubleQ}\\_{t} = R\\_{t+1} + \\gamma{Q}\\left(S\\_{t+1}, \\arg\\max\\_{a}Q\\left(S\\_{t+1}, a; \\mathbb{\\theta}\\_{t}\\right);\\mathbb{\\theta}^{'}\\_{t}\\right) $$\r\n\r\nHere the selection of the action in the $\\arg\\max$ is still due to the online weights $\\theta\\_{t}$. But we use a second set of weights $\\mathbb{\\theta}^{'}\\_{t}$ to fairly evaluate the value of this policy.\r\n\r\nSource: [Deep Reinforcement Learning with Double Q-learning](https:\/\/paperswithcode.com\/paper\/deep-reinforcement-learning-with-double-q)","293":"A **Double Deep Q-Network**, or **Double DQN** utilises [Double Q-learning](https:\/\/paperswithcode.com\/method\/double-q-learning) to reduce overestimation by decomposing the max operation in the target into action selection and action evaluation. We evaluate the greedy policy according to the online network, but we use the target network to estimate its value. The update is the same as for [DQN](https:\/\/paperswithcode.com\/method\/dqn), but replacing the target $Y^{DQN}\\_{t}$ with:\r\n\r\n$$ Y^{DoubleDQN}\\_{t} = R\\_{t+1}+\\gamma{Q}\\left(S\\_{t+1}, \\arg\\max\\_{a}Q\\left(S\\_{t+1}, a; \\theta\\_{t}\\right);\\theta\\_{t}^{-}\\right) $$\r\n\r\nCompared to the original formulation of Double [Q-Learning](https:\/\/paperswithcode.com\/method\/q-learning), in Double DQN the weights of the second network $\\theta^{'}\\_{t}$ are replaced with the weights of the target network $\\theta\\_{t}^{-}$ for the evaluation of the current greedy policy.","294":"**Graphic Mutual Information**, or **GMI**, measures the correlation between input graphs and high-level hidden representations. GMI generalizes the idea of conventional mutual information computations from vector space to the graph domain where measuring mutual information from two aspects of node features and topological structure is indispensable. GMI exhibits several benefits: First, it is invariant to the isomorphic transformation of input graphs---an inevitable constraint in many existing graph representation learning algorithms; Besides, it can be efficiently estimated and maximized by current mutual information estimation methods such as MINE.","295":"**Self-Adversarial Negative Sampling** is a negative sampling technique used for methods like [word embeddings](https:\/\/paperswithcode.com\/methods\/category\/word-embeddings) and [knowledge graph embeddings](https:\/\/paperswithcode.com\/methods\/category\/graph-embeddings). The traditional negative sampling loss from word2vec for optimizing distance-based models be written as:\r\n\r\n$$ L = \u2212\\log\\sigma\\left(\\gamma \u2212 d\\_{r}\\left(\\mathbf{h}, \\mathbf{t}\\right)\\right) \u2212 \\sum^{n}\\_{i=1}\\frac{1}{k}\\log\\sigma\\left(d\\_{r}\\left(\\mathbf{h}^{'}\\_{i}, \\mathbf{t}^{'}\\_{i}\\right) - \\gamma\\right) $$\r\n\r\nwhere $\\gamma$ is a fixed margin, $\\sigma$ is the sigmoid function, and $\\left(\\mathbf{h}^{'}\\_{i}, r, \\mathbf{t}^{'}\\_{i}\\right)$ is the $i$-th negative triplet. \r\n\r\nThe negative sampling loss samples the negative triplets in a uniform way. Such a uniform negative sampling suffers the problem of inefficiency since many samples are obviously false as training goes on, which does not provide any meaningful information. Therefore, the authors propose an approach called self-adversarial negative sampling, which samples negative triples according to the current embedding model. Specifically, we sample negative triples from the following distribution:\r\n\r\n$$ p\\left(h^{'}\\_{j}, r, t^{'}\\_{j} | \\text{set}\\left(h\\_{i}, r\\_{i}, t\\_{i} \\right) \\right) = \\frac{\\exp\\alpha{f}\\_{r}\\left(\\mathbf{h}^{'}\\_{j}, \\mathbf{t}^{'}\\_{j}\\right)}{\\sum\\_{i=1}\\exp\\alpha{f}\\_{r}\\left(\\mathbf{h}^{'}\\_{i}, \\mathbf{t}^{'}\\_{i}\\right)} $$\r\n\r\nwhere $\\alpha$ is the temperature of sampling. Moreover, since the sampling procedure may be costly, the authors treat the above probability as the weight of the negative sample. Therefore, the final negative sampling loss with self-adversarial training takes the following form:\r\n\r\n$$ L = \u2212\\log\\sigma\\left(\\gamma \u2212 d\\_{r}\\left(\\mathbf{h}, \\mathbf{t}\\right)\\right) \u2212 \\sum^{n}\\_{i=1}p\\left(h^{'}\\_{i}, r, t^{'}\\_{i}\\right)\\log\\sigma\\left(d\\_{r}\\left(\\mathbf{h}^{'}\\_{i}, \\mathbf{t}^{'}\\_{i}\\right) - \\gamma\\right) $$","296":"**RotatE** is a method for generating graph embeddings which is able to model and infer various relation patterns including: symmetry\/antisymmetry, inversion, and composition. Specifically, the RotatE model defines each relation as a rotation from the source entity to the target entity in the complex vector space. The RotatE model is trained using a [self-adversarial negative sampling](https:\/\/paperswithcode.com\/method\/self-adversarial-negative-sampling) technique.","297":"**VEGA** is an AutoML framework that is compatible and optimized for multiple hardware platforms. It integrates various modules of AutoML, including [Neural Architecture Search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) (NAS), Hyperparameter Optimization (HPO), Auto Data Augmentation, Model Compression, and Fully Train. To support a variety of search algorithms and tasks, it involves a fine-grained search space and a description language to enable easy adaptation to different search algorithms and tasks.","298":"Procrustes","299":"An **Inception Module** is an image model block that aims to approximate an optimal local sparse structure in a CNN. Put simply, it allows for us to use multiple types of filter size, instead of being restricted to a single filter size, in a single image block, which we then concatenate and pass onto the next layer.","300":"**GoogLeNet** is a type of convolutional neural network based on the [Inception](https:\/\/paperswithcode.com\/method\/inception-module) architecture. It utilises Inception modules, which allow the network to choose between multiple convolutional filter sizes in each block. An Inception network stacks these modules on top of each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid.","301":"**Retrace** is an off-policy Q-value estimation algorithm which has guaranteed convergence for a target and behaviour policy $\\left(\\pi, \\beta\\right)$. With off-policy rollout for TD learning, we must use importance sampling for the update:\r\n\r\n$$ \\Delta{Q}^{\\text{imp}}\\left(S\\_{t}, A\\_{t}\\right) = \\gamma^{t}\\prod\\_{1\\leq{\\tau}\\leq{t}}\\frac{\\pi\\left(A\\_{\\tau}\\mid{S\\_{\\tau}}\\right)}{\\beta\\left(A\\_{\\tau}\\mid{S\\_{\\tau}}\\right)}\\delta\\_{t} $$\r\n\r\nThis product term can lead to high variance, so Retrace modifies $\\Delta{Q}$ to have importance weights truncated by no more than a constant $c$:\r\n\r\n$$ \\Delta{Q}^{\\text{imp}}\\left(S\\_{t}, A\\_{t}\\right) = \\gamma^{t}\\prod\\_{1\\leq{\\tau}\\leq{t}}\\min\\left(c, \\frac{\\pi\\left(A\\_{\\tau}\\mid{S\\_{\\tau}}\\right)}{\\beta\\left(A\\_{\\tau}\\mid{S\\_{\\tau}}\\right)}\\right)\\delta\\_{t} $$","302":"A **Stochastic Dueling Network**, or **SDN**, is an architecture for learning a value function $V$. The SDN learns both $V$ and $Q$ off-policy while maintaining consistency between the two estimates. At each time step it outputs a stochastic estimate of $Q$ and a deterministic estimate of $V$.","303":"**ACER**, or **Actor Critic with Experience Replay**, is an actor-critic deep reinforcement learning agent with [experience replay](https:\/\/paperswithcode.com\/method\/experience-replay). It can be seen as an off-policy extension of [A3C](https:\/\/paperswithcode.com\/method\/a3c), where the off-policy estimator is made feasible by:\r\n\r\n- Using [Retrace](https:\/\/paperswithcode.com\/method\/retrace) Q-value estimation.\r\n- Using truncated importance sampling with bias correction.\r\n- Using a trust region policy optimization method.\r\n- Using a [stochastic dueling network](https:\/\/paperswithcode.com\/method\/stochastic-dueling-network) architecture.","304":"**Additive Attention**, also known as **Bahdanau Attention**, uses a one-hidden layer feed-forward network to calculate the attention alignment score:\r\n\r\n$$f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right) = v\\_{a}^{T}\\tanh\\left(\\textbf{W}\\_{a}\\left[\\textbf{h}\\_{i};\\textbf{s}\\_{j}\\right]\\right)$$\r\n\r\nwhere $\\textbf{v}\\_{a}$ and $\\textbf{W}\\_{a}$ are learned attention parameters. Here $\\textbf{h}$ refers to the hidden states for the encoder, and $\\textbf{s}$ is the hidden states for the decoder. The function above is thus a type of alignment score function. We can use a matrix of alignment scores to show the correlation between source and target words, as the Figure to the right shows.\r\n\r\nWithin a neural network, once we have the alignment scores, we calculate the final scores using a [softmax](https:\/\/paperswithcode.com\/method\/softmax) function of these alignment scores (ensuring it sums to 1).","305":"**Pointer Networks** tackle problems where input and output data are sequential data, but can't be solved by seq2seq type models because discrete categories of output elements depend on the variable input size (and are not decided in advance).\r\n\r\nA Pointer Network learns the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence. They solve the problem of variable size output dictionaries using [additive attention](https:\/\/paperswithcode.com\/method\/additive-attention). But instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, Pointer Networks use attention as a pointer to select a member of the input sequence as the output. \r\n\r\nPointer-Nets can be used to learn approximate solutions to challenging geometric problems such as finding planar convex hulls, computing Delaunay triangulations, and the planar Travelling Salesman Problem.","306":"**FBNet Block** is an image model block used in the [FBNet](https:\/\/paperswithcode.com\/method\/fbnet) architectures discovered through [DNAS](https:\/\/paperswithcode.com\/method\/dnas) [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search). The basic building blocks employed are [depthwise convolutions](https:\/\/paperswithcode.com\/method\/depthwise-convolution) and a [residual connection](https:\/\/paperswithcode.com\/method\/residual-connection).","307":"**FBNet** is a type of convolutional neural architectures discovered through [DNAS](https:\/\/paperswithcode.com\/method\/dnas) [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search). It utilises a basic type of image model block inspired by [MobileNetv2](https:\/\/paperswithcode.com\/method\/mobilenetv2) that utilises depthwise convolutions and an inverted residual structure (see components).","308":"**Cascade R-CNN** is an object detection architecture that seeks to address problems with degrading performance with increased IoU thresholds (due to overfitting during training and inference-time mismatch between IoUs for which detector is optimal and the inputs). It is a multi-stage extension of the [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn), where detector stages deeper into the cascade are sequentially more selective against close false positives. The cascade of R-CNN stages are trained sequentially, using the output of one stage to train the next. This is motivated by the observation that the output IoU of a regressor is almost invariably better than the input IoU. \r\n\r\nCascade R-CNN does not aim to mine hard negatives. Instead, by adjusting bounding boxes, each stage aims to find a good set of close false positives for training the next stage. When operating in this manner, a sequence of detectors adapted to increasingly higher IoUs can beat the overfitting problem, and thus be effectively trained. At inference, the same cascade procedure is applied. The progressively improved hypotheses are better matched to the increasing detector quality at each stage.","309":"A **Dense Block** is a module used in convolutional neural networks that connects *all layers* (with matching feature-map sizes) directly with each other. It was originally proposed as part of the [DenseNet](https:\/\/paperswithcode.com\/method\/densenet) architecture. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers. In contrast to [ResNets](https:\/\/paperswithcode.com\/method\/resnet), we never combine features through summation before they are passed into a layer; instead, we combine features by concatenating them. Hence, the $\\ell^{th}$ layer has $\\ell$ inputs, consisting of the feature-maps of all preceding convolutional blocks. Its own feature-maps are passed on to all $L-\\ell$ subsequent layers. This introduces $\\frac{L(L+1)}{2}$ connections in an $L$-layer network, instead of just $L$, as in traditional architectures: \"dense connectivity\".","310":"A **DenseNet** is a type of convolutional neural network that utilises [dense connections](https:\/\/paperswithcode.com\/method\/dense-connections) between layers, through [Dense Blocks](http:\/\/www.paperswithcode.com\/method\/dense-block), where we connect *all layers* (with matching feature-map sizes) directly with each other. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers.","311":"**Grid Sensitive** is a trick for object detection introduced by [YOLOv4](https:\/\/paperswithcode.com\/method\/yolov4). When we decode the coordinate of the bounding box center $x$ and $y$, in original [YOLOv3](https:\/\/paperswithcode.com\/method\/yolov3), we can get them by\r\n\r\n$$\r\n\\begin{aligned}\r\n&x=s \\cdot\\left(g\\_{x}+\\sigma\\left(p\\_{x}\\right)\\right) \\\\\r\n&y=s \\cdot\\left(g\\_{y}+\\sigma\\left(p\\_{y}\\right)\\right)\r\n\\end{aligned}\r\n$$\r\n\r\nwhere $\\sigma$ is the sigmoid function, $g\\_{x}$ and $g\\_{y}$ are integers and $s$ is a scale factor. Obviously, $x$ and $y$ cannot be exactly equal to $s \\cdot g\\_{x}$ or $s \\cdot\\left(g\\_{x}+1\\right)$. This makes it difficult to predict the centres of bounding boxes that just located on the grid boundary. We can address this problem, by changing the equation to\r\n\r\n$$\r\n\\begin{aligned}\r\n&x=s \\cdot\\left(g\\_{x}+\\alpha \\cdot \\sigma\\left(p\\_{x}\\right)-(\\alpha-1) \/ 2\\right) \\\\\r\n&y=s \\cdot\\left(g\\_{y}+\\alpha \\cdot \\sigma\\left(p\\_{y}\\right)-(\\alpha-1) \/ 2\\right)\r\n\\end{aligned}\r\n$$\r\n\r\nThis makes it easier for the model to predict bounding box center exactly located on the grid boundary. The FLOPs added by Grid Sensitive are really small, and can be totally ignored.","312":"**Bottom-up Path Augmentation** is a feature extraction technique that seeks to shorten the information path and enhance a feature pyramid with accurate localization signals existing in low-levels. This is based on the fact that high response to edges or instance parts is a strong indicator to accurately localize instances. \r\n\r\nEach building block takes a higher resolution feature map $N\\_{i}$ and a coarser map $P\\_{i+1}$ through lateral connection and generates the new feature map $N\\_{i+1}$ Each feature map $N\\_{i}$ first goes through a $3 \\times 3$ convolutional layer with stride $2$ to reduce the spatial size. Then each element of feature map $P\\_{i+1}$ and the down-sampled map are added through lateral connection. The fused feature map is then processed by another $3 \\times 3$ convolutional layer to generate $N\\_{i+1}$ for following sub-networks. This is an iterative process and terminates after approaching $P\\_{5}$. In these building blocks, we consistently use channel 256 of feature maps. The feature grid for each proposal is then pooled from new feature maps, i.e., {$N\\_{2}$, $N\\_{3}$, $N\\_{4}$, $N\\_{5}$}.","313":"** Spatial Pyramid Pooling (SPP)** is a pooling layer that removes the fixed-size constraint of the network, i.e. a CNN does not require a fixed-size input image. Specifically, we add an SPP layer on top of the last convolutional layer. The SPP layer pools the features and generates fixed-length outputs, which are then fed into the fully-connected layers (or other classifiers). In other words, we perform some information aggregation at a deeper stage of the network hierarchy (between convolutional layers and fully-connected layers) to avoid the need for cropping or warping at the beginning.","314":"**PAFPN** is a feature pyramid module used in Path Aggregation networks ([PANet](https:\/\/paperswithcode.com\/method\/panet)) that combines FPNs with [bottom-up path augmentation](https:\/\/paperswithcode.com\/method\/bottom-up-path-augmentation), which shortens the information path between lower layers and topmost feature.","315":"A **Spatial Attention Module** is a module for spatial attention in convolutional neural networks. It generates a spatial attention map by utilizing the inter-spatial relationship of features. Different from the [channel attention](https:\/\/paperswithcode.com\/method\/channel-attention-module), the spatial attention focuses on where is an informative part, which is complementary to the channel attention. To compute the spatial attention, we first apply average-pooling and max-pooling operations along the channel axis and concatenate them to generate an efficient feature descriptor. On the concatenated feature descriptor, we apply a [convolution](https:\/\/paperswithcode.com\/method\/convolution) layer to generate a spatial attention map $\\textbf{M}\\_{s}\\left(F\\right) \\in \\mathcal{R}^{H\u00d7W}$ which encodes where to emphasize or suppress. \r\n\r\nWe aggregate channel information of a feature map by using two pooling operations, generating two 2D maps: $\\mathbf{F}^{s}\\_{avg} \\in \\mathbb{R}^{1\\times{H}\\times{W}}$ and $\\mathbf{F}^{s}\\_{max} \\in \\mathbb{R}^{1\\times{H}\\times{W}}$. Each denotes average-pooled features and max-pooled features across the channel. Those are then concatenated and convolved by a standard convolution layer, producing the 2D spatial attention map. In short, the spatial attention is computed as:\r\n\r\n$$ \\textbf{M}\\_{s}\\left(F\\right) = \\sigma\\left(f^{7x7}\\left(\\left[\\text{AvgPool}\\left(F\\right);\\text{MaxPool}\\left(F\\right)\\right]\\right)\\right) $$\r\n\r\n$$ \\textbf{M}\\_{s}\\left(F\\right) = \\sigma\\left(f^{7x7}\\left(\\left[\\mathbf{F}^{s}\\_{avg};\\mathbf{F}^{s}\\_{max} \\right]\\right)\\right) $$\r\n\r\nwhere $\\sigma$ denotes the sigmoid function and $f^{7\u00d77}$ represents a convolution operation with the filter size of 7 \u00d7 7.","316":"**DropBlock** is a structured form of [dropout](https:\/\/paperswithcode.com\/method\/dropout) directed at regularizing convolutional networks. In DropBlock, units in a contiguous region of a feature map are dropped together. As DropBlock discards features in a correlated area, the networks must look elsewhere for evidence to fit the data.","317":"**CSPDarknet53** is a convolutional neural network and backbone for object detection that uses [DarkNet-53](https:\/\/paperswithcode.com\/method\/darknet-53). It employs a CSPNet strategy to partition the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network. \r\n\r\nThis CNN is used as the backbone for [YOLOv4](https:\/\/paperswithcode.com\/method\/yolov4).","318":"**YOLOv4** is a one-stage object detection model that improves on [YOLOv3](https:\/\/paperswithcode.com\/method\/yolov3) with several bags of tricks and modules introduced in the literature. The components section below details the tricks and modules used.","319":"The **Lovasz-Softmax loss** is a loss function for multiclass semantic segmentation that incorporates the [softmax](https:\/\/paperswithcode.com\/method\/softmax) operation in the Lovasz extension. The Lovasz extension is a means by which we can achieve direct optimization of the mean intersection-over-union loss in neural networks.","320":"**Inception-v3 Module** is an image block used in the [Inception-v3](https:\/\/paperswithcode.com\/method\/inception-v3) architecture. This architecture is used on the coarsest (8 \u00d7 8) grids to promote high dimensional representations.","321":"**Inception-v3** is a convolutional neural network architecture from the Inception family that makes several improvements including using [Label Smoothing](https:\/\/paperswithcode.com\/method\/label-smoothing), Factorized 7 x 7 convolutions, and the use of an auxiliary classifer to propagate label information lower down the network (along with the use of [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) for layers in the sidehead).","322":"**Step Decay** is a learning rate schedule that drops the learning rate by a factor every few epochs, where the number of epochs is a hyperparameter.\r\n\r\nImage Credit: [Suki Lau](https:\/\/towardsdatascience.com\/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1)","323":"**MobileNet** is a type of convolutional neural network designed for mobile and embedded vision applications. They are based on a streamlined architecture that uses depthwise separable convolutions to build lightweight deep neural networks that can have low latency for mobile and embedded devices.","324":"A **Wide Residual Block** is a type of [residual block](https:\/\/paperswithcode.com\/method\/residual-block) that utilises two conv 3x3 layers (with [dropout](https:\/\/paperswithcode.com\/method\/dropout)). This is wider than other variants of residual blocks (for instance [bottleneck residual blocks](https:\/\/paperswithcode.com\/method\/bottleneck-residual-block)). It was proposed as part of the [WideResNet](https:\/\/paperswithcode.com\/method\/wideresnet) CNN architecture.","325":"A **Channel Attention Module** is a module for channel-based attention in convolutional neural networks. We produce a channel attention map by exploiting the inter-channel relationship of features. As each channel of a feature map is considered as a feature detector, channel attention focuses on \u2018what\u2019 is meaningful given an input image. To compute the channel attention efficiently, we squeeze the spatial dimension of the input feature map. \r\n\r\nWe first aggregate spatial information of a feature map by using both average-pooling and max-pooling operations, generating two different spatial context descriptors: $\\mathbf{F}^{c}\\_{avg}$ and $\\mathbf{F}^{c}\\_{max}$, which denote average-pooled features and max-pooled features respectively. \r\n\r\nBoth descriptors are then forwarded to a shared network to produce our channel attention map $\\mathbf{M}\\_{c} \\in \\mathbb{R}^{C\\times{1}\\times{1}}$. Here $C$ is the number of channels. The shared network is composed of multi-layer perceptron (MLP) with one hidden layer. To reduce parameter overhead, the hidden activation size is set to $\\mathbb{R}^{C\/r\u00d71\u00d71}$, where $r$ is the reduction ratio. After the shared network is applied to each descriptor, we merge the output feature vectors using element-wise summation. In short, the channel attention is computed as:\r\n\r\n$$ \\mathbf{M\\_{c}}\\left(\\mathbf{F}\\right) = \\sigma\\left(\\text{MLP}\\left(\\text{AvgPool}\\left(\\mathbf{F}\\right)\\right)+\\text{MLP}\\left(\\text{MaxPool}\\left(\\mathbf{F}\\right)\\right)\\right) $$\r\n\r\n$$ \\mathbf{M\\_{c}}\\left(\\mathbf{F}\\right) = \\sigma\\left(\\mathbf{W\\_{1}}\\left(\\mathbf{W\\_{0}}\\left(\\mathbf{F}^{c}\\_{avg}\\right)\\right) +\\mathbf{W\\_{1}}\\left(\\mathbf{W\\_{0}}\\left(\\mathbf{F}^{c}\\_{max}\\right)\\right)\\right) $$\r\n\r\nwhere $\\sigma$ denotes the sigmoid function, $\\mathbf{W}\\_{0} \\in \\mathbb{R}^{C\/r\\times{C}}$, and $\\mathbf{W}\\_{1} \\in \\mathbb{R}^{C\\times{C\/r}}$. Note that the MLP weights, $\\mathbf{W}\\_{0}$ and $\\mathbf{W}\\_{1}$, are shared for both inputs and the [ReLU](https:\/\/paperswithcode.com\/method\/relu) activation function is followed by $\\mathbf{W}\\_{0}$.\r\n\r\nNote that the channel attention module with just [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling) is the same as the [Squeeze-and-Excitation Module](https:\/\/paperswithcode.com\/method\/squeeze-and-excitation-block).","326":"**Convolutional Block Attention Module (CBAM)** is an attention module for convolutional neural networks. Given an intermediate feature map, the module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement.\r\n\r\nGiven an intermediate feature map $\\mathbf{F} \\in \\mathbb{R}^{C\u00d7H\u00d7W}$ as input, CBAM sequentially infers a 1D channel attention map $\\mathbf{M}\\_{c} \\in \\mathbb{R}^{C\u00d71\u00d71}$ and a 2D spatial attention map $\\mathbf{M}\\_{s} \\in \\mathbb{R}^{1\u00d7H\u00d7W}$. The overall attention process can be summarized as:\r\n\r\n$$ \\mathbf{F}' = \\mathbf{M}\\_{c}\\left(\\mathbf{F}\\right) \\otimes \\mathbf{F} $$\r\n\r\n$$ \\mathbf{F}'' = \\mathbf{M}\\_{s}\\left(\\mathbf{F'}\\right) \\otimes \\mathbf{F'} $$\r\n\r\nDuring multiplication, the attention values are broadcasted (copied) accordingly: channel attention values are broadcasted along the spatial dimension, and vice versa. $\\mathbf{F}''$ is the final refined\r\noutput.","327":"**FFB6D** is a full flow bidirectional fusion network for 6D pose estimation of known objects from a single RGBD image. Unlike previous works that extract the RGB and point cloud features independently and fuse them in the final stage, FFB6D builds bidirectional fusion modules as communication bridges in the full flow of the two networks. In this way, the two networks can obtain complementary information from the other and learn representations containing rich appearance and geometry information of the scene.","328":"**Iterative Pseudo-Labeling** (IPL) is a semi-supervised algorithm for speech recognition which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves. In particular, IPL fine tunes an existing model at each iteration using both labeled data and a subset of unlabeled data.","329":"Random Erasing is a data augmentation method for training the convolutional neural network (CNN), which randomly selects a rectangle region in an image and erases its pixels with random values. In this process, training images with various levels of occlusion are generated, which reduces the risk of over-fitting and makes the model robust to occlusion. Random Erasing is parameter learning free, easy to implement, and can be integrated with most of the CNN-based recognition models. Random Erasing is complementary to commonly used data augmentation techniques such as random cropping and flipping, and can be implemented in various vision tasks, such as image classification, object detection, semantic segmentation.","330":"A **Spatial Transformer** is an image model block that explicitly allows the spatial manipulation of data within a [convolutional neural network](https:\/\/paperswithcode.com\/methods\/category\/convolutional-neural-networks). It gives CNNs the ability to actively spatially transform feature maps, conditional on the feature map itself, without any extra training supervision or modification to the optimisation process. Unlike pooling layers, where the receptive fields are fixed and local, the spatial transformer module is a dynamic mechanism that can actively spatially transform an image (or a feature map) by producing an appropriate transformation for each input sample. The transformation is then performed on the entire feature map (non-locally) and can include scaling, cropping, rotations, as well as non-rigid deformations.\r\n\r\nThe architecture is shown in the Figure to the right. The input feature map $U$ is passed to a localisation network which regresses the transformation parameters $\\theta$. The regular spatial grid $G$ over $V$ is transformed to the sampling grid $T\\_{\\theta}\\left(G\\right)$, which is applied to $U$, producing the warped output feature map $V$. The combination of the localisation network and sampling mechanism defines a spatial transformer.","331":"**Adaptive Instance Normalization** is a normalization method that aligns the mean and variance of the content features with those of the style features. \r\n\r\n[Instance Normalization](https:\/\/paperswithcode.com\/method\/instance-normalization) normalizes the input to a single style specified by the affine parameters. Adaptive Instance Normaliation is an extension. In AdaIN, we receive a content input $x$ and a style input $y$, and we simply align the channel-wise mean and variance of $x$ to match those of $y$. Unlike [Batch Normalization](https:\/\/paperswithcode.com\/method\/batch-normalization), Instance Normalization or [Conditional Instance Normalization](https:\/\/paperswithcode.com\/method\/conditional-instance-normalization), AdaIN has no learnable affine parameters. Instead, it adaptively computes the affine parameters from the style input:\r\n\r\n$$\r\n\\textrm{AdaIN}(x, y)= \\sigma(y)\\left(\\frac{x-\\mu(x)}{\\sigma(x)}\\right)+\\mu(y)\r\n$$","332":"**StyleGAN** is a type of generative adversarial network. It uses an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature; in particular, the use of [adaptive instance normalization](https:\/\/paperswithcode.com\/method\/adaptive-instance-normalization). Otherwise it follows Progressive [GAN](https:\/\/paperswithcode.com\/method\/gan) in using a progressively growing training regime. Other quirks include the fact it generates from a fixed value tensor not stochastically generated latent variables as in regular GANs. The stochastically generated latent variables are used as style vectors in the adaptive [instance normalization](https:\/\/paperswithcode.com\/method\/instance-normalization) at each resolution after being transformed by an 8-layer [feedforward network](https:\/\/paperswithcode.com\/method\/feedforward-network). Lastly, it employs a form of regularization called mixing regularization, which mixes two style latent variables during training.","333":"**Exponential Decay** is a learning rate schedule where we decay the learning rate with more iterations using an exponential function:\r\n\r\n$$ \\text{lr} = \\text{lr}\\_{0}\\exp\\left(-kt\\right) $$\r\n\r\nImage Credit: [Suki Lau](https:\/\/towardsdatascience.com\/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1)","334":"**Restricted Boltzmann Machines**, or **RBMs**, are two-layer generative neural networks that learn a probability distribution over the inputs. They are a special class of Boltzmann Machine in that they have a restricted number of connections between visible and hidden units. Every node in the visible layer is connected to every node in the hidden layer, but no nodes in the same group are connected. RBMs are usually trained using the contrastive divergence learning procedure.\r\n\r\nImage Source: [here](https:\/\/medium.com\/datatype\/restricted-boltzmann-machine-a-complete-analysis-part-1-introduction-model-formulation-1a4404873b3)","335":"The **Maxout Unit** is a generalization of the [ReLU](https:\/\/paperswithcode.com\/method\/relu) and the [leaky ReLU](https:\/\/paperswithcode.com\/method\/leaky-relu) functions. It is a piecewise linear function that returns the maximum of the inputs, designed to be used in conjunction with [dropout](https:\/\/paperswithcode.com\/method\/dropout). Both ReLU and leaky ReLU are special cases of Maxout. \r\n\r\n$$f\\left(x\\right) = \\max\\left(w^{T}\\_{1}x + b\\_{1}, w^{T}\\_{2}x + b\\_{2}\\right)$$\r\n\r\nThe main drawback of Maxout is that it is computationally expensive as it doubles the number of parameters for each neuron.","336":"**Contrastive Language-Image Pre-training** (**CLIP**), consisting of a simplified version of ConVIRT trained from scratch, is an efficient method of image representation learning from natural language supervision. , CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset\u2019s classes. \r\n\r\nFor pre-training, CLIP is trained to predict which of the $N X N$ possible (image, text) pairings across a batch actually occurred. CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the $N$ real pairs in the batch while minimizing the cosine similarity of the embeddings of the $N^2 - N$ incorrect pairings. A symmetric cross entropy loss is optimized over these similarity scores. \r\n\r\nImage credit: [Learning Transferable Visual Models From Natural Language Supervision](https:\/\/arxiv.org\/pdf\/2103.00020.pdf)","337":"A **Laplacian Pyramid** is a linear invertible image representation consisting of a set of band-pass\r\nimages, spaced an octave apart, plus a low-frequency residual. Formally, let $d\\left(.\\right)$ be a downsampling operation which blurs and decimates a $j \\times j$ image $I$, so that $d\\left(I\\right)$ is a new image of size $j\/2 \\times j\/2$. Also, let $u\\left(.\\right)$ be an upsampling operator which smooths and expands $I$ to be twice the size, so $u\\left(I\\right)$ is a new image of size $2j \\times 2j$. We first build a Gaussian pyramid $G\\left(I\\right) = \\left[I\\_{0}, I\\_{1}, \\dots, I\\_{K}\\right]$, where\r\n$I\\_{0} = I$ and $I\\_{k}$ is $k$ repeated applications\u2217 of $d\\left(.\\right)$ to $I$. $K$ is the number of levels in the pyramid, selected so that the final level has very small spatial extent ($\\leq 8 \\times 8$ pixels).\r\n\r\nThe coefficients $h\\_{k}$ at each level $k$ of the Laplacian pyramid $L\\left(I\\right)$ are constructed by taking the difference between adjacent levels in the Gaussian pyramid, upsampling the smaller one with $u\\left(.\\right)$ so that the sizes are compatible:\r\n\r\n$$ h\\_{k} = \\mathcal{L}\\_{k}\\left(I\\right) = G\\_{k}\\left(I\\right) \u2212 u\\left(G\\_{k+1}\\left(I\\right)\\right) = I\\_{k} \u2212 u\\left(I\\_{k+1}\\right) $$\r\n\r\nIntuitively, each level captures image structure present at a particular scale. The final level of the\r\nLaplacian pyramid $h\\_{K}$ is not a difference image, but a low-frequency residual equal to the final\r\nGaussian pyramid level, i.e. $h\\_{K} = I\\{K}$. Reconstruction from a Laplacian pyramid coefficients\r\n$\\left[h\\_{1}, \\dots, h\\_{K}\\right]$ is performed using the backward recurrence:\r\n\r\n$$ I\\_{k} = u\\left(I\\_{k+1}\\right) + h\\_{k} $$\r\n\r\nwhich is started with $I\\_{K} = h\\_{K}$ and the reconstructed image being $I = I\\_{o}$. In other words, starting at the coarsest level, we repeatedly upsample and add the difference image h at the next finer level until we get back to the full resolution image.\r\n\r\nSource: [LAPGAN](https:\/\/paperswithcode.com\/method\/lapgan)\r\n\r\nImage : [Design of FIR Filters for Fast Multiscale Directional Filter Banks](https:\/\/www.researchgate.net\/figure\/Relationship-between-Gaussian-and-Laplacian-Pyramids_fig2_275038450)","338":"**AccoMontage** is a model for accompaniment arrangement, a type of music generation task involving intertwined constraints of melody, harmony, texture, and music structure. AccoMontage generates piano accompaniments for folk\/pop songs based on a lead sheet (i.e. a melody with chord progression). It first retrieves phrase montages from a database while recombining them structurally using dynamic programming. Second, chords of the retrieved phrases are manipulated to match the lead sheet via style transfer. Lastly, the system offers controls over the generation process. In contrast to pure deep learning approaches, AccoMontage uses a hybrid pathway, in which rule-based optimization and deep learning are both leveraged.","339":"**Asynchronous Interaction Aggregation**, or **AIA**, is a network that leverages different interactions to boost action detection. There are two key designs in it: one is the Interaction Aggregation structure (IA) adopting a uniform paradigm to model and integrate multiple types of interaction; the other is the Asynchronous Memory Update algorithm (AMU) that enables us to achieve better performance by modeling very long-term interaction dynamically.","340":"A **Res2Net Block** is an image model block that constructs hierarchical residual-like connections\r\nwithin one single [residual block](https:\/\/paperswithcode.com\/method\/residual-block). It was proposed as part of the [Res2Net](https:\/\/paperswithcode.com\/method\/res2net) CNN architecture.\r\n\r\nThe block represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The $3 \\times 3$ filters of $n$ channels is replaced with a set of smaller filter groups, each with $w$ channels. These smaller filter groups are connected in a hierarchical residual-like style to increase the number of scales that the output features can represent. Specifically, we divide input feature maps into several groups. A group of filters first extracts features from a group of input feature maps. Output features of the previous group are then sent to the next group of filters along with another group of input feature maps. \r\n\r\nThis process repeats several times until all input feature maps are processed. Finally, feature maps from all groups are concatenated and sent to another group of $1 \\times 1$ filters to fuse information altogether. Along with any possible path in which input features are transformed to output features, the equivalent receptive field increases whenever it passes a $3 \\times 3$ filter, resulting in many equivalent feature scales due to combination effects.\r\n\r\nOne way of thinking of these blocks is that they expose a new dimension, **scale**, alongside the existing dimensions of depth, width, and cardinality.","341":"**Res2Net** is an image model that employs a variation on bottleneck residual blocks. The motivation is to be able to represent features at multiple scales. This is achieved through a novel building block for CNNs that constructs hierarchical residual-like connections within one single [residual block](https:\/\/paperswithcode.com\/method\/residual-block).\r\nThis represents multi-scale features at a granular level and increases the range of receptive fields for each network layer.","342":"Attention gate focuses on targeted regions while suppressing feature activations in irrelevant regions.\r\nGiven the input feature map $X$ and the gating signal $G\\in \\mathbb{R}^{C'\\times H\\times W}$ which is collected at a coarse scale and contains contextual information, the attention gate uses additive attention to obtain the gating coefficient. Both the input $X$ and the gating signal are first linearly mapped to an $\\mathbb{R}^{F\\times H\\times W}$ dimensional space, and then the output is squeezed in the channel domain to produce a spatial attention weight map $ S \\in \\mathbb{R}^{1\\times H\\times W}$. The overall process can be written as\r\n\\begin{align}\r\n S &= \\sigma(\\varphi(\\delta(\\phi_x(X)+\\phi_g(G))))\r\n\\end{align}\r\n\\begin{align}\r\n Y &= S X\r\n\\end{align}\r\nwhere $\\varphi$, $\\phi_x$ and $\\phi_g$ are linear transformations implemented as $1\\times 1$ convolutions. \r\n\r\nThe attention gate guides the model's attention to important regions while suppressing feature activation in unrelated areas. It substantially enhances the representational power of the model without a significant increase in computing cost or number of model parameters due to its lightweight design. It is general and modular, making it simple to use in various CNN models.","343":"Stochastic Steady-state Embedding (SSE) is an algorithm that can learn many steady-state algorithms over graphs. Different from graph neural network family models, SSE is trained stochastically which only requires 1-hop information, but can capture fixed point relationships efficiently and effectively.\r\n\r\nDescription and Image from: [Learning Steady-States of Iterative Algorithms over Graphs](https:\/\/proceedings.mlr.press\/v80\/dai18a.html)","344":"**TopK Copy** is a cross-attention guided copy mechanism for entity extraction where only the Top-$k$ important attention heads are used for computing copy distributions. The motivation is that that attention heads may not equally important, and that some heads can be pruned out with a marginal decrease in overall performance. Attention probabilities produced by insignificant attention heads may be noisy. Thus, computing copy distributions without these heads could improve the model\u2019s ability to infer the importance of each token in the input document.","345":"Spatio-temporal features extraction that measure the stabilty. The proposed method is based on a compression algorithm named Run Length Encoding. The workflow of the method is presented bellow.","346":"**GAGNN**, or **Group-aware Graph Neural Network**, is a hierarchical model for nationwide city air quality forecasting. The model constructs a city graph and a city group graph to model the spatial and latent dependencies between cities, respectively. GAGNN introduces differentiable grouping network to discover the latent dependencies among cities and generate city groups. Based on the generated city groups, a group correlation encoding module is introduced to learn the correlations between them, which can effectively capture the dependencies between city groups. After the graph construction, GAGNN implements message passing mechanism to model the dependencies between cities and city groups.","347":"Extends [PatchGAN](https:\/\/paperswithcode.com\/method\/patchgan) discriminator for the task of layout2image generation. The discriminator is comprised of two processing streams: one for the RGB image and one for its semantics, which are fused together at the later stages of the discriminator.","348":"**TransE** is an energy-based model that produces knowledge base embeddings. It models relationships by interpreting them as translations operating on the low-dimensional embeddings of the entities. Relationships are represented as translations in the embedding space: if $\\left(h, \\mathcal{l}, t\\right)$ holds, the embedding of the tail entity $t$ should be close to the embedding of the head entity $h$ plus some vector that depends on the relationship $\\mathcal{l}$.","349":"**GENets**, or **GPU-Efficient Networks**, are a family of efficient models found through [neural architecture search](https:\/\/paperswithcode.com\/methods\/category\/neural-architecture-search). The search occurs over several types of convolutional block, which include [depth-wise convolutions](https:\/\/paperswithcode.com\/method\/depthwise-convolution), [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization), [ReLU](https:\/\/paperswithcode.com\/method\/relu), and an [inverted bottleneck](https:\/\/paperswithcode.com\/method\/inverted-residual-block) structure.","350":"**Hierarchical Feature Fusion (HFF)** is a feature fusion method employed in [ESP](https:\/\/paperswithcode.com\/method\/esp) and [EESP](https:\/\/paperswithcode.com\/method\/eesp) image model blocks for degridding. In the ESP module, concatenating the outputs of dilated convolutions gives the ESP module a large effective receptive field, but it introduces unwanted checkerboard or gridding artifacts. To address the gridding artifact in ESP, the feature maps obtained using kernels of different dilation rates are hierarchically added before concatenating them (HFF). This solution is simple and effective and does not increase the complexity of the ESP module.","351":"An **Efficient Spatial Pyramid (ESP)** is an image model block based on a factorization principle that decomposes a standard [convolution](https:\/\/paperswithcode.com\/method\/convolution) into two steps: (1) point-wise convolutions and (2) spatial pyramid of dilated convolutions. The point-wise convolutions help in reducing the computation, while the spatial pyramid of dilated convolutions re-samples the feature maps to learn the representations from large effective receptive field. This allows for increased efficiency compared to another image blocks like [ResNeXt](https:\/\/paperswithcode.com\/method\/resnext) blocks and Inception modules.","352":"**ESPNet** is a convolutional neural network for semantic segmentation of high resolution images under resource constraints. ESPNet is based on a convolutional module, efficient spatial pyramid ([ESP](https:\/\/paperswithcode.com\/method\/esp)), which is efficient in terms of computation, memory, and power.","353":"Please enter a description about the method here","354":"The **MLP-Mixer** architecture (or \u201cMixer\u201d for short) is an image architecture that doesn't use convolutions or self-attention. Instead, Mixer\u2019s architecture is based entirely on multi-layer perceptrons (MLPs) that are repeatedly applied across either spatial locations or feature channels. Mixer relies only on basic matrix multiplication routines, changes to data layout (reshapes and transpositions), and scalar nonlinearities.\r\n\r\nIt accepts a sequence of linearly projected image patches (also referred to as tokens) shaped as a \u201cpatches \u00d7 channels\u201d table as an input, and maintains this dimensionality. Mixer makes use of two types of MLP layers: channel-mixing MLPs and token-mixing MLPs. The channel-mixing MLPs allow communication between different channels; they operate on each token independently and take individual rows of the table as inputs. The token-mixing MLPs allow communication between different spatial locations (tokens); they operate on each channel independently and take individual columns of the table as inputs. These two types of layers are interleaved to enable interaction of both input dimensions.","355":"**Blender** is a proposal-based instance mask generation module which incorporates rich instance-level information with accurate dense pixel features. A single [convolution](https:\/\/paperswithcode.com\/method\/convolution) layer is added on top of the detection towers to produce attention masks along with each bounding box prediction. For each predicted instance, the blender crops predicted bases with its bounding box and linearly combines them according the learned attention maps.\r\n\r\nThe inputs of the blender module are bottom-level bases $\\mathbf{B}$, the selected top-level attentions $A$ and bounding box proposals $P$. First [RoIPool](https:\/\/paperswithcode.com\/method\/roi-pooling) of Mask R-CNN to crop bases with each proposal $\\mathbf{p}\\_{d}$ and then resize the region to a fixed size $R \\times R$ feature map $\\mathbf{r}\\_{d}$\r\n\r\n$$\r\n\\mathbf{r}\\_{d}=\\operatorname{RoIPool}_{R \\times R}\\left(\\mathbf{B}, \\mathbf{p}\\_{d}\\right), \\quad \\forall d \\in\\{1 \\ldots D\\}\r\n$$\r\n\r\nMore specifically, asampling ratio 1 is used for [RoIAlign](https:\/\/paperswithcode.com\/method\/roi-align), i.e. one bin for each sampling point. During training, ground truth boxes are used as the proposals. During inference, [FCOS](https:\/\/paperswithcode.com\/method\/fcos) prediction results are used.\r\n\r\nThe attention size $M$ is smaller than $R$. We interpolate $\\mathbf{a}\\_{d}$ from $M \\times M$ to $R \\times R$, into the shapes of $R=\\left\\(\\mathbf{r}\\_{d} \\mid d=1 \\ldots D\\right)$\r\n\r\n$$\r\n\\mathbf{a}\\_{d}^{\\prime}=\\text { interpolate }\\_{M \\times M \\rightarrow R \\times R}\\left(\\mathbf{a}\\_{d}\\right), \\quad \\forall d \\in\\{1 \\ldots D\\}\r\n$$\r\n\r\nThen $\\mathbf{a}\\_{d}^{\\prime}$ is normalized with a softmax function along the $K$ dimension to make it a set of score maps $\\mathbf{s}\\_{d}$.\r\n\r\n$$\r\n\\mathbf{s}\\_{d}=\\operatorname{softmax}\\left(\\mathbf{a}\\_{d}^{\\prime}\\right), \\quad \\forall d \\in\\{1 \\ldots D\\}\r\n$$\r\n\r\nThen we apply element-wise product between each entity $\\mathbf{r}\\_{d}, \\mathbf{s}\\_{d}$ of the regions $R$ and scores $S$, and sum along the $K$ dimension to get our mask logit $\\mathbf{m}\\_{d}:$\r\n\r\n$$\r\n\\mathbf{m}\\_{d}=\\sum\\_{k=1}^{K} \\mathbf{s}\\_{d}^{k} \\circ \\mathbf{r}\\_{d}^{k}, \\quad \\forall d \\in\\{1 \\ldots D\\}\r\n$$\r\n\r\nwhere $k$ is the index of the basis. The mask blending process with $K=4$ is visualized in the Figure.","356":"The Legendre Memory Unit (LMU) is mathematically derived to orthogonalize\r\nits continuous-time history \u2013 doing so by solving d coupled ordinary differential\r\nequations (ODEs), whose phase space linearly maps onto sliding windows of\r\ntime via the Legendre polynomials up to degree d-1. It is optimal for compressing temporal information.\r\n\r\nSee paper for equations (markdown isn't working).\r\n\r\nOfficial github repo: [https:\/\/github.com\/abr\/lmu](https:\/\/github.com\/abr\/lmu)","357":"**RegionViT** consists of two tokenization processes that convert an image into regional (upper path) and local tokens (lower path). Each tokenization is a convolution with different patch sizes, the patch size of regional tokens is $28^2$ while $4^2$ is used for local tokens with dimensions projected to $C$, which means that one regional token covers $7^2$ local tokens based on the spatial locality, leading to the window size of a local region to $7^2$. At stage 1, two set of tokens are passed through the proposed regional-to-local transformer encoders. However, for the later stages, to balance the computational load and to have feature maps at different resolution, the approach uses a downsampling process to halve the spatial resolution while doubling the channel dimension like CNN on both regional and local tokens before going to the next stage. Finally, at the end of the network, it simply averages the remaining regional tokens as the final embedding for the classification while the detection uses all local tokens at each stage since it provides more fine-grained location information. By having the pyramid structure, the ViT can generate multi-scale features and hence it could be easily extended to more vision applications, e.g., object detection, rather than image classification only.","358":"**Hydra** is a multi-headed neural network for model distillation with a shared body network. The shared body network learns a joint feature representation that enables each head to capture the predictive behavior of each ensemble member. Existing distillation methods often train a distillation network to imitate the prediction of a larger network. Hydra instead learns to distill the individual predictions of each ensemble member into separate light-weight head models while amortizing the computation through a shared heavy-weight body network. This retains the diversity of ensemble member predictions which is otherwise lost in knowledge distillation.","359":"**RetinaNet** is a one-stage object detection model that utilizes a [focal loss](https:\/\/paperswithcode.com\/method\/focal-loss) function to address class imbalance during training. Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard negative examples. RetinaNet is a single, unified network composed of a *backbone* network and two task-specific *subnetworks*. The backbone is responsible for computing a convolutional feature map over an entire input image and is an off-the-self convolutional network. The first subnet performs convolutional object classification on the backbone's output; the second subnet performs convolutional bounding box regression. The two subnetworks feature a simple design that the authors propose specifically for one-stage, dense detection. \r\n\r\nWe can see the motivation for focal loss by comparing with two-stage object detectors. Here class imbalance is addressed by a two-stage cascade and sampling heuristics. The proposal stage (e.g., [Selective Search](https:\/\/paperswithcode.com\/method\/selective-search), [EdgeBoxes](https:\/\/paperswithcode.com\/method\/edgeboxes), [DeepMask](https:\/\/paperswithcode.com\/method\/deepmask), [RPN](https:\/\/paperswithcode.com\/method\/rpn)) rapidly narrows down the number of candidate object locations to a small number (e.g., 1-2k), filtering out most background samples. In the second classification stage, sampling heuristics, such as a fixed foreground-to-background ratio, or online hard example mining ([OHEM](https:\/\/paperswithcode.com\/method\/ohem)), are performed to maintain a\r\nmanageable balance between foreground and background.\r\n\r\nIn contrast, a one-stage detector must process a much larger set of candidate object locations regularly sampled across an image. To tackle this, RetinaNet uses a focal loss function, a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples. \r\n\r\nFormally, the Focal Loss adds a factor $(1 - p\\_{t})^\\gamma$ to the standard cross entropy criterion. Setting $\\gamma>0$ reduces the relative loss for well-classified examples ($p\\_{t}>.5$), putting more focus on hard, misclassified examples. Here there is tunable *focusing* parameter $\\gamma \\ge 0$. \r\n\r\n$$ {\\text{FL}(p\\_{t}) = - (1 - p\\_{t})^\\gamma \\log\\left(p\\_{t}\\right)} $$","360":"**AdaGrad** is a stochastic optimization method that adapts the learning rate to the parameters. It performs smaller updates for parameters associated with frequently occurring features, and larger updates for parameters associated with infrequently occurring features. In its update rule, Adagrad modifies the general learning rate $\\eta$ at each time step $t$ for every parameter $\\theta\\_{i}$ based on the past gradients for $\\theta\\_{i}$: \r\n\r\n$$ \\theta\\_{t+1, i} = \\theta\\_{t, i} - \\frac{\\eta}{\\sqrt{G\\_{t, ii} + \\epsilon}}g\\_{t, i} $$\r\n\r\nThe benefit of AdaGrad is that it eliminates the need to manually tune the learning rate; most leave it at a default value of $0.01$. Its main weakness is the accumulation of the squared gradients in the denominator. Since every added term is positive, the accumulated sum keeps growing during training, causing the learning rate to shrink and becoming infinitesimally small.\r\n\r\nImage: [Alec Radford](https:\/\/twitter.com\/alecrad)","361":"**Enhanced Sequential Inference Model** or **ESIM** is a sequential NLI model proposed in [Enhanced LSTM for Natural Language Inference](https:\/\/www.aclweb.org\/anthology\/P17-1152) paper.","362":"**Channel Shuffle** is an operation to help information flow across feature channels in convolutional neural networks. It was used as part of the [ShuffleNet](https:\/\/paperswithcode.com\/method\/shufflenet) architecture. \r\n\r\nIf we allow a group [convolution](https:\/\/paperswithcode.com\/method\/convolution) to obtain input data from different groups, the input and output channels will be fully related. Specifically, for the feature map generated from the previous group layer, we can first divide the channels in each group into several subgroups, then feed each group in the next layer with different subgroups. \r\n\r\nThe above can be efficiently and elegantly implemented by a channel shuffle operation: suppose a convolutional layer with $g$ groups whose output has $g \\times n$ channels; we first reshape the output channel dimension into $\\left(g, n\\right)$, transposing and then flattening it back as the input of next layer. Channel shuffle is also differentiable, which means it can be embedded into network structures for end-to-end training.","363":"The extremely low computational cost of lightweight CNNs constrains the depth and width of the networks, further decreasing their representational power. To address the above problem, Chen et al. proposed dynamic convolution, a novel operator design that increases representational power with negligible additional computational cost and does not change the width or depth of the network in parallel with CondConv.\r\n\r\nDynamic convolution uses $K$ parallel convolution kernels of the same size and input\/output dimensions instead of one kernel per layer. Like SE blocks, it adopts a squeeze-and-excitation mechanism to generate the attention weights for the different convolution kernels. These kernels are then aggregated dynamically by weighted summation and applied to the input feature map $X$:\r\n\\begin{align}\r\n s & = \\text{softmax} (W_{2} \\delta (W_{1}\\text{GAP}(X)))\r\n\\end{align}\r\n\\begin{align}\r\n \\text{DyConv} &= \\sum_{i=1}^{K} s_k \\text{Conv}_k \r\n\\end{align}\r\n\\begin{align}\r\n Y &= \\text{DyConv}(X)\r\n\\end{align}\r\nHere the convolutions are combined by summation of weights and biases of convolutional kernels. \r\n\r\nCompared to applying convolution to the feature map, the computational cost of squeeze-and-excitation and weighted summation is extremely low. Dynamic convolution thus provides an efficient operation to improve representational power and can be easily used as a replacement for any convolution.","364":"CR-NET is a YOLO-based model proposed for license plate character detection and recognition","365":"**Colorization** is a self-supervision approach that relies on colorization as the pretext task in order to learn image representations.","366":"**LeNet** is a classic convolutional neural network employing the use of convolutions, pooling and fully connected layers. It was used for the handwritten digit recognition task with the MNIST dataset. The architectural design served as inspiration for future networks such as [AlexNet](https:\/\/paperswithcode.com\/method\/alexnet) and [VGG](https:\/\/paperswithcode.com\/method\/vgg).","367":"Perhaps the most widely used approach to synthesizing new examples is called the Synthetic Minority Oversampling Technique, or SMOTE for short. This technique was described by Nitesh Chawla, et al. in their 2002 paper named for the technique titled \u201cSMOTE: Synthetic Minority Over-sampling Technique.\u201d\r\n\r\nSMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.","368":"OASIS is a [GAN](https:\/\/paperswithcode.com\/method\/gan)-based model to translate semantic label maps into realistic-looking images. The model builds on preceding work such as [Pix2Pix](https:\/\/paperswithcode.com\/method\/pix2pix) and SPADE. OASIS introduces the following innovations: \r\n\r\n1. The method is not dependent on the perceptual loss, which is commonly used for the semantic image synthesis task. A [VGG](https:\/\/paperswithcode.com\/method\/vgg) network trained on ImageNet is routinely employed as the perceptual loss to strongly improve the synthesis quality. The authors show that this perceptual loss also has negative effects: First, it reduces the diversity of the generated images. Second, it negatively influences the color distribution to be more biased towards ImageNet. OASIS eliminates the dependence on the perceptual loss by changing the common discriminator design: The OASIS discriminator segments an image into one of the real classes or an additional fake class. In doing so, it makes more efficient use of the label maps that the discriminator normally receives. This distinguishes the discriminator from the commonly used encoder-shaped discriminators, which concatenate the label maps to the input image and predict a single score per image. With the more fine-grained supervision through the loss of the OASIS discriminator, the perceptual loss is shown to become unnecessary.\r\n\r\n2. A user can generate a diverse set of images per label map by simply resampling noise. This is achieved by conditioning the [spatially-adaptive denormalization](https:\/\/arxiv.org\/abs\/1903.07291) module in each layer of the GAN generator directly on spatially replicated input noise. A side effect of this conditioning is that at inference time an image can be resampled either globally or locally (either the complete image changes or a restricted region in the image).","369":"** Sigmoid Linear Units**, or **SiLUs**, are activation functions for\r\nneural networks. The activation of the SiLU is computed by the sigmoid function multiplied by its input, or $$ x\\sigma(x).$$\r\n\r\nSee [Gaussian Error Linear Units](https:\/\/arxiv.org\/abs\/1606.08415) ([GELUs](https:\/\/paperswithcode.com\/method\/gelu)) where the SiLU was originally coined, and see [Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning](https:\/\/arxiv.org\/abs\/1702.03118) and [Swish: a Self-Gated Activation Function](https:\/\/arxiv.org\/abs\/1710.05941v1) where the SiLU was experimented with later.","370":"**Temporal Activation Regularization (TAR)** is a type of slowness regularization for [RNNs](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks) that penalizes differences between states that have been explored in the past. Formally we minimize:\r\n\r\n$$\\beta{L\\_{2}}\\left(h\\_{t} - h\\_{t+1}\\right)$$\r\n\r\nwhere $L\\_{2}$ is the $L\\_{2}$ norm, $h_{t}$ is the output of the RNN at timestep $t$, and $\\beta$ is a scaling coefficient.","371":"**Activation Regularization (AR)**, or $L\\_{2}$ activation regularization, is regularization performed on activations as opposed to weights. It is usually used in conjunction with [RNNs](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks). It is defined as:\r\n\r\n$$\\alpha{L}\\_{2}\\left(m\\circ{h\\_{t}}\\right) $$\r\n\r\nwhere $m$ is a [dropout](https:\/\/paperswithcode.com\/method\/dropout) mask used by later parts of the model, $L\\_{2}$ is the $L\\_{2}$ norm, and $h_{t}$ is the output of an RNN at timestep $t$, and $\\alpha$ is a scaling coefficient. \r\n\r\nWhen applied to the output of a dense layer, AR penalizes activations that are substantially away from 0, encouraging activations to remain small.","372":"**Weight Tying** improves the performance of language models by tying (sharing) the weights of the embedding and [softmax](https:\/\/paperswithcode.com\/method\/softmax) layers. This method also massively reduces the total number of parameters in the language models that it is applied to. \r\n\r\nLanguage models are typically comprised of an embedding layer, followed by a number of [Transformer](https:\/\/paperswithcode.com\/method\/transformer) or [LSTM](https:\/\/paperswithcode.com\/method\/lstm) layers, which are finally followed by a softmax layer. Embedding layers learn word representations, such that similar words (in meaning) are represented by vectors that are near each other (in cosine distance). [Press & Wolf, 2016] showed that the softmax matrix, in which every word also has a vector representation, also exhibits this property. This leads them to propose to share the softmax and embedding matrices, which is done today in nearly all language models. \r\n\r\nThis method was independently introduced by [Press & Wolf, 2016](https:\/\/paperswithcode.com\/paper\/using-the-output-embedding-to-improve) and [Inan et al, 2016](https:\/\/paperswithcode.com\/paper\/tying-word-vectors-and-word-classifiers-a).\r\n\r\nAdditionally, the Press & Wolf paper proposes Three-way Weight Tying, a method for NMT models in which the embedding matrix for the source language, the embedding matrix for the target language, and the softmax matrix for the target language are all tied. That method has been adopted by the Attention Is All You Need model and many other neural machine translation models.","373":"**Embedding Dropout** is equivalent to performing [dropout](https:\/\/paperswithcode.com\/method\/dropout) on the embedding matrix at a word level, where the dropout is broadcast across all the word vector\u2019s embedding. The remaining non-dropped-out word embeddings are scaled by $\\frac{1}{1-p\\_{e}}$ where $p\\_{e}$ is the probability of embedding dropout. As the dropout occurs on the embedding matrix that is used for a full forward and backward pass, this means that all occurrences of a specific word will disappear within that pass, equivalent to performing [variational dropout](https:\/\/paperswithcode.com\/method\/variational-dropout) on the connection between the one-hot embedding and the embedding lookup.\r\n\r\nSource: Merity et al, Regularizing and Optimizing [LSTM](https:\/\/paperswithcode.com\/method\/lstm) Language Models","374":"**DropConnect** generalizes [Dropout](https:\/\/paperswithcode.com\/method\/dropout) by randomly dropping the weights rather than the activations with probability $1-p$. DropConnect is similar to Dropout as it introduces dynamic sparsity within the model, but differs in that the sparsity is on the weights $W$, rather than the output vectors of a layer. In other words, the fully connected layer with DropConnect becomes a sparsely connected layer in which the connections are chosen at random during the training stage. Note that this is not equivalent to setting $W$ to be a fixed sparse matrix during training.\r\n\r\nFor a DropConnect layer, the output is given as:\r\n\r\n$$ r = a \\left(\\left(M * W\\right){v}\\right)$$\r\n\r\nHere $r$ is the output of a layer, $v$ is the input to a layer, $W$ are weight parameters, and $M$ is a binary matrix encoding the connection information where $M\\_{ij} \\sim \\text{Bernoulli}\\left(p\\right)$. Each element of the mask $M$ is drawn independently for each example during training, essentially instantiating a different connectivity for each example seen. Additionally, the biases are also masked out during training.","375":"**ASGD Weight-Dropped LSTM**, or **AWD-LSTM**, is a type of recurrent neural network that employs [DropConnect](https:\/\/paperswithcode.com\/method\/dropconnect) for regularization, as well as [NT-ASGD](https:\/\/paperswithcode.com\/method\/nt-asgd) for optimization - non-monotonically triggered averaged [SGD](https:\/\/paperswithcode.com\/method\/sgd) - which returns an average of last iterations of weights. Additional regularization techniques employed include variable length backpropagation sequences, [variational dropout](https:\/\/paperswithcode.com\/method\/variational-dropout), [embedding dropout](https:\/\/paperswithcode.com\/method\/embedding-dropout), [weight tying](https:\/\/paperswithcode.com\/method\/weight-tying), independent embedding\/hidden size, [activation regularization](https:\/\/paperswithcode.com\/method\/activation-regularization) and [temporal activation regularization](https:\/\/paperswithcode.com\/method\/temporal-activation-regularization).","376":"**Mixture of Softmaxes** performs $K$ different softmaxes and mixes them. The motivation is that the traditional [softmax](https:\/\/paperswithcode.com\/method\/softmax) suffers from a softmax bottleneck, i.e. the expressiveness of the conditional probability we can model is constrained by the combination of a dot product and the softmax. By using a mixture of softmaxes, we can model the conditional probability more expressively.","377":"A **Highway Layer** contains an information highway to other layers that helps with information flow. It is characterised by the use of a gating unit to help this information flow. \r\n\r\nA plain feedforward neural network typically consists of $L$ layers where the $l$th layer ($l \\in ${$1, 2, \\dots, L$}) applies a nonlinear transform $H$ (parameterized by $\\mathbf{W\\_{H,l}}$) on its input $\\mathbf{x\\_{l}}$ to produce its output $\\mathbf{y\\_{l}}$. Thus, $\\mathbf{x\\_{1}}$ is the input to the network and $\\mathbf{y\\_{L}}$ is the network\u2019s output. Omitting the layer index and biases for clarity,\r\n\r\n$$ \\mathbf{y} = H\\left(\\mathbf{x},\\mathbf{W\\_{H}}\\right) $$\r\n\r\n$H$ is usually an affine transform followed by a non-linear activation function, but in general it may take other forms. \r\n\r\nFor a [highway network](https:\/\/paperswithcode.com\/method\/highway-network), we additionally define two nonlinear transforms $T\\left(\\mathbf{x},\\mathbf{W\\_{T}}\\right)$ and $C\\left(\\mathbf{x},\\mathbf{W\\_{C}}\\right)$ such that:\r\n\r\n$$ \\mathbf{y} = H\\left(\\mathbf{x},\\mathbf{W\\_{H}}\\right)\u00b7T\\left(\\mathbf{x},\\mathbf{W\\_{T}}\\right) + \\mathbf{x}\u00b7C\\left(\\mathbf{x},\\mathbf{W\\_{C}}\\right)$$\r\n\r\nWe refer to T as the transform gate and C as the carry gate, since they express how much of the output is produced by transforming the input and carrying it, respectively. In the original paper, the authors set $C = 1 \u2212 T$, giving:\r\n\r\n$$ \\mathbf{y} = H\\left(\\mathbf{x},\\mathbf{W\\_{H}}\\right)\u00b7T\\left(\\mathbf{x},\\mathbf{W\\_{T}}\\right) + \\mathbf{x}\u00b7\\left(1-T\\left(\\mathbf{x},\\mathbf{W\\_{T}}\\right)\\right)$$\r\n\r\nThe authors set:\r\n\r\n$$ T\\left(x\\right) = \\sigma\\left(\\mathbf{W\\_{T}}^{T}\\mathbf{x} + \\mathbf{b\\_{T}}\\right) $$\r\n\r\nImage: [Sik-Ho Tsang](https:\/\/towardsdatascience.com\/review-highway-networks-gating-function-to-highway-image-classification-5a33833797b5)","378":"A **Highway Network** is an architecture designed to ease gradient-based training of very deep networks. They allow unimpeded information flow across several layers on \"information highways\". The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. Highway networks with hundreds of layers can be trained directly using stochastic gradient descent and with a variety of activation functions.","379":"**Soft Actor Critic**, or **SAC**, is an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as [Q-learning methods](https:\/\/paperswithcode.com\/method\/q-learning). [SAC](https:\/\/paperswithcode.com\/method\/sac) combines off-policy updates with a stable stochastic actor-critic formulation.\r\n\r\nThe SAC objective has a number of advantages. First, the policy is incentivized to explore more widely, while giving up on clearly unpromising avenues. Second, the policy can capture multiple modes of near-optimal behavior. In problem settings where multiple actions seem equally attractive, the policy will commit equal probability mass to those actions. Lastly, the authors present evidence that it improves learning speed over state-of-art methods that optimize the conventional RL objective function.","380":"ORB-SLAM2 is a complete SLAM system for monocular, stereo and RGB-D cameras, including map reuse, loop closing and relocalization capabilities. The system works in real-time on standard CPUs in a wide variety of environments from small hand-held indoors sequences, to drones flying in industrial environments and cars driving around a city.\r\n\r\nSource: [Mur-Artal and Tardos](https:\/\/arxiv.org\/pdf\/1610.06475v2.pdf)\r\n\r\nImage source: [Mur-Artal and Tardos](https:\/\/arxiv.org\/pdf\/1610.06475v2.pdf)","381":"**SortCut Sinkhorn Attention** is a variant of [Sparse Sinkhorn Attention](https:\/\/paperswithcode.com\/method\/sparse-sinkhorn-attention) where a post-sorting truncation of the input sequence is performed, essentially performing a hard top-k operation on the input sequence blocks within the computational graph. While most attention models mainly re-weight or assign near-zero weights during training, this allows for explicitly and dynamically truncate the input sequence. Specifically:\r\n\r\n$$ Y = \\text{Softmax}\\left(Q{\\psi\\_{S}}\\left(K\\right)^{T}\\_{\\left[:n\\right]}\\right)\\psi\\_{S}\\left(V\\right)\\_{\\left[:n\\right]} $$\r\n\r\nwhere $n$ is the Sortfut budget hyperparameter.","382":"**Sparse Sinkhorn Attention** is an attention mechanism that reduces the memory complexity of the [dot-product attention mechanism](https:\/\/paperswithcode.com\/method\/scaled) and is capable of learning sparse attention outputs. It is based on the idea of differentiable sorting of internal representations within the self-attention module. SSA incorporates a meta sorting network that learns to rearrange and sort input sequences. Sinkhorn normalization is used to normalize the rows and columns of the sorting matrix. The actual SSA attention mechanism then acts on the block sorted sequences.","383":"The **Sinkhorn Transformer** is a type of [transformer](https:\/\/paperswithcode.com\/method\/transformer) that uses [Sparse Sinkhorn Attention](https:\/\/paperswithcode.com\/method\/sparse-sinkhorn-attention) as a building block. This component is a plug-in replacement for dense fully-connected attention (as well as local attention, and sparse attention alternatives), and allows for reduced memory complexity as well as sparse attention.","384":"**Activation Normalization** is a type of normalization used for flow-based generative models; specifically it was introduced in the [GLOW](https:\/\/paperswithcode.com\/method\/glow) architecture. An ActNorm layer performs an affine transformation of the activations using a scale and bias parameter per channel, similar to [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization). These parameters are initialized such that the post-actnorm activations per-channel have zero mean and unit variance given an initial minibatch of data. This is a form of data dependent initilization. After initialization, the scale and bias are treated as regular trainable parameters that are independent of the data.","385":"The **Invertible 1x1 Convolution** is a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) used in flow-based generative models that reverses the ordering of channels. The weight matrix is initialized as a random rotation matrix. The log-determinant of an invertible 1 \u00d7 1 convolution of a $h \\times w \\times c$ tensor $h$ with $c \\times c$ weight matrix $\\mathbf{W}$ is straightforward to compute:\r\n\r\n$$ \\log | \\text{det}\\left(\\frac{d\\text{conv2D}\\left(\\mathbf{h};\\mathbf{W}\\right)}{d\\mathbf{h}}\\right) | = h \\cdot w \\cdot \\log | \\text{det}\\left(\\mathbf{W}\\right) | $$","386":"**GLOW** is a type of flow-based generative model that is based on an invertible $1 \\times 1$ [convolution](https:\/\/paperswithcode.com\/method\/convolution). This builds on the flows introduced by [NICE](https:\/\/paperswithcode.com\/method\/nice) and [RealNVP](https:\/\/paperswithcode.com\/method\/realnvp). It consists of a series of steps of flow, combined in a multi-scale architecture; see the Figure to the right. Each step of flow consists of Act Normalization followed by an *invertible $1 \\times 1$ convolution* followed by an [affine coupling](https:\/\/paperswithcode.com\/method\/affine-coupling) layer.","387":"**MATE** is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture designed to model the structure of web tables. It uses sparse attention in a way that allows heads to efficiently attend to either rows or columns in a table. Each attention head reorders the tokens by either column or row index and then applies a windowed attention mechanism. Unlike traditional self-attention, Mate scales linearly in the sequence length.","388":"**PP-OCR** is an OCR system that consists of three parts, text detection, detected boxes rectification and text recognition. The purpose of text detection is to locate the text area in the image. In PP-OCR, Differentiable Binarization (DB) is used as text detector which is based on a simple segmentation network. It integrates feature extraction and sequence modeling. It adopts the Connectionist Temporal Classification (CTC) loss to avoid the inconsistency between prediction and label.","389":"\\begin{equation}\r\nDiceLoss\\left( y,\\overline{p} \\right) = 1-\\big(\\left( 2y\\overline{p}+1 \\right) \\div \\left( y+\\overline{p}+1 \\right)\\big)\r\n\\end{equation}","390":"**Corner Pooling** is a pooling technique for object detection that seeks to better localize corners by encoding explicit prior knowledge. Suppose we want to determine if a pixel at location $\\left(i, j\\right)$ is a top-left corner. Let $f\\_{t}$ and $f\\_{l}$ be the feature maps that are the inputs to the top-left corner pooling layer, and let $f\\_{t\\_{ij}}$ and $f\\_{l\\_{ij}}$ be the vectors at location $\\left(i, j\\right)$ in $f\\_{t}$ and $f\\_{l}$ respectively. With $H \\times W$ feature maps, the corner pooling layer first max-pools all feature vectors between $\\left(i, j\\right)$ and $\\left(i, H\\right)$ in $f\\_{t}$ to a feature vector $t\\_{ij}$ , and max-pools all feature vectors between $\\left(i, j\\right)$ and $\\left(W, j\\right)$ in $f\\_{l}$ to a feature vector $l\\_{ij}$. Finally, it adds $t\\_{ij}$ and $l\\_{ij}$ together.","391":"**CornerNet** is an object detection model that detects an object bounding box as a pair of keypoints, the top-left corner and the bottom-right corner, using a single [convolution](https:\/\/paperswithcode.com\/method\/convolution) neural network. By detecting objects as paired keypoints, we eliminate the need for designing a set of anchor boxes commonly used in prior single-stage detectors. It also utilises [corner pooling](https:\/\/paperswithcode.com\/method\/corner-pooling), a new type of pooling layer than helps the network better localize corners.","392":"Non-maximum suppression is an integral part of the object detection pipeline. First, it sorts all detection boxes on the basis of their scores. The detection box $M$ with the maximum score is selected and all other detection boxes with a significant overlap (using a pre-defined threshold)\r\nwith $M$ are suppressed. This process is recursively applied on the remaining boxes. As per the design of the algorithm, if an object lies within the predefined overlap threshold, it leads to a miss. \r\n\r\n**Soft-NMS** solves this problem by decaying the detection scores of all other objects as a continuous function of their overlap with M. Hence, no object is eliminated in this process.","393":"**VQ-VAE** is a type of variational autoencoder that uses vector quantisation to obtain a discrete latent representation. It differs from [VAEs](https:\/\/paperswithcode.com\/method\/vae) in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, ideas from vector quantisation (VQ) are incorporated. Using the VQ method allows the model to circumvent issues of posterior collapse - where the latents are ignored when they are paired with a powerful autoregressive decoder - typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes.","394":"A **Graph Attention Network (GAT)** is a neural network architecture that operates on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods\u2019 features, a GAT enables (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront.\r\n\r\nSee [here](https:\/\/docs.dgl.ai\/en\/0.4.x\/tutorials\/models\/1_gnn\/9_gat.html) for an explanation by DGL.","395":"_**Independent component analysis** (ICA) is a statistical and computational technique for revealing hidden factors that underlie sets of random variables, measurements, or signals._\r\n\r\n_ICA defines a generative model for the observed multivariate data, which is typically given as a large database of samples. In the model, the data variables are assumed to be linear mixtures of some unknown latent variables, and the mixing system is also unknown. The latent variables are assumed nongaussian and mutually independent, and they are called the independent components of the observed data. These independent components, also called sources or factors, can be found by ICA._\r\n\r\n_ICA is superficially related to principal component analysis and factor analysis. ICA is a much more powerful technique, however, capable of finding the underlying factors or sources when these classic methods fail completely._\r\n\r\n\r\nExtracted from (https:\/\/www.cs.helsinki.fi\/u\/ahyvarin\/whatisica.shtml)\r\n\r\n**Source papers**:\r\n\r\n[Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture](https:\/\/doi.org\/10.1016\/0165-1684(91)90079-X)\r\n\r\n[Independent component analysis, A new concept?](https:\/\/doi.org\/10.1016\/0165-1684(94)90029-9)\r\n\r\n[Independent component analysis: algorithms and applications](https:\/\/doi.org\/10.1016\/S0893-6080(00)00026-5)","396":"**RealNVP** is a generative model that utilises real-valued non-volume preserving (real NVP) transformations for density estimation. The model can perform efficient and exact inference, sampling and log-density estimation of data points.","397":"**Xavier Initialization**, or **Glorot Initialization**, is an initialization scheme for neural networks. Biases are initialized be 0 and the weights $W\\_{ij}$ at each layer are initialized as:\r\n\r\n$$ W\\_{ij} \\sim U\\left[-\\frac{1}{\\sqrt{n}}, \\frac{1}{\\sqrt{n}}\\right] $$\r\n\r\nWhere $U$ is a uniform distribution and $n$ is the size of the previous layer (number of columns in $W$).","398":"A **Spatially Separable Convolution** decomposes a [convolution](https:\/\/paperswithcode.com\/method\/convolution) into two separate operations. In regular convolution, if we have a 3 x 3 kernel then we directly convolve this with the image. We can divide a 3 x 3 kernel into a 3 x 1 kernel and a 1 x 3 kernel. Then, in spatially separable convolution, we first convolve the 3 x 1 kernel then the 1 x 3 kernel. This requires 6 instead of 9 parameters compared to regular convolution, and so it is more parameter efficient (additionally less matrix multiplications are required).\r\n\r\nImage Source: [Kunlun Bai](https:\/\/towardsdatascience.com\/a-comprehensive-introduction-to-different-types-of-convolutions-in-deep-learning-669281e58215)","399":"A **SqueezeNeXt Block** is a two-stage bottleneck module used in the [SqueezeNeXt](https:\/\/paperswithcode.com\/method\/squeezenext) architecture to reduce the number of input channels to the 3 \u00d7 3 [convolution](https:\/\/paperswithcode.com\/method\/convolution). We decompose with separable convolutions to further reduce the number of parameters (orange parts), followed by a 1 \u00d7 1 expansion module.","400":"**SqueezeNeXt** is a type of convolutional neural network that uses the [SqueezeNet](https:\/\/paperswithcode.com\/method\/squeezenet) architecture as a baseline, but makes a number of changes. First, a more aggressive channel reduction is used by incorporating a two-stage squeeze module. This significantly reduces the total number of parameters used with the 3\u00d73 convolutions. Secondly, it uses separable 3 \u00d7 3 convolutions to further reduce the model size, and removes the additional 1\u00d71 branch after the squeeze module. Thirdly, the network use an element-wise addition skip connection similar to that of [ResNet](https:\/\/paperswithcode.com\/method\/resnet) architecture.","401":"The **Cross-Attention** module is an attention module used in [CrossViT](https:\/\/paperswithcode.com\/method\/crossvit) for fusion of multi-scale features. The CLS token of the large branch (circle) serves as a query token to interact with the patch tokens from the small branch through attention. $f\\left(\u00b7\\right)$ and $g\\left(\u00b7\\right)$ are projections to align dimensions. The small branch follows the same procedure but swaps CLS and patch tokens from another branch.","402":"**REINFORCE** is a Monte Carlo variant of a policy gradient algorithm in reinforcement learning. The agent collects samples of an episode using its current policy, and uses it to update the policy parameter $\\theta$. Since one full trajectory must be completed to construct a sample space, it is updated as an off-policy algorithm.\r\n\r\n$$ \\nabla\\_{\\theta}J\\left(\\theta\\right) = \\mathbb{E}\\_{\\pi}\\left[G\\_{t}\\nabla\\_{\\theta}\\ln\\pi\\_{\\theta}\\left(A\\_{t}\\mid{S\\_{t}}\\right)\\right]$$\r\n\r\nImage Credit: [Tingwu Wang](http:\/\/www.cs.toronto.edu\/~tingwuwang\/REINFORCE.pdf)","403":"**Cutout** is an image augmentation and regularization technique that randomly masks out square regions of input during training. and can be used to improve the robustness and overall performance of convolutional neural networks. The main motivation for cutout comes from the problem of object occlusion, which is commonly encountered in many computer vision tasks, such as object recognition,\r\ntracking, or human pose estimation. By generating new images which simulate occluded examples, we not only better prepare the model for encounters with occlusions in the real world, but the model also learns to take more of the image context into consideration when making decisions","404":"**Shake-Shake Regularization** aims to improve the generalization ability of multi-branch networks by replacing the standard summation of parallel branches with a stochastic affine combination. A typical pre-activation [ResNet](https:\/\/paperswithcode.com\/method\/resnet) with 2 residual branches would follow this equation:\r\n\r\n$$x\\_{i+1} = x\\_{i} + \\mathcal{F}\\left(x\\_{i}, \\mathcal{W}\\_{i}^{\\left(1\\right)}\\right) + \\mathcal{F}\\left(x\\_{i}, \\mathcal{W}\\_{i}^{\\left(2\\right)}\\right) $$\r\n\r\nShake-shake regularization introduces a random variable $\\alpha\\_{i}$ following a uniform distribution between 0 and 1 during training:\r\n\r\n$$x\\_{i+1} = x\\_{i} + \\alpha\\mathcal{F}\\left(x\\_{i}, \\mathcal{W}\\_{i}^{\\left(1\\right)}\\right) + \\left(1-\\alpha\\right)\\mathcal{F}\\left(x\\_{i}, \\mathcal{W}\\_{i}^{\\left(2\\right)}\\right) $$\r\n\r\nFollowing the same logic as for [dropout](https:\/\/paperswithcode.com\/method\/dropout), all $\\alpha\\_{i}$ are set to the expected value of $0.5$ at test time.","405":"The Contour Proposal Network (CPN) detects possibly overlapping objects in an image while simultaneously fitting pixel-precise closed object contours. The CPN can incorporate state of the art object detection architectures as backbone networks into a fast single-stage instance segmentation model that can be trained end-to-end.","406":"**CoOp**, or **Context Optimization**, is an automated prompt engineering method that avoids manual prompt tuning by modeling context words with continuous vectors that are end-to-end learned from data. The context could be shared among all classes or designed to be class-specific. During training, we simply minimize the prediction error using the cross-entropy loss with respect to the learnable context vectors, while keeping the pre-trained parameters fixed. The gradients can be back-propagated all the way through the text encoder, distilling the rich knowledge encoded in the parameters for learning task-relevant context.","407":"The **Griffin-Lim Algorithm (GLA)** is a phase reconstruction method based on the redundancy of the short-time Fourier transform. It promotes the consistency of a spectrogram by iterating two projections, where a spectrogram is said to be consistent when its inter-bin dependency owing to the redundancy of STFT is retained. GLA is based only on the consistency and does not take any prior knowledge about the target signal into account. \r\n\r\nThis algorithm expects to recover a complex-valued spectrogram, which is consistent and maintains the given amplitude $\\mathbf{A}$, by the following alternative projection procedure:\r\n\r\n$$ \\mathbf{X}^{[m+1]} = P\\_{\\mathcal{C}}\\left(P\\_{\\mathcal{A}}\\left(\\mathbf{X}^{[m]}\\right)\\right) $$\r\n\r\nwhere $\\mathbf{X}$ is a complex-valued spectrogram updated through the iteration, $P\\_{\\mathcal{S}}$ is the metric projection onto a set $\\mathcal{S}$, and $m$ is the iteration index. Here, $\\mathcal{C}$ is the set of consistent spectrograms, and $\\mathcal{A}$ is the set of spectrograms whose amplitude is the same as the given one. The metric projections onto these sets $\\mathcal{C}$ and $\\mathcal{A}$ are given by:\r\n\r\n$$ P\\_{\\mathcal{C}}(\\mathbf{X}) = \\mathcal{GG}^{\u2020}\\mathbf{X} $$\r\n$$ P\\_{\\mathcal{A}}(\\mathbf{X}) = \\mathbf{A} \\odot \\mathbf{X} \\oslash |\\mathbf{X}| $$\r\n\r\n\r\nwhere $\\mathcal{G}$ represents STFT, $\\mathcal{G}^{\u2020}$ is the pseudo inverse of STFT (iSTFT), $\\odot$ and $\\oslash$ are element-wise multiplication and division, respectively, and division by zero is replaced by zero. GLA is obtained as an algorithm for the following optimization problem:\r\n\r\n$$ \\min\\_{\\mathbf{X}} || \\mathbf{X} - P\\_{\\mathcal{C}}\\left(\\mathbf{X}\\right) ||^{2}\\_{\\text{Fro}} \\text{ s.t. } \\mathbf{X} \\in \\mathcal{A} $$\r\n\r\nwhere $ || \u00b7 ||\\_{\\text{Fro}}$ is the Frobenius norm. This equation minimizes the energy of the inconsistent components under the constraint on amplitude which must be equal to the given one. Although GLA has been widely utilized because of its simplicity, GLA often involves many iterations until it converges to a certain spectrogram and results in low reconstruction quality. This is because the cost function only requires the consistency, and the characteristics of the target signal are not taken into account.","408":"A **Residual GRU** is a [gated recurrent unit (GRU)](https:\/\/paperswithcode.com\/method\/gru) that incorporates the idea of residual connections from [ResNets](https:\/\/paperswithcode.com\/method\/resnet).","409":"**CBHG** is a building block used in the [Tacotron](https:\/\/paperswithcode.com\/method\/tacotron) text-to-speech model. It consists of a bank of 1-D convolutional filters, followed by highway networks and a bidirectional gated recurrent unit ([BiGRU](https:\/\/paperswithcode.com\/method\/bigru)). \r\n\r\nThe module is used to extract representations from sequences. The input sequence is first\r\nconvolved with $K$ sets of 1-D convolutional filters, where the $k$-th set contains $C\\_{k}$ filters of width $k$ (i.e. $k = 1, 2, \\dots , K$). These filters explicitly model local and contextual information (akin to modeling unigrams, bigrams, up to K-grams). The [convolution](https:\/\/paperswithcode.com\/method\/convolution) outputs are stacked together and further max pooled along time to increase local invariances. A stride of 1 is used to preserve the original time resolution. The processed sequence is further passed to a few fixed-width 1-D convolutions, whose outputs are added with the original input sequence via residual connections. [Batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) is used for all convolutional layers. The convolution outputs are fed into a multi-layer [highway network](https:\/\/paperswithcode.com\/method\/highway-network) to extract high-level features. Finally, a bidirectional [GRU](https:\/\/paperswithcode.com\/method\/gru) RNN is stacked on top to extract sequential features from both forward and backward context.","410":"**Tacotron** is an end-to-end generative text-to-speech model that takes a character sequence as input and outputs the corresponding spectrogram. The backbone of Tacotron is a seq2seq model with attention. The Figure depicts the model, which includes an encoder, an attention-based decoder, and a post-processing net. At a high-level, the model takes characters as input and produces spectrogram\r\nframes, which are then converted to waveforms.","411":"The **Self-Organizing Map (SOM)**, commonly also known as Kohonen network (Kohonen 1982, Kohonen 2001) is a computational method for the visualization and analysis of high-dimensional data, especially experimentally acquired information.\r\n\r\nExtracted from [scholarpedia](http:\/\/www.scholarpedia.org\/article\/Self-organizing_map)\r\n\r\n**Sources**:\r\n\r\nImage: [scholarpedia](http:\/\/www.scholarpedia.org\/article\/File:Somnbc.png)\r\n\r\nPaper: [Kohonen, T. Self-organized formation of topologically correct feature maps. Biol. Cybern. 43, 59\u201369 (1982)](https:\/\/doi.org\/10.1007\/BF00337288)\r\n\r\nBook: [Self-Organizing Maps](https:\/\/doi.org\/10.1007\/978-3-642-56927-2)","412":"**XLNet** is an autoregressive [Transformer](https:\/\/paperswithcode.com\/method\/transformer) that leverages the best of both autoregressive language modeling and autoencoding while attempting to avoid their limitations. Instead of using a fixed forward or backward factorization order as in conventional autoregressive models, XLNet maximizes the expected log likelihood of a sequence w.r.t. all possible permutations of the factorization order. Thanks to the permutation operation, the context for each position can consist of tokens from both left and right. In expectation, each position learns to utilize contextual information from all positions, i.e., capturing bidirectional context.\r\n\r\nAdditionally, inspired by the latest advancements in autogressive language modeling, XLNet integrates the segment recurrence mechanism and relative encoding scheme of [Transformer-XL](https:\/\/paperswithcode.com\/method\/transformer-xl) into pretraining, which empirically improves the performance especially for tasks involving a longer text sequence.","413":"A **Connectionist Temporal Classification Loss**, or **CTC Loss**, is designed for tasks where we need alignment between sequences, but where that alignment is difficult - e.g. aligning each character to its location in an audio file. It calculates a loss between a continuous (unsegmented) time series and a target sequence. It does this by summing over the probability of possible alignments of input to target, producing a loss value which is differentiable with respect to each input node. The alignment of input to target is assumed to be \u201cmany-to-one\u201d, which limits the length of the target sequence such that it must be $\\leq$ the input length.","414":"A **Dual Path Network** block is an image model block used in convolutional neural network. The idea of this module is to enable sharing of common features while maintaining the flexibility to explore new features through dual path architectures. In this sense it combines the benefits of [ResNets](https:\/\/paperswithcode.com\/method\/resnet) and [DenseNets](https:\/\/paperswithcode.com\/method\/densenet). It was proposed as part of the [DPN](https:\/\/paperswithcode.com\/method\/dpn) CNN architecture.\r\n\r\nWe formulate such a dual path architecture as follows:\r\n\r\n$$x^{k} = \\sum\\limits\\_{t=1}^{k-1} f\\_t^{k}(h^t) \\text{,} $$\r\n\r\n$$\r\ny^{k} = \\sum\\limits\\_{t=1}^{k-1} v\\_t(h^t) = y^{k-1} + \\phi^{k-1}(y^{k-1}) \\text{,} \\\\\\\\\r\n$$\r\n\r\n$$\r\nr^{k} = x^{k} + y^{k} \\text{,} \\\\\\\\\r\n$$\r\n\r\n$$\r\nh^k = g^k \\left( r^{k} \\right) \\text{,}\r\n$$\r\n\r\nwhere $x^{k}$ and $y^{k}$ denote the extracted information at $k$-th step from individual path, $v_t(\\cdot)$ is a feature learning function as $f_t^k(\\cdot)$. The first equation refers to the densely connected path that enables exploring new features. The second equation refers to the residual path that enables common features re-usage. The third equation defines the dual path that integrates them and feeds them to the last transformation function in the last equation.","415":"A **Dual Path Network (DPN)** is a convolutional neural network which presents a new topology of connection paths internally. The intuition is that [ResNets](https:\/\/paperswithcode.com\/method\/resnet) enables feature re-usage while [DenseNet](https:\/\/paperswithcode.com\/method\/densenet) enables new feature exploration, and both are important for learning good representations. To enjoy the benefits from both path topologies, Dual Path Networks share common features while maintaining the flexibility to explore new features through dual path architectures. \r\n\r\nWe formulate such a dual path architecture as follows:\r\n\r\n$$x^{k} = \\sum\\limits\\_{t=1}^{k-1} f\\_t^{k}(h^t) \\text{,} $$\r\n\r\n$$\r\ny^{k} = \\sum\\limits\\_{t=1}^{k-1} v\\_t(h^t) = y^{k-1} + \\phi^{k-1}(y^{k-1}) \\text{,} \\\\\\\\\r\n$$\r\n\r\n$$\r\nr^{k} = x^{k} + y^{k} \\text{,} \\\\\\\\\r\n$$\r\n\r\n$$\r\nh^k = g^k \\left( r^{k} \\right) \\text{,}\r\n$$\r\n\r\nwhere $x^{k}$ and $y^{k}$ denote the extracted information at $k$-th step from individual path, $v_t(\\cdot)$ is a feature learning function as $f_t^k(\\cdot)$. The first equation refers to the densely connected path that enables exploring new features. The second equation refers to the residual path that enables common features re-usage. The third equation defines the dual path that integrates them and feeds them to the last transformation function in the last equation.","416":"Exit whenever the model is confident enough allowing early exiting from hidden layers","417":"Spatial pooling usually operates on a small region which limits its capability to capture long-range dependencies and focus on distant regions. To overcome this, Hou et al. proposed strip pooling, a novel pooling method capable of encoding long-range context in either horizontal or vertical spatial domains. \r\n\r\nStrip pooling has two branches for horizontal and vertical strip pooling. The horizontal strip pooling part first pools the input feature $F \\in \\mathcal{R}^{C \\times H \\times W}$ in the horizontal direction:\r\n\\begin{align}\r\ny^1 = \\text{GAP}^w (X) \r\n\\end{align}\r\nThen a 1D convolution with kernel size 3 is applied in $y$ to capture the relationship between different rows and channels. This is repeated $W$ times to make the output $y_v$ consistent with the input shape:\r\n\\begin{align}\r\n y_h = \\text{Expand}(\\text{Conv1D}(y^1))\r\n\\end{align}\r\nVertical strip pooling is performed in a similar way. Finally, the outputs of the two branches are fused using element-wise summation to produce the attention map:\r\n\\begin{align}\r\ns &= \\sigma(Conv^{1\\times 1}(y_{v} + y_{h}))\r\n\\end{align}\r\n\\begin{align}\r\nY &= s X\r\n\\end{align}\r\n\r\nThe strip pooling module (SPM) is further developed in the mixed pooling module (MPM). Both consider spatial and channel relationships to overcome the locality of convolutional neural networks. SPNet achieves state-of-the-art results for several complex semantic segmentation benchmarks.","418":"**Strip Pooling** is a pooling strategy for scene parsing which considers a long but narrow kernel, i.e., $1\\times{N}$ or $N\\times{1}$. As an alternative to global pooling, strip pooling offers two advantages. First, it deploys a long kernel shape along one spatial dimension and hence enables capturing long-range relations of isolated regions. Second, it keeps a narrow kernel shape along the other spatial dimension, which facilitates capturing local context and prevents irrelevant regions from interfering the label prediction. Integrating such long but narrow pooling kernels enables the scene parsing networks to simultaneously aggregate both global and local context. This is essentially different from the traditional spatial pooling which collects context from a fixed square region.","419":"Please enter a description about the method here","420":"**MixConv**, or **Mixed Depthwise Convolution**, is a type of [depthwise convolution](https:\/\/paperswithcode.com\/method\/depthwise-convolution) that naturally mixes up multiple kernel sizes in a single [convolution](https:\/\/paperswithcode.com\/method\/convolution). It is based on the insight that depthwise convolution applies a single kernel size to all channels, which MixConv overcomes by combining the benefits of multiple kernel sizes. It does this by partitioning channels into groups and applying a different kernel size to each group.","421":"**MixNet** is a type of convolutional neural network discovered via AutoML that utilises MixConvs instead of regular depthwise convolutions.","422":"**CaiT**, or **Class-Attention in Image Transformers**, is a type of [vision transformer](https:\/\/paperswithcode.com\/methods\/category\/vision-transformer) with several design alterations upon the original [ViT](https:\/\/paperswithcode.com\/method\/vision-transformer). First a new layer scaling approach called [LayerScale](https:\/\/paperswithcode.com\/method\/layerscale) is used, adding a learnable diagonal matrix on output of each residual block, initialized close to (but not at) 0, which improves the training dynamics. Secondly, [class-attention layers](https:\/\/paperswithcode.com\/method\/ca) are introduced to the architecture. This creates an architecture where the transformer layers involving [self-attention](https:\/\/paperswithcode.com\/method\/scaled) between patches are explicitly separated from class-attention layers -- that are devoted to extract the content of the processed patches into a single vector so that it can be fed to a linear classifier.","423":"A **Class Attention** layer, or **CA Layer**, is an [attention mechanism](https:\/\/paperswithcode.com\/methods\/category\/attention-mechanisms-1) for [vision transformers](https:\/\/paperswithcode.com\/methods\/category\/vision-transformer) used in [CaiT](https:\/\/paperswithcode.com\/method\/cait) that aims to extract information from a set of processed patches. It is identical to a [self-attention layer](https:\/\/paperswithcode.com\/method\/scaled), except that it relies on the attention between (i) the class embedding $x_{\\text {class }}$ (initialized at CLS in the first CA) and (ii) itself plus the set of frozen patch embeddings $x_{\\text {patches }} .$ \r\n\r\nConsidering a network with $h$ heads and $p$ patches, and denoting by $d$ the embedding size, the multi-head class-attention is parameterized with several projection matrices, $W_{q}, W_{k}, W_{v}, W_{o} \\in \\mathbf{R}^{d \\times d}$, and the corresponding biases $b_{q}, b_{k}, b_{v}, b_{o} \\in \\mathbf{R}^{d} .$ With this notation, the computation of the CA residual block proceeds as follows. We first augment the patch embeddings (in matrix form) as $z=\\left[x_{\\text {class }}, x_{\\text {patches }}\\right]$. We then perform the projections:\r\n\r\n$$Q=W\\_{q} x\\_{\\text {class }}+b\\_{q}$$\r\n\r\n$$K=W\\_{k} z+b\\_{k}$$\r\n\r\n$$V=W\\_{v} z+b\\_{v}$$\r\n\r\nThe class-attention weights are given by\r\n\r\n$$\r\nA=\\operatorname{Softmax}\\left(Q . K^{T} \/ \\sqrt{d \/ h}\\right)\r\n$$\r\n\r\nwhere $Q . K^{T} \\in \\mathbf{R}^{h \\times 1 \\times p}$. This attention is involved in the weighted sum $A \\times V$ to produce the residual output vector\r\n\r\n$$\r\n\\operatorname{out}\\_{\\mathrm{CA}}=W\\_{o} A V+b\\_{o}\r\n$$\r\n\r\nwhich is in turn added to $x\\_{\\text {class }}$ for subsequent processing.","424":"**LayerScale** is a method used for [vision transformer](https:\/\/paperswithcode.com\/methods\/category\/vision-transformer) architectures to help improve training dynamics. It adds a learnable diagonal matrix on output of each residual block, initialized close to (but not at) 0. Adding this simple layer after each residual block improves the training dynamic, allowing for the training of deeper high-capacity image transformers that benefit from depth.\r\n\r\nSpecifically, LayerScale is a per-channel multiplication of the vector produced by each residual block, as opposed to a single scalar, see Figure (d). The objective is to group the updates of the weights associated with the same output channel. Formally, LayerScale is a multiplication by a diagonal matrix on output of each residual block. In other words:\r\n\r\n$$\r\nx\\_{l}^{\\prime} =x\\_{l}+\\operatorname{diag}\\left(\\lambda\\_{l, 1}, \\ldots, \\lambda\\_{l, d}\\right) \\times \\operatorname{SA}\\left(\\eta\\left(x\\_{l}\\right)\\right) \r\n$$\r\n\r\n$$\r\nx\\_{l+1} =x\\_{l}^{\\prime}+\\operatorname{diag}\\left(\\lambda\\_{l, 1}^{\\prime}, \\ldots, \\lambda\\_{l, d}^{\\prime}\\right) \\times \\operatorname{FFN}\\left(\\eta\\left(x\\_{l}^{\\prime}\\right)\\right)\r\n$$\r\n\r\nwhere the parameters $\\lambda\\_{l, i}$ and $\\lambda\\_{l, i}^{\\prime}$ are learnable weights. The diagonal values are all initialized to a fixed small value $\\varepsilon:$ we set it to $\\varepsilon=0.1$ until depth 18 , $\\varepsilon=10^{-5}$ for depth 24 and $\\varepsilon=10^{-6}$ for deeper networks. \r\n\r\nThis formula is akin to other [normalization](https:\/\/paperswithcode.com\/methods\/category\/normalization) strategies [ActNorm](https:\/\/paperswithcode.com\/method\/activation-normalization) or [LayerNorm](https:\/\/paperswithcode.com\/method\/layer-normalization) but executed on output of the residual block. Yet LayerScale seeks a different effect: [ActNorm](https:\/\/paperswithcode.com\/method\/activation-normalization) is a data-dependent initialization that calibrates activations so that they have zero-mean and unit variance, like [BatchNorm](https:\/\/paperswithcode.com\/method\/batch-normalization). In contrast, in LayerScale, we initialize the diagonal with small values so that the initial contribution of the residual branches to the function implemented by the transformer is small. In that respect the motivation is therefore closer to that of [ReZero](https:\/\/paperswithcode.com\/method\/rezero), [SkipInit](https:\/\/paperswithcode.com\/method\/skipinit), [Fixup](https:\/\/paperswithcode.com\/method\/fixup-initialization) and [T-Fixup](https:\/\/paperswithcode.com\/method\/t-fixup): to train closer to the identity function and let the network integrate the additional parameters progressively during the training. LayerScale offers more diversity in the optimization than just adjusting the whole layer by a single learnable scalar as in [ReZero](https:\/\/paperswithcode.com\/method\/rezero)\/[SkipInit](https:\/\/paperswithcode.com\/method\/skipinit), [Fixup](https:\/\/paperswithcode.com\/method\/fixup-initialization) and [T-Fixup](https:\/\/paperswithcode.com\/method\/t-fixup).","425":"A **Data-Efficient Image Transformer** is a type of [Vision Transformer](https:\/\/paperswithcode.com\/method\/vision-transformer) for image classification tasks. The model is trained using a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention.","426":"**Context Enhancement Module (CEM)** is a feature extraction module used in object detection (specifically, [ThunderNet](https:\/\/paperswithcode.com\/method\/thundernet)) which aims to to enlarge the receptive field. The key idea of CEM is to aggregate multi-scale local context information and global context information to generate more discriminative features. In CEM, the feature maps from three scales are merged: $C\\_{4}$, $C\\_{5}$ and $C\\_{glb}$. $C\\_{glb}$ is the global context feature vector by applying a [global average pooling](https:\/\/paperswithcode.com\/method\/global-average-pooling) on $C\\_{5}$. We then apply a 1 \u00d7 1 [convolution](https:\/\/paperswithcode.com\/method\/convolution) on each feature map to squeeze the number of channels to $\\alpha \\times p \\times p = 245$.\r\n\r\nAfterwards, $C\\_{5}$ is upsampled by 2\u00d7 and $C\\_{glb}$ is broadcast so that the spatial dimensions of the three feature maps are\r\nequal. At last, the three generated feature maps are aggregated. By leveraging both local and global context, CEM effectively enlarges the receptive field and refines the representation ability of the thin feature map. Compared with prior [FPN](https:\/\/paperswithcode.com\/method\/fpn) structures, CEM involves only two 1\u00d71 convolutions and a fc layer.","427":"**ShuffleNet V2 Block** is an image model block used in the [ShuffleNet V2](https:\/\/paperswithcode.com\/method\/shufflenet-v2) architecture, where speed is the metric optimized for (instead of indirect ones like FLOPs). It utilizes a simple operator called channel split. At the beginning of each unit, the input of $c$ feature channels are split into two branches with $c - c'$ and $c'$ channels, respectively. Following **G3**, one branch remains as identity. The other branch consists of three convolutions with the same input and output channels to satisfy **G1**. The two $1\\times1$ convolutions are no longer group-wise, unlike the original [ShuffleNet](https:\/\/paperswithcode.com\/method\/shufflenet). This is partially to follow **G2**, and partially because the split operation already produces two groups. After [convolution](https:\/\/paperswithcode.com\/method\/convolution), the two branches are concatenated. So, the number of channels keeps the same (G1). The same \u201c[channel shuffle](https:\/\/paperswithcode.com\/method\/channel-shuffle)\u201d operation as in ShuffleNet is then used to enable information communication between the two branches.\r\n\r\nThe motivation behind channel split is that alternative architectures, where pointwise group convolutions and bottleneck structures are used, lead to increased memory access cost. Additionally more network fragmentation with group convolutions reduces parallelism (less friendly for GPU), and the element-wise addition operation, while they have low FLOPs, have high memory access cost. Channel split is an alternative where we can maintain a large number of equally wide channels (equally wide minimizes memory access cost) without having dense convolutions or too many groups.","428":"**Spatial Attention Module (SAM)** is a feature extraction module for object detection used in [ThunderNet](https:\/\/paperswithcode.com\/method\/thundernet).\r\n\r\nThe ThunderNet SAM explicitly re-weights the feature map before RoI warping over the spatial dimensions. The key idea of SAM is to use the knowledge from [RPN](https:\/\/paperswithcode.com\/method\/rpn) to refine the feature distribution of the feature map. RPN is trained to recognize foreground regions under the supervision of ground truths. Therefore, the intermediate features in RPN can be used to distinguish foreground features from background features. SAM accepts two inputs: the intermediate feature map from RPN $\\mathcal{F}^{RPN}$ and the thin feature map from the [Context Enhancement Module](https:\/\/paperswithcode.com\/method\/context-enhancement-module) $\\mathcal{F}^{CEM}$. The output of SAM $\\mathcal{F}^{SAM}$ is defined as:\r\n\r\n$$ \\mathcal{F}^{SAM} = \\mathcal{F}^{CEM} * \\text{sigmoid}\\left(\\theta\\left(\\mathcal{F}^{RPN}\\right)\\right) $$\r\n\r\nHere $\\theta\\left(\u00b7\\right)$ is a dimension transformation to match the number of channels in both feature maps. The sigmoid function is used to constrain the values within $\\left[0, 1\\right]$. At last, $\\mathcal{F}^{CEM}$ is re-weighted by the generated feature map for better feature distribution. For computational efficiency, we simply apply a 1\u00d71 [convolution](https:\/\/paperswithcode.com\/method\/convolution) as $\\theta\\left(\u00b7\\right)$, so the computational cost of CEM is negligible. The Figure to the right shows the structure of SAM. \r\n\r\nSAM has two functions. The first one is to refine the feature distribution by strengthening foreground features and suppressing background features. The second one is to stabilize the training of RPN as SAM enables extra gradient flow from [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn) subnet to RPN. As a result, RPN receives additional supervision from RCNN subnet, which helps the training of RPN.","429":"**Position-Sensitive RoIAlign** is a positive sensitive version of [RoIAlign](https:\/\/paperswithcode.com\/method\/roi-align) - i.e. it performs selective alignment, allowing for the learning of position-sensitive region of interest aligning.","430":"**SNet** is a convolutional neural network architecture and object detection backbone used for the [ThunderNet](https:\/\/paperswithcode.com\/method\/thundernet) two-stage object detector. SNet uses ShuffleNetV2 basic blocks but replaces all 3\u00d73 depthwise convolutions with 5\u00d75 depthwise convolutions.","431":"**ThunderNet** is a two-stage object detection model. The design of ThunderNet aims at the computationally expensive structures in state-of-the-art two-stage detectors. The backbone utilises a [ShuffleNetV2](https:\/\/paperswithcode.com\/method\/shufflenet-v2) inspired network called [SNet](https:\/\/paperswithcode.com\/method\/snet) designed for object detection. In the detection part, ThunderNet follows the detection head design in Light-Head [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn), and further compresses the [RPN](https:\/\/paperswithcode.com\/method\/rpn) and R-CNN subnet. To eliminate the performance degradation induced by small backbones and small feature maps, ThunderNet uses two new efficient architecture blocks, [Context Enhancement Module](https:\/\/paperswithcode.com\/method\/context-enhancement-module) (CEM) and [Spatial Attention Module](https:\/\/paperswithcode.com\/method\/spatial-attention-module) (SAM). CEM combines the feature maps from multiple scales to leverage local and global context information, while SAM uses the information learned in RPN to refine the feature distribution in RoI warping.","432":"Spatial Broadcast Decoder is an architecture that aims to improve disentangling, reconstruction accuracy, and generalization to held-out regions in data space. It provides a particularly dramatic\r\nbenefit when applied to datasets with small objects.\r\n\r\nSource: [Watters et al.](https:\/\/arxiv.org\/pdf\/1901.07017v2.pdf)\r\n\r\nImage source: [Watters et al.](https:\/\/arxiv.org\/pdf\/1901.07017v2.pdf)","433":"The **Affine Operator** is an affine transformation layer introduced in the [ResMLP](https:\/\/paperswithcode.com\/method\/resmlp) architecture. This replaces [layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization), as in [Transformer based networks](https:\/\/paperswithcode.com\/methods\/category\/transformers), which is possible since in the ResMLP, there are no [self-attention layers](https:\/\/paperswithcode.com\/method\/scaled) which makes training more stable - hence allowing a more simple affine transformation.\r\n\r\nThe affine operator is defined as:\r\n\r\n$$ \\operatorname{Aff}_{\\mathbf{\\alpha}, \\mathbf{\\beta}}(\\mathbf{x})=\\operatorname{Diag}(\\mathbf{\\alpha}) \\mathbf{x}+\\mathbf{\\beta} $$\r\n\r\nwhere $\\alpha$ and $\\beta$ are learnable weight vectors. This operation only rescales and shifts the input element-wise. This operation has several advantages over other normalization operations: first, as opposed to Layer Normalization, it has no cost at inference time, since it can absorbed in the adjacent linear layer. Second, as opposed to [BatchNorm](https:\/\/paperswithcode.com\/method\/batch-normalization) and Layer Normalization, the Aff operator does not depend on batch statistics.","434":"**Residual Multi-Layer Perceptrons**, or **ResMLP**, is an architecture built entirely upon [multi-layer perceptrons](https:\/\/paperswithcode.com\/methods\/category\/feedforward-networks) for image classification. It is a simple [residual network](https:\/\/paperswithcode.com\/method\/residual-connection) that alternates (i) a [linear layer](https:\/\/paperswithcode.com\/method\/linear-layer) in which image patches interact, independently and identically across channels, and (ii) a two-layer [feed-forward network](https:\/\/paperswithcode.com\/method\/feedforward-network) in which channels interact independently per patch. At the end of the network, the patch representations are average pooled, and fed to a linear classifier.\r\n\r\n[Layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization) is replaced with a simpler [affine transformation](https:\/\/paperswithcode.com\/method\/affine-operator), thanks to the absence of self-attention layers which makes training more stable. The affine operator is applied at the beginning (\"pre-normalization\") and end (\"post-normalization\") of each residual block. As a pre-normalization, Aff replaces LayerNorm without using channel-wise statistics. Initialization is achieved as $\\mathbf{\\alpha}=\\mathbf{1}$, and $\\mathbf{\\beta}=\\mathbf{0}$. As a post-normalization, Aff is similar to [LayerScale](https:\/\/paperswithcode.com\/method\/layerscale) and $\\mathbf{\\alpha}$ is initialized with the same small value.","435":"**Disentangled Attention Mechanism** is an attention mechanism used in the [DeBERTa](https:\/\/paperswithcode.com\/method\/deberta) architecture. Unlike [BERT](https:\/\/paperswithcode.com\/method\/bert) where each word in the input layer is represented using a vector which is the sum of its word (content) embedding and position embedding, each word in DeBERTa is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices based on their contents and relative positions, respectively. This is motivated by the observation that the attention weight of a word pair depends on not only their contents but their relative positions. For example, the dependency between the words \u201cdeep\u201d and \u201clearning\u201d is much stronger when they occur next to each other than when they occur in different sentences.","436":"**DeBERTa** is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based neural language model that aims to improve the [BERT](https:\/\/paperswithcode.com\/method\/bert) and [RoBERTa](https:\/\/paperswithcode.com\/method\/roberta) models with two techniques: a [disentangled attention mechanism](https:\/\/paperswithcode.com\/method\/disentangled-attention-mechanism) and an enhanced mask decoder. The disentangled attention mechanism is where each word is represented unchanged using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangle matrices on their contents and relative positions. The enhanced mask decoder is used to replace the output [softmax](https:\/\/paperswithcode.com\/method\/softmax) layer to predict the masked tokens for model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve model\u2019s generalization on downstream tasks.","437":"TuckER","438":"**V-trace** is an off-policy actor-critic reinforcement learning algorithm that helps tackle the lag between when actions are generated by the actors and when the learner estimates the gradient. Consider a trajectory $\\left(x\\_{t}, a\\_{t}, r\\_{t}\\right)^{t=s+n}\\_{t=s}$ generated by the actor following some policy $\\mu$. We can define the $n$-steps V-trace target for $V\\left(x\\_{s}\\right)$, our value approximation at state $x\\_{s}$ as:\r\n\r\n$$ v\\_{s} = V\\left(x\\_{s}\\right) + \\sum^{s+n-1}\\_{t=s}\\gamma^{t-s}\\left(\\prod^{t-1}\\_{i=s}c\\_{i}\\right)\\delta\\_{t}V $$\r\n\r\nWhere $\\delta\\_{t}V = \\rho\\_{t}\\left(r\\_{t} + \\gamma{V}\\left(x\\_{t+1}\\right) - V\\left(x\\_{t}\\right)\\right)$ is a temporal difference algorithm for $V$, and $\\rho\\_{t} = \\text{min}\\left(\\bar{\\rho}, \\frac{\\pi\\left(a\\_{t}\\mid{x\\_{t}}\\right)}{\\mu\\left(a\\_{t}\\mid{x\\_{t}}\\right)}\\right)$ and $c\\_{i} = \\text{min}\\left(\\bar{c}, \\frac{\\pi\\left(a\\_{t}\\mid{x\\_{t}}\\right)}{\\mu\\left(a\\_{t}\\mid{x\\_{t}}\\right)}\\right)$ are truncated importance sampling weights. We assume that the truncation levels are such that $\\bar{\\rho} \\geq \\bar{c}$.","439":"**IMPALA**, or the **Importance Weighted Actor Learner Architecture**, is an off-policy actor-critic framework that decouples acting from learning and learns from experience trajectories using [V-trace](https:\/\/paperswithcode.com\/method\/v-trace). Unlike the popular [A3C](https:\/\/paperswithcode.com\/method\/a3c)-based agents, in which workers communicate gradients with respect to the parameters of the policy to a central parameter server, IMPALA actors communicate trajectories of experience (sequences of states, actions, and rewards) to a centralized learner. Since the learner in IMPALA has access to full trajectories of experience we use a GPU to perform updates on mini-batches of trajectories while aggressively parallelising all time independent operations. \r\n\r\nThis type of decoupled architecture can achieve very high throughput. However, because the policy used to generate a trajectory can lag behind the policy on the learner by several updates at the time of gradient calculation, learning becomes off-policy. The V-trace off-policy actor-critic algorithm is used to correct for this harmful discrepancy.","440":"**TayPO**, or **Taylor Expansion Policy Optimization**, refers to a set of algorithms that apply the $k$-th order Taylor expansions for policy optimization. This generalizes prior work, including [TRPO](https:\/\/paperswithcode.com\/method\/trpo) as a special case. It can be thought of unifying ideas from trust-region policy optimization and off-policy corrections. Taylor expansions share high-level similarities with both trust region policy search and off-policy corrections. To get high-level intuitions of such similarities, consider a simple 1D example of Taylor expansions. Given a sufficiently smooth real-valued function on the real line $f : \\mathbb{R} \\rightarrow \\mathbb{R}$, the $k$-th order Taylor expansion of $f\\left(x\\right)$ at $x\\_{0}$ is \r\n\r\n$$f\\_{k}\\left(x\\right) = f\\left(x\\_{0}\\right)+\\sum^{k}\\_{i=1}\\left[f^{(i)}\\left(x\\_{0}\\right)\/i!\\right]\\left(x\u2212x\\_{0}\\right)^{i}$$\r\n\r\nwhere $f^{(i)}\\left(x\\_{0}\\right)$ are the $i$-th order derivatives at $x\\_{0}$. First, a common feature shared by Taylor expansions and trust-region policy search is the inherent notion of a trust region constraint. Indeed, in order for convergence to take place, a trust-region constraint is required $|x \u2212 x\\_{0}| < R\\left(f, x\\_{0}\\right)^{1}$. Second, when using the truncation as an approximation to the original function $f\\_{K}\\left(x\\right) \\approx f\\left(x\\right)$, Taylor expansions satisfy the requirement of off-policy evaluations: evaluate target policy with behavior data. Indeed, to evaluate the truncation $f\\_{K}\\left(x\\right)$ at any $x$ (target policy), we only require the behavior policy \"data\" at $x\\_{0}$ (i.e., derivatives $f^{(i)}\\left(x\\_{0}\\right)$).","441":"**Neural Tangent Transfer**, or **NTT**, is a method for finding trainable sparse networks in a label-free manner. Specifically, NTT finds sparse networks whose training dynamics, as characterized by the neural tangent kernel, mimic those of dense networks in function space.","442":"**Detailed Expression Capture and Animation**, or **DECA**, is a model for 3D face reconstruction that is trained to robustly produce a UV displacement map from a low-dimensional latent representation that consists of person-specific detail parameters and generic expression parameters, while a regressor is trained to predict detail, shape, albedo, expression, pose and illumination parameters from a single image. A detail-consistency loss is used to disentangle person-specific details and expression-dependent wrinkles. This disentanglement allows us to synthesize realistic person-specific wrinkles by controlling expression parameters while keeping person-specific details unchanged.","443":"**VoiceFilter-Lite** is a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system. In this architecture, the voice filtering model operates as a frame-by-frame frontend signal processor to enhance the features consumed by the speech recognizer, without reconstructing audio signals from the features. The key contributions are (1) A system to perform speech separation directly on ASR input features; (2) An asymmetric loss function to penalize oversuppression during training, to make the model harmless under various acoustic environments, (3) An adaptive suppression strength mechanism to adapt to different noise conditions.","444":"SENet pioneered channel attention. The core of SENet is a squeeze-and-excitation (SE) block which is used to collect global information, capture channel-wise relationships and improve representation ability.\r\nSE blocks are divided into two parts, a squeeze module and an excitation module. Global spatial information is collected in the squeeze module by global average pooling. The excitation module captures channel-wise relationships and outputs an attention vector by using fully-connected layers and non-linear layers (ReLU and sigmoid). Then, each channel of the input feature is scaled by multiplying the corresponding element in the attention vector. Overall, a squeeze-and-excitation block $F_\\text{se}$ (with parameter $\\theta$) which takes $X$ as input and outputs $Y$ can be formulated \r\nas:\r\n\\begin{align}\r\n s = F_\\text{se}(X, \\theta) & = \\sigma (W_{2} \\delta (W_{1}\\text{GAP}(X)))\r\n\\end{align}\r\n\\begin{align}\r\n Y = sX\r\n\\end{align}","445":"**Efficient Channel Attention** is an architectural unit based on [squeeze-and-excitation](https:\/\/paperswithcode.com\/method\/squeeze-and-excitation-block) blocks that reduces model complexity without dimensionality reduction. It was proposed as part of the [ECA-Net](https:\/\/paperswithcode.com\/method\/eca-net) CNN architecture. \r\n\r\nAfter channel-wise [global average pooling](https:\/\/paperswithcode.com\/method\/global-average-pooling) without dimensionality reduction, the ECA captures local cross-channel interaction by considering every channel and its $k$ neighbors. The ECA can be efficiently implemented by fast $1D$ [convolution](https:\/\/paperswithcode.com\/method\/convolution) of size $k$, where kernel size $k$ represents the coverage of local cross-channel interaction, i.e., how many neighbors participate in attention prediction of one channel.","446":"**FixRes** is an image scaling strategy that seeks to optimize classifier performance. It is motivated by the observation that data augmentations induce a significant discrepancy between the size of the objects seen by the classifier at train and test time: in fact, a lower train resolution improves the classification at test time! FixRes is a simple strategy to optimize the classifier performance, that employs different train and test resolutions. The calibrations are: (a) calibrating the object sizes by adjusting the crop size and (b) adjusting statistics before spatial pooling.","447":"**Weight Standardization** is a normalization technique that smooths the loss landscape by standardizing the weights in convolutional layers. Different from the previous normalization methods that focus on *activations*, WS considers the smoothing effects of *weights* more than just length-direction decoupling. Theoretically, WS reduces the Lipschitz constants of the loss and the gradients.\r\nHence, WS smooths the loss landscape and improves training.\r\n\r\nIn Weight Standardization, instead of directly optimizing the loss $\\mathcal{L}$ on the original weights $\\hat{W}$, we reparameterize the weights $\\hat{W}$ as a function of $W$, i.e. $\\hat{W}=\\text{WS}(W)$, and optimize the loss $\\mathcal{L}$ on $W$ by [SGD](https:\/\/paperswithcode.com\/method\/sgd):\r\n\r\n$$\r\n \\hat{W} = \\Big[ \\hat{W}\\_{i,j}~\\big|~ \\hat{W}\\_{i,j} = \\dfrac{W\\_{i,j} - \\mu\\_{W\\_{i,\\cdot}}}{\\sigma\\_{W\\_{i,\\cdot}+\\epsilon}}\\Big]\r\n$$\r\n\r\n$$\r\n y = \\hat{W}*x\r\n$$\r\n\r\nwhere\r\n\r\n$$\r\n \\mu_{W\\_{i,\\cdot}} = \\dfrac{1}{I}\\sum\\_{j=1}^{I}W\\_{i, j},~~\\sigma\\_{W\\_{i,\\cdot}}=\\sqrt{\\dfrac{1}{I}\\sum\\_{i=1}^I(W\\_{i,j} - \\mu\\_{W\\_{i,\\cdot}})^2}\r\n$$\r\n\r\nSimilar to [Batch Normalization](https:\/\/paperswithcode.com\/method\/batch-normalization), WS controls the first and second moments of the weights of each output channel individually in convolutional layers. Note that many initialization methods also initialize the weights in some similar ways. Different from those methods, WS standardizes the weights in a differentiable way which aims to normalize gradients during back-propagation. Note that we do not have any affine transformation on $\\hat{W}$. This is because we assume that normalization layers such as BN or [GN](https:\/\/paperswithcode.com\/method\/group-normalization) will normalize this convolutional layer again.","448":"**Group Normalization** is a normalization layer that divides channels into groups and normalizes the features within each group. GN does not exploit the batch dimension, and its computation is independent of batch sizes. In the case where the group size is 1, it is equivalent to [Instance Normalization](https:\/\/paperswithcode.com\/method\/instance-normalization).\r\n\r\nAs motivation for the method, many classical features like SIFT and HOG had *group-wise* features and involved *group-wise normalization*. For example, a HOG vector is the outcome of several spatial cells where each cell is represented by a normalized orientation histogram.\r\n\r\nFormally, Group Normalization is defined as:\r\n\r\n$$ \\mu\\_{i} = \\frac{1}{m}\\sum\\_{k\\in\\mathcal{S}\\_{i}}x\\_{k} $$\r\n\r\n$$ \\sigma^{2}\\_{i} = \\frac{1}{m}\\sum\\_{k\\in\\mathcal{S}\\_{i}}\\left(x\\_{k}-\\mu\\_{i}\\right)^{2} $$\r\n\r\n$$ \\hat{x}\\_{i} = \\frac{x\\_{i} - \\mu\\_{i}}{\\sqrt{\\sigma^{2}\\_{i}+\\epsilon}} $$\r\n\r\nHere $x$ is the feature computed by a layer, and $i$ is an index. Formally, a Group Norm layer computes $\\mu$ and $\\sigma$ in a set $\\mathcal{S}\\_{i}$ defined as: $\\mathcal{S}\\_{i} = ${$k \\mid k\\_{N} = i\\_{N} ,\\lfloor\\frac{k\\_{C}}{C\/G}\\rfloor = \\lfloor\\frac{I\\_{C}}{C\/G}\\rfloor $}.\r\n\r\nHere $G$ is the number of groups, which is a pre-defined hyper-parameter ($G = 32$ by default). $C\/G$ is the number of channels per group. $\\lfloor$ is the floor operation, and the final term means that the indexes $i$ and $k$ are in the same group of channels, assuming each group of channels are stored in a sequential order along the $C$ axis.","449":"A scalable second order optimization algorithm for deep learning.\r\n\r\nOptimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods, that involve second derivatives and\/or second order statistics of the data, are far less prevalent despite strong theoretical properties, due to their prohibitive computation, memory and communication costs. In an attempt to bridge this gap between theoretical and practical optimization, we present a scalable implementation of a second-order preconditioned method (concretely, a variant of full-matrix Adagrad), that along with several critical algorithmic and numerical improvements, provides significant convergence and wall-clock time improvements compared to conventional first-order methods on state-of-the-art deep models. Our novel design effectively utilizes the prevalent heterogeneous hardware architecture for training deep models, consisting of a multicore CPU coupled with multiple accelerator units. We demonstrate superior performance compared to state-of-the-art on very large learning tasks such as machine translation with Transformers, language modeling with BERT, click-through rate prediction on Criteo, and image classification on ImageNet with ResNet-50.","450":"The **Enhanced Fusion Framework** proposes three different ideas to improve the existing MI-based BCI frameworks.\r\n\r\nImage source: [Fumanal-Idocin et al.](https:\/\/arxiv.org\/pdf\/2101.06968v1.pdf)","451":"BCI MI framework to classifiy brain signals using a multimodal decission making phase, with an addtional differentiation of the signal.","452":"BCI MI signal Classification Framework using Fuzzy integrals.\r\n\r\nPaper: Ko, L. W., Lu, Y. C., Bustince, H., Chang, Y. C., Chang, Y., Ferandez, J., ... & Lin, C. T. (2019). Multimodal fuzzy fusion for enhancing the motor-imagery-based brain computer interface. IEEE Computational Intelligence Magazine, 14(1), 96-106.","453":"**MushroomRL** is an open-source Python library developed to simplify the process of implementing and running Reinforcement Learning (RL) experiments. The architecture of MushroomRL is built in such a way that every component of an RL problem is already provided, and most of the time users can only focus on the implementation of their own algorithms and experiments. MushroomRL comes with a strongly modular architecture that makes it easy to understand how each component is structured and how it interacts with other ones; moreover it provides an exhaustive list of RL methodologies, such as:","454":"The methon to overcome catastrophic forgetting in neural network while continual learning","455":"**$L_{1}$ Regularization** is a regularization technique applied to the weights of a neural network. We minimize a loss function compromising both the primary loss function and a penalty on the $L\\_{1}$ Norm of the weights:\r\n\r\n$$L\\_{new}\\left(w\\right) = L\\_{original}\\left(w\\right) + \\lambda{||w||}\\_{1}$$\r\n\r\nwhere $\\lambda$ is a value determining the strength of the penalty. In contrast to [weight decay](https:\/\/paperswithcode.com\/method\/weight-decay), $L_{1}$ regularization promotes sparsity; i.e. some parameters have an optimal value of zero.\r\n\r\nImage Source: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Regularization_(mathematics)#\/media\/File:Sparsityl1.png)","456":"A **Sparse Autoencoder** is a type of autoencoder that employs sparsity to achieve an information bottleneck. Specifically the loss function is constructed so that activations are penalized within a layer. The sparsity constraint can be imposed with [L1 regularization](https:\/\/paperswithcode.com\/method\/l1-regularization) or a KL divergence between expected average neuron activation to an ideal distribution $p$.\r\n\r\nImage: [Jeff Jordan](https:\/\/www.jeremyjordan.me\/autoencoders\/). Read his blog post (click) for a detailed summary of autoencoders.","457":"**FCOS** is an anchor-box free, proposal free, single-stage object detection model. By eliminating the predefined set of anchor boxes, FCOS avoids computation related to anchor boxes such as calculating overlapping during training. It also avoids all hyper-parameters related to anchor boxes, which are often very sensitive to the final detection performance.","458":"**AlignPS**, or **Feature-Aligned Person Search Network**, is an anchor-free framework for efficient person search. The model employs the typical architecture of an anchor-free detection model (i.e., [FCOS](https:\/\/paperswithcode.com\/method\/fcos)). An aligned feature aggregation (AFA) module is designed to make the model focus more on the re-id subtask. Specifically, AFA reshapes some building blocks of [FPN](https:\/\/paperswithcode.com\/method\/fpn) to overcome the issues of region and scale misalignment in re-id feature learning. A [deformable convolution](https:\/\/paperswithcode.com\/method\/deformable-convolution) is exploited to make the re-id embeddings adaptively aligned with the foreground regions. A feature fusion scheme is designed to better aggregate features from different FPN levels, which makes the re-id features more robust to scale variations. The training procedures of re-id and detection are also optimized to place more emphasis on generating robust re-id embeddings.","459":"**VOS** is a type of video object segmentation model consisting of two network components. The target appearance model consists of a light-weight module, which is learned during the inference stage using fast optimization techniques to predict a coarse but robust target segmentation. The segmentation model is exclusively trained offline, designed to process the coarse scores into high quality segmentation masks.","460":"**Local SGD** is a distributed training technique that runs [SGD](https:\/\/paperswithcode.com\/method\/sgd) independently in parallel on different workers and averages the sequences only once in a while.","461":"**ASLFeat** is a convolutional neural network for learning local features that uses deformable convolutional networks to densely estimate and apply local transformation. It also takes advantage of the inherent feature hierarchy to restore spatial resolution and low-level details for accurate keypoint localization. Finally, it uses a peakiness measurement to relate feature responses and derive more indicative detection scores.","462":"**FastSpeech2** is a text-to-speech model that aims to improve upon FastSpeech by better solving the one-to-many mapping problem in TTS, i.e., multiple speech variations corresponding to the same text. It attempts to solve this problem by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, in FastSpeech 2, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values in inference.\r\n\r\nThe encoder converts the phoneme embedding sequence into the phoneme hidden sequence, and then the variance adaptor adds different variance information such as duration, pitch and energy into the hidden sequence, finally the mel-spectrogram decoder converts the adapted hidden sequence into mel-spectrogram sequence in parallel. FastSpeech 2 uses a feed-forward [Transformer](https:\/\/paperswithcode.com\/method\/transformer) block, which is a stack of [self-attention](https:\/\/paperswithcode.com\/method\/multi-head-attention) and 1D-[convolution](https:\/\/paperswithcode.com\/method\/convolution) as in FastSpeech, as the basic structure for the encoder and mel-spectrogram decoder.","463":"Multiscale Attention ViT with Late fusion (MAVL) is a multi-modal network, trained with aligned image-text pairs, capable of performing targeted detection using human understandable natural language text queries. It utilizes multi-scale image features and uses deformable convolutions with late multi-modal fusion. The authors demonstrate excellent ability of MAVL as class-agnostic object detector when queried using general human understandable natural language command, such as \"all objects\", \"all entities\", etc.","464":"**Multiscale Vision Transformer**, or **MViT**, is a [transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture for modeling visual data such as images and videos. Unlike conventional transformers, which maintain a constant channel capacity and resolution throughout the network, Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features.","465":"**context2vec** is an unsupervised model for learning generic context embedding of wide sentential contexts, using a bidirectional [LSTM](https:\/\/paperswithcode.com\/method\/lstm). A large plain text corpora is trained on to learn a neural model that embeds entire sentential contexts and target words in the same low-dimensional space, which\r\nis optimized to reflect inter-dependencies between targets and their entire sentential context as a whole. \r\n\r\nIn contrast to word2vec that use context modeling mostly internally and considers the target word embeddings as their main output, the focus of context2vec is the context representation. context2vec achieves its objective by assigning similar embeddings to sentential contexts and their associated target words.","466":"**Dot-Product Attention** is an attention mechanism where the alignment score function is calculated as: \r\n\r\n$$f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right) = h\\_{i}^{T}s\\_{j}$$\r\n\r\nIt is equivalent to [multiplicative attention](https:\/\/paperswithcode.com\/method\/multiplicative-attention) (without a trainable weight matrix, assuming this is instead an identity matrix). Here $\\textbf{h}$ refers to the hidden states for the encoder, and $\\textbf{s}$ is the hidden states for the decoder. The function above is thus a type of alignment score function. \r\n\r\nWithin a neural network, once we have the alignment scores, we calculate the final scores\/weights using a [softmax](https:\/\/paperswithcode.com\/method\/softmax) function of these alignment scores (ensuring it sums to 1).","467":"**Spatial Gating Unit**, or **SGU**, is a gating unit used in the [gMLP](https:\/\/paperswithcode.com\/method\/gmlp) architecture to captures spatial interactions. To enable cross-token interactions, it is necessary for the layer $s(\\cdot)$ to contain a contraction operation over the spatial dimension. The layer $s(\\cdot)$ is formulated as the output of linear gating:\r\n\r\n$$\r\ns(Z)=Z \\odot f\\_{W, b}(Z)\r\n$$\r\n\r\nwhere $\\odot$ denotes element-wise multiplication. For training stability, the authors find it critical to initialize $W$ as near-zero values and $b$ as ones, meaning that $f\\_{W, b}(Z) \\approx 1$ and therefore $s(Z) \\approx Z$ at the beginning of training. This initialization ensures each [gMLP](https:\/\/paperswithcode.com\/method\/gmlp) block behaves like a regular [FFN](https:\/\/paperswithcode.com\/method\/gmlp) at the early stage of training, where each token is processed independently, and only gradually injects spatial information across tokens during the course of learning.\r\n\r\nThe authors find it further effective to split $Z$ into two independent parts $\\left(Z\\_{1}, Z\\_{2}\\right)$ along the channel dimension for the gating function and for the multiplicative bypass:\r\n\r\n$$\r\ns(Z)=Z\\_{1} \\odot f\\_{W, b}\\left(Z\\_{2}\\right)\r\n$$\r\n\r\nThey also normalize the input to $f\\_{W, b}$ which empirically improved the stability of large NLP models.","468":"**gMLP** is an [MLP](https:\/\/paperswithcode.com\/methods\/category\/feedforward-networks)-based alternative to [Transformers](https:\/\/paperswithcode.com\/methods\/category\/vision-transformer) without [self-attention](https:\/\/paperswithcode.com\/method\/scaled), which simply consists of channel projections and spatial projections with static parameterization. It is built out of basic MLP layers with gating. The model consists of a stack of $L$ blocks with identical size and structure. Let $X \\in \\mathbb{R}^{n \\times d}$ be the token representations with sequence length $n$ and dimension $d$. Each block is defined as:\r\n\r\n$$\r\nZ=\\sigma(X U), \\quad \\tilde{Z}=s(Z), \\quad Y=\\tilde{Z} V\r\n$$\r\n\r\nwhere $\\sigma$ is an activation function such as [GeLU](https:\/\/paperswithcode.com\/method\/gelu). $U$ and $V$ define linear projections along the channel dimension - the same as those in the FFNs of Transformers (e.g., their shapes are $768 \\times 3072$ and $3072 \\times 768$ for $\\text{BERT}_{\\text {base }}$).\r\n\r\nA key ingredient is $s(\\cdot)$, a layer which captures spatial interactions. When $s$ is an identity mapping, the above transformation degenerates to a regular FFN, where individual tokens are processed independently without any cross-token communication. One of the major focuses is therefore to design a good $s$ capable of capturing complex spatial interactions across tokens. This leads to the use of a [Spatial Gating Unit](https:\/\/www.paperswithcode.com\/method\/spatial-gating-unit) which involves a modified linear gating.\r\n\r\nThe overall block layout is inspired by [inverted bottlenecks](https:\/\/paperswithcode.com\/method\/inverted-residual-block), which define $s(\\cdot)$ as a [spatial depthwise convolution](https:\/\/paperswithcode.com\/method\/depthwise-separable-convolution). Note, unlike Transformers, gMLP does not require position embeddings because such information will be captured in $s(\\cdot)$.","469":"VLG-Net leverages recent advantages in Graph Neural Networks (GCNs) and leverages a novel multi-modality graph-based fusion method for the task of natural language video grounding.","470":"SAGA is a method in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser. Unlike SDCA, SAGA supports non-strongly convex problems directly, and is adaptive to any inherent strong convexity of the problem.","471":"**UNETR**, or **UNet Transformer**, is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based architecture for [medical image segmentation](https:\/\/paperswithcode.com\/task\/medical-image-segmentation) that utilizes a pure [transformer](https:\/\/paperswithcode.com\/method\/transformer) as the encoder to learn sequence representations of the input volume -- effectively capturing the global multi-scale information. The transformer encoder is directly connected to a decoder via [skip connections](https:\/\/paperswithcode.com\/methods\/category\/skip-connections) at different resolutions like a [U-Net](https:\/\/paperswithcode.com\/method\/u-net) to compute the final semantic segmentation output.","472":"A **Global Convolutional Network**, or **GCN**, is a semantic segmentation building block that utilizes a large kernel to help perform classification and localization tasks simultaneously. It can be used in a [FCN](https:\/\/paperswithcode.com\/method\/fcn)-like structure, where the [GCN](https:\/\/paperswithcode.com\/method\/gcn) is used to generate semantic score maps. Instead of directly using larger kernels or global [convolution](https:\/\/paperswithcode.com\/method\/convolution), the GCN module employs a combination of $1 \\times k + k \\times 1$ and $k \\times 1 + 1 \\times k$ convolutions, which enables [dense connections](https:\/\/paperswithcode.com\/method\/dense-connections) within a large\r\n$k\\times{k}$ region in the feature map","473":"FixMatch is an algorithm that first generates pseudo-labels using the model's predictions on weakly-augmented unlabeled images. For a given image, the pseudo-label is only retained if the model produces a high-confidence prediction. The model is then trained to predict the pseudo-label when fed a strongly-augmented version of the same image.\r\n\r\nDescription from: [FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence](https:\/\/paperswithcode.com\/paper\/fixmatch-simplifying-semi-supervised-learning)\r\n\r\nImage credit: [FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence](https:\/\/paperswithcode.com\/paper\/fixmatch-simplifying-semi-supervised-learning)","474":"**Retriever-Augmented Generation**, or **RAG**, is a type of language generation model that combines pre-trained parametric and non-parametric memory for language generation. Specifically, the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. For query $x$, Maximum Inner Product Search (MIPS) is used to find the top-K documents $z\\_{i}$. For final prediction $y$, we treat $z$ as a latent variable and marginalize over seq2seq predictions given different documents.","475":"**Location-based Attention** is an attention mechanism in which the alignment scores are computed from solely the target hidden state $\\mathbf{h}\\_{t}$ as follows:\r\n\r\n$$ \\mathbf{a}\\_{t} = \\text{softmax}(\\mathbf{W}\\_{a}\\mathbf{h}_{t}) $$","476":"**CoVe**, or **Contextualized Word Vectors**, uses a deep [LSTM](https:\/\/paperswithcode.com\/method\/lstm) encoder from an attentional sequence-to-sequence model trained for machine translation to contextualize word vectors. $\\text{CoVe}$ word embeddings are therefore a function of the entire input sequence. These word embeddings can then be used in downstream tasks by concatenating them with $\\text{GloVe}$ embeddings:\r\n\r\n$$ v = \\left[\\text{GloVe}\\left(x\\right), \\text{CoVe}\\left(x\\right)\\right]$$\r\n\r\nand then feeding these in as features for the task-specific models.","477":"**Graph Self-Attention (GSA)** is a self-attention module used in the [BP-Transformer](https:\/\/paperswithcode.com\/method\/bp-transformer) architecture, and is based on the [graph attentional layer](https:\/\/paperswithcode.com\/method\/graph-attentional-layer).\r\n\r\nFor a given node $u$, we update its representation according to its neighbour nodes, formulated as $\\mathbf{h}\\_{u} \\leftarrow \\text{GSA}\\left(\\mathcal{G}, \\mathbf{h}^{u}\\right)$.\r\n\r\nLet $\\mathbf{A}\\left(u\\right)$ denote the set of the neighbour nodes of $u$ in $\\mathcal{G}$, $\\text{GSA}\\left(\\mathcal{G}, \\mathbf{h}^{u}\\right)$ is detailed as follows:\r\n\r\n$$ \\mathbf{A}^{u} = \\text{concat}\\left(\\{\\mathbf{h}\\_{v} | v \\in \\mathcal{A}\\left(u\\right)\\}\\right) $$\r\n\r\n$$ \\mathbf{Q}^{u}\\_{i} = \\mathbf{H}\\_{k}\\mathbf{W}^{Q}\\_{i},\\mathbf{K}\\_{i}^{u} = \\mathbf{A}^{u}\\mathbf{W}^{K}\\_{i},\\mathbf{V}^{u}\\_{i} = \\mathbf{A}^{u}\\mathbf{W}\\_{i}^{V} $$\r\n\r\n$$ \\text{head}^{u}\\_{i} = \\text{softmax}\\left(\\frac{\\mathbf{Q}^{u}\\_{i}\\mathbf{K}\\_{i}^{uT}}{\\sqrt{d}}\\right)\\mathbf{V}\\_{i}^{u} $$\r\n\r\n$$ \\text{GSA}\\left(\\mathcal{G}, \\mathbf{h}^{u}\\right) = \\left[\\text{head}^{u}\\_{1}, \\dots, \\text{head}^{u}\\_{h}\\right]\\mathbf{W}^{O}$$\r\n\r\nwhere d is the dimension of h, and $\\mathbf{W}^{Q}\\_{i}$, $\\mathbf{W}^{K}\\_{i}$ and $\\mathbf{W}^{V}\\_{i}$ are trainable parameters of the $i$-th attention head.","478":"**Rectified Adam**, or **RAdam**, is a variant of the [Adam](https:\/\/paperswithcode.com\/method\/adam) stochastic optimizer that introduces a term to rectify the variance of the adaptive learning rate. It seeks to tackle the bad convergence problem suffered by Adam. The authors argue that the root cause of this behaviour is that the adaptive learning rate has undesirably large variance in the early stage of model training, due to the limited amount of training samples being used. Thus, to reduce such variance, it is better to use smaller learning rates in the first few epochs of training - which justifies the warmup heuristic. This heuristic motivates RAdam which rectifies the variance problem:\r\n\r\n$$g\\_{t} = \\nabla\\_{\\theta}f\\_{t}\\left(\\theta\\_{t-1}\\right) $$\r\n\r\n$$v\\_{t} = 1\/\\beta\\_{2}v\\_{t-1} + \\left(1-\\beta\\_{2}\\right)g^{2}\\_{t} $$\r\n\r\n$$m\\_{t} = \\beta\\_{1}m\\_{t-1} + \\left(1-\\beta\\_{1}\\right)g\\_{t} $$\r\n\r\n$$ \\hat{m\\_{t}} = m\\_{t} \/ \\left(1-\\beta^{t}\\_{1}\\right) $$\r\n\r\n$$ \\rho\\_{t} = \\rho\\_{\\infty} - 2t\\beta^{t}\\_{2}\/\\left(1-\\beta^{t}\\_{2}\\right) $$\r\n\r\n$$\\rho_{\\infty} = \\frac{2}{1-\\beta_2} - 1$$ \r\n\r\nIf the variance is tractable - $\\rho\\_{t} > 4$ then:\r\n\r\n...the adaptive learning rate is computed as:\r\n\r\n$$ l\\_{t} = \\sqrt{\\left(1-\\beta^{t}\\_{2}\\right)\/v\\_{t}}$$\r\n\r\n...the variance rectification term is calculated as:\r\n\r\n$$ r\\_{t} = \\sqrt{\\frac{(\\rho\\_{t}-4)(\\rho\\_{t}-2)\\rho\\_{\\infty}}{(\\rho\\_{\\infty}-4)(\\rho\\_{\\infty}-2)\\rho\\_{t}}}$$\r\n\r\n...and we update parameters with adaptive momentum:\r\n\r\n$$ \\theta\\_{t} = \\theta\\_{t-1} - \\alpha\\_{t}r\\_{t}\\hat{m}\\_{t}l\\_{t} $$\r\n\r\nIf the variance isn't tractable we update instead with:\r\n\r\n$$ \\theta\\_{t} = \\theta\\_{t-1} - \\alpha\\_{t}\\hat{m}\\_{t} $$","479":"Hyperboloid Embeddings (HypE) is a novel self-supervised dynamic reasoning framework, that utilizes positive first-order existential queries on a KG to learn representations of its entities and relations as hyperboloids in a Poincar\u00e9 ball. HypE models the positive first-order queries as geometrical translation (t), intersection ($\\cap$), and union ($\\cup$). For the problem of KG reasoning in real-world datasets, the proposed HypE model significantly outperforms the state-of-the art results. HypE is also applied to an anomaly detection task on a popular e-commerce website product taxonomy as well as hierarchically organized web articles and demonstrate significant performance improvements compared to existing baseline methods. Finally, HypE embeddings can also be visualized in a Poincar\u00e9 ball to clearly interpret and comprehend the representation space.","480":"A generic way of representing and interpolating labels, which allows straightforward extension of any kind of [mixup](https:\/\/paperswithcode.com\/method\/mixup) to deep metric learning for a large class of loss functions.","481":"**MoCo v2** is an improved version of the [Momentum Contrast](https:\/\/paperswithcode.com\/method\/moco) self-supervised learning algorithm. Motivated by the findings presented in the [SimCLR](https:\/\/paperswithcode.com\/method\/simclr) paper, authors:\r\n\r\n- Replace the 1-layer fully connected layer with a 2-layer MLP head with [ReLU](https:\/\/paperswithcode.com\/method\/relu) for the unsupervised training stage.\r\n- Include blur augmentation.\r\n- Use cosine learning rate schedule.\r\n\r\nThese modifications enable MoCo to outperform the state-of-the-art SimCLR with a smaller batch size and fewer epochs.","482":"**DeepMask** is an object proposal algorithm based on a convolutional neural network. Given an input image patch, DeepMask generates a class-agnostic mask and an associated score which estimates the likelihood of the patch fully containing a centered object (without any notion of an object category). The core of the model is a ConvNet which jointly predicts the mask and the object score. A large part of the network is shared between those two tasks: only the last few network\r\nlayers are specialized for separately outputting a mask and score prediction.","483":"**Wide&Deep** jointly trains wide linear models and deep neural networks to combine the benefits of memorization and generalization for real-world recommender systems. In summary, the wide component is a generalized linear model. The deep component is a feed-forward neural network. The deep and wide components are combined using a weighted sum of their output log odds as the prediction. This is then fed to a logistic loss function for joint training, which is done by back-propagating the gradients from the output to both the wide and deep part of the model simultaneously using mini-batch stochastic optimization. The AdaGrad optimizer is used for the wider part. The combined model is illustrated in the figure (center).","484":"[Laplacian eigenvectors](https:\/\/paperswithcode.com\/paper\/laplacian-eigenmaps-and-spectral-techniques) represent a natural generalization of the [Transformer](https:\/\/paperswithcode.com\/method\/transformer) positional encodings (PE) for graphs as the eigenvectors of a discrete line (NLP graph) are the cosine and sinusoidal functions. They help encode distance-aware information (i.e., nearby nodes have similar positional features and farther nodes have dissimilar positional features).\r\n\r\nHence, Laplacian Positional Encoding (PE) is a general method to encode node positions in a graph. For each node, its Laplacian PE is the k smallest non-trivial eigenvectors.","485":"This is **Graph Transformer** method, proposed as a generalization of [Transformer](https:\/\/paperswithcode.com\/method\/transformer) Neural Network architectures, for arbitrary graphs.\r\n\r\nCompared to the original Transformer, the highlights of the presented architecture are:\r\n\r\n- The attention mechanism is a function of neighborhood connectivity for each node in the graph. \r\n- The position encoding is represented by Laplacian eigenvectors, which naturally generalize the sinusoidal positional encodings often used in NLP. \r\n- The [layer normalization](https:\/\/paperswithcode.com\/method\/layer-normalization) is replaced by a [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) layer. \r\n- The architecture is extended to have edge representation, which can be critical to tasks with rich information on the edges, or pairwise interactions (such as bond types in molecules, or relationship type in KGs. etc).","486":"**Local Contrast Normalization** is a type of normalization that performs local subtraction and division normalizations, enforcing a sort of local competition between adjacent features in a feature map, and between features at the same spatial location in different feature maps.","487":"**ZFNet** is a classic convolutional neural network. The design was motivated by visualizing intermediate feature layers and the operation of the classifier. Compared to [AlexNet](https:\/\/paperswithcode.com\/method\/alexnet), the filter sizes are reduced and the stride of the convolutions are reduced.","488":"**Electric** is an energy-based cloze model for representation learning over text. Like BERT, it is a conditional generative model of tokens given their contexts. However, Electric does not use masking or output a full distribution over tokens that could occur in a context. Instead, it assigns a scalar energy score to each input token indicating how likely it is given its context.\r\n\r\nSpecifically, like BERT, Electric also models $p\\_{\\text {data }}\\left(x\\_{t} \\mid \\mathbf{x}\\_{\\backslash t}\\right)$, but does not use masking or a softmax layer. Electric first maps the unmasked input $\\mathbf{x}=\\left[x\\_{1}, \\ldots, x\\_{n}\\right]$ into contextualized vector representations $\\mathbf{h}(\\mathbf{x})=\\left[\\mathbf{h}\\_{1}, \\ldots, \\mathbf{h}\\_{n}\\right]$ using a transformer network. The model assigns a given position $t$ an energy score\r\n\r\n$$\r\nE(\\mathbf{x})\\_{t}=\\mathbf{w}^{T} \\mathbf{h}(\\mathbf{x})\\_{t}\r\n$$\r\n\r\nusing a learned weight vector $w$. The energy function defines a distribution over the possible tokens at position $t$ as\r\n\r\n$$\r\np\\_{\\theta}\\left(x\\_{t} \\mid \\mathbf{x}_{\\backslash t}\\right)=\\exp \\left(-E(\\mathbf{x})\\_{t}\\right) \/ Z\\left(\\mathbf{x}\\_{\\backslash t}\\right) \r\n$$\r\n\r\n$$\r\n=\\frac{\\exp \\left(-E(\\mathbf{x})\\_{t}\\right)}{\\sum\\_{x^{\\prime} \\in \\mathcal{V}} \\exp \\left(-E\\left(\\operatorname{REPLACE}\\left(\\mathbf{x}, t, x^{\\prime}\\right)\\right)\\_{t}\\right)}\r\n$$\r\n\r\nwhere $\\text{REPLACE}\\left(\\mathbf{x}, t, x^{\\prime}\\right)$ denotes replacing the token at position $t$ with $x^{\\prime}$ and $\\mathcal{V}$ is the vocabulary, in practice usually word pieces. Unlike with BERT, which produces the probabilities for all possible tokens $x^{\\prime}$ using a softmax layer, a candidate $x^{\\prime}$ is passed in as input to the transformer. As a result, computing $p_{\\theta}$ is prohibitively expensive because the partition function $Z\\_{\\theta}\\left(\\mathbf{x}\\_{\\backslash t}\\right)$ requires running the transformer $|\\mathcal{V}|$ times; unlike most EBMs, the intractability of $Z\\_{\\theta}(\\mathbf{x} \\backslash t)$ is more due to the expensive scoring function rather than having a large sample space.","489":"**COLA** is a self-supervised pre-training approach for learning a general-purpose representation of audio. It is based on contrastive learning: it learns a representation which assigns high similarity to audio segments extracted from the same recording while assigning lower similarity to segments from different recordings.","490":"A **(2+1)D Convolution** is a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) used for action recognition convolutional neural networks, with a spatiotemporal volume. As opposed to applying a [3D Convolution](https:\/\/paperswithcode.com\/method\/3d-convolution) over the entire volume, which can be computationally expensive and lead to overfitting, a (2+1)D convolution splits computation into two convolutions: a spatial 2D convolution followed by a temporal 1D convolution.","491":"A **R(2+1)D** convolutional neural network is a network for action recognition that employs [R(2+1)D](https:\/\/paperswithcode.com\/method\/2-1-d-convolution) convolutions in a [ResNet](https:\/\/paperswithcode.com\/method\/resnet) inspired architecture. The use of these convolutions over regular [3D Convolutions](https:\/\/paperswithcode.com\/method\/3d-convolution) reduces computational complexity, prevents overfitting, and introduces more non-linearities that allow for a better functional relationship to be modeled.","492":"**CodeT5** is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based model for code understanding and generation based on the [T5 architecture](https:\/\/paperswithcode.com\/method\/t5). It utilizes an identifier-aware pre-training objective that considers the crucial token type information (identifiers) from code. Specifically, the denoising [Seq2Seq](https:\/\/paperswithcode.com\/method\/seq2seq) objective of T5 is extended with two identifier tagging and prediction tasks to enable the model to better leverage the token type information from programming languages, which are the identifiers assigned by developers. To improve the natural language-programming language alignment, a bimodal dual learning objective is used for a bidirectional conversion between natural language and programming language.","493":"**Lambda layers** are a building block for modeling long-range dependencies in data. They consist of long-range interactions between a query and a structured set of context elements at a reduced memory cost. Lambda layers transform each available context into a linear function, termed a lambda, which is then directly applied to the corresponding query. Whereas self-attention defines a similarity kernel between the query and the context elements, a lambda layer instead summarizes contextual information into a fixed-size linear function (i.e. a matrix), thus bypassing the need for memory-intensive attention maps.","494":"**LightGCN** is a type of [graph convolutional neural network](https:\/\/paperswithcode.com\/method\/gcn) (GCN), including only the most essential component in GCN (neighborhood aggregation) for collaborative filtering. Specifically, LightGCN learns user and item embeddings by linearly propagating them on the user-item interaction graph, and uses the weighted sum of the embeddings learned at all layers as the final embedding.","495":"**lda2vec** builds representations over both words and documents by mixing word2vec\u2019s skipgram architecture with Dirichlet-optimized sparse topic mixtures. \r\n\r\nThe Skipgram Negative-Sampling (SGNS) objective of word2vec is modified to utilize document-wide feature vectors while simultaneously learning continuous document weights loading onto topic vectors. The total loss term $L$ is the sum of the Skipgram Negative Sampling Loss (SGNS) $L^{neg}\\_{ij}$ with the addition of a Dirichlet-likelihood term over document weights, $L\\_{d}$. The loss is conducted using a context vector, $\\overrightarrow{c\\_{j}}$ , pivot word vector $\\overrightarrow{w\\_{j}}$, target word vector $\\overrightarrow{w\\_{i}}$, and negatively-sampled word vector $\\overrightarrow{w\\_{l}}$:\r\n\r\n$$ L = L^{d} + \\Sigma\\_{ij}L^{neg}\\_{ij} $$\r\n\r\n$$L^{neg}\\_{ij} = \\log\\sigma\\left(c\\_{j}\\cdot\\overrightarrow{w\\_{i}}\\right) + \\sum^{n}\\_{l=0}\\sigma\\left(-\\overrightarrow{c\\_{j}}\\cdot\\overrightarrow{w\\_{l}}\\right)$$","496":"**Deep-MAC**, or **Deep Mask-heads Above CenterNet**, is a type of anchor-free instance segmentation model based on [CenterNet](https:\/\/paperswithcode.com\/method\/centernet). The motivation for this new architecture is that boxes are much cheaper to annotate than masks, so the authors address the \u201cpartially supervised\u201d instance segmentation problem, where all classes have bounding box annotations but only a subset of classes have mask annotations. \r\n\r\nFor predicting bounding boxes, CenterNet outputs 3 tensors: (1) a class-specific [heatmap](https:\/\/paperswithcode.com\/method\/heatmap) which indicates the probability of the center of a bounding box being present at each location, (2) a class-agnostic 2-channel tensor indicating the height and width of the bounding box at each center pixel, and (3) since the output feature map is typically smaller than the image (stride 4 or 8), CenterNet also predicts an x and y direction offset to recover this discretization error at each center pixel.\r\n\r\nFor Deep-MAC, in parallel to the box-related prediction heads, we add a fourth pixel embedding branch $P$. For each bounding box\r\n$b$, we crop a region $P\\_{b}$ from $P$ corresponding to $b$ via [ROIAlign](https:\/\/paperswithcode.com\/method\/roi-align) which results in a 32 \u00d7 32 tensor. We then feed each $P\\_{b}$ to a mask-head. The final prediction at the end is a class-agnostic 32 \u00d7 32 tensor which we pass through a sigmoid to get per-pixel probabilities. We train this mask-head via a per-pixel cross-entropy loss averaged over all pixels and instances. During post-processing, the predicted mask is re-aligned according to the predicted box and resized to the resolution of the image. \r\n\r\nIn addition to this 32 \u00d7 32 cropped feature map, we add two inputs for improved stability of some mask-heads: (1) Instance embedding: an additional head is added to the backbone that predicts a per-pixel embedding. For each bounding box $b$ we extract its embedding from the center pixel. This embedding is tiled to a size of 32 \u00d7 32 and concatenated to the pixel embedding crop. This helps condition the mask-head on a particular instance and disambiguate it from others. (2) Coordinate Embedding: Inspired by [CoordConv](https:\/\/paperswithcode.com\/method\/coordconv), the authors add a 32 \u00d7 32 \u00d7 2 tensor holding normalized $\\left(x, y\\right)$ coordinates relative to the bounding box $b$.","497":"**Positional Encoding Generator**, or **PEG**, is a module used in the [Conditional Position Encoding](https:\/\/paperswithcode.com\/method\/conditional-positional-encoding) position embeddings. It dynamically produce the positional encodings conditioned on the local neighborhood of an input token. To condition on the local neighbors, we first reshape the flattened input sequence $X \\in \\mathbb{R}^{B \\times N \\times C}$ of DeiT back to $X^{\\prime} \\in \\mathbb{R}^{B \\times H \\times W \\times C}$ in the 2 -D image space. Then, a function (denoted by $\\mathcal{F}$ in the Figure) is repeatedly applied to the local patch in $X^{\\prime}$ to produce the conditional positional encodings $E^{B \\times H \\times W \\times C} .$ PEG can be efficiently implemented with a 2-D convolution with kernel $k(k \\geq 3)$ and $\\frac{k-1}{2}$ zero paddings. Note that the zero paddings here are important to make the model be aware of the absolute positions, and $\\mathcal{F}$ can be of various forms such as separable convolutions and many others.","498":"**Conditional Positional Encoding**, or **CPE**, is a type of positional encoding for [vision transformers](https:\/\/paperswithcode.com\/methods\/category\/vision-transformer). Unlike previous fixed or learnable positional encodings, which are predefined and independent of input tokens, CPE is dynamically generated and conditioned on the local neighborhood of the input tokens. As a result, CPE aims to generalize to the input sequences that are longer than what the model has ever seen during training. CPE can also keep the desired translation-invariance in the image classification task. CPE can be implemented with a [Position\r\nEncoding Generator](https:\/\/paperswithcode.com\/method\/positional-encoding-generator) (PEG) and incorporated into the current [Transformer framework](https:\/\/paperswithcode.com\/methods\/category\/transformers).","499":"**Global Sub-Sampled Attention**, or **GSA**, is a local [attention mechanism](https:\/\/paperswithcode.com\/methods\/category\/attention-mechanisms-1) used in the [Twins-SVT](https:\/\/paperswithcode.com\/method\/twins-svt) architecture. \r\n\r\nA single representative is used to summarize the key information for each of $m \\times n$ subwindows and the representative is used to communicate with other sub-windows (serving as the key in self-attention), which can reduce the cost to $\\mathcal{O}(m n H W d)=\\mathcal{O}\\left(\\frac{H^{2} W^{2} d}{k\\_{1} k\\_{2}}\\right)$. This is essentially equivalent to using the sub-sampled feature maps as the key in attention operations, and thus it is termed global sub-sampled attention (GSA). \r\n\r\nIf we alternatively use the [LSA](https:\/\/paperswithcode.com\/method\/locally-grouped-self-attention) and GSA like [separable convolutions](https:\/\/paperswithcode.com\/method\/depthwise-separable-convolution) (depth-wise + point-wise). The total computation cost is $\\mathcal{O}\\left(\\frac{H^{2} W^{2} d}{k\\_{1} k\\_{2}}+k\\_{1} k\\_{2} H W d\\right) .$ We have:\r\n\r\n$$\\frac{H^{2} W^{2} d}{k\\_{1} k\\_{2}}+k_{1} k_{2} H W d \\geq 2 H W d \\sqrt{H W} $$ \r\n\r\nThe minimum is obtained when $k\\_{1} \\cdot k\\_{2}=\\sqrt{H W}$. Note that $H=W=224$ is popular in classification. Without loss of generality, square sub-windows are used, i.e., $k\\_{1}=k\\_{2}$. Therefore, $k\\_{1}=k\\_{2}=15$ is close to the global minimum for $H=W=224$. However, the network is designed to include several stages with variable resolutions. Stage 1 has feature maps of $56 \\times 56$, the minimum is obtained when $k\\_{1}=k\\_{2}=\\sqrt{56} \\approx 7$. Theoretically, we can calibrate optimal $k\\_{1}$ and $k\\_{2}$ for each of the stages. For simplicity, $k\\_{1}=k\\_{2}=7$ is used everywhere. As for stages with lower resolutions, the summarizing window-size of GSA is controlled to avoid too small amount of generated keys. Specifically, the sizes of 4,2 and 1 are used for the last three stages respectively.","500":"**Locally-Grouped Self-Attention**, or **LSA**, is a local attention mechanism used in the [Twins-SVT](https:\/\/paperswithcode.com\/method\/twins-svt) architecture. Locally-grouped self-attention (LSA). Motivated by the group design in depthwise convolutions for efficient inference, we first equally divide the 2D feature maps into sub-windows, making self-attention communications only happen within each sub-window. This design also resonates with the multi-head design in self-attention, where the communications only occur within the channels of the same head. To be specific, the feature maps are divided into $m \\times n$ sub-windows. Without loss of generality, we assume $H \\% m=0$ and $W \\% n=0$. Each group contains $\\frac{H W}{m n}$ elements, and thus the computation cost of the self-attention in this window is $\\mathcal{O}\\left(\\frac{H^{2} W^{2}}{m^{2} n^{2}} d\\right)$, and the total cost is $\\mathcal{O}\\left(\\frac{H^{2} W^{2}}{m n} d\\right)$. If we let $k\\_{1}=\\frac{H}{n}$ and $k\\_{2}=\\frac{W}{n}$, the cost can be computed as $\\mathcal{O}\\left(k\\_{1} k\\_{2} H W d\\right)$, which is significantly more efficient when $k\\_{1} \\ll H$ and $k\\_{2} \\ll W$ and grows linearly with $H W$ if $k\\_{1}$ and $k\\_{2}$ are fixed.\r\n\r\nAlthough the locally-grouped self-attention mechanism is computation friendly, the image is divided into non-overlapping sub-windows. Thus, we need a mechanism to communicate between different sub-windows, as in Swin. Otherwise, the information would be limited to be processed locally, which makes the receptive field small and significantly degrades the performance as shown in our experiments. This resembles the fact that we cannot replace all standard convolutions by depth-wise convolutions in CNNs.","501":"**Spatially Separable Self-Attention**, or **SSSA**, is an [attention module](https:\/\/paperswithcode.com\/methods\/category\/attention-modules) used in the [Twins-SVT](https:\/\/paperswithcode.com\/method\/twins-svt) architecture that aims to reduce the computational complexity of [vision transformers](https:\/\/paperswithcode.com\/methods\/category\/vision-transformer) for dense prediction tasks (given high-resolution inputs). SSSA is composed of [locally-grouped self-attention](https:\/\/paperswithcode.com\/method\/locally-grouped-self-attention) (LSA) and [global sub-sampled attention](https:\/\/paperswithcode.com\/method\/global-sub-sampled-attention) (GSA).\r\n\r\nFormally, spatially separable self-attention (SSSA) can be written as:\r\n\r\n$$\r\n\\hat{\\mathbf{z}}\\_{i j}^{l}=\\text { LSA }\\left(\\text { LayerNorm }\\left(\\mathbf{z}\\_{i j}^{l-1}\\right)\\right)+\\mathbf{z}\\_{i j}^{l-1} $$\r\n\r\n$$\\mathbf{z}\\_{i j}^{l}=\\mathrm{FFN}\\left(\\operatorname{LayerNorm}\\left(\\hat{\\mathbf{z}}\\_{i j}^{l}\\right)\\right)+\\hat{\\mathbf{z}}\\_{i j}^{l} $$\r\n\r\n$$ \\hat{\\mathbf{z}}^{l+1}=\\text { GSA }\\left(\\text { LayerNorm }\\left(\\mathbf{z}^{l}\\right)\\right)+\\mathbf{z}^{l} $$\r\n\r\n$$ \\mathbf{z}^{l+1}=\\text { FFN }\\left(\\text { LayerNorm }\\left(\\hat{\\mathbf{z}}^{l+1}\\right)\\right)+\\hat{\\mathbf{z}}^{l+1}$$\r\n\r\n$$i \\in\\{1,2, \\ldots ., m\\}, j \\in\\{1,2, \\ldots ., n\\}\r\n$$\r\n\r\nwhere LSA means locally-grouped self-attention within a sub-window; GSA is the global sub-sampled attention by interacting with the representative keys (generated by the sub-sampling functions) from each sub-window $\\hat{\\mathbf{z}}\\_{i j} \\in \\mathcal{R}^{k\\_{1} \\times k\\_{2} \\times C} .$ Both LSA and GSA have multiple heads as in the standard self-attention.","502":"**Twins-SVT** is a type of [vision transformer](https:\/\/paperswithcode.com\/methods\/category\/vision-transformer) which utilizes a [spatially separable attention mechanism](https:\/\/paperswithcode.com\/method\/spatially-separable-self-attention) (SSAM) which is composed of two types of attention operations\u2014(i) locally-grouped self-attention (LSA), and (ii) global sub-sampled attention (GSA), where LSA captures the fine-grained and short-distance information and GSA deals with the long-distance and global information. On top of this, it utilizes [conditional position encodings](https:\/\/paperswithcode.com\/method\/conditional-positional-encoding) as well as the architectural design of the [Pyramid Vision Transformer](https:\/\/paperswithcode.com\/method\/pvt).","503":"**Twins-PCPVT** is a type of [vision transformer](https:\/\/paperswithcode.com\/methods\/category\/vision-transformer) that combines global attention, specifically the global sub-sampled attention as proposed in [Pyramid Vision Transformer](https:\/\/paperswithcode.com\/method\/pvt), with [conditional position encodings](https:\/\/paperswithcode.com\/method\/conditional-positional-encoding) (CPE) to replace the [absolute position encodings](https:\/\/paperswithcode.com\/method\/absolute-position-encodings) used in PVT.\r\n\r\nThe [position encoding generator](https:\/\/paperswithcode.com\/method\/positional-encoding-generator) (PEG), which generates the CPE, is placed after the first encoder block of each stage. The simplest form of PEG is used, i.e., a 2D [depth-wise convolution](https:\/\/paperswithcode.com\/method\/depthwise-convolution) without [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization). For image-level classification, following [CPVT](https:\/\/paperswithcode.com\/method\/cpvt), the class token is removed and [global average pooling](https:\/\/paperswithcode.com\/method\/global-average-pooling) is used at the end of the stage. For other vision tasks, the design of PVT is followed.","504":"**Meta-augmentation** helps generate more varied tasks for a single example in meta-learning. It can be distinguished from data augmentation in classic machine learning as follows. For data augmentation in classical machine learning, the aim is to generate more varied examples, within a single task. Meta-augmentation has the exact opposite aim: we wish to generate more varied tasks,\r\nfor a single example, to force the learner to quickly learn a new task from feedback. In meta-augmentation, adding randomness discourages the base learner and model from learning trivial solutions that do not generalize to new tasks.","505":"DiffPool is a differentiable graph pooling module that can generate hierarchical representations of graphs and can be combined with various graph neural network architectures in an end-to-end fashion. DiffPool learns a differentiable soft cluster assignment for nodes at each layer of a deep GNN, mapping nodes to a set of clusters, which then form the coarsened input for the next GNN layer.\r\n\r\nDescription and image from: [Hierarchical Graph Representation Learning with Differentiable Pooling](https:\/\/arxiv.org\/pdf\/1806.08804.pdf)","506":"**Wasserstein Gradient Penalty Loss**, or **WGAN-GP Loss**, is a loss used for generative adversarial networks that augments the Wasserstein loss with a gradient norm penalty for random samples $\\mathbf{\\hat{x}} \\sim \\mathbb{P}\\_{\\hat{\\mathbf{x}}}$ to achieve Lipschitz continuity:\r\n\r\n$$ L = \\mathbb{E}\\_{\\mathbf{\\hat{x}} \\sim \\mathbb{P}\\_{g}}\\left[D\\left(\\tilde{\\mathbf{x}}\\right)\\right] - \\mathbb{E}\\_{\\mathbf{x} \\sim \\mathbb{P}\\_{r}}\\left[D\\left(\\mathbf{x}\\right)\\right] + \\lambda\\mathbb{E}\\_{\\mathbf{\\hat{x}} \\sim \\mathbb{P}\\_{\\hat{\\mathbf{x}}}}\\left[\\left(||\\nabla\\_{\\tilde{\\mathbf{x}}}D\\left(\\mathbf{\\tilde{x}}\\right)||\\_{2}-1\\right)^{2}\\right]$$\r\n\r\nIt was introduced as part of the [WGAN-GP](https:\/\/paperswithcode.com\/method\/wgan-gp) overall model.","507":"**Phase Shuffle** is a technique for removing pitched noise artifacts that come from using transposed convolutions in audio generation models. Phase shuffle is an operation with hyperparameter $n$. It randomly perturbs the phase of each layer\u2019s activations by \u2212$n$ to $n$ samples before input to the next layer.\r\n\r\nIn the original application in [WaveGAN](https:\/\/paperswithcode.com\/method\/wavegan), the authors only apply phase shuffle to the discriminator, as the latent vector already provides the generator a mechanism to manipulate the phase\r\nof a resultant waveform. Intuitively speaking, phase shuffle makes the discriminator\u2019s job more challenging by requiring invariance to the phase of the input waveform.","508":"**WaveGAN** is a generative adversarial network for unsupervised synthesis of raw-waveform audio (as opposed to image-like spectrograms). \r\n\r\nThe WaveGAN architecture is based off [DCGAN](https:\/\/paperswithcode.com\/method\/dcgan). The DCGAN generator uses the [transposed convolution](https:\/\/paperswithcode.com\/method\/transposed-convolution) operation to iteratively upsample low-resolution feature maps into a high-resolution image. WaveGAN modifies this transposed [convolution](https:\/\/paperswithcode.com\/method\/convolution) operation to widen its receptive field, using a longer one-dimensional filters of length 25 instead of two-dimensional filters of size 5x5, and upsampling by a factor of 4 instead of 2 at each layer. The discriminator is modified in a similar way, using length-25 filters in one dimension and increasing stride\r\nfrom 2 to 4. These changes result in WaveGAN having the same number of parameters, numerical\r\noperations, and output dimensionality as DCGAN. An additional layer is added afterwards to allow for more audio samples. Further changes include:\r\n\r\n1. Flattening 2D convolutions into 1D (e.g. 5x5 2D conv becomes length-25 1D).\r\n2. Increasing the stride factor for all convolutions (e.g. stride 2x2 becomes stride 4).\r\n3. Removing [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) from the generator and discriminator.\r\n4. Training using the [WGAN](https:\/\/paperswithcode.com\/method\/wgan)-GP strategy.","509":"**PnP**, or **Poll and Pool**, is sampling module extension for [DETR](https:\/\/paperswithcode.com\/method\/detr)-type architectures that adaptively allocates its computation spatially to be more efficient. Concretely, the PnP module abstracts the image feature map into fine foreground object feature vectors and a small number of coarse background contextual feature vectors. The [transformer](https:\/\/paperswithcode.com\/method\/transformer) models information interaction within the fine-coarse feature space and translates the features into the detection result.","510":"A **Fire Module** is a building block for convolutional neural networks, notably used as part of [SqueezeNet](https:\/\/paperswithcode.com\/method\/squeezenet). A Fire module is comprised of: a squeeze [convolution](https:\/\/paperswithcode.com\/method\/convolution) layer (which has only 1x1 filters), feeding into an expand layer that has a mix of 1x1 and 3x3 convolution filters. We expose three tunable dimensions (hyperparameters) in a Fire module: $s\\_{1x1}$, $e\\_{1x1}$, and $e\\_{3x3}$. In a Fire module, $s\\_{1x1}$ is the number of filters in the squeeze layer (all 1x1), $e\\_{1x1}$ is the number of 1x1 filters in the expand layer, and $e\\_{3x3}$ is the number of 3x3 filters in the expand layer. When we use Fire modules we set $s\\_{1x1}$ to be less than ($e\\_{1x1}$ + $e\\_{3x3}$), so the squeeze layer helps to limit the number of input channels to the 3x3 filters.","511":"**SqueezeNet** is a convolutional neural network that employs design strategies to reduce the number of parameters, notably with the use of fire modules that \"squeeze\" parameters using 1x1 convolutions.","512":"**Go-Explore** is a family of algorithms aiming to tackle two challenges with effective exploration in reinforcement learning: algorithms forgetting how to reach previously visited states (\"detachment\") and from failing to first return to a state before exploring from it (\"derailment\").\r\n\r\nTo avoid detachment, Go-Explore builds an archive of the different states it has visited in the environment, thus ensuring that states cannot be forgotten. Starting with an archive beginning with the initial state, the archive is built iteratively. In Go-Explore we:\r\n\r\n(a) Probabilistically select a state from the archive, preferring states associated with promising cells. \r\n\r\n(b) Return to the selected state, such as by restoring simulator state or by running a goal-conditioned policy. \r\n\r\n(c) Explore from that state by taking random actions or sampling from a trained policy. \r\n\r\n(d) Map every state encountered during returning and exploring to a low-dimensional cell representation. \r\n\r\n(e) Add states that map to new cells to the archive and update other archive entries.","513":"**DINO** (self-distillation with no labels) is a self-supervised learning method that directly predicts the output of a teacher network - built with a momentum encoder - by using a standard cross-entropy loss. \r\n\r\nIn the example to the right, DINO is illustrated in the case of one single pair of views $\\left(x\\_{1}, x\\_{2}\\right)$ for simplicity. The model passes two different random transformations of an input image to the student and teacher networks. Both networks have\r\nthe same architecture but different parameters. The output of the teacher network is centered with a mean computed over the batch. Each networks outputs a $K$ dimensional feature that is normalized with a temperature [softmax](https:\/\/paperswithcode.com\/method\/softmax) over the feature dimension. Their\r\nsimilarity is then measured with a cross-entropy loss. A stop-gradient (sg) operator is applied on the teacher to propagate gradients\r\nonly through the student. The teacher parameters are updated with an exponential moving average (ema) of the student parameters.","514":"**MoCo v3** aims to stabilize training of self-supervised ViTs. MoCo v3 is an incremental improvement of MoCo v1\/2. Two crops are used for each image under random data augmentation. They are encoded by two encoders $f_q$ and $f_k$ with output vectors $q$ and $k$. $q$ behaves like a \"query\", where the goal of learning is to retrieve the corresponding \"key\". The objective is to minimize a contrastive loss function of the following form: \r\n\r\n$$\r\n\\mathcal{L_q}=-\\log \\frac{\\exp \\left(q \\cdot k^{+} \/ \\tau\\right)}{\\exp \\left(q \\cdot k^{+} \/ \\tau\\right)+\\sum_{k^{-}} \\exp \\left(q \\cdot k^{-} \/ \\tau\\right)}\r\n$$\r\n\r\nThis approach aims to train the Transformer in the contrastive\/Siamese paradigm. The encoder $f_q$ consists of a backbone (e.g., ResNet and ViT), a projection head, and an extra prediction head. The encoder $f_k$ has the back the backbone and projection head but not the prediction head. $f_k$ is updated by the moving average of $f_q$, excluding the prediction head.","515":"**AMSGrad** is a stochastic optimization method that seeks to fix a convergence issue with [Adam](https:\/\/paperswithcode.com\/method\/adam) based optimizers. AMSGrad uses the maximum of past squared gradients \r\n$v\\_{t}$ rather than the exponential average to update the parameters:\r\n\r\n$$m\\_{t} = \\beta\\_{1}m\\_{t-1} + \\left(1-\\beta\\_{1}\\right)g\\_{t} $$\r\n\r\n$$v\\_{t} = \\beta\\_{2}v\\_{t-1} + \\left(1-\\beta\\_{2}\\right)g\\_{t}^{2}$$\r\n\r\n$$ \\hat{v}\\_{t} = \\max\\left(\\hat{v}\\_{t-1}, v\\_{t}\\right) $$\r\n\r\n$$\\theta\\_{t+1} = \\theta\\_{t} - \\frac{\\eta}{\\sqrt{\\hat{v}_{t}} + \\epsilon}m\\_{t}$$","516":"Karim Hammoudi, Adnane Cabani, Bouthaina Slika, Halim Benhabiles, Fadi Dornaika and Mahmoud Melkemi. SuperpixelGridCut, SuperpixelGridMean and SuperpixelGridMix Data Augmentation, arXiv:2204.08458, 2022. https:\/\/doi.org\/10.48550\/arxiv.2204.08458","517":"As CNN features are naturally spatial, channel-wise and multi-layer, \r\nChen et al. proposed a novel spatial and channel-wise attention-based convolutional neural network (SCA-CNN). \r\nIt was designed for the task of image captioning, and uses an encoder-decoder framework where a CNN first encodes an input image into a vector and then an LSTM decodes the vector into a sequence of words. Given an input feature map $X$ and the previous time step LSTM hidden state $h_{t-1} \\in \\mathbb{R}^d$, a spatial attention mechanism pays more attention to the semantically useful regions, guided by LSTM hidden state $h_{t-1}$. The spatial attention model is:\r\n\r\n\\begin{align}\r\na(h_{t-1}, X) &= \\tanh(Conv_1^{1 \\times 1}(X) \\oplus W_1 h_{t-1})\r\n\\end{align}\r\n\r\n\\begin{align}\r\n\\Phi_s(h_{t-1}, X) &= \\text{Softmax}(Conv_2^{1 \\times 1}(a(h_{t-1}, X))) \r\n\\end{align}\r\n\r\nwhere $\\oplus$ represents addition of a matrix and a vector. Similarly, channel-wise attention aggregates global information first, and then computes a channel-wise attention weight vector with the hidden state $h_{t-1}$:\r\n\\begin{align}\r\nb(h_{t-1}, X) &= \\tanh((W_2\\text{GAP}(X)+b_2)\\oplus W_1h_{t-1})\r\n\\end{align}\r\n\\begin{align}\r\n\\Phi_c(h_{t-1}, X) &= \\text{Softmax}(W_3(b(h_{t-1}, X))+b_3) \r\n\\end{align}\r\nOverall, the SCA mechanism can be written in one of two ways. If channel-wise attention is applied before spatial attention, we have\r\n\\begin{align}\r\nY &= f(X,\\Phi_s(h_{t-1}, X \\Phi_c(h_{t-1}, X)), \\Phi_c(h_{t-1}, X)) \r\n\\end{align}\r\nand if spatial attention comes first:\r\n\\begin{align}\r\nY &= f(X,\\Phi_s(h_{t-1}, X), \\Phi_c(h_{t-1}, X \\Phi_s(h_{t-1}, X)))\r\n\\end{align}\r\nwhere $f(\\cdot)$ denotes the modulate function which takes the feature map $X$ and attention maps as input and then outputs the modulated feature map $Y$.\r\n\r\nUnlike previous attention mechanisms which consider each image region equally and use global spatial information to tell the network where to focus, SCA-Net leverages the semantic vector to produce the spatial attention map as well as the channel-wise attention weight vector. Being more than a powerful attention model, SCA-CNN also provides a better understanding of where and what the model should focus on during sentence generation.","518":"**Hard Swish** is a type of activation function based on [Swish](https:\/\/paperswithcode.com\/method\/swish), but replaces the computationally expensive sigmoid with a piecewise linear analogue:\r\n\r\n$$\\text{h-swish}\\left(x\\right) = x\\frac{\\text{ReLU6}\\left(x+3\\right)}{6} $$","519":"**MobileNetV3** is a convolutional neural network that is tuned to mobile phone CPUs through a combination of hardware-aware network architecture search (NAS) complemented by the [NetAdapt](https:\/\/paperswithcode.com\/method\/netadapt) algorithm, and then subsequently improved through novel architecture advances. Advances include (1) complementary search techniques, (2) new efficient versions of nonlinearities practical for the mobile setting, (3) new efficient network design.\r\n\r\nThe network design includes the use of a [hard swish](https:\/\/paperswithcode.com\/method\/hard-swish) activation and squeeze-and-excitation modules in the MBConv blocks.","520":"**T2T-ViT** (Tokens-To-Token Vision Transformer) is a type of [Vision Transformer](https:\/\/paperswithcode.com\/method\/vision-transformer) which incorporates 1) a layerwise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision [transformer](https:\/\/paperswithcode.com\/method\/transformer) motivated by CNN architecture design after empirical study.","521":"**MagFace** is a category of losses for face recognition that learn a universal feature embedding whose magnitude can measure the quality of a given face. Under the new loss, it can be proven that the magnitude of the feature embedding monotonically increases if the subject is more likely to be recognized. In addition, MagFace introduces an adaptive mechanism to learn a well-structured within-class feature distributions by pulling easy samples to class centers while pushing hard samples away. For face recognition, MagFace helps prevent model overfitting on noisy and low-quality samples by an adaptive mechanism to learn well-structured within-class feature distributions -- by pulling easy samples to class centers while pushing hard samples away.","522":"**Pix2Pix** is a conditional image-to-image translation architecture that uses a conditional [GAN](https:\/\/paperswithcode.com\/method\/gan) objective combined with a reconstruction loss. The conditional GAN objective for observed images $x$, output images $y$ and the random noise vector $z$ is:\r\n\r\n$$ \\mathcal{L}\\_{cGAN}\\left(G, D\\right) =\\mathbb{E}\\_{x,y}\\left[\\log D\\left(x, y\\right)\\right]+\r\n\\mathbb{E}\\_{x,z}\\left[log(1 \u2212 D\\left(x, G\\left(x, z\\right)\\right)\\right] $$\r\n\r\nWe augment this with a reconstruction term:\r\n\r\n$$ \\mathcal{L}\\_{L1}\\left(G\\right) = \\mathbb{E}\\_{x,y,z}\\left[||y - G\\left(x, z\\right)||\\_{1}\\right] $$\r\n\r\nand we get the final objective as:\r\n\r\n$$ G^{*} = \\arg\\min\\_{G}\\max\\_{D}\\mathcal{L}\\_{cGAN}\\left(G, D\\right) + \\lambda\\mathcal{L}\\_{L1}\\left(G\\right) $$\r\n\r\nThe architectures employed for the generator and discriminator closely follow [DCGAN](https:\/\/paperswithcode.com\/method\/dcgan), with a few modifications:\r\n\r\n- Concatenated skip connections are used to \"shuttle\" low-level information between the input and output, similar to a [U-Net](https:\/\/paperswithcode.com\/method\/u-net).\r\n- The use of a [PatchGAN](https:\/\/paperswithcode.com\/method\/patchgan) discriminator that only penalizes structure at the scale of patches.","523":"**Layer-wise Adaptive Rate Scaling**, or **LARS**, is a large batch optimization technique. There are two notable differences between LARS and other adaptive algorithms such as [Adam](https:\/\/paperswithcode.com\/method\/adam) or [RMSProp](https:\/\/paperswithcode.com\/method\/rmsprop): first, LARS uses a separate learning rate for each layer and not for each weight. And second, the magnitude of the update is controlled with respect to the weight norm for better control of training speed.\r\n\r\n$$m\\_{t} = \\beta\\_{1}m\\_{t-1} + \\left(1-\\beta\\_{1}\\right)\\left(g\\_{t} + \\lambda{x\\_{t}}\\right)$$\r\n$$x\\_{t+1}^{\\left(i\\right)} = x\\_{t}^{\\left(i\\right)} - \\eta\\_{t}\\frac{\\phi\\left(|| x\\_{t}^{\\left(i\\right)} ||\\right)}{|| m\\_{t}^{\\left(i\\right)} || }m\\_{t}^{\\left(i\\right)} $$","524":"**SwaV**, or **Swapping Assignments Between Views**, is a self-supervised learning approach that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, it simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or views) of the same image, instead of comparing features directly as in contrastive learning. Simply put, SwaV uses a swapped prediction mechanism where we predict the cluster assignment of a view from the representation of another view.","525":"A regularization criterion that, differently from [dropout](https:\/\/paperswithcode.com\/method\/dropout) and its variants, is deterministic rather than random. It grounds on the empirical evidence that feature descriptors with larger L2-norm and highly-active nodes are strongly correlated to confident class predictions. Thus, the criterion guides towards dropping a percentage of the most active nodes of the descriptors, proportionally to the estimated class probability","526":"**You Only Hypothesize Once** is a local descriptor-based framework for the registration of two unaligned point clouds. The proposed descriptor achieves the rotation invariance by recent technologies of group equivariant feature learning, which brings more robustness to point density and noise. The descriptor in YOHO also has a rotation-equivariant part, which enables the estimation the registration from just one correspondence hypothesis.","527":"**Hit-Detector** is a neural architectures search algorithm that simultaneously searches all components of an object detector in an end-to-end manner. It is a hierarchical approach to mine the proper subsearch space from the large volume of operation candidates. It consists of two main procedures. First, given a large search space containing all the operation candidates, we screen out the customized sub search space suitable for each part of detector with the help of group sparsity regularization. Secondly, we search the architectures for each part within the corresponding sub search space by adopting the differentiable manner.","528":"Hou et al. proposed coordinate attention,\r\na novel attention mechanism which\r\nembeds positional information into channel attention,\r\nso that the network can focus on large important regions \r\nat little computational cost.\r\n\r\nThe coordinate attention mechanism has two consecutive steps, coordinate information embedding and coordinate attention generation. First, two spatial extents of pooling kernels encode each channel horizontally and vertically. In the second step, a shared $1\\times 1$ convolutional transformation function is applied to the concatenated outputs of the two pooling layers. Then coordinate attention splits the resulting tensor into two separate tensors to yield attention vectors with the same number of channels for horizontal and vertical coordinates of the input $X$ along. This can be written as \r\n\\begin{align}\r\n z^h &= \\text{GAP}^h(X) \r\n\\end{align}\r\n\\begin{align}\r\n z^w &= \\text{GAP}^w(X)\r\n\\end{align}\r\n\\begin{align}\r\n f &= \\delta(\\text{BN}(\\text{Conv}_1^{1\\times 1}([z^h;z^w])))\r\n\\end{align}\r\n\\begin{align}\r\n f^h, f^w &= \\text{Split}(f)\r\n\\end{align}\r\n\\begin{align}\r\n s^h &= \\sigma(\\text{Conv}_h^{1\\times 1}(f^h))\r\n\\end{align}\r\n\\begin{align}\r\n s^w &= \\sigma(\\text{Conv}_w^{1\\times 1}(f^w))\r\n\\end{align}\r\n\\begin{align}\r\n Y &= X s^h s^w\r\n\\end{align}\r\nwhere $\\text{GAP}^h$ and $\\text{GAP}^w$ denote pooling functions for vertical and horizontal coordinates, and $s^h \\in \\mathbb{R}^{C\\times 1\\times W}$ and $s^w \\in \\mathbb{R}^{C\\times H\\times 1}$ represent corresponding attention weights. \r\n\r\nUsing coordinate attention, the network can accurately obtain the position of a targeted object.\r\nThis approach has a larger receptive field than BAM and CBAM.\r\nLike an SE block, it also models cross-channel relationships, effectively enhancing the expressive power of the learned features.\r\nDue to its lightweight design and flexibility, \r\nit can be easily used in classical building blocks of mobile networks.","529":"**Neural Architecture Search (NAS)** learns a modular architecture which can be transferred from a small dataset to a large dataset. The method does this by reducing the problem of learning best convolutional architectures to the problem of learning a small convolutional cell. The cell can then be stacked in series to handle larger images and more complex datasets.\r\n\r\nNote that this refers to the original method referred to as NAS - there is also a broader category of methods called \"neural architecture search\".","530":"**ScheduledDropPath** is a modified version of [DropPath](https:\/\/paperswithcode.com\/method\/droppath). In DropPath, each path in the cell is stochastically dropped with some fixed probability during training. In ScheduledDropPath, each path in the cell is dropped out with a probability that is linearly increased over the course of training.","531":"**Decentralized Distributed Proximal Policy Optimization (DD-PPO)** is a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever `stale'), making it conceptually simple and easy to implement. \r\n\r\nProximal Policy Optimization, or [PPO](https:\/\/paperswithcode.com\/method\/ppo), is a policy gradient method for reinforcement learning. The motivation was to have an algorithm with the data efficiency and reliable performance of [TRPO](https:\/\/paperswithcode.com\/method\/trpo), while using only first-order optimization. \r\n\r\nLet $r\\_{t}\\left(\\theta\\right)$ denote the probability ratio $r\\_{t}\\left(\\theta\\right) = \\frac{\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}$, so $r\\left(\\theta\\_{old}\\right) = 1$. TRPO maximizes a \u201csurrogate\u201d objective:\r\n\r\n$$ L^{v}\\left({\\theta}\\right) = \\hat{\\mathbb{E}}\\_{t}\\left[\\frac{\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\_{t}\\mid{s\\_{t}}\\right)})\\hat{A}\\_{t}\\right] = \\hat{\\mathbb{E}}\\_{t}\\left[r\\_{t}\\left(\\theta\\right)\\hat{A}\\_{t}\\right] $$\r\n\r\nAs a general abstraction, DD-PPO implements the following:\r\nat step $k$, worker $n$ has a copy of the parameters, $\\theta^k_n$, calculates the gradient, $\\delta \\theta^k_n$, and updates $\\theta$ via \r\n\r\n$$ \\theta^{k+1}\\_n = \\text{ParamUpdate}\\Big(\\theta^{k}\\_n, \\text{AllReduce}\\big(\\delta \\theta^k\\_1, \\ldots, \\delta \\theta^k\\_N\\big)\\Big) = \\text{ParamUpdate}\\Big(\\theta^{k}\\_n, \\frac{1}{N} \\sum_{i=1}^{N} { \\delta \\theta^k_i} \\Big) $$\r\n\r\nwhere $\\text{ParamUpdate}$ is any first-order optimization technique (e.g. gradient descent) and $\\text{AllReduce}$ performs a reduction (e.g. mean) over all copies of a variable and returns the result to all workers.\r\nDistributed DataParallel scales very well (near-linear scaling up to 32,000 GPUs), and is reasonably simple to implement (all workers synchronously running identical code).","532":"**AdaMax** is a generalisation of [Adam](https:\/\/paperswithcode.com\/method\/adam) from the $l\\_{2}$ norm to the $l\\_{\\infty}$ norm. Define:\r\n\r\n$$ u\\_{t} = \\beta^{\\infty}\\_{2}v\\_{t-1} + \\left(1-\\beta^{\\infty}\\_{2}\\right)|g\\_{t}|^{\\infty}$$\r\n\r\n$$ = \\max\\left(\\beta\\_{2}\\cdot{v}\\_{t-1}, |g\\_{t}|\\right)$$\r\n\r\nWe can plug into the Adam update equation by replacing $\\sqrt{\\hat{v}_{t} + \\epsilon}$ with $u\\_{t}$ to obtain the AdaMax update rule:\r\n\r\n$$ \\theta\\_{t+1} = \\theta\\_{t} - \\frac{\\eta}{u\\_{t}}\\hat{m}\\_{t} $$\r\n\r\nCommon default values are $\\eta = 0.002$ and $\\beta\\_{1}=0.9$ and $\\beta\\_{2}=0.999$.","533":"**Slanted Triangular Learning Rates (STLR)** is a learning rate schedule which first linearly increases the learning rate and then linearly decays it, which can be seen in Figure to the right. It is a modification of Triangular Learning Rates, with a short increase and a long decay period.","534":"**Universal Language Model Fine-tuning**, or **ULMFiT**, is an architecture and transfer learning method that can be applied to NLP tasks. It involves a 3-layer [AWD-LSTM](https:\/\/paperswithcode.com\/method\/awd-lstm) architecture for its representations. The training consists of three steps: 1) general language model pre-training on a Wikipedia-based text, 2) fine-tuning the language model on a target task, and 3) fine-tuning the classifier on the target task.\r\n\r\nAs different layers capture different types of information, they are fine-tuned to different extents using [discriminative fine-tuning](https:\/\/paperswithcode.com\/method\/discriminative-fine-tuning). Training is performed using [Slanted triangular learning rates](https:\/\/paperswithcode.com\/method\/slanted-triangular-learning-rates) (STLR), a learning rate scheduling strategy that first linearly increases the learning rate and then linearly decays it.\r\n\r\nFine-tuning the target classifier is achieved in ULMFiT using gradual unfreezing. Rather than fine-tuning all layers at once, which risks catastrophic forgetting, ULMFiT gradually unfreezes the model starting from the last layer (i.e., closest to the output) as this contains the least general knowledge. First the last layer is unfrozen and all unfrozen layers are fine-tuned for one epoch. Then the next group of frozen layers is unfrozen and fine-tuned and repeat, until all layers are fine-tuned until convergence at the last iteration.","535":"**PQ-Transformer**, or **PointQuad-Transformer**, is a [Transformer](https:\/\/paperswithcode.com\/method\/transformer)-based architecture that predicts 3D objects and layouts simultaneously, using point cloud inputs. Unlike existing methods that either estimate layout keypoints or edges, room layouts are directly parameterized as a set of quads. Along with the quad representation, a physical constraint loss function is used that discourages object-layout interference.\r\n\r\nGiven an input 3D point cloud of $N$ points, the point cloud feature learning backbone extracts $M$ context-aware point features of $\\left(3+C\\right)$ dimensions, through sampling and grouping. A voting module and a farthest point sampling (FPS) module are used to generate $K\\_{1}$ object proposals and $K\\_{2}$ quad proposals respectively. Then the proposals are processed by a transformer decoder to further refine proposal features. Through several feedforward layers and non-maximum suppression (NMS), the proposals become the final object bounding boxes and layout quads.","536":"A **Fractal Block** is an image model block that utilizes an expansion rule that yields a structural layout of truncated fractals. For the base case where $f\\_{1}\\left(z\\right) = \\text{conv}\\left(z\\right)$ is a convolutional layer, we then have recursive fractals of the form:\r\n\r\n$$ f\\_{C+1}\\left(z\\right) = \\left[\\left(f\\_{C}\\circ{f\\_{C}}\\right)\\left(z\\right)\\right] \\oplus \\left[\\text{conv}\\left(z\\right)\\right]$$\r\n\r\nWhere $C$ is the number of columns. For the join layer (green in Figure), we use the element-wise mean rather than concatenation or addition.","537":"**FractalNet** is a type of convolutional neural network that eschews [residual connections](https:\/\/paperswithcode.com\/method\/residual-connection) in favour of a \"fractal\" design. They involve repeated application of a simple expansion rule to generate deep networks whose structural layouts are precisely truncated fractals. These networks contain interacting subpaths of different lengths, but do not include any pass-through or residual connections; every internal signal is transformed by a filter and nonlinearity before being seen by subsequent layers.","538":"A **Contractive Autoencoder** is an autoencoder that adds a penalty term to the classical reconstruction cost function. This penalty term corresponds to the Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input. This penalty term results in a localized space contraction which in turn yields robust features on the activation layer. The penalty helps to carve a representation that better captures the local directions of variation dictated by the data, corresponding to a lower-dimensional non-linear manifold, while being more invariant to the vast majority of directions orthogonal to the manifold.","539":"**MADDPG**, or **Multi-agent DDPG**, extends [DDPG](https:\/\/paperswithcode.com\/method\/ddpg) into a multi-agent policy gradient algorithm where decentralized agents learn a centralized critic based on the observations and actions of all agents. It leads to learned policies that only use local information (i.e. their own observations) at execution time, does not assume a differentiable model of the environment dynamics or any particular structure on the communication method between agents, and is applicable not only to cooperative interaction but to competitive or mixed interaction involving both physical and communicative behavior. The critic is augmented with extra information about the policies of other agents, while the actor only has access to local information. After training is completed, only the local actors are used at execution phase, acting in a decentralized manner.","540":"**Disentangled Attribution Curves (DAC)** provide interpretations of tree ensemble methods in the form of (multivariate) feature importance curves. For a given variable, or group of variables, [DAC](https:\/\/paperswithcode.com\/method\/dac) plots the importance of a variable(s) as their value changes.\r\n\r\nThe Figure to the right shows an example. The tree depicts a decision tree which performs binary classification using two features (representing the XOR function). In this problem, knowing the value of one of the features without knowledge of the other feature yields no information - the classifier still has a 50% chance of predicting either class. As a result, DAC produces curves which assign 0 importance to either feature on its own. Knowing both features yields perfect information about the classifier, and thus the DAC curve for both features together correctly shows that the interaction of the features produces the model\u2019s predictions.","541":"**LayoutLMv2** is an architecture and pre-training method for document understanding. The model is pre-trained with a great number of unlabeled scanned document images from the IIT-CDIP dataset, where some images in the text-image pairs are randomly replaced with another document image to make the model learn whether the image and OCR texts are correlated or not. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture, so that the model can fully understand the relative positional relationship among different text blocks.\r\n\r\nSpecifically, an enhanced Transformer architecture is used, i.e. a multi-modal Transformer asisthe backbone of LayoutLMv2. The multi-modal Transformer accepts inputs of three modalities: text, image, and layout. The input of each modality is converted to an embedding sequence and fused by the encoder. The model establishes deep interactions within and between modalities by leveraging the powerful Transformer layers.","542":"**Manifold Mixup** is a regularization method that encourages neural networks to predict less confidently on interpolations of hidden representations. It leverages semantic interpolations as an additional training signal, obtaining neural networks with smoother decision boundaries at multiple levels of representation. As a result, neural networks trained with Manifold Mixup learn class-representations with fewer directions of variance.\r\n\r\nConsider training a deep neural network $f\\left(x\\right) = f\\_{k}\\left(g\\_{k}\\left(x\\right)\\right)$, where $g\\_{k}$ denotes the part of the neural network mapping the input data to the hidden representation at layer $k$, and $f\\_{k}$ denotes the\r\npart mapping such hidden representation to the output $f\\left(x\\right)$. Training $f$ using Manifold Mixup is performed in five steps:\r\n\r\n(1) Select a random layer $k$ from a set of eligible layers $S$ in the neural network. This set may include the input layer $g\\_{0}\\left(x\\right)$.\r\n\r\n(2) Process two random data minibatches $\\left(x, y\\right)$ and $\\left(x', y'\\right)$ as usual, until reaching layer $k$. This provides us with two intermediate minibatches $\\left(g\\_{k}\\left(x\\right), y\\right)$ and $\\left(g\\_{k}\\left(x'\\right), y'\\right)$.\r\n\r\n(3) Perform Input [Mixup](https:\/\/paperswithcode.com\/method\/mixup) on these intermediate minibatches. This produces the mixed minibatch:\r\n\r\n$$\r\n\\left(\\tilde{g}\\_{k}, \\tilde{y}\\right) = \\left(\\text{Mix}\\_{\\lambda}\\left(g\\_{k}\\left(x\\right), g\\_{k}\\left(x'\\right)\\right), \\text{Mix}\\_{\\lambda}\\left(y, y'\\right\r\n)\\right),\r\n$$\r\n\r\nwhere $\\text{Mix}\\_{\\lambda}\\left(a, b\\right) = \\lambda \\cdot a + \\left(1 \u2212 \\lambda\\right) \\cdot b$. Here, $\\left(y, y'\r\n\\right)$ are one-hot labels, and the mixing coefficient\r\n$\\lambda \\sim \\text{Beta}\\left(\\alpha, \\alpha\\right)$ as in mixup. For instance, $\\alpha = 1.0$ is equivalent to sampling $\\lambda \\sim U\\left(0, 1\\right)$.\r\n\r\n(4) Continue the forward pass in the network from layer $k$ until the output using the mixed minibatch $\\left(\\tilde{g}\\_{k}, \\tilde{y}\\right)$.\r\n\r\n(5) This output is used to compute the loss value and\r\ngradients that update all the parameters of the neural network.","543":"**RepVGG** is a [VGG](https:\/\/paperswithcode.com\/method\/vgg)-style convolutional architecture. It has the following advantages:\r\n\r\n- The model has a VGG-like plain (a.k.a. feed-forward) topology 1 without any branches. I.e., every layer takes\r\nthe output of its only preceding layer as input and feeds the output into its only following layer.\r\n- The model\u2019s body uses only 3 \u00d7 3 conv and [ReLU](https:\/\/paperswithcode.com\/method\/relu).\r\n- The concrete architecture (including the specific depth and layer widths) is instantiated with no automatic\r\nsearch, manual refinement, compound scaling, nor other heavy designs.","544":"**Thinned U-shape Module**, or **TUM**, is a feature extraction block used for object detection models. It was introduced as part of the [M2Det](https:\/\/paperswithcode.com\/method\/m2det) architecture. Different from [FPN](https:\/\/paperswithcode.com\/method\/fpn) and [RetinaNet](https:\/\/paperswithcode.com\/method\/retinanet), TUM adopts a thinner U-shape structure as illustrated in the Figure to the right. The encoder is a series of 3x3 [convolution](https:\/\/paperswithcode.com\/method\/convolution) layers with stride 2. And the decoder takes the outputs of these layers as its reference set of feature maps, while the original FPN chooses the output of the last layer of each stage in [ResNet](https:\/\/paperswithcode.com\/method\/resnet) backbone. \r\n\r\nIn addition, with TUM, we add [1x1 convolution](https:\/\/paperswithcode.com\/method\/1x1-convolution) layers after the upsample and element-wise sum operation at the decoder branch to enhance learning ability and keep smoothness for the features. In the context of M2Det, all of the outputs in the decoder of each TUM form the multi-scale features of the current level. As a whole, the outputs of stacked TUMs form the multi-level multi-scale features, while the front TUM mainly provides shallow-level features, the middle TUM provides medium-level features, and the back TUM provides deep-level features.","545":"The **SAGAN Self-Attention Module** is a self-attention module used in the [Self-Attention GAN](https:\/\/paperswithcode.com\/method\/sagan) architecture for image synthesis. In the module, image features from the previous hidden layer $\\textbf{x} \\in \\mathbb{R}^{C\\text{x}N}$ are first transformed into two feature spaces $\\textbf{f}$, $\\textbf{g}$ to calculate the attention, where $\\textbf{f(x) = W}\\_{\\textbf{f}}{\\textbf{x}}$, $\\textbf{g}(\\textbf{x})=\\textbf{W}\\_{\\textbf{g}}\\textbf{x}$. We then calculate:\r\n\r\n$$\\beta_{j, i} = \\frac{\\exp\\left(s_{ij}\\right)}{\\sum^{N}\\_{i=1}\\exp\\left(s_{ij}\\right)} $$\r\n\r\n$$ \\text{where } s_{ij} = \\textbf{f}(\\textbf{x}\\_{i})^{T}\\textbf{g}(\\textbf{x}\\_{i}) $$\r\n\r\nand $\\beta_{j, i}$ indicates the extent to which the model attends to the $i$th location when synthesizing the $j$th region. Here, $C$ is the number of channels and $N$ is the number of feature\r\nlocations of features from the previous hidden layer. The output of the attention layer is $\\textbf{o} = \\left(\\textbf{o}\\_{\\textbf{1}}, \\textbf{o}\\_{\\textbf{2}}, \\ldots, \\textbf{o}\\_{\\textbf{j}} , \\ldots, \\textbf{o}\\_{\\textbf{N}}\\right) \\in \\mathbb{R}^{C\\text{x}N}$ , where,\r\n\r\n$$ \\textbf{o}\\_{\\textbf{j}} = \\textbf{v}\\left(\\sum^{N}\\_{i=1}\\beta_{j, i}\\textbf{h}\\left(\\textbf{x}\\_{\\textbf{i}}\\right)\\right) $$\r\n\r\n$$ \\textbf{h}\\left(\\textbf{x}\\_{\\textbf{i}}\\right) = \\textbf{W}\\_{\\textbf{h}}\\textbf{x}\\_{\\textbf{i}} $$\r\n\r\n$$ \\textbf{v}\\left(\\textbf{x}\\_{\\textbf{i}}\\right) = \\textbf{W}\\_{\\textbf{v}}\\textbf{x}\\_{\\textbf{i}} $$\r\n\r\nIn the above formulation, $\\textbf{W}\\_{\\textbf{g}} \\in \\mathbb{R}^{\\bar{C}\\text{x}C}$, $\\mathbf{W}\\_{f} \\in \\mathbb{R}^{\\bar{C}\\text{x}C}$, $\\textbf{W}\\_{\\textbf{h}} \\in \\mathbb{R}^{\\bar{C}\\text{x}C}$ and $\\textbf{W}\\_{\\textbf{v}} \\in \\mathbb{R}^{C\\text{x}\\bar{C}}$ are the learned weight matrices, which are implemented as $1$\u00d7$1$ convolutions. The authors choose $\\bar{C} = C\/8$.\r\n\r\nIn addition, the module further multiplies the output of the attention layer by a scale parameter and adds back the input feature map. Therefore, the final output is given by,\r\n\r\n$$\\textbf{y}\\_{\\textbf{i}} = \\gamma\\textbf{o}\\_{\\textbf{i}} + \\textbf{x}\\_{\\textbf{i}}$$\r\n\r\nwhere $\\gamma$ is a learnable scalar and it is initialized as 0. Introducing $\\gamma$ allows the network to first rely on the cues in the local neighborhood \u2013 since this is easier \u2013 and then gradually learn to assign more weight to the non-local evidence.","546":"A **Non-Local Operation** is a component for capturing long-range dependencies with deep neural networks. It is a generalization of the classical non-local mean operation in computer vision. Intuitively a non-local operation computes the response at a position as a weighted sum of the features at all positions in the input feature maps. The set of positions can be in space, time, or spacetime, implying that these operations are applicable for image, sequence, and video problems.\r\n\r\nFollowing the non-local mean operation, a generic non-local operation for deep neural networks is defined as:\r\n\r\n$$ \\mathbb{y}\\_{i} = \\frac{1}{\\mathcal{C}\\left(\\mathbb{x}\\right)}\\sum\\_{\\forall{j}}f\\left(\\mathbb{x}\\_{i}, \\mathbb{x}\\_{j}\\right)g\\left(\\mathbb{x}\\_{j}\\right) $$\r\n\r\nHere $i$ is the index of an output position (in space, time, or spacetime) whose response is to be computed and $j$ is the index that enumerates all possible positions. x is the input signal (image, sequence, video; often their features) and $y$ is the output signal of the same size as $x$. A pairwise function $f$ computes a scalar (representing relationship such as affinity) between $i$ and all $j$. The unary function $g$ computes a representation of the input signal at the position $j$. The\r\nresponse is normalized by a factor $C\\left(x\\right)$.\r\n\r\nThe non-local behavior is due to the fact that all positions ($\\forall{j}$) are considered in the operation. As a comparison, a convolutional operation sums up the weighted input in a local neighborhood (e.g., $i \u2212 1 \\leq j \\leq i + 1$ in a 1D case with kernel size 3), and a recurrent operation at time $i$ is often based only on the current and the latest time steps (e.g., $j = i$ or $i \u2212 1$).\r\n\r\nThe non-local operation is also different from a fully-connected (fc) layer. The equation above computes responses based on relationships between different locations, whereas fc uses learned weights. In other words, the relationship between $x\\_{j}$ and $x\\_{i}$ is not a function of the input data in fc, unlike in nonlocal layers. Furthermore, the formulation in the equation above supports inputs of variable sizes, and maintains the corresponding size in the output. On the contrary, an fc layer requires a fixed-size input\/output and loses positional correspondence (e.g., that from $x\\_{i}$ to $y\\_{i}$ at the position $i$).\r\n\r\nA non-local operation is a flexible building block and can be easily used together with convolutional\/recurrent layers. It can be added into the earlier part of deep neural networks, unlike fc layers that are often used in the end. This allows us to build a richer hierarchy that combines both non-local and local information.\r\n\r\nIn terms of parameterisation, we usually parameterise $g$ as a linear embedding of the form $g\\left(x\\_{j}\\right) = W\\_{g}\\mathbb{x}\\_{j}$ , where $W\\_{g}$ is a weight matrix to be learned. This is implemented as, e.g., 1\u00d71 [convolution](https:\/\/paperswithcode.com\/method\/convolution) in space or 1\u00d71\u00d71 convolution in spacetime. For $f$ we use an affinity function, a list of which can be found [here](https:\/\/paperswithcode.com\/methods\/category\/affinity-functions).","547":"The **Truncation Trick** is a latent sampling procedure for generative adversarial networks, where we sample $z$ from a truncated normal (where values which fall outside a range are resampled to fall inside that range). \r\nThe original implementation was in [Megapixel Size Image Creation with GAN](https:\/\/paperswithcode.com\/paper\/megapixel-size-image-creation-using).\r\nIn [BigGAN](http:\/\/paperswithcode.com\/method\/biggan), the authors find this provides a boost to the Inception Score and FID.","548":"The **Self-Attention Generative Adversarial Network**, or **SAGAN**, allows for attention-driven, long-range dependency modeling for image generation tasks. Traditional convolutional GANs generate high-resolution details as a function of only spatially local points in lower-resolution feature maps. In SAGAN, details can be generated using cues from all feature locations. Moreover, the discriminator can check that highly detailed features in distant portions of the image are consistent with each other.","549":"The **GAN Hinge Loss** is a hinge loss based loss function for [generative adversarial networks](https:\/\/paperswithcode.com\/methods\/category\/generative-adversarial-networks):\r\n\r\n$$ L\\_{D} = -\\mathbb{E}\\_{\\left(x, y\\right)\\sim{p}\\_{data}}\\left[\\min\\left(0, -1 + D\\left(x, y\\right)\\right)\\right] -\\mathbb{E}\\_{z\\sim{p\\_{z}}, y\\sim{p\\_{data}}}\\left[\\min\\left(0, -1 - D\\left(G\\left(z\\right), y\\right)\\right)\\right] $$\r\n\r\n$$ L\\_{G} = -\\mathbb{E}\\_{z\\sim{p\\_{z}}, y\\sim{p\\_{data}}}D\\left(G\\left(z\\right), y\\right) $$","550":"The **Two Time-scale Update Rule (TTUR)** is an update rule for generative adversarial networks trained with stochastic gradient descent. TTUR has an individual learning rate for both the discriminator and the generator. The main premise is that the discriminator converges to a local minimum when the generator is fixed. If the generator changes slowly enough, then the discriminator still converges, since the generator perturbations are small. Besides ensuring convergence, the performance may also improve since the discriminator must first learn new patterns before they are transferred to the generator. In contrast, a generator which is overly fast, drives the discriminator steadily into new regions without capturing its gathered information.","551":"**Conditional Batch Normalization (CBN)** is a class-conditional variant of [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization). The key idea is to predict the $\\gamma$ and $\\beta$ of the batch normalization from an embedding - e.g. a language embedding in VQA. CBN enables the linguistic embedding to manipulate entire feature maps by scaling them up or down, negating them, or shutting them off. CBN has also been used in [GANs](https:\/\/paperswithcode.com\/methods\/category\/generative-adversarial-networks) to allow class information to affect the batch normalization parameters.\r\n\r\nConsider a single convolutional layer with batch normalization module $\\text{BN}\\left(F\\_{i,c,h,w}|\\gamma\\_{c}, \\beta\\_{c}\\right)$ for which pretrained scalars $\\gamma\\_{c}$ and $\\beta\\_{c}$ are available. We would like to directly predict these affine scaling parameters from, e.g., a language embedding $\\mathbf{e\\_{q}}$. When starting the training procedure, these parameters must be close to the pretrained values to recover the original [ResNet](https:\/\/paperswithcode.com\/method\/resnet) model as a poor initialization could significantly deteriorate performance. Unfortunately, it is difficult to initialize a network to output the pretrained $\\gamma$ and $\\beta$. For these reasons, the authors propose to predict a change $\\delta\\beta\\_{c}$ and $\\delta\\gamma\\_{c}$ on the frozen original scalars, for which it is straightforward to initialize a neural network to produce an output with zero-mean and small variance.\r\n\r\nThe authors use a one-hidden-layer MLP to predict these deltas from a question embedding $\\mathbf{e\\_{q}}$ for all feature maps within the layer:\r\n\r\n$$\\Delta\\beta = \\text{MLP}\\left(\\mathbf{e\\_{q}}\\right)$$\r\n\r\n$$\\Delta\\gamma = \\text{MLP}\\left(\\mathbf{e\\_{q}}\\right)$$\r\n\r\nSo, given a feature map with $C$ channels, these MLPs output a vector of size $C$. We then add these predictions to the $\\beta$ and $\\gamma$ parameters:\r\n\r\n$$ \\hat{\\beta}\\_{c} = \\beta\\_{c} + \\Delta\\beta\\_{c} $$\r\n\r\n$$ \\hat{\\gamma}\\_{c} = \\gamma\\_{c} + \\Delta\\gamma\\_{c} $$\r\n\r\nFinally, these updated $\\hat{\u03b2}$ and $\\hat{\\gamma}$ are used as parameters for the batch normalization: $\\text{BN}\\left(F\\_{i,c,h,w}|\\hat{\\gamma\\_{c}}, \\hat{\\beta\\_{c}}\\right)$. The authors freeze all ResNet parameters, including $\\gamma$ and $\\beta$, during training. A ResNet consists of\r\nfour stages of computation, each subdivided in several residual blocks. In each block, the authors apply CBN to the three convolutional layers.","552":"A **Non-Local Block** is an image block module used in neural networks that wraps a [non-local operation](https:\/\/paperswithcode.com\/method\/non-local-operation). We can define a non-local block as:\r\n\r\n$$ \\mathbb{z}\\_{i} = W\\_{z}\\mathbb{y\\_{i}} + \\mathbb{x}\\_{i} $$\r\n\r\nwhere $y\\_{i}$ is the output from the non-local operation and $+ \\mathbb{x}\\_{i}$ is a [residual connection](https:\/\/paperswithcode.com\/method\/residual-connection).","553":"A **Projection Discriminator** is a type of discriminator for generative adversarial networks. It is motivated by a probabilistic model in which the distribution of the conditional variable $\\textbf{y}$ given $\\textbf{x}$ is discrete or uni-modal continuous distributions.\r\n\r\nIf we look at the original solution for the loss function $\\mathcal{L}\\_{D}$ in the vanilla GANs, we can decompose it into the sum of two log-likelihood ratios:\r\n\r\n$$ f^{*}\\left(\\mathbf{x}, \\mathbf{y}\\right) = \\log\\frac{q\\left(\\mathbf{x}\\mid{\\mathbf{y}}\\right)q\\left(\\mathbf{y}\\right)}{p\\left(\\mathbf{x}\\mid{\\mathbf{y}}\\right)p\\left(\\mathbf{y}\\right)} = \\log\\frac{q\\left(\\mathbf{y}\\mid{\\mathbf{x}}\\right)}{p\\left(\\mathbf{y}\\mid{\\mathbf{x}}\\right)} + \\log\\frac{q\\left(\\mathbf{x}\\right)}{p\\left(\\mathbf{x}\\right)} = r\\left(\\mathbf{y\\mid{x}}\\right) + r\\left(\\mathbf{x}\\right) $$\r\n\r\nWe can model the log likelihood ratio $r\\left(\\mathbf{y\\mid{x}}\\right)$ and $r\\left(\\mathbf{x}\\right)$ by some parametric functions $f\\_{1}$ and $f\\_{2}$ respectively. If we make a standing assumption that $p\\left(y\\mid{x}\\right)$ and $q\\left(y\\mid{x}\\right)$ are simple distributions like those that are Gaussian or discrete log linear on the feature space, then the parametrization of the following form becomes natural:\r\n\r\n$$ f\\left(\\mathbf{x}, \\mathbf{y}; \\theta\\right) = f\\_{1}\\left(\\mathbf{x}, \\mathbf{y}; \\theta\\right) + f\\_{2}\\left(\\mathbf{x}; \\theta\\right) = \\mathbf{y}^{T}V\\phi\\left(\\mathbf{x}; \\theta\\_{\\phi}\\right) + \\psi\\left(\\phi(\\mathbf{x}; \\theta\\_{\\phi}); \\theta\\_{\\psi}\\right) $$\r\n\r\nwhere $V$ is the embedding matrix of $y$, $\\phi\\left(\u00b7, \\theta\\_{\\phi}\\right)$ is a vector output function of $x$, and $\\psi\\left(\u00b7, \\theta\\_{\\psi}\\right)$ is a scalar function of the same $\\phi\\left(\\mathbf{x}; \\theta\\_{\\phi}\\right)$ that appears in $f\\_{1}$. The learned parameters $\\theta = ${$V, \\theta\\_{\\phi}, \\theta\\_{\\psi}$} are trained to optimize the adversarial loss. This model of the discriminator is the projection.","554":"**Off-Diagonal Orthogonal Regularization** is a modified form of [orthogonal regularization](https:\/\/paperswithcode.com\/method\/orthogonal-regularization) originally used in [BigGAN](https:\/\/paperswithcode.com\/method\/biggan). The original orthogonal regularization is known to be limiting so the authors explore several variants designed to relax the constraint while still imparting the desired smoothness to the models. They opt for a modification where they remove diagonal terms from the regularization, and aim to minimize the pairwise cosine similarity between filters but does not constrain their norm:\r\n\r\n$$ R\\_{\\beta}\\left(W\\right) = \\beta|| W^{T}W \\odot \\left(\\mathbf{1}-I\\right) ||^{2}\\_{F} $$\r\n\r\nwhere $\\mathbf{1}$ denotes a matrix with all elements set to 1. The authors sweep $\\beta$ values and select $10^{\u22124}$.","555":"**BigGAN** is a type of generative adversarial network that was designed for scaling generation to high-resolution, high-fidelity images. It includes a number of incremental changes and innovations. The baseline and incremental changes are:\r\n\r\n- Using [SAGAN](https:\/\/paperswithcode.com\/method\/sagan) as a baseline with spectral norm. for G and D, and using [TTUR](https:\/\/paperswithcode.com\/method\/ttur).\r\n- Using a Hinge Loss [GAN](https:\/\/paperswithcode.com\/method\/gan) objective\r\n- Using class-[conditional batch normalization](https:\/\/paperswithcode.com\/method\/conditional-batch-normalization) to provide class information to G (but with linear projection not MLP.\r\n- Using a [projection discriminator](https:\/\/paperswithcode.com\/method\/projection-discriminator) for D to provide class information to D.\r\n- Evaluating with EWMA of G's weights, similar to ProGANs.\r\n\r\nThe innovations are:\r\n\r\n- Increasing batch sizes, which has a big effect on the Inception Score of the model.\r\n- Increasing the width in each layer leads to a further Inception Score improvement.\r\n- Adding skip connections from the latent variable $z$ to further layers helps performance.\r\n- A new variant of [Orthogonal Regularization](https:\/\/paperswithcode.com\/method\/orthogonal-regularization).","556":"**Fast R-CNN** is an object detection model that improves in its predecessor [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn) in a number of ways. Instead of extracting CNN features independently for each region of interest, Fast R-CNN aggregates them into a single forward pass over the image; i.e. regions of interest from the same image share computation and memory in the forward and backward passes.","557":"**DynamicConv** is a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) for sequential modelling where it has kernels that vary over time as a learned function of the individual time steps. It builds upon [LightConv](https:\/\/paperswithcode.com\/method\/lightconv) and takes the same form but uses a time-step dependent kernel:\r\n\r\n$$ \\text{DynamicConv}\\left(X, i, c\\right) = \\text{LightConv}\\left(X, f\\left(X\\_{i}\\right)\\_{h,:}, i, c\\right) $$","558":"**Random Scaling** is a type of image data augmentation where we randomly change the scale the image between a specified range.","559":"**PixelShuffle** is an operation used in super-resolution models to implement efficient sub-pixel convolutions with a stride of $1\/r$. Specifically it rearranges elements in a tensor of shape $(\\*, C \\times r^2, H, W)$ to a tensor of shape $(\\*, C, H \\times r, W \\times r)$.\r\n\r\nImage Source: [Remote Sensing Single-Image Resolution Improvement Using A Deep Gradient-Aware Network with Image-Specific Enhancement](https:\/\/www.researchgate.net\/figure\/The-pixel-shuffle-layer-transforms-feature-maps-from-the-LR-domain-to-the-HR-image_fig3_339531308)","560":"**SRGAN Residual Block** is a residual block used in the [SRGAN](https:\/\/paperswithcode.com\/method\/srgan) generator for image super-resolution. It is similar to standard [residual blocks](https:\/\/paperswithcode.com\/method\/residual-block), although it uses a [PReLU](https:\/\/paperswithcode.com\/method\/prelu) activation function to help training (preventing sparse gradients during [GAN](https:\/\/paperswithcode.com\/method\/gan) training).","561":"**VGG Loss** is a type of content loss intorduced in the [Perceptual Losses for Real-Time Style Transfer and Super-Resolution](https:\/\/paperswithcode.com\/paper\/perceptual-losses-for-real-time-style) super-resolution and style transfer framework. It is an alternative to pixel-wise losses; VGG Loss attempts to be closer to perceptual similarity. The [VGG](https:\/\/paperswithcode.com\/method\/vgg) loss is based on the [ReLU](https:\/\/paperswithcode.com\/method\/relu) activation layers of the pre-trained 19 layer VGG network. With $\\phi\\_{i,j}$ we indicate the feature map obtained by the $j$-th [convolution](https:\/\/paperswithcode.com\/method\/convolution) (after activation) before the $i$-th maxpooling layer within the VGG19 network, which we consider given. We then define the VGG loss as the euclidean distance between the feature representations of a reconstructed image $G\\_{\\theta\\_{G}}\\left(I^{LR}\\right)$ and the reference image $I^{HR}$:\r\n\r\n$$ l\\_{VGG\/i.j} = \\frac{1}{W\\_{i,j}H\\_{i,j}}\\sum\\_{x=1}^{W\\_{i,j}}\\sum\\_{y=1}^{H\\_{i,j}}\\left(\\phi\\_{i,j}\\left(I^{HR}\\right)\\_{x, y} - \\phi\\_{i,j}\\left(G\\_{\\theta\\_{G}}\\left(I^{LR}\\right)\\right)\\_{x, y}\\right)^{2}$$ \r\n\r\nHere $W\\_{i,j}$ and $H\\_{i,j}$ describe the dimensions of the respective feature maps within the VGG network.","562":"**SRGAN** is a generative adversarial network for single image super-resolution. It uses a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes the solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images. In addition, the authors use a content loss motivated by perceptual similarity instead of similarity in pixel space. The actual networks - depicted in the Figure to the right - consist mainly of residual blocks for feature extraction.\r\n\r\nFormally we write the perceptual loss function as a weighted sum of a ([VGG](https:\/\/paperswithcode.com\/method\/vgg)) content loss $l^{SR}\\_{X}$ and an adversarial loss component $l^{SR}\\_{Gen}$:\r\n\r\n$$ l^{SR} = l^{SR}\\_{X} + 10^{-3}l^{SR}\\_{Gen} $$","563":"A **Groupwise Point Convolution** is a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) where we apply a [point convolution](https:\/\/paperswithcode.com\/method\/pointwise-convolution) groupwise (using different set of convolution filter groups).\r\n\r\nImage Credit: [Chi-Feng Wang](https:\/\/towardsdatascience.com\/a-basic-introduction-to-separable-convolutions-b99ec3102728)","564":"**ShuffleNet V2 Downsampling Block** is a block for spatial downsampling used in the [ShuffleNet V2](https:\/\/paperswithcode.com\/method\/shufflenet-v2) architecture. Unlike the regular [ShuffleNet](https:\/\/paperswithcode.com\/method\/shufflenet) V2 block, the channel split operator is removed so the number of output channels is doubled.","565":"**ShuffleNet v2** is a convolutional neural network optimized for a direct metric (speed) rather than indirect metrics like FLOPs. It builds upon [ShuffleNet v1](https:\/\/paperswithcode.com\/method\/shufflenet), which utilised pointwise group convolutions, bottleneck-like structures, and a [channel shuffle](https:\/\/paperswithcode.com\/method\/channel-shuffle) operation. Differences are shown in the Figure to the right, including a new channel split operation and moving the channel shuffle operation further down the block.","566":"RAM adopts RNNs and reinforcement learning (RL) to make the network learn where to pay attention.","567":"**Spatial Feature Transform**, or **SFT**, is a layer that generates affine transformation parameters for spatial-wise feature modulation, and was originally proposed within the context of image super-resolution. A Spatial Feature Transform (SFT) layer learns a mapping function $\\mathcal{M}$ that outputs a modulation parameter pair $(\\mathbf{\\gamma}, \\mathbf{\\beta})$ based on some prior condition $\\Psi$. The learned parameter pair adaptively influences the outputs by applying an affine transformation spatially to each intermediate feature maps in an SR network. During testing, only a single forward pass is needed to generate the HR image given the LR input and segmentation probability maps.\r\n\r\nMore precisely, the prior $\\Psi$ is modeled by a pair of affine transformation parameters $(\\mathbf{\\gamma}, \\mathbf{\\beta})$ through a mapping function $\\mathcal{M}: \\Psi \\mapsto(\\mathbf{\\gamma}, \\mathbf{\\beta})$. Consequently,\r\n\r\n$$\r\n\\hat{\\mathbf{y}}=G_{\\mathbf{\\theta}}(\\mathbf{x} \\mid \\mathbf{\\gamma}, \\mathbf{\\beta}), \\quad(\\mathbf{\\gamma}, \\mathbf{\\beta})=\\mathcal{M}(\\Psi)\r\n$$\r\n\r\nAfter obtaining $(\\mathbf{\\gamma}, \\mathbf{\\beta})$ from conditions, the transformation is carried out by scaling and shifting feature maps of a specific layer:\r\n\r\n$$\r\n\\operatorname{SFT}(\\mathbf{F} \\mid \\mathbf{\\gamma}, \\mathbf{\\beta})=\\mathbf{\\gamma} \\odot \\mathbf{F}+\\mathbf{\\beta}\r\n$$\r\n\r\nwhere $\\mathbf{F}$ denotes the feature maps, whose dimension is the same as $\\gamma$ and $\\mathbf{\\beta}$, and $\\odot$ is referred to element-wise multiplication, i.e., Hadamard product. Since the spatial dimensions are preserved, the SFT layer not only performs feature-wise manipulation but also spatial-wise transformation.","568":"A **Deep Boltzmann Machine (DBM)** is a three-layer generative model. It is similar to a [Deep Belief Network](https:\/\/paperswithcode.com\/method\/deep-belief-network), but instead allows bidirectional connections in the bottom layers. Its energy function is as an extension of the energy function of the RBM:\r\n\r\n$$ E\\left(v, h\\right) = -\\sum^{i}\\_{i}v\\_{i}b\\_{i} - \\sum^{N}\\_{n=1}\\sum_{k}h\\_{n,k}b\\_{n,k}-\\sum\\_{i, k}v\\_{i}w\\_{ik}h\\_{k} - \\sum^{N-1}\\_{n=1}\\sum\\_{k,l}h\\_{n,k}w\\_{n, k, l}h\\_{n+1, l}$$\r\n\r\nfor a DBM with $N$ hidden layers.\r\n\r\nSource: [On the Origin of Deep Learning](https:\/\/arxiv.org\/pdf\/1702.07800.pdf)","569":"**CenterPoint** is a two-stage 3D detector that finds centers of objects and their properties using a keypoint detector and regresses to other attributes, including 3D size, 3D orientation and velocity. In a second-stage, it refines these estimates using additional point features on the object. CenterPoint uses a standard Lidar-based backbone network, i.e., VoxelNet or PointPillars, to build a representation of the input point-cloud. CenterPoint predicts the relative offset (velocity) of objects between consecutive frames, which are then linked up greedily -- so in Centerpoint, 3D object tracking simplifies to greedy closest-point matching.","570":"**Sharpness-Aware Minimization**, or **SAM**, is a procedure that improves model generalization by simultaneously minimizing loss value and loss sharpness. SAM functions by seeking parameters that lie in neighborhoods having uniformly low loss value (rather than parameters that only themselves have low loss value).","571":"**Grid R-CNN** is an object detection framework, where the traditional regression\r\nformulation is replaced by a grid point guided localization mechanism.\r\n\r\nGrid R-CNN divides the object bounding box region into grids and employs a fully convolutional network ([FCN](https:\/\/paperswithcode.com\/method\/fcn)) to predict the locations of grid points. Owing to the position sensitive property of fully convolutional architecture, Grid R-CNN maintains the explicit spatial information and grid points locations can be obtained in pixel level. When a certain number of grid points at specified location are known, the corresponding bounding box is definitely determined. Guided by the grid points, Grid R-CNN can determine more accurate object bounding box than regression method which lacks the guidance of explicit spatial information.","572":"1D Convolutional Neural Networks are similar to well known and more established 2D Convolutional Neural Networks. 1D Convolutional Neural Networks are used mainly used on text and 1D signals.","573":"A **Switch FFN** is a sparse layer that operates independently on tokens within an input sequence. It is shown in the blue block in the figure. We diagram two tokens ($x\\_{1}$ = \u201cMore\u201d and $x\\_{2}$ = \u201cParameters\u201d below) being routed (solid lines) across four FFN experts, where the router independently routes each token. The switch FFN layer returns the output of the selected FFN multiplied by the router gate value (dotted-line).","574":"**Switch Transformer** is a sparsely-activated expert [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers) model that aims to simplify and improve over Mixture of Experts. Through distillation of sparse pre-trained and specialized fine-tuned models into small dense models, it reduces the model size by up to 99% while preserving 30% of the quality gains of the large sparse teacher. It also uses selective precision training that enables training with lower bfloat16 precision, as well as an initialization scheme that allows for scaling to a larger number of experts, and also increased regularization that improves sparse model fine-tuning and multi-task training.","575":"Park et al. proposed the bottleneck attention module (BAM), aiming\r\nto efficiently improve the representational capability of networks. \r\nIt uses dilated convolution to enlarge the receptive field of the spatial attention sub-module, and build a bottleneck structure as suggested by ResNet to save computational cost.\r\n\r\nFor a given input feature map $X$, BAM infers the channel attention $s_c \\in \\mathbb{R}^C$ and spatial attention $s_s\\in \\mathbb{R}^{H\\times W}$ in two parallel streams, then sums the two attention maps after resizing both branch outputs to $\\mathbb{R}^{C\\times H \\times W}$. The channel attention branch, like an SE block, applies global average pooling to the feature map to aggregate global information, and then uses an MLP with channel dimensionality reduction. In order to utilize contextual information effectively, the spatial attention branch combines a bottleneck structure and dilated convolutions. Overall, BAM can be written as\r\n\\begin{align}\r\n s_c &= \\text{BN}(W_2(W_1\\text{GAP}(X)+b_1)+b_2)\r\n\\end{align}\r\n\r\n\\begin{align}\r\n s_s &= BN(Conv_2^{1 \\times 1}(DC_2^{3\\times 3}(DC_1^{3 \\times 3}(Conv_1^{1 \\times 1}(X))))) \r\n\\end{align}\r\n\\begin{align}\r\n s &= \\sigma(\\text{Expand}(s_s)+\\text{Expand}(s_c)) \r\n\\end{align}\r\n\\begin{align}\r\n Y &= s X+X\r\n\\end{align}\r\nwhere $W_i$, $b_i$ denote weights and biases of fully connected layers respectively, $Conv_{1}^{1\\times 1}$ and $Conv_{2}^{1\\times 1}$ are convolution layers used for channel reduction. $DC_i^{3\\times 3}$ denotes a dilated convolution with $3\\times 3$ kernel, applied to utilize contextual information effectively. $\\text{Expand}$ expands the attention maps $s_s$ and $s_c$ to $\\mathbb{R}^{C\\times H\\times W}$.\r\n\r\nBAM can emphasize or suppress features in both spatial and channel dimensions, as well as improving the representational power. Dimensional reduction applied to both channel and spatial attention branches enables it to be integrated with any convolutional neural network with little extra computational cost. However, although dilated convolutions enlarge the receptive field effectively, it still fails to capture long-range contextual information as well as encoding cross-domain relationships.","576":"Per the authors, Graph Isomorphism Network (GIN) generalizes the WL test and hence achieves maximum discriminative power among GNNs.","577":"**ProGAN**, or **Progressively Growing GAN**, is a generative adversarial network that utilises a progressively growing training approach. The idea is to grow both the generator and discriminator progressively: starting from a low resolution, we add new layers that model increasingly fine details as training progresses.","578":"**FairMOT** is a model for multi-object tracking which consists of two homogeneous branches to predict pixel-wise objectness scores and re-ID features. The achieved fairness between the tasks is used to achieve high levels of detection and tracking accuracy. The detection branch is implemented in an anchor-free style which estimates object centers and sizes represented as position-aware measurement maps. Similarly, the re-ID branch estimates a re-ID feature for each pixel to characterize the object centered at the pixel. Note that the two branches are completely homogeneous which essentially differs from the previous methods which perform detection and re-ID in a cascaded style. It is also worth noting that FairMOT operates on high-resolution feature maps of strides four while the previous anchor-based methods operate on feature maps of stride 32. The elimination of anchors as well as the use of high-resolution feature maps better aligns re-ID features to object centers which significantly improves the tracking accuracy.","579":"**JLA**, or **Joint Learning Architecture**, is an approach for multiple object tracking and trajectory forecasting. It jointly trains a tracking and trajectory forecasting model, and the trajectory forecasts are used for short-term motion estimates in lieu of linear motion prediction methods such as the Kalman filter. It uses a [FairMOT](https:\/\/paperswithcode.com\/method\/fairmot) model as the base model because this architecture already performs detection and tracking. A forecasting branch is added to the network and is trained end-to-end. [FairMOT](https:\/\/paperswithcode.com\/method\/fairmot) consist of a backbone network utilizing [Deep Layer Aggregation](https:\/\/www.paperswithcode.com\/method\/dla), an object detection head, and a reID head.","580":"LINE is a novel network embedding method which is suitable for arbitrary types of information networks: undirected, directed, and\/or weighted. The method optimizes a carefully designed objective function that preserves both the local and global network structures.\r\n\r\nSource: [Tang et al.](https:\/\/arxiv.org\/pdf\/1503.03578v1.pdf)\r\n\r\nImage source: [Tang et al.](https:\/\/arxiv.org\/pdf\/1503.03578v1.pdf)","581":"**ArcFace**, or **Additive Angular Margin Loss**, is a loss function used in face recognition tasks. The [softmax](https:\/\/paperswithcode.com\/method\/softmax) is traditionally used in these tasks. However, the softmax loss function does not explicitly optimise the feature embedding to enforce higher similarity for intraclass samples and diversity for inter-class samples, which results in a performance gap for deep face recognition under large intra-class appearance variations. \r\n\r\nThe ArcFace loss transforms the logits $W^{T}\\_{j}x\\_{i} = || W\\_{j} || \\text{ } || x\\_{i} || \\cos\\theta\\_{j}$,\r\nwhere $\\theta\\_{j}$ is the angle between the weight $W\\_{j}$ and the feature $x\\_{i}$. The individual weight $ || W\\_{j} || = 1$ is fixed by $l\\_{2}$ normalization. The embedding feature $ ||x\\_{i} ||$ is fixed by $l\\_{2}$ normalization and re-scaled to $s$. The normalisation step on features and weights makes the predictions only depend on the angle between the feature and the weight. The learned embedding\r\nfeatures are thus distributed on a hypersphere with a radius of $s$. Finally, an additive angular margin penalty $m$ is added between $x\\_{i}$ and $W\\_{y\\_{i}}$ to simultaneously enhance the intra-class compactness and inter-class discrepancy. Since the proposed additive angular margin penalty is\r\nequal to the geodesic distance margin penalty in the normalised hypersphere, the method is named ArcFace:\r\n\r\n$$ L\\_{3} = -\\frac{1}{N}\\sum^{N}\\_{i=1}\\log\\frac{e^{s\\left(\\cos\\left(\\theta\\_{y\\_{i}} + m\\right)\\right)}}{e^{s\\left(\\cos\\left(\\theta\\_{y\\_{i}} + m\\right)\\right)} + \\sum^{n}\\_{j=1, j \\neq y\\_{i}}e^{s\\cos\\theta\\_{j}}} $$\r\n\r\nThe authors select face images from 8 different identities containing enough samples (around 1,500 images\/class) to train 2-D feature embedding networks with the softmax and ArcFace loss, respectively. As the Figure shows, the softmax loss provides roughly separable feature embedding\r\nbut produces noticeable ambiguity in decision boundaries, while the proposed ArcFace loss can obviously enforce a more evident gap between the nearest classes.\r\n\r\nOther alternatives to enforce intra-class compactness and inter-class distance include [Supervised Contrastive Learning](https:\/\/arxiv.org\/abs\/2004.11362).","582":"**AdaSmooth** is a stochastic optimization technique that allows for per-dimension learning rate method for [SGD](https:\/\/paperswithcode.com\/method\/sgd). It is an extension of [Adagrad](https:\/\/paperswithcode.com\/method\/adagrad) and [AdaDelta](https:\/\/paperswithcode.com\/method\/adadelta) that seek to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to a fixed size $w$ while AdaSmooth adaptively selects the size of the window.\r\n\r\nGiven the window size $M$, the effective ratio is calculated by \r\n\r\n$$e_t = \\frac{s_t}{n_t}= \\frac{| x_t - x_{t-M}|}{\\sum_{i=0}^{M-1} | x_{t-i} - x_{t-1-i}|}\\\\\r\n= \\frac{| \\sum_{i=0}^{M-1} \\Delta x_{t-1-i}|}{\\sum_{i=0}^{M-1} | \\Delta x_{t-1-i}|}.$$\r\n\r\nGiven the effective ratio, the scaled smoothing constant is obtained by:\r\n\r\n$$c_t = ( \\rho_2- \\rho_1) \\times e_t + (1-\\rho_2),$$\r\n\r\nThe running average $E\\left[g^{2}\\right]\\_{t}$ at time step $t$ then depends only on the previous average and current gradient:\r\n\r\n$$ E\\left[g^{2}\\right]\\_{t} = c_t^2 \\odot g_{t}^2 + \\left(1-c_t^2 \\right)\\odot E[g^2]_{t-1} $$\r\n\r\nUsually $\\rho_1$ is set to around $0.5$ and $\\rho_2$ is set to around 0.99. The update step the follows:\r\n\r\n$$ \\Delta x_t = -\\frac{\\eta}{\\sqrt{E\\left[g^{2}\\right]\\_{t} + \\epsilon}} \\odot g_{t}, $$\r\n\r\nwhich is incorporated into the final update:\r\n\r\n$$x_{t+1} = x_{t} + \\Delta x_t.$$\r\n\r\nThe main advantage of AdaSmooth is its faster convergence rate and insensitivity to hyperparameters.","583":"**AdaDelta** is a stochastic optimization technique that allows for per-dimension learning rate method for [SGD](https:\/\/paperswithcode.com\/method\/sgd). It is an extension of [Adagrad](https:\/\/paperswithcode.com\/method\/adagrad) that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to a fixed size $w$.\r\n\r\nInstead of inefficiently storing $w$ previous squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients. The running average $E\\left[g^{2}\\right]\\_{t}$ at time step $t$ then depends only on the previous average and current gradient:\r\n\r\n$$E\\left[g^{2}\\right]\\_{t} = \\gamma{E}\\left[g^{2}\\right]\\_{t-1} + \\left(1-\\gamma\\right)g^{2}\\_{t}$$\r\n\r\nUsually $\\gamma$ is set to around $0.9$. Rewriting SGD updates in terms of the parameter update vector:\r\n\r\n$$ \\Delta\\theta_{t} = -\\eta\\cdot{g\\_{t, i}}$$\r\n$$\\theta\\_{t+1} = \\theta\\_{t} + \\Delta\\theta_{t}$$\r\n\r\nAdaDelta takes the form:\r\n\r\n$$ \\Delta\\theta_{t} = -\\frac{\\eta}{\\sqrt{E\\left[g^{2}\\right]\\_{t} + \\epsilon}}g_{t} $$\r\n\r\nThe main advantage of AdaDelta is that we do not need to set a default learning rate.","584":"**Neural Oblivious Decision Ensembles (NODE)** is a tabular data architecture that consists of differentiable\r\noblivious decision trees (ODT) that are trained end-to-end by backpropagation. \r\n\r\nThe core building block is a Neural Oblivious Decision Ensemble (NODE) layer. The layer is composed of $m$ differentiable oblivious decision trees (ODTs) of equal depth $d$. As an input, all $m$ trees get a common vector $x \\in \\mathbb{R}^{n}$, containing $n$ numeric features. Below we describe a design of a single differentiable ODT.\r\n\r\nIn its essence, an ODT is a decision table that splits the data along $d$ splitting features and compares each feature to a learned threshold. Then, the tree returns one of the $2^{d}$ possible responses, corresponding to the comparisons result. Therefore, each ODT is completely determined by its splitting features $f \\in \\mathbb{R}^{d}$, splitting thresholds $b \\in \\mathbb{R}^{d}$ and a $d$-dimensional tensor of responses $R \\in \\mathbb{R} \\underbrace{2 \\times 2 \\times 2}_{d}$. In this notation, the tree output is defined as:\r\n\r\n$$\r\nh(x)=R\\left[\\mathbb{1}\\left(f\\_{1}(x)-b_{1}\\right), \\ldots, \\mathbb{1}\\left(f\\_{d}(x)-b\\_{d}\\right)\\right]\r\n$$\r\nwhere $\\mathbb{1}(\\cdot)$ denotes the Heaviside function.","585":"**CrossViT** is a type of [vision transformer](https:\/\/paperswithcode.com\/method\/vision-transformer) that uses a dual-branch architecture to extract multi-scale feature representations for image classification. The architecture combines image patches (i.e. tokens in a [transformer](https:\/\/paperswithcode.com\/method\/transformer)) of different sizes to produce stronger visual features for image classification. It processes small and large patch tokens with two separate branches of different computational complexities and these tokens are fused together multiple times to complement each other.\r\n\r\nFusion is achieved by an efficient [cross-attention module](https:\/\/paperswithcode.com\/method\/cross-attention-module), in which each transformer branch creates a non-patch token as an agent to exchange information with the other branch by attention. This allows for linear-time generation of the attention map in fusion instead of quadratic time otherwise.","586":"FIERCE is an entropic regularization on the **feature** space","587":"**Conditional Relation Network**, or **CRN**, is a building block to construct more sophisticated structures for representation and reasoning over video. CRN takes as input an array of tensorial objects and a conditioning feature, and computes an array of encoded output objects. Model building becomes a simple exercise of replication, rearrangement and stacking of these reusable units for diverse modalities and contextual information. This design thus supports high-order relational and multi-step reasoning.","588":"A **Dynamic Memory Network** is a neural network architecture which processes input sequences and questions, forms episodic memories, and generates relevant answers. Questions trigger an iterative attention process which allows the model to condition its attention on the inputs and the result of previous iterations. These results are then reasoned over in a hierarchical recurrent sequence model to generate answers. \r\n\r\nThe DMN consists of a number of modules:\r\n\r\n- Input Module: The input module encodes raw text inputs from the task into distributed vector representations. The input takes forms like a sentence, a long story, a movie review and so on.\r\n- Question Module: The question module encodes the question of the task into a distributed\r\nvector representation. For question answering, the question may be a sentence such as \"Where did the author first fly?\". The representation is fed into the episodic memory module, and forms the basis, or initial state, upon which the episodic memory module iterates.\r\n- Episodic Memory Module: Given a collection of input representations, the episodic memory module chooses which parts of the inputs to focus on through the attention mechanism. It then produces a \u201dmemory\u201d vector representation taking into account the question as well as the previous memory. Each iteration provides the module with newly relevant information about the input. In other words,\r\nthe module has the ability to retrieve new information, in the form of input representations, which were thought to be irrelevant in previous iterations.\r\n- Answer Module: The answer module generates an answer from the final memory vector of the memory module.","589":"**Nesterov Accelerated Gradient** is a momentum-based [SGD](https:\/\/paperswithcode.com\/method\/sgd) optimizer that \"looks ahead\" to where the parameters will be to calculate the gradient **ex post** rather than **ex ante**:\r\n\r\n$$ v\\_{t} = \\gamma{v}\\_{t-1} + \\eta\\nabla\\_{\\theta}J\\left(\\theta-\\gamma{v\\_{t-1}}\\right) $$\r\n$$\\theta\\_{t} = \\theta\\_{t-1} + v\\_{t}$$\r\n\r\nLike SGD with momentum $\\gamma$ is usually set to $0.9$.\r\n\r\nThe intuition is that the [standard momentum](https:\/\/paperswithcode.com\/method\/sgd-with-momentum) method first computes the gradient at the current location and then takes a big jump in the direction of the updated accumulated gradient. In contrast Nesterov momentum first makes a big jump in the direction of the previous accumulated gradient and then measures the gradient where it ends up and makes a correction. The idea being that it is better to correct a mistake after you have made it. \r\n\r\nImage Source: [Geoff Hinton lecture notes](http:\/\/www.cs.toronto.edu\/~tijmen\/csc321\/slides\/lecture_slides_lec6.pdf)","590":"**Relative Position Encodings** are a type of position embeddings for [Transformer-based models](https:\/\/paperswithcode.com\/methods\/category\/transformers) that attempts to exploit pairwise, relative positional information. Relative positional information is supplied to the model on two levels: values and keys. This becomes apparent in the two modified self-attention equations shown below. First, relative positional information is supplied to the model as an additional component to the keys\r\n\r\n$$ e\\_{ij} = \\frac{x\\_{i}W^{Q}\\left(x\\_{j}W^{K} + a^{K}\\_{ij}\\right)^{T}}{\\sqrt{d\\_{z}}} $$\r\n\r\nHere $a$ is an edge representation for the inputs $x\\_{i}$ and $x\\_{j}$. The [softmax](https:\/\/paperswithcode.com\/method\/softmax) operation remains unchanged from vanilla self-attention. Then relative positional information is supplied again as a sub-component of the values matrix:\r\n\r\n$$ z\\_{i} = \\sum^{n}\\_{j=1}\\alpha\\_{ij}\\left(x\\_{j}W^{V} + a\\_{ij}^{V}\\right)$$\r\n\r\nIn other words, instead of simply combining semantic embeddings with absolute positional ones, relative positional information is added to keys and values on the fly during attention calculation.\r\n\r\nSource: [Jake Tae](https:\/\/jaketae.github.io\/study\/relative-positional-encoding\/)\r\n\r\nImage Source: [Relative Positional Encoding for Transformers with Linear Complexity](https:\/\/www.youtube.com\/watch?v=qajudaEHuq8","591":"**Global-Local Attention** is a type of attention mechanism used in the [ETC](https:\/\/paperswithcode.com\/method\/etc) architecture. ETC receives two separate input sequences: the global input $x^{g} = (x^{g}\\_{1}, \\dots, x^{g}\\_{n\\_{g}})$ and the long input $x^{l} = (x^{l}\\_{1}, \\dots x^{l}\\_{n\\_{l}})$. Typically, the long input contains the input a [standard Transformer](https:\/\/paperswithcode.com\/method\/transformer) would receive, while the global input contains a much smaller number of auxiliary tokens ($n\\_{g} \\ll n\\_{l}$). Attention is then split into four separate pieces: global-to-global (g2g), global-tolong (g2l), long-to-global (l2g), and long-to-long (l2l). Attention in the l2l piece (the most computationally expensive piece) is restricted to a fixed radius $r \\ll n\\_{l}$. To compensate for this limited attention span, the tokens in the global input have unrestricted attention, and thus long input tokens can transfer information to each other through global input tokens. Accordingly, g2g, g2l, and l2g pieces of attention are unrestricted.","592":"**Extended Transformer Construction**, or **ETC**, is an extension of the [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture with a new attention mechanism that extends the original in two main ways: (1) it allows scaling up the input length from 512 to several thousands; and (2) it can ingesting structured inputs instead of just linear sequences. The key ideas that enable ETC to achieve these are a new [global-local attention mechanism](https:\/\/paperswithcode.com\/method\/global-local-attention), coupled with [relative position encodings](https:\/\/paperswithcode.com\/method\/relative-position-encodings). ETC also allows lifting weights from existing [BERT](https:\/\/paperswithcode.com\/method\/bert) models, saving computational resources while training.","593":"**Inception-ResNet-v2-A** is an image model block for a 35 x 35 grid used in the [Inception-ResNet-v2](https:\/\/paperswithcode.com\/method\/inception-resnet-v2) architecture.","594":"**Inception-ResNet-v2 Reduction-B** is an image model block used in the [Inception-ResNet-v2](https:\/\/paperswithcode.com\/method\/inception-resnet-v2) architecture.","595":"**Inception-ResNet-v2-B** is an image model block for a 17 x 17 grid used in the [Inception-ResNet-v2](https:\/\/paperswithcode.com\/method\/inception-resnet-v2) architecture. It largely follows the idea of Inception modules - and grouped convolutions - but also includes residual connections.","596":"**Inception-ResNet-v2-C** is an image model block for an 8 x 8 grid used in the [Inception-ResNet-v2](https:\/\/paperswithcode.com\/method\/inception-resnet-v2) architecture. It largely follows the idea of Inception modules - and grouped convolutions - but also includes residual connections.","597":"**Inception-ResNet-v2** is a convolutional neural architecture that builds on the Inception family of architectures but incorporates [residual connections](https:\/\/paperswithcode.com\/method\/residual-connection) (replacing the filter concatenation stage of the Inception architecture).","598":"A **Deep Belief Network (DBN)** is a multi-layer generative graphical model. DBNs have bi-directional connections ([RBM](https:\/\/paperswithcode.com\/method\/restricted-boltzmann-machine)-type connections) on the top layer while the bottom layers only have top-down connections. They are trained using layerwise pre-training. Pre-training occurs by training the network component by component bottom up: treating the first two layers as an RBM and training, then treating the second layer and third layer as another RBM and training for those parameters.\r\n\r\nSource: [Origins of Deep Learning](https:\/\/arxiv.org\/pdf\/1702.07800.pdf)\r\n\r\nImage Source: [Wikipedia](https:\/\/en.wikipedia.org\/wiki\/Deep_belief_network)","599":"**Neural Network Compression Framework**, or **NNCF**, is a Python-based framework for neural network compression with fine-tuning. It leverages recent advances of various network compression methods and implements some of them, namely quantization, sparsity, filter pruning and binarization. These methods allow producing more hardware-friendly models that can be efficiently run on general-purpose hardware computation units (CPU, GPU) or specialized deep learning accelerators.","600":"UNet++ is an architecture for semantic segmentation based on the [U-Net](https:\/\/paperswithcode.com\/method\/u-net). Through the use of densely connected nested decoder sub-networks, it enhances extracted feature processing and was reported by its authors to outperform the U-Net in [Electron Microscopy (EM)](https:\/\/imagej.net\/events\/isbi-2012-segmentation-challenge), [Cell](https:\/\/acsjournals.onlinelibrary.wiley.com\/doi\/full\/10.1002\/cncy.21576), [Nuclei](https:\/\/www.kaggle.com\/c\/data-science-bowl-2018), [Brain Tumor](https:\/\/paperswithcode.com\/dataset\/brats-2013-1), [Liver](https:\/\/paperswithcode.com\/dataset\/lits17) and [Lung Nodule](https:\/\/paperswithcode.com\/dataset\/lidc-idri) medical image segmentation tasks.","601":"**SegNet** is a semantic segmentation model. This core trainable segmentation architecture consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the\r\nVGG16 network. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature maps. Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to\r\nperform non-linear upsampling.","602":"This method introduces several regularization schemes that can be applied to an Autoencoder. To make the model genrative *ex-post* density estimation is proposed and consists in fitting a Mixture of Gaussian distribution on the train data embeddings after the model is trained.","603":"**VirText**, or **Visual representations from Textual annotations** is a pretraining approach using semantically dense captions to learn visual representations. First a ConvNet and [Transformer](https:\/\/paperswithcode.com\/method\/transformer) are jointly trained from scratch to generate natural language captions for images. Then, the learned features are transferred to downstream visual recognition tasks.","604":"**Gumbel-Softmax** is a continuous distribution that has the property that it can be smoothly annealed into a categorical distribution, and whose parameter gradients can be easily computed via the reparameterization trick.","605":"**Beta-VAE** is a type of variational autoencoder that seeks to discovered disentangled latent factors. It modifies [VAEs](https:\/\/paperswithcode.com\/method\/vae) with an adjustable hyperparameter $\\beta$ that balances latent channel capacity and independence constraints with reconstruction accuracy. The idea is to maximize the probability of generating the real data while keeping the distance between the real and estimated distributions small, under a threshold $\\epsilon$. We can use the Kuhn-Tucker conditions to write this as a single equation:\r\n\r\n$$ \\mathcal{F}\\left(\\theta, \\phi, \\beta; \\mathbf{x}, \\mathbf{z}\\right) = \\mathbb{E}\\_{q\\_{\\phi}\\left(\\mathbf{z}|\\mathbf{x}\\right)}\\left[\\log{p}\\_{\\theta}\\left(\\mathbf{x}\\mid\\mathbf{z}\\right)\\right] - \\beta\\left[D\\_{KL}\\left(\\log{q}\\_{\\theta}\\left(\\mathbf{z}\\mid\\mathbf{x}\\right)||p\\left(\\mathbf{z}\\right)\\right) - \\epsilon\\right]$$\r\n\r\nwhere the KKT multiplier $\\beta$ is the regularization coefficient that constrains the capacity of the latent channel $\\mathbf{z}$ and puts implicit independence pressure on the learnt posterior due to the isotropic nature of the Gaussian prior $p\\left(\\mathbf{z}\\right)$.\r\n\r\nWe write this again using the complementary slackness assumption to get the Beta-VAE formulation:\r\n\r\n$$ \\mathcal{F}\\left(\\theta, \\phi, \\beta; \\mathbf{x}, \\mathbf{z}\\right) \\geq \\mathcal{L}\\left(\\theta, \\phi, \\beta; \\mathbf{x}, \\mathbf{z}\\right) = \\mathbb{E}\\_{q\\_{\\phi}\\left(\\mathbf{z}|\\mathbf{x}\\right)}\\left[\\log{p}\\_{\\theta}\\left(\\mathbf{x}\\mid\\mathbf{z}\\right)\\right] - \\beta\\{D}\\_{KL}\\left(\\log{q}\\_{\\theta}\\left(\\mathbf{z}\\mid\\mathbf{x}\\right)||p\\left(\\mathbf{z}\\right)\\right)$$","606":"**Channel-wise Soft Attention** is an attention mechanism in computer vision that assigns \"soft\" attention weights for each channel $c$. In soft channel-wise attention, the alignment weights are learned and placed \"softly\" over each channel. This would contrast with hard attention which would only selects one channel to attend to at a time.\r\n\r\nImage: [Xu et al](http:\/\/proceedings.mlr.press\/v37\/xuc15.pdf)","607":"A **Selective Kernel Convolution** is a [convolution](https:\/\/paperswithcode.com\/method\/convolution) that enables neurons to adaptively adjust their RF sizes among multiple kernels with different kernel sizes. Specifically, the SK convolution has three operators \u2013 Split, Fuse and Select. Multiple branches with different kernel sizes are fused using\r\n[softmax](https:\/\/paperswithcode.com\/method\/softmax) attention that is guided by the information in these branches. Different attentions on these branches yield different sizes of the effective receptive fields of neurons in the fusion layer","608":"A **Selective Kernel** unit is a bottleneck block consisting of a sequence of 1\u00d71 [convolution](https:\/\/paperswithcode.com\/method\/convolution), SK convolution and 1\u00d71 convolution. It was proposed as part of the [SKNet](https:\/\/paperswithcode.com\/method\/sknet) CNN architecture. In general, all the large kernel convolutions in the original bottleneck blocks in [ResNeXt](https:\/\/paperswithcode.com\/method\/resnext) are replaced by the proposed SK convolutions, enabling the network to choose appropriate receptive field sizes in an adaptive manner. \r\n\r\nIn SK units, there are three important hyper-parameters which determine the final settings of SK convolutions: the number of paths $M$ that determines the number of choices of different kernels to be aggregated, the group number $G$ that controls the cardinality of each path, and the reduction ratio $r$ that controls the number of parameters in the fuse operator. One typical setting of SK convolutions is $\\text{SK}\\left[M, G, r\\right]$ to be $\\text{SK}\\left[2, 32, 16\\right]$.","609":"**Self-Cure Network**, or **SCN**, is a method for suppressing uncertainties for large-scale facial expression recognition, prventing deep networks from overfitting uncertain facial images. Specifically, SCN suppresses the uncertainty from two different aspects: 1) a self-attention mechanism over mini-batch to weight each training sample with a ranking regularization, and 2) a careful relabeling mechanism to modify the labels of these samples in the lowest-ranked group.","610":"**Minimum Description Length** provides a criterion for the selection of models, regardless of their complexity, without the restrictive assumption that the data form a sample from a 'true' distribution.\r\n\r\nExtracted from [scholarpedia](http:\/\/scholarpedia.org\/article\/Minimum_description_length)\r\n\r\n**Source**:\r\n\r\nPaper: [J. Rissanen (1978) Modeling by the shortest data description. Automatica 14, 465-471](https:\/\/doi.org\/10.1016\/0005-1098(78)90005-5)\r\n\r\nBook: [P. D. Gr\u00fcnwald (2007) The Minimum Description Length Principle, MIT Press, June 2007, 570 pages](https:\/\/ieeexplore.ieee.org\/servlet\/opac?bknumber=6267274)","611":"A **CNN BiLSTM** is a hybrid bidirectional [LSTM](https:\/\/paperswithcode.com\/method\/lstm) and CNN architecture. In the original formulation applied to named entity recognition, it learns both character-level and word-level features. The CNN component is used to induce the character-level features. For each word the model employs a [convolution](https:\/\/paperswithcode.com\/method\/convolution) and a [max pooling](https:\/\/paperswithcode.com\/method\/max-pooling) layer to extract a new feature vector from the per-character feature vectors such as character embeddings and (optionally) character type.","612":"**Atrous Spatial Pyramid Pooling (ASPP)** is a semantic segmentation module for resampling a given feature layer at multiple rates prior to [convolution](https:\/\/paperswithcode.com\/method\/convolution). This amounts to probing the original image with multiple filters that have complementary effective fields of view, thus capturing objects as well as useful image context at multiple scales. Rather than actually resampling features, the mapping is implemented using multiple parallel atrous convolutional layers with different sampling rates.","613":"**DeepLabv3** is a semantic segmentation architecture that improves upon [DeepLabv2](https:\/\/paperswithcode.com\/method\/deeplabv2) with several modifications. To handle the problem of segmenting objects at multiple scales, modules are designed which employ atrous [convolution](https:\/\/paperswithcode.com\/method\/convolution) in cascade or in parallel to capture multi-scale context by adopting multiple atrous rates. Furthermore, the Atrous [Spatial Pyramid Pooling](https:\/\/paperswithcode.com\/method\/spatial-pyramid-pooling) module from DeepLabv2 augmented with image-level features encoding global context and further boost performance. \r\n\r\nThe changes to the ASSP module are that the authors apply [global average pooling](https:\/\/paperswithcode.com\/method\/global-average-pooling) on the last feature map of the model, feed the resulting image-level features to a 1 \u00d7 1 convolution with 256 filters (and [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization)), and then bilinearly upsample the feature to the desired spatial dimension. In the\r\nend, the improved [ASPP](https:\/\/paperswithcode.com\/method\/aspp) consists of (a) one 1\u00d71 convolution and three 3 \u00d7 3 convolutions with rates = (6, 12, 18) when output stride = 16 (all with 256 filters and batch normalization), and (b) the image-level features.\r\n\r\nAnother interesting difference is that DenseCRF post-processing from DeepLabv2 is no longer needed.","614":"The Robust Loss is a generalization of the Cauchy\/Lorentzian, Geman-McClure, Welsch\/Leclerc, generalized Charbonnier, Charbonnier\/pseudo-Huber\/L1-L2, and L2 loss functions. By introducing robustness as a continuous parameter, the loss function allows algorithms built around robust loss minimization to be generalized, which improves performance on basic vision tasks such as registration and clustering. Interpreting the loss as the negative log of a univariate density yields a general probability distribution that includes normal and Cauchy distributions as special cases. This probabilistic interpretation enables the training of neural networks in which the robustness of the loss automatically adapts itself during training, which improves performance on learning-based tasks such as generative image synthesis and unsupervised monocular depth estimation, without requiring any manual parameter tuning.","615":"CharacterBERT is a variant of [BERT](https:\/\/paperswithcode.com\/method\/bert) that **drops the wordpiece system** and **replaces it with a CharacterCNN module** just like the one [ELMo](https:\/\/paperswithcode.com\/method\/elmo) uses to produce its first layer representation. This allows CharacterBERT to represent any input token without splitting it into wordpieces. Moreover, this frees BERT from the burden of a domain-specific wordpiece vocabulary which may not be suited to your domain of interest (e.g. medical domain). Finally, it allows the model to be more robust to noisy inputs.","616":"**Stable Rank Normalization (SRN)** is a weight-normalization scheme which minimizes the\r\nstable rank of a linear operator. It simultaneously controls the Lipschitz constant and the stable rank of a linear operator. Stable rank is a softer version of the rank operator and is defined as the squared ratio of the Frobenius norm to the spectral norm.","617":"VERtex Similarity Embeddings (VERSE) is a simple, versatile, and memory-efficient method that derives graph embeddings explicitly calibrated to preserve the distributions of a selected vertex-to-vertex similarity measure. VERSE learns such embeddings by training a single-layer neural network.\r\n\r\nSource: [Tsitsulin et al.](https:\/\/arxiv.org\/pdf\/1803.04742v1.pdf)\r\n\r\nImage source: [Tsitsulin et al.](https:\/\/arxiv.org\/pdf\/1803.04742v1.pdf)","618":"**SNIPER** is a multi-scale training approach for instance-level recognition tasks like object detection and instance-level segmentation. Instead of processing all pixels in an image pyramid, SNIPER selectively processes context regions around the ground-truth objects (a.k.a chips). This can help to speed up multi-scale training as it operates on low-resolution chips. Due to its memory-efficient design, SNIPER can benefit from [Batch Normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) during training and it makes larger batch-sizes possible for instance-level recognition tasks on a single GPU.","619":"Please enter a description about the method here","620":"Random Ensemble Mixture (REM) is an easy to implement extension of [DQN](https:\/\/paperswithcode.com\/method\/dqn) inspired by [Dropout](https:\/\/paperswithcode.com\/method\/dropout). The key intuition behind REM is that if one has access to multiple estimates of Q-values, then a weighted combination of the Q-value estimates is also an estimate for Q-values. Accordingly, in each training step, REM randomly combines multiple Q-value estimates and uses this random combination for robust training.","621":"**ResNet-RS** is a family of [ResNet](https:\/\/paperswithcode.com\/method\/resnet) architectures that are 1.7x faster than [EfficientNets](https:\/\/paperswithcode.com\/method\/efficientnet) on TPUs, while achieving similar accuracies on ImageNet. The authors propose two new scaling strategies: (1) scale model depth in regimes where overfitting can occur (width scaling is preferable otherwise); (2) increase image resolution more slowly than previously recommended.\r\n\r\nAdditional improvements include the use of a [cosine learning rate schedule](https:\/\/paperswithcode.com\/method\/cosine-annealing), [label smoothing](https:\/\/paperswithcode.com\/method\/label-smoothing), [stochastic depth](https:\/\/paperswithcode.com\/method\/stochastic-depth), [RandAugment](https:\/\/paperswithcode.com\/method\/randaugment), decreased [weight decay](https:\/\/paperswithcode.com\/method\/weight-decay), [squeeze-and-excitation](https:\/\/paperswithcode.com\/method\/squeeze-and-excitation-block) and the use of the [ResNet-D](https:\/\/paperswithcode.com\/method\/resnet-d) architecture.","622":"**ResNet-D** is a modification on the [ResNet](https:\/\/paperswithcode.com\/method\/resnet) architecture that utilises an [average pooling](https:\/\/paperswithcode.com\/method\/average-pooling) tweak for downsampling. The motivation is that in the unmodified ResNet, the 1 \u00d7 1 [convolution](https:\/\/paperswithcode.com\/method\/convolution) for the downsampling block ignores 3\/4 of input feature maps, so this is modified so no information will be ignored","623":"**3D ResNet-RS** is an architecture and scaling strategy for 3D ResNets for video recognition. The key additions are:\r\n\r\n- **3D ResNet-D stem**: The [ResNet-D](https:\/\/paperswithcode.com\/method\/resnet-d) stem is adapted to 3D inputs by using three consecutive [3D convolutional layers](https:\/\/paperswithcode.com\/method\/3d-convolution). The first convolutional layer employs a temporal kernel size of 5 while the remaining two convolutional layers employ a temporal kernel size of 1.\r\n\r\n- **3D Squeeze-and-Excitation**: [Squeeze-and-Excite](https:\/\/paperswithcode.com\/method\/squeeze-and-excitation-block) is adapted to spatio-temporal inputs by using a 3D [global average pooling](https:\/\/paperswithcode.com\/method\/global-average-pooling) operation for the squeeze operation. A SE ratio of 0.25 is applied in each 3D bottleneck block for all experiments.\r\n\r\n- **Self-gating**: A self-gating module is used in each 3D bottleneck block after the SE module.","624":"**Shifted Softplus** is an activation function ${\\rm ssp}(x) = \\ln( 0.5 e^{x} + 0.5 )$, which [SchNet](https:\/\/paperswithcode.com\/method\/schnet) employs as non-linearity throughout the network in order to obtain a smooth potential energy surface. The shifting ensures that ${\\rm ssp}(0) = 0$ and improves the convergence of the network. This activation function shows similarity to ELUs, while having infinite order of continuity.","625":"**SchNet** is an end-to-end deep neural network architecture based on continuous-filter convolutions. It follows the deep tensor neural network framework, i.e. atom-wise representations are constructed by starting from embedding vectors that characterize the atom type before introducing the configuration of the system by a series of interaction blocks.","626":"FastGCN is a fast improvement of the GCN model recently proposed by Kipf & Welling (2016a) for learning graph embeddings. It generalizes transductive training to an inductive manner and also addresses the memory bottleneck issue of GCN caused by recursive expansion of neighborhoods. The crucial ingredient is a sampling scheme in the reformulation of the loss and the gradient, well justified through an alternative view of graph convoluntions in the form of integral transforms of embedding functions.\r\n\r\nDescription and image from: [FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling](https:\/\/arxiv.org\/pdf\/1801.10247.pdf)","627":"This optimizer mix [ADAM](https:\/\/paperswithcode.com\/method\/adam) and [SGD](https:\/\/paperswithcode.com\/method\/sgd) creating the MAS optimizer.","628":"**Rotary Position Embedding**, or **RoPE**, is a type of position embedding which encodes absolute positional information with rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation. Notably, RoPE comes with valuable properties such as flexibility of being expand to any sequence lengths, decaying inter-token dependency with increasing relative distances, and capability of equipping the linear self-attention with relative position encoding.","629":"SRM combines style transfer with an attention mechanism. Its main contribution is style pooling which utilizes both mean and standard deviation of the input features to improve its capability to capture global information. It also adopts a lightweight channel-wise fully-connected (CFC) layer, in place of the original fully-connected layer, to reduce the computational requirements.\r\nGiven an input feature map $X \\in \\mathbb{R}^{C \\times H \\times W}$, SRM first collects global information by using style pooling ($\\text{SP}(\\cdot)$) which combines global average pooling and global standard deviation pooling. \r\nThen a channel-wise fully connected ($\\text{CFC}(\\cdot)$) layer (i.e. fully connected per channel), batch normalization $\\text{BN}$ and sigmoid function $\\sigma$ are used to provide the attention vector. Finally, as in an SE block, the input features are multiplied by the attention vector. Overall, an SRM can be written as:\r\n\\begin{align}\r\n s = F_\\text{srm}(X, \\theta) & = \\sigma (\\text{BN}(\\text{CFC}(\\text{SP}(X))))\r\n\\end{align}\r\n\\begin{align}\r\n Y & = s X\r\n\\end{align}\r\nThe SRM block improves both squeeze and excitation modules, yet can be added after each residual unit like an SE block.","630":"**Multi-Head Linear Attention** is a type of linear multi-head self-attention module, proposed with the [Linformer](https:\/\/paperswithcode.com\/method\/linformer) architecture. The main idea is to add two linear projection matrices $E\\_{i}, F\\_{i} \\in \\mathbb{R}^{n\\times{k}}$ when computing key and value. We first project the original $\\left(n \\times d\\right)$-dimensional key and value layers $KW\\_{i}^{K}$ and $VW\\_{i}^{V}$ into $\\left(k\\times{d}\\right)$-dimensional projected key and value layers. We then compute a $\\left(n\\times{k}\\right)$ dimensional context mapping $\\bar{P}$ using scaled-dot product attention:\r\n\r\n$$ \\bar{\\text{head}\\_{i}} = \\text{Attention}\\left(QW^{Q}\\_{i}, E\\_{i}KW\\_{i}^{K}, F\\_{i}VW\\_{i}^{V}\\right) $$\r\n\r\n$$ \\bar{\\text{head}\\_{i}} = \\text{softmax}\\left(\\frac{QW^{Q}\\_{i}\\left(E\\_{i}KW\\_{i}^{K}\\right)^{T}}{\\sqrt{d\\_{k}}}\\right) \\cdot F\\_{i}VW\\_{i}^{V} $$\r\n\r\nFinally, we compute context embeddings for each head using $\\bar{P} \\cdot \\left(F\\_{i}{V}W\\_{i}^{V}\\right)$.","631":"**Skip-gram Word2Vec** is an architecture for computing word embeddings. Instead of using surrounding words to predict the center word, as with CBow Word2Vec, Skip-gram Word2Vec uses the central word to predict the surrounding words.\r\n\r\nThe skip-gram objective function sums the log probabilities of the surrounding $n$ words to the left and right of the target word $w\\_{t}$ to produce the following objective:\r\n\r\n$$J\\_\\theta = \\frac{1}{T}\\sum^{T}\\_{t=1}\\sum\\_{-n\\leq{j}\\leq{n}, \\neq{0}}\\log{p}\\left(w\\_{j+1}\\mid{w\\_{t}}\\right)$$","632":"**PReLU-Net** is a type of convolutional neural network that utilises parameterized ReLUs for its activation function. It also uses a robust initialization scheme - afterwards known as [Kaiming Initialization](https:\/\/paperswithcode.com\/method\/he-initialization) - that accounts for non-linear activation functions.","633":"**$\\epsilon$-Greedy Exploration** is an exploration strategy in reinforcement learning that takes an exploratory action with probability $\\epsilon$ and a greedy action with probability $1-\\epsilon$. It tackles the exploration-exploitation tradeoff with reinforcement learning algorithms: the desire to explore the state space with the desire to seek an optimal policy. Despite its simplicity, it is still commonly used as an behaviour policy $\\pi$ in several state-of-the-art reinforcement learning models.\r\n\r\nImage Credit: [Robin van Embden](https:\/\/cran.r-project.org\/web\/packages\/contextual\/vignettes\/sutton_barto.html)","634":"**Barlow Twins** is a self-supervised learning method that applies redundancy-reduction \u2014 a principle first proposed in neuroscience \u2014 to self supervised learning. The objective function measures the cross-correlation matrix between the embeddings of two identical networks fed with distorted versions of a batch of samples, and tries to make this matrix close to the identity. This causes the embedding vectors of distorted version of a sample to be similar, while minimizing the redundancy between the components of these vectors. Barlow Twins does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates. Intriguingly it benefits from very high-dimensional output vectors.","635":"**Adversarial Color Enhancement** is an approach to generating unrestricted adversarial images by optimizing a color filter via gradient descent.","636":"**Selective Search** is a region proposal algorithm for object detection tasks. It starts by over-segmenting the image based on intensity of the pixels using a graph-based segmentation method by Felzenszwalb and Huttenlocher. Selective Search then takes these oversegments as initial input and performs the following steps\r\n\r\n1. Add all bounding boxes corresponding to segmented parts to the list of regional proposals\r\n2. Group adjacent segments based on similarity\r\n3. Go to step 1\r\n\r\nAt each iteration, larger segments are formed and added to the list of region proposals. Hence we create region proposals from smaller segments to larger segments in a bottom-up approach. This is what we mean by computing \u201chierarchical\u201d segmentations using Felzenszwalb and Huttenlocher\u2019s oversegments.","637":"A **Multiplicative LSTM (mLSTM)** is a recurrent neural network architecture for sequence modelling that combines the long short-term memory ([LSTM](https:\/\/paperswithcode.com\/method\/lstm)) and multiplicative recurrent neural network ([mRNN](https:\/\/paperswithcode.com\/method\/mrnn)) architectures. The mRNN and LSTM architectures can be combined by adding connections from the mRNN\u2019s intermediate state $m\\_{t}$ to each gating units in the LSTM.","638":"**LSGAN**, or **Least Squares GAN**, is a type of generative adversarial network that adopts the least squares loss function for the discriminator. Minimizing the objective function of LSGAN yields minimizing the Pearson $\\chi^{2}$ divergence. The objective function can be defined as:\r\n\r\n$$ \\min\\_{D}V\\_{LSGAN}\\left(D\\right) = \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{x} \\sim p\\_{data}\\left(\\mathbf{x}\\right)}\\left[\\left(D\\left(\\mathbf{x}\\right) - b\\right)^{2}\\right] + \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{z}\\sim p\\_{data}\\left(\\mathbf{z}\\right)}\\left[\\left(D\\left(G\\left(\\mathbf{z}\\right)\\right) - a\\right)^{2}\\right] $$\r\n\r\n$$ \\min\\_{G}V\\_{LSGAN}\\left(G\\right) = \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{z} \\sim p\\_{\\mathbf{z}}\\left(\\mathbf{z}\\right)}\\left[\\left(D\\left(G\\left(\\mathbf{z}\\right)\\right) - c\\right)^{2}\\right] $$\r\n\r\nwhere $a$ and $b$ are the labels for fake data and real data and $c$ denotes the value that $G$ wants $D$ to believe for fake data.","639":"**RandWire** is a type of convolutional neural network that arise from randomly\r\nwired neural networks that are sampled from stochastic network generators, in which a human-designed random\r\nprocess defines generation.","640":"**DeepLab** is a semantic segmentation architecture. First, the input image goes through the network with the use of dilated convolutions. Then the output from the network is bilinearly interpolated and goes through the fully connected [CRF](https:\/\/paperswithcode.com\/method\/crf) to fine tune the result we obtain the final predictions.","641":"**CascadePSP** is a general segmentation refinement model that refines any given segmentation from low to high resolution. The model takes as input an initial mask that can be an output of any algorithm to provide a rough object location. Then the CascadePSP will output a refined mask. The model is designed in a cascade fashion that generates refined segmentation in a coarse-to-fine manner. Coarse outputs from the early levels predict object structure which will be used as input to the latter levels to refine boundary details.","642":"DGCNN involves neural networks that read the graphs directly and learn a classification function. There are two main challenges: 1) how to extract useful features characterizing the rich information encoded in a graph for classification purpose, and 2) how to sequentially read a graph in a meaningful and consistent order. To address the first challenge, we design a localized graph convolution model and show its connection with two graph kernels. To address the second challenge, we design a novel SortPooling layer which sorts graph vertices in a consistent order so that traditional neural networks can be trained on the graphs.\r\n\r\nDescription and image from: [An End-to-End Deep Learning Architecture for Graph Classification](https:\/\/muhanzhang.github.io\/papers\/AAAI_2018_DGCNN.pdf)","643":"**Residual Normal Distributions** are used to help the optimization of VAEs, preventing optimization from entering an unstable region. This can happen due to sharp gradients caused in situations where the encoder and decoder produce distributions far away from each other. The residual distribution parameterizes $q\\left(\\mathbf{z}|\\mathbf{x}\\right)$ relative to $p\\left(\\mathbf{z}\\right)$. Let $p\\left(z^{i}\\_{l}|\\mathbf{z}\\_{ 0$$\r\n$$\\alpha\\left(\\exp\\left(x\\right) \u2212 1\\right) \\text{ if } x \\leq 0$$","657":"**Attention Free Transformer**, or **AFT**, is an efficient variant of a [multi-head attention module](https:\/\/paperswithcode.com\/method\/multi-head-attention) that eschews [dot product self attention](https:\/\/paperswithcode.com\/method\/scaled). In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes.\r\n\r\nGiven the input $X$, AFT first linearly transforms them into $Q=X W^{Q}, K=X W^{K}, V=X W^{V}$, then performs following operation:\r\n\r\n$$\r\nY=f(X) ; Y\\_{t}=\\sigma\\_{q}\\left(Q\\_{t}\\right) \\odot \\frac{\\sum\\_{t^{\\prime}=1}^{T} \\exp \\left(K\\_{t^{\\prime}}+w\\_{t, t^{\\prime}}\\right) \\odot V\\_{t^{\\prime}}}{\\sum\\_{t^{\\prime}=1}^{T} \\exp \\left(K\\_{t^{\\prime}}+w\\_{t, t^{\\prime}}\\right)}\r\n$$\r\n\r\nwhere $\\odot$ is the element-wise product; $\\sigma\\_{q}$ is the nonlinearity applied to the query with default being sigmoid; $w \\in R^{T \\times T}$ is the learned pair-wise position biases.\r\n\r\nExplained in words, for each target position $t$, AFT performs a weighted average of values, the result of which is combined with the query with element-wise multiplication. In particular, the weighting is simply composed of the keys and a set of learned pair-wise position biases. This provides the immediate advantage of not needing to compute and store the expensive attention matrix, while maintaining the global interactions between query and values as MHA does.","658":"**Stochastic Weight Averaging** is an optimization procedure that averages multiple points along the trajectory of [SGD](https:\/\/paperswithcode.com\/method\/sgd), with a cyclical or constant learning rate. On the one hand it averages weights, but it also has the property that, with a cyclical or constant learning rate, SGD proposals are approximately sampling from the loss surface of the network, leading to stochastic weights and helping to discover broader optima.","659":"**DenseNAS-C** is a mobile convolutional neural network discovered through the [DenseNAS](https:\/\/paperswithcode.com\/method\/densenas) [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) method. The basic building block is MBConvs, or inverted bottleneck residuals, from the [MobileNet](https:\/\/paperswithcode.com\/method\/mobilenetv2) architectures.","660":"**DenseNAS-B** is a mobile convolutional neural network discovered through the [DenseNAS](https:\/\/paperswithcode.com\/method\/densenas) [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) method. The basic building block is MBConvs, or inverted bottleneck residuals, from the [MobileNet](https:\/\/paperswithcode.com\/method\/mobilenetv2) architectures.","661":"**DenseNAS-A** is a mobile convolutional neural network discovered through the [DenseNAS](https:\/\/paperswithcode.com\/method\/densenas) [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) method. The basic building block is MBConvs, or inverted bottleneck residuals, from the MobileNet architectures.","662":"**DenseNAS** is a [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) method that utilises a densely connected search space. The search space is represented as a dense super network, which is built upon designed routing blocks. In the super network, routing blocks are densely connected and we search for the best path between them to derive the final architecture. A chained cost estimation algorithm is used to approximate the model cost during the search.","663":"The ARMA GNN layer implements a rational graph filter with a recursive approximation.","664":"**Multi-scale Progressive Fusion Network** (MSFPN) is a neural network representation for single image deraining. It aims to exploit the correlated information of rain streaks across scales for single image deraining. \r\n\r\nSpecifically, we first generate the Gaussian pyramid rain images using Gaussian kernels to down-sample the original rain image in sequence. A coarse-fusion module (CFM) is designed to capture the global texture information from these multi-scale rain images through recurrent calculation (Conv-[LSTM](https:\/\/paperswithcode.com\/method\/lstm)), thus enabling the network to cooperatively represent the target rain streak using similar counterparts from global feature space. Meanwhile, the representation of the high-resolution pyramid layer is guided by previous outputs as well as all low-resolution pyramid layers. A finefusion module (FFM) is followed to further integrate these correlated information from different scales. By using the channel attention mechanism, the network not only discriminatively learns the scale-specific knowledge from all preceding pyramid layers, but also reduces the feature redundancy effectively. Moreover, multiple FFMs can be cascaded to form a progressive multi-scale fusion. Finally, a reconstruction module (RM) is appended to aggregate the coarse and fine rain information extracted respectively from CFM and FFM for learning the residual rain image, which is the approximation of real rain streak distribution.","665":"Fast-BAT is a new method for accelerated adversarial training.","666":"**DNAS**, or **Differentiable Neural Architecture Search**, uses gradient-based methods to optimize ConvNet architectures, avoiding enumerating and training individual architectures separately as in previous methods. DNAS allows us to explore a layer-wise search space where we can choose a different block for each layer of the network. DNAS represents the search space by a super net whose operators execute stochastically. It relaxes the problem of finding the optimal architecture to find a distribution that yields the optimal architecture. By using the [Gumbel Softmax](https:\/\/paperswithcode.com\/method\/gumbel-softmax) technique, it is possible to directly train the architecture distribution using gradient-based optimization such as [SGD](https:\/\/paperswithcode.com\/method\/sgd).\r\n\r\nThe loss used to train the stochastic super net consists of both the cross-entropy loss that leads to better accuracy and the latency loss that penalizes the network's latency on a target device. To estimate the latency of an architecture, the latency of each operator in the search space is measured and a lookup table model is used to compute the overall latency by adding up the latency of each operator. Using this model allows for estimation of the latency of architectures in an enormous search space. More importantly, it makes the latency differentiable with respect to layer-wise block choices.","667":"Just as [dropout](https:\/\/paperswithcode.com\/method\/dropout) prevents co-adaptation of activations, **DropPath** prevents co-adaptation of parallel paths in networks such as [FractalNets](https:\/\/paperswithcode.com\/method\/fractalnet) by randomly dropping operands of the join layers. This\r\ndiscourages the network from using one input path as an anchor and another as a corrective term (a\r\nconfiguration that, if not prevented, is prone to overfitting). Two sampling strategies are:\r\n\r\n- **Local**: a join drops each input with fixed probability, but we make sure at least one survives.\r\n- **Global**: a single path is selected for the entire network. We restrict this path to be a single\r\ncolumn, thereby promoting individual columns as independently strong predictors.","668":"**ProxylessNAS** directly learns neural network architectures on the target task and target hardware without any proxy task. Additional contributions include:\r\n\r\n- Using a new path-level pruning perspective for [neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search), showing a close connection between NAS and model compression. Memory consumption is saved by one order of magnitude by using path-level binarization.\r\n- Using a novel gradient-based approach (latency regularization loss) for handling hardware objectives (e.g. latency). Given different hardware platforms: CPU\/GPU\/Mobile, ProxylessNAS enables hardware-aware neural network specialization that\u2019s exactly optimized for the target hardware.","669":"**Jukebox** is a model that generates music with singing in the raw audio domain. It tackles the long context of raw audio using a multi-scale [VQ-VAE](https:\/\/paperswithcode.com\/method\/vq-vae) to compress it to discrete codes, and modeling those using [autoregressive Transformers](https:\/\/paperswithcode.com\/methods\/category\/autoregressive-transformers). It can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable.\r\n\r\nThree separate VQ-VAE models are trained with different temporal resolutions. At each level, the input audio is segmented and encoded into latent vectors $\\mathbf{h}\\_{t}$, which are then quantized to the closest codebook vectors $\\mathbf{e}\\_{z\\_{t}}$. The code $z\\_{t}$ is a discrete representation of the audio that we later train our prior on. The decoder takes the sequence of codebook vectors and reconstructs the audio. The top level learns the highest degree of abstraction, since it is encoding longer audio per token while keeping the codebook size the same. Audio can be reconstructed using the codes at any one of the abstraction levels, where the least abstract bottom-level codes result in the highest-quality audio.","670":"**Auxiliary Batch Normalization** is a type of regularization used in adversarial training schemes. The idea is that adversarial examples should have a separate [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) components to the clean examples, as they have different underlying statistics.","671":"**A2C**, or **Advantage Actor Critic**, is a synchronous version of the [A3C](https:\/\/paperswithcode.com\/method\/a3c) policy gradient method. As an alternative to the asynchronous implementation of A3C, A2C is a synchronous, deterministic implementation that waits for each actor to finish its segment of experience before updating, averaging over all of the actors. This more effectively uses GPUs due to larger batch sizes.\r\n\r\nImage Credit: [OpenAI Baselines](https:\/\/openai.com\/blog\/baselines-acktr-a2c\/)","672":"**Low-Rank Factorization-based Multi-head Attention Mechanism**, or **LAMA**, is a type of attention module that uses low-rank factorization to reduce computational complexity. It uses low-rank bilinear pooling to construct a structured sentence representation that attends to multiple aspects of a sentence.","673":"**Adaptive Masking** is a type of attention mechanism that allows a model to learn its own context size to attend over. For each head in [Multi-Head Attention](https:\/\/paperswithcode.com\/method\/multi-head-attention), a masking function is added to control for the span of the attention. A masking function is a non-increasing function that maps a\r\ndistance to a value in $\\left[0, 1\\right]$. Adaptive masking takes the following soft masking function $m\\_{z}$ parametrized by a real value $z$ in $\\left[0, S\\right]$:\r\n\r\n$$ m\\_{z}\\left(x\\right) = \\min\\left[\\max\\left[\\frac{1}{R}\\left(R+z-x\\right), 0\\right], 1\\right] $$\r\n\r\nwhere $R$ is a hyper-parameter that controls its softness. The shape of this piecewise function as a function of the distance. This soft masking function is inspired by [Jernite et al. (2017)](https:\/\/arxiv.org\/abs\/1611.06188). The attention weights from are then computed on the masked span:\r\n\r\n$$ a\\_{tr} = \\frac{m\\_{z}\\left(t-r\\right)\\exp\\left(s\\_{tr}\\right)}{\\sum^{t-1}\\_{q=t-S}m\\_{z}\\left(t-q\\right)\\exp\\left(s\\_{tq}\\right)}$$\r\n\r\nA $\\mathcal{l}\\_{1}$ penalization is added on the parameters $z\\_{i}$ for each attention head $i$ of the model to the loss function:\r\n\r\n$$ L = - \\log{P}\\left(w\\_{1}, \\dots, w\\_{T}\\right) + \\frac{\\lambda}{M}\\sum\\_{i}z\\_{i} $$\r\n\r\nwhere $\\lambda > 0$ is the regularization hyperparameter, and $M$ is the number of heads in each\r\nlayer. This formulation is differentiable in the parameters $z\\_{i}$, and learnt jointly with the rest of the model.","674":"A **Scale Aggregation Block** concatenates feature maps at a wide range of scales. Feature maps for each scale are generated by a stack of downsampling, [convolution](https:\/\/paperswithcode.com\/method\/convolution) and upsampling operations. The proposed scale aggregation block is a standard computational module which readily replaces any given transformation $\\mathbf{Y}=\\mathbf{T}(\\mathbf{X})$, where $\\mathbf{X}\\in \\mathbb{R}^{H\\times W\\times C}$, $\\mathbf{Y}\\in \\mathbb{R}^{H\\times W\\times C_o}$ with $C$ and $C_o$ being the input and output channel number respectively. $\\mathbf{T}$ is any operator such as a convolution layer or a series of convolution layers. Assume we have $L$ scales. Each scale $l$ is generated by sequentially conducting a downsampling $\\mathbf{D}_l$, a transformation $\\mathbf{T}_l$ and an unsampling operator $\\mathbf{U}_l$:\r\n\r\n$$\r\n\\mathbf{X}^{'}_l=\\mathbf{D}_l(\\mathbf{X}),\r\n\\label{eq:eq_d}\r\n$$\r\n\r\n$$\r\n\\mathbf{Y}^{'}_l=\\mathbf{T}_l(\\mathbf{X}^{'}_l),\r\n\\label{eq:eq_tl}\r\n$$\r\n\r\n$$\r\n\\mathbf{Y}_l=\\mathbf{U}_l(\\mathbf{Y}^{'}_l),\r\n\\label{eq:eq_u}\r\n$$\r\n\r\nwhere $\\mathbf{X}^{'}_l\\in \\mathbb{R}^{H_l\\times W_l\\times C}$,\r\n$\\mathbf{Y}^{'}_l\\in \\mathbb{R}^{H_l\\times W_l\\times C_l}$, and\r\n$\\mathbf{Y}_l\\in \\mathbb{R}^{H\\times W\\times C_l}$.\r\nNotably, $\\mathbf{T}_l$ has the similar structure as $\\mathbf{T}$.\r\nWe can concatenate all $L$ scales together, getting\r\n\r\n$$\r\n\\mathbf{Y}^{'}=\\Vert^L_1\\mathbf{U}_l(\\mathbf{T}_l(\\mathbf{D}_l(\\mathbf{X}))),\r\n\\label{eq:eq_all}\r\n$$\r\n\r\nwhere $\\Vert$ indicates concatenating feature maps along the channel dimension, and $\\mathbf{Y}^{'} \\in \\mathbb{R}^{H\\times W\\times \\sum^L_1 C_l}$ is the final output feature maps of the scale aggregation block.\r\n\r\nIn the reference implementation, the downsampling $\\mathbf{D}_l$ with factor $s$ is implemented by a max pool layer with $s\\times s$ kernel size and $s$ stride. The upsampling $\\mathbf{U}_l$ is implemented by resizing with the nearest neighbor interpolation.","675":"**ScaleNet**, or a **Scale Aggregation Network**, is a type of convolutional neural network which learns a neuron allocation for aggregating multi-scale information in different building blocks of a deep network. The most informative output neurons in each block are preserved while others are discarded, and thus neurons for multiple scales are competitively and adaptively allocated. The scale aggregation (SA) block concatenates feature maps at a wide range of scales. Feature maps for each scale are generated by a stack of downsampling, [convolution](https:\/\/paperswithcode.com\/method\/convolution) and upsampling operations.","676":"A **Kernel Activation Function** is a non-parametric activation function defined as a one-dimensional kernel approximator:\r\n\r\n$$ f(s) = \\sum_{i=1}^D \\alpha_i \\kappa( s, d_i) $$\r\n\r\nwhere:\r\n\r\n1. The dictionary of the kernel elements $d_0, \\ldots, d_D$ is fixed by sampling the $x$-axis with a uniform step around 0.\r\n2. The user selects the kernel function (e.g., Gaussian, [ReLU](https:\/\/paperswithcode.com\/method\/relu), [Softplus](https:\/\/paperswithcode.com\/method\/softplus)) and the number of kernel elements $D$ as a hyper-parameter. A larger dictionary leads to more expressive activation functions and a larger number of trainable parameters.\r\n3. The linear coefficients are adapted independently at every neuron via standard back-propagation.\r\n\r\nIn addition, the linear coefficients can be initialized using kernel ridge regression to behave similarly to a known function in the beginning of the optimization process.","677":"**ENIGMA** is an evaluation framework for dialog systems based on Pearson and Spearman's rank correlations between the estimated rewards and the true rewards. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation, making automatic evaluations feasible. More importantly, ENIGMA is model-free and agnostic to the behavior policies for collecting the experience data (see details in Section 2), which significantly alleviates the technical difficulties of modeling complex dialogue environments and human behaviors.","678":"[Transformer](https:\/\/paperswithcode.com\/method\/transformer) is a type of self-attention-based neural networks originally applied for NLP tasks. Recently, pure transformer-based models are proposed to solve computer vision problems. These visual transformers usually view an image as a sequence of patches while they ignore the intrinsic structure information inside each patch. In this paper, we propose a novel Transformer-iN-Transformer (TNT) model for modeling both patch-level and pixel-level representation. In each TNT block, an outer transformer block is utilized to process patch embeddings, and an inner transformer block extracts local features from pixel embeddings. The pixel-level feature is projected to the space of patch embedding by a linear transformation layer and then added into the patch. By stacking the TNT blocks, we build the TNT model for image recognition.\r\n\r\nImage source: [Han et al.](https:\/\/arxiv.org\/pdf\/2103.00112v1.pdf)","679":"**YOLOX** is a single-stage object detector that makes several modifications to [YOLOv3](https:\/\/paperswithcode.com\/method\/yolov3) with a [DarkNet53](https:\/\/www.paperswithcode.com\/method\/darknet53) backbone. Specifically, YOLO\u2019s head is replaced with a decoupled one. For each level of [FPN](https:\/\/paperswithcode.com\/method\/fpn) feature, we first adopt a 1 \u00d7 1 conv layer to reduce the feature channel to 256 and then add two parallel branches with two 3 \u00d7 3 conv layers each for classification and regression tasks respectively.\r\n\r\nAdditional changes include adding Mosaic and [MixUp](https:\/\/paperswithcode.com\/method\/mixup) into the augmentation strategies to boost YOLOX\u2019s performance. The anchor mechanism is also removed so YOLOX is anchor-free. Lastly, SimOTA for label assignment -- where label assignment is formulated as an optimal transport problem via a top-k strategy.","680":"**GShard** is a intra-layer parallel distributed method. It consists of set of simple APIs for annotations, and a compiler extension in XLA for automatic parallelization.","681":"**GCNII** is an extension of a [Graph Convolution Networks](https:\/\/www.paperswithcode.com\/method\/gcn) with two new techniques, initial residual and identify mapping, to tackle the problem of oversmoothing -- where stacking more layers and adding non-linearity tends to degrade performance. At each layer, initial residual constructs a skip connection from the input layer, while identity mapping adds an identity matrix to the weight matrix.","682":"Please enter a description about the method here","683":"**SpineNet** is a convolutional neural network backbone with scale-permuted intermediate features and cross-scale connections that is learned on an object detection task by [Neural Architecture Search](https:\/\/paperswithcode.com\/method\/neural-architecture-search).","684":"**SNGAN**, or **Spectrally Normalised GAN**, is a type of generative adversarial network that uses [spectral normalization](https:\/\/paperswithcode.com\/method\/spectral-normalization), a type of [weight normalization](https:\/\/paperswithcode.com\/method\/weight-normalization), to stabilise the training of the discriminator.","685":"[Neural architecture search](https:\/\/paperswithcode.com\/method\/neural-architecture-search) (NAS) has witnessed prevailing success in image classification and (very recently) segmentation tasks. In this paper, we present the first preliminary study on introducing the NAS algorithm to generative adversarial networks (GANs), dubbed AutoGAN. The marriage of NAS and GANs faces its unique challenges. We define the search space for the generator architectural variations and use an RNN controller to guide the search, with parameter sharing and dynamic-resetting to accelerate the process. Inception score is adopted as the reward, and a multi-level search strategy is introduced to perform NAS in a progressive way.","686":"**Gradient Checkpointing** is a method used for reducing the memory footprint when training deep neural networks, at the cost of having a small increase in computation time.","687":"**Fast AutoAugment** is an image data augmentation algorithm that finds effective augmentation policies via a search strategy based on density matching, motivated by Bayesian DA. The strategy is to improve the generalization performance of a given network by learning the augmentation policies which treat augmented data as missing data points of training data. However, different from Bayesian DA, the proposed method recovers those missing data points by the exploitation-and-exploration of a family of inference-time augmentations via Bayesian optimization in the policy search phase. This is realized by using an efficient density matching algorithm that does not require any back-propagation for network training for each policy evaluation.","688":"VL-BERT is pre-trained on a large-scale image-captions dataset together with text-only corpus. The input to the model are either words from the input sentences or regions-of-interest (RoI) from input images. It can be fine-tuned to fit most visual-linguistic downstream tasks. Its backbone is a multi-layer bidirectional Transformer encoder, modified to accommodate visual contents, and new type of visual feature embedding to the input feature embeddings. VL-BERT takes both visual and linguistic elements as input, represented as RoIs in images and subwords in input sentences. Four different types of embeddings are used to represent each input: token embedding, visual feature embedding, segment embedding, and sequence position embedding. VL-BERT is pre-trained using Conceptual Captions and text-only datasets. Two pre-training tasks are used: masked language modeling with visual clues, and masked RoI classification with linguistic clues.","689":"**Gradient Sparsification** is a technique for distributed training that sparsifies stochastic gradients to reduce the communication cost, with minor increase in the number of iterations. The key idea behind our sparsification technique is to drop some coordinates of the stochastic gradient and appropriately amplify the remaining coordinates to ensure the unbiasedness of the sparsified stochastic gradient. The sparsification approach can significantly reduce the coding length of the stochastic gradient and only slightly increase the variance of the stochastic gradient.","690":"**Randomized Leaky Rectified Linear Units**, or **RReLU**, are an activation function that randomly samples the negative slope for activation values. It was first proposed and used in the Kaggle NDSB Competition. During training, $a\\_{ji}$ is a random number sampled from a uniform distribution $U\\left(l, u\\right)$. Formally:\r\n\r\n$$ y\\_{ji} = x\\_{ji} \\text{ if } x\\_{ji} \\geq{0} $$\r\n$$ y\\_{ji} = a\\_{ji}x\\_{ji} \\text{ if } x\\_{ji} < 0 $$\r\n\r\nwhere\r\n\r\n$$\\alpha\\_{ji} \\sim U\\left(l, u\\right), l < u \\text{ and } l, u \\in \\left[0,1\\right)$$\r\n\r\nIn the test phase, we take average of all the $a\\_{ji}$ in training similar to [dropout](https:\/\/paperswithcode.com\/method\/dropout), and thus set $a\\_{ji}$ to $\\frac{l+u}{2}$ to get a deterministic result. As suggested by the NDSB competition winner, $a\\_{ji}$ is sampled from $U\\left(3, 8\\right)$. \r\n\r\nAt test time, we use:\r\n\r\n$$ y\\_{ji} = \\frac{x\\_{ji}}{\\frac{l+u}{2}} $$","691":"**FSAF**, or Feature Selective Anchor-Free, is a building block for single-shot object detectors. It can be plugged into single-shot detectors with feature pyramid structure. The FSAF module addresses two limitations brought up by the conventional anchor-based detection: 1) heuristic-guided feature selection; 2) overlap-based anchor sampling. The general concept of the FSAF module is online feature selection applied to the training of multi-level anchor-free branches. Specifically, an anchor-free branch is attached to each level of the feature pyramid, allowing box encoding and decoding in the anchor-free manner at an arbitrary level. During training, we dynamically assign each instance to the most suitable feature level. At the time of inference, the FSAF module can work jointly with anchor-based branches by outputting predictions in parallel. We instantiate this concept with simple implementations of anchor-free branches and online feature selection strategy\r\n\r\nThe general concept is presented in the Figure to the right. An anchor-free branch is built per level of feature pyramid, independent to the anchor-based branch. Similar to the anchor-based branch, it consists of a classification subnet and a regression subnet (not shown in figure). An instance can be assigned to arbitrary level of the anchor-free branch. During training, we dynamically select the most suitable level of feature for each instance based on the instance content instead of just the size of instance box. The selected level of feature then learns to detect the assigned instances. At inference, the FSAF module can run independently or jointly with anchor-based branches. The FSAF module is agnostic to the backbone network and can be applied to single-shot detectors with a structure of feature pyramid. Additionally, the instantiation of anchor-free branches and online feature selection can be various.","692":"**1-bit LAMB** is a communication-efficient stochastic optimization technique which introduces a novel way to support adaptive layerwise learning rates even when communication is compressed. Learning from the insights behind [1-bit Adam](https:\/\/paperswithcode.com\/method\/1-bit-adam), it is a a 2-stage algorithm which uses [LAMB](https:\/\/paperswithcode.com\/method\/lamb) (warmup stage) to \u201cpre-condition\u201d a communication compressed momentum SGD algorithm (compression stage). At compression stage where original LAMB algorithm cannot be used to update the layerwise learning rates, 1-bit LAMB employs a novel way to adaptively scale layerwise learning rates based on information from both warmup and compression stages. As a result, 1-bit LAMB is able to achieve large batch optimization (LAMB)\u2019s convergence speed under compressed communication.\r\n\r\nThere are two major differences between 1-bit LAMB and the original LAMB:\r\n\r\n- During compression stage, 1-bit LAMB updates the layerwise learning rate based on a novel \u201creconstructed gradient\u201d based on the compressed momentum. This makes 1-bit LAMB compatible with error compensation and be able to keep track of the training dynamic under compression.\r\n- 1-bit LAMB also introduces extra stabilized soft thresholds when updating layerwise learning rate at compression stage, which makes training more stable under compression.","693":"**1-bit Adam** is a [stochastic optimization](https:\/\/paperswithcode.com\/methods\/category\/stochastic-optimization) technique that is a variant of [ADAM](https:\/\/paperswithcode.com\/method\/adam) with error-compensated 1-bit compression, based on finding that Adam's variance term becomes stable at an early stage. First vanilla Adam is used for a few epochs as a warm-up. After the warm-up stage, the compression stage starts and we stop updating the variance term $\\mathbf{v}$ and use it as a fixed precondition. At the compression stage, we communicate based on the momentum applied with error-compensated 1-bit compression. The momentums are quantized into 1-bit representation (the sign of each element). Accompanying the vector, a scaling factor is computed as $\\frac{\\text { magnitude of compensated gradient }}{\\text { magnitude of quantized gradient }}$. This scaling factor ensures that the compressed momentum has the same magnitude as the uncompressed momentum. This 1-bit compression could reduce the communication cost by $97 \\%$ and $94 \\%$ compared to the original float 32 and float 16 training, respectively.","694":"**PIoU Loss** is a loss function for oriented object detection which is formulated to exploit both the angle and IoU for accurate oriented bounding box regression. The PIoU loss is derived from IoU metric with a pixel-wise form.","695":"**$n$-step Returns** are used for value function estimation in reinforcement learning. Specifically, for $n$ steps we can write the complete return as:\r\n\r\n$$ R\\_{t}^{(n)} = r\\_{t+1} + \\gamma{r}\\_{t+2} + \\cdots + \\gamma^{n-1}\\_{t+n} + \\gamma^{n}V\\_{t}\\left(s\\_{t+n}\\right) $$\r\n\r\nWe can then write an $n$-step backup, in the style of TD learning, as:\r\n\r\n$$ \\Delta{V}\\_{r}\\left(s\\_{t}\\right) = \\alpha\\left[R\\_{t}^{(n)} - V\\_{t}\\left(s\\_{t}\\right)\\right] $$\r\n\r\nMulti-step returns often lead to faster learning with suitably tuned $n$.\r\n\r\nImage Credit: Sutton and Barto, Reinforcement Learning","696":"**Revision Network** is a style transfer module that aims to revise the rough stylized image via generating residual details image $r_{c s}$, while the final stylized image is generated by combining $r\\_{c s}$ and rough stylized image $\\bar{x}\\_{c s}$. This procedure ensures that the distribution of global style pattern in $\\bar{x}\\_{c s}$ is properly kept. Meanwhile, learning to revise local style patterns with residual details image is easier for the Revision Network.\r\n\r\nAs shown in the Figure, the Revision Network is designed as a simple yet effective encoder-decoder architecture, with only one down-sampling and one up-sampling layer. Further, a [patch discriminator](https:\/\/paperswithcode.com\/method\/patchgan) is used to help Revision Network to capture fine patch textures under adversarial learning setting. The patch discriminator $D$ is defined following SinGAN, where $D$ owns 5 convolution layers and 32 hidden channels. A relatively shallow $D$ is chosen to (1) avoid overfitting since we only have one style image and (2) control the receptive field to ensure D can only capture local patterns.","697":"**Drafting Network** is a style transfer module designed to transfer global style patterns in low-resolution, since global patterns can be transferred easier in low resolution due to larger receptive field and less local details. To achieve single style transfer, earlier work trained an encoder-decoder module, where only the content image is used as input. To better combine the style feature and the content feature, the Drafting Network adopts the [AdaIN module](https:\/\/paperswithcode.com\/method\/adaptive-instance-normalization).\r\n\r\nThe architecture of Drafting Network is shown in the Figure, which includes an encoder, several AdaIN modules and a decoder. (1) The encoder is a pre-trained [VGG](https:\/\/paperswithcode.com\/method\/vgg)-19 network, which is fixed during training. Given $\\bar{x}\\_{c}$ and $\\bar{x}\\_{s}$, the VGG encoder extracts features in multiple granularity at 2_1, 3_1 and 4_1 layers. (2) Then, we apply feature modulation between the content and style feature using AdaIN modules after 2_1, 3_1 and 4_1 layers, respectively. (3) Finally, in each granularity of decoder, the corresponding feature from the AdaIN module is merged via a [skip-connection](https:\/\/paperswithcode.com\/methods\/category\/skip-connections). Here, skip-connections after AdaIN modules in both low and high levels are leveraged to help to reserve content structure, especially for low-resolution image.","698":"**LapStyle**, or **Laplacian Pyramid Network**, is a feed-forward style transfer method. It uses a [Drafting Network](https:\/\/paperswithcode.com\/method\/drafting-network) to transfer global style patterns in low-resolution, and adopts higher resolution [Revision Networks](https:\/\/paperswithcode.com\/method\/revision-network) to revise local styles in a pyramid manner according to outputs of multi-level Laplacian filtering of the content image. Higher resolution details can be generated by stacking Revision Networks with multiple Laplacian pyramid levels. The final stylized image is obtained by aggregating outputs of all pyramid levels.\r\n\r\nSpecifically, we first generate image pyramid $\\left\\(\\bar{x}\\_{c}, r\\_{c}\\right\\)$ from content image $x\\_{c}$ with the help of Laplacian filter. Rough low-resolution stylized image are then generated by the Drafting Network. Then the Revision Network generates stylized detail image in high resolution. Then the final stylized image is generated by aggregating the outputs pyramid. $L, C$ and $A$ in an image represent Laplacian, concatenate and aggregation operation separately.","699":"**SNIP**, or **Scale Normalization for Image Pyramids**, is a multi-scale training scheme that selectively back-propagates the gradients of object instances of different sizes as a function of the image scale. SNIP is a modified version of MST where only the object instances that have a resolution close to the pre-training dataset, which is typically 224x224, are used for training the detector. In multi-scale training (MST), each image is observed at different resolutions therefore, at a high resolution (like 1400x2000) large objects are hard to classify and at a low resolution (like 480x800) small objects are hard to classify. Fortunately, each object instance appears at several different scales and some of those appearances fall in the desired scale range. In order to eliminate extreme scale objects, either too large or too small, training is only performed on objects that fall in the desired scale range and the remainder are simply ignored during back-propagation. Effectively, SNIP uses all the object instances during training, which helps capture all the variations in appearance and\r\npose, while reducing the domain-shift in the scale-space for the pre-trained network.","700":"**Mix-FFN** is a feedforward layer used in the [SegFormer](https:\/\/paperswithcode.com\/method\/segformer) architecture. [ViT](https:\/\/www.paperswithcode.com\/method\/vision-transformer) uses [positional encoding](https:\/\/paperswithcode.com\/methods\/category\/position-embeddings) (PE) to introduce the location information. However, the resolution of $\\mathrm{PE}$ is fixed. Therefore, when the test resolution is different from the training one, the positional code needs to be interpolated and this often leads to dropped accuracy. To alleviate this problem, [CPVT](https:\/\/www.paperswithcode.com\/method\/cpvt) uses $3 \\times 3$ Conv together with the PE to implement a data-driven PE. The authors of Mix-FFN argue that positional encoding is actually not necessary for semantic segmentation. Instead, they use Mix-FFN which considers the effect of zero padding to leak location information, by directly using a $3 \\times 3$ Conv in the feed-forward network (FFN). Mix-FFN can be formulated as:\r\n\r\n$$\r\n\\mathbf{x}\\_{\\text {out }}=\\operatorname{MLP}\\left(\\operatorname{GELU}\\left(\\operatorname{Conv}\\_{3 \\times 3}\\left(\\operatorname{MLP}\\left(\\mathbf{x}\\_{i n}\\right)\\right)\\right)\\right)+\\mathbf{x}\\_{i n}\r\n$$\r\n\r\nwhere $\\mathbf{x}\\_{i n}$ is the feature from a self-attention module. Mix-FFN mixes a $3 \\times 3$ convolution and an MLP into each FFN.","701":"**SegFormer** is a [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based framework for semantic segmentation that unifies Transformers with lightweight [multilayer perceptron](https:\/\/paperswithcode.com\/method\/feedforward-network) (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes which leads to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers, and thus combining both local attention and global attention to render powerful representations.","702":"**MobileBERT** is a type of inverted-bottleneck [BERT](https:\/\/paperswithcode.com\/method\/bert) that compresses and accelerates the popular BERT model. MobileBERT is a thin version of BERT_LARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks. To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated BERT_LARGE model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to various downstream NLP tasks via simple fine-tuning. It is trained by layer-to-layer imitating the inverted bottleneck BERT.","703":"**Distributional Generalization** is a type of generalization that roughly states that outputs of a classifier at train and test time are close as distributions, as opposed to close in just their average error. This behavior is not captured by classical generalization, which would only consider the average error and not the distribution of errors over the input domain.","704":"**Macaw** is a generative question-answering (QA) system that is built on UnifiedQA, itself built on [T5](https:\/\/paperswithcode.com\/method\/t5). Macaw has three interesting features. First, it often produces high-quality answers to questions far outside the domain it was trained on, sometimes surprisingly so. Second, Macaw allows different permutations (\u201can gles\u201d) of inputs and outputs to be used. For example, we can give it a question and get an answer; or give it an answer and get a question; or give it a question and answer and get a set of multiple-choice (MC) options for that question. This multi-angle QA capability allows versatility in the way Macaw can be used, include recursively using outputs as new inputs to the system. Finally, Macaw also generates explanations as an optional output (or even input) element.","705":"**RandAugment** is an automated data augmentation method. The search space for data augmentation has 2 interpretable hyperparameter $N$ and $M$. $N$ is the number of augmentation transformations to apply sequentially, and $M$ is the magnitude for all the transformations. To reduce the parameter space but still maintain image diversity, learned policies and probabilities for applying each transformation are replaced with a parameter-free procedure of always selecting a transformation with uniform probability $\\frac{1}{K}$. Here $K$ is the number of transformation options. So given $N$ transformations for a training image, RandAugment may thus express $KN$ potential policies.\r\n\r\nTransformations applied include identity transformation, autoContrast, equalize, rotation, solarixation, colorjittering, posterizing, changing contrast, changing brightness, changing sharpness, shear-x, shear-y, translate-x, translate-y.","706":"**GridMask** is a data augmentation method that randomly removes some pixels of an input image. Unlike other methods, the region that the algorithm removes is neither a continuous region nor random pixels in dropout. Instead, the algorithm removes a region with disconnected pixel sets, as shown in the Figure.\r\n\r\nWe express the setting as\r\n\r\n$$\r\n\\tilde{\\mathbf{x}}=\\mathbf{x} \\times M\r\n$$\r\n\r\nwhere $\\mathbf{x} \\in R^{H \\times W \\times C}$ represents the input image, $M \\in$ $\\{0,1\\}^{H \\times W}$ is the binary mask that stores pixels to be removed, and $\\tilde{\\mathbf{x}} \\in R^{H \\times W \\times C}$ is the result produced by the algorithm. For the binary mask $M$, if $M_{i, j}=1$ we keep pixel $(i, j)$ in the input image; otherwise we remove it. GridMask is applied after the image normalization operation.\r\n\r\nThe shape of $M$ looks like a grid, as shown in the Figure . Four numbers $\\left(r, d, \\delta_{x}, \\delta_{y}\\right)$ are used to represent a unique $M$. Every mask is formed by tiling the units. $r$ is the ratio of the shorter gray edge in a unit. $d$ is the length of one unit. $\\delta\\_{x}$ and $\\delta\\_{y}$ are the distances between the first intact unit and boundary of the image.","707":"Combines learned time-frequency representation with a masker architecture based on 1D [dilated convolution](https:\/\/paperswithcode.com\/method\/dilated-convolution).","708":"**SepFormer** is [Transformer](https:\/\/paperswithcode.com\/methods\/category\/transformers)-based neural network for speech separation. The SepFormer learns short and long-term dependencies with a multi-scale approach that employs transformers. It is mainly composed of multi-head attention and feed-forward layers. A dual-path framework (introduced by DPRNN) is adopted and [RNNs](https:\/\/paperswithcode.com\/methods\/category\/recurrent-neural-networks) are replaced with a multiscale pipeline composed of transformers that learn both short and long-term dependencies. The dual-path framework enables the mitigation of the quadratic complexity of transformers, as transformers in the dual-path framework process smaller chunks.\r\n\r\nThe model is based on the learned-domain masking approach and employs an encoder, a decoder, and a masking network, as shown in the figure. The encoder is fully convolutional, while the decoder employs two Transformers embedded inside the dual-path processing block. The decoder finally reconstructs the separated signals in the time domain by using the masks predicted by the masking network.","709":"The **Mogrifier LSTM** is an extension to the [LSTM](https:\/\/paperswithcode.com\/method\/lstm) where the LSTM\u2019s input $\\mathbf{x}$ is gated conditioned on the output of the previous step $\\mathbf{h}\\_{prev}$. Next, the gated input is used in a similar manner to gate the output of the\r\nprevious time step. After a couple of rounds of this mutual gating, the last updated $\\mathbf{x}$ and $\\mathbf{h}\\_{prev}$ are fed to an LSTM. \r\n\r\nIn detail, the Mogrifier is an LSTM where two inputs $\\mathbf{x}$ and $\\mathbf{h}\\_{prev}$ modulate one another in an alternating fashion before the usual LSTM computation takes place. That is: $ \\text{Mogrify}\\left(\\mathbf{x}, \\mathbf{c}\\_{prev}, \\mathbf{h}\\_{prev}\\right) = \\text{LSTM}\\left(\\mathbf{x}^{\u2191}, \\mathbf{c}\\_{prev}, \\mathbf{h}^{\u2191}\\_{prev}\\right)$ where the modulated inputs $\\mathbf{x}^{\u2191}$ and $\\mathbf{h}^{\u2191}\\_{prev}$ are defined as the highest indexed $\\mathbf{x}^{i}$ and $\\mathbf{h}^{i}\\_{prev}$, respectively, from the interleaved sequences:\r\n\r\n$$ \\mathbf{x}^{i} = 2\\sigma\\left(\\mathbf{Q}^{i}\\mathbf{h}^{i\u22121}\\_{prev}\\right) \\odot x^{i-2} \\text{ for odd } i \\in \\left[1 \\dots r\\right] $$\r\n\r\n$$ \\mathbf{h}^{i}\\_{prev} = 2\\sigma\\left(\\mathbf{R}^{i}\\mathbf{x}^{i-1}\\right) \\odot \\mathbf{h}^{i-2}\\_{prev} \\text{ for even } i \\in \\left[1 \\dots r\\right] $$\r\n\r\nwith $\\mathbf{x}^{-1} = \\mathbf{x}$ and $\\mathbf{h}^{0}\\_{prev} = \\mathbf{h}\\_{prev}$. The number of \"rounds\", $r \\in \\mathbb{N}$, is a hyperparameter; $r = 0$ recovers the LSTM. Multiplication with the constant 2 ensures that randomly initialized $\\mathbf{Q}^{i}$, $\\mathbf{R}^{i}$ matrices result in transformations close to identity. To reduce the number of additional model parameters, we typically factorize the $\\mathbf{Q}^{i}$, $\\mathbf{R}^{i}$ matrices as products of low-rank matrices: $\\mathbf{Q}^{i}$ =\r\n$\\mathbf{Q}^{i}\\_{left}\\mathbf{Q}^{i}\\_{right}$ with $\\mathbf{Q}^{i} \\in \\mathbb{R}^{m\\times{n}}$, $\\mathbf{Q}^{i}\\_{left} \\in \\mathbb{R}^{m\\times{k}}$, $\\mathbf{Q}^{i}\\_{right} \\in \\mathbb{R}^{k\\times{n}}$, where $k < \\min\\left(m, n\\right)$ is the rank.","710":"**LipGAN** is a generative adversarial network for generating realistic talking faces conditioned on translated speech. It employs an adversary that measures the extent of lip synchronization in the frames generated by the generator. The system is capable of handling faces in random poses without the need for realignment to a template pose. LipGAN is a fully self-supervised approach that learns a phoneme-viseme mapping, making it language independent.","711":"**Euclidean Norm Regularization** is a regularization step used in [generative adversarial networks](https:\/\/paperswithcode.com\/methods\/category\/generative-adversarial-networks), and is typically added to both the generator and discriminator losses:\r\n\r\n$$ R\\_{z} = w\\_{r} \\cdot ||\\Delta{z}||^{2}\\_{2} $$\r\n\r\nwhere the scalar weight $w\\_{r}$ is a parameter.\r\n\r\nImage: [LOGAN](https:\/\/paperswithcode.com\/method\/logan)","712":"**Latent Optimisation** is a technique used for generative adversarial networks to refine the sample quality of $z$. Specifically, it exploits knowledge from the discriminator $D$ to refine the latent source $z$. Intuitively, the gradient $\\nabla\\_{z}f\\left(z\\right) = \\delta{f}\\left(z\\right)\\delta{z}$ points in the direction that better satisfies the discriminator $D$, which implies better samples. Therefore, instead of using the randomly sampled $z \\sim p\\left(z\\right)$, we uses the optimised latent:\r\n\r\n$$ \\Delta{z} = \\alpha\\frac{\\delta{f}\\left(z\\right)}{\\delta{z}} $$\r\n\r\n$$ z' = z + \\Delta{z} $$\r\n\r\nSource: [LOGAN](https:\/\/paperswithcode.com\/method\/logan)\r\n.","713":"**CS-GAN** is a type of generative adversarial network that uses a form of deep compressed sensing, and [latent optimisation](https:\/\/paperswithcode.com\/method\/latent-optimisation), to improve the quality of generated samples.","714":"**LOGAN** is a generative adversarial network that uses a latent optimization approach using [natural gradient descent](https:\/\/paperswithcode.com\/method\/natural-gradient-descent) (NGD). For the Fisher matrix in NGD, the authors use the empirical Fisher $F'$ with Tikhonov damping:\r\n\r\n$$ F' = g \\cdot g^{T} + \\beta{I} $$\r\n\r\nThey also use Euclidian Norm regularization for the optimization step.\r\n\r\nFor LOGAN's base architecture, [BigGAN-deep](https:\/\/paperswithcode.com\/method\/biggan-deep) is used with a few modifications: increasing the size of the latent source from $186$ to $256$, to compensate the randomness of the source lost\r\nwhen optimising $z$. 2, using the uniform distribution $U\\left(\u22121, 1\\right)$ instead of the standard normal distribution $N\\left(0, 1\\right)$ for $p\\left(z\\right)$ to be consistent with the clipping operation, using leaky [ReLU](https:\/\/paperswithcode.com\/method\/relu) (with the slope of 0.2 for the negative part) instead of ReLU as the non-linearity for smoother gradient flow for $\\frac{\\delta{f}\\left(z\\right)}{\\delta{z}}$ .","715":"**BigGAN-deep** is a deeper version (4x) of [BigGAN](https:\/\/paperswithcode.com\/method\/biggan). The main difference is a slightly differently designed [residual block](https:\/\/paperswithcode.com\/method\/residual-block). Here the $z$ vector is concatenated with the conditional vector without splitting it into chunks. It is also based on residual blocks with bottlenecks. BigGAN-deep uses a different strategy than BigGAN aimed at preserving identity throughout the skip connections. In G, where the number of channels needs to be reduced, BigGAN-deep simply retains the first group of channels and drop the rest to produce the required number of channels. In D, where the number of channels should be increased, BigGAN-deep passes the input channels unperturbed, and concatenates them with the remaining channels produced by a 1 \u00d7 1 [convolution](https:\/\/paperswithcode.com\/method\/convolution). As far as the\r\nnetwork configuration is concerned, the discriminator is an exact reflection of the generator. \r\n\r\nThere are two blocks at each resolution (BigGAN uses one), and as a result BigGAN-deep is four times\r\ndeeper than BigGAN. Despite their increased depth, the BigGAN-deep models have significantly\r\nfewer parameters mainly due to the bottleneck structure of their residual blocks.","716":"**CondConv**, or **Conditionally Parameterized Convolutions**, are a type of [convolution](https:\/\/paperswithcode.com\/method\/convolution) which learn specialized convolutional kernels for each example. In particular, we parameterize the convolutional kernels in a CondConv layer as a linear combination of $n$ experts $(\\alpha_1 W_1 + \\ldots + \\alpha_n W_n) * x$, where $\\alpha_1, \\ldots, \\alpha_n$ are functions of the input learned through gradient descent. To efficiently increase the capacity of a CondConv layer, developers can increase the number of experts. This can be more computationally efficient than increasing the size of the convolutional kernel itself, because the convolutional kernel is applied at many different positions within the input, while the experts are combined only once per input.","717":"**Cascade Mask R-CNN** extends [Cascade R-CNN](https:\/\/paperswithcode.com\/method\/cascade-r-cnn) to instance segmentation, by adding a\r\nmask head to the cascade.\r\n\r\nIn the [Mask R-CNN](https:\/\/paperswithcode.com\/method\/mask-r-cnn), the segmentation branch is inserted in parallel to the detection branch. However, the Cascade [R-CNN](https:\/\/paperswithcode.com\/method\/r-cnn) has multiple detection branches. This raises the questions of 1) where to add the segmentation branch and 2) how many segmentation branches to add. The authors consider three strategies for mask prediction in the Cascade R-CNN. The first two strategies address the first question, adding a single mask prediction head at either the first or last stage of the Cascade R-CNN. Since the instances used to train the segmentation branch are the positives of the detection branch, their number varies in these two strategies. Placing the segmentation head later on the cascade leads to more examples. However, because segmentation is a pixel-wise operation, a large number of highly overlapping instances is not necessarily as helpful as for object detection, which is a patch-based operation. The third strategy addresses the second question, adding a segmentation branch to each\r\ncascade stage. This maximizes the diversity of samples used to learn the mask prediction task. \r\n\r\nAt inference time, all three strategies predict the segmentation masks on the patches produced by the final object detection stage, irrespective of the cascade stage on which the segmentation mask is implemented and how many segmentation branches there are.","718":"**PolarMask** is an anchor-box free and single-shot instance segmentation method. Specifically, PolarMask takes an image as input and predicts the distance from a sampled positive location (ie a candidate object's center) with respect to the object's contour at each angle, and then assembles the predicted points to produce the final mask. There are several benefits to the system: (1) The polar representation unifies instance segmentation (masks) and object detection (bounding boxes) into a single framework (2) Two modules are designed (i.e. soft polar centerness and polar IoU loss) to sample high-quality center examples and optimize polar contour regression, making the performance of PolarMask does not depend on the bounding box prediction results and more efficient in training. (3) PolarMask is fully convolutional and can be embedded into most off-the-shelf detection methods.","719":"**InfoGAN** is a type of generative adversarial network that modifies the [GAN](https:\/\/paperswithcode.com\/method\/gan) objective to\r\nencourage it to learn interpretable and meaningful representations. This is done by maximizing the\r\nmutual information between a fixed small subset of the GAN\u2019s noise variables and the observations.\r\n\r\nFormally, InfoGAN is defined as a minimax game with a variational regularization of mutual information and the hyperparameter $\\lambda$:\r\n\r\n$$ \\min\\_{G, Q}\\max\\_{D}V\\_{INFOGAN}\\left(D, G, Q\\right) = V\\left(D, G\\right) - \\lambda{L}\\_{I}\\left(G, Q\\right) $$\r\n\r\nWhere $Q$ is an auxiliary distribution that approximates the posterior $P\\left(c\\mid{x}\\right)$ - the probability of the latent code $c$ given the data $x$ - and $L\\_{I}$ is the variational lower bound of the mutual information between the latent code and the observations.\r\n\r\nIn the practical implementation, there is another fully-connected layer to output parameters for the conditional distribution $Q$ (negligible computation ontop of regular GAN structures). Q is represented with a [softmax](https:\/\/paperswithcode.com\/method\/softmax) non-linearity for a categorical latent code. For a continuous latent code, the authors assume a factored Gaussian.","720":"**AutoML-Zero** is an AutoML technique that aims to search a fine-grained space simultaneously for the model, optimization procedure, initialization, and so on, permitting much less human-design and even allowing the discovery of non-neural network algorithms. It represents ML algorithms as computer programs comprised of three component functions, Setup, Predict, and Learn, that performs initialization, prediction and learning. The instructions in these functions apply basic mathematical operations on a small memory. The operation and memory addresses used by each instruction are free parameters in the search space, as is the size of the component functions. While this reduces expert design, the consequent sparsity means that [random search](https:\/\/paperswithcode.com\/method\/random-search) cannot make enough progress. To overcome this difficulty, the authors use small proxy tasks and migration techniques to build an optimized infrastructure capable of searching through 10,000 models\/second\/cpu core.\r\n\r\nEvolutionary methods can find solutions in the AutoML-Zero search space despite its enormous\r\nsize and sparsity. The authors show that by randomly modifying the programs and periodically selecting the best performing ones on given tasks\/datasets, AutoML-Zero discovers reasonable algorithms. They start from empty programs and using data labeled by \u201cteacher\u201d neural networks with random weights, and demonstrate evolution can discover neural networks trained by gradient descent. Following this, they minimize bias toward known algorithms by switching to binary classification tasks extracted from CIFAR-10 and allowing a larger set of possible operations. This discovers interesting techniques like multiplicative interactions, normalized gradient and weight averaging. Finally, they show it is possible for evolution to adapt the algorithm to the type of task provided. For example, [dropout](https:\/\/paperswithcode.com\/method\/dropout)-like operations emerge when the task needs regularization and learning rate decay appears when the task requires faster convergence.","721":"**Self-adaptive Training** is a training algorithm that dynamically corrects problematic training labels by model predictions to improve generalization of deep learning for potentially corrupted training data. Accumulated predictions are used to augment the training dynamics. The use of an exponential-moving-average scheme alleviates the instability issue of model predictions, smooths out the training target during the training process and enables the algorithm to completely change the training labels if necessary.","722":"OSCAR is a new learning method that uses object tags detected in images as anchor points to ease the learning of image-text alignment. The model take a triple as input (word-tag-region) and pre-trained with two losses (masked token loss over words and tags, and a contrastive loss between tags and others). OSCAR represents an image-text pair into semantic space via dictionary lookup. Object tags are used as anchor points to align image regions with word embeddings of pre-trained language models. The model is then fine-tuned for understanding and generation tasks.","723":"A **PixelCNN** is a generative model that uses autoregressive connections to model images pixel by pixel, decomposing the joint image distribution as a product of conditionals. PixelCNNs are much faster to train than [PixelRNNs](https:\/\/paperswithcode.com\/method\/pixelrnn) because convolutions are inherently easier to parallelize; given the vast number of pixels present in large image datasets this is an important advantage.","724":"**Sarsa** is an on-policy TD control algorithm:\r\n\r\n$$Q\\left(S\\_{t}, A\\_{t}\\right) \\leftarrow Q\\left(S\\_{t}, A\\_{t}\\right) + \\alpha\\left[R_{t+1} + \\gamma{Q}\\left(S\\_{t+1}, A\\_{t+1}\\right) - Q\\left(S\\_{t}, A\\_{t}\\right)\\right] $$\r\n\r\nThis update is done after every transition from a nonterminal state $S\\_{t}$. if $S\\_{t+1}$ is terminal, then $Q\\left(S\\_{t+1}, A\\_{t+1}\\right)$ is defined as zero.\r\n\r\nTo design an on-policy control algorithm using Sarsa, we estimate $q\\_{\\pi}$ for a behaviour policy $\\pi$ and then change $\\pi$ towards greediness with respect to $q\\_{\\pi}$.\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition","725":"A wavelet **scattering transform** computes a translation invariant representation, which is stable to deformation, using a deep [convolution](https:\/\/paperswithcode.com\/method\/convolution) network architecture. It computes non-linear invariants with modulus and averaging pooling functions. It helps to eliminate the image variability due to translation and is stable to deformations. \r\n\r\nImage source: [Bruna and Mallat](https:\/\/arxiv.org\/pdf\/1203.1513v2.pdf)","726":"**Sparse R-CNN** is a purely sparse method for object detection in images, without object positional candidates enumerating\r\non all(dense) image grids nor object queries interacting with global(dense) image feature.\r\n\r\nAs shown in the Figure, object candidates are given with a fixed small set of learnable bounding boxes represented by 4-d coordinate. For the example of the COCO dataset, 100 boxes and 400 parameters are needed in total, rather than the predicted ones from hundreds of thousands of candidates in a Region Proposal Network ([RPN](https:\/\/paperswithcode.com\/method\/rpn)). These sparse candidates are used as proposal boxes to extract the feature of Region of Interest (RoI) by [RoIPool](https:\/\/paperswithcode.com\/method\/roi-pooling) or [RoIAlign](https:\/\/paperswithcode.com\/method\/roi-align).","727":"**Deterministic Policy Gradient**, or **DPG**, is a policy gradient method for reinforcement learning. Instead of the policy function $\\pi\\left(.\\mid{s}\\right)$ being modeled as a probability distribution, DPG considers and calculates gradients for a deterministic policy $a = \\mu\\_{theta}\\left(s\\right)$.","728":"Object Dropout is a technique that perturbs object features in an image for [noisy student](https:\/\/paperswithcode.com\/method\/noisy-student) training. It performs at par with standard data augmentation techniques while being significantly faster than the latter to implement.","729":"**Noisy Student Training** is a semi-supervised learning approach. It extends the idea of self-training\r\nand distillation with the use of equal-or-larger student models and noise added to the student during learning. It has three main steps: \r\n\r\n1. train a teacher model on labeled images\r\n2. use the teacher to generate pseudo labels on unlabeled images\r\n3. train a student model on the combination of labeled images and pseudo labeled images. \r\n\r\nThe algorithm is iterated a few times by treating the student as a teacher to relabel the unlabeled data and training a new student.\r\n\r\nNoisy Student Training seeks to improve on self-training and distillation in two ways. First, it makes the student larger than, or at least equal to, the teacher so the student can better learn from a larger dataset. Second, it adds noise to the student so the noised student is forced to learn harder from the pseudo labels. To noise the student, it uses input noise such as [RandAugment](https:\/\/paperswithcode.com\/method\/randaugment) data augmentation, and model noise such as [dropout](https:\/\/paperswithcode.com\/method\/dropout) and [stochastic depth](https:\/\/paperswithcode.com\/method\/stochastic-depth) during training.","730":"**Poincar\u00e9 Embeddings** learn hierarchical representations of symbolic data by embedding them into hyperbolic space -- or more precisely into an $n$-dimensional Poincar\u00e9 ball. Due to the underlying hyperbolic geometry, this allows for learning of parsimonious representations of symbolic data by simultaneously capturing hierarchy and similarity. Embeddings are learnt based on\r\nRiemannian optimization.","731":"**End-to-End Neural Diarization** is a neural network for speaker diarization in which a neural network directly outputs speaker diarization results given a multi-speaker recording. To realize such an end-to-end model, the speaker diarization problem is formulated as a multi-label classification problem and a permutation-free objective function is introduced to directly minimize diarization errors. The EEND method can explicitly handle speaker overlaps during training and inference. Just by feeding multi-speaker recordings with corresponding speaker segment labels, the model can be adapted to real conversations.","732":"**GrowNet** is a novel approach to combine the power of gradient boosting to incrementally build complex deep neural networks out of shallow components. It introduces a versatile framework that can readily be adapted for a diverse range of machine learning tasks in a wide variety of domains.","733":"Contextualized Topic Models are based on the Neural-ProdLDA variational autoencoding approach by Srivastava and Sutton (2017). \r\n\r\nThis approach trains an encoding neural network to map pre-trained contextualized word embeddings (e.g., [BERT](https:\/\/paperswithcode.com\/method\/bert)) to latent representations. Those latent representations are sampled variationally from a Gaussian distribution $N(\\mu, \\sigma^2)$ and passed to a decoder network that has to reconstruct the document bag-of-word representation.","734":"**TimeSformer** is a [convolution](https:\/\/paperswithcode.com\/method\/convolution)-free approach to video classification built exclusively on self-attention over space and time. It adapts the standard [Transformer](https:\/\/paperswithcode.com\/method\/transformer) architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Specifically, the method adapts the image model [[Vision Transformer](https:\/\/paperswithcode.com\/method\/vision-transformer)](https\/\/www.paperswithcode.com\/method\/vision-transformer) (ViT) to video by extending the self-attention mechanism from the image space to the space-time 3D volume. As in ViT, each patch is linearly mapped into an embedding and augmented with positional information. This makes it possible to interpret the resulting sequence of vector","735":"**HS-ResNet** is a [convolutional neural network](https:\/\/paperswithcode.com\/methods\/category\/convolutional-neural-networks) that employs [Hierarchical-Split Block](https:\/\/paperswithcode.com\/method\/hierarchical-split-block) as its central building block within a [ResNet](https:\/\/paperswithcode.com\/method\/resnet)-like architecture.","736":"**Hierarchical-Split Block** is a representational block for multi-scale feature representations. It contains many hierarchical split and concatenate connections within one single [residual block](https:\/\/paperswithcode.com\/methods\/category\/skip-connection-blocks). \r\n\r\nSpecifically, ordinary feature maps in deep neural networks are split into $s$ groups, each with $w$ channels. As shown in the Figure, only the first group of filters can be straightly connected to next layer. The second group of feature maps are sent to a convolution of $3 \\times 3$ filters to extract features firstly, then the output feature maps are split into two sub-groups in the channel dimension. One sub-group of feature maps straightly connected to next layer, while the other sub-group is concatenated with the next group of input feature maps in the channel dimension. The concatenated feature maps are operated by a set of $3 \\times 3$ convolutional filters. This process repeats several times until the rest of input feature maps are processed. Finally, features maps from all input groups are concatenated and sent to another layer of $1 \\times 1$ filters to rebuild the features.","737":"A **Ghost Module** is an image block for convolutional neural network that aims to generate more features by using fewer parameters. Specifically, an ordinary convolutional layer in deep neural networks is split into two parts. The first part involves ordinary convolutions but their total number is controlled. Given the intrinsic feature maps from the first part, a series of simple linear operations are applied for generating more feature maps. \r\n\r\nGiven the widely existing redundancy in intermediate feature maps calculated by mainstream CNNs, ghost modules aim to reduce them. In practice, given the input data $X\\in\\mathbb{R}^{c\\times h\\times w}$, where $c$ is the number of input channels and $h$ and $w$ are the height and width of the input data, respectively, the operation of an arbitrary convolutional layer for producing $n$ feature maps can be formulated as\r\n\r\n$$\r\nY = X*f+b,\r\n$$\r\n\r\nwhere $*$ is the [convolution](https:\/\/paperswithcode.com\/method\/convolution) operation, $b$ is the bias term, $Y\\in\\mathbb{R}^{h'\\times w'\\times n}$ is the output feature map with $n$ channels, and $f\\in\\mathbb{R}^{c\\times k\\times k \\times n}$ is the convolution filters in this layer. In addition, $h'$ and $w'$ are the height and width of the output data, and $k\\times k$ is the kernel size of convolution filters $f$, respectively. During this convolution procedure, the required number of FLOPs can be calculated as $n\\cdot h'\\cdot w'\\cdot c\\cdot k\\cdot k$, which is often as large as hundreds of thousands since the number of filters $n$ and the channel number $c$ are generally very large (e.g. 256 or 512).\r\n\r\nHere, the number of parameters (in $f$ and $b$) to be optimized is explicitly determined by the dimensions of input and output feature maps. The output feature maps of convolutional layers often contain much redundancy, and some of them could be similar with each other. We point out that it is unnecessary to generate these redundant feature maps one by one with large number of FLOPs and parameters. Suppose that the output feature maps are *ghosts* of a handful of intrinsic feature maps with some cheap transformations. These intrinsic feature maps are often of smaller size and produced by ordinary convolution filters. Specifically, $m$ intrinsic feature maps $Y'\\in\\mathbb{R}^{h'\\times w'\\times m}$ are generated using a primary convolution:\r\n\r\n$$\r\nY' = X*f',\r\n$$\r\n\r\nwhere $f'\\in\\mathbb{R}^{c\\times k\\times k \\times m}$ is the utilized filters, $m\\leq n$ and the bias term is omitted for simplicity. The hyper-parameters such as filter size, stride, padding, are the same as those in the ordinary convolution to keep the spatial size (ie $h'$ and $w'$) of the output feature maps consistent. To further obtain the desired $n$ feature maps, we apply a series of cheap linear operations on each intrinsic feature in $Y'$ to generate $s$ ghost features according to the following function:\r\n\r\n$$\r\ny_{ij} = \\Phi_{i,j}(y'_i),\\quad \\forall\\; i = 1,...,m,\\;\\; j = 1,...,s,\r\n$$\r\n\r\nwhere $y'\\_i$ is the $i$-th intrinsic feature map in $Y'$, $\\Phi\\_{i,j}$ in the above function is the $j$-th (except the last one) linear operation for generating the $j$-th ghost feature map $y_{ij}$, that is to say, $y'\\_i$ can have one or more ghost feature maps $\\{y\\_{ij}\\}\\_{j=1}^{s}$. The last $\\Phi\\_{i,s}$ is the identity mapping for preserving the intrinsic feature maps. we can obtain $n=m\\cdot s$ feature maps $Y=[y\\_{11},y\\_{12},\\cdots,y\\_{ms}]$ as the output data of a Ghost module. Note that the linear operations $\\Phi$ operate on each channel whose computational cost is much less than the ordinary convolution. In practice, there could be several different linear operations in a Ghost module, eg $3\\times 3$ and $5\\times5$ linear kernels, which will be analyzed in the experiment part.","738":"A **Ghost BottleNeck** is a skip connection block, similar to the basic [residual block](https:\/\/paperswithcode.com\/method\/residual-block) in [ResNet](https:\/\/paperswithcode.com\/method\/resnet) in which several convolutional layers and shortcuts are integrated, but stacks [Ghost Modules](https:\/\/paperswithcode.com\/method\/ghost-module) instead (two stacked Ghost modules). It was proposed as part of the [GhostNet](https:\/\/paperswithcode.com\/method\/ghostnet) CNN architecture.\r\n\r\nThe first Ghost module acts as an expansion layer increasing the number of channels. The ratio between the number of the output channels and that of the input is referred to as the *expansion ratio*. The second Ghost module reduces the number of channels to match the shortcut path. Then the shortcut is connected between the inputs and the outputs of these two Ghost modules. The [batch normalization](https:\/\/paperswithcode.com\/method\/batch-normalization) (BN) and [ReLU](https:\/\/paperswithcode.com\/method\/relu) nonlinearity are applied after each layer, except that ReLU is not used after the second Ghost module as suggested by [MobileNetV2](https:\/\/paperswithcode.com\/method\/mobilenetv2). The Ghost bottleneck described above is for stride=1. As for the case where stride=2, the shortcut path is implemented by a downsampling layer and a [depthwise convolution](https:\/\/paperswithcode.com\/method\/depthwise-convolution) with stride=2 is inserted between the two Ghost modules. In practice, the primary [convolution](https:\/\/paperswithcode.com\/method\/convolution) in Ghost module here is [pointwise convolution](https:\/\/paperswithcode.com\/method\/pointwise-convolution) for its efficiency.","739":"A **GhostNet** is a type of convolutional neural network that is built using Ghost modules, which aim to generate more features by using fewer parameters (allowing for greater efficiency). \r\n\r\nGhostNet mainly consists of a stack of Ghost bottlenecks with the Ghost modules as the building block. The first layer is a standard convolutional layer with 16 filters, then a series of Ghost bottlenecks with gradually increased channels follow. These Ghost bottlenecks are grouped into different stages according to the sizes of their input feature maps. All the Ghost bottlenecks are applied with stride=1 except that the last one in each stage is with stride=2. At last a [global average pooling](https:\/\/paperswithcode.com\/method\/global-average-pooling) and a convolutional layer are utilized to transform the feature maps to a 1280-dimensional feature vector for final classification. The squeeze and excite (SE) module is also applied to the residual layer in some ghost bottlenecks. \r\n\r\nIn contrast to [MobileNetV3](https:\/\/paperswithcode.com\/method\/mobilenetv3), GhostNet does not use [hard-swish](https:\/\/paperswithcode.com\/method\/hard-swish) nonlinearity function due to its large latency.","740":"Bi-attention employs the attention-in-attention (AiA) mechanism to capture second-order statistical information: the outer point-wise channel attention vectors are computed from the output of the inner channel attention.","741":"**Guided Anchoring** is an anchoring scheme for object detection which leverages semantic features to guide the anchoring. The method is motivated by the observation that objects are not distributed evenly over the image. The scale of an object is also closely related to the imagery content, its location and geometry of the scene. Following this intuition, the method generates sparse anchors in two steps: first identifying sub-regions that may contain objects and then determining the shapes at different locations.","742":"**RetinaNet-RS** is an object detection model produced through a model scaling method based on changing the the input resolution and [ResNet](https:\/\/paperswithcode.com\/method\/resnet) backbone depth. For [RetinaNet](https:\/\/paperswithcode.com\/method\/retinanet), we scale up input resolution from 512 to 768 and the ResNet backbone depth from 50 to 152. As RetinaNet performs dense one-stage object detection, the authors find scaling up input resolution leads to large resolution feature maps hence more anchors to process. This results in a higher capacity dense prediction heads and expensive NMS. Scaling stops at input resolution 768 \u00d7 768 for RetinaNet.","743":"**Linear Warmup** is a learning rate schedule where we linearly increase the learning rate from a low rate to a constant rate thereafter. This reduces volatility in the early stages of training.\r\n\r\nImage Credit: [Chengwei Zhang](https:\/\/www.dlology.com\/about-me\/)","744":"**CTRL** is conditional [transformer](https:\/\/paperswithcode.com\/method\/transformer) language model, trained\r\nto condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw\r\ntext, preserving the advantages of unsupervised learning while providing more\r\nexplicit control over text generation. These codes also allow CTRL to predict\r\nwhich parts of the training data are most likely given a sequence","745":"**Cross-Scale Non-Local Attention**, or **CS-NL**, is a non-local attention module for image super-resolution deep networks. It learns to mine long-range dependencies between LR features to larger-scale HR patches within the same feature map. Specifically, suppose we are conducting an s-scale super-resolution with the module, given a feature map $X$ of spatial size $(W, H)$, we first bilinearly downsample it to $Y$ with scale $s$, and match the $p\\times p$ patches in $X$ with the downsampled $p \\times p$ candidates in $Y$ to obtain the [softmax](https:\/\/paperswithcode.com\/method\/softmax) matching score. Finally, we conduct deconvolution.on the score by weighted adding the patches of size $\\left(sp, sp\\right)$ extracted from $X$. The obtained $Z$ of size $(sW, sH)$ will be $s$ times super-resolved than $X$.","746":"**Contextual Residual Aggregation**, or **CRA**, is a module for image inpainting. It can produce high-frequency residuals for missing contents by weighted aggregating residuals from contextual patches, thus only requiring a low-resolution prediction from the network. Specifically, it involves a neural network to predict a low-resolution inpainted result and up-sample it to yield a large blurry image. Then we produce the high-frequency residuals for in-hole patches by aggregating weighted high-frequency residuals from contextual patches. Finally, we add the aggregated residuals to the large blurry image to obtain a sharp result.","747":"Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner. Code, models, and datasets are released at https:\/\/github.com\/salesforce\/BLIP.","748":"An **Accumulating Eligibility Trace** is a type of [eligibility trace](https:\/\/paperswithcode.com\/method\/eligibility-trace) where the trace increments in an accumulative way. For the memory vector $\\textbf{e}\\_{t} \\in \\mathbb{R}^{b} \\geq \\textbf{0}$:\r\n\r\n$$\\mathbf{e\\_{0}} = \\textbf{0}$$\r\n\r\n$$\\textbf{e}\\_{t} = \\nabla{\\hat{v}}\\left(S\\_{t}, \\mathbf{\\theta}\\_{t}\\right) + \\gamma\\lambda\\textbf{e}\\_{t}$$","749":"**TD_INLINE_MATH_1** is a generalisation of **TD_INLINE_MATH_2** reinforcement learning algorithms, but it employs an [eligibility trace](https:\/\/paperswithcode.com\/method\/eligibility-trace) $\\lambda$ and $\\lambda$-weighted returns. The eligibility trace vector is initialized to zero at the beginning of the episode, and it is incremented on each time step by the value gradient, and then fades away by $\\gamma\\lambda$:\r\n\r\n$$ \\textbf{z}\\_{-1} = \\mathbf{0} $$\r\n$$ \\textbf{z}\\_{t} = \\gamma\\lambda\\textbf{z}\\_{t-1} + \\nabla\\hat{v}\\left(S\\_{t}, \\mathbf{w}\\_{t}\\right), 0 \\leq t \\leq T$$\r\n\r\nThe eligibility trace keeps track of which components of the weight vector contribute to recent state valuations. Here $\\nabla\\hat{v}\\left(S\\_{t}, \\mathbf{w}\\_{t}\\right)$ is the feature vector.\r\n\r\nThe TD error for state-value prediction is:\r\n\r\n$$ \\delta\\_{t} = R\\_{t+1} + \\gamma\\hat{v}\\left\\(S\\_{t+1}, \\mathbf{w}\\_{t}\\right) - \\hat{v}\\left(S\\_{t}, \\mathbf{w}\\_{t}\\right) $$\r\n\r\nIn **TD_INLINE_MATH_1**, the weight vector is updated on each step proportional to the scalar TD error and the vector eligibility trace:\r\n\r\n$$ \\mathbf{w}\\_{t+1} = \\mathbf{w}\\_{t} + \\alpha\\delta\\mathbf{z}\\_{t} $$\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition","750":"**TD-Gammon** is a game-learning architecture for playing backgammon. It involves the use of a $TD\\left(\\lambda\\right)$ learning algorithm and a feedforward neural network.\r\n\r\nCredit: [Temporal Difference Learning and\r\nTD-Gammon](https:\/\/cling.csd.uwo.ca\/cs346a\/extra\/tdgammon.pdf)","751":"**Self-supervised Equivariant Attention Mechanism**, or **SEAM**, is an attention mechanism for weakly supervised semantic segmentation. The SEAM applies consistency regularization on CAMs from various transformed images to provide self-supervision for network learning. To further improve the network prediction consistency, SEAM introduces the pixel correlation module (PCM), which captures context appearance information for each pixel and revises original CAMs by learned affinity attention maps. The SEAM is implemented by a [siamese network](https:\/\/paperswithcode.com\/method\/siamese-network) with equivariant cross regularization (ECR) loss, which regularizes the original CAMs and the revised CAMs on different branches.","752":"**VIME **, or **Value Imputation and Mask Estimation**, is a self- and semi-supervised learning framework for tabular data. It consists of a pretext task of estimating mask vectors from corrupted tabular data in addition to the reconstruction pretext task for self-supervised learning.","753":"**TDN**, or **Temporaral Difference Network**, is an action recognition model that aims to capture multi-scale temporal information. To fully capture temporal information over the entire video, the TDN is established with a two-level difference modeling paradigm. Specifically, for local motion modeling, temporal difference over consecutive frames is used to supply 2D CNNs with finer motion pattern, while for global motion modeling, temporal difference across segments is incorporated to capture long-range structure for motion feature excitation.","754":"**BASNet**, or **Boundary-Aware Segmentation Network**, is an image segmentation architecture that consists of a predict-refine architecture and a hybrid loss. The proposed BASNet comprises a predict-refine architecture and a hybrid loss, for highly accurate image segmentation. The predict-refine architecture consists of a densely supervised encoder-decoder network and a residual \r\n refinement module, which are respectively used to predict and refine a segmentation probability map. The hybrid loss is a combination of the binary cross entropy, structural similarity and intersection-over-union losses, which guide the network to learn three-level (i.e., pixel-, patch- and map- level) hierarchy representations.","755":"**Feature Matching** is a regularizing objective for a generator in [generative adversarial networks](https:\/\/paperswithcode.com\/methods\/category\/generative-adversarial-networks) that prevents it from overtraining on the current discriminator. Instead of directly maximizing the output of the discriminator, the new objective requires the generator to generate data that matches the statistics of the real data, where we use the discriminator only to specify the statistics that we think are worth matching. Specifically, we train the generator to match the expected value of the features on an intermediate layer of the discriminator. This is a natural choice of statistics for the generator to match, since by training the discriminator we ask it to find those features that are most discriminative of real data versus data generated by the current model.\r\n\r\nLetting $\\mathbf{f}\\left(\\mathbf{x}\\right)$ denote activations on an intermediate layer of the discriminator, our new objective for the generator is defined as: $ ||\\mathbb{E}\\_{x\\sim p\\_{data} } \\mathbf{f}\\left(\\mathbf{x}\\right) \u2212 \\mathbb{E}\\_{\\mathbf{z}\u223cp\\_{\\mathbf{z}}\\left(\\mathbf{z}\\right)}\\mathbf{f}\\left(G\\left(\\mathbf{z}\\right)\\right)||^{2}\\_{2} $. The discriminator, and hence\r\n$\\mathbf{f}\\left(\\mathbf{x}\\right)$, are trained as with vanilla GANs. As with regular [GAN](https:\/\/paperswithcode.com\/method\/gan) training, the objective has a fixed point where G exactly matches the distribution of training data.","756":"Syntax Heat Parse Tree are heatmaps over parse trees, similar to [\"heat trees\"](https:\/\/doi.org\/10.1371\/journal.pcbi.1005404) in biology.","757":"**U2-Net** is a two-level nested U-structure architecture that is designed for salient object detection (SOD). The architecture allows the network to go deeper, attain high resolution, without significantly increasing the memory and computation cost. This is achieved by a nested U-structure: on the bottom level, with a novel ReSidual U-block (RSU) module, which is able to extract intra-stage multi-scale features without degrading the feature map resolution; on the top level, there is a [U-Net](https:\/\/paperswithcode.com\/method\/u-net) like structure, in which each stage is filled by a RSU block.","758":"**Wasserstein GAN + Gradient Penalty**, or **WGAN-GP**, is a generative adversarial network that uses the Wasserstein loss formulation plus a gradient norm penalty to achieve Lipschitz continuity.\r\n\r\nThe original [WGAN](https:\/\/paperswithcode.com\/method\/wgan) uses weight clipping to achieve 1-Lipschitz functions, but this can lead to undesirable behaviour by creating pathological value surfaces and capacity underuse, as well as gradient explosion\/vanishing without careful tuning of the weight clipping parameter $c$.\r\n\r\nA Gradient Penalty is a soft version of the Lipschitz constraint, which follows from the fact that functions are 1-Lipschitz iff the gradients are of norm at most 1 everywhere. The squared difference from norm 1 is used as the gradient penalty.","759":"**Orthogonal Regularization** is a regularization technique for convolutional neural networks, introduced with generative modelling as the task in mind. Orthogonality is argued to be a desirable quality in ConvNet filters, partially because multiplication by an orthogonal matrix leaves the norm of the original matrix unchanged. This property is valuable in deep or recurrent networks, where repeated matrix multiplication can result in signals vanishing or exploding. To try to maintain orthogonality throughout training, Orthogonal Regularization encourages weights to be orthogonal by pushing them towards the nearest orthogonal manifold. The objective function is augmented with the cost:\r\n\r\n$$ \\mathcal{L}\\_{ortho} = \\sum\\left(|WW^{T} \u2212 I|\\right) $$\r\n\r\nWhere $\\sum$ indicates a sum across all filter banks, $W$ is a filter bank, and $I$ is the identity matrix","760":"**Spektral** is an open-source Python library for building graph neural networks with TensorFlow and the Keras application programming interface. Spektral implements a large set of methods for deep learning on graphs, including message-passing and pooling operators, as well as utilities for processing graphs and loading popular benchmark datasets.","761":"A **Zero-padded Shortcut Connection** is a type of [residual connection](https:\/\/paperswithcode.com\/method\/residual-connection) used in the [PyramidNet](https:\/\/paperswithcode.com\/method\/pyramidnet) architecture. For PyramidNets, identity mapping alone cannot be used for a shortcut because the feature map dimension differs among individual residual units. Therefore, only a zero-padded shortcut or projection shortcut can be used for all the residual units. However, a projection shortcut can hamper information propagation and lead to optimization problems, especially for very deep networks. On the other hand, the zero-padded shortcut avoids the overfitting problem because no additional parameters exist.","762":"A **Pyramidal Residual Unit** is a type of residual unit where the number of channels gradually increases as a function of the depth at which the layer occurs, which is similar to a pyramid structure of which the shape gradually widens from the top downwards. It was introduced as part of the [PyramidNet](https:\/\/paperswithcode.com\/method\/pyramidnet) architecture.","763":"A **Pyramidal Bottleneck Residual Unit** is a type of residual unit where the number of channels gradually increases as a function of the depth at which the layer occurs, which is similar to a pyramid structure of which the shape gradually widens from the top downwards. It also consists of a bottleneck using 1x1 convolutions. It was introduced as part of the [PyramidNet](https:\/\/paperswithcode.com\/method\/pyramidnet) architecture.","764":"A **PyramidNet** is a type of convolutional network where the key idea is to concentrate on the feature map dimension by increasing it gradually instead of by increasing it sharply at each residual unit with downsampling. In addition, the network architecture works as a mixture of both plain and residual networks by using zero-padded identity-mapping shortcut connections when increasing the feature map dimension.","765":"**HiFi-GAN** is a generative adversarial network for speech synthesis. HiFi-GAN consists of one generator and two discriminators: multi-scale and multi-period discriminators. The generator and discriminators are trained adversarially, along with two additional losses for improving training stability and model performance.\r\n\r\nThe generator is a fully convolutional neural network. It uses a mel-spectrogram as input and upsamples it through transposed convolutions until the length of the output sequence matches the temporal resolution of raw waveforms. Every [transposed convolution](https:\/\/paperswithcode.com\/method\/transposed-convolution) is followed by a multi-receptive field fusion (MRF) module.\r\n\r\nFor the discriminator, a multi-period discriminator (MPD) is used consisting of several sub-discriminators each handling a portion of periodic signals of input audio. Additionally, to capture consecutive patterns and long-term dependencies, the multi-scale discriminator (MSD) proposed in [MelGAN](https:\/\/paperswithcode.com\/method\/melgan) is used, which consecutively evaluates audio samples at different levels.","766":"**SRU**, or **Simple Recurrent Unit**, is a recurrent neural unit with a light form of recurrence. SRU exhibits the same level of parallelism as [convolution](https:\/\/paperswithcode.com\/method\/convolution) and [feed-forward nets](https:\/\/paperswithcode.com\/methods\/category\/feedforward-networks). This is achieved by balancing sequential dependence and independence: while the state computation of SRU is time-dependent, each state dimension is independent. This simplification enables CUDA-level optimizations that parallelize the computation across hidden dimensions and time steps, effectively using the full capacity of modern GPUs. \r\n\r\nSRU also replaces the use of convolutions (i.e., ngram filters), as in [QRNN](https:\/\/paperswithcode.com\/method\/qrnn) and KNN, with more recurrent connections. This retains modeling capacity, while using less computation (and hyper-parameters). Additionally, SRU improves the training of deep recurrent models by employing [highway connections](https:\/\/paperswithcode.com\/method\/highway-layer) and a parameter initialization scheme tailored for gradient propagation in deep architectures.\r\n\r\nA single layer of SRU involves the following computation:\r\n\r\n$$\r\n\\mathbf{f}\\_{t} =\\sigma\\left(\\mathbf{W}\\_{f} \\mathbf{x}\\_{t}+\\mathbf{v}\\_{f} \\odot \\mathbf{c}\\_{t-1}+\\mathbf{b}\\_{f}\\right) \r\n$$\r\n\r\n$$\r\n\\mathbf{c}\\_{t} =\\mathbf{f}\\_{t} \\odot \\mathbf{c}\\_{t-1}+\\left(1-\\mathbf{f}\\_{t}\\right) \\odot\\left(\\mathbf{W} \\mathbf{x}\\_{t}\\right) \\\\\r\n$$\r\n\r\n$$\r\n\\mathbf{r}\\_{t} =\\sigma\\left(\\mathbf{W}\\_{r} \\mathbf{x}\\_{t}+\\mathbf{v}\\_{r} \\odot \\mathbf{c}\\_{t-1}+\\mathbf{b}\\_{r}\\right) \\\\\r\n$$\r\n\r\n$$\r\n\\mathbf{h}\\_{t} =\\mathbf{r}\\_{t} \\odot \\mathbf{c}\\_{t}+\\left(1-\\mathbf{r}\\_{t}\\right) \\odot \\mathbf{x}\\_{t}\r\n$$\r\n\r\nwhere $\\mathbf{W}, \\mathbf{W}\\_{f}$ and $\\mathbf{W}\\_{r}$ are parameter matrices and $\\mathbf{v}\\_{f}, \\mathbf{v}\\_{r}, \\mathbf{b}\\_{f}$ and $\\mathbf{b}_{v}$ are parameter vectors to be learnt during training. The complete architecture decomposes to two sub-components: a light recurrence and a highway network,\r\n\r\nThe light recurrence component successively reads the input vectors $\\mathbf{x}\\_{t}$ and computes the sequence of states $\\mathbf{c}\\_{t}$ capturing sequential information. The computation resembles other recurrent networks such as [LSTM](https:\/\/paperswithcode.com\/method\/lstm), [GRU](https:\/\/paperswithcode.com\/method\/gru) and RAN. Specifically, a forget gate $\\mathbf{f}\\_{t}$ controls the information flow and the state vector $\\mathbf{c}\\_{t}$ is determined by adaptively averaging the previous state $\\mathbf{c}\\_{t-1}$ and the current observation $\\mathbf{W} \\mathbf{x}_{+}$according to $\\mathbf{f}\\_{t}$.","767":"**GPipe** is a distributed model parallel method for neural networks. With GPipe, each model can be specified as a sequence of layers, and consecutive groups of layers can be partitioned into cells. Each cell is then placed on a separate accelerator. Based on this partitioned setup, batch splitting is applied. A mini-batch of training examples is split into smaller micro-batches, then the execution of each set of micro-batches is pipelined over cells. Synchronous mini-batch gradient descent is applied for training, where gradients are accumulated across all micro-batches in a mini-batch and applied at the end of a mini-batch.","768":"**Packed Levitated Markers**, or **PL-Marker**, is a span representation approach for [named entity recognition](https:\/\/paperswithcode.com\/task\/named-entity-recognition-ner) that considers the dependencies between spans (pairs) by strategically packing the markers in the encoder. A pair of Levitated Markers, emphasizing a span, consists of a start marker and an end marker which share the same position embeddings with span\u2019s start and end tokens respectively. In addition, both levitated markers adopt a restricted attention, that is, they are visible to each other, but not to the text token and other pairs of markers. sBased on the above features, the levitated marker would not affect the attended context of the original text tokens, which allows us to flexibly pack a series of related spans with their levitated markers in the encoding phase and thus model their dependencies.","769":"A **Dueling Network** is a type of Q-Network that has two streams to separately estimate (scalar) state-value and the advantages for each action. Both streams share a common convolutional feature learning module. The two streams are combined via a special aggregating layer to produce an\r\nestimate of the state-action value function Q as shown in the figure to the right.\r\n\r\nThe last module uses the following mapping:\r\n\r\n$$ Q\\left(s, a, \\theta, \\alpha, \\beta\\right) =V\\left(s, \\theta, \\beta\\right) + \\left(A\\left(s, a, \\theta, \\alpha\\right) - \\frac{1}{|\\mathcal{A}|}\\sum\\_{a'}A\\left(s, a'; \\theta, \\alpha\\right)\\right) $$\r\n\r\nThis formulation is chosen for identifiability so that the advantage function has zero advantage for the chosen action, but instead of a maximum we use an average operator to increase the stability of the optimization.","770":"UNITER or UNiversal Image-TExt Representation model is a large-scale pre-trained model for joint multimodal embedding. It is pre-trained using four image-text datasets COCO, Visual Genome, Conceptual Captions, and SBU Captions. It can power heterogeneous downstream V+L tasks with joint multimodal embeddings. \r\nUNITER takes the visual regions of the image and textual tokens of the sentence as inputs. A faster R-CNN is used in Image Embedder to extract the visual features of each region and a Text Embedder is used to tokenize the input sentence into WordPieces. \r\n\r\nIt proposes WRA via the Optimal Transport to provide more fine-grained alignment between word tokens and image regions that is effective in calculating the minimum cost of transporting the contextualized image embeddings to word embeddings and vice versa. \r\n\r\nFour pretraining tasks were designed for this model. They are Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). This model is different from the previous models because it uses conditional masking on pre-training tasks.","771":"**ViP-DeepLab** is a model for depth-aware video panoptic segmentation. It extends Panoptic-[DeepLab](https:\/\/paperswithcode.com\/method\/deeplab) by adding a depth prediction head to perform monocular depth estimation and a next-frame instance branch which regresses to the object centers in frame $t$ for frame $t + 1$. This allows the model to jointly perform video panoptic segmentation and monocular depth estimation.","772":"**GMVAE**, or **Gaussian Mixture Variational Autoencoder**, is a stochastic regularization layer for [transformers](https:\/\/paperswithcode.com\/methods\/category\/transformers). A GMVAE layer is trained using a 700-dimensional internal representation of the first MLP layer. For every output from the first MLP layer, the GMVAE layer first computes a latent low-dimensional representation sampling from the GMVAE posterior distribution to then provide at the output a reconstruction sampled from a generative model.","773":"**Adaptive Dropout** is a regularization technique that extends dropout by allowing the dropout probability to be different for different units. The intuition is that there may be hidden units that can individually make confident predictions for the presence or absence of an important feature or combination of features. [Dropout](https:\/\/paperswithcode.com\/method\/dropout) will ignore this confidence and drop the unit out 50% of the time. \r\n\r\nDenote the activity of unit $j$ in a deep neural network by $a\\_{j}$ and assume that its inputs are {$a\\_{i}: i < j$}. In dropout, $a\\_{j}$ is randomly set to zero with probability 0.5. Let $m\\_{j}$ be a binary variable that is used to mask, the activity $a\\_{j}$, so that its value is:\r\n\r\n$$ a\\_{j} = m\\_{j}g \\left( \\sum\\_{i: i