Bibliography
Bibliography
[1]
Large‐Scale Data for Multiple‐View Stereopsis
[2]
Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations
[3]
Third Time's the Charm? Image and Video Editing with StyleGAN3
[4]
The Illustrated Transformer
[5]
Flamingo: A Visual Language Model for Few-Shot Learning
[6]
[7]
Data-scalable Hessian preconditioning for distributed parameter PDE-constrained inverse problems
[8]
Neural Point-Based Graphics
[9]
Large-DiT-ImageNet: Large Diffusion Transformer for ImageNet Generation
[10]
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
[11]
[12]
Computer Vision vs Machine Vision: What's the Difference?
[13]
[14]
Wasserstein GAN
[15]
Partitioning Gated Mechanisms for Graph Generation
[16]
ViViT: A Video Vision Transformer
[17]
BURST: A Benchmark for Unifying Object Recognition, Segmentation and Tracking in Video
[18]
Structured Denoising Diffusion Models in Discrete State-Spaces
[19]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
[20]
Multiple Object Recognition with Visual Attention
[21]
[22]
Neural Machine Translation by Jointly Learning to Align and Translate
[23]
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
[24]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
[25]
On a Structural Similarity Index Approach for Floating-Point Data
[26]
Delving Deeper into Convolutional Networks for Learning Video Representations
[27]
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning
[28]
Mip-NeRF 360: Unbounded Anti-Aliased in-the-Wild Scene Rendering
[29]
Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields
[30]
Zip-NeRF: Anti-Aliased, Compositional Neural Representation for Editable 3D Scenes
[31]
Network Dissection: Quantifying Interpretability of Deep Visual Representations
[32]
Losses Explained: Contrastive Loss
[33]
XMem++: Production-Level Video Segmentation from Few Annotated Frames
[34]
Revisiting ResNets: Improved Training and Scaling Strategies
[35]
[36]
Representation learning: A review and new perspectives
[37]
{Interactive Visualization of Stable Diffusion Image Embeddings
[38]
Random Search for Hyper-Parameter Optimization
[39]
Random Search for Hyper-Parameter Optimization
[40]
MultiGrain: a Unified Image Embedding for Classes and Instances
[41]
Is Space-Time Attention All You Need for Video Understanding?
[42]
[43]
FlexiViT: One Model for All Patch Sizes
[44]
NoPe-NeRF: Optimising Neural Radiance Field with No Pose Prior
[45]
Demystifying MMD GANs
[46]
Stable Video Diffusion: Scaling Latent Video Diffusion Models
[47]
YOLOv4: Optimal Speed and Accuracy of Object Detection
[48]
Perception Encoder: The best visual embeddings are not at the output of the network
[49]
Token Merging: Your ViT But Faster
[50]
Food-101 -- Mining Discriminative Components with Random Forests
[51]
Human-Level Concept Learning Through Probabilistic Program Induction
[52]
Large Scale GAN Training for High Fidelity Natural Image Synthesis
[53]
High-Performance Large-Scale Image Recognition Without Normalization
[54]
High-Performance Large-Scale Image Recognition Without Normalization
[55]
Neural Photo Editing with Introspective Adversarial Networks
[56]
Model-based three-dimensional interpretations of two-dimensional images
[57]
InstructPix2Pix: Learning to Follow Image Editing Instructions
[58]
Language Models are Few-Shot Learners
[59]
Language Models are Few-Shot Learners
[60]
Gender shades: Intersectional accuracy disparities in commercial gender classification
[61]
ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware
[62]
Align-DETR: Enhancing End-to-end Object Detection with Aligned Loss
[63]
A computational approach to edge detection
[64]
End-to-End Object Detection with Transformers
[65]
SAM 3: Segment Anything with Concepts
[66]
[67]
Towards Evaluating the Robustness of Neural Networks
[68]
Universal and Transferable Adversarial Attacks on Aligned Language Models
[69]
Deep Clustering for Unsupervised Learning of Visual Features
[70]
Emerging Properties in Self-Supervised Vision Transformers
[71]
Emerging Properties in Self-Supervised Vision Transformers
[72]
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
[73]
Unsupervised Pre-Training of Image Features on Non-Curated Data
[74]
[75]
Listen, Attend and Spell
[76]
ShapeNet: An Information-Rich 3D Model Repository
[77]
MaskGIT: Masked Generative Image Transformer
[78]
Rethinking the Faster R-CNN Architecture for Temporal Action Localization
[79]
MVSNeRF: Fast Generalizable Radiance Field Reconstruction from Multi-View Stereo
[80]
TensoRF: Tensorial Radiance Fields
[81]
Adaptive Image Transformer for One-Shot Object Detection
[82]
[83]
DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
[84]
Generative Pretraining From Pixels
[85]
Generative Pretraining from Pixels
[86]
Fantasia3D: Disentangling Geometry and Appearance for High-Quality Text-to-3D Content Creation
[87]
Flow Matching on General Geometries
[88]
Neural Ordinary Differential Equations
[89]
A Simple Framework for Contrastive Learning of Visual Representations
[90]
Big Self-Supervised Models are Strong Semi-Supervised Learners
[91]
On Self Modulation for Generative Adversarial Networks
[92]
[93]
Exploring Simple Siamese Representation Learning
[94]
{An Empirical Study of Training Self-Supervised Vision Transformers
[95]
An Empirical Study of Training Self-Supervised Vision Transformers
[96]
Improved Baselines with Momentum Contrastive Learning
[97]
UNITER: Learning Universal Image-Text Representations
[98]
Deep Bundle-Adjusting Generalizable Neural Radiance Fields
[99]
Per-Pixel Classification is Not All You Need for Semantic Segmentation
[100]
Masked-Attention Mask Transformer for Universal Image Segmentation
[101]
YOLO-World: Real-Time Open-Vocabulary Object Detection
[102]
TALLFormer: Temporal Action Localization with a Long-memory Transformer
[103]
Putting the Object Back into Video Object Segmentation
[104]
BING: Binarized normed gradients for objectness estimation at 300fps
[105]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
[106]
[107]
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
[108]
3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation
[109]
Adversarial Video Generation on Complex Datasets
[110]
Fast and accurate deep network learning by exponential linear units (elus)
[111]
Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
[112]
Distribution Matching Losses Can Hallucinate Features in Medical Image Translation
[113]
AutoAugment: Learning Augmentation Policies from Data
[114]
RandAugment: Practical Automated Data Augmentation with a Reduced Search Space
[115]
ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes
[116]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
[118]
Objaverse: A Universe of Annotated 3D Objects
[119]
[120]
Exploring Nearest Neighbor Approaches for Image Captioning
[121]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
[122]
Improved Regularization of Convolutional Neural Networks with Cutout
[123]
Diffusion Models Beat GANs on Image Synthesis
[124]
Diffusion Models Beat GANs on Image Synthesis
[125]
[126]
MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
[127]
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
[128]
ACNet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetry
[129]
RepVGG: Making VGG-style ConvNets Great Again
[130]
Density estimation using Real NVP
[131]
Unsupervised Visual Representation Learning by Context Prediction
[132]
Large Scale Adversarial Representation Learning
[133]
[134]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
[135]
PaLM-E: An Embodied Multimodal Language Model
[136]
[137]
A learned representation for artistic style
[138]
Equivariant Neural Rendering
[139]
With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations
[140]
Introduction to 3D Gaussian Splatting
[141]
Texture Synthesis by Non‑Parametric Sampling
[142]
Depth Map Prediction from a Single Image using a Multi-Scale Deep Network
[143]
The power of depth for feedforward neural networks
[144]
Understanding Region of Interest - Part 2 (RoI Align)
[145]
How Well Do Self-Supervised Models Transfer?
[146]
Whitening for Self-Supervised Representation Learning
[147]
Neural Scene Representation and Rendering
[148]
KiloNeuS: A Versatile Neural Implicit Surface Representation for Real-Time Rendering
[149]
Taming Transformers for High-Resolution Image Synthesis
[150]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
[151]
[152]
Multiscale Vision Transformers
[153]
A Point Set Generation Network for 3D Object Reconstruction from a Single Image
[154]
Fluid: Scaling Autoregressive Text-to-Image Generative Models with Continuous Tokens
[155]
X3D: Expanding Architectures for Efficient Video Recognition
[156]
What Have We Learned from Deep Representations for Action Recognition?
[157]
A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning
[158]
Deep Insights into Convolutional Networks for Video Recognition
[159]
Masked Autoencoders As Spatiotemporal Learners
[160]
SlowFast Networks for Video Recognition
[161]
[162]
Plenoxels: Radiance Fields without Neural Networks
[163]
VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling
[164]
COLMAP-Free 3D Gaussian Splatting
[165]
Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position
[166]
DataComp: In search of the next generation of multimodal datasets
[167]
NeRF-Editing: Geometry Editing of Neural Radiance Fields
[168]
TALL: Temporal Activity Localization via Language Query
[169]
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
[170]
Discrete Flow Matching
[171]
Discrete Flow Matching
[172]
Image Style Transfer Using Convolutional Neural Networks
[173]
Texture Synthesis Using Convolutional Neural Networks
[174]
Gemini: A Family of Highly Capable Multimodal Models
[175]
Consistency Models Made Easy
[176]
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation
[177]
GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment
[178]
Unsupervised Representation Learning by Predicting Image Rotations
[179]
ImageBind: One Embedding Space to Bind Them All
[180]
OmniMAE: Single Model Masked Pretraining on Images and Videos
[181]
Fast R-CNN
[182]
Rich feature hierarchies for accurate object detection and semantic segmentation
[183]
Mesh R-CNN
[184]
Understanding the difficulty of training deep feedforward neural networks
[185]
Multimodal Neurons in Artificial Neural Networks
[186]
Explaining and harnessing adversarial examples
[187]
Explaining and Harnessing Adversarial Examples
[188]
Knowledge Distillation: A Survey
[189]
Fractional Max-Pooling
[190]
Submanifold Sparse Convolutional Networks
[191]
FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models
[192]
Ego4D: Around the World in 3,000 Hours of Egocentric Video
[193]
DRAW: A Recurrent Neural Network For Image Generation
[194]
Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning
[195]
Implicit Geometric Regularization for Learning Shapes
[196]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
[197]
AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions
[198]
Non-Autoregressive Neural Machine Translation
[199]
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
[200]
SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering
[201]
Improved Training of Wasserstein GANs
[202]
On Calibration of Modern Neural Networks
[203]
Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning
[204]
LVIS: A Dataset for Large Vocabulary Instance Segmentation
[205]
ODinW: Evaluating and Harnessing Object Detection in the Wild
[206]
Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks
[207]
[208]
GhostNet: More Features from Cheap Operations
[209]
Memory-augmented Dense Predictive Coding for Video Representation Learning
[210]
Efficient Diffusion Training via Min-SNR Weighting Strategy
[211]
Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions
[212]
A combined corner and edge detector
[213]
LoRA+: Efficient Low Rank Adaptation of Large Models
[214]
Rethinking ImageNet Pre-training
[215]
[216]
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification
[217]
Identity mappings in deep residual networks
[218]
[219]
Masked Autoencoders Are Scalable Vision Learners
[220]
Momentum Contrast for Unsupervised Visual Representation Learning
[221]
Bag of Tricks for Image Classification with Convolutional Neural Networks
[222]
Deep Blending for Free-Viewpoint Image-Based Rendering
[223]
Data-Efficient Image Recognition with Contrastive Predictive Coding
[224]
Gaussian Error Linear Units (GELUs)
[225]
Rotary Position Embedding for Vision Transformer
[226]
Prompt-to-Prompt Image Editing with Cross Attention Control
[227]
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
[228]
Distilling the Knowledge in a Neural Network
[229]
[230]
[231]
[232]
Denoising Diffusion Probabilistic Models
[233]
Classifier-Free Diffusion Guidance
[234]
Cascaded Diffusion Models for High Fidelity Image Generation
[235]
MGAN: Training generative adversarial nets with multiple generators
[236]
[237]
Generator Matching: Generative Modeling with Arbitrary Markov Processes
[238]
Generator Matching: Generative modeling with arbitrary Markov processes
[239]
[240]
[241]
Universal Language Model Fine-tuning for Text Classification
[242]
One-Shot Object Detection with Co-Attention and Co-Excitation
[243]
LoRA: Low-Rank Adaptation of Large Language Models
[244]
Squeeze-and-Excitation Networks
[245]
AdCo: Adversarial Contrast for Efficient Learning of Unsupervised Representations from Self-Trained Negative Adversaries
[246]
SiamMask: A Framework for Fast Online Object Tracking and Segmentation
[247]
Tri-MipRF: Tri-Mip Representation for Efficient Anti-Aliasing Neural Radiance Fields
[248]
ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation
[249]
DiffuseST: Unleashing the Capability of the Diffusion Model for Style Transfer
[250]
PromptCap: Prompt-Guided Task-Aware Image Captioning
[251]
{DAC
[252]
Deep Networks with Stochastic Depth
[253]
Densely Connected Convolutional Networks
[254]
ALLoRA: Adaptive Learning Rate Mitigates LoRA Fatal Flaws
[255]
T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation
[256]
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
[257]
Real-Time Object Detection Meets {DINOv3
[258]
Improving Transformer Optimization Through Better Initialization
[259]
Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization
[260]
Receptive fields of single neurones in the cat's striate cortex
[261]
StyleGAN and StyleGAN2: Learn to generate and control images
[262]
All About Normalization
[263]
[264]
{DEIMv2
[265]
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
[266]
Image-to-Image Translation with Conditional Adversarial Networks
[267]
Space-Time Correspondence as a Contrastive Random Walk
[268]
Neural Tangent Kernel: Convergence and Generalization in Neural Networks
[269]
Zero-Shot Text-Guided Object Generation with Dream Fields
[270]
OneFormer: One Transformer To Rule Universal Image Segmentation
[271]
NeRFshop: Interactive Editing of Neural Radiance Fields
[272]
Categorical Reparameterization with Gumbel-Softmax
[273]
Large scale multi-view stereopsis evaluation
[274]
[275]
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
[276]
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
[277]
THUMOS Challenge: Action Recognition with a Large Number of Classes
[278]
GeoNeRF: Generalizing NeRF with Geometry Priors
[279]
Perceptual Losses for Real-Time Style Transfer and Super-Resolution
[280]
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
[281]
Inferring and Executing Programs for Visual Reasoning
[282]
A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA
[283]
{MDETR
[284]
Scaling up GANs for Text-to-Image Synthesis
[285]
Deep Visual-Semantic Alignments for Generating Image Descriptions
[286]
Visualizing and Understanding Recurrent Networks
[287]
[288]
Large-scale video classification with convolutional neural networks
[289]
A Style-Based Generator Architecture for Generative Adversarial Networks
[290]
Alias-Free Generative Adversarial Networks
[291]
Analyzing and Improving the Image Quality of StyleGAN
[292]
Elucidating the Design Space of Diffusion-Based Generative Models
[293]
Progressive Growing of GANs for Improved Quality, Stability, and Variation
[294]
The Kinetics Human Action Video Dataset
[295]
Transformer Architecture: The Positional Encoding
[296]
Rethinking Positional Encoding in Language Pre-training
[297]
Segment Anything in High Quality
[298]
Gaussian Surfels for Real-Time Rendering of Point Clouds
[299]
3D Gaussian Splatting for Real-Time Radiance Field Rendering
[300]
LERF: Language Embedded Radiance Fields
[301]
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
[302]
Geometry Score: A Method for Comparing Generative Adversarial Networks
[303]
Flow Matching: A Unified Framework for Generative Models
[304]
Auto-Encoding Variational Bayes
[305]
Adam: A Method for Stochastic Optimization
[306]
Glow: Generative Flow with Invertible 1x1 Convolutions
[307]
PointRend: Image Segmentation as Rendering
[308]
Segment Anything
[309]
Segment Anything
[310]
Self-normalizing neural networks
[311]
Tanks and temples: benchmarking large-scale scene reconstruction
[312]
ABC: A Big CAD Model Dataset For Geometric Deep Learning
[313]
Exponential convergence rates for batch normalization: The power of length-direction decoupling in non-convex optimization
[314]
3D Object Representations for Fine-Grained Categorization
[315]
Dense-Captioning Events in Videos
[316]
Quantizing deep convolutional networks for efficient inference: A whitepaper
[317]
Learning Multiple Layers of Features from Tiny Images
[318]
ImageNet classification with deep convolutional neural networks
[319]
Deep Convolutional Inverse Graphics Network
[320]
Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution
[321]
[322]
F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models
[323]
FindIt: Generalized Localization with Natural Language Queries
[324]
{MAST
[325]
Control-NeRF: Editable Feature Volumes for Scene Rendering and Manipulation
[326]
Gradient-Based Learning Applied to Document Recognition
[327]
Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network
[328]
Draft-and-Revise: Effective Image Generation with Contextual RQ-Transformer
[329]
From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation
[330]
Compressive Visual Representations
[331]
NT–Xent Loss: Normalized Temperature‑Scaled Cross‑Entropy Loss
[332]
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries
[333]
Less Is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
[334]
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
[335]
LLaVA-OneVision: Easy Visual Task Transfer
[336]
Language-Driven Semantic Segmentation
[337]
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
[338]
DN-DETR: Accelerate DETR Training by Introducing Query DeNoising
[339]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
[340]
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation
[341]
Visualizing the Loss Landscape of Neural Nets
[342]
Align Before Fuse: Vision and Language Representation Learning With Momentum Distillation
[343]
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
[344]
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
[345]
VideoChat: Chat-Centric Video Understanding
[346]
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
[347]
UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning
[348]
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
[349]
Grounded Language-Image Pre-Training
[350]
Autoregressive Image Generation without Vector Quantization
[351]
OSCAR: Object-Semantics Aligned Pre-training for Vision-Language Tasks
[352]
TEA: Temporal Excitation and Aggregation for Action Recognition
[353]
Learnable Fourier Features for Multi-dimensional Spatial Positional Encoding
[354]
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
[355]
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
[356]
Detailed 2D-3D Joint Representation for Human-Object Interaction
[357]
Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes
[358]
[359]
Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning
[360]
BARF: Bundle-Adjusting Neural Radiance Fields
[361]
Magic3D: High-Resolution Text-to-3D Content Creation
[362]
Learning Salient Boundary Feature for Anchor-Free Temporal Action Localization
[363]
TSM: Temporal Shift Module for Efficient Video Understanding
[364]
One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer
[365]
SDXL-Lightning: Progressive Adversarial Diffusion Distillation
[366]
BMN: Boundary-Matching Network for Temporal Action Proposal Generation
[367]
[368]
Focal Loss for Dense Object Detection
[369]
[370]
DETR Doesn't Need Multi-Scale or Locality Design
[371]
PacGAN: The power of two samples in generative adversarial networks
[372]
Flow Matching Guide and Code
[373]
Flow Matching for Generative Modeling
[374]
[375]
World Model on Million-Length Video And Language With Blockwise RingAttention
[376]
[377]
Visual Instruction Tuning
[378]
[379]
Neural Sparse Voxel Fields
[380]
One for All: Video Conversation Is Feasible Without Video Instruction Tuning
[381]
DoRA: Weight-Decomposed Low-Rank Adaptation
[382]
DAB-DETR: Dynamic Anchor Boxes Are Better Queries for DETR
[383]
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding
[384]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
[385]
NeUDF: Learning Neural Unsigned Distance Fields with Volume Rendering
[386]
Large-Margin Softmax Loss for Convolutional Neural Networks
[387]
SphereFace: Deep Hypersphere Embedding for Face Recognition
[388]
End-to-End Temporal Action Detection with Transformer
[389]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
[390]
InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation
[391]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
[392]
Neural Rays for Occlusion-Aware Image-Based Rendering
[393]
TS2-Net: Token Shift and Selection Transformer for Video-Language Pretraining
[394]
Swin Transformer V2: Scaling Up Capacity and Resolution
[395]
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
[396]
Video Swin Transformer
[397]
TAM: Temporal Adaptive Module for Video Recognition
[398]
TAM: Temporal Adaptive Module for Video Recognition
[399]
TEINet: Towards an Efficient Architecture for Video Recognition
[400]
A ConvNet for the 2020s
[401]
Neural Volumes: Learning Dynamic Renderable Volumes from Images
[402]
Fully Convolutional Networks for Semantic Segmentation
[403]
Marching cubes: A high resolution 3D surface construction algorithm
[404]
Decoupled Weight Decay Regularization
[405]
Three-dimensional object recognition from single two-dimensional images
[406]
[407]
DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps
[408]
Omnimatte: Associating Objects and Their Effects in Video
[409]
OmnimatteRF: Dampened Global Transport for Layered Neural Rendering
[410]
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
[411]
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
[412]
Are GANs Created Equal? A Large-Scale Study
[413]
Diffusion Models from Scratch
[414]
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval and Captioning
[415]
Valley: Video Assistant with Large Language model Enhanced abilitY
[416]
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
[417]
Understanding the Effective Receptive Field in Deep Convolutional Neural Networks
[418]
{RT-DETRv2
[419]
SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers
[420]
[421]
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
[422]
Visualizing Data using t-SNE
[423]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
[424]
Towards Deep Learning Models Resistant to Adversarial Attacks
[425]
Gated recurrent unit (GRU)-based deep learning method for spectrum estimation and inverse modeling in plasmonic devices
[426]
Mega-NeRF: Scalable Construction of Large-Scale Neural Radiance Fields
[427]
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding
[428]
TIPS: Text-Image Pretraining with Spatial awareness
[429]
Least Squares Generative Adversarial Networks
[430]
Vision: A Computational Investigation into the Human Representation and Processing of Visual Information
[431]
NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections
[432]
[433]
Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences
[434]
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
[435]
PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models
[436]
Which Training Methods for GANs do actually Converge?
[437]
Occupancy Networks: Learning 3D Reconstruction in Function Space
[438]
[439]
SAM 2: Segment Anything in Images and Videos
[440]
Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures
[441]
Local Light Field Fusion: Practical View Synthesis with Prescriptive Sampling Guidelines
[442]
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
[443]
RawNeRF: Neural Radiance Fields from Noisy Raw Images
[444]
Scaling Open-Vocabulary Object Detection
[445]
Scaling Open-Vocabulary Object Detection
[446]
Simple Open-Vocabulary Object Detection with Vision Transformers
[447]
Perceptrons: An Introduction to Computational Geometry
[448]
Conditional Generative Adversarial Nets
[449]
Self-Supervised Learning of Pretext-Invariant Representations
[450]
Representation Learning via Invariant Causal Mechanisms
[451]
Spectral Normalization for Generative Adversarial Networks
[452]
PartNet: A Large-Scale Benchmark for Fine-Grained and Hierarchical Part-Level 3D Object Understanding
[453]
Null-text Inversion for Editing Real Images using Guided Diffusion Models
[454]
[455]
Inceptionism: Going Deeper into Neural Networks
[456]
T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models
[457]
Instant Neural Graphics Primitives with a Multiresolution Hash Encoding
[458]
SILC: Improving Vision Language Pretraining with Self-Distillation
[459]
Deep Double Descent: Where Bigger Models and More Data Hurt
[460]
Video Transformer Network
[461]
Multifaceted Feature Visualization: Uncovering the Different Types of Features Learned By Each Neuron in Deep Neural Networks
[462]
Bellman Optimal Stepsize Straightening of Flow-Matching Models
[463]
Improved Denoising Diffusion Probabilistic Models
[464]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
[465]
Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision
[466]
RegNeRF: Regularizing Neural Radiance Fields for View Synthesis from Sparse Inputs
[467]
Understanding SSIM
[468]
Learning Deconvolution Network for Semantic Segmentation
[469]
UNISURF: Unifying Neural Implicit Surfaces and Radiance Fields for Multi-View Reconstruction
[470]
Video Object Segmentation using Space-Time Memory Networks
[471]
Attention U-Net: Learning Where to Look for the Pancreas
[472]
Pixel Recurrent Neural Networks
[473]
Representation Learning with Contrastive Predictive Coding
[474]
Neural Discrete Representation Learning
[475]
Conditional Image Generation with PixelCNN Decoders
[476]
GPT-4V(ision) System Card
[477]
DINOv2: Learning Robust Visual Features without Supervision
[478]
NMS Strikes Back: Suppressing Overconfident Incorrect Queries in DETR
[479]
Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization
[480]
VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples
[481]
DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation
[482]
HyperNeRF: A Higher-Dimensional Representation for Topologically Varying Neural Radiance Fields
[483]
Nerfies: Deformable Neural Radiance Fields
[484]
Semantic Image Synthesis with Spatially-Adaptive Normalization
[485]
On the Difficulty of Training Recurrent Neural Networks
[486]
[487]
ROI Pool and Align: PyTorch Implementation
[488]
Perception Test: A Diagnostic Benchmark for Multimodal Video Models
[489]
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
[490]
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
[491]
Causality
[492]
Scalable Diffusion Models with Transformers
[493]
FiLM: Visual Reasoning with a General Conditioning Layer
[494]
[495]
A Self-Supervised Descriptor for Image Copy Detection
[496]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
[497]
Why and when can deep—but not shallow—networks avoid the curse of dimensionality: A review
[498]
Acceleration of Stochastic Approximation by Averaging
[499]
The 2017 DAVIS Challenge on Video Object Segmentation
[500]
Multisample Flow Matching: Straightening Flows with Minibatch Couplings
[501]
DreamFusion: Text-to-3D using 2D Diffusion
[502]
D-NeRF: Neural Radiance Fields for Dynamic Scenes
[503]
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
[504]
PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space
[505]
PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies
[506]
Spatiotemporal Contrastive Video Representation Learning
[507]
MobileNetV4 -- Universal Models for the Mobile Ecosystem
[508]
Temporal Context Aggregation Network for Temporal Action Proposal Refinement
[509]
Qwen2.5 Technical Report
[510]
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
[511]
Language Models are Unsupervised Multitask Learners
[512]
Learning Transferable Visual Models From Natural Language Supervision
[513]
Learning Transferable Visual Models From Natural Language Supervision
[514]
Designing Network Design Spaces
[515]
Designing Network Design Spaces
[516]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
[517]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
[518]
On the Spectral Bias of Neural Networks
[519]
Searching for activation functions
[520]
Swish: a self-gated activation function
[521]
Stand-Alone Self-Attention in Vision Models
[522]
On the Frequency Bias of Coordinate-MLPs
[523]
DALL-E: Creating Images from Text Descriptions
[524]
Hierarchical Text-Conditional Image Generation with CLIP Latents
[525]
Zero-Shot Text-to-Image Generation
[526]
Vision Transformers for Dense Prediction
[527]
Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer
[528]
Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks
[529]
SAM 2: Segment Anything in Images and Videos
[530]
Generating Diverse High-Fidelity Images with VQ-VAE-2
[531]
Broaden Your Views for Self-Supervised Video Learning
[532]
YOLO9000: better, faster, stronger
[533]
YOLOv3: An incremental improvement
[534]
You Only Look Once: Unified, Real-Time Object Detection
[535]
Generative Adversarial Text to Image Synthesis
[536]
Learning What and Where to Draw: Generative Adversarial What‑Where Networks
[537]
CO3D: Common Objects in 3D for Few-Shot View Synthesis
[538]
Faster {R-CNN
[539]
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
[540]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
[541]
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
[542]
Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis
[543]
Learning with Average Precision: Training Image Retrieval with a Listwise Loss
[544]
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression
[545]
BYOL works even without batch statistics
[546]
Machine Perception of Three-Dimensional Solids
[547]
SASSL: Enhancing Self-Supervised Learning via Neural Style Transfer
[548]
High-Resolution Image Synthesis with Latent Diffusion Models
[549]
U-Net: Convolutional Networks for Biomedical Image Segmentation
[550]
The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain
[551]
SplaTAM: Splat, Track \& Map 3D Gaussians for Dense RGB-D SLAM
[552]
"GrabCut": interactive foreground extraction using iterated graph cuts
[553]
Principles of Mathematical Analysis
[554]
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
[555]
Learning representations by back-propagating errors
[556]
Logo Synthesis and Manipulation with Clustered Generative Adversarial Networks
[557]
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
[558]
Assessing Generative Models via Precision and Recall
[559]
Progressive Distillation for Fast Sampling of Diffusion Models
[560]
Improved Techniques for Training GANs
[561]
PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications
[562]
[563]
CLIP-Mesh: Generating {T
[564]
How Does Batch Normalization Help Optimization?
[565]
How Does Batch Normalization Help Optimization?
[566]
StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets
[567]
Adversarial Diffusion Distillation
[568]
Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation
[569]
StyleGAN-T: Unlocking the Power of GANs with Transformer Backbones
[570]
Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Neural Networks
[571]
Structure-From-Motion Revisited
[572]
FaceNet: A unified embedding for face recognition and clustering
[573]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
[574]
LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models
[575]
A Basic Introduction to Separable Convolutions
[576]
A Comparison and Evaluation of Multi‐View Stereo Reconstruction Algorithms
[577]
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
[578]
Prompting Large Language Models with Answer Heuristics for Knowledge-Based VQA
[579]
Transition Matching: Scalable and Flexible Generative Modeling
[580]
Self-Attention with Relative Position Representations
[581]
Fast Transformer Decoding: One Write-Head is All You Need
[582]
[583]
[584]
Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network
[585]
Pixels, voxels, and views: A study of shape representations for single view 3D object shape prediction
[586]
Unifying and Boosting Gradient-Based Training-Free Neural Architecture Search
[587]
An Illusion of Equivalence: Revisiting Low-Rank Adaptation and Full Fine-Tuning
[588]
Understanding Camera Calibration and Intrinsics
[589]
DINOv3
[590]
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
[591]
Two-Stream Convolutional Networks for Action Recognition in Videos
[592]
Very Deep Convolutional Networks for Large-Scale Image Recognition
[593]
Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations
[594]
GEOMetrics: Exploiting Geometric Structure for Graph-Encoded Objects
[595]
Cyclical Learning Rates for Training Neural Networks
[596]
Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates
[598]
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
[599]
Convolutions and Backpropagation
[600]
Denoising Diffusion Implicit Models
[601]
Generative Modeling by Estimating Gradients of the Data Distribution
[602]
Consistency Models
[603]
Consistency Models
[604]
From Signal to Noise: Understanding Diffusion Models
[605]
Score-Based Generative Modeling through Stochastic Differential Equations
[606]
Striving for Simplicity: The All Convolutional Net
[607]
Dropout: a simple way to prevent neural networks from overfitting
[608]
Training Very Deep Networks
[609]
Stable Diffusion Image Variations (Official Release)
[610]
Stable Diffusion unCLIP
[611]
How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers
[612]
L2 Regularization and Batch Normalization: How They Interact
[613]
Direct Voxel Grid Optimization: Super-fast Convergence for Radiance Fields Reconstruction
[614]
Improved Direct Voxel Grid Optimization for Radiance Fields Reconstruction
[615]
Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling
[616]
Sequence to sequence learning with neural networks
[617]
Going deeper with convolutions
[618]
[619]
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning
[620]
EfficientNetV2: Smaller Models and Faster Training
[621]
[622]
MnasNet: Platform-Aware Neural Architecture Search for Mobile
[623]
Block-NeRF: Scalable Large Scene Neural View Synthesis
[624]
Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains
[625]
Nerfacto: A Fast Hash-Grid NeRF Baseline
[626]
Asynchronous Interaction Aggregation for Action Detection
[627]
DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation
[628]
DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis
[629]
Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs
[630]
Single-View to Multi-View: Reconstructing Unseen Views with a Convolutional Network
[631]
What Do Single-view 3D Reconstruction Networks Learn?
[632]
DeepFloyd IF: A Cascaded Diffusion Model for Text-to-Image Synthesis
[633]
Higher Accuracy on Vision Models with EfficientNet-Lite
[634]
Benefits of depth in neural networks
[635]
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
[636]
Understanding self-supervised learning dynamics without contrastive pairs
[637]
FCOS: Fully Convolutional One-Stage Object Detection
[638]
Breaking the "Object" in Video Object Segmentation
[639]
MLP-Mixer: An all-MLP Architecture for Vision
[640]
Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet?
[641]
TrajectoryNet: Learning Continuous Dynamics for Optimal Transport
[642]
Video{MAE
[643]
DeepPose: Human Pose Estimation via Deep Neural Networks
[644]
DeiT III: Revenge of the ViT
[645]
Fixing the train-test resolution discrepancy
[646]
Going deeper with Image Transformers
[647]
Training Data-Efficient Image Transformers {\&
[648]
On Adaptive Attacks to Adversarial Example Defenses
[649]
Video Classification with Channel-Separated Convolutional Networks
[650]
A Closer Look at Spatiotemporal Convolutions for Action Recognition
[651]
[652]
GRF: Learning a General Radiance Field for 3D Representation and Rendering
[653]
SPARF: Neural Radiance Fields from Sparse and Noisy Poses
[654]
Review: Group Normalization (GN) for Image Classification
[655]
Review --- BYOL: Bootstrap Your Own Latent a New Approach to Self-Supervised Learning
[656]
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
[657]
Active Data Curation Effectively Distills Large-Scale Multimodal Models
[658]
Selective Search for Object Recognition
[659]
Data Augmentation using {Ultralytics YOLO
[660]
Instance Normalization: The Missing Ingredient for Fast Stylization
[661]
DyLoRA: Parameter-Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation
[662]
Flow Matching and Generative Modeling: Deep Dive and Code Walkthrough
[663]
[664]
Attention is all you need
[665]
Residual Networks Behave Like Ensembles of Relatively Shallow Networks
[666]
Optimal Transport: Old and New
[667]
[668]
[669]
Show and tell: A neural image caption generator
[670]
[671]
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
[672]
The Caltech-UCSD Birds-200-2011 Dataset
[673]
Diffusion Model Alignment Using Direct Preference Optimization
[674]
LocCa: Visual Pretraining with Location-aware Captioners
[675]
Regularization of Neural Networks using DropConnect
[676]
All in One: Exploring Unified Video-Language Pre-Training
[677]
IBRNet: Learning Multi-View Image-Based Rendering
[678]
Additive Margin Softmax for Face Verification
[679]
CosFace: Large Margin Cosine Loss for Deep Face Recognition
[680]
Video Modeling with Correlation Networks
[681]
GIT: A Generative Image-to-Text Transformer for Vision and Language
[682]
[683]
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
[684]
TDN: Temporal Difference Networks for Efficient Action Recognition
[685]
Video{MAE
[686]
Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images
[687]
BAD-NeRF: Bundle Adjusted Deblur Neural Radiance Fields
[688]
NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction
[689]
Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution
[690]
BEVT: BERT Pretraining of Video Transformers
[691]
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-Supervised Video Representation Learning
[692]
Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation
[693]
Intern{V
[694]
Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks
[695]
Contrastive Learning With Stronger Augmentations
[696]
Videos as Space-Time Region Graphs
[697]
Non-local Neural Networks
[698]
Non-local Neural Networks
[699]
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
[700]
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
[701]
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
[702]
GS-IR: 3D Gaussian Splatting for Inverse Rendering
[703]
NeuS2: Fast Learning of Neural Implicit Surfaces for Multi-View Reconstruction
[704]
HF-NeuS: Improved Surface Reconstruction Using High-Frequency Details Robust to Noise
[705]
PET-NeuS: Positional Encoding Tri-Planes for Neural Surfaces
[706]
Dynamic Graph CNN for Learning on Point Clouds
[707]
Diffusion-GAN: Training GANs with Diffusion
[708]
StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models
[709]
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
[710]
Masked Feature Prediction for Self-Supervised Visual Pre-Training
[711]
Fast Texture Synthesis using Tree‑structured Vector Quantization
[712]
Pixel2Mesh++: Multi-View 3D Mesh Generation via Deformation
[713]
LongVLM: Efficient Long Video Understanding via Large Language Models
[714]
Aliasing --- {W
[715]
Sine and cosine --- Wikipedia{,
[716]
Simple statistical gradient-following algorithms for connectionist reinforcement learning
[717]
An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories
[718]
Roofline: an insightful visual performance model for multicore architectures
[719]
ProGAN – How NVIDIA Generated Images of Unprecedented Quality
[720]
[721]
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
[722]
Robust fine-tuning of zero-shot models
[723]
MeMViT: Memory-Augmented Multiscale Vision Transformers for Efficient Long-Term Video Recognition
[724]
Multi-Scale Feature Aggregation for Spatio-Temporal Action Detection
[725]
4D Gaussian Splatting for Real-Time Dynamic Scene Rendering
[726]
[727]
Point Transformer V2: Grouped Vector Attention and Partition-based Pooling
[728]
3D ShapeNets: A Deep Representation for Volumetric Shapes
[729]
Unsupervised Deep Embedding for Clustering Analysis
[730]
Aggregated Residual Transformations for Deep Neural Networks
[731]
Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering
[732]
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
[733]
Boundary-Sensitive Pre-Training for Temporal Localization in Videos
[734]
G-TAD: Sub-Graph Localization for Temporal Action Detection
[735]
[736]
Point-NeRF: Point-based Neural Radiance Fields
[737]
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks
[738]
Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models
[739]
Versatile Diffusion: Text, Images and Variations All in One Diffusion Model
[740]
Sliced Wasserstein Generative Models
[741]
LanguageBind: Extending Video-Language Pretraining to Multiple Modalities
[742]
Multiview Transformers for Video Recognition
[743]
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
[744]
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
[745]
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory
[746]
Video Instance Segmentation
[747]
BasicTAD: An astounding RGB-Only baseline for temporal action detection
[748]
[749]
DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection
[750]
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-Training via Word-Region Alignment
[751]
FILIP: Fine-grained Interactive Language-Image Pre-Training
[752]
Efficient DETR: Improving End-to-End Object Detector with Dense Prior
[753]
Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance
[754]
Volume Rendering of Neural Implicit Surfaces
[755]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
[756]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
[757]
Overfitting vs. Underfitting
[758]
Generative Adversarial Network in Medical Imaging: A Review
[759]
One-step Diffusion with Distribution Matching Distillation
[760]
Understanding Neural Networks Through Deep Visualization
[761]
Large batch training of convolutional networks
[762]
pixelNeRF: Neural Radiance Fields from One or Few Images
[763]
PlenOctrees for Real-Time Rendering of Neural Radiance Fields
[764]
CoCa: Contrastive Captioners are Image-Text Foundation Models
[765]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
[766]
Vector-quantized Image Modeling with Improved VQGAN
[767]
GSDF: 3DGS Meets SDF for Improved Neural Rendering
[768]
Florence: A New Foundation Model for Computer Vision
[769]
V4D: 4D Convolutional Neural Networks for Video-level Representation Learning
[770]
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features
[771]
DiracNets: Training Very Deep Neural Networks Without Skip-Connections
[772]
Understanding Batch Normalization Backpropagation
[773]
Open-{V
[774]
Barlow Twins: Self-Supervised Learning via Redundancy Reduction
[775]
[776]
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
[777]
MERLOT: Multimodal Neural Script Knowledge Models
[778]
S4L: Self-Supervised Semi-Supervised Learning
[779]
LiT: Zero-Shot Transfer With Locked-Image Text Tuning
[780]
Sigmoid Loss for Language Image Pre-Training
[781]
Root Mean Square Layer Normalization
[782]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
[783]
ActionFormer: Localizing Moments of Actions with Transformers
[784]
Self-Attention Generative Adversarial Networks
[785]
StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks
[786]
StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks
[787]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
[788]
ResNeSt: Split-Attention Networks
[789]
GLIPv2: Unifying Localization and Vision-Language Understanding
[790]
Fixup Initialization: Residual Learning Without Normalization
[791]
mixup: Beyond Empirical Risk Minimization
[792]
NeRF++: Analyzing and Improving Neural Radiance Fields
[793]
Shape and Motion under Varying Illumination: Unifying Structure from Motion, Photometric Stereo, and Multi-view Stereo
[794]
Adding Conditional Control to Text-to-Image Diffusion Models
[795]
VinVL: Making Visual Representations Matter in Vision-Language Models
[796]
AdaLoRA: Adaptive Low-Rank Adaptation via Gradient-Based Rank Selection
[797]
LLaMA-Adapter: Efficient Fine-Tuning of Language Models with Zero-Initiated Attention
[798]
Making Convolutional Networks Shift-Invariant Again
[799]
[800]
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
[801]
[802]
Recognize Anything: A Strong Image Tagging Model
[803]
Inversion-Based Style Transfer with Diffusion Models
[804]
Road Extraction by Deep Residual U-Net
[805]
Multimodal Chain-of-Thought Reasoning in Language Models
[806]
{HACS
[807]
Point Transformer
[808]
Pyramid Scene Parsing Network
[809]
TubeR: Tubelet Transformer for Video Action Detection
[810]
VideoPrism: A Foundational Visual Encoder for Video Understanding
[811]
Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models
[812]
Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head
[813]
Learning Video Representations from Large Language Models
[814]
Classifier Guidance and Classifier Free Guidance: Code Examples
[815]
RegionCLIP: Region-Based Language-Image Pretraining
[816]
Learning Deep Features for Discriminative Localization
[817]
[818]
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
[819]
iBOT: Image BERT Pre-Training with Online Tokenizer
[820]
Conditional Prompt Learning for Vision-Language Models
[821]
Unified Vision-Language Pre-Training for Image Captioning and VQA
[822]
Unet++: A Nested U-Net Architecture for Medical Image Segmentation
[823]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
[824]
Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks
[825]
DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis
[826]
Enriching Local and Global Contexts for Temporal Action Detection
[827]
Deformable DETR: Deformable Transformers for End-to-End Object Detection
[828]
Transfusion: Cross-modal Diffusion Models for Reference-based Image Editing and Generation
[829]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
[830]
Edge boxes: Locating object proposals from edges
[831]
Neural Architecture Search with Reinforcement Learning
[832]
Learning Transferable Architectures for Scalable Image Recognition
[833]
Generalized Decoding for Pixel, Image, and Language