Bibliography
Bibliography
[1]
Untitled
[2]
[3]
[4]
Third Time's the Charm? Image and Video Editing with StyleGAN3
[5]
[6]
[7]
[8]
[9]
Neural Point-Based Graphics
[10]
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
[11]
[12]
Computer Vision vs Machine Vision: What's the Difference?
[13]
[14]
Wasserstein GAN
[15]
Partitioning Gated Mechanisms for Graph Generation
[16]
ViViT: A Video Vision Transformer
[17]
[18]
[19]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
[20]
[21]
[22]
Neural Machine Translation by Jointly Learning to Align and Translate
[23]
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
[24]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
[25]
[26]
On a Structural Similarity Index Approach for Floating-Point Data
[27]
[28]
[29]
Mip-NeRF 360: Unbounded Anti-Aliased in-the-Wild Scene Rendering
[30]
Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields
[31]
Zip-NeRF: Anti-Aliased, Compositional Neural Representation for Editable 3D Scenes
[32]
Network Dissection: Quantifying Interpretability of Deep Visual Representations
[33]
[34]
[35]
Revisiting ResNets: Improved Training and Scaling Strategies
[36]
[37]
[38]
{Interactive Visualization of Stable Diffusion Image Embeddings
[39]
Random Search for Hyper-Parameter Optimization
[40]
Random Search for Hyper-Parameter Optimization
[41]
[42]
Is Space-Time Attention All You Need for Video Understanding?
[43]
FlexiViT: One Model for All Patch Sizes
[44]
NoPe-NeRF: Optimising Neural Radiance Field with No Pose Prior
[45]
Demystifying MMD GANs
[46]
Stable Video Diffusion: Scaling Latent Video Diffusion Models
[47]
[48]
Perception Encoder: The best visual embeddings are not at the output of the network
[49]
Token Merging: Your ViT But Faster
[50]
Food-101 -- Mining Discriminative Components with Random Forests
[51]
[52]
[53]
High-Performance Large-Scale Image Recognition Without Normalization
[54]
High-Performance Large-Scale Image Recognition Without Normalization
[55]
[56]
[57]
InstructPix2Pix: Learning to Follow Image Editing Instructions
[58]
Language Models are Few-Shot Learners
[59]
Language Models are Few-Shot Learners
[60]
[61]
ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware
[62]
Align-DETR: Enhancing End-to-end Object Detection with Aligned Loss
[63]
[64]
End-to-End Object Detection with Transformers
[65]
SAM 3: Segment Anything with Concepts
[66]
[67]
[68]
Universal and Transferable Adversarial Attacks on Aligned Language Models
[69]
[70]
[71]
[72]
[73]
Unsupervised Pre-Training of Image Features on Non-Curated Data
[74]
[75]
[76]
ShapeNet: An Information-Rich 3D Model Repository
[77]
MaskGIT: Masked Generative Image Transformer
[78]
Rethinking the Faster R-CNN Architecture for Temporal Action Localization
[79]
MVSNeRF: Fast Generalizable Radiance Field Reconstruction from Multi-View Stereo
[80]
TensoRF: Tensorial Radiance Fields
[81]
[82]
[83]
[84]
[85]
[86]
Flow Matching on General Geometries
[87]
Neural Ordinary Differential Equations
[88]
A Simple Framework for Contrastive Learning of Visual Representations
[89]
[90]
[91]
[92]
[93]
{An Empirical Study of Training Self-Supervised Vision Transformers
[94]
An Empirical Study of Training Self-Supervised Vision Transformers
[95]
Improved Baselines with Momentum Contrastive Learning
[96]
UNITER: Learning Universal Image-Text Representations
[97]
Deep Bundle-Adjusting Generalizable Neural Radiance Fields
[98]
Per-Pixel Classification is Not All You Need for Semantic Segmentation
[99]
Masked-Attention Mask Transformer for Universal Image Segmentation
[100]
YOLO-World: Real-Time Open-Vocabulary Object Detection
[101]
[102]
[103]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
[104]
[105]
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
[106]
[107]
Adversarial Video Generation on Complex Datasets
[108]
[109]
Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
[110]
[111]
[112]
[113]
[114]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
[115]
[117]
[118]
[119]
[120]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
[121]
Improved Regularization of Convolutional Neural Networks with Cutout
[122]
Diffusion Models Beat GANs on Image Synthesis
[123]
[124]
[125]
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
[126]
[127]
RepVGG: Making VGG-style ConvNets Great Again
[128]
Density estimation using Real NVP
[129]
[130]
[131]
[132]
[133]
[134]
[135]
[136]
[137]
[138]
[139]
Introduction to 3D Gaussian Splatting
[140]
[141]
Depth Map Prediction from a Single Image using a Multi-Scale Deep Network
[142]
[143]
Understanding Region of Interest - Part 2 (RoI Align)
[144]
[145]
[146]
[147]
[148]
Taming Transformers for High-Resolution Image Synthesis
[149]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
[150]
[151]
[152]
[153]
[154]
What Have We Learned from Deep Representations for Action Recognition?
[155]
[156]
Deep Insights into Convolutional Networks for Video Recognition
[157]
[158]
[159]
[160]
Plenoxels: Radiance Fields without Neural Networks
[161]
[162]
[163]
[164]
DataComp: In search of the next generation of multimodal datasets
[165]
[166]
[167]
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
[168]
Discrete Flow Matching
[169]
[170]
Texture Synthesis Using Convolutional Neural Networks
[171]
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation
[172]
[173]
[174]
[175]
Fast R-CNN
[176]
[177]
Mesh R-CNN
[178]
[179]
[180]
[181]
[182]
Knowledge Distillation: A Survey
[183]
Fractional Max-Pooling
[184]
Submanifold Sparse Convolutional Networks
[185]
FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models
[186]
Ego4D: Around the World in 3,000 Hours of Egocentric Video
[187]
[188]
[189]
Implicit Geometric Regularization for Learning Shapes
[190]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
[191]
AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions
[192]
[193]
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
[194]
Improved Training of Wasserstein GANs
[195]
[196]
[197]
ODinW: Evaluating and Harnessing Object Detection in the Wild
[198]
[199]
[200]
GhostNet: More Features from Cheap Operations
[201]
[202]
Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions
[203]
[204]
LoRA+: Efficient Low Rank Adaptation of Large Models
[205]
Rethinking ImageNet Pre-training
[206]
[207]
[208]
[209]
[210]
[211]
[212]
Bag of Tricks for Image Classification with Convolutional Neural Networks
[213]
[214]
Data-Efficient Image Recognition with Contrastive Predictive Coding
[215]
[216]
Rotary Position Embedding for Vision Transformer
[217]
Prompt-to-Prompt Image Editing with Cross Attention Control
[218]
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
[219]
Distilling the Knowledge in a Neural Network
[220]
[221]
[222]
[223]
Denoising Diffusion Probabilistic Models
[224]
Classifier-Free Diffusion Guidance
[225]
Cascaded Diffusion Models for High Fidelity Image Generation
[226]
MGAN: Training generative adversarial nets with multiple generators
[227]
[228]
Generator Matching: Generative Modeling with Arbitrary Markov Processes
[229]
[230]
[231]
[232]
One-Shot Object Detection with Co-Attention and Co-Excitation
[233]
LoRA: Low-Rank Adaptation of Large Language Models
[234]
Squeeze-and-Excitation Networks
[235]
[236]
SiamMask: A Framework for Fast Online Object Tracking and Segmentation
[237]
Tri-MipRF: Tri-Mip Representation for Efficient Anti-Aliasing Neural Radiance Fields
[238]
ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation
[239]
[240]
[241]
{DAC
[242]
Deep Networks with Stochastic Depth
[243]
Densely Connected Convolutional Networks
[244]
[245]
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
[246]
Real-Time Object Detection Meets {DINOv3
[247]
Improving Transformer Optimization Through Better Initialization
[248]
Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization
[249]
[250]
[251]
All About Normalization
[252]
[253]
{DEIMv2
[254]
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
[255]
[256]
[257]
Neural Tangent Kernel: Convergence and Generalization in Neural Networks
[258]
Zero-Shot Text-Guided Object Generation with Dream Fields
[259]
OneFormer: One Transformer To Rule Universal Image Segmentation
[260]
NeRFshop: Interactive Editing of Neural Radiance Fields
[261]
Categorical Reparameterization with Gumbel-Softmax
[262]
[263]
[264]
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
[265]
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
[266]
[267]
GeoNeRF: Generalizing NeRF with Geometry Priors
[268]
[269]
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
[270]
Inferring and Executing Programs for Visual Reasoning
[271]
[272]
[273]
Scaling up GANs for Text-to-Image Synthesis
[274]
[275]
Visualizing and Understanding Recurrent Networks
[276]
[277]
[278]
A Style-Based Generator Architecture for Generative Adversarial Networks
[279]
Alias-Free Generative Adversarial Networks
[280]
Analyzing and Improving the Image Quality of StyleGAN
[281]
Progressive Growing of GANs for Improved Quality, Stability, and Variation
[282]
The Kinetics Human Action Video Dataset
[283]
Transformer Architecture: The Positional Encoding
[284]
[285]
Segment Anything in High Quality
[286]
[287]
3D Gaussian Splatting for Real-Time Radiance Field Rendering
[288]
LERF: Language Embedded Radiance Fields
[289]
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
[290]
Geometry Score: A Method for Comparing Generative Adversarial Networks
[291]
Flow Matching: A Unified Framework for Generative Models
[292]
Auto-Encoding Variational Bayes
[293]
Adam: A Method for Stochastic Optimization
[294]
Glow: Generative Flow with Invertible 1x1 Convolutions
[295]
PointRend: Image Segmentation as Rendering
[296]
[297]
Segment Anything
[298]
[299]
Tanks and temples: benchmarking large-scale scene reconstruction
[300]
[301]
[302]
Optimal Flow Matching: Learning Straight Trajectories in Just One Step
[303]
3D Object Representations for Fine-Grained Categorization
[304]
[305]
Quantizing deep convolutional networks for efficient inference: A whitepaper
[306]
[307]
[308]
Deep Convolutional Inverse Graphics Network
[309]
Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution
[310]
[311]
F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models
[312]
FindIt: Generalized Localization with Natural Language Queries
[313]
[314]
[315]
Control-NeRF: Editable Feature Volumes for Scene Rendering and Manipulation
[316]
[317]
[318]
Draft-and-Revise: Effective Image Generation with Contextual RQ-Transformer
[319]
Compressive Visual Representations
[320]
[321]
[322]
[323]
[324]
[325]
LLaVA-OneVision: Easy Visual Task Transfer
[326]
Language-Driven Semantic Segmentation
[327]
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
[328]
[329]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
[330]
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation
[331]
Visualizing the Loss Landscape of Neural Nets
[332]
Align Before Fuse: Vision and Language Representation Learning With Momentum Distillation
[333]
[334]
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
[335]
VideoChat: Chat-Centric Video Understanding
[336]
[337]
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
[338]
Grounded Language-Image Pre-Training
[339]
[340]
[341]
OSCAR: Object-Semantics Aligned Pre-training for Vision-Language Tasks
[342]
[343]
Learnable Fourier Features for Multi-dimensional Spatial Positional Encoding
[344]
[345]
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
[346]
[347]
[348]
[349]
Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes
[350]
[351]
Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning
[352]
BARF: Bundle-Adjusting Neural Radiance Fields
[353]
[354]
[355]
One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer
[356]
[357]
[358]
[359]
[360]
Focal Loss for Dense Object Detection
[361]
DETR Doesn't Need Multi-Scale or Locality Design
[362]
PacGAN: The power of two samples in generative adversarial networks
[363]
Flow Matching Guide and Code
[364]
Flow Matching for Generative Modeling
[365]
[366]
World Model on Million-Length Video And Language With Blockwise RingAttention
[367]
[368]
Visual Instruction Tuning
[369]
[370]
Neural Sparse Voxel Fields
[371]
[372]
[373]
DoRA: Weight-Decomposed Low-Rank Adaptation
[374]
[375]
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding
[376]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
[377]
[378]
[379]
SphereFace: Deep Hypersphere Embedding for Face Recognition
[380]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
[381]
[382]
Neural Rays for Occlusion-Aware Image-Based Rendering
[383]
[384]
[385]
[386]
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
[387]
TAM: Temporal Adaptive Module for Video Recognition
[388]
A ConvNet for the 2020s
[389]
Neural Volumes: Learning Dynamic Renderable Volumes from Images
[390]
[391]
Marching cubes: A high resolution 3D surface construction algorithm
[392]
Decoupled Weight Decay Regularization
[393]
[394]
[395]
DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps
[396]
Omnimatte: Associating Objects and Their Effects in Video
[397]
OmnimatteRF: Dampened Global Transport for Layered Neural Rendering
[398]
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
[399]
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
[400]
Are GANs Created Equal? A Large-Scale Study
[401]
High-fidelity image generation with fewer labels
[402]
Diffusion Models from Scratch
[403]
[404]
Valley: Video Assistant with Large Language model Enhanced abilitY
[405]
Understanding the Effective Receptive Field in Deep Convolutional Neural Networks
[406]
{RT-DETRv2
[407]
[408]
[409]
Visualizing Data using t-SNE
[410]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
[411]
[412]
Gated recurrent unit (GRU)-based deep learning method for spectrum estimation and inverse modeling in plasmonic devices
[413]
Mega-NeRF: Scalable Construction of Large-Scale Neural Radiance Fields
[414]
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding
[415]
TIPS: Text-Image Pretraining with Spatial awareness
[416]
[417]
[418]
NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections
[419]
[420]
[421]
[422]
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
[423]
PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models
[424]
[425]
[426]
[427]
Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures
[428]
Local Light Field Fusion: Practical View Synthesis with Prescriptive Sampling Guidelines
[429]
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
[430]
RawNeRF: Neural Radiance Fields from Noisy Raw Images
[431]
Scaling Open-Vocabulary Object Detection
[432]
Scaling Open-Vocabulary Object Detection
[433]
Simple Open-Vocabulary Object Detection with Vision Transformers
[434]
[435]
Conditional Generative Adversarial Nets
[436]
Self-Supervised Learning of Pretext-Invariant Representations
[437]
Representation Learning via Invariant Causal Mechanisms
[438]
[439]
[440]
[441]
Inceptionism: Going Deeper into Neural Networks
[442]
T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models
[443]
Instant Neural Graphics Primitives with a Multiresolution Hash Encoding
[444]
SILC: Improving Vision Language Pretraining with Self-Distillation
[445]
Deep Double Descent: Where Bigger Models and More Data Hurt
[446]
Video Transformer Network
[447]
Multifaceted Feature Visualization: Uncovering the Different Types of Features Learned By Each Neuron in Deep Neural Networks
[448]
Bellman Optimal Stepsize Straightening of Flow-Matching Models
[449]
Improved Denoising Diffusion Probabilistic Models
[450]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
[451]
[452]
[453]
Understanding SSIM
[454]
[455]
UNISURF: Unifying Neural Implicit Surfaces and Radiance Fields for Multi-View Reconstruction
[456]
[457]
Attention U-Net: Learning Where to Look for the Pancreas
[458]
Pixel Recurrent Neural Networks
[459]
Representation Learning with Contrastive Predictive Coding
[460]
Neural Discrete Representation Learning
[461]
Conditional Image Generation with PixelCNN Decoders
[462]
GPT-4V(ision) System Card
[463]
DINOv2: Learning Robust Visual Features without Supervision
[464]
[465]
[466]
[467]
DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation
[468]
HyperNeRF: A Higher-Dimensional Representation for Topologically Varying Neural Radiance Fields
[469]
Nerfies: Deformable Neural Radiance Fields
[470]
[471]
[472]
[473]
ROI Pool and Align: PyTorch Implementation
[474]
[475]
[476]
[477]
[478]
Scalable Diffusion Models with Transformers
[479]
FiLM: Visual Reasoning with a General Conditioning Layer
[480]
[481]
A Self-Supervised Descriptor for Image Copy Detection
[482]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
[483]
[484]
Acceleration of Stochastic Approximation by Averaging
[485]
[486]
Multisample Flow Matching: Straightening Flows with Minibatch Couplings
[487]
DreamFusion: Text-to-3D using 2D Diffusion
[488]
D-NeRF: Neural Radiance Fields for Dynamic Scenes
[489]
[490]
[491]
[492]
[493]
MobileNetV4 -- Universal Models for the Mobile Ecosystem
[494]
Qwen2.5 Technical Report
[495]
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
[496]
[497]
[498]
Learning Transferable Visual Models From Natural Language Supervision
[499]
Designing Network Design Spaces
[500]
Designing Network Design Spaces
[501]
[502]
[503]
[504]
[505]
Stand-Alone Self-Attention in Vision Models
[506]
[507]
DALL-E: Creating Images from Text Descriptions
[508]
Hierarchical Text-Conditional Image Generation with CLIP Latents
[509]
Zero-Shot Text-to-Image Generation
[510]
[511]
[512]
Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks
[513]
SAM 2: Segment Anything in Images and Videos
[514]
Generating Diverse High-Fidelity Images with VQ-VAE-2
[515]
Broaden Your Views for Self-Supervised Video Learning
[516]
[517]
[518]
You Only Look Once: Unified, Real-Time Object Detection
[519]
[520]
[521]
[522]
[523]
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
[524]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
[525]
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
[526]
[527]
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression
[528]
[529]
[530]
SASSL: Enhancing Self-Supervised Learning via Neural Style Transfer
[531]
High-Resolution Image Synthesis with Latent Diffusion Models
[532]
[533]
[534]
[535]
"GrabCut": interactive foreground extraction using iterated graph cuts
[536]
Principles of Mathematical Analysis
[537]
[538]
[539]
[540]
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
[541]
Assessing Generative Models via Precision and Recall
[542]
Progressive Distillation for Fast Sampling of Diffusion Models
[543]
Improved Techniques for Training GANs
[544]
[545]
[546]
[547]
[548]
CLIP-Mesh: Generating {T
[549]
How Does Batch Normalization Help Optimization?
[550]
How Does Batch Normalization Help Optimization?
[551]
StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets
[552]
[553]
[554]
FaceNet: A unified embedding for face recognition and clustering
[555]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
[556]
[557]
A Basic Introduction to Separable Convolutions
[558]
[559]
[560]
[561]
[562]
Fast Transformer Decoding: One Write-Head is All You Need
[563]
[564]
Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network
[565]
Pixels, voxels, and views: A study of shape representations for single view 3D object shape prediction
[566]
Unifying and Boosting Gradient-Based Training-Free Neural Architecture Search
[567]
[568]
Understanding Camera Calibration and Intrinsics
[569]
DINOv3
[570]
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
[571]
[572]
Very Deep Convolutional Networks for Large-Scale Image Recognition
[573]
Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations
[574]
GEOMetrics: Exploiting Geometric Structure for Graph-Encoded Objects
[575]
Cyclical Learning Rates for Training Neural Networks
[576]
Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates
[578]
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
[579]
Convolutions and Backpropagation
[580]
Denoising Diffusion Implicit Models
[581]
[582]
[583]
Score-Based Generative Modeling through Stochastic Differential Equations
[584]
Striving for Simplicity: The All Convolutional Net
[585]
[586]
[587]
[588]
[589]
How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers
[590]
L2 Regularization and Batch Normalization: How They Interact
[591]
Direct Voxel Grid Optimization: Super-fast Convergence for Radiance Fields Reconstruction
[592]
Improved Direct Voxel Grid Optimization for Radiance Fields Reconstruction
[593]
Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling
[594]
[595]
[596]
[597]
[598]
[599]
EfficientNetV2: Smaller Models and Faster Training
[600]
[601]
MnasNet: Platform-Aware Neural Architecture Search for Mobile
[602]
Block-NeRF: Scalable Large Scene Neural View Synthesis
[603]
Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains
[604]
[605]
Nerfacto: A Fast Hash-Grid NeRF Baseline
[606]
[607]
[608]
DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis
[609]
[610]
[611]
What Do Single-view 3D Reconstruction Networks Learn?
[612]
DeepFloyd IF: A Cascaded Diffusion Model for Text-to-Image Synthesis
[613]
Higher Accuracy on Vision Models with EfficientNet-Lite
[614]
[615]
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
[616]
[617]
FCOS: Fully Convolutional One-Stage Object Detection
[618]
[619]
MLP-Mixer: An all-MLP Architecture for Vision
[620]
Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet?
[621]
TrajectoryNet: Learning Continuous Dynamics for Optimal Transport
[622]
Video{MAE
[623]
[624]
[625]
DeiT III: Revenge of the ViT
[626]
Fixing the train-test resolution discrepancy
[627]
Going deeper with Image Transformers
[628]
On Adaptive Attacks to Adversarial Example Defenses
[629]
[630]
A Closer Look at Spatiotemporal Convolutions for Action Recognition
[631]
[632]
GRF: Learning a General Radiance Field for 3D Representation and Rendering
[633]
SPARF: Neural Radiance Fields from Sparse and Noisy Poses
[634]
Review: Group Normalization (GN) for Image Classification
[635]
[636]
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
[637]
Active Data Curation Effectively Distills Large-Scale Multimodal Models
[638]
Selective Search for Object Recognition
[639]
Data Augmentation using {Ultralytics YOLO
[640]
Instance Normalization: The Missing Ingredient for Fast Stylization
[641]
[642]
Flow Matching and Generative Modeling: Deep Dive and Code Walkthrough
[643]
[644]
[645]
Residual Networks Behave Like Ensembles of Relatively Shallow Networks
[646]
Optimal Transport: Old and New
[647]
[648]
[649]
[650]
[651]
[652]
LocCa: Visual Pretraining with Location-aware Captioners
[653]
Regularization of Neural Networks using DropConnect
[654]
IBRNet: Learning Multi-View Image-Based Rendering
[655]
Additive Margin Softmax for Face Verification
[656]
[657]
Video Modeling with Correlation Networks
[658]
[659]
[660]
[661]
[662]
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
[663]
[664]
Video{MAE
[665]
Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images
[666]
BAD-NeRF: Bundle Adjusted Deblur Neural Radiance Fields
[667]
[668]
Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution
[669]
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-Supervised Video Representation Learning
[670]
[671]
Intern{V
[672]
Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks
[673]
[674]
Videos as Space-Time Region Graphs
[675]
Non-local Neural Networks
[676]
[677]
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
[678]
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
[679]
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
[680]
[681]
[682]
[683]
[684]
[685]
StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models
[686]
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
[687]
[688]
[689]
Pixel2Mesh++: Multi-View 3D Mesh Generation via Deformation
[690]
[691]
Aliasing --- {W
[692]
Sine and cosine --- Wikipedia{,
[693]
[694]
[695]
Roofline: an insightful visual performance model for multicore architectures
[696]
[697]
[698]
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
[699]
Robust fine-tuning of zero-shot models
[700]
MeMViT: Memory-Augmented Multiscale Vision Transformers for Efficient Long-Term Video Recognition
[701]
[702]
[703]
[704]
[705]
3D ShapeNets: A Deep Representation for Volumetric Shapes
[706]
[707]
Unsupervised Deep Embedding for Clustering Analysis
[708]
Aggregated Residual Transformations for Deep Neural Networks
[709]
[710]
[711]
[712]
[713]
[714]
Point-NeRF: Point-based Neural Radiance Fields
[715]
[716]
Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models
[717]
Versatile Diffusion: Text, Images and Variations All in One Diffusion Model
[718]
Sliced Wasserstein Generative Models
[719]
LanguageBind: Extending Video-Language Pretraining to Multiple Modalities
[720]
[721]
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
[722]
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
[723]
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory
[724]
Consistency Flow Matching: Defining Straight Flows with Velocity Consistency
[725]
[726]
DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection
[727]
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-Training via Word-Region Alignment
[728]
FILIP: Fine-grained Interactive Language-Image Pre-Training
[729]
[730]
Efficient DETR: Improving End-to-End Object Detector with Dense Prior
[731]
Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance
[732]
Volume Rendering of Neural Implicit Surfaces
[733]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
[734]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
[735]
Overfitting vs. Underfitting
[736]
[737]
Understanding Neural Networks Through Deep Visualization
[738]
Large batch training of convolutional networks
[739]
pixelNeRF: Neural Radiance Fields from One or Few Images
[740]
PlenOctrees for Real-Time Rendering of Neural Radiance Fields
[741]
[742]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
[743]
Vector-quantized Image Modeling with Improved VQGAN
[744]
[745]
Florence: A New Foundation Model for Computer Vision
[746]
[747]
[748]
[749]
Understanding Batch Normalization Backpropagation
[750]
[751]
[752]
[753]
[754]
[755]
[756]
[757]
[758]
Sigmoid Loss for Language Image Pre-Training
[759]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
[760]
[761]
[762]
[763]
[764]
Self-Attention Generative Adversarial Networks
[765]
[766]
[767]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
[768]
ResNeSt: Split-Attention Networks
[769]
GLIPv2: Unifying Localization and Vision-Language Understanding
[770]
Fixup Initialization: Residual Learning Without Normalization
[771]
mixup: Beyond Empirical Risk Minimization
[772]
NeRF++: Analyzing and Improving Neural Radiance Fields
[773]
[774]
VinVL: Making Visual Representations Matter in Vision-Language Models
[775]
[776]
Making Convolutional Networks Shift-Invariant Again
[777]
[778]
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
[779]
[780]
[781]
[782]
Recognize Anything: A Strong Image Tagging Model
[783]
Inversion-Based Style Transfer with Diffusion Models
[784]
[785]
[786]
Multimodal Chain-of-Thought Reasoning in Language Models
[787]
[788]
[789]
[790]
Pyramid Scene Parsing Network
[791]
VideoPrism: A Foundational Visual Encoder for Video Understanding
[792]
Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models
[793]
Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head
[794]
Learning Video Representations from Large Language Models
[795]
RegionCLIP: Region-based Language-Image Pretraining
[796]
[797]
[798]
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
[799]
iBOT: Image BERT Pre-Training with Online Tokenizer
[800]
[801]
Unified Vision-Language Pre-Training for Image Captioning and VQA
[802]
[803]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
[804]
[805]
[806]
[807]
[808]
Deformable DETR: Deformable Transformers for End-to-End Object Detection
[809]
Transfusion: Cross-modal Diffusion Models for Reference-based Image Editing and Generation
[810]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
[811]
[812]
[813]
[814]
Generalized Decoding for Pixel, Image, and Language