In the Twitter thread the article mentions, LeCun makes his claim only for "high-resolution" images and the article assumes 1024x1024 to fall under this category. To me, 1024x1024 is not "high-resolution." This assumption is flawed imo
I currently use convnext for image classification at a size of 4096x2048 (definitely counts as "high-resolution"). For my use case, it would never be practical to use VITs for this. I can't downscale the resolution because extremely fine details need to be preserved.
I don't think LeCun's comment was a "knee-jerk reaction" as the article claims.
This means that you can split your image into tiles, process each tile individually, average the results, apply a final classification layer to the average and get exactly the same result. For reference, see the demonstration below.
You could of course do exactly the same thing with a vision transformer instead of a convolutional neural network.
That being said, architecture is wildly overemphasized in my opinion. Data is everything.
import torch, torchvision.models
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torchvision.models.convnext_small()
model.to(device)
tile_size, image_size = 32, 224 # note that 32 divides 224 evenly
image = torch.randn((1, 3, image_size, image_size), device=device)
# Process image as usual
x_expected = model(image)
# Process image as tiles (using for-loops for educational purposes; should use .view and .permute instead for performance)
features = [
model.features(image[:, :, y:y + tile_size, x:x + tile_size])
for y in range(0, image_size, tile_size)
for x in range(0, image_size, tile_size)]
x = model.classifier(sum(features) / len(features))
print(f"Mean squared error: {(x - x_expected).pow(2).mean().item():.20f}")
As someone who's done a fair bit of architecture work -- both are important! Making it either or is a very silly thing, both are the limiting factor for the other and there are no two ways about it.
Also, for classification, MaxPooling is often far superior, you can learn an average smoothing filter in your convolutions beforehand in a data-dependent manner so that Nyquist sampling stuff is properly preserved.
Also, please do smoothed crossentropy for image class stuff (generally speaking, unless maybe data is hilariously large), MSE won't nearly cut it!
But that being said, adaptive stuff certainly is great when doing classification. Something to note is that batching does become an issue at a certain point -- as well as certain other fine-grained details if you're simply going to average it all down to one single vector (IIUC).
LeCun's technical assessments have borne out over a lot of years. The likely next step in scaling vision transformers is to treat the image as a MIP pyramid and use the transformer to adaptively sample out of that. Requires RL to train (tricky) but it would decouple compute footprint from input size.
As someone who has worked in computer vision ML for nearly a decade, this sounds like a terrible idea.
You don't need RL remotely for this usecase. Image resolution pyramids are pretty normal tho and handling them well/efficiently is the big thing. Using RL for this would be like trying to use graphene to make a computer screen because it's new and flashy and everyone's talking about it. RL is inherently very sample inefficient, and is there to approximate when you don't have certain defined informative components, which we do have in computer vision in spades. Crossentropy losses (and the like) are (generally, IME/IMO) what RL losses try to approximate, only on a much larger (and more poorly-defined) scale.
Please mark speculation as such -- I've seen people see confident statements like this and spend a lot of time/manhours on it (because it seems plausible). It is not a bad idea from a creativity standpoint, but practically is most certainly not the way to go about it.
(That being said, you can try for dynamic sparsity stuff, it has some painful tradeoffs that generally don't scale but no way in Illinois do you need RL for that)
Really appreciated the post, very insightful. We also use VITs for some of our models and find that between model compilation and hyperparameter tuning we are able to get sub second evaluation of images on commodity hardware while maintaining a high precision and recall.
The article basically argues: You would expect to get similarly good results with subsampling in practice. E.g. no need to process at 1920x1080 when you can do 960x540. Separately, you can break down many problems into smaller tiles and get similar quality results without the compute overheads of a high res ViT.
"Note that I chose an unusually long chart to exemplify an extreme case of aspect ratio stretching. Still, 512px² is enough.
This is two_col_40643 from ChartQA validation set. Original resolution: 800x1556."
But yeah, ultimately which resolution you need depends on the image content, and if you need to squeeze out every bit of accuracy, processing at the original resolution is unavoidable.
In the Twitter thread the article mentions, LeCun makes his claim only for "high-resolution" images and the article assumes 1024x1024 to fall under this category. To me, 1024x1024 is not "high-resolution." This assumption is flawed imo
I currently use convnext for image classification at a size of 4096x2048 (definitely counts as "high-resolution"). For my use case, it would never be practical to use VITs for this. I can't downscale the resolution because extremely fine details need to be preserved.
I don't think LeCun's comment was a "knee-jerk reaction" as the article claims.
ConvNeXT's architecture contains an AdaptiveAvgPool2d layer: https://github.com/pytorch/vision/blob/5f03dc524bdb7529bb4f2...
This means that you can split your image into tiles, process each tile individually, average the results, apply a final classification layer to the average and get exactly the same result. For reference, see the demonstration below.
You could of course do exactly the same thing with a vision transformer instead of a convolutional neural network.
That being said, architecture is wildly overemphasized in my opinion. Data is everything.
As someone who's done a fair bit of architecture work -- both are important! Making it either or is a very silly thing, both are the limiting factor for the other and there are no two ways about it.
Also, for classification, MaxPooling is often far superior, you can learn an average smoothing filter in your convolutions beforehand in a data-dependent manner so that Nyquist sampling stuff is properly preserved.
Also, please do smoothed crossentropy for image class stuff (generally speaking, unless maybe data is hilariously large), MSE won't nearly cut it!
But that being said, adaptive stuff certainly is great when doing classification. Something to note is that batching does become an issue at a certain point -- as well as certain other fine-grained details if you're simply going to average it all down to one single vector (IIUC).
Curious what kind of classification problems requires full 4096x2048 images, couldn't you feed multiple 512x512 overlapping crops instead?
Interesting. Can you run your images through a segment model first and then only classify interesting boxes?
LeCun's technical assessments have borne out over a lot of years. The likely next step in scaling vision transformers is to treat the image as a MIP pyramid and use the transformer to adaptively sample out of that. Requires RL to train (tricky) but it would decouple compute footprint from input size.
As someone who has worked in computer vision ML for nearly a decade, this sounds like a terrible idea.
You don't need RL remotely for this usecase. Image resolution pyramids are pretty normal tho and handling them well/efficiently is the big thing. Using RL for this would be like trying to use graphene to make a computer screen because it's new and flashy and everyone's talking about it. RL is inherently very sample inefficient, and is there to approximate when you don't have certain defined informative components, which we do have in computer vision in spades. Crossentropy losses (and the like) are (generally, IME/IMO) what RL losses try to approximate, only on a much larger (and more poorly-defined) scale.
Please mark speculation as such -- I've seen people see confident statements like this and spend a lot of time/manhours on it (because it seems plausible). It is not a bad idea from a creativity standpoint, but practically is most certainly not the way to go about it.
(That being said, you can try for dynamic sparsity stuff, it has some painful tradeoffs that generally don't scale but no way in Illinois do you need RL for that)
Really appreciated the post, very insightful. We also use VITs for some of our models and find that between model compilation and hyperparameter tuning we are able to get sub second evaluation of images on commodity hardware while maintaining a high precision and recall.
> You don't need very high resolution
Yes, you do. Also, 1024x1024 is not high resolution.
An example is segmenting basic 1920x1080 (FHD) video in 60 Hz formats.
The article basically argues: You would expect to get similarly good results with subsampling in practice. E.g. no need to process at 1920x1080 when you can do 960x540. Separately, you can break down many problems into smaller tiles and get similar quality results without the compute overheads of a high res ViT.
>text in photos, phone screens, diagrams and charts, 448px² is enough
Not in the graph you provided as an example.
It has this note at the bottom:
"Note that I chose an unusually long chart to exemplify an extreme case of aspect ratio stretching. Still, 512px² is enough.
This is two_col_40643 from ChartQA validation set. Original resolution: 800x1556."
But yeah, ultimately which resolution you need depends on the image content, and if you need to squeeze out every bit of accuracy, processing at the original resolution is unavoidable.
It's enough, especially if you select one of the sharper options like Lanczos, but 512px is sure a lot easier for a human.