Technology

AI Model Speeds Up High-Resolution Computer Vision

AI Model Speeds Up High-Resolution Computer Vision

An autonomous car must recognize items it encounters quickly and precisely, from an idling delivery truck parked on the corner to a cyclist whizzing toward an oncoming crossroads.

To accomplish this, the vehicle may employ a strong computer vision model to classify every pixel in a high-resolution image of the scene, ensuring that it does not miss items that would be obscured in a lower-quality image. However, when the image has a high resolution, this operation, known as semantic segmentation, is hard and needs a large amount of computing.

Researchers from MIT, the MIT-IBM Watson AI Lab, and other institutions have created a more efficient computer vision model that significantly reduces the computing complexity of this activity. Their approach can accurately conduct semantic segmentation in real-time on a device with limited hardware resources, such as on-board processors that allow an autonomous car to make split-second judgments.

Technology
AI Model Speeds Up High-Resolution Computer Vision

Recent cutting-edge semantic segmentation models learn the interaction between each pair of pixels in an image directly, therefore their computations rise quadratically as picture resolution increases. As a result, while these models are correct, they are too slow to interpret high-resolution photos in real-time on a sensor or cell phone.

The MIT researchers created a new building block for semantic segmentation models that delivers the same capabilities as these cutting-edge models while requiring only linear computing complexity and hardware-efficient procedures.

As a result, when installed on a mobile device, a new model series for high-resolution computer vision operates up to nine times quicker than previous models. Importantly, this new model series outperformed the alternatives in terms of accuracy.

This technology could not only be used to assist autonomous vehicles in making real-time decisions, but it might also improve the efficiency of other high-resolution computer vision tasks, such as medical image segmentation.

“While traditional vision transformers have been used by researchers for a long time and produce amazing results, we want people to pay attention to the efficiency of these models as well.” “Our work demonstrates that it is possible to drastically reduce computation so that this real-time image segmentation can occur locally on a device,” says Song Han, an associate professor in the Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab, and senior author of the paper describing the new model.

Lead author Han Cai, an EECS graduate student, Junyan Li, an undergraduate student at Zhejiang University, Muyan Hu, an undergraduate student at Tsinghua University, and Chuang Gan, a major research staff member at the MIT-IBM Watson AI Lab, also contributed to the article. The findings will be presented at the International Conference on Computer Vision, which will be held in Paris from October 2-6. It can be found on the arXiv preprint server.

A simplified solution: A machine-learning algorithm will struggle to categorize every pixel in a high-resolution image with millions of pixels. A vision transformer, a powerful new sort of model, has lately been employed efficiently.

Transformers were initially created to aid in natural language processing. They encode each word in a sentence as a token in that context and then construct an attention map that records each token’s associations with all other tokens. When making predictions, the model uses this attention map to comprehend the context.

A vision transformer, using a similar approach, divides an image into patches of pixels and encodes each little patch into a token before constructing an attention map. The model creates this attention map using a similarity function that learns the interaction between each pair of pixels directly. As a result, the model creates a global receptive field, which means it can access all essential elements of the image.

Because a high-resolution image can contain millions of pixels chunked into thousands of patches, the attention map can soon become massive. As a result, the amount of calculation climbs quadratically as the image resolution increases.

The MIT researchers employed a simpler approach to create the attention map in their new model series, called EfficientViT, substituting the nonlinear similarity function with a linear similarity function. As a result, they can reorder operations in order to reduce overall calculations without affecting functionality or losing the global receptive field. The amount of processing required for a prediction climbs linearly as the image resolution increases in their model.

“However, there is no such thing as a free lunch.” “Linear attention only captures global context about the image, leaving out local information, which reduces accuracy,” Han explains.

To compensate for the loss of precision, the researchers added two new components to their model, each of which requires only a small amount of processing.

One of these elements aids the model in capturing local feature interactions, so compensating for the linear function’s deficiency in local information extraction. The second, a multiscale learning module, assists the model in recognizing both large and small items.

“The most critical part here is that we need to carefully balance performance and efficiency,” Cai explains.

They created EfficientViT with a hardware-friendly architecture so that it could run on a variety of devices, such as virtual reality headsets or autonomous car edge computers. Their model could be used for various computer vision applications such as image classification.

Streamlining semantic segmentation: When they evaluated their model on semantic segmentation datasets, they discovered that it performed up to nine times faster on an Nvidia graphics processing unit (GPU) than other popular vision transformer models with the same or greater accuracy.

“Now we can have the best of both worlds,” Han says, “and reduce the computing to make it fast enough to run on mobile and cloud devices.”

Based on these findings, the researchers intend to employ this technique to accelerate generative machine-learning models, such as those used to produce new images. They also intend to keep expanding EfficientViT for other vision tasks.

“Efficient transformer models, pioneered by Professor Song Han’s team, now form the backbone of cutting-edge techniques in diverse computer vision tasks, including detection and segmentation,” says Lu Tian, senior director of AI algorithms at AMD, Inc., who was not involved in the research. “Their research not only demonstrates transformer efficiency and capability but also reveals their enormous potential for real-world applications, such as improving image quality in video games.”

“Model compression and light-weight model design are critical research topics in the context of large foundation models for efficient AI computing.” Professor Song Han’s lab has made great work compressing and accelerating modern deep learning models, notably vision transformers,” says Jay Jackson, Oracle’s global vice president of artificial intelligence and machine learning, who was not involved in this research. “Oracle Cloud Infrastructure has been assisting his team in moving this important line of research toward efficient and green AI.”