A new method to improve the accuracy of computer vision
The presented approach uses synthetic data to improve the accuracy of AI models that recognize images.
In order for a machine learning model to perform the task of diagnosing diseases in medical images, it must be trained to do so. Training an image classification model usually requires a huge dataset, millions of examples of similar images. And this is where the problems arise.
Using data from real medical images is not always ethical. After all, it could be an invasion of people's privacy, a copyright violation, or the dataset could be biased against a particular racial or ethnic group. To minimize such risks, one can forego the real image dataset and use image generation programs instead. This approach will create a synthetic dataset for training an image classification model. However, these methods are limited because expertise is often required to manually develop image generation programs that can create effective training data.
Researchers from the Massachusetts Institute of Technology, MIT-IBM Watson AI Lab and others have analyzed all the problems encountered in generating image datasets and presented a different solution to the problem. They refused to develop a custom image generation program and assembled a large collection of basic image generation programs for a particular training task from publicly available programs on the Internet.
Their set consisted of 21 000 different programs that were capable of creating images of simple textures and colors. The programs were small, usually taking up only a few lines of code. The researchers did not change these programs and immediately used them to generate a set of images.
They used this dataset to train a computer vision model. Based on the test results, it turned out that models trained on such a dataset classified images more accurately than other synthetically trained models. And yet these models were still inferior to models trained on real data. The researchers also found that increasing the number of image processing programs in the dataset increases the performance of the model, making it possible to achieve higher accuracy.
It turned out that using many programs that do not require additional work with them is actually better than using a small set of programs that require additional processing. Data are certainly important, but this experiment showed that you can achieve good results without real data as well.
Conducted research allows us to rethink the data pre-training process. Machine learning models are usually pre-trained. They are first trained on one set of data, after they create parameters, and then they can be used to solve other problems.
For example, a model designed to classify X-rays images may first be pre-trained using a huge dataset of synthetically generated images. And only then it will be trained using a much smaller dataset of real X-rays to perform its real task. The problem with this method is that the synthetic images must match certain properties of the real images. And this, in turn, requires additional work with the programs that generate such synthetic images. This complicates the process of training the models.
Instead, researchers from the Watson AI Lab used simple image generation programs in their work. There were a lot of them, gathered from the Internet. The programs had to generate images quickly, so the scientists chose those that were written in a simple programming language and contained only a few fragments of code. The requirements for the image generation were also quite simple, it had to be images that looked like abstract art.
These programs worked so fast that there was no need to prepare a set of images in advance to train the model. The programs generated images and the model was immediately trained on them. This greatly simplifies the process.
The scientists have used their vast array of image generation programs to pre-train computer vision models for both supervised and unsupervised image classification tasks. In supervised training, the image data is labeled, while in unsupervised training, the model learns to classify images without labels.
When they compared their pre-trained models to modern computer vision models that were pre-trained using synthetic data, their models were more accurate, placing images in the correct categories more often. Although accuracy levels were still lower than those of models trained on real data, this method reduced the performance gap between models trained on real data and models trained on synthetic data by 38 percent.
This research also demonstrates that performance scales logarithmically with the number of generative programs. If more programs are collected, the model will perform even better. Thus, the researchers emphasize that there is a way to extend their approach.
To determine the factors affecting the accuracy of the model, the researchers used each image generation program separately for pre-training. They found that the more diverse set of images the program generated, the better the model performed. It has also been observed that color images that fill the entire canvas are better for improving model performance.
This approach to pre-training proved to be quite successful. The researchers plan to apply their methods to other types of data, such as multimodal data that includes text and images. They also want to further explore ways to improve image classification performance.
Read more details about the study in the article.