Researchers find way for machine learning without real image data
Before a machine learning model can perform a task, such as detecting cancer in medical images, the model must be trained. When training image classification models, the model is usually shown millions of example images collected in a large dataset. But can real image data simply be used in this process?
Computer systems that use artificial intelligence to interpret images and help doctors make diagnoses are increasingly being used in medicine. They do this by comparing the new images with existing image data. In the process, the machine "learns" continuously. However, machine learning based on images has its pitfalls.
Copyright can prevent machine learning
Indeed, using real image data to train machine learning can raise practical and ethical issues: The images could violate copyright laws, infringe on people's privacy, or be biased toward a particular racial or ethnic group. To avoid these pitfalls, researchers can use image generation programs to create synthetic data for model training. However, these techniques have limited application because expert knowledge is often required to design an image generation program that can produce effective training data.
Researchers at MIT, the MIT-IBM Watson AI Lab and other institutes therefore took a different approach. Instead of developing customized image generation programs for a specific training task, they collected a dataset of 21,000 publicly available programs from the Internet. Then they used this large collection of basic image generation programs to train a computer vision model. These programs generate different images that represent simple colors and textures. The researchers did not edit or modify the programs, each of which consists of only a few lines of code.
Image programs as a valid replacement
The models they trained with this large dataset of programs classified images more accurately than other synthetically trained models. And although their models performed worse than those trained with real data, the researchers showed that increasing the number of image programs in the data set also increased the model's performance and showed a path to higher accuracy.
"It turns out that using many uncurated programs is actually better than using a small set of programs that need to be manipulated by humans. Data is important, but we've shown that you can get pretty far without real data," says Manel Baradad, an electrical engineering and computer science (EECS) doctoral student in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and lead author of the research paper describing the technique.
Rethinking the pre-training
Machine learning models are typically pre-trained, meaning they are first trained on a dataset to develop parameters that can be used to accomplish another task. A model for classifying X-ray images might be trained on a huge dataset of synthetically generated images before being trained on a much smaller dataset of real X-ray images for its actual task.
The researchers had previously shown that they could use a handful of image generation programs to create synthetic data for pretraining the model, but the programs had to be carefully designed so that the synthetic images matched certain properties of the real images. This made it difficult to extend the technique. The new work instead used an enormous dataset of uncurated image generation programs.
Machine learning with "artificially" generated images
The researchers began by assembling a collection of 21,000 image-generating programs from the Internet. All of the programs are written in a simple programming language and consist of just a few snippets of code, so they generate images quickly. "These programs were designed by developers around the world to create images that have some of the characteristics we are interested in. They create images that look almost like abstract art," Baradad explains.
These simple programs can be run so quickly that the researchers did not have to create images in advance to train the model. The researchers found that they could generate images and train the model simultaneously, streamlining the process. They used their huge dataset of image generation programs to pre-train computer vision models for both supervised and unsupervised image classification tasks. In supervised learning, image data is tagged with labels, while in unsupervised learning, the model learns to categorize images without labels.
Accuracy improvement
When they compared their pre-trained models to modern computer vision models pre-trained with synthetic data, their models were more accurate, i.e., they assigned images to the correct categories more often. While accuracy was still lower than models trained with real data, their technique narrowed the performance gap between models trained with real data and those trained with synthetic data by 38 percent.
"Importantly, we show that performance scales logarithmically for the number of programs collected. We don't reach performance saturation, which means that if we collect more programs, the model would perform even better. So there is an opportunity to extend our approach," Manel says.
The researchers also used each image-generating program for pre-training to determine the factors that contribute to the model's accuracy. They found that the model performed better when a program generated a greater variety of images. They also found that colorful images with scenes filling the entire canvas improved the model's performance the most.
Having demonstrated the success of this pretraining approach, the researchers now want to extend their technique to other types of data, such as multimodal data containing text and images. They also want to continue looking for ways to improve the classification performance of images.
Source: Techexplore.com