
With FlexOlmo, people can train AI without handing over their data. They can even remove their contribution after the model is complete.
Data is the lifeblood of modern AI, but people are increasingly wary of sharing their information with model builders. A new architecture could get around the problem by letting data owners control how training data is used even after a model has been built.
The impressive capabilities of today’s leading AI models are the result of an enormous data-scraping operation that hoovered up vast amounts of publicly available information. This has raised thorny questions around consent and whether people have been properly compensated for the use of their data. And data owners are increasingly looking for ways to protect their data from AI companies.
A new architecture from researchers at the Allen Institute for AI (Ai2) called FlexOlmo could present a potential workaround. FlexOlmo allows models to be trained on private datasets without owners ever having to share the raw data. It also lets owners remove their data, or limit its use, after training has finished.
“FlexOlmo opens the door to a new paradigm of collaborative AI development,” the Ai2 researchers wrote in a blog post describing the new approach. “Data owners who want to contribute to the open, shared language model ecosystem but are hesitant to share raw data or commit permanently can now participate on their own terms.”
The team developed the new architecture to solve several problems with the existing approach to model training. Currently, data owners must make a one-time and essentially irreversible decision about whether or not to include their information in a training dataset. Once this data has been publicly shared there’s little prospect of controlling who uses it. And if a model is trained on certain data there’s no way to remove it later on, short of completely retraining the model. Given the cost of cutting-edge training runs, few model developers are likely to agree to this.
FlexOlmo gets around this by allowing each data owner to train a separate model on their own data. These models are then merged to create a shared model, building on a popular approach called “mixture of experts” (MoE), in which multiple smaller expert models are trained on specific tasks. A routing model is then trained to decide which experts to engage to solve specific problems.
Training expert models on very different datasets is tricky, though, because the resulting models diverge too far to effectively merge with each other. To solve this, FlexOlmo provides a shared public model pre-trained on publicly available data. Each data owner that wants to contribute to a project creates two copies of this model and trains them side-by-side on their private dataset, effectively creating a two-expert MoE model.
While one of these models trains on the new data, the parameters of the other are frozen so the values don’t change during training. By training the two models jointly, the first model learns to coordinate with the frozen version of the public model, known as the “anchor.” This means all privately trained experts can coordinate with the shared public model, making it possible to merge them into one large MoE model.
When the researchers merged several privately trained expert models with the pre-trained public model, they found it achieved significantly higher performance than the public model alone. Crucially, the approach means data owners don’t need to share their raw data with anyone, they can decide what kinds of tasks their expert should contribute to, and they can even remove their expert from the shared model.
The researchers say the approach could be particularly useful for applications involving sensitive private data, such as information in healthcare or government, by allowing a range of organizations to pool their resources without surrendering control of their datasets.
There is a chance that attackers could extract sensitive data from the shared model, the team admits, but in experiments they showed the risk was low. And their approach can be combined with privacy-preserving training approaches like “differential privacy” to provide more concrete protection.
The technique might be overly cumbersome for many model developers who are focused more on performance than the concerns of data owners. But it could be a powerful new way to open up datasets that have been locked away due to security or privacy concerns.
The post This AI Gives You Power Over Your Data appeared first on SingularityHub.
* This article was originally published at Singularity Hub
0 Comments