Data Transparency in Open Source AI: Protecting Sensitive Datasets
The Open Source Initiative (OSI) is running a blog series to introduce some of the people who have been actively involved in the Open Source AI Definition (OSAID) co-design process. The co-design methodology allows for the integration of diverging perspectives into one just, cohesive and feasible standard. Support and contribution from a significant and broad group of stakeholders is imperative to the Open Source process and is proven to bring diverse issues to light, deliver swift outputs and garner community buy-in.
This series features the voices of the volunteers who have helped shape and are shaping the Definition.
Meet Tarunima Prabhakar
I am the research lead and co-founder at Tattle, a civic tech organization that builds citizen centric tools and datasets to respond to inaccurate and harmful content. My broad research interests are in the intersection of technology, policy and global development. Prior to starting Tattle, I worked as a research fellow at the Center for Long-Term Cybersecurity at UC, Berkeley studying the deployment of behavioral credit scoring algorithms towards financial inclusion goals in the global majority. I’ve also been fortunate to work on award-winning ICTD and data driven development projects with stellar non-profits. My career working in low-resource environments has turned me into an ardent advocate for Open Source development and citizen science movements.
Protecting Sensitive Datasets
I recently gave a lightning talk at IndiaFOSS where I shared about Uli, a project to co-design solutions to online gendered abuse in Indian languages. As a part of this project, we’re building and maintaining datasets that are useful for machine learning models that detect abuse. The talk exhibited the importance of and the care that must be given when choosing a license for sensitive data, and why open datasets in Open Source AI should be carefully considered.
With the Uli project, we created a dataset annotated by gender rights activists and researchers who speak Hindi, Tamil and Indian English. Then, we fine-tuned Twitter’s XLM-RoBERTa model to detect gender abuse, which we deployed as a browser plugin. When activated, the Uli plugin would redact abusive tweets from a person’s feed. Another dataset we created was of slur words in the three languages that might be used to target people. Such a list is not only useful for the Uli plugin- these words are redacted from web pages if the plugin is installed- but they are also useful for any platform needing to moderate conversations in these languages. At the time of the launch of the plugin, we chose to license the two datasets under an Open Data License (ODL). The model is hosted on Hugging Face and the code is available on GitHub.
As we have continued to maintain and grow Uli, we have reconsidered how we license the data. When thinking about how to license this data, several factors come into play. First, annotating a dataset on abuse is labor-intensive and mentally exhausting, and the expert annotators should be fairly compensated for their expertise. Second, when these datasets are used by platforms for abuse detection, it creates a potential loophole—if abusive users realize the list of flagged words is public, they can change their language to evade moderation.
These concerns have led us to think carefully about how to license the data. On one end of the spectrum, we could continue to make everything open, regardless of commercial use. On the other end, we could keep all the data closed. We’ve historically operated as an Open Source organization, and every decision we make about data access impacts how we license our machine learning models as well. We are trying to find a happy medium that lets us balance the numerous concerns- recognition of effort and effectiveness of the data on one hand, and transparency, adaptability and extensibility on the other.
As we’ve thought about different strategies for data licensing, we haven’t been sure what that would mean for the license of the machine learning models. And that’s partly because we don’t have a clear definition for what “Open Source AI” really means.
It is for this reason that we’ve closely followed the Open Source Initiative’s (OSI) process for converging on a definition for Open Source AI. OSI has been grappling with the definition of “Open Source AI” as it pertains to the four freedoms: the freedom to use, study, modify, and share. Over the past year, the OSI has been iterating on a definition for Open Source AI, and they’ve reached a point where they propose the following:
- Open weights: The model weights and parameters should be open.
- Open source code: The source code used to train the system should be open.
- Open data or transparent data: Either the dataset should be open, or there should be enough detailed information for someone to recreate the dataset.
It’s important to note that the dataset doesn’t necessarily have to be open. The departure from a stance of maximally open dataset accounts for the complexity in the collection and management of data driving real world ML applications. While frontier models need to deal with copyright and privacy concerns, many smaller projects like ours worry about the uneven power dynamics between those creating the data and the entities using it. In our specific case, opening data also reduces its efficacy.
But having struggled with papers that describe research or data without sharing the dataset itself, I also recognize that ‘enough detailed information’ might not be information enough to repeat, adapt or extend another group’s work. In the end, the question becomes: how much information about the dataset is enough to consider the model “open?” It’s a fine line, and not everyone is comfortable with OSI’s stance on this issue. For our project in particular, we are considering the option of staggered data release- older data is released under an open data license, while the newest data requires users to request access.
If you have strong opinions on this process, I encourage you to visit the OSI website and leave feedback. The OSI process is influential, and your input on open weights, open code, and their specifications around data openness could shape the future of Open Source AI.
You can learn more about the participatory process behind the Uli dataset here, and about Uli and Tattle on their respective websites.
How to get involved
The OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:
- Join the forum: share your comment on the drafts.
- Leave comment on the latest draft: provide precise feedback on the text of the latest draft.
- Follow the weekly recaps: subscribe to our monthly newsletter and blog to be kept up-to-date.
- Join the town hall meetings: we’re increasing the frequency to weekly meetings where you can learn more, ask questions and share your thoughts.
- Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
Leave a Reply