The Open Source AI Definition RC1 is available for comments
A little over a month after v.0.0.9, we have a Release Candidate version of the Open Source AI Definition. This was reached with lots of community feedback: 5 town hall meetings, several comments on the forum and on the draft, and in person conversations at events in Austria, China, India, Ghana, and Argentina.
There are three relevant changes to the part of the definition pertaining to the “preferred form to make modifications to a machine learning system.”
The feature that will draw most attention is the new language of Data Information. It clarifies that all the training data needs to be shared and disclosed. The updated text comes from many conversations with several individuals who engaged passionately with the design process, on the forum, in person and on hackmd. These conversations helped describe four types of data: open, public, obtainable and unshareable data, well described in the FAQ. The legal requirements are different for each. All are required to be shared in the form that the law allows them to be shared.
Two new features are equally important. RC1 clarifies that Code must be complete, enough for downstream recipients to understand how the training was done. This was done to reinforce the importance of the training, both for transparency, security and other practical reasons. Training is where innovation is happening at the moment and that’s why you don’t see corporations releasing their training and data processing code. We believe, given the current status of knowledge and practice, that this is required to meaningfully fork (study and modify) AI systems.
Last, there is new text that is meant to explicitly acknowledge that it is admissible to require copyleft-like terms for any of the Code, Data Information and Parameters, individually or as bundled combinations. A demonstrative scenario is a consortium owning rights to training code and a dataset deciding to distribute the bundle code+data with legal terms that tie the two together, with copyleft-like provisions. This sort of legal document doesn’t exist yet but the scenario is plausible enough that it deserves consideration. This is another area that OSI will monitor carefully as we start reviewing these legal terms with the community.
A note about science and reproducibility
The aim of Open Source is not and has never been to enable reproducible software. The same is true for Open Source AI: reproducibility of AI science is not the objective. Open Source’s role is merely not to be an impediment to reproducibility. In other words, one can always add more requirements on top of Open Source, just like the Reproducible Builds effort does.
Open Source means giving anyone the ability to meaningfully “fork” (study and modify) a system, without requiring additional permissions, to make it more useful for themselves and also for everyone. This is why OSD #2 requires that the “source code” must be provided in the preferred form for making modifications. This way everyone has the same rights and ability to improve the system as the original developers, starting a virtuous cycle of innovation. Forking in the machine learning context has the same meaning as with software: having the ability and the rights to build a system that behaves differently than its original status. Things that a fork may achieve are: fixing security issues, improving behavior, removing bias. All these are possible thanks to the requirements of the Open Source AI Definition.
What’s coming next
With the release candidate cycle starting today, the drafting process will shift focus: no new features, only bug fixes. We’ll watch for new issues raised, watching for major flaws that may require significant rewrites to the text. The main focus will be on the accompanying documentation, the Checklist and the FAQ. We also realized that in our zeal to solve the problem of data that needs to be provided but cannot be supplied by the model owner for good reasons, we had failed to make clear the basic requirement that “if you can share the data you must.” We have already made adjustments in RC1 and will be seeking views on how to better express this in an RC2.
In the next weeks until the 1.0 release of October 28, we’ll focus on:
- Getting more endorsers to the Definition
- Continuing to collect feedback on hackmd and forum, focusing on new, unseen-before concerns
- Preparing the artifacts necessary for the launch at All Things Open
- Iterating on the Checklist and FAQ, preparing them for deployment.
Leave a Reply