Open Source AI Definition – Weekly update September 2nd

Share your thoughts about draft v0.0.9

  • @mkai added concerns about how OSI will address AI-generated content from both open and closed source models, given current legal rulings that such content cannot be copyrighted. He also suggests clarifying the difference between licenses for AI model parameters and the model itself within the Open Source AI Definition.
  • @shujisado added that while media coverage of the OSAID v0.0.9 release is encouraging, he is not supportive of the idea of an enforcement mechanism to flag false open source AI. He believes this approach differs from OSI’s traditional stance and suggests it may be a misunderstanding.
  • @jplorre added that while LINAGORA supports the proposed definition, they propose clarifying the term “equivalent system” to mean systems that produce the same outputs given identical inputs. They also suggest removing the specific reference to “tokenizers” in the definition, as it may not apply to all AI systems.
    • @shujisado agreed with the need for clarification on “equivalent system” but noted that identical outputs cannot always be guaranteed in general LLMs. He suggests that this clarification might be better suited for the checklist rather than the OSAID itself

Draft v.0.0.9 of the Open Source AI Definition is available for comments

  • @adafruit reconnects with @webmink and proposes updates to the Open Source AI Definition, including adding requirements for prompt transparency and data access during AI training. These updates aim to enhance the ability to audit, replicate, and modify AI models by providing detailed logs, documentation, and public access to prompts used during the training phase.
    • @webmink appreciates the proposal but points out that it seems specific to a single approach, suggesting that it may need broader applicability.
  • @thesteve0 criticizes the current definition, arguing that it does not grant true freedom to modify AI models because the weights, which are essential for using the model, cannot be reproduced without access to both the original data and code. He suggests that models sharing only their weights, especially when built on proprietary data, should be labeled as “open weights” rather than “open source.” He also expresses concern about the misuse of the “open source” label by some AI models, citing specific examples where the term is being abused.

Open-washing and unspoken assumptions of OSS

  • @pranesh added that it might be helpful to explicitly state that the governance of open-source AI is out of scope for OSAID, but also notes that neither the OSD nor the free software definition explicitly mention governance, so it may not be necessary.
  • @kjetilk added that while governance issues have traditionally been unspoken, this unspoken nature is a key problem that needs addressing. He suggests that OSI should explicitly declare governance out of scope to allow others to take on this responsibility.
  • @mjbommar added support for making an official statement that OSI does not intend to control governance, noting concerns that some might fear OSI is moving towards a walled governance approach. He references past regrets about not controlling the “open source” trademark as a means to combat open-washing.
  • @nick added assurance that OSI has no intention of creating a walled governance garden, reaffirming the organization’s long-standing position against such control.
  • @shujisado added that there seems to be a consensus within the OSAID process that governance is out of scope, and notes that related statements have already been moved to the FAQ section in recent versions.

Explaining the concept of Data information

  • @pranesh mentions that, from a legal perspective, the percentage of infringement matters, citing the “de minimis” doctrine and defenses like “fair use” that consider the amount and purpose of infringement. He emphasizes that copyright laws in different jurisdictions vary, and not all recognize the same defenses as in the US.
  • @mjbommar argues that the scale and nature of AI outputs make the “de minimis” defense irrelevant, especially when AI models generate significant amounts of copyrighted content. He stresses that the economic impact of AI-generated content is a key factor in determining whether it qualifies as transformative or infringes copyright.
  • @shujisado highlights that in Japan, using copyrighted works for AI training is generally treated as an exception under copyright law, a stance that is also being adopted by neighboring East Asian countries. He suggests that approaches like the EU Directive are unlikely to become mainstream in Asia.
  • @mjbommar acknowledges the global focus on US/EU laws but points out that many commonly used models are developed by Western organizations. He questions how Japan’s updated copyright laws align with international treaties like WCT/DMCA, expressing concern that they may allow practices that conflict with these agreements.
    • @shujisado responds by stating that Japan’s copyright laws, including Article 30-4, were carefully crafted to comply with international standards, such as the Berne Convention and the WIPO Copyright Treaty, ensuring that they meet the required legal frameworks.

Welcome diverse approaches to training data within a unified Open Source AI Definition

  • @arandal emphasizes the importance of the Open Source Definition (OSD) as a unifying framework that accommodates diverse approaches within the open-source community. She argues that AI models, being a combination of source code and training data, should have their diversity in handling data explicitly recognized in the Open Source AI Definition. She proposes specific text changes to the draft to clarify that while some developers may be comfortable with proprietary data, others may not, and both approaches should be supported to ensure the long-term success of open-source AI.
  • @mjbommar appreciates the spirit of Arandal’s proposal but adds that the OSI currently lacks specific licenses for data, which is why it is crucial for the OSI to collaborate with Creative Commons. Creative Commons maintains the ecosystem of “data licenses” that would be necessary under the proposed revisions to the Open Source AI Definition.
  • @arandal agrees with the need for collaboration with organizations like Creative Commons, noting that this coordination is already reflected in checklist v. 0.0.9. She suggests that such collaboration is necessary even without the proposed revisions to ensure the definition accurately addresses data licensing in AI.
  • @nick acknowledges the importance of working with organizations like Creative Commons and mentions that OSI is in ongoing communication with several relevant organizations, including MLCommons, the Open Future Foundation, and the Data and Trust Alliance. He highlights the recent publication of the Data Provenance Standards by the Data and Trust Alliance as an example of the kind of collaborative work that is being pursued.
  • @mjbommar reiterates the need for explicit coordination with Creative Commons, arguing that the OSI cannot realistically finalize the Open Source AI Definition without such collaboration. He also suggests that the OSI should explore AI preference signaling and work with Creative Commons and SPDX/LF to establish shared standards, which should be part of the OSAID standard’s roadmap.

Join this week’s town hall to hear the latest developments, give your comments and ask questions.

Click Here to View Original Source (opensource.org)

Leave a Reply

Your email address will not be published. Required fields are marked *

Shared by: voicesofopensource

Tags: , ,