Why datasets built on public domain might not be enough for AI
Favorite There is tension between copyright laws and large datasets suitable to train large language models. Common Corpus is a dataset that only uses text from copyright-expired sources to bypass the legal issues. It’s a useful achievement, paving the path to research without immediate risk of lawsuits. I also fear
Read More
Shared by voicesofopensource May 7, 2024