Close Menu
  • Tech Insights
  • Laptops
  • Mobiles
  • Gaming
  • Apps
  • Money
  • Latest in Tech
Facebook X (Twitter) Instagram
Facebook X (Twitter) Instagram
TechzLab – Tech News, Gadgets, Mobile & IT UpdatesTechzLab – Tech News, Gadgets, Mobile & IT Updates
  • Tech Insights
  • Laptops
  • Mobiles
  • Gaming
  • Apps
  • Money
  • Latest in Tech
TechzLab – Tech News, Gadgets, Mobile & IT UpdatesTechzLab – Tech News, Gadgets, Mobile & IT Updates
Home » Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft
Tech Insights

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft

adminBy adminDecember 12, 2024No Comments3 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email

In addition to the trove of books, the Institutional Data Initiative is also working with the Boston Public Library to scan millions of articles from different newspapers now in the public domain, and it says it’s open to forming similar collaborations down the line. The exact way the books dataset will be released is not settled. The Institutional Data Initiative has asked Google to work together on public distribution, and the company has pledged its support.

However the IDI’s dataset is released, it will be joining a host of similar projects, startups, and initiatives that promise to give companies access to substantial and high-quality AI training materials without the risk of running into copyright issues. Firms like Calliope Networks and ProRata have emerged to issue licenses and design compensation schemes designed to get creators and rightsholders paid for providing AI training data.

There are also other new public-domain projects. Last spring, the French AI startup Pleias rolled out its own public-domain dataset, Common Corpus, which contains an estimated 3 to 4 million books and periodical collections, according to project coordinator Pierre-Carl Langlais. Backed by the French Ministry of Culture, the Common Corpus has been downloaded more than 60,000 times this month alone on the open source AI platform Hugging Face. Last week, Pleias announced that it is releasing its first set of large language models trained on this dataset, which Langlais told WIRED constitute the first models “ever trained exclusively on open data and compliant with the [EU] AI Act.”

Efforts are underway to create similar mage datasets as well. AI startup Spawning released its own this summer called Source.Plus, which contains public-domain images from Wikimedia Commons as well as a variety of museums and archives. Several significant cultural institutions have long made their own archives accessible to the public as standalone projects, like the Metropolitan Museum of Art.

Ed Newton-Rex, a former executive at Stability AI who now runs a nonprofit that certifies ethically-trained AI tools, says the rise of these datasets shows that there’s no need to steal copyrighted materials to build high-performing and quality AI models. OpenAI previously told lawmakers in the United Kingdom that it would be “impossible” to create products like ChatGPT without using copyrighted works. “Large public domain datasets like these further demolish the ‘necessity defense’ some AI companies use to justify scraping copyrighted work to train their models,” Newton-Rex says.

But he still has reservations about whether the IDI and projects like it will actually change the training status quo. “These datasets will only have a positive impact if they’re used, probably in conjunction with licensing other data, to replace scraped copyrighted work. If they’re just added to the mix, one part of a dataset that also includes the unlicensed life’s work of the world’s creators, they’ll overwhelmingly benefit AI companies,” he says.

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
admin
  • Website

Related Posts

This Home Robot Clears Tables and Loads the Dishwasher All by Itself

November 19, 2025

Project Hail Mary trailer confirms it’s my most eagerly-awaited new sci-fi movie of 2026, but I wish it didn’t ruin its biggest surprise

November 18, 2025

Google Pixel 9 Pro 5G is available at under Rs. 85,000 on Amazon: Check deals and offers

November 17, 2025

Comments are closed.

Latest
  • Hot, Fast and Everywhere: The Rise of the Air Fryer in the American Kitchen November 20, 2025
  • Samsung’s Galaxy Z Fold 7 gets a jaw-dropping $700 discount for Black Friday – and that’s not even the best part November 19, 2025
  • Function Health raises $298M Series B at $2.5B valuation | TechCrunch November 19, 2025
  • Google Maps is adding 4 new features to help you navigate the holiday season November 19, 2025
  • How to retrieve Flash Drive from G-Wagon in Escape from Tarkov November 19, 2025
We are social
  • Facebook
  • Twitter
  • Pinterest
  • Instagram
  • YouTube
  • Vimeo

Subscribe to Updates

Get the latest creative news from Techzlab.

Tags
AI Alphabet Anthropic Apple Apps artificial intelligence Artificial Intelligence (AI) ChatGPT critical minerals cybersecurity data centers Disney Donald Trump Elon Musk evergreens EVs Exclusive gemini Google Grok In Brief iPhone Meta Microsoft Netflix nvidia Openai Perplexity Pinterest renewable power robotics Scales to slate auto Softbank Solar Power SpaceX Spotify TechCrunch All Stage 2025 TechCrunch Disrupt TechCrunch Disrupt 2025 Tesla Tiktok Trump Administration X YouTube
Archives
Quick Link
  • Apps (332)
  • From the Editor (4)
  • Gaming (356)
  • Laptops (359)
  • Latest in Tech (355)
  • Mobiles (362)
  • Money (187)
  • Tech Insights (346)
Don't miss

This Home Robot Clears Tables and Loads the Dishwasher All by Itself

November 19, 2025

Project Hail Mary trailer confirms it’s my most eagerly-awaited new sci-fi movie of 2026, but I wish it didn’t ruin its biggest surprise

November 18, 2025

Google Pixel 9 Pro 5G is available at under Rs. 85,000 on Amazon: Check deals and offers

November 17, 2025
Follow us
  • Facebook
  • Twitter
  • Pinterest
  • Instagram
  • YouTube
  • Vimeo
© 2025 Techzlab.com Designed and Developed by WebExpert.
  • Home
  • From the Editor
  • Money
  • Privacy Policy
  • Contact

Type above and press Enter to search. Press Esc to cancel.