top of page

OpenAI’s Licensing Deals for Content: The Modern Gold Rush in AI


The Evolution of Content


If we journey back over 40 years to the era of Bulletin Board Systems (BBSs), a single motive stood out: content. Networks like the FIDO network shared message traffic across BBSs, providing content for local users who couldn't afford the high costs of long-distance access. Fast forward to 1995, the rise of the Internet marked the end of the BBS world, but the mantra "Content is King" prevailed. Unlike the BBS era, where sharing was the norm, the Internet era ushered in a period where everyone could access any website from anywhere, and the competition for unique content was fierce.


Today, we're witnessing a similar gold rush fueled by Generative Pre-Trained Transformers (GPT) and Large Language Models (LLMs). In the early Internet days, websites that produced their own content had a significant advantage over those merely broadcasting others' content. As the global battle for the best generative AI heats up, OpenAI is making strategic moves that echo the lessons of the past. By securing licensing deals for high-quality content, OpenAI is positioning itself to lead in this new era of AI-driven content consumption.


Major Licensing Deals: A Strategic Move

OpenAI's proactive approach to securing licensing deals with major media companies underscores its commitment to feeding its AI models with high-quality, diverse content. Here’s a closer look at some of the significant partnerships in the past year:


Time Magazine

OpenAI has entered into a licensing agreement with Time Magazine, granting it access to 101 years of journalism. This partnership includes real-time content, enabling OpenAI to provide users with up-to-date information on breaking news while citing Time and linking to the source material on the publication's website.


Vox Media and The Atlantic

Agreements with Vox Media and The Atlantic allow OpenAI to use this content to train its AI models and for real-time discovery products. The Atlantic's deal, in particular, includes provisions for the publisher to influence how news is surfaced and presented in OpenAI's products.


News Corp

A multi-year agreement with News Corp provides OpenAI with access to current and archived articles from a range of News Corp publications, including the Wall Street Journal, New York Post, The Daily Telegraph, Barron’s, MarketWatch, Investor’s Business Daily, FN, The Sunday Times, The Sun, and The Australian.


Dotdash Meredith, Financial Times, and Reddit

OpenAI has also secured content from Dotdash Meredith, the Financial Times, and Reddit, further expanding its repository of high-quality content for training its AI models.


Axel Springer

A notable partnership with Axel Springer, a major European digital publishing house, adds to the diverse content sources OpenAI can utilize.


The multitude of ways this content might be used boggles the mind. OpenAI has just begun to explore it. But they're betting big that the value is there.


Open-Sourced Training Datasets: A Foundation of Diversity


In addition to these licensing deals, OpenAI leverages several open-sourced datasets to train its large language models. These datasets provide a broad foundation of diverse content, enhancing the AI's ability to generalize and provide accurate information across various contexts.


Some key datasets include:

  • Common Crawl: Terabytes of raw web data extracted from billions of web pages.

  • RefinedWeb: A deduplicated and filtered version of the Common Crawl dataset.

  • The Pile: An 800 GB corpus designed to enhance a model’s generalization capability.

  • C4 (Colossal Clean Crawled Corpus): A 750 GB English corpus derived from the Common Crawl.

  • Starcoder Data: A programming-centric dataset with 783 GB of code in 86 programming languages.

  • BookCorpus: A dataset of 11,000 unpublished books totaling 985 million words.

  • ROOTS: A 1.6TB multilingual dataset sourced in 59 languages.

  • Wikipedia Dataset: Cleaned text data derived from Wikipedia in all languages.

  • Red Pajama: An open-source effort to replicate the LLaMa dataset, comprising 1.2 trillion tokens from various sources.


The Next Battleground: Data and Unsupervised Training


The battle for content is not just about what’s available now; it's about future-proofing AI models with the best possible datasets for unsupervised training. This will allow AI to codify, properly reference, and expand on original content, pushing the boundaries of what these models can achieve.


The real question is whether OpenAI will do the right thing—will it create a system that mirrors the music industry’s approach to royalties, providing fair compensation for using original content produced by end-users? This could revolutionize how data and content are valued and monetized.


The Potential of Consumer Data: A New Frontier


Imagine a scenario where Apple enables its Apple Watch users to sell their data to AI models. Researchers could then collect real-time health information for studies, and consumers could profit from their data. This concept isn't far-fetched; the value of such data is immense. Marketers, for example, would pay a premium for insights into consumers' vitals, location, and routines in any demographic. Imagine asking such an AI what the exercise routines of residents living in North Florida are, which gyms they frequent, what economic status they have, and what times of day they are there. This is knowing exactly where your customers are, what they are doing, and when. We haven't even begun to explore the ramifications of this kind of knowledge.


Conclusion: The Future of AI and Content


OpenAI's strategic licensing deals and the use of extensive open-sourced datasets underscore its commitment to developing advanced AI models. These partnerships ensure that OpenAI's models are trained on high-quality, diverse, and up-to-date content, enhancing their ability to provide accurate and reliable information to users.


As OpenAI focuses on commercial content now, the real battleground will soon shift to the end consumer. Companies like Apple are uniquely positioned to negotiate the value of consumer data for their customers. The coming months promise to be exciting as we edge closer to a future where AI can provide real-time, comprehensive answers to almost any query.


OpenAI is pioneering a model to buy and resell data that could quickly outpace competitors. The landscape of AI and content is evolving rapidly, and OpenAI's moves today are setting the stage for the next generation of AI-driven insights and innovations.

6 views0 comments

Comments


bottom of page