artificial intelligence will change the world as we know. In most cases, it’s for the better. But worse, not everything is a function of the discussed effects on the labor market or an apocalyptic end of the world; very cinematic but not quite real. The fact that this dark side of AI is here and now is not just about the results, but also about how it is built. The enormous power of new smart machines and those who created them. No one knows by what ethical criteria they operate beyond the principles of creating a great business. Much of this debate, which shows us a worrying dark side, has led to attempts to set legal limits, although this puts the process at risk. Artificial intelligence development is slowing downThis is an issue on which Europe and the United States once again do not fully agree.

A recent news story shows us the dark side he talks about. It all started shortly after ChatGPT-4 was released in March this year. Someone from the Washington Post (it’s important to remember that this classic newspaper of American journalism is owned by Amazon creator Jeff Bezos, who also announced a new generative AI project to be launched shortly) thought to know: Where does the volume of information come from? which showcases chatbots similar to the one from OpenAI. It’s great that he seems to understand me when I ask him, and can even answer me coherently even though none of it is true. The tool is a great “stochastic parrot”, but it doesn’t have any understanding capabilities, it just uses the huge amounts of data (text to be exact) that it processes during its complex training to calculate probabilities. The volume processed is so large that only the results give an idea of ​​its size. Therefore, the question the newspaper wanted to analyze was not how the AI ​​chatbots responded to me, but what they responded to me and, above all, where the content of these responses came from. Frankly, the explanation is that they pulled this data from the internet. They are easy to access because they have a digital configuration that makes them easier to handle for further processing.

The Washington Post found that when big tech companies with generative AI models were asked what specific internet sources they obtained their data from, the answers were vague or nonexistent. OpenAI, for example, refused to disclose what resources it used to train different ChatGPT models (it always has). When journalists at the newspaper in the US capital, with the help of a specialized company, filtered the databases of millions of domain names in use, something unexpected emerged: journalism,’eat entertainment,’eat software development,’eat medicine Yet content creation in general, it was cannibalized in the training of artificial intelligence and “integrated” into the base of its own “knowledge”. All of these websites contained clear warnings that their content was copyrighted, and they never allowed anyone this access. But now your information was part of the AI, and the AI ​​could use it in its answers without citing any sources.

The Washington Post found up to 200 million references to “copyrighted” content among content to train AI models

Data theft doubled in some cases when access robots couldn’t access it openly: Washington Post Found 27 sites identified by the US government as: pirated book “markets”Some were later closed by authorities. Technology companies are a big black hole While training artificial intelligence models that swallow thousands of gigabytes of original data without permission. Up to 200 million pieces of copyrighted content appeared in the list of websites used. Of course they fell into that black hole all media digital media is like saying “all” US media right now. In fact, it is one of the 10 websites that “contribute the most” to artificial intelligence education. five were media. There were also millions author blogs It is created from web pages created on Word Press, Tumblr, Blogspot, or platforms such as sites.google.com.


A huge funnel consuming millions of data: This is how artificial intelligence itself sees the training process. DALL-E

Someone who knows these facts reasonable panic He visited companies and individuals who make a living by producing original content. Big tech companies working on AI have spent a lot of money on computing power and cloud space to develop their AI models. Of course, they paid big money to the big providers of these services, including Google and Amazon. There thousands of millions Financing rounds are obtained because investors know that these projects can generate millions of dollars in revenue. However not a dime It had gone to thousands of information and content creators or the companies that support them. They don’t even appear to be the beneficiaries of the future benefits of this technology, but they also run the risk of the value of their creations and work being reduced to zero due to the fact that many of these AI tools do not recognize where they obtained the original content. and they also avoid Quoting sources.

Just 10 days ago, News/Media AllianceThe powerful organization that brings together the US media published a publication. white book (which you can download from this link ) analyzed in detail this systematic piracy of its partners’ content “without permission or compensation”. The organization stated its support for artificial intelligence: “but not at the expense of editors and journalists those who spend significant time and resources producing materials that inform, keep and entertain our communities and keep our government officials and other decision-makers in check.” Bringing together more than 2,000 North American media organizations, the Alliance brought the matter to the United States Copyright Office, opening the door to correcting this unfair situation and soliciting comments from all parties involved.

Aim, the owner of Instagram, Facebook or WhatsApp and the creator of Llama 2, another model of generative artificial intelligence. He described this copyright compensation as “impossible”. OpenAI, Microsoft and Google said the same. The latter’s argument attracts attention. Alphabetmatrices, stated that they make a “data collection” during training where they are authorized by applicable copyright laws. An unconvincing answer for companies that file hundreds of lawsuits every year for infringement of patents they hold, legally protecting their algorithms and new developments from any attempt to analyze their impact on the digital market. HE funnel law: If you touch my information, I will sue you, but I can use yours however I want.

A final note, however, when these differences arise between major tech companies dealing with their own private businesses: game of ThronesThe copyrighted content argument is valid. Elon Musk And Microsoft They have been at odds ever since Bill Gates’ company became a major bastion of OpenAI and replaced Musk as the main investor in the Sam Altman-led firm. When Microsoft removed X, the former Twitter, from its advertising platform Musk rebelled and threatened Satya Nadella’s company He was accused of suing her for training AI projects with data from the bird’s old social network. Very descriptive.