Table of Contents
Here’s the translation:
Legal and Ethical Quagmire
Accountability and Profit Sharing
Home Technology peripherals AI Chatbots are digesting the internet, and the internet wants to reap the rewards

Chatbots are digesting the internet, and the internet wants to reap the rewards

May 16, 2023 pm 04:31 PM
AI language model

Chatbots are digesting the internet, and the internet wants to reap the rewards

Artificial intelligence companies are exploiting the content created by countless people on the Internet without their consent or compensation. Now, a growing number of tech and media companies are demanding payment in hopes of getting a piece of the chatbot craze.

Here’s the translation:

If you’ve ever blogged, posted on Reddit, or shared anything on the open web, chances are you’ve contributed to Contributed to the birth of the latest generation of artificial intelligence.

Google’s Bard, OpenAI’s ChatGPT, Microsoft’s new version of Bing, and similar tools provided by other startups all integrate artificial intelligence language models. But these clever robot writers wouldn’t be possible without the vast amounts of text freely available on the Internet.

Nowadays, web content has once again become the focus of competition. This hasn't happened since the early days of the search engine wars. Tech giants are trying to carve out this irreplaceable source of information, rich in new value, as their own territory.

Originally unsuspecting tech and media companies are realizing that this data is critical to fostering a new generation of language-based artificial intelligence. Reddit is one of OpenAI's valuable training resources, but it recently announced that it would charge artificial intelligence companies for data access. OpenAI declined to comment.

Recently, Twitter also began charging for data access services, a change that affects many aspects of Twitter’s business, including the use of data by artificial intelligence companies. The News Media Alliance, which represents publishers, announced in a paper this month that companies should pay licensing fees when they use work produced by their members to train artificial intelligence.

Prashanth Chandrasekar, CEO of Stack Overflow, a Q&A site for programmers, said: “What’s really important to us is ownership of information.” For large-scale artificial intelligence The smart company plans to start charging for access to user-generated content on the site. "The Stack Overflow community has spent so much effort answering questions over the past 15 years, and we really want to make sure the effort is rewarded."

There have been many artificial intelligence services before, such as OpenAI’s Dall-E 2, which can generate images through learning, but have been accused of large-scale theft of intellectual property. The companies that created these systems are currently involved in lawsuits over these allegations. The battle over AI-generated text may be even bigger, involving not only issues of compensation and credit, but also privacy issues.

But Emily M. Bender, a computational linguist at the University of Washington, believes that under current laws, artificial intelligence agencies are not responsible for their actions.

The dispute stems from the way artificial intelligence chatbots are developed. The core algorithms of these robots are called "large language model algorithms", which need to imitate the content and manner of human speech by absorbing and processing large amounts of existing language text data. This type of data is different from the behavioral and personal information used by services such as Facebook parent company Meta Platforms to target ads we are used to on the internet.

This data is created by human users using various services, such as the hundreds of millions of posts made by Reddit users. Only on the Internet can you find a large enough library of artificially generated words. Without it, none of today’s chat-based AI and related technologies would succeed.

Jesse Dodge, a research scientist at the non-profit Allen Institute for Artificial Intelligence, found in a 2021 paper that Wikipedia and countless copyright-protected websites from media organizations large and small Protected news articles are present in the most commonly used web crawler databases. Both Google and Facebook use this dataset to train large language models, and OpenAI uses a similar database.

OpenAI no longer discloses its data sources, but according to a 2020 paper published by the company, its large language model uses posts scraped from Reddit to filter and improve the data used to train its artificial intelligence.

Reddit spokesman Tim Rathschmidt said it was uncertain how much revenue it would generate from charging companies to access its data, but believed the data they had could Help improve today's state-of-the-art large-scale language models.

Reports say publishing industry executives have been investigating: To what extent is their content used to train ChatGPT and other artificial intelligence tools? How do they think they should be compensated? And what laws can they use to defend their rights? However, Danielle Coffey, the organization's general counsel, said that so far, no agreement has been reached with any of the owners of large AI chat engines (such as Google, OpenAI, Microsoft, etc.) to let They pay for a portion of the training data scraped from members of the News Media Alliance.

Twitter did not respond to a request for comment. Microsoft declined to comment. A Google spokesperson said: "We have a long history of helping creators and publishers monetize their content and strengthen relationships with their audiences. In line with our AI principles, we will continue to do so in a responsible and ethical manner. Innovate in an ethical way." The spokesperson also said that "it is still early days" and Google is soliciting opinions on how to build artificial intelligence that is beneficial to the open network.

In some cases, copying data available on the open web (also known as scraping) is legal, although companies are still discussing how and where The details of when they were allowed to do so were debated.

Most companies and organizations are willing to put their data online because they want it to be discovered and indexed by search engines so that people can find the content. However, copying this data to train artificial intelligence, replacing the need to find the original source, is entirely different.

Computational linguist Bender said technology companies that collect information from the Internet to train artificial intelligence operate on the principle: "We can accept it, therefore it is ours." Converting text (including books, magazine articles, essays on personal blogs, patents, scientific papers, and Wikipedia content) into chatbot answers removes links to the source of the material. It also makes it harder for users to verify what the bot is telling them. This is a big problem for systems that often lie.

These large-scale scrapes also steal our personal information. Common Crawl is a non-profit organization that has been crawling vast amounts of content on the open web for more than a decade and making its database freely available to researchers. Common Crawl's database is also used as a starting point for companies looking to train artificial intelligence, including Google, Meta, OpenAI and others.

Sebastian Nagel, a data scientist and engineer at Common Crawl, says a blog post you wrote years ago that has since been deleted may still be It's present in the training data used by OpenAI, which uses web content from years ago to train its artificial intelligence.

Unlike search indexes owned by Google and Microsoft, removing personal information from trained AI requires retraining the entire model, Bender said. Dodge also said that because the cost of retraining a large language model can be very high, even if users can prove that personal data was used to train artificial intelligence, the company is unlikely to do so. Due to the enormous computing power required, such models can cost tens of millions of dollars to train.

But Dodge added that in most cases it would also be difficult to get an AI trained on a data set that includes personal information to regurgitate that information. OpenAI said it has adjusted its chat-based system to reject requests for personal information. The European Union and U.S. governments are considering new laws and regulations to govern this type of artificial intelligence.

Accountability and Profit Sharing

Some proponents of AI believe that AI should have access to all the data their engineers can get because that’s how humans learn. Logically, why shouldn't a machine do this?

Bender said that aside from the fact that artificial intelligence is not yet the same as humans, there is a problem with the above point of view, that is, according to current laws, artificial intelligence cannot be responsible for its own actions. People who plagiarize the work of others, or who try to repackage misinformation as truth, can face severe consequences, but a machine and its creators do not share the same responsibility.

Of course, this may not always be the case. Just like copyright owner Getty sued image-generating AI companies for using their intellectual property as training data, businesses and other organizations will likely end up suing the makers of chat-based AI if they use their content without authorization. Go to court unless they agree to a warrant.

Those personal essays written by countless people, as well as posts posted on obscure forums and disappeared social networks, and all kinds of other things, can really make today's chatbots as capable as writers. Okay? Perhaps the only benefit the creators of these content can gain from this is that they have contributed something to the cultivation of chatbots in terms of their use of language.

The above is the detailed content of Chatbots are digesting the internet, and the internet wants to reap the rewards. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Bytedance Cutting launches SVIP super membership: 499 yuan for continuous annual subscription, providing a variety of AI functions Bytedance Cutting launches SVIP super membership: 499 yuan for continuous annual subscription, providing a variety of AI functions Jun 28, 2024 am 03:51 AM

This site reported on June 27 that Jianying is a video editing software developed by FaceMeng Technology, a subsidiary of ByteDance. It relies on the Douyin platform and basically produces short video content for users of the platform. It is compatible with iOS, Android, and Windows. , MacOS and other operating systems. Jianying officially announced the upgrade of its membership system and launched a new SVIP, which includes a variety of AI black technologies, such as intelligent translation, intelligent highlighting, intelligent packaging, digital human synthesis, etc. In terms of price, the monthly fee for clipping SVIP is 79 yuan, the annual fee is 599 yuan (note on this site: equivalent to 49.9 yuan per month), the continuous monthly subscription is 59 yuan per month, and the continuous annual subscription is 499 yuan per year (equivalent to 41.6 yuan per month) . In addition, the cut official also stated that in order to improve the user experience, those who have subscribed to the original VIP

Context-augmented AI coding assistant using Rag and Sem-Rag Context-augmented AI coding assistant using Rag and Sem-Rag Jun 10, 2024 am 11:08 AM

Improve developer productivity, efficiency, and accuracy by incorporating retrieval-enhanced generation and semantic memory into AI coding assistants. Translated from EnhancingAICodingAssistantswithContextUsingRAGandSEM-RAG, author JanakiramMSV. While basic AI programming assistants are naturally helpful, they often fail to provide the most relevant and correct code suggestions because they rely on a general understanding of the software language and the most common patterns of writing software. The code generated by these coding assistants is suitable for solving the problems they are responsible for solving, but often does not conform to the coding standards, conventions and styles of the individual teams. This often results in suggestions that need to be modified or refined in order for the code to be accepted into the application

Seven Cool GenAI & LLM Technical Interview Questions Seven Cool GenAI & LLM Technical Interview Questions Jun 07, 2024 am 10:06 AM

To learn more about AIGC, please visit: 51CTOAI.x Community https://www.51cto.com/aigc/Translator|Jingyan Reviewer|Chonglou is different from the traditional question bank that can be seen everywhere on the Internet. These questions It requires thinking outside the box. Large Language Models (LLMs) are increasingly important in the fields of data science, generative artificial intelligence (GenAI), and artificial intelligence. These complex algorithms enhance human skills and drive efficiency and innovation in many industries, becoming the key for companies to remain competitive. LLM has a wide range of applications. It can be used in fields such as natural language processing, text generation, speech recognition and recommendation systems. By learning from large amounts of data, LLM is able to generate text

Can fine-tuning really allow LLM to learn new things: introducing new knowledge may make the model produce more hallucinations Can fine-tuning really allow LLM to learn new things: introducing new knowledge may make the model produce more hallucinations Jun 11, 2024 pm 03:57 PM

Large Language Models (LLMs) are trained on huge text databases, where they acquire large amounts of real-world knowledge. This knowledge is embedded into their parameters and can then be used when needed. The knowledge of these models is "reified" at the end of training. At the end of pre-training, the model actually stops learning. Align or fine-tune the model to learn how to leverage this knowledge and respond more naturally to user questions. But sometimes model knowledge is not enough, and although the model can access external content through RAG, it is considered beneficial to adapt the model to new domains through fine-tuning. This fine-tuning is performed using input from human annotators or other LLM creations, where the model encounters additional real-world knowledge and integrates it

To provide a new scientific and complex question answering benchmark and evaluation system for large models, UNSW, Argonne, University of Chicago and other institutions jointly launched the SciQAG framework To provide a new scientific and complex question answering benchmark and evaluation system for large models, UNSW, Argonne, University of Chicago and other institutions jointly launched the SciQAG framework Jul 25, 2024 am 06:42 AM

Editor |ScienceAI Question Answering (QA) data set plays a vital role in promoting natural language processing (NLP) research. High-quality QA data sets can not only be used to fine-tune models, but also effectively evaluate the capabilities of large language models (LLM), especially the ability to understand and reason about scientific knowledge. Although there are currently many scientific QA data sets covering medicine, chemistry, biology and other fields, these data sets still have some shortcomings. First, the data form is relatively simple, most of which are multiple-choice questions. They are easy to evaluate, but limit the model's answer selection range and cannot fully test the model's ability to answer scientific questions. In contrast, open-ended Q&A

SOTA performance, Xiamen multi-modal protein-ligand affinity prediction AI method, combines molecular surface information for the first time SOTA performance, Xiamen multi-modal protein-ligand affinity prediction AI method, combines molecular surface information for the first time Jul 17, 2024 pm 06:37 PM

Editor | KX In the field of drug research and development, accurately and effectively predicting the binding affinity of proteins and ligands is crucial for drug screening and optimization. However, current studies do not take into account the important role of molecular surface information in protein-ligand interactions. Based on this, researchers from Xiamen University proposed a novel multi-modal feature extraction (MFE) framework, which for the first time combines information on protein surface, 3D structure and sequence, and uses a cross-attention mechanism to compare different modalities. feature alignment. Experimental results demonstrate that this method achieves state-of-the-art performance in predicting protein-ligand binding affinities. Furthermore, ablation studies demonstrate the effectiveness and necessity of protein surface information and multimodal feature alignment within this framework. Related research begins with "S

Five schools of machine learning you don't know about Five schools of machine learning you don't know about Jun 05, 2024 pm 08:51 PM

Machine learning is an important branch of artificial intelligence that gives computers the ability to learn from data and improve their capabilities without being explicitly programmed. Machine learning has a wide range of applications in various fields, from image recognition and natural language processing to recommendation systems and fraud detection, and it is changing the way we live. There are many different methods and theories in the field of machine learning, among which the five most influential methods are called the "Five Schools of Machine Learning". The five major schools are the symbolic school, the connectionist school, the evolutionary school, the Bayesian school and the analogy school. 1. Symbolism, also known as symbolism, emphasizes the use of symbols for logical reasoning and expression of knowledge. This school of thought believes that learning is a process of reverse deduction, through existing

SK Hynix will display new AI-related products on August 6: 12-layer HBM3E, 321-high NAND, etc. SK Hynix will display new AI-related products on August 6: 12-layer HBM3E, 321-high NAND, etc. Aug 01, 2024 pm 09:40 PM

According to news from this site on August 1, SK Hynix released a blog post today (August 1), announcing that it will attend the Global Semiconductor Memory Summit FMS2024 to be held in Santa Clara, California, USA from August 6 to 8, showcasing many new technologies. generation product. Introduction to the Future Memory and Storage Summit (FutureMemoryandStorage), formerly the Flash Memory Summit (FlashMemorySummit) mainly for NAND suppliers, in the context of increasing attention to artificial intelligence technology, this year was renamed the Future Memory and Storage Summit (FutureMemoryandStorage) to invite DRAM and storage vendors and many more players. New product SK hynix launched last year

See all articles