Carbon Chemist

Tag: GPT4

OpenAI’s latest blunder shows the challenges facing Chinese AI models

[ad_1]

In fact, among the few long Chinese tokens in GPT-4o that aren’t either pornography or gambling nonsense, two are “socialism with Chinese characteristics” and “People’s Republic of China.” The presence of these phrases suggests that a significant part of the training data actually is from Chinese state media writings, where formal, long expressions are extremely common.

OpenAI has historically been very tight-lipped about the data it uses to train its models, and it probably will never tell us how much of its Chinese training database is state media and how much is spam. (OpenAI didn’t respond to MIT Technology Review’s detailed questions sent on Friday.)

But it is not the only company struggling with this problem. People inside China who work in its AI industry agree there’s a lack of quality Chinese text data sets for training LLMs. One reason is that the Chinese internet used to be, and largely remains, divided up by big companies like Tencent and ByteDance. They own most of the social platforms and aren’t going to share their data with competitors or third parties to train LLMs.

In fact, this is also why search engines, including Google, kinda suck when it comes to searching in Chinese. Since WeChat content can only be searched on WeChat, and content on Douyin (the Chinese TikTok) can only be searched on Douyin, this data is not accessible to a third-party search engine, let alone an LLM. But these are the platforms where actual human conversations are happening, instead of some spam website that keeps trying to draw you into online gambling.

The lack of quality training data is a much bigger problem than the failure to filter out the porn and general nonsense in GPT-4o’s token-training data. If there isn’t an existing data set, AI companies have to put in significant work to identify, source, and curate their own data sets and filter out inappropriate or biased content.

It doesn’t seem OpenAI did that, which in fairness makes some sense, given that people in China can’t use its AI models anyway.

Still, there are many people living outside China who want to use AI services in Chinese. And they deserve a product that works properly as much as speakers of any other language do.

How can we solve the problem of the lack of good Chinese LLM training data? Tell me your idea at [email protected].

[ad_2]

Source link

May 22, 2024
GPT-4o’s Chinese token-training data is polluted by spam and porn websites

[ad_1]

The new tokenizer has 200,000 tokens in total, and about 25% of the tokens are in non-English languages, says Deedy Das, an AI investor at Menlo Ventures. He used language filters to count the number of tokens in different languages, and the top languages, besides English, are Russian, Arabic, and Vietnamese.

“So the tokenizer’s main impact, in my opinion, is you get the cost down in these languages, not that the quality in these languages goes dramatically up,” Das says. When an LLM has better and longer tokens in non-English languages, they can analyze the prompts faster and charge the users less for the same answer. With the new tokenizer, “you’re looking at almost four times cost reduction,” he says.

Das, who also speaks Hindi and Bengali, took a look at the longest tokens in those languages. The tokens show a clear emphasis on respective dialogues happening in those languages, so they would include words like “Narendra” or “Pakistan.” But other than those, it looks similar to a list of common long words in English, like Prime Minister, university, and international. They also don’t exhibit the issue in Chinese tokens.

That likely reflects the training data in those languages, Das says, “My working theory is the websites in Hindi and Bengali are very rudimentary. It’s like [mostly] news articles. So I would expect this to be the case. There are not many spam bots and porn websites trying to happen in these languages. It’s mostly going to be in English.”

Polluted data and a lack of cleaning

However, things are drastically different in Chinese. According to multiple researchers who have looked into the new library of tokens used for GPT-4o, the longest tokens in Chinese are almost exclusively spam words used in pornography, gambling, and scamming contexts. Even shorter tokens, like three-character-long Chinese words, also have a significant concentration on the same topics.

“The problem is clear: the corpus used to train [the tokenizer] is not clean. The English tokens seem fine, but the Chinese ones are not,” says Cai from Princeton University. Crawling spam and including it in training data is not rare, but usually, there will be significant effort taken to clean up the data before it’s used. “It’s possible that they didn’t do proper data clearing when it comes to Chinese,” he says.

The content of these Chinese tokens could suggest that they have been polluted by a specific phenomenon: websites hijacking unrelated content in Chinese or other languages to boost spam messages.

These messages are often advertisements of pornography videos and gambling websites. They could be real businesses or merely scams. And the language is inserted into content farm websites or sometimes legitimate websites so they can be indexed by search engines, circumvent the spam filters, and be found in random searches. For example, Google indexed one search result page on a US National Institute of Health website, which lists a porn site in Chinese. The same site name also appeared in at least five Chinese tokens in GPT-4o.

[ad_2]

Source link

May 17, 2024

Tag: GPT4

OpenAI’s latest blunder shows the challenges facing Chinese AI models

GPT-4o’s Chinese token-training data is polluted by spam and porn websites

Polluted data and a lack of cleaning