Read More 2 minute read Science & Tech GPT-4o’s Chinese token-training data is polluted by spam and porn websites The new tokenizer has 200,000 tokens in total, and about 25% of the tokens are in non-English languages,… bychemistadminMay 17, 2024