Written by Tristan Greeneformer writerReviewed by Alex Cohenformer editor

Meta’s new Megabyte system solves one of the biggest roadblocks for GPTs

Latest NewsPublishedMay 25, 2023

Researchers at Meta AI may have developed a way to get around the “tokenization” problem with GPT models.

meta-s-new-megabyte-system-solves-one-of-the-biggest-roadblocks-for-gpts

Meta AI recently published pre-print research showing off a radical new “Megabyte” framework for building generative pre-trained transformer (GPT) systems.

Dubbed “promising” by OpenAI’s Andrej Karpathy, former director of artificial intelligence at Tesla, the new architecture is designed to process large volumes of data — such as images, novels and video files — without the use of a process known as tokenization.

Promising. Everyone should hope that we can throw away tokenization in LLMs. Doing so naively creates (byte-level) sequences that are too long, so the devil is in the details.

Tokenization means that LLMs are not actually fully end-to-end. There is a whole separate stage with… https://t.co/t240ZPxPm7— Andrej Karpathy (@karpathy) May 15, 2023

Tokenization is a lossy process that’s comparable to file compression. To process large amounts of data, GPT models convert bytes to tokens. The tokens are then processed by the transformer and used to generate output tokens, which are then decoded.

The tokenization process allows an AI system to process larger strings of data as numbers. The words “my favorite color is red,” if processed by OpenAI’s ChatGPT, for example, would be converted to the token string “3666, 4004, 3124, 318, 2266, 13” for processing.

OpenAI demonstration of tokenization process. Source: OpenAI

Unfortunately, even through tokenization, the amount of data current state-of-the-art systems can process still has a hard limit. For GPT-3.5, the limit is slightly over 4,000 tokens or about 3,000 words, whereas GPT-4 maxes out at around 32,000 tokens or about 24,000 words.

Meta’s new Megabyte system ditches tokenization in favor of a novel multi-layer prediction architecture capable of end-to-end modeling over 1 million bytes of data.

Most standard English-language encoding systems use standard 8-bit encoding. In this paradigm, each character takes up one byte of data. Therefore, an AI system capable of processing 1 million bytes of data without tokenization could work with text documents containing 750,000 words — a 3,025% increase over GPT-4.

For comparison, GPT-4 can currently handle about 10 feature-length news articles in a single prompt, whereas Megabyte would be able to parse the entirety of Leo Tolstoy’s War and Peace plus another two average-length novels.

Meta’s Megabyte model also performed well on ImageNet tests and benchmarks related to processing audio files, either equaling or surpassing existing byte-based transformer models such as DeepMind’s Perciever AR on both:

“Megabyte matches the state-of-the-art performance of PerceiverAR whilst using only half the compute.”

The implications of this research could be far-reaching. Tokenization is considered a roadblock in the field due to its hard data limits and the amount of energy and time required to train systems.

Without tokenization, it should be possible to train AI models with stronger foundational support for non-English languages, especially those that can’t be easily encoded in standard 8-bit characters.

This could lead to the further democratization of these technologies and enable everything from cryptocurrency trading bots to decentralized autonomous organization technologies to be built in native language codes around the world.

Related: Sam Altman’s Worldcoin secures $115M for decentralized ID

It would also increase the capacity of models like ChatGPT to work with image, video and audio files by generating multimedia clips using around the same time and energy consumption as text.

Subscribe to daily byte-sized crypto news from Cointelegraph

Cointelegraph is committed to independent, transparent journalism. This news article is produced in accordance with Cointelegraph’s Editorial Policy and aims to provide accurate and timely information. Readers are encouraged to verify information independently.

Meta’s new Megabyte system solves one of the biggest roadblocks for GPTs

Subscribe to daily byte-sized crypto news from Cointelegraph

More on the subject

Bitcoin hits $62K while Coinbase premium hits 77-day negative streak

Coldcard exploit sparks Bitcoin flight, ‘bullish’ crypto consolidation: Hodler’s Digest, August 2

Strategy leaves preferred STRC dividend at 12% as price still below par

Bitcoin hits $62K while Coinbase premium hits 77-day negative streak

Coldcard exploit sparks Bitcoin flight, ‘bullish’ crypto consolidation: Hodler’s Digest, August 2

Strategy leaves preferred STRC dividend at 12% as price still below par

Coldcard’s 5-year flaw reveals hardware wallet testing gap: Kraken’s security chief

US Treasury yields rise as TIPS challenge the inflation narrative

The real reason DeFi projects that survived 2022 crash are shutting down now

South Korean stablecoin outflows top $367M in June: Report

Suspected 4th Coldcard attack wave sweeps 448 Bitcoin: Galaxy’s Thorn