#pixel $PIXEL
---
A *pixel token* is the fundamental unit that allows Transformer-based models to process visual information in the same way they process language. Instead of feeding a model hundreds of thousands of raw RGB values, an image is first partitioned into small patches — typically 16×16 pixels — which are then flattened and projected into dense vector embeddings. These embeddings become the “tokens” of the image, analogous to subword tokens in text. Recent work like _From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities_ pushes this idea further by applying BPE directly to visual data, creating a learned vocabulary of recurring visual patterns rather than fixed grid patches. This injects structural prior information into each token, so early in the network a token might represent “edge” or “texture,” while deeper layers compose them into higher-level concepts like “traffic light” or “human face.” The advantage is twofold: first, it dramatically reduces sequence length from 150,000+ pixels to a few hundred tokens, making self-attention computationally feasible; second, it aligns the visual representation format with language tokens, enabling a single Transformer to reason across modalities without separate encoders. In practice, this tokenization strategy has been shown to improve multimodal understanding and data efficiency, helping models like Being-VL-0 achieve stronger performance even with limited training data.
---
Want me to also give you a more technical version with math, or keep it essay-friendly like this?