The point is not that tokenization is irrelevant, its that the transformer model _requires_ information dense inputs, which is derived by compressing the input space from raw characters to subwords. Give it something like raw audio or video frames, and its capabilities dramatically bottom out. That’s why even todays sota transformer models heavily preprocess media input, even going as far as doing lightweight frame importance sampling to extract the “best” parts of the video.
In the future, all of these tricks may seem quaint. “Why don’t you just pass the raw bits of the camera feed straight to the model layers?” we may say.