In Tier 1 the VLA's image encoder was a black box. Now we open it: how CLIP embeddings, open-vocabulary detection, and spatial reasoning let the robot see what language describes.