Compositional generalization is crucial for human intelligence, yet it remains unclear whether scaling data and model size can solve this challenge, particularly for vision models. We study how data scaling affects compositional generalization in a simplified visual setting. Our key finding is that while models can achieve compositional generalization, this ability critically depends on data diversity. Models develop compositional structure in their latent space only when trained with diverse data, otherwise failing to learn compositional representations despite achieving discrimination. We show that high data diversity leads to linear concept representations, which we demonstrate enables efficient compositional learning. Analyzing large-scale pretrained models through this framework reveals mixed results, suggesting compositional generalization remains challenging.