Humans and other animals navigate different landscapes and environments with ease, a feat that requires the brain's ability to rapidly and accurately adapt to different visual domains, generalizing across contexts/backgrounds. Despite recent progress in deep learning applied to classification and detection in the presence of multiple confounds including contextual ones, there remain important challenges to address regarding how networks can perform context-dependent computations and how contextually-invariant visual concepts are formed. For instance, recent studies have shown artificial networks that repeatedly misclassified familiar objects set on new backgrounds, e.g. incorrectly labeling known animals when they appeared in a different setting. Here, we show how a bio-inspired network motif can explicitly address this issue. We do this using a novel dataset which can be used as a benchmark for future studies probing invariance to backgrounds. The dataset consists of MNIST digits of varying transparency, set on one of two backgrounds with different statistics: a Gaussian noise or a more naturalistic background from the CIFAR-10 dataset. We use this dataset to learn digit classification when contexts are shown sequentially, and find that both shallow and deep networks have sharply decreased performance when returning to the first background after experience learning the second -- the catastrophic forgetting phenomenon in continual learning. To overcome this, we propose an architecture with additional ``switching'' units that are activated in the presence of a new background. We find that the switching network can learn the new context even with very few switching units, while maintaining the performance in the previous context -- but that they must be recurrently connected to network layers. When the task is difficult due to high transparency, the switching network trained on both contexts outperforms networks without switching trained on only one context. The switching mechanism leads to sparser activation patterns, and we provide intuition for why this helps to solve the task. We compare our architecture with other prominent learning methods, and find that elastic weight consolidation is not successful in our setting, while progressive nets are more complex but less effective. Our study therefore shows how a bio-inspired architectural motif can contribute to task generalization across context.