Evidence from behavioral studies demonstrates that spoken language guides attention in a related visual scene and that attended scene information can influence the comprehension process. Here we model sentence comprehension within visual contexts. A recurrent neural network is trained to associate the linguistic input with the visual scene and to produce the interpretation of the described event which is part of the visual scene. A feedback mechanism is investigated, which enables explicit utterance-mediated attention shifts to the relevant part of the scene. We compare four models - a simple recurrent network (SRN) and three models with specific types of additional feedback - in order to explore the role of the attention mechanism in the comprehension process. The results show that all networks learn not only successfully to produce the interpretation at the sentence end, but also demonstrate predictive behavior reflected by the ability to anticipate upcoming constituents. The SRN performs expectedly very well, but demonstrates that adding an explicit attentional mechanism does not lead to loss of performance, and even results in a slight improvement in one of the models.