Cross-media semantic retrieval (CSR) and cross-modal semantic mapping are key problems of the multimedia search engine. The cognitive function and neural structure for visual and auditory information process are an important reference for the study of brain-inspired CSR. In this paper, we analyze the hierarchy, the functionality and the structure of visual and auditory in the brain. Considering an idea from deep belief network and hierarchical temporal memory, we presented a brain-inspired intelligent model, called cross-media semantic retrieval based on neural computing of visual and auditory sensation (CSRNCVA). Algorithms based on CSRNCVA were developed. It employs belief propagation algorithms of probabilistic graphical model and hierarchical learning. The experiments show that our model and algorithms can be effectively applied to the CSR. This work provides an important significance for brain-inspired cross-media intelligence framework.