The research aims to reveal neural indicators of recognition for iconic words and the possible cross-modal multisensory integration behind this process. The goals of this research are twofold: (1) to register event-related potentials (ERP) in the brain in the process of visual and auditory recognition of Russian imitative words on different de-iconization stages; and (2) to establish whether differences in the brain activity arise while processing visual and auditory stimuli of different nature. Sound imitative (onomatopoeic, mimetic, and ideophonic) words are words with iconic correlation between form and meaning (iconicity being a relationship of resemblance). Russian adult participants (n = 110) were presented with 15 stimuli both visually and auditorily. The stimuli material was equally distributed into three groups according to the criterion of (historical) iconicity loss: five explicit sound imitative (SI) words, five implicit SI words and five non-SI words. It was established that there was no statistically significant difference between visually presented explicit or implicit SI words and non-SI words respectively. However, statistically significant differences were registered for auditorily presented explicit SI words in contrast to implicit SI words in the N400 ERP component, as well as implicit SI words in contrast to non-SI words in the P300 ERP component. We thoroughly analyzed the integrative brain activity in response to explicit IS words and compared it to that in response to implicit SI and non-SI words presented auditorily. The data yielded by this analysis showed the N400 ERP component was more prominent during the recognition process of the explicit SI words received from the central channels (specifically Cz). We assume that these results indicate a specific brain response associated with directed attention in the process of performing cognitive decision making tasks regarding explicit and implicit SI words presented auditorily. This may reflect a higher level of cognitive complexity in identifying this type of stimuli considering the experimental task challenges that may involve cross-modal integration process.