Cross-cultural study that avoids bias in the formulation of emotion recognition models is indispensable to address the challenges in facial emotion classification for children, for it being a relatively difficult task. With evidences from literature for the hybrid feature extraction approaches to improve the image classification accuracy, this research work focuses on developing a hybrid framework for emotion recognition from facial images of Tamil and Russian children. The dataset is audio video recording of 28 Tamil speaking and 64 Russian speaking children. The data is collected in a controlled environment and labelled by experts. Traditional features like Grey Level Cooccurrence Matrix (GLCM) and facial landmark are extracted and are fused with You Look Only Once (YOLO V5) features. While facial landmarks and GLCM provide useful information about the facial expressions and texture of the image, YOLO V5 being a single-stage object detector makes the hybrid model super-fast and achieve high accuracy in detecting small objects and in low-light settings. The different classifiers that includes KNN, SVM, Random Forest, XGBoost, and Multilayer Perceptron, is employed yielding the accuracy result of 93%, 86%, 89%, 88%,and 90%. The use of majority voting ensemble of heterogeneous classifiers for the final prediction strengthens the model further, yielding an accuracy as high as 96% for the custom cross-cultural dataset, consisting of facial images of Russian and Indian children. This is consistent with the results obtained for Indian and Russian datasets. Further, the ablation study unveils the effect of feature fusion in boosting the performance and the dominance of YOLO V5 features over the other two.