Although extensive research has been conducted on retrieving sea ice variables from synthetic aperture radar (SAR) and multimodal remote sensing data, cross-scene retrieval using regional training models remains a significant challenge. Previous studies have employed multi-task learning but have not sufficiently explored the interplay between network architectures and multi-task performance. Moreover, self-supervised learning (SSL) has shown promise in improving tasks with limited training samples, though its potential in sea ice variable retrieval requires further study. To address the challenge of cross-scene retrieval of sea ice variables, we introduce a novel and effective method called Multimodal Fusion Domain Adaptive (MFDA), which combines three key strategies: 1) Employ SSL methods for multimodal data to pre-train the model, improving its noise sensitivity and promoting a hierarchical understanding of multimodality. 2) Propose a unified convolutional and Transformer-based data fusion architecture to enhance the integration of multimodal data and improve semantic understanding. 3) Incorporate a domain adaptation module between the multimodal encoder and the multi-task decoding predictor to facilitate the model's understanding of the semantic gaps between different regional environments. The performance of the proposed MFDA has been extensively evaluated on the Ai4Arctic dataset. The experimental results demonstrate that MFDA achieves superior performance compared to other state-of-the-art sea ice classification approaches for the task of cross-scene sea ice retrieval.