Skip to main content

Speech Multimodality

Basic Introduction

Speech multimodality refers to the technology that combines speech with other modalities (such as text, images, etc.) for processing and analysis. Its core task is to enhance the understanding and generation capabilities of speech content by fusing information from different modalities. Speech multimodality technology is widely applied in fields like intelligent customer service, voice assistants, and video content analysis.

Currently, the speech multimodal large models deployed on MoArk include:

  • Qwen2-Audio-7B-Instruct
    Qwen2-Audio-7B-Instruct is a large-scale audio-language model with speech chat and audio analysis capabilities. It supports input in multiple languages, enhancing speech interaction and audio processing performance.