Skip to main content

Voiceprint (Voice Cloning)

Basic Introduction

Voiceprint technology refers to the use of artificial intelligence models to generate speech audio that matches specific sound characteristics. By combining voiceprint information with text content, it can generate speech with specific timbre, intonation, and speech rate. Voiceprint technology is widely used in fields such as personalized text-to-speech, voice assistants, and intelligent customer service.

Currently, most text-to-speech models available on Model Plaza support voiceprints, but each model uses voiceprints in different ways, mainly including the following:

  • Built-in voiceprints
  • Directly using recordings as voiceprints
  • Using recordings with corresponding text content as voiceprints
  • Using .pt format voiceprint files

1. Built-in Voiceprints

Some models do not support custom voiceprints; they have a few built-in voiceprint options, and users can directly use these models for text-to-speech. For example, Spark-TTS-0.5B allows selecting built-in male and female voiceprints.

Spark-TTS

2. Directly Using Recordings as Voiceprints

IndexTTS-1.5 uses recordings as voiceprint information. Users only need to upload an audio file, and the model will automatically extract sound features from the audio for text-to-speech.

Index-TTS

3. Using Recordings with Corresponding Text as Voiceprints

As shown in the figure below, the voiceprint requirements for models F5-TTS, CosyVoice2-0.5B, and Step-Audio-TTS-3B all require recording a sample audio in WAV format, along with the text content corresponding to the recording as voiceprint information. Please use recording software to record audio with clear articulation. The recommended duration is 5-15 seconds, and the file format is .wav. It is advisable that the file size is not too large.

Text-to-Speech

4. Using .pt Format Voiceprint Files

You only need to configure voice_url in the parameter settings of the speech generation model, passing the .pt format voiceprint file to the model via a URL, and you can generate audio that matches the sound characteristics in the file. It mainly includes the following key steps:

  1. Create a .pt format voiceprint file, which can be converted from an audio file. Different models require different voiceprint files:
  • For CosyVoice-300M and fish-speech-1.2-sft models, you can use the voiceprint file interface provided by the platform.
  • For ChatTTS model, you can use the ChatTTS voice cloning tool at http://region-9.autodl.pro:41137/ (this page is currently inaccessible, and the official no longer provides direct services for generating pt files).
  1. Store the created .pt format voiceprint file in a publicly downloadable address, such as your Gitee code repository.

  2. Copy the download address to the voice_url parameter in the model as shown in the figure below. Finally, enter the text and run it to generate a sound with the same timbre as the file.

alt text

Step 1: Create .pt Format Voiceprint Files

1.1 Create Audio Files

Use recording software to record audio with clear articulation. The recommended duration is 5-15 seconds, and the file format is .mp3 or .m4a. It is advisable that the file size is not too large.

1.2 Generate Voiceprint Files for CosyVoice-300M and fish-speech-1.2-sft Models

Upload the audio file to the voice feature extraction interface provided by the platform. The following is a detailed description of the interface:

Function Description

This interface is used to process audio files and extract key audio features.

Notes

  • File size limit: less than 5 M
  • Supported audio formats: .mp3 or .m4a
  • This interface can extract key features from audio for subsequent processing and analysis.

Calling Method

HTTPS call

POST https://moark.ai/v1/audio/voice-feature-extraction

Request Parameters

Parameter PositionNameTypeRequiredDescription
HeaderAuthorizationstringYesAccess token, which can be generated and obtained in Workbench -> Settings -> Access Tokens. Format: "Bearer access_token", Example: "Bearer t-g1044qeGEDXTB6NDJOGV4JQCYDGHRBARFTGT1234"
form-datafilefileYesVoice content. Note: Content-Type is application/octet-stream, and the example value is a binary file.
form-dataprompt_textstringYesPrompt content. Note: Text description consistent with the recording content.
form-datamodelstringYesModel name: CosyVoice-300M

Return Parameters

When the HTTP status code is 200, it indicates success. Returns the file binary stream.

cURL Example

cURL
--location --request POST 'https://moark.ai/v1/audio/voice-feature-extraction'
--header 'Authorization: Bearer Enter your access token '
--form 'model="CosyVoice-300M"'
--form 'file=@"Upload .mp3 or .m4a format files"'
--form 'prompt_text="Text description consistent with the recording content"'

Example of Request Using APIfox Interface Tool

  1. Create a new interface and enter the interface address.

  2. Add the Authorization parameter. The parameter value can be obtained from MoArk Workbench - Access Tokens.

  3. Select Body -> form-data.

  4. Add the file parameter and upload the audio file; add the prompt_text parameter with a value that is the text description consistent with the recording content. After completion, click Send.

  5. After a successful request, download the file returned by the interface and rename it to xxxx.pt.

alt text

alt text

alt text

1.3 Generate Voiceprint Files for ChatTTS Model

  1. Visit the ChatTTS voice cloning tool at http://region-9.autodl.pro:41137/

alt text

  1. Upload/record audio and click submit. Wait for generation.

  2. After successful generation, download the pt file for the ChatTTS pypi version.

alt text

Step 2: Upload Voiceprint File and Obtain Download URL

The voice_url parameter of the speech generation model needs to read the voiceprint file, so we need to upload the voiceprint file to a network disk or other downloadable public space, obtain the file download URL, and configure it to the voice_url parameter of the voice model.

It is recommended that you create or select a public repository on Gitee and upload the .pt file to the repository. Find the uploaded file in the Gitee repository, click the file name to enter the download page, right-click to download and copy the download address, as shown below:

alt text alt text alt text

Step 3: Paste the Address in Model Parameters

According to the type of pt file you generated, select CosyVoice-300M, ChatTTS, or fish-speech-1.2-sft. Paste the voiceprint file download URL into the voice_url parameter, enter text, and run. You can generate a sound with the same timbre as the voiceprint.

alt text

You can create voiceprint file URLs for personal or specific voices, experience the generation effects of different models, and integrate the model API into your business to implement various interesting and useful applications.