Voiceprint (Voice Cloning)

Basic Introduction

Voiceprint technology refers to the use of artificial intelligence models to generate speech audio that matches specific sound characteristics. By combining voiceprint information with text content, it can generate speech with specific timbre, intonation, and speech rate. Voiceprint technology is widely used in fields such as personalized text-to-speech, voice assistants, and intelligent customer service.

Currently, most text-to-speech models available on Model Plaza support voiceprints, but each model uses voiceprints in different ways, mainly including the following:

Built-in voiceprints
Directly using recordings as voiceprints
Using recordings with corresponding text content as voiceprints
Using .pt format voiceprint files

1. Built-in Voiceprints

Some models do not support custom voiceprints; they have a few built-in voiceprint options, and users can directly use these models for text-to-speech. For example, Spark-TTS-0.5B allows selecting built-in male and female voiceprints.

Spark-TTS

2. Directly Using Recordings as Voiceprints

IndexTTS-1.5 uses recordings as voiceprint information. Users only need to upload an audio file, and the model will automatically extract sound features from the audio for text-to-speech.

Index-TTS

3. Using Recordings with Corresponding Text as Voiceprints

As shown in the figure below, the voiceprint requirements for models F5-TTS, CosyVoice2-0.5B, and Step-Audio-TTS-3B all require recording a sample audio in WAV format, along with the text content corresponding to the recording as voiceprint information. Please use recording software to record audio with clear articulation. The recommended duration is 5-15 seconds, and the file format is .wav. It is advisable that the file size is not too large.

Text-to-Speech

4. Using .pt Format Voiceprint Files

You only need to configure voice_url in the parameter settings of the speech generation model, passing the .pt format voiceprint file to the model via a URL, and you can generate audio that matches the sound characteristics in the file. It mainly includes the following key steps:

Create a .pt format voiceprint file, which can be converted from an audio file. Different models require different voiceprint files:

For CosyVoice-300M and fish-speech-1.2-sft models, you can use the voiceprint file interface provided by the platform.
For ChatTTS model, you can use the ChatTTS voice cloning tool at http://region-9.autodl.pro:41137/ (this page is currently inaccessible, and the official no longer provides direct services for generating pt files).

Store the created .pt format voiceprint file in a publicly downloadable address, such as your Gitee code repository.
Copy the download address to the voice_url parameter in the model as shown in the figure below. Finally, enter the text and run it to generate a sound with the same timbre as the file.

alt text

Step 1: Create .pt Format Voiceprint Files

1.1 Create Audio Files

Use recording software to record audio with clear articulation. The recommended duration is 5-15 seconds, and the file format is .mp3 or .m4a. It is advisable that the file size is not too large.

1.2 Generate Voiceprint Files for CosyVoice-300M and fish-speech-1.2-sft Models

Upload the audio file to the voice feature extraction interface provided by the platform. The following is a detailed description of the interface:

Function Description

This interface is used to process audio files and extract key audio features.

Notes

File size limit: less than 5 M
Supported audio formats: .mp3 or .m4a
This interface can extract key features from audio for subsequent processing and analysis.

Calling Method

HTTPS call

POST https://moark.ai/v1/audio/voice-feature-extraction

Request Parameters

Parameter Position	Name	Type	Required	Description
Header	Authorization	string	Yes	Access token, which can be generated and obtained in Workbench -> Settings -> Access Tokens. Format: "Bearer access_token", Example: "Bearer t-g1044qeGEDXTB6NDJOGV4JQCYDGHRBARFTGT1234"
form-data	file	file	Yes	Voice content. Note: Content-Type is application/octet-stream, and the example value is a binary file.
form-data	prompt_text	string	Yes	Prompt content. Note: Text description consistent with the recording content.
form-data	model	string	Yes	Model name: CosyVoice-300M

Return Parameters

When the HTTP status code is 200, it indicates success. Returns the file binary stream.

cURL Example

cURL
--location --request POST 'https://moark.ai/v1/audio/voice-feature-extraction'
--header 'Authorization: Bearer Enter your access token '
--form 'model="CosyVoice-300M"'
--form 'file=@"Upload .mp3 or .m4a format files"'
--form 'prompt_text="Text description consistent with the recording content"'

Example of Request Using APIfox Interface Tool

Create a new interface and enter the interface address.
Add the Authorization parameter. The parameter value can be obtained from MoArk Workbench - Access Tokens.
Select Body -> form-data.
Add the file parameter and upload the audio file; add the prompt_text parameter with a value that is the text description consistent with the recording content. After completion, click Send.
After a successful request, download the file returned by the interface and rename it to xxxx.pt.

alt text

1.3 Generate Voiceprint Files for ChatTTS Model

Visit the ChatTTS voice cloning tool at http://region-9.autodl.pro:41137/

alt text

Upload/record audio and click submit. Wait for generation.
After successful generation, download the pt file for the ChatTTS pypi version.

alt text

Step 2: Upload Voiceprint File and Obtain Download URL

The voice_url parameter of the speech generation model needs to read the voiceprint file, so we need to upload the voiceprint file to a network disk or other downloadable public space, obtain the file download URL, and configure it to the voice_url parameter of the voice model.

It is recommended that you create or select a public repository on Gitee and upload the .pt file to the repository. Find the uploaded file in the Gitee repository, click the file name to enter the download page, right-click to download and copy the download address, as shown below:

alt text

Step 3: Paste the Address in Model Parameters

According to the type of pt file you generated, select CosyVoice-300M, ChatTTS, or fish-speech-1.2-sft. Paste the voiceprint file download URL into the voice_url parameter, enter text, and run. You can generate a sound with the same timbre as the voiceprint.

alt text

You can create voiceprint file URLs for personal or specific voices, experience the generation effects of different models, and integrate the model API into your business to implement various interesting and useful applications.

Basic Introduction​

1. Built-in Voiceprints​

2. Directly Using Recordings as Voiceprints​

3. Using Recordings with Corresponding Text as Voiceprints​

4. Using .pt Format Voiceprint Files​

Step 1: Create .pt Format Voiceprint Files​

1.1 Create Audio Files​

1.2 Generate Voiceprint Files for CosyVoice-300M and fish-speech-1.2-sft Models​

1.3 Generate Voiceprint Files for ChatTTS Model​

Step 2: Upload Voiceprint File and Obtain Download URL​

Step 3: Paste the Address in Model Parameters​

Basic Introduction

1. Built-in Voiceprints

2. Directly Using Recordings as Voiceprints

3. Using Recordings with Corresponding Text as Voiceprints

4. Using .pt Format Voiceprint Files

Step 1: Create .pt Format Voiceprint Files

1.1 Create Audio Files

1.2 Generate Voiceprint Files for CosyVoice-300M and fish-speech-1.2-sft Models

1.3 Generate Voiceprint Files for ChatTTS Model

Step 2: Upload Voiceprint File and Obtain Download URL

Step 3: Paste the Address in Model Parameters