Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
max-parallel: 10
matrix:
python-version: [3.7]
tensorflow-version: [2.3.0]
tensorflow-version: [2.3.1]
steps:
- uses: actions/checkout@master
- uses: actions/setup-python@v1
Expand Down
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -42,5 +42,4 @@ dump_baker/
dump_ljspeech/
dump_kss/
dump_libritts/
/examples/*/*
/notebooks/test_saved/
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
:zany_face: TensorFlowTTS provides real-time state-of-the-art speech synthesis architectures such as Tacotron-2, Melgan, Multiband-Melgan, FastSpeech, FastSpeech2 based-on TensorFlow 2. With Tensorflow 2, we can speed-up training/inference progress, optimizer further by using [fake-quantize aware](https://www.tensorflow.org/model_optimization/guide/quantization/training_comprehensive_guide) and [pruning](https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras), make TTS models can be run faster than real-time and be able to deploy on mobile devices or embedded systems.

## What's new
- 2020/11/24 **(NEW!)** Add HiFi-GAN vocoder. See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/hifigan)
- 2020/11/19 **(NEW!)** Add Multi-GPU gradient accumulator. See [here](https://github.com/TensorSpeech/TensorFlowTTS/pull/377)
- 2020/08/23 Add Parallel WaveGAN tensorflow implementation. See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/parallel_wavegan)
- 2020/08/23 Add MBMelGAN G + ParallelWaveGAN G example. See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/multiband_pwgan)
Expand Down Expand Up @@ -85,6 +86,7 @@ TensorFlowTTS currently provides the following architectures:
4. **Multi-band MelGAN** released with the paper [Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech](https://arxiv.org/abs/2005.05106) by Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, Lei Xie.
5. **FastSpeech2** released with the paper [FastSpeech 2: Fast and High-Quality End-to-End Text to Speech](https://arxiv.org/abs/2006.04558) by Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu.
6. **Parallel WaveGAN** released with the paper [Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram](https://arxiv.org/abs/1910.11480) by Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim.
7. **HiFi-GAN** released with the paper [HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis](https://arxiv.org/abs/2010.05646) by Jungil Kong, Jaehyeon Kim, Jaekyoung Bae.

We are also implementing some techniques to improve quality and convergence speed from the following papers:

Expand Down Expand Up @@ -217,6 +219,7 @@ To know how to train model from scratch or fine-tune with other datasets/languag
- For Multiband-MelGAN tutorial, pls see [examples/multiband_melgan](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/multiband_melgan)
- For Parallel WaveGAN tutorial, pls see [examples/parallel_wavegan](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/parallel_wavegan)
- For Multiband-MelGAN Generator + Parallel WaveGAN Discriminator tutorial, pls see [examples/multiband_pwgan](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/multiband_pwgan)
- For HiFi-GAN tutorial, pls see [examples/hifigan](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/hifigan)
# Abstract Class Explaination

## Abstract DataLoader Tensorflow-based dataset
Expand Down
65 changes: 65 additions & 0 deletions examples/hifigan/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
Based on the script [`train_hifigan.py`](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/hifigan/train_hifigan.py).

## Training HiFi-GAN from scratch with LJSpeech dataset.
This example code show you how to train MelGAN from scratch with Tensorflow 2 based on custom training loop and tf.function. The data used for this example is LJSpeech, you can download the dataset at [link](https://keithito.com/LJ-Speech-Dataset/).

### Step 1: Create Tensorflow based Dataloader (tf.dataset)
First, you need define data loader based on AbstractDataset class (see [`abstract_dataset.py`](https://github.com/tensorspeech/TensorFlowTTS/tree/master/tensorflow_tts/datasets/abstract_dataset.py)). On this example, a dataloader read dataset from path. I use suffix to classify what file is a audio and mel-spectrogram (see [`audio_mel_dataset.py`](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/melgan/audio_mel_dataset.py)). If you already have preprocessed version of your target dataset, you don't need to use this example dataloader, you just need refer my dataloader and modify **generator function** to adapt with your case. Normally, a generator function should return [audio, mel].

### Step 2: Training from scratch
After you re-define your dataloader, pls modify an input arguments, train_dataset and valid_dataset from [`train_hifigan.py`](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/hifigan/train_hifigan.py). Here is an example command line to training HiFi-GAN from scratch:

First, you need training generator with only stft loss:

```bash
CUDA_VISIBLE_DEVICES=0 python examples/hifigan/train_hifigan.py \
--train-dir ./dump/train/ \
--dev-dir ./dump/valid/ \
--outdir ./examples/hifigan/exp/train.hifigan.v1/ \
--config ./examples/hifigan/conf/hifigan.v1.yaml \
--use-norm 1
--generator_mixed_precision 1 \
--resume ""
```

Then resume and start training generator + discriminator:

```bash
CUDA_VISIBLE_DEVICES=0 python examples/hifigan/train_hifigan.py \
--train-dir ./dump/train/ \
--dev-dir ./dump/valid/ \
--outdir ./examples/hifigan/exp/train.hifigan.v1/ \
--config ./examples/hifigan/conf/hifigan.v1.yaml \
--use-norm 1
--resume ./examples/hifigan/exp/train.hifigan.v1/checkpoints/ckpt-100000
```

IF you want to use MultiGPU to training you can replace `CUDA_VISIBLE_DEVICES=0` by `CUDA_VISIBLE_DEVICES=0,1,2,3` for example. You also need to tune the `batch_size` for each GPU (in config file) by yourself to maximize the performance. Note that MultiGPU now support for Training but not yet support for Decode.

In case you want to resume the training progress, please following below example command line:

```bash
--resume ./examples/hifigan/exp/train.hifigan.v1/checkpoints/ckpt-100000
```

If you want to finetune a model, use `--pretrained` like this with the filename of the generator
```bash
--pretrained ptgenerator.h5
```

**IMPORTANT NOTES**:

- When training generator only, we enable mixed precision to speed-up training progress.
- We don't apply mixed precision when training both generator and discriminator. (Discriminator include group-convolution, which cause discriminator slower when enable mixed precision).
- 100k here is a *discriminator_train_start_steps* parameters from [hifigan.v1.yaml](https://github.com/tensorspeech/TensorflowTTS/tree/master/examples/hifigan/conf/hifigan.v1.yaml)


## Reference

1. https://github.com/descriptinc/melgan-neurips
2. https://github.com/kan-bayashi/ParallelWaveGAN
3. https://github.com/tensorflow/addons
4. [HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis](https://arxiv.org/abs/2010.05646)
5. [MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis](https://arxiv.org/abs/1910.06711)
6. [Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram](https://arxiv.org/abs/1910.11480)
116 changes: 116 additions & 0 deletions examples/hifigan/conf/hifigan.v1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@

# This is the hyperparameter configuration file for Hifigan.
# Please make sure this is adjusted for the LJSpeech dataset. If you want to
# apply to the other dataset, you might need to carefully change some parameters.
# This configuration performs 4000k iters.

###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
sampling_rate: 22050 # Sampling rate of dataset.
hop_size: 256 # Hop size.
format: "npy"


###########################################################
# GENERATOR NETWORK ARCHITECTURE SETTING #
###########################################################
model_type: "hifigan_generator"

hifigan_generator_params:
out_channels: 1
kernel_size: 7
filters: 512
use_bias: true
upsample_scales: [8, 8, 2, 2]
stacks: 3
stack_kernel_size: [3, 7, 11]
stack_dilation_rate: [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
use_final_nolinear_activation: true
is_weight_norm: false

###########################################################
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
###########################################################
hifigan_discriminator_params:
out_channels: 1 # Number of output channels (number of subbands).
period_scales: [2, 3, 5, 7, 11] # List of period scales.
n_layers: 5 # Number of layer of each period discriminator.
kernel_size: 5 # Kernel size.
strides: 3 # Strides
filters: 8 # In Conv filters of each period discriminator
filter_scales: 4 # Filter scales.
max_filters: 1024 # maximum filters of period discriminator's conv.
is_weight_norm: false # Use weight-norm or not.

melgan_discriminator_params:
out_channels: 1 # Number of output channels.
scales: 3 # Number of multi-scales.
downsample_pooling: "AveragePooling1D" # Pooling type for the input downsampling.
downsample_pooling_params: # Parameters of the above pooling function.
pool_size: 4
strides: 2
kernel_sizes: [5, 3] # List of kernel size.
filters: 16 # Number of channels of the initial conv layer.
max_downsample_filters: 1024 # Maximum number of channels of downsampling layers.
downsample_scales: [4, 4, 4, 4] # List of downsampling scales.
nonlinear_activation: "LeakyReLU" # Nonlinear activation function.
nonlinear_activation_params: # Parameters of nonlinear activation function.
alpha: 0.2
is_weight_norm: false # Use weight-norm or not.

###########################################################
# STFT LOSS SETTING #
###########################################################
stft_loss_params:
fft_lengths: [1024, 2048, 512] # List of FFT size for STFT-based loss.
frame_steps: [120, 240, 50] # List of hop size for STFT-based loss
frame_lengths: [600, 1200, 240] # List of window length for STFT-based loss.

###########################################################
# ADVERSARIAL LOSS SETTING #
###########################################################
lambda_feat_match: 10.0
lambda_adv: 4.0

###########################################################
# DATA LOADER SETTING #
###########################################################
batch_size: 16 # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
batch_max_steps: 8192 # Length of each audio in batch for training. Make sure dividable by hop_size.
batch_max_steps_valid: 81920 # Length of each audio for validation. Make sure dividable by hope_size.
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
is_shuffle: true # shuffle dataset after each epoch.

###########################################################
# OPTIMIZER & SCHEDULER SETTING #
###########################################################
generator_optimizer_params:
lr_fn: "PiecewiseConstantDecay"
lr_params:
boundaries: [100000, 200000, 300000, 400000, 500000, 600000, 700000]
values: [0.0005, 0.0005, 0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
amsgrad: false

discriminator_optimizer_params:
lr_fn: "PiecewiseConstantDecay"
lr_params:
boundaries: [100000, 200000, 300000, 400000, 500000]
values: [0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
amsgrad: false

gradient_accumulation_steps: 1 # should be even number or 1.
###########################################################
# INTERVAL SETTING #
###########################################################
discriminator_train_start_steps: 100000 # steps begin training discriminator
train_max_steps: 4000000 # Number of training steps.
save_interval_steps: 20000 # Interval steps to save checkpoint.
eval_interval_steps: 5000 # Interval steps to evaluate the network.
log_interval_steps: 200 # Interval steps to record the training log.

###########################################################
# OTHER SETTING #
###########################################################
num_save_intermediate_results: 1 # Number of batch to be saved as intermediate results.
116 changes: 116 additions & 0 deletions examples/hifigan/conf/hifigan.v2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@

# This is the hyperparameter configuration file for Hifigan.
# Please make sure this is adjusted for the LJSpeech dataset. If you want to
# apply to the other dataset, you might need to carefully change some parameters.
# This configuration performs 4000k iters.

###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
sampling_rate: 22050 # Sampling rate of dataset.
hop_size: 256 # Hop size.
format: "npy"


###########################################################
# GENERATOR NETWORK ARCHITECTURE SETTING #
###########################################################
model_type: "hifigan_generator"

hifigan_generator_params:
out_channels: 1
kernel_size: 7
filters: 128
use_bias: true
upsample_scales: [8, 8, 2, 2]
stacks: 3
stack_kernel_size: [3, 7, 11]
stack_dilation_rate: [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
use_final_nolinear_activation: true
is_weight_norm: false

###########################################################
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
###########################################################
hifigan_discriminator_params:
out_channels: 1 # Number of output channels (number of subbands).
period_scales: [2, 3, 5, 7, 11] # List of period scales.
n_layers: 5 # Number of layer of each period discriminator.
kernel_size: 5 # Kernel size.
strides: 3 # Strides
filters: 8 # In Conv filters of each period discriminator
filter_scales: 4 # Filter scales.
max_filters: 512 # maximum filters of period discriminator's conv.
is_weight_norm: false # Use weight-norm or not.

melgan_discriminator_params:
out_channels: 1 # Number of output channels.
scales: 3 # Number of multi-scales.
downsample_pooling: "AveragePooling1D" # Pooling type for the input downsampling.
downsample_pooling_params: # Parameters of the above pooling function.
pool_size: 4
strides: 2
kernel_sizes: [5, 3] # List of kernel size.
filters: 16 # Number of channels of the initial conv layer.
max_downsample_filters: 512 # Maximum number of channels of downsampling layers.
downsample_scales: [4, 4, 4, 4] # List of downsampling scales.
nonlinear_activation: "LeakyReLU" # Nonlinear activation function.
nonlinear_activation_params: # Parameters of nonlinear activation function.
alpha: 0.2
is_weight_norm: false # Use weight-norm or not.

###########################################################
# STFT LOSS SETTING #
###########################################################
stft_loss_params:
fft_lengths: [1024, 2048, 512] # List of FFT size for STFT-based loss.
frame_steps: [120, 240, 50] # List of hop size for STFT-based loss
frame_lengths: [600, 1200, 240] # List of window length for STFT-based loss.

###########################################################
# ADVERSARIAL LOSS SETTING #
###########################################################
lambda_feat_match: 10.0
lambda_adv: 4.0

###########################################################
# DATA LOADER SETTING #
###########################################################
batch_size: 16 # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
batch_max_steps: 8192 # Length of each audio in batch for training. Make sure dividable by hop_size.
batch_max_steps_valid: 81920 # Length of each audio for validation. Make sure dividable by hope_size.
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
is_shuffle: true # shuffle dataset after each epoch.

###########################################################
# OPTIMIZER & SCHEDULER SETTING #
###########################################################
generator_optimizer_params:
lr_fn: "PiecewiseConstantDecay"
lr_params:
boundaries: [100000, 200000, 300000, 400000, 500000, 600000, 700000]
values: [0.0005, 0.0005, 0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
amsgrad: false

discriminator_optimizer_params:
lr_fn: "PiecewiseConstantDecay"
lr_params:
boundaries: [100000, 200000, 300000, 400000, 500000]
values: [0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
amsgrad: false

gradient_accumulation_steps: 1 # should be even number or 1.
###########################################################
# INTERVAL SETTING #
###########################################################
discriminator_train_start_steps: 100000 # steps begin training discriminator
train_max_steps: 4000000 # Number of training steps.
save_interval_steps: 20000 # Interval steps to save checkpoint.
eval_interval_steps: 5000 # Interval steps to evaluate the network.
log_interval_steps: 200 # Interval steps to record the training log.

###########################################################
# OTHER SETTING #
###########################################################
num_save_intermediate_results: 1 # Number of batch to be saved as intermediate results.
Loading