TensorSpeech · dathudeptrai · Nov 24, 2020 · Nov 24, 2020 · Nov 24, 2020 · Nov 24, 2020
diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
@@ -17,7 +17,7 @@ jobs:
       max-parallel: 10
       matrix:
         python-version: [3.7]
-        tensorflow-version: [2.3.0]
+        tensorflow-version: [2.3.1]
     steps:
       - uses: actions/checkout@master
       - uses: actions/setup-python@v1

diff --git a/.gitignore b/.gitignore
@@ -42,5 +42,4 @@ dump_baker/
 dump_ljspeech/
 dump_kss/
 dump_libritts/
-/examples/*/*
 /notebooks/test_saved/
diff --git a/README.md b/README.md
@@ -19,6 +19,7 @@
 :zany_face: TensorFlowTTS provides real-time state-of-the-art speech synthesis architectures such as Tacotron-2, Melgan, Multiband-Melgan, FastSpeech, FastSpeech2 based-on TensorFlow 2. With Tensorflow 2, we can speed-up training/inference progress, optimizer further by using [fake-quantize aware](https://www.tensorflow.org/model_optimization/guide/quantization/training_comprehensive_guide) and [pruning](https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras), make TTS models can be run faster than real-time and be able to deploy on mobile devices or embedded systems.
 
 ## What's new
+- 2020/11/24 **(NEW!)** Add HiFi-GAN vocoder. See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/hifigan)
 - 2020/11/19 **(NEW!)** Add Multi-GPU gradient accumulator. See [here](https://github.com/TensorSpeech/TensorFlowTTS/pull/377)
 - 2020/08/23  Add Parallel WaveGAN tensorflow implementation. See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/parallel_wavegan)
 - 2020/08/23 Add MBMelGAN G + ParallelWaveGAN G example. See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/multiband_pwgan)
@@ -85,6 +86,7 @@ TensorFlowTTS currently  provides the following architectures:
 4. **Multi-band MelGAN** released with the paper [Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech](https://arxiv.org/abs/2005.05106) by Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, Lei Xie.
 5. **FastSpeech2** released with the paper [FastSpeech 2: Fast and High-Quality End-to-End Text to Speech](https://arxiv.org/abs/2006.04558) by Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu.
 6. **Parallel WaveGAN** released with the paper [Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram](https://arxiv.org/abs/1910.11480) by Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim.
+7. **HiFi-GAN** released with the paper [HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis](https://arxiv.org/abs/2010.05646) by Jungil Kong, Jaehyeon Kim, Jaekyoung Bae.
 
 We are also implementing some techniques to improve quality and convergence speed from the following papers:
 
@@ -217,6 +219,7 @@ To know how to train model from scratch or fine-tune with other datasets/languag
 - For Multiband-MelGAN tutorial, pls see [examples/multiband_melgan](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/multiband_melgan)
 - For Parallel WaveGAN tutorial, pls see [examples/parallel_wavegan](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/parallel_wavegan)
 - For Multiband-MelGAN Generator + Parallel WaveGAN Discriminator tutorial, pls see [examples/multiband_pwgan](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/multiband_pwgan)
+- For HiFi-GAN tutorial, pls see [examples/hifigan](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/hifigan)
 # Abstract Class Explaination
 
 ## Abstract DataLoader Tensorflow-based dataset

diff --git a/examples/hifigan/README.md b/examples/hifigan/README.md
@@ -0,0 +1,65 @@
+# HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
+Based on the script [`train_hifigan.py`](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/hifigan/train_hifigan.py).
+
+## Training HiFi-GAN from scratch with LJSpeech dataset.
+This example code show you how to train MelGAN from scratch with Tensorflow 2 based on custom training loop and tf.function. The data used for this example is LJSpeech, you can download the dataset at  [link](https://keithito.com/LJ-Speech-Dataset/).
+
+### Step 1: Create Tensorflow based Dataloader (tf.dataset)
+First, you need define data loader based on AbstractDataset class (see [`abstract_dataset.py`](https://github.com/tensorspeech/TensorFlowTTS/tree/master/tensorflow_tts/datasets/abstract_dataset.py)). On this example, a dataloader read dataset from path. I use suffix to classify what file is a audio and mel-spectrogram (see [`audio_mel_dataset.py`](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/melgan/audio_mel_dataset.py)). If you already have preprocessed version of your target dataset, you don't need to use this example dataloader, you just need refer my dataloader and modify **generator function** to adapt with your case. Normally, a generator function should return [audio, mel].
+
+### Step 2: Training from scratch
+After you re-define your dataloader, pls modify an input arguments, train_dataset and valid_dataset from [`train_hifigan.py`](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/hifigan/train_hifigan.py). Here is an example command line to training HiFi-GAN from scratch:
+
+First, you need training generator with only stft loss: 
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python examples/hifigan/train_hifigan.py \
+  --train-dir ./dump/train/ \
+  --dev-dir ./dump/valid/ \
+  --outdir ./examples/hifigan/exp/train.hifigan.v1/ \
+  --config ./examples/hifigan/conf/hifigan.v1.yaml \
+  --use-norm 1
+  --generator_mixed_precision 1 \
+  --resume ""
+```
+
+Then resume and start training generator + discriminator:
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python examples/hifigan/train_hifigan.py \
+  --train-dir ./dump/train/ \
+  --dev-dir ./dump/valid/ \
+  --outdir ./examples/hifigan/exp/train.hifigan.v1/ \
+  --config ./examples/hifigan/conf/hifigan.v1.yaml \
+  --use-norm 1
+  --resume ./examples/hifigan/exp/train.hifigan.v1/checkpoints/ckpt-100000
+```
+
+IF you want to use MultiGPU to training you can replace `CUDA_VISIBLE_DEVICES=0` by `CUDA_VISIBLE_DEVICES=0,1,2,3` for example. You also need to tune the `batch_size` for each GPU (in config file) by yourself to maximize the performance. Note that MultiGPU now support for Training but not yet support for Decode.
+
+In case you want to resume the training progress, please following below example command line:
+
+```bash
+--resume ./examples/hifigan/exp/train.hifigan.v1/checkpoints/ckpt-100000
+```
+
+If you want to finetune a model, use `--pretrained` like this with the filename of the generator
+```bash
+--pretrained ptgenerator.h5
+```
+
+**IMPORTANT NOTES**:
+
+- When training generator only, we enable mixed precision to speed-up training progress.
+- We don't apply mixed precision when training both generator and discriminator. (Discriminator include group-convolution, which cause discriminator slower when enable mixed precision).
+- 100k here is a *discriminator_train_start_steps* parameters from [hifigan.v1.yaml](https://github.com/tensorspeech/TensorflowTTS/tree/master/examples/hifigan/conf/hifigan.v1.yaml)
+
+
+## Reference
+
+1. https://github.com/descriptinc/melgan-neurips
+2. https://github.com/kan-bayashi/ParallelWaveGAN
+3. https://github.com/tensorflow/addons
+4. [HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis](https://arxiv.org/abs/2010.05646)
+5. [MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis](https://arxiv.org/abs/1910.06711)
+6. [Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram](https://arxiv.org/abs/1910.11480)
diff --git a/examples/hifigan/conf/hifigan.v1.yaml b/examples/hifigan/conf/hifigan.v1.yaml
@@ -0,0 +1,116 @@
+
+# This is the hyperparameter configuration file for Hifigan.
+# Please make sure this is adjusted for the LJSpeech dataset. If you want to
+# apply to the other dataset, you might need to carefully change some parameters.
+# This configuration performs 4000k iters.
+
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+sampling_rate: 22050     # Sampling rate of dataset.
+hop_size: 256            # Hop size.
+format: "npy"
+
+
+###########################################################
+#         GENERATOR NETWORK ARCHITECTURE SETTING          #
+###########################################################
+model_type: "hifigan_generator"
+
+hifigan_generator_params:
+    out_channels: 1
+    kernel_size: 7
+    filters: 512
+    use_bias: true
+    upsample_scales: [8, 8, 2, 2]
+    stacks: 3
+    stack_kernel_size: [3, 7, 11]
+    stack_dilation_rate: [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
+    use_final_nolinear_activation: true
+    is_weight_norm: false
+
+###########################################################
+#       DISCRIMINATOR NETWORK ARCHITECTURE SETTING        #
+###########################################################
+hifigan_discriminator_params:
+    out_channels: 1                     # Number of output channels (number of subbands).
+    period_scales: [2, 3, 5, 7, 11]     # List of period scales.
+    n_layers: 5                         # Number of layer of each period discriminator.
+    kernel_size: 5                      # Kernel size.
+    strides: 3                          # Strides
+    filters: 8                          # In Conv filters of each period discriminator
+    filter_scales: 4                    # Filter scales.
+    max_filters: 1024                   # maximum filters of period discriminator's conv.
+    is_weight_norm: false               # Use weight-norm or not.
+
+melgan_discriminator_params:
+    out_channels: 1                          # Number of output channels.
+    scales: 3                                # Number of multi-scales.
+    downsample_pooling: "AveragePooling1D"   # Pooling type for the input downsampling.
+    downsample_pooling_params:               # Parameters of the above pooling function.
+        pool_size: 4
+        strides: 2
+    kernel_sizes: [5, 3]              # List of kernel size.
+    filters: 16                       # Number of channels of the initial conv layer.
+    max_downsample_filters: 1024      # Maximum number of channels of downsampling layers.
+    downsample_scales: [4, 4, 4, 4]   # List of downsampling scales.
+    nonlinear_activation: "LeakyReLU" # Nonlinear activation function.
+    nonlinear_activation_params:      # Parameters of nonlinear activation function.
+        alpha: 0.2
+    is_weight_norm: false             # Use weight-norm or not.
+
+###########################################################
+#                   STFT LOSS SETTING                     #
+###########################################################
+stft_loss_params:
+    fft_lengths: [1024, 2048, 512]  # List of FFT size for STFT-based loss.
+    frame_steps: [120, 240, 50]     # List of hop size for STFT-based loss
+    frame_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
+
+###########################################################
+#               ADVERSARIAL LOSS SETTING                  #
+###########################################################
+lambda_feat_match: 10.0
+lambda_adv: 4.0
+
+###########################################################
+#                  DATA LOADER SETTING                    #
+###########################################################
+batch_size: 16                 # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
+batch_max_steps: 8192          # Length of each audio in batch for training. Make sure dividable by hop_size.
+batch_max_steps_valid: 81920   # Length of each audio for validation. Make sure dividable by hope_size.
+remove_short_samples: true     # Whether to remove samples the length of which are less than batch_max_steps.
+allow_cache: true              # Whether to allow cache in dataset. If true, it requires cpu memory.
+is_shuffle: true               # shuffle dataset after each epoch.
+
+###########################################################
+#             OPTIMIZER & SCHEDULER SETTING               #
+###########################################################
+generator_optimizer_params:
+    lr_fn: "PiecewiseConstantDecay"
+    lr_params: 
+        boundaries: [100000, 200000, 300000, 400000, 500000, 600000, 700000]
+        values: [0.0005, 0.0005, 0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
+    amsgrad: false
+
+discriminator_optimizer_params:
+    lr_fn: "PiecewiseConstantDecay"
+    lr_params: 
+        boundaries: [100000, 200000, 300000, 400000, 500000]
+        values: [0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
+    amsgrad: false
+
+gradient_accumulation_steps: 1  # should be even number or 1.
+###########################################################
+#                    INTERVAL SETTING                     #
+###########################################################
+discriminator_train_start_steps: 100000  # steps begin training discriminator
+train_max_steps: 4000000                 # Number of training steps.
+save_interval_steps: 20000               # Interval steps to save checkpoint.
+eval_interval_steps: 5000                # Interval steps to evaluate the network.
+log_interval_steps: 200                  # Interval steps to record the training log.
+
+###########################################################
+#                     OTHER SETTING                       #
+###########################################################
+num_save_intermediate_results: 1  # Number of batch to be saved as intermediate results.
diff --git a/examples/hifigan/conf/hifigan.v2.yaml b/examples/hifigan/conf/hifigan.v2.yaml
@@ -0,0 +1,116 @@
+
+# This is the hyperparameter configuration file for Hifigan.
+# Please make sure this is adjusted for the LJSpeech dataset. If you want to
+# apply to the other dataset, you might need to carefully change some parameters.
+# This configuration performs 4000k iters.
+
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+sampling_rate: 22050     # Sampling rate of dataset.
+hop_size: 256            # Hop size.
+format: "npy"
+
+
+###########################################################
+#         GENERATOR NETWORK ARCHITECTURE SETTING          #
+###########################################################
+model_type: "hifigan_generator"
+
+hifigan_generator_params:
+    out_channels: 1
+    kernel_size: 7
+    filters: 128
+    use_bias: true
+    upsample_scales: [8, 8, 2, 2]
+    stacks: 3
+    stack_kernel_size: [3, 7, 11]
+    stack_dilation_rate: [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
+    use_final_nolinear_activation: true
+    is_weight_norm: false
+
+###########################################################
+#       DISCRIMINATOR NETWORK ARCHITECTURE SETTING        #
+###########################################################
+hifigan_discriminator_params:
+    out_channels: 1                     # Number of output channels (number of subbands).
+    period_scales: [2, 3, 5, 7, 11]     # List of period scales.
+    n_layers: 5                         # Number of layer of each period discriminator.
+    kernel_size: 5                      # Kernel size.
+    strides: 3                          # Strides
+    filters: 8                          # In Conv filters of each period discriminator
+    filter_scales: 4                    # Filter scales.
+    max_filters: 512                   # maximum filters of period discriminator's conv.
+    is_weight_norm: false               # Use weight-norm or not.
+
+melgan_discriminator_params:
+    out_channels: 1                          # Number of output channels.
+    scales: 3                                # Number of multi-scales.
+    downsample_pooling: "AveragePooling1D"   # Pooling type for the input downsampling.
+    downsample_pooling_params:               # Parameters of the above pooling function.
+        pool_size: 4
+        strides: 2
+    kernel_sizes: [5, 3]              # List of kernel size.
+    filters: 16                       # Number of channels of the initial conv layer.
+    max_downsample_filters: 512      # Maximum number of channels of downsampling layers.
+    downsample_scales: [4, 4, 4, 4]   # List of downsampling scales.
+    nonlinear_activation: "LeakyReLU" # Nonlinear activation function.
+    nonlinear_activation_params:      # Parameters of nonlinear activation function.
+        alpha: 0.2
+    is_weight_norm: false             # Use weight-norm or not.
+
+###########################################################
+#                   STFT LOSS SETTING                     #
+###########################################################
+stft_loss_params:
+    fft_lengths: [1024, 2048, 512]  # List of FFT size for STFT-based loss.
+    frame_steps: [120, 240, 50]     # List of hop size for STFT-based loss
+    frame_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
+
+###########################################################
+#               ADVERSARIAL LOSS SETTING                  #
+###########################################################
+lambda_feat_match: 10.0
+lambda_adv: 4.0
+
+###########################################################
+#                  DATA LOADER SETTING                    #
+###########################################################
+batch_size: 16                 # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
+batch_max_steps: 8192          # Length of each audio in batch for training. Make sure dividable by hop_size.
+batch_max_steps_valid: 81920   # Length of each audio for validation. Make sure dividable by hope_size.
+remove_short_samples: true     # Whether to remove samples the length of which are less than batch_max_steps.
+allow_cache: true              # Whether to allow cache in dataset. If true, it requires cpu memory.
+is_shuffle: true               # shuffle dataset after each epoch.
+
+###########################################################
+#             OPTIMIZER & SCHEDULER SETTING               #
+###########################################################
+generator_optimizer_params:
+    lr_fn: "PiecewiseConstantDecay"
+    lr_params: 
+        boundaries: [100000, 200000, 300000, 400000, 500000, 600000, 700000]
+        values: [0.0005, 0.0005, 0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
+    amsgrad: false
+
+discriminator_optimizer_params:
+    lr_fn: "PiecewiseConstantDecay"
+    lr_params: 
+        boundaries: [100000, 200000, 300000, 400000, 500000]
+        values: [0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
+    amsgrad: false
+
+gradient_accumulation_steps: 1  # should be even number or 1.
+###########################################################
+#                    INTERVAL SETTING                     #
+###########################################################
+discriminator_train_start_steps: 100000  # steps begin training discriminator
+train_max_steps: 4000000                 # Number of training steps.
+save_interval_steps: 20000               # Interval steps to save checkpoint.
+eval_interval_steps: 5000                # Interval steps to evaluate the network.
+log_interval_steps: 200                  # Interval steps to record the training log.
+
+###########################################################
+#                     OTHER SETTING                       #
+###########################################################
+num_save_intermediate_results: 1  # Number of batch to be saved as intermediate results.