SWivid · SWivid · Oct 23, 2024 · Oct 16, 2024 · Oct 16, 2024 · Oct 17, 2024
diff --git a/.github/workflows/publish-docker-image.yaml b/.github/workflows/publish-docker-image.yaml
@@ -0,0 +1,61 @@
+name: Create and publish a Docker image
+
+# Configures this workflow to run every time a change is pushed to the branch called `release`.
+on:
+  push:
+    branches: ['main']
+
+# Defines two custom environment variables for the workflow. These are used for the Container registry domain, and a name for the Docker image that this workflow builds.
+env:
+  REGISTRY: ghcr.io
+  IMAGE_NAME: ${{ github.repository }}
+
+# There is a single job in this workflow. It's configured to run on the latest available version of Ubuntu.
+jobs:
+  build-and-push-image:
+    runs-on: ubuntu-latest
+    # Sets the permissions granted to the `GITHUB_TOKEN` for the actions in this job.
+    permissions:
+      contents: read
+      packages: write
+      # 
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+      - name: Free Up GitHub Actions Ubuntu Runner Disk Space 🔧
+        uses: jlumbroso/free-disk-space@main
+        with:
+          # This might remove tools that are actually needed, if set to "true" but frees about 6 GB
+          tool-cache: false
+
+          # All of these default to true, but feel free to set to "false" if necessary for your workflow
+          android: true
+          dotnet: true
+          haskell: true
+          large-packages: false
+          swap-storage: false
+          docker-images: false
+      # Uses the `docker/login-action` action to log in to the Container registry registry using the account and password that will publish the packages. Once published, the packages are scoped to the account defined here.
+      - name: Log in to the Container registry
+        uses: docker/login-action@65b78e6e13532edd9afa3aa52ac7964289d1a9c1
+        with:
+          registry: ${{ env.REGISTRY }}
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+      # This step uses [docker/metadata-action](https://github.com/docker/metadata-action#about) to extract tags and labels that will be applied to the specified image. The `id` "meta" allows the output of this step to be referenced in a subsequent step. The `images` value provides the base name for the tags and labels.
+      - name: Extract metadata (tags, labels) for Docker
+        id: meta
+        uses: docker/metadata-action@9ec57ed1fcdbf14dcef7dfbe97b2010124a938b7
+        with:
+          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
+      # This step uses the `docker/build-push-action` action to build the image, based on your repository's `Dockerfile`. If the build succeeds, it pushes the image to GitHub Packages.
+      # It uses the `context` parameter to define the build's context as the set of files located in the specified path. For more information, see "[Usage](https://github.com/docker/build-push-action#usage)" in the README of the `docker/build-push-action` repository.
+      # It uses the `tags` and `labels` parameters to tag and label the image with the output from the "meta" step.
+      - name: Build and push Docker image
+        uses: docker/build-push-action@f2a1d5e99d037542a71f64918e516c093c6f3fc4
+        with:
+          context: .
+          file: ./gradio.Dockerfile
+          push: true
+          tags: ${{ steps.meta.outputs.tags }}
+          labels: ${{ steps.meta.outputs.labels }}
diff --git a/README.md b/README.md
@@ -63,11 +63,35 @@ pre-commit run --all-files
 Note: Some model components have linting exceptions for E722 to accommodate tensor notation
 
 
+### As a pip package
+
+```bash
+pip install git+https://github.com/SWivid/F5-TTS.git
+```
+
+```python
+import gradio as gr
+from f5_tts.gradio_app import app
+
+with gr.Blocks() as main_app:
+    gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")
+
+    # ... other Gradio components
+
+    app.render()
+
+main_app.launch()
+
+```
+
 ## Prepare Dataset
 
-Example data processing scripts for Emilia and Wenetspeech4TTS, and you may tailor your own one along with a Dataset class in `model/dataset.py`.
+Example data processing scripts for Emilia and Wenetspeech4TTS, and you may tailor your own one along with a Dataset class in `f5_tts/model/dataset.py`.
 
 ```bash
+# switch to the main directory
+cd f5_tts
+
 # prepare custom dataset up to your need
 # download corresponding dataset first, and fill in the path in scripts
 
@@ -83,14 +107,17 @@ python scripts/prepare_wenetspeech4tts.py
 Once your datasets are prepared, you can start the training process.
 
 ```bash
+# switch to the main directory
+cd f5_tts
+
 # setup accelerate config, e.g. use multi-gpu ddp, fp16
 # will be to: ~/.cache/huggingface/accelerate/default_config.yaml     
 accelerate config
 accelerate launch train.py
 ```
 An initial guidance on Finetuning [#57](https://github.com/SWivid/F5-TTS/discussions/57).
 
-Gradio UI finetuning with `finetune_gradio.py` see [#143](https://github.com/SWivid/F5-TTS/discussions/143).
+Gradio UI finetuning with `f5_tts/finetune_gradio.py` see [#143](https://github.com/SWivid/F5-TTS/discussions/143).
 
 ### Wandb Logging
 
@@ -136,6 +163,9 @@ for change model use `--ckpt_file` to specify the model you want to load,
 for change vocab.txt use `--vocab_file` to provide your vocab.txt file.
 
 ```bash
+# switch to the main directory
+cd f5_tts
+
 python inference-cli.py \
 --model "F5-TTS" \
 --ref_audio "tests/ref_audio/test_en_1_ref_short.wav" \
@@ -161,27 +191,27 @@ Currently supported features:
 You can launch a Gradio app (web interface) to launch a GUI for inference (will load ckpt from Huggingface, you may also use local file in `gradio_app.py`). Currently load ASR model, F5-TTS and E2 TTS all in once, thus use more GPU memory than `inference-cli`.
 
 ```bash
-python gradio_app.py
+python f5_tts/gradio_app.py
 ```
 
 You can specify the port/host:
 
 ```bash
-python gradio_app.py --port 7860 --host 0.0.0.0
+python f5_tts/gradio_app.py --port 7860 --host 0.0.0.0
 ```
 
 Or launch a share link:
 
 ```bash
-python gradio_app.py --share
+python f5_tts/gradio_app.py --share
 ```
 
 ### Speech Editing
 
 To test speech editing capabilities, use the following command.
 
 ```bash
-python speech_edit.py
+python f5_tts/speech_edit.py
 ```
 
 ## Evaluation
@@ -199,6 +229,9 @@ python speech_edit.py
 To run batch inference for evaluations, execute the following commands:
 
 ```bash
+# switch to the main directory
+cd f5_tts
+
 # batch inference for evaluations
 accelerate config  # if not set before
 bash scripts/eval_infer_batch.sh
@@ -234,6 +267,9 @@ pip install faster-whisper==0.10.1
 
 Update the path with your batch-inferenced results, and carry out WER / SIM evaluations:
 ```bash
+# switch to the main directory
+cd f5_tts
+
 # Evaluation for Seed-TTS test set
 python scripts/eval_seedtts_testset.py
 

diff --git a/app.py b/app.py
@@ -0,0 +1,3 @@
+from f5_tts.gradio_app import app
+
+app.queue().launch()
diff --git a/gradio.Dockerfile b/gradio.Dockerfile
@@ -0,0 +1,27 @@
+FROM pytorch/pytorch:2.4.0-cuda12.4-cudnn9-devel
+
+USER root
+
+ARG DEBIAN_FRONTEND=noninteractive
+
+LABEL github_repo="https://github.com/rsxdalv/F5-TTS"
+
+RUN set -x \
+    && apt-get update \
+    && apt-get -y install wget curl man git less openssl libssl-dev unzip unar build-essential aria2 tmux vim \
+    && apt-get install -y openssh-server sox libsox-fmt-all libsox-fmt-mp3 libsndfile1-dev ffmpeg \
+    && rm -rf /var/lib/apt/lists/* \
+    && apt-get clean
+
+WORKDIR /workspace
+
+RUN git clone https://github.com/rsxdalv/F5-TTS.git \
+    && cd F5-TTS \
+    && pip install --no-cache-dir -r requirements.txt
+
+ENV SHELL=/bin/bash
+
+WORKDIR /workspace/F5-TTS/f5_tts
+
+EXPOSE 7860
+CMD python gradio_app.py
diff --git a/model/__init__.py b/model/__init__.py
diff --git a/pyproject.toml b/pyproject.toml
@@ -0,0 +1,52 @@
+[build-system]
+requires = ["setuptools >= 61.0", "setuptools-scm>=8.0"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "f5-tts"
+dynamic = ["version"]
+description = "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
+readme = "README.md"
+classifiers = [
+    "License :: OSI Approved :: MIT License",
+    "Operating System :: OS Independent",
+    "Programming Language :: Python :: 3",
+]
+dependencies = [
+    "accelerate>=0.33.0",
+    "cached_path @ git+https://github.com/rsxdalv/cached_path@main",
+    "click",
+    "datasets",
+    "einops>=0.8.0",
+    "einx>=0.3.0",
+    "ema_pytorch>=0.5.2",
+    "gradio",
+    "jieba",
+    "librosa",
+    "matplotlib",
+    "numpy<=1.26.4",
+    "pydub",
+    "pypinyin",
+    "safetensors",
+    "soundfile",
+    "tomli",
+    "torch>=2.0.0",
+    "torchaudio>=2.0.0",
+    "torchdiffeq",
+    "tqdm>=4.65.0",
+    "transformers",
+    "vocos",
+    "wandb",
+    "x_transformers>=1.31.14",
+]
+
+[[project.authors]]
+name = "Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen"
+
+[project.urls]
+Homepage = "https://github.com/SWivid/F5-TTS"
+
+[project.scripts]
+"finetune-cli" = "f5_tts.finetune_cli:main"
+"inference-cli" = "f5_tts.inference_cli:main"
+"eval_infer_batch" = "f5_tts.scripts.eval_infer_batch:main"