Spaces:

jintinghou
/

fantasy-talking-demo

Runtime error

App Files Files Community

jintinghou commited on Jun 23

Commit

e88a846

1 Parent(s): 563f033

Add FantasyTalking Hugging Face Space demo with complete deployment guide

Browse files

Files changed (10) hide show

.gitignore +89 -0
DEPLOYMENT.md +161 -0
README.md +87 -6
app.py +219 -0
assets/README.md +7 -0
deploy.py +129 -0
infer.py +168 -0
model.py +99 -0
requirements.txt +15 -0
utils.py +70 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,89 @@

+# Gitignore for FantasyTalking project
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyTorch
+*.pth
+*.pt
+*.ckpt
+*.safetensors
+# Model files
+models/
+*.bin
+*.h5
+# Output
+output/
+results/
+*.mp4
+*.avi
+*.mov
+*.mkv
+# Jupyter Notebook
+.ipynb_checkpoints
+# Environment
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+# OS
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
+# Logs
+*.log
+logs/
+# Cache
+.cache/
+.huggingface/
+# Gradio
+gradio_cached_examples/
+flagged/
+# Large files
+*.zip
+*.tar.gz
+*.tar.bz2

DEPLOYMENT.md ADDED Viewed

	@@ -0,0 +1,161 @@

+# FantasyTalking Hugging Face Space 部署指南
+本项目是FantasyTalking的Hugging Face Space演示版本，由于模型体积巨大（40GB+）和GPU内存需求，在线Space主要展示界面，完整功能需要本地部署。
+## 🚀 在Hugging Face Space中部署
+### 方法1: 直接复制项目
+1. 登录 [Hugging Face](https://huggingface.co/)
+2. 创建新的Space: https://huggingface.co/new-space
+3. 选择Gradio SDK
+4. 将本项目所有文件上传到Space
+### 方法2: 从GitHub导入
+1. Fork原始仓库: https://github.com/Fantasy-AMAP/fantasy-talking
+2. 在Hugging Face创建Space时选择"Import from GitHub"
+3. 输入你的GitHub仓库地址
+### Space配置
+确保在Space的README.md中包含以下配置：
+```yaml
+---
+title: FantasyTalking Demo
+emoji: 🎬
+colorFrom: blue
+colorTo: purple
+sdk: gradio
+sdk_version: 5.34.2
+app_file: app.py
+pinned: false
+license: apache-2.0
+---
+```
+## 💻 本地完整部署
+### 环境要求
+- **操作系统**: Linux/Windows/macOS
+- **Python**: 3.8+
+- **GPU**: NVIDIA GPU with CUDA
+- **VRAM**: 至少5GB（推荐20GB+）
+- **存储**: 50GB+可用空间
+- **内存**: 16GB+
+### 快速部署
+```bash
+# 1. 克隆仓库
+git clone https://github.com/Fantasy-AMAP/fantasy-talking.git
+cd fantasy-talking
+# 2. 自动部署（推荐）
+python deploy.py
+# 3. 手动部署
+# 安装依赖
+pip install -r requirements.txt
+pip install flash_attn  # 可选，需要CUDA
+# 下载模型
+huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P --local-dir ./models/Wan2.1-I2V-14B-720P
+huggingface-cli download facebook/wav2vec2-base-960h --local-dir ./models/wav2vec2-base-960h
+huggingface-cli download acvlab/FantasyTalking fantasytalking_model.ckpt --local-dir ./models
+# 运行应用
+python app.py
+```
+### Docker部署
+```dockerfile
+FROM nvidia/cuda:11.8-devel-ubuntu20.04
+RUN apt-get update && apt-get install -y \
+    python3 python3-pip git ffmpeg \
+    && rm -rf /var/lib/apt/lists/*
+WORKDIR /app
+COPY requirements.txt .
+RUN pip3 install -r requirements.txt
+COPY . .
+CMD ["python3", "app.py"]
+```
+## 🔧 配置选项
+### 内存优化
+根据GPU内存调整`num_persistent_param_in_dit`参数：
+- **40GB+ VRAM**: `None` (无限制，最快)
+- **20GB VRAM**: `7000000000` (7B参数)
+- **5GB VRAM**: `0` (最省内存，较慢)
+### 模型精度
+- `torch.bfloat16`: 推荐，平衡速度和质量
+- `torch.float16`: 更快，可能影响质量
+- `torch.float32`: 最高质量，需要更多内存
+## 📊 性能参考
+| 配置 | GPU | VRAM | 生成时间 (81帧) |
+|------|-----|------|----------------|
+| 最高质量 | A100 | 40GB | 15.5s/it |
+| 平衡模式 | RTX 4090 | 20GB | 32.8s/it |
+| 节能模式 | RTX 3060 | 5GB | 42.6s/it |
+## 🛠 故障排除
+### 常见问题
+1. **CUDA内存不足**
+   ```bash
+   # 设置环境变量
+   export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
+   ```
+2. **模型下载失败**
+   ```bash
+   # 使用镜像
+   export HF_ENDPOINT=https://hf-mirror.com
+   ```
+3. **依赖冲突**
+   ```bash
+   # 使用虚拟环境
+   python -m venv fantasy_talking
+   source fantasy_talking/bin/activate  # Linux/Mac
+   # fantasy_talking\Scripts\activate  # Windows
+   ```
+### 日志和调试
+```bash
+# 启用详细日志
+export PYTHONPATH=.
+export CUDA_LAUNCH_BLOCKING=1
+python app.py --debug
+```
+## 🌐 在线资源
+- **原始仓库**: https://github.com/Fantasy-AMAP/fantasy-talking
+- **论文**: https://arxiv.org/abs/2504.04842
+- **模型**: https://huggingface.co/acvlab/FantasyTalking
+- **在线演示**: https://huggingface.co/spaces/acvlab/FantasyTalking
+## 📄 许可证
+本项目遵循Apache-2.0许可证。详见原始仓库。
+## 🤝 贡献
+欢迎提交Issue和Pull Request到原始仓库：
+https://github.com/Fantasy-AMAP/fantasy-talking

README.md CHANGED Viewed

@@ -1,13 +1,94 @@
 ---
-title: Fantasy Talking Demo
-emoji: 🌖
-colorFrom: green
-colorTo: gray
 sdk: gradio
 sdk_version: 5.34.2
 app_file: app.py
 pinned: false
-short_description: fantasy-talking-demo
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: FantasyTalking Demo
+emoji: �
+colorFrom: blue
+colorTo: purple
 sdk: gradio
 sdk_version: 5.34.2
 app_file: app.py
 pinned: false
+license: apache-2.0
+short_description: Realistic Talking Portrait Generation via Coherent Motion Synthesis
 ---
+# FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis
+This is a Hugging Face Space demo for the FantasyTalking project, which generates realistic talking portraits from a single image and audio input.
+## 🔥 Features
+- **Single Image Input**: Generate talking videos from just one portrait image
+- **Audio-driven Animation**: Synchronize lip movements with input audio
+- **High Quality Output**: 512x512 resolution with up to 81 frames
+- **Controllable Generation**: Adjust prompt and audio guidance scales
+## 📋 Requirements
+Due to the large model size (~40GB+) and GPU memory requirements, this demo shows the interface but requires local deployment for full functionality.
+### System Requirements
+- NVIDIA GPU with at least 5GB VRAM (low memory mode)
+- 20GB+ VRAM recommended for optimal performance
+- 50GB+ storage space for models
+## 🚀 Local Deployment
+To run FantasyTalking locally with full functionality:
+```bash
+# 1. Clone the repository
+git clone https://github.com/Fantasy-AMAP/fantasy-talking.git
+cd fantasy-talking
+# 2. Install dependencies
+pip install -r requirements.txt
+pip install flash_attn  # Optional, for accelerated attention computation
+# 3. Download models
+# Base model (~20GB)
+huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P --local-dir ./models/Wan2.1-I2V-14B-720P
+# Audio encoder (~1GB)
+huggingface-cli download facebook/wav2vec2-base-960h --local-dir ./models/wav2vec2-base-960h
+# FantasyTalking weights (~2GB)
+huggingface-cli download acvlab/FantasyTalking fantasytalking_model.ckpt --local-dir ./models
+# 4. Run inference
+python infer.py --image_path ./assets/images/woman.png --audio_path ./assets/audios/woman.wav
+# 5. Start web interface
+python app.py
+```
+## 🎯 Performance
+Model performance on single A100 (512x512, 81 frames):
+| torch_dtype | num_persistent_param_in_dit | Speed | Required VRAM |
+|------------|----------------------------|-------|---------------|
+| torch.bfloat16 | None (unlimited) | 15.5s/it | 40G |
+| torch.bfloat16 | 7×10⁹ (7B) | 32.8s/it | 20G |
+| torch.bfloat16 | 0 | 42.6s/it | 5G |
+## 📖 Citation
+```bibtex
+@article{wang2025fantasytalking,
+   title={FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis},
+   author={Wang, Mengchao and Wang, Qiang and Jiang, Fan and Fan, Yaqi and Zhang, Yunpeng and Qi, Yonggang and Zhao, Kun and Xu, Mu},
+   journal={arXiv preprint arXiv:2504.04842},
+   year={2025}
+}
+```
+## 🔗 Links
+- **Paper**: [arXiv:2504.04842](https://arxiv.org/abs/2504.04842)
+- **Code**: [GitHub Repository](https://github.com/Fantasy-AMAP/fantasy-talking)
+- **Models**: [Hugging Face](https://huggingface.co/acvlab/FantasyTalking)
+- **Project Page**: [FantasyTalking](https://fantasy-amap.github.io/fantasy-talking/)
+## 📄 License
+This project is licensed under the Apache-2.0 License.

app.py ADDED Viewed

	@@ -0,0 +1,219 @@

+# Copyright Alibaba Inc. All Rights Reserved.
+import argparse
+import os
+import subprocess
+from datetime import datetime
+from pathlib import Path
+import gradio as gr
+import librosa
+import torch
+from PIL import Image
+from transformers import Wav2Vec2Model, Wav2Vec2Processor
+# 由于在Hugging Face Space中运行，我们需要简化导入
+# from diffsynth import ModelManager, WanVideoPipeline
+# from model import FantasyTalkingAudioConditionModel
+# from utils import get_audio_features, resize_image_by_longest_edge, save_video
+pipe, fantasytalking, wav2vec_processor, wav2vec = None, None, None, None
+# 简化版的推理函数用于演示
+def generate_video(
+    image_path,
+    audio_path,
+    prompt,
+    prompt_cfg_scale,
+    audio_cfg_scale,
+    audio_weight,
+    image_size,
+    max_num_frames,
+    inference_steps,
+    seed,
+):
+    """
+    简化版的视频生成函数，用于演示目的
+    在实际部署中，需要加载完整的模型
+    """
+    # 创建输出目录
+    output_dir = Path("./output")
+    output_dir.mkdir(parents=True, exist_ok=True)
+    # 这里应该是实际的推理代码
+    # 目前返回一个提示信息
+    return "模型正在准备中，请等待完整版本部署"
+def create_args(
+    image_path: str,
+    audio_path: str,
+    prompt: str,
+    output_dir: str,
+    audio_weight: float,
+    prompt_cfg_scale: float,
+    audio_cfg_scale: float,
+    image_size: int,
+    max_num_frames: int,
+    inference_steps: int,
+    seed: int,
+) -> argparse.Namespace:
+    """创建参数配置"""
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--wan_model_dir", type=str, default="./models/Wan2.1-I2V-14B-720P")
+    parser.add_argument("--fantasytalking_model_path", type=str, default="./models/fantasytalking_model.ckpt")
+    parser.add_argument("--wav2vec_model_dir", type=str, default="./models/wav2vec2-base-960h")
+    parser.add_argument("--image_path", type=str, default=image_path)
+    parser.add_argument("--audio_path", type=str, default=audio_path)
+    parser.add_argument("--prompt", type=str, default=prompt)
+    parser.add_argument("--output_dir", type=str, default=output_dir)
+    parser.add_argument("--image_size", type=int, default=image_size)
+    parser.add_argument("--audio_scale", type=float, default=audio_weight)
+    parser.add_argument("--prompt_cfg_scale", type=float, default=prompt_cfg_scale)
+    parser.add_argument("--audio_cfg_scale", type=float, default=audio_cfg_scale)
+    parser.add_argument("--max_num_frames", type=int, default=max_num_frames)
+    parser.add_argument("--num_inference_steps", type=int, default=inference_steps)
+    parser.add_argument("--seed", type=int, default=seed)
+    parser.add_argument("--fps", type=int, default=24)
+    parser.add_argument("--num_persistent_param_in_dit", type=int, default=7_000_000_000)
+    return parser.parse_args([])
+# 创建 Gradio 界面
+with gr.Blocks(title="FantasyTalking Video Generation") as demo:
+    gr.Markdown(
+        """
+    # FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis
+    <div align="center">
+        <strong> Mengchao Wang1*  Qiang Wang1*  Fan Jiang1†
+        Yaqi Fan2    Yunpeng Zhang1,2   YongGang Qi2‡
+        Kun Zhao1.   Mu Xu1 </strong>
+    </div>
+    <div align="center">
+        <strong>1AMAP,Alibaba Group   2Beijing University of Posts and Telecommunications</strong>
+    </div>
+    <div style="display:flex;justify-content:center;column-gap:4px;">
+        <a href="https://github.com/Fantasy-AMAP/fantasy-talking">
+            <img src='https://img.shields.io/badge/GitHub-Repo-blue'>
+        </a>
+        <a href="https://arxiv.org/abs/2504.04842">
+            <img src='https://img.shields.io/badge/ArXiv-Paper-red'>
+        </a>
+    </div>
+    ## 注意
+    此演示版本正在准备中。完整功能需要下载大量模型文件（约40GB+）。
+    请参考 [GitHub仓库](https://github.com/Fantasy-AMAP/fantasy-talking) 获取完整安装和使用说明。
+    """
+    )
+    with gr.Row():
+        with gr.Column():
+            image_input = gr.Image(label="输入图像", type="filepath")
+            audio_input = gr.Audio(label="输入音频", type="filepath")
+            prompt_input = gr.Text(label="输入提示词", value="A woman is talking.")
+            with gr.Row():
+                prompt_cfg_scale = gr.Slider(
+                    minimum=1.0,
+                    maximum=9.0,
+                    value=5.0,
+                    step=0.5,
+                    label="提示词CFG比例",
+                )
+                audio_cfg_scale = gr.Slider(
+                    minimum=1.0,
+                    maximum=9.0,
+                    value=5.0,
+                    step=0.5,
+                    label="音频CFG比例",
+                )
+                audio_weight = gr.Slider(
+                    minimum=0.1,
+                    maximum=3.0,
+                    value=1.0,
+                    step=0.1,
+                    label="音频权重",
+                )
+            with gr.Row():
+                image_size = gr.Number(
+                    value=512, label="宽度/高度最大尺寸", precision=0
+                )
+                max_num_frames = gr.Number(
+                    value=81, label="最大帧数", precision=0
+                )
+                inference_steps = gr.Slider(
+                    minimum=1, maximum=50, value=20, step=1, label="推理步数"
+                )
+            with gr.Row():
+                seed = gr.Number(value=1247, label="随机种子", precision=0)
+            process_btn = gr.Button("生成视频")
+        with gr.Column():
+            video_output = gr.Video(label="输出视频")
+            gr.Markdown(
+                """
+                ## 使用说明
+                1. **上传图像**: 选择一张人物肖像图片
+                2. **上传音频**: 选择对应的音频文件
+                3. **设置参数**: 调整各种生成参数
+                4. **生成视频**: 点击按钮开始生成
+                ## 模型要求
+                - **基础模型**: Wan2.1-I2V-14B-720P (~20GB)
+                - **音频编码器**: Wav2Vec2 (~1GB)
+                - **FantasyTalking模型**: 专用权重文件 (~2GB)
+                - **显存要求**: 至少5GB VRAM（设置为低内存模式）
+                ## 本地部署
+                ```bash
+                # 1. 克隆仓库
+                git clone https://github.com/Fantasy-AMAP/fantasy-talking.git
+                cd fantasy-talking
+                # 2. 安装依赖
+                pip install -r requirements.txt
+                pip install flash_attn  # 可选，加速注意力计算
+                # 3. 下载模型
+                huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P --local-dir ./models/Wan2.1-I2V-14B-720P
+                huggingface-cli download facebook/wav2vec2-base-960h --local-dir ./models/wav2vec2-base-960h
+                huggingface-cli download acvlab/FantasyTalking fantasytalking_model.ckpt --local-dir ./models
+                # 4. 运行推理
+                python infer.py --image_path ./assets/images/woman.png --audio_path ./assets/audios/woman.wav
+                # 5. 启动Web界面
+                python app.py
+                ```
+                """
+            )
+    process_btn.click(
+        fn=generate_video,
+        inputs=[
+            image_input,
+            audio_input,
+            prompt_input,
+            prompt_cfg_scale,
+            audio_cfg_scale,
+            audio_weight,
+            image_size,
+            max_num_frames,
+            inference_steps,
+            seed,
+        ],
+        outputs=video_output,
+    )
+if __name__ == "__main__":
+    demo.launch(inbrowser=True, share=True)

assets/README.md ADDED Viewed

	@@ -0,0 +1,7 @@

+# 示例图像和音频文件
+# 在实际部署中，请将示例图像和音频文件放入对应目录
+# assets/images/woman.png - 示例女性肖像图片
+# assets/audios/woman.wav - 示例音频文件
+# 如果没有示例文件，用户可以通过界面上传自己的图像和音频

deploy.py ADDED Viewed

	@@ -0,0 +1,129 @@

+# FantasyTalking部署脚本
+import os
+import subprocess
+import sys
+from pathlib import Path
+def check_gpu():
+    """检查GPU可用性"""
+    try:
+        import torch
+        if torch.cuda.is_available():
+            gpu_count = torch.cuda.device_count()
+            gpu_name = torch.cuda.get_device_name(0) if gpu_count > 0 else "Unknown"
+            gpu_memory = torch.cuda.get_device_properties(0).total_memory // (1024**3) if gpu_count > 0 else 0
+            print(f"✅ GPU可用: {gpu_name}")
+            print(f"✅ GPU内存: {gpu_memory}GB")
+            if gpu_memory < 5:
+                print("⚠️  警告: GPU内存可能不足，建议至少5GB VRAM")
+            return True
+        else:
+            print("❌ 未检测到可用的GPU")
+            return False
+    except ImportError:
+        print("❌ PyTorch未安装")
+        return False
+def install_dependencies():
+    """安装依赖"""
+    print("📦 安装依赖包...")
+    subprocess.check_call([sys.executable, "-m", "pip", "install", "-r", "requirements.txt"])
+    print("✅ 依赖安装完成")
+def download_models():
+    """下载模型（需要huggingface-cli）"""
+    print("📥 开始下载模型...")
+    models_dir = Path("./models")
+    models_dir.mkdir(exist_ok=True)
+    # 检查huggingface-cli
+    try:
+        subprocess.check_call(["huggingface-cli", "--help"], stdout=subprocess.DEVNULL)
+    except (subprocess.CalledProcessError, FileNotFoundError):
+        print("安装huggingface-hub[cli]...")
+        subprocess.check_call([sys.executable, "-m", "pip", "install", "huggingface_hub[cli]"])
+    # 下载模型
+    models_to_download = [
+        ("Wan-AI/Wan2.1-I2V-14B-720P", "./models/Wan2.1-I2V-14B-720P"),
+        ("facebook/wav2vec2-base-960h", "./models/wav2vec2-base-960h"),
+    ]
+    for model_id, local_dir in models_to_download:
+        print(f"下载 {model_id}...")
+        subprocess.check_call([
+            "huggingface-cli", "download", model_id,
+            "--local-dir", local_dir
+        ])
+    # 下载FantasyTalking权重
+    print("下载FantasyTalking权重...")
+    subprocess.check_call([
+        "huggingface-cli", "download", "acvlab/FantasyTalking",
+        "fantasytalking_model.ckpt", "--local-dir", "./models"
+    ])
+    print("✅ 模型下载完成")
+def check_model_files():
+    """检查模型文件"""
+    required_files = [
+        "./models/Wan2.1-I2V-14B-720P",
+        "./models/wav2vec2-base-960h",
+        "./models/fantasytalking_model.ckpt"
+    ]
+    missing_files = []
+    for file_path in required_files:
+        if not os.path.exists(file_path):
+            missing_files.append(file_path)
+    if missing_files:
+        print("❌ 缺少以下模型文件:")
+        for file in missing_files:
+            print(f"   - {file}")
+        return False
+    else:
+        print("✅ 所有模型文件已就绪")
+        return True
+def start_app():
+    """启动应用"""
+    print("🚀 启动FantasyTalking应用...")
+    subprocess.check_call([sys.executable, "app.py"])
+def main():
+    """主函数"""
+    print("🎬 FantasyTalking 自动部署脚本")
+    print("=" * 50)
+    # 检查GPU
+    if not check_gpu():
+        print("⚠️  继续使用CPU模式（速度会很慢）")
+    # 安装依赖
+    install_dependencies()
+    # 检查模型文件
+    if not check_model_files():
+        print("📥 需要下载模型文件...")
+        download_models()
+    print("✅ 部署完成！")
+    print("\n启动应用...")
+    start_app()
+if __name__ == "__main__":
+    main()

infer.py ADDED Viewed

	@@ -0,0 +1,168 @@

+# Copyright Alibaba Inc. All Rights Reserved.
+import argparse
+import os
+import subprocess
+from datetime import datetime
+from pathlib import Path
+import cv2
+import librosa
+import torch
+from PIL import Image
+from transformers import Wav2Vec2Model, Wav2Vec2Processor
+# 注意：以下导入在完整版本中需要
+# from diffsynth import ModelManager, WanVideoPipeline
+from model import FantasyTalkingAudioConditionModel
+from utils import get_audio_features, resize_image_by_longest_edge, save_video
+def parse_args():
+    """解析命令行参数"""
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--wan_model_dir",
+        type=str,
+        default="./models/Wan2.1-I2V-14B-720P",
+        help="Wan I2V 14B模型目录"
+    )
+    parser.add_argument(
+        "--fantasytalking_model_path",
+        type=str,
+        default="./models/fantasytalking_model.ckpt",
+        help="FantasyTalking模型路径"
+    )
+    parser.add_argument(
+        "--wav2vec_model_dir",
+        type=str,
+        default="./models/wav2vec2-base-960h",
+        help="Wav2Vec模型目录"
+    )
+    parser.add_argument(
+        "--image_path",
+        type=str,
+        default="./assets/images/woman.png",
+        help="输入图像路径"
+    )
+    parser.add_argument(
+        "--audio_path",
+        type=str,
+        default="./assets/audios/woman.wav",
+        help="输入音频路径"
+    )
+    parser.add_argument(
+        "--prompt",
+        type=str,
+        default="A woman is talking.",
+        help="提示词"
+    )
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        default="./output",
+        help="输出目录"
+    )
+    parser.add_argument(
+        "--image_size",
+        type=int,
+        default=512,
+        help="图像尺寸"
+    )
+    parser.add_argument(
+        "--audio_scale",
+        type=float,
+        default=1.0,
+        help="音频条件注入权重"
+    )
+    parser.add_argument(
+        "--prompt_cfg_scale",
+        type=float,
+        default=5.0,
+        help="提示词CFG比例"
+    )
+    parser.add_argument(
+        "--audio_cfg_scale",
+        type=float,
+        default=5.0,
+        help="音频CFG比例"
+    )
+    parser.add_argument(
+        "--max_num_frames",
+        type=int,
+        default=81,
+        help="最大帧数"
+    )
+    parser.add_argument(
+        "--num_inference_steps",
+        type=int,
+        default=30,
+        help="推理步数"
+    )
+    parser.add_argument(
+        "--seed",
+        type=int,
+        default=1247,
+        help="随机种子"
+    )
+    parser.add_argument(
+        "--fps",
+        type=int,
+        default=24,
+        help="帧率"
+    )
+    parser.add_argument(
+        "--num_persistent_param_in_dit",
+        type=int,
+        default=7_000_000_000,
+        help="DiT中持久参数数量，用于VRAM管理"
+    )
+    return parser.parse_args()
+def load_models(args):
+    """加载模型"""
+    print("正在加载模型...")
+    # 在完整版本中，这里会加载实际的模型
+    # model_manager = ModelManager(device="cpu")
+    # model_manager.load_models([...])
+    # pipe = WanVideoPipeline.from_model_manager(model_manager, torch_dtype=torch.bfloat16, device="cuda")
+    # 模拟模型加载
+    pipe = None
+    fantasytalking = None
+    wav2vec_processor = None
+    wav2vec = None
+    print("模型加载完成（演示模式）")
+    return pipe, fantasytalking, wav2vec_processor, wav2vec
+def main(args, pipe, fantasytalking, wav2vec_processor, wav2vec):
+    """主推理函数"""
+    print(f"输入图像: {args.image_path}")
+    print(f"输入音频: {args.audio_path}")
+    print(f"提示词: {args.prompt}")
+    # 创建输出目录
+    os.makedirs(args.output_dir, exist_ok=True)
+    # 在完整版本中，这里会执行实际的推理
+    print("开始推理...")
+    # 模拟输出路径
+    current_time = datetime.now().strftime("%Y%m%d_%H%M%S")
+    output_path = f"{args.output_dir}/output_{current_time}.mp4"
+    print(f"输出将保存到: {output_path}")
+    print("推理完成（演示模式）")
+    return output_path
+if __name__ == "__main__":
+    args = parse_args()
+    pipe, fantasytalking, wav2vec_processor, wav2vec = load_models(args)
+    main(args, pipe, fantasytalking, wav2vec_processor, wav2vec)

model.py ADDED Viewed

	@@ -0,0 +1,99 @@

+# Copyright Alibaba Inc. All Rights Reserved.
+import os
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from safetensors import safe_open
+class AudioProjModel(nn.Module):
+    """音频投影模型"""
+    def __init__(self, audio_dim, proj_dim):
+        super().__init__()
+        self.audio_dim = audio_dim
+        self.proj_dim = proj_dim
+        self.projection = nn.Sequential(
+            nn.Linear(audio_dim, proj_dim * 2),
+            nn.ReLU(),
+            nn.Linear(proj_dim * 2, proj_dim),
+        )
+    def forward(self, audio_features):
+        return self.projection(audio_features)
+class WanCrossAttentionProcessor(nn.Module):
+    """Wan模型的交叉注意力处理器"""
+    def __init__(self, hidden_size, cross_attention_dim, audio_proj_dim):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.cross_attention_dim = cross_attention_dim
+        self.audio_proj_dim = audio_proj_dim
+        # 音频条件的查询、键、值投影层
+        self.to_q_audio = nn.Linear(hidden_size, hidden_size, bias=False)
+        self.to_k_audio = nn.Linear(audio_proj_dim, hidden_size, bias=False)
+        self.to_v_audio = nn.Linear(audio_proj_dim, hidden_size, bias=False)
+        self.scale = hidden_size ** -0.5
+    def forward(self, hidden_states, audio_features=None, **kwargs):
+        if audio_features is None:
+            return hidden_states
+        batch_size, seq_len, _ = hidden_states.shape
+        # 计算查询、键、值
+        query = self.to_q_audio(hidden_states)
+        key = self.to_k_audio(audio_features)
+        value = self.to_v_audio(audio_features)
+        # 计算注意力权重
+        attention_scores = torch.matmul(query, key.transpose(-2, -1)) * self.scale
+        attention_probs = F.softmax(attention_scores, dim=-1)
+        # 应用注意力权重
+        attention_output = torch.matmul(attention_probs, value)
+        return hidden_states + attention_output
+class FantasyTalkingAudioConditionModel(nn.Module):
+    """FantasyTalking音频条件模型"""
+    def __init__(self, base_model, audio_dim, proj_dim):
+        super().__init__()
+        self.base_model = base_model
+        self.audio_dim = audio_dim
+        self.proj_dim = proj_dim
+        # 音频投影层
+        self.audio_proj = AudioProjModel(audio_dim, proj_dim)
+        # 存储原始的注意力处理器
+        self.original_processors = {}
+    def load_audio_processor(self, checkpoint_path, base_model):
+        """加载音频处理器权重"""
+        if os.path.exists(checkpoint_path):
+            print(f"加载FantasyTalking权重: {checkpoint_path}")
+            # 这里应该加载实际的权重文件
+            # state_dict = torch.load(checkpoint_path, map_location="cpu")
+            # self.load_state_dict(state_dict, strict=False)
+        else:
+            print(f"权重文件不存在: {checkpoint_path}")
+    def enable_audio_condition(self):
+        """启用音频条件"""
+        # 这里应该替换base_model中的注意力处理器
+        pass
+    def disable_audio_condition(self):
+        """禁用音频条件"""
+        # 这里应该恢复原始的注意力处理器
+        pass
+    def forward(self, audio_features):
+        """前向传播"""
+        return self.audio_proj(audio_features)

requirements.txt ADDED Viewed

	@@ -0,0 +1,15 @@

+torch>=2.0.0
+torchvision
+transformers==4.46.2
+gradio==5.34.2
+spaces
+imageio
+imageio[ffmpeg]
+safetensors
+einops
+sentencepiece
+protobuf
+librosa
+numpy
+pillow
+tqdm

utils.py ADDED Viewed

	@@ -0,0 +1,70 @@

+# Copyright Alibaba Inc. All Rights Reserved.
+import imageio
+import librosa
+import numpy as np
+import torch
+from PIL import Image
+from tqdm import tqdm
+def resize_image_by_longest_edge(image_path, target_size):
+    """根据最长边调整图像大小"""
+    image = Image.open(image_path)
+    width, height = image.size
+    if max(width, height) <= target_size:
+        return image
+    if width > height:
+        new_width = target_size
+        new_height = int(height * target_size / width)
+    else:
+        new_height = target_size
+        new_width = int(width * target_size / height)
+    return image.resize((new_width, new_height), Image.Resampling.LANCZOS)
+def save_video(frames, save_path, fps, quality=9, ffmpeg_params=None):
+    """保存视频帧为MP4文件"""
+    if isinstance(frames, torch.Tensor):
+        frames = frames.cpu().numpy()
+    # 确保帧数据在正确的范围内
+    if frames.max() <= 1.0:
+        frames = (frames * 255).astype(np.uint8)
+    else:
+        frames = frames.astype(np.uint8)
+    # 使用imageio保存视频
+    writer = imageio.get_writer(save_path, fps=fps, quality=quality)
+    for frame in tqdm(frames, desc="保存视频"):
+        writer.append_data(frame)
+    writer.close()
+def get_audio_features(wav2vec, audio_processor, audio_path, fps, num_frames):
+    """提取音频特征"""
+    sr = 16000
+    audio_input, sample_rate = librosa.load(audio_path, sr=sr)  # 采样率为 16kHz
+    start_time = 0
+    end_time = num_frames / fps
+    start_sample = int(start_time * sr)
+    end_sample = int(end_time * sr)
+    try:
+        audio_segment = audio_input[start_sample:end_sample]
+    except:
+        audio_segment = audio_input
+    input_values = audio_processor(
+        audio_segment, sampling_rate=sample_rate, return_tensors="pt"
+    ).input_values.to("cuda")
+    with torch.no_grad():
+        fea = wav2vec(input_values).last_hidden_state
+    return fea