爬虫结果
直接拿去炼TTS或svc啥的都可以

前期调查
由于没有mygo看，我决定炼一个mygo全员的tts模型来给自己造糖吃。回想起之前炼虹团时夙兴夜寐地解析手游资源的痛苦时光。与少歌一样，bestdori平台可以省下解包手游的精力。
游戏脚本中包含了制作数据集所需的所有信息。
而存储音频文件的路径也能在平台上找到。
我已经快等不及了，快点爬下来把数据集端上来罢。
然后我品鉴了一下这令人忍俊不禁的命名方式:

-scenario
    area_opening_story34
    area_opening_story35
    area_opening_story7
    area_opening_story9
    backstagestory1
    bandstory1
    bandstory10
    bandstory100
    bandstory101
    bandstory102
    bandstory103
    bandstory104
    ......
    eventstory199_3
    eventstory199_4
    eventstory199_5
    eventstory199_6
    eventstory2_0
    eventstory2_1
    eventstory2_2
    eventstory2_3
    eventstory2_4
    eventstory2_5
    ......

狠狠地切割了。
看来是没法通过总结命名方法整到所有音频文件夹的详细地址。那么还是准备暴力破解吧。

# 分析
要用到的python包主要有beautifulsoup4、requests还有selenium。
先试着直接解析。

import requests
from bs4 import BeautifulSoup

# 网址
URL = "https://bestdori.com/tool/explorer/asset/jp/sound/voice/scenario/bandstory294"

# 使用 requests 库来获取网页内容
response = requests.get(URL)
response.raise_for_status()

# 解析 HTML
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.prettify())

先让我们看看解析的结果。

<body>
  <noscript>
   <strong>
    We're sorry but Bestdori! doesn't work properly without JavaScript enabled. Please enable it to continue.
   </strong>
  </noscript>
  <div id="app">
  </div>
  <script src="/js/chunk-vendors.9309a223.js">
  </script>
  <script src="/js/app.3b5d5ef2.js">
  </script>
  <script crossorigin="anonymous" data-cf-beacon='{"rayId":"80e4bf36cdaf49cc","version":"2023.8.0","r":1,"b":1,"token":"6e9e54182ea54522b7f2b9e65e87b87e","si":100}' defer="" integrity="sha512-bjgnUKX4azu3dLTVtie9u6TKqgx29RBwfj3QXYt5EKfWM/9hPSAI/4qcV5NACjwAo8UtTeWefx6Zq5PHcMm7Tg==" src="https://static.cloudflareinsights.com/beacon.min.js/v8b253dfea2ab4077af8c6f58422dfbfd1689876627854">
  </script>
 </body>

解析了吗？如解，除了body part外都有了。很明显，这并不是我想要的结果。

网站爬取

为了解析动态网站，需要先下载googledriver,同时用pip install安装先前提到的selenium包。

from selenium import webdriver
from bs4 import BeautifulSoup
#测试用的网址
URL = "https://bestdori.com/tool/explorer/asset/jp/sound/voice/scenario/bandstory294"

# 指定chromedriver的路径
driver_path = r'C:\Users\Admin\爬虫\chromedriver-win64\chromedriver.exe'
driver = webdriver.Chrome(executable_path=driver_path)

# 打开指定的URL
driver.get(URL)

# 等待页面完全加载，这里可以加上适当的延时，或者使用Selenium的等待机制
# 简单的延时示例（等待10秒）：
import time
time.sleep(10)

# 获取页面的HTML内容
page_source = driver.page_source

# 使用BeautifulSoup解析
soup = BeautifulSoup(page_source, 'html.parser')
print(soup.prettify())

# 关闭浏览器
driver.quit()

这次的结果就和谷歌浏览器右键检查的结果一样了，直接看我们想要的。

<div class="m-b-l">
          <div class="m-b-xs">
           <span class="icon">
            <i class="fas fa-file-audio">
            </i>
           </span>
           <span>
            event57-07-048.mp3
           </span>
          </div>

“icon”和”fas fa-file-audio”就是用来检索的标签。

from bs4 import BeautifulSoup

html_content = soup.prettify()

soup = BeautifulSoup(html_content, 'html.parser')

# 查找所有具有fas fa-file-audio图标的<span>标签的下一个兄弟标签
spans = soup.find_all('span', class_='icon')
file_names = [span.find_next_sibling().text.strip() for span in spans if span.find('i', class_='fas fa-file-audio')]

for file_name in file_names:
    print(file_name)

看一下检索的结果

event57-07-001.mp3
event57-07-002.mp3
......
event57-07-048.mp3
event57-07-049.mp3
......

这就是我想要的
同理在上级目录下把”fas fa-file-audio”换成”fas fa-file-archive”

from selenium import webdriver
from bs4 import BeautifulSoup

URL = "https://bestdori.com/tool/explorer/asset/jp/sound/voice/scenario"

# 指定chromedriver的路径
driver_path = r'C:\Users\Admin\爬虫\chromedriver-win64\chromedriver.exe'
driver = webdriver.Chrome(executable_path=driver_path)

# 打开指定的URL
driver.get(URL)

# 等待页面完全加载，这里可以加上适当的延时，或者使用Selenium的等待机制
# 简单的延时示例（等待10秒）：
import time
time.sleep(10)

# 获取页面的HTML内容
page_source = driver.page_source

# 使用BeautifulSoup解析
soup = BeautifulSoup(page_source, 'html.parser')

# 关闭浏览器
driver.quit()

html_content = soup.prettify()

soup = BeautifulSoup(html_content, 'html.parser')

# 查找所有具有fas fa-file-archive图标的<span>标签的下一个兄弟标签
spans = soup.find_all('span', class_='icon')
file_names = [span.find_next_sibling().text.strip() for span in spans if span.find('i', class_='fas fa-file-archive')]

for file_name in file_names:
    print(file_name)

就能提取上级目录的名字了。

完整代码

from selenium import webdriver
from bs4 import BeautifulSoup
import time

# 设置chromedriver的路径
driver_path = r'C:\Users\Admin\爬虫\chromedriver-win64\chromedriver.exe'
driver = webdriver.Chrome(executable_path=driver_path)

# 1. 访问主页面并获取所有子目录的名称
URL = "https://bestdori.com/tool/explorer/asset/jp/sound/voice/scenario"
driver.get(URL)
time.sleep(2)  # 等待页面加载
soup = BeautifulSoup(driver.page_source, 'html.parser')
subdirs = [span.find_next_sibling().text.strip() for span in soup.find_all('span', class_='icon') if span.find('i', class_='fas fa-file-archive')]

# 为保存mp3路径的txt文件做准备
with open("C:/Users/Admin/爬虫/WholeMp3UrlPaths.txt", "w") as f:

    # 2. 对于每个子目录，获取所有mp3文件的名称
    for subdir in subdirs:
        URL_subdir = f"https://bestdori.com/tool/explorer/asset/jp/sound/voice/scenario/{subdir}"
        driver.get(URL_subdir)
        time.sleep(2)  # 等待页面加载
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        mp3_names = [span.find_next_sibling().text.strip() for span in soup.find_all('span', class_='icon') if span.find('i', class_='fas fa-file-audio')]

        # 3. 保存mp3文件的路径到txt文件
        for mp3_name in mp3_names:
            f.write(f"https://bestdori.com/assets/jp/sound/voice/scenario/{subdir}_rip/{mp3_name}\n")

# 关闭浏览器
driver.quit()

所有语音的路径就被记录在WholeMp3UrlPaths.txt里面了，需要注意的是，真正获取音频文件的路径并非简单的加法，而是

1	f.write(f"https://bestdori.com/assets/jp/sound/voice/scenario/{subdir}_rip/{mp3_name}\n")

解析动态网址的耗时非常漫长，而且此时的任务只完成了一半，还需要对asset内的信息进行整理。

下载所有剧情脚本

游戏内的剧情脚本类似于galgame(实际上复杂多了，主要还有live2d动作之类的)。
用与获取mp3音频路径同样的方式下载所有脚本。

from selenium import webdriver
from bs4 import BeautifulSoup
import time
import requests
import os

# 设置chromedriver的路径和初始URL
driver_path = r'C:\Users\Admin\爬虫\chromedriver-win64\chromedriver.exe'
base_url = "https://bestdori.com/tool/explorer/asset/jp/scenario"

driver = webdriver.Chrome(executable_path=driver_path)

# 第一层目录
driver.get(base_url)
time.sleep(10)
soup = BeautifulSoup(driver.page_source, 'html.parser')
first_level_dirs = [span.find_next_sibling('span', class_='m-l-xs').text.strip() for span in soup.find_all('span', class_='icon') if (span.find('i', class_='fas fa-file-archive') or span.find('i', class_='fas fa-folder'))]

# 保存所有找到的.asset文件的链接
all_asset_links = []

for first_dir in first_level_dirs:
    driver.get(f"{base_url}/{first_dir}")
    time.sleep(10)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    second_level_dirs = [span.find_next_sibling('span', class_='m-l-xs').text.strip() for span in soup.find_all('span', class_='icon') if (span.find('i', class_='fas fa-file-archive') or span.find('i', class_='fas fa-folder'))]

    for second_dir in second_level_dirs:
        driver.get(f"{base_url}/{first_dir}/{second_dir}")
        time.sleep(10)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        links = soup.find_all('a', attrs={'download': True})
        asset_links = ['https://bestdori.com' + link['href'] for link in links]
        print(asset_links)
        all_asset_links.extend(asset_links)

# 下载所有找到的.asset文件
save_path = r'C:\Users\Admin\爬虫\assert'
if not os.path.exists(save_path):
    os.makedirs(save_path)

for link in all_asset_links:
    response = requests.get(link, stream=True)
    filename = os.path.join(save_path, link.split('/')[-1])

    if response.status_code == 200:
        with open(filename, 'wb') as f:
            for chunk in response.iter_content(1024):
                f.write(chunk)

# 关闭浏览器
driver.quit()

标记数据集制作

对剧情脚本重新进行整理，就能得到标记好的数据集了。

import os
import json
import re

def extract_data(data, current_data=None, output_file=None):
    if current_data is None:
        current_data = {}

    if isinstance(data, dict):
        if 'windowDisplayName' in data:
            current_data['windowDisplayName'] = data['windowDisplayName']
        if 'body' in data:
            # 移除body中的换行符
            current_data['body'] = data['body'].replace('\n', '')
        if 'voiceId' in data:
            current_data['voiceId'] = data['voiceId']
            
            # 检查所有字段是否非空
            valid_data = all(current_data.get(k) for k in ['windowDisplayName', 'body', 'voiceId'])
            # 检查 windowDisplayName 是否包含 "・"
            valid_displayname = "・" not in current_data.get('windowDisplayName', "")
            # 检查 body 是否只包含标点符号
            valid_body = bool(re.sub(r'[^\w]', '', current_data.get('body', "")))
            
            # 如果满足所有条件，输出结果到文件
            if valid_data and valid_displayname and valid_body:
                output_file.write(f"{current_data['voiceId']}|{current_data['windowDisplayName']}|{current_data['body']}\n")
                current_data.clear()  # 清空当前数据以供下次使用
        
        for key in data:
            extract_data(data[key], current_data, output_file)
    elif isinstance(data, list):
        for item in data:
            extract_data(item, current_data, output_file)

# 获取目录下的所有文件
directory = "C:/Users/Admin/爬虫/assert"
files = os.listdir(directory)

# 打开一个txt文件以保存结果
with open("BangDreamSortPath.txt", "w", encoding="utf-8") as output_file:
    # 遍历所有文件
    for filename in files:
        if filename.endswith(".asset"):
            file_path = os.path.join(directory, filename)
            with open(file_path, 'r', encoding='utf-8') as file:
                content = json.load(file)
                extract_data(content, output_file=output_file)

制作预处理清单
为了方便在colab平台上快速处理数据集，将mp3地址和整理完的”BangDreamSortPath.txt”重新整合:

# 从 WholeMp3UrlPaths.txt 创建一个映射
path_mapping = {}
with open("WholeMp3UrlPaths.txt", "r", encoding="utf-8", errors="ignore") as f:
    for line in f:
        try:
            audio_id = line.strip().split("/")[-1].replace(".mp3", "")
            path_mapping[audio_id] = line.strip('\n')
        except Exception as e:
            print(f"Error processing line {line}: {e}")

# 遍历 BangDreamSortPath.txt 的每一行，并根据需要替换音频名
new_results = []
with open("BangDreamSortPath.txt", "r", encoding="utf-8") as f:
    for line in f:
        try:
            audio_id = line.split("|")[0]
            if audio_id in path_mapping:
                line = line.replace(audio_id, path_mapping[audio_id])
            new_results.append(line)
        except Exception as e:
            print(f"Error processing line {line}: {e}")

# 将新的结果保存到新的 txt 文件中
with open("SortPathUrl.txt", "w", encoding="utf-8") as f:
    f.writelines(new_results)

# 简单地筛掉无效网址
with open(file_path, 'r', encoding='utf-8') as file:
    lines = file.readlines()

# 只保留以https开头的行
filtered_lines = [line for line in lines if line.startswith('https')]

# 将筛选后的内容写回文件
with open(file_path, 'w', encoding='utf-8') as file:
    file.writelines(filtered_lines)

print("File has been processed.")

这样就能省区繁杂的上传流程而能够直接在colab上进行处理了。

模型训练

网上Bert-VITS2的教程十分脑残，大概是那边不愿意把一键流程流出导致滥用。
先说一下要点。
数据的预处理没上面好说的，用request模块从SortPathUrl.txt下载音频文件时时直接处理成44100的单声道就好了，用不上resample.py
训练前先用preprocess_text.py处理SortPathUrl.txt(记得把里面的网址替换为真实路径)，会自动生成config。
下载日语bert模型，运行bert_gen.py 然后直接训练。
然后这样在train_ms.py def run() 函数的上面加入：

import os

os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '8880'
os.environ['WORLD_SIZE'] = '1'
os.environ['RANK'] = '0'

事后记:逆天colab重采样的音频大概率没法保存成功，浪费算力是板上钉钉的了(悲)

Mahiruoshi

邦邦游戏文档爬取&vits模型制作:音频抓取及数据集对齐

网站爬取

完整代码

下载所有剧情脚本

标记数据集制作

模型训练