邦邦游戏文档爬取&vits模型制作:音频抓取及数据集对齐

爬虫结果
直接拿去炼TTS或svc啥的都可以

前期调查
由于没有mygo看,我决定炼一个mygo全员的tts模型来给自己造糖吃。回想起之前炼虹团时夙兴夜寐地解析手游资源的痛苦时光。与少歌一样,bestdori平台可以省下解包手游的精力。
游戏脚本中包含了制作数据集所需的所有信息。
而存储音频文件的路径也能在平台上找到。
我已经快等不及了,快点爬下来把数据集端上来罢。
然后我品鉴了一下这令人忍俊不禁的命名方式:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-scenario
area_opening_story34
area_opening_story35
area_opening_story7
area_opening_story9
backstagestory1
bandstory1
bandstory10
bandstory100
bandstory101
bandstory102
bandstory103
bandstory104
......
eventstory199_3
eventstory199_4
eventstory199_5
eventstory199_6
eventstory2_0
eventstory2_1
eventstory2_2
eventstory2_3
eventstory2_4
eventstory2_5
......

狠狠地切割了。
看来是没法通过总结命名方法整到所有音频文件夹的详细地址。那么还是准备暴力破解吧。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 分析
要用到的python包主要有beautifulsoup4、requests还有selenium。
先试着直接解析。

import requests
from bs4 import BeautifulSoup

# 网址
URL = "https://bestdori.com/tool/explorer/asset/jp/sound/voice/scenario/bandstory294"

# 使用 requests 库来获取网页内容
response = requests.get(URL)
response.raise_for_status()

# 解析 HTML
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.prettify())

先让我们看看解析的结果。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<body>
<noscript>
<strong>
We're sorry but Bestdori! doesn't work properly without JavaScript enabled. Please enable it to continue.
</strong>
</noscript>
<div id="app">
</div>
<script src="/js/chunk-vendors.9309a223.js">
</script>
<script src="/js/app.3b5d5ef2.js">
</script>
<script crossorigin="anonymous" data-cf-beacon='{"rayId":"80e4bf36cdaf49cc","version":"2023.8.0","r":1,"b":1,"token":"6e9e54182ea54522b7f2b9e65e87b87e","si":100}' defer="" integrity="sha512-bjgnUKX4azu3dLTVtie9u6TKqgx29RBwfj3QXYt5EKfWM/9hPSAI/4qcV5NACjwAo8UtTeWefx6Zq5PHcMm7Tg==" src="https://static.cloudflareinsights.com/beacon.min.js/v8b253dfea2ab4077af8c6f58422dfbfd1689876627854">
</script>
</body>

解析了吗?如解,除了body part外都有了。很明显,这并不是我想要的结果。

网站爬取

为了解析动态网站,需要先下载googledriver,同时用pip install安装先前提到的selenium包。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from selenium import webdriver
from bs4 import BeautifulSoup
#测试用的网址
URL = "https://bestdori.com/tool/explorer/asset/jp/sound/voice/scenario/bandstory294"

# 指定chromedriver的路径
driver_path = r'C:\Users\Admin\爬虫\chromedriver-win64\chromedriver.exe'
driver = webdriver.Chrome(executable_path=driver_path)

# 打开指定的URL
driver.get(URL)

# 等待页面完全加载,这里可以加上适当的延时,或者使用Selenium的等待机制
# 简单的延时示例(等待10秒):
import time
time.sleep(10)

# 获取页面的HTML内容
page_source = driver.page_source

# 使用BeautifulSoup解析
soup = BeautifulSoup(page_source, 'html.parser')
print(soup.prettify())

# 关闭浏览器
driver.quit()

这次的结果就和谷歌浏览器右键检查的结果一样了,直接看我们想要的。

1
2
3
4
5
6
7
8
9
10
<div class="m-b-l">
<div class="m-b-xs">
<span class="icon">
<i class="fas fa-file-audio">
</i>
</span>
<span>
event57-07-048.mp3
</span>
</div>

“icon”和”fas fa-file-audio”就是用来检索的标签。

1
2
3
4
5
6
7
8
9
10
11
12
from bs4 import BeautifulSoup

html_content = soup.prettify()

soup = BeautifulSoup(html_content, 'html.parser')

# 查找所有具有fas fa-file-audio图标的<span>标签的下一个兄弟标签
spans = soup.find_all('span', class_='icon')
file_names = [span.find_next_sibling().text.strip() for span in spans if span.find('i', class_='fas fa-file-audio')]

for file_name in file_names:
print(file_name)

看一下检索的结果

1
2
3
4
5
6
event57-07-001.mp3
event57-07-002.mp3
......
event57-07-048.mp3
event57-07-049.mp3
......

这就是我想要的
同理在上级目录下把”fas fa-file-audio”换成”fas fa-file-archive”

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from selenium import webdriver
from bs4 import BeautifulSoup

URL = "https://bestdori.com/tool/explorer/asset/jp/sound/voice/scenario"

# 指定chromedriver的路径
driver_path = r'C:\Users\Admin\爬虫\chromedriver-win64\chromedriver.exe'
driver = webdriver.Chrome(executable_path=driver_path)

# 打开指定的URL
driver.get(URL)

# 等待页面完全加载,这里可以加上适当的延时,或者使用Selenium的等待机制
# 简单的延时示例(等待10秒):
import time
time.sleep(10)

# 获取页面的HTML内容
page_source = driver.page_source

# 使用BeautifulSoup解析
soup = BeautifulSoup(page_source, 'html.parser')

# 关闭浏览器
driver.quit()

html_content = soup.prettify()

soup = BeautifulSoup(html_content, 'html.parser')

# 查找所有具有fas fa-file-archive图标的<span>标签的下一个兄弟标签
spans = soup.find_all('span', class_='icon')
file_names = [span.find_next_sibling().text.strip() for span in spans if span.find('i', class_='fas fa-file-archive')]

for file_name in file_names:
print(file_name)

就能提取上级目录的名字了。

完整代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from selenium import webdriver
from bs4 import BeautifulSoup
import time

# 设置chromedriver的路径
driver_path = r'C:\Users\Admin\爬虫\chromedriver-win64\chromedriver.exe'
driver = webdriver.Chrome(executable_path=driver_path)

# 1. 访问主页面并获取所有子目录的名称
URL = "https://bestdori.com/tool/explorer/asset/jp/sound/voice/scenario"
driver.get(URL)
time.sleep(2) # 等待页面加载
soup = BeautifulSoup(driver.page_source, 'html.parser')
subdirs = [span.find_next_sibling().text.strip() for span in soup.find_all('span', class_='icon') if span.find('i', class_='fas fa-file-archive')]

# 为保存mp3路径的txt文件做准备
with open("C:/Users/Admin/爬虫/WholeMp3UrlPaths.txt", "w") as f:

# 2. 对于每个子目录,获取所有mp3文件的名称
for subdir in subdirs:
URL_subdir = f"https://bestdori.com/tool/explorer/asset/jp/sound/voice/scenario/{subdir}"
driver.get(URL_subdir)
time.sleep(2) # 等待页面加载
soup = BeautifulSoup(driver.page_source, 'html.parser')
mp3_names = [span.find_next_sibling().text.strip() for span in soup.find_all('span', class_='icon') if span.find('i', class_='fas fa-file-audio')]

# 3. 保存mp3文件的路径到txt文件
for mp3_name in mp3_names:
f.write(f"https://bestdori.com/assets/jp/sound/voice/scenario/{subdir}_rip/{mp3_name}\n")

# 关闭浏览器
driver.quit()

所有语音的路径就被记录在WholeMp3UrlPaths.txt里面了,需要注意的是,真正获取音频文件的路径并非简单的加法,而是

1
f.write(f"https://bestdori.com/assets/jp/sound/voice/scenario/{subdir}_rip/{mp3_name}\n")

解析动态网址的耗时非常漫长,而且此时的任务只完成了一半,还需要对asset内的信息进行整理。

下载所有剧情脚本

游戏内的剧情脚本类似于galgame(实际上复杂多了,主要还有live2d动作之类的)。
用与获取mp3音频路径同样的方式下载所有脚本。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import requests
import os

# 设置chromedriver的路径和初始URL
driver_path = r'C:\Users\Admin\爬虫\chromedriver-win64\chromedriver.exe'
base_url = "https://bestdori.com/tool/explorer/asset/jp/scenario"

driver = webdriver.Chrome(executable_path=driver_path)

# 第一层目录
driver.get(base_url)
time.sleep(10)
soup = BeautifulSoup(driver.page_source, 'html.parser')
first_level_dirs = [span.find_next_sibling('span', class_='m-l-xs').text.strip() for span in soup.find_all('span', class_='icon') if (span.find('i', class_='fas fa-file-archive') or span.find('i', class_='fas fa-folder'))]

# 保存所有找到的.asset文件的链接
all_asset_links = []

for first_dir in first_level_dirs:
driver.get(f"{base_url}/{first_dir}")
time.sleep(10)
soup = BeautifulSoup(driver.page_source, 'html.parser')
second_level_dirs = [span.find_next_sibling('span', class_='m-l-xs').text.strip() for span in soup.find_all('span', class_='icon') if (span.find('i', class_='fas fa-file-archive') or span.find('i', class_='fas fa-folder'))]

for second_dir in second_level_dirs:
driver.get(f"{base_url}/{first_dir}/{second_dir}")
time.sleep(10)
soup = BeautifulSoup(driver.page_source, 'html.parser')
links = soup.find_all('a', attrs={'download': True})
asset_links = ['https://bestdori.com' + link['href'] for link in links]
print(asset_links)
all_asset_links.extend(asset_links)

# 下载所有找到的.asset文件
save_path = r'C:\Users\Admin\爬虫\assert'
if not os.path.exists(save_path):
os.makedirs(save_path)

for link in all_asset_links:
response = requests.get(link, stream=True)
filename = os.path.join(save_path, link.split('/')[-1])

if response.status_code == 200:
with open(filename, 'wb') as f:
for chunk in response.iter_content(1024):
f.write(chunk)

# 关闭浏览器
driver.quit()

标记数据集制作

对剧情脚本重新进行整理,就能得到标记好的数据集了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import os
import json
import re

def extract_data(data, current_data=None, output_file=None):
if current_data is None:
current_data = {}

if isinstance(data, dict):
if 'windowDisplayName' in data:
current_data['windowDisplayName'] = data['windowDisplayName']
if 'body' in data:
# 移除body中的换行符
current_data['body'] = data['body'].replace('\n', '')
if 'voiceId' in data:
current_data['voiceId'] = data['voiceId']

# 检查所有字段是否非空
valid_data = all(current_data.get(k) for k in ['windowDisplayName', 'body', 'voiceId'])
# 检查 windowDisplayName 是否包含 "・"
valid_displayname = "・" not in current_data.get('windowDisplayName', "")
# 检查 body 是否只包含标点符号
valid_body = bool(re.sub(r'[^\w]', '', current_data.get('body', "")))

# 如果满足所有条件,输出结果到文件
if valid_data and valid_displayname and valid_body:
output_file.write(f"{current_data['voiceId']}|{current_data['windowDisplayName']}|{current_data['body']}\n")
current_data.clear() # 清空当前数据以供下次使用

for key in data:
extract_data(data[key], current_data, output_file)
elif isinstance(data, list):
for item in data:
extract_data(item, current_data, output_file)

# 获取目录下的所有文件
directory = "C:/Users/Admin/爬虫/assert"
files = os.listdir(directory)

# 打开一个txt文件以保存结果
with open("BangDreamSortPath.txt", "w", encoding="utf-8") as output_file:
# 遍历所有文件
for filename in files:
if filename.endswith(".asset"):
file_path = os.path.join(directory, filename)
with open(file_path, 'r', encoding='utf-8') as file:
content = json.load(file)
extract_data(content, output_file=output_file)

制作预处理清单
为了方便在colab平台上快速处理数据集,将mp3地址和整理完的”BangDreamSortPath.txt”重新整合:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# 从 WholeMp3UrlPaths.txt 创建一个映射
path_mapping = {}
with open("WholeMp3UrlPaths.txt", "r", encoding="utf-8", errors="ignore") as f:
for line in f:
try:
audio_id = line.strip().split("/")[-1].replace(".mp3", "")
path_mapping[audio_id] = line.strip('\n')
except Exception as e:
print(f"Error processing line {line}: {e}")

# 遍历 BangDreamSortPath.txt 的每一行,并根据需要替换音频名
new_results = []
with open("BangDreamSortPath.txt", "r", encoding="utf-8") as f:
for line in f:
try:
audio_id = line.split("|")[0]
if audio_id in path_mapping:
line = line.replace(audio_id, path_mapping[audio_id])
new_results.append(line)
except Exception as e:
print(f"Error processing line {line}: {e}")

# 将新的结果保存到新的 txt 文件中
with open("SortPathUrl.txt", "w", encoding="utf-8") as f:
f.writelines(new_results)

# 简单地筛掉无效网址
with open(file_path, 'r', encoding='utf-8') as file:
lines = file.readlines()

# 只保留以https开头的行
filtered_lines = [line for line in lines if line.startswith('https')]

# 将筛选后的内容写回文件
with open(file_path, 'w', encoding='utf-8') as file:
file.writelines(filtered_lines)

print("File has been processed.")

这样就能省区繁杂的上传流程而能够直接在colab上进行处理了。

模型训练

网上Bert-VITS2的教程十分脑残,大概是那边不愿意把一键流程流出导致滥用。
先说一下要点。
数据的预处理没上面好说的,用request模块从SortPathUrl.txt下载音频文件时时直接处理成44100的单声道就好了,用不上resample.py
训练前先用preprocess_text.py处理SortPathUrl.txt(记得把里面的网址替换为真实路径),会自动生成config。
下载日语bert模型,运行bert_gen.py 然后直接训练。
然后这样在train_ms.py def run() 函数的上面加入:

1
2
3
4
5
6
import os

os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '8880'
os.environ['WORLD_SIZE'] = '1'
os.environ['RANK'] = '0'

事后记:逆天colab重采样的音频大概率没法保存成功,浪费算力是板上钉钉的了(悲)