Getting QQ Chat Data
- Tutorial reference: NTQQ Windows Data Decryption
- Supplementary material: Database Decoding Reference
- The above two are different chapters of the same tutorial, read them patiently, it's not complicated (if you don't know how, scroll to the bottom to find me)
- Use DB Browser for SQLite, enter the 16-digit key you obtained as the password
- HMAC algorithm is generally SHA1, some people use SHA512 and 256, test yourself, wrong algorithm will fail to open the database (so you need to test until it opens, you can also use AI to help you adapt)
- In DB Browser export the SQL of
c2c_msg_table
- Create a new database, import the SQL file you just exported
- Get a database like this
- Structure as shown below, it's a plaintext database (you can open it and see the data, which means it's normal)
- Rename the database to
qq.db
and place it in thedataset/original
folder
Or modify the
qq_db_path
insetting.jsonc
Getting Telegram (TG) Chat Data
Please use Telegram Desktop to export chat data
Click the
Export chat history
buttonSelect the
JSON(Machine-readable JSON)
buttonNo need to check other buttons, as this project does not support multimodal yet
Move all ChatExport_ folders from the export folder to the
dataset/original/
folder, as shown belowImportant
Modify the
setting.jsonc
file, changetelegram_chat_id
to your telegram chat id
Including spaces!!!
- For example, if the following ID needs to be filled in as
qqqqq f
Getting WeChat (WX) Chat Data
Go to the WeChatBakTool Github project and download the latest version from the releases page.
Or click here to download WeChatBakTool.
Go to this project to download an older version of WeChat (v3.9.12.15).
Or click this link to quickly download WeChat.
Install and log in to WeChat.
On your phone, go to
Settings - Chats - Chat History Migration & Backup - Migrate - Migrate to PC/Mac
and proceed.Unzip BakTool.
- Install .NET Desktop Runtime (note: this is the Desktop Runtime version 6.0; ignore if already installed).
- Open and log in to WeChat.
- At the bottom left of the software, click "New Workspace".
- In the "New Workspace" interface, select the WeChat process for which you want to create a workspace, and confirm that the WeChat ID below is correct.
- For the decryption method, it is recommended to choose "Username Inference Search"! This method theoretically supports all 64-bit versions of WeChat. However, this mode requires ensuring the WeChat account is correct.
- Beginners should ignore other options and directly click "Create Workspace". The program will automatically create and decrypt the workspace.
Right-click on the
Workspace
->Manage
, export friend chats, all.Go to the
baktool
folder, enterworkspace-[random_folder_name]-DecDB
.Find all
MSG*.db
files, for exampleMSG1.db
, and move them all to thedataset/original/wechat
folder.
(Optional) Getting Chat Data from Video/Audio Files
- Extract from dual-track video/audio (requires files with separated audio tracks)
1. Install Dependencies
# Automatically install all dependencies
python process_data/chat_parser/video-to-chatml/start-vtc.py --install
# Or install manually
pip install -r requirements.txt
Note: You also need to install ffmpeg:
- Windows: Download from https://ffmpeg.org or use
choco install ffmpeg
- Ubuntu:
sudo apt install ffmpeg
- macOS:
brew install ffmpeg
2. Usage
Interactive Mode (Recommended for new users)
python process_data/chat_parser/video-to-chatml/start-vtc.py -i
Direct Command Line Mode
python process_data/chat_parser/video-to-chatml/start-vtc.py video.mp4 -u 0 -a 1 -o output.json -m base
Using Main Program
python process_data/chat_parser/video-to-chatml/video-to-chatml.py video.mp4 -u 0 -a 1 -o output.json -m base
Parameter Description
video
: Input video file path-u, --user-track
: User audio track index (default: 0)-a, --assistant-track
: Assistant audio track index (default: 1)-o, --output
: Output ChatML file path-m, --model
: Whisper model size (tiny/base/small/medium/large, default: base)
Whisper Model Selection
Model | Parameters | Memory Usage | Speed | Accuracy |
---|---|---|---|---|
tiny | 39M | ~1GB | Fastest | Lowest |
base | 74M | ~1GB | Fast | Low |
small | 244M | ~2GB | Medium | Medium |
medium | 769M | ~5GB | Slow | High |
large | 1550M | ~10GB | Slowest | Highest |
Output Format
The generated ChatML file format is as follows:
[
{
"role": "user",
"content": "User's spoken content",
"timestamp": {
"start": 0.0,
"end": 2.5
}
},
{
"role": "assistant",
"content": "Assistant's response",
"timestamp": {
"start": 2.5,
"end": 5.0
}
}
]
Common Issues
1. CUDA Support
If you have an NVIDIA GPU, the program will automatically use CUDA acceleration. Check CUDA support:
python process_data/chat_parser/video-to-chatml/start-vtc.py --check
2. Audio Track Recognition
Using interactive mode allows you to view all audio track information in the video, helping you select the correct track index.
3. Out of Memory
If you encounter memory issues, try using smaller Whisper models (like tiny or base).
Dependencies
- Python 3.7+
- openai-whisper
- torch
- ffmpeg-python
- ffmpeg (system dependency)
Supported Formats
Video Formats: MP4, MKV, AVI, MOV, WMV and other formats supported by ffmpeg Audio Codecs: Most common audio codecs (AAC, MP3, WAV, etc.)