VS Code 中文词频统计

由于统计词频涉及 “分词”（尤其是中文，需要把“我爱编程”拆成“我”、“爱”、“编程”），最好的办法是在 VS Code 的终端里跑一个几行的 Python 脚本。高频词可以直接作为文章标签（Tags），提高站点内容结构清晰度的同时，对搜索引擎也较友好。还可以借助其，分析文章主题集中度、内容倾向等，为后续写作提供参考。

1. 环境准备

安装 Python：确保你的电脑安装了 Python 3。
安装分词工具：打开 VS Code 的终端（快捷键 Ctrl + ` ），输入以下命令并回车：  python3 -m pip install jieba

2. 放置脚本文件

在 VS Code 中打开你的项目文件夹。
在根目录下新建一个文件，命名为 tags.py。

3. 编写脚本代码

将以下代码完整复制并粘贴到 tags.py 中：

import os
import jieba
import re
from collections import Counter

# ================= 配置区 =================
# 1. 目标文件夹路径：'./_posts' 表示脚本在根目录，文章在 _posts 文件夹里
POSTS_PATH = '.'

# 2. 统计参数
TOP_N = 5000        # 显示前 100 个高频词
MIN_WORD_LEN = 2    # 过滤掉单字（如“的”、“我”），只保留 2 字以上词汇

# 3. 停用词过滤：在这里添加你想排除的无意义词汇


STOP_WORDS = {

}
# ==========================================

def get_content_without_header(text):
    """剔除 Markdown 顶部的 YAML Front Matter (--- 之间的内容)"""
    # 使用正则表达式匹配开头的 YAML 块
    split_content = re.split(r'^---\s*$', text, maxsplit=2, flags=re.MULTILINE)
    if len(split_content) >= 3:
        return split_content[2]  # 返回 --- 之后的正文部分
    return text

def main():
    all_words = []
    found_files = 0

    # 检查路径是否存在
    if not os.path.exists(POSTS_PATH):
        print(f"❌ 错误：找不到路径 '{POSTS_PATH}'。")
        print("请检查脚本是否放在正确的位置，或手动修改代码中的 POSTS_PATH。")
        return

    # 遍历文件夹
    for root, _, files in os.walk(POSTS_PATH):
        for file in files:
            # 匹配 .md 和 .markdown 文件
            if file.endswith(('.md', '.markdown')):
                found_files += 1
                file_path = os.path.join(root, file)
                try:
                    with open(file_path, 'r', encoding='utf-8') as f:
                        raw_text = f.read()
                        # 步骤 1: 剔除 YAML 头部配置
                        content = get_content_without_header(raw_text)
                        # 步骤 2: 仅提取中文字符（过滤代码块、链接、标点）
                        chinese_text = "".join(re.findall(r'[\u4e00-\u9fa5]+', content))
                        # 步骤 3: 使用 jieba 分词
                        words = jieba.cut(chinese_text)
                        # 步骤 4: 长度过滤 + 停用词过滤
                        for word in words:
                            if len(word) >= MIN_WORD_LEN and word not in STOP_WORDS:
                                all_words.append(word)
                except Exception as e:
                    print(f"读取文件 {file} 时出错: {e}")

    # 输出结果
    print(f"✅ 诊断：已扫描 {found_files} 篇文章")
    print("-" * 35)

    if not all_words:
        print("⚠️ 结果为空。请检查文件内容是否为中文，或路径是否正确。")
        return

    # 统计词频
    counter = Counter(all_words)
    common_words = counter.most_common(TOP_N)

    print(f"{'高频关键词':<12} | {'出现次数':<5}")
    print("-" * 35)
    for word, count in common_words:
        # 使用 f-string 格式化对齐
        print(f"{word:<15} | {count:<5}")

if __name__ == "__main__":
    main()

4. 运行统计

打开终端：在 VS Code 底部终端面板。
执行脚本：python3 tags.py 如果提示命令找不到，请尝试把 python3 换成 python。

进阶技巧：过滤掉不想要的词

如果你运行后发现排名靠前的都是“这个”、“那个”、“我们”这种词：找到脚本中的 STOP_WORDS 这一行。把这些词填进去，例如：STOP_WORDS = {‘这个’, ‘那个’, ‘我们’, ‘结果’}。重新运行脚本。

一些常用命令集合	2025-08-01
如何批量导出苹果备忘录	2025-09-18
iCloud 专用代理	2025-12-06
Cudy 路由器完全配置手册	2025-04-02
Domains For Sale	2025-05-05
多语言站点：自定义谷歌翻译功能	2025-12-07
批量将图片文件名改为小写	2025-06-21
人工智能提示语	2025-12-01
WordPress 站点重装	2023-01-05
手机拍摄星空（配图）	2024-03-30

1. 环境准备

2. 放置脚本文件

3. 编写脚本代码

4. 运行统计

进阶技巧：过滤掉不想要的词

相关文章