[案例] 文章采集:使用AI马驹(浏览器插件)实现数码文章内容的自动下载【AI浏览器】

【案例】文章采集:使用AI马驹(浏览器插件)实现数码文章内容的自动下载

任务指令(可使用deepseek)

我现在要做一个数码文章采集自动化的node.js代码。有以下浏览器自动化插件接口:http://browser.bb53m4hk65am1ogwky0n.html.dtns.top/api/tabs/open?url={url}打开链接 。首先通过此api打开网站URL:https://m.ithome.com/digi/ 然后通过http://browser.bb53m4hk65am1ogwky0n.html.dtns.top/api/page/links这个api来得到页面的所有链接(包含了我所需要的文章链接),返回JSON格式如下:{"success":true,"links":[{"href":"https://m.ithome.com/digi/###","text":"","title":"","target":"","rel":"","visible":false},{"href":"https://m.ithome.com/","text":"","title":"","target":"","rel":"","visible":true},{"href":"https://www.zuihuimai.com/","text":"","title":"最会买","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953698.htm","text":"极空间私有云 Z425 轻享版发布:八盘位、万兆网口,首发价 6699 元起\n \n 08:58\n \n \n 2评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953697.htm","text":"Ayaneo Konkr Pocket Block 掌机亮相:两色可选,具备 AI 功能\n \n 08:57\n \n \n 3评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953693.htm","text":"华硕上线背插主板 TUF GAMING Z890-BTF WIFI7,全新干练设计语言\n \n 08:43\n \n \n 6评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953692.htm","text":"壹号本全新 OnexPlayer X1 Pro 掌机发布:锐龙 AI9 HX 470 处理器,65Wh 电池\n \n 08:40\n \n \n 2评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953689.htm","text":"联想 IdeaPad Slim 3i 17IWC11 笔记本发布:17 寸大屏、英特尔 Wildcat Lake 平台\n \n 08:18\n \n \n 10评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953686.htm","text":"【视频】驾驶小米YU7 GT 开启一场高原自驾之旅体验如何?\n \n 08:08\n \n \n 12评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953681.htm","text":"HMD Vibe 2 5G 手机海外发售:紫光展锐 T8200 芯片、6.75 英寸 720P+ LCD 屏,10999 卢比起\n \n 07:36\n \n \n 12评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953663.htm","text":"消息称某厂 7 英寸 2K 大屏性能机测试天玑 9 系芯片,预计为小米旗下\n \n 昨日 23:00\n \n \n 60评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953648.htm","text":"AYANEO Pocket AIR Mini 安卓掌机 Arcade Home 联名款发布:IGS 正版授权游戏,739 元起\n \n 昨日 21:46\n \n \n 11评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953641.htm","text":"小米大家电 MiCare 官方保障服务上线:限时 5 折优惠,54 元起\n \n 昨日 21:27\n \n \n 9评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953630.htm","text":"哈趣 H3 Pro 投影仪开售:400 CVIA 亮度、原生 1080P,到手价 839 元\n \n 昨日 20:50\n \n \n 16评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953615.htm","text":"AOC QD-OLED 电竞显示器 U27G4D 开售:26.5 英寸 4K 165Hz,3499 元\n \n 昨日 19:53\n \n \n 16评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953613.htm","text":"全家畅看《低智商犯罪》:爱奇艺电视白金会员 + 京东 PLUS 年卡 239 元直降\n \n 昨日 19:41\n \n \n 21评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953607.htm","text":"小米手环 10 Pro 发布:支持苹果生态、游戏模式,首销 399 元起\n \n 昨日 19:25\n \n \n 138评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953606.htm","text":"小米首款耳夹式耳机发布:IP57、单只耳机重 5.5g,首销 799 元\n \n 昨日 19:21\n \n \n 168评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953323.htm","text":"20:15~20:30 速抢京东 6.6 元大红包,不限量 1.18 元大红包人人有份\n \n 昨日 14:04\n \n \n 38评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/952469.htm","text":"赠 30 元 UP 主充电券:B站大会员年卡 233 → 108 元持平去年双 11\n \n 05月19日\n \n \n 93评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953596.htm","text":"原生 1080P 1000Hz,AOC 爱攻联合京东方发布 PRO 电竞显示器 AGP257FT\n \n 昨日 18:34\n \n \n 18评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953589.htm","text":"小天才电话手表 Z12 发布:首搭“转一转”、GSR 皮肤电传感器,2699 元\n \n 昨日 18:12\n \n \n 116评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953585.htm","text":"漫步者 HECATE 推出微动热插拔鼠标 G5M Ultra,首发价 299 元\n \n 昨日 18:00\n \n \n 19评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953585.htm","text":"漫步者 HECATE 推出微动热插拔鼠标 G5M Ultra,首发价 299 元昨日 18:0019评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953582.htm","text":"1099 → 859 元新低:安克 300W 充电宝百亿补贴,华为小米协议全兼容昨日 17:5630评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953580.htm","text":"【IT之家评测室】绿联 NAS 私有云 DXP4800 GT 深度体验:四盘位配双万兆,这配置太“GT”了昨日 17:5330评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953579.htm","text":"云深处小型轮足机器人山猫 S10 发布:极限速度 8m/s,能跨越半米高障碍昨日 17:4910评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953572.htm","text":"DEVIALET 帝瓦雷 Mania 便携音响新增晨雾白、深苔绿等配色,8980 元昨日 17:4323评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953571.htm","text":"EPOMAKER 复合碳纤维鼠标 Carbonis 国行发售:PAW3950,499 元昨日 17:3713评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953536.htm","text":"【IT之家开箱】黑金配色,简约设计,绿联 NAS 私有云 DXP4800 GT 开箱图赏昨日 17:2218评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953533.htm","text":"新机上手即更新,联想拯救者 Y900 13 平板 OTA0 已推送昨日 17:2110评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953519.htm","text":"AMD 携手多家中国台湾地区 OSAT 企业推进新一代 EFB 技术研发昨日 17:067评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953513.htm","text":"Counterpoint 报告:2026 年五一假期中国智能手机销量同比下降 16%,华为领跑市场昨日 16:5882评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953506.htm","text":"技嘉推出双模 IPS 显示器 G27U,支持 UHD 160Hz / FHD 320Hz昨日 16:4411评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953502.htm","text":"vivo Y600 Turbo 手机配置曝光:9020mAh 电池 + 天玑 8500,5 月 25 日预售昨日 16:3629评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953490.htm","text":"热泵洗烘一步到位:海信棉花糖 E8Q 一体机 13kg 京东 7999 → 4648 元新低昨日 16:0412评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953486.htm","text":"JBL CLIPS 琉璃扣耳夹耳机新增祖母绿色特别版:半透明设计,999 元昨日 15:5810评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953467.htm","text":"AMD 下一代霄龙处理器“Venice”进入量产阶段,采用台积电 2nm 工艺昨日 15:286评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953452.htm","text":"三星 Galaxy S23 Ultra 国行版获推 One UI 8.5 版本,界面设计焕然一新昨日 15:1739评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953449.htm","text":"荣耀手表 6 Plus 开启预约:续航最长 35 天 + 专业羽毛球模式昨日 15:1625评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953402.htm","text":"梅雨季将至:德施曼龙霆 3D 人脸识别智能门锁 288 元新低,支持天地钩昨日 14:5825评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953044.htm","text":"支持牙刷以旧换新:海尔云溪 582 热泵洗烘套装 12998 → 5698 元国补新低05月20日19评","title":"","target":"","rel":"","visible":true},{"href":"https://m.ithome.com/html/953372.htm","text":"荣耀 600 系列手机官宣搭载“教科书级”护眼屏,支持行业独家纸质全彩模式昨日 14:4634评","title":"","target":"","rel":"","visible":true},{"href":"https://www.ithome.com/openapp/ithome?pagetype=index&tab=news","text":"","title":"","target":"","rel":"","visible":false},{"href":"https://m.ithome.com/search/index.htm","text":"搜索关键词","title":"","target":"","rel":"","visible":false},{"href":"https://m.ithome.com/user/usercomment","text":"我的评论","title":"","target":"","rel":"","visible":false},{"href":"http://www.ithome.com/about.htm","text":"关于IT之家","title":"","target":"_blank","rel":"","visible":false},{"href":"http://www.ruanmei.com/contact/","text":"联系我们","title":"","target":"_blank","rel":"","visible":false},{"href":"http://www.ruanmei.com/","text":"软媒公司","title":"","target":"_blank","rel":"","visible":false},{"href":"https://www.ithome.com/openapp/ithome?pagetype=index&tab=news","text":"立即打开","title":"","target":"","rel":"","visible":false}]} 其中仅类似https://m.ithome.com/html/{数字形式的文章ID}.htm的URL链接才是文章链接,需要将文章标题、文章链接,输出为{数码文章-当前日期}.json的数据文件保存在当前子目录data路径下。然后通过http://browser.bb53m4hk65am1ogwky0n.html.dtns.top/api/tabs/open?url={url}打开每一个文章链接。使用APi:http://browser.bb53m4hk65am1ogwky0n.html.dtns.top/api/page/visible-text/merged拿到纯文本内容如下:{"success":true,"text":"首页 数码之家 电脑硬件 极空间私有云 Z425 轻享版发布:八盘位、万兆网口,首发价 6699 元起\n2026-05-22 08:58 IT之家 - 汪淼\nIT之家 5 月 22 日消息,极空间私有云 Z425 轻享版于 5 月 21 日发布,\n首发价 6699 元起\n,国补后 5637.61 元起。\n英特尔酷睿 Ultra 5 125H 处理器(14 核 18 线程,睿频 4.5GHz)\n“八盘位”(4 个 SATA + 4 个 M.2 SSD) ,最大 160TB 存储\n标配 8GB DDR5 内存,最高可升 128GB\n万兆有线网口 + 双雷电 4 接口\n“AI 空间”功能完整继承,自然语义搜索、视频音频摘要生成、智能分类……\nIT之家附配置表如下:\n京东 极空间 Z425 轻享版\n券后 5930.1 元 领 110 元券\n京东 618 无门槛红包 面额至高 26618 元,每天抽 3 次: 点此抽红包\n淘宝 618 无门槛红包 面额至高 26888 元,每天抽 1 次: 点此抽红包\n广告声明:文内含有的对外跳转链接(包括不限于超链接、二维码、口令等形式),用于传递更多信息,节省甄选时间,结果仅供参考,IT之家所有文章均包含本声明。\n责任编辑:汪淼 投诉水文 我要纠错\n打分后显示文章质量得分,当前2人打分\n有价值 无价值\n相关文章\n【IT之家开箱】八盘位设计,质感与实用并存,极空间 Z425 私有云开箱图赏\n2025.10.16\n极空间 NAS 发布 7 月升级:夸克网盘挂载、HDMI 桌面应用、手机面对面分享等\n2025.07.29\n1499 → 666 元:极空间 Z2 Pro 私有云再降价,配备 2.5G 网口 + 4K 输出\n2025.06.02\n查看更多\n全部评论\n软媒旗下人气应用\n最会买 IT之家 要知\n关于IT之家 | 联系我们 © 2006 - 2026\n软媒公司 版权所有\n说两句... 12 打开APP","length":829} ,将使用文章标题、文章链接、文本内容,形成以markdown格式的记录文件,保存到data/{当前日期}/{文件名称}.md文件中。请输出此Node.js代码


任务结果:

以下是实现您需求的Node.js代码。它会自动采集数码文章、获取链接、提取内容并保存为Markdown文件。

const fs = require('fs').promises;
const path = require('path');

// API配置
const API_BASE = 'http://browser.bb53m4hk65am1ogwky0n.html.dtns.top/api';
const TARGET_URL = 'https://m.ithome.com/digi/';

// 辅助函数:延迟
const sleep = (ms) => new Promise(resolve => setTimeout(resolve, ms));

// 获取当前日期字符串 YYYY-MM-DD
const getTodayStr = () => {
    const now = new Date();
    return `${now.getFullYear()}-${String(now.getMonth() + 1).padStart(2, '0')}-${String(now.getDate()).padStart(2, '0')}`;
};

// 判断是否为文章链接(格式:https://m.ithome.com/html/数字.htm)
const isArticleLink = (href) => {
    return href && /^https:\/\/m\.ithome\.com\/html\/\d+\.htm$/.test(href);
};

// 调用API打开标签页
const openTab = async (url) => {
    const apiUrl = `${API_BASE}/tabs/open?url=${encodeURIComponent(url)}`;
    const response = await fetch(apiUrl);
    if (!response.ok) {
        throw new Error(`打开标签页失败: ${response.statusText}`);
    }
    const data = await response.json();
    console.log(`✓ 已打开: ${url}`);
    return data;
};

// 获取页面所有链接
const fetchPageLinks = async () => {
    const apiUrl = `${API_BASE}/links`;
    const response = await fetch(apiUrl);
    if (!response.ok) {
        throw new Error(`获取链接失败: ${response.statusText}`);
    }
    const data = await response.json();
    if (!data.success) {
        throw new Error(`获取链接API返回错误: ${JSON.stringify(data)}`);
    }
    return data.links || [];
};

// 获取页面纯文本内容
const fetchPageText = async () => {
    const apiUrl = `${API_BASE}/visible-text/merged`;
    const response = await fetch(apiUrl);
    if (!response.ok) {
        throw new Error(`获取文本失败: ${response.statusText}`);
    }
    const data = await response.json();
    if (!data.success) {
        throw new Error(`获取文本API返回错误: ${JSON.stringify(data)}`);
    }
    return data.text || '';
};

// 保存文章列表JSON
const saveArticlesJson = async (articles, today) => {
    const dataDir = path.join(process.cwd(), 'data');
    await fs.mkdir(dataDir, { recursive: true });
    const filename = `数码文章-${today}.json`;
    const filepath = path.join(dataDir, filename);
    await fs.writeFile(filepath, JSON.stringify(articles, null, 2), 'utf-8');
    console.log(`✓ 已保存文章列表: ${filepath}`);
    return filepath;
};

// 保存单篇文章为Markdown
const saveArticleMarkdown = async (article, today) => {
    const { title, url, content } = article;
    const articleDir = path.join(process.cwd(), 'data', today);
    await fs.mkdir(articleDir, { recursive: true });
    
    // 从URL中提取ID作为文件名
    const idMatch = url.match(/\/html\/(\d+)\.htm/);
    const id = idMatch ? idMatch[1] : Date.now();
    const safeTitle = title.replace(/[\\/:*?"<>|]/g, '_').substring(0, 50);
    const filename = `${id}_${safeTitle}.md`;
    const filepath = path.join(articleDir, filename);
    
    // 构建Markdown内容
    const markdown = `# ${title}\n\n> 原文链接:${url}\n\n---\n\n${content}\n\n---\n\n*本文由自动化采集工具生成,采集时间:${new Date().toLocaleString()}*\n`;
    
    await fs.writeFile(filepath, markdown, 'utf-8');
    console.log(`✓ 已保存文章: ${filename}`);
    return filepath;
};

// 主函数
async function main() {
    console.log('=== 数码文章采集自动化开始 ===\n');
    const today = getTodayStr();
    const articles = [];
    
    try {
        // 1. 打开目标页面
        console.log(`步骤1: 打开页面 ${TARGET_URL}`);
        await openTab(TARGET_URL);
        await sleep(2000); // 等待页面加载
        
        // 2. 获取页面所有链接
        console.log('\n步骤2: 获取页面链接...');
        const allLinks = await fetchPageLinks();
        console.log(`共获取到 ${allLinks.length} 个链接`);
        
        // 3. 筛选出文章链接
        const articleLinks = allLinks.filter(link => isArticleLink(link.href));
        console.log(`筛选出文章链接 ${articleLinks.length} 个`);
        
        if (articleLinks.length === 0) {
            console.log('未找到文章链接,退出程序');
            return;
        }
        
        // 4. 提取文章基本信息
        for (const link of articleLinks) {
            // 提取标题(从text字段中清洗)
            let title = link.text || '';
            // 清理标题中的换行、多余空格和时间信息
            title = title.replace(/\n/g, ' ').replace(/\s+/g, ' ').trim();
            // 如果标题为空,则从URL中提取ID作为临时标题
            if (!title) {
                const idMatch = link.href.match(/\/html\/(\d+)\.htm/);
                title = `文章_${idMatch ? idMatch[1] : Date.now()}`;
            }
            
            articles.push({
                title: title,
                url: link.href,
                collectedAt: new Date().toISOString()
            });
        }
        
        // 5. 保存文章列表JSON
        console.log('\n步骤3: 保存文章列表...');
        await saveArticlesJson(articles, today);
        
        // 6. 遍历每篇文章,获取内容并保存Markdown
        console.log('\n步骤4: 采集文章内容...');
        for (let i = 0; i < articles.length; i++) {
            const article = articles[i];
            console.log(`\n[${i + 1}/${articles.length}] 处理: ${article.title.substring(0, 50)}...`);
            
            try {
                // 打开文章页面
                await openTab(article.url);
                await sleep(1500); // 等待页面加载
                
                // 获取页面文本
                const pageText = await fetchPageText();
                
                // 保存Markdown
                await saveArticleMarkdown({
                    title: article.title,
                    url: article.url,
                    content: pageText
                }, today);
                
                // 添加内容到articles对象中(可选)
                article.contentLength = pageText.length;
                
                // 避免请求过快
                await sleep(500);
                
            } catch (err) {
                console.error(`✗ 处理文章失败: ${article.url}`, err.message);
                article.error = err.message;
            }
        }
        
        console.log('\n=== 采集完成 ===');
        console.log(`- 文章总数: ${articles.length}`);
        console.log(`- 文章列表: data/数码文章-${today}.json`);
        console.log(`- 文章内容: data/${today}/`);
        
    } catch (err) {
        console.error('程序执行出错:', err);
        process.exit(1);
    }
}

// 运行主函数
main().catch(console.error);

使用说明

1. 运行环境

  • Node.js 18+(支持原生fetch)
  • 确保网络可访问提供的API服务

2. 运行方法

# 保存代码为 index.js
node index.js

3. 输出结构

项目目录/
├── index.js
└── data/
    ├── 数码文章-2026-05-22.json          # 文章列表
    └── 2026-05-22/                       # 按日期分类
        ├── 953698_极空间私有云Z425轻享版发布.md
        ├── 953697_Ayaneo Konkr Pocket Block掌机亮相.md
        └── ...

4. 核心功能

  • 筛选文章:使用正则 /^https:\/\/m\.ithome\.com\/html\/\d+\.htm$/ 精确匹配文章链接
  • JSON输出:保存所有文章标题和链接,文件名包含日期
  • Markdown存储:每篇文章独立保存,包含标题、原文链接和正文
  • 错误处理:单篇文章失败不影响其他文章的采集

5. 可调整参数

await sleep(2000);  // 页面加载等待时间,可根据网络情况调整
await sleep(500);   // 请求间隔,避免过快

6. 注意事项

  • API服务可能有时效性,请确保接口可用
  • 采集频率建议不要过高,代码中已内置延迟
  • 如果文章数量很多,可考虑添加进度恢复功能

经修正延迟和部分API接口错误后的正确脚本代码


const fs = require('fs').promises;
const path = require('path');

// API配置
const API_BASE = 'http://browser.bb53m4hk65am1ogwky0n.html.dtns.top/api';
const TARGET_URL = 'https://m.ithome.com/digi/';

// 辅助函数:延迟
const sleep = (ms) => new Promise(resolve => setTimeout(resolve, ms));

// 获取当前日期字符串 YYYY-MM-DD
const getTodayStr = () => {
    const now = new Date();
    return `${now.getFullYear()}-${String(now.getMonth() + 1).padStart(2, '0')}-${String(now.getDate()).padStart(2, '0')}`;
};

// 判断是否为文章链接(格式:https://m.ithome.com/html/数字.htm)
const isArticleLink = (href) => {
    return href && /^https:\/\/m\.ithome\.com\/html\/\d+\.htm$/.test(href);
};

// 调用API打开标签页
const openTab = async (url) => {
    const apiUrl = `${API_BASE}/tabs/open?url=${encodeURIComponent(url)}`;
    const response = await fetch(apiUrl);
    if (!response.ok) {
        throw new Error(`打开标签页失败: ${response.statusText}`);
    }
    const data = await response.json();
    console.log(`✓ 已打开: ${url}`);
    return data;
};

// 获取页面所有链接
const fetchPageLinks = async () => {
    const apiUrl = `${API_BASE}/page/links`;
    const response = await fetch(apiUrl);
    if (!response.ok) {
        throw new Error(`获取链接失败: ${response.statusText}`);
    }
    const data = await response.json();
    if (!data.success) {
        throw new Error(`获取链接API返回错误: ${JSON.stringify(data)}`);
    }
    return data.links || [];
};

// 获取页面纯文本内容
const fetchPageText = async () => {
    const apiUrl = `${API_BASE}/page/visible-text/merged`;
    const response = await fetch(apiUrl);
    if (!response.ok) {
        throw new Error(`获取文本失败: ${response.statusText}`);
    }
    const data = await response.json();
    if (!data.success) {
        throw new Error(`获取文本API返回错误: ${JSON.stringify(data)}`);
    }
    return data.text || '';
};

// 保存文章列表JSON
const saveArticlesJson = async (articles, today) => {
    const dataDir = path.join(process.cwd(), 'data');
    await fs.mkdir(dataDir, { recursive: true });
    const filename = `数码文章-${today}.json`;
    const filepath = path.join(dataDir, filename);
    await fs.writeFile(filepath, JSON.stringify(articles, null, 2), 'utf-8');
    console.log(`✓ 已保存文章列表: ${filepath}`);
    return filepath;
};

// 保存单篇文章为Markdown
const saveArticleMarkdown = async (article, today) => {
    const { title, url, content } = article;
    const articleDir = path.join(process.cwd(), 'data', today);
    await fs.mkdir(articleDir, { recursive: true });
    
    // 从URL中提取ID作为文件名
    const idMatch = url.match(/\/html\/(\d+)\.htm/);
    const id = idMatch ? idMatch[1] : Date.now();
    const safeTitle = title.replace(/[\\/:*?"<>|]/g, '_').substring(0, 50);
    const filename = `${id}_${safeTitle}.md`;
    const filepath = path.join(articleDir, filename);
    
    // 构建Markdown内容
    const markdown = `# ${title}\n\n> 原文链接:${url}\n\n---\n\n${content}\n\n---\n\n*本文由自动化采集工具生成,采集时间:${new Date().toLocaleString()}*\n`;
    
    await fs.writeFile(filepath, markdown, 'utf-8');
    console.log(`✓ 已保存文章: ${filename}`);
    return filepath;
};

// 主函数
async function main() {
    console.log('=== 数码文章采集自动化开始 ===\n');
    const today = getTodayStr();
    const articles = [];
    
    try {
        // 1. 打开目标页面
        console.log(`步骤1: 打开页面 ${TARGET_URL}`);
        await openTab(TARGET_URL);
        await sleep(3000); // 等待页面加载
        
        // 2. 获取页面所有链接
        console.log('\n步骤2: 获取页面链接...');
        const allLinks = await fetchPageLinks();
        console.log(`共获取到 ${allLinks.length} 个链接`);
        
        // 3. 筛选出文章链接
        const articleLinks = allLinks.filter(link => isArticleLink(link.href));
        console.log(`筛选出文章链接 ${articleLinks.length} 个`);
        
        if (articleLinks.length === 0) {
            console.log('未找到文章链接,退出程序');
            return;
        }
        
        // 4. 提取文章基本信息
        for (const link of articleLinks) {
            // 提取标题(从text字段中清洗)
            let title = link.text || '';
            // 清理标题中的换行、多余空格和时间信息
            title = title.replace(/\n/g, ' ').replace(/\s+/g, ' ').trim();
            // 如果标题为空,则从URL中提取ID作为临时标题
            if (!title) {
                const idMatch = link.href.match(/\/html\/(\d+)\.htm/);
                title = `文章_${idMatch ? idMatch[1] : Date.now()}`;
            }
            
            articles.push({
                title: title,
                url: link.href,
                collectedAt: new Date().toISOString()
            });
        }
        
        // 5. 保存文章列表JSON
        console.log('\n步骤3: 保存文章列表...');
        await saveArticlesJson(articles, today);
        
        // 6. 遍历每篇文章,获取内容并保存Markdown
        console.log('\n步骤4: 采集文章内容...');
        for (let i = 0; i < articles.length; i++) {
            const article = articles[i];
            console.log(`\n[${i + 1}/${articles.length}] 处理: ${article.title.substring(0, 50)}...`);
            
            try {
                // 打开文章页面
                await openTab(article.url);
                await sleep(3500); // 等待页面加载
                
                // 获取页面文本
                const pageText = await fetchPageText();
                
                // 保存Markdown
                await saveArticleMarkdown({
                    title: article.title,
                    url: article.url,
                    content: pageText
                }, today);
                
                // 添加内容到articles对象中(可选)
                article.contentLength = pageText.length;
                
                // 避免请求过快
                await sleep(500);
                
            } catch (err) {
                console.error(`✗ 处理文章失败: ${article.url}`, err.message);
                article.error = err.message;
            }
        }
        
        console.log('\n=== 采集完成 ===');
        console.log(`- 文章总数: ${articles.length}`);
        console.log(`- 文章列表: data/数码文章-${today}.json`);
        console.log(`- 文章内容: data/${today}/`);
        
    } catch (err) {
        console.error('程序执行出错:', err);
        process.exit(1);
    }
}

// 运行主函数
main().catch(console.error);

步骤截图

1、查看api

http://browser.bb53m4hk65am1ogwky0n.html.dtns.top

【案例】文章采集:使用AI马驹(浏览器插件)实现数码文章内容的自动下载

2、打开数码之家并拿到数码文章链接

http://browser.bb53m4hk65am1ogwky0n.html.dtns.top/api/tabs/open?url={url}打开链接

【案例】文章采集:使用AI马驹(浏览器插件)实现数码文章内容的自动下载

【案例】文章采集:使用AI马驹(浏览器插件)实现数码文章内容的自动下载

3、根据文章列表的文章链接、控制浏览器自动打开数码文章

http://browser.bb53m4hk65am1ogwky0n.html.dtns.top/api/tabs/open?url={url}打开链接

【案例】文章采集:使用AI马驹(浏览器插件)实现数码文章内容的自动下载

【案例】文章采集:使用AI马驹(浏览器插件)实现数码文章内容的自动下载

4、获得当前浏览器打开的文章的纯文本内容

http://browser.bb53m4hk65am1ogwky0n.html.dtns.top/api/page/visible-text/merged

【案例】文章采集:使用AI马驹(浏览器插件)实现数码文章内容的自动下载

5、整理的node.js的脚本的命令行运行结果


node .\shuma-posts.js

【案例】文章采集:使用AI马驹(浏览器插件)实现数码文章内容的自动下载

【案例】文章采集:使用AI马驹(浏览器插件)实现数码文章内容的自动下载

【案例】文章采集:使用AI马驹(浏览器插件)实现数码文章内容的自动下载

【案例】文章采集:使用AI马驹(浏览器插件)实现数码文章内容的自动下载

总结:至此,完成了自动化采集数码文章的小任务

【案例】文章采集:使用AI马驹(浏览器插件)实现数码文章内容的自动下载

注:顶部可以看到浏览器插件打开了一堆的tab页面,每个页面均是刚才要打开和整理的数码文章。


附录

AI灵驹(浏览器插件)

AI灵驹(浏览器插件)

开源网址:

https://gitee.com/dtnsman/dtns.ai-pet-pony/tree/master/browser-plugin