forked from jaydenwen123/GolangSpider
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
197fd33
commit a583aaa
Showing
12 changed files
with
288 additions
and
19 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,162 @@ | ||
juejin | ||
# 爬取掘金文章 # | ||
---------- | ||
|
||
|
||
> [官网:https://juejin.im](https://juejin.im) | ||
> | ||
> ![掘金主页](https://github.com/jaydenwen123/GolangSpider/blob/master/GolangSpider/example/juejin/images/home.png) | ||
## 主要任务 ## | ||
1. 爬取掘金**最新**、**最热**、**热榜(3天、7天、30天)**文章信息 | ||
2. 根据关键词搜索**文章(一天、一周,三月)**、**用户**、**标签**、**综合数据**,爬取文章信息 | ||
3. 获取所有的标签信息 | ||
4. 批量**关注标签** | ||
5. 爬取**某一标签的全部文章信息** | ||
6. 将爬取的文章保存成**markdown格式**存储 | ||
|
||
## 后台接口分析 ## | ||
|
||
### 1.后台首页接口 ### | ||
|
||
> ![掘金主页](https://github.com/jaydenwen123/GolangSpider/blob/master/GolangSpider/example/juejin/images/home.png) | ||
|
||
**1.1热门内容接口** | ||
|
||
> https://web-api.juejin.im/query | ||
> post | ||
> **请求头:** | ||
> Content-Type: application/json | ||
> X-Agent: Juejin/Web | ||
> **参数:** | ||
> `{"operationName":"","query":"","variables":{"first":20,"after":"","order":"POPULAR"},"extensions":{"query":{"id":"21207e9ddb1de777adeaca7a2fb38030"}}}` | ||
> | ||
**1.2热榜内容接口** | ||
|
||
> https://web-api.juejin.im/query | ||
> post | ||
> **请求头:** | ||
> Content-Type: application/json | ||
> X-Agent: Juejin/Web | ||
> **参数:** | ||
> `{"operationName":"","query":"","variables":{"first":20,"after":"","order":"THREE_DAYS_HOTTEST"},"extensions":{"query":{"id":"21207e9ddb1de777adeaca7a2fb38030"}}}` | ||
|
||
**1.3最新内容接口** | ||
|
||
> https://web-api.juejin.im/query | ||
> post | ||
> **请求头:** | ||
> Content-Type: application/json | ||
> X-Agent: Juejin/Web | ||
> **参数:** | ||
> `{"operationName":"","query":"","variables":{"first":20,"after":"","order":"NEWEST"},"extensions":{"query":{"id":"21207e9ddb1de777adeaca7a2fb38030"}}}` | ||
|
||
### 2.文章详情接口 ### | ||
|
||
> ![文章详情](https://github.com/jaydenwen123/GolangSpider/blob/master/GolangSpider/example/juejin/images/article_detail.png) | ||
|
||
**2.1文章详情接口** | ||
|
||
> https://juejin.im/post/{id} | ||
> id:5cf61ed3e51d4555fd20a2f3 | ||
> get请求 | ||
**2.2文章内容** | ||
|
||
> | ||
<h1 class="article-title" data-v-3f6f7ca1>如何提升JSON.stringify()的性能?</h1> | ||
<div data-id="5cf7ae1b6fb9a07ef06f830a" itemprop="articleBody" class="article-content" data-v-3f6f7ca1> | ||
.........content..... | ||
</div> | ||
|
||
|
||
### 3.搜索接口 ### | ||
|
||
> ![文章详情](https://github.com/jaydenwen123/GolangSpider/blob/master/GolangSpider/example/juejin/images/search.png) | ||
|
||
**3.1掘金搜索接口** | ||
|
||
> https://web-api.juejin.im/query | ||
> post | ||
> **请求头:** | ||
> User-Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36 | ||
> X-Agent: Juejin/Web | ||
> **参数:** | ||
> | ||
> `{"operationName":"","query":"","variables":{"type":"ALL","query":"golang","after":"","period":"ALL","first":20},"extensions":{"query":{"id":"d9997080c3d67a02bfdae094729fed3b"}}}` | ||
> `{"operationName":"","query":"","variables":{"type":"ALL","query":"golang","after":"","period":"M3","first":20},"extensions":{"query":{"id":"d9997080c3d67a02bfdae094729fed3b"}}}` | ||
> | ||
> //type:ALL/ARTICLE/TAG/USER | ||
### 4.标签 ### | ||
|
||
> ![标签](https://github.com/jaydenwen123/GolangSpider/blob/master/GolangSpider/example/juejin/images/tag.png) | ||
**4.1获取标签信息** | ||
|
||
> https://gold-tag-ms.juejin.im/v1/tags/type/hot/page/1/pageSize/40 | ||
> Origin: https://juejin.im | ||
> Referer: https://juejin.im/subscribe/all | ||
> User-Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36 | ||
> X-Juejin-Client: 1559818729874 | ||
> X-Juejin-Src: web | ||
> X-Juejin-Token: eyJhY2Nlc3NfdG9rZW4iOiI3d2MzSG9Sb0JOeEV3dnpkIiwicmVmcmVzaF90b2tlbiI6ImdhbklJaE9LdnRJVWdBSkUiLCJ0b2tlbl90eXBlIjoibWFjIiwiZXhwaXJlX2luIjoyNTkyMDAwfQ== | ||
> X-Juejin-Uid: 5ce8befdf265da1bd1463390 | ||
|
||
**4.2关注标签** | ||
|
||
> addTagUrl=`https://gold-tag-ms.juejin.im/v1/tag/subscribe/5597a23fe4b08a686ce5a7c4` | ||
> //PUT | ||
> **请求头:** | ||
> User-Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36 | ||
> X-Juejin-Client: 1559818729874 | ||
> X-Juejin-Src: web | ||
> X-Juejin-Token: eyJhY2Nlc3NfdG9rZW4iOiI3d2MzSG9Sb0JOeEV3dnpkIiwicmVmcmVzaF90b2tlbiI6ImdhbklJaE9LdnRJVWdBSkUiLCJ0b2tlbl90eXBlIjoibWFjIiwiZXhwaXJlX2luIjoyNTkyMDAwfQ== | ||
> X-Juejin-Uid: 5ce8befdf265da1bd1463390 | ||
|
||
**4.3获取标签全部文章** | ||
|
||
> ![文章详情](https://github.com/jaydenwen123/GolangSpider/blob/master/GolangSpider/example/juejin/images/tag_article.png) | ||
> **get请求** | ||
> | ||
> `https://timeline-merger-ms.juejin.im/v1/get_tag_entry?src=web&uid=5ce8befdf265da1bd1463390&device_id=1559818729874&token=eyJhY2Nlc3NfdG9rZW4iOiI3d2MzSG9Sb0JOeEV3dnpkIiwicmVmcmVzaF90b2tlbiI6ImdhbklJaE9LdnRJVWdBSkUiLCJ0b2tlbl90eXBlIjoibWFjIiwiZXhwaXJlX2luIjoyNTkyMDAwfQ%3D%3D&tagId=5597a063e4b08a686ce57030&page=0&pageSize=20&sort=rankIndex` | ||
> | ||
|
||
## 成果展现 ## | ||
|
||
**1.文章列表** | ||
> ![文章列表](https://github.com/jaydenwen123/GolangSpider/blob/master/GolangSpider/example/juejin/images/article_list.png) | ||
|
||
**2.文章详情** | ||
> ![文章详情](https://github.com/jaydenwen123/GolangSpider/blob/master/GolangSpider/example/juejin/images/article_show.png) | ||
**3.下载日志** | ||
> ![下载日志](https://github.com/jaydenwen123/GolangSpider/blob/master/GolangSpider/example/juejin/images/downloadlog.png) | ||
**4.文章简要信息** | ||
> ![文章简要信息](https://github.com/jaydenwen123/GolangSpider/blob/master/GolangSpider/example/juejin/images/article_info.png) | ||
## 关键技术 ## | ||
1. golang html转markdown | ||
2. 正则表达式提取文章html数据 | ||
3. http不同方法PUT、DELETE、GET、POST发送请求 | ||
4. gjson解析json数据 | ||
|
||
## 参考资料 ## | ||
1. [html2text:https://jaytaylor.com/html2text](https://jaytaylor.com/html2text) | ||
2. [gjson:https://github.com/tidwall/gjson](https://github.com/tidwall/gjson) | ||
|
||
## 待优化的点 ## | ||
1. 将文章数据保存到ElasticSearch中,通过web界面提供搜索接口 | ||
2. 文章保存成markdown时,代码格式比较乱,后期考虑优化 | ||
3. 采用redis对爬过的文章去重操作 |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,114 @@ | ||
[{"Title":"中高级前端大厂面试秘籍,为你保驾护航金三银四,直通大厂(上)","Id":"5c64d3b7f265da2d943f4acb","OriginalUrl":"https://juejin.im/post/5c64d15d6fb9a049d37f9c20","CommentCount":389,"LikeCount":3929},{"Title":"2018前端面试总结,看完弄懂,工资少说加3K | 掘金技术征文","Id":"5b94d9d9e51d450e9704a4cb","OriginalUrl":"https://juejin.im/post/5b94d8965188255c5a0cdc02","CommentCount":101,"LikeCount":4174},{"Title":"这一次,彻底弄懂 JavaScript 执行机制","Id":"5a13e00bf265da432b4a729f","OriginalUrl":"https://juejin.im/post/59e85eebf265da430d571f89","CommentCount":395,"LikeCount":4007},{"Title":"一名【合格】前端工程师的自检清单","Id":"5cc2511e5188252e843b539c","OriginalUrl":"https://juejin.im/post/5cc1da82f265da036023b628","CommentCount":456,"LikeCount":3472},{"Title":"技术胖155集前端视频教程-全部免费观看","Id":"5a5be14cf265da3e303c73dd","OriginalUrl":"https://juejin.im/post/5a5bc8c36fb9a01ca26774eb","CommentCount":138,"LikeCount":3567},{"Title":"近两万字小程序攻略发布了","Id":"5b8fd30ce51d450ea362f0d3","OriginalUrl":"https://juejin.im/post/5b8fd1416fb9a05cf3710690","CommentCount":80,"LikeCount":3611},{"Title":"总结了17年初到18年初百场前端面试的面试经验(含答案)","Id":"5b44a4bf6fb9a04faf4790b0","OriginalUrl":"https://juejin.im/post/5b44a485e51d4519945fb6b7","CommentCount":114,"LikeCount":3051},{"Title":"新年献礼 技术胖262集前端免费视频 让您走的更容易些","Id":"5c11c080e51d4536ee0b92fb","OriginalUrl":"https://juejin.im/post/5c11bf145188252704368b98","CommentCount":595,"LikeCount":3161},{"Title":"2018 Java 后端工程师的书单推荐","Id":"59c2f4b05188252c237f85c4","OriginalUrl":"https://juejin.im/post/59c2f3e16fb9a00a600f6a5c","CommentCount":47,"LikeCount":1309},{"Title":"疑因内部宫斗被离职,中兴70后程序员从公司坠楼 ","Id":"5a329923f265da432840e359","OriginalUrl":"https://juejin.im/post/5a32942ef265da43104868fe","CommentCount":243,"LikeCount":117},{"Title":"2018春招前端面试: 闯关记(精排精校) | 掘金技术征文","Id":"5a9d5cf66fb9a028db583215","OriginalUrl":"https://juejin.im/post/5a998991f265da237f1dbdf9","CommentCount":172,"LikeCount":2858},{"Title":"ES6、ES7、ES8、ES9、ES10新特性一览","Id":"5ca2f441e51d457b0d00ffc8","OriginalUrl":"https://juejin.im/post/5ca2e1935188254416288eb2","CommentCount":97,"LikeCount":2045},{"Title":"2万5千字大厂面经 | 掘金技术征文","Id":"5ba3644ae51d450e5766fb40","OriginalUrl":"https://juejin.im/post/5ba34e54e51d450e5162789b","CommentCount":55,"LikeCount":2089},{"Title":"前端常用插件、工具类库汇总,不要重复造轮子啦!!!","Id":"5ba7d9485188255c5c45f043","OriginalUrl":"https://juejin.im/post/5ba7d5dd5188255c6140cc9d","CommentCount":103,"LikeCount":3173},{"Title":"干货!各种常见布局实现+知名网站实例分析","Id":"5aa7246ff265da239f070791","OriginalUrl":"https://juejin.im/post/5aa252ac518825558001d5de","CommentCount":72,"LikeCount":2336},{"Title":"大型项目前端架构浅谈(8000字原创)","Id":"5cea200c6fb9a07ef56212b2","OriginalUrl":"https://juejin.im/post/5cea1f705188250640005472","CommentCount":309,"LikeCount":2251}] | ||
[ | ||
{ | ||
"Title": "中高级前端大厂面试秘籍,为你保驾护航金三银四,直通大厂(上)", | ||
"Id": "5c64d3b7f265da2d943f4acb", | ||
"OriginalUrl": "https://juejin.im/post/5c64d15d6fb9a049d37f9c20", | ||
"CommentCount": 389, | ||
"LikeCount": 3929 | ||
}, | ||
{ | ||
"Title": "2018前端面试总结,看完弄懂,工资少说加3K | 掘金技术征文", | ||
"Id": "5b94d9d9e51d450e9704a4cb", | ||
"OriginalUrl": "https://juejin.im/post/5b94d8965188255c5a0cdc02", | ||
"CommentCount": 101, | ||
"LikeCount": 4174 | ||
}, | ||
{ | ||
"Title": "这一次,彻底弄懂 JavaScript 执行机制", | ||
"Id": "5a13e00bf265da432b4a729f", | ||
"OriginalUrl": "https://juejin.im/post/59e85eebf265da430d571f89", | ||
"CommentCount": 395, | ||
"LikeCount": 4007 | ||
}, | ||
{ | ||
"Title": "一名【合格】前端工程师的自检清单", | ||
"Id": "5cc2511e5188252e843b539c", | ||
"OriginalUrl": "https://juejin.im/post/5cc1da82f265da036023b628", | ||
"CommentCount": 456, | ||
"LikeCount": 3472 | ||
}, | ||
{ | ||
"Title": "技术胖155集前端视频教程-全部免费观看", | ||
"Id": "5a5be14cf265da3e303c73dd", | ||
"OriginalUrl": "https://juejin.im/post/5a5bc8c36fb9a01ca26774eb", | ||
"CommentCount": 138, | ||
"LikeCount": 3567 | ||
}, | ||
{ | ||
"Title": "近两万字小程序攻略发布了", | ||
"Id": "5b8fd30ce51d450ea362f0d3", | ||
"OriginalUrl": "https://juejin.im/post/5b8fd1416fb9a05cf3710690", | ||
"CommentCount": 80, | ||
"LikeCount": 3611 | ||
}, | ||
{ | ||
"Title": "总结了17年初到18年初百场前端面试的面试经验(含答案)", | ||
"Id": "5b44a4bf6fb9a04faf4790b0", | ||
"OriginalUrl": "https://juejin.im/post/5b44a485e51d4519945fb6b7", | ||
"CommentCount": 114, | ||
"LikeCount": 3051 | ||
}, | ||
{ | ||
"Title": "新年献礼 技术胖262集前端免费视频 让您走的更容易些", | ||
"Id": "5c11c080e51d4536ee0b92fb", | ||
"OriginalUrl": "https://juejin.im/post/5c11bf145188252704368b98", | ||
"CommentCount": 595, | ||
"LikeCount": 3161 | ||
}, | ||
{ | ||
"Title": "2018 Java 后端工程师的书单推荐", | ||
"Id": "59c2f4b05188252c237f85c4", | ||
"OriginalUrl": "https://juejin.im/post/59c2f3e16fb9a00a600f6a5c", | ||
"CommentCount": 47, | ||
"LikeCount": 1309 | ||
}, | ||
{ | ||
"Title": "疑因内部宫斗被离职,中兴70后程序员从公司坠楼 ", | ||
"Id": "5a329923f265da432840e359", | ||
"OriginalUrl": "https://juejin.im/post/5a32942ef265da43104868fe", | ||
"CommentCount": 243, | ||
"LikeCount": 117 | ||
}, | ||
{ | ||
"Title": "2018春招前端面试: 闯关记(精排精校) | 掘金技术征文", | ||
"Id": "5a9d5cf66fb9a028db583215", | ||
"OriginalUrl": "https://juejin.im/post/5a998991f265da237f1dbdf9", | ||
"CommentCount": 172, | ||
"LikeCount": 2858 | ||
}, | ||
{ | ||
"Title": "ES6、ES7、ES8、ES9、ES10新特性一览", | ||
"Id": "5ca2f441e51d457b0d00ffc8", | ||
"OriginalUrl": "https://juejin.im/post/5ca2e1935188254416288eb2", | ||
"CommentCount": 97, | ||
"LikeCount": 2045 | ||
}, | ||
{ | ||
"Title": "2万5千字大厂面经 | 掘金技术征文", | ||
"Id": "5ba3644ae51d450e5766fb40", | ||
"OriginalUrl": "https://juejin.im/post/5ba34e54e51d450e5162789b", | ||
"CommentCount": 55, | ||
"LikeCount": 2089 | ||
}, | ||
{ | ||
"Title": "前端常用插件、工具类库汇总,不要重复造轮子啦!!!", | ||
"Id": "5ba7d9485188255c5c45f043", | ||
"OriginalUrl": "https://juejin.im/post/5ba7d5dd5188255c6140cc9d", | ||
"CommentCount": 103, | ||
"LikeCount": 3173 | ||
}, | ||
{ | ||
"Title": "干货!各种常见布局实现+知名网站实例分析", | ||
"Id": "5aa7246ff265da239f070791", | ||
"OriginalUrl": "https://juejin.im/post/5aa252ac518825558001d5de", | ||
"CommentCount": 72, | ||
"LikeCount": 2336 | ||
}, | ||
{ | ||
"Title": "大型项目前端架构浅谈(8000字原创)", | ||
"Id": "5cea200c6fb9a07ef56212b2", | ||
"OriginalUrl": "https://juejin.im/post/5cea1f705188250640005472", | ||
"CommentCount": 309, | ||
"LikeCount": 2251 | ||
} | ||
] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters