0.前言
Java自带的net一点也不好用,捣鼓了maven依赖捣鼓了半天,使用HttpClient,终于复刻了python爬虫一模一样的效果。如果有空的话会写gui界面的,但是博主还在和奥密克戎的传播斗争,要是没斗过就要和奥密克戎斗争了,祝博主可以躲过这一劫吧。
1.Java自带的URL类
在那之前我们先来看一下Java自带的URL类是怎么操作的。我这里以下载一张图片为例。
import java.net.*;
import java.io.*;
public class Main {
public static void main(String[] args) {
String strurl = "http:///wp-content/uploads/2022/11/98987278.png";
try {
URL url = new URL(strurl);
// 打开URL连接
URLConnection con = url.openConnection();
// 得到URL的输入流
InputStream input = con.getInputStream();
// 设置数据缓冲
byte[] bs = new byte[1024 * 2];
// 读取到的数据长度
int len;
// 输出的文件流保存图片至本地
OutputStream os = new FileOutputStream("a.png");
while ((len = input.read(bs)) != -1) {
os.write(bs, 0, len);
}
os.close();
input.close();
} catch (MalformedURLException e) {
// TODO 自动生成的 catch 块
e.printStackTrace();
} catch (IOException e) {
// TODO 自动生成的 catch 块
e.printStackTrace();
}
}
}
这里简单的讲述一下如何将inputstream里面的内容写成图片:实际上和C++差不多,创建了一个byte数组存放字节信息,将它一部分一部分转移。input的read方法返回的值是多少个字节。write参数有三个:内容存放的引用,off(一般设为0),多少字节待写入。
2.使用HttpClient实现
有关maven怎么安装请自行搜索。这里使用了两个,一个是HttpClient用于发送get请求,一个是fastjson用于解析json数据。还有一个实用的(但我这里没有用到)叫jsoup和python的beautifulsoup差不多的。它们的依赖如下引入。
org.apache.httpcomponents
httpclient
4.4
org.jsoup
jsoup
1.8.3
com.alibaba
fastjson
1.2.47
其余的思路和python是一模一样的,就不在赘述了。
package org.example;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.*;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class BiliSpyder {
//1.创建HttpClient对象
private CloseableHttpClient httpclient = HttpClients.createDefault();
private String id;
public BiliSpyder(String id)
{
this.id=id;
}
public BiliSpyder()
{
this("401742377");
}
public CloseableHttpResponse getResponse(String url) throws IOException {
//2.创建HttpGet请求并进行相关设置
HttpGet httpGet = new HttpGet(url);
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36");
//3.发起请求
CloseableHttpResponse response = this.httpclient.execute(httpGet);
return response;
}
public void downLoadImage(String url,String title) throws IOException {
CloseableHttpResponse response = getResponse(url);
InputStream in = response.getEntity().getContent();
byte[] bs = new byte[1024 * 2];
// 读取到的数据长度
int len;
// 输出的文件流保存图片至本地
OutputStream os = new FileOutputStream(title+".jpg");
while ((len = in.read(bs)) != -1) {
os.write(bs, 0, len);
}
os.close();
in.close();
response.close();
}
public JSONArray getData(String url) throws IOException {
//得到response
CloseableHttpResponse response = this.getResponse(url);
String html="";
//判断响应状态码并获取响应数据
if(response.getStatusLine().getStatusCode() == 200){//200表示响应成功
html = EntityUtils.toString(response.getEntity(), "UTF-8");
//System.out.println(html);
}
//DataOutputStream output = new DataOutputStream(new FileOutputStream("1.txt"));
//output.write(html.getBytes());
JSONObject dataDict = JSONObject.parseObject(html);
JSONObject data2 = dataDict.getJSONObject("data").getJSONObject("list");
JSONArray data = data2.getJSONArray("vlist");
return data;
}
public void run() throws IOException {
//得到网址
int page = 1;
String url_pattern = "https://api.bilibili.com/x/space/wbi/arc/search?mid=%s&ps=30&tid=0&pn=%d&keyword=&order=pubdate&order_avoided=true&w_rid=64a17313d0ab4fe3a74503517fe017b4&wts=1671862074";
JSONArray data = this.getData(String.format(url_pattern, id,page));
File directory = new File(JSONObject.parseObject(data.get(0).toString()).get("author").toString());
if(directory.exists())
{
System.out.println("已存在");
}
else
directory.mkdirs();
while(data.size()>0)
{
System.out.println(data.size());
for(int i=0;i\\|]");
Matcher matcher = pattern.matcher(title);
title= matcher.replaceAll("");
downLoadImage(url,directory.getPath()+"/"+title);
}
page++;
data = getData(String.format(url_pattern,id,page));
}
//5.关闭资源
this.httpclient.close();
}
public static void main(String[] args) throws IOException {
System.out.println("输入UID:");
String id;
Scanner input = new Scanner(System.in);
id = input.next();
BiliSpyder spyder = new BiliSpyder(id);
spyder.run();
}
}
3.后记
Java的学习本应告一段落,但是博主买的Java进阶教程到了,大体翻了一下里面讲的基本就是Java数据结构和高级GUI等知识,也不一定学。寒假后续干什么事情我还在思考。