C#多线程爬虫抓取免费 * 的示例代码
作者:L-H 时间:2023-02-08 01:22:01
这里用到一个HTML解析辅助类:HtmlAgilityPack,如果没有网上找一个增加到库里,这个插件有很多版本,如果你开发环境是使用VS2005就2.0的类库,VS2010就使用4.0,以此类推..........然后直接创建一个控制台应用,将我下面的代码COPY替换就可以运行,下面就来讲讲我两年前做爬虫经历,当时是给一家公司做,也是用的C#,不过当时遇到一个头痛的问题就是抓的图片有病毒,然后系统挂了几次。所以抓网站图片要注意安全,虽然我这里没涉及到图片,但是还是提醒下看文章的朋友。
class Program
{
//存放所有抓取的代理
public static List<proxy> masterPorxyList = new List<proxy>();
// * 类
public class proxy
{
public string ip;
public string port;
public int speed;
public proxy(string pip,string pport,int pspeed)
{
this.ip = pip;
this.port = pport;
this.speed = pspeed;
}
}
//抓去处理方法
static void getProxyList(object pageIndex)
{
string urlCombin = "http://www.xicidaili.com/wt/" + pageIndex.ToString();
string catchHtml = catchProxIpMethord(urlCombin, "UTF8");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(catchHtml);
HtmlNode table = doc.DocumentNode.SelectSingleNode("//div[@id='wrapper']//div[@id='body']/table[1]");
HtmlNodeCollection collectiontrs = table.SelectNodes("./tr");
for (int i = 0; i < collectiontrs.Count; i++)
{
HtmlAgilityPack.HtmlNode itemtr = collectiontrs[i];
HtmlNodeCollection collectiontds = itemtr.ChildNodes;
//table中第一个是能用的代理标题,所以这里从第二行TR开始取值
if (i>0)
{
HtmlNode itemtdip = (HtmlNode)collectiontds[3];
HtmlNode itemtdport = (HtmlNode)collectiontds[5];
HtmlNode itemtdspeed = (HtmlNode)collectiontds[13];
string ip = itemtdip.InnerText.Trim();
string port = itemtdport.InnerText.Trim();
string speed = itemtdspeed.InnerHtml;
int beginIndex = speed.IndexOf(":", 0, speed.Length);
int endIndex = speed.IndexOf("%", 0, speed.Length);
int subSpeed = int.Parse(speed.Substring(beginIndex + 1, endIndex - beginIndex - 1));
//如果速度展示条的值大于90,表示这个代理速度快。
if (subSpeed > 90)
{
proxy temp = new proxy(ip, port, subSpeed);
masterPorxyList.Add(temp);
Console.WriteLine("当前是第:" + masterPorxyList.Count.ToString() + "个 * ");
}
}
}
}
//抓网页方法
static string catchProxIpMethord(string url,string encoding )
{
string htmlStr = "";
try
{
if (!String.IsNullOrEmpty(url))
{
WebRequest request = WebRequest.Create(url);
WebResponse response = request.GetResponse();
Stream datastream = response.GetResponseStream();
Encoding ec = Encoding.Default;
if (encoding == "UTF8")
{
ec = Encoding.UTF8;
}
else if (encoding == "Default")
{
ec = Encoding.Default;
}
StreamReader reader = new StreamReader(datastream, ec);
htmlStr = reader.ReadToEnd();
reader.Close();
datastream.Close();
response.Close();
}
}
catch { }
return htmlStr;
}
static void Main(string[] args)
{
//多线程同时抓15页
for (int i = 1; i <= 15; i++)
{
ThreadPool.QueueUserWorkItem(getProxyList, i);
}
Console.Read();
}
}
来源:http://www.cnblogs.com/xiaoliao/p/7436711.html?utm_source=tuicool&utm_medium=referral
标签:C#,爬虫, ,
![](/images/zang.png)
![](/images/jiucuo.png)
猜你喜欢
C# 使用鼠标点击对Chart控件实现数据提示效果
2023-03-05 14:20:06
![](https://img.aspxhome.com/file/2023/0/104230_0s.webp)
详解Java代码常见优化方案
2023-11-29 03:13:04
基于Tomcat7、Java、WebSocket的服务器推送聊天室实例
2023-11-25 23:35:34
![](https://img.aspxhome.com/file/2023/0/60390_0s.jpg)
java实现邮件发送
2022-06-03 02:48:20
![](https://img.aspxhome.com/file/2023/2/68762_0s.jpg)
C#创建不规则窗体的4种方式详解
2021-08-08 09:00:43
![](https://img.aspxhome.com/file/2023/6/96406_0s.jpg)
eclipse端口被占用问题的解决方法
2022-10-04 07:01:54
![](https://img.aspxhome.com/file/2023/5/98405_0s.png)
Android双击事件拦截方法
2022-07-21 19:33:30
详解C#中的依赖注入和IoC容器
2023-03-11 09:05:19
![](https://img.aspxhome.com/file/2023/2/82232_0s.png)
Android应用中绘制圆形头像的方法解析
2022-02-06 00:46:52
![](https://img.aspxhome.com/file/2023/8/97848_0s.png)
android计算器代码示例分享
2023-10-14 14:06:58
浅谈collection标签的oftype属性能否为java.util.Map
2023-03-19 23:16:15
![](https://img.aspxhome.com/file/2023/8/97818_0s.png)
Flutter实现固定header底部滑动页效果示例
2022-06-15 06:31:05
Spring AOP面向切面编程实现原理方法详解
2021-07-22 00:26:07
C#中闭包概念讲解
2022-08-16 05:16:28
Android手势密码view学习笔记(二)
2023-08-14 16:06:35
![](https://img.aspxhome.com/file/2023/4/123274_0s.jpg)
深入了解Java接口回调机制
2023-11-09 15:52:05
![](https://img.aspxhome.com/file/2023/0/59310_0s.png)
线程池中execute与submit的区别说明
2023-03-18 23:09:04
![](https://img.aspxhome.com/file/2023/6/64786_0s.png)
Android实现可滑动的自定义日历控件
2022-09-01 02:12:21
![](https://img.aspxhome.com/file/2023/3/139353_0s.jpg)
Springboot日志开启SLF4J过程解析
2022-04-23 01:29:57
Java多线程编程之读写锁ReadWriteLock用法实例
2021-10-13 17:01:14
![](https://img.aspxhome.com/file/2023/0/65620_0s.png)