How to start a search, or a few thoughts about crawler

ongoing threads about their own search engine

So there are a few major tasks that the system must solve the search will start with the fact that a separate page is necessary to get and keep.
There are several ways, depending on what methods of treatment You will choose in the future.

Obviously, it is necessary to have a queue of pages that need to be downloaded from the web, at least in order to then look at them long winter evenings, if nothing better not come up. I prefer to have a queue of sites and their home pages, and local mini place that I will process at that time. The reason is simple – a list of all the pages that I would like to load just for a month can significantly exceed the volume of my big hard drive :) so I keep only what is really needed – sites at the moment is 600 thousand, and their priorities and load times.


When you load another page, all links from this page should either be added to the local queue if they are within the website which I treat, either in the basic list of sites to which I will sooner or later return.

How many pages to obtain from one site at a time? Personally, I prefer no more than 100 thousand, although periodically change this restriction only for 1000 pages. And the sites on which the pages more – not so much.
Now let us consider:

If we get 1 page at a time, all pages consecutively, then how many pages we have processed, say, an hour?
time when the page is made up of:
· the time we wait for DNS response (it is, as practice shows is not enough). DNS maps the name of the site "site.ru" ip address of the server on which it lies, and it's not the easiest task given that the sites have the habit to move, routes, routing packets to change and much more. In short, the DNS server keeps a table of addresses, and every time we knock him to understand the address – where to go page.
· time connect and sending a request (faster if you have at least average channel)
· the time of receipt of the actual response page

That is why Yandex rumored, at the time, faced the first problem – if you get really a lot of pages, the DNS provider is not able to cope with this – in my experience a delay of up to 10 seconds to address, especially because you still have to transmit the answer back and forth over the network, and I have the provider not alone. Note that if the query sequentially 1000 pages from one site, You will each pull 1000 times provider.

With modern hardware is quite simple to put a local caching DNS server on the local network, and ship their work to him, and not the provider – then the provider will deal with the shipment of Your packages faster. However, you can and will bother to write cache within the bootloader of your pages, if You write at a low level.
If you use ready solutions like LWP or HTTP modules for Perl, then the local DNS server would be optimal.

Now assume that the answer is up to You 1-10 seconds on average – have fast servers, and is very slow. Then in a minute You got 6-60 pages per hour 360-3600, in day approximately from 8000 to 60000 (consciously round down on all sorts of delay: in reality, when you request 1 page at a time without a local DNS, the channel 100mbit/s, You will receive 10,000 pages a day, of course, if the sites are different, not one is very fast)

And even though this overview does not include processing time, saving the result, frankly, miserable.

OK, I said, and made 128 requests at a time in parallel, all flew perfectly – peak 120 thousand pages per hour, until they started to do obscene ogy from the admins of servers where I knocked on DDOS attacks, Yes 5000 queries per 5 minutes is probably not any hosting allows.

All decided what to ship at the same time I was 8-16 different sites, no more than 2-3 pages in parallel. It turned out that about 20-30 thousand pages per hour, and that's fine by me. I must say at night the figures are much Mature
Full contents and list of my articles on search engine will be updated here: http://habrahabr.ru/blogs/search_engines/123671/
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

mSearch: search + filter for MODX Revolution

Emulator data from GNSS receiver NMEA

The game Let's Twist: the Path into the unknown