Web Crawlers

"Crawler4j is an open source Java Crawler which provides a simple interface for crawling the web. Using it, you can setup a multi-threaded web crawler in 5 minutes! "

Key features

Support for http, https, ftp, nntp and news URL schemes.
htdb virtual URL scheme for indexing SQL databases.
Indexes text/html, text/xml, text/plain, audio/mpeg (mp3) and image/gif mime types natively.... more

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

Heritrix (sometimes spelled heretrix, or misspelled or mis-said as heratrix/heritix/ heretix/heratix) is an archaic word for... more

HTTrack is a free (GPL, libre/free software) and easy-to-use offline browser utility.

It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and... more

Apache Nutch is an open source web-search software project. Nutch is a project of the Apache Software Foundation and is part of the larger Apache community of developers and users.
Apache Nutch is an open source web-search software project... more

OpenSearchServer is an open source search engine and crawler software based on the best open source technologies .

Give your users the best search experience: Suggestion box (autocompletion), spell checking, facet and filter search,... more

"OpenWebSpider would be the base for a new Search engine developed from a community of opensource developers, it has a powerful features like MP3 and PDF support!!!"

Spider is a complete standalone Java application designed to easily integrate varied datasources.

XML driven framework for data retrieval from network accessible sources
Scheduled pulling
Highly extensible... more

Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml... more

"YaCy is a free search engine that anyone can use to build a search portal for their intranet or to help search the public internet. YaCy can be operated as a Private Search Appliance and YaCy can also operate as a peer in a peer-to-peer search... more

