|
Presentation
Circa is a search engine for your Web site, or for a list of sites. It
indexes like Altavista does. It can read, add and parse all url's found
in a page, if the page is on the same server.
Circa is free, under GNU license
Try-it !
Make a search on AlianWebServer :
Features
- Full text indexing
- Different weights for title, keywords, description and rest of page
HTML read can be given in configuration
- Boolean query language support : or (default) and ("+")
not ("-"). Ex perl + faq -cgi : Documents with faq, eventually
perl and not cgi.
- Support protocol HTTP,FTP
- Make index in MySQL
- Client Perl or PHP
- Read HTML and full text plain
- Can do indexation of filesystem without talk to Web Server
- Can browse site by directory / rubrique.
- Several kinds of indexing : full, incremental, only on a particular
server. Documents not updated are not reindexed. All requests for a
file are made first with a head http request, for information such as
validate, last update, size, etc.
- Size of documents read can be restricted (Ex: don't get all documents
> 5 MB). For use with low-bandwidth connections, or computers which
do not have much memory.
- HTML template can be easily customized for your needs.
- Search for different criteria: news, last modified date, language,
URL / site.
- Admin functions available by browser interface or command-line.
- Full support of standard robots exclusion (robots.txt). Identification
with CircaIndexer/0.1, mail alian@alianwebserver.com.
- Delay requests to the same server for 8 secondes. "It's not a
bug, it's a feature!" Basic rule for HTTP serveur load.
- Index the different links found in a CGI (all after name_of_file?)
- Support proxy HTTP
To do
- Support NNTP
- Support of different character sets
- Support of other bases
- Requirement
- MySQL
- Perl
- Modules DBI, DBD::mysql,LWP::RobotUA,HTML::LinkExtor;
Benchmark
Memory : Indexation : 5,5M
Processeur : on Sun SPARC Station 4 : (5 secondes à 2%, 2s. à
20%, 1s. à 30%) / url indexée.
Size on MySQL: 2-5 ko / url.
Make index is a big work so it's not for CGI protocol. Try to use admin.pl
to update index; if you don't have telnet acces, try to lunch processus
on background with another CGI. Or install MySQL on local disk, make your
index, and export index on you sarch machine.
Install
- Download one of archive file, uncompress it.
- You must update search.cgi and search.pl (script for search) admin.cgi
and admin.pl (script for admin) for put your MYSQL param :user, password,
database and ip adress if different from 'localhost'.
- Run admin.cgi (CGI interface) or admin.pl (command line) for add your
url, drop or create tables, ... I suggest to prefer use admin.pl on
command line because indexation can take a lot of time and is not adapted
for CGI
- Run search.cgi. You can use the default form for use in your page.
Only field 'words' is necessary.
- For customized HTML result, look in file circa.htm
Documentation
Documentation POD is available, use pod2html name_of_file.pm > name_of_file.html
for read it.
Download
If you have root privileges and can install Perl modules, you can install
this two modules : Circa::Search
et Circa::Indexer. See directory
demo for how use this module. Install Circa::Indexer first.
Else, you can use this distrib :
Format ZIP
or Format tar.gz
Author
Alain BARBET alian@alianwebserver.com
Reference
Rules and security with :
http://info.webcrawler.com/mak/projects/robots/robots.html
Feature :
http://search.mnogo.ru/features.html
Why ?
I read of this need,
I needed one for AlianWebServer, and I think other people need it too.
|
|