extract-links - build links database
extract-links [options] extract-links [options] URI directory extract-links [options] URL
This program extracts links and builds the databases used by link-controllers various other programs.
The first database built is the links database containing information about the status of all of the links that are being checked. The second two are in cdb format and can be used as indexes for identifying which files contain which URIs and vica versa.
In file mode this goes through a directory tree. The program requires a base URI which is the URI which would be used to refernce the directory in which all of the files are contained on the world wide web. This is used to convert internal references into full URIs which can be used to check that the files are visible from the outside.
In WWW mode the program goes through a set of World Wide Web pages generating the databases.
The program requires a base-URL which where it starts from. It's default mode is to only work down from that URL, that is it will only get URIs from WWW pages who's URL starts with the base-url
There are two regular expressions which can be given for filtering.
If the regular expression given with <--exclude-regex> matches the
file name then it will not be read in. If the regular expression
given with --prune-regex
matches a directory name then that entire
directory and all subdirectories are excluded.
By default, extract-links extracts and refreshes all of the infostructures listed in the file $::infostrucs. The file looks like
#mode url directory www http://myserver.example.com /var/www/html223
This is covered in detail in the LinkController reference manual.
There are several configuration
$HOME/.link-control.pl - base configuration file
This contains configuration variables which point to further files.
$::links - link database
$::infostrucs - infostructure configuration file
Full details of the format of these configuration files can be found in the LinkController reference manual.
Unlike other programs which tend to resort to closing and re-opening files with lists of links, or holding them all in memory, this program uses a file containing list of links to follow the recursion in the WWW without actually using recursive functions. This relies on output to an unbuffered file being available for input immediately afterwards.
We also use a temporary database to record which links have been seen before. This could get LARGE.
The HTML parsing is done by the perl html parser. This provides excelent and controllable results, but a custom parser carefully written in C would be alot faster. This program takes a long time to run. Since this program is run under human control this matters. If anyone knows of an efficient but good C based parser, suggestion would be greatfully accepted. Direct interface compatibility with the current Perl parser would be even better.
I think the program can get trapped in directories it can change into, but can't read (mode o+x-w). This should be fixed.
This program could put a large load on a given server if accidentally let go where it shouldn't be. This is your responsibility since it isn't reasonable to slow it down for when it's being used on a local machine or LAN. Some warning should be provided.. e.g. out of local domain check.
I don't really know if the tied database is really needed. I want to allow massive link collections though.
the verify-link-control manpage; the extract-links manpage; the build-schedule manpage the link-report manpage; the fix-link manpage; the link-report.cgi manpage; the fix-link.cgi manpage the suggest manpage; the link-report.cgi manpage; the configure-link-control manpage
cdbmake, cdbget, cdbmultiget
The LinkController manual in the distribution in HTML, info, or postscript formats, included in the distribution.
http://scotclimb.org.uk/software/linkcont - the LinkController homepage.