[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3. Advanced Configuration

There are various advanced ways to configure LinkController. These are mostly not needed for simple checking of a small collection of web pages. For larger sites and special situations however, they may well make life much easier.

3.1 Advanced Infostructure Configuration  Advanced control of checking
3.2 Authorisation Configuration  Checking pages which require basic authentication.
3.3 Configuring CGI Programs  Setting up LinkController's web interface


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.1 Advanced Infostructure Configuration

Using more advanced configuration it is possible to skip over certain resources when we are doing link extraction and to ignore some of the links. You may want to skip over this section initially and come back to it only when you find that there are links or pages being checked that you would rather avoid.

For this section, we assume that you already know how to make basic Perl code. If not, then please read through the Perl manual pages `perl', `perlsyn' and `perldata'. You may find that the examples given below are sufficient to get you started.

In order to get extract-links to extract links using an advanced infostructure, you must use the advanced keyword. In the infostructure file. Infostructures not listed there will be ignored, but won't cause any harm.

Advanced configuration is in the `.link-controller.pl' configuration file by making definitions into the %::infostrucs hash. These look like the following

 
$::infostrucs{http://www.mypages.org/} = {
   mode => "directory";
   file_base => "/home/myself/www",
   prune_re => "^(/home/myself/www/statistics)" #ignore referrals
              . "|(cgi-bin)", #do CGIs separately
   resource_exclude_re => "\.secret$", #secrets shouldn't stay secret
   link_exclude_re => "^http://([a-z]+\.)+example\.com", 
};

$::infostrucs{http://www.mypages.org/cgi-bin/} = {
   mode => "www";
   resource_exclude_re => "query", #query space is infinite!!
};

There are a number of keywordss that can be used.

`mode'
This decides how to download the links. Either `www' or `directory'.
`file_base'
If defined, this defines the directory which matches the URL where the infostructure is based. This must be defined if the mode is set to directory.
`resource_include_re'
If defined, this regular expression must be matched by the URL for every resource before links will be extracted from it.
`resource_exclude_re'
If defined, this regular expression must not be matched by the URL for every resource before links will be extracted from it.
`link_include_re'
If defined, this regular expression must be matched by every URL found before it will be extracted and saved.
`link_exclude_re'
If defined, this regular expression must not be matched by every URL found before it will be extracted and saved.
`prune_re'
Used only in directory mode, this will completely exclude all files and sub-directories of directories matched by the regular expression.

N.B. the exclude and include regular expressions can be used together. For a match, the include regular expression must match and the exclude must not match. In other words excludes override includes.

In order for the infostructure to be used by extract-links an entry must still be made in the `infostrucs' file. For this use the advanced keyword. The second argument is a URL used to look up the definition in the $::infostrucs hash.

 
advanced   http://www.mypages.org/
advanced   http://www.mypages.org/cgi-bin/

The URL used here must match exactly the one used in the hash. It is important to note that `directory' and `www' definitions in the `infostrucs' file will override any advanced configuration given.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.2 Authorisation Configuration

One problem when checking links, especially within an intranet situation is that some pages can be protected with basic authentication. In order to extract links from those pages or to simply know that they are there, we have to get through that authentication. By using the advanced Authorisation Configuration we can give LinkController authority to access these pages and allow link checking to work as normal.

 
Using this method to allow LinkController to work in an
environment with authentication is inherently a security issue since
authentication tokens must be stored, effectively in plaintext, in
files.  This risk may, however, not be much higher than the one that you
currently accept, so this can be useful

We can store the authentication tokens simply in the %::credentials hash which we can create in the `.link-controller.pl' configuration file. The keys in the hash are the exact realm string which will be sent by the web server. Each value of this hash is a hash with a pair of keys. The `credentials' key should be associated to the authentication token. The `uri_re' key should be a regular expression which matches the web pages you want to visit. For security reasons it shouldn't match any others.

 
$::credentials = {
  my_realm => { uri_re => "https://myhost.example.com",
                credential => "my_secret" }
} );

As a sanity check, every `uri_re' will be tried on `http://3133t3hax0rs.rhere.com' and `http://3133t3hax0rs.rhere.com/secretstuff/www.goodplace.com/'. If the expression matches then the credentials will be ignored. If you know enough to do this safely then you should definitely know how to get past this check. The owners of the domain `3133t3hax0rs.rhere.com' will just have to hack the code..

For more discussion about the security risks and how to mitigate them see the file `authorisation.pod' included with the LinkController distribution. If you didn't understand the security risk from the above description then probably you should consider avoiding using this mechanism.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.3 Configuring CGI Programs

The CGI programs use the same configuration variables as the other programs, however, to avoid any confusion and related security problems, a perl script should be written which has the configuration variables hard wired in then runs the appropriate CGI program. configure-link-cgi is a program designed to set up such a script.

FIXME: this section needs to be rewritten.


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by Michael De La Rue on February, 3 2002 using texi2html