How to generate full visitor count from an Apache log file

Gabor Szabo (szabgab)

3.86/5 (4 votes)

Jan 10, 2012

CPOL

3 min read

29282

Count how many hits were generated from each IP address and show the top 10 sources.

Introduction

In the previous article, I described how to create a report from an Apache log file for the number of hits from localhost vs. elsewhere. That script can be easily changed to provide a report for any single IP address vs. the rest of the world just by replacing the IP address with another address.

It can be also changed to provide a report with full visitor count, showing how many hits came from each IP address. Then it is easy to show the top 10 sources, or filter them in some other way.

Background

Just to recall, in the default format, each line in the log file of Apache starts like this:

 127.0.0.1 - - [10/Apr/2007:10:39:11 +0300] ...
 127.0.0.1 - - [10/Apr/2007:10:39:11 +0300] ...
 139.12.0.2 - - [10/Apr/2007:10:40:54 +0300] ...
 217.1.20.22 - - [10/Apr/2007:10:40:54 +0300] ...

That means if we take any single line and put it in the $line variable, we can extract the IP address by the following code:

my $length = index ($line, " ");
my $ip = substr($line, 0, $length);

Using the code

In order to count an arbitrary set of strings, we need a data structure that can map strings to scalar values. In Perl, this data structure is called "associative array" or in short "hash". In other languages, a similar thing might be called a map, a dictionary, or a look-up table.

A hash is basically an unordered set of key-value pairs, where the keys are unique strings and the values can be any scalar value (number, string, or a reference).

In Perl, a hash is marked with the percentage character (%). So we declare the %count hash to hold the IP to "number of hits" mapping. Most of the code is the same as in the previous example but instead of increasing two separate scalars, we increase the elements of the hash using the following construct:

$count{$ip}++;

When we encounter an IP address for the first time, $count{$ip} does not exist yet. If a value is not there yet, Perl assumes it has an "undef" value in it. If that is used in some numerical operation such as the ++ auto-increment, then it pretends to be the number 0. That becomes 1 and this operation also creates the appropriate entry in the hash. The key-value pair automatically springs to existence. This is also called auto-vivification.

As you can see, the hash grows automatically. Perl does all the memory management.

Once this is done, we'll have a hash in which each key is an IP address and each value is the number of times that IP address appears in the file. The keys function gets a hash as a parameter and returns the unordered list of keys of the hash. This code will print all the IP addresses with the corresponding number of hits:

foreach my $ip (keys %count) {
    print "$ip   $count{$ip}\n";
}

Code

The full script is here:

#!/usr/bin/perl
use strict;
use warnings;

my $file = shift or die "Usage: $0 FILENAME\n";
open my $fh, '<', $file or die "Could not open '$file': $!";

my %count;

while (my $line = <$fh>) {
    my $length = index ($line, " ");
    my $ip = substr($line, 0, $length);
    $count{$ip}++;   
}

foreach my $ip (keys %count) {
    print "$ip   $count{$ip}\n";
}

Points of interest

Of course it would be nicer to have them sorted and this code will do it:

foreach my $ip (sort keys %count) {
    print "$ip   $count{$ip}\n";
}

But this sorts the IP addresses based on the ASCII table. Probably not very interesting.

A better sorting might be this:

foreach my $ip (reverse sort { $count{$a} <=> $count{$b} } keys %count) {
    print "$ip   $count{$ip}\n";
}

Here we sort the keys according to the corresponding values and then we reverse the order to get the IPs with the largest numbers first. This is the expression, but let's take it apart:

reverse sort { $count{$a} <=> $count{$b} } keys %count

You can sort any list of strings.

sort @strings;

By default this sorts comparing every two values based on the ASCII table.

You can also sort them using any other condition. E.g., the length of the strings:

sort { length($a) <=> length($b) } @strings;

The sort() function of Perl will take any two values it wants to compare, put them in the two variables $a and $b, and evaluate the block. Based on the result, it will either keep the order of the two values or swap them.

sort { $count{$a} <=> $count{$b} } keys %count

This code does the same but it sorts the keys of the hash and when comparing two keys, the expression will compare the values of the two keys. The result will be in increasing order but if we would like to display the IP with the biggest number of hits, then we need to reverse the results:

reverse sort { $count{$a} <=> $count{$b} } keys %count

In the last example, we do the same but when displaying, we use a helper variable to limit the number of items to the top two IP addresses.

my $top = 2;
foreach my $ip (reverse sort { $count{$a} <=> $count{$b} } keys %count) {
    print "$ip   $count{$ip}\n";
    $top--;
    if ($top <= 0) {
        last;
    }
}