Finding the number of unique XML elements

Example Perl script | 0 comments

The following small Perl program was written when I needed to find out which XML elements were used by several XML files. I also wanted to know how often each element occurred.

The Perl program expects at least one directory passed as an argument to scan for XML files, so first the program checks if there are any arguments given to the program, and reports its usage if none are given.

Next, the program creates an XML::Parser object and the start element handler is set. The start handler just counts element occurrences using the well known method of a hash table.

Next, the File::Find module is used to recurse over each directory passed on the command line. For each item found the process_xml function is called, which itself parses the item if and only if it's a file and has an xml extension. Note that the current directory is set to the directory containing the item, so $_, set to the name of the item itself without its path, can safely be passed to the parsefile method.

Finally, a simple report is printed by looping over all keys of the elements hash in sorted order.

#!/usr/bin/perl
#
# xmlelements.pl
#
# © Copyright, 2006 by John Bokma, http://johnbokma.com/
# License: The Artistic License
#
# $Id$ 

use strict;
use warnings;


use XML::Parser;
use File::Find;

@ARGV or die "usage: xmlelements DIR [DIR ...]\n";

my %element_count;

my $parser = XML::Parser->new(

    Handlers => {

        Start => \&start_element,
    },
);

find \&process_xml, @ARGV;

print "$_ ($element_count{ $_ })\n"
    for sort keys %element_count;

exit;


sub process_xml {

    $parser->parsefile( $_ )
        if substr( $_, -4 ) eq '.xml' and -f;
}


sub start_element {

    my ( $expat, $element, @attrval ) = @_;

    $element_count{ $element }++;
}

Example output snippet of the Perl program:

a (405)
blockquote (1)
cimage (103)
code (25)
collection (15)
define-link (14)
div (11)
document-root (1)
fp (273)
group (24)
h2 (169)
h3 (14)
id (191)
img (19)
include (16)
include-code (29)

Please post a comment | read 0 comments | RSS feed