The following small Perl program was written when I needed to find out which XML elements were used by several XML files. I also wanted to know how often each element occurred.
The Perl program expects at least one directory passed as an argument to scan for XML files, so first the program checks if there are any arguments given to the program, and reports its usage if none are given.
Next, the program creates an XML::Parser object and the start element handler is set. The start handler just counts element occurrences using the well known method of a hash table.
Next, the File::Find module is used to recurse over each directory passed on the command line.
For each item found the process_xml
function is called, which itself parses
the item if and only if it's a file and has an xml extension. Note that the current directory
is set to the directory containing the item, so $_, set to the name of the item itself without
its path, can safely be passed to the parsefile
method.
Finally, a simple report is printed by looping over all keys of the elements hash in sorted order.
#!/usr/bin/perl
#
# xmlelements.pl
#
# © Copyright, 2006 by John Bokma, http://johnbokma.com/
# License: The Artistic License
#
# $Id$
use strict;
use warnings;
use XML::Parser;
use File::Find;
@ARGV or die "usage: xmlelements DIR [DIR ...]\n";
my %element_count;
my $parser = XML::Parser->new(
Handlers => {
Start => \&start_element,
},
);
find \&process_xml, @ARGV;
print "$_ ($element_count{ $_ })\n"
for sort keys %element_count;
exit;
sub process_xml {
$parser->parsefile( $_ )
if substr( $_, -4 ) eq '.xml' and -f;
}
sub start_element {
my ( $expat, $element, @attrval ) = @_;
$element_count{ $element }++;
}
Example output snippet of the Perl program:
a (405)
blockquote (1)
cimage (103)
code (25)
collection (15)
define-link (14)
div (11)
document-root (1)
fp (273)
group (24)
h2 (169)
h3 (14)
id (191)
img (19)
include (16)
include-code (29)