Suggestions On How Build An HTML Diff Tool?

July 31, 2022 Post a Comment

In this post I asked if there were any tools that compare the structure (not actual content) of 2 HTML pages. I ask because I receive HTML templates from our designers, and frequen

Solution 1:

The DOM is a data structure - it's a tree.

Solution 2:

Run both files through the following Perl script, then use diff -iw to do a case-insensitive, whitespace-ignoring diff.

#! /usr/bin/perl -w

use strict;

undef $/;

my $html = <STDIN>;

while ($html =~ /\S/) {
  if ($html =~ s/^\s*<//) {
    $html =~ s/^(.*?)>// or die "malformed HTML";
    print "<$1>\n";
  } else {
    $html =~ s/^([^<]+)//;
    print "(text)\n";
  }
}

Solution 3:

@Mike - that would compare everything, including the content of the page, which isn't want the original poster wanted.

Assuming that you have access to the browser's DOM (by writing a Firefox/IE plugin or whatever), I would probably put all of the HTML elements into a tree, then compare the two trees. If the tag name is different, then the node is different. You might want to stop enumerating at a certain point (you probably don't care about span, bold, italic, etc. - maybe only worry about divs?), since some tags are really the content, rather than the structure, of the page.

Solution 4:

If i was to tacke this issue I would do this:

Plan for some kind of a DOM for html pages. starts at lightweight and then add more as needed. I would use composite pattern for the data structure. i.e. every element has children collection of the base class type.
Create a parser to parse html pages.
Using the parser load html element to the DOM.
After the pages' been loaded up to the DOM, you have the hierachical snapshot of your html pages structure.
Keep iterating through every element on both sides till the end of the DOM. You'll find the diff in the structure, when you hit a mismatched of element type.

In your example you would have only a div element object loaded on one side, on the other side you would have a div element object loaded with 1 child element of type paragraph element. fire up your iterator, first you'll match up the div element, second iterator you'll match up paragraph with nothing. You've got your structural difference.

Solution 5:

See http://www.semdesigns.com/Products/SmartDifferencer/index.html for a tool that is parameterized by langauge grammar, and produces deltas in terms of language elements (identifiers, expressions, statements, blocks, methods, ...) inserted, deleted, moved, replaced, or has identifiers substituted across it consistently. This tool ignores whitespace reformatting (e.g., different linebreaks or layouts) and semantically indistinguishable values (e.g., it knows that 0x0F and 15 are the same value). This can be applied to HTML using an HTML parser.

EDIT: 9/12/2009. We've built an experimental SmartDiff tool using an HTML editor.

Html5 Info