User:Cacycle/diff

From wikishia


File:WikEdDiff screenshot.png
wikEd diff improved diff view:

Legend (in order): deleted text, inserted text, block move mark, moved block, single character changes, highlighted moved block and block mark, and ambiguous insertion aligned to line.

wikEd diff is a free JavaScript visual diff library for inline text comparisons. It is the only available JavaScript diff library that detects and highlights block moves and that works on the word and character level. While wikEd diff has been developed and optimized for comparing Wikipedia source texts, it works great for any type of text, including program code. The library is customizable, has Unicode and multilingual support, is fully commented and documented, and is free (public domain). The script is used by the Wikipedia/MediaWiki in-browser editor wikEd and by the gadget wikEdDiff. You can test the library by checking any of these gadgets in your English Wikipedia preferences. For easy text comparisons by copy-and-pasting you can also use the wikEd diff online tool and demo. The library has also been ported to PHP and is available as a MediaWiki extension.

Features

  • Visual inline style, changes are shown in a single output text
  • Block move detection and highlighting
  • Resolution down to characters level
  • Unicode and multilingual support
  • Stepwise split (paragraphs, sentences, words, characters)
  • Recursive diff
  • Optimized code for resolving unmatched sequences
  • Minimization of length of moved blocks
  • Alignment of ambiguous unmatched sequences to next line break or word border
  • Clipping of unchanged irrelevant parts from the output (optional)
  • Fully customizable
  • Text split optimized for MediaWiki source texts
  • Well commented and documented code

Code

The JavaScript code can be found under User:Cacycle/diff.js (raw code).

Supported browsers

The code works for all browsers, including Internet Explorer 8 under Windows XP.

Usage

To call this library, please use the following code:

 var wikEdDiff = new WikEdDiff();
 var diffHtml = wikEdDiff.diff( oldVersionString, newVersionString );

To raw code of the library is available under:

https://en.wikipedia.org/w/index.php?title=User:Cacycle/diff.js&action=raw&ctype=text/javascript

To use the library in other JavaScript programs, add the following line to the HTML <head> tag:

<script src="https://en.wikipedia.org/w/index.php?title=User:Cacycle/diff.js&action=raw&ctype=text/javascript"></script>

Customization

The library is fully customizable. The following settings can be changed on your User:YourUsername/common.js page. Please check the top of the diff.js program code for additional settings.Please see wikEdDiff for wikEdDiff-specific settings.

Colors can be customized by adding CSS code to your User:YourUsername/common.css page. Please check the code for available CSS classes.

Block mark symbols:

var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.cssMarkLeft = '◀';
wikEdDiffConfig.cssMarkRight = '▶';

Show complete un-clipped diff text:

var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.fullDiff = true;

Enable block move layout with highlighted blocks and marks at their original positions (default):

var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.showBlockMoves = true;

Reject blocks if they are too short and their words are not unique, prevents fragmentated diffs for very different versions (default):

var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.unlinkBlocks = true;

Reject blocks if shorter than this number of real words (default):

var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.blockMinLength = 3;

Enable character-refined diff (default):

var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.charDiff = true;

Enable recursive diff to resolve problematic sequences (default):

var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.recursiveDiff = true;

Maximum recursion depth (default):

var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.recursionMax = 10;

Display blocks in differing colors (rainbow color scheme):

var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.coloredBlocks = false;

Show debug infos and stats (block and group data object) in browser console:

var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.debug = true; }

Show timing results in browser console:

var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.debugTime = true; }

Run unit tests to prove correct working, display results in browser console:

var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.unitTesting = true;

Text type customization

The regular expressions for splitting the texts are optimized for natural language containing wiki markup. While this default also works great for other text types such as source code, it can be customized:

wikEdDiffConfig.regExp.split = {
  paragraph: new RegExp('…', 'g'),
  line: new RegExp('…', 'g'),
  sentence: new RegExp('…', 'g'),
  word: new RegExp('…', 'g'),
  character: new RegExp('…', 'g')
};

Implementation

The code is an implementation and extension of the following basic algorithm:

Paul Heckel: A technique for isolating differences between files
Communications of the ACM 21(4):264 (1978)
http://doi.acm.org/10.1145/359460.359467 (access restricted)
Mirror: http://documents.scribd.com/docs/10ro9oowpo1h81pgh1as.pdf (open access)

This method is word-based and uses unique words as anchor points to identify matching text and moved blocks.

wikEd diff has additional code for resolving unmatched islands that are caused by common tokens at the border of sequences of common tokens or by sequence crossing-overs. From these matching data, the program extracts moved blocks and compiles a new text with added insertion, deletion, and moved blocks, and block marks.

Details

  1. Matching tokens (Heckel method):
    1. A list of real words and chunks is compiled from the old and the new text for later "unlinking" step wordParse()
    2. Old and new text versions are split into tokens ("symbols", "words"), creating two token lists splitText() .tokens
    3. The number of token occurrences in each version are counted in the symbol tables calculateDiff() symbol
    4. Tokens unique in both texts serve as anchors or seed crystals and are connected ("linked", "matched") between old and new version calculateDiff() .tokens.link
    5. Starting from these connected pairs, adjacent identical pairs are connected outwards
    6. In addition to the classical Heckel method, the following improvements have been implemented:
      1. The process is repeated ("levels") while decreasing the size of unmatched tokens from paragraphs to lines, chunks (i.e. html and link markup), words, and single characters
        1. Words are split into characters only in corresponding unmatched regions ("gaps") of identical token number and strong similarity in all tokens: splitRefineChars()
          1. Identical tokens (i.e. space separators) will be linked, resulting in word-wise character-level linking
          2. One token became connected or separated by space or dash (or any token) (axb → a b)
          3. Addition or deletion of flanking strings in tokens (xabx → ab)
          4. Addition or deletion of internal string in tokens (axb → ab)
          5. Same length and at least 50 % identity (abc → abx)
          6. Same start or end, same text longer than different text (abcd → abcxx)
          7. Same length and at least 50 % identity (abcd → abcxx)
      2. Gaps caused by "crossing-overs" are resolved by repeating the procedure at each level with a fresh symbol table for unlinked tokens
      3. Corresponding gaps caused by non-unique tokens on both sides of the gap borders are resolved by repeating the procedure at each level for each gap pair recursively
      4. Ambiguous gaps with the same front and back tokens are aligned with newlines or word borders if possible ("sliding") slideGaps()
  2. Extracting block information: detectBlocks()
    1. Unchanged text blocks ('same' blocks, sequences of matched tokens) are added to the blocks table in new text order getSameBlocks() blocks
    2. To set blocks as fixed vs. moved, the longest sequences of blocks in increasing old text order are identified (longest increasing subsequence problem for moved blocks):
      1. Independent block sections with no block moves outside the sections are identified getSections() sections
      2. Continuous blocks in old text order are identified as groups getGroups() groups
      3. The longest subsequence of groups in sections in old text order is identified using a cached recursive method findMaxPath() and set fixed setFixed()
      4. Blocks outside sections are set fixed
  3. Preventing fragmented outputs caused by false positive matchings in texts with many changes: unlinkBlocks()
    1. Unchanged blocks are converted into insertion/deletion pairs in token lists ("unlinking"):
      1. Unlink whole group if no block is shorter than three real words blockMinLength and contains no word unique to the whole text
      2. Unlink group blocks stepwise from the front and the end until a block contains more than one real word or a word unique to the whole text
    2. After unlinking, the whole process of extracting block information from the token lists has to be repeated
    3. Unlinking is repeated as long as possible
  4. Deletion blocks ('del' blocks, unmatched sequences of text blocks from the old text token list) are added to the blocks table getDelBlocks()
  5. Deletion blocks are sorted-in in between the unchanged blocks at the correct position in new text order: positionDelBlocks()
    1. Deletion blocks that are next to a fixed neighbor block in old text order are positioned next to that block
    2. Deletion blocks without a fixed neighbor in old text order are positioned next to a neighbor if that block is not the start or end of a group
    3. Alternatively, the deletion block is positioned next to the closest fixed neighbor to the left
    4. Deletion blocks around unchanged blocks will be arranged in order of their old text positions
  6. Insertion blocks ('ins' blocks, unmatched text blocks from the new text token list) are added to the blocks table getInsBlocks()
  7. Insertion block group numbers and fixed status are set: setInsGroups()
    1. If the insertion is inside a group, that group number and fixed status is used
    2. For insertions outside existing groups a new single-block group is created
  8. Adding marks at the original position of moved groups ('mark' blocks) to the blocks table: insertMarks()
    1. This is similar to positioning deletion blocks
    2. Marks of moved goups that are next to a fixed group in old text order are positioned next to that group
    3. Alternatively, the mark is positioned next to the closest fixed group to the left
    4. Mark blocks and deletion blocks around unchanged blocks will be arranged in order of their old text positions
    5. Moved group colors are set in order of the marks in the new text
  9. Collect diff fragment list for markup (abstraction layer for customized diffs): getDiffFragments()
    1. Groups are processed in new text order
    2. If block moves are not displayed as per configuration showBlockMoves, then this text is inserted as a deletion
    3. Otherwise, a mark is added with this text as a popup title
  10. Clip unchanged sections from unmoved block text: clipDiffFragments()
    1. Unmoved block text is clipped at the closest feature that is detected from the borders (lines, heading, paragraph, line break, blank, fixed number of chars)
  11. Create html formatted diff code from: getDiffHtml()


See also

  • wikEd, a full-featured MediaWiki-integrated text editor that adds enhanced text processing functions to edit pages. wikEd provides syntax highlighting, search and replace functions, and on-page previews and diff views
  • wikEdDiff, uses this library to provides an improved and easier to read diff view for comparing article versions
  • wikEd diff online tool and demo
  • wikEdDiff extension

License

The code has been released into the public domain.