Analyser: Data Grouping Logic / Reading Troublesome Disks


Since adding format pattern recognition requires a different approach to track resolving, the script generator required various changes. During the process we came up with an enhanced version of the data grouping logic that is less sensitive to random read errors on a track. The result is that many more difficult to read disks are completely resolved, automatically, first time.

For this to make sense you may need to know what the data grouping logic is. So lets take a moment to explain it.

A track is read several times by the dumping tool. CTA selects data from any of these reads when analysing formats, and pieces the data together like a puzzle until that puzzle is solved. This makes it possible in most cases to have a good read with a poorly readable disk that produces random errors/changes.

This is all fine as long as a format is manually scripted, i.e pre-defined by a format script so the analyser knows what to expect and what the missing pieces should look like.

What happens for the script-generator? We are creating the script itself based on potentially unreadable or changing data. Even checking integrity through CRCs can be misleading especially on protections. So what we need to do is to group data segments of the various reads of the same data stream found on the disk.

A group consists of data segments that should contain the same data but might not. A good segmenting point is the syncs found on a track since that means re-alignment of the data stream consisting of clock and data signals. Each group consists of streams that should contain the same data in theory, but quite often they don’t in practice. Some elements of the data or the sync itself may be completely illegible in some cases. If the sync is illegible that causes the amount of segmenting points to change specific to reads. The distance from the index signal is not a good enough indicator either as the speed and the signal can and do skew.

Positively identifying segmenting points belonging in the same group is far from trivial, especially when really small segments should be paired; like ones with segments a few bytes long on certain protection schemes.

The algorithm used before for script generation for grouping pretty much expected a fairly faultless read to work properly and was a fairly straightforward two-pass process.

This is no longer the case.

Right now the algorithm is multi-pass and multi-stage. One pass consists of pairing any unused segment possible according to the conditions set by the stage. As long as any new segment gets resolved during a pass the pass is repeated, since that can potentially lead to satisfying the matching conditions for another segment (i.e once again, just like a puzzle). Once no new matches can be found (and there is still stray segments left), a different approach for grouping is taken; this is the new stage used until all possible approaches are exhausted.

A sanity check after this process ensures to signal an error should any segments fall through the cracks - this is how it was possible to add a new stage caring for a fairly extreme condition lately. Obviously, script generation itself is fairly complex and works on the segment groups generated. For example, the recently implemented new block matching works on groups, not on individual segments.