KryoFlux - Stupid Timers

2009-10-13

Implemented some code that transfers data and, on the host side, checks data validity and spots suspect data - just to be on the safe side.

This lead to the discovery of an unfortunate problem: the timer value is unreliable for anything above 16 bits (it can count 0...65535, then starts from 0 again and so on). That is, we can only measure 2.686 ms per cell max, and then we would get a flux change enforced regardless of whether it actually happened or not.

In practice, you’d have a hard time reading a track without a flux change for that long, as the AGC (Automatic Gain Control) on the drive amplifies the signal to read from a disk, the longer it takes to see a flux change, the lower the threshold gets.

So theoretically speaking it shouldn’t matter, and most (all?) drives would say there was a flux change sooner than 2.68 ms even if there actually wasn’t. However, it would be nice to be able to do it properly.

Some background:

The hardware only has a 16 bit timer, but it can tell if there was an overflow.

So how can you implement a 32 bit counter?

Generate an interrupt on either timer overflow or data sampled.
Store the number of overflows, add sampled cell size.

Easy, right? ... Wrong.

Consider the following scenario that our host data checker kindly pointed out:

Timer is at near overflow, say at 65534.
A complete cell has been sampled and an interrupt is generated.
Some time passes; this is the interrupt latency and is unpredictable as it depends on the instruction being executed at the time of interrupt.
Interrupt handler checks overflow (this value is NOT frozen at the time of sampling).
It reads the sampled cell size (this value is frozen at the time of sampling so it’s correct).

If your interrupt latency is more than 2 cycles, what happens is that you read 65534 as the sampled cell size PLUS by the time you read it, an overflow will have happened too, making the sampled cell size 65536+65534 which obviously is a bit off!

You could say that any such value should be treated as a special case and overflow ignored, but remember that the interrupt latency is not guaranteed, so a range of numbers should have overflow ignored instead.

If for whatever reason you really do have a cell that takes the amount of time for real that we discard because of the overflow being suspect, it will be measured as 65536 cycles, less than it is in reality.

You could change the order of overflow handling, but that just means that an underflow error gets introduced, similar to how overflow error happens.

You have the exact same problem when measuring the time between index pulses (apart from the fact that the overflow always happens as a disk revolution takes about 200ms), it might be off by 2.68ms or be correct - you just can’t tell.

In short: no matter what you do, there will be values that cannot reliably determined (only by guessing) whether they are the genuine values or not. It’s very, very non-obvious how an error can get introduced (well, it’s simple once you understand - you are measuring a stopped and a running signal at the same time) so without writing an automated check we would have never thought of this to be honest...

For the index signal measurement, we can get away with:

Resampling the counter several times until the value read is stable. This cannot be done for cell sampling.
Use a different reference clock. As the index signal is mechanical, it’s not clockwork precision anyway.

Since we already have a working reference clock that is within mechanical error range (1/8th resolution of sampler clock) we’ll just use that - it will be more accurate than the hardware signal anyway.

Unfortunately we cannot use this reference clock for sampling as the cell resolution would drop from the current 40ns to 40*8 ns, which is still better than other existing hardware, but... We’d consider that kind of resolution poor, and is unsuitable for error correction algorithms.

The options we have:

Say we are safe with drives that use AGC (all?), as they would fake a flux change before our measurable cell period anyway. We can still check for overflow, and send a message to the host if it happens. The host can try and resample (to see if there is an electrical problem) and if the problem persists, bail out, indicating to the user they should get another drive.
Add RAM, and sample cells in a tight loop. The problem is it may still not work in a slightly different way, as there is a hardware bug that may cause the loss of the interrupt signal when it’s polled, due to the difference between the timer rate and master clock rate, so some signals might be lost...!
Get a Beagle Board - it has dedicated 32 bit timers.
Add dedicated timing hardware to this board that can properly sample long signals... not to mention buffering the measured data.

We think that, for now, we’ll go for option 1. We’ll do some testing in the evening to see how common it is to see a timer overflow. We will see if we can get away with the 2.686 ms measurement limit in the short term, but we will still leave the actual 32 bit processing in there just in case we add a 32 bit buffered timer hardware later - no need to rewrite the whole code.

...Even on the C64 you could link a 16 bit timer to another one making it 32 bit - and that was not exactly cutting edge technology. ;)

(later)

We will try to use an external trigger to reset and start sampling at the same time. If it works, it will not have the overflow bug, as the timer would be reset before reaching the overflow condition if it never reaches 65536. If it does, then it is always an overflow for real, and the counter is not free running for overflow anymore.

Originally using the trigger was considered, but we opted for the current solution because:

1) Two very short cells would be measured as one, with the first one being measured and the second one ignored if the measured is not read before the second one. This is a hardware limitation as timer values are not buffered. We use an absolute value instead at the moment, which would measure two very short signals as one, and is the sum of the two short intervals. This is actually correct - it means we have a glitch filter.

2) The position of the index can’t be measured that way - there is no reference time available from the sampling. However since the index can’t use the sampling as a reference timer anyway as it will not always be correct. This problem no longer applies. We’ll lose glitch filtering on very short intervals (a few nanosecond ones, which are obviously electrical problems) and those cells will be lost.

Ideally, a track would be first sampled with the external trigger using real 32 bit precision (again, if it works... we’ll see) and if there is no cell measured during that time that would require more than 16 bits to represent, resample with the free running counter as that would filter the very small cells. Then the two different kinds of samples could be matched up, and very short cells identified and inserted in the 32 bit sampled version... We’re not that bothered to go down this route, but it is a possibility for extreme correctness. :)

In case you are wondering why measuring two very short cells is an issue, the hardware glitch filter does not apply as the timer measures the pin directly, not the IO controller - the timer can’t measure internal signals.

(later)

It’s the sampling method we use to compensate for the lack of glitch filter that is bugged.

If external triggers work for start and sample at the same time on the same edge then 32 bit timers will work, but we’ll lose the “software glitch filter” (adding up the very short cells) plus a few the very short cells here and there...

If sampling on the same edge does not work we’ll stick with the 16 bit resolution for the simple reason that measuring from a falling edge to a rising edge will not measure the asserted state. It should be a constant value, but it’s probably not always 100% the same value. Also, rounding errors get introduced over time (imagine rounding errors adding up for 1 million samples...), which does not matter if the whole period is measured, but does matter if only half of it is.

It’s amazing how cheap the designers were to save on different things... like another 16 bit counter, cascaded? Compared to the cost of the rest of the chip it’s pointless not to, and it makes life miserable for developers.

(later)

Triggering and sampling on the same edge does work (for a change)! So the 32 bit sampling is back, and we just happily sampled 600ms without any flux change. To be clear, this means the above problems with the timer completely goes away.

So we have been able to stream HD disks over USB without any encoding as 32 bit samples... Pretty obviously this would fail in a worst case scenario of an entire track filled with zeros, as that is the most dense bitpattern you can have in MFM. Still nice!

With a HD disk (and twice as many sampling interrupts as the DD disks) the maximum throughput is about 900 KiB - so it’s safe to say overall performance on DD disks would be 1800 KiB, if full speed USB transfer speed would allow that.

Update: While there was doubt we would ever see a title using a long period of no flux changes, Guardian Angel showed that some do indeed exist. Caring about this possibility from the beginning was not a useless theoretical issue. This will be explained in further detail in a later WIP report.