DSPRelated.com
Forums

data anomaly detection

Started by Unknown February 15, 2007
I am trying to develop a data anomaly detector. I basically want to
detect clipped data, spikes, and drifting data to begin with. Any
suggestions on how to do it.

lakshmanan.meyyappan@gmail.com wrote:
> I am trying to develop a data anomaly detector. I basically want to > detect clipped data, spikes, and drifting data to begin with. Any > suggestions on how to do it.
You want to make quality judgments. Defining the characteristics that flag poor quality in your application is the first task. It can't be reliably found if it can't be rigorously defined. Jerry -- Engineering is the art of making what you want from things you can get. ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
On Feb 15, 2:37 pm, lakshmanan.meyyap...@gmail.com wrote:
> I am trying to develop a data anomaly detector. I basically want to > detect clipped data, spikes, and drifting data to begin with. Any > suggestions on how to do it.
It would be helpful if you could provide a description of the non- anomolous data. For example if it is a sine wave then the instantaneous frequency could be a useful metric. John
lakshmanancom wrote:
> I am trying to develop a data anomaly detector. I basically want to > detect clipped data, spikes, and drifting data to begin with. Any > suggestions on how to do it.
You can try difference filters (eg. [1/4 -1/2 1/4]): for clips, the output of the difference filter is close to zero for several samples. For spikes, the output varies wildly for several samples. To detect drift, subtract the output of the difference filter from the input (delayed by one sample). If the output grows or is very large (compared to the input), you have drift. Regards, Andor
On Feb 15, 11:37 am, lakshmanan.meyyap...@gmail.com wrote:
> I am trying to develop a data anomaly detector. I basically want to > detect clipped data, spikes, and drifting data to begin with. Any > suggestions on how to do it.
What do you mean by drifting? If you're trying to detect high-frequency anomalies, like a clipp or spike, I would suggest using multirate analysis (i.e. filterbanks). This is very similar to what Andor wrote, regarding difference filters. Basically, the signal is input to a multi-channel filter bank, where one or more of the channels are expected to catch anomalies within a particular frequency range. During the analysis stage you could monitor the magnitude of your output channels, and when you detect a large increase in magnitude you can flag that as an anomly. The theory behind multirate analysis is somewhat advanced but the actual implementation is super easy with Haar wavelets, assuming you are dealing with discrete samples. Email me if you're interested in a better explanation. -marc
On 15 Feb, 20:37, lakshmanan.meyyap...@gmail.com wrote:
> I am trying to develop a data anomaly detector. I basically want to > detect clipped data, spikes, and drifting data to begin with. Any > suggestions on how to do it.
The easy stuff first: Clipped data. In a fixed-point numerical format, you can check for the maximum and minimum integer values. In a system with a floating-point ADC you might have to check with some tabulated values. A bit more cumbersome, but not at all impossile. For outliers, check median filters. It was discussed here last summer: http://groups.google.no/group/comp.dsp/msg/9f740be9bda608d4?hl=no& For data drift, select a window frame length and map mean or median values inside the frames. Rune
Thanks a lot for all your replies.

To answer some of your questions ... I am actually trying to develop a
generic data anomaly dtection toolkit for my project. My group maily
deals with engineering data ... temperatures, speed, torque, stress,
pressure and so on.

I will summarize what I have done as I think it will also be useful
for someone else ...
Here's what I have done so far ... (if you think what I am doing is
not correct or if u think there is a better way to do it, please let
me know)

For Clipping:
I am getting indices of max/min values from the data. If the max or
min values are consecutive, then I raise a flag that it could possibly
be clipped. If a max value flat line is followed by a min value flat
line, I conclude that it possibly a digital on/off type signal. I also
give the user the option to enter range. If the user does that, it is
more accurate.

Drifting Data:
I am breaking the entire plot area into 20 windows. Each window
contains 5% of the data. Calculate the mean values within each window
and store it in a array of size 20. The  I calculate the standard
deviation among these 20 values. If the data is normal the standard
deviation should be reasonably small. If not, then data is either
drifting or its a ramp signal

Noise/Spikes
Method 1: Amplitute Threshold Detection
Take user input on max and min thresolds of data. Anything beyond that
is spike
Method 2: Amplitute Threshold Detection - no user input
Anything beyond mean + 5 times std dveiation is a spike
Method 3: Differencial Threshold Detection
Calculate the abs value of slope of each consecutive points. If slope
increase dramatically, its a spike
Method 4: Running standard deviation

I will be adding a few more data anomaly checks. Will keep you posted

Thanks
Laks

lakshmanan.meyyappan@gmail.com wrote:
> Thanks a lot for all your replies. > > To answer some of your questions ... I am actually trying to develop a > generic data anomaly dtection toolkit for my project. My group maily > deals with engineering data ... temperatures, speed, torque, stress, > pressure and so on. > > I will summarize what I have done as I think it will also be useful > for someone else ... > Here's what I have done so far ... (if you think what I am doing is > not correct or if u think there is a better way to do it, please let > me know) > > For Clipping: > I am getting indices of max/min values from the data. If the max or > min values are consecutive, then I raise a flag that it could possibly > be clipped. If a max value flat line is followed by a min value flat > line, I conclude that it possibly a digital on/off type signal. I also > give the user the option to enter range. If the user does that, it is > more accurate.
How does clipping happen in your environment? If it is internal to the computer and in integer format. there could be numerical wraparound. That can sometimes be detected as a change in sign of successive numbers of large magnitude. If it is in an analog sensor, there might be saturation but only close successive values.
> Drifting Data: > I am breaking the entire plot area into 20 windows. Each window > contains 5% of the data. Calculate the mean values within each window > and store it in a array of size 20. The I calculate the standard > deviation among these 20 values. If the data is normal the standard > deviation should be reasonably small. If not, then data is either > drifting or its a ramp signal
Or overlain with low-frequency AC.
> Noise/Spikes > Method 1: Amplitute Threshold Detection > Take user input on max and min thresolds of data. Anything beyond that > is spike > Method 2: Amplitute Threshold Detection - no user input > Anything beyond mean + 5 times std dveiation is a spike > Method 3: Differencial Threshold Detection > Calculate the abs value of slope of each consecutive points. If slope > increase dramatically, its a spike > Method 4: Running standard deviation > > I will be adding a few more data anomaly checks. Will keep you posted
Thanks for that. Jerry -- Engineering is the art of making what you want from things you can get. ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
Clipped data is where the real data values exceed the full scale
limits of the calibrated acquisition unit. For example, if I configure
my daq to measure strains in the range of -1000 me to +1000mu.e, any
value outside this range are clipped


lakshmanan.meyyappan@gmail.com wrote:
> Clipped data is where the real data values exceed the full scale > limits of the calibrated acquisition unit. For example, if I configure > my daq to measure strains in the range of -1000 me to +1000mu.e, any > value outside this range are clipped
Sure, but what number is reported? Some analog circuits -- op amps especially -- fold back with large overloads. I bound those when I use them in instrumentation and set their max output just under full ADC scale. The principle is simple: "Once bitten, twice shy." Jerry -- Engineering is the art of making what you want from things you can get. ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯