I've posted this on the MATLAB newsgroup, but that group is pretty noisy and I know there are a lot of MATLAB users here, so here goes... I'm looking for suggestions/code on doing the following. They generally get more complex as the list goes on, numbers (1) and (4) are really the startings point and most important to me. 1) Given a sequence of bytes, find all occurances of that sequence in a binary file. 2) Given a sequence of bytes with possible wildcards (fixed wildcard length of one byte), find all occurances of that sequence in a binary file. 3) Given a sequence of bytes with possible wildcards (wildcards representing a variable number of bytes), find all occurances of that sequence in a binary file. 4-6) Same as the above except for bits rather than bytes--bits are not necessarily aligned with byte boundaries in any way. It's not really necessary for this to be in MATLAB (and would probably be much faster in C), but the files won't be huge (generally no larger than a few hundred KB) and search seqeuences relatively short (3-5 bytes typically) so MATLAB processing time may be acceptable. Are there some existing tools/code I could use? General suggestions? Thanks, Matt -- Remove Xs from address to reply via e-mail. -- Remove Xs from address to reply via e-mail.
MATLAB question
Started by ●March 25, 2005
Reply by ●March 25, 20052005-03-25
Matt Roos wrote: (snip)> I'm looking for suggestions/code on doing the following. They generally get > more complex as the list goes on, numbers (1) and (4) are really the > startings point and most important to me.> 1) Given a sequence of bytes, find all occurances of that sequence in a > binary file.Deterministic Finite State Automata are popular here.> 2) Given a sequence of bytes with possible wildcards (fixed wildcard length > of one byte), find all occurances of that sequence in a binary file.> 3) Given a sequence of bytes with possible wildcards (wildcards representing > a variable number of bytes), find all occurances of that sequence in a > binary file.The ones I know are dynamic programming algorithms. It might be that FSA can do it, but dynamic programming allows different weights to be added to different characters, and gap lengths.> 4-6) Same as the above except for bits rather than bytes--bits are not > necessarily aligned with byte boundaries in any way.This doesn't really change the algorithm, but it does change the was the data is read. Read bytes, separate the bits, supply them to the algorithm being used. It also tends to be slow. For the no gap (wildcard) or fixed length gap you can consider the byte pattern and match that. A fixed bit pattern will require eight possible byte patterns, not including the partial bytes at the end. With fixed gap lengths you can also find an equivalent byte pattern. Find the literature on dynamic programming. These problems have been well studied, mostly in the case of DNA and protein sequence comparison for biology. -- glen