[Subject Prev][Subject Next][Thread Prev][Thread Next][Subject Index][Thread Index]

Re: UNIX problem



On Tue, Aug 03, 1999 at 10:47:54AM +0530, Himadri Hazarika wrote:

> The problem comes here: For the projected data volumes, the system
> must support a read and write per day of app. 13 terabytes. In order
> to reduce this figure, we are looking at some solution for searching
> through a compressed file without uncompressing it. (Though ZCat does
> this without uncompressing to a file
> explicitly, it actually uncompresses and redirects to standard output.)

compress uses LZW compression algorithm, which works by constructing 
commonly occuring strings into a table and compressing the whole string
into a single number (its index in the table). LZW is patented - so
gzip uses a slightly different unpatented method (LZ77).

So potentially you can do a better job of string searching than 
zcat file | grep (which is what zgrep seems to do), by constructing the
string table yourself (which requires scanning the file - but if you're
lucky you might find the string which you're searching for early), 
getting the index of the string and then searching the rest of the file
for a number (the index), which is certainly likely to give you some
speedups.

> Please could you give me more information on the following:

But given that no one has done what I've proposed above, it appears that
the gains are quite limited.

In any case, if you have such large amounts of data, you should store it
in a database and index it, rather than searching the data using grep.
The search will be much faster that way. If your database doesn't compress
the data itself, you can use filesystem level compression.

	-Arun

- --------------------------------------------------------------------
For more information on Linux in India visit http://www.linux-india.org/
The Linux India mailing list does not accept postings in HTML format.

------------------------------