[Subject Prev][Subject Next][Thread Prev][Thread Next][Subject Index][Thread Index]

Re: Journalled File System (long)



Hi,

Shanu wrote:
>         I read somewhere that the forthcoming 2.4 series will have a
> "journalized file system (ext3fs) which will eliminate the need for 
> long fsck".
> 
>         How different is a journalized file system from the ones we 
> have now?
>         What are its features?
>         What other OS's use this feature currently?

> How different is a journalized file system from the ones we have now?

Traditionally, UNIX file systems such as the Berkeley fast file system
have to write everything through the disk. When you're updating,
creating your file, they have to write all the way through the disk and
wait for that write to complete. This is a synchronous write.

With Journalled File Systems (JFS for short), changes to the inodes,
directories and bitmaps are logged to the disk before the original
entries are updated. Should the system crash before the updates are done
they can be recreated using the log and updated as intended.

> What are its features?
Journalling preserves File System integrity. Which, in turn, means fast
boot times, and it effectively eliminates the need for fsck.

> What other OS's use this feature currently?
SGI's IRIX XFS, and Be's BeOS use this feature. In the 80's IBM JFS, 
Veritas, Tolerant were among the first to support this.

SGI recently announced that they will port the Jounalling features of
their XFS to Linux, and there was a thread earlier in this list that
gave the details. It is scheduled to be released in about a year's time.

To elaborate, The basic idea is that the disk blocks that are involved
in a disk modification (file creation, deletion, write, and so on) are
written to the disk's "journal" *before* they're actually written to
their final resting places on the disk. This ensures that your disk's
structures will be consistent even if you crash during a disk access: 

(a) If you crash before the journal entry is written for a given
operation then the operation will appear to have never happened. 

(b) If you crash while data blocks are being written (that is, after the
journal blocks have been written), the system will "replay" the journal
entry when you reboot, and the aborted transactions will complete
normally. 

(c) If you were to crash after the data blocks were flushed, but before
the journal entry was removed, the disk blocks would simply be
re-written. 

Journaling ensures integrity, but it can't guarantee that the file
system will always be 100% up to date. That is, because disk blocks get
buffered in memory, a crash may prevent some of them from making it to
the journal. Thus, when you reboot you won't see the transactions that
died in the cache. So, for example, that would mean that you might not
see the last file that was created before a power failure -- but your
hard disk wouldn't be corrupted. 

Part of the disk is allocated for the FS metadata journalling, you might
lose about 2MB for every 10GM to the journal, but it is a small price to
pay for system integrity (its just about 0.02% anyway.) Even if you had
to replay the entire journal of 2MB it would take just about 30 seconds. 

Journalling maintains only file system metadata, _NOT_ the user data.
The data that you were writing when the system crashed have gone into
the eternal bitbucket.

Depending on the file system architecture and the disk block size
Journalling may or may not reduce performance. This would depend on how
the buffer cache and disk DMA are used. For example, The BeOS provides
upto 6MB/s of sustained transfer on a standard IDE. Pre-allocation of
long-runs of diskblocks would improve the performance because the writes
on a per file basis would be contiguous (which is also one of the
reasons why _modern_ Operating Systems do not need to "defrag" the
harddisk). The preallocation of the run would depend on the data size
that's already been written. 

Also, Journalling ensures that multiple transactions are clubbed to form
a single transaction, thereby increasing performance. The writing of
data from/to the Jounal is asynchronous and the data is written to the
disk when the system is idle. For crucial data it is possible to make
sync and fsync function calls to force a data write. On my machine the
BeOS takes about 17 seconds to boot, regardless of what it is doing when
I pull the plug. And in over one year of regular usage I'm yet to lose
any data. YMMV. So, we should be seeing this kind of performance
statistics on Linux soon...

Journalling is not something you can simply layer on top of an existing
FS, it has to be built into the FS from scratch, which might be the
reason for the one year delay in bringing it to Linux. But I also hear
rumors about the media capabilities of XFS not being present in the
port. But I'm guessing. Can someone clarify?

BGa

- --------------------------------------------------------------------
For more information on Linux in India visit http://www.linux-india.org/

------------------------------