The cause of "SYSTEM ERROR: wrong dbkey in block. Found <dbkey>, should be <dbkey2> (1124)" and SYSTEM ERROR: read wrong dbkey at offset <offset> in file <file> found <dbkey>, expected <dbkey>, retrying. (9445) is very often due to hardware or operating system problems.
Error 1124 and 9445 errors are an indication of database corruption. This very serious error is reported by the database storage manager when database block header validation fails after a disk read. Error 1124 may also be reported in 3 other cases, described below.
Each database block has a unique identifier called a "dbkey". The dbkey identifies a block's location within the database. Every database block stored on disk contains a copy of its dbkey in its block header. When the Progress database storage manager reads the block for a particular dbkey from disk, it compares the dbkey in the block header of the block that was just read with the dbkey that was requested. If they do not match, an error 1124 is reported. The 1124 or 9445 errors are typically, immediately preceded by error 4229 (Corrupt block detected when reading from database).
If the dbkeys do not match, it means either that the block has been damaged and does not contain valid data, or that the read operation returned the wrong block. If the block does not contain valid data, then those data are permanently lost and cannot be reconstructed by the crash recovery mechanism of the database.
If a database block has been read successfully from disk into the memory-resident buffer pool, then the database manager validates the block header again whenever a buffer lock is released, after the block has been updated, and before writing the block back to disk. If the validation fails any of these checks, then the block has been damaged while it is in memory. In each case, error 1124 is reported, preceded by one of error 4232, 4231, or 4230, depending on when the error was detected. The affected block is NOT WRITTEN back to the database.
The Progress database storage manager applies the same block header validation in all executables that read from or write to the data extents of database, e.g., self-service clients, page writers, probkup and prorest. This block header validation is also performed for on-disk temporary storage used for 4GL TEMP-TABLES.
Error 1124 can occur for many different reasons. In most cases, especially when preceded by error 4229, the cause is external to the Progress database, occurring sometime between when the database manager wrote a good block to disk and when it later reads the block again and the header validation fails. Isolating the cause of the problem is often difficult and time consuming. Once the cause has been determined and the problem corrected, the best course of action is to restore the database from backup.
Many actions occur between the database storage manager writing a good block and a subsequent read of the same block. Between these two events, there are many possible points of failure. A simplified sequence of events is as follows:
The storage manager issues a buffered write request the block is copied into the operating system's buffers.
Faulty RAM or an operating system or file system bug can cause corruption here.
Action: Replace any newly installed RAM, and check with your OS supplier for the latest OS patch information.
The operating system then passes the block to a device driver.
A device driver bug can cause corruption here.
Action: Check OS supplier for the latest patch information.
The device driver then passes the block to the disk controller.
A faulty disk controller or a controller firmware bug can cause corruption here.
The disk controller then transfers the block to the disk, possibly via an external cable.
Faulty disk or cable can cause corruption here.
In reading the block, the reverse happens. A similar sequence of events occurs when backing up and restoring. Using non-PROGRESS backup utilities (particularly if they are not the standard ones provided with the operating system) introduces another potential point of failure.
In addition, after a block has been read from disk and while it is present in the database manager's buffer pool, a memory shortage may cause the operating system to page the buffer to disk (in the paging file) and retrieve it later, unbeknownst to the storage manager. Errors and corruption can occur during this process as well.
For SCO UNIX, UnixWare, Linux, and Solaris Intel systems, using old PC hardware, the BIOS sector translation for DOS drives greater than 1 GB _MUST_ be disabled. If you are running one of these operating systems with that translation turned on, the BIOS translates sectors for the benefit of DOS. This translation is not needed for other operating systems. Turning BIOS sector translation off may greatly reduce if not eliminate the 1124 errors and may greatly increase performance, too.
MISCELLANEOUS USER EXPERIENCE
These 'case studies' are provided merely as suggestions on where to focus your research efforts when trying to troubleshoot the cause of the 1124 error.
1) During an idxbuild on a system with a defective SCSI cable, errors occurred at random points in the job, different each time.
If the problem seems to come and go, then check such things as terminators or not having enough cable between connectors, termination power, and anything else related.
Another customer, on an AIX system, found that with all the disks connected via SCSI controllers, when they added all their cable length together, had exceeded the specified maximum length for SCSI. They split off the disks and the 1124 errors went away.
2) The 1124 error occurred on a database that had been cpio'd from another machine. The customer was able to run the process on the original machine but not on the copy.
Damage may have been caused when the cpio copy was performed.
Check the size of the database against the ulimit size on the machine that you copied the database to. cpio will truncate the database at the ulimit size without giving you an error message.
NOTE: The PROGRESS backup utility, probkup, will override the ulimit size but cpio will not.
3) Shared memory client might corrupt memory.
At a customer’s, shared memory clients were running a 3rd-party shared library (.DLL on Windows, .so on UNIX). The 3rd-library contained a bug because of which random region of memory would be overwritten with random data; most of the times, the overwritten region would be memory private to the client, so that only the very client hitting the bug would crash, but occasionally the overwritten region would be in shared memory. A database crash with error 1124 would ensue almost immediately after that.
4) It is possible that the machine has memory problems and/or if you are using a disk cache, the block is being corrupted by the caching process.
5) Verify that the motherboard's speed is correctly set to match the CPU's speed.
6) Examine other simultaneous processes accessing the drives.
One customer had a benchmark process that was able to cause the error. What they also found is running multiple index rebuilds on multiple databases at the same time also caused the error. It seemed any workload that generated high levels of disk activity would cause the error.
The customer was able to determine that this problem was only occurring on one particular model of hard drive they were using. The problem occurred regardless of whether the hard drive was a master or slave drive.
The cause was ultimately determined to be flawed disk drive design and all drives of this particular model were faulty.