Some notes on handling SQL Server corruption

Version : 1.1
Date : 23/09/2010
By : Albert van der Sel
Type of doc : It's just a few notes on SQL Server corruption in very simple words. It's no more than "entry level".
For who : For anyone who likes a short orientation on the subject.



Although you do not need to be an expert, it is assumed that you have at least a "reasonable" level of knowledge of SQL Server.

Also, I did my best to put trustworthy and correct information in this note, but due to the "nature of the game",
use it at your own risk.

Chapter 1 just describes the creation of a Demo database which will be used to simulate several sorts of corruption.
But, as of Chapter 2, real scenario's are described.

Contents:

Chapter 1. Creating the example database, and getting a hex editor.

-- Notes on object- or data page corruption:
Chapter 2: "Soft" corruption on (ordinary) data and index pages (corrupt table).
Chapter 3: Severe corruption on (ordinary) data and index pages (corrupt table).

-- Notes on file- or admin page corruption:
Chapter 4: Inaccesible or Suspect database due to a Missing or Corrupt Transaction log.
Chapter 5: Inaccesible or Suspect database due to Severe corruption on a .ndf file.

-- Other notes:
Chapter 6: Overview of Microsoft SQL Server Repair commands.
Chapter 7: Using queries to salvage data from corrupt tables.


Chapter 1. Creating the example database, and getting a hex editor.


Example Database:

In the following sections, we will use the following example database.
With the script below, it's easy to create the demo database. Note that we have several filegroups and multiple logfiles,
in order to "simulate" a certain level of "complexity".
Also, we create a few sample tables, and index, on different filegroups.

To follow this note, you do not need to create this sample database. You can just browse through, or just read this note.
But, if you like to repeat the actions yourself, it is really advisable to create it.
And, since we have a script, it's not too bad if you hose up the database. Creating it again is just a matter of seconds.

I really advise you to create the database on SQL Server 2005 or 2008, and follow along with me, starting at Chapter 2,
and work your way to each following chapter.

create database SALES
on PRIMARY
(
name='SALES',
filename='c:\mssql\data\SALES.mdf',
size=40MB,
filegrowth= 10MB,
maxsize= 100MB
),
FILEGROUP SALESDATA01
(
name='SALES_DATA_01',
filename='c:\mssql\data\SALES_DATA_01.ndf',
size= 40MB,
filegrowth= 10MB,
maxsize= 100MB
),
FILEGROUP SALESINDEX01
(
name='SALES_INDEX_01',
filename='c:\mssql\data\SALES_INDEX_01.ndf',
size= 40MB,
filegrowth= 10MB,
maxsize= 100MB
)
LOG ON
(
name='SALES_LOG_001',
filename='c:\mssql\data\SALES_LOG_001.ldf',
size= 40MB,
filegrowth= 10MB,
maxsize= 100MB
)

ALTER DATABASE SALES
MODIFY FILEGROUP SALESDATA01 DEFAULT
GO

USE SALES
GO

CREATE TABLE dbo.EMPLOYEE
(
EMP_ID INT NOT NULL,
EMP_NAME VARCHAR(20) NOT NULL,
SALARY DECIMAL(7,2)
)
ON [SALESDATA01]

CREATE TABLE dbo.EMPLOYEE2
(
EMP_ID INT NOT NULL,
EMP_NAME VARCHAR(20) NOT NULL,
SALARY DECIMAL(7,2)
)
ON [PRIMARY]

USE SALES
GO

-- insert data into the EMPLOYEE table in Filegroup SALESDATA01 (the .ndf file)
insert into EMPLOYEE
values
(1,'Harry',2000.50)

insert into EMPLOYEE
values
(2,'Nadia',3000.00)

insert into EMPLOYEE
values
(3,'Albert',5000.00)

-- insert data into the EMPLOYEE2 table in Filegroup PRIMARY (the .mdf file)
insert into EMPLOYEE2
values
(1,'Harry',2000.50)

insert into EMPLOYEE2
values
(2,'Nadia',3000.00)

insert into EMPLOYEE2
values
(3,'Albert',5000.00)

CREATE NONCLUSTERED INDEX indx_employee_empname ON EMPLOYEE(EMP_NAME) ON SALESINDEX01
GO


Ok, we have a database right now.

Get a Hex editor:

Next, to emulate several crashes, we need a hex editor, by which we can open a database file,
and "corrupt" the file manually.
There are several choices here. You might consider the free utility XVI32, which you can download from
the site below.

http://www.chmaas.handshake.de/delphi/freeware/xvi32/xvi32.htm

You only have to download the zip file, and extract it. Then you are ready to run it.
There exists many alternatives too. Just "google" on "hex editor".
But I really recommend xvi32, due to its simplicity and capabilities.


Chapter 2: A "soft" corruption on an ordinary data page (1).

This section primarily deals with a "soft corruption" based error, in an ordinary data page,
and how to understand it. In this section we used SQL Server 2008 32bit.
This type of corruption is not very severe. We will do the corruption ourselves, by using
a hexeditor, and change some value in a datapage while the database is offline.
In this case, we do not "destroy" a page by killing a pageheader or other critical parts.
This section only deals with a data alteration, but which will be detected by SQL Server
and it will throw an errormessage (if the "page verify" option is higher than "none').

The involved error message is:

Message 824: logical consistency-based I/O error: incorrect checksum

As we have seen in Section 1, we have put three records in the EMPLOYEE table. This table is stored
in the SALESDAT01 filegroup, which consist of the "c:\mssql\data\SALES_DATA_01.ndf" file.

In section 2.1, we will explore where those records are located, that is, in which page the records are stored.
Various methods exists. Here we will use the DBCC PAGE() function.

In section 2.2, we will use a hexeditor (like xvi32), to emulate a soft corruption, by altering
an Employee name, directly in the file itself. Thus we bypass SQL Server, (using the hexeditor) while the sales database is offline.
Here we alter a 'non critical page', meaning that we just alter a datapage.
What is the effect?

Section 2.1: DBCC PAGE()

Now, the three records of the EMPLOYEE table, are likedly to be located at the "first" usable datapage
in the "SALES_DATA_01.ndf" file, which (in my case) is page 8.

You know that the very first pages in any database file, are used for administrative purposes, like the "fileheader page" (page 0),
the "Page Free Space (PSF) page" (page 1) etc.. It's easy to check that our records are in the 8th page,
using the DBCC PAGE() command. Let's try that.

First use "DBCC TRACEON (3604)" in your database, in order to view output of the DBCC PAGE() command,
otherwise we won't see any output at all.

DBCC TRACEON (3604)
GO

Since a database usually consists of several datafiles, a page in a database, is uniquely identified using the "file_id"
and the "pagenumber" (in that file), like for example (1,355), meaning page 355 in file number 1.
let's first see which file_id's exists in our sales database. This is shown in the figure below.
Just use the query "select * from sysfiles".


So, let's take a look at the database page in file number 3 (SALES_DATA_01.ndf), which contains the three records.
We are going to use DBCC PAGE() for that. The command takes a number of parameters, which are:
database, file_id, page number, output_mode.
So, the following would be our statement, to view the contents of page 8:


DBCC PAGE('sales',3,8,3)

This command produces a lot of output, of which below a small sample is shown:

..
Slot 0 Column 2 Offset 0x14 Length 5 Length (physical) 5
EMP_NAME = Harry
Slot 0 Column 3 Offset 0x8 Length 5 Length (physical) 5
SALARY = 2000.50
..
etc..

Anyway, we have shown that the records are indeed stored in page 8.
Also, we have learned how to use the DBCC PAGE(0) command.

Section 2.2: Creating a soft corruption in the page:

Now, with our hexeditor, we are going to open the "c:\mssql\data\SALES_DATA_01.ndf" datafile.
First, ofcourse, we put the sales database offline, using the Management Console (SSMS), or with the following
SQL command:

alter database sales set offline

So, now start your hexeditor (like "xvi32") and open "c:\mssql\data\SALES_DATA_01.ndf"
Since page 8 starts at 8 x 8192= 65536, which is "10000" in hexidecimal, let's scroll down to offset 10000.
Below you can see the contents of the file, at that position.


As you can clearly see in the right pane, you see the names of our records, like "Albert".

Now, let's "emulate" a soft corruption in that page. Navigate to the string Albert and replace that string
with "Nelson". What we have done here, is just replace a piece of text, with some other characters of the same length.
Now, save the file, and close your hexeditor.
Secondly, try to put the database online, using "alter database sales set online".

As you see, the database will open without any complaints. But, actually we have altered a page!
Why does SQL Server does NOT throw an error message at us?

It would have done that right away, at the moment it was trying to set the database online, if we would have altered
a critical page, like page 0 (the fileheader), or the bootpage (page 9 in the .mdf file).
Now, we cannot expect SQL Server to check all pages of the database. This is just an ordinary datapage.
But it will detect the "corruption" at the moment it accesses that page.

Suspect Mode: If you have some sort of corruption, and for example, after a restart SQL Server needs
to rollforward or rollback transactions, but it can't for some reason (for example, due to one or more corrupt tables),
it will likely place the database in "Suspect" mode.
There are other reasons too, as to why SQL Server might place a database in "Suspect" mode. In later chapters, we will see
how to deal with this state of a database.

Let's try to query the EMPLOYEE table:

SELECT * FROM EMPLOYEE

This will SQL Server accesses the page, and it will discover the alteration we have done.
The following output appears:

Msg 824, Level 24, State 2, Line 1
SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x794965ef; actual: 0x724e6c66).
It occurred during a read of page (3:8) in database ID 5 at offset 0x00000000010000 in file 'c:\mssql\data\SALES_DATA_01.ndf'.
Additional messages in the SQL Server error log or system event log may provide more detail.
This is a severe error condition that threatens database integrity and must be corrected immediately.

Note that indeed the page involved was page 8 (in file no 3). At the moment that SQL Server accessed the page, it was discovered
that the current checksum is different from the checksum stored in the pageheader.
Indeed, by altering the Employee name, we altered (a newly computed) checksum as well ! Ok, but how to recover from this situation.

Did you noticed how SQL server only complained about the wrong checksum?
Physically, there is nothing wrong with the page and it's data. We only changed an Employee name.

Take a look at this. In SSMS, rightclick the sales database, and chooce "properties".
In the "Options" pane, you will see an option called "page verify". The possible options are:
Torn Page Detection, Checksum, None.



Now, just change the "page verify" to None, and run the query again.

This time, SQL Server does not care anymore on changed page checksums, so it just returns the records from the table,
although "formally" a corruption exists.

It's also instructive to run the DBCC CHECKTABLE() command on the EMPLOYEE table.

While the page verify option is still none, DBCC CHECKTABLE(EMPLOYEE) comes with the following output.

DBCC CHECKTABLE(EMPLOYEE)

DBCC results for 'EMPLOYEE'.
Msg 8951, Level 16, State 1, Line 1
Table error: Table 'EMPLOYEE' (ID 2073058421). Data row does not have a matching index row in index 'indx_employee_empname' (ID 2). Possible missing or invalid keys for the index row matching:
Msg 8955, Level 16, State 1, Line 1
Data row (3:8:2) identified by (HEAP RID = (3:8:2)) with index values (EMP_NAME = 'Nelson' and HEAP RID = (3:8:2)).
Msg 8952, Level 16, State 1, Line 1
Table error: Table 'EMPLOYEE' (ID 2073058421). Index row in index 'indx_employee_empname' (ID 2) does not match any data row.
Possible extra or invalid keys for:
Msg 8956, Level 16, State 1, Line 1
Index row (4:8:0) with values (EMP_NAME = 'Albert' and HEAP RID = (3:8:2)) pointing to the data row identified by (HEAP RID = (3:8:2)).
There are 3 rows in 1 pages for object "EMPLOYEE".
CHECKTABLE found 0 allocation errors and 2 consistency errors in table 'EMPLOYEE' (object ID 2073058421).
repair_rebuild is the minimum repair level for the errors found by DBCC CHECKTABLE (SALES.dbo.EMPLOYEE).
DBCC execution completed. If DBCC printed error messages, contact your system administrator.


Note also, how the command found that the index entry (of Albert) does not match the table data (which is Nelson).

In fact, with this type of soft corruption, you can "repair" the table (actually page 8), and the index entry, by using

DBCC CHECKTABLE(EMPLOYEE,repair_rebuild)

This command will repair the checksum, and makes sure that the index is in sync with the tabledata.
If you change the "page verify" option back to CHECKSUM, you can query the table and no inconsistencies are reported.

I admit that this section was more for demonstration purposes only, than for a real recovery operation.
Anyway, it learned us a lot about certain details of corruption.

Suppose you encounter a situation similar as above, with many checksum errors on many pages.
Then, by just changing the "page verify" database option, you might be able to select all records of the effected tables.
But that condition is a bit "disturbing", ofcourse. You might consider:

1: Use DBCC CHECKTABLE(table_name,repair_rebuild)
Just try the upper command on all effected tables.

2: "rebuilding" the effected tables into new tables.
Essentially this means putting the data into new pages, with correct checksums.
You can simply insert the data from the old tables into new tables (if "select *" works, with page verify=NONE)
You can use statements like:

SELECT * INTO table_b -- create table_b "on the fly"
FROM table_a

INSERT INTO table_b -- table_b need to exist already
SELECT * FROM table_a

(Note: it sounds easy, but could be complicated due to PK-FK relations, renaming tables, triggers etc..)

3: Restore individual pages from backup.
In principle, if your database uses the "Full Recovery model" (or Bulk-logged model), then you
are able to restore an individual page, in case that page is corrupt.
The syntax is like in the following example:

RESTORE DATABASE SALES PAGE = '2:155' FROM DISK='d:\backups\sales.dmp' WITH RECOVERY

Here, we restore the pagenumber 155 in fileno 2, from the diskbased backup.


Chapter 3: Severe corruption on ordinary data pages.

In section 2, we only changed an employee name in a data page, using a hex editor.
In fact, it wasn't a severe form of corruption, but it was corruption allright.

In this section, I will put a lot of records into the EMPLOYEE table. Then, when the database is offline,
I will use xvi32 to wipe out pages N and N+1. I will start somewhere "in the middle" of page N, all the way
through to the middle of page N+1, and fill that region with "all zero's".
Actually, this is a severe corruption because I cross pageboundaries and also destroy a page header.

What will be the effect, and how can we resolve the situation?

If you want follow along the exercise, then delete the sales database, and create it again using the script of section 1.
Then, only create the EMPLOYEE and EMPLOYEE2 tables, but don't insert the 3 records per table.
Also, do not create the index (the last statement).

Ok, let's insert a bunch of records in the EMPLOYEE table.
Use the following script to insert 9999 records in the EMPLOYEE table

DECLARE @i INT

SET @i=1

WHILE (@i<10000)
BEGIN
INSERT INTO EMPLOYEE
VALUES (@i,'Harry',1000)
SELECT @i=@i+1
END

Thanks to the @i, we have unique records in the table, although the employee name and salary are same in all records.
Actually, for our purpose, it does not matter at all, as long as we are able to destroy a few datapages,
with good pages on the "left" and "right" of pages N, and N+1.
Ofcourse, if you like you can insert many more rows, by just varying the while clause, like for example using
(@i<1000000), which will insert a million rows.

Actually, how many pages are allocated to the EMPLOYEE table right now? Various options exists, but we can use
the DBCC PAGE() command again, and this time request information from "page 1" in "filenumber 3" (sales_data_01.ndf).
This is the Page Free Space (PSF) page, which stores information about the allocation and free space of pages.
So, let's try that:

DBCC TRACEON (3604)
GO

DBCC PAGE('sales',3,1,3)
GO

Output:
(skipping a lot of lines, only showing the last lines...)

PFS: Page Alloc Status @0x4416C000

(3:0) - (3:3) = ALLOCATED 0_PCT_FULL
(3:4) - (3:5) = NOT ALLOCATED 0_PCT_FULL
(3:6) - (3:7) = ALLOCATED 0_PCT_FULL
(3:8) - = ALLOCATED 100_PCT_FULL Mixed Ext
(3:9) - = ALLOCATED 0_PCT_FULL IAM Page Mixed Ext
(3:10) - (3:16) = ALLOCATED 100_PCT_FULL Mixed Ext
(3:17) - (3:23) = NOT ALLOCATED 0_PCT_FULL
(3:24) - (3:50) = ALLOCATED 100_PCT_FULL
(3:51) - = ALLOCATED 50_PCT_FULL
(3:52) - (3:5119) = NOT ALLOCATED 0_PCT_FULL

What we see here, is that for example pages 3:24 to 3:50 are allocated to the EMPLOYEE table, and are filled with records.
So, if I kill pages 30 and 31, we have the situation that we want to investigate.

Next, put the sales database offline.
Start xvi32, and open the "c:\mssql\data\SALES_DATA_01.ndf" datafile.
In xvi32, we want to go to database page 30. And 30x8192=245760, or in hexidecimal that would be 3C000.
The 31st page, would start at 3E000. So, I will go to somewhere in between 3C000 and 3E000, and fill
a region with all 0's, crossing the boundary to the next page.

Save, the file, and close xvi32.
Next, put the sales database online. As expected, the SQL Server does not complain at all.

Now let's query the EMPLOYEE table.

SELECT * FROM EMPLOYEE

We will see a number of records pass the screen, as SQL Server will just read the table from the beginning.
Then, it will try to read the corrupt page, and an error message appears:

Msg 824, Level 24, State 2, Line 1
SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x90ee5853; actual: 0x949c5001).
It occurred during a read of page (3:30) in database ID 5 at offset 0x0000000003c000 in file 'c:\mssql\data\SALES_DATA_01.ndf'.
This is a severe error condition that threatens database integrity and must be corrected immediately.
Complete a full database consistency check (DBCC CHECKDB).

This time, it will not help to put the database option "page verify" from "CHECKSUM" to "NONE", since this time
we not dealing with just a "soft" corruption. No, this time, two database pages are truly destroyed.

How can we recover from this situation?
Let's try to repair the table with:

Method 1. using SQL server repair methods like DBCC CHECKTABLE(table_name, repair_clause).
Method 2. Our own "monkey" method where we try to salvage as many rows as possible.

Method 1: DBCC CHECKTABLE (table_name,repair_allow_data_loss)

Let's first SQL Server check the integrety of the table, using the DBCC CHECKTABLE(EMPLOYEE) command.

DBCC CHECKTABLE(EMPLOYEE)

A lot of output appears, but the most important lines are the following:

..
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKTABLE (SALES.dbo.EMPLOYEE).

Ok, so if we want SQL Server to repair the table, our only chance is by using
the DBCC CHECKTABLE (EMPLOYEE,repair_allow_data_loss) command.

Here, SQL Server warns us that we may experience possible dataloss. In our case, the data in a few pages
is really lost. But if we can save the majority of rows in the EMPLOYEE table, the damage is somewhat limited,
so let's try that command.

First, in order to use the command, you need to put the database in "Single User Mode". After you did that, we try the SQL Server repair method:

DBCC CHECKTABLE (EMPLOYEE,repair_allow_data_loss)

A lot of output appears, but the most important lines are the following:

..
Repair: The page (3:30) has been deallocated from object ID 2073058421, index ID 0,
partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data).
Repair: The page (3:31) has been deallocated from object ID 2073058421,
index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data).
..
Msg 8928, Level 16, State 1, Line 1
The error has been repaired.
..

Yes, we have succes! SQL Server essentially deallocated pages 30 and 31 (as expected), and moved some pointers
so that all pages of the table have pointers to the former and next page.
The table is repaired, only we miss the data from pages 30 and 31, which was unavoidable ofcourse.
Again, we can perform queries on the EMPLOYEE table in the usual way.

The following figure shows clearly what is missing:

SELECT * FROM EMPLOYEE

skipping a lot of records
..
3987 Harry 1000.00
3988 Harry 1000.00
3989 Harry 1000.00
3990 Harry 1000.00 <- here the "gap" starts
4561 Harry 1000.00 <- here the "gap" ends
4562 Harry 1000.00
4563 Harry 1000.00
4564 Harry 1000.00
4565 Harry 1000.00
4566 Harry 1000.00
4567 Harry 1000.00
..

From the 9999 records, we miss 571 records. That's the penalty of killing 2 database pages.
But since our records are so small, many records will fit in a database page. So, with many other
true production tables, you might "only" miss just a few records.
I am not saying that it's perfect, but at least you have a sound table back, with the majority of records.

Method 2: Selecting "around" corrupt pages, just using queries.

In the unlikely event, that "method 1" does not work for some reason, and you still have the corrupt table,
you might consider this method:
It's also possible to "select around" the corrupt pages. In effect, you just use SQL to select as much rows
as you can, and skipping the rows on the damaged pages.
This "trick" usually is not easy. It certainly helps if the table uses a unique key, and that an index is defined
on that key.

Still, using no more than common sense, we can imaging that we can just select rows, and put them in a copy table
as long as SQL Server does not "touches" the damaged pages.

Here is a trivial (wrong) example that intuitively might look allright (but most often don't work well)
in order to try to salvage 'sound" rows from the corrupt table.


begin try
declare @i int
set @i=0

while (@i<100000)
begin
insert into employee_copy
select * from employee where emp_id>@i and emp_id<@i+10
select @i=@i+10
end
end try

begin catch
print @i
end catch

As you can see, we try to start at "the top" of the table, and work our way right trough it, copying rows
to a copy table. Until SQL Server meets the corrupt page ofcourse.
Then we could restart this game (setting @i at an appropriate value) to process the records "below" the
damaged pages.
Well, it could work. But the above code is way too simple and often does not work.
That's why Chapter 7 will go into some details of this method.

Recap:

In Chapters 2 and 3, we have seen how we can handle corruption on ordinary datapages.

  1. We have seen what happens if a "soft" corruption in a tablepage have occurred, which we solved without dataloss.

  2. We have seen what happens if a "hard" corruption in a few tablepages have occurred, which we solved with some dataloss.
    In the latter case, a few table pages were truly destroyed, so, some dataloss was actually unavoidable.


=> For finding problems, we used:

DBCC CHECKTABLE(table_name)

This will not repair or alter anything, but the output will show you (in most cases) what's wrong, and might give
a hint on which minimum level of DBCC command should be used next.

In order to check the whole database in one run, you can use:

DBCC CHECKDB(database_name)

Here too, only problems will be reported, but not repaired. Since it checks the whole database, possibly multiple
objects are reported.
Ofcourse, it's recommended to check your databases on a regular basis.

=> For solving the softcorruption of (a) tablepage(s) we used:

(1): Put the "PAGE_VERIFY" database option to NONE, and see if you now can select the whole table.
If you can retreive all records, multiple options exists to rebuild the data (like INSERTING rows into a copy table).
But you likely have no unsurmountable problems, so you just might as well try to repair the table.

(2): In order to repair the table, we used:

DBCC CHECKTABLE(table_name,repair_rebuild)

This will not lead to dataloss, since it will only rebuild checksums, pointers, and likely indexentries too.

Here too, a similar "database wide" DBCC CHECKDB command exists as well. it's:

DBCC CHECKDB(database_name, repair_rebuild)

However, if you have multiple corrupt objects, I still would recommend to handle them one a the time.

=> For solving the hard corruption of (a) tablepage(s) we used:

DBCC CHECKTABLE (table_name,repair_allow_data_loss)

It does not neccessarily lead to dataloss. In the example we have used, it was a brutal attack on datapages, fully
destroying them. Obviously, some dataloss after repair was unavoidable.
In general, it's likely (but not always so) that the "bad" pages gets deallocated from the object.

Here too, a similar "database wide" DBCC CHECKDB command exists as well. it's obviously:

DBCC CHECKDB(database_name, repair_allow_data_loss)

However, if you have multiple corrupt objects, I still would recommend to handle them one a the time.

As another option for repair, is the use of "smart" queries (the monkey method) where we try to select "around"
the bad pages. However, certain conditions should be in place, like a unique key.
This will be the subject of Chapter 8.

Please note that all of the above handle ordinary datapages. If a "special" page in a database file
get corrupted (like page 0: the header page, or page 1: the PFS page, or page 2: the GAM page, etc..)
we have a larger problem, and in many cases, DBCC commands will not be of help.

Note: if you would have determined that the object is "just" an index, you might just DROP and CREATE the index again.
An index is just "derived" data, and can be rebuild without loss of tabledata. You only have to keep in mind
that the index might support a "constraint". If that's true, there are some follow-up actions.


Chapter 4: Inaccesible or Suspect database, due to a Missing or corrupt transaction log. Msg 945.


In this chapter, we will explore what we can do when the Transaction log is missing, or is corrupt.

-If the Transaction Log is missing, your database will be "inaccesible". You will get an errormessage that will
inform you to that fact.
-If the Transaction Log is really corrupt, the database might be inaccesible, or it will be put in "Suspect" mode.
-If you have corrupt objects like tables and indexes, and SQL Server needs to rollback or rollforward data
related to those tables, the database will be placed in Suspect mode. Thats why after a crash, a database might
show up as Suspect. In chapter 3 we have seen, that even if some ordinary datapages (table or index) are hard corrupted, but they
do not need to be accessed during the start (at the recovery phase), SQL Server will just start normally.

As said before, in this chapter we will see what we can do if the Transaction log is corrupted or missing.
Ofcourse, such a small note as this one, cannot list all possible solutions. But three realistic solutions
are presented here, which are:

(1) ALTER DATABASE DATABASE_NAME SET EMERGENCY not recommended.
This allows you to SELECT all tables, and thus you can salvage all data. But this method is not recommended,
since below two methods are listed that might repair your database.
However, the method is often used to get out of the Suspect state, and to go to Emergency state.

(2) ALTER DATABASE DATABASE_NAME REBUILD LOG ON (NAME=logical_name,FILENAME=physical_name)
This method is recommended. In many cases, if the transacionlog files were gone, it will rebuild
a new transaction logfile. If it succeeds, you have an operational database.

(3) CREATE DATABASE DATABASE_NAME ON (FILENAME = 'path to mdf file') FOR ATTACH_FORCE_REBUILD_LOG
If method (2) has failed, possibly due to the fact that SQL Server thinks the database is "recovery_pending"
or due to other reasons, we can try to "detach" and "attach" the database again, with a ATTACH_FORCE_REBUILD_LOG clause.
There is a warning though! Preferrably, create copies of your databasefiles to another disk or folder(s).

4.1 Simulating a Missing (or corrupt) Transaction log file.

Please create the SALES database again, using the full script as listed in chapter 1.

Again, we have a sound database, with a couple of tables which we can access in the usual way.
Now, to simulate a serious error condition, in which the transaction log file is missing, do the following:

- Shut down the SQL Server service.
- Navigate the filesystem to the "c:\mssql\data" folder.
- Delete the "SALES_LOG_001.ldf" transaction log file.
- Start the SQL Server service again.

next, open SQL Server Management Studio (SSMS), and try to access the SALES database.
Likely, SSMS will throw the error message "SALES is not accessible.." to you.
In SSMS, you also see that next to the SALES database icon, there is no plus sign (+) to examine objects in that database.
If we take a look at the SQL Server log, we find entries like the following:

File activation failure. The physical file name "c:\mssql\data\SALES_LOG_001.ldf" may be incorrect.
The log cannot be rebuilt because the database was not cleanly shut down.
Msg 945, Level 14, State 2, Line 1
Database 'SALES' cannot be opened due to inaccessible files or insufficient memory or disk space.
Msg 5069, Level 16, State 1, Line 1
ALTER DATABASE statement failed.

and maybe also:
FileMgr::StartLogFiles: Operating system error 2(error not found) occurred while creating or opening file
'c:\mssql\data\SALES_LOG_001.ldf'. >br> Diagnose and correct the operating system error, and retry the operation.

We are not able to open and access the SALES database. There are several strategies that we can follow.
The next sections will discuss various methods.

4.2 Solution 1: Perform the neccessary Trivial checks

Here, ofcourse we know that the logfile is missing, but in general you should quicly check the following trivial items:

- did somebody accidently moved database files (on your Server) to another location?
- is the file perhaps in the Recycle bin?
- do you have an NTFS Undelete utility (there are even a couple of free tools available)
- are perhaps the filesystem permissions changed, so that the service cannot access the file?

We are not going to undelete the file, and we further treat this case as if the logfile is permanently missing.

4.3 Solution 2: place the Database in Emergency Mode (not recommended)


If it's clear from the eror messages, that it is the Transaction log, that is missing or corrupt,
then this method is not recommended.
But in other circumstances, for example, when you first see a "Suspect" database, then it could be
a good action.

But, suppose we get spooked by now, because we know we have a serious problem. Although better methods exists to get "life"
again, we might decide to retrieve the information from the SALES database, as soon as possible.

In many cases, if you can't access a database (for example, due to missing transaction log files, or due to corrupt objects), you can place a database in "Emergency Mode".
If the database was "cleanly" shutdown before the deletion of the transaction logfile, it probably will work.

If it works, a database is then effectively to be considered as "read-only", but we are able to SELECT all tables,
and copy the data to another database. Also, export tools as "bcp" will work.
There is no garantee that it will always work, but in many cases, it really does.

Let's try it. Start a Query Window, and execute the following TSQL statement:

ALTER DATABASE SALES SET EMERGENCY

If it worked, then if you refresh the databases icon in SSMS, you will now see the "+" appear before the SALES database,
and you can view the objects as usual. Effectively, the database is read-only, but you are able to retrieve the data.
Just try to select the rows from the EMPLOYEE table, to check if it works.

At least this solution is a method to salvage all your data, but it is not recommended.

It can also be interresting to take a look at a couple of systemviews, like for example:

select substring(name,1,30), database_id, state, state_desc from sys.databases
select * from sys.dm_tran_database_transactions


4.4 Solution 3: Rebuild the Transaction Log (recommended)

Several approaches exists to let SQL Server create a new Transaction logfile.

Once, we had the DBCC REBUILD_LOG() statement to our disposal, but it's not available over all versions,
so we don't consider that one.

What we will try instead is the following command:

ALTER DATABASE SALES REBUILD LOG ON (NAME=logical_name,FILENAME=physical_name)

So, let's try it:

ALTER DATABASE SALES REBUILD LOG ON (NAME=Sales_Log,FILENAME='c:\mssql\data\sales_log.ldf')

The following output hopefully appears:

Warning: The log for database 'SALES' has been rebuilt. Transactional consistency has been lost.
The RESTORE chain was broken, and the server no longer has context on the previous log files, so you will need to know what they were.
You should run DBCC CHECKDB to validate physical consistency. The database has been put in dbo-only mode.
When you are ready to make the database available for use, you will need to reset database options and delete any extra log files.

If the upper output appeared, we have succes! If now you refresh SSMS, the Emergency mode is replaced by "Restricted User" mode.
Now, change the database properties so that the Database is multi-user again.
Also, we have a rebuild transaction log, and you can add files, to what you see is appropriate for the Log.

4.5 Solution 4: Rebuild log with CREATE DATABASE.. ATTACH_FORCE_REBUILD_LOG).

If the solution of section 4.4 has failed, we are still not out of business yet.
Indeed, in some cases, SQL Server might "think" that a database is in a "RECOVERY_PENDING" state.
Then, if a Log is lost, it might just protest in rebuilding the transaction logfile, but there are other causes as well.
In this simple note, we emulate problems with the transaction log, by just deleting it (while the database is offline).
But there could be really complex problems due to (what SQL Server thinks) are unfinished transactions that must rollback.

Below, is actually a bit of a trick, so if you are working right now on the SALES demo database, you do not have to take
any precautions.

But if you want to apply the following method to a database that is of any value, I would like you to create copies of the
database files (e.g. to another disk, or folder). So, you might consider shutting down the SQL Server service, and make copies.

Anyway, we go ahead with SALES database.
Please drop it, and create it again using the full script in Chapter 1.

Again, we have a sound database, with a couple of tables which we can access in the usual way.
Now, just as in section 4.1, do the following:

- Shut down the SQL Server service.
- Navigate the filesystem to the "c:\mssql\data" folder.
- Delete the "SALES_LOG_001.ldf" transaction log file.
- Start the SQL Server service again.

If you try to access the database, you get the familiar errormessages, just as described in section 4.1.

This time we are going to "detach" the database, so that SQL Server places it in a specific state, just as if
it does not know the SALES database anymore.
But again, it's advisable to create copies of the databasefiles first (if possible), while SQL Server is down

Let's try the following:

sp_detach_db sales

Now, you may receive output that the command completed, but it's more likely you get output similar to this:

Msg 947, Level 16, State 1, Line 1
Error while closing database 'SALES'. Check for previous additional errors and retry the operation.

Now, don't worry about the message above. The database is de-registered from SQL Server
Also, we still have the original database files (.mdf and .ndf files), and next we tell SQL Server to attach
the SALES database, just as if you were moving it from one Server to another.

CREATE DATABASE SALES
ON (FILENAME = 'c:\mssql\data\sales.mdf')
FOR ATTACH_FORCE_REBUILD_LOG

You probably get the following output, which actually means succes, if you inspect the second line closely:

File activation failure. The physical file name "c:\mssql\data\SALES_LOG_001.ldf" may be incorrect.
New log file 'c:\mssql\data\SALES_log.LDF' was created.

Next, try if you can for example select the EMPLOYEE table again.
Hopefully it works, and we conclude that we have repaired the SALES database.


Chapter 5: Inaccesible or Suspect database, due to Severe corruption.

5.1 A few remarks on databases pages: the ordinary datapages and the special pages

Chapters 2 and 3 dealt with corruption on (ordinary) data pages. These pages will generally hold table- and index data.
If objects on ordinary data or index pages went corrupt, then SQL Server did not alter the "state of the database",
unless there is a reason to put it into the 'Suspect" state (like that at startup, rollback is needed, but the object is corrupt).

You know that a database will have one .mdf (primary) file, and usually one or more .ndf (secondary) files,
which will hold (or are supposed to hold), the actual data (tables, indexes etc..).
The first couple of pages in the .mdf or any .ndf file, are "special", and will contain (among other things)
identification data, and metadata, describing the database, and the (starting)location of all objects in the file.
These pages form the 'administration" of the files and the database.

A page is a unit of 8192 bytes with a specific structure. The larger part of a database file, is further organized
into extents. Extents are a collection of eight physically contiguous pages and are used to efficiently manage the pages.

The "administration" of pages/extents is covered by some very specific pages:

- The "GAM" (Global Allocation Map) and "SGAM" (Shared Global Allocation Map) keeps track of which extents
have been allocated, and which are free.
- The "PFS" page (Page Free Space page) keeps information about page allocation and free space available on pages.

You will find a GAM and SGAM page as page 2 and page 3 at the begin of a file, covering 4G (64000) of extents.
PFS pages can be located at multiple places, but you will find one as page 1 at the beginning of a file.
By the way, PFS pages are only used for "heaps", which are tables without a clustered index.

So, the first pages at the beginning of a database file, have a specific purpose. let's recapitulate them first.

Overview Specific Pages:

page 0: The very first page in any file. It's the "fileheader page". it contains information about the file itself,
like to what "dbid" (database id) it belongs, on which pages the GAM and SGAM (sort of directory pages) are located etc..

page 1: The (first) PFS pages. It keeps track of allocated pages and free space in pages.

page 2: The first GAM page.

page 3: The first SGAM page.

pages 4-7: Specific pages which will track differential, changed pages and logging information.

Usually, the first usable page (in an .ndf file) for regular data will start from page 8.
If an index was created in the database, page 9 could also be the first IAM page, unless the first
table creation and inserts of data, spanned a number of pages (as of page 8). Then, IAM's occur at later positions.

Only for the primary .mdf datafile, page 10 is the "bootpage". It stores specific metadata, as well as Log Sequence Numbers
for all sorts of operations as full backups, differential backups, create index LSN's etc..

let's play around a bit with the DBCC PAGE() and DBCC DBINFO() commands.

Let's first create a new sales database again. If you still have the old one, then just drop it using SQL Server management Studio.
Now, create a new SALES database using the script of chapter 1. This script creates a database with
a specific layout with multiple filegroups, and also will create a few tables on different filegroups.

After you have executed the script, let's explore the special pages a bit:

Playing around to view some specific pages:

DBCC TRACEON (3604)
GO

-- 1. let's take a look at page 0 (fileheader) in file no 3 (which is "sales_data-01.ndf");

DBCC PAGE('sales',3,0,3)


Only partial output is shown above.
Do you for example notice the location of the GAM and SGAM pages (at the bottom)


-- 2. let's take a look at page 1 (PFS) in file no 3:

DBCC PAGE('sales',3,1,3)


Only partial output is shown above.
Do you for example notice the listing of free and allocated pages (at the bottom)


-- 3. let's take a look at page 2 (GAM page) in file no 3:

DBCC PAGE('sales',3,2,3)


Only partial output is shown above.
Do you for example notice the listing of free and allocated pages (at the bottom)


5.2 Inaccesible or Suspect database due to corruption of an .ndf file

Note: for information on what to do at a missing or corrupt Transaction log: please see Chapter 4.

Ofcourse, most pages in a database are ordinary data pages (like table- and index) pages. If corruption is found
here, we have a good chance to repair it.

Situation 1: Suspect database, and/or ordinary data pages in an .ndf file are corrupted, then:
  1. Either SQL Server will start normally if it does not need to access the related objects for recovery purpose.

  2. Or, it might put the database in a suspect state, if at startup it needs to access the corrupt objects (like at "rollback"/"rollforward").


Situation 2: Severe corruption or missing .ndf file.

In section 5.1, we have seen a few examples of "special" system pages. They surely exist in the beginning of a database file,
but some (might) be present at certain locations throughout the file (like the PFS page, IAM's etc..), depending
on the size of the file, and the number of objects.

Situation 1: ordinary data pages in an .ndf file are corrupted

Let's first deal with "Situation 1".
Situation 1 is, in general, survivable using the techniques we have seen before.
But it must be said, that if the Transaction Log is (physically or logically) corrupt at the same time, it gets a bit harder.

Let's first explore the more general case:

First, if the database is Suspect, you can try to put the database in Emergency mode.
After that, it is (likely) to be accesible. Then you can try the DBCC CHECKDB command to obtain a list of objects
which have corrupt pages.
Then, it might get clear that you must run DBCC CHECKTABLE with the "repair_allow_data_loss" option, to repair
the list of the objects. This might result in (usually slight) dataloss. But at times that I had to use
the procedure for "for real", dataloss really did not happen often. So, many corruptions are just logical corruptions,
instead of really trashing pages (like we have seen in Chapter 3).

To put the above in a "neat" list:

(1): Database is suspect. You then try:
(2): ALTER DATABASE database_name SET EMERGENCY
(3): If it turns out that the database is accesible (for example. you can query a Table), you then try:
(4): DBCC CHECKDB(database_name) to obtain a list of corrupt objects and other information.
(5): If it turns out, that only tables and/or indexes are affected, then
(6): Run DBCC CHECKTABLE(table_name) for those tables. See what what minimum "repair clause" is needed.
(7): Run DBCC CHECKTABLE(table_name,repair_rebuild) or DBCC CHECKTABLE (table_name,repair_allow_data_loss).
(8): If all tables are done, run DBCC CHECKDB(database_name) again.

Hopefully it has worked.

If it turns out, that the Transaction log was "bad" also, try the solutions of sections 4.4 or 4.5.

Situation 2: Severe corruption in an .ndf file

It's possible, that in such a situation, "partial database availability", can be of use.

Since SQL Server 2005, there is a new "state" in which the database can exist, and it's called "partial database availability".

If you are familiar with Oracle, you know that you can put a "tablespace" offline. That feature is available since a very long time.
The "Tablespace" concept of Oracle, is very similar to the "Filegroup" concept in SQL Server.

In principle, it's now possible to place a Filegroup offline, while the rest of the database is still accessible.
Ofcourse, if you only have a few Filegroups, then probably you are not helped much, if a large part of the database is not online.
But, even if you have only a small number of filegroups, then you still might have "distributed" the objects in a sort of
"functional" meaningfull way, so that the application can still function even if a part of the database is offline.
Ofcourse, you cannot expect this to work if you use a simple structure like having one .mdf file and a transaction log.

So, there are a few requirements, before you can meaningfully use the "partial database availability".

- The database should use the "Full Recovery model" for total recovery, but the feature will still work if the model is "Simple".
- There exists sufficient "structure" in the number and purpose of filegroups.

let's try demonstrate the "partial database availablity" feature. Create the SALES database again, using the script from Chapter 1.
After the SALES database is created again, we know that we have the EMPLOYEE table in the "SALES_DATA_01" file group,
while the EMPLOYEE2 table resides in the Primary Filegroup.
Now, we execute the following:

ALTER DATABASE SALES
MODIFY FILE (NAME = 'SALES_DATA_01' , OFFLINE)
GO

ALTER DATABASE SALES SET ONLINE
GO

So, the tablespace "SALES_DATA_01" (and thus the file "c:\mssql\data\SALES_DATA_01.ndf") is offline,
while the rest of the database is still open.
Now, what happens if we query the EMPLOYEE and EMPLOYEE2 tables?
We expect that we are not able to access the EMPLOYEE table, but querying the EMPLOYEE2 table should work.

SELECT * FROM EMPLOYEE

Msg 8653, Level 16, State 1, Line 1
The query processor is unable to produce a plan for the table or view 'EMPLOYEE' because the table resides in a filegroup which is not online.


SELECT * FROM EMPLOYEE2

MP_ID EMP_NAME SALARY

1 Harry 2000.50
2 Nadia 3000.00
3 Albert 5000.00

This feature indeed works.

So, as of SQL Server 2005, it is possible to mark a lost or corrupt filegroup as offline, and then bring the rest
of the unaffected database online, so that an application might continue to function.
Ofcourse, objects that are located on the offline filegroup, are not available.

The larger your database is, and the more distributed over different filegroups, the better this feature will work.

Ofcourse, you should restore the filegroup and use transaction log backups to fully recover.
But at least it is a feature where not your entire database is lost.



Chapter 6: Overview of SQL Server Repair commands.


1. PROBLEMS WITH TRANSACTION LOG:

See chapter 4 for explanations. You might try one of the following:

1. ALTER DATABASE DATABASE_NAME REBUILD LOG ON (NAME=logical_name,FILENAME=physical_name)

2. CREATE DATABASE DATABASE_NAME ON (FILENAME = 'path to mdf file') FOR ATTACH_FORCE_REBUILD_LOG

3. ALTER DATABASE DATABASE_NAME SET EMERGENCY (not recommended)

4. If everything fails (don't use it, unless out of options):

- Detach the database.
- Shutdown SQL Server.
- Rename .mdf and all .ndf files to .old (all database files)
- Create a small blank database with the same name, and exactly same filegroups and physical files.
- Detach the new database
- Shutdown SQL Server.
- Delete the .mdf and .ndf files of blank database, except the log .ldf file.
- Rename the .old files to original .mdf and .ndf files.
- Start SQL Server
- Use CREATE DATABASE DATABASE_NAME ON (FILENAME = 'path to mdf file') FOR ATTACH_FORCE_REBUILD_LOG
- If database openend: use DBCC CHECKDB
- If database Suspect: ALTER DATABASE DATABASE_NAME SET EMERGENCY
- use DBCC CHECKDB, and then DBCC CHECKTABLE() for all affected objects.


2. PROBLEMS WITH OBJECTS LIKE TABLES, CORRUPT DATA PAGES:

See chapters 2 and 3 for explanations. You might try one of the following:

1. Retreiving information on possible corrupt objects:

DBCC CHECKDB(DATABASE_NAME) -- only listing, not repairing

2. Establishing level of corruption of a table and associated objects:

DBCC CHECKTABLE(TABLE_NAME) -- only listing, not repairing

3. Repair logical corruption like checksum, index entries, pointers:

DBCC CHECKTABLE(TABLE_NAME,repair_rebuild) -- no dataloss

4. Repair hard corruption (possible de-allocation of some pages):

DBCC CHECKTABLE (TABLE_NAME,repair_allow_data_loss) -- possibly some dataloss

5. REPAIR A COMPLETE DATABASE (not recommended, see also chapter 3):

DBCC CHECKDB(database_name, repair_rebuild)
or
DBCC CHECKDB(database_name, repair_allow_data_loss)

Note:
If a table has only "soft" corruption (for example wrong checksum on one or more data pages),
setting the "page verify" database option to NONE, allows you to select the table again.

3. PROBLEMS WITH A FILE OR FILEGROUP:

See chapter 5 for explanations.

=> Try to place the "faulty" Filegroup offline.

-- Execute below statements as "one" batch:

ALTER DATABASE DATABASE_NAME
MODIFY FILE (NAME = 'Filegroup_Name' , OFFLINE)
GO

ALTER DATABASE DATABASE_NAME SET ONLINE
GO

If it works, the rest of the database is online.

=> As another possible method:

(1): Database is suspect. You then try:
(2): ALTER DATABASE database_name SET EMERGENCY
(3): If it turns out that the database is accesible (for example. you can query a Table), you then try:
(4): DBCC CHECKDB(database_name) to obtain a list of corrupt objects and other information.
(5): If it turns out, that only tables and/or indexes are affected, then
(6): Run DBCC CHECKTABLE(table_name) for those tables. See what what minimum "repair clause" is needed.
(7): Run DBCC CHECKTABLE(table_name,repair_rebuild) or DBCC CHECKTABLE (table_name,repair_allow_data_loss).
(8): If all tables are done, run DBCC CHECKDB(database_name) again.


4. SOME SUPPORTING SQL

See chapter 2 and 3 for examples.

1. View a database page:

DBCC TRACEON (3604)
GO

DBCC PAGE('database_name', file_id, pagenumber, output_mode) -- e.g.: DBCC PAGE('sales',3,30,3)

2. Get object_id from object_name, or the other way around:

SELECT OBJECT_ID(object_name)
SELECT OBJECT_NAME(object_id)

3. Get a list of tables and their associated indexes:

SELECT
substring(sysobjects.name,1,50) AS TABLENAME,
substring(sysindexes.name,1,50) AS INDEXNAME,
sysobjects.id, sysindexes.indid, sysindexes.groupid,sysindexes.rows
FROM sysobjects, sysindexes
WHERE sysobjects.id=sysindexes.id
ORDER BY sysindexes.rows desc

4. Kill a session (spid):

kill spid -- e.g.: kill 132

5. Show SQL Server version:

SELECT @@VERSION

6. Get Windows version:

-> start cmd -> winver
-> start cmd -> systeminfo
-> start cmd -> systeminfo > list.txt (then open list.txt with notepad)

7. Other 2005/2008:

Showing sessions, spids, and blocked sessions
Showing backup history from the MSDB database
Showing index information (number of rows, fragmentation etc..)
Showing database file info (sizes,paths,%free/%full etc..)
Showing SQL Server waits
Note describing SQL Server backup and restore using TSQL


Chapter 7: Using queries to salvage data from corrupt tables.


Instead of using repair commands, we might be able to use an alternative method. In this case, we simply use queries.
The trick then is, to try to select "around" corrupt pages.

Actually, the best way to repair your table, is by using the methods as shown in chapters 2 and 3.
But in the unlikely event that those methods does not work, you might save data using queries. We might consider two categories:

1. The table has one or more non-clustered indexes.
2. The table does not have indexes, or those indexes are unusable for our purposes.

Case 1: The table has one or more non-clustered indexes:

Why would indexes help?
Suppose a table is so severely damaged, that you are hardly able to query data from the table itself.
Now, you know that a nonclustered index is an external object, and it "contains" one or more columns
of the table, from all rows of the table.
So, if you would use a covering query, that only selects the columns contained in the index,
and the Query Optimizer decides to use that index, then the table is not accessed at all!
And suppose further, that multiple indexes exist, then you might even be able to retreive a relevant amount of data.

Let's try it using a very simple example. We will use the example of the "hard" corrupted pages
as shown in Chapter 3.
Only this time, for the sake of demonstration, the index on the EMPLOYEE table, will use the EMP_NAME and SALARY columns.
So, create the SALES database, the EMPLOYEE and EMPLOYEE2 tables, and create a non-clustered index as follows:

CREATE NONCLUSTERED INDEX indx_employee_empname ON EMPLOYEE(EMP_NAME,SALARY) ON SALESINDEX01
GO

Now, execute the script that inserts 9999 rows in the EMPLOYEE table.
Next, do the procedure as listed in chapter 3, to hard corrupt a couple of data pages of the EMPLOYEE table.

As you have seen in Chapter3 (and what you maybe are doing again if you practise this example), you cannot
select the EMPLOYEE table: SQL Server will come up with an errormessage preventing you to retreive the rows.

Now, with our non-clustered index, we might be able to use this trick. Actually, it's not a "trick" ofcourse,
since we will use a "covering" query which forces the optimizer to use the index, and the table is not touched at all.
So, now try this:

SELECT emp_name, salary FROM EMPLOYEE WITH (INDEX(indx_employee_empname))
GO

SQL Server will return all rows, that is all EMP_NAMES and all SALARIES. Only the EMP_ID is missing,
since that is not "part" of the index.

True, it's certainly not perfect. But suppose multiple non-clustered indexes on your damaged table exists,
all using different columns, then you would be able to construct several covering queries, and you might be able
to retreive a relevant amount of data.


So, for now I will quit working on this document. Next, I will start a similar note on Oracle.

I certainly hope it was of use !