(scottish) sql bob blog

rss

Some thoughts of your typical data shepard / data groomer / data dance teacher sort of person.


Monitoring replication status using Nagios (using PowerShell script and TSQL)

For various reasons which l have now forgotten, l set up transactional replication for some clients.  The result of this is l am the caretaker of transactional replication for two of our clients, what l lucky person l am !

T-SQL code used to check Transaction replication 
Once l got the process up and running (a very, very long and stressful story).  At that point l realised that I would have to monitor these processes.  Following some goggling this article was found with some TSQL written by SQLSoldier here.  This worked for me and l used this to monitor replication by running the script and checking the results manually.
 
Nagios output 
Our SysAdmin uses a tool called nagios to monitor out IT estate.  So they suggested that a script could be written to monitor the replication process and send an alert if anything needed to be looked at.  This sounded like an excellent idea, how to do it?  The approach that was arrived at, involved using a PowerShell script which would run a SQL query examine the results then respond to Nagios, with the following values  

0 - all ok nothing to see here 
1- something happening nothing to worry about (just now) 
2-yes there is something that really needs some attention

See this page for more guidance.

Stored procedure 

Next we decided to use PowerShell to return the results.  Following some goggling we found this page http://www.madeiradata.com/cross-server-replication-health-check-using-powershell/ which runs a SQL script and returned a data set.  First challenge was the TSQL script from SQLSoldier was rather long for my first powershell script, l wanted some thing smaller.  So l created a stored procedure based on the script, and placed it in my Distribution database on the replicated database server.  Doing this had a two reasons, first less TSQL in the PowerShell script, second changing one of the parameters meant it returns different amounts of data.  The stored procedure takes the following parameters ; 

@Publisher  - The name of the publisher database server instance 
@PublisherDB - The name of the publisher database 
@NagiosOutput - Y = only output return code and short error code( max of 80 characters), N = output all results 

The following script was used to check the results that would be returned to Nagios. 
USE [distribution]; 
DECLARE @Publisher AS sysname 
DECLARE @PublisherDB AS sysname 
DECLARE @NagiosOutput AS Char(1) 
SET @Publisher = 'ReplPublisherServerName' 
SET @PublisherDB = 'ReplPublisherDBName' 
SET @NagiosOutput = 'Y' 
EXEC [dbo].[Check_Replication] @Publisher, @PublisherDB, @NagiosOutput; 

Note the account connecting to the database from Nagios will require execute permissions to the stored procedure otherwise it cannot run the stored procedure.  The code for the stored procedure is here.

PowerShell script 
Having adapted the PowerShell script found here to run the stored procedure.  When the PowerShell script was run by Nagios there was no 'Return Code' returned (this is what Nagios expects).  We did find the solution on this page, and inserted function ExitWithCode, and made a few other changes.  The resulting PowerShell script is below -

## Beginning of Monitor 
##Connection String With Server Variable, Distribution Database name is 'Distribution' 
$con = "server=127.0.0.1;database=Distribution;Integrated Security=sspi" 
 
##Begin SQL Query 
$cmd = "SET NOCOUNT ON; " 
$cmd = $cmd + " USE [distribution];" 
$cmd = $cmd + " DECLARE @Publisher  AS sysname" 
$cmd = $cmd + " DECLARE @PublisherDB AS sysname" 
$cmd = $cmd + " DECLARE @NagiosOutput AS Char(1)" 
$cmd = $cmd + " SET @Publisher = 'ReplPublisherServerName' " 
$cmd = $cmd + " SET @PublisherDB = 'ReplPublisherDBName' " 
$cmd = $cmd + " SET @NagiosOutput = 'Y'" 
$cmd = $cmd + " EXEC [dbo].[Check_Replication] @Publisher, @PublisherDB, @NagiosOutput;" 
 
##Creating DataSet Object 
$set = new-object system.data.dataset 
##Running Query 
$da = new-object System.Data.SqlClient.SqlDataAdapter ($cmd, $con) 
##Filling DataSet With Results 
$da.fill($set) | out-null 
##Creating Table Object and Inserting DataSet 
$dt = new-object System.Data.DataTable 
$dt = $set.Tables[0] 
 
## loop over each column in the DataSet 
foreach ($row in $set.Tables[0].Rows) 

##write out the 2nd row which contains the message text 
write-host $row[1].ToString() 

$exitcode = $row[0].ToString() 
 
## The 'exit code' fragment below was adapted from: 
## http://weblogs.asp.net/soever/returning-an-exit-code-from-a-powershell-script 
##SysAdmin, 2015-Nov 
function ExitWithCode  
{  
    param  
    (  
        $exitcode  
    ) 
    $host.SetShouldExit($exitcode)  
    exit  

 
The output from stored procedure when  is very short.  Note that Nagios only permits a maximum of 80 characters to be returned.  Hence the sample output if run the TSQL will looking this ; 

No issues
Error Code   ErrorMessage  
0 Replication OK 

Issue(s) requiring attention 
Error Code  ErrorMessage  
2  TNotRepl=45 CNotRepl=4 Latency=5 Status=In progress 

Nagios expects a error message of a maximum of 80 characters which is the reason for the brevity of the error message. The error messages are -

TNotRepl=45
 -  'Transactions not replicated' total number of commands queued to be applied to the subscriber database. 
CNotRepl=4 -  'Commands not replicated' total number of commands queued to be applied to the subscriber database 
Latency=5 - This is the time taken to from when a transaction is applied to the publisher database the amount of time it takes till the same transaction is applied to the subscriber database (in seconds).  
Status=In progress - current status of the replication process

When considering this error message it was primarily to give some guidance as to what might be happening with the replication process.  It is not intended to give any guidance on the underlying reason that is causing the issue.  All that is required is that the Nagios process shows that there is something wrong.  What ever the reason it requires some form of human intervention.  Once an error condition has been detected then the issue will be handed to me to resolve.  At least now l do not have to check the process periodically, now l just have to wait for a message from our sysadmin.