MSSQL – Runaway Job, Missing Max Run Time for SQL Job (Tame Them…)

Hi,

Recently an issue attracted my attention where we needed to tame the runaway SQL jobs, where the step in the job to import a file from the FTP source at times would run up to 2 days (which would normally complete in ~2 minutes).

Well at times ftp downloads by scripts are affected by lots of factors out of our hands, network congestion, provider or what not.

So the only way to make it work was quite simple, kill the job and start again and in the next run every thing would be absolutely fine.

But this cant be done manually some one sitting and looking at each job run time.

So I devised a way in the below script which when run at frequency intervals of 30 seconds to a minute would
monitor job (or step) current run duration and compare it with the meta data in a table. If the duration exceeds the meta-table entry the job is killed and email is send out for notification.

This has proved us very valuable, and hence thought would prove very helpful for any one in a similar situation, till SQL server jobs doesn’t come with a default max run time functionality in new releases

The script in the job step is as follows:


IF OBJECT_ID('tempdb..##KillJobList') IS NOT NULL
	DROP TABLE ##KillJobList;

 SELECT AJ.job_name , AJ.start_execution_date , AJ.current_executed_step_id , AJ.step_name , 
 AJ.RunDurationMinutes , AJ.RunDurationSeconds, JM.max_duration_sec , JM.step_number,
 AJ.CurrentStepRunDurationMinutes , AJ.CurrentStepRunDurationSeconds,

 CASE WHEN JM.step_name IS NOT NULL THEN --Check for the job step duration
	CASE WHEN JM.step_name = AJ.step_name THEN --When the current executing job matches the defination
		CASE WHEN JM.max_duration_sec < AJ.CurrentStepRunDurationSeconds THEN 1 ELSE 0 END --if the job manager max duration is less than current execution time, kill job by flag 1
	ELSE 0 END 
 ELSE  
	CASE WHEN JM.max_duration_sec < AJ.RunDurationSeconds THEN 1 ELSE 0 END	
 END KillJobFlag

 INTO ##KillJobList

 FROM dbo.vwActiveRunningJob AJ
 INNER JOIN dbo.JobManager JM ON JM.job_name = AJ.job_name
 WHERE JM.duration_check = 1;
 
 
DECLARE @sqlstmt VARCHAR(2000);

DECLARE db_cursor CURSOR FOR
SELECT job_name 
FROM ##KillJobList
WHERE KillJobFlag = 1

OPEN db_cursor   
FETCH NEXT FROM db_cursor INTO @sqlstmt   

WHILE @@FETCH_STATUS = 0   
BEGIN 

--SELECT @sqlstmt
EXEC msdb.dbo.sp_stop_job @job_name = @sqlstmt

FETCH NEXT FROM db_cursor INTO @sqlstmt 

END   

CLOSE db_cursor   
DEALLOCATE db_cursor

IF EXISTS (SELECT * FROM ##KillJobList WHERE KillJobFlag = 1)
BEGIN	
EXEC msdb.dbo.sp_send_dbmail
	@profile_name='--Mail Profile Name--',
	@recipients = 'team@address.com',
	@subject = '[Job Force Killed] Auto Job Manager',
	@body  = 'The attached file contains the list of job that were killed due to exceeding there max run time defined',
	@query = 'SELECT * FROM ##KillJobList WHERE KillJobFlag = 1',
	@body_format='HTML',
	@attach_query_result_as_file = 1,
	@query_attachment_filename = 'job_auto_killed.txt'
END	


EXEC msdb.dbo.sp_purge_jobhistory @job_name = 'Auto Job Manager' --This is necessary as this job will run very frequently and we dont need its every minute history to balloon our logs

The view (dbo.vwActiveRunningJob) mentioned in the above script is as follows:

CREATE VIEW [dbo].[vwActiveRunningJob]
AS
SELECT
ja.job_id,
j.name AS job_name,
ja.start_execution_date,      
ja.last_executed_step_date,
ISNULL(last_executed_step_id,0)+1 AS current_executed_step_id,
js.step_name,
DATEDIFF(MINUTE, ja.start_execution_date, GETDATE()) AS RunDurationMinutes,
DATEDIFF(SECOND, ja.start_execution_date, GETDATE()) AS RunDurationSeconds,

DATEDIFF(MINUTE, ISNULL(ja.last_executed_step_date,ja.start_execution_date), GETDATE()) AS CurrentStepRunDurationMinutes,
DATEDIFF(SECOND, ISNULL(ja.last_executed_step_date,ja.start_execution_date), GETDATE()) AS CurrentStepRunDurationSeconds

FROM msdb.dbo.sysjobactivity ja 
LEFT JOIN msdb.dbo.sysjobhistory jh ON ja.job_history_id = jh.instance_id
JOIN msdb.dbo.sysjobs j ON ja.job_id = j.job_id
JOIN msdb.dbo.sysjobsteps js ON ja.job_id = js.job_id AND ISNULL(ja.last_executed_step_id,0)+1 = js.step_id
WHERE ja.session_id = (SELECT TOP 1 session_id FROM msdb.dbo.syssessions   ORDER BY agent_start_date DESC)
AND start_execution_date IS NOT NULL
AND stop_execution_date IS NULL
GO

And the meta data table which governs the duration’s allowed for job/steps is as follows:

 SELECT job_id,
        job_name,
        step_number,
        step_name,
        max_duration_sec,
        duration_check
 FROM dbo.JobManager;

I hope the above helps to control your environment runaway task. And other application where this technique can be used in processing the cube partitions, where the processing is hung due to other jobs processing the cube partitions.

Leave a Reply

Your email address will not be published. Required fields are marked *

*