A Powershell Monitor for ‘Leaky Processes’ (Part I – Script Design)

So a request came in this week for me to watch all the processes on a box and alert when the handle count of any of those processes exceeded 7,500.  So my first question was, what is the customer really trying to monitor?  Leaking Handles on the process.  The 7,500 handle count was just an arbitrary number that the customer had come up with, what they really wanted was the ability to say process X  has a high number of handles on my server and is steadily increasing, this may lead to my server or application crashing.

The next logical question is how do I do this in SCOM?  Well there are a few different ways you could go about this.  The most straight forward would be to create a discovery for all of the processes that run on a box.  At that point we could create a simple performance monitor for each instance of that class that says when this process is above threshold for handles X go to a warning state, when it exceeds Y go to a critical state and open a ticket.  There are a number of problems with doing this.

    1. A monitor that fires off when a process crosses a threshold is not really detecting a leak.

      • What I am trying to detect is when a process is continually grabbing more handles.

    2. Discovering all processes running on all servers is generically a bad thing.

      • This is because the instances of this class on each server would be constantly changing. This will lead to configuration churn, See Kevin Holman’s blog post for more information on Configuration Churn.

So what now?  Create a script based monitor.  Script based monitors are the most expensive (from the perspective of the agent) type of monitor you can create so you need to ensure that your script is efficient.  A great post on designing MPs to be efficient and scalable is from Kristopher Bash and is available at here.  Kristopher points out that the most efficient code execution can be achieved through .NET managed code.  The problem with this is now you have to distribute your compiled code to every server that will be using the monitor, not a very supportable thing.  The next best thing for SCOM R2 is a PowerShell module.  With the addition of a native PowerShell module in R2 the overhead associated with spawning separate processes to execute our scripts is removed (the script now runs under the OpsMgr PowerShell Host).  This means that if you have the choice of PowerShell or WSH, use PowerShell (this means that the agent that the code will ultimately be executing on must have PowerShell). The other key programming conecpt you should keep in mind when designing scripts that will be running is scalability. You want to design the script so that it has O(n) complexity, which is to say each time you loop through all the instances (in this example all the processes) you do not want loop through that instance space again.

So how do you ensure that this workflow only runs on servers with PowerShell?  In our environment we have created a class for this.  We use this class to discover additional attributes that we then use to scope things to.  In order to maximize its flexibility we have made it an extension to windows computer.  This allows us to make groups of this class based on our newly discovered attributes.  The PowerShell based monitors are then disabled by default and enabled with an override for all objects of the PowerShell enabled group.  Kevin Holman has done a nice write-up on the creation of the extension class and creation of the group here.

The next step is to actually create the PowerShell script that will be run and test it to ensure it does what we want.  The script I created will monitor all processes above a certain threshold.  It will track the Handle Counts of these processes and alert if they are increasing (leaking) by an average of more than LeakAmount over time NumSamples runs.  The script also takes a parameter called ignoredProcessList which takes a comma seperated list of processes to ignore on a box.

  1. HandleThreshold

    • This is the threshold that a process must exceed in order to be investigated and tracked.

    • Set to 0 to track all processes for a leak

    • We set this to 7500 handles to limit the monitors scope each time it runs

  2. NumSamples

    • This is the number of times the leaking condition (above HandleThreshold and averaging at least LeakAmount) must be true before a state change is triggered

    • Set to 1 to alert whenever a process exceeds the HandleThreshold (ignoring LeakAmount and changing the monitor to a simple high handle count detection monitor

    • We set this to 4 samples and fire the monitor off every 15 minutes

  3. LeakAmount

    • This is the average number of handles that the process must be increasing by to change state

    • Set to 0 to always change state if a process is over the HandleThreshold for NumSamples

    • We set this to 25 which detects processes that are leaking 100 handles or more over an hour period (monitor is run every 15 minutes and NumSamples set to 4

  4. ignoredProcessList

    • This is the comma seperated list of processes to ignore

    • Set this to an empty string to monitor all processes

    • We set this to lsass,system,svchost in our environment as our default setting. This can be overwriten for different groups of servers or individual servers as needed

Param ([int]$HandleThreshold, [int]$NumSamples, [int]$LeakAmount, [String]$ignoredProcessList)

###############################################################################
#Setup Variables
###############################################################################
$TempPath="c:\HealthServiceTemp\HandleCountMonitoring"
$ProcessList = Get-Process | ? {$_.Handles -gt $HandleThreshold}
$ProcessHashTable = @{}
$HighHandleCountProcesses = "ProcessName-PID-HandleCount"
$retValue = "Good"
$IgnoredProcessArray = New-Object System.Collections.ArrayList

###############################################################################
#Parse Input String for Ignored Processes and load it into an ArrayList
###############################################################################
foreach ($SubString in $ignoredProcessList.Split(‘,’))
{
    $IgnoredProcessArray.Add($SubString)
}

###############################################################################
#Parse output of Get-Processes and load results into a HashTable
###############################################################################
if($processList)
{
    foreach ($process in $processList)
    {
        If(-not ($IgnoredProcessArray.Contains($process.ProcessName)))
        {
            [String] $FileName = $process.ProcessName + "-" + $process.Id
            $ProcessHashTable.Add($FileName, $process.Handles)
        }
    }
}

###############################################################################
#Test for existance of Folder Structure, create if it does not exist
###############################################################################
if(-not (test-path $TempPath))
{
    New-Item $TempPath -type Directory
}

###############################################################################
#Change Directory to Folder Structure
###############################################################################
Set-Location $TempPath

###############################################################################
#Evaluate all files in folder structure.  Remove files that are not associated
#With a currently ‘hot’ (Above $HandleThreshold) ProcessName-PID pair
###############################################################################
$Files = Get-ChildItem $TempPath | ? {$_.Attributes -ne "Directory"}
if($Files)
{
    Foreach ($File in $Files)
    {
        if(-not($ProcessHashTable.ContainsKey($File.Name)))
        {
            Remove-Item $File
        }
    }
}

###############################################################################
#If there are ‘hot’ processes either create a file for them or add to the
#existing file.  Each line in the file will be the handle count at that check
#Time.  Each file will have a size limit of NumSamples * 100
###############################################################################
if($ProcessHashTable.count -gt 0)
{
    Foreach ($ProcessKey in $ProcessHashTable.Keys)
    {
        $FileName = $TempPath + "\" + $ProcessKey
        if(-not (Test-Path $FileName))
        {
            New-Item $FileName -type file
            Add-Content $filename $ProcessHashTable[$ProcessKey]
        }
        else
        {
            $count = (Get-Content $FileName | Measure-Object).count
            $HandleCount = $ProcessHashTable[$ProcessKey]
            if($count -gt $NumSamples * 100)
            {
                $Content = Get-Content $File | Select -Skip 1
                $Content += "`n$HandleCount"
                Set-Content $FileName $Content
            }
            else
            {
                Add-Content $FileName $HandleCount
            }
        }
    }
}

###############################################################################
#Check Folder Structure for files with more than $NumSamples lines in them.
#Check to see if the difference between the last two entries is greater than
#$LeakAmount. Add any matches to the $HighHandleCountProccesses string and set
#the return value to "Bad".  If the Process is not leaking by $LeakAmount remove
#File
###############################################################################
$Files = Get-ChildItem $TempPath | ? {$_.Attributes -ne "Directory"}
if($Files)
{
    Foreach ($File in $Files)
    {
        $Content = Get-Content $File
        if($Content.Count -gt $NumSamples)
        {
            $Difference = 0
            if($NumSamples -gt 1)
            {
                if($Content.Count -gt 0)
                {
                    $Values = $Content | Select-Object -last $NumSamples
                    $Difference = ([int]$Values[$NumSamples-1] – [int]$Values[0]) / ([int]$NumSamples-1)
                }
            }
            if($Difference -ge $LeakAmount)
            {
                $retValue = "Bad"
                $HandleCount = $ProcessHashTable[$File.Name]
                $HighHandleCountProcesses += "`n$File-$HandleCount"
            }
            else
            {
                Remove-Item $File
            }
        }
    }
}

###############################################################################
#Return Values
###############################################################################
$api = New-Object -comObject ‘Mom.ScriptAPI’
$bag = $api.CreatePropertyBag()
$bag.AddValue("retValue",$retValue)
$bag.AddValue("message",$HighHandleCountProcesses)   
$bag

###############################################################################
#Destroy objects
###############################################################################
remove-variable bag
remove-variable api
remove-variable retValue
remove-variable HighHandleCountProcesses
remove-variable HandleCount
remove-variable count
remove-variable NumSamples
remove-variable File
remove-variable Files
remove-variable TempPath
remove-variable ProcessHashTable
remove-variable filename
remove-variable ProcessKey
remove-variable HandleThreshold
remove-variable ProcessList
remove-variable process
remove-variable LeakAmount
remove-variable ignoredProcessList
remove-variable ignoredProcessArray
remove-Variable Content
remove-Variable Values
remove-Variable Difference

yoga-cats-11

The next step is to wrap this into a custom data source and build a composite monitor from that data source.  After that we use the composite monitor in a management pack that extends windows server operating system, default disabled, then create an override management pack to enable it for our PowerShell enabled servers.  Stay tuned!

Advertisements
This entry was posted in Management Pack Authoring, Scripting. Bookmark the permalink.

3 Responses to A Powershell Monitor for ‘Leaky Processes’ (Part I – Script Design)

  1. Jonathan Almquist says:

    Interesting workflow. Looks like a lot of fun. Couple questions, though. How did you determine that it’s always better to use the powershell module over CScript? What is the reason for using the filesystem as a DS/Processing location? Does customer want to monitor all processes, or are they satisfied with only knowing when the first process changes state of the monitor?

  2. i200908 says:

    Wow, very nice. thank you!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s