Awhile a go a request came into me to begin alerting on the addition and removal of members to a number of active directory groups. So after a bit of research on Security Events I found that this would be relatively simple to accomplish as all of our DCs have SCOM agents installed on them, or so I thought!
I did my research on which security events I should be caring about on Randy Franklin Smith’s website (http://www.ultimatewindowssecurity.com) and highly suggest looking there for descriptions of any security event. It turns out that for Windows Server 2003 DCs there were 6 events we wanted to monitor and in Windows Server 2008 R2 DCs there were 6 other events we wanted to monitor.
- 632: Security Enabled Global Group Member Added
- 633: Security Enabled Global Group Member Removed
- 636: Security Enabled Local Group Member Added
- 637: Security Enabled Local Group Member Removed
- 660: Security Enabled Universal Group Member Added
- 661: Security Enabled Universal Group Member Removed
- 4728: A member was added to a security-enabled global group
- 4729: A member was removed from a security-enabled global group
- 4732: A member was added to a security-enabled local group
- 4733: A member was removed from a security-enabled local group
- 4756: A member was added to a security-enabled universal group
- 4757: A member was removed from a security-enabled universal group
As these events are all Microsoft events I assumed they all were paramatized, and after a quick check with Log Parser I found out that they indeed were. Parameter 3 holds the name of the account that was modified. At this point I created a number of straight forward event based rules to alert us on group changes for important security groups [Builtin Admins, Domain Admins, Schema Admins, Enterprise Admins etc], setup a subscription to the alerts to notify the correct people and we were off to the races.
Everything is Great
The monitoring worked. We got alerted when we added or removed people from these groups almost immediately. Other groups approached us and asked us if we could provide the same types of monitoring for them. Life was easy and SCOM was doing exactly what it was designed for.
As a SCOM administrator / Author the most common type of rule I make is an event based rule. I instruct my application owners to create events in the local event logs, then pick them up and alert on them. As this was a pretty common practice for me I did not see a reason to test these rules in a development environment and, in hindsight, I should have.
It soon was noticed that the LSASS process on our three primary Domain Controllers had jumped up unexpectedly. We utilize these three Domain Controllers pretty heavily so at first it was thought that some newly developed application was hammering them too frequently or something. After a bit of investigation though it turned out that it was SCOM causing LSASS to spike up by over 30% of the total processor time available on these Domain Controllers.
We quickly targeted the new rules we had written as the culprits, what we couldn’t figure out is why they were causing LSASS, a Domain Controller process, to spike? After talking with Microsoft we found out that the actual security events store GUIDs not human readable strings (think parameter 3, the group name) and that by trying to filter on the parameter we were causing it to have to do that resolution which was putting a load on LSASS. But hold on, it is very rare for us to have any of the events listed above happen in our environment so why was this a constant increase? The Data Source we were using was a simple Event Provider DS.
We assumed that this configuration meant that if the event id matched 632 then it would look at the Parameters and check to see if Param equaled Domain Admins. But, if we actually sit down and look at what this configuration is actually saying, it just says pass the output of this DS on down the line if you find an Event with number 632 and Parameter 3 equal to Domain Admins. This means, for every event that is dropped in the event log, check its Event ID and Parameter 3. Essentially we were causing the Parameter 3 GUID to always be resolved to a friendly name which translated into the increase in load.
So what do we do now? We need to build in the filtering logic that we thought was originally there! To do this we use the same sort of logic that you use when designing for cookdown. In the Data Source you map the output you want then pass that output to a condition detection for filtering then to the alert.
In SCOM this looks like
Or for the XML inclined
<Rule ID="GMI.Security.Rules.DC.2003.EnterpriseAdmins.Removed" Enabled="true" Target="MicrosoftWindowsServerAD2003Discovery!Microsoft.Windows.Server.2003.AD.DomainControllerRole" ConfirmDelivery="true" Remotable="true" Priority="Normal" DiscardLevel="100">
<DataSource ID="DS" TypeID="GMILibrary!GMI.Library.DS.FilteredEventProvider.Security">
<ConditionDetection ID="Filter" TypeID="System!System.ExpressionFilter">
<Value Type="String">Enterprise Admins</Value>
<WriteAction ID="Alert" TypeID="Health!System.Health.GenerateAlert">