This article covers the various performance metrics displayed by the Performance Analysis Dashboard and Performance Analysis Overview, and how to interpret different metric values and combinations of values across different metrics for SQL Sentry.
Note: For Mode: S = Sample and H = History.
SQL Server Metrics
|SQL Server Activity||User Connections||The total number of connections established to the SQL Server. Details for each connection can be viewed by querying the sys.dm_exec_connections DMV.|
|SQL Server Activity||Blocked Processes||The total number of blocked processes detected on the SQL Server. The Dashboard is designed to give you information about the number of blocked processes. The Blocking SQL tab should be used for in depth troubleshooting of any blocking issues.|
|SQL Server Activity||Batches||The total number of select, insert, or delete statements per second, including those inside a stored procedure. The name is somewhat misleading since it doesn't represent batches (groups of multiple statements) in the traditional sense. It's one of the best measures of overall activity on a SQL Server.|
Over 1000 Mb per second is generally considered moderate to high activity. A 100Mb network can reach saturation at around 3000 Mb per second.
If high, use QuickTrace™ to determine which applications, hosts, etc., are responsible.
|SQL Server Activity||Transactions||The total number of transactions per second across all databases on the server. A transaction can be either a user-defined statement block surrounded by a BEGIN TRAN and END TRAN, or an individual DML statement (insert, update or delete).|
Compare with batches per second. On systems with high DML you typically want to see a low ratio of transactions to batches. A low ratio indicates that the individual statements are being bundled together, and can result in dramatically higher throughput and reduced IO due to log flushes.
Transactions per second can be higher on SQL Server 2005 and 2008 due to increased system process activity.
If high, use QuickTrace to determine which applications, hosts, etc., are responsible.
|SQL Server Activity||Compiles||The total number of initial compiles and recompiles per second. The value should generally be < 10 percent of batches per second. Higher values indicate plan reuse is low, and will generally correlate with high CPU, since plan compilation is a CPU intensive operation. It may also correlate with low cache hit ratios for object and/or SQL plans (see "Plan cache hit ratios" below).|
It can also be a strong indicator of memory pressure, since there may not be enough room to keep all plans in cache.
If you see consistently high compiles, run QuickTrace and sort results by Cache misses/sec, then expand details to view the actual queries or procedure calls that are causing the misses, which will be highlighted.
|SQL Server Activity||Recompiles||The number of recompiles per second. The value should generally be < 10 percent of initial compiles per second. Also see Compiles above.|
|SQL Server Activity||Key lookups||The number of times per second that the query processor had to perform a key lookup, across all queries. Lookups occur when the index being used is non-covering, meaning it doesn't include all of the columns required by the query. For each row returned by the index operation, the query processor has to go back to either the clustered index to perform a key lookup, or the base table to perform a RID lookup in the case of a heap. Lookups are a high overhead operation, especially when large number of rows are involved, because each lookup incurs a random I/O and additional processing. This often correlates with higher CPU usage and page reads. Lookups can be eliminated by using a covering index, adjusting joins to reduce the set so the lookup isn't needed, or using multiple indexes (intersection).|
|SQL Server Activity||Forwarded records||The number of times per second that the query processor had to lookup forwarded records, across all queries. Forwarded records occur in tables with no clustered index (heaps) when rows become too large to fit on the page and have to be relocated. Over time, this can cause severe fragmentation and queries to incur much higher than normal I/Os, specifically random reads. This can correlate with high SQL Server page reads, and high SQL Server disk wait time, data file and physical disk latency if the disk system isn't keeping up with the additional reads. On many systems it's not unusual for this counter to stay at zero if all tables have a clustered index, any heaps aren't fragmented, or they just aren't accessed frequently.|
|SQL Server Activity||Backup MB/sec||The data rate in MB/sec for any backup operations taking place on the server.|
|SQL Server Activity||Send Queue / Redo Queue|
The Send Queue and Redo Queue are applicable for servers involved in a database Mirroring Session.
The Send Queue represents unsent the log that has accumulated on the log disk of the principal database. The Redo Queue is the log waiting on the mirror's disk. See also: Database Mirroring Sessions
|SQL Server Waits||CPU percent of Total Waits||CPU percentage of Total Waits represents the percentage of all waits which are signal waits. Signal wait time is the time a thread has spent waiting on the CPU after being signaled that its resource is available. A high CPU percentage of Total Waits percentage may indicate CPU pressure.|
|SQL Server Waits||Wait Time: by Category||The average wait time (in milliseconds) per second for all processes on the server, broken down by major physical resource category (Network, CPU, Memory, Disk). Although there are hundreds of wait types, only the wait types that can definitively be attributed to one of the physical resource categories are included in the calculations for this chart. The Other category is for a few other important wait types that can either affect performance in more than one major category, or cannot be directly attributed to any category with absolute certainty, such as backups and parallelism respectively.|
Waits by category is one of the most important charts on the dashboard, because it provides an instant profile of the SQL Server and where it's spending the most time waiting for physical resources. If waits by category are consistently low, then what the other dashboard charts are showing is less important. For example, if CPU and batches look unusually high, but CPU waits are low, then the server hardware is effectively handling the load.
Total waits of less than 200ms is very good. Between 200ms and 1000ms is average. Greater than 1000ms likely requires some attention to determine where the bottleneck lies. Over 5000ms may indicate severe bottlenecking.
The total wait time may be higher by virtue of a large number of processes (spids) active on the server, because wait time is summed across all processes, it isn't a per process average. This can be especially applicable to the Other category, because several processes experiencing parallelism at the same time can cause it to spike to very high levels.
To view the specific wait classes involved for a particular interval, choose the interval by selecting on the chart, and then select the Sample button on the toolbar.
If you hover over a category, the detailed wait types for that time sample are shown in a tooltip. You can also see the specific wait types for a range by running the SQL Server Wait Stats Analysis report from the Reports >Performance Analysis menu.
|SQL Server Waits||Wait Time: by Class||The average wait time (in milliseconds) per second for all processes on the server, broken down by wait class. There are approximately 30 different wait classes, and each represents a particular SQL Server functional area or type of activity. Meaningless or innocuous wait types such as timer and queue waits are pre-filtered from view. Each class is further broken down into resource and CPU waits.|
If you hover over a class column, the detailed wait types are shown in a tooltip.
|SQL Server Memory||Buffer cache size||The current size of the buffer cache, in MB. You want this to be as large as possible for maximum performance, and on a dedicated SQL Server it should consume most of the SQL Server memory and physical memory.|
|SQL Server Memory||Plan cache size (SQL)||The current size of the cache used for query plans, in MB. This includes adhoc, autoparameterized, and prepared plans. A high value in proportion to the buffer cache may indicate query plans aren't being effectively reused.|
|SQL Server Memory||Plan cache size (Objects)||The current size of the cache used for object plans, in MB. This includes stored procedures, functions, and triggers. A high value in proportion to the buffer cache may indicate query plans aren't being effectively reused.|
|SQL Server Memory||Other cache size||The current size of the cache used for all other plans, in MB. This includes bound trees, extended stored procedures, temporary tables, and table variables. This cache size should be low in proportion to the other plan caches. If it goes over roughly 10 percent of the object or SQL plan size, further investigation may be needed.|
|SQL Server Memory||Plan cache hit ratio (SQL)||The ratio of hits to lookups for the query plan cache. This value should stay above 90 percent.|
|SQL Server Memory||Plan cache hit ratio (Objects)||The ratio of hits to lookups for the object plan cache. This value should stay above 90 percent.|
|SQL Server Memory||In-Mem OLTP|
The current amount of memory in MB dedicated to In-Memory OLTP. This includes Memory-optimized tables, non-durable tables, and natively compiled T-SQL modules.
|SQL Server Memory||Query Grants|
The current amount of memory in MB allocated to the MEMORYCLERK_SQLQERESERVATIONS clerk.
Additional Information: Memory Grants Meditation: The mysterious SQL Server memory consumer with many names (Microsoft Tech Community article).
|SQL Server Memory||Stolen Buffer|
The amount of memory that could be used for buffer, but is consumed by another task and is not available for buffer. Stolen Buffer can be reduced by optimizing queries. Granted workspace memory can reduce the amount of free memory available and increase Stolen Buffer.
|SQL Server Memory||Columnstore||The current amount of memory in MB allocated to columnstore indexes. This includes both clustered and nonclustered indexes.|
|SQL Server Memory||Page reads||The average number of buffer data pages read from disk per second. Ideally, this value should be at or near zero most of the time. If it's above zero, it means that the data wasn't found in the buffer cache, and so it had to be retrieved from disk. If spikes in page reads correlate with high disk latency, the disk system may not be keeping up.|
Querying newly created temp tables will also show up as page reads, as well as activity from internal tempdb objects. This includes hash joins, hash aggregates, sort and query spool operations. This means that you can still see high paging from tempdb due to query activity, even though you aren't explicitly using temp tables.
When page reads and page writes correlate closely, it's a strong indicator that it's related to tempdb activity, because pages are being written to disk when the objects are created, then immediately read back in to memory for use by querying operations.
If lazy writes > zero and track closely with page reads, and page life expectancy < 600, it's a strong indicator of memory pressure, because data is being moved out of buffer to make room for new data coming in.
Lazy writes also cause page writes, but generally much less than tempdb activity. If you see high page reads, and relatively low lazy writes and page writes, it's likely memory pressure and not tempdb activity.
Working with tempdb in SQL Server 2005
|SQL Server Memory||Page writes||The average number of buffer data pages writes to disk per second.|
Page writes can be caused by checkpoints, lazy writes, and tempdb activity. To calculate the approximate amount of writes related to tempdb, for any given interval, subtract checkpoints and lazy writes from total page writes.
If high page writes correlate with high latency, the disk system may not be keeping up.
See Page reads for more details.
|SQL Server Memory||Page life expectancy||The average lifespan of a data page. If this value is less than 600, it's an indicator of memory pressure. Ideally, it should be much higher than 600 if ample memory is available. In general, the larger the buffer cache size, the higher it should be. This is the best universal indicator of memory pressure.|
|SQL Server Memory||Memory grants pending||The number of processes waiting for a query workspace memory grants. Ideally, this value should be zero at all times, but it can go above zero in cases of severe memory pressure.|
When it does, RESOURCE_SEMAPHORE waits will also be > zero, since this wait is a measure of the time that queries had to wait for memory grants. This type will be visible in the Waits chart tooltips for the Memory class and category.
|Database IO||Read latency by database file||The average time in milliseconds each physical disk read is taking for a particular database file. The top 10 database files (data and transaction log) with the highest latency for the specified date range are shown. In History mode, select the dropdown box to determine whether total database latency or a specific database file is shown on the chart.|
See the Disk IO: Read latency section for more details and performance ranges.
|Database IO||Write latency by database file||The average time in milliseconds each physical disk write is taking for a particular database file.|
See Database IO: Read latency and Disk IO: Read latency sections above for more details.
|Database IO||Checkpoint pages||The average pages per second written to disk by the checkpoint process. Checkpoints flush all dirty buffer pages for a given database to disk and are a normal part of SQL Server operations. The frequency of checkpoints and volume of checkpoint pages is dictated directly by the Recovery Interval server option. SQL Server uses checkpoints to batch writes to disk, which is generally more efficient. However, if the volume of each checkpoint is too high and you see a correlation with high disk latency, it may indicate that the disk system isn't keeping up.|
Note: SQL Server 2016 introduced Indirect Checkpoints that are defined and configured at the database level. Indirect Checkpoints are set by default for SQL Server 2016 databases.
Checkpoints may correlate with lower page life expectancy, but only because when pages are flushed from memory PLE can drop, not because there is a direct correlation with memory pressure and checkpoints. It will also correlate directly with high SQL Server page writes. If you see a correlation with high disk latency or high SQL Server disk waits, then the disk system may not be keeping up with the checkpoint volume. By lowering the recovery interval you can increase the frequency and at the same time lower the volume of each checkpoint, which may lessen the overall impact on disk waits. Every system is different in this regard, however, so testing of different recovery intervals should be performed and the impact observed.
This counter may, correlate with lazy writes, because it depends more on the activity profile of the server, even under memory pressure. If checkpoints are happening often enough to meet the specified recovery interval, there may not be much work for the lazywriter threads to do when they wake up.
If the lazywriter threads are staying continually busy, and the number of dirty pages is low enough to meet the recovery interval, the checkpoint process may never have much, if any, work to do.
On SQL Server 2008, you can use the -k startup parameter to control the max MB per second that can be written by the checkpoint process, in the event that the checkpoint process is overwhelming the disk system. This isn't a typical scenario, however, so this option should be tested carefully before use, and other possible solutions should be investigated first.
If write latency exceeds 20ms, the checkpoint process will be automatically throttle to avoid overloading the disk. This can happen even when the disk is capable of easily sustaining higher throughput, and can lead to abnormally long checkpoints, and blocking of the log writer process.
DBCC TRACEON (3502, 3504, -1); will log many additional details about the checkpoint process to the SQL Server error log, and can be very helpful for troubleshooting checkpoint related issues.
SQL Server I/O Basics, Chapter 2
SQL 2016 – It Just Runs Faster: Indirect Checkpoint Default
Changes in SQL Server 2016 Checkpoint Behavior
|Database IO||Lazy writes||The average number of writes per second by the lazy writer. The lazy writer periodically scans the buffer and evicts pages that have low use counts in order to maintain a certain number of pages on the free list. Ideally, this value should be at or near zero most of the time. When there is no memory pressure, the lazy writer will generally leave data pages in memory, even those with low use counts. However, when pressure exists, the lazy writer will continually be working to make room for new data coming into the buffer.|
An indicator of memory pressure is ongoing lazy writes > zero with page reads/writes > zero and page life expectancy< 600.
Also see Checkpoint pages above.
|Database IO||Log flushes||Log flushes occur with every DML operation, and are a normal part of SQL Server activity. It's important to note that log writes to physical disk from updates to buffer pages happen immediately upon transaction commit, whereas writes to physical disk from the changed buffer pages is delayed until the next checkpoint occurs. It's critical that the physical disk system where the transaction log resides is fast enough to keep up with activity. If not, it can slow down all DML operations occurring in the database.|
Ideally each busy transaction log should have its own dedicated disks, so that writes can happen sequentially, which will minimize latency. If log flushes are high and latency is high for a transaction log file, then the disk system is likely underpowered for the current load.
Additional Information: Understanding log buffer flushes