Performance

You are currently browsing the archive for the Performance category.

Did you know that BACKUP automatically tells OpenVMS to avoid use of the XFC when it requests data? But most other backup applications do not know how to do this. (Hint: It’s a modifier to the $QIO call requesting the data. Most third party backup applications do not use it.) If you have a backup application that uses OpenVMS BACKUP (SLS, ABS are two examples), then you are automatically covered.

But why would you care? Well, when you have an applications (backup or otherwise) that walks through a volume, unless the application tells OpenVMS to not cache the data, it will end up bumping everything else out of the XFC cache. So, your Customer Service Rep suddenly has to read all the way to the back end of the array to get the data. As a result, such jobs can result in poor system performance when they run.

So, a large end of day job that wades through an entire database, a large SQL query job. A backup job that does not tell OpenVMS to avoid XFC cache. All of them can lead to po

clipped from h71000.www7.hp.com

HP OpenVMS Systems

HP OpenVMS version 8.4
for Integrity server systems and AlphaServer—New features and benefits

Performance and Scaling Enhancements

Dynamic Enabling/Disabling of XFC Cache for Mounted Volumes

  • New features in XFC to dynamically enable/disable cache for mounted
    volumes

  • Users can dynamically disable caching on a volume and then perform huge
    backup, copy and search operations. Once this is complete caching can be
    enabled on that volume.

You may have heard of T4 & Friends. The core of T4 & Friends is the T4 toolkit that you can install on your OpenVMS systems. This kit allows you to track performance of your OpenVMS system and various subsystems, such as storage.

You can read more about T4 & Friends in the VMS Technical Journal:

  • V3 – TimeLine-Driven Collaboration with T4 & Friends: A Time-saving Approach to OpenVMS Performance [ HTML, PDF ]
  • V4 – Adding a Friend to T4 & Friends: Incorporating BEA WebLogic Server 8.1 Performance Data [ HTML, PDF ]
  • V10 – Taking T4 to the Next Level [ HTML, PDF ]
  • V11 – RMS Collector for T4 and Friends [ HTML, PDF ]

If you want to download T4 & Friends, you can do so by entering “OpenVMS T4 download site:hp.com” into your search engine. It should return as the top item the following page:

http://h71000.www7.hp.com/openvms/products/t4/index.html

Edit: Link and Search updated (Thanks to Ian Miller for catching my use of the old link)

Included in the download page, you will find:

  • a PCSI kit for OpenVMS versions V7.3-2 and up [ Alpha and Integrity (V8.3 and up) ]
  • a PCSI kit for OpenVMS versions V7.3-1 [ Alpha ]
  • a Windows Install kit to install TLViz V1.6-09 on Windows systems
  • a Windows Image to update the TLViz image to V1.6-14 where TLViz is previously installed
  • a ZIP file to install CSVPNG V1.0-156 on Windows systems

To help with the installation the T4 PCSI kits include a “How to install T4″ document. The TLViz and CSVPNG kits also include basic documents on how to install and use these analysis tools. For more information on TLViz and CSVPNG, Steve Lieman provides some useful information at http://trendsthatmatter.com/. (Edit: Again, thanks to Ian Miller for adding this pointer in his comment).

Also included is a tool called VEVAMON (the VMS Eva MONitor). As customers moved their OpenVMS systems to use EVA storage arrays, OpenVMS engineering created this tool to better analyze EVA storage performance data. However, this tool only works with a very old version of the EVA controller firmware. The current, recommended and supported method to collect EVA storage array performance data is to use EVAperf on the Storage Management Appliance (SMA) and then convert the EVAperf CSV file to TLViz formatted files. This conversion is done with the EVAperf-to-TLViz converter program that is located in the same directory as the EVAperf software on the SMA.

Finally, for more information, you can read the Frequently Asked Questions. If you don’t find answers there, you can ask your questions by entering your question in the online form to send email to HP’s T4 team.

Below you will find a screencast of this information.

(to be inserted)

Now that we know how to find where and how we can improve performance, it is also essential to understand that the response time through a service center is directly related to the utilization of that service center. But that director relationship is NOT a straight line correlation. Instead, what we see is a curve that

Response Time to Utilization Curve

Graph: Response Time to Utilization Curve

Let’s use an example that most of us understand from our daily lives. We expect that as we drive on a road, we will be able to get from one place to another in about the same amount of time every time we use that road. We realize that when there are more cars on the road the time will gradually increase. We tend to expect a “straight line correlation” between the utilization of the road and the time it takes to travel the road.

Graph: Response Time to Utilization Assumption

Graph: Response Time to Utilization Assumption

However, what in fact happens is that we encounter the dreaded “rush hour“. But is “rush hour“? This is when the utilization of the road (that is, the number of cars on the road) increases beyond the capacity of the road to handle traffic without significant slowdowns. This point where the transportation system no longer is able to handle the load without significant slowdowns is known as the “knee of the curve”.

Graph: Knee of the Curve

Graph: Knee of the Curve

After this “knee of the curve” we find that the as the workload increase, the slope of the line changes. Before this point, we experience a minimal amount of increase in the response time when we significantly increase the utilization. After the knee of the curve the correlation flips. Instead we see a significant amount of increase in the response time when we have a minimum amount of increase in the utilization. The response time gets worse and worse. We find ourselves sitting in traffic wondering how long it will take to get off the road.

Graph: Slope of the "line" shows degraded performance

Graph: Slope of the "line" shows degraded performance

Unfortunately, we are not dealing with just one service center. In fact, the saturation of the service center tends to cascade and magnify the impact. That’s what we see on a transportation system. If we only had one entry and exit point on the transportation system, we would see some slowdown, but when we have many exit and entry points, the impact is magnified.

Graph: Magnification of the impact with multiple=

Graph: Magnification of the impact with multiple service centers

The analogy of the highway system is an apt analogy for a storage subsystem. As we saturate various service centers throughout the storage subsystem, the response time of the overall system dramatically increases. For example, if with start at a storage subsystem that provides a response time for a random read of about 4 milliseconds, then often as we approach the 70% “knee of the curve” the response time will approach 40 milliseconds. That is a 10x increase in the response time.  [ LOTS of factors impact the actual numbers. I use this as an example of performance issues I encounter on various storage arrays and different operating systems. This "symptom" is vendor and operating system neutral. ]  However, as the utilization increases we can see the response time climb well beyond the 40 second mark. That’s an increase of 1,000 times over the original 4 millisecond range. And as you can imagine, the result for the user of that storage subsystem is quite undesirable.

This is why I focus so heavily on the utilization of various service centers throughout a storage subsystem. This is why I promote the use of T4 and the T4 analysis tools, such as TLViz and CSVPNG.

Next up: Measuring performance with T4.

It would be wonderful if we could purchase whatever we want. I would love to have a Tier 1 storage array next to my computer. But somehow the cost of the purchase and running it just does not make sense for my own small environment.

Equation of Amdahl's Law

Equation: Amdahl's Law

So, how do we decide where to focus our efforts and funds? Here is where we can use Amdahl’s Law to try to obtain a feeling for where we should focus. In Amdahl’s Law, the speed increase (s) can be calculated if we know the fraction of the time (f) we use the faster mode and the amount of speed increase (k) while in the faster mode.

Let’s assume we have two different arrays. One array has 10K drives and the other has 15K drives. Certainly, if we upgrade the array with 10K drives we will get a 50% increase in performance, right? Right?

Well, not really. Here’s why. Let’s assume that our application processes data sequentially. For example, let’s assume we have two arrays and we shadow data between the two arrays. In that case, if we drop the member on the array with 10K drives and then use that member for backup operations, most of our backups will be fairly sequential in nature. We do a large block transfer (32K or 64K in size) and we tend to start at the beginning of the volume and head to the end of the volume.

Sure fragmentation will tend to create some randomness, but as long as the volume is not TOO fragmented, it tends to read from start and reads until the end of the volume. So the reads are very sequential in nature.

How does that change things? Well, in this situation, the storage actually spends very little time doing random I/O requests. It does not matter all that much how fast the heads spin. The heads are usually ready to read the next segment. And most arrays will prestage data into the array cache. Thus, having faster drives just does not matter. So, in this case of Amdahl’s Law shows a tiny speed increase (s) because we also spend very little time (f) in the increased mode. Though the speed increase while seeking is almost 50% greater (0.5), the fact is due to the little amount of time spent seeking, the actual improvement for performance is minuscule.

But what if the application does a huge amount of random I/O requests (such as for interactive lookups of data within a database). In that case, the improvement in performance, while not the full 50% (0.5), it is high enough to warrant the change. For example, those same two arrays, with an equal workload against them (using Host Based Volume Shadowing), will tend to see about a 20 to 25% increase against the array with 15K drives. OpenVMS automatically prefers the shadowset member with lower I/O requests and a lower latency. That random I/O request workload presented against the 15K based member will be better able to quickly respond and return data.

So, we can use Amdahl’s Law when we examine service centers throughout the computing and storage subsystem to try to identify the Best “Bang for the Buck”. By examining the potential improvement we can improve the return on investment (ROI).

And that’s the magic of Amdahl’s Law for storage performance analysis. The next blog entry will examine the impact of resource utilization on the responsiveness of work through the service centers.

Let’s dive into one area where I spend a lot of time: Storage Performance Analysis. As you might imagine from my tagline, I tend to take a bit of a different twist when I review “Storage Performance”.

First, I look both at how various storage subsystems are behaving (from their perspective) as well how the host perceives the storage array. It’s at this intersection of the two technologies where some of the most interesting observations can be made.

Second, as you deal with performance issues, it is essential to understand a little bit about statistics. Though I will spend blog entries wading through how we can use statistics to better understand the performance of the computing and storage subsystems, I highly recommend the course “Meaning from Data: Statistics Made Clear” from the Teaching Company. [The Teaching Company often has deals throughout the year which makes the cost of the course much more approachable.] This course provides an excellent overview of how to use statistics within a business setting. It explains the fundamentals in a method that is both easy to understand and to apply.

So, when we talk about performance management, what does that really mean? We tend to think of making the storage and computing environment perform as well as possible and as consistently as possible. The last thing we want is unexpected surprises.

How do we go about defining what is normal and predictable? We need to measure the performance of the computing and storage subsystems. We can not measure everything. The very act of measuring the environment distorts the performance of the environment. But we can and should gather a statistical sampling of the performance data. Using this data we can start to define “normal” and “abnormal” performance. We can identify uneven utilization of various “service centers”.

What is a “service center” in a computing and storage system? During the 1950s John Little was able to demonstrate that businesses consisted of a series of “service centers” where work was performed. Further, he shows that it is possible to predict the rate of completion of work and/or time needed to complete a specific item of work. So, if you now the time it takes to complete an average transaction at a bank, you know the arrival rate of work, you can predict the time it will take for a customer to do their banking.

This understanding of the performance of a “service center” is explained with Little’s Law. It helps us understand there is a predictable behavior as we move work through the computing and storage environment.

Early computer scientists realized that this same principle applies to computing and storage systems. Thus they realized that you can predict the performance of various “service centers”. This is in fact what we measure and try to characterize as we collect performance data.

My next two blog entries will explore:

From there we will explore how we gather and analyze the data. Future blog entries will explore:

  • use of “Average/Mean”
  • use of 90th and 95th percentile
  • use of standard deviation
  • use of Min/Max

Finally, we will start to dive into how to use T4 and the TLViz/CSVPNG analysis tools against the performance data from an OpenVMS environment.

« Older entries