Traditional Simulation

Part IV: Uncertain Numbers that Depend on Other Uncertain Numbers: Statistical Dependence

In the last section we investigated models with only a single uncertain input. With more than one uncertain input the situation becomes more complicated. In particular, some of the uncertain inputs may depend on the values of other uncertain inputs. In this section we will consider two examples involving statistical dependence; one involving failure analysis, the other financial portfolios.

Failure Analysis

We will start with an engineering example, by simulating the dependability of a new industrial engine running in an uncertain environment. To keep the engine operating properly, the cooling system must dissipate heat while the oil pump maintains adequate oil pressure. This new engine is lighter and cheaper to produce than its predecessor, but on the downside there is concern that it will be less dependable under extreme operating conditions. If the oil pressure ever drops below 20 lbs/sq. in. while at the same time the operating temperature exceeds 225 degrees Fahrenheit, it is believed the engine will fail. The problem is to estimate the likelihood of failure given the wide variety of potential operating environments of the new engine. Data from the more robust predecessor engine will be used for this purpose. Temperature and oil pressure for the new engine are expected to be similar to those of the old engine, for which substantial operating data has been recorded.

Tutorial 4a - Estimating Likelihood of Failure

  1. Open Engine.xls. The Data sheet contains operational data on the predecessor engine. One hundred Temperature/Pressure pairs were recorded over a wide variety of operating environments. The Temperature data is stored in a range called Temp_Data while the associated pressure data is stored in Press_Data. The average temperature and pressure over these environments was 200 degrees and 78 lbs./sq. in. as shown.

Histograms created for each quantity using the Simulate Data Range command indicate that neither is normally distributed as shown below.

 

  1. Click on the worksheet tab "Model". Here we see that our model of the new engine will indicate failure if the temperature is greater than 225 and the pressure is less than 20. With the average temperature and pressure values in cells A2 and B2, the engine does not fail, but as you should have learned in the last section, averages are little use in this sort of analysis. Try various values of temperature and pressure. (If the indicator does not turn to "1" for sufficiently high temperatures and low pressures, use the Tools Options command to check that your spreadsheet is in Automatic Calculation mode.) This indicator formula will be used during simulation to determine the probability of engine failure.

  1. Since the distributions of temperature and pressure are not of any well-known type, but we have substantial data, this is an ideal opportunity to use the resampling technique discussed in section II. We will do this in two ways. First, for practice, we will do the simplest type of resampling, which ignores the dependence of oil pressure on temperature. Unlike using average temperature and pressure, resampling the entire range of temperatures and pressures will at least indicate some possibility of failure. However, it will give an inaccurate estimate of the chance of failure. The second, more accurate approach will take into account the relation between operating temperature and oil pressure, and lead to a much more precise result.

To perform the first experiment, place the formula =gen_resample(Temp_Data) in cell A2 and =gen_resample(Press_Data) in cell B2 as shown. Be sure to see gen_Functions.xls for demonstrations of all the random number generators.

Press the F9 key a few times to make sure the formulas are working correctly.

  1. Run a simulation of 1000 trials or more, specifying B5, the failure indicator, as the output cell. You should find an average value of the indicator of around .007. This means that failure occurred in only 0.7% of the trials.

The lesson here is that the average value of an indicator of an event is an estimate of the probability of that event occurring.

This gives us our first indication of the risk of failure, but remember it was based on the naïve assumption that temperature and pressure were unrelated.

  1. In point of fact, although both temperature and pressure are uncertain, high temperature reduces the viscosity of oil, leading to lower oil pressure. Thus temperature and pressure are statistically dependent. This becomes clearly visible in the XY scatter plot of temperature and pressure on the XY Chart tab. NOTE, if the temperature and pressure had not been sampled in pairs, this dependence would not have been preserved. This is an important consideration in collecting the data in the first place.

PUZZLE: The way we used gen_resample in step 3 above did not preserve this statistical dependence. Does this cause the resulting estimate of the likelihood of failure to be too high or too low? Why? You will learn the answer at step 9.

  1. We will now use gen_resample in a more sophisticated manner, which preserves dependence. This uses gen_resample as an array formula, which is a bit tricky, but worth the trouble. Start by placing the cursor in cell A2, then while holding down the left mouse button also select cell B2. The screen should appear as shown below.

  1. The next step is to edit the formula to include the pressure data as a second argument as shown, but don't press the Enter key yet.

  1. Now simultaneously hold down the <Shift> and <Ctrl> keys, then press <Enter> (see Microsoft Excel help on array formulas for more details). If you did everything right, curly brackets will now appear around your formula. Array formulas control multiple cells at once. When gen_resample is used as an array formula, it assures that the random sample taken from the temperature data in cell A2, is paired with its associated pressure data in B2. NOTE: It was not necessary to enter the two data ranges individually, separated by ",". It would also have been possible to enter the single combined range as an argument.

  1. Run a simulation of the failure indicator with the dependence preserved, and you should find a much higher probability of failure of around 5.5%. That's about eight times higher than the previous result.

This tutorial has demonstrated:

Simulating Investment Portfolios

Perhaps the area in which the modeling of statistical dependence has been most institutionalized is that of investment portfolios. Because the distributions of many investment instruments are either normal or log normal, an elegant alternative exists to resampling. Instead, the mean returns and covariance matrix may be calculated from past data. For an algebraic definition of covariance see any text on statistics. For an intuitive geometric definition of covariance, see chapter three of INSIGHT.xla.

Consider three potential investments with expected annual returns and covariance matrix shown below in file Portfolio.xls.

Note that all three potential investments have the same expected return and variance. However, A and B are negatively correlated, A and C are positively correlated while B and C are uncorrelated. In the following tutorial we will simulate three different portfolios comprised of the following investments: 50% of the portfolio in A and 50% in B (we will call this "A+B"), and the other two 50/50 combinations, "A+C" and "B+C". This will demonstrate the well-known correlation effect at the heart of modern portfolio theory (see Reference [2]).

Tutorial 4b

  1. Open the file Portfolio.xls. Use the Simulation, Thaw command to make the formulas live. The first step is to simulate the three correlated investments. This will be done with the gen_MVNormal function as follows. Begin by selecting cell C9:E9 as shown.

  1. Next use the function wizard to insert gen_MVNormal. The arguments are B4:B6 for the mean, and C4:E6 for the covariance matrix. Once the arguments are selected, DO NOT CLICK OK. Instead simultaneously hold down <Shift> and <Ctrl>, then press <Enter>.

  1. Curly brackets should appear around the formula as shown. Press the <F9> key a few times to make sure everything works. Be sure to see gen_Functions.xls for demonstrations of all the random number generators

  1. Next we will model the three portfolios described earlier. The return of the A+B portfolio for example is (.5*C9+.5*B9) as shown below.

PUZZLE: Before proceeding to the next step, what do you think the expected return will be for each portfolio? How would you expect the shapes of the three distributions to differ?

  1. Run a simulation with one output cell for each of the three portfolios. The notice that the expected returns of the three portfolios are nearly identical at roughly 10%. The 5th and 10th percentiles display the value at risk at these levels. For example there is a 5% chance that A+B will return 2.75% or less, A+C will lose 4.6% or more, and B+C will lose 1.7% or more.

Next, create a common histogram of each of the portfolios using Smoothed 2D Lines. This clearly shows the effects of the correlation between the original investments. With A+B, the correlation is negative, with changes in one likely to be cancelled out by opposite changes in the other. This has the effect of narrowing the distribution of the return, and is desirable if you are risk averse. This was the central point of the pioneering work of Harry Markowitz in the early 1950's (see Reference [2]). With the A+C portfolio, the correlation is positive, with changes in one likely to be amplified by similar changes in the other, hence the wide distribution. This is the least desirable for the risk averse investor. There is no correlation between B+ C, resulting in a distribution that is between the other two.

It is also instructive to use the interactive common histogram. Use <Ctrl><PgUp> and <Ctrl><PgDn> to cycle through the histograms.

 

Summary

In this section we introduced the concept of an indicator variable whose average estimated the probability that an event would occur.

We used resampling to simulate a distribution based on historical data and demonstrated the importance of capturing statistical dependence if present.

Finally we showed how to generate Multivariate Normal random variables using their means and covariance matrix.