Traditional Simulation
Part IV: Uncertain Numbers that Depend on Other Uncertain Numbers: Statistical
Dependence
In the last section we investigated models with only a single uncertain
input. With more than one uncertain input the situation becomes more
complicated. In particular, some of the uncertain inputs may depend on the
values of other uncertain inputs. In this section we will consider two examples
involving statistical dependence; one involving failure analysis, the other
financial portfolios.
Failure Analysis
We will start with an engineering example, by simulating the dependability
of a new industrial engine running in an uncertain environment. To keep the
engine operating properly, the cooling system must dissipate heat while the oil
pump maintains adequate oil pressure. This new engine is lighter and cheaper to
produce than its predecessor, but on the downside there is concern that it will
be less dependable under extreme operating conditions. If the oil pressure ever
drops below 20 lbs/sq. in. while at the same time the operating temperature
exceeds 225 degrees Fahrenheit, it is believed the engine will fail. The
problem is to estimate the likelihood of failure given the wide variety of
potential operating environments of the new engine. Data from the more robust
predecessor engine will be used for this purpose. Temperature and oil pressure
for the new engine are expected to be similar to those of the old engine, for
which substantial operating data has been recorded.
Tutorial 4a - Estimating Likelihood of Failure
- Open
Engine.xls. The Data sheet contains operational data on the predecessor
engine. One hundred Temperature/Pressure pairs were recorded over a wide
variety of operating environments. The Temperature data is stored in a
range called Temp_Data while the associated pressure data is stored in
Press_Data. The average temperature and pressure over these environments
was 200 degrees and 78 lbs./sq. in. as shown.

Histograms created for each quantity using the Simulate Data Range command indicate that
neither is normally distributed as shown below.

- Click on the
worksheet tab "Model". Here we see that our model of the new engine will indicate
failure if the temperature is greater than 225 and the pressure is less
than 20. With the average temperature and pressure values in cells A2 and
B2, the engine does not fail, but as you should have learned in the last
section, averages are little use in this sort of analysis. Try various
values of temperature and pressure. (If the indicator does not turn to
"1" for sufficiently high temperatures and low pressures, use
the Tools Options command to check that your spreadsheet is in Automatic
Calculation mode.) This indicator formula will be used during simulation
to determine the probability of engine failure.

- Since the
distributions of temperature and pressure are not of any well-known type,
but we have substantial data, this is an ideal opportunity to use the
resampling technique discussed in section II. We will do this in two ways.
First, for practice, we will do the simplest type of resampling, which
ignores the dependence of oil pressure on temperature. Unlike using
average temperature and pressure, resampling the entire range of
temperatures and pressures will at least indicate some possibility of
failure. However, it will give an inaccurate estimate of the chance of
failure. The second, more accurate approach will take into account the
relation between operating temperature and oil pressure, and lead to a
much more precise result.
To perform the first experiment, place the formula
=gen_resample(Temp_Data) in cell A2 and =gen_resample(Press_Data) in cell B2 as
shown. Be sure to see gen_Functions.xls for demonstrations of all the random number generators.

Press the F9 key a few times to make sure the
formulas are working correctly.
- Run a
simulation of 1000 trials or more, specifying B5, the failure indicator,
as the output cell. You should find an average value of the indicator of
around .007. This means that failure occurred in only 0.7% of the trials.
The lesson here is that the average value of an
indicator of an event is an estimate of the probability of that event
occurring.
This gives us our first indication of the risk of
failure, but remember it was based on the naïve assumption that temperature and
pressure were unrelated.
- In point of
fact, although both temperature and pressure are uncertain, high
temperature reduces the viscosity of oil, leading to lower oil pressure.
Thus temperature and pressure are statistically dependent.
This becomes clearly visible in the XY scatter plot of temperature and
pressure on the XY Chart tab. NOTE, if the temperature and pressure had
not been sampled in pairs, this dependence would not have been preserved.
This is an important consideration in collecting the data in the first
place.

PUZZLE: The way we used gen_resample in step
3 above did not preserve this statistical dependence. Does this cause the
resulting estimate of the likelihood of failure to be too high or too low? Why?
You will learn the answer at step 9.
- We will now
use gen_resample in a more sophisticated manner, which preserves
dependence. This uses gen_resample as an array formula, which is a bit
tricky, but worth the trouble. Start by placing the cursor in cell A2,
then while holding down the left mouse button also select cell B2. The
screen should appear as shown below.

- The next
step is to edit the formula to include the pressure data as a second
argument as shown, but don't press the Enter key yet.

- Now
simultaneously hold down the <Shift> and <Ctrl> keys, then
press <Enter> (see Microsoft Excel help on array formulas for more
details). If you did everything right, curly brackets will now appear
around your formula. Array formulas control multiple cells at once. When
gen_resample is used as an array formula, it assures that the random
sample taken from the temperature data in cell A2, is paired with its
associated pressure data in B2. NOTE: It was not necessary to enter the
two data ranges individually, separated by ",". It would also
have been possible to enter the single combined range as an argument.

- Run a
simulation of the failure indicator with the dependence preserved, and you
should find a much higher probability of failure of around 5.5%. That's
about eight times higher than the previous result.
This tutorial has demonstrated:
- The use of indicator formulas to estimate the
probability of an event.
- The use of
gen_resample to simulate a distribution form past data.
- The array version of
gen_resample to simulate multiple random inputs with dependence.
Simulating Investment Portfolios
Perhaps the area in which the modeling of statistical dependence has been
most institutionalized is that of investment portfolios. Because the distributions
of many investment instruments are either normal or log normal, an elegant
alternative exists to resampling. Instead, the mean returns and covariance
matrix may be calculated from past data. For an algebraic definition of
covariance see any text on statistics. For an intuitive geometric definition of
covariance, see chapter three of INSIGHT.xla.
Consider three potential investments with expected annual returns and
covariance matrix shown below in file Portfolio.xls.

Note that all three potential investments have the same expected return and
variance. However, A and B are negatively correlated, A and C are positively
correlated while B and C are uncorrelated. In the following tutorial we will
simulate three different portfolios comprised of the following investments: 50%
of the portfolio in A and 50% in B (we will call this "A+B"), and the other two 50/50 combinations,
"A+C" and "B+C". This will demonstrate the well-known correlation effect at the
heart of modern portfolio theory (see Reference [2]).
Tutorial 4b
- Open the
file Portfolio.xls. Use the Simulation, Thaw command to make the formulas live. The first step is to simulate the three correlated
investments. This will be done with the gen_MVNormal function as follows.
Begin by selecting cell C9:E9 as shown.

- Next use the
function wizard to insert gen_MVNormal. The arguments are B4:B6 for the
mean, and C4:E6 for the covariance matrix. Once the arguments are
selected, DO NOT CLICK OK. Instead simultaneously hold down <Shift>
and <Ctrl>, then press <Enter>.

- Curly
brackets should appear around the formula as shown. Press the <F9>
key a few times to make sure everything works. Be sure to see gen_Functions.xls for demonstrations of all the random number generators

- Next we will
model the three portfolios described earlier. The return of the A+B
portfolio for example is (.5*C9+.5*B9) as shown below.

PUZZLE: Before proceeding to the next step,
what do you think the expected return will be for each portfolio? How would you
expect the shapes of the three distributions to differ?
- Run a
simulation with one output cell for each of the three portfolios. The
notice that the expected returns of the three portfolios are nearly
identical at roughly 10%. The 5th and 10th percentiles display the value
at risk at these levels. For example there is a 5% chance that A+B will
return 2.75% or less, A+C will lose 4.6% or more, and B+C will lose 1.7%
or more.

Next, create a common histogram of each of the
portfolios using Smoothed 2D Lines. This clearly shows the effects of the
correlation between the original investments. With A+B, the correlation is
negative, with changes in one likely to be cancelled out by opposite changes in
the other. This has the effect of narrowing the distribution of the return, and
is desirable if you are risk averse. This was the central point of the
pioneering work of Harry Markowitz in the early 1950's (see Reference [2]). With the A+C
portfolio, the correlation is positive, with changes in one likely to be
amplified by similar changes in the other, hence the wide distribution. This is
the least desirable for the risk averse investor. There is no correlation
between B+ C, resulting in a distribution that is between the other two.

It is also instructive to use the interactive
common histogram. Use <Ctrl><PgUp> and <Ctrl><PgDn> to
cycle through the histograms.

Summary
In
this section we introduced the concept of an indicator variable whose average
estimated the probability that an event would occur.
We
used resampling to simulate a distribution based on historical data and demonstrated
the importance of capturing statistical dependence if present.
Finally
we showed how to generate Multivariate Normal random variables using their
means and covariance matrix.