Jump to content

User:NorwegianBlue/refdesk/statistics

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by NorwegianBlue (talk | contribs) at 18:44, 22 May 2008 (Hat problem). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Statistics & Propagation of error with least square fits

I originally came across this question my freshman year (at college), and although I am now a senior, I have yet to get a concrete answer. Basically, in my program (chemistry), we must do a lot of propagation of error, and there is one particular point that confuses me. Basically, if I plot a set of data that already has an associated error with it, how can that error be incorporated into the slope and intercept of a best-fit line? For example, if I plot absorbance vs. concentration of some solution, each point will have an error in the X direction (concentration, due to errors in preparation) and an error in the Y direction (absorbance, error and variablity due to instrument). If I calculate a best fit line, how can I incorporate these errors with the normal errors generated with a best fit? Let me know if you don't understand and I will dig up an example. As a similar question, if I take a set of points with error and try to take an average, how can these be incorporated into the error (and/or standard deviation) of the average? Again if you need an example, let me know. I would be happy with just a name of an equation or point me to a page. I would be very grateful :) Thanks a lot --Bennybp 02:39, 11 January 2007 (UTC)[reply]

Well I sort of found an answer in Bayesian linear regression, although it looks much more complicated than I wanted it to be. Anything simpler? --Bennybp 03:06, 11 January 2007 (UTC)[reply]
If your data sets are humungous, incorporating the individual point errors will have only a minimal effect. Otherwise, one method you can use is to replace each data point in your data set by K copies, each of which is randomly perturbed by X and Y amounts having the known error distribution for that point. Then compute the mean and variance of the miraculously multiplied data set. The mean should be the same, but the variance should be larger. If N was the size of the original set and you would have used the bias correction for the variance estimator of replacing N by N−1 in the divisor, then now replace KN by KN−K. The same method can be used for least square fits; in fact, computing an average is a form of least square fit in one dimension. Unless you use a fixed pseudo-random sequence you should get slightly different results each time; if the differences are too large, then N was not large enough.  --LambiamTalk 05:54, 11 January 2007 (UTC)[reply]
Depending on how variable the errors in the X variable are relative to the errors in Y, the standard least squares "best-fit" line may be a very poor fit. Our regression dilution article gives a good overview of the issues, with good references. You might also be interested in errors-in-variables models. -- Avenue 15:39, 11 January 2007 (UTC)[reply]
The average is easy: the variance of the sum (of independent variables) is the sum of the variances, and the standard deviation scales linearly with what it describes. Therefore the standard deviation of the average is just as the average is the sum divided by N; . I separated the two factors of to emphasize that the standard deviation of the average is the RMS of the individual standard deviations, divided by . It is that division which makes repeating a measurement worthwhile; the resulting error is better than the "average" errors of the individual measurements.
Handling something like least-squares is more complicated, but it can be handled by general error propagation methods — see the first equation for in that article. You have two functions f for the slope and intercept of your line; if you do the fit with only symbolic values (this gets complicated for large N), you can then take the required partial derivatives and evaluate the total error. As a trivial example, consider just two points: obviously the line passes through both so the slope is just and the intercept is . The partial derivatives for m are
we can use those (with the chain rule) to easily evaluate the ones for b:
The total errors (using Greek letters for uncertainties for brevity, with ) are then
.
Throwing in numbers I get . The much higher uncertainty for b comes from the distance the points are from the y-axis; perhaps a more useful measure would be the uncertainty at the midpoint, which is . (I doubt it's a coincidence that with two points the uncertainty at the middle is just the distance to the middle times the uncertainty in the slope.) Obviously this algebraic approach would become tedious for N any bigger, but one can probably make it somewhat easier using the linear algebra approach to LSF and a CAS. Does this help? --Tardis 17:06, 11 January 2007 (UTC)[reply]
That all seems like a big help to me. I didn't think of applying propagation of uncertainty to the linear regression equation. And I usually get everything mixed up anyway (error due to measurements, standard deviation, etc.). Thanks a lot :) --Bennybp 00:47, 12 January 2007 (UTC)[reply]

Sample size vs. extreme observations

For a one-variable statistic with an infinite, standard-normally-distributed population, what is the relationship between the sample size and the expected highest value observed? NeonMerlin 23:49, 15 January 2007 (UTC)[reply]

Given a single measurement of a random variable with a cumulative distribution function F, the probability of getting a result that does not exceed s is F(s). For N measurements, the probability of not exceeding s in any of the measurements is F(s)^N. The derivative of F(s)^N with respect to s will thus tell you the probability of getting no result larger than s+ds in any of them, but a result of at least s in at least one of the measurements. That is thus the probability that your highest observed value was s. In other words, F(s)^N would be the cumulative distribution function of the highest observed value. I made that up from scratch, but a quick googling confirms it, see for example the abstract of [1] --mglg(talk) 00:51, 16 January 2007 (UTC)[reply]
That gives me, for the standard normal distribution, a CDF of Φ(x)n = and a PDF of d/dx that. I imagine someone who knows integral calculus will be able to simplify these, but how do I turn either of them into the expected value? NeonMerlin 01:40, 16 January 2007 (UTC)[reply]
In general, given a PDF you can calculate the expectation value as Integral[x PDF(x) dx]. I don't think you can get any analytic answer in this case. There is probably a useful approximation for large N, though, but I don't know it. By the way, (essentially) your integral above has a name: the Error function. --mglg(talk) 03:40, 16 January 2007 (UTC)[reply]
Let Z denote the highest of n independent random variables having the standard normal distribution. It is possible to give the following upper lower bound on the expected value of Z: E(Z) > Φ−1(n/(n+1)). This can be seen as follows. Φ is a monotonic functions, so it commutes with max. So Φ(Z) is the highest of the Φ-values of n random variables whose (cumulative) distribution function is Φ. But these Φ-values are then simply uniformly distributed on the interval (0,1), so Φ(Z) has distribution function F(x) = xn (as per above), and E(Φ(Z)) is now easily found to be n/(n+1). Since function Φ−1 is concave convex in the area of interest (for n > 1), E(Z) = E(Φ−1(Φ(Z))) > Φ−1(E(Φ(Z))) = Φ−1(n/(n+1)). I have not tried to estimate how tight this is, but I wouldn't be too surprised if E(Z) = o(Φ−1(n/(n+1))).  --LambiamTalk 05:21, 16 January 2007 (UTC)[reply]
Nice, Lambiam. But shouldn't it be "convex" and "lower bound", i.e. E(Z) > Φ−1(n/(n+1)) ? --mglg(talk) 19:12, 16 January 2007 (UTC)[reply]
You're right. In the meantime I have done some calculations, which suggest that the difference is not a runaway one; putting E(Z) = Φ−1(n/(nn)), the quantity δn appears to decrease very slowly as n increases:
      n  delta_n
  -----  -------
      1  1.00000
      2  0.80231
      4  0.71501
      8  0.67000
     16  0.64408
     32  0.62783
     64  0.61688
    128  0.60909
    256  0.60326
    512  0.59874
   1024  0.59511
   2048  0.59213
   4096  0.58965
   8192  0.58755
  16384  0.58578
It looks like δn will remain above 0.5.  --LambiamTalk 21:51, 16 January 2007 (UTC)[reply]

How do I tell if I've chosen an appropriate statistical distribution

I have a group of 12 observations. I'd like to predict what my observations will be in the future. I also need the distribution to apply Bayes Theorem.

Right now, I'm using the normal distribution but I don't know if that's the right choice. I've calculated the skewness and kurtosis of the data, but I don't have any idea what they're supposed to be! I mean, I know if my observations were truly normally distributed, the skewness would be zero, but I don't know if my skewness of 1.65 is "close enough" or what. Are there rules of thumb for this? moink 05:50, 1 June 2006 (UTC)[reply]

Under the assumption of normal distribution, the probability that a sample of 12 observations has a skewness whose absolute value is at least 1.65 is about 0.002. That is fairly low, and normally ground to reject the null hypothesis of normalcy. What is the source of the observations and how critical is the accuracy of the estimated distribution? Often the physical or other origin of the data suggests a plausible crude model for the distribution that is good enough in practice. --LambiamTalk 06:36, 1 June 2006 (UTC)[reply]
It is not particularly critical. I was actually kinda hoping not to have to share the type of data, but since it's apparently a very poor fit to the normal distribution, I guess I will. It's the length of my menstrual cycle. Now all the boys on the math RD can get all grossed out.  :) I like to know if I should carry tampons on me, and the Bayes' theorem thing... well, if you're very bright you may be able to figure it out but I will not provide an explanation. Here's the data: 32, 29, 28, 28, 26, 27, 27, 29, 36, 25, 26, 28. moink 07:41, 1 June 2006 (UTC)[reply]
You know, you could use a neural network for precisely this task. Neural networks can be used to predict the length of menstrual cycles as well as stock market values or other things. Choose an encoding for the lengths, train some sort of recurrent network on the data you have, and then get it to generate predictions. If I get time, and I am sufficiently bored, I might even try this for you. Dysprosia 07:49, 1 June 2006 (UTC)[reply]
Sounds cool but beyond my abilities. Right now, though, I'm less interested in predicting exactly the length of the next cycle, and more in knowing the approximate probability that it is at least some length so I can apply Bayes' theorem. moink 07:56, 1 June 2006 (UTC)[reply]
Using your data and the formula at skewness, I find a skewness of 1.27, which is still significantly different from the null hypothesis but less so. Looking at the data, the problem appears to be the outliers at the high side. If you censor the data by discarding values > 30, you get a good agreement with a normal distribution. Given the application censoring at the high side is acceptable, since you want confidence at the low side. The sample is still a bit small, though, to really confidently assume the low end behaves normally, without outliers. --LambiamTalk 14:38, 1 June 2006 (UTC)[reply]
So much for trusting my spreadsheet software. I thought about dropping the large ones, but it's in the higher range that I'm most interested in the probability, and since it seems that it does occasionally get that large, and not that rarely, I wanted to take that into account. moink 15:28, 1 June 2006 (UTC)[reply]
Well, I'm by no means suggesting that this is what you are trying to calculate, but just for the sake of argument: if A=pregnant, and B=menstruation has not yet occurred, and one were interested in P(A|B), then P(B|A) would of course be very close to unity, but what value should be used for P(A)? Would the age specific fertility rate be correct? --NorwegianBlue 15:01, 1 June 2006 (UTC)[reply]
Addendum: P(A) would obviously have to be either zero, or a lot higher... --NorwegianBlue 15:11, 1 June 2006 (UTC)[reply]
Why would you say that? I mean, it could be zero, but it could be the small numbers you'd get using the failure rates of certain contraceptives. Even with several instances of unprotected sex in a month, it will generally not go above 25-30%. moink 15:24, 1 June 2006 (UTC)[reply]
Agreed. You are right. --NorwegianBlue 16:35, 1 June 2006 (UTC)[reply]
Chi squared might be your answer to whether the data is normal or not. Basically this works by dividing up the domain in to a number of boxes, you then count how many of your data items fall into each box and compare with the number predicted from the normal distribution. Add up the square of differences and compare with the approptite Chi-squared statistic. This should give a confidence interval as to whether the difference is significant or not. I suspect with only twelve points you don't really have enough data to meaningfully talk about skew. --Salix alba (talk) 15:10, 1 June 2006 (UTC)[reply]
Sigh. Ok, so I'm transparent. My prior distribution in this context is from a record of instances of penetrative sexual intercourse along with the underlying numbers used by this site combined with a pdf of the date of ovulation using the pdf above and the possibly quite poor assumption of a constant luteal phase of 14 days. moink 15:14, 1 June 2006 (UTC)[reply]
If the goal is to get pregnant, I'd start off by measuring my body temperature, to get a more precise estimate of the time of ovulation. After one year with no success, I would definitely go see a gynecologist. If, on the other hand, the goal is not to get pregnant, and you want a statistical tool to tell you when to start worrying, I'm afraid your approach won't work. Biological distributions tend to have very heavy tails, and you simply do not have enough data to make a sensible estimate of the distribution. With a limited dataset, however, you could make control charts. Here's a link to a how-to (powerpoint), courtesy of the British NHS Modernisation Agency. --NorwegianBlue 17:18, 1 June 2006 (UTC)[reply]
When reading my previous comment: forgetting to mention this was maybe a male freudian slip, but anyway: if the goal is getting pregnant, it would be a good idea to have your partner checked as well. --NorwegianBlue 21:33, 1 June 2006 (UTC)[reply]
Well, the goal is complicated. It is one, the other, or both of the above, in addition to saving on costs of Human chorionic gonadotropin tests (which have high false negative rates, especially when used too early) by using them at the right time. For example, applying an additional Bayesian update rule, a negative test with a sensitivity of 25 mIU of hCG would reduce my probability by a factor of nearly three if I used it today, while it would reduce the probability by a factor of eight if used tomorrow. And buying a basal thermometer would negate those cost savings.  :) The other main goal is the fun of overanalyzing these things. :) moink 04:12, 2 June 2006 (UTC)[reply]
If you check out the presentation that I linked to, and are able not to get too irritated about the "for dummies" manner in which it is presented, you will see that this might be exactly the tool that you are looking for. It is a tool for decision-making, primarily in the manufacturing industry, but it is now mandatory also in blood banks throughout the EU. As you can see from this article, it has been around for a long time, and has stood the test of time. It is a curious mix of parametric and non-parametric statistics.
Your statement on the goal leaves me with the impression that timing is a rather critical issue. I would definitely invest in that thermometer! I wish you all the best, and hope that you achieve your goal and that it brings you happiness. Best regards, --NorwegianBlue 23:39, 2 June 2006 (UTC)[reply]

Linear regression problem (Statistics)

I have some data that I tried analysing using a linear model with R (programming language). The dependent variable is Y. There is an explanatory variable X1. The data consists of two populations, represented by another variable X2. I did a regression analysis Y~X1 in each subpopulation (X2=1 and X2=2), as well as a regression Y~X1 for the combined data set. My problem is that the regression lines, when plotted, are quite different from what I would have expected from the scatterplot. I've put the data, plots and R code on an external page, and would be extremely grateful for feedback on whether I am making a silly mistake in the regression analysis or plotting, or if I might need to consult an opthalmologist.

The first figure shows the results of the regression analysis for each subpopulation (red and blue), as well as the result for the whole data set (green). The second figure repeats the regression line for the combined data set (green), and shows where I had expected to find the regression line(orange). Thanks in advance for any feedback. --NorwegianBlue talk 13:46, 28 January 2007 (UTC)[reply]

1.
Different lines can be obtained by using different weights to off mean points eg |xn-xav| , (xn-xav)2 it depends on how you want to emphasise the shape of the scatter plot. You could do an analysis of the scatter plot to see if the spread of values seems to match any known distribution.87.102.33.144 14:23, 28 January 2007 (UTC)[reply]
2.
The slope of the line of the combined data set didn't match what you expected - here's what I would do.
There are n points. Take a pair of points and find the slope and second variable (constant) for this pair of points eg y=ax+c, y2-y1=(x2-x1)a. Repeat this for each combination of points - there are n(n-1)/2 pairs of points..
Now you should have n(n-1)/2 values for 'a' (the slope) and 'c' (the constant). You can analyse this data set for and average 'a' and 'c', and also find the variance of both.
This should give you not only average slope but a range (amount of accuracy) of values in which 'a' can be expected to lay based on your data set.
Question - does this range of values include your expected (orange) line.. Hope that helps. I'm suggesting you find the 'amount of error' in the line.87.102.33.144 14:40, 28 January 2007 (UTC)[reply]
Thanks! Re 1: I was thinking plain old least squares linear regression. My problem isn't that I had some preconceived expectations about the regression lines, but that they are less steep than what my eyes are telling me that they should be when I look at the scatter plot. I have plotted the residuals against quantiles of the normal distribution (qqnorm in R). Except for two outliers at the lower end, and four at the upper end, the residuals appeared to be reasonably normally distributed. And with approx. 1500 data points, I wouldn't expect 6 outliers to have a tremendous effect.
Re 2. That's interesting, I'll try to do what you suggest. Should the average slope and intercept using this procedure be identical to those obtained using least squares linear regression? --NorwegianBlue talk 15:11, 28 January 2007 (UTC)[reply]
As I understand it - not identical - method of least squares emphasises 'erratic' points more than a magnitude (absolute value) method. So there will be a minor difference (except of course when all the points are exactly on a line - not the case here).
I've just 'thought' of something else - is your method using - vertical offsets (my guess is it is - as it is simpler and more common) - if so you could try perpendicular offsets (much better) see http://mathworld.wolfram.com/LeastSquaresFitting.html (second set of diagrams down) - I think the picture speaks for itself. If your not using perpendicular offsets you really should try this - it does give better results (especially when the slope of the line is high) - you might find it gives a result closer to the one you expected - the eye is usually a good measurer..87.102.33.144 16:19, 28 January 2007 (UTC)[reply]
You also might want to look at http://www.tufts.edu/~gdallal/slr.htm which gives "confidence bands from the regression line " - I haven't checked the maths on this - what you need to do is find "the confidence bands from the regression line" - a google search might help or you could ask here - I'd be very suprised if your orange line didn't lie within the confidence bands.
The confidence bands effectively give a measure (statistical) of where the actual line can be expected to lie (within certain probabilities) - like a broad thick line that the actual line must lie within.87.102.33.144 16:19, 28 January 2007 (UTC)[reply]
I think you are right in that the lm() function of R minimizes the squared vertical offsets. Browsing the help files, I found that R has a procedure called line() which does robust line fitting. The line thus obtained was slightly "better" than the green line obtained using lm(), but still far from my orange line. --NorwegianBlue talk 18:48, 28 January 2007 (UTC)[reply]
Looking at your picture, the orange line is more or less the major axis of an ellipse containing most of the points of the joint scatter plot. In general, the line obtained by linear regression has a slope that is less steep. Let U and V be two independent normally distributed (μ = 0, σ = 1) random variables. If X = aU + b, Y = mX + c + dV, for sufficiently large samples the least squares linear regression fit will approximate the line y = mx + d. The lines of equal density in a scatter plot are then ellipses of the form u2 + v2 = r2, for various values of r, where u = (x−b)/a and v = (y-(mx+c))/d. Expressed in x-y coordinates: ((x−b)/a)2 + ((y-(mx+c))/d)2 = r2. In general, the slope of the major axis of these ellipses is not equal to m. As you change the scale of Y, the value of m changes accordingly. But the major/minor axes of an ellipse are not projective invariants. In particular, they are perpendicular, but linear scale transformations do not preserve angles and consequently do not preserve these axes.  --LambiamTalk 12:36, 29 January 2007 (UTC)[reply]
Thanks, Lambiam. It's correct that I expected the regression line to more or less follow the major axis of that ellipse, and I didn't know that a less steep line was to be expected. Can I infer from your comment, then, that there is nothing obviously wrong with the regression lines as plotted? --NorwegianBlue talk 18:58, 29 January 2007 (UTC)[reply]
You can infer from my comment that I see nothing obviously wrong with these lines. For a simple visual check, if you have a nicely ellipse-shaped cluster, take the vertical tangents at the left and right sides of the ellipse. The regression line should pass through the points where these lines touch the ellipse. For each point on the regression line, the vertical extents above and below should be (roughly) equal. As far as I can see without a visit to the ophthalmologist, your red, blue and green lines pass this visual check; the orange line does not.  --LambiamTalk 13:36, 30 January 2007 (UTC)[reply]
Thanks a lot! --NorwegianBlue talk 22:11, 30 January 2007 (UTC)[reply]

Arcsin sqrt(x) transformation

My officemate had a reviewer on a paper write "data should be transformed by arcsin sqrt(x)." I recall (I think) that this is a normalizing transformation for some kind of data set (binomial/n, I think), but neither she nor I can find a reference on it. Any ideas? --TeaDrinker 00:30, 20 January 2007 (UTC)[reply]

Google gave an immediate reference: http://www.tina-vision.net/tina-knoppix/tina-memo/2002-007.pdf
The point seems to be that if the data are binomially-distributed, the transformation gives a variance independent of the mean. I question, however, the arrogance of the reviewer in saying what should be done, without explanation or even considering that another approach could be valid.86.132.163.126 12:25, 27 January 2007 (UTC)[reply]

Statistical process control

In statistical process control using control charts, I have noticed that presenters often recommend calculating the standard deviation in a, so to speak, nonstandard way. The recommended procedure is to calculate a mean moving range, i.e. , using a relatively small dataset, and then divide the mean moving range by the magic number 1.128. If you google for "(1.128 and calculate)" and are feeling lucky today, you will find a such a presentation. The number 1.128 is often represented by the symbol d2. Does anybody know the maths behind this non-standard estimator of the standard deviation? --NorwegianBlue 19:00, 1 June 2006 (UTC)[reply]

If the SD (innate variability about the current mean) was estimated from all the values, there would be an over-estimate in the presence of a trend (shift of mean), whether upward, downward or cyclic. Using the difference between successive observations removes this effect. If the sum of the squares of these differences is used, it has to be divided by 2(n-1) and square-rooted to estimate SD. Your formula uses absolute differences, without a "2" in the denominator. But for a Normal distribution, Mean Absolute Deviation is SD*root(2/pi).
Combining these corrections, 1.128 is root (4/pi).
86.132.237.108 19:49, 2 September 2006 (UTC)[reply]

Vysochanskiï-Petunin inequality

I read the article on the Vysochanskiï-Petunin inequality, and am trying to understand it, and its implications on the interpretation of control charts using 3-sigma limits, i.e. lambda=3. The article states: "The sole restriction the distribution is that it be unimodal and have finite variance. (This implies that it is a continuous probability distribution except at the mode, which may have a non-zero probability.)" I am having problems with the above, because up until now I have always thought that a probability distribution is either continuous or discrete, but the statement in parenthesis seems to imply that it can be both (continuous for some arguments, discrete for others). To me, this raises several questions:

  • Is there such a thing as a unimodal discrete probability distribution? The statement cited above seems to indicate that the answer is "no", but oviously many discrete probability distributions have only one mode.
  • If the answer to the preceding question is "yes", does the Vysochanskiï-Petunin inequality apply to a unimodal discrete probability distribution?
  • What is the probability density of a "continuous" probability distribution at the mode, if the probability is non-zero? Undefined?
  • Could someone give an example of a reasonably occurring probability density with mode=0, nonzero probability at 0, and arguments that are positive or zero, which would either satisfy nor not satisfy the requirements of the inequality?

Thank you! --NorwegianBlue talk 12:24, 20 October 2007 (UTC)[reply]

Probability distributions can be quite general. Given your sample space, say and any σ-algebra on it, you can define any function from that σ-algebra to [0, 1] which satisfies a few axioms, and you get a probability distribution. Continuous and discrete are just two narrow, yet common, kinds. The distribution kind alluded to is not really one of them, but resembles one or the other in different regions.
  • No (unless there is just one point where the probability is positive), and this becomes clear if you take a look at the definition of a Unimodal function. The probability density for a discrete distribution is 0 for most real numbers. Hence, for any number where the probability is positive, you have a local maximum, and thus the distribution is not unimodal.
  • At the basic level it is undefined, but the Dirac delta function can be introduced to deal with it.
  • That "reasonably occurring" bit is tricky... Can't think of any right now.
-- Meni Rosenfeld (talk) 13:06, 20 October 2007 (UTC)[reply]
Thanks a lot, Meni, I think I understood. To verify, I'll try to give examples of "reasonably occuring" probability distributions with nonzero probability at 0, and which satisfy and don't satisfy the requirement. Am I correct in thinking that the distribution
p(x<0) = 0
p(x=0) = 0.3
p(x>0) = 0.7*exp(-x)
satisfies the requirement, whereas the distribution
p(x<0) = 0
p(x=0) = 0.2
p(x>0) = 0.7*exp(-x)+0.1*g(x)

where g(x) has the properties
g(x) = 0 when x<0
The integral of g from -inf to +inf is 1
g(5) > exp(-5)
does not? --NorwegianBlue talk 15:34, 20 October 2007 (UTC)[reply]
The idea for the first looks good, but the notation isn't. The distribution you mean can be described by:
As for the second, same problem with the notation, and I'm not sure we know enough about g to deduce anything. If, on the other hand, we know that , I think we can conclude the distribution isn't unimodal. -- Meni Rosenfeld (talk) 16:49, 20 October 2007 (UTC)[reply]
Thanks again. I'm sorry about the notation, I knew it was awful and should have apologized beforehand. I also realize that I failed to make a clear distinction between densities and probabilities, and that I failed to take into account the factors 0.7 and 0.1 that I was multiplying exp(-x) and g(x) with. By cutting and pasting from the wiki math notation in your reply, I'll have a second go at rephrasing what I was trying to express:
The first distribution was intended to be:
The second distribution was intended to be:
As you point out, the requirements to g should have been:
Sorry about persevering, I just want to make sure I understand:
  • Did I get the notation right?
  • If so, is my first distribution just a more awkward way of expressing what you did in two lines?
  • Does the first distribution satisfy the requirements of the Vysochanskiï-Petunin inequality?
  • Does the second distribution violate the requirements of the Vysochanskiï-Petunin inequality?
-NorwegianBlue talk 19:53, 20 October 2007 (UTC)[reply]
  • Much better. It seems a bit unconventional, but it gets the necessary information across.
  • Pretty much, yes. Basically, by calculating using my formula, you will get the same results as your formula (there are some technical details involved).
  • Yes.
  • I think not. The factor of 7 wasn't the only thing I have changed - I have given a condition on the derivative of g rather than g itself. To ensure being non-unimodal, there must be some point where the derivative of the density function (which is also the second derivative of the cumulative density function) is positive. Ensuring that with a condition on g is trickier.
-- Meni Rosenfeld (talk) 22:23, 20 October 2007 (UTC)[reply]
I understand. Your replies have been very helpful. Thank you! --NorwegianBlue talk 11:16, 21 October 2007 (UTC)[reply]

Histograms and probability distributions (Statistics)

What are the correct terms to distinguish between what you get when you smoothen out a histogram of observed data, correcting scales such that the area under the curve is 1, and the underlying true probability distribution? --NorwegianBlue talk 09:24, 29 October 2007 (UTC)[reply]

By smoothing you are trying to estimate the probability density function of the population, much like the arithmetic average of a sample is used to estimate the population average. So you could call it the "estimated density (function)". A common class of methods for estimating the p.d.f. from the observed data that bypasses the histogram is described under Kernel density estimation.  --Lambiam 09:50, 29 October 2007 (UTC)[reply]
Thank you!
  • Can I extrapolate your answer into stating that the expression "observed probability distribution" (~550 google hits) should be avoided?
P.S. I fully appreciate that "estimated density (function)" is better than "estimated probability distribution". The reason I'm asking is that I'm preparing a talk for an audience which is not very mathematically sophisticated, and I'd prefer to avoid talking about densities. At the same time, I need to be precise about the distinction between our observations, and the underlying processes that we imagine are responsible for generating the data. I'm thinking of drawing a cartoon of Plato's allegory of the cave, with the histogram on the cave wall, and the density function behind the viewer. --NorwegianBlue talk 10:54, 29 October 2007 (UTC)[reply]
Indeed, there is no such thing as "observed probability distribution". I have a coin — maybe fair, maybe not. I flip it once, and you record the outcome. Have you observed the probability distribution? Absurd, eh? Obviously not. I flip again. Now? No; but then when? Even a fair coin can produce ten heads in a row! From statistical observations we can only estimate properties of the generator. The more observations, the better the estimate. Plato's cave is a pretty good metaphor, but perhaps the coin drives the point home better. --KSmrqT 14:27, 29 October 2007 (UTC)[reply]
Thanks for the suggestion! Good point, I'll use it. --NorwegianBlue talk 16:51, 29 October 2007 (UTC)[reply]

Exponentially distributed data (Statistics)

When counting residual (unwanted) cells in blood components, I find that such counts tend to be best described by the exponential distribution. The illustration is a histogram of platelet counts in centrifuged plasma. The exponential distribution (red) gives a decent fit, while a normal distribution (blue) obviously is way off.

The article exponential distribution gives some examples of real-world scenarios which tend to be exponentially distributed, such as the time or distance between poisson-distributed events, but none of these really resemble my example, as far as I can see. I suspect that the principal source of variation in my data is the handling skills of the operator who carried the centrifuged blood bag from the centrifuge and placed it in the separator device (the device then applies gentle pressure to the bag, and lets the plasma escape through a tube on top of the bag).

I would like to understand why the exponential distribution gives such a good fit with the cell count data. Thank you! --NorwegianBlue talk 10:17, 29 October 2007 (UTC)[reply]

A possible explanation. Assume, that there are say a N platelet and each platelet has a small but finite chance, p, of escaping through the tube. You can use the binomial distribution to calculate the chance of n platelets escaping: NCn p^n (1-p)^(N-n). Now for very large n and very small p, the binomial is approximated by the Poisson distribution with mean p*n. This would also fix your data. Just a thought. --Salix alba (talk) 11:23, 29 October 2007 (UTC)[reply]
Platelet conc Occurences
0 1
1 103
2 415
3 212
4 112
5 106
6 64
7 72
8 50
9 36
10 42
11 30
12 28
13 29
14 15
15 18
16 12
17 11
18 10
19 5
20 6
21 5
22 2
23 3
24 3
25 3
26 2
27 5
28 1
29 2
30 2
35 2
37 1
43 1
52 1
63 1
I don't think that's quite it. As mentioned above, I think the key lies in what happens when the bag is carried from the centrifuge, and put in the separating device. After centrifugation, there is an interface (buffy coat) between the plasma and the red cells. Most of the platelets are there. Although the bags are handled very gently to preserve the buffy coat layer, occationally a bag might be shaken slightly, causing some of the platelets to mix with the plasma. This would be a rare event, and it would have varying strength, i.e. disturb the interface to a varying extent. Would such a scenario be expected to lead to an exponential distribution? --NorwegianBlue talk 12:11, 29 October 2007 (UTC)[reply]

The result of counting is a nonnegative integer. So you need a model providing nonnegative integers for its outcome. The normal distribution provides negative outcomes and it provides noninteger outcomes. The exponential distribution it provides noninteger outcomes. As Salix alba explained, the poisson distribution is a suitable model, providing nonnegative integer outcomes. Knowing the parameter λ the probability of observing the count i is e−λλi/i! . If you sum this expression over i, from zero to infinity, you get 1, to confirm that it is a discrete probability distribution function. If you multiply this expression by i and sum over i from zero to infinity you get λ, to confirm that λ is the mean value of i. Similarily you may compute the standard deviation of i as λ1/2. Knowing the parameter λ the poisson distribution model provides an estimate for the observation i~λ±λ1/2. However, your problem is the dual one, having made the observation i you need to estimate the parameter λ. Keeping i constant and letting λ be a nonnegative real variable, the above expression e−λλi/i! is a continuous distribution function called the gamma distribution. Integrating over λ gives 1, confirming that it is a probability distribution function. The special case i=0 is the exponential distribution function. The mean value of λ is i+1 and the standard deviation of λ is (i+1)1/2. So the gamma distribution model provides an estimate for the parameter λ ~(i+1)±(i+1)1/2. Bo Jacoby 12:43, 29 October 2007 (UTC).[reply]

I don't know why people insist on mentioning the Poisson distribution, which quite obviously does not fit the data. The exponential distribution has both a continuous and a discrete version, the latter being called the Geometric distribution. I suspect that the cell count, although discrete, strongly depends on the (continuous) time period between two events, which for some reason is distributed exponentially. I do not understand biology well enough to speculate as to why that is so, but I do suggest that time is the key here. -- Meni Rosenfeld (talk) 13:18, 29 October 2007 (UTC)[reply]

It is not obvious that the poisson distribution does not fit the data for some λ in the interval 0 < λ < 1. Bo Jacoby 14:39, 29 October 2007 (UTC).[reply]

Except for the fact that the expectation of the given distribution is roughly 6. -- Meni Rosenfeld (talk) 14:47, 29 October 2007 (UTC)[reply]

Thank you all for taking the time to respond to my question! I realise, of course, that just about any distribution representing measurements of physical quantities is in principle a discrete distribution. When weighing a sample of a pure substance, you are in essence counting the number of molecules in the sample. I should have pointed out, also, that the unit on the x-axis is platelets × 109/L, so the number of platelets counted in each measurement is much larger than the numbers on the axis suggest. I did suspect, beforehand, that the gamma distribution might be appropriate. This is because it is unreasonable to assume that the mode of the distribution is zero. When reexamining the data, I see that the mode is in fact 2. The data was recorded without any decimals, and is detailed in the table on the right. I read the section about parameter estimation in the article, but alas, this was way above my head.

If a gamma distribution is indeed appropriate, is there a not-too-difficult way of calculating the scale and rate parameters? Specifically, does anyone know whether R (programming language) is able to do this? To Meni, thanks for the suggestion about a time factor being responsible, I'll certainly look closer into that. --NorwegianBlue talk 15:00, 29 October 2007 (UTC)[reply]

This changes everything. Since the problem is essentially continuous, the Poisson distribution is completely irrlevant. The data is much too skew to be normal, and the exponential distribution seems less likely now that we see that the mode isn't 0 (note that my suggestion to look at time was based on the assumption that the distribution is exponential, though it could still be relevant). What you can do is to calculate the moments of the data (that is, the quantities derived from the moments - mean, variance, skewness and excess kurtosis), and compare them to those of different distributions (our articles have those). The first few can be used to find the parameters which give the best fit, and the rest can tell you how good the fit is. For the gamma distribution (note that Bo has suggested it not as the distribution of the data), for example, denoting the mean by and the variance by , you have and , and if, for your data, the skewness is and the kurtosis is , you have a good match.
Note that the data also seems to be quite noisy, so no common distribution will be a great match. -- Meni Rosenfeld (talk) 16:28, 29 October 2007 (UTC)[reply]
Unless I have made a mistake, for our data we have . If you try to fit this as a gamma distribution, you'll get , so it would be exponential; but the skewness would be only 2, so I'd say that's off the table. You can double-check my calculations, and try your luck with any of these distributions. -- Meni Rosenfeld (talk) 16:52, 29 October 2007 (UTC)[reply]
I think you do have a normal distribution in the chart. The graph doesn't show it because you lumped the 0-2 range together to get a value greater than that for the 3-4 range. However, if you plot each value separately, it looks like it will match a normal distribution fairly well. One non-math comment: someone should develop a machine to transport the bag of centrifuged blood so it won't be shaken up, or at least so the degree of shaking will be consistent. StuRat 17:23, 29 October 2007 (UTC)[reply]
Are you kidding? The normal distribution is symmetric, this one is anything but. It has a skewness of 3; It has a mode at 2 and mean at 5.5; it has a long tail on the right and a ridiculously short one on the left. Not to mention that the fitted normal in the picture doesn't even remotely resemble the distribution, even if we split the first bin. -- Meni Rosenfeld (talk) 18:02, 29 October 2007 (UTC)[reply]
It looks to me like it might well be truncated, but symmetrical (or at least somewhat close), if you toss out the single the value at 0. And no, the blue curve isn't correct, a different normal curve would be needed to fit the data. The maximum point might be around 2.3. StuRat 16:16, 30 October 2007 (UTC)[reply]

The poisson distributions constitute a one-parameter family of distributions having nonnegative unlimited integer outcome. That's why you would try it first. Calculation in J:

  concentrations  NB. the new table of data
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 35 37 43 52 63
  occurences
1 103 415 212 112 106 64 72 50 36 42 30 28 29 15 18 12 11 10 5 6 5 2 3 3 3 2 5 1 2 2 2 1 1 1 1
  ] N =: +/ occurences NB. compute and display the number of occurences
1411
  ] S =: +/ occurences * concentrations NB. sum of concentrations
7839
  ] L =: S % N NB. mean value of the concentrations
5.55563
  ] SS=:+/ occurences * *: concentrations NB. sum of squares
87477
  ] var=:(SS%N)-*:S%N NB. variance of concentrations
31.1235

As the variance is greater than the mean value, the distribution is not poisson. Sticking to discrete distributions, it might be a negative binomial distribution. (See also cumulant). Bo Jacoby 18:05, 29 October 2007 (UTC).[reply]

Seriously, don't you read what the OP is saying? The data is not discrete. It is a continuous variable, which is presented on an arbitrary scale and rounded. -- Meni Rosenfeld (talk) 18:19, 29 October 2007 (UTC)[reply]

I'm sorry that I when posting the question didn't state clearly that the concentrations were based on much larger counts than the numbers on the x-axis suggested, and that the concentration is for practical purposes a continuous variable, although it unfortunately was recorded without decimals.

To Sturat: The cellular content is not a great concern, because (a) the plasma is pooled and processed further industrially, and (b) units which obviously have been shaken are re-centrifuged before further processing.

To Meni: Thanks again. I did some further experimentation in R. When examining the histogram with bin sizes of 1, I get the impression that the distribution is bimodal may be a mixture of two distributions, with a second peak the second distribution peaking somewhere in the vicinity of 8. Something like 0.7*gamma_density(shape=2, scale=1)+0.3*normal_density(mu=8, sd=4) fits reasonably well. For practical purposes (making control charts), however, I think I'll treat the data as exponentially distributed, and use the transformation y=x0.2777, which according to textbooks on statistical process control should result in a Weibull random variable which is well approximated by the normal distribution. --NorwegianBlue talk 18:37, 29 October 2007 (UTC)[reply]

Within a single bin, the number should be an observation of a random variable with a Poisson distribution. That means that in bins 6, 7 and 8 we have an uncertainty of about ±8 in the frequency of occurrence. The local hump is slight in comparison and insufficient to suggest bimodality. The data roughly fits a log-normal distribution.  --Lambiam 21:12, 29 October 2007 (UTC)[reply]
However, I cannot get a good fit with a log-normal distribution. For the maximum likelihood fit, the expected # of occurrences for the bin c = 2 is only 218.8, differing wildly from the observed value of 415.  --Lambiam 19:34, 30 October 2007 (UTC)[reply]

Do all the bins have the same width, or has bin 0 half the width of the other bins because the variable is nonnegative? Bo Jacoby 23:01, 29 October 2007 (UTC)[reply]

Evaluating a definite integral (d2, SPC)

I'm curious about the function

,

where , i.e. the cumulative standard normal distribution, and n is an integer greater than one. I suspect that the usage of α without any indication that it is a function of x may be non-standard (at least, it had me confused). However, I wanted to reproduce it exactly as it appears in the paper in which I found it: Tippet, LHC. On the Extreme Individuals and the Range of Samples Taken From a Normal Population. Biometrika vol 17, 364-387, 1925. It is my understanding that this function gives the expected value of the range of a sample of size n, taken from a standard normal distribution. In the field of Statistical process control, the function w is known as d2.

These are my questions:

  • For n = 2, it is easy to show that the integral is numerically equal to 2/sqrt(π) within machine precision, and I feel reasonably certain that 2/sqrt(π) is indeed the exact value. I would like to know how one determines whether this is the case. As I am neither a statistician nor a mathematician, I would need the details spelled out.
  • Can this integral be expressed in terms of simple functions for values of n greater than 2? If so, how?
  • Is my suspicion avove, that the notation today would be considered non-standard, correct? If so, what would standard notation be?

Thanks. --NorwegianBlue talk 14:45, 9 February 2008 (UTC)[reply]

I haven't done much probability, so I'm not really sure what's going on there, but it does seem odd to me to be integrating the CDF - normally you integrate the density function to get the CDF, so I'm pretty much lost. As for the standard notation, wouldn't that just be:
?
It's not particularly unusual not to use a notation that doesn't explicitly state a dependence, although when you're only dealing with functions one variable, I can't see much point in not being explicit. --Tango (talk) 13:40, 10 February 2008 (UTC)[reply]
Thanks! Well, the function works, and there's no doubt that it's the CDF that is to be integrated. Below is a very crude implementation in R. The standard normal density in R is dnorm, while the distribution function is pnorm:
     d2 = function(n, distribution=pnorm, lolim=-12, hilim=12)
     {
         dx = 0.01
         steps = seq(lolim, hilim, by=dx)
         alpha=distribution(steps)
         rectangles = (1 - (1-alpha)^n - alpha^n)*dx
         return(sum(rectangles))
     }
And here's the output when testing it:
     > format(d2(2), digits=15)
     [1] "1.12837916709551"
     > format(2/sqrt(pi), digits=15)
     [1] "1.12837916709551"
--NorwegianBlue talk 14:08, 10 February 2008 (UTC)[reply]


A little more experimenting shows that d2(3)=3/sqrt(pi). However d2(4) = 2.058751, while 4/sqrt(pi) = 2.256758. Can anyone see a pattern? --NorwegianBlue talk 00:32, 11 February 2008 (UTC)[reply]

I can't offer much, but Mathematica confirms that and . It is also worth noting that . It doesn't readily succeed in calculating symbolically. The numerical value is 2.058750746..., for which Plouffe's inverter gives no match. So I doubt there is any significantly simpler way to express it. -- Meni Rosenfeld (talk) 09:05, 11 February 2008 (UTC)[reply]
Thanks a lot, Meni. --NorwegianBlue talk 13:02, 11 February 2008 (UTC)[reply]

Help needed in understanding 1925 Biometrika paper

I'm reading the paper Tippet, LHC. On the Extreme Individuals and the Range of Samples Taken From a Normal Population. Biometrika vol 17, 364-387, 1925., temporarily available here.

The paper defines the function

,

where initially was defined as , where , i.e. it is a cumulative distribution function, and n is an integer greater than one. The function returns the expected value of the range of a sample of size n, taken from the probability distribution defined by . If is the standard normal distribution, w is known as the control chart constant d2 in the field of statistical process control.

I posted a related question recently about this function. I have found that the function

is easily evaluated numerically, using a naive brute force approach such as the following C++ implementation:

double d2(int n)
{
    double x, dx;
    dx = 0.01;
    double sum = 0.0;
    for (x = -12; x <= 12; x = x + dx)
    {
        double alpha = cumulative_normal_distribution(x);
        sum = sum + (1 - pow(1-alpha,n) - pow(alpha,n))*dx;
    }
    return sum;
}

I know that there are far more efficient and elegant ways of evaluating integrals numerically. Nevertheless, this implementation works, and mirrors my simple-minded mental image of what is going on, so I'll stick with it for now. Tippett then defines as equation (10) the following:

and states (page 372) that

"The second moment, and hence the standard deviation of the distribution, was determined in several cases by equation (10). The work is very laborious, as it involves cubature, and even so, the result can only be given to a few figures. It is believed that the values given in Table IV are correct to the last figure."

The return values square root of should correspond to the control chart constant d3, and Table IV confirms that this is indeed the case. My problem is that I haven't been able to evaluate , and I suspect I've misunderstood something, possibly something pretty elementary about double integrals (I have no mathematical training beyond this, and it's a long time ago). Here's my implementation, again using a similar brute force approach. I've highlighted in red the part that I was most unsure of:

//NOTE: Comments in green added by OP after the question was posted, based on Meni Rosenfeld's answers.

double d3(int n)
{ 
    double x_1, x_n, dx_1, dx_n;
    dx_1 = dx_n = 0.01;
    double sum = 0.0;
    for (x_n = -12; x_n <= 12; x_n = x_n + dx_n)
    {
        double alpha_n = cumulative_normal_distribution(x_n); // WRONG - should be doing alpha_1 in the outer loop!
        // Making x_1 (-inf ... +inf) the outer loop, and x_n (-inf ... x_1) the inner loop fixes the problem.
        for (x_1 = -12; x_1 <= x_n; x_1 = x_1 + dx_1) // WRONG - see the code in red: x_1 should be the limit of integration. 
        {
            double alpha_1 = cumulative_normal_distribution(x_1); // WRONG - should do alpha_n in the inner loop!
            sum = sum + (1 - pow(alpha_1,n) - pow(1 - alpha_n,n) - pow(alpha_1 - alpha_n, n))*dx_1*dx_n;
                    // As Meni pointed out there's an error here ^ . The minus sign should be plus.
        }
    }
    double variance = 2*sum - pow(d2(n),2);
    return sqrt(variance);
}


The return values of this function are way off the correct values. My question is: can someone spot the misunderstanding(s) in my interpretation of the paper, the maths, and/or the error(s) in my implementation? Your help would be much appreciated. Thanks in advance, --NorwegianBlue talk 16:13, 23 February 2008 (UTC)[reply]

I don't see a problem with the way you are calculating the double integral. However, there is something terribly wrong about the function you are trying to integrate. To explain, I'll introduce some notation:
In order for any of this to be meaningful, b must exist for any . For this to happen, at the very least we must have for every . This is clearly not the case, because and so . Thus the function is not integrable. Check that you have copied the equations exactly as they appear in the paper, and that the notation means exactly what you think it does; If so, there is possibly a mistake in the paper. -- Meni Rosenfeld (talk) 12:10, 24 February 2008 (UTC)[reply]
Hmm. Well, I've checked the equations, and believe my reproduction is correct. I also very much doubt that there is a mistake in the paper, as it was widely cited, and equation (10) was used by subsequent authors to tabulate d3. Therefore, the most likely explanation is that I've misunderstood the notation. I prepared a small .PNG file from the paper with the relevant equations and a figure which explains the notation. x1 refers to the highest value in a sample of n values, and xn refers to the lowest, as is shown in the figure. The .PNG is here, the paper is here. Thanks again! --NorwegianBlue talk 13:20, 24 February 2008 (UTC)[reply]
Note the step from equation (9) to (10), where turns into . The plus is correct, and everything falls into place nicely if you make this correction. So it turns out even widely cited papers can have errors. Also, the table you linked to has some errors, for example . -- Meni Rosenfeld (talk) 14:36, 24 February 2008 (UTC)[reply]
Thanks a lot! The error in the step between equations (9) and (10) was glaringly obvious, once I became aware of it. I've looked through the steps from equation (4) to (9) for more errors, but unfortunately, most of it exceeds my mathematical capabilities. I'm also confused about the notation. What does the S-notation mean? Is appears in places which might suggest summation, is it the same as sigma notation?
Unfortunately, everything did not fall nicely into place. I corrected the minus sign to plus, but the function still does not work correctly. The entire program I've used is here. The section of the paper immediately preceding equation (10) might suggest that equation (10) only works for even values of x. Here's what I found, with the correction applied:
            n   program   correct
            ---------------------
            2    23.94    0.8525
            3     8.65    0.8884
            4    23.90    0.8794
            5    10.01    0.8641
            6    23.86    0.8480
            7    10.70    0.8332
            8    23.82    0.8198
            9    11.15    0.8078
Even values slowly decreasing from 23.84, odd values increasing from 8.65. Strange. A standard deviation of >23 for the range of samples of n=2 drawn from a standard normal distribution is obviously not right! Any suggestions of what else might be wrong? --NorwegianBlue talk 16:58, 24 February 2008 (UTC)[reply]
Well, there aren't any problems with the mathematics. I have calculated the sum in Mathematica and it gives the correct result. A step of 0.01 is enough to get a few correct digits, and an integration bound of 12 is overkill - 5 will be more than enough. So you need to debug your code - it looks generally okay but I don't speak C++ fluently. Try to define intermediate functions as I have done above, calculate some values of them, and either check them for plausibility or put them here so I can have a look.
The S notation is unfamiliar to me - it is probably, indeed, an archaic version of the Sigma notation. -- Meni Rosenfeld (talk) 17:40, 24 February 2008 (UTC)[reply]
Or start throwing cout to test intermediate values of variables in between every single line. x42bn6 Talk Mess 19:49, 24 February 2008 (UTC)[reply]
Thanks a million, Meni!!! I implemented your a, b, c and d3 functions exactly as you wrote them, and the result was reasonably close to the correct one (overestimated it slightly). It turned out that the step size was important. The smaller I made it, the closer I got to the values listed as correct above. I was able to achieve 3 digits using a step size of 1/2048.
So I went back to the original function, and compared the code, to see why it behaved differently. The problem was related to what I had highlighted in red. When I made x_1 the outer loop, and x_n the inner loop, and changed the test in the inner loop to x_n <= x_1, the modified original function gave the same results as the a, b, c and d3 functions. Thanks again, this was beginning to drive me nuts! By the way, did Mathematica give results with high accuracy? I'm particularly interested in a good approximation of d3(2). --NorwegianBlue talk 20:23, 24 February 2008 (UTC)[reply]
Glad to be of assistance, and sorry for not picking up earlier that it should be . Mathematica can give arbitrarily good accuracy, but since this is a double integral it can take some time. should be correct up to the digits specified. I'm running it now for more digits in case you'll need it. A few additional values (n=3 and up) are 0.888368, 0.879808, 0.864082, 0.84804, 0.833205, 0.819831, 0.807834, 0.797051. -- Meni Rosenfeld (talk) 22:04, 24 February 2008 (UTC)[reply]
Thank you again for taking the time. You have helped me tremendously, both in increasing my understanding of the maths underlying SPC, and in solving a tough practical problem. --NorwegianBlue talk 15:52, 25 February 2008 (UTC)[reply]
Gah! I looked at this yesterday, but posted nothing because I didn't see that ordering problem. Shame on the original paper (not the OP!) for writing a double integral as ! Treating a differential (infinitesimal) as a first-class entity is fine, but writing the normal definite integral operation without nesting it properly leads to madness, since the doesn't get a variable name attached to it. --Tardis (talk) 16:39, 25 February 2008 (UTC)[reply]
As stated above, my mathematical training is limited, so I need to be spoonfed :-). You are saying that my approach, i.e. making the inner integral the inner loop, would normally be the way to go, right?
Would
be a more standard way of writing the integral? I was a bit confused about this, and that's why I flagged the x_1<=x_n test (which should have been x_n<=x_1) in the code. --NorwegianBlue talk 18:54, 25 February 2008 (UTC)[reply]
Yes. We are essentially calculating here the integral of a function which is itself an integral. The inner function is , and thus it should appear as inner in the formula and be implemented as inner. We then calculate the integral of this, , thus the formula should be . This would be even more critical if we wanted to calculate, say, - we need to know which is the variable that only goes up to 0, and flipping the order changes this. In our integral, we can understand from the context that relates to (it wouldn't make much sense for to have itself for an upper integration bound). In the implementation, it doesn't really matter in which order you make the loops, since you can change the order of integration as long as you keep the bounds right. The bit clearly implies that the bound is , and it's unfortunate it took us so long to notice that. -- Meni Rosenfeld (talk) 23:14, 25 February 2008 (UTC)[reply]
Thanks again, and thanks to Tardis for making me aware of exactly why I tripped. I've learned a lot from this thread! --NorwegianBlue talk 10:05, 26 February 2008 (UTC)[reply]

Hat problem

I came across the monthy hall problem some time ago, and I'm curious about how it applies to more complex situations. Anybody have a clue? Bastard Soap (talk) 21:42, 20 May 2008 (UTC)[reply]

What kind of more complex situations did you have in mind? There are all kinds of situations in probability were intuition turns out to be wildly incorrect. Take a look at Prosecutor's fallacy for one example. Also, if you haven't seen it already, we have an article on the Monty Hall problem. --Tango (talk) 21:55, 20 May 2008 (UTC)[reply]
I’ve seen this problem which is somewhat similar cause much debate (and I remember it because I was convinced I had the right answer for several hours until I figured out otherwise).
The situation is that there are 3 people, and they will enter the same room and receive a hat (either red or blue, each with a 50% chance). They cannot communicate whatsoever once in the room, but they can collaborate and determine a strategy before entering the room.
Then at the exact same time each of them is to guess what color their hat is (or they may choose not to guess at all). If at least one person guesses correctly and nobody guesses wrong, they win a prize.
The question is, given an optimal strategy, what is the probability that they will get the prize? GromXXVII (talk) 22:36, 20 May 2008 (UTC)[reply]
Give me a minute... --Tango (talk) 22:44, 20 May 2008 (UTC)[reply]
Nope, not happening. At least one person has to guess, and they have a 50% chance of getting it wrong and blowing everything, so there is no way to improve on 50%. However, if the answer was 50%, you wouldn't have asked the question... Damn you... Let me sleep on it. --Tango (talk) 22:50, 20 May 2008 (UTC)[reply]
75%. Turning my computer off and going to sleep: Take 2. --Tango (talk) 23:02, 20 May 2008 (UTC)[reply]
I had first confused this with another problem where after receiving the hats, the players are asked one by one if they know their own hat's color, and after at most 5 (IIRC) queries, someone will know. BTW, Tango's going to sleep quip is actually quite a cute coincidence, as a New York Times <SPOILER ALERT>article </SPOILER ALERT> covering the problem mentions someone going to bed after solving it and recognizing its relevance to coding theory as he fell asleep. And yes, the answer is 75% Baccyak4H (Yak!) 04:11, 21 May 2008 (UTC)[reply]
I'm having a hard time understanding the problem completely. Each one gets a hat (at the same time?), but can't see what color their own hat is? Can they see the color of each of the others' hats? --Prestidigitator (talk) 04:55, 21 May 2008 (UTC)[reply]
You have to assume that they can in order to give them a better than 50% chance of getting it right. And it took me a while to work out what the strategy is but I knew it had something to do with what to guess given what you see. (Here's a hint: no matter what the arrangement of hats, there will always be at least two people with the same colour. If you're a prisoner who can see two hats the same colour, or two hats of different colour, see if that affects your optimum strategy for guessing your own colour.) Confusing Manifestation(Say hi!) 06:42, 21 May 2008 (UTC)[reply]
Interesting how the game show contestants became prisoners there. --tcsetattr (talk / contribs) 07:10, 21 May 2008 (UTC)[reply]
Am I missing something? If all are given with 50% probability then whatever colour hats the other people have won't affect the probability of guessing your own colour. The obvious answer seems to be that they should decide that just one should guess having a 50% chance. -- Q Chris (talk) 07:36, 21 May 2008 (UTC)[reply]
Yeah, you're missing it. The external link given by Baccyak4H contains a pretty thorough spoiler which should have you slapping your forehead and saying "Of course!" --tcsetattr (talk / contribs) 07:58, 21 May 2008 (UTC)[reply]
You’re right that each person that guesses has a 50% chance of guessing right, or guessing wrong regardless of any other factors. But the problem has three people, and whether one person guesses right or wrong is not independent of whether somebody else does (in an optimal strategy, which too my knowledge results in 75%). GromXXVII (talk) 10:52, 21 May 2008 (UTC)[reply]
The key point to note is that the method which the hats were assigned was made explicit; there is a prior distribution on what the hats can be. While when reading the problem the description seems trivial, it turns out to be crucial. If no information about that is known then of course there is no improvement over 50%. So look at this as a conditional probability problem, and consider Thomas Bayes your friend... Baccyak4H (Yak!) 14:28, 21 May 2008 (UTC)[reply]
Or just right out all the possible combinations - worked for me! --Tango (talk) 15:17, 21 May 2008 (UTC)[reply]
Right, but that still presupposes the prior. <SPOILER ALERT>What if the prior was probability 1 for all hats being red? What is the chance that the strategy will work in that case? </SPOILER ALERT> Was your sleeping comment intentional, or just a coincidence, with respect to the article I linked to above? Baccyak4H (Yak!) 16:37, 21 May 2008 (UTC)[reply]
Well, yes, but we were given the prior. I wasn't suggesting you were wrong, just that there's a less technical approach. Pure coincidence - I read the problem just as I was about to turn the computer off and go to sleep, attempted it, gave up, turned the computer off, immeadiately realised the answer and turned the computer back on again! (I'm not entirely sure why... bragging rights, I guess.) --Tango (talk) 18:30, 21 May 2008 (UTC)[reply]

So it's not really correct that events which aren't linked don't "influence" each other? You can get the flow of the event by looking at other manifestations of it? Bastard Soap (talk) 08:40, 21 May 2008 (UTC)[reply]

Having read the spoiler I don't think that's right. The events don't influence eachother, the strategy just means that when a wrong guess is made it will be by all three people at the same time, whereas a correct guess will be made by just one. Since each geuess has a 50/50 chance three out of four times a correct guess will be made, one out of four three incorrect guesses will be made. -- Q Chris (talk) 10:54, 21 May 2008 (UTC)[reply]

Incidentally, is there a simple proof that I'm missing that the 75% strategy is optimal? I'm pretty sure it is, but I can't see how to prove it. --Tango (talk) 11:46, 21 May 2008 (UTC)[reply]

I saw more or less the same problem a while ago. Here's one form: you have (say) 1023 (this is a hint) people, each provided with a random hat as before. This time everyone must vote, and the group wins if the majority vote correctly. A variant allows weighted voting, so you can (say) cast 100 votes that you have a red hat. I was reminded of these because the forced voting clause makes it easy to prove the optimal strategies are indeed optimal, which I can't quite see how to do for the problem here. Algebraist 12:53, 21 May 2008 (UTC)[reply]
This might work...although I haven’t tried to actually prove the optimality before
Consider that regardless of the strategy everyone will guess wrong 50% of the time. If two people guess wrong together in a case, that provides correctness of at most 67%. If all three people guess wrong together in a case, then this provides correctness of at most 75%. This exhausts all cases. It also assumes that one can partition the sample space though, which isn’t always possible, but nonetheless should provide the sufficient upper bound. (for instance, I don’t think it’s possible to have a strategy which wins two thirds of the time because 8 isn’t divisible by 3). GromXXVII (talk) 16:42, 21 May 2008 (UTC)[reply]
It's not possible to have a deterministic strategy with 2/3 chance of winning, if you allow non-deterministic strategies, it should be doable. --Tango (talk) 18:25, 21 May 2008 (UTC)[reply]