# Averaging versus Stacking

OK Forum Friends, I am still a bit new at this, so be kind. :)

From what I can understand, AAVSO recommends more than one image of a field. I take four as four as the minimum needed to get a standard deviation of the average magnitude of four estimates. So, when I have a target with SNR > 100 I compute the magnitude for the target in each image and then average the results and obtain an appropriate uncertainty with N=4. I do not stack. I figure that the uncertainties associated with each image estimate are not relevent and it is the uncertainty of the average over four estimates that is relevent. So each image estimate might have a rather large standad deviation due to differences in color in the ensemble, but the mean and standard deviation of the four images is more reflective of the actual uncertainty.

Example: NSVS J0426046+163402 (SNR = 98, pretend it does not vary over short time periods)

N=4

Range = 12.765-12.787 (VMag)

Average = 12.77V; StDev target = 0.013

Now--- if you take any one image, the Std of the variable and the ensemble is somewhere in the range of 0.04 - 0.08 (VPHOT error). This can be improved for a single image by eliminating comp stars. However, I thought the idea of multiple images and ensembles was not to eliminate comps but let the "law of averages" work over several images. If you look at any ensemble it is quite possible to eliminate comps to improve your uncertainty, but I don't think it really helps compared to averaging over images. For example, the best I could do by eliminating all but 2comps for a single image is VPHOT error = 0.011. But if I am concered about a 0.002 difference, I can simply add another image or integrate longer for better SNR. (For example, if I simply add a fifth image that is around the mean I get a StDev of the variable of 0.011.) Further, even if I eliminate comps to "beat down" the uncertainty of single images, neither the average or the uncertainty over the four images necessarily improves, in fact, it can be worse.

For example, I eliminated two comps bringing my individual image uncertainty down about 0.02/image. But my average did not change (12.77V) and uncertainty over the four images increased to 0.015. Naturally, I have no degrees of freedom to argue this point as this is a single example. But I am wondering if I am on the right track.

BTW, if I stack I get a full ensemble of comp stars and a variable uncertainty of 0.020. I can improve that by eliminating comp stars, but is this good practice given that I have adaquate SNR in each image and thus can average?

Ed

I take 3 images of a star per filter, measure each seperately and average the result. The error is then the SD of the measurements. I then transform the results (not using ensemble but a single comp).

This seem easier than aligning and stacking the separate images before taking the measurements.

Cheers

Terry

Should you average, stack, or what?

First, some extreme cases:

- If you must capture 'fast' behavior of a faint star in time-series data collection...you can't stack because you lose time resolution. If the target is faint and you get an estimated SNR of 40 (bad, yes?)...but the amplitude of the fast-acting variable is 1 magnitude...then you can happily meet the science needs even though your data plot looks (and is) noisy. (Sometimes I'm trying to capture 'fast' behavior of a mag 16 target...taking 15 second exposures, and subframing to shorten download time...so that my cadence/cycle time is 20 seconds or shorter. This can be challenging photometry.)

- If the target/comps are very bright..forcing you to use a 'short' exposure (2 seconds, for example)...then take enough images to get 10 - 30 seconds of open-shutter time...to beat down scintillation noise...and then average all the values to one data point for reporting. (It is instructive in these cases to also examine the standard deviation of your individual measures...to get a feel for how bad scintillation can be for your rig/exposure time/seeing conditions/etc.)

- If the target is very faint, slow-acting, and your rig can't take exposures long enough to make a single 'high' SNR exposure...then you should stack.

OK, what about for more 'typical' observing runs/targets? I recommend a simple test.

Image a field of known/constant stars that span a decent range of magnitudes... so that for your V filter a 30-second exposure will not saturate the brighter stars. Take a time-series (on a good night, at low air mass) of about 30 images.

Now analyze the actual scatter you get for various stars from very bright, to barely detectable/low-SNR.

You will probably note the following:

- for the brightest (but not saturated) stars, your photometry software will estimate SNR to be approaching 1000 (based on Poisson statistics/photon counting)...but the actual stadard deviation of the measures of those stars will be more like 0.003 or 0.004. In other words your real-world performance will be worse than predictions by a factor of 2 or maybe 3 or 4. I call this the 'disappointment zone'...your actual performance is considerably worse than Poisson statistics would imply. I see little point in exposing stars to this ADU level...unless they are very bright and you have no choice but to expose to put them close to the saturation limit. (Don't stack here. Average if you feel you need more precision from your measurements.)

- for stars a couple magnitudes fainter than the saturation limit...your photometry software will estimate an SNR of X (e.g. 120)...and your standard deviation of the actual measures will be in close agreement to that estimate. I call this the 'sweet spot'. Preditions are close to reality...therefore you know what you will get from your rig. (You probably don't need to t stack here. Average if you feel you need more precision from your measurements.)

- for stars about 4 - 6 mags fainter than the saturation limit...your SNR is rather low...such as 50 or worse. This is the 'almost mandatory stack zone'...especially as SNR gets down near 20 or 10.

Keep note of your results from this test. If you find, for example, that for a 30-second V exposure on a mag 11 star...that your real-world standard deviation is 0.009 mag...then perhaps you should report that value for similar cases?

These are starting points and rules of thumb, not iron rules.

I try to avoid stacking unless I have no choice. I'd rather see individual measures from each frame...and then QC/reject/filter obvious outliers...and then average the remaining measures to one data point.

The above 30-image time-series test probably does not help identify/measure systematic errors. If you have problems with vignetting/flat fields, or transforms and weird/extreme star colors...that can produce significant systematic errors that make your measures consistently different from data submitted by other folks.

But I hope this helps.

Tom: I think we are in agreement.

Ed

KTC wrote:

"- for the brightest (but not saturated) stars, your photometry software will estimate SNR to be approaching 1000 (based on Poisson statistics/photon counting)...but the actual standard deviation of the measures of those stars will be more like 0.003 or 0.004."

I'm not real sure what you mean by "actual standard deviation" but I'm guessing you mean to average the 30 or so proposed measurements and compute the standard deviation of that sample? The specifics here will depend upon the details but I think a look at what Poisson statistics means is in order.

IF (note that is a BIG if) we measure a star with a "constant" output we believe there is an average (or mean) value that can be determined within some boundaries of uncertainty, ie, we can put "error bars" around some measured value and claim that the "true" value lies within those error bars to some degree of probablility. In particular, a SINGLE measurement of the star gives us an estimate of the underlying constant mean with a standard deviation of 1/SNR. IE, a single measurement of the flux tells us the "true" value is within +/- 1/SNR from that measurement to a confidence of about 68% (+/- 3/SNR for 99% confidence). For your example of a measurement with SNR = 1000, the "true" value of the star's flux would be known to +/- 0.003 for a 99% confidence level.

Now, if you make 30 or so measurements, each with a SNR ~1000 one would expect ALL of them (30x0.99) to fall within +/- 0.003 of the first one IF the star has a constant output over the measurement interval. If you find one or more of the measurements fall outside that range, your only proper conclusion is that the star is NOT constant (with a less than 1 chance in 100 of being wrong). IE, you have a *variable star.* (Note: If all your measurements indicate the star is constant, you can refine your estimate of the standard deviation by 1/sqr(30).)

Even if you conclude the star is variable over some range during the measurement interval, you can report the average of those measurements with the overall STD but you would be masking the underlying variability which, at the millimag levels of this example may be ok?

I'm not real sure what you mean by "actual standard deviation" but I'm guessing you mean to average the 30 or so proposed measurements and compute the standard deviation of that sample?

Yes, your actual, real-world measurements...not an estimate by software.

IF (note that is a BIG if) we measure a star with a "constant" output

I've heard quotes that about 98% of stars are not variable...at least at the amplitudes and time scales we can deal with below this atmosphere. If you want to be conservative, then you can say that 90% of all stars are not variable. (Kepler mission scientists have more problems...they are see 10 - 30 parts per million variation in star brightness...and that is making it tougher to find exoplanets. We will never work at that precision at sea level, or even a mountaintop.) In other words, it's pretty safe that randomly chosen stars that are not cataloged as variable...are 'constant enough' for the purposes of testing and characterizing your rig.

Now, if you make 30 or so measurements, each with a SNR ~1000 one would expect ALL of them (30x0.99) to fall within +/- 0.003 of the first one IF the star has a constant output over the measurement interval. If you find one or more of the measurements fall outside that range, your only proper conclusion is that the star is NOT constant (with a less than 1 chance in 100 of being wrong). IE, you have a

variable star.

What about cosmic rays and other shot noise in your image?

Below is a spreadsheet analysis I did of a time-series test from one of the scope here. (C-11, ST-7XME, V filter, 30-seconds)

11.2 vs … | 11.85 | 11.74 | 12.29 | 12.64 | 13.22 | 13.21 | 13.84 | 14.36 | 14.66 | 14.92 | 15.36 | 16.09 |

SNR | SNR | SNR | SNR | SNR | SNR | SNR | SNR | SNR | SNR | SNR | SNR | SNR |

400 | 400 | 400 | 400 | 350 | 240 | 240 | 156 | 106 | 84 | 67 | 48 | 25 |

stdev | stdev | stdev | stdev | stdev | stdev | stdev | stdev | stdev | stdev | stdev | stdev | stdev |

0.003 | 0.003 | 0.003 | 0.003 | 0.004 | 0.005 | 0.006 | 0.007 | 0.012 | 0.014 | 0.015 | 0.020 | 0.037 |

AIP4WIN | AIP4WIN | AIP4WIN | AIP4WIN | AIP4WIN | AIP4WIN | AIP4WIN | AIP4WIN | AIP4WIN | AIP4WIN | AIP4WIN | AIP4WIN | AIP4WIN |

0.002 | 0.002 | 0.002 | 0.003 | 0.003 | 0.005 | 0.005 | 0.007 | 0.010 | 0.013 | 0.016 | 0.023 | 0.044 |

My real-world standard deviation never got better than about 0.003...maybe 0.0025...or a maximum SNR of 400. My software estimates of SNR went higher than that...getting closer to 600, maybe even higher.

For the brightest stars the actual performance was worse than the estimated value (zone of disappointment). Around SNR 240 or so...estimates match real-world measurments (sweet spot!).

I now have a decent idea of SNR/exposure/filter/aperture for one scope...and can plan future observations accordingly.

I'm looking forward to seeing your test results.

Thanks in advance.

ROE makes some good points, however I have to disagree with this:

“Now, if you make 30 or so measurements, each with a SNR ~1000 one would expect ALL of them (30x0.99) to fall within +/- 0.003 of the first one IF the star has a constant output over the measurement interval. If you find one or more of the measurements fall outside that range, your only proper conclusion is that the star is NOT constant (with a less than 1 chance in 100 of being wrong). IE, you have a *variable star.* (Note: If all your measurements indicate the star is constant, you can refine your estimate of the standard deviation by 1/sqr(30).”

I think that statistics were invented because we cannot assume that any experiment is perfect. An image and a measurement are both “experiments.” The star might, indeed, be perfectly constant but that does not mean the conditions are perfect or your optical train is perfect or there are no changes in seeing or transparency or that we might have imperfect flats and darks, or a hot pixel not picked up in calibration, or… etc. etc. In other words, all experiments have variables over which we have some control but not total control.

My opinion, uninformed as it might be: if I have a single measure that is different than the other 29 my first assumption is that experimental error is at work. So I do not agree that “your only proper conclusion is that the star is NOT constant” is the only conclusion. My first reaction would be to ask what went wrong with the image. If nothing, then my next reaction, if I suspect it might be a variable, would be to take several time series over several nights to see if there is a pattern of variation. One night of 30 images with one outlier? No degrees of freedom to speculate, it's a singular event even if you believe it. Five nights of consistant variation, 4 degrees of freedom to speculate (err, hypothesize). Repeatability is the name of the game, as in all experiments. It would be the pattern in light of the uncertianty that would guide me to my conclusion.

Tom's point (and mine) is that averaging over several images and reporting the uncertainty as the standard deviation of those measures is a more realistic measure of uncertainty when appropriate (which is not always). I think it better reflects the varying conditions under which I operate, image to image, night to night, one set of calibration images to another.

However, that was not all of my post: there is also the issue of picking and choosing among the available comps or using the entire ensemble, warts and all. Anyone have comment on that?

There are lots of other variables that would give changing results in the example of a 30 image run that would give fluctuating measurements.

These include passing clouds, dew etc.

Another problem I think is that of residual base image (RBI)

This seems to make the raw measurement of the first image in a run slightly lower than subsequent images. Of course this depends on the CCD as some are more prone to RBI than others.

Cheers

Terry

WEY wrote:

"ROE makes some good points, however I have to disagree with this:"

I think you missed the point of my post. I specifically limited it to 30 essentially identical measurements of flux (no mixing in uncertainties of other stars as in deriving magnitudes) wherein no passing clouds would have caused the SNR to be lower, etc. The point went to the heart of why we need uncertainty estimates which goes to how we use them. We use uncertainty estimates to make decisions - basically to use or not use a data point in some analysis. This is a point I felt was not adequately covered in the uncertainty course. Considers.

There was an example in the uncertainty course of two measurements: 9.5 +/- 0.5 and 9.8 +/- 0.2. What can one say about these and, more importantly, what should one do about them? Depends. The 9.5 measurement includes the 9.8 measurement in its uncertainty so they could be comparable. If they were both taken in the same time interval I would reject the 9.5 measurement an go with the 9.8 measurement. I certainly wouldn't average them, for example. If they were taken a day apart I would have to keep both and suspect the possibility of variability but I couldn't reject the hypothesis that the star is actually constant over that time period. More data needed.

The same process applies to 30 measurements made over a, say, 30 minute time interval. You are testing the hypothesis that the star is either constant or not constant (ie, variable). You can only do so within the confidence levels indicated by your uncertainty estimates. If you decide the star is not constant during that time interval you have evidence of variability. If you average the measurements you will destroy that evidence and, in effect, claim the star appears to be constant but with a larger uncertainty in that claim.

Is it important? Depends. Cadence is important. You get the same effect with your exposure times and sampling rates. Time exposures average the incoming flux over the exposure interval. The blinking neutron star in M1 cannot be observed with 1 sec exposures for example but it has been observed with much higher resolution exposures.

Bottom line for me - if you have high SNR data that shows a variability over your measurement interval, go ahead and report the individual measurements (rather than average them to one single measurement). A researcher not interested in short term phenomena can just run low pass filter over the data to get what he/she is looking for but someone interested in the faster stuff cannot retrieve the info wiped out by pre-processing via averaging.

KTC wrote:

"I've heard quotes that about 98% of stars are not variable"

Hmmm. I've heard that all red stars are variable. It also seems to me that most stars are red, at least compared to Vega. Maybe there is a dividing line between "red" and "really red?"

Roe wrote: " You are testing the hypothesis that the star is either constant or not constant (i.e., variable)."

Now I think I understand your approach. We have different approaches to hypothesis testing and to the data and thus different assumptions. To wit:

You might be testing the null that the variable is constant and the alternate that the variable is not constant. But I am not, in fact; I am not testing any hypothesis, I collecting measures given some assumptions.

When I average I am accepting the work of previous investigators that this particular star is not expected to show short-term variation within the time frame I am taking data (thus uncovering my Bayesian tendencies). Thus I have embraced the hypothesis that there is no short-term variation apriori. I cannot test what I have assumed. My additional assumptions are about which central tendency to report and how to characterize the variation.

Since I am assuming that any variation can be ascribed to measurement error, I try to deal with measurement error by averaging over individual measures and report the uncertainty of that measurement error.

I am not saying your null is uninteresting and I don't think that you can claim that averaging is not appropriate under the right circumstances. We have adopted different approaches to the data. I typically taken 4 images for each filter, its all over in 8 minutes per filter. Perhaps the Miras I study do show such short-term variation and I (and by inference AAVSO given guidelines) am missing the boat.

You, on the other hand may take enough images to see shorter-term variation. In fact, that might be what you are looking for and you have desgined your experiments to search for this variation. Different research programs, different assumptions, different treatments of data. Nothing wrong here.

On the other hand there are stars with which I am in complete agreement with your point about reporting all observations; if I happen to be taking data for a star that it expected to vary over the time period in which I am collecting data, or if the VSX tells me that the star in of unknown quality (say an "S"), then I assume nothing and report all the observations. I do this frequently since there are lots of variables in my Mira FOVs. Different kinds of stars, different assumptions, difference treatments of the data.

WEY wrote:

"You, on the other hand may take enough images to see shorter-term variation."

OK. We are not communicating and I take full responsibility for it. Sorry. I have said nothing about what I do, nor commented on your research program. Your original post was about averaging your data points. Along the way KTZ described an experiment in which some 30 or so bright (SNR ~ 1000) stars might be studied. It was to that comment that I responded. I suppose I should have forked the thread.

That being said, your technique of averaging data points is common, I've done it myself. I was just hoping to point out that there is the possibility of losing data that may (someday) be important. My example of the neutron star in M1 is a good one, I think. Any ordinary measurement schedule would show it to be essentially constant but it has been shown to "blink" some 30 times per second. That seems to me to be a pretty interesting fact. Your mileage may vary. I'm not saying you and I are likely to achieve that sort of time resolution any time soon, but just suppose one of those slowing varying, long term Miras is undergoing eclipses with, say, an amplitude of 0.1 mag and you are taking images with a SNR of 50 so you stack, or average, to try to raise the SNR or reduce the uncertainty and you never see it? It's your call, go for it. IMHO, reporting 4 SNR = 50 measurements compared to one SNR = 100 measuremnt will not confuse a competent researcher.

"- If the target is very faint, slow-acting, and your rig can't take exposures long enough to make a single 'high' SNR exposure...then you should stack"

Tom I am curious as to why you say stack rather than average in this case. Is it simply that the measurement software can't detect star centroids adequately and may drift off the stars you are trying to measure, but there are brighter stars in the image that can be used to align and stack? Then I agree, unless it is possible to manually adjust the measurement aperture locations and the quantity of images involved doesn't make manual adjustment impractical.

Mathematically there is no difference between averaging and stacking, except for noise introduced by image alignment. Assuming the SNRs of the images are relatively equal, stacking, say, 10 images with individual SNRs of 40 results in an SNR of SQRT(10)*40 and therefore, the estimated STDEV is reduced by a factor of 1/SQRT(10). Averaging 10 images of SNR 40 gives a STDEV OF THE MEAN that is also reduced compared to the individual images by a factor of 1/SQRT(10). If the SNRs vary significantly between images then you should calculate a weighted average SUM(Mi/σi^2)/SUM(1/ σi^2) and the standard deviation of the mean becomes SQRT(1/SUM(1/ σi^2)). With averaging, however, you have the individual measurements that allow you calculate the standard deviation of the sample directly from the distribution of the individual magnitudes from the mean (average). In general you want to report the STDEV of the population not the STDEV of the mean.

The other issue is that in stacking you need to align the images. If there is significant drift among the images, the alignment process will add noise to the images in the process of aligning star centroids. Individual pixel values from the original image can be re-allocated across pixel boundaries, particularly if there is some field curvature in the image.

If you are binning images in a time series, averaging also retains the data of the individual images within the bins so that you can run Chi square analysis of check stars or comparison star residuals to better estimate the uncertainty of the measurements.

Brad Walter, WBY

Good question.

I stack when individual images are such low SNR that the object is close to limiting magnitude, or fainter.

In other words, if you can't measure/detect the object in an individual image...stack.

I only stack when I have no other choice.

I deal with a lot of faint targets, 15-19th mag in V, so I stack just about everything 3-4 times. The end result is better SNR and lower error bars in my reported data. Here are screen shots from VPHOT demonstrating the results.

The target is V391 Lyrae, a Z Cam candidate that is normally 16th - 17th mag in quiescence. A single 180 second exposure in V on a 14-inch scope results in a measure with SNR of 26 and an err based on the SNR of 0.041. The std between my 3 comp stars with SNR>100 is 0.002, so all in all this is a pretty decent result at 16th mag.

Now if I stack 4- 180 second images my SNR goes up to 43 and the error computed from SNR is down to 0.026, much more like what I prefer to see. I also have 5 comp stars with SNR>100 to use if I choose to do so.

NOTE: You should never get a better Err than your Err(SNR), so that is the target I shoot for (if you do you're cheating or messing up).

So stacking is NOT the same as averaging, because you are not increasing the SNR by simply averaging a number of short exposures.

Ideally, you should try to have at least 100 data points per cycle or period. So if your star has a 300 day period once every three days is really covering it well. You can expose for an hour if you want to. If it has a two hour period or less (UGSU fans) you need to do 60 second exposures or less to really cover the target.

If I were doing time series I would rather have 60 data points per three hours than 15 stacked data points, but I am not doing time series. Z Cams typically have periods in excess of 3-4 hours, some much longer! This method works very well for LPVs, RCBs, YSOs and many other typical AAVSO targets whose periods are also long or irregular.

Deciding on what integration times, filters and analysis methods to use requires some familiarity with the objects you're observing. It's not all about the math and statistics.

I deal with a lot of faint targets, 15-19th mag in V....

Your problem is a an aperture deficiency. Get a bigger scope.

Progress on Autoscope22 has stalled...but it doesn't have to be that way.

I am definitely in the camp of average binned data vs. Stacking. My contention is that if:

1. You are comparing the correct statistics and

2. You have sufficient image depth that your software can determine the star centroid so that it can distinguish the star from the background,

then you are better off averaging because you don't introduce variations due to correct alignment, and if you are doing anything other than just shifting the image you don't have to worry about maintianing the noise structure (net flux) of your star images, and finally, you are able to preserve the most information about the uncertainty of your measurements.

Please bear with me, this gets a little complicated. I did a little experiment the other night. I imaged Landolt standard star SAO 107-359 15 times in an R filter. The exposures were short, 5 seconds, so there will be some scintillation. Only one star was measured on each image so that magnitudes shown are raw magnitudes. There was very little movement of the measured star, less than 10 pixels ~ 6 arcsec maximum displacement among all images (peak to peak displacement). Therefore, even if I hadn't been using an SBIG ST7, shutter vignetting would not be a factor. There could, be however, some variation in what amounts to zero point offsets during the sequence which are included in the spread of the magnitude data. I would expct that variation to increase with the number of images since the elapsed time of the sequence increases about 20 seconds per image including download time.

I then compared the result of stacking the images vs. binning the data for the first 5, then the first 10 and, finally, for all 15 images. For the binned data I calculated uncertainty from the standard deviation of the individual magnitudes of the binned images, the average of the full CCD uncertainty equation values (error column) for the binned images and the average of the 1/SNR uncertainty estimates of the binned images. To compare the uncertainty of the average (mean) of the magnitudes of the binned images to the uncertainty of the stacked images you have to calculate the uncertainty of the mean value of the binned images. Therefore the uncertainties of the three sets of binned images were divided by the square root of the number of images being binned. That is why I stated above that you have to compare the correct statistics and it is another reason I don't like stacking. The uncertainty you get when you stack is mathematically the same as the standard deviation of the mean of the underlying sample, rather than the standard deviation of the underlying sample itself. I checked the standard deviation of the CCD uncertainty equation values to verify that there wasn’t a large difference in uncertainty between images, which would require the more complicated calculation of the uncertainty of the mean from data points with varying uncertainties.

Then standard deviations of the mean of the averaged image magnitudes were then compared to the uncertainty of the binned images for the bins of 5, 10 and 15 images. When you do this comparison you have to compare uncertainties calculated using the same method of estimating uncertainty. That is, you have to compare uncertainty determined from the CCD uncertainty equation of the binned images with the uncertainty calculated from the CCD uncertainty equation for the stacked image and uncertainties calculated from 1/SNR to each other. You wouldn't normally use 1/SNR as an estimate of uncertainty in data with SNR as low as in the unstacked images, but I did it anyway just to see how it would work out. For the stacked images there is no uncertainty calculated directly from the standard deviation of the magnitudes themselves since there is only one magnitude. However, it is interesting to see how the uncertainties of the mean calculate directly from the magnitudes compare to the other, less direct methods of estimation for the binned data. This gives an indication as to whether any of the methods is drastically underestimating the uncertainty.

In all cases the 1/SNR of the stacked images gives the smallest uncertainty estimate. In the case of 10 and 15 images it gives a much lower estimate even when compared to the CCD uncertainty equation estimate of the stacked images. This leads me to conclude that it is significantly underestimating the error, I think primarily because it doesn’t include read noise and some of the other minor contributors which start to have an effect when you are combining larger numbers of images.

The CCD uncertainty equation results for stacked and binned data agree extremely well for the 5 image and 10 image data and reasonably well for the 15 image data. Also for all three bin sizes the CCD uncertainty equation seems to track the uncertainty of the mean derived from the standard deviations of the individual magnitudes, but at a lower value.

It is also noteworthy that the magnitudes derived from average of the binned image data and the stacked images agree very closely.

The one thing that really surprised me is that the 1/SNR uncertainty estimate for the 5 binned images agreed very closely with the uncertainty calculated from the magnitudes and was larger than the uncertainty estimate given by the CCD equation. This normally isn't the case.

I think the results tend to support my claim that there isn’t any statistically significant difference between the results one gets from stacking images and binning data from the individual images provided the images aren’t so noisy that the measurement software can reliably determine the centroids. the trick is to understand if a lower uncertainty estimate really represents higher precision or whether it underestimates the uncertainty in the data. That is why I state you have to compare statistics from the same method of estimation. I don't think, for example it makes sense in low SNR data to compare statistics derived from 1/SNR to those forom the CCD uncertainty equation or to those derived directly from the standard deviation of magnitude measurements (which I think is what Tom Kraijzi means by the "actual" standard deviation).

I was told until a star is saturated it is better to stack images, so that you improbe the SNR, but I'm also curious to see the advise of Arne.

I think it also depends on your target. If you observe an LPV stacking probably works, but in the case of time series with stacking you lose the cadence.