# Statistics for Business Intelligence – Sampling

It is necessary to understand sampling techniques before data for a sample is gathered for analysis. Some of the terms that are important are
Population – This is the complete set under consideration. For example a survey of food choices for a country might consider all citizens of the country.
Frame – frame is the population where the survey is targeted to. For example, a survey of sports interests among school children considers all schools. A frame considers a list of population such as a school list.

Random Sampling or probabilistic sampling– In this kind of sampling each unit of the population has an equal chance of being selected in the sample.
Non Random Sampling or non probabilistic sampling – In this sampling method units have different probability of being selected for sampling, i.e. the sampling is biased for the selection of the unit.

Types of Random Sampling
Simple Random Sampling – Each unit is assigned a number, and a table of random numbers is used to select the unit. for example if the population has 30 members, each member is assigned a number. Random table is used to generate a number between 1 and 30 and the member corresponding to the random number generated is selected.

Stratified Random Sampling – The population is divided into different strata. These strata are non overlapping. Random methods are used to select members from each stratum. This method helps in selecting a sample that is representative of the population and prevents the researcher from collecting units from a subgroup of the population. The strata can be formed logically, for example in the survey of sports choices, the population can be divided into girls and boys.
In proportionate stratified random sampling the number of units selected from each strata is proportional to the total number of members in the strata. So if a school has 70 boys and 30 girls, and a sample of 10 is required, then 7 boys and 3 girls would be selected.

Systematic Sampling – In this method every kth element of the population is selected. This method is easy to implement but fails if the periodicity of the population coincides with k.

Cluster or Area sampling – In this method the population is broken down into logical clusters or areas. For example a state population can be broken into cities for the purpose of sampling. This technique is mostly used for its convenience. In contrast to stratified sampling, here the internal population is heterogeneous. In stratified sampling each stratum is homogeneous in terms of property influencing the survey.

Nonrandom sampling – Nonrandom sampling is generally not advised when inferential techniques need to be applied . Also the error of sampling calculated may be incorrect for nonrandom sampling.

Convenience Sampling – elements for the sample are selected as per the convenience of the surveyor. The sample may contain less variation than the population. However, the cost for sampling may be reduced since the samples are taken from a convenient location.

Judgement sampling – The elements for sampling are chosen by the judgement of the researcher. Studies show that random sampling gives a better population mean than judgement sampling. This kind of sampling also introduces biases of the researcher.

Quota sampling – Quota sampling divides the population into subgroups or strata as in stratified sampling however, members are selected from the strata using non random techniques. The number of members to be selected from the strata are proportional to the population of the subgroup and is called a quota.

snowball sampling – In this kind of sampling members are selected based on the referral from other members. The advantage is that members for survey can be identified easily. however the technique is non random.

sample mean distribution- A population with a known distribution is chosen. Samples are taken from this population and mean calculated for the samples. The probability distribution of this mean is governed by what is called the Central limit theorem. It is a powerful theorem and states that if samples of size n are taken randomly from a population having a mean of mu and a standard deviation of sigma then the sample means x are normally distributed for large sample sizes (typically n>=30) regardless of the shape of the distribution of the population. However, if the population is normally distributed, the sample means are also normally distributed for all values of n. mathematically mean of the sample means is equal to population mean and SD of sample means is the SD of the population divided by square root of population size.

The power of the theorem lies in the fact that even if the population is not normally distributed, the probability for a particular sample mean can be calculated from a sample of large size since the sample mean distribution would always be normal.