Systematic Sampling in R

Sampling is a method used in research to gather information about a population by selecting a subset, or sample, of individuals or items from that population. Instead of studying every single member of the population, researchers collect data from a smaller group that represents the whole. Sampling is a powerful tool used in various fields to understand, analyze, and create.

What is Systematic Sampling?

Systematic sampling is a statistical sampling method where elements from a larger population are selected at regular intervals with a fixed sampling interval. The process involves selecting every kth element from a list after a random start, where k is the sampling interval.

For example, if a teacher wanted to sample 100 students from a school with 1000 students using systematic sampling, then the teacher would select every 10th student from a list sorted by, say, student ID numbers.

Step 1: Determine the Population Size (N)

Identify the total number of elements in the population that we want to sample from.

Step 2: Calculate the Sampling Interval (k)

Decide on the sampling interval, which represents the gap between selected elements. The sampling interval (k) is calculated as N/sample size, where the sample size is the number of elements that want to be sample.

𝒌=𝑵/Sample Size

Step 3: Random Start

Choose a random starting point between 1 and k. This starting point determines which element will be the first in the sample.

Step 4: Select Systematic Sample

Start from the randomly chosen point and select every kth element until reach the end of the population. Systematic sampling is easy to implement and is more efficient than simple random sampling in certain situations.

Systematic Sampling in R

Systematic sampling is a technique used in statistics to select a sample from a larger population at regular intervals. In R Programming Language we can implement systematic sampling using ‘ seq()’ function.

Output:

 [1] 1 11 21 31 41 51 61 71 81 91

A population consisting of numbers from 1 to 100.

The output [1] 1 11 21 31 41 51 61 71 81 91 represents the systematic sample obtained from the population using the specified parameters.

Systematic Sampling on mtcars dataset

Here we use the built-in ‘ mtcars’ dataset in R. The ‘ mtcars’ dataset contains information about various car models.

Output:

 mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

systematic_sample

mpg cyl disp hp drat wt qsec vs am gear carb
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2

First load the mtcars dataset.

Output:

# View first six rows of data frame
first_name exam_score
1 onxel 82.9
2 snmqc 74.7
3 nfwxh 78.1
4 crlpw 67.3
5 jbhim 74.4
6 rxfkp 74.9

# View first six rows of systematic sample
first_name exam_score
40 jgozy 71.0
90 pygyq 74.2
140 unoio 71.0
190 mzgwl 70.1
240 iucgb 76.0
290 mvngl 74.8

# View dimensions of systematic sample
[1] 10 2

Set a seed for reproducibility.

Uses of Systematic Sampling

  1. Large Populations: When dealing with a large population, it can be challenging and expensive to conduct a simple random sample. Systematic sampling provides a more practical and efficient way to obtain a representative sample by selecting every kth element.
  2. Efficiency: Systematic sampling is often more efficient than simple random sampling. It requires less effort and resources, making it a suitable choice when time and budget constraints are significant considerations.
  3. Homogeneous Population: If the population is relatively homogeneous and there is no significant order or pattern in the data, systematic sampling can give representative results.
  4. Regular Data Collection: In situations where data is collected at regular intervals, systematic sampling can align with the natural order of the data collection process. This can simplify the sampling procedure and make it more practical.

Limitations of Systematic Sampling

  1. Bias Risk: Systematic sampling may introduce bias if there’s a hidden pattern or periodicity in the population aligned with the sampling interval.
  2. Skewed Representation: It can lead to skewed representation if the sampling interval coincides with certain characteristics, causing under or overrepresentation.
  3. Dependency on Ordering: The effectiveness relies on the order of elements; specific arrangements may affect representativeness.
  4. Sensitivity to Outliers: Outliers can have a significant impact, especially if they are consistently spaced based on the sampling interval.
  5. Inapplicability for Unordered Populations: Not suitable for populations without a clear order or listing.
  6. Complexity in Unequal Probability: Adjusting for unequal probabilities can add complexity, potentially negating the simplicity of systematic sampling.

Conclusion

In summary, systematic sampling in R is a straightforward and efficient method suitable for ordered populations. It’s easy to implement and resource-efficient for large datasets. However, caution is needed to avoid biases caused by hidden patterns aligned with the sampling interval. While offering simplicity and practicality, systematic sampling may not be ideal for all scenarios, and researchers should be mindful of its limitations.