Calculate Sample Sizes for Web Scrapers

2 mins

Overview

Sampling from websites and APIs can be tricky: limits to server load (e.g., retrieval limits) or snowballing effects (e.g., a seed of 100 users, sample 100 of their peers and obtain all of their peers’ consumption patterns for 50 weeks). In addition to the minimum sample size necessary to satisfy statistical power requirements, an important consideration is therefore the technically feasible sample size. That is the sample size that can be obtained from a website or API while considering resource constraints.

Formula

$N = \frac{req \times S}{r \times freq}$

whereby - $N$ = sample size (i.e., number of instances of an entity to extract data from), - $req$ = retrieval limit (maximum number of requests per time unit, allowed for each scraper or authenticated API user), - $S$ = number of scrapers used (e.g., computers with separate IP addresses, or authenticated users of an API), - $r$ = number of URL calls to make to obtain data for each instance, and - $freq$ = the desired sampling frequency for each entity per time unit.

Tip

Convert all input parameters to the same time unit (e.g., the retrieval limit may initially be expressed in fifteen-minute intervals but needs to match the desired sampling frequency, which may be expressed in hours)!

Example

Suppose you wish to know the technically feasible sample size for collecting data from an online social network. In other words, you want to solve for $N$.

The input parameters are:

$req$ = 5 requests per second = 5 x 60 x 60 requests per hour (18,000)
$S$ = 1 scraper, authenticated via the service's API
$r$ = 2 (the scraper needs to visit two URLs: one to obtain users' meta data, and one to obtain users' usage history)
$freq$ = Each user should be visited at least once every fifteen minutes (once every 15 minutes = 4 times per hour).

$N = \frac{req \times S}{r \times freq} = \frac{18,000 \times 1}{2 \times 4} = 2,250$

Suggest changes to this page

Extract Data Using the YouTube API

Learn how to extract data from YouTube using the YouTube API.

api

application programming interface

data collection

YouTube

Safeguard Legal Compliance When Scraping

Obtain legal advice to limit your exposure to legal risks when web scraping

legal

debug

Monitor and Safeguard Your Data Quality

10 common issues and potential solutions to troubleshoot your data collection

monitor

debug

Personalized Cookies

Calculate Sample Sizes for Web Scrapers

Overview

Formula

Example

Related Posts

Extract Data Using the YouTube API

Safeguard Legal Compliance When Scraping

Monitor and Safeguard Your Data Quality