How to Handle Pagination in AI Scraping

Updated May 2026
Web pages that display large datasets split them across multiple pages using pagination, infinite scroll, load-more buttons, or cursor-based navigation. AI scraping tools need to navigate these patterns to capture complete datasets. This guide covers how to identify each pagination type and implement reliable navigation logic that captures all available data without duplicates or infinite loops.

Missing pagination handling is one of the most common causes of incomplete data in scraping pipelines. A scraper that only processes the first page of search results or the initially visible items in an infinite scroll feed captures a fraction of the available data. Proper pagination handling ensures completeness.

Identify the Pagination Pattern

Before implementing pagination handling, determine which pattern your target site uses. Open the page in a browser, scroll down, and observe how new content loads. Check the URL bar for page number parameters. Open browser developer tools and watch network requests as you navigate between pages.

Numbered pagination shows page numbers or next/previous buttons at the bottom of the results. The URL typically includes a page parameter like ?page=2 or ?offset=20. Each page loads as a full page navigation with a new URL.

Infinite scroll loads new content automatically as you scroll toward the bottom of the page. There are no page numbers visible. Network requests fire as you scroll, fetching new batches of items from an API endpoint. Social media feeds and modern product catalogs commonly use this pattern.

Load-more buttons work similarly to infinite scroll but require clicking a button to trigger the next batch. This is a hybrid between numbered pagination and infinite scroll, with a visible control element but no page navigation.

Cursor-based pagination uses opaque tokens rather than page numbers. Each response includes a cursor value that must be passed as a parameter to fetch the next batch. This pattern is common in API-level pagination and some modern web applications.

Implement Page Navigation

For numbered pagination, the simplest approach is URL construction. Identify the pagination parameter in the URL (page, p, offset, start) and generate URLs for each page by incrementing the parameter. Fetch each URL independently, extract data, and move to the next page. This is the most straightforward pattern because each page is a standalone URL that can be fetched without browser interaction.

For infinite scroll, the headless browser must scroll programmatically. The typical implementation scrolls to the bottom of the page, waits for new content to appear (detected by DOM element count changes or network activity), and repeats. The wait time between scrolls should be generous enough for content to load, typically one to three seconds, with adaptive timing that increases if content takes longer than expected.

For load-more buttons, the browser identifies the button element (by text content, class, or position) and clicks it. After clicking, the implementation waits for new content to render before extracting. The button may change its text (from "Load More" to "Loading...") or disappear entirely when all content has been loaded.

For cursor-based pagination, the scraper must extract the cursor value from each response and include it in the request for the next batch. If the pagination is at the API level (discovered through network inspection), you may be able to call the API directly without browser rendering, which is faster and cheaper.

Detect End of Results

Every pagination implementation needs a reliable stopping condition to prevent infinite loops. Without proper end detection, a scraper can cycle endlessly through empty pages or scroll indefinitely on a page that has loaded all its content.

For numbered pagination, check whether the current page returned any results. An empty results page indicates you have passed the last page. Additionally, some sites include total page counts or result counts that you can use to calculate the expected number of pages upfront.

For infinite scroll, compare the item count before and after scrolling. If scrolling produces no new items after two or three attempts with appropriate wait times, the feed has been fully loaded. A maximum scroll count provides a safety limit against feeds that never truly end (like algorithmic social media feeds that generate content indefinitely).

For load-more buttons, check whether the button is still present and clickable after loading new content. When the button disappears or becomes disabled, all content has been loaded. Some implementations replace the button with a "No more results" message.

For cursor-based pagination, a null or empty cursor in the response typically indicates the last page. Some APIs include a "has_next" boolean field. Always implement a maximum page count as a fallback to prevent runaway pagination.

Handle Deduplication

Pagination can produce duplicate items, especially with infinite scroll where content shifts between loads. If a new item is added to the feed between scroll events, all subsequent items shift by one position, potentially causing an item to appear in both the current and previous batch. Similar duplication can occur with numbered pagination on sites with rapidly changing content.

Track seen items using a unique identifier for each item (product ID, URL, or content hash). Before adding an extracted item to the results, check whether its identifier has already been recorded. This deduplication step is inexpensive and prevents downstream systems from processing duplicate records.

For very large datasets spanning thousands of pages, use a space-efficient data structure like a Bloom filter for deduplication tracking rather than storing every seen identifier in memory. The small false positive rate of Bloom filters (occasionally marking a new item as seen) is acceptable compared to the memory savings for large-scale pagination.

Key Takeaway

Identify which pagination pattern your target uses before implementing navigation logic. Every pagination handler needs a reliable stopping condition and deduplication tracking. Test your pagination handling on the full dataset to verify that all items are captured without duplicates.