Beyond Public Pages: How to Use Crawlio to Scrape Member-Only Content
Web scraping has traditionally been about extracting content from publicly accessible websites — think product listings, blog posts, or news feeds. But what happens when the content you need lives behind a login? Dashboards, subscriptions, profile data, internal analytics — these are valuable sources of structured information, but until now, they’ve been difficult to access reliably without building your own headless login logic from scratch.
With Crawlio’s latest update, that changes.
🍪 Introducing Cookie Injection in Crawlio
Our /scrape
endpoint and SDKs (Node.js and Python) now support a new option: cookies
. This allows you to send authenticated session cookies along with your scraping request — meaning you can target pages that require login without needing to simulate login forms or manage sessions in your own code.
Here’s a quick look at how it works:
const result = await client.scrape({
url: 'https://example.com/account',
cookies: [
{
name: 'session_id',
value: 'abc123xyz',
domain: 'example.com',
path: '/',
httpOnly: true,
secure: true,
sameSite: 'Lax'
}
],
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
})
Crawlio will inject this cookie into the virtual browser’s request context before loading the page, allowing it to behave just like an authenticated user.
🔐 When to Use It
This feature is ideal for scenarios like:
- Scraping user dashboards from internal platforms
- Extracting content from private CMS views or SaaS panels
- Archiving member-only blog posts or documentation
- Feeding authenticated content into AI summarizers or search pipelines
However, this also introduces new responsibility — which brings us to an important point.
🤝 Use It Ethically
Authenticated scraping should always respect the terms of service and the spirit of consent. Some quick guidelines:
- Only scrape authenticated content you have access to
- Don’t reuse other users’ session cookies or automate login to accounts you don’t own
- Be mindful of rate limits and bot detection — Crawlio includes built-in throttling, but respect goes beyond the tech
- Avoid scraping PII or sensitive data unless you are authorized to access it
We’ve built this feature to unlock legitimate use cases — like exporting your own data or building integrations with tools you already use — not to enable abuse.
🔧 How It Works Under the Hood
When you include the cookies
array in a /scrape
request, Crawlio:
- Spawns a headless browser container
- Injects the cookies using browser-level APIs (not just headers)
- Sets a custom
User-Agent
if provided - Executes any defined workflow steps (like
scroll
,click
, oreval
) - Captures the resulting HTML, Markdown, screenshots, and URLs
This makes the session indistinguishable from a real browser visit — ideal for sites that dynamically generate content after login.
🧪 Example: Scraping a Subscription Dashboard
Let’s say you subscribe to a newsletter service and want to archive your own saved issues. Here’s how you might approach that:
await client.scrape({
url: 'https://newsletters.example.com/dashboard',
cookies: [mySessionCookie],
includeOnly: ['.newsletter-entry'],
markdown: true,
returnUrls: true
})
In this example, you extract only the newsletter entries, convert them to Markdown, and return all embedded URLs for future follow-up scrapes.
🔁 Combine with Workflows
You can also use this in combination with workflow actions like:
-
wait
(pause for content to load) -
scroll
(trigger lazy-loaded sections) -
screenshot
(visual archiving) -
eval
(custom JavaScript logic)
This is especially useful for dashboards that load asynchronously or require user interaction to display content.
🧁 Cookie Format Recap
Make sure your cookie object matches this structure:
{
"name": "session_id",
"value": "abc123xyz",
"path": "/",
"domain": "example.com",
"httpOnly": true,
"secure": true,
"sameSite": "Lax"
}
Be sure the domain
and path
match the target URL exactly — otherwise the browser will silently ignore the cookie.
🚨 Pro Tips
-
If your cookie isn't working, double-check that:
- It's not expired
- The domain/path matches exactly
- The SameSite policy isn’t blocking it
-
Use browser dev tools (Application > Cookies tab) to extract valid session cookies
-
Set a realistic
User-Agent
to mimic browser traffic -
Use workflow
wait
steps to account for post-login content loading
🧭 What’s Next?
We’re continuing to expand Crawlio’s support for advanced browsing scenarios — including upcoming features like:
- Full browser login flows
- Cookie persistence between scrapes
- Session-aware crawling across multiple pages
If you’ve got ideas or use cases that would benefit from these, we’d love to hear from you.
✅ Conclusion
Authenticated scraping used to mean browser automation, headless logins, and brittle selectors. With Crawlio’s new cookie support, it’s now a clean, declarative one-liner. Whether you're archiving your own dashboards or building integrations with third-party tools, this feature makes it easier — and safer — than ever.
As always, use it responsibly. And if you build something cool with it, let us know. We love seeing what the community is up to.
Scrape Smarter, Not Harder
Get the web data you need — without the headaches.
Start with zero setup. Use Crawlio’s API to scrape dynamic pages, search results, or full sites in minutes.