AWS Lambda for web scraping - The good, bad & ugly

When building Hyikko's flight search engine, I needed a scalable solution for web scraping. AWS Lambda seemed like the perfect fit - serverless, scalable, and cost-effective. Here's my honest review after months of production use.

First, why serverless in the first place?

If you have stateless logic, that you need to be able to run concurently (in my case - flight search logic), you're faced with two main options: 1. Use a pool of workers (could be K8s cluseter or a bunch of VMs) - and a platform to manage executions between them (i.e. Spark, Celery etc).

2. Use a serverless solution like AWS Lambda, dont worry about a pool of servers, and just focus on your logic.

Each have it's pros and cons, but many for many starting projects, option #1 is not even an option if you need hunderd's of concurrency - because that would cost you a fortune to hold so many CPUs, not even mentioning the effort to manage executions.

This is exactly where serverless comes in.
This blog post gives an overview of the trade-offs in place when choosing a serverless solution (specifically AWS Lambda).

The Good

1. Infinite Scalability

Lambda automatically scales from 0 to hunderd's of concurrentexecutions. For you, it means your logic can run in parallel without you having to worry about it.

2. Cost Efficiency

You only pay for what you use. During development and low-traffic periods, costs are minimal. Even with thousands of searches per day, Lambda costs are a fraction of what dedicated servers would cost.

What else, AWS Lambda has a free tier of 1,000,000 invocations per month.

This basically kept development and early stages costs at 0 for me.

3. Built-in Monitoring

CloudWatch provides detailed metrics, logs, and error tracking - out of the box.

I can monitor execution times, error rates, and view logs at my leisure without additional setup.

4. Easy Deployment

Deploying updates is as simple doker push and a press of a button (and can be simplified even further with triggers).

The Bad

1. Cold Start Latency

The first request after inactivity can take 2-5 seconds. For web scraping, this means users might wait longer for their first search. I've implemented warm-up strategies, but it's still a concern.

2. Local Debugging Complexity

To run your lambda locally, you'll need to have the AWS Lambda Runtime Interface.

Basically, I ran my lambda E2E locally using docker alone, and had a different setup for testing specific parts.

Not the end of the world, but definetly complicates things.

3. Remote Debugging Complexity

Certain logics, especially scraping logic, can act differently locally and remotly (even when using docker). DIfferent user permissions to certain files, different IPs, timezones and more - can all cause issues.

Debugging serverless functions is more complex than traditional servers. You can't SSH into a Lambda function, so you end up having to debug through logs when such issues arise.

The Ugly

1. Using AWS Base Image

To have the same environment as your lambda, you'll need to use the AWS base image. This just creates a terrible inconvience, especially if you have a lot of dependencies.

Therea are ways around that, and I've had to use a lot of workarounds to get my lambda to work, but it can be a real pain in the a$$.

2. Concurrency Limitation

Remember we said at the start "Infinite Scalability"? Well, it's not true.

Lambda has a concurrency limit of 1000. This means that if you have more than 1000 concurrent requests, you'll start getting errors.

This is why I've had to implement a queue on my end and other strategies to avoid this.

**Update - Appeantly you can talk to AWS and ask for a higher concurrency limit.

3. Vendor Lock-in

I can (not so proudly) say that moving my logic out of AWS Lambda to something like GCP cloud functions would be a terrible pain.

I'm not sure how the other serverless solutions work and weather they have specific run-times and base images, but moving away now from AWS Lambda would be a pain, which causes me certain anxiety.

What about the other serverless solutions?

The honest answer is that I've never used any of the other serverless solutions. AWS Lambda is the most popular one out there, and from what I've heared and readthe most performant too (in terms of cold starts etc).

Their documentation is pretty good, their free-forever trial is tempting to start with, and the community is big.

Would I Choose Lambda Again?

For a bootstrapped project like Hyikko, yes. In hind-sight, I would say the benefits of using a serverless solution like AWS Lambda for the initial stages of your project - let's you grow super quick, and it outweighs the challenges.

Final Verdict

AWS Lambda is excellent for web scraping and other stateless logics that needs scaling. The key is understanding the limitations and trade-offs in place.

← Back to all posts

I chose AWS Lambda for my main scrape logic. Here's the good, the bad & the ugly