System Design
  • Introduction
  • Glossary of System Design
    • System Design Basics
    • Key Characteristics of Distributed Systems
    • Scalability - Harvard lecture
      • Scalability for Dummies - Part 1: Clones
      • Scalability for Dummies - Part 2: Database
      • Scalability for Dummies - Part 3: Cache
      • Scalability for Dummies - Part 4: Asynchronism
    • Trade-off
      • CAP Theorem
      • Performance vs scalability
      • Latency vs throughput
      • Availability vs consistency
    • Load Balancing
      • Load balancer
    • Proxies
      • Reverse proxy
    • Cache
      • Caching
    • Asynchronism
    • Processing guarantee in Kafka
    • Database
      • Relational database management system (RDBMS)
      • Redundancy and Replication
      • Data Partitioning
      • Indexes
      • NoSQL
      • SQL vs. NoSQL
      • Consistent Hashing
    • Application layer
    • DNS
    • CDN
    • Communication
      • Long-Polling vs WebSockets vs Server-Sent Events
    • Security
    • Lambda Architecture
  • OOD design
    • Concepts
      • Object-Oriented Basics
      • OO Analysis and Design
      • What is UML?
      • Use Case Diagrams
    • Design a parking lot
  • System Design Cases
    • Overview
    • Design a system that scales to millions of users on AWS
    • Designing a URL Shortening service like TinyURL
      • Design Unique ID Generator
      • Designing Pastebin
      • Design Pastebin.com (or Bit.ly)
    • Design notification system (scott)
      • Designing notification service
    • Designing Chat System
      • Designing Slack
      • Designing Facebook Messenger
    • Design Top K System
    • Designing Instagram
    • Design a newsfeed system
      • Designing Facebook’s Newsfeed
      • Design the data structures for a social network
    • Designing Twitter
      • Design the Twitter timeline and search
      • Designing Twitter Search
    • Design Youtube - Scott
      • Design live commenting
      • Designing Youtube or Netflix
    • Designing a Web Crawler
      • Designing a distributed job scheduler
      • Designing a Web Crawler/Archive (scott)
      • Design a web crawler
    • Designing Dropbox
    • Design Google Doc
    • Design Metrics Aggregation System
      • Design Ads Logging System
    • Design Instacart
    • Design a payment system
      • Airbnb - Avoiding Double Payments in a Distributed Payments System
    • Design Distributed Message Queue
      • Cherami: Uber Engineering’s Durable and Scalable Task Queue in Go
    • Design Distributed Cache
      • Design a key-value cache to save the results of the most recent web server queries
    • Design a scalable file distribution system
    • Design Amazon's sales ranking by category feature
    • Design Mint.com
    • Design Autocomplete System
      • Designing Typeahead Suggestion
    • Designing an API Rate Limiter
      • Designing Rate Limiter
    • Design Google Map
      • Designing Yelp or Nearby Friends
      • Designing Uber backend
    • Designing Ticketmaster
      • Design 12306 - Scott
    • Design AirBnB or a Hotel Booking System
  • Paper Reading
    • MapReduce
  • Other Questions
    • What happened after you input the url in the browser?
Powered by GitBook
On this page
  • Design Pastebin.com (or Bit.ly)
  • Step 1: Outline use cases and constraints
  • Use cases
  • Constraints and assumptions
  • Step 2: Create a high level design
  • Step 3: Design core components
  • Use case: User enters a block of text and gets a randomly generated link
  • Use case: User enters a paste's url and views the contents
  • Use case: Service tracks analytics of pages
  • Use case: Service deletes expired pastes
  • Step 4: Scale the design
  • Additional talking points

Was this helpful?

  1. System Design Cases
  2. Designing a URL Shortening service like TinyURL

Design Pastebin.com (or Bit.ly)

Design Pastebin.com (or Bit.ly)

Step 1: Outline use cases and constraints

Gather requirements and scope the problem. Ask questions to clarify use cases and constraints. Discuss assumptions.

Use cases

  • User enters a block of text and gets a randomly generated link

    • Expiration

      • Default setting does not expire

      • Can optionally set a timed expiration

  • User enters a paste's url and views the contents

  • User is anonymous

  • Service tracks analytics of pages

    • Monthly visit stats

  • Service deletes expired pastes

  • Service has high availability

Out of scope

  • User registers for an account

    • User verifies email

  • User logs into a registered account

    • User edits the document

  • User can set visibility

  • User can set the shortlink

Constraints and assumptions

State assumptions

  • Traffic is not evenly distributed

  • Following a short link should be fast

  • Pastes are text only

  • Page view analytics do not need to be realtime

  • 10 million users

  • 10 million paste writes per month

  • 100 million paste reads per month

  • 10:1 read to write ratio

Calculate usage

  • Size per paste

    • 1 KB content per paste (1 KB = 1000 bytes)

    • shortlink - 7 bytes

    • expiration_length_in_minutes - 4 bytes

    • created_at - 5 bytes

    • paste_path - 255 bytes

    • total = ~1.27 KB

  • 12.7 GB of new paste content per month

    • 1.27 KB per paste * 10 million pastes per month

    • ~450 GB of new paste content in 3 years

    • 360 million shortlinks in 3 years

    • Assume most are new pastes instead of updates to existing ones

  • 4 paste writes per second on average

  • 40 read requests per second on average

Handy conversion guide:

  • 2.5 million seconds per month

  • 1 request per second = 2.5 million requests per month

  • 40 requests per second = 100 million requests per month

  • 400 requests per second = 1 billion requests per month

Step 2: Create a high level design

Step 3: Design core components

Use case: User enters a block of text and gets a randomly generated link

We could use a relational database as a large hash table, mapping the generated url to a file server and path containing the paste file.

Instead of managing a file server, we could use a managed Object Store such as Amazon S3 or a NoSQL document store.

An alternative to a relational database acting as a large hash table, we could use a NoSQL key-value store. We should discuss the tradeoffs between choosing SQL or NoSQL. The following discussion uses the relational database approach.

  • The Client sends a create paste request to the Web Server, running as a reverse proxy

  • The Web Server forwards the request to the Write API server

  • The Write API server does the following:

    • Generates a unique url

      • Checks if the url is unique by looking at the SQL Database for a duplicate

      • If the url is not unique, it generates another url

      • If we supported a custom url, we could use the user-supplied (also check for a duplicate)

    • Saves to the SQL Database pastes table

    • Saves the paste data to the Object Store

    • Returns the url

The pastes table could have the following structure:

shortlink char(7) NOT NULL
expiration_length_in_minutes int NOT NULL
created_at datetime NOT NULL
paste_path varchar(255) NOT NULL
PRIMARY KEY(shortlink)

To generate the unique url, we could:

    • MD5 is a widely used hashing function that produces a 128-bit hash value

    • MD5 is uniformly distributed

    • Alternatively, we could also take the MD5 hash of randomly-generated data

    • Base 62 encodes to [a-zA-Z0-9] which works well for urls, eliminating the need for escaping special characters

    • There is only one hash result for the original input and Base 62 is deterministic (no randomness involved)

    • Base 64 is another popular encoding but provides issues for urls because of the additional + and / characters

    • def base_encode(num, base=62):
        digits = []
        while num > 0
            remainder = modulo(num, base)
            digits.push(remainder)
            num = divide(num, base)
        digits = digits.reverse
  • Take the first 7 characters of the output, which results in 62^7 possible values and should be sufficient to handle our constraint of 360 million shortlinks in 3 years:

url = base_encode(md5(ip_address+timestamp))[:URL_LENGTH]

We'll use a public REST API:

$ curl -X POST --data '{ "expiration_length_in_minutes": "60", "paste_contents": "Hello World!" }' https://pastebin.com/api/v1/paste

Response:

{
    "shortlink": "foobar"
}

Use case: User enters a paste's url and views the contents

  • The Client sends a get paste request to the Web Server

  • The Web Server forwards the request to the Read API server

  • The Read API server does the following:

    • Checks the SQL Database for the generated url

      • If the url is in the SQL Database, fetch the paste contents from the Object Store

      • Else, return an error message for the user

REST API:

$ curl https://pastebin.com/api/v1/paste?shortlink=foobar

Response:

{
    "paste_contents": "Hello World"
    "created_at": "YYYY-MM-DD HH:MM:SS"
    "expiration_length_in_minutes": "60"
}

Use case: Service tracks analytics of pages

Since realtime analytics are not a requirement, we could simply MapReduce the Web Server logs to generate hit counts.

Clarify with your interviewer how much code you are expected to write.

class HitCounts(MRJob):

    def extract_url(self, line):
        """Extract the generated url from the log line."""
        ...

    def extract_year_month(self, line):
        """Return the year and month portions of the timestamp."""
        ...

    def mapper(self, _, line):
        """Parse each log line, extract and transform relevant lines.

        Emit key value pairs of the form:

        (2016-01, url0), 1
        (2016-01, url0), 1
        (2016-01, url1), 1
        """
        url = self.extract_url(line)
        period = self.extract_year_month(line)
        yield (period, url), 1

    def reducer(self, key, values):
        """Sum values for each key.

        (2016-01, url0), 2
        (2016-01, url1), 1
        """
        yield key, sum(values)

Use case: Service deletes expired pastes

To delete expired pastes, we could just scan the SQL Database for all entries whose expiration timestamp are older than the current timestamp. All expired entries would then be deleted (or marked as expired) from the table.

Step 4: Scale the design

Important: Do not simply jump right into the final design from the initial design!

State you would do this iteratively: 1. Benchmark/Load Test; 1. Profile for bottlenecks; 1. address bottlenecks while evaluating alternatives and trade-offs; 1. repeat.

See Design a system that scales to millions of users on AWS as a sample on how to iteratively scale the initial design.

It's important to discuss what bottlenecks you might encounter with the initial design and how you might address each of them. For example, what issues are addressed by adding a Load Balancer with multiple Web Servers? CDN? Master-Slave Replicas? What are the alternatives and Trade-Offs for each?

We'll introduce some components to complete the design and to address scalability issues. Internal load balancers are not shown to reduce clutter.

The Analytics Database could use a data warehousing solution such as Amazon Redshift or Google BigQuery.

An Object Store such as Amazon S3 can comfortably handle the constraint of 12.7 GB of new content per month.

To address the 40 average read requests per second (higher at peak), traffic for popular content should be handled by the Memory Cache instead of the database. The Memory Cache is also useful for handling the unevenly distributed traffic and traffic spikes. The SQL Read Replicas should be able to handle the cache misses, as long as the replicas are not bogged down with replicating writes.

4 average paste writes per second (with higher at peak) should be do-able for a single SQL Write Master-Slave. Otherwise, we'll need to employ additional SQL scaling patterns:

We should also consider moving some data to a NoSQL Database.

Additional talking points

Additional topics to dive into, depending on the problem scope and time remaining.

NoSQL

Caching

  • Where to cache

  • What to cache

  • When to update the cache

Asynchronism and microservices

Communications

  • Discuss tradeoffs:

Security

Latency numbers

Ongoing

  • Continue benchmarking and monitoring your system to address bottlenecks as they come up

  • Scaling is an iterative process

PreviousDesigning PastebinNextDesign notification system (scott)

Last updated 4 years ago

Was this helpful?

Setting the primary key to be based on the shortlink column creates an that the database uses to enforce uniqueness. We'll create an additional index on created_at to speed up lookups (log-time instead of scanning the entire table) and to keep the data in memory. Reading 1 MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4x and from disk takes 80x longer.

Take the hash of the user's ip_address + timestamp

encode the MD5 hash

The following runs in O(k) time where k is the number of digits = 7:

For internal communications, we could use .

To avoid repeating discussions, refer to the following for main talking points, tradeoffs, and alternatives:

External communication with clients -

Internal communications -

Refer to the .

See .

index
MD5
Base 62
Base 62 pseudocode
Remote Procedure Calls
system design topics
DNS
CDN
Load balancer
Horizontal scaling
Web server (reverse proxy)
API server (application layer)
Cache
Relational database management system (RDBMS)
SQL write master-slave failover
Master-slave replication
Consistency patterns
Availability patterns
Federation
Sharding
Denormalization
SQL Tuning
Key-value store
Document store
Wide column store
Graph database
SQL vs NoSQL
Client caching
CDN caching
Web server caching
Database caching
Application caching
Caching at the database query level
Caching at the object level
Cache-aside
Write-through
Write-behind (write-back)
Refresh ahead
Message queues
Task queues
Back pressure
Microservices
HTTP APIs following REST
RPC
Service discovery
security section
Latency numbers every programmer should know