Design a payment system

Payment Problems

Pay per transaction: From rider to Uber
- Paid pre-transaction: Amazon
- Paid post-transaction: Uber trip
Schedule batched monthly payment: pay to driver
discuss:
- physical infrastructure,
- data stores,
- data model,
- security,
- performance considerations

Requirements

Functional Requirements

Post-transaction: uber trip
- Account balance (uber credit), account info: saved payment methods (credit numbers, bank account, etc)
- Currency, localization? (USD, global: multiple DC across multiple regions)
- User has multiple accounts
scheduled/Batch payments
- Monthly (configurable, hard-code is fine)
- Batch several accounts in one payment request?
- One account scheduled per month

Non-functional Requirements

Design a scalable, reliable payment service which could be used by your company to process payment, it could read the account info and payment info from the user and make payment.
Source of truth ledger system, reduce the total number of transactions (1 - 2 % fee per transaction)
Additionally, payment transaction to bank latency could be ~ 1 sec level.

Constraints

Scalable: 30M transaction per day = 300 QPS
- peak: 10 times more: 3K QPS
Durable: hardware/software failures
Fault-tolerant: any part of the system goes down/failures
Consistency: Strong consistency.
Availability: 99.99% (50 mins downtime/year)
Latency: p99. SLA < 200 ms for others

What could go wrong?

payment system design 当中容易遇到的难点！

Lack of payment
- No sufficient fund (bank denied it)
Double spending/payout
- Charge twice (it’s not you use the same money to pay for two things)
- correctness/at-least-one transaction
Incorrect currency conversion
- n/a
Dangling authorization
- Don’t worry about auth, not client side auth?
- Auth to the payment system
- Auth from bank
  - PSP (Payment Service Provider) auth (username/password pass to bank)
Incorrect payment
- Discrepancy from downstream services??
Incompatible IDs (only temporarily unique)
- Idempotency key? Diff id system (tmp unique)
3rd party PSP outage
- Any downstream service outages
Data discrepancy of charges from stripes / braintree
- Same as incorrect payment

High-level Architecture

When a user clicks the “Buy” button, a payment event is generated and sent to the payment service.
The payment service stores the payment event in the database.
Sometimes a single payment event may contain several payment orders.
- For example, you may select products from multiple sellers in a single checkout process. The payment service will call the payment executor for each payment order.
The payment executor stores the payment order in the database.
The payment executor calls an external PSP to finish the credit card payment.
After the payment executor has successfully executed the payment, the payment service will update the wallet to record how much money a given seller has.
The wallet server stores the updated balance information in the database.
After the wallet service has successfully updated the seller’s balance information, the payment service will call the ledger to update it.
The ledger service appends the new ledger information to the database.
Every night the PSP or banks send settlement files to their clients. The settlement file contains the balance of the bank account, together with all the transactions that took place on this bank account during the day.

Detailed Design

Request and Status

retry and de-duplication with idempotent API
Request: idempotent key (UUID, deterministic id)
status: success/failure (retryable/non-retryable)

Request life cycle

payment event from event queue (durable), AWS SQS (at-least-once delivery)
- only one worker work on the same event（不能有两个 worker 同时 work on the same event，这样就会执行两次！）
- If the worker failed/timeout, the event would go back to the queue to be picked up by another worker
- If it failed for configurable times, it goes to DLQ (Dead letter queue) - 如果是在常规时间内 fail，说明有一些其他原因导致 fail，需要查看。放到 DLQ 里。
- Scale up workers for post processing
Q: Does Amazon SQS guarantee delivery of messages?
- Standard queues provide at-least-once delivery, which means that each message is delivered at least once.
- FIFO queues provide exactly-once processing, which means that each message is delivered once and remains available until a consumer processes it and deletes it. Duplicates are not introduced into the queue.
Backend Worker pull event from queue (eventid as idempotent key)
- DB record: Idempotent key
- status: success/retryable_failure/non-retryable_failure (classify failure reasons)
  - 还应该加上 queued/running，因为如果一个 request 在 execute 的时候 fail 了，可以 track
- a. Prepare (db row-level lock on the idempotent key, lease expiry): expiry time: Configurable value > max downstream processing time - 相当于先查看 DB 当中执行过这个 event 没有
  - Get or create an entry in db
  - If the entry in db with status finished, return
  - If the entry in db with status (retryable) or the entry doesn’t exist, we need to continue process
- b. Process: talk to downstream services (PSP, bank) for the actual money transfer
  - Async, circuit break (fail fast), retry (exponentially back off retry)
- c. after we get response
- d. set back, we need to update the db record
  - First update the status to either success/failure (5xx retryable failure, 4xx non-retryable failures)
  - Release the lock on the idempotent key
  - Delete that event from queue
- e. Scenario 1: Server 1 -> req A locked -> server 1 crashed -> then will not deadlock. lock have expiration: need process to restart
  - Queue: send and forget (at least once process)
  - SQS message end of lifecycle stop when the process is fully done
  - MQ / Kafka end of life cycle is consumed the message: Backend process to restart
- f. Scenario 2: expirable time 5 mins.
  - Serve_1 -> row_1 lock -> stop the world GC -> more than 5 mins-> W
  - Server_2 -> row_1 lock -> W data -> X
- g. Scenario 3: worker process payment actually went through, but response didn’t go back to SQS.
  - Failed to write to DB as success:
    retry -> idempotent -> no side effect
  - Failed to delete the event:
    next worker gets the event, and check db and saw it’s already finished, do nothing and return
- h. distributed locking
  - Ref reading: How to do distributed locking
  - id (monotonically increment id);
  - server1 request with id: 10; server2 request with id: 11;
  - when server1 goes back online, trying to commit, it can see it’s id is invalid (abort that transaction)
After processing finished, bookkeeping for audit/data analysis purpose
- Save transaction history, bookkeeping datas (money in/out for each account)
- Sweep process to cleanup/purge the tables (delete rows that are 1 year old)

Idempotent Requests

Ref reading：Stripe Idempotent Requests

The API supports idempotency for safely retrying requests without accidentally performing the same operation twice.

To perform an idempotent request, provide an additional Idempotency-Key: <key> header to the request.

An idempotency key is a unique value generated by the client which the server uses to recognize subsequent retries of the same request. How you create unique keys is up to you, but we suggest using V4 UUIDs, or another random string with enough entropy to avoid collisions. Idempotency keys can be up to 255 characters long.

Keys are eligible to be removed from the system automatically after they're at least 24 hours old, and a new request is generated if a key is reused after the original has been pruned.

Distributed Locking

Ref reading: How to do distributed locking - Martin Kleppmann

ReadLock on top of Redis for distributed locking.

Protecting a resource with a lock

For example, say you have an application in which a client needs to update a file in shared storage (e.g. HDFS or S3). A client first acquires the lock, then reads the file, makes some changes, writes the modified file back, and finally releases the lock. The lock prevents two clients from performing this read-modify-write cycle concurrently, which would result in lost updates. The code might look something like this:

// THIS CODE IS BROKEN
function writeData(filename, data) {
    var lock = lockService.acquireLock(filename);
    if (!lock) {
        throw 'Failed to acquire lock';
    }

    try {
        var file = storage.readFile(filename);
        var updated = updateContents(file, data);
        storage.writeFile(filename, updated);
    } finally {
        lock.release();
    }
}

Unfortunately, even if you have a perfect lock service, the code above is broken. The following diagram shows how you can end up with corrupted data:
- This bug is not theoretical: HBase used to have this problem. Normally, GC pauses are quite short, but “stop-the-world” GC pauses have sometimes been known to last for several minutes – certainly long enough for a lease to expire. Even so-called “concurrent” garbage collectors like the HotSpot JVM’s CMS cannot fully run in parallel with the application code – even they need to stop the world from time to time.

Making the lock safe with fencing

The fix for this problem is actually pretty simple: you need to include a fencing token with every write request to the storage service. In this context, a fencing token is simply a number that increases (e.g. incremented by the lock service) every time a client acquires the lock. This is illustrated in the following diagram:

Database choice

Storage Estimation

1 row : 10 K
30 M req/day -> 300 GB/day * 180 (存六个月) -> 50 TB -> 100 DB instances (500 GB per instance)

DB partition

sharding based on transaction key - uber trip id
cross shard transaction

DB

account, request tables (purge every 6 months): SQL (ACID, transaction)
transaction_histroy, bookkeeping tables: NoSQL