1/ @bitfinex infrastructure thread
Till Jan 2018 Bitfinex was using cloud hosting providers (AWS) as main site for the platform.
We soon figure out that was almost impossible to scale a CEX on cloud.
Here is the train of thoughts that we had in the mighty infrastructure team.
Till Jan 2018 Bitfinex was using cloud hosting providers (AWS) as main site for the platform.
We soon figure out that was almost impossible to scale a CEX on cloud.
Here is the train of thoughts that we had in the mighty infrastructure team.
2/
Problems (1):
- order/execution full roundtrip should be ~1ms or lower, should be improved over time (0.1ms target)
- mange 10k+ to 100k+ orders/sec
- recording to historical data stores all events, available to users via API and UI
Problems (1):
- order/execution full roundtrip should be ~1ms or lower, should be improved over time (0.1ms target)
- mange 10k+ to 100k+ orders/sec
- recording to historical data stores all events, available to users via API and UI
3/
Problems (2):
- handling billions of events per day and recording them it's one of the main bottlenecks of (crypto) exchanges -> DBMS have to record 100k events per second sustained?
- on cloud you're not sure how CPU gets allocated (trading engine is a CPU bound process)
Problems (2):
- handling billions of events per day and recording them it's one of the main bottlenecks of (crypto) exchanges -> DBMS have to record 100k events per second sustained?
- on cloud you're not sure how CPU gets allocated (trading engine is a CPU bound process)
4/
Problems (3):
- disk throughput can be throttled (GP2, IO1)
- if you use SSDs on cloud, which model do you have, are you the only writer?, how to get a proper RAID setup?, ...
- how to minimize the number of switches between your instances?
- unpredictable network behaviors
Problems (3):
- disk throughput can be throttled (GP2, IO1)
- if you use SSDs on cloud, which model do you have, are you the only writer?, how to get a proper RAID setup?, ...
- how to minimize the number of switches between your instances?
- unpredictable network behaviors
5/
Problems (4):
- reliable UDP multicast: AWS seems not to support UDP multicast, while GC does not. Anyway what's the expected packet loss?
- security: spectre attacks, data hosted on 3rd party, ...
- #crypto is growing. When next bull run? How to ensure 10x/100x scalability
Problems (4):
- reliable UDP multicast: AWS seems not to support UDP multicast, while GC does not. Anyway what's the expected packet loss?
- security: spectre attacks, data hosted on 3rd party, ...
- #crypto is growing. When next bull run? How to ensure 10x/100x scalability
6/
Path to metal


Our infra team has outstanding experience in secure, god wrath resistant, deployments, so we discussed a lot about pros and cons.
Here are the 2 cons summarized:
- hardware decay
- physical hosts maintenance (trips to datacenter, hardware expertise, ...)
Path to metal



Our infra team has outstanding experience in secure, god wrath resistant, deployments, so we discussed a lot about pros and cons.
Here are the 2 cons summarized:
- hardware decay
- physical hosts maintenance (trips to datacenter, hardware expertise, ...)
7/
Benefit (1):
- the entire infra behavior is predictable
- RAID enabled historical sharded data-warehouses kick asses
- KYD, KYM and KYC
: Know Your Disk, Know your Mem and Know Your CPU
- dedicated switches and firewalls, back to back connectivity for core servers
Benefit (1):
- the entire infra behavior is predictable
- RAID enabled historical sharded data-warehouses kick asses
- KYD, KYM and KYC

- dedicated switches and firewalls, back to back connectivity for core servers
8/
Benefit (2):
- reliable UDP multicast
- bare-metal Linux installs
- you're the only one using your CPU (see spectre-like attacks)
- colocation services offering
Benefit (2):
- reliable UDP multicast
- bare-metal Linux installs
- you're the only one using your CPU (see spectre-like attacks)
- colocation services offering
10/
here is a potato
here is a potato