at://bnewbold.net/com.whtwnd.blog.entry/3kwzl7tye6u2y
Back to Collection
Record JSON
{
"$type": "com.whtwnd.blog.entry",
"blobs": [
{
"blobref": {
"mimeType": "image/png",
"original": {
"$type": "blob",
"ref": {
"$link": "bafkreid4rb2kcokdrgpmj54wnsnfti2rkr34ebsik3rrrp342rumvtcjpi"
},
"mimeType": "image/png",
"size": 93798
},
"ref": {
"$link": "bafkreid4rb2kcokdrgpmj54wnsnfti2rkr34ebsik3rrrp342rumvtcjpi"
},
"size": 93798
},
"encoding": "image/png",
"name": "2024-07-11T11:05:24,561816603-07:00.png"
}
],
"content": "These are some informal notes on setting up a full-network atproto Relay, using the `bigsky` relay software developed by Bluesky. This is the same software we run ourselves at \u003chttps://bsky.network\u003e. The focus here is on the compute resources necessary to replicate the type of full-network, full-featured service that Bluesky currently operates with the size of the network that exists today.\n\nThe demo Relay described here is running at `relay-ovh.demo.bsky.dev`. It handles crawling and re-publishing the full network firehose, with headroom for traffic spikes and growth on a number of dimensions: new accounts (repos), new PDS instances, more content being created (firehose event rate), number of services consuming from the firehose, etc.\n\n## Changes\n\nRunning through this demo setup turned up some sharp edges and missing configuration knobs in `bigsky`. For example, trying to backfill the full network with default configuration was resulting in OOM errors with an instance this size. Tweaks and configuration have been merged to the `main` branch of `bigsky`, along with additions to [the README](https://github.com/bluesky-social/indigo/tree/main/cmd/bigsky).\n\n## Network Scaling\n\nHow big can this instance scale? Hard to tell exactly, my guess is that it could do an order of magnitude more event rate, but will run out of disk before too long (eg, in the next year).\n\nThere are a number of possibilities for improving Relay efficiency to make this kind of service cheaper: There are implementation details (like using alternative database engines, or not storing repo data as millions of small files on disk). Data and work could be sharded across multiple machines. Not every Relay needs to crawl and mirror the entire network, and Relays could potentially be simplified to not maintain a full mirror of network content. On the other hand, actually running critical services would have a number of needs not covered here: legal and administrative burdens, monitoring and alerting, etc.\n\n\n## Shopping for an Server\n\nMy assumption is that the main thing needed is a relatively large and reasonably fast disk. The Bluesky production Relays currently use about 1 TByte for PostgreSQL and 1 TByte for CAR storage on local disk. The CAR storage filesystem should also be XFS (not ext4), to handle many millions of small files. When shopping for instances I looked for around 2 TByte for PostgreSQL and 2 TByte for CAR storage so that this setup would be realistic even with growth of the network over time. This storage could all be one disk/filesystem or two separate disks/filesystems. Regular SSD would probably work fine, NVMe is nice, especially for backfill.\n\nMore RAM always helps (page cache and other caches). Don’t need much CPU; the relay process is highly concurrent, but mostly I/O bound. Do want decent network monthly quota.\n\nDisk is definitely the hard part. Network block storage (eg, AWS EBS) is pretty expensive even from cheaper providers, and usually costs more monthly than an entire bare metal instance with larger disks. Bare metal instances are mostly spinning disk or NVMe, not SSD; I assume that spinning disk isn't realistic for a fast backfill demo.\n\nI ended up selecting an OVH instance with for about $150/month:\n\n- `ADVANCE-2-LE`: https://www.ovhcloud.com/en/bare-metal/advance/adv-2/\n- 12 vCPU (Intel Xeon-E 2136 - 6c/12t - 3.3 GHz/4.5 GHz)\n- 32 GB RAM (32 GB ECC 2666 MHz)\n- disks: 2×1.92 TB NVMe\n- 1Gbit/s unmetered and guaranteed\n- $152/month plus one-time $92 setup fee (no commitment)\n\nThat exact config isn’t available now (a week later), but a very similar one is:\n\n- `ADVANCE-1` : https://www.ovhcloud.com/en/bare-metal/advance/adv-1/\n- 12 vCPU (AMD EPYC 4244P - 6c/12t - 3.8GHz/5.1GHz)\n- 32 GB RAM (32GB DDR5 ECC 5200MHz)\n- rootfs disk: 2x NVMe 960GB (RAID)\n- data disks: 2x 1.92TB NVMe\n- 1Gbit/s unmetered and guaranteed\n- $153/month plus one-time $93 setup (no commitment)\n\nIn both cases the setup fee is waved with a 6 month commitment, and there are discounts on the monthly rate with longer commitments.\n\n## Host Provisioning\n\nUsing the OVH web interface, provisioned the server with Ubuntu 24.04. With the `ADVANCE-2-LE` host, I specified partitioning to not use RAID. I let the setup wizard use one of the two disks for rootfs, boot, and swap. With all defaults this resulted in ext4. The second disk was not partitioned or configured using the wizard (I got to that later on the server itself).\n\nConfigured a DNS A record to point at the IPv4 that OVH gave us.\n\nLogged in to the server and ran commands similar to this:\n\n```\nhostnamectl hostname relay-example.demo.bsky.dev\n\napt update\napt upgrade\napt install ripgrep fd-find dstat htop iotop iftop pg-activity httpie caddy golang postgresql yarnpkg\n\n# set up yarn command; could also have used nvm\nln -s /usr/bin/yarnpkg /usr/bin/yarn\n\n# punch holes in default firewall for HTTP/S\nufw allow 80/tcp\nufw allow 443/tcp\n```\n\nRan through partitioning of the second NVMe with XFS. Note that on a real machine you'd want to set up `fstab` so this mounts automatically on a reboot.\n\n```\n# create a partition\nsudo fdisk /dev/nvme1n1\n# c (create), default (primary), default (1), default (start sector), default (entire disk), w (write)\n\n# create XFS filesystem on that partition\nsudo mkfs.xfs /dev/nvme1n1p1\n\n# mount that filesystem to /data\nsudo mkdir -p /data\nsudo mount /dev/nvme1n1p1 /data\n```\n\nPull the indigo codebase and build; ran this as the `ubuntu` user not `root`:\n\n```\n# depending on user that will be running the service\nmkdir -p /data/bigsky\nmkdir -p /data/bigsky/events\nsudo chown ubuntu:ubuntu /data/bigsky/\nsudo chown ubuntu:ubuntu /data/bigsky/events\n\n# pull source code and build. if you had patches or a working branch, would modify here\ncd\ngit clone https://github.com/bluesky-social/indigo\ncd indigo\nmake build-relay-ui build\n```\n\nConfigure PostgreSQL (`sudo -u postgres psql`); replace CHANGME with a secure password of your choice:\n```\nCREATE DATABASE bgs;\nCREATE DATABASE carstore;\n\nCREATE USER bigsky WITH PASSWORD 'CHANGEME';\nGRANT ALL PRIVILEGES ON DATABASE bgs TO bigsky;\nGRANT ALL PRIVILEGES ON DATABASE carstore TO bigsky;\n\n# these are needed for newer versions of postgres\n\\c bgs postgres\nGRANT ALL ON SCHEMA public TO bigsky;\n\n\\c carstore postgres\nGRANT ALL ON SCHEMA public TO bigsky;\n```\n\nCreate a config file at `~/indigo/.env`:\n\n```\nENVIRONMENT=production\nDATABASE_URL=\"postgres://bigsky:CHANGEME@localhost:5432/bgs\"\nCARSTORE_DATABASE_URL=\"postgres://bigsky:CHANGEME@localhost:5432/carstore\"\nDATA_DIR=/data/bigsky\nRELAY_PERSISTER_DIR=/data/bigsky/events\nGOLOG_LOG_LEVEL=info\n# or whatever DNS you want to use for handle resolution\nRESOLVE_ADDRESS=\"8.8.8.8:53\"\nFORCE_DNS_UDP=true\nRELAY_COMPACT_INTERVAL=0\nRELAY_DEFAULT_REPO_LIMIT=500000\n\n# these were somewhat tuned to this instance size\nMAX_CARSTORE_CONNECTIONS=12\nMAX_METADB_CONNECTIONS=12\nMAX_FETCH_CONCURRENCY=25\nRELAY_CONCURRENCY_PER_PDS=20\nRELAY_MAX_QUEUE_PER_PDS=200\n\n#RELAY_ADMIN_KEY=CHANGEME\n```\n\n**UPDATE:** renamed `BGS_COMPACT_INTERVAL` to `RELAY_COMPACT_INTERVAL`, and added `RELAY_PERSISTER_DIR`.\n\nWith the `RELAY_ADMIN_KEY` set to a strong random value, and `DATABASE_URL` substituted to the earlier database password. You can create one with:\n\n```\nopenssl rand -base64 30\n```\n\nCreate a system-wide Caddy config at `/etc/caddy/Caddyfile`. Substitute in your hostname, and comment out any other lines in the file:\n\n```\nrelay-example.demo.bsky.dev {\n reverse_proxy 127.0.0.1:2470\n}\n```\n\nRestart caddy: `sudo systemctl restart caddy`\n\n## Running `bigsky` and Backfilling\n\nRun the actual service! For example, in a `screen` session, or a service management tool of your choice:\n\n```\ncd ~/indigo\n./bigsky --api-listen 127.0.0.1:2470\n```\n\nConfirm that everything is working by connecting using the `gosky` command from a laptop (which is in the `indigo` repo). Won’t get events (because Relay hasn't subscribed to anything yet), but should connect successfully:\n\n```\ngosky readStream wss://relay-example.demo.bsky.dev\n```\n\nYou can also connect to the web management interface at \u003chttps://relay-example.demo.bsky.dev/dash\u003e. This lets you view basic stats per PDS, modify limits, add new PDS instances to crawl, takedown individual repos (by DID), block PDS instances by domain suffix, etc.\n\n![Relay admin interface screenshot](https://morel.us-east.host.bsky.network/xrpc/com.atproto.sync.getBlob?did=did%3Aplc%3A44ybard66vv44zksje25o7dz\u0026cid=bafkreid4rb2kcokdrgpmj54wnsnfti2rkr34ebsik3rrrp342rumvtcjpi)\n\nTo start backfills from a laptop, create a `hosts.txt` file with PDS hostnames, then run initial crawl command. Can do this from a laptop:\n\n```\ncd ~/indigo/cmd/bigsky\nexport RELAY_ADMIN_KEY=CHANGEMESECRET\nexport RELAY_HOST=relay-example.demo.bsky.dev\n\ncat hosts.txt | parallel -j1 ./crawl_pds.sh {}\n```\n\nLet that bake for a few hours or overnight. Only accounts with new commits will get backfilled. A 24 hour period is usually around 10% of the network.\n\nThen can start explicit backfills per-PDS (\"resync\"). This will pull a complete list of DIDs hosted on the PDS (or at least, which the PDS *thinks* are still hosted on the PDS, this might not yet handle migrations). Don’t want to do full PDS backfills for all the big PDS instances at once, or the relay will get overwhelmed (eg, OOM). Instead, do 4-8 at a time, modifying `hosts.txt` or the `head` command as needed:\n\n```\nhead -n 4 hosts.txt | parallel -j1 ./sync_pds.sh {}\n\n# check progress\nhead -n 4 hosts.txt | parallel -j1 ./sync_status_pds.sh {}\n```\n\nSmaller self-hosted instances can be backfilled in big batches (eg, hundreds of backfills at the same time).\n\nWhile running backfill, some new PDS instances will be discovered, even if not crawled specifically, and even with “spidering” disabled. My guess is that accounts which migrate away from our PDS instances are still listed by the original PDS. When `bigsky` does a backfill, it resolves all the DIDs, and sees a different PDS in the atproto service entry, and adds it to the PDS list.\n\nThe entire backfill took a couple days of casual checking in and poking it along.\n\nHow does one get a complete list of PDS instances in the network? It would be helpful if Relays had a public endpoint to scroll through all known PDS instances, and indicate if they are active, blocked/suspended, and roughly how many repos there are. For now, you can pull hostnames from public listings like \u003chttps://blue.mackuba.eu/directory/\u003e and \u003chttps://bsky-debug.app/\u003e. Or scrape the complete DID PLC directory (which is public and enumerable), and extract all PDS service endpoints.\n\n## Rough Performance Stats\n\nHere are some quick/informal system performance snapshots. The backfill period (fetching all previous repo content from the network) is far more resource intensive than steady operation.\n\nI didn't run any compactions manually, and disabled automatic/periodic compactions. These are resource intensive to process, but free up disk and database space. Compactions are a feature specific to `bigsky` and it's data storage system.\n\n**UPDATE:** you can re-enable compactions by editing the `RELAY_COMPACT_INTERVAL` environment variable. The default is `4h`; it is disabled (set to zero) in the template env file above.\n\nDuring early phase of backfill:\n\n```\n# dstat\n\n----total-usage---- -dsk/total- -net/total- ---paging-- ---system--\nusr sys idl wai stl| read writ| recv send| in out | int csw \n 32 16 44 5 0| 17M 1410M| 20M 879k| 0 0 | 107k 173k\n 52 13 29 4 0| 13M 1495M| 18M 765k| 0 0 | 134k 117k\n 74 13 13 1 0| 20M 623M| 18M 722k| 0 0 | 47k 62k\n 26 14 52 6 0| 12M 1610M| 14M 613k| 0 0 | 133k 120k\n 51 16 30 2 0| 19M 928M| 18M 813k| 0 0 | 55k 118k\n 29 16 47 6 0| 15M 1842M| 13M 587k| 0 0 | 133k 114k\n 26 14 55 3 0| 16M 1124M| 12M 537k| 0 0 | 69k 165k\n 24 15 52 7 0| 14M 1600M| 13M 575k| 0 0 | 131k 138k\n 30 14 51 3 0| 16M 1041M| 18M 786k| 0 0 | 62k 155k\n 20 14 57 7 0| 13M 1719M|8916k 406k| 0 0 | 137k 121k\n```\n\n```\n# pg_analyze (as postgres user)\n\nPostgreSQL 16.3 - relay-ovh - postgres@/var/run/postgresql:5432/postgres - Ref.: 2s -\n * Global: 38 minutes uptime, 12.54G dbs size - 14.84M/s growth, 90.60% cache hit ratio\n Sessions: 81/100 total, 42 active, 39 idle, 0 idle in txn, 0 idle in txn abrt, 0 waiting\n Activity: 4382 tps, 74089 insert/s, 164 update/s, 0 delete/s, 28383 tuples returned/s, 0\n * Worker processes: 0/8 total, 0/4 logical workers, 0/8 parallel workers\n Other processes \u0026 info: 0/3 autovacuum workers, 0/10 wal senders, 0 wal receivers, 0/10\n * Mem.: 31.12G total, 756.90M (2.38%) free, 14.14G (45.44%) used, 16.24G (52.19%)\n Swap: 512.00M total, 511.00M (99.80%) free, 1.00M (0.20%) used\n IO: 155846/s max iops, 2.15K/s - 0/s read, 608.78M/s - 155846/s write\n Load average: 8.19 7.33 5.16\n```\n\nI don’t have stats, but at a later phase of backfill, I/O wait was pretty high and disk read/write were more symmetrical around 500MB/sec (should have taken a snapshot of that!), and CPU wait was only single-digit.\n\nAfter all major backfills, just cruising along at a normal firehose subscription:\n\n```\n# dstat\n\n----total-usage---- -dsk/total- -net/total- ---paging-- ---system--\nusr sys idl wai stl| read writ| recv send| in out | int csw \n 1 1 97 1 0|5794k 8046k| 232k 6851B| 0 0 |4439 7848 \n 1 0 98 0 0|3615k 7399k| 219k 5071B| 0 0 |4062 7163 \n 1 1 98 0 0|4645k 13M| 198k 5184B| 0 0 |4473 7194 \n 1 0 98 0 0|4831k 8142k| 242k 7273B| 0 0 |4174 7581 \n 1 0 98 0 0|3264k 7092k| 178k 4784B| 0 0 |3625 6619 \n 1 1 98 0 0|3564k 7336k| 153k 3394B| 0 0 |3253 5502 \n 1 0 98 0 0|4930k 9139k| 239k 6719B| 0 0 |4119 7364 \n 2 1 97 1 0|6430k 14M| 313k 10k| 0 0 |6372 13k\n 1 0 98 0 0|3359k 7422k| 172k 5255B| 0 0 |3670 6860 \n 1 0 98 0 0|3929k 10M| 206k 7088B| 0 0 |4036 8954 \n 1 1 98 0 0|3732k 7560k| 212k 6173B| 0 0 |3789 6771 \n 1 1 97 1 0|5694k 8819k| 267k 7758B| 0 0 |4630 8511 \n 1 0 98 0 0|3480k 11M| 175k 4565B| 0 0 |3764 5758\n```\n\n```\n# pg_analyze\n\nPostgreSQL 16.3 - relay-ovh - postgres@/var/run/postgresql:5432/postgres - Ref.: 2s -\n * Global: 4 days, 22 hours and 23 minutes uptime, 445.38G dbs size - 162.93K/s growth, 79.55% cache hit ratio\n Sessions: 22/100 total, 1 active, 21 idle, 0 idle in txn, 0 idle in\n Activity: 514 tps, 912 insert/s, 0 update/s, 0 delete/s, 1028 tuples returned/s, 0 temp files, 0B temp size\n * Worker processes: 0/8 total, 0/4 logical workers, 0/8 parallel workers\n Other processes \u0026 info: 0/3 autovacuum workers, 0/10 wal senders, 0\n * Mem.: 31.12G total, 676.83M (2.12%) free, 17.36G (55.78%) used, 13.10G (42.09%) buff+cached\n Swap: 512.00M total, 608.00K (0.12%) free, 511.40M (99.88%) used\n IO: 0/s max iops, 0B/s - 0/s read, 0B/s - 0/s write\n Load average: 0.46 0.44 0.38\n```\n\n```\n# df -h\nFilesystem Size Used Avail Use% Mounted on\ntmpfs 3.2G 1.6M 3.2G 1% /run\nefivarfs 192K 37K 151K 20% /sys/firmware/efi/efivars\n/dev/nvme0n1p3 1.8T 452G 1.2T 28% /\ntmpfs 16G 1.1M 16G 1% /dev/shm\ntmpfs 5.0M 0 5.0M 0% /run/lock\n/dev/nvme0n1p2 974M 182M 725M 21% /boot\n/dev/nvme0n1p1 511M 5.2M 506M 2% /boot/efi\n/dev/nvme1n1p1 1.8T 722G 1.1T 41% /data\ntmpfs 3.2G 12K 3.2G 1% /run/user/1000\n\n# df -i (inodes)\nFilesystem Inodes IUsed IFree IUse% Mounted on\ntmpfs 4078568 1019 4077549 1% /run\nefivarfs 0 0 0 - /sys/firmware/efi/efivars\n/dev/nvme0n1p3 117080064 230862 116849202 1% /\ntmpfs 4078568 3 4078565 1% /dev/shm\ntmpfs 4078568 3 4078565 1% /run/lock\n/dev/nvme0n1p2 65536 603 64933 1% /boot\n/dev/nvme0n1p1 0 0 0 - /boot/efi\n/dev/nvme1n1p1 187537280 25138236 162399044 14% /data\ntmpfs 815713 32 815681 1% /run/user/1000\n\n# sudo du -sh /var/lib/postgresql/16/\n447G /var/lib/postgresql/16/\n\n# lsblk (for reference)\nNAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS\nsda 8:0 1 0B 0 disk \nsr0 11:0 1 1024M 0 rom \nnvme1n1 259:0 0 1.7T 0 disk \n└─nvme1n1p1 259:7 0 1.7T 0 part /data\nnvme0n1 259:1 0 1.7T 0 disk \n├─nvme0n1p1 259:2 0 511M 0 part /boot/efi\n├─nvme0n1p2 259:3 0 1G 0 part /boot\n├─nvme0n1p3 259:4 0 1.7T 0 part /\n├─nvme0n1p4 259:5 0 512M 0 part [SWAP]\n└─nvme0n1p5 259:6 0 2M 0 part \n```\n\nPostgreSQL table sizes:\n\n```\nbgs=# SELECT\n table_name,\n pg_size_pretty(table_size) AS table_size,\n pg_size_pretty(indexes_size) AS indexes_size,\n pg_size_pretty(total_size) AS total_size\nFROM (\n SELECT\n table_name,\n pg_table_size(table_name) AS table_size,\n pg_indexes_size(table_name) AS indexes_size,\n pg_total_relation_size(table_name) AS total_size\n FROM (\n SELECT ('\"' || table_schema || '\".\"' || table_name || '\"') AS table_name\n FROM information_schema.tables\n WHERE table_schema != 'pg_catalog' AND table_schema != 'information_schema'\n ) AS all_tables\n ORDER BY total_size DESC\n) AS pretty_sizes;\n table_name | table_size | indexes_size | total_size \n-------------------------------+------------+--------------+------------\n \"public\".\"repo_event_records\" | 18 GB | 541 MB | 18 GB\n \"public\".\"actor_infos\" | 987 MB | 993 MB | 1980 MB\n \"public\".\"users\" | 749 MB | 1060 MB | 1809 MB\n \"public\".\"pds\" | 3752 kB | 32 kB | 3784 kB\n \"public\".\"auth_tokens\" | 16 kB | 48 kB | 64 kB\n \"public\".\"slurp_configs\" | 16 kB | 32 kB | 48 kB\n \"public\".\"feed_posts\" | 8192 bytes | 24 kB | 32 kB\n \"public\".\"vote_records\" | 8192 bytes | 16 kB | 24 kB\n \"public\".\"follow_records\" | 8192 bytes | 16 kB | 24 kB\n \"public\".\"domain_bans\" | 8192 bytes | 16 kB | 24 kB\n \"public\".\"repost_records\" | 8192 bytes | 8192 bytes | 16 kB\n(11 rows)\n\n\ncarstore=# SELECT\n table_name,\n pg_size_pretty(table_size) AS table_size,\n pg_size_pretty(indexes_size) AS indexes_size,\n pg_size_pretty(total_size) AS total_size\nFROM (\n SELECT\n table_name,\n pg_table_size(table_name) AS table_size,\n pg_indexes_size(table_name) AS indexes_size,\n pg_total_relation_size(table_name) AS total_size\n FROM (\n SELECT ('\"' || table_schema || '\".\"' || table_name || '\"') AS table_name\n FROM information_schema.tables\n WHERE table_schema != 'pg_catalog' AND table_schema != 'information_schema'\n ) AS all_tables\n ORDER BY total_size DESC\n) AS pretty_sizes;\n table_name | table_size | indexes_size | total_size \n-----------------------+------------+--------------+------------\n \"public\".\"block_refs\" | 192 GB | 217 GB | 409 GB\n \"public\".\"car_shards\" | 4011 MB | 4088 MB | 8098 MB\n \"public\".\"stale_refs\" | 6245 MB | 576 MB | 6821 MB\n(3 rows)\n```",
"createdAt": "2024-11-08T18:03:58.412Z",
"theme": "github-light",
"title": "Notes on Running a Full-Network atproto Relay (July 2024)",
"visibility": "public"
}