at://bnewbold.net/com.whtwnd.blog.entry/3kwzl7tye6u2y

Back to Collection

Record JSON

{
  "$type": "com.whtwnd.blog.entry",
  "blobs": [
    {
      "blobref": {
        "mimeType": "image/png",
        "original": {
          "$type": "blob",
          "ref": {
            "$link": "bafkreid4rb2kcokdrgpmj54wnsnfti2rkr34ebsik3rrrp342rumvtcjpi"
          },
          "mimeType": "image/png",
          "size": 93798
        },
        "ref": {
          "$link": "bafkreid4rb2kcokdrgpmj54wnsnfti2rkr34ebsik3rrrp342rumvtcjpi"
        },
        "size": 93798
      },
      "encoding": "image/png",
      "name": "2024-07-11T11:05:24,561816603-07:00.png"
    }
  ],
  "content": "These are some informal notes on setting up a full-network atproto Relay, using the `bigsky` relay software developed by Bluesky. This is the same software we run ourselves at \u003chttps://bsky.network\u003e. The focus here is on the compute resources necessary to replicate the type of full-network, full-featured service that Bluesky currently operates with the size of the network that exists today.\n\nThe demo Relay described here is running at `relay-ovh.demo.bsky.dev`. It handles crawling and re-publishing the full network firehose, with headroom for traffic spikes and growth on a number of dimensions: new accounts (repos), new PDS instances, more content being created (firehose event rate), number of services consuming from the firehose, etc.\n\n## Changes\n\nRunning through this demo setup turned up some sharp edges and missing configuration knobs in `bigsky`. For example, trying to backfill the full network with default configuration was resulting in OOM errors with an instance this size. Tweaks and configuration have been merged to the `main` branch of `bigsky`, along with additions to [the README](https://github.com/bluesky-social/indigo/tree/main/cmd/bigsky).\n\n## Network Scaling\n\nHow big can this instance scale? Hard to tell exactly, my guess is that it could do an order of magnitude more event rate, but will run out of disk before too long (eg, in the next year).\n\nThere are a number of possibilities for improving Relay efficiency to make this kind of service cheaper: There are implementation details (like using alternative database engines, or not storing repo data as millions of small files on disk). Data and work could be sharded across multiple machines. Not every Relay needs to crawl and mirror the entire network, and Relays could potentially be simplified to not maintain a full mirror of network content. On the other hand, actually running critical services would have a number of needs not covered here: legal and administrative burdens, monitoring and alerting, etc.\n\n\n## Shopping for an Server\n\nMy assumption is that the main thing needed is a relatively large and reasonably fast disk. The Bluesky production Relays currently use about 1 TByte for PostgreSQL and 1 TByte for CAR storage on local disk. The CAR storage filesystem should also be XFS (not ext4), to handle many millions of small files. When shopping for instances I looked for around 2 TByte for PostgreSQL and 2 TByte for CAR storage so that this setup would be realistic even with growth of the network over time. This storage could all be one disk/filesystem or two separate disks/filesystems. Regular SSD would probably work fine, NVMe is nice, especially for backfill.\n\nMore RAM always helps (page cache and other caches). Don’t need much CPU; the relay process is highly concurrent, but mostly I/O bound. Do want decent network monthly quota.\n\nDisk is definitely the hard part. Network block storage (eg, AWS EBS) is pretty expensive even from cheaper providers, and usually costs more monthly than an entire bare metal instance with larger disks. Bare metal instances are mostly spinning disk or NVMe, not SSD; I assume that spinning disk isn't realistic for a fast backfill demo.\n\nI ended up selecting an OVH instance with for about $150/month:\n\n- `ADVANCE-2-LE`: https://www.ovhcloud.com/en/bare-metal/advance/adv-2/\n- 12 vCPU (Intel Xeon-E 2136 - 6c/12t - 3.3 GHz/4.5 GHz)\n- 32 GB RAM (32 GB ECC 2666 MHz)\n- disks: 2×1.92 TB NVMe\n- 1Gbit/s unmetered and guaranteed\n- $152/month plus one-time $92 setup fee (no commitment)\n\nThat exact config isn’t available now (a week later), but a very similar one is:\n\n- `ADVANCE-1` : https://www.ovhcloud.com/en/bare-metal/advance/adv-1/\n- 12 vCPU (AMD EPYC 4244P - 6c/12t - 3.8GHz/5.1GHz)\n- 32 GB RAM (32GB DDR5 ECC 5200MHz)\n- rootfs disk: 2x NVMe 960GB (RAID)\n- data disks: 2x 1.92TB NVMe\n- 1Gbit/s unmetered and guaranteed\n- $153/month plus one-time $93 setup (no commitment)\n\nIn both cases the setup fee is waved with a 6 month commitment, and there are discounts on the monthly rate with longer commitments.\n\n## Host Provisioning\n\nUsing the OVH web interface, provisioned the server with Ubuntu 24.04. With the `ADVANCE-2-LE` host, I specified partitioning to not use RAID. I let the setup wizard use one of the two disks for rootfs, boot, and swap. With all defaults this resulted in ext4. The second disk was not partitioned or configured using the wizard (I got to that later on the server itself).\n\nConfigured a DNS A record to point at the IPv4 that OVH gave us.\n\nLogged in to the server and ran commands similar to this:\n\n```\nhostnamectl hostname relay-example.demo.bsky.dev\n\napt update\napt upgrade\napt install ripgrep fd-find dstat htop iotop iftop pg-activity httpie caddy golang postgresql yarnpkg\n\n# set up yarn command; could also have used nvm\nln -s /usr/bin/yarnpkg /usr/bin/yarn\n\n# punch holes in default firewall for HTTP/S\nufw allow 80/tcp\nufw allow 443/tcp\n```\n\nRan through partitioning of the second NVMe with XFS. Note that on a real machine you'd want to set up `fstab` so this mounts automatically on a reboot.\n\n```\n# create a partition\nsudo fdisk /dev/nvme1n1\n# c (create), default (primary), default (1), default (start sector), default (entire disk), w (write)\n\n# create XFS filesystem on that partition\nsudo mkfs.xfs /dev/nvme1n1p1\n\n# mount that filesystem to /data\nsudo mkdir -p /data\nsudo mount /dev/nvme1n1p1 /data\n```\n\nPull the indigo codebase and build; ran this as the `ubuntu` user not `root`:\n\n```\n# depending on user that will be running the service\nmkdir -p /data/bigsky\nmkdir -p /data/bigsky/events\nsudo chown ubuntu:ubuntu /data/bigsky/\nsudo chown ubuntu:ubuntu /data/bigsky/events\n\n# pull source code and build. if you had patches or a working branch, would modify here\ncd\ngit clone https://github.com/bluesky-social/indigo\ncd indigo\nmake build-relay-ui build\n```\n\nConfigure PostgreSQL (`sudo -u postgres psql`); replace CHANGME with a secure password of your choice:\n```\nCREATE DATABASE bgs;\nCREATE DATABASE carstore;\n\nCREATE USER bigsky WITH PASSWORD 'CHANGEME';\nGRANT ALL PRIVILEGES ON DATABASE bgs TO bigsky;\nGRANT ALL PRIVILEGES ON DATABASE carstore TO bigsky;\n\n# these are needed for newer versions of postgres\n\\c bgs postgres\nGRANT ALL ON SCHEMA public TO bigsky;\n\n\\c carstore postgres\nGRANT ALL ON SCHEMA public TO bigsky;\n```\n\nCreate a config file at `~/indigo/.env`:\n\n```\nENVIRONMENT=production\nDATABASE_URL=\"postgres://bigsky:CHANGEME@localhost:5432/bgs\"\nCARSTORE_DATABASE_URL=\"postgres://bigsky:CHANGEME@localhost:5432/carstore\"\nDATA_DIR=/data/bigsky\nRELAY_PERSISTER_DIR=/data/bigsky/events\nGOLOG_LOG_LEVEL=info\n# or whatever DNS you want to use for handle resolution\nRESOLVE_ADDRESS=\"8.8.8.8:53\"\nFORCE_DNS_UDP=true\nRELAY_COMPACT_INTERVAL=0\nRELAY_DEFAULT_REPO_LIMIT=500000\n\n# these were somewhat tuned to this instance size\nMAX_CARSTORE_CONNECTIONS=12\nMAX_METADB_CONNECTIONS=12\nMAX_FETCH_CONCURRENCY=25\nRELAY_CONCURRENCY_PER_PDS=20\nRELAY_MAX_QUEUE_PER_PDS=200\n\n#RELAY_ADMIN_KEY=CHANGEME\n```\n\n**UPDATE:** renamed `BGS_COMPACT_INTERVAL` to `RELAY_COMPACT_INTERVAL`, and added `RELAY_PERSISTER_DIR`.\n\nWith the `RELAY_ADMIN_KEY` set to a strong random value, and `DATABASE_URL` substituted to the earlier database password. You can create one with:\n\n```\nopenssl rand -base64 30\n```\n\nCreate a system-wide Caddy config at `/etc/caddy/Caddyfile`. Substitute in your hostname, and comment out any other lines in the file:\n\n```\nrelay-example.demo.bsky.dev {\n  reverse_proxy 127.0.0.1:2470\n}\n```\n\nRestart caddy: `sudo systemctl restart caddy`\n\n## Running `bigsky` and Backfilling\n\nRun the actual service! For example, in a `screen` session, or a service management tool of your choice:\n\n```\ncd ~/indigo\n./bigsky --api-listen 127.0.0.1:2470\n```\n\nConfirm that everything is working by connecting using the `gosky` command from a laptop (which is in the `indigo` repo). Won’t get events (because Relay hasn't subscribed to anything yet), but should connect successfully:\n\n```\ngosky readStream wss://relay-example.demo.bsky.dev\n```\n\nYou can also connect to the web management interface at \u003chttps://relay-example.demo.bsky.dev/dash\u003e. This lets you view basic stats per PDS, modify limits, add new PDS instances to crawl, takedown individual repos (by DID), block PDS instances by domain suffix, etc.\n\n![Relay admin interface screenshot](https://morel.us-east.host.bsky.network/xrpc/com.atproto.sync.getBlob?did=did%3Aplc%3A44ybard66vv44zksje25o7dz\u0026cid=bafkreid4rb2kcokdrgpmj54wnsnfti2rkr34ebsik3rrrp342rumvtcjpi)\n\nTo start backfills from a laptop, create a `hosts.txt` file with PDS hostnames, then run initial crawl command. Can do this from a laptop:\n\n```\ncd ~/indigo/cmd/bigsky\nexport RELAY_ADMIN_KEY=CHANGEMESECRET\nexport RELAY_HOST=relay-example.demo.bsky.dev\n\ncat hosts.txt | parallel -j1 ./crawl_pds.sh {}\n```\n\nLet that bake for a few hours or overnight. Only accounts with new commits will get backfilled. A 24 hour period is usually around 10% of the network.\n\nThen can start explicit backfills per-PDS (\"resync\"). This will pull a complete list of DIDs hosted on the PDS (or at least, which the PDS *thinks* are still hosted on the PDS, this might not yet handle migrations). Don’t want to do full PDS backfills for all the big PDS instances at once, or the relay will get overwhelmed (eg, OOM). Instead, do 4-8 at a time, modifying `hosts.txt` or the `head` command as needed:\n\n```\nhead -n 4 hosts.txt | parallel -j1 ./sync_pds.sh {}\n\n# check progress\nhead -n 4 hosts.txt | parallel -j1 ./sync_status_pds.sh {}\n```\n\nSmaller self-hosted instances can be backfilled in big batches (eg, hundreds of backfills at the same time).\n\nWhile running backfill, some new PDS instances will be discovered, even if not crawled specifically, and even with “spidering” disabled. My guess is that accounts which migrate away from our PDS instances are still listed by the original PDS. When `bigsky` does a backfill, it resolves all the DIDs, and sees a different PDS in the atproto service entry, and adds it to the PDS list.\n\nThe entire backfill took a couple days of casual checking in and poking it along.\n\nHow does one get a complete list of PDS instances in the network? It would be helpful if Relays had a public endpoint to scroll through all known PDS instances, and indicate if they are active, blocked/suspended, and roughly how many repos there are. For now, you can pull hostnames from public listings like \u003chttps://blue.mackuba.eu/directory/\u003e and \u003chttps://bsky-debug.app/\u003e. Or scrape the complete DID PLC directory (which is public and enumerable), and extract all PDS service endpoints.\n\n## Rough Performance Stats\n\nHere are some quick/informal system performance snapshots. The backfill period (fetching all previous repo content from the network) is far more resource intensive than steady operation.\n\nI didn't run any compactions manually, and disabled automatic/periodic compactions. These are resource intensive to process, but free up disk and database space. Compactions are a feature specific to `bigsky` and it's data storage system.\n\n**UPDATE:** you can re-enable compactions by editing the `RELAY_COMPACT_INTERVAL` environment variable. The default is `4h`; it is disabled (set to zero) in the template env file above.\n\nDuring early phase of backfill:\n\n```\n# dstat\n\n----total-usage---- -dsk/total- -net/total- ---paging-- ---system--\nusr sys idl wai stl| read  writ| recv  send|  in   out | int   csw \n 32  16  44   5   0|  17M 1410M|  20M  879k|   0     0 | 107k  173k\n 52  13  29   4   0|  13M 1495M|  18M  765k|   0     0 | 134k  117k\n 74  13  13   1   0|  20M  623M|  18M  722k|   0     0 |  47k   62k\n 26  14  52   6   0|  12M 1610M|  14M  613k|   0     0 | 133k  120k\n 51  16  30   2   0|  19M  928M|  18M  813k|   0     0 |  55k  118k\n 29  16  47   6   0|  15M 1842M|  13M  587k|   0     0 | 133k  114k\n 26  14  55   3   0|  16M 1124M|  12M  537k|   0     0 |  69k  165k\n 24  15  52   7   0|  14M 1600M|  13M  575k|   0     0 | 131k  138k\n 30  14  51   3   0|  16M 1041M|  18M  786k|   0     0 |  62k  155k\n 20  14  57   7   0|  13M 1719M|8916k  406k|   0     0 | 137k  121k\n```\n\n```\n# pg_analyze (as postgres user)\n\nPostgreSQL 16.3 - relay-ovh - postgres@/var/run/postgresql:5432/postgres - Ref.: 2s -\n * Global: 38 minutes uptime, 12.54G dbs size - 14.84M/s growth, 90.60% cache hit ratio\n   Sessions: 81/100 total, 42 active, 39 idle, 0 idle in txn, 0 idle in txn abrt, 0 waiting\n   Activity: 4382 tps, 74089 insert/s, 164 update/s, 0 delete/s, 28383 tuples returned/s, 0\n * Worker processes: 0/8 total, 0/4 logical workers, 0/8 parallel workers\n   Other processes \u0026 info: 0/3 autovacuum workers, 0/10 wal senders, 0 wal receivers, 0/10\n * Mem.: 31.12G total, 756.90M (2.38%) free, 14.14G (45.44%) used, 16.24G (52.19%)\n   Swap: 512.00M total, 511.00M (99.80%) free, 1.00M (0.20%) used\n   IO: 155846/s max iops, 2.15K/s - 0/s read, 608.78M/s - 155846/s write\n   Load average: 8.19 7.33 5.16\n```\n\nI don’t have stats, but at a later phase of backfill, I/O wait was pretty high and disk read/write were more symmetrical around 500MB/sec (should have taken a snapshot of that!), and CPU wait was only single-digit.\n\nAfter all major backfills, just cruising along at a normal firehose subscription:\n\n```\n# dstat\n\n----total-usage---- -dsk/total- -net/total- ---paging-- ---system--\nusr sys idl wai stl| read  writ| recv  send|  in   out | int   csw \n  1   1  97   1   0|5794k 8046k| 232k 6851B|   0     0 |4439  7848 \n  1   0  98   0   0|3615k 7399k| 219k 5071B|   0     0 |4062  7163 \n  1   1  98   0   0|4645k   13M| 198k 5184B|   0     0 |4473  7194 \n  1   0  98   0   0|4831k 8142k| 242k 7273B|   0     0 |4174  7581 \n  1   0  98   0   0|3264k 7092k| 178k 4784B|   0     0 |3625  6619 \n  1   1  98   0   0|3564k 7336k| 153k 3394B|   0     0 |3253  5502 \n  1   0  98   0   0|4930k 9139k| 239k 6719B|   0     0 |4119  7364 \n  2   1  97   1   0|6430k   14M| 313k   10k|   0     0 |6372    13k\n  1   0  98   0   0|3359k 7422k| 172k 5255B|   0     0 |3670  6860 \n  1   0  98   0   0|3929k   10M| 206k 7088B|   0     0 |4036  8954 \n  1   1  98   0   0|3732k 7560k| 212k 6173B|   0     0 |3789  6771 \n  1   1  97   1   0|5694k 8819k| 267k 7758B|   0     0 |4630  8511 \n  1   0  98   0   0|3480k   11M| 175k 4565B|   0     0 |3764  5758\n```\n\n```\n# pg_analyze\n\nPostgreSQL 16.3 - relay-ovh - postgres@/var/run/postgresql:5432/postgres - Ref.: 2s -\n * Global: 4 days, 22 hours and 23 minutes uptime, 445.38G dbs size - 162.93K/s growth, 79.55% cache hit ratio\n   Sessions: 22/100 total, 1 active, 21 idle, 0 idle in txn, 0 idle in\n   Activity: 514 tps, 912 insert/s, 0 update/s, 0 delete/s, 1028 tuples returned/s, 0 temp files, 0B temp size\n * Worker processes: 0/8 total, 0/4 logical workers, 0/8 parallel workers\n   Other processes \u0026 info: 0/3 autovacuum workers, 0/10 wal senders, 0\n * Mem.: 31.12G total, 676.83M (2.12%) free, 17.36G (55.78%) used, 13.10G (42.09%) buff+cached\n   Swap: 512.00M total, 608.00K (0.12%) free, 511.40M (99.88%) used\n   IO: 0/s max iops, 0B/s - 0/s read, 0B/s - 0/s write\n   Load average: 0.46 0.44 0.38\n```\n\n```\n# df -h\nFilesystem      Size  Used Avail Use% Mounted on\ntmpfs           3.2G  1.6M  3.2G   1% /run\nefivarfs        192K   37K  151K  20% /sys/firmware/efi/efivars\n/dev/nvme0n1p3  1.8T  452G  1.2T  28% /\ntmpfs            16G  1.1M   16G   1% /dev/shm\ntmpfs           5.0M     0  5.0M   0% /run/lock\n/dev/nvme0n1p2  974M  182M  725M  21% /boot\n/dev/nvme0n1p1  511M  5.2M  506M   2% /boot/efi\n/dev/nvme1n1p1  1.8T  722G  1.1T  41% /data\ntmpfs           3.2G   12K  3.2G   1% /run/user/1000\n\n# df -i (inodes)\nFilesystem        Inodes    IUsed     IFree IUse% Mounted on\ntmpfs            4078568     1019   4077549    1% /run\nefivarfs               0        0         0     - /sys/firmware/efi/efivars\n/dev/nvme0n1p3 117080064   230862 116849202    1% /\ntmpfs            4078568        3   4078565    1% /dev/shm\ntmpfs            4078568        3   4078565    1% /run/lock\n/dev/nvme0n1p2     65536      603     64933    1% /boot\n/dev/nvme0n1p1         0        0         0     - /boot/efi\n/dev/nvme1n1p1 187537280 25138236 162399044   14% /data\ntmpfs             815713       32    815681    1% /run/user/1000\n\n# sudo du -sh /var/lib/postgresql/16/\n447G    /var/lib/postgresql/16/\n\n# lsblk (for reference)\nNAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS\nsda           8:0    1    0B  0 disk \nsr0          11:0    1 1024M  0 rom  \nnvme1n1     259:0    0  1.7T  0 disk \n└─nvme1n1p1 259:7    0  1.7T  0 part /data\nnvme0n1     259:1    0  1.7T  0 disk \n├─nvme0n1p1 259:2    0  511M  0 part /boot/efi\n├─nvme0n1p2 259:3    0    1G  0 part /boot\n├─nvme0n1p3 259:4    0  1.7T  0 part /\n├─nvme0n1p4 259:5    0  512M  0 part [SWAP]\n└─nvme0n1p5 259:6    0    2M  0 part \n```\n\nPostgreSQL table sizes:\n\n```\nbgs=# SELECT\n    table_name,\n    pg_size_pretty(table_size) AS table_size,\n    pg_size_pretty(indexes_size) AS indexes_size,\n    pg_size_pretty(total_size) AS total_size\nFROM (\n    SELECT\n        table_name,\n        pg_table_size(table_name) AS table_size,\n        pg_indexes_size(table_name) AS indexes_size,\n        pg_total_relation_size(table_name) AS total_size\n    FROM (\n        SELECT ('\"' || table_schema || '\".\"' || table_name || '\"') AS table_name\n        FROM information_schema.tables\n        WHERE table_schema != 'pg_catalog' AND table_schema != 'information_schema'\n    ) AS all_tables\n    ORDER BY total_size DESC\n) AS pretty_sizes;\n          table_name           | table_size | indexes_size | total_size \n-------------------------------+------------+--------------+------------\n \"public\".\"repo_event_records\" | 18 GB      | 541 MB       | 18 GB\n \"public\".\"actor_infos\"        | 987 MB     | 993 MB       | 1980 MB\n \"public\".\"users\"              | 749 MB     | 1060 MB      | 1809 MB\n \"public\".\"pds\"                | 3752 kB    | 32 kB        | 3784 kB\n \"public\".\"auth_tokens\"        | 16 kB      | 48 kB        | 64 kB\n \"public\".\"slurp_configs\"      | 16 kB      | 32 kB        | 48 kB\n \"public\".\"feed_posts\"         | 8192 bytes | 24 kB        | 32 kB\n \"public\".\"vote_records\"       | 8192 bytes | 16 kB        | 24 kB\n \"public\".\"follow_records\"     | 8192 bytes | 16 kB        | 24 kB\n \"public\".\"domain_bans\"        | 8192 bytes | 16 kB        | 24 kB\n \"public\".\"repost_records\"     | 8192 bytes | 8192 bytes   | 16 kB\n(11 rows)\n\n\ncarstore=# SELECT\n    table_name,\n    pg_size_pretty(table_size) AS table_size,\n    pg_size_pretty(indexes_size) AS indexes_size,\n    pg_size_pretty(total_size) AS total_size\nFROM (\n    SELECT\n        table_name,\n        pg_table_size(table_name) AS table_size,\n        pg_indexes_size(table_name) AS indexes_size,\n        pg_total_relation_size(table_name) AS total_size\n    FROM (\n        SELECT ('\"' || table_schema || '\".\"' || table_name || '\"') AS table_name\n        FROM information_schema.tables\n        WHERE table_schema != 'pg_catalog' AND table_schema != 'information_schema'\n    ) AS all_tables\n    ORDER BY total_size DESC\n) AS pretty_sizes;\n      table_name       | table_size | indexes_size | total_size \n-----------------------+------------+--------------+------------\n \"public\".\"block_refs\" | 192 GB     | 217 GB       | 409 GB\n \"public\".\"car_shards\" | 4011 MB    | 4088 MB      | 8098 MB\n \"public\".\"stale_refs\" | 6245 MB    | 576 MB       | 6821 MB\n(3 rows)\n```",
  "createdAt": "2024-11-08T18:03:58.412Z",
  "theme": "github-light",
  "title": "Notes on Running a Full-Network atproto Relay (July 2024)",
  "visibility": "public"
}