astrolabe

at://did:plc:44ybard66vv44zksje25o7dz/pub.leaflet.document/3m7e3hk57rs2u

Record JSON

{
  "$type": "pub.leaflet.document",
  "author": "did:plc:44ybard66vv44zksje25o7dz",
  "description": "",
  "pages": [
    {
      "$type": "pub.leaflet.pages.linearDocument",
      "blocks": [
        {
          "$type": "pub.leaflet.pages.linearDocument#block",
          "block": {
            "$type": "pub.leaflet.blocks.text",
            "facets": [
              {
                "features": [
                  {
                    "$type": "pub.leaflet.richtext.facet#link",
                    "uri": "https://en.wikipedia.org/wiki/BEAM_(Erlang_virtual_machine)"
                  }
                ],
                "index": {
                  "byteEnd": 217,
                  "byteStart": 213
                }
              },
              {
                "features": [
                  {
                    "$type": "pub.leaflet.richtext.facet#link",
                    "uri": "https://en.wikipedia.org/wiki/Gleam_(programming_language)"
                  }
                ],
                "index": {
                  "byteEnd": 285,
                  "byteStart": 280
                }
              },
              {
                "features": [
                  {
                    "$type": "pub.leaflet.richtext.facet#link",
                    "uri": "https://slices.network/"
                  }
                ],
                "index": {
                  "byteEnd": 305,
                  "byteStart": 299
                }
              },
              {
                "features": [
                  {
                    "$type": "pub.leaflet.richtext.facet#link",
                    "uri": "https://github.com/bluesky-social/indigo/pull/1170"
                  }
                ],
                "index": {
                  "byteEnd": 367,
                  "byteStart": 364
                }
              },
              {
                "features": [
                  {
                    "$type": "pub.leaflet.richtext.facet#link",
                    "uri": "https://compare.hose.cam/"
                  }
                ],
                "index": {
                  "byteEnd": 597,
                  "byteStart": 589
                }
              }
            ],
            "plaintext": "A couple of weeks ago at Eurosky I talked to some friendly Erlang hackers who wanted to get involved with AT development. The AT network is a big data-intensive distributed system, and it is the sort of thing the BEAM runtime is well-suited for. I know Chad Miller has been using Gleam for parts of Slices, and services like relays, jetstream, and the forthcoming tap tool could all be re-implemented using Erlang-y tools. But I think these tools, written in Go, are already in a pretty good place: they are efficient enough, are not too hard to operate (IMO), and have capacity to scale. hose.cam lists 9 full-network relays from 7 distinct parties."
          }
        },
        {
          "$type": "pub.leaflet.pages.linearDocument#block",
          "block": {
            "$type": "pub.leaflet.blocks.text",
            "facets": [],
            "plaintext": "I think the harder technical problem in the network today is patterns for big indexing: datastores for indexing billions of records. This is an area where there aren't clear choices, and folks are feeling pain. There probably isn't going to be one solution that works for everything, and more experiments and write-ups would be welcome, particularly from folks from outside the AT ecosystem who have a hammer they love (eg, a particular database) and are looking for a nail (a fun use case)."
          }
        },
        {
          "$type": "pub.leaflet.pages.linearDocument#block",
          "block": {
            "$type": "pub.leaflet.blocks.text",
            "facets": [],
            "plaintext": "There definitely are folks doing great work in this direction already:"
          }
        },
        {
          "$type": "pub.leaflet.pages.linearDocument#block",
          "block": {
            "$type": "pub.leaflet.blocks.unorderedList",
            "children": [
              {
                "$type": "pub.leaflet.blocks.unorderedList#listItem",
                "children": [],
                "content": {
                  "$type": "pub.leaflet.blocks.text",
                  "facets": [
                    {
                      "features": [
                        {
                          "$type": "pub.leaflet.richtext.facet#link",
                          "uri": "https://www.microcosm.blue/"
                        }
                      ],
                      "index": {
                        "byteEnd": 9,
                        "byteStart": 0
                      }
                    }
                  ],
                  "plaintext": "microcosm is a really impressive collection of generic AT services and APIs that work with all data in the network, is open source, and runs on affordable hardware"
                }
              },
              {
                "$type": "pub.leaflet.blocks.unorderedList#listItem",
                "children": [],
                "content": {
                  "$type": "pub.leaflet.blocks.text",
                  "facets": [
                    {
                      "features": [
                        {
                          "$type": "pub.leaflet.richtext.facet#link",
                          "uri": "https://www.graze.social/"
                        }
                      ],
                      "index": {
                        "byteEnd": 12,
                        "byteStart": 0
                      }
                    },
                    {
                      "features": [
                        {
                          "$type": "pub.leaflet.richtext.facet#link",
                          "uri": "https://bsky.app/profile/devingaffney.com/post/3m6x5vl65lc24"
                        }
                      ],
                      "index": {
                        "byteEnd": 60,
                        "byteStart": 34
                      }
                    },
                    {
                      "features": [
                        {
                          "$type": "pub.leaflet.richtext.facet#link",
                          "uri": "https://clickhouse.com/blog/building-a-medallion-architecture-for-bluesky-json-data-with-clickhouse"
                        }
                      ],
                      "index": {
                        "byteEnd": 119,
                        "byteStart": 100
                      }
                    }
                  ],
                  "plaintext": "graze.social uses Clickhouse, and talked about some benefits recently. Clickhouse itself also did a case study/tutorial using AT data earlier this year"
                }
              },
              {
                "$type": "pub.leaflet.blocks.unorderedList#listItem",
                "children": [],
                "content": {
                  "$type": "pub.leaflet.blocks.text",
                  "facets": [
                    {
                      "features": [
                        {
                          "$type": "pub.leaflet.richtext.facet#link",
                          "uri": "https://www.greenearth.social/p/introducing-greenearth"
                        }
                      ],
                      "index": {
                        "byteEnd": 10,
                        "byteStart": 0
                      }
                    },
                    {
                      "features": [
                        {
                          "$type": "pub.leaflet.richtext.facet#link",
                          "uri": "https://github.com/greenearth-social/ingex/tree/main/ingest"
                        }
                      ],
                      "index": {
                        "byteEnd": 52,
                        "byteStart": 26
                      }
                    }
                  ],
                  "plaintext": "greenearth is building an ingest and indexing system on Elasticsearch for their feed project"
                }
              },
              {
                "$type": "pub.leaflet.blocks.unorderedList#listItem",
                "children": [],
                "content": {
                  "$type": "pub.leaflet.blocks.text",
                  "facets": [
                    {
                      "features": [
                        {
                          "$type": "pub.leaflet.richtext.facet#link",
                          "uri": "https://newsletter.pragmaticengineer.com/p/bluesky"
                        }
                      ],
                      "index": {
                        "byteEnd": 110,
                        "byteStart": 90
                      }
                    }
                  ],
                  "plaintext": "Bluesky itself switched from PostgreSQL to ScyllaDB for our appview, as described in this April 2024 deep dive. That was mostly driven by request volume, not indexing workload"
                }
              },
              {
                "$type": "pub.leaflet.blocks.unorderedList#listItem",
                "children": [],
                "content": {
                  "$type": "pub.leaflet.blocks.text",
                  "facets": [
                    {
                      "features": [
                        {
                          "$type": "pub.leaflet.richtext.facet#link",
                          "uri": "https://whtwnd.com/futur.blue/3ls7sbvpsqc2w"
                        }
                      ],
                      "index": {
                        "byteEnd": 23,
                        "byteStart": 11
                      }
                    }
                  ],
                  "plaintext": "futur.blue demonstrated a near-full-network Bluesky appview running PostgreSQL on a $200/month Hetzner instance"
                }
              },
              {
                "$type": "pub.leaflet.blocks.unorderedList#listItem",
                "children": [],
                "content": {
                  "$type": "pub.leaflet.blocks.text",
                  "facets": [
                    {
                      "features": [
                        {
                          "$type": "pub.leaflet.richtext.facet#link",
                          "uri": "https://github.com/blacksky-algorithms/rsky/tree/rude1/backfill/rsky-wintermute"
                        }
                      ],
                      "index": {
                        "byteEnd": 115,
                        "byteStart": 105
                      }
                    }
                  ],
                  "plaintext": "Blacksky has been building a full-network Bluesky-compatible appview, using a custom Rust indexing tool (wintermute)"
                }
              },
              {
                "$type": "pub.leaflet.blocks.unorderedList#listItem",
                "children": [],
                "content": {
                  "$type": "pub.leaflet.blocks.text",
                  "facets": [
                    {
                      "features": [
                        {
                          "$type": "pub.leaflet.richtext.facet#link",
                          "uri": "https://skyfeed.app/"
                        }
                      ],
                      "index": {
                        "byteEnd": 7,
                        "byteStart": 0
                      }
                    },
                    {
                      "features": [
                        {
                          "$type": "pub.leaflet.richtext.facet#link",
                          "uri": "https://github.com/skyfeed-dev"
                        }
                      ],
                      "index": {
                        "byteEnd": 25,
                        "byteStart": 14
                      }
                    },
                    {
                      "features": [
                        {
                          "$type": "pub.leaflet.richtext.facet#link",
                          "uri": "https://github.com/surrealdb/surrealdb"
                        }
                      ],
                      "index": {
                        "byteEnd": 66,
                        "byteStart": 57
                      }
                    }
                  ],
                  "plaintext": "skyfeed is an open source feed builder system which uses SurrealDB for indexing"
                }
              }
            ]
          }
        },
        {
          "$type": "pub.leaflet.pages.linearDocument#block",
          "block": {
            "$type": "pub.leaflet.blocks.text",
            "facets": [],
            "plaintext": "Building alternative bluesky-compatible appviews is definitely a big motivating use case, but I think the need is broader than something that works for one project and codebase. At a minimum, it should be possible to develop new product features which would require additional data types and indices, like Blacksky is planning with community features. And we expect other projects and apps in the network to grow over time: we need to be ready for non-bsky record types with millions and billions of records, which will have their own unique indexing needs."
          }
        },
        {
          "$type": "pub.leaflet.pages.linearDocument#block",
          "block": {
            "$type": "pub.leaflet.blocks.text",
            "facets": [],
            "plaintext": "The sweet spot to me is: what systems and design patterns work well to index tens of billions of records on low-end bare metal servers? By \"low-end\" I mean cheap dedicated servers on the order of $100 to $600 a month: very expensive compared to a Raspberry Pi or basic VPS, but much cheaper than buying a $20k to $50k closet monster, and usually a lot cheaper than a dedicated/managed database server (eg, a big AWS RDS instance). This class of machine usually has tens of GB of RAM, 12+ vCPUs, and most importantly, several TBytes of fast directly attached NVMe storage. You can get good deals on this sort of hardware from OVH or Hetzner; don't bother with cloud providers like AWS, GCP, or Azure. Ideally the datastore would support horizontal scalability for read load and availability, but still work well on a single instance for prototyping."
          }
        },
        {
          "$type": "pub.leaflet.pages.linearDocument#block",
          "block": {
            "$type": "pub.leaflet.blocks.text",
            "facets": [],
            "plaintext": "Some of the broad categories I can think of are:"
          }
        },
        {
          "$type": "pub.leaflet.pages.linearDocument#block",
          "block": {
            "$type": "pub.leaflet.blocks.unorderedList",
            "children": [
              {
                "$type": "pub.leaflet.blocks.unorderedList#listItem",
                "children": [],
                "content": {
                  "$type": "pub.leaflet.blocks.text",
                  "facets": [],
                  "plaintext": "getting regular PostgreSQL to scale better. I don't mean sharding or Citus or alternative data wrappers (though those approaches are also fine), I just mean running regular modern PostgreSQL with TBytes of storage and indices on a single machine. This would mostly be about careful schema and query design. The big win here is that PostgreSQL \"scales down\" very well, lots of developers are familiar with it, the pitfalls are known, and there is a lot of good tooling available. As a side note, the current Bluesky appview codebase includes a PG backend, but it isn't well optimized for current network scale."
                }
              },
              {
                "$type": "pub.leaflet.blocks.unorderedList#listItem",
                "children": [],
                "content": {
                  "$type": "pub.leaflet.blocks.text",
                  "facets": [
                    {
                      "features": [
                        {
                          "$type": "pub.leaflet.richtext.facet#link",
                          "uri": "https://github.com/vitessio/vitess"
                        }
                      ],
                      "index": {
                        "byteEnd": 47,
                        "byteStart": 41
                      }
                    },
                    {
                      "features": [
                        {
                          "$type": "pub.leaflet.richtext.facet#link",
                          "uri": "https://www.yugabyte.com/"
                        }
                      ],
                      "index": {
                        "byteEnd": 59,
                        "byteStart": 49
                      }
                    },
                    {
                      "features": [
                        {
                          "$type": "pub.leaflet.richtext.facet#link",
                          "uri": "https://www.pingcap.com/tidb/"
                        }
                      ],
                      "index": {
                        "byteEnd": 65,
                        "byteStart": 61
                      }
                    },
                    {
                      "features": [
                        {
                          "$type": "pub.leaflet.richtext.facet#link",
                          "uri": "https://clickhouse.com/"
                        }
                      ],
                      "index": {
                        "byteEnd": 77,
                        "byteStart": 67
                      }
                    }
                  ],
                  "plaintext": "NewSQL distributed database systems like Vitess, yugabyteDB, TiDB, ClickHouse, and many others. These are a bit more familiar to work with (still SQL), though might require distinct data layouts or have other quirks."
                }
              },
              {
                "$type": "pub.leaflet.blocks.unorderedList#listItem",
                "children": [],
                "content": {
                  "$type": "pub.leaflet.blocks.text",
                  "facets": [],
                  "plaintext": "Columnar stores like Cassandra-compatibles, DuckDB, etc. Some of the NewSQL projects overlap with this category. The big strength is most likely compression efficiency, requiring much less disk space and enabling faster aggregation queries. The difficulties might be random write indexing performance (eg, from firehose), and cost/limitations on joins and secondary indices."
                }
              },
              {
                "$type": "pub.leaflet.blocks.unorderedList#listItem",
                "children": [],
                "content": {
                  "$type": "pub.leaflet.blocks.text",
                  "facets": [],
                  "plaintext": "Key/Value stores like FoundationDB and ScyllaDB (which might overlap with other categories). These often have great performance and operations, but big learning curves around data layout, and don't \"scale down\" well."
                }
              },
              {
                "$type": "pub.leaflet.blocks.unorderedList#listItem",
                "children": [],
                "content": {
                  "$type": "pub.leaflet.blocks.text",
                  "facets": [],
                  "plaintext": "Hybrid, multi-modal, and \"all others\", like Elasticsearch."
                }
              }
            ]
          }
        },
        {
          "$type": "pub.leaflet.pages.linearDocument#block",
          "block": {
            "$type": "pub.leaflet.blocks.text",
            "facets": [],
            "plaintext": "Datastores are often branded or perceived as more transaction-oriented (OLTP) or analytics-oriented (OLAP), but can often work well enough for the opposite use case, especially if there is flexibility around performance or eventual consistency."
          }
        },
        {
          "$type": "pub.leaflet.pages.linearDocument#block",
          "block": {
            "$type": "pub.leaflet.blocks.text",
            "facets": [],
            "plaintext": "What would love to see emerge is a bunch of blog posts and trip reports talking about big AT data indexing attempts, and what the resource costs, bottlenecks, and pain points were. Maybe even a benchmark/leaderboard could emerge around how long it takes to backfill the full network and how much it costs (though this might be reductive and hard to do fair comparisons). I'm less interested in making a big list of hypothetical options, or \"what about XYZ\" questions: there are a bajillion ideas and options, we need real attempts for real use cases."
          }
        }
      ],
      "id": "019af5d7-5489-7332-99eb-b18d01aae61a"
    }
  ],
  "postRef": {
    "cid": "bafyreicbhnl6pslgjitntrbkjyoovdltlzyeqosyvmy6q7h6cnavqnpame",
    "commit": {
      "cid": "bafyreiemx4soxo2l4ycy63mx3bnufu4p337ux43iamrczkvkwvb7vs2wga",
      "rev": "3m7e3ho3dw22e"
    },
    "uri": "at://did:plc:44ybard66vv44zksje25o7dz/app.bsky.feed.post/3m7e3hnyh5c2u",
    "validationStatus": "valid"
  },
  "publication": "at://did:plc:44ybard66vv44zksje25o7dz/pub.leaflet.publication/3m2x76zrtrs23",
  "publishedAt": "2025-12-06T22:48:06.795Z",
  "tags": [
    "atproto"
  ],
  "title": "Big Indexing"
}