The Future Data Stack: Why We Built RoboSystems

The Modern Data Stack Is Breaking

Take a look at the typical enterprise data architecture in 2026:

Source Systems → Fivetran → Snowflake → dbt → Reverse ETL → Downstream Apps
                    ↓            ↓         ↓
              Airbyte      Databricks    Hightouch
                    ↓            ↓         ↓
              Stitch       BigQuery      Census

Each arrow is a point of failure. Each tool is a vendor contract. Each hop adds latency, cost, and complexity.

We've created a Rube Goldberg machine where data travels through seven systems just to answer a simple question. And when something breaks at 2 AM? Good luck figuring out which of your seventeen integrations failed.

This isn't sustainable. It never was.

What Comes Next

The next generation of data platforms won't look like this. They'll be characterized by three fundamental shifts:

Unified databases that handle transactions and analytics without copying data between systems
Observable orchestration where every pipeline run is traceable, replayable, and auditable
Semantic layers that give AI agents actual understanding, not just data access

We've spent years building RoboSystems around these principles. Not because we predicted the future—but because we refused to accept the present.

Orchestration You Can Actually See

Here's the dirty secret of most data infrastructure: nobody knows what's running.

Cron jobs fire into the void. Airflow DAGs fail silently. Lambda functions timeout with cryptic errors. When the CEO asks "why is this number wrong?" the answer is usually "let me check CloudWatch logs for the next three hours."

We built our orchestration layer on Dagster because it solves a different problem than most orchestration tools: visibility.

Every job run is logged. Every failure is traceable. Every asset has lineage. When something breaks, you know what broke, when it broke, and what data was affected.

# Assets define WHAT should exist, not scripts that run
@asset(
    deps=["sec_raw_filings"],
    group_name="sec_pipeline",
)
def sec_graph_materialized(
    context: AssetExecutionContext,
    graph: GraphResource,
    staging_result: dict,
) -> dict:
    """Materialize staged SEC data into the knowledge graph."""
    graph_id = staging_result["graph_id"]

    # Dagster tracks: what ran, when, what it produced, what failed
    result = graph.materialize_staged_data(graph_id, staging_result["tables"])

    context.log.info(f"Materialized {result['nodes_created']} nodes")
    return result

This is asset-centric thinking. You don't write scripts that "run the SEC pipeline." You define assets—SEC filings, staged data, materialized graphs—and Dagster figures out what needs to run to make them exist.

The difference matters when something goes wrong:

| Traditional Orchestration | Asset-Centric Orchestration | | ------------------------- | ----------------------------------- | | "The 2 AM job failed" | "The SEC Q4 filings asset is stale" | | "Check the logs" | "Here's the exact run that failed" | | "Re-run everything" | "Re-materialize just this asset" | | "Hope it works this time" | "Replay with the same inputs" |

We still run scheduled jobs—monthly billing, daily cleanup, nightly processing. But every run is observable. Every asset has provenance. When the auditor asks "where did this number come from?" you can trace it back to the exact source file, transformation, and timestamp.

This matters more than ever as AI enters the picture. Agents need to trust the data they're working with. If you can't prove where a number came from, AI can't rely on it—and neither can you. Observable pipelines don't just help debugging. They make data AI-ready.

The Unified Database Reality

Every few years, someone promises a database that handles everything. They usually fail because "everything" is too broad.

We took a different approach: build for analytical graph workloads specifically. LadybugDB, our embedded graph engine, combines:

Columnar storage for the aggregations financial analysis actually needs
Graph semantics for the relationships that make financial data meaningful
Embedded architecture for true multi-tenant isolation

This isn't a general-purpose database trying to do everything. It's a specialized engine that does one thing exceptionally well: make financial relationships queryable at scale.

// One query that would require joins across five tables in SQL
MATCH (company:Entity)-[:FILED]->(report:Report)
      -[:CONTAINS]->(fact:Fact)-[:MEASURES]->(element:Element)
WHERE element.qname = 'us-gaap:Revenues'
  AND report.period_end >= date('2024-01-01')
RETURN company.name, fact.numeric_value, report.period_end
ORDER BY company.name, report.period_end

The query doesn't just retrieve data—it traverses relationships. Revenue connects to reports connects to companies connects to industries. The graph isn't an alternative to SQL; it's what SQL was always trying to be.

Semantic Understanding, Not Just Data Access

Here's the uncomfortable truth about most AI-powered data tools: they retrieve information, but they don't understand meaning.

RAG systems fetch relevant chunks. Text-to-SQL tools generate queries. But when your CFO asks "What's driving our margin compression?"—retrieving the top 10 relevant documents isn't the same as understanding causality. Pattern-matching against your schema isn't reasoning.

Knowledge graphs change this equation. When your data is stored as explicit relationships, AI agents can traverse meaning, not just retrieve text:

Revenue is connected to Customers
Customers are connected to Segments
Segments have different Profitability metrics
Profitability affects Cash Flow projections

This isn't semantic search—it's semantic reasoning. The difference matters when you ask: "What happens to our cash position if we lose our top three customers?"

A SQL generator gives you a query template. A graph-aware AI agent traces the actual relationships in your business to model the cascade effect.

Infrastructure You Actually Own

Here's something the enterprise software industry doesn't want you to know: you can run your own data infrastructure.

We built RoboSystems to deploy in your AWS account with a single command:

just bootstrap              # Set up OIDC federation
just deploy prod            # Deploy everything

That's it. No SaaS dependency. No data leaving your network. No per-seat licensing that scales with your headcount.

What gets deployed:

| Component | Service | Purpose | | ------------- | ----------- | ----------------------------- | | API | ECS Fargate | Application endpoints | | Orchestration | ECS Fargate | Dagster workers and scheduler | | PostgreSQL | RDS | User accounts and billing | | LadybugDB | EC2 ARM64 | Graph storage and queries | | Cache | ElastiCache | Session and query caching |

The entire stack uses GitHub OIDC federation—no AWS credentials stored anywhere. One-hour sessions that can't be abused if compromised.

This isn't just a preference—it's becoming a fiduciary responsibility. Regulations tighten every year. Your clients expect to know where their data lives. Your auditors need to verify your controls. "It's in some vendor's cloud" is no longer an acceptable answer for sensitive financial data.

Data sovereignty isn't paranoia. It's prudence.

Fork-Friendly by Design

Most open-source projects make the same mistake: they assume you'll contribute upstream or stay on the main branch forever.

Real enterprises don't work that way. You need custom integrations. Proprietary data sources. Industry-specific logic.

We designed RoboSystems with explicit merge boundaries:

adapters/
├── sec/                 # ← Upstream maintains
├── quickbooks/          # ← Upstream maintains
├── plaid/               # ← Upstream maintains
│
└── custom_*/            # ← Your namespace (we never touch)
    ├── custom_erp/      #    Your ERP integration
    ├── custom_bank/     #    Your bank API
    └── custom_crm/      #    Your CRM connector

The custom_* namespace is a contract: we will never create files there. Your fork can pull upstream updates without merge conflicts, without losing your customizations, without worrying about what we changed.

This extends to Dagster assets, schema extensions, and operation modules. Build what you need. Keep getting updates.

The Jobs That Run Your Business

Orchestration isn't just about data pipelines—it's about the operational heartbeat of your platform.

Here's what runs automatically in a production RoboSystems deployment:

Billing Operations:

Monthly credit allocation and overage processing
Daily storage usage billing
Hourly usage snapshots for analytics
Monthly usage report generation

Infrastructure Maintenance:

Hourly cleanup of expired API keys and sessions
Weekly health checks on credit system integrity
Instance health monitoring and metrics collection
Registry cleanup for orphaned resources

Data Pipelines:

SEC filing discovery and processing
QuickBooks transaction synchronization
Plaid account and transaction sync
Staged file validation and graph materialization

Each job is observable in Dagster's UI. Each run is logged. Each failure triggers alerts. No more mystery batch jobs that "run somewhere."

The Path Forward

We're not claiming to have solved data engineering. But we've made deliberate architectural choices that position RoboSystems for what comes next:

Instead of multi-hop pipelines: Direct integration from source to graph. QuickBooks transactions don't travel through three systems—they flow directly to your knowledge graph.

Instead of invisible pipelines: Observable orchestration where every run is logged, every failure is traceable, and every asset has lineage you can audit.

Instead of separate OLTP/OLAP: A unified graph layer that handles both relationship traversal and analytical aggregations.

Instead of vendor lock-in: Deploy in your own AWS account. Fork and customize. Own your infrastructure.

Instead of AI that generates SQL: Semantic understanding through graph relationships. AI agents that reason about your business, not just pattern-match against your schema.

Building for the Next Decade

The data engineering world is about to undergo a major consolidation. The seventeen-tool stack will collapse into something simpler. Companies that built on foundations designed for this complexity will struggle to adapt.

We built RoboSystems for what comes after.

Not because we're smarter. Not because we predicted the future. But because we refused to accept that financial intelligence requires copying data through seven systems.

The future of data engineering isn't more tools. It's better foundations.

Ready to build on foundations designed for the next decade? Deploy RoboSystems in your own AWS account and own your data infrastructure.

Want to see the orchestration in action? Explore our Dagster implementation and see how asset-centric pipelines actually work.

The future of data engineering isn't more complexity. It's better architecture.

The Future Data Stack: Why We Built RoboSystems

The Modern Data Stack Is Breaking

What Comes Next

Orchestration You Can Actually See

The Unified Database Reality

Semantic Understanding, Not Just Data Access

Infrastructure You Actually Own

Fork-Friendly by Design

The Jobs That Run Your Business

The Path Forward

Building for the Next Decade

Joey French

Ready to Build Your Knowledge Graph?

More from RoboSystems Blog

Building Financial Context Graphs: Beyond RAG to Semantic Understanding

The Financial Knowledge Graph Manifesto

Knowledge Graphs: The Missing Link in Financial Intelligence