The Future Data Stack: Why We Built RoboSystems
The Modern Data Stack Is Breaking
Take a look at the typical enterprise data architecture in 2026:
Source Systems → Fivetran → Snowflake → dbt → Reverse ETL → Downstream Apps
↓ ↓ ↓
Airbyte Databricks Hightouch
↓ ↓ ↓
Stitch BigQuery Census
Each arrow is a point of failure. Each tool is a vendor contract. Each hop adds latency, cost, and complexity.
We've created a Rube Goldberg machine where data travels through seven systems just to answer a simple question. And when something breaks at 2 AM? Good luck figuring out which of your seventeen integrations failed.
This isn't sustainable. It never was.
What Comes Next
The next generation of data platforms won't look like this. They'll be characterized by three fundamental shifts:
- Unified databases that handle transactions and analytics without copying data between systems
- Observable orchestration where every pipeline run is traceable, replayable, and auditable
- Semantic layers that give AI agents actual understanding, not just data access
We've spent years building RoboSystems around these principles. Not because we predicted the future—but because we refused to accept the present.
Orchestration You Can Actually See
Here's the dirty secret of most data infrastructure: nobody knows what's running.
Cron jobs fire into the void. Airflow DAGs fail silently. Lambda functions timeout with cryptic errors. When the CEO asks "why is this number wrong?" the answer is usually "let me check CloudWatch logs for the next three hours."
We built our orchestration layer on Dagster because it solves a different problem than most orchestration tools: visibility.
Every job run is logged. Every failure is traceable. Every asset has lineage. When something breaks, you know what broke, when it broke, and what data was affected.
# Assets define WHAT should exist, not scripts that run
@asset(
deps=["sec_raw_filings"],
group_name="sec_pipeline",
)
def sec_graph_materialized(
context: AssetExecutionContext,
graph: GraphResource,
staging_result: dict,
) -> dict:
"""Materialize staged SEC data into the knowledge graph."""
graph_id = staging_result["graph_id"]
# Dagster tracks: what ran, when, what it produced, what failed
result = graph.materialize_staged_data(graph_id, staging_result["tables"])
context.log.info(f"Materialized {result['nodes_created']} nodes")
return result
This is asset-centric thinking. You don't write scripts that "run the SEC pipeline." You define assets—SEC filings, staged data, materialized graphs—and Dagster figures out what needs to run to make them exist.
The difference matters when something goes wrong:
| Traditional Orchestration | Asset-Centric Orchestration | | ------------------------- | ----------------------------------- | | "The 2 AM job failed" | "The SEC Q4 filings asset is stale" | | "Check the logs" | "Here's the exact run that failed" | | "Re-run everything" | "Re-materialize just this asset" | | "Hope it works this time" | "Replay with the same inputs" |
We still run scheduled jobs—monthly billing, daily cleanup, nightly processing. But every run is observable. Every asset has provenance. When the auditor asks "where did this number come from?" you can trace it back to the exact source file, transformation, and timestamp.
This matters more than ever as AI enters the picture. Agents need to trust the data they're working with. If you can't prove where a number came from, AI can't rely on it—and neither can you. Observable pipelines don't just help debugging. They make data AI-ready.
The Unified Database Reality
Every few years, someone promises a database that handles everything. They usually fail because "everything" is too broad.
We took a different approach: build for analytical graph workloads specifically. LadybugDB, our embedded graph engine, combines:
- Columnar storage for the aggregations financial analysis actually needs
- Graph semantics for the relationships that make financial data meaningful
- Embedded architecture for true multi-tenant isolation
This isn't a general-purpose database trying to do everything. It's a specialized engine that does one thing exceptionally well: make financial relationships queryable at scale.
// One query that would require joins across five tables in SQL
MATCH (company:Entity)-[:FILED]->(report:Report)
-[:CONTAINS]->(fact:Fact)-[:MEASURES]->(element:Element)
WHERE element.qname = 'us-gaap:Revenues'
AND report.period_end >= date('2024-01-01')
RETURN company.name, fact.numeric_value, report.period_end
ORDER BY company.name, report.period_end
The query doesn't just retrieve data—it traverses relationships. Revenue connects to reports connects to companies connects to industries. The graph isn't an alternative to SQL; it's what SQL was always trying to be.
Semantic Understanding, Not Just Data Access
Here's the uncomfortable truth about most AI-powered data tools: they retrieve information, but they don't understand meaning.
RAG systems fetch relevant chunks. Text-to-SQL tools generate queries. But when your CFO asks "What's driving our margin compression?"—retrieving the top 10 relevant documents isn't the same as understanding causality. Pattern-matching against your schema isn't reasoning.
Knowledge graphs change this equation. When your data is stored as explicit relationships, AI agents can traverse meaning, not just retrieve text:
Revenue is connected to Customers
Customers are connected to Segments
Segments have different Profitability metrics
Profitability affects Cash Flow projections
This isn't semantic search—it's semantic reasoning. The difference matters when you ask: "What happens to our cash position if we lose our top three customers?"
A SQL generator gives you a query template. A graph-aware AI agent traces the actual relationships in your business to model the cascade effect.
Infrastructure You Actually Own
Here's something the enterprise software industry doesn't want you to know: you can run your own data infrastructure.
We built RoboSystems to deploy in your AWS account with a single command:
just bootstrap # Set up OIDC federation
just deploy prod # Deploy everything
That's it. No SaaS dependency. No data leaving your network. No per-seat licensing that scales with your headcount.
What gets deployed:
| Component | Service | Purpose | | ------------- | ----------- | ----------------------------- | | API | ECS Fargate | Application endpoints | | Orchestration | ECS Fargate | Dagster workers and scheduler | | PostgreSQL | RDS | User accounts and billing | | LadybugDB | EC2 ARM64 | Graph storage and queries | | Cache | ElastiCache | Session and query caching |
The entire stack uses GitHub OIDC federation—no AWS credentials stored anywhere. One-hour sessions that can't be abused if compromised.
This isn't just a preference—it's becoming a fiduciary responsibility. Regulations tighten every year. Your clients expect to know where their data lives. Your auditors need to verify your controls. "It's in some vendor's cloud" is no longer an acceptable answer for sensitive financial data.
Data sovereignty isn't paranoia. It's prudence.
Fork-Friendly by Design
Most open-source projects make the same mistake: they assume you'll contribute upstream or stay on the main branch forever.
Real enterprises don't work that way. You need custom integrations. Proprietary data sources. Industry-specific logic.
We designed RoboSystems with explicit merge boundaries:
adapters/
├── sec/ # ← Upstream maintains
├── quickbooks/ # ← Upstream maintains
├── plaid/ # ← Upstream maintains
│
└── custom_*/ # ← Your namespace (we never touch)
├── custom_erp/ # Your ERP integration
├── custom_bank/ # Your bank API
└── custom_crm/ # Your CRM connector
The custom_* namespace is a contract: we will never create files there. Your fork can pull upstream updates without merge conflicts, without losing your customizations, without worrying about what we changed.
This extends to Dagster assets, schema extensions, and operation modules. Build what you need. Keep getting updates.
The Jobs That Run Your Business
Orchestration isn't just about data pipelines—it's about the operational heartbeat of your platform.
Here's what runs automatically in a production RoboSystems deployment:
Billing Operations:
- Monthly credit allocation and overage processing
- Daily storage usage billing
- Hourly usage snapshots for analytics
- Monthly usage report generation
Infrastructure Maintenance:
- Hourly cleanup of expired API keys and sessions
- Weekly health checks on credit system integrity
- Instance health monitoring and metrics collection
- Registry cleanup for orphaned resources
Data Pipelines:
- SEC filing discovery and processing
- QuickBooks transaction synchronization
- Plaid account and transaction sync
- Staged file validation and graph materialization
Each job is observable in Dagster's UI. Each run is logged. Each failure triggers alerts. No more mystery batch jobs that "run somewhere."
The Path Forward
We're not claiming to have solved data engineering. But we've made deliberate architectural choices that position RoboSystems for what comes next:
Instead of multi-hop pipelines: Direct integration from source to graph. QuickBooks transactions don't travel through three systems—they flow directly to your knowledge graph.
Instead of invisible pipelines: Observable orchestration where every run is logged, every failure is traceable, and every asset has lineage you can audit.
Instead of separate OLTP/OLAP: A unified graph layer that handles both relationship traversal and analytical aggregations.
Instead of vendor lock-in: Deploy in your own AWS account. Fork and customize. Own your infrastructure.
Instead of AI that generates SQL: Semantic understanding through graph relationships. AI agents that reason about your business, not just pattern-match against your schema.
Building for the Next Decade
The data engineering world is about to undergo a major consolidation. The seventeen-tool stack will collapse into something simpler. Companies that built on foundations designed for this complexity will struggle to adapt.
We built RoboSystems for what comes after.
Not because we're smarter. Not because we predicted the future. But because we refused to accept that financial intelligence requires copying data through seven systems.
The future of data engineering isn't more tools. It's better foundations.
Ready to build on foundations designed for the next decade? Deploy RoboSystems in your own AWS account and own your data infrastructure.
Want to see the orchestration in action? Explore our Dagster implementation and see how asset-centric pipelines actually work.
The future of data engineering isn't more complexity. It's better architecture.
Joey French
Contributing to the future of graph databases and AI-powered business intelligence.
Ready to Build Your Knowledge Graph?
Experience the power of graph databases with RoboSystems.
Get Started on GitHubMore from RoboSystems Blog
Building Financial Context Graphs: Beyond RAG to Semantic Understanding
Jan 12, 2026
Most AI projects fail because the data isn't ready. Context graphs are the semantic layer that transforms scattered financial data into meaning AI can actually understand.
The Financial Knowledge Graph Manifesto
Sep 1, 2025
Why the future of business intelligence lies not in isolated metrics, but in the rich, interconnected relationships between financial outcomes and the operational activities that drive them.
Knowledge Graphs: The Missing Link in Financial Intelligence
Sep 6, 2025
Your business runs on relationships—between customers and revenue, costs and activities, risks and opportunities. So why does your financial system pretend these connections don't exist?