Top Themes in Data Transcript
Slide 1
Clearing: While data world consolidates, capabilities have exploded with AI.
Content:
AI is rewriting every rule about what’s possible with data
Those two forces in tension will make for an exciting 2025
Slide 2
Clearing: My name is Tomasz Tunguz, founder and general partner at Theory.
Content:
I’ve been investing in data for the last 17 years and have worked with companies like Looker, Monte Carlo, Hex, Omni, Tobiko Data and Mother Duck
I founded Theory, a venture firm managing $700M with the idea that all modern software companies will be underpinned by data and AI
We run a research-oriented firm, formed by 200 buyers of data and AI software
Transition:
These are the themes that we predict within the world of data
Slide 3
Clearing: Every transformation follows a pattern. Today, three powerful movements are reshaping how enterprises work with data.
Content:
First, we’re witnessing the Great Consolidation. After a decade of expanding complexity in the modern data stack, companies are dramatically simplifying their architectures – and getting better results
Second, we’re seeing a renaissance of scale-up computing. The distributed systems that dominated the 2010s are giving way to powerful single machines and Python-first workflows
Third, we’re entering the age of agentic data – where AI doesn’t just analyze data, but actively manages it. Production AI systems are transforming both how we operate our data systems and how we extract insights from them
Transition:
These aren’t isolated trends. They’re converging to create a fundamentally new way of working with data
Slide 4
Clearing: Let’s talk about the great consolidation.
Content:
We’ve seen the modern data stack explode in the last years
There’s a tool for everything
Transition:
But this has led to a lot of complexity
Slide 5
Clearing: Buyers are overwhelmed. I’m hearing more and more of them say, “Don’t sell me another tool!”
Content:
They want simplification, not more point solutions
Companies want to optimize costs. Fewer vendors mean fewer licenses and less overhead
The office of the CFO is pressuring data leaders for ROI from billions invested over the last decade
We will see enterprises standardizing on particular technologies, particularly the broadest ones, even if the individual point solutions are not the best in that layer
Expect more mergers and acquisitions as companies try to assemble their versions of the most prized data layers
Transition:
This consolidation is pushing us towards more flexible and scalable data architectures, driven not only by cost and simplicity but also capabilities, which brings us to…
Slide 6
Clearing: That MacBook Pro should be called a mainframe pro. It’s just that powerful.
Content:
I use my MacBook Pro to run 70 billion parameter models, which are equivalent to GPT 3.5
With that kind of power, the vast majority of data workloads, I can develop on my local machine
Transition:
As a new generation of especially Python developers wants to start working with data, they prefer local first development and scale up architectures, allow them to start small and migrate their workloads to bigger machines which satisfy more than 80% of current workloads
Slide 7
Clearing: Decoupling storage and computers all about Unlocking flexibility.
Content:
We are not talking about this scale out architecture that separated storage and compute for Snowflake
Instead, we’re talking about a logical separation between the query engine and the data storage
Traditionally, these have been tightly coupled. But now, we’re seeing them decoupled, with technologies like Iceberg leading the way
This allows us to:
Use different query engines for different tasks, optimizing for both price and performance
Create intellectual property around AI by building proprietary models
Improve data governance, access control, and privacy compliance
New query engines emerging:
DuckDB is an in-process analytical database designed for efficient queries on larger datasets
DataFusion is an extensible query engine written in Rust
We’re also seeing greater use of Python data wrangling tools:
DLT is a powerful tool for building data transformation pipelines
Polars is a fast and efficient DataFrame library similar to Dask
Transition:
Centralized control of data & built for purpose data engines enable AI
Slide 8
Clearing: AI is changing the way software and data engineering teams work together.
Content:
Jensen Huang, the CEO of NVIDIA, has a great way of putting it. He says the IT department of the future will be like the HR department for AI agents
We’ll be managing and ’training’ these agents to work with our data
Transition:
This change starts first within the engineering org
Slide 9
Clearing: Historically, there’s been a divide between software engineering and AI/ML teams.
Content:
AI teams often worked downstream of the application, building offline models for Analysis, clustering, and segmentation combined with the work of the financial analyst
Data engineering teams and software engineering teams are writing separate pipelines
Operating in separate environments with different technologies
Merging the two over the last decade has been extremely difficult
At the same time, Managing costs can be extremely expensive.
Transition:
AI changes this topology
Slide 10
Clearing: AI is a core part of many products, and in the future, every software company will be an AI company.
Content:
Data scientists are now building production models
Software engineers are hitting AI endpoints to build agents inside modern applications
Python has become the dominant language of AI and a popular language for software development
There’s an opportunity to fuse those two environments
Data teams need to adopt software engineering best practices including:
Virtual development environments
Regression and integration testing
Cost optimization
Tobiko Data with SQLMesh reduces CDW costs by 50% while also enabling this transition to virtual development environments.
We’re seeing this occur within our startups
Transition:
Speaking of cost, let’s talk about the expense of AI
Slide 11
Clearing: In the 24 months after chatGPT3 was released, a parameter race was unleashed where the sizes of models became ever larger, culminating most recently with Lama 3.3 at 450 billion parameters.
Content:
These electron guzzling monoliths are incredibly powerful, containing a compressed version of the 20 trillion or so words written on the internet & an ability to process them
At the same time, there’s been parallel research efforts optimizing smaller and smaller models
Transition:
While large models are essential in use cases where the universe of inputs is infinite, Not every business workload needs a Wikipedia on every API call
Slide 12
Clearing: Databricks’ most recent state of data report published earlier this year. Small models are the most popular.
Content:
Small models now represent a majority of deployed AI models
Interviewing AI buyers, the pressure from the CFO is stark
In contrast to the decade of data which grew unabatedly for the 12 years before 2022, cost pressures on AI have started from day one
With financial pressure, resourceful data teams have resorted to smaller models
Transition:
But it is not performance at any price
Slide 13
Clearing: Plotting MMLU or high school equivalency over time, you can see that small, medium, and large models are converging around 70 to 80% accuracy.
Content:
This isn’t a one-time trend
Overall AI inference costs have fallen 1000x in the US in the last three years
Newer models might cost two orders of magnitude less to train
Jevons Paradox is in full force – OpenAI materially underestimated how much people would use their software
Transition:
With the performance relatively similar, no surprise enterprises are moving to smaller models. But it’s not just for performance equivalency
Slide 14
Clearing: In addition, smaller models offer significantly better latency.
Content:
Latency is three to four times better with a smaller model
Google found the linear relationship in user latency is significant on search results
It’s no different within modern software applications
Smaller models offer significantly better user experience
Transition:
And they do it Just how much is the cost difference?
Slide 15
Clearing: Docspot tracks these prices and plots them on a logarithmic chart.
Content:
Gemini’s 8 billion parameter flash model costs 10c
OpenAI’s GPT-4 costs more than $60
There’s two orders of magnitude of difference – 600x more expensive
Some new AI architectures run multiple queries for the same user workflow to ensure higher accuracy
Transition:
Smaller models of near equivalent levels of performance, significantly lower latency, and orders of magnitude lower cost. We believe they will be dominant within the enterprise. But smaller models do require one thing
Slide 16
Clearing: Data modeling isn’t just back – it’s become the foundation of reliable AI.
Content:
Without it, we’re building AI castles on sandy data
Our current AI models are text models, not numerical models
To drive maximum performance we need to model the data
This limits the universe of potential outcomes and dramatically improves quality
Data modeling significantly improves the developer experience for software engineers
Transition:
Let me show you what I mean
Slide 17
Clearing: Here I created a little TypeScript application that processes the famous FAA data. I did this in 15 minutes.
Content:
I recorded a video of my request to show me the busiest airports by total flights in 2023
The text-to-sequel model underpinning this is hitting a data model
The data model provides additional context to help translate the structure of the underlying database
For large enterprises with tens of thousands of tables, this is the only way to drive accuracy
This provides a great API endpoint for software engineers to hit
Transition:
The impact of enabling AI to work within data organizations is not trivial
Slide 18
Clearing: Many other organizations, the leading organizations are starting to use AI in a pretty meaningful way.
Content:
25% of new code at Google is written by AI
Microsoft and ServiceNow have both reported 50% developer productivity boosts
Amazon saved 275 million migrating one version of Java to another using AI
These productivity impacts will benefit data teams
Models need to understand the underlying data through data models
Once a data model is in place, we can build applications on top
This data model will basically be an ORM for the entire data stack
Transition:
Imagine being the first data team to save your company $10 million by producing the right analysis for the CFO or the board, especially in this environment of consolidation. That’s a surefire way to earn a promotion! One of the first applications of models is BI. BI is changing too
Slide 19
Clearing: Data governance isn’t about control anymore – it’s about enablement.
Content:
The best governance frameworks today are built on collaboration, not restriction
The core of BI is data governance
It may look like fancy charts, but the most important thing is providing accurate data
Data teams face a dilemma:
Decentralized access means greater accessibility but more risk of misinterpretation
Data centralization means higher quality data but less velocity
Transition:
We’re finally reaching a place where you can have both
Slide 20
Clearing: The business intelligence ecosystem has been a pendulum oscillating between centralized and decentralized control.
Content:
Early 2000s: The Era of Centralized BI
Companies like MicroStrategy, Cognos, BusinessObjects, and Hyperion
Powerful but slow and IT-dependent reporting solutions
High accuracy, low agility
2003: The Rise of Self-Service Analytics
Tableau revolutionized the industry
Empowered business users to directly access and analyze data
The Cloud Data Warehouse Revolution:
Cloud platforms like Snowflake and BigQuery enabled massive scalability
Tools like Looker emerged for consistent and governed access
The Challenge of Balancing:
Data democratization is crucial
Centralized control is essential
Omni It enables a hybrid approach:
Both centralized teams and individual marketers can define and share metrics
Everyone uses the same trusted data while maintaining flexibility
Transition:
Underpinning BI, data models, and new architectures is observability
Slide 21
Clearing: I believe data pipelines are the backbone of any modern AI system.
Content:
They’re not just for analytics anymore; they’re essential for the entire machine learning lifecycle
Key functions of an intelligent pipeline:
Ensures data quality through cleaning, transformation, and validation
Enforces consistency using standardized formats
Guarantees timely delivery
Data observability acts as a health monitor:
Detect issues proactively
Troubleshoot problems faster
Build more trust in data
Pipelines are getting more complex:
Data coming from everywhere
Need for real-time processing is growing rapidly
Transition:
With reliable and observable data flowing, we can leverage powerful new techniques, like…
Slide 22
Clearing: This slide really captures the essence of why intelligent data pipelines are so vital.
Content:
They’re the backbone of any modern AI system
Key elements include:
INPUTS: databases, APIs, streaming data, IoT sensors
Processing: ensuring quality, consistency, and timely delivery
OUTPUTS: machine learning models, dashboards, applications
Critical components:
OBSERVABILITY and EVALS
Constant monitoring
Proactive issue detection
Growing demands for:
Speed and accuracy
Consistency across AI and BI systems
Meeting regulatory requirements
Slide 23
Clearing: Every transformation follows a pattern. Today, three powerful movements are reshaping how enterprises work with data.
Content:
The Great Consolidation:
After a decade of expanding complexity
Companies are dramatically simplifying architectures
Renaissance of scale-up computing:
Distributed systems giving way to powerful single machines
Python-first workflows
Age of agentic data:
AI actively manages data
Production AI systems transform operations and insights
Transition:
These aren’t isolated trends. They’re converging to create a fundamentally new way of working with data
—————
Boost Internet Speed–
Free Business Hosting–
Free Email Account–
Dropcatch–
Free Secure Email–
Secure Email–
Cheap VOIP Calls–
Free Hosting–
Boost Inflight Wifi–
Premium Domains–
Free Domains