Home

Published

- 7 min read

How Big Is JPMorgan’s Data Compared to OpenAI?

img of How Big Is JPMorgan’s Data Compared to OpenAI?

If you only remember one thing from this post, make it this: JPMorgan is sitting on orders of magnitude more proprietary data than what OpenAI reportedly used to train GPT‑4, and that imbalance is going to define the next decade of AI power.

Most people think “AI power” means bigger models and more GPUs. But once models saturate on public internet-scale text, the real moat becomes who owns the best, densest, most continuous stream of real‑world data. JPMorgan is a perfect case study of what that looks like when a bank quietly turns into a data empire.


The 10‑Second Answer

  • JPMorgan’s internal data estate is on the order of hundreds of petabytes (roughly 150–500+ PB, depending on which parts you count).
  • Public estimates suggest GPT‑4 trained on under 1 petabyte of text/code data, corresponding to around 10–13 trillion tokens.
  • So in raw storage, JPMorgan has something like 100–500× more data than the corpus reportedly used for GPT‑4 pre‑training—but it’s a very different kind of data.

The interesting question is not “who has more bytes?” but: who can turn their data into models that actually move PnL, risk, and power?


Part 1: How Big Is JPMorgan’s Data Really?

Numbers first, then meaning.

1.1 Reported scale in petabytes

Different public descriptions give slightly different slices of JPMorgan’s data estate:

  • Articles on its Hadoop/big‑data stack cite “more than 150 petabytes of data,” supporting roughly 30,000 databases and billions of accounts.
  • A data‑infrastructure case study mentions 450+ PB powering 6,500+ applications across the firm.
  • A more recent AI/data strategy discussion refers to around 500 petabytes of data driving 300+ AI/ML use cases and billions of dollars in annual business value from AI and machine learning.
  • Commentary from Alexandr Wang and others often quotes ~150 PB of structured, continuously updated financial data as a benchmark example.

Put together, a reasonable mental range is:

  • Core, highly structured, analytics‑ready data: ~150 PB.
  • Total data footprint across logs, historical archives, data lakes, and systems: ~450–500 PB.

This is not static cold storage; it’s a live environment serving thousands of production systems and hundreds of AI applications.

1.2 Daily data exhaust

A large bank like JPMorgan throws off a ridiculous stream of new data every day:

  • One analysis estimates JPMorgan generates 12–27 terabytes of new data per day, which is roughly 4–8 petabytes per year of fresh information even before you count derived features.
  • This spans payments, trades, risk, customer interactions, internal systems logs, research, quant factor libraries, and more.

The important bit: this is a continuous time series of economic behavior, not one‑off static snapshots.


Part 2: What Kind of Data Does JPMorgan Have?

Raw volume is a vanity metric; density and structure are the real story. JPMorgan isn’t just hoarding PDFs and log files—it is curating a multi‑decade, high‑frequency view of the financial system.

2.1 Categories of JPMorgan data

Some representative buckets:

  • Transaction flows:

    • Global payments, securities settlement, trade finance, derivatives workflows, collateral moves.
    • This underpins the oft‑quoted stat that JPMorgan moves trillions of dollars per day and generates double‑digit terabytes of data daily in the process.
  • Market and macro time series:

    • Internal quant datasets like macro quant data sets engineered for backtests and systematic strategies.
    • Millions of time series in internal platforms that investors and desks use to build and stress‑test systematic strategies.
  • Client interaction & research:

    • Research portals, client emails, content downloads, and portal interactions.
    • Electronic, machine‑readable research streams that clients can pipe directly into their systems.
  • Operational and risk data:

    • KYC/AML, fraud patterns, operational risk incidents, compliance workflows, dispute histories.
    • These are gold for anomaly detection, graph models, and credit/fraud LLM copilots.
  • Internal code, documentation, and tickets:

    • Source code, config files, internal wikis, tickets, incident reports, which are perfect for internal code and infra copilots.

The result is a vertically integrated, multi‑modal representation of how money, risk, and information move through the global economy, with identities attached and timestamps down to the millisecond.

2.2 Why this data is “dense”

Compared to random internet text:

  • JPMorgan’s data is:
    • Structured: schema‑rich tables, reference data, hierarchies.
    • Labelled by construction: every transaction has counterparties, products, channels, timestamps, statuses, outcomes.
    • Economically grounded: each row corresponds to real money, risk, or regulatory exposure, so signals are not just linguistic—they are financial.

Public web data is broad but noisy. JPMorgan’s data is narrower in topic, but incredibly high signal per byte for financial and enterprise AI.


Part 3: How Big Is OpenAI’s Data (For GPT‑4)?

OpenAI hasn’t published an official dataset size for GPT‑4, but multiple independent analyses and leaks converge on some ballpark numbers.

3.1 Tokens, not terabytes

Most of the GPT‑4 discussion is framed in tokens, not bytes:

  • One widely cited technical leak claims that GPT‑4 was trained on roughly 13 trillion tokens, corresponding to around 10 trillion words of text, with two epochs on text data and more passes for code.
  • This corpus spans web scrapes, books, code, proprietary licensed data, and instruction/fine‑tuning datasets, with millions of rows of high‑quality instructions.

Storage‑wise:

  • Multiple AI infra commentators assert that GPT‑4’s training data fits under 1 petabyte of storage.
  • That is intuitively reasonable: 13 trillion tokens at a few bytes per token, plus overhead and metadata, is large but not hundreds of petabytes.

3.2 Scaling beyond GPT‑4

Looking forward:

  • Analysts speculating about a “GPT‑4.5” suggest that training tokens might scale into the 20–50 trillion range, with roughly an order of magnitude more training FLOPs than GPT‑4.
  • Even then, the total unique data reused across epochs is still likely in the few‑petabyte regime, not 100+ PB.

This matters because it puts a ceiling: once a lab has scraped the open web and licensed everything reasonably available, getting more high‑quality general data becomes very hard.


Part 4: Raw Size vs Strategic Power

Now the punchline: JPMorgan almost certainly has hundreds of times more bytes than OpenAI used to train GPT‑4. But it is not an apples‑to‑apples comparison.

4.1 Data size comparison

A very rough side‑by‑side:

AspectJPMorgan Data EstateOpenAI GPT‑4 Training Data
Approx. size (storage)~150 PB structured core; 450–500 PB total across lakes/systemsReportedly under 1 PB of corpus data
Daily growth12–27 TB/day, 4–8 PB/year new dataMostly static corpus; incremental refreshes
DomainFinancial transactions, markets, clients, operationsGeneral web, books, code, licensed corpora
StructureHighly structured, time‑series, labeled by processesMostly unstructured/semi‑structured text/code
Access modelPrivate, regulated, behind firewalls and KYCFoundation model pre‑training and fine‑tuning

On storage alone, JPMorgan “wins” by a massive margin. But the right way to think about it is:

  • OpenAI’s corpus:
    • Broad, noisy, general world knowledge.
    • Optimized for building a general‑purpose cognitive engine once.
  • JPMorgan’s corpus:
    • Deep, narrow, economically potent signals.
    • Optimized for thousands of domain‑specific AI agents, risk engines, and copilots.

4.2 Neural scaling laws and diminishing returns

Neural scaling laws say that model performance improves as a power law with model size, data size, and compute—but with diminishing returns.

  • Once a model has ingested almost all high‑quality internet text, adding more generic data gives smaller gains.
  • Gains increasingly come from better, more specialized data and smarter data curation/fine‑tuning techniques.

That’s exactly where JPMorgan’s edge lies:

  • It doesn’t need a bigger “internet”; it already has more task‑relevant data than it can currently exploit.
  • Its challenge is converting that into labeled, governed training sets for production‑grade models.

Part 5: Who Actually Has the Advantage?

So who is “ahead”—OpenAI with a world‑scale foundation model, or JPMorgan with a financial data firehose?

5.1 OpenAI’s edge: general intelligence and tooling

OpenAI’s strengths:

  • Frontier model capabilities: GPT‑4 and its successors are among the strongest general‑purpose reasoning and language engines.
  • Ecosystem and tools: APIs, tool calling, function integration, embeddings, and plugins that let enterprises plug their data into a powerful generic engine.

In other words, OpenAI builds the brains.

5.2 JPMorgan’s edge: proprietary, regulated, monetizable data

JPMorgan’s strengths:

  • Depth of domain data: 150–500 PB of highly structured financial and client data spanning decades.
  • Production footprint: Hundreds of AI use cases, large numbers of production AI/ML deployments, and billions in business value already documented.
  • Operational integration: AI embedded in payments, risk, fraud, research, and client workflows, not just in a lab.

This supports a broader point: governments and large enterprises may ultimately wield more effective AI power than the model labs, because they control the highest‑value data.

5.3 The real future: OpenAI × JPMorgan, not OpenAI vs JPMorgan

The most realistic path forward is a hybrid stack:

  • Foundation models (from OpenAI or open source) provide:
    • General reasoning, language understanding, and coding ability.
  • Enterprise data estates (like JPMorgan’s) provide:
    • Ground truth, labels, and constraints for domain‑specific copilots, agents, and risk engines.

This is exactly the “data‑centric foundation model development” pattern the tooling ecosystem is moving toward: use foundation models to help label and curate enterprise data, then train or fine‑tune smaller, specialized models that run inside the firm’s governance perimeter.

In that world, the question “who has more data?” is almost the wrong question. The right one is:

Who can transform their proprietary data into a defensible, compounding AI advantage faster—without blowing up on regulation, security, or alignment?

Today, OpenAI leads in models; JPMorgan leads in financial data. Over the next decade, the winners will be the institutions that master both.